Missing time-series vs. Empty time-series
Lokad is about time-series forecasting, but as simple as the time-series model may seem to be (after all a time-series is nothing more than a list of time-value pairs), there are several subtleties in the way to manage time-series. In this post, we will see how the Lokad time-series model distinguishes missing time-value pairs from empty time-value pairs. Since the topic is slightly complex, I would suggest, if you’re not familiar the Lokad technology, to have a look at our User Guide (in particular, the Forecasting tasks section).
A practical situation
Let’s start with a practical real-life situation; let’s assume that we have a time-series that include 12 time-values, one value for each month of the year 2005 (starting January 2005, ending December 2005). We can imagine that this time-series represent the monthly sales of a web shop. At the time I am writing this post, it’s the beginning of January 2007. What happen if I insert now this time-series into my Lokad account and ask for a monthly forecast? Well, there is an ambiguity in the time-series model, because there would be two possibilities:
- Returning a forecast for January 2007 (let’s call it the clock-centric approach). In this case, we would be considering the 12 values for the year 2006 are simply missing. Thus, we skip them a produce a forecast nonetheless but based on the data of the year 2005.
- Returning a forecast for January 2006 (let’s call it the data-centric approach): The forecast is based on the last time-value pair available (i.e. December 2005 in the present situation), which is equivalent to the assumption that there is no missing values. In this case, the delivered forecast might refer to a period already part of the past.
Let’s make the things clear: Lokad has chosen the data-centric approach, if ask a monthly forecast for your 12 time-values ranging from January 2005 to December 2005, you will get a forecast for January 2006, no matter if you request the forecast at the beginning of 2006 or in a distant future. Lokad takes the last time-value pair of your time-series as a reference to compute the forecasts. This option has been chosen because we believe it’s closer to the business requirements.
Some arguments supporting the data-centric approach
Let’s review the arguments in favor of the data-centric approach:
- The data-centric approach has a persistent semantic. If the input time-series data do not change the forecast time-range do not either (yet the actual values of the forecast may change over time).
- The data-centric approach offers the possibility to benchmark the Lokad forecast services. You can import your 2005 product sales data in your Lokad account, get the forecast for 2006, and see how much difference lies between our forecasts and your historical record for 2006.
- The data-centric approach assumes that there is no missing data in your time-series data after the initial time-value pair. This assumption has the strong advantage: its simplicity. Indeed, in some data mining fields, missing data are very frequent (think medical surveys for example), but when it comes to time-series, it’s quite rare.
Yet, this approach involves a minor drawback: you need to handle explicitly the lack of data. For example, in the previous web shop situation, each product of the catalog may not have be sold even once a month. In such case, you must explicitly add a zero time-value in your time-series that represent this lack of sales.