Multivariate Time Series Forecasting is an important problem in many domains. Be it forecasting the demand for a product, or finding weather patterns, using time series parameters from the present to predict the future is vital to many organisations. In this article we are going to see how to use recurrent neural networks with convolutional neural networks to predict time series forecast for grocery store sales. The data can be found here, and the code can be found here. This implementation was developed as a prototype for AI Hello, a company based in Toronto that provides e-commerce solutions including sales prediction for sellers on e-commerce platforms such as Amazon, Woocommerce, and Shopify.
We are using an implementation called LSTNet to perform this prediction. LSTNet uses convolutional networks and recurrent networks in conjunction. Additionally, it also provides the capability to detect long or short term patterns according to the nature of the data.
Data Preparation and Exploration
For this article we’ll be looking at the data-set — Rossman Store Sales — available on Kaggle.
We are cleaning the data to get overall sales across all stores based on the day of the week and promotions. The code to extract this is given below:
The data now has aggregated values:
Now let’s gain a better understanding of the cleaned data.
The graph below shows the sales average by each day of the week — Sundays have the lowest, and Mondays enjoy the highest sales.
- The relation between the number of customers and the total sales follow a linear pattern. It is only to be expected, but this is helpful while modelling our prediction algorithm.
- Promotions have a big impact. The mean sales are almost doubled when a promotion was available:
We will split our data into test and train. We are using 20 percent of our data for training, 20 percent for validation, and we will train on 60 percent of our data.
Before we go into the model however, there is one important question to tackle. What are our features and what are our labels? Or in other words, what input are fitting to what desired output?
In time series problems, our input (X) will be all the records from timestep t(n-k) to t(n-1). The value we try to fit to this input will be our record at timestep t(n). To model the problem in this fashion we need to do further processing on the data. The code below demonstrates how to transform a time-series data this way.
Model Architecture
Before we train the model, let’s take a look at the model architecture. LSTNet has a very novel architecture, it uses both convolutional and recurrent components. It also has a skip-gram recurrent component to take into account the long term patterns in the data better. Below, we have a detailed analysis of each component.
Convolutional Component
The first layer of LSTNet is a convolutional network without pooling, which aims to extract short-term patterns in the time dimension as well as local dependencies between variables. The convolutional layer consists of multiple filters of width ω and height n (the height is set to be the same as the number of variables)
- Recurrent Component
The output of the convolutional layer is simultaneously fed into the Recurrent component and Recurrent-skip component The Recurrent component is a recurrent layer with the Gated Recurrent Unit (GRU) and uses the RELU function as the hidden update activation function .
- Recurrent-skip Component
The Recurrent layers with GRU and LSTM unit are carefully designed to memorise the historical information and hence to be aware of relatively long-term dependencies. Due to gradient vanishing, however, GRU and LSTM usually fail to capture very long-term correlation in practice. A recurrent structure is used with temporal skip-connections to extend the temporal span of the information flow and hence to ease the optimisation process. Specifically, skip-links are added between the current hidden cell and the hidden cells in the same phase in adjacent periods.
4.Temporal Attention Layer
The Recurrent-skip layer requires a predefined hyper-parameter p, which is unfavourable in the non-seasonal time series prediction, or whose period length is dynamic over time. To alleviate such issue, we consider an alternative approach, attention mechanism, which learns the weighted combination of hidden representations at each window position of the input matrix.
Hyper-parameters for the model
Following are the parameters to decide for the model :
Epochs — We found the loss function plateauing after 40 epochs for this data-set. Hence we decided running 50 epochs will provide us good accuracy.
Window — This specifies how many time-steps from the past we should take into account for the prediction of sales in the next time-step. A key consideration is to make the window small enough to detect short term patterns, but also be aware of the larger trend. We have decided that the sales information from the past quarter would provide enough information to the predictor. Hence, our window size is 90.
Batchsize- We are passing 128 values in one batch.
CNN Filters- Number of output filters in the CNN layer
CNN Kernel — The filter size of CNN layer. We are using a value of 6
GRU Units- Number of hidden states in GRU layer. We are using 100 hidden states
Skip — Number of values to skip when looking for patterns, we are assuming weekly pattern and will use a value of 7
Highway — Number of timeseries values to consider for the linear layer, i.e. the auto regression layer, we will keep this to 90 too.
And lastly, we are using RMSE values as our error function and we will use Adam Optimizer for our model.
Now we will see how to setup the neural network :
With our model defined we can delve into the code for actually training our data.
Evaluate Model
The mean squared error value and the correlation value for our model during training and validation are given below. We see a considerable decrease in the error and a rising trend in the correlation, both of which are good indicators.
Given below is the prediction for the whole time period. Let’s concentrate on the test values prediction.
The blue values indicate the actual sales.
We have not considered certain factors such as holidays and individual store location holidays which will help us predict the outliers, but the general trend and dips, as well as the weeks in which the sales were generally low or high have been perfectly predicted by the algorithm. Taking a closer look reveals this:
Comparison
To compare the results, here is the prediction pattern we are getting with FB Prophet( another time series prediction tool):
FB Prophet Results
We see that while FB Prophet is good at accounting for the periodicity and general trend of the data, it does not do as good a job when it comes to predicting actual sales value for each day. Even while sales tend to rise and fall by each week, FB Prophet predicts the same values for all weeks. Our algorithm however, is able to predict how each week is going to be different from the last, which is a huge factor when it comes to sales strategy.
https://www.aihello.com/resources/blog/multivariate-time-series-forecasting-using-deep-neural-networks/
We are using an implementation called LSTNet to perform this prediction. LSTNet uses convolutional networks and recurrent networks in conjunction. Additionally, it also provides the capability to detect long or short term patterns according to the nature of the data.
Data Preparation and Exploration
For this article we’ll be looking at the data-set — Rossman Store Sales — available on Kaggle.
We are cleaning the data to get overall sales across all stores based on the day of the week and promotions. The code to extract this is given below:
The data now has aggregated values:
Now let’s gain a better understanding of the cleaned data.
The graph below shows the sales average by each day of the week — Sundays have the lowest, and Mondays enjoy the highest sales.
- The relation between the number of customers and the total sales follow a linear pattern. It is only to be expected, but this is helpful while modelling our prediction algorithm.
- Promotions have a big impact. The mean sales are almost doubled when a promotion was available:
We will split our data into test and train. We are using 20 percent of our data for training, 20 percent for validation, and we will train on 60 percent of our data.
Before we go into the model however, there is one important question to tackle. What are our features and what are our labels? Or in other words, what input are fitting to what desired output?
In time series problems, our input (X) will be all the records from timestep t(n-k) to t(n-1). The value we try to fit to this input will be our record at timestep t(n). To model the problem in this fashion we need to do further processing on the data. The code below demonstrates how to transform a time-series data this way.
Model Architecture
Before we train the model, let’s take a look at the model architecture. LSTNet has a very novel architecture, it uses both convolutional and recurrent components. It also has a skip-gram recurrent component to take into account the long term patterns in the data better. Below, we have a detailed analysis of each component.
Convolutional Component
The first layer of LSTNet is a convolutional network without pooling, which aims to extract short-term patterns in the time dimension as well as local dependencies between variables. The convolutional layer consists of multiple filters of width ω and height n (the height is set to be the same as the number of variables)
- Recurrent Component
The output of the convolutional layer is simultaneously fed into the Recurrent component and Recurrent-skip component The Recurrent component is a recurrent layer with the Gated Recurrent Unit (GRU) and uses the RELU function as the hidden update activation function .
- Recurrent-skip Component
The Recurrent layers with GRU and LSTM unit are carefully designed to memorise the historical information and hence to be aware of relatively long-term dependencies. Due to gradient vanishing, however, GRU and LSTM usually fail to capture very long-term correlation in practice. A recurrent structure is used with temporal skip-connections to extend the temporal span of the information flow and hence to ease the optimisation process. Specifically, skip-links are added between the current hidden cell and the hidden cells in the same phase in adjacent periods.
4.Temporal Attention Layer
The Recurrent-skip layer requires a predefined hyper-parameter p, which is unfavourable in the non-seasonal time series prediction, or whose period length is dynamic over time. To alleviate such issue, we consider an alternative approach, attention mechanism, which learns the weighted combination of hidden representations at each window position of the input matrix.
Hyper-parameters for the model
Following are the parameters to decide for the model :
Epochs — We found the loss function plateauing after 40 epochs for this data-set. Hence we decided running 50 epochs will provide us good accuracy.
Window — This specifies how many time-steps from the past we should take into account for the prediction of sales in the next time-step. A key consideration is to make the window small enough to detect short term patterns, but also be aware of the larger trend. We have decided that the sales information from the past quarter would provide enough information to the predictor. Hence, our window size is 90.
Batchsize- We are passing 128 values in one batch.
CNN Filters- Number of output filters in the CNN layer
CNN Kernel — The filter size of CNN layer. We are using a value of 6
GRU Units- Number of hidden states in GRU layer. We are using 100 hidden states
Skip — Number of values to skip when looking for patterns, we are assuming weekly pattern and will use a value of 7
Highway — Number of timeseries values to consider for the linear layer, i.e. the auto regression layer, we will keep this to 90 too.
And lastly, we are using RMSE values as our error function and we will use Adam Optimizer for our model.
Now we will see how to setup the neural network :
With our model defined we can delve into the code for actually training our data.
Evaluate Model
The mean squared error value and the correlation value for our model during training and validation are given below. We see a considerable decrease in the error and a rising trend in the correlation, both of which are good indicators.
Given below is the prediction for the whole time period. Let’s concentrate on the test values prediction.
The blue values indicate the actual sales.
We have not considered certain factors such as holidays and individual store location holidays which will help us predict the outliers, but the general trend and dips, as well as the weeks in which the sales were generally low or high have been perfectly predicted by the algorithm. Taking a closer look reveals this:
Comparison
To compare the results, here is the prediction pattern we are getting with FB Prophet( another time series prediction tool):
FB Prophet Results
We see that while FB Prophet is good at accounting for the periodicity and general trend of the data, it does not do as good a job when it comes to predicting actual sales value for each day. Even while sales tend to rise and fall by each week, FB Prophet predicts the same values for all weeks. Our algorithm however, is able to predict how each week is going to be different from the last, which is a huge factor when it comes to sales strategy.
https://www.aihello.com/resources/blog/multivariate-time-series-forecasting-using-deep-neural-networks/