How to improve the accuracy of time series forecasts with cross-validation?

0

Time series analysis is one of the main parts of data science and techniques like clustering, splitting, and cross-validation require a different kind of understanding of data. In one of our articles, we discussed time series clustering. In this article, we are going to discuss cross-validation in time series. The main points to be discussed in the article are listed below.

Contents

  1. Simple cross-validation
  2. Cross-validation in time series
  3. k-fold cross-validation on time series
  4. Implementation of k-fold cross-validation
  5. Comparison between k-fold cross-validation and general modeling
    1. General modeling
    2. Modeling with cross-validation

Let’s briefly review what is cross-validation?

simple cross-validation

In general, cross-validation is one of the methods to assess model performance. It works by segregating the data into different sets and after segregation we train the model using those folds except one fold and validate the model on one fold. This type of validation needs to be performed multiple times.

The final result we form using the average of the scores obtained in each fold. By using this type of modeling procedure, we try to avoid overfitting and check the accuracy of the models while taking into account the robustness of the model. The image below represents the idea behind cross-validation

Source of images

In the above, we can see the basic concept behind cross validation and there are different techniques to perform cross validation like k-fold, laminate k-fold, Rolling and Holdout. Here we will discuss cross validation.

Are you looking for a comprehensive repository of Python libraries used in data science, check here.

Cross-validation in time series

We need to think differently about cross-validation in time series because it works on an ongoing basis. In general, cross-validation can be thought of as the random selection of some data from the whole data set and we perform the following analysis on this collected data. When these things happen on time series data, we need to know that time series data is generated when variables change over time. By this we can understand that the data point we get in the time series is highly correlated with the previous data and we cannot choose these types of data randomly.

An example of performing cross-validation on data can be thought of as choosing data based on time, not percentage. Performing time-based analysis step by step can increase model efficiency by understanding the sequence of data.

k-fold cross validation in time series

We need to think differently about cross-validation in time series because it works on an ongoing basis. As we know, the data included in a time series is sequential and often correlated to its last data point. In such an environment, cross-validation must be performed while the model is making predictions. Once the model starts predicting, the accuracy of the predicted points is checked, and then some of the predicted data points with older data are needed to make the next predictions.

Let’s take a hunch from the image below.

Source of images

In the image above, we can see how cross-validation works with time series data. The idea behind time series cross-validation can be further explained by taking a simple data example.

Let’s say [1, 2, 3, 4, 5] are our data. Here we need to perform k-fold cross validation and the value of k is 4.

According to k-fold cross-validation, we are required to create 4 pairs of training and testing data which can be performed using the following rules:

  • There should be new observations in the test set
  • Observations from the training set come first, followed by observations from the test set.

Let’s see how to generate such pairs from the dataset.

  • Railway data: [1] Test data: [2]
  • Railway data: [1, 2] Test data: [3]
  • Railway data: [1, 2, 3] Test data: [4]
  • Railway data: [1, 2, 3, 4] Test data: [5]
  • Calculate the average of the precisions of the 4 test bends.

Here we can see how cross validation works and we can understand that cross validation in time series is different from general modeling.

Implementation of k-fold cross-validation

Let’s see how we can do this using python.

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
cv = TimeSeriesSplit()
for train_index, test_index in tscv.split(X):
    print("Train data:", train_index, "Test data:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

0output:

Here is an example of running a k-fold cross-validation with time series data.

Comparison between k-fold cross-validation and general modeling

We can compare general modeling and modeling with cross-validation while implementing both procedures. Let’s look at the procedure using which we can improve the performance of time series modeling.

General modeling

Let’s start by importing the data.

import pandas as pd
path="/content/drive/MyDrive/Yugesh/deseasonalizing time series/AirPassengers.csv"
data = pd.read_csv(path, index_col="Month")
data.head(20)

production

path=’/content/drive/MyDrive/Yugesh/deseasonalizing time series/AirPassengers.csv’

data = pd.read_csv(path, index_col=’Month’)

data.head(20)

Production:

Here in this article, we are going to use the Airline dataset which includes the number of passengers of an airline in a month. These data can be downloaded here. Let’s separate the data into train and test.

#divide data into train and test
train_ind = int(len(data)*0.8)
train = data[:train_ind]
test = data[train_ind:]

Now after dividing we can fit and form a simple exponential model in the following way.

from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_squared_error
from math import sqrt
model = ExponentialSmoothing(train, seasonal="mul", seasonal_periods=12).fit()
pred = model.predict(start=test.index[0], end=test.index[-1])
MSE=round(mean_squared_error(test, pred),2)
MSE

from statsmodels.tsa.holtwinters import ExponentialSmoothing

from sklearn.metrics import mean_squared_error

from sqrt math import

model = ExponentialSmoothing(train, season=’mul’, season_periods=12).fit()

pred = model.predict(start=test.index[0]end=test.index[-1])

MSE=round(mean_squared_error(test,pred),2)

MSE

Production:

Here we can see the MSE between the test and the model predictions. Now let’s perform this similar modeling with a cross-validation procedure.

Modeling with cross-validation

In the sections above, we looked at how k-fold cross-validation works with time series. Here we don’t need to split our dataset because the definite numbers of k will make a fold of the data and form the model on them. Let’s start by running k-fold on the data.

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(data):
  print('TRAIN:', train_index, 'TEST:', test_index) 
  X_train, X_test = data.Passengers[train_index], data.Passengers[test_index]
  y_train, y_test = data.index[train_index], data.index[test_index]

Production:

Here in the above we can see that our folds are prepared. Now let’s separate our data according to the folds above.

#Splitting according to the above description
train1, test1 = data.iloc[:24, 0], data.iloc[24:48, 0]
train2, test2 = data.iloc[:48, 0], data.iloc[48:72, 0]
train3, test3 = data.iloc[:72, 0], data.iloc[72:96, 0]
train4, test4 = data.iloc[:96, 0], data.iloc[96:120, 0]
train5, test5 = data.iloc[:120, 0], data.iloc[120:144, 0]

After separation, we are required to adjust the models in the folds.

model1 = ExponentialSmoothing(train1, seasonal="mul", seasonal_periods=12).fit()
pred1 = model1.predict(start=test1.index[0], end=test1.index[-1])
MSE1=round(mean_squared_error(test1, pred1),2)
model2 = ExponentialSmoothing(train2, seasonal="mul", seasonal_periods=12).fit()
pred2 = model2.predict(start=test2.index[0], end=test2.index[-1])
MSE2=round(mean_squared_error(test2, pred2),2)
model3 = ExponentialSmoothing(train3, seasonal="mul", seasonal_periods=12).fit()
pred3 = model3.predict(start=test3.index[0], end=test3.index[-1])
MSE3=round(mean_squared_error(test3, pred3),2)
model4 = ExponentialSmoothing(train4, seasonal="mul", seasonal_periods=12).fit()
pred4 = model4.predict(start=test4.index[0], end=test4.index[-1])
MSE4=round(mean_squared_error(test4, pred4),2)
model5 = ExponentialSmoothing(train5, seasonal="mul", seasonal_periods=12).fit()
pred5 = model5.predict(start=test5.index[0], end=test5.index[-1])
MSE5=round(mean_squared_error(test5, pred5),2)

After fitting these models, let’s check all the MSE values ​​between the test data and the prediction

model1 = ExponentialSmoothing(train1, season=’mul’, season_periods=12).fit()

pred1 = model1.predict(start=test1.index[0]end=test1.index[-1])

MSE1=round(mean_squared_error(test1, pred1),2)

model2 = ExponentialSmoothing(train2, season=’mul’, season_periods=12).fit()

pred2 = model2.predict(start=test2.index[0]end=test2.index[-1])

MSE2=round(mean_squared_error(test2,pred2),2)

model3 = ExponentialSmoothing(train3, season=’mul’, season_periods=12).fit()

pred3 = model3.predict(start=test3.index[0]end=test3.index[-1])

MSE3=round(mean_squared_error(test3, pred3),2)

model4 = ExponentialSmoothing(train4, season=’mul’, season_periods=12).fit()

pred4 = model4.predict(start=test4.index[0]end=test4.index[-1])

MSE4=round(mean_squared_error(test4, pred4),2)

model5 = ExponentialSmoothing(train5, season=’mul’, season_periods=12).fit()

pred5 = model5.predict(start=test5.index[0]end=test5.index[-1])

MSE5=round(mean_squared_error(test5,pred5),2)

After fitting these models, let’s check all the MSE values ​​between the test data and the prediction

print ("MSE:", MSE)
print ("MSE1:", MSE1)
print ("MSE2:", MSE2)
print ("MSE3:", MSE3)
print ("MSE4:", MSE4)
print ("MSE5:", MSE5)

print(“MSE1:”, MSE1)

print(“MSE2:”, MSE2)

print(“MSE3:”, MSE3)

print(“MSE4:”, MSE4)

print(“MSE5:”, MSE5)

Production:

Here we can see that some of the folds performed very well because their MSE value is lower than the general model.

Now let’s check the overall MSE value.

Overall_MSE=round((MSE1+MSE2+MSE3+MSE4+MSE5)/5,2)
print ("Overall MSE:", Overall_MSE) 

Production:

Here we can see that our overall MSE while performing k-fold cross-validation is much better than the general modeling. This is how we can improve the performance of time series modeling.

Last words

In this article, we have discussed cross validation in time series which must work differently because in the time series the data points we have correlate with the lagged values. Along with this we discussed how we can do this using the sklearn package and a comparison between general modeling and modeling with cross-validation.

References

Share.

Comments are closed.