-
In this section, the conventional LSTM network is introduced, its shortcomings analyzed, and then the
-LSTM network is proposed for traffic flow forecasting.$ {\overline{\delta }}_{relax} $ The conventional LSTM network for forecasting
-
The LSTM network has been proven to be stable and powerful in modeling the long-term correlation of traffic flow sequences[27]. The LSTM network is composed of several basic LSTM cell units and a fully connected neural (FCN) network. Taking cell unit un as an example, hn−1 represents the cell hidden state at moment n−1, xn is the cell input at moment n. When hn-1, xn and b enter into the sig and the tanh boxes, it implies that they pass through a basic neural network[23], with output represented by in, fn, on, and C respectively. The relationship is expressed in Eqn (1).
$ \left(\begin{array}{c}\begin{array}{c}{i}_{n}\\ {f}_{n}\\ {o}_{n}\end{array}\\ \tilde{C}\end{array}\right)=\left(\begin{array}{c}\begin{array}{c}sigmoid\\ sigmoid\\ sigmoid\end{array}\\ tanh\end{array}\right)W\left(\genfrac{}{}{0pt}{}{{x}_{n}}{{h}_{n-1}}\right)+\left(\begin{array}{c}\begin{array}{c}{b}_{i}\\ {b}_{f}\\ {b}_{o}\end{array}\\ {b}_{C}\end{array}\right) $ (1) where, W represents the weight matrix in the hidden layer of the basic neural network, and xn, is the normalized data. In Eqn (1), in, fn and on are called the input gate, the forgetting gate, and the output gate respectively, and
is an intermediate variable to calculate cell state cn. bi, bf, bo, and bC mean the corresponding offset vectors to in, fn, on, and$ \tilde{C} $ . Since the range of the sigmoid function is from 0 to 1, in, fn, and on are all non-negative, where the parameter$ \tilde{C} $ ranges from −1 to 1, as determined by the hyperbolic tanh function.$ \tilde{C} $ And then the cell state cn is calculated, which can be calculated by summing cn−1 and
in a certain proportion. The proportion of cn−1 is determined by fn, while the contribution of$ \tilde{C} $ is controlled by in, as shown in Eqn (2).$ \tilde{C} $ $ {c}_{n}={f}_{n}\otimes{c}_{n-1}+{i}_{n}\otimes\tilde{C} $ (2) Since cn−1 means the previous cell state,
is calculated by the current cell, and fn and in are their corresponding coefficients, which is why fn and in are called forgetting gate and input gate.$ \tilde{C} $ Then calculate the cell output hn which is the result of the activated value of cn to a certain extent. The extent is determined by the output gate on, as shown in Eqn (3).
$ {h}_{n}={o}_{n}\otimes\mathrm{t}\mathrm{a}\mathrm{n}\mathrm{h}\left({c}_{n}\right) $ (3) Finally, hn is entered into an FCN network to get the prediction
. Since the cell state at any moment is related to the previous cell state and the input of the current moment, hn contains the information of all the previous moments and the current moment, which realizes the correlation dependence of long sequences.$ {\hat{x}}_{n+1} $ Before using the LSTM network to predict traffic flow, it is necessary to train the parameters of the LSTM network by back-propagation algorithm under the guidance of the error function. The error function of the conventional LSTM network is the mean square error (MSE) function, and its expression is shown in Eqn (4).
$ MSE=\dfrac{1}{N}\textstyle\sum _{n=1}^{N}{\left({\hat{x}}_{n+1}-{x}_{n+1}\right)}^{2} $ (4) where,
is the predicted value at the time of n+1, xn+1 represents the true value at the time of n+1, and N is defined as the total number of samples in the training set.$ {\hat{x}}_{n+1} $ As shown in Eqn (4), when the data is a stationary sequence, or when the noise is Gaussian noise or noiseless, satisfying
, the network parameters guided by MSE will converge rapidly. However, non-Gaussian outlier noise is often generated in traffic flow data due to various reasons such as accidents. When the error$ |{\hat{x}}_{n+1}-{x}_{n+1}| < 1 $ , the square operation in MSE will further amplify the error, and then change the parameters in the network. The MSE loss makes the LSTM network vulnerable to non-Gaussian noise. At this point, the canonical LSTM network guided by MSE loss cannot provide accurate prediction in the case of non-Gaussian distribution, especially for traffic flow data. Therefore, the standard LSTM network needs to be further improved.$ |{\hat{x}}_{n+1}-{x}_{n+1}| > 1 $ The $ {\overline{{\delta }}}_{\mathit{r}\mathit{e}\mathit{l}\mathit{a}\mathit{x}} $-LSTM network for forecasting
-
Although the LSTM model can learn long sequence dependence, its prediction performance is highly dependent on the MSE criterion. However, the MSE criterion assumes that the prediction error obeys Gaussian independent identical distribution (i.i.d), which makes MSE not suitable for complex traffic flow sequences containing non-Gaussian noise such as impulse noise. To solve this problem, we propose to introduce the MCVC function into the LSTM network to guide network parameters, to carry out higher-quality traffic flow forecasting.
It is well known that the error function plays a key role in the performance of deep learning networks. From the perspective of information theory, the correntropy criterion, as a nonlinear similarity measure, has been successfully used as an effective optimization cost in signal processing and machine learning[28]. The correntropy between two random variables X and Y is shown in Eqn (5).
$ V(X,Y)=\mathbb{E}\left[{\text ƙ}\right(e\left)\right]=\dfrac{1}{N}\textstyle\sum _{n=1}^{N}{\text ƙ}\left({e}_{n}\right) $ (5) where,
denotes the expectation operator,$ \mathbb{E}[\cdot ] $ is the Mercer kernel, e = X − Y, and N represents the number of samples.${\text ƙ}(\cdot) $ It is worth noting that the selection of the kernel function
plays an important role in the correntropy. If the kernel function adopts a triangle kernel, i.e.$ {\text ƙ} $ , when d = 2, V will degenerate to MSE. When the kernel function is Gaussian kernel, that is,$ {\text ƙ}\left(e\right)={\parallel e\parallel }^{d} $ , and then V is the maximum correntropy. Further, when the kernel function in Eqn (5) adopts the mixed Gaussian kernel, then V is called the mixed correntropy (MC), as shown in Eqn (6).$ {\text ƙ}\left(e\right)=\dfrac{1}{N}\sum _{n=1}^{N}\mathrm{e}\mathrm{x}\mathrm{p}[-\dfrac{{\Delta }_{n}^{2}}{2{\delta }^{2}}] $ $ {V}_{MC}=\textstyle\sum _{i=1}^{I}{\alpha }_{i}\dfrac{1}{N}\textstyle\sum _{n=1}^{N}{G}_{{\delta }_{i}}\left({e}_{n}\right) $ (6) where, δi is the kernel bandwidth of the ith Gaussian kernel, and αi is the corresponding proportionality coefficient, satisfying α1 + α2 + ... + αI = 1. Since the Taylor expansion of Gaussian kernel is a measure from zero to infinite order, it can contain the measure order of non-Gaussian noise whether it is heavy tail noise or light tail noise, so Gaussian kernel is easy to eliminate non-Gaussian noise in the training process.
In Eqn (6), VMC is a linear combination of multiple Gaussian cores. Besides, it is found that the mean error of a single Gaussian kernel in VMC is zero, that is, VMC can only have a good effect on the noise under the mixed Gaussian kernel with the center of zero. Then, Chen et al.[30] proposed the MCVC criterion to further improve the performance of correntropy by enhancing the applicability of correntropy, as shown in Eqn (7).
$ {V}_{MCVC}=\textstyle\sum _{k=1}^{K}{\lambda }_{k}\dfrac{1}{N}\textstyle\sum _{n=1}^{N}{G}_{{\delta }_{k}}({e}_{n}-{c}_{n}) $ (7) where, δk defines the kernel bandwidth of the kth Gaussian kernel, and λk is the corresponding proportionality coefficient, satisfying λ1 + λ2 + ... +λk = 1.
It should be noted that the kernel function in VMCVC is a multi-Gaussian function, which usually does not satisfy Mercer's condition. However, this is not a problem because Mercer's condition is not required for the similarity measure[30]. As for the convergence of 1, it involves the kernel method and the unified framework of regression and classification. However, the convergence of 1 can be guaranteed if an appropriate parameter search method is adopted[31].
To consider both Gaussian error and non-Gaussian error of LSTM network, an
-LSTM network is proposed. In this network, a new loss function based on the MCVC criterion is adopted, shown in Eqn (8).$ {\overline{\delta }}_{relax} $ $ \begin{split}L=\;&1-{V}_{MCVC} =1-\sum _{k=1}^{K}{\lambda }_{k}\dfrac{1}{N}\textstyle\sum _{n=1}^{N}{G}_{{\delta }_{k}}\left({e}_{n}-{c}_{n}\right)\\ =\;&1-\Bigg({\lambda }_{1}\dfrac{1}{N}\textstyle\sum _{n=1}^{N}\mathrm{exp}\left[-\dfrac{{\left({e}_{n}-{c}_{n}\right)}^{2}}{2{{\delta }_{1}}^{2}}\right]+\cdots +\\&{\lambda }_{K}\dfrac{1}{N}\textstyle\sum _{n=1}^{N}\mathrm{exp}\left[-\dfrac{{\left({e}_{n}-{c}_{n}\right)}^{2}}{2{{\delta }_{K}}^{2}}\right]\Bigg)\end{split} $ (8) Through the analysis of Eqn (8), the advantages of
are as follows:$ \mathcal{L} $ ●
performs a negative exponential operation on the prediction error. This means that when the sequence is mixed with non-Gaussian noise such as impulse noise or outliers, the value of$ \mathcal{L} $ will be very large, but the negative exponential operation makes the correntropy VMCVC tend to zero. That is, VMCVC is not sensitive to non-Gaussian noise, which can weaken the misjudgment of the LSTM network.$ \dfrac{{\left({e}_{n}-{c}_{n}\right)}^{2}}{2{\delta }^{2}} $ ● When K = 2 , δ1 < δ2, and
is satisfied, VMCVC is approximately equal to MSE, which means that the$ {\delta }_{1}\to \infty $ -LSTM network has the potential to maintain good performance in Gaussian noise environment. On the other hand, when ck = 0 (k = 1, 2, ..., K) is satisfied, VMCVC = VMC is obtained, that is, the performance of MCVC-LSTM network in non-Gaussian noise environment is not inferior to MC criterion. The proposed$ {\overline{\delta }}_{relax} $ loss function can make the LSTM network have excellent prediction performance in dealing with both Gaussian noise and non-Gaussian noise.$ \mathcal{L} $ ● The single Gaussian kernel in
is no longer limited to zero-mean, but can be concentrated in different positions. By studying the Gaussian mixture kernel with a variable center, it is found that VMCVC is more general and flexible, and can adapt to more complex error distributions, such as skew, multi-peak, discrete value distribution, and so on. Therefore, when$ \mathcal{L} $ is employed as the cost function in the LSTM network, traffic flow forecasting can get better performance by setting the center appropriately.$ \mathcal{L} $ -
In this section, the performance of the
-LSTM network in traffic flow prediction is tested. In addition to the classical historical average (HA), Kalman filter (KF)[32], stacked Auto Encoder (SAE)[20], MSE-based LSTM method, and the NiLSTM method[27] is also selected as the comparison benchmark because of their excellent robustness in the face of a non-Gaussian noise environment. Unless otherwise noted, all experiments were conducted on a computer equipped with an Intel Core i7-8850H CPU and 32 GB of RAM, and the source code is implemented by PyTorch 1.2.0 on Python3.7.3.$ {\overline{\delta }}_{relax} $ Data description
-
The datasets A1, A2, A4, and A8 obtained by Monica sensor collected by Wang et al.[33] were used in the experiment, which records the traffic flow per minute of A1, A2, A4, and A8 freeways within 35 d starting from May 20, 2010. These datasets are widely used in the evaluation of traffic flow prediction models[7,9,16,21,34,35].
The geographical location of the four expressways is shown in Fig. 2. Among them, the A1 highway is the first double three-lane highway with a high utilization rate in Europe, connecting Amsterdam and the German border. Its traffic volume has changed greatly over time, which increases the difficulty of prediction. The A2 motorway connects Amsterdam to the Belgian border with more than 2,000 vehicles an hour. The A4 highway connects the city of Amsterdam to Belgium’s northern border and is 154 km long. The A8 highway starts at the northern end of the A10 highway and ends at Zaandijk, which is less than 10 km in length.
In the experiment, the data are aggregated as vehicles per hour in 10 min, in the unit of vehs/h, which is consistent with other traffic flow prediction models[6,9]. The first 28 d of the dataset were used for training the model, and the last 7 d were used for testing. All data are normalized to the maximum and minimum before being sent into the model.
Evaluation criteria
-
In the test, two common indicators, root mean square error (RMSE) and mean absolute percentage error (MAPE), were used to evaluate all the prediction methods. RMSE measures the average difference between the predicted and true values, while MAPE represents the percentage difference between them. The calculation methods of RMSE and MAPE are shown in Eqns (9) and (10), respectively.
$ \mathrm{R}\mathrm{M}\mathrm{S}\mathrm{E}=\sqrt{\dfrac{1}{M}\textstyle\sum _{m=1}^{M}{({y}_{m}-{\hat{y}}_{m})}^{2}} $ (9) $ \mathrm{M}\mathrm{A}\mathrm{P}\mathrm{E}=\dfrac{1}{M}\textstyle\sum _{m=1}^{M}\left|\dfrac{{y}_{m}-{\hat{y}}_{m}}{{y}_{m}}\right|\times 100{\text{%}} $ (10) where, M means the total number of samples in the test set,
represents the predicted value of the mth sample in the test set, and ym is the corresponding true value.$ {\hat{y}}_{m} $ Performance evaluation
-
In this section, the test results of
-LSTM and the other five baseline networks are compared, which proves the superiority of the proposed$ {\overline{\delta }}_{relax} $ -LSTM network for traffic forecasting. Then, the influence of different parameters on the performance of the$ {\overline{\delta }}_{relax} $ -LSTM is analyzed, trying to explore the law of the influence of parameters on the$ {\overline{\delta }}_{relax}$ -LSTM network. Each result in the experiment was averaged from 20 replicates.$ {\overline{\delta }}_{relax} $ In this part, the performance of
-LSTM is compared with the following eight models in traffic flow datasets, including Historical Average (HA), Kalman Filter (KF)[32], Artificial neural network (ANN)[36], Stacked Auto-Encoder (SAE)[16], GSA-ELM[13], PSOGSA-ELM[21], LSTM, and NiLSTM[27].$ {\overline{\delta }}_{relax} $ The data preprocessing method of the KF model in Table 1 adopts the wavelet de-noising method proposed by Xie et al.[32], the mother wavelet uses Daubechies 4, and the variance of processing error is V = 0.1I, where I represents the identity matrix. The variance of the measurement noise is 0, so the measurement is considered to be correct. The initial state is defined as [1/N, ..., 1/N] with N = 8. The covariance matrix of the initial state estimation error is expressed as 10−2I. The ANN is a one-hidden-layer feed-forward neural network, where the mean squared, error is set to 0.001, the spread of a radial basis function (RBF) is 2000, and the maximum number of neurons in a hidden layer is set as 40. Through cross-validation, the parameter setting of the SAE network is [120, 60, 30], and the hierarchical greedy training method is adopted. In the LSTM, NiLSTM, and
-LSTM networks, the Tanh function is used as the activation function for the LSTM layer, while the Sigmoid function is used for the full connection layer. In the back-propagation algorithm, the gradient descent algorithm is the Adam optimization method, and the initial learning rate is set to 0.001. The other hyperparameters for the three networks are shown in Table 2. The Gaussian mean square error in the NiLSTM network is δ = 1.0.$ {\overline{\delta }}_{relax} $ Table 1. The comparison of the $ {\overline{\delta }}_{relax} $-LSTM model with five baseline models on the four baseline datasets, with boldface representing the best performance.
Models Criterion A1 A2 A4 A8 HA RMSE (vehs/h) 404.84 348.96 357.85 218.72 MAPE (%) 16.87 15.53 16.72 16.24 KF RMSE (vehs/h) 332.03 239.87 250.51 187.48 MAPE (%) 12.46 10.72 12.62 12.63 ANN RMSE (vehs/h) 299.64 212.95 225.86 166.50 MAPE (%) 12.61 10.89 12.49 12.53 SAE RMSE (vehs/h) 295.43 209.32 226.91 167.01 MAPE (%) 11.92 10.23 11.87 12.03 GSA-ELM RMSE (vehs/h) 287.89 203.04 221.39 163.24 MAPE (%) 11.69 10.25 11.72 12.05 PSOGSA-ELM RMSE (vehs/h) 288.03 204.09 220.52 163.92 MAPE (%) 11.53 10.16 11.67 12.02 LSTM RMSE (vehs/h) 289.56 204.71 224.49 165.13 MAPE (%) 12.38 10.56 11.99 12.48 NiLSTM RMSE (vehs/h) 285.54 203.69 223.72 163.25 MAPE (%) 12.00 10.14 11.57 11.76 $ {\overline{\delta }}_{relax} $-LSTM RMSE (vehs/h) 280.54 195.28 220.08 161.69 MAPE (%) 11.48 10.02 11.51 11.54 In addition, for the
-LSTM network, K = 2. Combined with the experimental verification of the literature[31], the parameter ranges of λ1, λ2, δ1, δ2, c1, and c2 are set as follows: the range of other parameters is λ1 = [0.2, 0.4, 0.6, 0.8], λ2 = 1 − λ1, δ1 = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7], δ2 = [1, 3, 5, 7, 10, 15, 30, 60], c1 = [−5, −3, −1, −0.5, 0, 0.5, 1, 3, 5], c2 = [−5, −3, −1, −0.5, 0, 0.5, 1, 3, 5]. Through grid search, the parameters with the best performance of each dataset are shown in Table 3.$ {\overline{\delta }}_{relax} $ Table 2. The hyperparameters for the LSTM, NiLSTM, and $ {\overline{\delta }}_{relax} $- LSTM network.
Hyperparameter value Value Hidden layers 1 Hidden units 256 Batch size 32 Input length 12 Epochs 200 Table 3. The parameter settings of $ \mathcal{L} $ for the $ {\overline{\delta }}_{relax} $-LSTM network.
Dataset λ1 λ2 δ1 δ2 c1 c2 A1 0.6 0.4 0.3 10 0 −1 A2 0.8 0.2 30 0.3 5 0 A4 1 0 0.7 30 0 −0.5 A8 0.6 0.4 0.3 15 0 −1 The performance results are listed in Table 1. According to the results in Table 1, the prediction effect of the
-LSTM is better than all the other baseline models. This is because it is difficult for the parameter models in the baseline models to deal with the nonlinear relationship of traffic data through limited parameters and fixed model settings. For machine learning methods, the network cannot accurately capture the long-term dependence between traffic flow sequences. In addition, the ordinary LSTM model is limited by the setting of the network and cannot effectively resist Gaussian noise and non-Gaussian noise at the same time. In these aspects, the benchmark model is difficult to achieve better performance in the real world. The$ {\overline{\delta }}_{relax} $ -LSTM method fully considers the huge uncertainty of traffic flow data, and then provides more selectivity and pertinence to the network setting, to obtain a better prediction effect.$ {\overline{\delta }}_{relax} $ -
In this paper, an
-LSTM network for short-term traffic flow prediction is proposed. The present study proposes a network formulates a loss function to concentrate the centers of Gaussian mixture kernels at different positions to become variable centers. In this way, the$ {\overline{\delta }}_{relax}$ -LSTM network can effectively resist various noise distributions such as Gaussian noise and impulse noise to achieve high prediction accuracy and robustness. Extensive experiments on four benchmark datasets show that the$ {\overline{\delta }}_{relax} $ -LSTM model performs better than the typical prediction models as well as the most advanced LSTM family models. In the future, we plan to explore the combined Gaussian and non-Gaussian kernel as a new hybrid kernel and apply it to short-term traffic flow prediction.$ {\overline{\delta }}_{relax} $ -
About this article
Cite this article
Fang W, Li X, Lin Z, Zhou J, Zhou T. 2024. Mixture correntropy with variable center LSTM network for traffic flow forecasting. Digital Transportation and Safety 3(4): 264−270 doi: 10.48130/dts-0024-0023
Mixture correntropy with variable center LSTM network for traffic flow forecasting
- Received: 11 September 2024
- Revised: 28 October 2024
- Accepted: 11 November 2024
- Published online: 27 December 2024
Abstract: Timely and accurate traffic flow prediction is the core of an intelligent transportation system. Canonical long short-term memory (LSTM) networks are guided by the mean square error (MSE) criterion, so it can handle Gaussian noise in traffic flow effectively. The MSE criterion is a global measure of the total error between the predictions and the ground truth. When the errors between the predictions and the ground truth are independent and identically Gaussian distributed, the MSE-guided LSTM networks work well. However, traffic flow is often impacted by non-Gaussian noise, and can no longer maintain an identical Gaussian distribution. Then, a $ {\overline{\delta }}_{relax} $-LSTM network guided by mixed correlation entropy and variable center (MCVC) criterion is proposed to simultaneously respond to both Gaussian and non-Gaussian distributions. The abundant experiments on four benchmark datasets of traffic flow show that the $ {\overline{\delta }}_{relax} $-LSTM network obtained more accurate prediction results than state-of-the-art models.
-
Key words:
- Traffic flow theory /
- Machine learning /
- Robust modeling /
- Mixture correntropy