
Published as a conference paper at ICLR 2023
forecasts which can lead to runaway error propagation. To mitigate this issue, we introduce cross-scale
normalization at each step.
Our approach re-orders model capacity to shift the focus on scale awareness, but does not fundamen-
tally alter the attention-driven paradigm of transformers. As a result, it can be readily adapted to
work jointly with multiple recent time series transformer architectures, acting broadly orthogonally
to their own contributions. Leveraging this, we chose to operate with various transformer-based
backbones (e.g. Fedformer, Autoformer, Informer, Reformer, Performer) to further probe the effect
of our multi-scale method on a variety of experimental setups.
Our contributions are as follows: (1) we introduce a novel iterative scale-refinement paradigm that
can be readily adapted to a variety of transformer-based time series forecasting architectures. (2) To
minimize distribution shifts between scales and windows, we introduce cross-scale normalization on
outputs of the Transformer. (3) Using Informer and AutoFormer, two state-of-the-art architectures, as
backbones, we demonstrate empirically the effectiveness of our approach on a variety of datasets.
Depending on the choice of transformer architecture, our mutli-scale framework results in mean
squared error reductions ranging from
5.5%
to
38.5%
. (4) Via a detailed ablation study of our
findings, we demonstrate the validity of our architectural and methodological choices.
2 RELATED WORKS
Time-series forecasting
: Time-series forecasting plays an important role in many domains, including:
weather forecasting (Murphy, 1993), inventory planning (Syntetos et al., 2009), astronomy (Scargle,
1981), economic and financial forecasting (Krollner et al., 2010). One of the specificities of time
series data is the need to capture seasonal trends (Brockwell & Davis, 2009). There exits a vast variety
of time-series forecasting models (Box & Jenkins, 1968; Hyndman et al., 2008; Salinas et al., 2020;
Rangapuram et al., 2018; Bai et al., 2018; Wu et al., 2020). Early approaches such as ARIMA (Box
& Jenkins, 1968) and exponential smoothing models (Hyndman et al., 2008) were followed by the
introduction of neural network based approaches involving either Recurrent Neural Netowkrs (RNNs)
and their variants (Salinas et al., 2020; Rangapuram et al., 2018; Salinas et al., 2020) or Temporal
Convolutional Networks (TCNs) (Bai et al., 2018).
More recently, time-series Transformers (Wu et al., 2020; Zerveas et al., 2021; Tang & Matteson,
2021) were introduced for the forecasting task by leveraging self-attention mechanisms to learn
complex patterns and dynamics from time series data. Informer (Zhou et al., 2021) reduced quadratic
complexity in time and memory to
O(L log L)
by enforcing sparsity in the attention mechanism
with the ProbSparse attention. Yformer (Madhusudhanan et al., 2021) proposed a Y-shaped encoder-
decoder architecture to take advantage of the multi-resolution embeddings. Autoformer (Xu et al.,
2021) used a cross-correlation-based attention mechanism to operate at the level of subsequences.
FEDformer (Zhou et al., 2022b) employs frequency transform to decompose the sequence into
multiple frequency domain modes to extract the feature, further improving the performance of
Autoformer.
Multi-scale neural architectures
: Multi-scale and hierarchical processing is useful in many domains,
such as computer vision (Fan et al., 2021; Zhang et al., 2021; Liu et al., 2018), natural language
processing (Nawrot et al., 2021; Subramanian et al., 2020; Zhao et al., 2021) and time series
forecasting (Chen et al., 2022; Ding et al., 2020). Multiscale Vision Transformers (Fan et al., 2021)
is proposed for video and image recognition, by connecting the seminal idea of multiscale feature
hierarchies with transformer models, however, it focuses on the spatial domain, specially designed
for computer vision tasks. Cui et al. (2016) proposed to use different transformations of a time series
such as downsampling and smoothing in parallel to the original signal to better capture temporal
patterns and reduce the effect of random noise. Many different architectures have been proposed
recently (Chung et al., 2016; Che et al., 2018; Shen et al., 2020; Chen et al., 2021) to improve RNNs
in tasks such as language processing, computer vision, time-series analysis, and speech recognition.
However, these methods are mainly focused on proposing a new RNN-based module which is not
applicable to transformers directly. The same direction has been also investigated in Transformers,
TCN, and MLP models. Recent work Du et al. (2022) proposed multi-scale segment-wise correlations
as a multi-scale version of the self-attention mechanism. Our work is orthogonal to the above methods
2
评论