Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Research on power load time series forecasting method based on transformer model
(USC Thesis Other) 

Research on power load time series forecasting method based on transformer model

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content Research on Power Load Time Series Forecasting Method Based on
Transformer Model
by
Ruishen Liu
A Thesis Presented to the
FACULTY OF THE USC DEPARTMENT OF MATHEMATICS
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF ARTS
(APPLIED MATHEMATICS)
December 2024
Copyright 2024 Ruishen Liu



TABLE OF CONTENTS
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter 1: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Research Background and Significance . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Basic Consepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Applications in Power Load Forecasting . . . . . . . . . . . . . . . . 5
2.2 Long Short-Term Memory Network (LSTM) . . . . . . . . . . . . . . . . . . 6
2.2.1 Basic Consepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Applications in Power Load Forecasting . . . . . . . . . . . . . . . . 7
2.3 Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Basic Consepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Applications in Power Load Forecasting . . . . . . . . . . . . . . . . 11
2.4 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Sliding Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
ii



Chapter 3: Experimental Design and Data Processing . . . . . . . . . . . . . . . . . . 15
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Data Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Data Standardization and Sequence Generation . . . . . . . . . . . . 18
3.2.2 Data Conversion to CNN Input Format . . . . . . . . . . . . . . . . 19
3.2.3 Creation of DataLoader . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Data Conversion to LSTM Input Format . . . . . . . . . . . . . . . . 20
3.2.5 Data Conversion to Transformer Input Format . . . . . . . . . . . . 21
3.3 Model Construction and Training . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Configuration and Training of the CNN Model . . . . . . . . . . . . . 22
3.3.2 Configuration and Training of the LSTM Model . . . . . . . . . . . . 26
3.3.3 Configuration and Training of the Transformer Model . . . . . . . . . 28
3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 4: Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 CNN Results Display . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 LSTM Results Display . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.3 Transformer Results Display . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.4 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Implementation Recommendations . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 5: Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Research Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii



Abstract
This study compares the predictive capabilities of three deep learning models—
Convolutional Neural Network (CNN), Long Short-Term Memory Network (LSTM), and
Transformer—for the critical task of power load forecasting. Through experimental design
and model construction, the performance of each model in power load forecasting is verified, and a series of evaluation metrics are used for comparative analysis. The results show
that the Transformer model, with its multi-head attention mechanism, performs best in
capturing long- and short-term dependencies and handling complex fluctuations. Compared
with LSTM and CNN, the Transformer not only has higher prediction accuracy but also
better handles the periodic and seasonal fluctuations of power load. This study provides
an accurate load forecasting tool for power systems, which can effectively improve energy
management and scheduling efficiency. Additionally, the paper explores the possibility of
model optimization and practical deployment, proposing model selection and deployment
recommendations based on scenario requirements.
iv



Chapter 1: Research Background
1.1 Research Background and Significance
With the rapid development of smart grids and the continuous growth of electricity
demand, power load forecasting has become crucial for the stable operation of power systems.
As a core component of the smooth operation of power systems, power load forecasting plays
a key role in power dispatching, energy management, and market transactions. Accurate
power load forecasting can help power companies anticipate future changes in electricity
demand, optimize production and dispatch strategies to meet electricity needs at different
times, and ensure the safety and stability of power systems. However, traditional load
forecasting methods, such as statistical methods and neural networks, still need improvement
in terms of prediction accuracy and robustness when facing complex and variable power
systems.
As a rapidly developing deep learning network architecture in recent years, the Transformer model has gradually become an emerging tool for power load forecasting due to
its adaptability and strong feature extraction capabilities in time series prediction. Power
load forecasting methods based on the Transformer model can effectively improve prediction
accuracy and stability and handle complex power system data.
This study aims to explore the potential of a Transformer-based power load forecasting
model in handling power load fluctuations, improving prediction accuracy, reducing energy
consumption, and enhancing grid dispatching capabilities. At the same time, the model will
be compared with traditional deep learning models, such as CNN and LSTM, to validate the
superiority of the Transformer model, providing theoretical support and practical solutions
for intelligent and refined management of power systems. This research not only helps
improve the production and dispatching efficiency of power companies but also provides
technical assurance for the sustainable development of the energy industry.
1



1.2 Research Status
Power load forecasting is a crucial component of the intelligentization of power systems, as
it is closely related to the economic operation and supply-demand balance of power systems.
High-accuracy short-term power load forecasting plays a significant role in ensuring the
safe and stable operation of the power grid, optimizing energy management, improving the
utilization of power generation equipment, and reducing operating costs.
In recent years, researchers have proposed various innovative methods to address the
complexity and nonlinearity of power load forecasting, aiming to improve the accuracy of
both short-term and long-term power load predictions. These methods mainly focus on deep
learning models, ensemble algorithms, and emerging neural network architectures.
Luo et al. [5] proposed a short-term power load forecasting method based on a combination of CNN-BiLSTM-Attention and XGBoost. This method selects similar day data
through adaptive hierarchical clustering and employs CNN for feature extraction. BiLSTM
captures the long-term dependencies in the time series, while the Attention mechanism optimizes weight allocation, ultimately improving prediction accuracy. Experimental results
show that the MAPE of this method is reduced by 5.88% to 69.40% compared to other
algorithms.
Nabavi et al. [6] combined LSTM with discrete wavelet transform (DWT-LSTM) to
address the impact of complex social and environmental factors on power consumption and
performed predictions on multi-country power market datasets. DWT-LSTM demonstrated
high accuracy in predictions across different time periods, with particularly outstanding
performance during special events such as holidays.
Wang et al. [8] proposed a short-term power load forecasting method based on CNNBiLSTM-Attention. This method uses secondary data cleaning and adaptive variational
mode decomposition (VMD) for data processing, and employs CNN, BiLSTM, and an Attention mechanism for feature extraction and optimization, achieving high prediction accuracy.
2



Kshetrimayum et al. [3] developed a parallel ConvLSTM (PConvLSTM) model for shortterm power load forecasting. This model processes two-dimensional features and effectively
captures spatiotemporal patterns in time series data. Experimental results show that PConvLSTM outperforms other models in multiple evaluation metrics, including MAE, MAPE,
and RMSE.
Yang et al. [9] proposed a transformer-based rolling prediction method for power load
using the Informer model. Through a self-attention distillation mechanism, they reduced the
input sequence length, improving the model’s memory efficiency and prediction accuracy.
Zhao et al. [10] proposed a load forecasting model combining NeuralProphet with BiLSTM-SA. By analyzing the impact of trends, seasonality, and holidays, this model demonstrated strong predictive capability during peak periods and introduced a peak-weighted
mean squared error metric to optimize model performance.
3



Chapter 2: Introduction
2.1 Convolutional Neural Network (CNN)
2.1.1 Basic Consepts
A Convolutional Neural Network (CNN) is a deep learning model widely used in the
field of image processing. It is composed of three main types of layers: convolutional layers,
pooling layers, and fully connected layers, each playing a different role in the data processing
workflow.
The core function of the convolutional layer is to extract features from the input data
by using learnable filters to capture the spatial and temporal local features of the data.
Each filter slides over the input and generates a feature map by computing the dot product
between the filter and the local regions of the input. The formula is as follows:
(𝑋 ∗ 𝑊)𝑖,𝑗 =
𝑘−1
∑
𝑚=0
𝑘−1
∑
𝑛=0
𝑋𝑖+𝑚, 𝑗+𝑛𝑊𝑚,𝑛 + 𝑏,
where 𝑋 is the input data, 𝑊 is the convolutional kernel, 𝑏 is the bias, 𝑘 is the size of the
kernel, and (𝑖, 𝑗) is the position in the output feature map. [1]
The pooling layer is mainly used to reduce the dimensionality of the feature map to lower
the computational complexity of the model while enhancing the model’s robustness to local
variations in the input data. The formula for max pooling is:
𝑃𝑖,𝑗 = max𝑚,𝑛(𝑋𝑖+𝑚,𝑗+𝑛),
where 𝑃 is the output after pooling, 𝑋 is the input data, (𝑖, 𝑗) are the coordinates of the
top-left corner of the pooling region, and 𝑚, 𝑛 are the size of the pooling window. [4]
The fully connected layer is usually located at the end of a CNN network. It maps
4



the extracted features to the final output and further processes them through an activation
function to perform classification or regression tasks. The formula is as follows:
𝑦 = 𝑊 ⋅ 𝑥 + 𝑏,
where 𝑥 is the input vector, 𝑊 is the weight matrix, 𝑏 is the bias, and 𝑦 is the output result.
2.1.2 Time Series Analysis
Although CNN was originally designed for image processing, its feature extraction capability is also well-suited for time series analysis. The convolutional layer can effectively
capture local patterns in time series data, such as trends and periodic variations. In time
series analysis, sliding window convolution is a commonly used technique. By defining a
window that slides along the time series, CNN can identify local temporal dependencies,
thereby enabling the prediction of future events.
2.1.3 Applications in Power Load Forecasting
The application of CNN in power load forecasting demonstrates its advantages in handling complex data. When using CNN for power load forecasting, the data needs to undergo
preprocessing steps, such as normalization, to ensure a more stable and efficient training
process.
The design and optimization of the model architecture are key to improving prediction
accuracy. In power load forecasting, adjusting the depth of convolutional layers, configuring
pooling layers, and determining the number of fully connected layers can enhance the model’s
ability to capture variations in power load. Increasing the number of convolutional layers
helps the model learn more complex temporal dependency patterns, while appropriately
setting pooling layers can help reduce overfitting.
In practical applications, CNN is trained using a large amount of historical power load
data to learn the complex variation patterns of power load. By optimizing hyperparameters
5



such as the learning rate, the model’s prediction performance can be further improved.
During the evaluation phase, metrics such as Mean Squared Error (MSE) and Mean Absolute
Percentage Error (MAPE) are used to evaluate the model, and it is compared with traditional
forecasting methods to validate the effectiveness of the CNN model in power load forecasting.
2.2 Long Short-Term Memory Network (LSTM)
2.2.1 Basic Consepts
The Long Short-Term Memory network (LSTM) is an improved Recurrent Neural Network (RNN) that can handle long-term dependency problems and addresses the vanishing
or exploding gradient issues encountered by traditional RNNs in long sequence data. LSTM
effectively manages the flow of information and maintains long-term dependencies through
three gating mechanisms: the forget gate, input gate, and output gate. The core of LSTM
is the cell state, which is updated or maintained through these gating mechanisms.
The forget gate determines which information in the current cell state needs to be forgotten. The formula is as follows:
𝑓𝑡 = 𝜎(𝑊𝑓
⋅ [ℎ𝑡−1, 𝑥𝑡
] + 𝑏𝑓
),
where 𝑓𝑡
is the output of the forget gate, 𝜎 is the Sigmoid activation function, 𝑊𝑓
is the
weight matrix of the forget gate, [ℎ𝑡−1, 𝑥𝑡
] is the concatenation of the previous hidden state
ℎ𝑡−1 and the current input 𝑥𝑡
, and 𝑏𝑓
is the bias term of the forget gate.
The input gate determines which new information will be updated into the cell state.
The calculation formula for the input gate is as follows:
𝑖𝑡 = 𝜎(𝑊𝑖
⋅ [ℎ𝑡−1, 𝑥𝑡
] + 𝑏𝑖
).
The new candidate memory content is generated through the tanh function, and the
6



formula is as follows:
𝐶̃
𝑡 = tanh(𝑊𝐶 ⋅ [ℎ𝑡−1, 𝑥𝑡
] + 𝑏𝐶),
where 𝑖𝑡
is the output of the input gate, 𝐶̃
𝑡
is the new candidate memory content, 𝑊𝑖
and
𝑊𝐶 are the weight matrices for the input gate and candidate memory content, and 𝑏𝑖
and
𝑏𝐶 are the bias terms for the input gate and candidate memory content.
The output gate determines the hidden state at the current time step. The calculation
formula for the output gate is as follows:
𝑜𝑡 = 𝜎(𝑊𝑜
⋅ [ℎ𝑡−1, 𝑥𝑡
] + 𝑏𝑜
).
The final hidden state ℎ𝑡
is controlled by the output gate and computed through the
tanh activation function:
ℎ𝑡 = 𝑜𝑡
⋅ tanh(𝐶𝑡
),
where 𝑜𝑡
is the output of the output gate, ℎ𝑡
is the hidden state at the current time step,
𝑊𝑜
is the weight matrix of the output gate, and 𝑏𝑜
is the bias term of the output gate. [2]
2.2.2 Time Series Analysis
The main advantage of LSTM is its ability to effectively handle and remember longterm dependencies, which is difficult to achieve with traditional RNNs. Through its gating
mechanisms, LSTM can forget certain information when it is not needed and remember
key information when necessary, thereby avoiding the vanishing gradient problem. This
makes LSTM highly effective in fields such as financial time series analysis, natural language
processing, and power load forecasting.
2.2.3 Applications in Power Load Forecasting
In power load forecasting, LSTM can utilize historical load data to predict future demand.
The model can enhance its ability to model time series data through multi-layer network
7



structures (such as bidirectional LSTM) and combine with attention mechanisms to improve
responsiveness to key time periods.
Feature selection: before training the model, data normalization is performed to ensure
that data of different magnitudes does not affect the training efficiency. Additionally, timestamp information is used as an input feature to capture the periodicity and trends in power
load.
Network parameter configuration: the input layer of the LSTM model is typically designed to accept continuous load data for 48 hours. The hidden layer is configured with
multiple units, and a bidirectional LSTM structure is employed to capture bidirectional dependencies in time series. By adjusting the learning rate and batch size, and using Mean
Squared Error (MSE) as the loss function, the model’s performance is continuously optimized.
2.3 Transformer Model
2.3.1 Basic Consepts
The Transformer model is a deep learning model used for processing sequential data. It
was initially applied to natural language processing tasks and has been widely adopted in
time series forecasting due to its outstanding performance in handling temporal data. The
core innovation of the Transformer model lies in its self-attention mechanism, which overcomes the limitations of traditional recurrent neural networks (such as LSTM) in processing
long sequences. The Transformer processes input sequences in a parallel manner, without
the need for step-by-step recursion, significantly improving computational efficiency.
8



Figure 2.1: The Structure of the Transformer Model
The basic structure of the Transformer can be divided into two parts: the Encoder and
9



the Decoder, and it is mainly composed of the following components:
(1) Self-Attention Mechanism
The self-attention mechanism is the core of the Transformer. It computes the relevance of
each element in the sequence to every other element without losing the positional information
of the sequence. Through the self-attention mechanism, the model can capture long-distance
dependencies in the sequence. The specific calculation process is as follows:
Attention Scores are generated by creating Query, Key, and Value vectors from the input
sequence. The attention scores are obtained by computing the dot product between the
Query and Key, as shown in the following formula:
Attention (𝑄, 𝐾, 𝑉 ) = softmax (
𝑄𝐾𝑇
√𝑑𝑘
) 𝑉 ,
where 𝑄 is the query vector matrix, 𝐾 is the key vector matrix, 𝑉 is the value vector matrix,
and 𝑑𝑘
is the dimension of the key vectors, 1
√𝑑𝑘
is a scaling factor used to prevent large dot
product values from causing vanishing or exploding gradients.
(2) Multi-Head Attention
The multi-head attention mechanism divides the Query, Key, and Value vectors into
multiple heads, performs self-attention calculations separately for each head, and then concatenates the results. The formula for multi-head attention is:
MultiHead (𝑄, 𝐾, 𝑉 ) = Concat(head1
, head2
, ..., headℎ
) 𝑊𝑂,
where ℎ𝑒𝑎𝑑𝑖 = (𝑄𝑊𝑄
𝑖
, 𝐾𝑊 𝐾
𝑖
, 𝑉 𝑊𝑉
𝑖
), 𝑊𝑂 is the output weight matrix.
(3) Feed-Forward Neural Network
The output at each position in the Transformer passes through a fully connected feedforward neural network. The feed-forward network typically consists of two layers of linear
10



transformations with a ReLU activation function in between. The calculation formula is:
FFN(𝑥) = max(0, 𝑥𝑊1 + 𝑏1
)𝑊2 + 𝑏2
,
where 𝑊1
and 𝑊2
are the weight matrices of the two linear transformations, respectively,
and 𝑏1
and 𝑏2
are the bias terms.
(4) Positional Encoding
Since the Transformer does not have a recurrent structure and cannot directly capture
the positional information of elements in a sequence, positional encoding is added to the
input sequence. Positional encoding is usually represented using sine and cosine functions:
PE(pos,2𝑖) = sin (
pos
10000 2𝑖
𝑑
) , PE(pos,2𝑖+1) = cos (
pos
10000 2𝑖
𝑑
) ,
where pos represents the position in the sequence, 𝑖 denotes the dimension index, and 𝑑 is
the dimension of the positional encoding. [7]
2.3.2 Applications in Power Load Forecasting
The application of Transformer in power load forecasting benefits from its ability to
handle long-term dependencies. In this task, the Transformer model can leverage the selfattention mechanism to capture the global patterns of load data without facing the performance degradation issues encountered by RNNs with long sequence dependencies. By
incorporating auxiliary information such as timestamps and weather data, the Transformer
can significantly improve the accuracy of power load forecasting.
2.4 Standardization
The main goal of standardization is to transform the data into a normal distribution with
a mean of 0 and a standard deviation of 1, eliminating the dimensional differences between
different features, thereby improving the convergence speed and stability of the model.
11



The formula for standardization is:
𝑥
′ =
𝑥 − 𝜇
𝜎
,
where 𝑥 is the original data point (i.e., the feature value to be standardized). 𝜇 is the mean
of the feature across the entire training set. 𝜎 is the standard deviation of the feature in the
training set. 𝑥
′
is the standardized data.
The standardized data 𝑥
′ will have the following properties:
• Mean of Zero: The standardized data is centered at 0.
• Standard deviation of 1: The range of variation in the standardized data is adjusted
to have a standard deviation of 1.
𝜇 =
1
𝑛
𝑛
∑
𝑖=1
𝑥𝑖
, 𝜎 = √
1
𝑛
𝑛
∑
𝑖=1
(𝑥𝑖 − 𝜇)2
,
where 𝑛 is the number of samples in the training set, and 𝑥𝑖
is the feature value of the 𝑖-th
sample.
2.5 Sliding Window
The sliding window is a commonly used data processing technique, especially suited for
time series analysis. It works by moving a fixed-length window across consecutive time steps,
dividing the time series data into multiple overlapping or non-overlapping subsequences.
Each subsequence serves as input for the model to predict subsequent time steps.
The core idea behind the sliding window technique is to use continuous segments of the
time series as model inputs to capture the temporal dependencies in the data. Each time
the sliding window moves forward by one time step, a new input-output pair is generated,
continuing until the entire time series is covered. Each window can be viewed as a historical
observation, and the model uses this historical data to predict future trends. This method
12



allows the sliding window to help the model capture local patterns in the time series, such
as trends and periodic changes, making it suitable for multi-step forecasting tasks, such as
power load prediction.
The sliding window technique not only generates multiple training samples but also allows
flexibility in adjusting the window size and stride to suit different task requirements, making
it widely applicable in various time series analysis scenarios.
For example, given a time series {𝑥1
, 𝑥2
, ..., 𝑥𝑛}, using a window size 𝜔, the sliding
window can divide the series into multiple subsequences of length 𝜔. At the 𝑡-th time step,
the input from the sliding window is {𝑥𝑡
, 𝑥𝑡+1, ..., 𝑥𝑡+𝜔−1}, and the model will use this
window to predict the next 𝑛 time steps.
2.6 Mean Squared Error
The Mean Squared Error (MSE) loss function is a commonly used loss function in regression tasks. It measures the model’s error by calculating the mean of the squared differences
between the predicted values and the actual values. The goal of MSE is to minimize the
deviation between the predicted and actual values, thereby improving the model’s accuracy.
The formula for the MSE loss function is:
MSE =
1
𝑛
𝑛
∑
𝑖=1
(𝑦𝑖 − ̂𝑦𝑖
)
2
,
where (𝑦𝑖 − ̂𝑦𝑖
)
2
represents the squared error between the predicted value and the actual
value for the 𝑖-th sample.
The MSE loss function has two notable characteristics. First, because the error is squared,
MSE strongly penalizes large deviations, forcing the model to pay more attention to significant errors during optimization, thereby effectively reducing major prediction biases. Second,
MSE is smooth and differentiable, making it suitable for gradient descent optimization. By
minimizing MSE, the model continuously adjusts its parameters, bringing the predicted
13



values closer to the actual values and improving prediction accuracy.
14



Chapter 3: Experimental Design and Data Processing
3.1 Data Collection
3.1.1 Data Source
The power load dataset used in this study comes from the PJM market, which covers
multiple states in the eastern United States and is one of the largest power markets in the
world. The dataset used consists of hourly energy consumption data for the DEOK(Duke
Energy Ohio/Kentucky) region, spanning from January 1, 2016, to early 2024, with a collection frequency of once per hour. This provides high-precision time series data for short-term
load forecasting tasks. The power load data in this dataset comes from the DEOK area of
the PJM market, covering parts of Ohio and Kentucky in the United States. Each record
corresponds to the power load information for one hour, measured in megawatts (MW),
reflecting the energy consumption in the region at a specific time. This is the core feature
of the dataset used for modeling and prediction. The dataset is stored in CSV format, with
each row containing a timestamp and the corresponding power load.
Datetime DEOK_MW
0 2012-01-01 01:00:00 2533.0
1 2012-01-01 02:00:00 2465.0
2 2012-01-01 03:00:00 2364.0
3 2012-01-01 04:00:00 2313.0
4 2012-01-01 05:00:00 2279.0
Table 3.1: The First 5 Rows of Electric Load Data
The dataset is a Pandas DataFrame with 57,739 rows and 2 columns, as shown in Table 3.1.
Datetime: The recording time, with a data type of datetime64[ns], indicating the
specific time corresponding to each record with an hourly granularity.
DEOK_MW: The hourly power load in the DEOK region, with a data type of float64.
15



The unit is megawatts (MW), reflecting the power consumption in the region at a specific
time.
The dataset has no missing values (each column has 57,739 non-null records), indicating
high data quality, suitable for further analysis and modeling. The dataset size is approximately 902.3 KB.
3.1.2 Data Features
To improve the model’s prediction accuracy for power load data, this study implemented
time-based feature engineering. The feature engineering process includes converting datetime formats, sorting the time series data, and setting up indices to ensure that the input
data fully expresses the dependencies in the time series, providing a clear input structure for
subsequent model training.
(1) Date-Time Conversion
The Datetime column in the original dataset is stored in string format, which is not
convenient for direct time series analysis. Therefore, it first needs to be converted to the
datetime64 format supported by Pandas. By using the pd.to_datetime() function, the
timestamp information can be correctly parsed. This step provides fundamental support for
subsequent time series operations, such as time window segmentation and data visualization.
(2) Time Series Sorting
To ensure the sequential order of the time series data and avoid issues with data disorder,
the sort_values() method is used to sort the data in ascending order based on the Datetime
column. This step ensures that the model can learn patterns in the time series without being
affected by data disorder, thereby improving prediction performance.
(3) Index Setting
Setting the index for time series data is a key step in feature construction. By setting the
Datetime column as the index of the DataFrame, the model can retrieve and analyze data
based on the time dimension. In practical applications, setting the index helps in quickly
16



accessing data for specific time periods and facilitates subsequent sliding window operations
and time series forecasting.
(4) Dataset Splitting
During the data preparation stage, the dataset is split into a training set and a test set
in an 80% ratio. The training set is used for training the model, while the test set is used
to evaluate the model’s generalization ability. This splitting method ensures that the model
can make effective predictions on new data and helps avoid overfitting.
(5) Data Visualization
To visually display the trend of power load changes over time, a time series plot of the
power load was created using the Matplotlib library.
Figure 3.1: Changes in Power Load over Time
Figure 3.1 shows the trend of power load changes (in megawatts, MW) over time in the
DEOK region from 2012 to 2018. The power load data in the figure is recorded hourly,
and clear periodic variations can be observed. With the change of seasons, the power load
exhibits cyclical fluctuations at different times. Electricity consumption is usually higher in
winter and summer, while relatively lower in spring and autumn, due to varying heating and
cooling demands caused by seasonal changes.
It can be seen from the figure that there are approximately two peak loads each year
17



(in summer and winter), accompanied by noticeable fluctuations, reflecting changes in seasonal electricity demand. Additionally, there are occasional sudden drops, which could be
due to anomalies or power outages. Overall, the trend of power load changes aligns with
expectations, showing significant seasonal variations in energy consumption.
3.2 Data Preprocessing
The data preprocessing steps in this study mainly include data normalization, the construction of time series data, and the conversion of data into a format suitable for input into
deep learning models. Figure 3.2 shows the preprocessing workflow.
Figure 3.2: Data Preprocessing Flowchart
In multi-step forecasting tasks, the data preprocessing workflows for CNN, LSTM, and
Transformer models are generally similar.
3.2.1 Data Standardization and Sequence Generation
First, the raw power load data is processed using StandardScaler. The goal of standardization is to scale the data to a range with a mean of 0 and variance of 1, thereby eliminating
the dimensional differences in the data, which helps the model converge more quickly. The
training data is standardized using the fit_transform method, while the test data is scaled
in the same way using the transform method.
• train_data_scaled: The standardized result of the training data.
• test_data_scaled: The standardized result of the test data.
18



After standardization, the data is transformed into input and target output sequences
suitable for multi-step forecasting. This is achieved using the create_multistep_sequences
function, which applies a sliding window technique to segment the time series data into input
sequences of length seq_length and output sequences of length n_steps:
• seq_length=24: The length of the input sequence, indicating that the model will use
data from the past 24 time steps as input.
• n_steps=12: The number of future time steps to predict; the model needs to forecast
power load values for the next 12 time steps.
In the create_multistep_sequences function, the process of generating input and output sequences is as follows:
The function iterates through the time series, extracting a data segment of length seq_
length as input (x), and the following data segment of length n_steps as the target output
(y). By iterating through the entire time series, multiple input-output pairs are generated,
and finally, these data pairs are returned as arrays.
3.2.2 Data Conversion to CNN Input Format
After generating the input and output sequences, the data needs to be converted into the
format required by the CNN model. The CNN model’s input typically requires specifying
the number of channels, which represents the feature dimension for each time step. For this
type of univariate time series data, the channel is set to 1.
• X_train and X_test: Input data, after dimension transformation, is adjusted to the
format [batch_size, channels, seq_length] using the permute function. Specifically, the original input date has the dimensions [batch_size, seq_length, 1],
which needs to be converted to [batch_size, 1, seq_length].
• y_train and y_text: Target output data, with the dimensions [batch_size, n_
steps], representing the predicted values for the next 12 time steps.
19



These tensors are converted from NumPy arrays into PyTorch tensor format using the
torch.from_numpy method, then cast to floating-point data and loaded onto the specified
computing device (CPU or GPU) for subsequent training and inference.
3.2.3 Creation of DataLoader
To efficiently perform model training and testing, the data is organized into a DataLoader,
which extracts a small batch of data from the dataset at each iteration and feeds it into the
model:
• batch_size=128: In each training or testing iteration, 128 samples are extracted from
the dataset.
• train_loader: The training data loader extracts batches from TensorDataset(X_
train, y_train) and shuffles the data (shuffle=True) to ensure that the model
sees the data in a random order in each epoch, which helps improve the model’s
generalization ability.
• test_loader: The test data loader uses a batch size of 128 and loads the test data
in sequential order (shuffle=False), ensuring that the data order remains the same
during the evaluation process.
3.2.4 Data Conversion to LSTM Input Format
The input format for an LSTM model is [batch_size, seq_length, input_size].
Unlike CNNs, LSTMs do not require adjusting the channel dimension, as LSTMs directly
process each time step and its corresponding features in the time series.
• X_train and X_test: In the LSTM model, the input data retains its original dimensions [batch_size, seq_length, input_size]. In this case, input_size=1, meaning each time step has only one input feature (the power load value). Therefore, there
is no need to adjust the channel dimension as done with CNNs.
20



• y_train and y_test: The output data is still the target values for the next 12 time
steps, with the dimension [batch_size, n_steps], consistent with the CNN section.
Unlike in CNNs, the LSTM data dimensions do not need to be transformed using the
permute function. The standard 3D tensor [batch_size, seq_length, input_size] can
be processed directly. Other steps, such as standardization, sequence generation, and the
creation of data loaders, are the same as in the CNN section and are not repeated here.
3.2.5 Data Conversion to Transformer Input Format
The input format for the Transformer model is [batch_size, seq_length, input_
dim], where input_dim represents the number of input features at each time step. Similar
to LSTM, the Transformer processes time series data without the need to adjust the channel
dimension as in CNNs.
• X_train and X_test: In the Transformer, the input data retains its original dimensions [batch_size, seq_length, input_dim]. In this case, input_dim=1, meaning
each time step has only one input feature. Therefore, there is no need to perform
the permute operation, and the data remains in the standard format [batch_size,
seq_length, input_size].
• y_train and y_test: The output data retains its dimensions as [batch_size, n_
steps], consistent with the settings in LSTM and CNN.
Similar to LSTM, the Transformer does not require channel adjustment. It only needs
to maintain the standard 3D tensor format [batch_size, seq_length, input_dim] for
processing. Other steps such as data standardization, sequence generation, and data loader
creation are the same as in the previous two models and will not be repeated here.
21



3.3 Model Construction and Training
3.3.1 Configuration and Training of the CNN Model
(1) Model Structure
The input data for the model is processed using a sliding window approach. The sliding
window divides the time series into multiple fixed-length subsequences, with each subsequence representing historical time step data used to predict multiple future time steps.
Specifically, the sliding window length in this paper is set to seq_length, meaning that the
model inputs historical load data covering seq_length time steps at a time and outputs
predictions for the next 12 future time steps through the network.
By using the sliding window mechanism, the model can leverage continuously shifting
historical segments during both training and prediction to generate forecasts for multiple
future time steps. Each time step within the window undergoes one-dimensional convolution
operations to extract local features, and the feature maps generated layer by layer produce
the prediction values. This design not only enhances the model’s ability to capture patterns
in the time series but also effectively utilizes historical data for multi-step forecasting.
The model adopts a two-layer one-dimensional convolutional architecture, fully considering the temporal dependencies in time series data. The input data is a univariate time
series, meaning that each time step contains only one load value, so the number of input
channels is set to 1.
The first convolutional layer consists of 64 output channels, with a kernel size of 3.
This means that each convolution operation covers three adjacent time steps, allowing the
model to effectively capture short-term dependencies in the local time series. The features
extracted by the convolution represent local patterns within the input sequence. Next, the
ReLU activation function introduces non-linearity, enabling the model to better represent
complex load variation patterns.
The second convolutional layer also consists of 64 output channels, with a kernel size of
22



3, and continues deep feature extraction from the feature map produced by the first layer.
This layer further uncovers potential patterns within the time series and captures higher-level
dependencies. The two convolutional layers work together, extracting features hierarchically,
helping the model to capture both short-term and long-term trends.
The output of the convolutional layers undergoes a flattening operation, converting the
two-dimensional feature maps into a one-dimensional vector. This vector is then passed
to a fully connected layer, which maps the high-dimensional feature space to the target
prediction dimension, i.e., the model’s output is the power load prediction for 12 future
time steps. This structural design enables the model to comprehensively utilize information
from previous time steps when making multi-step forecasts, improving the continuity and
accuracy of the predictions.
(2) Model Training
During the model training process, MSE was chosen as the loss function. MSE measures
the deviation between the model’s predicted values and the true values, and it is more
sensitive to larger prediction errors, helping to reduce significant error points during training.
The Adam optimizer was used for gradient updates, combining the advantages of momentum
and adaptive learning rates, which accelerates convergence and helps avoid local minima. The
learning rate was set to 0.00001, a relatively low value, to ensure that the model converges
steadily during training, preventing unstable weight updates due to fast changes.
The training data samples were generated using a sliding window approach, where each
sample consisted of seq_length time steps as input and 12 time steps as output. As the
sliding window moves through the time series, the model learns different patterns from the
entire sequence. The data was split into a training set and a test set, with the training set used
to optimize model parameters and the test set used to evaluate the model’s generalization
ability.
For each batch, the model first performs forward propagation, where the input data
passes through the model layers to generate predictions. Then, the loss between the predicted
23



values and the true values is calculated, followed by backpropagation, where gradients are
computed, and the model weights are updated. During the training phase, the model is in
training mode, allowing for gradient updates; during the testing phase, the model switches
to evaluation mode, disabling gradient updates to ensure that the model parameters are not
affected during evaluation. After each batch, the model progressively minimizes the loss
function and updates its weights.
The entire training process lasted for 10 epochs, and after each epoch, the average loss
on the training and test sets was recorded to track the model’s performance over time. By
observing the training and test loss curves, overfitting can be detected, and it can be ensured
that the model is gradually converging to a global optimum. After training, the loss curve
was plotted to visually present the model’s training process and loss trend, allowing for the
assessment of the model’s convergence and stability. Finally, the model’s weight parameters
were saved in the file cnn_multistep_model.pth for use in future load forecasting tasks.
(3) Training Results
24



Figure 3.3: CNN Training and Test Loss over Epochs
As shown in Figure 3.3, the training loss and test loss of the CNN model both gradually
decrease as the number of epochs increases. Initially, the training loss and test loss are
relatively high, at 0.7389 and 0.5641, respectively, but they rapidly decline during the first
two epochs, indicating a significant improvement in the model’s fit to the data during the
early stages. Afterward, the loss values stabilize, and by the 10th epoch, the training loss has
dropped to 0.1785, and the test loss has dropped to 0.1986. This indicates that the model’s
performance on both the training set and the test set is relatively close, without obvious
signs of overfitting, and the model has gradually converged. This demonstrates that the
model has good generalization ability and is able to make reasonably accurate predictions
on the test data.
25



3.3.2 Configuration and Training of the LSTM Model
(1) Model Structure
The model consists of two layers of LSTM networks, each with 64 hidden units. The
input data is a univariate time series, meaning that each time step contains only one power
load value (input_size=1). Through the stacked structure of multiple LSTM units, the
model can effectively extract complex temporal features from the sequence data. The first
LSTM layer processes the input sequence and generates a set of hidden state vectors, while
the second LSTM layer further processes these hidden states, capturing deeper dependencies
within the sequence.
During the model’s forward propagation, the input data has the dimensions [batch_
size, seq_length, input_size], meaning each batch contains multiple samples, and each
sample consists of a series of time steps and features. The model retains the hidden state
of the last time step in the LSTM layer, which encapsulates the information from all time
steps. This hidden state is then passed through a fully connected layer to map it to the
predicted values for 12 future time steps (output_size=12). This design allows the model
to generate multi-step load forecasts based on historical load data.
To enhance the model’s generalization ability and prevent overfitting, a dropout rate
of 10% (dropout=0.1) is introduced between the LSTM layers. During training, dropout
randomly deactivates a portion of neurons, forcing the model to learn different combinations
of input features, which improves the model’s robustness. The model uses the parameter
batch_first=True, ensuring that the batch size comes first in the input data format, which
is common in time series tasks. This means that the first dimension is the batch size, the
second dimension is the time steps, and the third dimension is the number of features.
Finally, the model maps the hidden state of the last time step to the predicted power
load values for 12 future time steps through a fully connected layer. This structure enables
the model to capture key patterns from historical time series data and apply them to future
multi-step forecasting tasks.
26



(2) Model Training
During the model training process, MSE was set as the loss function, which measures the
error between the model’s predicted values and the true values. Since MSE is more sensitive
to larger errors, it helps the model converge more quickly to an optimal solution. The
Adam algorithm was used as the optimizer, which combines the benefits of momentum and
adaptive learning rates, allowing for faster model convergence while avoiding local minima.
The learning rate was set to 0.0005, which is relatively high to accelerate training, while still
ensuring the stability of the model training process.
The model was trained for a total of 10 epochs, meaning the model was trained on the
entire dataset 10 times. During each epoch, the model extracted training samples from the
time series data using a sliding window. Each training sample consisted of seq_length
historical time steps as input data and 12 time steps as target output. During the forward
propagation stage, the input data first passed through the LSTM layers, and then through
the fully connected layer to generate predictions for the next 12 time steps. The loss between
the model’s predictions and the true values was then calculated, and the model entered the
backpropagation stage, where gradients were computed based on the loss, and the Adam
optimizer was used to update the model parameters and gradually optimize the weights.
In each batch, the model performed both forward propagation and backpropagation until
all batches of data were processed. After each epoch, the model evaluated its loss on both
the training set and the test set, tracking the training progress. During the testing phase,
the model switched to evaluation mode, disabling gradient updates to ensure that the model
weights were not updated during testing, allowing for an accurate assessment of the model’s
generalization performance.
(3) Training Results
27



Figure 3.4: LSTM Training and Test Loss over Epochs
The loss curves for the training process are shown in Figure 3.4. Both the training loss
and test loss gradually decrease as the number of epochs increases, demonstrating good
convergence of the model. Initially, the training loss and test loss were 0.8580 and 0.6440,
respectively. As the training progressed, both losses rapidly declined and then stabilized in
the later stages of training. By the 10th epoch, the training loss had dropped to 0.1988, and
the test loss to 0.2102, indicating that the model’s performance on both the training set and
test set was similar, with no significant signs of overfitting. The continuous decrease and
convergence of the losses suggest that the LSTM model effectively learned the patterns in
the power load time series, demonstrating strong predictive performance.
3.3.3 Configuration and Training of the Transformer Model
(1) Model Structure
28



The input to the Transformer model first passes through a linear projection layer, which
projects the original input data from input_dim=1 into a higher-dimensional space of
d_model=64. This projection step maps the low-dimensional time series data to a higherdimensional feature space, enhancing the model’s expressive power and enabling it to better capture complex temporal patterns. By mapping the input features to a vector space
matching the model’s dimension, the model can more effectively learn and predict sequential
patterns.
After the projection, the data is passed into the Transformer encoder. The encoder
consists of num_layers=2 Transformer layers, each containing nhead=8 parallel multi-head
self-attention mechanisms. These attention heads process temporal features from different subspaces in parallel. Through this multi-head self-attention mechanism, the model
can simultaneously focus on multiple parts of the time series and capture the global dependencies in the data, enhancing its ability to recognize complex temporal patterns. Additionally, each encoder layer includes a feed-forward neural network with a dimension of
dim_feedforward=256, which further processes the features extracted by the self-attention
layers. This feed-forward network strengthens the model’s ability to capture and identify
temporal features, ensuring better performance on long sequence data.
The Transformer model requires the input data to have the format [seq_length,
batch_size, d_model]. Therefore, after projecting the input to the d_model dimension,
the input data’s dimensions are adjusted using x.permute(1, 0, 2) to match the encoder’s input requirements. After processing by the Transformer encoder, the temporal features are extracted and enhanced, and the data’s dimensions are restored to [batch_size,
seq_length, d_model] for further processing.
In the final step of the model, only the features of the last time step are retained and
passed through a fully connected layer. The purpose of this fully connected layer is to map
the extracted features into output_size=12 prediction results, representing the forecasted
power load for the next 12 time steps. With this design, the model is able to extract key
29



features from the historical time step information and generate accurate predictions for
multiple future time steps.
(2) Model Training
During the training process, the time series data is processed using a sliding window
method. The sliding window divides the complete time series into multiple subsequences
of length seq_length, with each subsequence containing a segment of historical data. The
model uses these historical time steps to predict multiple future time steps. By moving the
sliding window, the model can observe different local sequences during training and use these
local features to construct predictions for future time steps.
Each time, the sliding window extracts a fixed-length segment of historical time steps
from different positions in the sequence as input and predicts the power load for 12 future
time steps.
In the training phase, the model uses MSE as the loss function, and the Adam optimizer
updates the weights with a learning rate of 0.00005. The model was trained for a total of 10
epochs. In each epoch, the model generated training samples using the sliding window, and
through forward propagation, backpropagation, and gradient updates, it gradually optimized
the weights. During the testing phase, the model switched to evaluation mode, disabling
gradient updates, and the model’s performance on the test set was recorded.
Throughout the training process, the training and test loss values were tracked, and
loss curves were plotted to visually demonstrate the training results. After training was
completed, the model’s weights were saved to the file transformer_multistep_model.pth,
allowing the model to be used for future multi-step power load forecasting tasks.
(3)Training Results
30



Figure 3.5: Transformer Training and Test Loss over Epochs
As shown in Figure 3.5, the loss curves for the model during training demonstrate that
both the training loss and test loss gradually decrease as the number of epochs increases.
Initially, the training loss was 0.5642, and the test loss was 0.4604. As the training progressed,
the model’s loss gradually decreased, and by the 10th epoch, the training loss had dropped
to 0.3895, and the test loss had dropped to 0.4287. The training loss converged quickly, and
the test loss gradually decreased, indicating that the model exhibited good convergence on
both the training set and test set without significant overfitting. The loss curves reflect a
stable training process, confirming the effectiveness of the Transformer model in multi-step
forecasting tasks.
31



3.4 Chapter Summary
This chapter introduces the process of experimental design and data processing. First,
power load data from the DEOK region in the PJM market was collected and processed
through feature engineering for model training and forecasting. During data preprocessing,
techniques such as standardization, time series sorting, index setting, and dataset splitting
were applied to ensure the data was suitable for input into the CNN, LSTM, and Transformer
models.
In terms of model construction and training, CNN, LSTM, and Transformer models were
designed and trained for multi-step power load forecasting. Mean Squared Error (MSE) was
used as the loss function, and the Adam optimizer was employed to optimize the parameters.
After 10 epochs of training, all models demonstrated good convergence on both the training
set and test set, with the loss curves showing stability and generalization ability. The
experimental results indicate that all three models performed well in power load forecasting,
and their respective structural characteristics enabled them to effectively handle time series
data.
32



Chapter 4: Experimental Results and Analysis
4.1 Performance Comparison
In this experiment, the Convolutional Neural Network (CNN) model, Long Short-Term
Memory (LSTM) model, and Transformer model were used to predict the power load in the
DEOK region. The models’ performances were analyzed using a series of evaluation metrics.
After training the models, predictions were made on the test set, and the predicted values
were compared with the actual power load values for a comparative analysis.
4.1.1 CNN Results Display
Figure 4.1: CNN Multistep Prediction: Electric Load
As shown in Figure 4.1, the comparison between the actual power load and the predicted
values from the CNN model over time is presented. In the figure, the blue curve represents
the actual power load values, while the red curve indicates the model’s predictions. It can be
seen that the model effectively captures the overall trend of the power load, especially during
periods of relatively stable load variation, where the predicted values closely match the actual
33



values. This figure effectively demonstrates the performance of the CNN model in multi-step
forecasting tasks, highlighting its feasibility and accuracy in power load prediction.
Regarding model performance evaluation, the model achieved an R² score of 0.9219 on
the test set, indicating that the model can explain over 92% of the power load variation.
Additionally, the model’s Mean Absolute Error (MAE) was 129.65 MW, and the Mean Absolute Percentage Error (MAPE) was 4.33%, indicating that the model can provide relatively
accurate load predictions during most time periods.
4.1.2 LSTM Results Display
Figure 4.2: LSTM Multistep Prediction: Electric Load
Figure 4.2 presents the multi-step forecasting results of the LSTM model for power load
prediction. The red curve represents the model’s predicted values, while the blue curve shows
the actual power load values. By comparing the two curves, it is evident that the model effectively captures the trend of load variation, especially during periods of relatively stable load
changes, where the predicted values closely match the actual values. The model demonstrates
strong predictive ability across different time periods, indicating that the LSTM model can
effectively capture the time series patterns of power load.
34



In terms of performance evaluation, the LSTM model achieved an R² score of 0.9364,
indicating that the model can explain most of the power load fluctuations. The Mean
Absolute Error (MAE) was 118.33 MW, and the Mean Absolute Percentage Error (MAPE)
was 4.03%, suggesting that the model can provide relatively accurate load predictions during
most time periods.
4.1.3 Transformer Results Display
Figure 4.3: Transformer Multistep Prediction: Electric Load
Figure 4.3 shows the comparison between the predicted power load values from the Transformer model (in red) and the actual values (in blue). It is evident that the predicted values
closely overlap with the actual values, demonstrating that the Transformer model accurately
captures the fluctuations in power load across different time periods. From the data between
April 2017 and August 2018, the Transformer model’s ability to handle long-range dependencies in time series data is particularly impressive, especially during peak load periods in
summer and winter, where the predicted values closely match the actual values. This indicates that the Transformer model not only captures the cyclical fluctuations of power load
but also handles complex temporal features, adapting well to load changes across different
seasons.
35



The evaluation results show that the Transformer model achieved an R² score of 0.9510,
meaning it can explain over 95% of the power load variation. The Mean Absolute Error
(MAE) was 105.83 MW, and the Mean Absolute Percentage Error (MAPE) was 3.57%,
indicating that the model provides relatively accurate predictions during the vast majority
of time periods.
4.1.4 Comparative Analysis
Table 4.1 below provides a detailed comparison of the performance metrics of the CNN,
LSTM, and Transformer models in the power load forecasting task, including key statistics
such as R-squared, MAE, and MAPE.
Table 4.1: Model Performance Comparison
Model R-squared (R2
) MAE (MW) MAPE (%)
CNN 0.9219 129.6546 4.3288
LSTM 0.9364 118.3261 4.0336
Transformer 0.9510 105.8325 3.5653
It can be seen that the Transformer model performs the best across all three performance metrics, indicating that the Transformer is more accurate in capturing the temporal
dependencies of power load and has the smallest prediction error, resulting in higher model
accuracy.
When comparing the power load forecasting results of the CNN, LSTM, and Transformer
models, we can also analyze the models’ predictive performance through three visualization
line charts.
In Figure 4.1, while the CNN model’s predicted values (in red) generally capture the
overall trend of the power load compared to the actual values (in blue), the predicted values
show weaker fluctuations in some highly volatile intervals, especially during peak periods
of seasonal changes. The model fails to fully capture the actual peaks and troughs. This
36



suggests that when dealing with complex time series data, the CNN may struggle to fully
capture long-term dependencies due to the locality of the convolutional layers. The CNN
model can predict the general trend well, but it falls short in capturing extreme fluctuations
in power load, resulting in slightly higher errors.
From the LSTM model’s visualization chart (Figure 4.2), we can see that the predicted
values (in red) are very close to the actual values (in blue), especially at the seasonal peaks
and troughs, where the predictions closely match the actual fluctuations. This is because
the LSTM model can effectively capture both short-term and long-term dependencies when
processing time series data. The LSTM model performs well in forecasting power load,
particularly during periods with significant seasonal fluctuations, with lower errors than the
CNN model. The model demonstrates good performance in handling dependencies in time
series data.
From the Transformer model’s visualization chart (Figure 4.3), it is evident that the
predicted values (in red) from the Transformer model have the highest degree of alignment
with the actual values (in blue). The model not only accurately captures the overall trend of
power load but also finely handles both short-term and long-term fluctuations, particularly
performing exceptionally well in peak and trough intervals. This is attributed to the Transformer’s multi-head attention mechanism, allowing the model to simultaneously focus on
different positions within the time series and capture more complex temporal relationships.
The Transformer model exhibits the best predictive performance, outperforming the CNN
and LSTM models across all evaluation metrics. It captures more details when handling
complex time series, significantly reducing errors.
4.2 Implementation Recommendations
For the task of power load forecasting, based on the performance comparison of the
aforementioned models and the requirements of practical application scenarios, the following
implementation recommendations are proposed to ensure effective deployment and efficient
37



operation of the models:
(1) Prioritize the Use of the Transformer Model
The Transformer model demonstrated the highest accuracy in this forecasting task, especially excelling in handling long-term dependencies and complex fluctuations. It is recommended to prioritize deploying the Transformer model in scenarios that require high
predictive accuracy. Its inherent multi-head attention mechanism can capture short-term
variations in power load and effectively manage long-term trends. Therefore, the Transformer
is the best choice for tasks involving medium to long-term power load planning, anomaly
detection, and seasonal demand fluctuation forecasting.
• Data Preparation: Ensure standardized and normalized processing of input data,
particularly the completeness and continuity of time series.
• Model Parameter Tuning: Adjust the hidden layer dimension (d_model), number of attention heads (nhead), and the number of layers in the Transformer model
according to different regions or demand characteristics to optimize performance.
• Hardware Requirements: The Transformer model is relatively complex and requires significant computational resources. It is recommended to perform training
and inference in a GPU environment to shorten training time and improve prediction
efficiency.
(2) Consider LSTM Model for Medium- and Short-Term Forecasting Tasks
The LSTM model performs well in capturing short-term changes and seasonal fluctuations in power load, making it suitable for medium- and short-term forecasting tasks. If
hardware resources are limited, or if the prediction time span is relatively short and the
accuracy requirements are not as high as for the Transformer, the LSTM model is a more
balanced choice. LSTM can provide relatively accurate predictions, especially in periods
with relatively stable fluctuations.
38



• Data Preprocessing: Ensure the completeness and standardization of time series
data, and adjust the length of the sliding window as needed to fit different time spans.
• Parameter Tuning: Adjust the size and number of hidden layers of the LSTM
based on the length of the prediction time to balance performance and computational
efficiency.
• Training Environment: In a CPU or standard GPU environment, LSTM can complete training quickly and provide accurate results, making it suitable for resourceconstrained scenarios.
(3) Consider CNN Model for Quick Predictions in Specific Scenarios
Although the CNN model’s predictive accuracy was slightly lower than that of the Transformer and LSTM in this experiment, it has unique advantages in handling local features and
training speed. CNN is suitable for quick predictions of short-term power load fluctuations,
especially when the task scenario focuses on high-frequency predictions or requires real-time
responsiveness. CNN can serve as a fast computational solution and provide acceptable
prediction accuracy in scenarios with limited computational resources.
• Model Adjustment: Simplify the CNN network structure to improve prediction
speed; selectively remove some convolutional layers to reduce computational load and
accelerate inference speed.
• Application Scenarios: For real-time monitoring and temporary scheduling in power
systems where high precision is not required but quick response is needed, CNN can
be used as a fast prediction tool.
• Edge Computing Deployment: The CNN model is suitable for deployment on edge
devices for distributed prediction, reducing the load pressure on the main server.
39



Chapter 5: Conclusion and Outlook
5.1 Research Conclusions
This study conducted a comprehensive comparison and analysis of the performance of
three models—CNN, LSTM, and Transformer—in the task of power load forecasting, aiming
to identify the most suitable model for handling complex time series data. The experimental
results indicate that the Transformer model excels in capturing both short-term and longterm dependencies, achieving the highest prediction accuracy and outperforming LSTM and
CNN across various performance metrics. The LSTM model shows good performance in
short-term forecasting tasks, while the CNN model, despite its strengths in processing local
features, is less effective in handling long-term dependencies.
Further improvements in prediction accuracy were achieved by fine-tuning hyperparameters and optimizing the architecture of each model. The Transformer, with its robust multihead attention mechanism, more effectively captures the complex fluctuation characteristics
of power load data, especially during seasonal peaks and troughs. The results demonstrate
that the Transformer model holds significant potential for real-world power load forecasting
tasks, performing best when dealing with time series data exhibiting periodicity and complex
fluctuations.
5.2 Future Directions
Given the current limitations and results achieved in this study, future work can be
expanded in the following areas:
• Data Diversification: Future research could incorporate additional dimensions of
data, such as weather, socioeconomic factors, and holidays, to enhance the model’s
robustness and prediction accuracy. This would improve adaptability in diverse power
load scenarios.
40



• Model Optimization and Simplification: Efforts could focus on optimizing the
model structure while maintaining prediction accuracy, reducing network computational complexity. Exploring more efficient training algorithms will also be beneficial,
especially for deployment in resource-constrained environments like edge computing.
• Advanced Deep Learning Techniques: With the advancement of deep learning,
future work could explore models that integrate Graph Convolutional Networks (GCN)
or self-attention mechanisms to enhance the ability to capture complex power load
patterns and improve prediction accuracy.
• Improvements in Real-Time Prediction Systems: Research could focus on efficiently deploying models within actual power systems, optimizing data flow, and enhancing the model’s responsiveness and real-time predictive capabilities. This would
ensure more precise real-time scheduling and load management in power systems.
• Model Integration and Hybrid Approaches: Future studies could explore combining CNN, LSTM, and Transformer models, leveraging the strengths of each through
hybrid models or ensemble strategies to further improve prediction performance.
41



References
[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.
[2] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[3] Nilakanta Kshetrimayum, Khumukcham Robindro Singh, and Nazrul Hoque. Pconvlstm: an effective parallel convlstm-based model for short-term electricity load forecasting. International Journal of Data Science and Analytics, Aug 2024.
[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[5] Shucheng Luo, Baoshi Wang, Qingzhong Gao, Yibao Wang, and Xinfu Pang. Stacking integration algorithm based on cnn-bilstm-attention with xgboost for short-term
electricity load forecasting. Energy Reports, 12:2676–2689, 2024.
[6] Seyed Azad Nabavi, Sahar Mohammadi, Naser Hossein Motlagh, Sasu Tarkoma, and
Philipp Geyer. Deep learning modeling in electricity load forecasting: Improved accuracy by combining dwt and lstm. Energy Reports, 12:2873–2900, 2024.
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv e-prints,
page arXiv:1706.03762, June 2017.
[8] Di Wang, Sha Li, and Xiaojin Fu. Short-term power load forecasting based on secondary
cleaning and cnn-bilstm-attention. Energies, 17(16), 2024.
[9] Chuan Yang and Zhibing Shu. Long-term rolling prediction of transformer power load
capacity based on the informer model. Journal of Physics: Conference Series, 2782,
2024.
[10] Dongpeng Zhao, Shouzhi Xu, Haowen Sun, Bitao Li, Mengying Jiang, and Shiyu Tan.
Combined model electricity load forecasting based on neuralprophet and bi-lstm-sa.
42



Journal of Physics: Conference Series, 2781(1):012025, jun 2024.
43 
Abstract (if available)
Abstract This study compares the predictive capabilities of three deep learning models—Convolutional Neural Network (CNN), Long Short-Term Memory Network (LSTM), and Transformer—for the critical task of power load forecasting. Through experimental design and model construction, the performance of each model in power load forecasting is verified, and a series of evaluation metrics are used for comparative analysis. The results show that the Transformer model, with its multi-head attention mechanism, performs best in capturing long- and short-term dependencies and handling complex fluctuations. Compared with LSTM and CNN, the Transformer not only has higher prediction accuracy but also better handles the periodic and seasonal fluctuations of power load. This study provides an accurate load forecasting tool for power systems, which can effectively improve energy management and scheduling efficiency. Additionally, the paper explores the possibility of model optimization and practical deployment, proposing model selection and deployment recommendations based on scenario requirements. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model 
Increase colorectal cancer prediction accuracy with the influence (I)-score
PDF
Increase colorectal cancer prediction accuracy with the influence (I)-score 
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Validating structural variations: from traditional algorithms to deep learning approaches 
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction
PDF
Deep learning in metagenomics: from metagenomic contigs sorting to phage-bacterial association prediction 
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Computational algorithms and statistical modelings in human microbiome analyses 
Application of statistical learning on breast cancer dataset
PDF
Application of statistical learning on breast cancer dataset 
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms 
Semicircle law for high dimensional geometric random graphs
PDF
Semicircle law for high dimensional geometric random graphs 
Deep generative models for time series counterfactual inference
PDF
Deep generative models for time series counterfactual inference 
Trustworthy spatiotemporal prediction models
PDF
Trustworthy spatiotemporal prediction models 
Data-driven learning for dynamical systems in biology
PDF
Data-driven learning for dynamical systems in biology 
Conformalized post-selection inference and structured prediction
PDF
Conformalized post-selection inference and structured prediction 
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification
PDF
Malignant cell fraction prediction using deep learning: from point estimate to uncertainty quantification 
Towards high-performance low-cost AMS designs: time-domain conversion and ML-based design automation
PDF
Towards high-performance low-cost AMS designs: time-domain conversion and ML-based design automation 
Neural matrix factorization model combing auxiliary information for movie recommender system
PDF
Neural matrix factorization model combing auxiliary information for movie recommender system 
Feature and model based biomedical system characterization of cancer
PDF
Feature and model based biomedical system characterization of cancer 
The spread of an epidemic on a dynamically evolving network
PDF
The spread of an epidemic on a dynamically evolving network 
Supervised learning algorithms on factors impacting retweet
PDF
Supervised learning algorithms on factors impacting retweet 
Delta Method confidence bands for parameter-dependent impulse response functions, convolutions, and deconvolutions arising from evolution systems described by…
PDF
Delta Method confidence bands for parameter-dependent impulse response functions, convolutions, and deconvolutions arising from evolution systems described by… 
Deep learning models for temporal data in health care
PDF
Deep learning models for temporal data in health care 
Action button
Asset Metadata
Creator Liu, Ruishen (author) 
Core Title Research on power load time series forecasting method based on transformer model 
Contributor Electronically uploaded by the author (provenance) 
School College of Letters, Arts and Sciences 
Degree Master of Arts 
Degree Program Applied Mathematics 
Degree Conferral Date 2024-12 
Publication Date 11/07/2024 
Defense Date 11/07/2024 
Publisher Los Angeles, California (original), University of Southern California (original), University of Southern California. Libraries (digital) 
Tag deep learning,OAI-PMH Harvest,time series prediction,transformer model 
Format theses (aat) 
Language English
Advisor Chen, Xiaohui (committee chair), Sun, Fengzhu (committee member), Zhu, Yizhe (committee member) 
Creator Email beijinglrs33@gmail.com,ruishenl@usc.edu 
Unique identifier UC11399DBHH 
Identifier etd-LiuRuishen-13619.pdf (filename) 
Legacy Identifier etd-LiuRuishen-13619 
Document Type Thesis 
Format theses (aat) 
Rights Liu, Ruishen 
Internet Media Type application/pdf 
Type texts
Source 20241108-usctheses-batch-1221 (batch), University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright.  It is the author, as rights holder, who must provide use permission if such use is covered by copyright. 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email cisadmin@lib.usc.edu
Tags
deep learning
time series prediction
transformer model