Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Designing data-effective machine learning pipeline in application to physics and material science
(USC Thesis Other)
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Designing Data-Effective Machine Learning Pipeline in
Application to Physics and Material Science
by
Chao Cao
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(PHYSICS)
May 2024
Copyright 2024 Chao Cao
Acknowledgements
I would like to extend my deepest gratitude to all those who have made the journey of
my Ph.D. both possible and memorable.
I would like to give my thanks to my advisor Professor Stephan Haas for his unwavering support, invaluable guidance, insightful critiques, and constant encouragement
throughout this research journey. Your mentorship has been a guiding light in my academic growth and professional development.
I am also profoundly thankful to Professor Marcin Abram. His broad knowledge
of machine learning and wise insight carried me through all the problems and troubles
throughout my research. He has always been responsible and helpful.
Also, I would like to thank my parents for their support physically, emotionally, and
financially. They made me, my every single day, and my journey through the Ph.D. possible.
Finally, I would like to thank all the teachers and professors who taught me, and all
my friends who accompanied me.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Chapter 1: A Brief Introduction to Machine Learning and Its Role in Physics Research 1
1.1 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Structure of Neural Network and Forward Propagation . . . . . . . . 7
1.2.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Update Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Role of Machine Learning in Physics . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Role of Data Representation in Machine Learning . . . . . . . . . . . . . . 17
2.1 Importance of Data Representation in Physics . . . . . . . . . . . . . . . . . . 17
2.2 Naive Data Representation Methods . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 FragVAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Experiment on More Complex Data Representation Methods . . . . . . . . . 22
2.3.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Experiment Procedure and Result . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: Machine Learning Curricula for Sparse-label-related Physics Problems . 28
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 CIFAR100 - an Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2.1 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2.2 Model Architecture and Metrics . . . . . . . . . . . . . . . . 30
3.2.2.3 Experiment on Different Curricula . . . . . . . . . . . . . . . 32
3.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3.1 Label Prediction with Different Curricula . . . . . . . . . . . 33
3.2.3.2 Discussion on Results and Limitations . . . . . . . . . . . . 35
iii
3.3 Superconductor - an Application . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2.1 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2.2 Model Architecture and Metrics . . . . . . . . . . . . . . . . 37
3.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 4: Machine Learning Curricula for Dense-label-related Physics Problems . . 42
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Linear Quantum System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2.1 Prediction Pipeline . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2.3 Model Architecture and Metrics . . . . . . . . . . . . . . . . 48
4.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A ESOL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
B SuperCon Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
iv
List of Figures
1.1 Typical structure of a simple neural network . . . . . . . . . . . . . . . . . . 8
2.1 Breakdown plot of feature importance . . . . . . . . . . . . . . . . . . . . . . 20
2.2 illustration of FragVAE fragments . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Performance comparison between different data representations . . . . . . . 25
3.1 Accuracy plot for different training schedules on learning fine label
prediction to predict coarse labels . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Accuracy plot for different training schedules on learning coarse label
prediction to predict fine labels . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Prediction accuracy as the epoch of training increases . . . . . . . . . . . . . 40
4.1 The machine-learning-based emulation pipeline. . . . . . . . . . . . . . . . 45
4.2 Structure of emulator neural network . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Result of GRU-based Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Model generalization performance . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Snapshot of features of Extended ESOL Dataset . . . . . . . . . . . . . . . . . 63
5.2 Snapshot of labels of Extended ESOL Dataset . . . . . . . . . . . . . . . . . . 64
5.3 Snapshot of SuperCon Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 66
v
Abstract
In this data-driven era, people are realizing the importance of data analysis, and the ability to harness vast amounts of information has become a cornerstone of progress. Machine learning, especially neural networks, has thus emerged as an invaluable tool to
deal with those colossal data reserves. Beyond those well-known applications in sections
like computer vision, natural language processing, and finance, machine learning is also
playing an increasingly important role in the realm of physics. Many outstanding studies
have demonstrated how machine learning can enhance physics research. In [1], Artificial
Neural Networks (ANN) are used to reduce the non-trivial correlations in the many-body
wave function. [2] proposed an innovative approach to find phase transitions by detecting peaks of network performance. Besides all these, supervised machine learning is also
widely used to identify particles, as is summarized in [3].
In my dissertation, I will focus on the following two categories that can happen in
many physics-related research: dense data, when you can arbitrarily sample from a data
distribution; or sparse data, when you only have a limited database and know nothing
in between each entry. In the case of dense data, such as simulation problems, have arbitrarily many available data, and thus how to select representative data from the dataset
remains one of the major issues. A representative dataset would greatly improve the
convergence speed while demanding minimal data generation and storage costs. On the
other hand, when dealing with sparse data, such as phase prediction when available data
collection is not continuous, the generalization capability will take the lead because the
model is given too few examples to learn on, and repeated training will surely induce
vi
overfitting for expressive models, which is necessary for solving complicated problems if
machine learning is to be used. For both cases, it’s thus imperative to highlight the role
of an effective sampling strategy. curriculum learning.
In this dissertation, we will focus on navigating the use of machine learning techniques and exploring possible optimizations to propose sampling strategies that could
improve model performance.
Our sampling strategy uses the notion of curriculum learning, which is inspired by
the way humans learn progressively from easy to hard tasks. Instead of indiscriminately
feeding data to models, curriculum learning organizes the learning procedure in a meaningful sequence. As a prevalent notion recently, this structured process inherently possesses the two key features mentioned above. The curriculum learning strategy could
provide a smoother learning curve, thus making the model easier to generalize to unseen
data and converging faster to the local minimum.
Most curriculum learning processes emphasize teaching the model in an easy-to-hard
order. In sparse-data-related problems, while agreeing to the reasonableness of the approach, our work questions the necessity of this order, and suggests that there could be
other possible strategies to be taken.
In dense-data-related problems, with the capability of drawing arbitrarily many samples from a dense distribution, we propose a strategy for the creation of a curriculum
in the presence of a cost for the generation, storage, and computation of each new data
point, to ensure good prediction accuracy while also promoting generalization.
Throughout this dissertation, we will show many different physics-related example
problems and how, instead of feeding the neural network with data randomly, designing
a specific data sampling strategy could help with the model performance.
The outline of the dissertation is as follows: We will start with a brief introduction to
machine learning in the context of physics to familiarize the reader with basic machine
learning concepts, especially neural networks. Then we will talk about how important
vii
data representation is in machine learning, which is the very first step of data preparation.
After that, we will focus on how to use machine learning in sparse-feature problems in
physics, with two distinct examples. Finally, we will address machine learning in densefeature problems in physics, and provide another illustrative instance.
viii
Chapter 1
A Brief Introduction to Machine Learning and Its Role in
Physics Research
1.1 Introduction to Machine Learning
Machine learning has many similar yet subtly different definitions. The best definition
that fits the purpose of this dissertation comes from Murphy’s paper[4], that machine
learning is a set of methods that can automatically detect patterns in data, and then use
the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty. As can be seen from the definition, a typical machine-learning
procedure consists of three steps:
• Data Preparation: To perform data preparation, one will need to find a reliable data
source or generate data by themselves, and use tools or scripts such as web scraping, web API, or database queries to extract data from it. The collected data must
include the target of the task which is usually referred to as "labels". Labels can be
either continuous numbers such as temperatures, or discrete classes such as cats,
dogs, and rabbits. For those discrete classes, distinct integers starting from 0 are
usually assigned as numerical classes to further label them for the convenience of
the algorithms. However, simple data collection is, under most circumstances, not
1
sufficient. One should also perform data preprocessing to prepare the best data for
machine learning usage.
One common preprocessing practice is data cleaning [5], which addresses and rectifies issues such as outliers, missing values, or erroneous entries. This step ensures
the integrity and reliability of the dataset.
Other practices include data augmentation [6], which enhances the dataset with
new data points by applying operations such as image rotation, noise injection, and
window slicing.
Additionally, feature engineering is also a very popular preprocessing step [7]. For
feature engineering, one will usually either create new variables according to the
original data for more informative or relevant features to be used in machine learning tasks or delete redundant or repetitive features that provide limited helpful
information to solve the machine learning target. Sometimes one will address a
more radical approach called data transformation that transforms all features in the
dataset into another set of features that might be more representative of the machine
learning model. This data transformation is also often referred to as data representation, a topic I will discuss in depth in Chapter 3.
Besides all possible preprocessing steps above, data splitting is an essential step before data is actually fed to the machine learning model. The common practice for
splitting data in machine learning involves dividing the dataset into three subsets:
training, validation, and testing sets. The validation set helps the user to select a
better model with more sensible hyper-parameters. The test set serves as a standard for performance assessment. The original data could be split into an 80-10-10
distribution, with 80% of the dataset for training, 10% for validation, and 10% for
testing.
2
Data preparation is also a very important step because the quality and relevance
of data to the task directly influence the performance and reliability of the trained
model. If the data used to train the model is erroneous, the resulting model will
surely be inaccurate, since the model is based on the data collected. Data size is
a major concern. When the dataset is limited in size, the information provided
to discern the underlying pattern will be insufficient, and one would expect the
model to perform poorly or even overfit, a term used to describe cases when a machine learning model tends to memorize training samples and incapable of working on unseen cases [8]. On the other hand, if a dataset is full of samples with
tens of thousands of input features, the resulting machine-learning model would
be much more complex, thus placing a high load on the computational infrastructure. A raw dataset, when untreated can sometimes demand an unaffordable computational cost.Consequently, producing, transforming, and filtering high-quality
datasets with a reasonable amount of features and data to feed the machine learning model is a crucial step before the training actually happens.
• Model and Model Training: With well-designed data preparation, the next step is
to find a suitable model to solve the task and make full use of the dataset, usually
referred to as a "machine learning model". And the process of utilizing available
data to alter parameters in the model is called "training the model". Depending
on the nature of the labels, machine learning algorithms can be divided into two
different groups. For continuous labels such as temperature, the algorithms used
are called regression models. To tackle discrete labels, the algorithms employed are
known as classification models. Since the birth of the word "machine learning" in
1959, a plethora of researchers have dedicated their efforts to the domain of machine
learning, leading to the proposition of a myriad of algorithms. Among all those
algorithms, the most well-known and widely-used algorithm is probably linear regression, which finds the best fit straight line between input features and resulting
3
labels. Other famous models include the Support Vector Machine(SVM). Support
Vector Machine works by finding the hyperplane that best divides the dataset into
designated classes, maximizing the margin between data from different classes.
The decision tree algorithm is more complicated than the two mentioned above. A
decision tree scans through features and splits data into subsets based on the values,
forming a tree-like structure. Each tree node represents a decision made according
to what value the data has on the specific feature.
Random forest, a simple yet powerful machine learning algorithm, is a variant of the
decision tree. Random forest operates by constructing multiple decision trees, often
referred to as an "ensemble" of decision trees. Each decision tree is trained separately
and only votes for its own candidate when making predictions. These votes are
summarized to get the final result. This ensemble learning method is widely used
across various domains due to its simplicity, and effectiveness.
And finally, another important machine learning algorithm that drives today’s artificial intelligence is the neural network. A neural network is a machine learning
model inspired by the human brain’s structure, with several layers of interconnected
nodes called "neurons". After certain operations, the final layer will produce the desired outcome, either classification or regression. Neural network is the main theme
of this thesis and will be introduced in more detail in the next section.
With all these algorithms available to build the machine learning model, model
training is the essential step to make the model really work. Different models have
different parameters: linear regression models have slopes and intersects, decision
tree models have value thresholds to make decisions, and neural network models
have different weights and offsets in summing the nodes. To find the best set of parameters, all machine learning algorithms require data input for calibration. Each
entry of feature and label will update the parameters, and when the model has seen
the whole dataset for once, it is called one "epoch". Several epochs of training are
4
usually required for the model to learn sufficiently from the whole dataset. While
10, 20, or 30 epochs are commonly chosen numbers for training iterations in machine learning models, it’s not unusual to see as few as 1 epoch or as many as 1000
epochs being used, depending on the nature and volume of the data as well as the
structure of the model. .
• Model Evaluation: The next phase after the model has undergone sufficient training
is named model evaluation. As mentioned in the data preparation, one separate
portion of the original dataset is split as the test set. In this phase, the test set is
fed into the model for outcomes to evaluate the overall performance of the model.
Again, depending on whether the task is regression or classification, the metrics for
evaluation are different.
For regression tasks, R-squared, or the coefficient of determination, determines the
proportion of variance in the dependent variable that can be explained by the independent variable and is usually used to judge whether the fit is good or not. Other
metrics include Mean Absolute Error(MAE), which represents the average of the
absolute difference between the predicted and actual values, and Mean Squared Error(MSE), which is similar to MAE except for using the squared difference instead.
For classification tasks, accuracy is a straightforward performance metric, calculating the ratio of correctly predicted instances to the total number of instances.
However, under certain circumstances, such as an imbalanced dataset, other more
complicated metrics are necessary. For example, if the test set contains 90% labeled
with cats and 10% labeled with dogs, simply predicting everything as a cat can already achieve an accuracy of 90%. In this scenario, the accuracy metric does not adequately reflect the model’s performance shortfall, highlighting the need for more
nuanced evaluation metrics to assess the model’s true predictive capabilities. To
address data imbalance issues, the F1-score is a harmonic average of Precision and
5
Recall, where Precision is the ratio of correctly predicted positive observations to
the total predicted positives and Recall is the ratio of correctly predicted positive
observations to all actual positives. Another popular metric is Area Under the ROC
Curve (AUC) which "provides an aggregate measure of performance across all possible classification thresholds" to measure the ability of a model to distinguish between two classes and thus find a good balance between sensitivity and specificity
irrespective of the actual class distribution. With the metrics calculated, however,
the whole machine-learning process is not a one-off task. According to how good
the metrics look and how they compare to the metric of training data, different optimizations can be applied to further improve the model for another round of machine learning. A low value on both training and test metrics implies a weak and
inadequate model. In those cases, a more complex model could somewhat improve
the circumstance. A more widely seen case is when the model performs better on the
training set than on the test set, which has been mentioned in the data preparation
section as an "overfit". When overfitting happens, the model tends to memorize all
training data and thus is not able to generalize to unseen cases in the test set. To address overfitting, one usually needs to enhance the dataset by collecting more data
or applying data augmentation, using simpler models, or reducing the number of
epochs so that the model won’t start to memorize information.
1.2 Neural Network
Among all the machine learning algorithms mentioned in the previous section, the machine learning algorithm that gets the most interest currently is Neural Network. In 1989,
[9] made a seminal contribution to the field of neural networks by establishing a universality theorem, stating that with an appropriate activation function in place, a neural
6
network possesses the capability to approximate any given function to an arbitrary degree of precision, contingent upon the presence of a sufficient number of neurons within
its hidden layer. Subsequent scholarly endeavors, including [10], [11], [12] and [13] have
built upon Cybenko’s pioneering work, underscoring the pivotal role of the architectural
design in neural networks. The inherent universality of neural networks, as delineated
by these findings, provides a compelling rationale for their extensive utilization in the
domain of computational research, even though there is no effective algorithm so far that
can guarantee model s convergence fast enough to the ideal value
In this section, we will introduce what a neural network is, and how it operates, and
familiarize readers with neural-network-related terms in this dissertation.
1.2.1 Structure of Neural Network and Forward Propagation
The structure of a typical Neural Network can be seen in 1.1. The basic element of the
network is called "Neurons". Several Neurons constitute one layer of the network, and
Neural Networks usually have multiple layers, each with a different number of Neurons.
The first layer is called the "input layer", which is directly transferred from the result of
data preparation. The last layer is the result of the whole Neural Network and is usually
referred to as the "output layer". All the remainder layers between the input layer and
output layer are named "hidden layers".
A neural network with two or three hidden layers is usually enough to solve simple
tasks such as MNIST handwritten digit recognition with more than 90% accuracy. However, with increasingly demanding and complicated tasks, the number of hidden layers
to fully solve the problem also grows, and can expand up to more than 100 layers. Those
networks with a large number of hidden layers are widely known as "Deep Neural Networks".
For a typical Neural Network, every Neuron of a given layer establishes linkages with
all Neurons in the subsequent layer. Each of these linkages is allocated a specific weight.
7
Figure 1.1: Typical structure of a simple neural network
The leftmost layer is the input layer which is the direct result of data preparation. The
last layer is the output layer which contains the result of the network. All intermediate
layers are called hidden layers. To advance from one layer to the next involves matrix
calculations and activation functions
8
The transformation of the initial layer to every Neuron in the succeeding layer is effectuated through the inner product involving the initial layer and the respective weights associated with each linkage. This inner product is followed by a non-linear map f : R → R,
which is often called an "Activation Function". Such a function is necessary to ensure the
Neural Network is capable of approximating any function to a high precision, as is stated
in the universality theorem. The most popular activation function is rectified linear unit
(ReLU), which is written as f(x) = max(0, x). Other popular activation function includes
Sigmoid Function f(x) = 1
1+e−x
, and Hyperbolic Tangent (tanh) f(x) = tanh(x). These
activation functions as been verified by numerous scholars to be effective in improving
network performance. The activation function usually signifies the end of information
transformation to the next layer. Now with a new layer of numbers, this transformation
procedure is repeated until the output layer is reached and the outcome is obtained. This
procedural sequence from the input layer to the output layer is denominated as "Forward
Propagation". When the network architecture is well-designed and all the parameters
in the network are assigned proper values, the Neural Network Model will be able to
generate desired outcomes for any given input.
1.2.2 Loss Function
With a Neural Network structure finished, the model should now be able to generate
results. But they haven’t learned from the dataset yet and thus cannot even perform
better than random guessing. The parameters inside the network need to be tuned to best
mimic the map from input features to output labels and minimize the difference between
truth and prediction.
The very first step to improve is to understand how far away the prediction result is
from the ground truth, and this is where the "Loss Function" plays its role: measure how
well the model’s predictions match the actual data. As machine learning tasks can be
9
divided into regression tasks and classification tasks, different loss functions are designed
to tailor either of them.
For regression tasks, Mean Squared Error (MSE) is a popular loss function. It measures
the average squared difference between the estimated values and the actual value. The
MSE is formally defined as:
MSE(y,yˆ) = 1
n
n
∑
i=1
(yi − yˆi)
2
(1.1)
where yi denotes the true value for the i
th data point, yˆi denotes the predicted value for
the i
th data point, and n is the number of observations in the dataset.
Other loss functions include Mean Absolute Error (MAE), which calculates absolute
difference instead of squared difference. This slight change in the function definition,
however, causes MSE and MAE to behave very differently. The presence of the square
function makes MSE very sensitive to outliers with large errors, while MAE tends to
measure all errors on the same linear scale. And thus this tolerance to outliers needs to be
taken into consideration when selecting the appropriate loss function. Also, MAE tends
to result in sparser weights because of its sharp shape induced by the absolute value.
For classification tasks, Cross Entropy is the most widely-used loss function defined
as:
H(p, q) = −
C
∑
i=1
p(i)log(q(i)) (1.2)
where C is the number of classes, p(i) represents the true probability of the observation
being in class i, which is 1 for the true class and 0 for all other classes, and q(i) represents
the probability of the observation being in class i as predicted by the neural network.
Note here that while the popular loss functions introduced are the most widely used,
loss function construction is never rigid. Real-life loss function can be a combination of
different basic loss functions. It can have adjusted weights on different terms, higher or
lower according to whether the corresponding result is important. It can also contain
10
loss calculated from intermediate layers if they are important targets to optimize. L1 and
L2 regularizations are also common techniques in loss function construction that add a
penalty term about the weight vector of the network to the loss function to restrict large
weights.
1.2.3 Update Policy
While loss functions are able to measure the error or distance between prediction and
truth, an update policy to improve parameters inside the neural network needs to be
implemented to utilize the result.
• Stochastic Gradient Descent (SGD): SGD is the simplest update policy that updates
the weight of every connection in the network by calculating the gradient of the loss
function with respect to that weight for every single training sample. The gradient
for those weights indirectly connected to the result is calculated with the chain rule
and the whole computation process is called "Backpropagation". A "learning rate"
is usually multiplied by every update on each weight. This "learning rate" is used
to balance over-corrections and under-corrections. Setting the learning rate too low
can result in an overly conservative update policy, causing the model to become
trapped in a local minimum of the loss function or converge at a very slow pace.
Conversely, a too-high learning rate can lead the network to overshoot the local
minimum, or in some cases even begin to diverge. Determining the optimal learning
rate value typically requires a process of trial and error.
• Mini-batch Gradient Descent: Mini-batch Gradient Descent is a similar approach
that uses gradient from input training data. This policy alleviates computational
burden with a minimal impact on the result by updating the weights after computing and a small subset of the training data (mini-batch) and is thus a more practical
algorithm. On the other hand, the Mini-batch Gradient is more unstable and thus
11
helps the model to jump out of local minima and consequently have a better chance
of landing in the global minimum.
• Learning Rate Schedule: In most cases, the procedure of parameter updating is the
most abrupt at the beginning. In the later stage of training, after several epochs,
one would expect it to vibrate around a certain local minimum instead of jumping
from one spot to the other. To follow this intuition, a learning rate schedule is often
used to start from a large learning rate value and gradually decrease it over epochs
of the number of samples trained. An exponential function is usually used for this
schedule, but linear functions or step functions are also possible choices.
• Adaptive Moment Estimation (Adam): Adam is a much more complicated algorithm than those mentioned above, but is still based on Gradient Descent. Adam
keeps records of two moments, one as the rolling mean of gradients to determine the
value of correction to every weight in the network, and one as the uncentered variance of the gradients to alter the learning rate adaptively. Adam is widely accepted
as the default optimizer for neural networks because of its efficiency in training and
versatility under different cases.
• Adaptive Moment Estimation with Weight Decay(AdamW): As can be seen from
the name, AdamW is a variant of the Adam algorithm, and is the main optimizing
algorithm used in this dissertation for most of the Neural Network Models trained.
Besides all the steps in Adam, AdamW also incorporates a decay applied directly to
all the weights in the network. While retaining all the advantages in Adam, AdamW
often leads to a model that can generalize better than those trained with other optimization algorithms.
• Gradient Clipping: In some cases during the process of training, gradient explosion
could happen, which means some calculated gradient is abnormally high, and thus
the corresponding weight would be drastically changed, leading to a total failure
12
in training. Gradient Clipping is the simple solution to this issue by clipping the
gradients if they exceed a certain threshold assigned. Gradient clipping is a separate technique in gradient updating and thus can often be used together with other
policies such as SGD or AdamW.
Update policy is a key step in neural network training. An inappropriately selected update policy would result in an unsatisfactory performance or sometimes even a consistent failure to converge. Determining the optimal update policy and identifying the most
effective set of parameters, especially the learning rate, usually requires extensive experimentation.
1.3 Role of Machine Learning in Physics
It’s compelling to acknowledge the often-overlooked intersection between machine learning and physics: though distinct fields of study, the two fields share several interesting
similarities.
First of all, both of them are based on data. While physics studies matter and its
constituents with various theories, it is fundamentally driven by experimental data. Experimentation forms the cornerstone of the discipline, serving as both the genesis and
benchmark for all theories. This emphasis on data as a foundational element is a thread
that also runs through the field of machine learning. The very first step of machine learning is data collection. Machine learning demands an enormous amount of well-labeled
high-quality data to function properly.
Another resemblance lies in their use of modeling and prediction. Physics theories,
from Newton’s Law to Quantum Mechanics, are intricately constructed models to explain
and predict natural phenomena. Whereas in machine learning, various algorithms with
different hyper-parameters and parameters serve as models to train on collected data and
make predictions.
13
These shared characteristics between physics and machine learning suggest a reciprocal benefit between the two fields, implying that methodologies and principles from one
discipline can potentially enhance the other. In fact, such enhancement is not a recent
development. Machine learning is deeply rooted in physics. Hopfield Network is often
known as the precursor of modern deep learning [14]. Invented by physicist and chemist
John Hopfield, the Hopfield Network is a form of Recurrent Artificial Neural Network
that resembles a type of spin glass system [15].
Acknowledging the historical background and their similarities, the "backpropagation" from machine learning to physics is an emerging trend being increasingly recognized. The potential of machine learning to enhance various fields within physics is substantial.
• Data Analysis and Interpretation: Machine learning excels at processing and analyzing large datasets, a capability crucial in physics research involving extensive
data from sources like particle accelerators or telescopes. By employing machine
learning models, it will be more efficient for researchers to find patterns and anomalies from countless data. Currently, this is already widely used in astronomy. [16]
employed a novel combination of genetic algorithms (GA) and Support Vector Machine (SVM) learning algorithms. This combination was specifically applied to the
tasks of star-galaxy separation and the estimation of photometric redshifts of galaxies. Their approach yielded high accuracy while operating under significantly fewer
assumptions than traditional methods.
• Simulation: In theoretical physics, machine learning can aid the simulation of complex systems. Clever machine learning designs can bypass unnecessary intermediate steps and focus directly on the final result. This flexibility allows researchers
the possibility to accelerate simulation considerably, achieving speeds hundreds of
times faster than traditional methods. [17] proposed a machine learning framework
14
called "Graph Network-based Simulators" (GNS) that is capable of performing simulation in a wide range of challenging physics domains including liquids, rigid
solids, and deformable materials interacting with one another.
• Generalization and Extended Prediction: Effective machine learning models, known
for their strong generalization capabilities to unseen cases, can be instrumental in
physics for feature space exploration. This is extremely important for materialscience-related problems. With such models, researchers can identify inputs that
may lead to previously unobserved properties, offering insights into the discovery of materials with novel characteristics. The design of high-temperature alloys,
for example, requires simultaneous consideration of various mechanisms operating
at different length scales.[18] proposed a workflow that integrates critical physical
principles with machine learning (ML) techniques. This approach is aimed at predicting the properties of complex high-temperature alloys, exemplified by the yield
strength of steels containing 9–12 wt% chromium. [19] is a more recent example
that implemented a scaled machine learning model called graph networks for materials exploration (GNoME). This model successfully identified thousands of new
stable chemical structures, showcasing the potential of advanced machine-learning
techniques in the field of material discovery and exploration.
• Interpretability and Reverse Engineering: Model interpretability can be very important and promising in physics. This is because interpretability in physics could potentially contribute to theoretical insights. A reverse engineering on an interpretable
model could help fully understand the e whole physics procedure and pave the way
for the discovery of new physics. Even in the case of a non-interpretable model,
analysis of the weights can indicate feature importance, shedding light on which
aspect of the data is deemed significant, thus guiding researchers by highlighting
areas that warrant further investigation.
15
• Large Language Model (LLM) for Physics: In this age of information overload,
countless experiments are being conducted globally, and hundreds of new papers
are published daily. Being impractical for any individual to assimilate all knowledge in a specific field, Large Language Models (LLM) emerge as a potential solution. With the recent advent of LLM driven by ChatGPT, people are amazed by
their ability to synthesize and distill vast amounts of information. These models offer the possibility of summarizing and consolidating knowledge from a wide range
of sources, thereby providing a more manageable and comprehensible overview of
a given subject area. [20] is such an approach with "a large language model that can
store, combine and reason about scientific knowledge." In their work, they trained
an LLM with a large number of existing papers and were able to achieve high scores
on technical knowledge probes in many scopes. However, there are certain limitations to this method. [21] found that LLMs are more prone to erroneous answers in
domains that necessitate specialized expertise. [22] is a more comprehensive article
on the challenges and risks of LLMs.
In summary, machine learning and physics not only share multiple foundational similarities but also have been interconnected deeply for an extended period. While the use
of machine learning to help physics research is a rather novel approach, the scope for
potential uses is vast, ranging from data analysis to prediction generalization, promising
significant contributions to the field.
16
Chapter 2
Role of Data Representation in Machine Learning
2.1 Importance of Data Representation in Physics
As is described in the data preparation section, the significance of data in the realm of
machine learning cannot be overstated. Data serves as the fundamental building block
for machine learning algorithms that determine their performance, accuracy, and generalization capability. Good data is often characterized by its ability to present relevant
information succinctly within the dataset, meaning that the data not only encompasses
all necessary and pertinent details but also does so in a manner that is concise and easily
interpretable for machine learning models. This demand is a core task of data representation.
Data representation refers to the way input data is prepared and structured to be effectively processed and analyzed by machine learning algorithms. All machine learning
algorithms require input to be numbers, but not all physics problems are well defined by
numbers. For example, in material-science-related problems, simply entering the name of
a material into a model provides no meaningful information. This highlights the necessity
for an efficient method of data representation.
On the other hand, even in those physics datasets well-defined with numerical features, data representation can play a pivotal role. The sheer amount of possible available
features can sometimes be a problem. If all are fed into a neural network, the network
17
will have to accommodate millions of parameters, leading to reduced efficiency and a
high risk of overfitting. In other cases, even when the available data are already succinct
and informative, transforming the original dataset can still enhance the learning procedure and boost model performance.
In this chapter, we’ll focus on a simple yet important task: molecular classification.
we will start with naive representation methods, then introduce some selected data representation methods from the literature and compare their performance on a simple task.
In the end, we will propose a possible improvement in the representation method.
2.2 Naive Data Representation Methods
The simplest and most straightforward approach to present different molecules is, of
course, to use their names directly. While this naive approach is enough to tell one
molecule from the other, it typically falls short of providing adequate information for
analytical purposes. More importantly, our ultimate goal is to learn from what we know
and predict on unknown molecules, but names cannot make any contribution to such
predictions. Besides, an advanced Natural Language Processing (NLP) model is usually
required to handle the information within, adding an additional layer of difficulty, thus
making this approach less feasible for practical applications.
Another straightforward approach to conveying efficient information is to characterize and represent materials by their various properties, also known as "descriptors", from
micro ones such as atomic structure and covalent radius to macro ones such as electrical
conductivity and elasticity. But such methods often suffer from the issue we already mentioned: too many features. While neural networks are able to deal with excessive features,
this usually requires extensive computational power and a large number of samples. And
thus people are usually faced with the question: which descriptors should we pick for the
best performance?
18
2.2.1 Feature Importance
In the paper [23] and [24], Lee’s Team proposed an alternative solution with the use of
feature filtering along with the notion of feature importance as a unique identifier, or "fingerprint", to classify multi-principal element alloys (MPEAs). In their paper, they start
with 125 distinct descriptors of the alloy, for instance, "Difference between minimum and
maximum numbers of unfilled valence orbitals" or "Average melting temperature". Then
the features get filtered by a Pearson correlation coefficient (PCC) check, and all with a
value lower than the specified threshold (0.4 for the stringent case) are selected, resulting in a filtered dataset of 12 features. The filtered dataset is subsequently employed in
an ensemble Support Vector Machine (SVM) model to classify the phase of the chemical
compositions. Once the model is operational, it is utilized to calculate the local importance of each of the 12 features for every single sample through a "minimum decrease"
procedure. During the procedure, every feature of the sample is "relaxed" to all possible
values in the same column and make predictions. The relaxed feature with a minimum
drop in average prediction value is selected as the next step, with the drop labeled as
the value of the feature. The step is repeated until all features are relaxed. This whole
procedure will generate a brand new dataset as the data representation of the original
data.
While feature filtering is a popular way to select representative features, using feature
importance as a unique identifier is uncommon. Generating an identifier by Varying
one column in the feature list and observing the resultant changes in prediction makes
sense because different materials can have different reactions to such alterations. Thus,
this novel idea opens a new dimension to the existing information. However, there are
multiple drawbacks to this method.
19
Figure 2.1: Breakdown plot of feature importance
The leftmost bar is the original prediction result. Each other bar is the change of the
prediction result if the corresponding feature is relaxed. A tall bar means a strong importance to the prediction, while a low bar means the feature is rather irrelevant in this
material and does not make much contribution to the prediction procedure. In this plot,
maxdiffelctronegativity and devNdvalence are the two most important features.
20
• Information leakage: The feature importance fingerprint could introduce information leakage because the feature importance fingerprint has already seen the label data when it is being generated. Consequently, feature importance fingerprint
would demand one of the original features for generation.
• Computational Load: Feature importance would require every feature to alter through
all possible values on this feature in the dataset, which would pose a high load on
computation power. In our test data set with only 1000 molecules, the computation
time already takes more than 24 hours on 4 CPUs, even though we have changed
the algorithm to sample randomly on only 1/10 of possible values. This indicates
that the method is computationally intensive, particularly for larger datasets.
While this method generates new information, our test shows that this does not guarantee
improvement in classification performance, as can be seen later in our summary of results.
Thus, we want to find some other ways of data representation.
2.2.2 FragVAE
Another paper [25] devised a fragment-based graphical encoder called FragVAE to represent molecular topology instead of looking at all molecular names [26]. In their approach,
they treat molecules as a set of atoms and bonds [AW,BW,V], and deconstruct them into
constituent fragments together with fragment connectivities. These fragments and their
connectivities are then encoded into latent spaces ZF for the fragments and ZC for the
connectivity separately. The molecule is broken down into smaller parts, such as individual rings or chains of atoms, as is shown in the 2.2. While these fragments already
contain abundant information, they cannot uniquely determine a molecule, and that is
why connectivity is added as a second step. Information on how the fragments are connected with each other is stored in the ZC latent space. With the two encoders mentioned
21
Figure 2.2: illustration of FragVAE fragments
The top panel is the molecular structure of Benzyl Benzoate. The bottom panel is the bag
fragment created by FragVAE
above, FragVAE maps the original molecule into a continuous latent space [ZF,ZC]. Subsequently, the encoded feature goes through the decoder aiming at reconstructing the input molecule from the encoded representation. By imposing a loss function to minimize
the difference between the reconstructed molecule and the input molecule and training
on a selected dataset, the autoencoder network gradually learns how to capture important aspects of input data because inefficient encoding of information would lead to bad
decoding results. Through full training on the Zinc15 dataset, the encoder was fine-tuned
to produce an efficient representation of the original molecule, demonstrating a robust
capability in predicting the logarithm of the partition coefficient (logP).
2.3 Experiment on More Complex Data Representation Methods
Building upon the methods introduced earlier, we conducted an experiment to compare
the performance of these different data representation techniques: raw data, FragVAE,
22
and feature importance. Additionally, different combinations of these methods are also
tested for potential enhancements in performance.
2.3.1 Data Preparation
The Estimated Solubility (ESOL) dataset is used as the dataset to test data representation
performances. The ESOL dataset is a collection of data used for predicting the aqueous solubility of small molecules [27], which is a crucial property in drug discovery and
chemical design. It consists of a set of chemical compounds along with their measured
solubility values. The dataset is often utilized in machine learning and cheminformatics
research to develop models that can predict solubility based on molecular structures [28].
Further detailed information for each molecule in the ESOL dataset is then scraped
from Pubchem, a comprehensive and freely accessible database of chemical molecules
and their activities against biological assays managed by the National Center for Biotechnology Information (NCBI). The scraping process is completed using a Python script with
the PUG VIEW Application Programming Interface (API) provided by the Pubchem website to access their available data.
With the molecule list from ESOL and molecular information from Pubchem, we then
cleaned the dataset by deleting empty entries and invalid rows, finally arriving at a
dataset with 1009 samples and 15 distinct features. Each sample has 6 binary labels for
testing: Irritant, Health Hazard, Flammable, Environmental Hazard, Acute Toxic, and
Corrosive.
2.3.2 Experiment Procedure and Result
Different data representation methods are applied to the dataset to test their performance.
For FragVAE, the encoders are used on the molecules to generate a representation of
the atoms and structure.
23
Feature importance fingerprint is much more complicated. Six different Random Forest models are first trained on the raw data for the six target labels. Each model is further
used on every single sample to calculate feature-wise importance, resulting in six different feature-importance representations. All of them count as a different representation
and need to be tested separately. However, a test of the representation generated by the
"Health Hazard" label on the "Health Hazard" label itself reached nearly 100% accuracy,
suggesting that information is leaked into the representation when generating importance representation from the "Health Hazard" label. Consequently, for each prediction
task, only the representation generated using another label is valid. And consequently,
for fairness, when testing other methods, a column of another label is provided as an
extra feature.
All the resulting representations are then used to train models to make predictions on
the six labels. For each task, two models with two different algorithms are generated,
with one implementing the easiest Linear Regression algorithm, and the other utilizing
Random Forest. Linear regression, being one of the most naive classification algorithms,
serves as the baseline, while Random Forest is more complicated. By analyzing the performances and differences of these three data representations on the two basic algorithms,
we aim to differentiate between different data representations. The result of the prediction of the "Health Hazard" label is shown below.
In the figure, "fp" stands for feature importance fingerprint, raw refers to the raw data
without any further processing after cleansing, and "frag" is the FragVAE encoder. The
blue bar represents the performance of the data representation on the simple Linear Regression, while the orange bar on top of it showcases the performance difference between
the two algorithms. Note here that the AUC score is used instead of the accuracy, since
some of the labels are imbalanced, and AUC is a more preferred metric in this case. Also,
to address the randomness issue of random forest, 10 random forests with different random states are trained, and their AUC scores are averaged as the final result. From the
24
Figure 2.3: Performance comparison between different data representations
Each bar stands for a different data representation method. The blue portion of the bar
indicates the prediction accuracy achieved by using a regression algorithm with various
data representations. The orange section, which is layered on top of the blue, illustrates
the additional accuracy improvement that is attained when applying a random forest
algorithm to the same data representation method.
25
figure, we can see that FragVAE surpasses raw data and feature importance on both algorithms, but combinations of different methods yield even better results. Specifically,
the integration of FragVAE with raw data achieves the highest AUC score, along with a
notably small performance gap between the two algorithms.
2.3.3 Discussion
It is easy to observe the supremacy of the FragVAE data representation over the other two
methods, with a higher performance on the prediction task.
It’s also important to acknowledge that FragVAE has a relatively a small discrepancy
in the AUC score between the Random Forest and Linear Regression models. This is also
an important part of our analysis, because this gap between the two algorithms is a good
indicator of whether the representation is easy for machine learning models to learn, and
this is exactly why we choose Random Forest, a widely-used advanced algorithm, as the
main model, and Linear Regression, the simplest and most straightforward algorithm, as
the comparison group.
From the result, we can also conclude that the feature importance representation, despite its complexity and sophistication, is not an ideal method to represent data, at least
not in this molecular scenario. This method is not representative enough compared with
FragVAE and also seems to make the situation so much more complicated that only nonlinear algorithms can perform well, as can be seen from the gap in the figure.
Finally, it’s important to acknowledge, however, the higher performance that other
representations offer, specifically, the combined FragVAE with raw data, and Importance
Fingerprint with FragVAE. This suggests that these representations, while perhaps a little
more complex, can potentially deliver superior predictive performance and thus might be
a better choice. This makes sense because, with a combination of different representation
methods, we are providing the machine learning model with more aspects of the subject,
together with more information. Due to the limited available resources, our dataset is
26
relatively small and uncomplicated. Therefore, the features provided are much less than
the capability of the machine learning models, thus an enhancement in performance by
combining different representations is reasonable. And this data size is already representative of many of the problems in academic research. For much larger datasets, on
the other hand, the effect of data representation and combination is unknown, but my
work proves that data representation has the potential to enhance performance and is
thus worth investigating.
27
Chapter 3
Machine Learning Curricula for Sparse-label-related
Physics Problems
3.1 Background
As an illustrative example, we will show in this chapter how a designed curriculum could
affect model performance and what its limitations are.
3.2 CIFAR100 - an Illustrative Example
3.2.1 Problem Setup
The CIFAR100 dataset, building upon the foundation of CIFAR10, serves as a significant
benchmark in the domain of computer vision and deep learning and is widely used to
train machine learning algorithms [29] [30]. CIFAR100 consists of 60000 images that are
evenly and carefully segregated into 20 coarse labels, such as fish, flowers, or vehicles.
Each coarse label is then further divided into 5 distinct fine labels. For example, flowers
could further break down into orchids, poppies, roses, etc. We select CIFAR100 as an
illustrative model for the following reasons.
First of all, CIFAR100 has an intermediate complexity. Other datasets like ImageNet
contain high-resolution images and a larger number of classes, presenting a significant
28
challenge due to their sheer size and complexity. The extensive volume of data and the
detailed nature of these images contribute to the dataset’s intricacy, making it a demanding task for processing and analysis. CIFAR100 strikes a balance here - it has a moderate
difficulty to make enough room for improvements. Thus CIFAR100 is a good example to
test and benchmark algorithms.
Secondly, the dataset has perfectly balanced labels. Each coarse or fine label contains
the same number of samples. This paragon rules out imbalance issues, and makes performance metrics easy to obtain - a simple accuracy could precisely reflect algorithm performance.
Finally, CIFAR100 has a hierarchical structure. This versatile structure offers a breadth
of problem space since we can experiment with both fine-grained and coarse-grained
classification.
With the advantages explained above, we want to explore the following questions in
this example. 1. Would a gradual curriculum that teaches from easy to difficult helps
with performance? 2. Would a harsh curriculum that aims to solve the difficult problem
helps the model learn better on easy problems? In other words, with coarse labels treated
as easier targets and fine labels treated as harder targets, we want to explore whether
learning easy problems first could help improve performance on hard problems, and vice
versa.
To make the problem more illustrative, we further vary the size of the dataset. Namely,
we randomly shrink the training set together with the validation set and test set to explore
the trend of the difference between different curricula according to data size.
29
3.2.2 Methods
3.2.2.1 Dataset Generation
We employed a conventional division strategy in the CIFAR100 dataset, which comprises
a total of 60,000 samples. 50,000 samples are designated for the training set, while the
remaining 10,000 are evenly distributed among the validation and test sets. We then embarked on a series of experiments that varied the dataset size, ranging from a minimum of
1,000 samples up to the full 50,000 samples designated for training in the traditional split.
For each of these varying dataset sizes, we adopted a systematic approach. We randomly
draw samples from the original dataset and perform the experiment ten times. To ensure
consistency and reproducibility across our experiments, we use manual random seeds
to control randomness. Note that the drawn dataset is also determined by random seed
because the training set itself is a very crucial part of model training. This becomes even
more pronounced and crucial when working with reduced dataset sizes, where the specific selection of samples can significantly influence the model’s performance and learning trajectory.
3.2.2.2 Model Architecture and Metrics
As an image-related problem, the most common and efficient way to extract important
information from a dataset is to use a Convolution Neural Network(CNN). The features
encoded from the original image with CNN then go into two consecutive predictors according to the different aims of the experiment. For the experiment training first on coarse
labels then on fine labels, the encoded features will first be used to predict its coarse label.
This coarse label is then concatenated with the original encoded features to go into the
final predictor of fine labels. For the fine label to coarse label experiment, the encoded
features go to the fine label predictor first and then concatenate with itself to predict the
coarse label. The detailed architecture is below:
30
• Convolution Neural Network Model: Each image with size 32*32*3 is used as input. The input goes through two consecutive sets of convolutions consisting of
a convolution layer, a Rectified Linear Unit(ReLU) activation function, and a 2*2
max-pooling operation. The first convolution layer contains 3 input channels, 32
output channels, and filters of size 3*3. The second convolution layer contains 3
input channels, 64 output channels, and also filters of size 3*3.
• Fine Label Predictor: 4096 features from the convolution network are used as input.
If the target is to train first on coarse labels to predict fine labels, another 20 features
from the coarse prediction will be concatenated to form 4116 features. The features
then shrink to 512 nodes, followed by a ReLU activation and a 0.2 dropout rate.
Then the 512 features shrink to 256 nodes and again activated with ReLU. The 256
nodes were reduced to 100 nodes as the final prediction.
• Coarse Label Predictor: 4096 features from the convolution network are used as
input. If the target is to train first on fine labels to predict coarse labels, another 100
features from fine prediction will be concatenated to form 4196 features. Then the
procedure is going to be similar to the fine label predictor. Features reduced to 512
nodes, then they get ReLU and 0.2 dropout rate, followed by reduction to 64 nodes,
ReLU, and final reduction to 20 nodes as the final prediction.
The output of the model is of size either 20 or 100 depending on whether the target is to
predict fine or coarse labels. For backpropagation, we use a loss function that combines
the predictions for both coarse and fine labels. Each of these predictions utilizes the CrossEntropy Loss.
For performance analysis, we use prediction accuracy to assess coarse label prediction,
which is a fair metric since the dataset is perfectly balanced. In the case of fine label prediction, however, since the raw prediction value could be low due to problem complexity,
one common practice is to use "Top N Accuracy", which means we count the prediction
31
as correct as long as the model gives target label a score that is among top n labels. In our
experiment, we pick n to be three.
A crucial point to highlight is that the predicted fine label is concatenated with the
encoder, not the actual fine label data, and vice versa. Consequently, there is no extra
information leaked into prediction.
3.2.2.3 Experiment on Different Curricula
With the dataset and model architecture available, we test the model on different curricula
to see their difference in performance [31]. For each trial of specific random seed and data
size, we train four independent models on four different curricula with the dataset for 30
epochs with a batch size of 16.
For the fine label prediction task, as is described in the previous section, the model
will first generate a result on coarse label predictor, and this result will concatenate with
the original encoder to finalize fine label prediction. For different curricula, we will mask
this coarse label prediction as 20 columns of zeros for a fixed amount of epochs. This
masking controls the number of epochs the neural network model receives information
from coarse label prediction to modify fine label prediction, with 0 meaning fine label predictors always learn coarse label prediction before making a prediction, and 30 meaning
fine label predictor never learns anything from coarse label predictor. We did four different curricula, with 0,10,20,30 out of 30 epochs for masking, on all different data sizes
mentioned above to test their performances.
The case of the coarse prediction task is similar. Fine model predictor result is masked
as 100 columns of zeros for 0, 10, 20, 30 epochs as four different curricula on all different
data sizes to sketch their trend of performances.
32
3.2.3 Results and Discussion
3.2.3.1 Label Prediction with Different Curricula
Figure 3.1: Accuracy plot for different training schedules on learning fine label prediction
to predict coarse labels
Different schedules are tested on different data sizes from 1000 to 50000. No obvious
performance difference is observed on different training schedules
In the plot shown above, the blue line serves as the baseline when the model never
receives any information on coarse label prediction, while all other three lines are trained
with knowledge of fine label prediction to some extent. As can be seen in the plot, when
the training set size is smaller than 15000, training on fine label prediction does not make
any discernible difference in performance. However, as the sample size grows larger and
larger, the gap in prediction accuracy between those with and without training on fine
labels grows increasingly noticeable.
33
Figure 3.2: Accuracy plot for different training schedules on learning coarse label prediction to predict fine labels
Different schedules are tested on different data sizes from 1000 to 50000. The blue line is
a comparison group with no training on coarse label prediction. The yellow, green, and
red lines are trained for 10, 20, and 30 epochs on coarse labels. As the data size grows, the
improvement by training on fine labels is increasingly noticeable.
Similar to the Fine-to-Coarse plot, the blue line is the baseline with no coarse label
prediction information, and the others are trained with coarse label prediction in some
epochs. This time, however, the performance curves seem to be entangled with each
other. As the number of samples grows, no significant improvement is observed in curricula that teach easy tasks first. To provide further insight, we actually drew several
other "Top N Accuracy" with N equal to 1, 5, and 10, but none of them shows one curriculum’s superiority over any other. This means a simple curriculum to learn an easier
target before aiming for the hard target does not improve model performance in this case.
34
3.2.3.2 Discussion on Results and Limitations
Our investigation unearthed that while introducing fine label prediction enhances the
model’s capability to predict coarse labels, the reverse scenario did not hold true. This
is an interesting trend that may stem from various causes and could lead to a range of
diverse implications.
As we can see from 3.2, learning fine prediction before doing coarse prediction does
help improve the model performance, and this boost even increases with a larger sample size. This makes sense because a focus on a harder target could force the model to
find a better target to converge instead of being stuck in a local minimum. On the other
hand, fine labels, as more granular labels, can provide clearer boundaries in the highdimensional space, which, when combined to form the coarse label, would result in a
more intricate decision boundary. Also, when provided with a harder task, the tendency
to overfit also drops.
Using coarse label prediction for fine labels, as seen from 3.1, does not improve prediction accuracy as expected. There could be multiple explanations for this. Coarse label
prediction is a task already contained in the fine label prediction. Thus it does not provide
extra insight into how the model can make predictions. Conversely, learning easy tasks
before hard tasks is known to have a smoother learning curve. In this problem, however,
the model is able to converge very quickly in 10 epochs, which negates the advantage of
learning an easier task first.
3.3 Superconductor - an Application
3.3.1 Background
Superconductivity [32] is a quantum mechanical phenomenon where certain materials,
when cooled below a critical temperature (Tc), exhibit zero electrical resistance and expel
35
magnetic fields due to the Meissner effect. This zero resistance property makes superconducting materials pivotal in modern science and technology since they can promise massive energy-saving solutions and improve electrical systems drastically. The implementation of superconductors can revolutionize various fields, including medical imaging with
advanced MRI technologies, sustainable energy transmission with minimal power loss,
and public transportation systems using magnetic levitation trains. Moreover, superconductors are crucial for scientific breakthroughs. Superconducting coils serve as central
components for quantum computers. They are also used in particle accelerators such
as the famous Large Hadron Collider (LHC) operated by the European Organization for
Nuclear Research (CERN).
Despite all these promising applications, the utilization of superconductors remains
highly limited. One major issue that caused this limitation is its critical temperature:
Superconductors only exhibit superconductivity below critical temperature (Tc). However, most known superconductors have a critical temperature close to or below 77K, the
boiling temperature of nitrogen. To cool and maintain materials at such extremely low
temperatures is unfeasible and costly. Thus, finding superconductors with higher critical
temperatures or even room temperature is one of the top priorities in the superconductor
community.
Therefore, understanding and predicting critical temperature is fundamental to harnessing the potential of superconductors. With the lack of a reliable theoretical model
to predict the critical temperature in academia, an experience-based model could serve
as a quick and accessible stopgap. A model good at predicting critical temperature and
generalizing to unseen cases can show people what region in the feature space might indicate a high critical temperature, thus narrowing the search scope. And, in the worst
case, the model can at least provide a hint on what might be the core feature to achieve
room-temperature superconductivity.
36
3.3.2 Methods
3.3.2.1 Dataset Generation
To make the best prediction on superconductors’ critical temperature, we need to utilize a
dataset as detailed and diverse as possible. The SuperCon Dataset is the most comprehensive collection of superconductive materials that encapsulates diverse empirical data from
various research. It is well known for its extensive details about many unique superconductive properties on a large amount of superconductive materials. This dataset serves
as a pivotal resource for pushing forward the boundary of high-temperature superconductors. In our work, we use the data from [33], which was gleaned from the SuperCon
database, to suit the specific analytical needs of our research. Beyond mere data retrieval,
Hamidieh did extra data preparation steps to get clean and meaningful data. Recording
errors were rectified, and materials with missing critical temperatures, or those listed as
zero or excessively high, were excluded. This meticulous process resulted in a refined
dataset comprising 21,263 unique superconductors. For each superconductor, the critical
temperature is provided together with 81 different features including mean atomic mass,
mean atomic radius, and other microscopic or macroscopic physical quantities.
3.3.2.2 Model Architecture and Metrics
In the CIFAR100 dataset, coarse and fine labels are provided at the beginning. In our SuperCon dataset, however, The only label available is critical temperature, which is neither
hierarchical nor distinctly separated into different classes, as it is a continuous variable.
Consequently, we use manually defined intervals to define fine and coarse classes. This
makes sense because precise temperatures, be it 280K or 290K, offer negligible functional
differences, as the application significance lies in whether a superconductor operates in
a high or low-temperature range. Thus, understanding the temperature category suffices
for most practical purposes, making detailed distinctions superfluous. For fine labels,
37
the data is divided into 56 different classes with a step length of 2. This means materials
with critical temperature within [0K,2K) belong to class 0, those with critical temperature within [2K,4K) belong to class 1, and so on. The sequence continues up to materials
within [108K,110K) range as class 55. Lastly, those with Tc > 110K belong to class 56, since
a minimal number of materials fall in this range. For coarse labels, the data is divided into
different classes according to the number of different classes we want similarly. The only
difference is that we divide the intervals by sample number quantiles. For instance, if
we want to divide samples into 4 different labels, we will rank the continuous Tc from
low to high, and separate them into four consecutive intervals, each containing 1/4 of the
original dataset, and call the first interval class 0, second interval class 1, etc. The main
result of this section uses 12 bins to separate different critical temperatures. The detailed
architecture is below:
• Encoder Neural Network: Each sample uses all features from the SuperCon dataset
as input features. The input goes through three consecutive sets of dense layers,
with the first two dense layers followed by a Rectified Linear Unit(ReLU) activation
function. The resulting encoded result contains 64 dims
• Fine Label Predictor: The 64-dimension result of the encoder is then fed into the final
label predictor to predict the fine label. We experimented on three different finelabel predictors: The first experiment is the baseline without fine-label predictor.
The second fine label predictor contains three dense layers with 32, 16, and 1 nodes.
The final 1-node layer is the prediction of the critical temperature. For the third finelabel predictor, the critical temperature is divided into 56 bins, as mentioned above.
Thus the last layer contains 56 nodes, while intermediate layers are two dense layers
with 64 nodes and a ReLU activation function.
38
• Coarse Label Predictor: The coarse label predictor also has different shapes for the
three different experiments. For the baseline without fine label knowledge, the predictor takes only the 64 neurons from the encoder goes through a 32-node dense
layer, then a 16-node dense layer, and finally generates a 12-neuron result as the
prediction of the critical temperature. For the 1-node fine-label predictor, the architecture is similar. The only change is in the input shape, which is now 65 to include
the fine-label prediction information. For the third fine-label predictor, the input
shape is now 120 to include fine-label prediction. It goes through a 64-node dense
layer, then a 32-node dense layer, and finally generates a 12-neuron result as the
prediction of the critical temperature.
For all three experiments, the output of the model is of size 12, as mentioned above. Backpropagation is similar to the previous Cifar100 experiment. The loss function combines
the predictions for both coarse and fine labels, each using a Cross-Entropy Loss. For performance analysis, we also use prediction accuracy to assess coarse label prediction.
3.3.3 Results and Discussion
We performed three different neural network pipelines on the same dataset for the same
task. The first approach is the baseline with no fine-label information. The second approach contains a single value of the prediction on the exact critical temperature as the
fine label and combines it with the original encoder to predict the coarse label. The last
approach maps the possible value of critical temperature into 56 bins as the fine label to
aid the prediction in the coarse label. The major result can be seen in 3.3.
The figure clearly shows the difference in the performance of the three approaches.
The baseline approach remains the worst in performance throughout the whole training
procedure. The second approach starts with similar accuracy to the baseline. But it improves quickly as the number of epochs grows, and the growth rate turns stable after
25 epochs. The final performance improvement is around 4%. The third method has a
39
Figure 3.3: Prediction accuracy as the epoch of training increases
The blue line is the baseline with no fine-label information. The orange line contains a
one-column information of fine-label prediction. The green line splits critical temperature
into 56 different bins and make prediction on the bin before predicting the coarse label.
The green line performs much better than the orange line, while the baseline has the worst
prediction accuracy.
much better performance than the other two methods. After the first epoch, it already
has a huge gap with the other methods. The gap is decreased as the training proceeds,
indicating a better convergence capability. This method ends with a 7% improvement in
prediction accuracy.
This experiment supports our conclusion drawn from the Cifar100 experiment, that
training on the fine label before going into predicting coarse label can help improve the
40
performance. We successfully improved the accuracy of predicting the critical temperature of superconductors using this novel approach, notably without the need to incorporate any additional information. This finding underscores the effectiveness of our approach in achieving more precise predictions in a specific physics context, demonstrating
the potential of strategic training methodologies in machine learning applications.
41
Chapter 4
Machine Learning Curricula for Dense-label-related
Physics Problems
4.1 Background
Data is the core of machine learning. The designation of a machine learning algorithm
to solve physics problems always depends on the specific dataset people have at hand.
In an ideal situation, the amount of data available can be quite substantial. For example,
datasets generated from LHC are typically enormous. In other scenarios, such as predicting the state of a system when the evolution formula is known, the data about the state of
the system is, in principle, calculable and is thus theoretically infinitely available as long
as the computation power is sufficient.
While being a favorable situation, having an abundant amount of data does not always guarantee a successful machine learning model [34], not to mention those infinitely
available datasets - it would be impossible to train through all the data even once given
the limitation on computation power. Thus, the question arises: when the dataset is too
large, how to efficiently make use of the dataset to train a high-performance dataset?
In the following sections, in an effort to shed some light on how to pick representative
samples from the original dataset so as to accelerate the training process and enhance
model performance, I will present my approaches to simulating quantum systems with
42
neural networks. These problems involve arbitrarily many data points generated from
the traditional quantum evolution algorithms and thus are perfect examples to showcase
thoughts and strategies in those scenarios.
4.2 Linear Quantum System
4.2.1 Problem Setup
As a proof of concept, we initially concentrated on the quantum dynamics of a onedimensional system. We consider the system governed by the one-dimensional timedependent Schrödinger equation in atomic units
i
∂
∂t
ψ(x,t) = Hˆ ψ(x,t), 0 ≤ x < Lx, t ≥ 0 (4.1)
where Lx is the size of the system and the Hamiltonian operator Hˆ is the sum of kinetic
and potential energy operators defined as
Hˆ = −
1
2
∂
2
∂x
2 + Vˆ(x) (4.2)
For the computer to handle this problem, the space and time of the system need to be
discretized into meshes. The value of the wave function ψ(x,t) is discretized into Nx × Nt
mesh grids with equal mesh spacing, ∆x = Lx/Nx and ∆t is chosen to be accurate enough
while computationally efficient. The (i, j)th point in space and time is thus given by
xi = i∆x,tj = j∆t. The wave function is thus represented by its value at each point in
space and time ψ
j
i = ψ(xi
,tj) with two values, namely, the real and imaginary parts of
the wave function at the point. We wanted to build a neural network-based algorithm to
emulate how a Gaussian wave packet would evolve under all different kinds of potential
landscapes - 0 potential, square well, rectangular barrier, or even other strange-shaped
43
barriers. Traditional algorithms, such as the Space-splitting Method are already able to
simulate one-dimensional quantum dynamics, and therefore infinitely many data is available to train the neural network model.
4.2.2 Methods
4.2.2.1 Prediction Pipeline
An accurate simulation needs enough mesh, usually more than 512 meshes on the original
quantum space. Consequently, simply treating the whole state of the system at a certain
time as model input is inefficient.
Given the complexity of the problem, it would require training on millions of connections in the neural network, posing a big challenge to the computation power. On the
other hand, to perform a frame-wise emulation, accuracy is a big challenge - an error on a
single frame will affect all frames afterward, and the accumulation of all these errors will
make the wave function collapse very quickly. Thus, we will need a more rigorous pointwise loss function to make sure every prediction is accurate on all pixels. While a neural
network that takes the whole wave function as input and predicts the whole function at
the next frame can ensure a low overall loss, it cannot address sparse outliers in its prediction with a deviation large enough to destroy prediction afterward but not significant
enough to be effectively mitigated by the loss function.
Therefore, we proposed a window-based point-wise prediction algorithm to make accurate predictions on each point as well as greatly reduce the complexity of the model.
This algorithm would require a slightly different pipeline than what we have seen in the
previous chapters.
Below is the plot for the whole pipeline. Data are first generated using a traditional
quantum dynamics simulation algorithm. These data are then chunked into 23pixels ×
4frames windows, with 23 meaning the size of the window and 4 meaning the number of
44
Figure 4.1: The machine-learning-based emulation pipeline.
Space Splitting Method (SSM) quantum dynamics simulator is first used to generate the
raw dataset. Random snapshots are drawn from the raw dataset as samples with a weight
policy to favor snapshots involving potential barriers. These snapshots are the data to
train the machine-learning-based emulator
consecutive frames of that window that we record. For each frame and pixel of the window, three different values are recorded: the real part of the wave function, the imaginary
part of the wave function, and the potential value at the spot. This window chunking is
repeated randomly for all data runs to form the final training data. These window data
are then fed into the Gated-Recurrent-Unit (GRU) Neural Network model for training.
The Neural Network model takes 4 × 23 × 3 data as input and returns a 1 × 23 × 2 row as
the real and imaginary part of the prediction of the window on the next frame given the
last 4 frames. With a model trained to reach convergence, prediction is then performed
differently. For a designated potential landscape with a given input wave, we use the
model to predict subsequent evolution frame by frame. Predictions from the last four
frames are stacked as raw input. To predict the wave function value of a certain point at a
certain frame, we use all chunks that contain the target point, which is 23 different chunks
in this case. All chunks are used by the Neural Network model separately to make predictions, followed by a Gaussian-weighted sum to summarize different predictions. This
45
procedure is repeated for all points in the system to generate the final prediction of the
system at the next frame and is again repeated for 400 frames for the whole propagation.
4.2.2.2 Data Preparation
A Space Splitting Method (SSM) quantum dynamics simulation algorithm [35] [36] [37]
was implemented to generate raw data on the evolution of a wave packet in a closed space
with different potential landscapes over time. The important parameters that govern the
wave packet propagation are listed below:
• Nx, Lx,∆t Size of the system is determined by Lx, while Nx is the number of meshes
that we chunk the system into. ∆t represents the step size in the simulation, dictating how far into the future each simulation step projects. A larger Nx and finer ∆t
will guarantee better simulation accuracy, but the degree of enhancement becomes
minimal beyond a certain threshold. Thus we did various experiments to pinpoint
the best set of Nx,∆t that provides sufficient accuracy with the lowest computation
cost.
• Hb
,Wb
In our training data, the potential barrier is in the middle of the simulation
box and has a rectangular shape. Hb
,Wb control the height and width of the barrier
respectively.
• X0,S0,E0 Since our study is on the evolution of a Gaussian wave packet, three different parameters are essential, with X0 being the center of the wave packet, S0 being
the spread, and E0 being the energy.
For every case of wave propagation, we split the whole space into 1024 meshes and
ran the algorithm for 400 units of time. Two sets of training data corresponding to two
different wave propagation scenarios are established to present the model with sufficient
information about what circumstances are possible in propagation. The first scenario
is free propagation, where no potential barriers are present in the space. The second
46
scenario introduces a single rectangular barrier located at the center of the simulation
box.
The training samples involving the freely dispersing wave packet are derived from a
combination of the following parameters:
• X0 = 10.0,40.0,70.0a.u.
• S0 = 1.0,1.5,2.0,2.5,3.0,3.5,4.0a.u.
• E0 = 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0a.u.
For the case of a potential barrier in the middle of the simulation space, samples are
generated using the following parameters:
• X0 = 10.0,40.0,70.0a.u.
• S0 = 1.0,1.5,2.0,2.5,3.0,3.5,4.0a.u.
• E0 = 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0a.u.
• Hb = 1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0a.u.
• Wb = 7.0a.u.
As shown in the pipeline we introduced in the last section, with the wave propagation
for different sets of parameters, we need to further chunk them into small pieces for the
machine learning model to learn about prediction. Every sample of wave propagation
consists of the value of wave function on 1024 spots × 500 frames of unit time. To fit the
input of the neural network, we pick a random window of length 23 meshes at a random
time and record its value for the next four frames. This random sampling is repeated for
each parameter set, with a spatial sampling rate of 0.1 and a temporal sampling rate of 0.9
to reduce possible correlations between samples. A weight in favor of cases where the potential barrier is involved is also implemented since potential-barrier-related incidents are
47
less common given its sparsity in the system space but predicting the wave’s interaction
with the environmental potential barrier is much more difficult than free propagation.
The test cases are generated in a slightly different process. The space was again split
into 1024 meshes to run the traditional SSM algorithm. However, unlike in training,
chunking was not required for testing. This distinction arises because the training process uses small windows, while testing involves summarizing predictions across these
windows to emulate the behavior of the quantum wave as a whole. On the other hand,
the potential barrier can also take different shapes - triangular, circular, or even random
shape, instead of being a fixed square.
4.2.2.3 Model Architecture and Metrics
Through a series of experiments, we found that incorporating historical data instead of relying only on the information from the last frame can significantly enhance overall performance, enabling the wave function to maintain accuracy for a longer number of frames.
In the realm of neural networks, there are various methodologies for processing sequential data. For our research, we implemented the Gated-Recurrent-Unit (GRU) [38],
which is widely used in various applications due to its effectiveness in handling sequential data. GRU utilizes two gates: an update gate and a reset gate, which help the network
to retain or forget information [39]. These gates in GRUs allow the model to adaptively
capture information from different parts of the input sequence, making them useful for
tasks like language modeling and time-series analysis [40]. We used GRU in this task
because it strikes the balance between complexity and performance, demanding minimal
computation load to yield satisfactory performance.
The detailed architecture of the GRU Neural Network is shown below:
As introduced in the last section, the input of the network is 4 continuous frames ×
23 consecutive meshes × 3 different values. Each of the input frames passes through the
same weight to reach the second layer with a ReLU activation function applied, resulting
48
Figure 4.2: Structure of emulator neural network
Input features include 4 consecutive frames each with 23 continuous grids. Each grid contains three values: the real and imaginary part of the wave function, and the background
potential value at the grid. These input features go through three layers: one dense layer,
the GRU layer, another dense layer, and finally turn into 23 columns × 2 rows representing the prediction on the 23 grids for the real and imaginary part of the wave function for
the next frame
49
in a new layer with 4 frames and 46 values each. Then they go through the GRU layer
followed also by a ReLU activation function, which pays the most attention to the last
frame while making minor corrections using the other three frames. This prediction of
GRU then passes through the last layer to shrink to 46 nodes as the final prediction of the
emulator model, which is rearranged into the shape of 23 columns × 2 rows, carrying the
real and imaginary part of the predicted wave function.
In the training phase, as the focus is on single window slices, the loss function employed is the Mean Square Error (MSE) between sliced ground truth and the model’s
prediction.
In the test phase, however, the interest shifts towards assessing the overall performance of the wave function prediction in its entirety. Consequently, the evaluation metric involves comparing the comprehensive prediction with the entire ground truth wave
function. Two different metrics are used to give a comprehensive evaluation of the model.
The first metric is the Mean Absolute Error (MAE) function,
|ε| =
1
Nx
Nx
∑
i=1
|ψˆ
i − ψi
| (4.3)
with ψˆ
i and ψi being the ground truth and predicted wave function, and Nx the number
of meshes in the system. Under most circumstances, MAE is a good metric faithfully
reflecting the model performance. However, in some cases, such as when the wave has
just started spreading and is only distributed on several grids of space, MAE could also
remain a low value while the prediction explosion is mitigated by the 1/Nx average.
Our second metric, Normalized Correlation can handle this situation. The Normalized
Correlation is defined as
C =
∑
Nx
i=1
ψˆ∗
i
ψi
|ψˆ||ψ|
(4.4)
The definitions of ψˆ
i
, ψi
, and Nx remain unchanged and the ∗ symbol stands for the complex conjugate. This normalized correlation treats the prediction function and ground
50
truth function as two separate vectors in space and measures their angular difference,
and is thus able to discern small deviations even when the value on the system space is
sparse.
It is important to note that while the correlation as an angular difference provides
useful insights, it omits critical information regarding the amplitude of the wave function
due to the division by the lengths of the vectors involved in its calculation. Consequently,
it is essential to consider both metrics in conjunction to get the best evaluation of the
emulator model. The calculation of both metrics is repeated for the whole 400 frames in
our study.
4.2.3 Results and Discussion
The major result of the emulator model can be summarized in the figure below We designed a whole set of test cases from easy scenarios of free propagation or very small
barriers to hard cases with non-trivial potential landscapes - the test results in the figure
are two examples: the left panel represents a pyramid-shaped potential barrier, and the
right panel represents a circular potential barrier. In the figure, we presented the color
value scale of both the ground truth wave function and prediction wave function for the
whole system space from the first frame to the 400th frame. We also showed side-by-side
wave function shapes at the 50th, 150th, and 350th frame for both difficult potential barriers. The difference between prediction and ground truth is minimal even after the whole
400 frames when the dispersion has almost flattened the whole Gaussian wave packet.
There are two major advancements in our result, and the first of them is the emulator model’s prediction precision. It is noteworthy that, unlike other machine learning
tasks where input data come from collection to make predictions, our model predictions
are based on predictions from the last frame in an iterative manner. The iteration is repeated until the 400th frame without showing any noticeable deviation from the ground
truth. This is much more challenging than a simple accuracy requirement because one
51
Figure 4.3: Result of GRU-based Emulator
While our machine-learning-based emulator was only trained on rectangular barriers, it is
tested on different unseen cases. Panel d and f are for the case of triangular barrier, while
panel e and g are for the case of circular barrier. Panel d and e are the heatmaps of the
real part of the wave function for prediction and ground truth containing all values from
the first frame to the 400th frame to compare their values. Panel f and g are snapshots of
ground truth and prediction at the 50th, 200th, and 350th frame. The plot shows that, even
after 350 frames, the difference between ground truth and prediction is small enough
52
single wrong prediction can be diluted by outnumbered correct predictions, but one single mistake in the iterative prediction will be carried on to the next frame and destroy later
predictions rapidly. Our model, as can be seen from the figure, maintains the integrity of
the wave function until the end of the 400 frames with negligible deviation from ground
truth.
The other notable advantage of our emulator model is its capability of generalization.
Given the window-slicing strategy and potential barrier distribution we introduced earlier in this chapter, we can summarize the situations that the model has seen falls into
three categories: (1) free propagation of quantum wave with zero potential; (2) propagation with non-zero but constant potential; (3) propagation with a step-function potential,
with the left several consecutive pixels non-zero and the rest zero, or vice versa. While
it is not surprising for the model to behave well in cases with rectangular barrier potential since they are already encountered in the training phase, our test shows that this
GRU-based emulator model has much more capability than that. Pyramidal and circular
potentials involve a rapid change over space in the background. To produce accurate and
iterative predictions on such landscapes necessitates the model to fully understand the
physics behind it about interaction with background potential with minimal error.
On the other hand, there are limitations inherent in our emulator model. We conducted a comprehensive test with various sets of parameters to test the model’s performance on unseen cases. The outcomes of these tests are illustrated in the figure presented
below.
We plotted the average normalized correlation across all emulation time steps, denoted as ⟨C ⟩, over the variation of one of the single parameters, using five different random seeds. Note that the red vertical lines on the figure stand for parameter values that
are used in training, and our comprehensive test goes beyond those cases.
The variation of the Rectangular barrier seems to be most unaffected, with the correlation. This makes sense for long barrier cases, because girds are window-sliced in our
53
Figure 4.4: Model generalization performance
We test our model on the variation of different parameters to test its capability of generalization. The x-axis is different values for the parameter being tested, while y-axis is the
average normalized correlation across all emulation time steps. The model can generalize
well to unseen values for wave packet spread and potential barrier width, while it decays
quickly when wave packet energy or barrier height has a value far away from the values
used in the training phase.
54
prediction procedure, and a longer barrier doesn’t introduce untrained cases. However
the short barrier limit introduces cases where only one or two grids in the center of the
window have non-zero potential, and our model performs well on these new cases.
Wave packet energy and potential barrier height, nevertheless, are highly sensitive
parameters in our model. The normalized correlation drops quickly right after these parameters go beyond trained values. This means our emulator model does not generalize
well on energy-related calculations. The likely cause of this limitation is the restricted
range of energy parameters used during training, which may lead the emulator to become confined to a local minimum, specifically in low-energy approximations. To address
this issue, expanding the range of energy parameters in the training dataset could be a
viable solution. This expansion would enable the emulator to learn a broader spectrum
of energy-related scenarios, potentially enhancing its ability to generalize across different
energy levels.
55
Chapter 5
Conclusion
Throughout the thesis, we introduced the importance of data preparation and data sampling strategy in the process of applying machine learning to solve physics problems. In
the second chapter, we applied three different data representation methods to solve the
molecular classification task. Our result indicates the superiority of one method (FragVAE) over the other two, while also pointing to a possible improvement of combining
different representation methods to achieve a better performance. In the third chapter,
we studied physics problems with sparsely distributed data. We illustrated the limitation of using coarse-label prediction for the prediction of fine labels by using a Cifar100
dataset. We also proposed a novel approach to using fine-label prediction on the coarselabel prediction, which is rectified on the Cifar100 dataset, and later implemented in the
SuperCon superconductor dataset to achieve a 7% improvement in the critical temperature prediction accuracy. Our proposed strategy is able to make better use of limited
information and can be helpful in many classification problems. Further improving our
neural network model is also one possible future direction, which can aid the discovery
of room-temperature superconductors by predicting the necessary features that a roomtemperature superconductor should have. In the fourth chapter, we proposed a sampling
curriculum to draw from a dense data distribution, which is applied to the problem of
emulating quantum waves using a GRU-based neural network. We drew weighted samples of wave functions generated from fixed-length rectangular barriers to train on the
56
neural network, resulting in a model that is highly resistant to noises and generalizes
well outside training scenarios. Our resulting model is able to have minimal difference
with real wave function after 400 frames of consecutive prediction. Though trained on
rectangular-barrier wave functions, the model is able to robustly emulate wave functions
for much more difficult cases such as triangular or circular barriers. Our workflow in the
dense-feature problem has the potential to be applied to many other problems such as
quantum systems with higher dimensions, open-quantum systems, or other simulation
problems.
57
References
1. Carleo, G. & Troyer, M. Solving the quantum many-body problem with artificial
neural networks. Science 355, 602–606. doi:10.1126/science.aag2302 (2017).
2. Van Nieuwenburg, E. P. L., Liu, Y.-H. & Huber, S. D. Learning phase transitions by
confusion. Nature Physics 13, 435–439. doi:10.1038/nphys4037 (2017).
3. Karagiorgi, G., Kasieczka, G., Kravitz, S., Nachman, B. & Shih, D. Machine learning
in the search for new fundamental physics. Nature Reviews Physics 4, 399–412 (2022).
4. Murphy, K. P. Machine Learning: A Probabilistic Perspective (MIT Press, 2012).
5. Wu, S. A review on coarse warranty data and analysis. Reliability Engineering & System Safety 114, 1–11. doi:10.1016/j.ress.2012.12.021 (2013).
6. LeCun, Y., Jackel, L. D., Bottou, L., Cortes, C., Denker, J. S., Drucker, H.,et al. Learning
algorithms for classification: A comparison on handwritten digit recognition. Neural
networks: the statistical mechanics perspective 261, 2 (1995).
7. Zheng, A. & Casari, A. Feature engineering for machine learning: principles and techniques for data scientists (" O’Reilly Media, Inc.", 2018).
8. Everitt, B. & Skrondal, A. The Cambridge Dictionary of Statistics (Cambridge University Press, 2010).
9. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics
of control, signals and systems 2, 303–314 (1989).
10. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural
networks 4, 251–257 (1991).
11. Leshno, M., Lin, V. Y., Pinkus, A. & Schocken, S. Multilayer feedforward networks
with a nonpolynomial activation function can approximate any function. Neural networks 6, 861–867 (1993).
12. Pinkus, A. Approximation theory of the MLP model in neural networks. Acta numerica 8, 143–195 (1999).
58
13. Kratsios, A. Characterizing the Universal Approximation Property. arXiv preprint
ArXiv:1910.03344 (2020).
14. Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79, 2554–2558 (1982).
15. Barra, A., Bernacchia, A., Santucci, E. & Contucci, P. On the equivalence of Hopfield
networks and Boltzmann machines. Neural Networks 34, 1–9 (2012).
16. Heinis, S., Kumar, S., Gezari, S., Burgett, W. S., Chambers, K. C., Draper, P. W., et
al. OF GENES AND MACHINES: APPLICATION OF A COMBINATION OF MACHINE LEARNING TOOLS TO ASTRONOMY DATA SETS. The Astrophysical Journal 821, 86. doi:10.3847/0004-637X/821/2/86 (2016).
17. Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J. & Battaglia, P. Learning to Simulate Complex Physics with Graph Networks in Proceedings of the 37th International Conference on Machine Learning (eds III, H. D. & Singh, A.) 119 (PMLR, 2020),
8459–8468.
18. Peng, J., Yamamoto, Y., Hawk, J. A., Lara-Curzio, E. & Shin, D. Coupling physics
in machine learning to predict properties of high-temperatures alloys. npj Computational Materials 6, 141 (2020).
19. Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D.
Scaling deep learning for materials discovery. Nature, 1–6 (2023).
20. Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., et al.
Galactica: A Large Language Model for Science 2022.
21. Li, X., Tang, H., Chen, S., Wang, Z., Maravi, A. & Abram, M. Context Matter: DataEfficient Augmentation of Large Language Models for Scientific Applications 2023.
22. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al. On
the Opportunities and Risks of Foundation Models 2022.
23. Lee, K., Ayyasamy, M. V., Delsa, P., Hartnett, T. Q. & Balachandran, P. V. Phase classification of multi-principal element alloys via interpretable machine learning. npj
Computational Materials 8, 25 (2022).
24. Lee, K., Ayyasamy, M. V., Ji, Y. & Balachandran, P. V. A comparison of explainable
artificial intelligence methods in the phase classification of multi-principal element
alloys. Scientific Reports 12, 11591 (2022).
25. Armitage, J., Spalek, L. J., Nguyen, M., Nikolka, M., Jacobs, I. E., Marañón, L., et
al. Fragment graphical variational autoencoding for screening molecules with small
data. arXiv preprint arXiv:1910.13325 (2019).
59
26. Mukaidaisi, M., Vu, A., Grantham, K., Tchagang, A. & Li, Y. Multi-objective drug
design based on graph-fragment molecular representation and deep evolutionary
learning. Frontiers in pharmacology 13, 920747 (2022).
27. Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure.
Journal of chemical information and computer sciences 44, 1000–1005 (2004).
28. Meng, J., Chen, P., Wahib, M., Yang, M., Zheng, L., Wei, Y., et al. Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data 9, 71
(2022).
29. Sharma, N., Jain, V. & Mishra, A. An analysis of convolutional neural networks for
image classification. Procedia computer science 132, 377–384 (2018).
30. Singla, S., Singla, S. & Feizi, S. Improved deterministic l2 robustness on CIFAR-10
and CIFAR-100. arXiv preprint arXiv:2108.04062 (2021).
31. Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning in Proceedings
of the 26th annual international conference on machine learning (2009), 41–48.
32. Combescot, R. Superconductivity: An Introduction doi:10.1017/9781108560184 (Cambridge University Press, 2022).
33. Hamidieh, K. A Data-Driven Statistical Model for Predicting the Critical Temperature of a
Superconductor 2018.
34. Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K. & Taha, K. Efficient
machine learning for big data: A review. Big Data Research 2, 87–93 (2015).
35. McLachlan, R. I. & Quispel, G. R. W. Splitting methods. Acta Numerica 11, 341–434
(2002).
36. Nakano, A., Vashishta, P. & Kalia, R. K. Massively parallel algorithms for computational nanoelectronics based on quantum molecular dynamics. Computer physics
communications 83, 181–196 (1994).
37. Carrillo-Bernal, M., Martínez-y-Romero, R., Núñez-Yépez, H., Salas-Brito, A. & Solis, D. A. Classical and quantum space splitting: the one-dimensional hydrogen atom.
European Journal of Physics 41, 065405 (2020).
38. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H., et al. Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078 (2014).
39. Ravanelli, M., Brakel, P., Omologo, M. & Bengio, Y. Light gated recurrent units for
speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence
2, 92–102 (2018).
60
40. Gruber, N. & Jockisch, A. Are GRU cells more specific and LSTM cells more sensitive in
motive classification of text?. Frontiers in Artificial Intelligence, 3, 1-6 2020.
61
Appendices
A ESOL dataset
The Estimated Solubility (ESOL) dataset is an essential dataset in the fields of drug discovery and chemical design, as it aids in predicting the aqueous solubility of small molecules,
a key property in these areas. It comprises a collection of chemical compounds, each accompanied by its measured solubility value.
To enrich this dataset, additional detailed information is gathered for each molecule
through Pubchem, a comprehensive and publicly accessible database managed by the
National Center for Biotechnology Information (NCBI). This database provides extensive
data on chemical molecules and their biological assay activities. The data collection was
conducted using a Python script, which interacted with the PUG VIEW Application Programming Interface (API) offered by the Pubchem website.
The dataset underwent a cleaning process involving the removal of empty entries and
invalid rows, resulting in a refined dataset comprising 1009 samples, each characterized
by 15 distinct features. Furthermore, each sample was labeled with 6 binary categories for
evaluation purposes: Irritant, Health Hazard, Flammable, Environmental Hazard, Acute
Toxic, and Corrosive.
62
Figure 5.1: Snapshot of features of Extended ESOL Dataset
A snapshot of all the features of Extended ESOL Dataset. With extra information from
Pubchem, the dataset is able to withhold 15 features. Due to the limitation of space, we
are only showing the first 36 entries of the dataset. The real dataset contains 1009 samples.
63
Figure 5.2: Snapshot of labels of Extended ESOL Dataset
A snapshot of all the labels of Extended ESOL Dataset. In the thesis, we mainly presented
results on the prediction on the "Health Hazard" label.
64
B SuperCon Dataset
We used data from [33], which was gleaned from the SuperCon database, to suit the specific analytical needs of our research. Hamidieh’s dataset involved additional steps to
ensure the cleanliness and relevance of the data, including correction of recording errors and exclusion of materials with missing critical temperatures, as well as those with
temperatures recorded as zero or abnormally high. This thorough process yielded a refined dataset consisting of 21,263 unique superconductors. For each superconductor, the
dataset provides the critical temperature along with 81 different features.
65
Figure 5.3: Snapshot of SuperCon Dataset
A snapshot of the SuperCon Dataset purged by Hamidieh. The dataset contains 21263
unique superconductors, each having 81 different features. Due to the limitation of space,
only a small portion of features and the first 36 rows are presented.
66
Abstract (if available)
Abstract
In this data-driven era, people are realizing the importance of data analysis, and the ability to harness vast amounts of information has become a cornerstone of progress. Machine learning, especially neural networks, has thus emerged as an invaluable tool to deal with those colossal data reserves. Beyond those well-known applications in sections like computer vision, natural language processing, and finance, machine learning is also playing an increasingly important role in the realm of physics. Many outstanding studies have demonstrated how machine learning can enhance physics research. Artificial Neural Networks (ANN) are used to reduce the non-trivial correlations in the many-body wave function. Nieuwenburg's paper proposed an innovative approach to find phase transitions by detecting peaks of network performance. Besides all these, supervised machine learning is also widely used to identify particles.
In my dissertation, I will focus on the following two categories that can happen in many physics-related research: dense data, when you can arbitrarily sample from a data distribution; or sparse data, when you only have a limited database and know nothing in between each entry. In the case of dense data, such as simulation problems, have arbitrarily many available data, and thus how to select representative data from the dataset remains one of the major issues. A representative dataset would greatly improve the convergence speed while demanding minimal data generation and storage costs. On the other hand, when dealing with sparse data, such as phase prediction when available data collection is not continuous, the generalization capability will take the lead because the model is given too few examples to learn on, and repeated training will surely induce overfitting for expressive models, which is necessary for solving complicated problems if machine learning is to be used. For both cases, it's thus imperative to highlight the role of an effective sampling strategy. curriculum learning.
In this dissertation, we will focus on navigating the use of machine learning techniques and exploring possible optimizations to propose sampling strategies that could improve model performance.
Our sampling strategy uses the notion of curriculum learning, which is inspired by the way humans learn progressively from easy to hard tasks. Instead of indiscriminately feeding data to models, curriculum learning organizes the learning procedure in a meaningful sequence. As a prevalent notion recently, this structured process inherently possesses the two key features mentioned above. The curriculum learning strategy could provide a smoother learning curve, thus making the model easier to generalize to unseen data and converging faster to the local minimum.
Most curriculum learning processes emphasize teaching the model in an easy-to-hard order. In sparse-data-related problems, while agreeing to the reasonableness of the approach, our work questions the necessity of this order, and suggests that there could be other possible strategies to be taken.
In dense-data-related problems, with the capability of drawing arbitrarily many samples from a dense distribution, we propose a strategy for the creation of a curriculum in the presence of a cost for the generation, storage, and computation of each new data point, to ensure good prediction accuracy while also promoting generalization.
Throughout this dissertation, we will show many different physics-related example problems and how, instead of feeding the neural network with data randomly, designing a specific data sampling strategy could help with the model performance.
The outline of the dissertation is as follows: We will start with a brief introduction to machine learning in the context of physics to familiarize the reader with basic machine learning concepts, especially neural networks. Then we will talk about how important data representation is in machine learning, which is the very first step of data preparation. After that, we will focus on how to use machine learning in sparse-feature problems in physics, with two distinct examples. Finally, we will address machine learning in dense-feature problems in physics, and provide another illustrative instance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multi-scale quantum dynamics and machine learning for next generation energy applications
PDF
Simulation and machine learning at exascale
PDF
Theoretical modeling of nanoscale systems: applications and method development
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Deep learning for subsurface characterization and forecasting
PDF
Photoexcitation and nonradiative relaxation in molecular systems: methodology, optical properties, and dynamics
PDF
Coulomb interactions and superconductivity in low dimensional materials
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Physics informed neural networks and electrostrictive cavitation in water
PDF
Explorations in the use of quantum annealers for machine learning
PDF
Exploring complexity reduction in deep learning
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Learning fair models with biased heterogeneous data
PDF
Neural network for molecular dynamics simulation and design of 2D materials
PDF
Machine-learning approaches for modeling of complex materials and media
PDF
Dynamical representation learning for multiscale brain activity
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Learning distributed representations from network data and human navigation
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
Asset Metadata
Creator
Cao, Chao
(author)
Core Title
Designing data-effective machine learning pipeline in application to physics and material science
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Physics
Degree Conferral Date
2024-05
Publication Date
01/30/2024
Defense Date
01/17/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
curriculum learning,data representation,machine learning,neural networks,OAI-PMH Harvest,Physics,quantum dynamics,superconductivity
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Haas, Stephan (
committee chair
), Boedicker, James (
committee member
), Chung, Peter (
committee member
), Di Felice, Rosa (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
bencao316@gmail.com,chaocao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113817102
Unique identifier
UC113817102
Identifier
etd-CaoChao-12645.pdf (filename)
Legacy Identifier
etd-CaoChao-12645
Document Type
Dissertation
Format
theses (aat)
Rights
Cao, Chao
Internet Media Type
application/pdf
Type
texts
Source
20240131-usctheses-batch-1124
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
curriculum learning
data representation
machine learning
neural networks
quantum dynamics
superconductivity