Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Photoplethysmogram-based biomarker for assessing risk of vaso-occlusive crisis in sickle cell disease: machine learning approaches
(USC Thesis Other)
Photoplethysmogram-based biomarker for assessing risk of vaso-occlusive crisis in sickle cell disease: machine learning approaches
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Photoplethysmogram-based Biomarker for Assessing Risk of Vaso-occlusive Crisis in Sickle Cell Disease: Machine Learning Approaches by Ji, Yunhua A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Biomedical Engineering) May 2020 Contents List of Figures v List of Tables viii 1 Introduction 1 1.1 Overview 1 1.2 Specific Aims and Objectives 3 1.2.1 Children’s Hospital Los Angeles Dataset 3 1.2.2 Sleep and Asthma Cohort Study Dataset 3 2 Literature Review 4 2.1 Sickle Cell Disease and Vaso-occlusive Crisis 4 2.1.1 Sickle Cell Disease 4 2.1.2 Vaso-occlusive Crisis 5 2.2 Machine Learning Methods 7 2.2.1 Introduction 7 2.2.2 Medical Applications 8 2.2.3 Types of Machine Learning Methods 9 2.2.4 Neural Network and Deep Learning 10 2.2.5 Convolutional Neural Network 16 2.2.6 Recurrent Neural Network 19 2.2.7 Support Vector Machine 22 2.2.8 Decision Tree 28 2.2.9 Ensemble Learning 31 2.2.10 Random Forest 32 2.2.11 Extremely Randomized Trees 33 2.2.12 Adaptive Boosting 33 ii 2.2.13 Gradient Boosting 35 2.2.14 Extreme Gradient Boosting 36 2.3 Model Interpretation 39 2.3.1 Gradient-weighted Class Activation Mapping 39 2.3.2 Feature Visualization 39 2.3.3 Feature Importance 39 3 Data 42 3.1 Children’s Hospital Los Angeles Dataset 42 3.1.1 Dataset and Studies 42 3.1.2 Data Preparation 43 3.1.3 Labels 44 3.2 Sleep and Asthma Cohort Study Dataset 46 3.2.1 Dataset and Studies 46 3.2.2 Data Preparation 47 3.2.3 Pain Categories 49 4 Methods 52 4.1 Children’s Hospital Los Angeles Dataset 52 4.1.1 Data Preprocessing 52 4.1.2 Machine Learning Models 59 4.1.3 Model Training 64 4.2 Sleep and Asthma Cohort Study Dataset 65 4.2.1 Machine Learning Model 65 4.2.2 Data Preprocessing 66 4.2.3 Feature Selection 71 4.2.4 Model Training 72 4.2.5 Model Performance 76 iii 5 Results 77 5.1 Children’s Hospital Los Angeles Dataset 77 5.1.1 Evaluation Metric 77 5.1.2 Model Comparison 77 5.1.3 Data Visualization 78 5.2 Sleep and Asthma Cohort Study Dataset 86 5.2.1 Evaluation Metric 86 5.2.2 Benchmark of Model Performance 88 5.2.3 Model Comparison 89 6 Discussion 93 6.1 Children’s Hospital Los Angeles Dataset 93 6.1.1 Model Analysis 93 6.1.2 Limitations 95 6.2 Sleep and Asthma Cohort Study Dataset 96 6.2.1 Model Analysis 96 6.2.2 Challenges and Limitations 98 6.2.3 Application and Significance of the Work 101 7 Future Work 102 7.1 Model Interpretation 102 7.2 Model Improvement 102 7.3 Future Research 102 Acknowledgements 104 Bibliography 105 Appendix I Models and Hyper-parameters 112 Appendix II Model Performance 115 iv List of Figures Figure 2.1 Normal red blood cells vs sickle red blood cells. Figure 2.2 Fundamental model of the vaso-occlusive unit in sickle cell disease. Figure 2.3 Performance of ImageNet models over the years compared to human Figure 2.4 Computational model of neurons. Figure 2.5 Examples of nonlinear activation functions. Figure 2.6 Example of a 2-layer fully connected neural network. Figure 2.7 Example of an input layer with 5 neurons and a convolutional layer with 2 filters of size 3. Figure 2.8 Example of a max-pooling layer of size 3. Figure 2.9 Long short-term memory unit (cell). Figure 2.10 Illustration of distance and geometry margin. Figure 2.11 Illustration of optimal margin separating the hyperplane. Figure 2.12 A make-up example of a decision tree. Figure 3.1 Photoplethysmogram and photoplethysmogram amplitude. Figure 3.2 Examples of normalized photoplethysmogram amplitude signals and their labels. Figure 3.3 Overview of the Sleep and Asthma Cohort Study (SAC) dataset. Figure 3.4 Example of Rembrandt software screenshot for one subject’s PSG study (overview). Figure 3.5 Example of Rembrandt software screenshot for one subject’s PSG study (channels). Figure 3.6 Example of Rembrandt software screenshot for one subject’s PSG study (annotations). Figure 4.1 Illustration of the weighted mean feature. Figure 4.2 Illustration of the trend feature. v Figure 4.3 Architecture of the multilayer perceptron model with hand engineering features input. Figure 4.4 Architecture of the convolutional neural network model. Figure 4.5 Architecture of the long short-term memory network. Figure 4.6 Architecture of the two-level stacking model. Figure 4.7 Illustration of the cross-validation training and evaluation process. Figure 4.8 Overall process for training and evaluation of the two-level stacking model. Figure 5.1 Visualization of linear regression model with hand engineering features input on one of the training set. Figure 5.2 Visualization of linear regression model with hand engineering features input on one of the validation set. Figure 5.3 Visualization of multilayer perceptron model with hand engineering features input on one of the training set. Figure 5.4 Visualization of multilayer perceptron model with hand engineering features input on one of the validation set. Figure 5.5 Gradient-weighted Class Activation Mapping for two photoplethysmogram amplitude signals based on the convolutional neural network model. Figure 5.6 Generated photoplethysmogram amplitude signal based on gradients of output. Figure 5.7 F1-score contour with respect to precision and recall. Figure 5.8 Performance of the benchmark random model. Figure 5.9 Model performance comparison (paired t-test). Figure 6.1 Illustration of performance vs size of dataset for deep neural network and traditional learning algorithm. vi Figure 6.2 MLP performance with different inputs (paired t-test). vii List of Tables Table 2.1 Notation for neural networks. Table 4.1 Cooccurrence probabilities for target words “ice” and “steam” with selected context words from a 6-billion-word corpus. Table 5.1 Performance of all models (CHLA dataset). Table 5.2 Model Performance Comparison (SAC dataset). Table 6.1 Performance of Base Models (SAC dataset). Table 6.2 MLP Performance with Different Inputs viii 1 Introduction 1.1 Overview Sickle cell disease (SCD) is an inherited genetic blood disorder that can lead to many catastrophic consequences. SCD is caused by substitution of the amino acid valine to glutamine at the 6th position of the β chain of hemoglobin that results in the formation of sickle hemoglobin (HbS) (Itano and Pauling, 1949; Pauling et al., 1949). When HbS polymerizes in the deoxygenation state, flexible discoid red blood cells (RBCs) are transformed into rigid sickle-shaped erythrocytes which can obstruct microvascular blood flow if they fail to escape the microvasculature before the transformation occurs (Coates et al., 2018). The blockage promotes the hallmark symptom of SCD, known as vaso-occlusive crisis (VOC), or pain crisis. VOC is characterized by recurrent episodes of excruciating pain, ischemic injuries, irreversible organ damage and premature death (Ballas, 2005; Platt et al., 1994). Therefore, the microvascular flow rate likely plays a key role in triggering VOC and incurring the severe consequences. Although blood rheology can be affected by many factors such as blood viscosity, inflammation, activation of coagulation, etc., it is quite likely that vasoconstriction mediated by autonomic nervous system (ANS) can tilt the delicate balance towards vaso-occlusion (Coates et al., 2018; Christoph et al., 2005; Hofrichter, 1986; Ferrone et al., 1985; Ferrone et al., 1985; Eaton et al., 1976; Hofrichter et al., 1974). Thus, sustained peripheral vasoconstriction can be a warning sign of VOC for SCD patients. In this work, analyses were conducted on two datasets to understand the causal relationship between ANS (particularly vasoconstriction) and VOC: the Children’s Hospital Los Angeles (CHLA) dataset, and the Sleep and Asthma Cohort Study (SAC) dataset (Willen et al., 2017; Vanderbilt University, 2011). The CHLA dataset contains measurements from three different studies, exposing subjects to external ANS, heat pain and cold pain stimuli respectively (Chalacheva et al., 2019; Khaleel et al., 2017; Khaleel et al., 2015; Veluswamy et al., 2018; Veluswamy et al., 2017). In order to quantify the severity of peripheral vasoconstriction, we focused on photoplethysmogram (PPG) signals (whose amplitude [PPGa] reflects the level of vasoconstriction [Allen 2007; Elgendi 2012]), and developed two features based on symbolic dynamics (Costa et al., 2017; Cysarz et al., 2015; Cysarz et al., 2013; V oss et al., 2000) and a 1 rule-based algorithm to identify the pattern of sustained vasoconstriction. To evaluate the effectiveness of the designed features, seven SCD researchers were asked to label the PPGa signals from 0 to 4 based on what they thought the tendency of sustained vasoconstriction was (for each signal), named vasoconstriction index for the rest of this work. Machine learning models were built to see how well the designed features can predict the vasoconstriction index. Moreover, deep learning models (including convolutional neural network (CNN) and long short-term memory (LSTM) network) were developed to understand whether deep models are applicable for physiological signal analysis with a relatively small dataset (163 recordings). As a result, deep neural networks outperformed hand engineering features by a great margin, reducing mean squared error (MSE) from 0.239 to 0.108. The SAC dataset contains recording of subjects’ hospitalization history due to sickle pain induced by VOC (Willen et al., 2017; Vanderbilt University, 2011), which is an objective gold standard for VOC severity. Unlike the CHLA studies, the SAC study is a polysomnography (PSG, a type of sleep study) study that does not contain significant external stimuli to induce the sustained vasoconstriction, and therefore there is no corresponding pattern in the PPGa signals. Instead, a coefficient of variation (CV) based vasoconstriction detection algorithm (Chalacheva et al., 2019) was applied to detect vasoconstriction events in the PPGa signals. Autonomic nervous system (ANS) indices were then developed based on the detected events. These indices, along with other features (including physiological features, heart rate variability [HRV] features, etc.) were used to identify the pain rate of SCD patients through a two-level stacking machine learning model. The first level of the model contained seven different base models, while the second level was a meta model which took both the subject features and outputs of the base models as its inputs. Results showed that physiological & ANS features predicted future (post-PSG) pain rate significantly better than past (pre-PSG) pain rate, which agreed with previous research (Chalacheva et al., 2019), and implied that ANS features are predictors for VOC but not vice versa. It was also observed that vasoconstriction related features did improve the performance of the model. 2 1.2 Specific Aims and Objectives 1.2.1 Children’s Hospital Los Angeles Dataset For the CHLA dataset, the objectives of the analyses are: 1. To determine whether deep learning methods can be applied to a relatively small database to detect patterns of sustained vasoconstriction in PPGa signals. 2. To use data visualization techniques to learn how convolutional neural network (CNN) works and get hints about important features or patterns of the PPGa signals that are crucial for identification of sustained vasoconstriction. 1.2.2 Sleep and Asthma Cohort Study Dataset For the SAC dataset, the overall objective of the analyses is to determine whether ANS features, especially vasoconstriction related biomarkers, can predict VOC of SCD patients? This question is handled by addressing the following specific sub-questions: 1. Can future (post-PSG) pain rate be predicted by an SCD subject’s physiological and ANS features derived from PPG measurements during the sleep? 2. Is there an association between past (pre-PSG) pain rate and the subject’s physiological and ANS features derived from PPG measurements during the sleep? 3. Can vasoconstriction-related biomarkers improve the prediction for future (post-PSG) pain rate over other biomarkers that do not incorporate direct information about vasoconstriction? The first two sub-questions focus on determining whether ANS features are predictors for VOC. The answer will be “yes” if post-PSG pain rate can be better predicted than pre-PSG pain rate, or “no” otherwise. The third sub-question concentrates more on vasoconstriction related features among all ANS features. Our future goal is to explore whether the PPG signal, which is easily accessible through wearable devices, has the potential to provide a VOC related biomarker for better therapeutic approaches and risk management strategies for SCD patients. 3 2 Literature Review 2.1 Sickle Cell Disease and Vaso-occlusive Crisis 2.1.1 Sickle Cell Disease Sickle cell disease (SCD), also known as sickle cell anemia, is a group of inherited genetic blood disorders affecting approximately 100,000 Americans, most prevalent in African Americans (about 1 out of every 365 births) (Yates, 2018; Centers for Disease Control and Prevention, 2017; Hassell, 2010). The median age at death for SCD patients is younger than 50 years (Hassell, 2010; Platt et al., 1994). As discussed in Section 1.1, the hallmark symptom of SCD is vaso-occlusive crisis (VOC). VOC is characterized by acute episodes of pain, likely caused by obstruction of microvascular blood flow by polymerized sickle-shaped erythrocytes (which contain abnormal hemoglobin-S [HbS]). 4 Figure 2.1 Normal red blood cells (RBCs) vs sickle RBCs. (A) shows microvascular blood flow with normal RBCs, and the cross-section of a normal RBC. (B) shows sickle RBC blocking microvascular blood flow, and the cross-section of a sickle RBC. Figure reproduced from National Heart Lung and Blood Institute. 2.1.2 Vaso-occlusive Crisis In the 1970s, Eaton et al. proposed that the fundamental mechanism of VOC is the obstruction in the microvasculature. The obstruction is caused by the polymerized sickle-shaped RBCs (SS-RBCs) and reduced blood flow (Christoph et al., 2005; Hofrichter, 1986; Ferrone et al., 1985; Ferrone et al., 1985; Eaton et al., 1976; Hofrichter et al., 1974). In other words, the possibility of regional vasoconstriction is mainly dependent on two factors. First, the RBC transit time in the microvasculature. Second, the delay time to polymerization , i.e. the time between onset of deoxygenation and polymerization (Coates et al., 2018). If , then there is a greater chance for retention of SS-RBC in the microvasculature and block the small vessels. is mainly affected by the intracellular concentration of HbS (Coates et al., 2018; Eaton and Bunn, 2017). is the reciprocal of velocity of microvascular blood flow, which is affected by autonomic nervous system (ANS) induced vasoconstriction, rheological properties of blood, inflammation and adhesion of cellular elements in venules, and nitric oxide availability (which results in hemolysis and modulation of blood flow) (Coates et al., 2018). T t T d T t > T d T d T t 5 6 Figure 2.2 Fundamental model of the vaso-occlusive unit in SCD. SS-RBCs obstruct the small vessels when Td1 < Tt . Otherwise when Td2 > Tt , SS-RBCs can escape the microvasculature and flow freely to larger post-capillary vessels. Factors that decrease blood flow and thus increase transit time Tt are shown in the boxes. ET-1 refers to Endothelin-1. Figure reproduced from Coates et al., 2018. 2.2 Machine Learning Methods 2.2.1 Introduction Machine learning is the technology of building algorithms or mathematical models to perform a certain task, and the performance can be improved progressively by updating (i.e. training) the parameters of the model with sample data, or “training data”. This terminology was first proposed by Arthur Samuel in 1959 (Samuel, 1959). For the first half century after its birth, machine learning had been struggling in perceptron tasks, including image recognition, speech recognition, pattern recognition, diagnoses, etc., as machines performed much worse than human beings. However, things have changed dramatically in the past decade. In 2012, the release of AlexNet (Krizhevsky et al., 2012) marked the starting point of the deep learning era. AlexNet deployed an eight layer convolutional neural network (CNN) model in ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al., 2015), much deeper than its predecessors (which only contained two to three layers), and outperformed its competitors by a large margin. Since then, machine learning models have been evolving extremely fast. Three years later, PReLU-Net became the first model that beat human performance in image recognition tasks (He et al., 2015). Soon after that, machines surpassed human in another perceptron task. In 2016, research scientists at Microsoft reported that their speech recognition technology had surpassed the performance of human transcriptionists (Xiong et al., 2016). Significantly improved accuracy of learning algorithms, scalability, along with the dramatically increased amount of accessible data, making machine assisted diagnoses an obvious choice to improve the efficiency of the current medical system. 7 2.2.2 Medical Applications There have been large numbers of machine learning applications in the medical area since the start of the deep learning era. One major application is the implementation of computer vision technologies in medical imaging. There are three main categories of medical imaging challenges. Classification, which focuses on classifying/diagnosing an image or region of interest in the image. A well-known work of this type is the skin cancer diagnosis research published in 2017 (Esteva et al., 2017). Detection is another category which involves both locating the regions of interest in medical images and the classification of the regions. A widely adopted and effective detection framework is the Faster R-CNN proposed by Ren et al., 2015. A lot of research on detection problems in medical imaging were based on Faster R-CNN and its variations, for e.g. Zhao et al., 2018; Ding et al., 2017; Yang et al., 2017. The last one is segmentation, which is essentially a pixel-wise classification problem. A typical question of a segmentation problem 8 Figure 2.3 Performance of ImageNet models (Top 5 error in classification) over the years compared to human. Figure reproduced from Devopedia, 2019. looks like “Does this pixel belong to blood vessel, or tumor, or background?”. The classical and benchmark segmentation model is U-Net proposed by Ronneberger et al., 2015. Another key application of machine learning algorithms lies in physiological signals. There were researchers using 1-dimensional CNN to classify ECG signals and detect arrhythmia (Rajpurkar et al., 2017). Large-scale assessment of measurements from smartwatch in detecting atrial fibrillation have also been conducted (Perez et al., 2019). Moreover, there are many wearable devices available commercially that employ machine learning algorithms to detect sleep stages and energy consumptions using acceleration and PPG signals (e.g. Biostrap Sleep Lab). 2.2.3 Types of Machine Learning Methods In general, machine learning methods may be divided into three classes, supervised learning, unsupervised learning and reinforcement learning, although more detailed classification exists. Supervised learning models are built with a set of data containing both the inputs and the outputs (known as the labels). Supervised learning algorithms include regression algorithms and classification algorithms. When the desired outputs are continuous in value, the algorithms are known as regression algorithms. Examples include linear regression, regression tree, neural network with a regression head, etc. When the outputs are categorical, the algorithms are called classification algorithms. Examples include logistic regression, support vector machine, decision tree, neural network with a classification head, etc. This work primarily focuses on supervised learning methods. Unsupervised learning models are built with a set of data containing the inputs only. Unsupervised learning algorithms are usually used to find structures in the data. For instance, clustering of the data points, encoding data into vectors (also known as embeddings), anomaly detection, etc. Reinforcement learning models deal with how agents (e.g. robots, cars, etc.) act in certain environments to maximize the reward. Reinforcement learning algorithms are commonly deployed in robotics industry, autonomous driving and AI gaming. The famous go algorithm AlphaGo (Silver and Hassabis, 2016) released in 2016 falls into this category. 9 In our application, we have found supervised learning algorithms to be the most relevant class, and thus in the rest of this chapter, we will review only the supervised learning algorithms applied in this work. 2.2.4 Neural Network and Deep Learning Neural network or artificial neural network is a computational model inspired by the biological neural network of human brain. Neural network, like its name, is the network consist of artificial neurons, which mathematically model the network of biological human neurons. The structure of the model of a single neuron is shown in Figure 2.4. It takes input through its dendrites from the axon of the previous neurons. The dendrites will then apply weights to the inputs. The cell body collect weighted inputs from all dendrites and sum them up, i.e. obtain the linear combination of all inputs (a bias term is usually included as well). After that, the cell body will determine if this neuron is going to fire or not by passing the summation through a nonlinear activation function. Figure 2.5 shows some examples of nonlinear activation function. tanh (hyperbolic tangent) and sigmoid functions used to be commonly adopted activation functions. In recent years, the Rectified Linear Unit (ReLU) has become very popular. One major advantage of ReLU is that it can facilitate the training process of neural networks due to its linear, non-saturation form, as the derivative (also known as gradient) of ReLU is either 1 or 0. On the other hand, tanh and sigmoid have vanishing or exploding gradient when the summation is very small or very large - this will be clearer after the backpropagation algorithm is introduced later in this section. Note that the activation function must be a nonlinear function for two reasons. First, make model 10 Figure 2.4 Computational model of neurons. Figure reproduced from Stanford CS231n. more flexible with nonlinear decision boundary. Second, make it meaningful for neural network (will be introduced in the next paragraph) to have multiple layers. Linear algebra shows that if the activation function is linear, multiple layers will be equivalent to only one layer (multiplication of matrices results in a matrix). Finally, the neuron will pass its activation (the output of the neuron) towards the following neurons through its axons. Artificial neurons form the neuron network in the following manner. 1. Multiple neurons form a layer. All neurons in the same layer are not connected to each other. 2. Neurons in each layer only connect to neurons in its neighboring layers. 3. Different layers have different types of connection. Layers are usually named based on the connection to their previous layers. Neural networks are usually named according to their encoding layers. For instance, convolutional neural networks use convolutional layers as the encoder. Figure 2.6 shows an example of a multilayer fully connected neural network, known as multilayer perceptron (MLP). In neural networks, the first layer is the input layer, and the last layer is the output layer, all other intermediate layers are called hidden layers. Let us define some notations and then mathematically describe neural networks. See Table 2.1. 11 Figure 2.5 Examples of nonlinear activation functions. Hyperbolic tangent activation function (left), sigmoid activation function (middle), rectified linear unit (ReLU) activation function (right). The output of a neural network is calculated layer by layer, from the first to the last. This whole process is called forward propagation. Take the model in Figure 2.6 as example, the activations of the input layer 12 a [1](i) 3 a [1](i) 2 x (i) 1 x (i) 2 x (i) 3 a [1](i) 1 z [2](i) 1 Figure 2.6 Example of a 2-layer fully-connected neural network. This network has one input layer, one hidden layer and one output layer. When we count total layers of a neural network, we only include the layers with trainable parameters, thus this network is a 2-layer network (input layer does not have trainable parameters. Details will be explained in the rest of this paragraph). Table 2.1 Notation for Neural Networks Symbol Meaning input output bias term summation at the cell body, include the bias term activation of the current neuron activation function output of the 3rd neuron in the 1st hidden layer of the 5th training example weight of the dendrite connecting th neuron in the current layer and th neuron in the previous layer k j th example in the training set i a [1](5) 3 superscript [l] a y th neuron in the current layer k w kj superscript (i) z x th layer (input layer is layer 0) l subscript (except for weights) k f b for the th training example will be The activations of the first hidden layer for the th training example will be The output of the neural network for the th training example will be Depending on tasks, a different activation function may be applied to the final output . For regression tasks (as the task of this study), usually no activation function will be applied to the output (or equivalently, an identity linear activation function will be applied). The forward propagation can also be expressed in matrix format, which is simpler. Where , is the total number of neurons in th layer, and is the total number of examples in the training set. Similar for and . i a [0](i) 1 = x (i) 1 a [0](i) 2 = x (i) 2 a [0](i) 3 = x (i) 3 i a [1](i) 1 = f ( z [1](i) 1 ) = f ( w [1] 11 a [0](i) 1 +w [1] 12 a [0](i) 2 +w [1] 13 a [0](i) 3 +b [1] 1 ) a [1](i) 2 = f ( z [1](i) 2 ) = f ( w [1] 21 a [0](i) 1 +w [1] 22 a [0](i) 2 +w [1] 23 a [0](i) 3 +b [1] 2 ) a [1](i) 3 = f ( z [1](i) 3 ) = f ( w [1] 31 a [0](i) 1 +w [1] 32 a [0](i) 2 +w [1] 33 a [0](i) 3 +b [1] 3 ) i z [2](i) 1 = w [2] 11 a [1](i) 1 +w [2] 12 a [1](i) 2 +w [2] 13 a [1](i) 3 +b [2] 1 z [2](i) 1 A [0] = X A [1] = f ( Z [1] ) = f ( W [1] A [0] +b [1] ) Z [2] = W [2] A [1] +b [2] A [l] = a [l][1] 1 ⋯ a [l][m] 1 ⋮ ⋱ ⋮ a [l][1] n ⋯ a [l][m] n n l m X Z [l] 13 , is the total number of neurons in th layer, and In order to train the neural network, a cost function is required to describe how far is the output away from the gold standard , which is provided in the training set. For regression tasks, mean squared error (MSE) is a commonly used cost function. Note that the cost function is a metric for how well the model performs. Thus the lower the cost (the value of cost function), the better the model. Also note that it is a function of model parameters (i.e. weights and biases), which means model performance can be improved by adjusting the parameters to lower the cost. This can be done by taking derivatives of the cost function with respect to each parameter and adjust the corresponding parameter in the negative-derivative direction by a small amount , which is called the learning rate. All the derivatives (also known as gradients) will be calculated in the step called backpropagation, which is shorthand for “the backward propagation of errors” (McGonagle et al.). It is named due to that gradients are calculated from the last layer to the first according to the chain rule in calculus. For detailed math about backpropagation, refer to (McGonagle et al.; Riedmiller and Braun, 1993; Hecht-Nielsen, 1992; Werbos, 1990). Once having the gradients, the parameters can be updated by the following equations. Repeat the above steps until converge. W [l] = w [l] 11 ⋯ w [l] 1n′ ⋮ ⋱ ⋮ w [l] n1 ⋯ w [l] nn′ n′ (l−1) b [l] = b [l] 1 ⋮ b [l] n z [2] 1 y J ( W,b ) = 1 m m ∑ i=1 ( y i −z [2](i) ) 2 α w [l] kj = w [l] kj −α ∂ ∂w [l] kj J ( W,b ) b [l] k = b [l] k −α ∂ ∂b [l] k J ( W,b ) 14 The above algorithm is called gradient descent. Based on that, there are some widely accepted improvements that can accelerate the speed of convergence, including gradient descent with momentum (Qian, 1999), RMSProp (unpublished, proposed by Geoff Hinton [Ruder, 2016]), and adaptive moment estimation (Adam) optimization (which combines gradient decent with momentum and RMSProp) (Kingma and Ba, 2015). In conclusion, neural network models are built and trained with the following steps. 1. Determine the structure of the neural network (input & output layer, number of hidden layers, type of hidden layers, etc.) 2. Define a cost function. 3. Randomly initialize weights and zero initialize all the biases. 4. Use forward propagation to compute the cost. 5. Use backpropagation to compute the gradients. 6. Update the parameters according to the gradients. 7. Iterate from step 4 to step 6 until converge. Note that in step 3, unless the neural network has no hidden layers (e.g. logistic regression is equivalently a single layer neural network with sigmoid activation function and no hidden layers), random initialization is needed in order to avoid symmetric weights for the hidden layers, which will lead to failure of the model. For deep models, weights also need to be initialized carefully to avoid vanishing gradient (when value of gradients are small, gradients decreases layer by layer due to the chain rule and eventually “vanishes”) & exploding gradient problems which will make the training fail to converge (particularly vanishing gradient because exploding gradient can usually be solved by applying thresholds). Refer to (Nielsen, 2015) for more details. The concept of neural network dates back to the 1940s~1950s, while some key algorithms including backpropagation were invented in the 1970s~1980s. However, due to limitation of computational power and the vanishing gradient problem, neural network models at that time were usually shallow (no more than two hidden layers), ineffective, and were overtaken by other machine learning methods like linear regression, support vector machine, etc. 15 Recently, with the rise of computational power, the availability of big data, and the techniques to overcome the vanishing gradient problems, there has been a revival of interest in and the use of neural networks, particularly the deep (“deep” is relative, there is no exact definition of “deep”) models for their superior performance to traditional methods. AlexNet (Krizhevsky et al., 2012), the first deep model in the ImageNet competition outperform all shallow models by a large margin. AlphaGo (Silver and Hassabis, 2016), an AI Go program based on deep models, developed by Google DeepMind beat human world champion Go player in 2016, which was considered an impossible task in decades due to the complexity of Go games. These iconic events, along with the fact that deep models have on par or better performance than humans in perception tasks like voice recognition, image recognition, diagnoses, etc. make deep learning an excellent candidate to improve the performance of other models. 2.2.5 Convolutional Neural Network In the MLP model introduced in the previous section (shown in Figure 2.6), all the neurons in the neighboring layers are connected to each other. This type of layer is called fully connected layer (alternative names include dense layer and linear layer). Despite the fact that it is powerful in forming a nonlinear decoder for the input or representation, dense layer has two major drawbacks. First, when the connected layers both have a large number of neurons, say and neurons respectively, the number of connections, thus the number of weights will be , which is extremely large, and will cause problems like long training time and overfitting. Second, dense layer treats all neurons in the input equivalently, which means it breaks the inner connection of the input. For instance, in physiological signals, nearby data points usually have a much stronger inner connection than points that are far away from each other. One simple solution to these problems is to restrict the connections between layers to be local. This idea is inspired by the fact that neurons in animal visual cortex only respond to stimuli in a certain region. In addition, both natural images and physiological signals tend to be “stationary”, i.e. different parts of the image/signal have similar statistics (e.g. mean, variance, etc.), which means patterns or features learned from some small part of the image/signal can be useful for other parts as well. Convolutional layer n 1 n 2 n 1 n 2 16 emerges from these thoughts. The rest of this section will focus on 1-dimension (1D) convolutional layers since this work is about physiological signals which are 1D. The weights (dendrites) in the convolutional layers are called filters (or kernels). Each convolutional layer can have a number of filters of the same size. Filter size is defined by the number of neurons it connects in the previous layer. Figure 2.7 shows an input layer with five neurons and a convolutional layer with 2 filters of size 3. Each filter is detecting a certain local feature (i.e. pattern) at different locations. For instance, the blue filter with weights 1, 0, -1 is detecting a local decreasing pattern by cross- correlating every three continuous neurons in the input layer with the three weights, then passing the result through a nonlinear activation function, usually ReLU, and obtaining the value for the output neuron. Note that although the layer & model is named as “convolution”, it is mathematically “cross- correlation”, but this change will not affect the result of the model, i.e. is computationally equivalent. It just saves the effort of flipping the input (or the order of the weights). If the input has neurons, and the filter size of the convolutional layer is , then the output in the convolutional layer for this filter will be (we only consider stride = 1 in this section. Stride is the number of neurons the filters shift at each step). If there are filters in this convolutional layer, the size of output for the whole layer will be n′ s n = n′ −s +1 n f 17 Figure 2.7 Example of an input layer with five neurons and a convolutional layer with 2 filters of size 3. ReLU is applied as activation function for the convolutional layer. The first number “3” in the convolutional layer block denotes the filter size, the second number “2” refers to number of filters. , i.e. different filters are treated as an additional dimension, called depth. The output of a convolutional layer is called feature map or activation map. In the case that a convolutional layer (with filters of size ) takes the output of some other convolutional layer (with the size ) as its input, then each filter will have weights, i.e. each filter is detecting some feature across the depth and along the non-depth dimension in the cross-correlation manner, and the size of its output will still be , where . As pointed out in the previous paragraph, convolutional layers dramatically reduce the number of parameters in the model. Consider an input of size 500, for a convolutional layer with filter size 7, and 32 different filters, there will be 7 * 32 = 224 weights and 32 biases or a total of 256 parameters for that layer. The feature map will contain 494 * 32 = 15808 neurons. Instead, for a dense layer to have the same size of feature map, there will be 500 * 15808 = 7904000 weights and 15808 biases or a total of 7919808 parameters for that layer. Even if ignoring the depth dimension, there will still be 500 * 494 + 494 = 247494 parameters to generate a feature map with 494 neurons. In addition, convolutional layers will also be able to capture local relation of the inputs due to their pattern recognition nature. Another common type of layer used in convolutional neural networks is the pooling layer, specifically the max-pooling layer. It basically downsamples the feature maps along their non-depth dimension by a factor of its pool size, in the manner of keeping the most activated neuron, i.e. neuron with the largest activation. It has two major functions. First, reduce the model size and complexity by acting as a downsampling method. Second, make model robust to small shifts and noise in the signal, because as long as a certain feature is detected within a small region (determined by pool size), the max-pooling layer will recognize it as “this local region has detected this feature”. Figure 2.8 shows an example of max-pooling layer with pool size of 3. Note that pooling layers do not have parameters to train, thus they are usually not counted in total number of layers. When we count the total number of layers in a neural network, we usually only consider the layers with trainable parameters. n*n f n f s n′ *n′ f s*n′ f n*n′ f n = n′ −s +1 18 A convolutional neural network will contain multiple convolutional layers, with lower level layers detecting simple features and higher level layers detecting more complex features. Lower level neurons are also activated by a small region of the input, while higher level neurons have much larger scopes. Typically the output size along non-depth dimension will be smaller as moving deeper due to the pooling layers and convolutional layers (especially when having a stride larger than 1), while depth increased instead since complex features have more variations. All convolutional layers act like the encoder, and the feature map of the last convolutional layer will be the new representation of the input. One or more dense layer will follow to act like the decoder and form a nonlinear predictor based on the representation or feature map encoded by the convolutional layers. 2.2.6 Recurrent Neural Network Recurrent neural network (RNN) is a class of neural network for sequence modeling. Unlike the neural networks previously mentioned in this chapter, RNN usually has no more than three layers. Each layer has multiple RNN units (or sometimes called “cell”) of some type, and all the units in the same layer 19 Figure 2.8 Example of max-pooling layer of size 3. It downsamples each feature map by a factor of 3 by taking the maximum activation per 3 neurons. share the same parameters. Each unit takes one input (although this input can be a vector) at a time, and compute the representation (called hidden state) based on the input of the current time step and the hidden state of the previous time step. The word “deep” for RNN generally implies a longer input sequence, while “deep” for CNN, MLP or other feedforward style neural networks generally refers to a large number of layers. RNN models usually have no more than three layers. Here we use to denote the hidden state since it is the “activation map” of RNN, Superscript with angle brackets denotes the time step. Note that is a vector of size (usually high dimension since it will be the representation calculated by the RNN model). represents the input at time , which is a vector of size ( if is a scaler). and are weights matrix of the RNN unit of size and respectively. is the bias vector of the RNN unit of size . is the nonlinear activation function which is usually tanh for RNN units. It can be seen from the mathematical expression that RNN actually shares some similar concepts with Markov chain. In Markov chain, the current state is only determined by its previous state through the transition matrix. In RNN, the parameters are the analogy to the “transition matrix”, and both the current input and previous hidden state form the “previous state”, while the current hidden state is the “current state”. The output type of RNN is very flexible. It can have some output at each time step (based on hidden state at the same time step for the corresponding output), for instance, a model to detect “Hey Siri” will output whether it received the voice activation command or not at each time step. Another example is when a model is trained to predict the next object in the sequence based on the current input (a well trained RNN can “compose” poems or music). RNN can also have one output for the whole sequence input, which will be based on the hidden state of the last time step. , a vector of size , is the output at time , and are parameters of size and respectively. a <t> = f ( W aa a <t−1> +W ax x <t> +b a) a t a <t> s a x <t> t s x s x = 1 x <t> W aa W ax s a *s a s a *s x b a s a f ̂ y <t> = f ( W ya a <t> +b y ) ̂ y <t> s y t W ya b y s y *s a s y 20 Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) network, which composed of LSTM units, is one of the most widely used and successful RNN. In addition to the hidden state, it also introduces memory cell which allows early hidden state to affect the current state, and a bunch of gates which controls the usage of memory cell. Figure 2.9 shows a diagram of LSTM unit. We will show all the related equations first and then explain them one by one. is the candidate for updating the memory cell, which is a vector of the same size as the hidden state ( ), and determined by and . Instead of directly updating the current hidden state in the ˜ c <t> = tanh ( W ca a <t−1> +W cx x <t> +b c) Γ i = σ ( W ia a <t−1> +W ix x <t> +b i) Γ f = σ ( W fa a <t−1> +W fx x <t> +b f ) c <t> =Γ u ⊙ ˜ c <t> +Γ f ⊙c <t−1> Γ o = σ ( W oa a <t−1> +W ox x <t> +b o) a <t> =Γ o ⊙c <t> ˜ c <t> s a a <t−1> x <t> 21 Figure 2.9 LSTM unit (cell). Figure reproduced from deeplearning.ai. previous example, LSTM unit updates the candidate first (to be more precise, candidate of candidate since memory cell is the candidate for updating the hidden state). , and are the update gate, forget gate and output gate respectively. All of them are vectors of size , and each element of them ranges from 0 to 1 (determined by a sigmoid activation function, denoted by ) with 1 representing totally update/remember/output and 0 for the opposite. Note that despite is called the forget gate, the numerical values of its elements actually mean how much to remember. All of the three gates are determined by and . is the memory cell of the current time step, which is the candidate for the current hidden state, and it is defined as “how much to update from and how much to remember from ”. denotes element-wise multiplication. For instance, if all elements of are 1, then , if is a vector of zeros, then . Finally, the hidden state of the current time step is computed through element-wise multiplication between and . In addition to “short-term memory”, LSTM allows early hidden states to affect the current one via memory cells (long-term memory), and properly address the vanishing gradient problem through the implementation of gates (refer details to [Hochreiter and Schmidhuber, 1997]). 2.2.7 Support Vector Machine 2.2.7.1 Optimal Margin Classifier Support vector machine (SVM) implements the idea of optimal margins which separate the data with the largest gap. The margins are determined by “support vectors” where SVM derives its name from. Before the deep learning era, many researchers regarded SVM as the best supervised learning algorithm. To introduce the SVM algorithm, let us first consider a simple case, a linearly separable dataset with two different classes . We will write our linear classifier as Γ i Γ f Γ o s a σ Γ f a <t−1> x <t> c <t> ˜ c <t> c <t−1> ⊙ Γ i Γ i ⊙ ˜ c <t> = ˜ c <t> Γ i Γ i ⊙ ˜ c <t> = 0 a <t> Γ o c <t> y ∈ {−1,1} h w,b = g ( w T x +b ) 22 Where are the features, are the parameters (weights and bias). is the function such that if and if . The decision boundary of this linear classifier will be the straight line , or equivalently . Consider a training set , we define the distance of sample to the decision boundary to be . As shown in Figure 2.10, point represents a positive sample (i.e. )with input , and the length of segment gives . According to Cartesian analytic geometry, vector is perpendicular to the decision boundary . Since represents , point is given by . In addition, we know that point is on the decision boundary, hence Solve for yields x w,b g g(z) = 1 z ≥ 0 g(z) =−1 z < 0 w T x +b = 0 w T w x + b w = 0 S = { (x (i) ,y (i) );i = 1,...,m } i γ (i) A i y (i) = 1 x (i) AB γ (i) w w T x +b = 0 A x (i) B x (i) −γ (i) ⋅w/ w B w T ( x (i) −γ (i) ⋅w/ w ) +b = 0 γ (i) γ (i) = ( w w ) T x (i) + b w 23 Figure 2.10 Illustration of distance . Geometry margin is defined as . Figure reproduced from Ng. γ (i) γ = min i=1,...,m γ (i) The above equation was derived for the case of a positive sample, which can be extended to include all positive and negative samples. The geometric margin is defined as the smallest distance of all samples. Intuitively, the optimal margin will be the maximum geometric margin since we will have the most confidence in classifying a sample, as the sample will be at least away from the decision boundary. Thus, for a linearly separable training set, we can pose the following optimization problem to solve for the classifier. γ (i) = y (i) ( w w ) T x (i) + b w γ (i) γ = min i=1,...,m γ (i) γ max w,b γ s.t. y (i) ( w w ) T x (i) + b w ≥ γ , i = 1,...,m 24 Figure 2.11 Illustration of the optimal margin separating the hyperplane (solid line). The dash lines represent the support vectors. Figure reproduced from Ng. To simply it, we define , hence The objective function in the above optimization problem is non-convex, so there is no existing optimization problem solver for it. To handle this issue, we see that by scaling , the sign of will remain unchanged, thus is actually a scaling constraint in this optimization problem. Since scaling constraints can be satisfied by rescaling later, we can set to be 1, a constant. Then the objective function becomes . Moreover, minimizing is equivalent to maximizing , hence the optimization problem becomes The above optimization problem can be solved by quadratic programming software, and its solution gives the optimal margin classifier. 2.2.7.2 Dual Problem Another key feature that makes SVM strong is the concept of kernel, which enables SVM algorithms to have non-linear decision boundaries. Before moving on to the topic of kernel, we need to first briefly introduce the concept of duality. In mathematical optimization theory, optimization problems can be viewed from the two perspectives, the primal problem (a minimization problem) and the dual problem. The solution to the dual problem provides a lower bound to the solution of the primal problem. When the primal problem is convex, this ̂ γ = w ⋅γ max w,b ̂ γ w s.t. y (i) ( w T x (i) +b ) ≥ ̂ γ , i = 1,...,m ̂ γ w w,b w T x +b ̂ γ w,b ̂ γ 1 w 1 w 1 2 w 2 min w,b 1 2 w 2 s.t. y (i) ( w T x (i) +b ) ≥ 1 , i = 1,...,m 25 lower bound can be reached under constraint qualification conditions, known as the Karush-Kuhn-Tucker (KKT) conditions. Our optimization problem of the optimal margin classifier is indeed a convex primal problem, thus the solution of its dual problem gives the lower boundary, and this boundary can be achieved when KKT conditions are meet. The derivation from the primal problem to the dual problem is beyond the scope of this study, as well as the verification that the KKT conditions hold for our problem. Without showing the details, the dual optimization problem will have the following form Where is the inner product of , i.e. a less ambiguous notation of . The above dual problem may look confusing, but indeed we only need to focus on two conclusions from the problem. First, the solution of this dual problem provides optimal . Second, the following key equation obtained during the derivation of this dual problem The above function shows how we can calculate the prediction based on , and that its value only depends on the inner product between the points in the training set and , which is the sample to be predicted. Now we are ready to move on to the topic of kernel. max α W (α) = m ∑ i=1 α i − 1 2 m ∑ i,j=1 y (i) y (j) α i α j⟨ x (i) ,x (j) ⟩ s.t. α i ≥ 0 , i = 1,...,m m ∑ i=1 α i y (i) = 0 ⟨ x (i) ,x (j) ⟩ x (i) ,x (j) x (i)T x (j) α i ,i = 1,...,m w T x +b = m ∑ i=1 α i y (i) ⟨ x (i) ,x ⟩ +b α i ,i = 1,...,m x (i) x 26 2.2.7.3 Kernel We have shown in the previous section that the prediction value of the classifier only depends on , this conclusion still holds if we perform a feature mapping to the features , can also be used to calculate the prediction value. Kernel is defined as Kernel is a very strong tool in the way that a training data set that is not linearly separable in original feature space might be linearly separable in some other space (usually with higher dimensions), and if you project the decision boundary back to the original feature space, it becomes a non-linear one. i.e. Kernel enables the optimal margin classifier to provide a non-linear decision boundary. The most commonly used kernel is the radial basis function (RBF) kernel, also known as the Gaussian kernel The other kernel that will be used in this study is the polynomial kernel Where is a constant and is the degree of the polynomial kernel. 2.2.7.4 Non-separable Case and Regularization SVM algorithms also work for training data set that is not linearly separable in both original feature space and kernel space (feature space after the mapping ). The primal and dual optimization can be reformulated as ⟨ x (i) ,x ⟩ ϕ :ℝ n →ℝ m x ∈ℝ n ⟨ ϕ ( x (i) ) ,ϕ (x) ⟩ ∈ℝ m K ( x (i) ,x ) = ⟨ ϕ ( x (i) ) ,ϕ (x) ⟩ K ( x (i) ,x ) = exp − x (i) −x 2σ 2 K ( x (i) ,x ) = ( ⟨ x (i) ,x ⟩ +c ) d c d ϕ 27 and The above problem can be solved by the sequential minimal optimization (SMO) algorithm, which is beyond the scope of this work. The reformulated problem allows samples to have margins less than 1. For samples with margin , an additional cost of will be added to the objective function. The parameter adjusts the relative weighting of these samples. 2.2.8 Decision Tree Decision tree is a machine learning model that implements algorithms that only consist of conditional control statements. The word “tree” in its name indicates its tree-like structure. Once an input sample is fed into the model, it flows through various kinds of condition statements, and its path is determined by whether the sample meets the condition or not. Figure 2.12 shows an example of a decision tree. min w,b 1 2 w 2 +C m ∑ i=1 ξ i s.t. y (i) ( w T x (i) +b ) ≥ 1−ξ i , i = 1,...,m ξ i ≥ 0 , i = 1,...,m max α W (α) = m ∑ i=1 α i − 1 2 m ∑ i,j=1 y (i) y (j) α i α j⟨ x (i) ,x (j) ⟩ s.t. 0≤ α i ≤ C , i = 1,...,m m ∑ i=1 α i y (i) = 0 1−ξ Cξ C 28 The initial node, or the very top of the tree is called root node, or just “root”. Root node has no parent, i.e. no arrow pointing to it. The nodes in the middle are called internal nodes or just “nodes”. They have both parent and children, i.e. have both arrows pointing to them and arrows pointing away from them. Root node and internal nodes hold the conditional statements, which further split the tree. A conditional statement usually ask a question about a certain feature of the input. An example of question for a continuous feature will look like “Is feature larger or equal than ?”. An example of question for a categorical feature will look like “What is the value of feature ” or “Is the feature (assume to be a color feature) red?”. The nodes with no children, i.e. have arrows pointing to them but no arrows pointing away from them, are called leaf nodes or just “leaves”. Leaf nodes hold the final decision for a certain input. In the case of classification tasks, each leaf node will hold one class/category. In order to make prediction, an input sample will be passed through the tree, and finally reaches a leaf. The class held at the leaf will be the prediction for this input sample. When training a decision tree with a training set , we grow the tree from its root. We first start with a tree of depth (the length of the longest path from the root node x i a x i x i S = { (x (i) ,y (i) );y i ∈ {1,...,k};i = 1,...,m } 29 Figure 2.12 A make-up example of a decision tree. to a leaf node) 1. We set up a conditional statement at root, which can split all the training samples into multiple (usually two) leaf nodes after passing them through the tree. In general, each leaf node will contain samples of many possible outcomes . In other words, the leaf nodes are impure. Impurity offers a great clue to measure the effectiveness of the split, as we can compare how much the impurity has decreased after passing through the root node or the internal node. There are multiple metrics for the impurity. Two most commonly used metrics are Gini impurity and entropy. Gini impurity is defined as Where is the probability for a sample in a certain node to have outcome , i.e. . It can be seen from the expression that Gini impurity is smaller when a certain outcome dominates a node, and reaches minimum value 0 when there is only one possible outcome in a node. When all the possible outcomes are equally likely in a node, the Gini impurity of the node reaches its maximum value . To evaluate the effectiveness of a certain split at a node, we define the Gini gain as the Gini impurity of the node, subtracted by the sum of Gini impurity of all its children nodes (weighted by the probability of a training sample flows into the children nodes). Where is the Gini impurity of the node, is the number of children nodes it has, is the probability of a training sample flowing into child node , is the Gini impurity of its child node . Another commonly adopted metric, entropy, is defined as Similar to Gini gain, the “entropy gain”, which is called information gain, is defined as y i ∈ {1,...,k} G = k ∑ c=1 p c( 1−p c) = 1− k ∑ c=1 p 2 c p c c y i = c k−1 k GG = G parent − n ∑ j=1 p j G j G parent n p j j G j j H =− k ∑ c=1 p c log 2 p c 30 When selecting the conditional statement for the split, we choose the feature that gives the maximum impurity gain (e.g. Gini gain, information gain). For a continuous feature, we also need to choose the split threshold. For a categorical feature, if the question is binary (as in the color red example mentioned in the second paragraph of this section), a certain category also needs to be chosen. We grow the decision tree by converting the leaf nodes in the previous step into internal nodes and repeating the above process, until any stopping criterion is met. Common stopping criteria include the maximum depth of the tree, the minimum number of training samples in a leaf node, the minimum information gain at each split, etc. Decision tree is a powerful non-linear machine learning algorithm with strong interpretation power (due to the conditional statements). However, a single decision tree can be easily overfitted to the training set. Thus researchers and engineers usually ensemble a large group of trees together to resolve the overfitting issue. Ensemble learning methods will be discussed in the next section. 2.2.9 Ensemble Learning Ensemble learning is the technology to combine multiple machine learning models to achieve better performance. This work involves three types of ensemble learning methods. They are bootstrap aggregating (bagging), boosting and stacking. 2.2.9.1 Bootstrap Aggregating Bootstrap aggregating (bagging) is an ensemble learning method focuses on reducing variance (high variance indicates overfitting). It ensembles base machine learning models (base models, also known as base learners, are usually homogeneous, i.e. the same type but trained in a different way) to improve the performance. The random forest algorithm to be introduced in Section 2.2.10 belongs to this family. As its name indicates, this method contains two parts, bootstrapping and aggregating. IG = H parent − n ∑ j=1 p j H j 31 Bootstrapping is a random sampling with replacement process. Assume having a training set , for each base model, we sample a new data set of size m with replacement. When is large, is expected to contain of the unique samples in . Aggregating combines the predictions of base models by majority voting (for a categorical outcome) or averaging them (for a continuous outcome). 2.2.9.2 Boosting Boosting is the idea of building an ensemble incrementally through training new models in a modified way, e.g. with emphasis on hard training samples, or targeting the residual (the difference between true outcome and predicted outcome), etc. This method primarily focus on reducing bias (improving performance on training data set), and also variance with proper implementation. The algorithm adaptive boosting, gradient boosting and extreme gradient boosting to be introduced from Section 2.2.12 to Section 2.2.14 belong to this family. 2.2.9.3 Stacking Stacking involves training a machine learning model to combine the prediction of other models (sometimes it also includes additional features as inputs) and improve the performance. Stacking is state- of-art technique that most Kaggle challenge (online data science challenges) winners have adopted in their models. The model trained during the stacking step is known as the meta model or meta learner, since it is a model based on meta data (the predictions or outputs of other models). 2.2.10 Random Forest Random forest (Breiman, 2001), as its name indicates, is a learning algorithm that ensembles a large number of decision trees. As discussed in 2.2.9.1, random forest is a bootstrap aggregating algorithm. For each tree in the forest, we sample the training data set in a bootstrapping manner (sample with S = { (x (i) ,y (i) );i = 1,...,m } S i m S i 1− 1 e ≈ 63.2% S S 32 replacement) to obtain data set , then we further sample a subset of the features of the inputs (non- bootstrapping manner, i.e. sample without replacement for the features). Each decision tree is trained with this subset of the features on data set . To make prediction for a new sample, we pass the sample through all the trees in the forest, and aggregating their outcomes by majority voting (when the outcome is categorical) or averaging (when the outcome is continuous). As discussed in Section 2.2.8, decision trees tends to overfit the training data set easily. Random forest reduces the variance in a statistical manner by introducing more randomness into the model. 2.2.11 Extremely Randomized Trees Extremely randomized trees (ExtraTrees) (Geurts et al., 2006) goes one step further in randomness compared to random forest. It is very similar to random forest, but with two main differences. First, each decision tree is trained on all samples in the training data set (no bootstrapping applied). Second, unlike the typical training process of a decision tree, ExtraTrees does not select the optimal split threshold for each feature when growing the tree. Recall that as discussed in Section 2.2.8, a decision tree grows in the following manner 1. For each feature, select an optimal threshold which gives the most impurity gain 2. Select the feature that gives the most impurity drop. 3. Grow the tree by splitting samples with selected feature at its optimal threshold. ExtraTrees algorithm substitutes the optimal threshold in step 1 with a random threshold. It further reduces the variance of the model by introducing extra randomness. 2.2.12 Adaptive Boosting Adaptive boosting (AdaBoost) (Freund and Schapire, 1997) is a boosting algorithm that ensembles base models incrementally, with the new member focusing on the mistakes made by previous models. Conceptually the base model can be any model, but in real application it is usually a stump. Stump is a decision tree with depth of 1, i.e. a tree with only the root node and two leaf nodes. S′ S′ 33 Sample weight plays an important role in AdaBoost. For a balanced training data set , each sample will initially have the same weight . If the data set is unbalanced (the outcome is categorical, and not all the categories contain the same number of samples), the samples of the rare categories will usually be assigned larger weights so that the total sample weights of each category stays balanced. When training the stump, instead of Gini gain or information gain, total error will be used as the criterion for feature and threshold selection. Assume the sample weight for sample is , the prediction of the stump is , then the total error will be Unlike random forest where each base model has the same voting power or model weight, in AdaBoost, each stump will have its own model weight , which is determined by the total error Since the sample weights in each stump is determined by its previously built stumps, we use the notation to represent the sample weight of sample in stump .The sample weights in the following stump will be updated by And then we normalize so that the total sample weights sum up to 1. S = { (x (i) ,y (i) );y i ∈ {1,...,k};i = 1,...,m } ( x (i) ,y (i) ) 1 m i w i f(x i ) ϵ = ∑ y i ≠f(x i ) w i ∑ m i=1 w i f j α j α j = 1 2 ln ( 1−ϵ j ϵ j ) w (j) i i f j w (j+1) i w (j+1) i = w (j) i e −y i α j f j( x i) w (j+1) i w (j+1) i = w (j+1) i ∑ m i=1 w (j+1) i 34 Repeat the above steps until we have (pre-determined, a hyper-parameter, which is a non-trainable parameter) stumps. The prediction of a sample for this ensemble model will be the category with the largest total model weight Or weighted average of the outputs of all stumps for a continuous outcome 2.2.13 Gradient Boosting Gradient boosting (Friedman, 2002; Friedman, 2001) is another commonly used boosting algorithm that also ensembles base models incrementally. The key difference between gradient boosting and AdaBoost is that the former one trains the new base model based on the residuals (the difference between true outcomes and predictions) of previously built models, while the later one trains the new base model with focus on mistakes made by previously built models. (More accurately, gradient boosting trains the new base model based on pseudo-residuals, which will be discussed later in this section.) Gradient boosting algorithm is slightly different for classification (when the outcome is categorical) and regression (when the outcome is continuous). Since only the classification version of this algorithm was applied in this work, this section will mainly focus on the classification version. In order to apply gradient boosting, a differentiable loss function is needed. Like the cost function introduced in Section 2.2.4, loss function is also a metric that describes how close the prediction is to the true outcome. The slight difference lies in that cost function conventionally refers to multiple samples or the whole training dataset, while loss function refers to a single sample (however, nowadays more and more researchers discard the difference between the two terms and use “loss function” more often for both cases). The gradient boosting algorithm can be described as the following. Assume the training data set is , and base models need to be trained. n x prediction≜ F (x) = argmax c ∑ f j (x)=c α j F (x) = ∑ n j=1 α j f j (x) ∑ n j=1 α j L ( y,F (x) ) S = { (x (i) ,y (i) );i = 1,...,m } n 35 Step 1, initialize the model a constant value Step 2, for j = 1 to n, do the following Step 2.1, compute the pseudo-residuals Step 2.2, train a base model with the training set Step 2.3, compute the multiplier by Step 2.4, update the model by Step 3, the final gradient boosting model will be 2.2.14 Extreme Gradient Boosting Extreme gradient boosting (XGBoost) (Chen and Guestrin, 2016) is a gradient boosting tree (gradient boosting with decision tree as base model) with well-designed regularization term controlling the model complexity (to prevent overfitting). XGBoost is one of the state-of-the-art machine learning models available in the toolkit. Detailed derivation of this algorithm is beyond the scope of this study. We will focus on discussing its objective function (a more general, alternative term for both loss function and cost function, used by the F 0 (x) = argmin γ m ∑ i=1 L ( y i ,γ ) r (j) i =− [ ∂L ( y,F (x) ) ∂F (x) ] y=y i ,F(x)=F j−1( x i) for i = 1,...,m f j { ( x i ,r (j) i ) ;i = 1,...,m } γ j γ j = argmin γ m ∑ i=1 L ( y i ,F j−1( x i) +γf j( x i) ) F j (x) = F j−1 (x)+γ j f j (x) F (x) = F n (x) 36 original author of the algorithm), because 1. it determines how the base models are trained; 2. the overall steps of XGBoost is pretty similar to what we have discussed in Section 2.2.14 as XGBoost belongs to the gradient boosting family. In order to be consistent throughout this chapter, the notation here will be different from the original paper of XGBoost. The objective function of XGBoost consists of both the loss function part in gradient boost, and the regularization part . The above function is the objective function when training the th decision tree. Note that at this step, the objective function is a function of the parameters in model only. Thus we have For some specific loss function, for e.g. mean squared error (MSE) for a continuous outcome, the objective function will have a friendly form. However, in general, the expression will be quite complex to handle. Researchers usually replace the loss function with its Taylor expansion up to the second order. n ∑ j=1 Ω ( f j ) obj = m ∑ i=1 L ( y i ,F j( x i) ) + j ∑ d=1 Ω ( f d) j f j obj j = m ∑ i=1 L ( y i ,F j( x i) ) + j ∑ d=1 Ω ( f d) = m ∑ i=1 L ( y i ,F j−1( x i) +f j( x i) ) +Ω ( f j ) +constant obj j = m ∑ i=1 L ( y i ,F j−1( x i) +f j( x i) ) +Ω ( f j ) +constant ≈ m ∑ i=1 L ( y i ,F j−1( x i) ) + ∂L ( y i ,F j−1( x i) ) ∂F j−1( x i) f j( x i) + 1 2 ∂ 2 L ( y i ,F j−1( x i) ) ∂F j−1( x i) 2 f 2 j ( x i) +Ω ( f j ) +constant = m ∑ i=1 ∂L ( y i ,F j−1( x i) ) ∂F j−1( x i) f j( x i) + 1 2 ∂ 2 L ( y i ,F j−1( x i) ) ∂F j−1( x i) 2 f 2 j ( x i) +Ω ( f j ) +constant 37 Where The regularization term, which penalize the complexity of the model to prevent overfitting, has the following form Where is the total number of leaf nodes in tree , is the weight for each leaf node (XGBoost assigns a different weight to each leaf node), and are the hyper-parameters that controls the weight of each part. In addition, note that the input pass through a decision tree will finally reach a certain leaf node. i.e. . Thus we can rewrite as Finally, after discarding the constants (as they have no contribution to the training process), we have the objective function for the th tree with the following forms (sum with respect to sample and sum with respect to leaf node) ∂L ( y i ,F j−1( x i) ) ∂F j−1( x i) ≜ ∂L ( y,F (x) ) ∂F (x) y=y i ,F(x)=F j−1( x i) Ω ( f j ) = γT+ 1 2 λ T ∑ l=1 w (j)2 l T f j w (j) l γ λ x i → q ( x i) f j( x i) f j (x) = w (j) q(x) j obj j ≈ m ∑ i=1 ∂L ( y i ,F j−1( x i) ) ∂F j−1( x i) w (j) q(x) + 1 2 ∂ 2 L ( y i ,F j−1( x i) ) ∂F j−1( x i) 2 w (j)2 q(x) +γT+ 1 2 λ T ∑ l=1 w (j)2 l = T ∑ l=1 ∑ q ( x i) =l ∂L ( y i ,F j−1( x i) ) ∂F j−1( x i) w (j) l + 1 2 ∑ q ( x i) =l ∂ 2 L ( y i ,F j−1( x i) ) ∂F j−1( x i) 2 +λ w (j)2 l +γT 38 2.3 Model Interpretation 2.3.1 Gradient-weighted Class Activation Mapping Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017) is an approach to generate a heat map indicating the parts of the original input that contributes to activate a certain neuron(s). It is originally used in image classification tasks to localize the object that activates a certain class. It can be generalized to regression tasks of 1D signals as well. Grad-CAM takes advantage of CNN’s nature of preserving localization information (i.e. each neuron is affected by a local region of input, which we refer as “scope” in Section 4.1.2.3), and computes the gradients of the target neuron with respect to feature maps of the last convolutional layer to determine which part of the input positively or negatively contributes to the prediction result. One of the limitations of Grad-CAM is that it does not work for LSTM networks since RNN models do not preserve localization information. Thus this analysis was applied on CNN model with PPGa signal input only. Open source Python library keras-vis (Kotikalapudi and contributors, 2017) was used to accomplish it (same for feature visualization). 2.3.2 Feature Visualization Feature visualization is another major thread of research in interpretation of neural networks. The idea is to generate the input that maximumly activate a certain neuron/filter/layer. The general approach for feature visualization is to optimize the input based on the gradients of the target neuron/filter/layer with respect to the input. We randomly initialize the input and optimize it from white noise. We have not implemented with regularization terms (frequency penalization, transformation robustness, learned prior, etc.) (Olah et al., 2017) yet, but will try different regularization methods in the future to improve results (generate more realistic input signals). 2.3.3 Feature Importance Feature importance is a metric for the impact of a certain feature on the outcome in a decision tree or an ensemble of decision trees. Its application includes feature selection, model explanation, etc. There are 39 two widely used types of feature importance. One is the impurity based feature importance, the other one is the permutation feature importance. Note that this section will use a different notation from the previous part of this chapter. 2.3.3.1 Impurity Feature Importance In a decision tree, each non-leaf node splits the tree based on a certain feature, and the impurity gain reflects how well this feature can help to determine the output. Since the possibility for a sample to pass through each node is different, we define the node importance to be a weighted impurity gain Where is the node importance for node , is the probability for a sample to pass through node , and is the indices for nodes pointed from node . The feature importance for feature in decision tree can be calculated as Where is the set of indices for nodes that split on feature , is the set of indices for all nodes that split on some feature. We also need to normalize it for each tree. Then the feature importance for feature in the whole ensemble model will be Where is the total number of trees in the ensemble model. The major cons of impurity feature importance include 1. tends to inflate the importance of numerical features than the categorical features, 2. the importance can be high for features that are not predictive of ni t = p t I t − ∑ k∈children(t) p k I k ni t t p t t k t fi s f j fi (j) s = ∑ k∈T s ni k ∑ t∈T ni t T s s T normfi (j) s = fi (j) s ∑ s∈allfeatures fi (j) s fi s fi s = ∑ n j=1 normfi (j) s n n 40 the outcome, but used by the model to overfit the training data set. Permutation feature importance to be discussed in the next section can mitigate these limitations, but it also has its own drawbacks. 2.3.3.2 Permutation Feature Importance Permutation feature importance (Altmann et al., 2010) is calculated with the following steps. Step 1, train the model with the original training data set, measure its performance with a certain metric. Step 2, for each feature, permute its value for all training samples, train the model with the newly obtained training data set (with one feature permuted), measure its performance with the same metric again. Step 3, for each feature, the difference of the performance between the two models (original minus the one with that feature permuted) is the importance of that feature. The major cons of permutation feature importance includes 1. when input features are correlated, and 2. computationally expensive. 41 3 Data 3.1 Children’s Hospital Los Angeles Dataset 3.1.1 Dataset and Studies The Children’s Hospital Los Angeles (CHLA) dataset contains 163 sets of physiological time series data from three studies, each applies a different type of external stimuli to the subjects. We recruited recordings from both sickle cell disease (SCD) patients and controls two reasons. First, the biomarker which reflects the subject’s tendency of sustained peripheral vasoconstriction is meaningful for both SCD patients and normal controls. Second, a larger dataset is necessary to train a deep model. The three studies are autonomic nervous system (ANS) study (76 samples), thermal pain study (54 samples) and cold pain study (33 samples). In the ANS study (Chalacheva et al., 2019), the subjects were instructed to rest quietly after the monitoring devices were connected. When the signals became stable, a number of ANS stimuli were given to subjects. The protocols were baseline (approximately 5 minutes), intermittent hypoxia I (give five breath of nitrogen), three voluntary sighs, intermittent hypoxia II, hand grip, flow mediated dilation and head-up tilt. In the thermal pain study (Khaleel et al., 2017; Khaleel et al., 2015), participants were instructed to rest quietly for 15 minutes after the setup of devices. Once the signals were stabilized, a 5-minute baseline was recorded and followed by five thermal stimuli of different patterns. In the cold pain study (Veluswamy et al., 2018; Veluswamy et al., 2017), the setup was similar to the thermal pain study. The protocol was replaced by baseline followed by four shuffled temperature stimuli, which are heat detection, cold detection, heat pain and cold pain, all graded individually based on subjects’ feedback. 42 3.1.2 Data Preparation 3.1.2.1 Photoplethysmogram Photoplethysmogram (PPG) is an optical technique that is used to detect regional blood volume changes in the microvasculature below the skin. PPG signals are often obtained through pulse oximeters which illuminate the skin and record the difference in light absorption (Shelley, 2001). The changes of red blood cell concentration due to the cardiac cycle lead to different levels of attenuation of either the reflecting light or the transmitted light, which reflects the volumetric change of blood in the periphery, and the changes of PPG amplitude (PPGa) of the pulses indicate peripheral vasoconstriction level, as shown in Figure 3.1. 3.1.2.2 Preparation The electrocardiogram (ECG) recordings in the CHLA dataset had been processed to derive inter-beat (R- peak to R-peak) interval (RRI). The same time stamps of the R-peaks were used to decide the peaks of photoplethysmogram (PPG) signals, and to find local minimum between each pair of peaks. PPG 43 Figure 3.1 Photoplethysmogram and photoplethysmogram amplitude. Figure shows an example of transmitting pulse oximeter. Decreased PPGa indicates peripheral vasoconstriction while increased PPGa represents vasodilation. amplitude (PPGa) was defined to be the amplitude from each minimum to its following peak. The sampling frequency for PPGa signals is 30Hz. For each recording, we took the part from 20 minutes pre-baseline to 60 minutes after the beginning of baseline (80 minutes or 144,000 samples in total). For shorter recordings (either at the beginning or at the end), they were padded with zeros. Since PPGa is a relative measurement, each 80-minute cropped PPGa signal was normalized according to its own 95 percentile value. The 20-minute pre-baseline information was included to find a more accurate reference value for normalization, because some subjects were nervous and began vasoconstriction even before the baseline. 3.1.3 Labels As mentioned in Section 1.1, seven researchers were asked to label all PPGa signals in the CHLA dataset according to the overall tendency of sustained vasoconstriction as a subjective gold standard to evaluate our machine learning models. Researchers tended to have a general agreement on the overall tendency of sustained vasoconstriction, however it was extremely hard to obtain an agreement of quantified features or patterns that lead to the judgement. The labels contained five different levels for tendency of sustained vasoconstriction, from 0 to 4, with 0 denoting least severe and 4 denoting most severe. All 163 signals were shuffled and repeated for three times for consistency, i.e. each researcher labelled each signal for three times. Weight 0.2, 0.4, 0.4 were assigned to the three labels (for each signal) respectively. Weighted average was used to represent the label for a certain signal from a certain researcher. Eventually, the seven labels from the seven researchers were averaged to get the final label for each signal. A MATLAB GUI tool was developed to collect all the labels. Figure 3.2 shows some example of normalized PPGa signals and their labels. 44 45 Figure 3.2 Examples of normalized PPGa signals and their labels. (A) & (B) are from the ANS study, (C) & (D) are from the thermal pain study, (E) & (F) are from the cold pain study. (A), (C) & (E) are examples with strong tendency of sustained vasoconstriction. (B), (D) & (F) are examples with weak tendency of sustained vasoconstriction. 3.2 Sleep and Asthma Cohort Study Dataset 3.2.1 Dataset and Studies The Sleep and Asthma Cohort Study (SAC) conducted by Vanderbilt University School of Medicine (Willen et al., 2017) is an observational study about children and adolescents with SCD. It contains two PSG studies with recordings including PPG signals at 100 Hz, involving three study sites in the United States and England (Washington University School of Medicine in St. Louis, Missouri; Case Western Reserve University in Cleveland, Ohio; and University College London in London, UK). 273 subjects participated in the study. 252 subjects among them had SCD (240 defined as homozygous for sickle cell hemoglobin [HbSS], and 12 defined as compound heterozygous for sickle β zero thalassemia [HbSβ0]) and were enrolled in the first PSG study. 212 subjects were included in this work after discarding subjects who took blood transfusion during the study or with bad ECG/PPG recordings (including invalid PPG, ectopic beats throughout the recoding and invalid R peaks). The first PSG study took place in the middle of the SAC study, while the second one was conducted near the end of it. This work only focuses on the first PSG study as the second one lacks followup recordings about the subjects. Thus the first PSG study will be abbreviated as PSG study for the rest of this work. The recordings of the PSG study contain two parts, raw signals and annotations. There are 28 channels in the raw signals, including ECG, PPG, airflow, SpO2, various locations of EMG and EEG signals, etc. The signals were saved in a proprietary file format MSR, which can be accessed through the Rembrandt software. For the annotations, there are around 50 different types, including arousal events, respiratory events, R-peak time stamps, sleep stages, limb movements, etc. All the annotations were saved in a proprietary format that can only be accessed via Rembrandt as well. The SAC study also recorded hospitalization events of the subjects due to severe pain induced by VOC, including admit dates and discharge dates. These records were used to derive the pain category, which is the dependent variable (to be predicted by other factors) in this work. The details will be discussed in Section 3.2.3. 46 3.2.2 Data Preparation The recordings were first extracted from the Rembrandt software in non-proprietary formats so that customized analyses can be conducted. The raw signals were exported to EDF format (European Data Format, an open and standard file format designed for medical time series data, commonly used in sleep studies) (Kemp et al., 1992) with a built-in tool in the Rembrandt software, while the annotations were extracted to text files (TXT) by running a self-written AutoIt script. Then MATLAB programs were written to read and preprocess the data. The details of data preprocessing will be discussed in Section 4.2.2. 47 Figure 3.3 Overview of the SAC dataset. Each vertical solid black line represents the followup history of a subject. Vertical axis is real world time (calendar year). Blue circles represent hospitalization events of pain episodes. Red dots represent the PSG study (sleep study). 48 Figure 3.4 Example of Rembrandt software screenshot for one subject’s PSG study (overview). Figure 3.5 Example of Rembrandt software screenshot for one subject’s PSG study (channels). 3.2.3 Pain Categories 3.2.3.1 Post-PSG Pain Category Pain rate is defined as average number of pain episodes (that require hospitalization) per year. Future pain rate, or post-PSG pain rate is defined as pain rate after the PSG study till the end of the SAC study. In a recently published research of the SAC study from Dr. Michael Debaun’s lab (Willen et. al, 2017), researchers tried to predict post-PSG pain rate based on measurements during the PSG study. Previous work from our lab (Chalacheva et al., 2019) showed that by dividing the subjects into two categories according to pain rate, namely high pain and low pain, a vasoconstriction related biomarker named Mvasoc is an effective predictor of subjects’ pain categories. We first used pain categories based on number of pain events six months before and after the PSG study, and selected the cutoff point to be 2 episodes as hematologists used to prescribe hydroxyurea treatment for SCD patients if they have more 49 Figure 3.6 Example of Rembrandt software screenshot for one subject’s PSG study (annotations). than 2 pain induced hospitalization events within a year. Next, we changed the pain categories based on post-PSG pain rate to be consistent with the former research by Willen et al., 2017. The cutoff point was chosen to be 1.5 pain episodes per year so that the population of high pain subjects is similar to the ±6 months case. In order to be consistent with these two previous studies, the post-PSG pain categories used in this work also took the cutoff point at 1.5 events per year. With this cutoff point, 36 subjects were categorized as high pain, and 176 subjects were categorized as low pain. Note that, as subjects’ ANS status may fluctuate from time to time, applying the 2-episode cutoff rule to a short time, and then re-define the cutoff point for a long time based on ratio of population is a reasonable approach. Another reason that a categorical dependent variable (pain category) was chosen over continuous one (pain rate) is that a small portion of the SCD patients contributes a much larger portion of all pain incidences. For instance, in the SAC study, 5% of the subjects contributes ~25% of all pain incidences. This is generally true for all SCD patients (in or outside the SAC study) according to hematologists. Thus it is very likely there is(are) certain phenotype(s) of SCD patients that tends to experience pain events more frequently. Third, it is more important to distinguish the SCD patients with 2 pain episodes per year from the patients with 0 episode, than to distinguish the patients with 6 episodes per year from the ones with 2 per year. Pain category serves this goal better than the continuous pain rate. In addition, only 212 subjects were included in this work, which is a very small dataset in the context of machine learning. Adopting pain category over pain rate can limit the impact of outliers (extremely high pain rate) to the machine learning model, reduce the amount of overfitting and allow the models to be trained more effectively. 3.2.3.1 Pre-PSG Pain Category In order to answer the first two sub-questions raised in Section 1.2.2, pre-PSG pain rate was developed as well. It was defined as the pain rate within a certain time window before the PSG study. For subjects with longer followup history before the PSG study (than after the PSG study), the time window was set to be 50 equal to the length of followup time after the PSG study. For subjects with shorter followup history before the PSG study, the time window was set to be from the beginning of SAC study to the PSG study. The same cutoff point, which is 1.5 pain episodes per year, was chosen to split the subjects into high pain (≥ 1.5) and low pain (< 1.5) category. With this cutoff point, 31 subjects were categorized as high pain, and 181 subjects were categorized as low pain. 51 4 Methods 4.1 Children’s Hospital Los Angeles Dataset 4.1.1 Data Preprocessing 4.1.1.1 Symbolic Dynamics Symbolic dynamics is a technique of coarsely graining physiological time series data by transforming them into symbol sequences that preserve information about dynamical properties of the corresponding system (Costa et al., 2017; Cysarz et al., 2015; Cysarz et al., 2013; V oss et al., 2000). It is usually used in heart rate variability analysis with different methods like σ-method, max-min, equal probability, etc (Cysarz et al., 2015). In this study, symbolic dynamics is adopted with variations of max-min method as a preprocessing technique for the PPGa signal to obtain sequences of numerical symbols for further analysis. Method 1: max-min with Six Symbols, 1Hz First, downsamples the normalized PPGa signals from 30Hz to 1Hz by taking mean of every 30 samples. Then, each signal ( is the length of signal , which is not same for all signals since padded zeros have been removed) is divided into six bins of equal size . Last, assign numerical symbols to these bins in ascending order. Sequence with six different symbols was a good compromise for heart rate variability analysis, and turned out to work well for PPGa signals as well. This method was used in extracting the weighted mean average hand engineering feature (see Section 4.1.1.2). , where x <t>(i) ,t = { 1,2,⋯,T i} T i i l = 1/(max(x <t>(i) )−min(x <t>(i) )) { 1,2,3,4,5,6 } S t = 1 if Min≤ x <t>(i) < Min+l 2 if Min+l≤ x <t>(i) < Min+2l 3 if Min+2l≤ x <t>(i) < Min+3l 4 if Min+3l≤ x <t>(i) < Min+4l 5 if Min+4l≤ x <t>(i) < Min+5l 6 if Min+5l≤ x <t>(i) ≤ Max { Min≜ min(x <t>(i) ) Max≜ max(x <t>(i) ) 52 Method 2: Two Region max-min with Six Symbols, 1Hz First, downsamples the normalized PPGa signals from 30Hz to 1Hz by taking mean of every 30 samples. Then, each signal is divided into two regions: and . Divided signals into four bins of equal size for the region, and two bins of equal size for the region. This method was used in extracting the trend hand engineering feature (see Section 4.1.1.2). , where Method 3: Two Region max-min with Seven Symbols, 0.1Hz Method 3 is similar to method 2 except for two changes. First, it downsamples the normalized PPGa signals to 0.1Hz instead of 1Hz. Second, the region is divided into five bins instead of four. , where x <t>(i) < 1 x <t>(i) ≥ 1 l 1 = (1−min(x <t>(i) ))/4 x <t>(i) < 1 l 2 = (max(x <t>(i) )−1)/2 x <t>(i) ≥ 1 S t = 1 if Min≤ x <t>(i) < Min+l 1 2 if Min+l 1 ≤ x <t>(i) < Min+2l 1 3 if Min+2l 1 ≤ x <t>(i) < Min+3l 1 4 if Min+3l 1 ≤ x <t>(i) < 1 5 if 1≤ x <t>(i) < Max−l 2 6 if Max−l 2 ≤ x <t>(i) ≤ Max { Min≜ min(x <t>(i) ) Max≜ max(x <t>(i) ) x <t>(i) < 1 S t = 1 if Min≤ x <t>(i) < Min+l 1 2 if Min+l 1 ≤ x <t>(i) < Min+2l 1 3 if Min+2l 1 ≤ x <t>(i) < Min+3l 1 4 if Min+3l 1 ≤ x <t>(i) < Min+4l 1 5 if Min+4l 1 ≤ x <t>(i) < 1 6 if 1≤ x <t>(i) < Max−l 2 7 if Max−l 2 ≤ x <t>(i) ≤ Max { Min≜ min(x <t>(i) ) Max≜ max(x <t>(i) ) 53 In addition to symbol sequences, if we concatenate every three consequent symbols and make it a symbolic word, we can obtain a symbolic word sequence for each signal as well. For instance, symbol sequence can form a symbolic word sequence . Note that the first word and the last word represent the same word. Also note that the symbolic word sequence will be shorter than the corresponding symbolic sequence by two (more generally, by “word length - 1”) due to the length of word. The purpose of having seven symbols instead of six is to get a larger vocabulary. With seven symbols, there will be up to different words including the unknown word (The actual vocabulary size is 262 words since some combinations never occur). In this way, a PPGa signal can be converted into a sentence of symbolic words, and that inspired the idea of applying successful nature language processing models to process the signal. In order to accomplish that, one additional preprocessing for the words was needed, which will be introduced in the Section 4.1.2.3. This method was used in preprocessing steps for deep neural networks (see Section 4.1.2.3 and Section 4.1.2.4). 4.1.1.2 Feature Engineering The PPGa signal showing tendency of sustained vasoconstriction had two common features. First, sustained vasoconstriction. Second, tendency of vasoconstriction when exposed to external stimuli to autonomic nervous system. Two hand engineering features “weighted mean PPGa” and “trend” are designed to describe them. Weighted mean PPGa is different from mean of PPGa in the way that it puts more weights to the sustained part of the signals. First, apply symbolic dynamics method 1 described in Section 4.1.1.1 to preprocess the signals. Once the signals are coarsely grained into symbolic sequences, we are able to define “sustained” according to the symbols. The symbolic sequences can further be regarded as segments of the same symbols with variant length. For each segment, its value is the symbol, and its weight is the square of its length. In this way, the value still represents the level of vasoconstriction, however, the sustained part or longer segments will have larger weights, as illustrated in Figure 4.1. {3,1,4,7,3,1,4} {'314', '147', '473', '731', '314'} '314' '314' 7 3 +1 = 344 54 where is the value (symbol) for segment of a certain PPGa signal, is the length for segment . Trend was developed to indicate the overall increasing (vasodilation) or decreasing (vasoconstriction) trend of the signal, with positive values referring to an increasing trend, negative values for the opposite, and magnitude describing the level of the trend. Symbolic dynamics method 2 introduces in Section 4.1.1.1 was adopted to preprocess the PPGa signals for this feature. From the obtained symbolic sequence, we looked at all the locations with value changes. If the current value is higher than its previous value, this “change” will positively contributed to the trend, and its increment to the value of trend will be determined by both the increment of symbolic value and the location of the “change”. The closer it is to the end of recording, the more impact it will have. On the contrary, if the current value is lower than its previous value, it will decrease the trend. The larger the magnitude of the “change”, and the closer it is to the beginning of the recording, the more decrement it will bring. Since the studies did not formally start until the baseline, we defined the beginning of the recording as the start of baseline, and treated all wmPPGa = ∑y s L 2 s ∑L 2 s y s s L s s 55 Figure 4.1 Illustration of the weighted mean feature. Signal A and signal B have the same mean but different weighted mean due to the “sustained” part in signal B. changes occurring pre-baseline as if they occurred at the beginning of baseline (more precisely, the first possible location for a change to happen, which is between the first and second symbol post-baseline). where is the value for the symbol , is the length of symbol sequence, is the total number of changes occurring in the sequence. Note that to a symbol sequence of length , there are possible locations for a change to take place. This expression also satisfies symmetry, i.e. the effect of a decrease taken place at time (between symbol and symbol ) will cancel out with the effect of an increase of the same magnitude taken place at time (between symbol and symbol ). Both features were normalized by subtracting their mean and divided by their range. trend = ∑ y t >y t−1 t−1 T−1 (y t −y t−1 )+∑ y t <y t−1 T−t+1 T−1 (y t −y t−1 ) N c y t t(t = 2,⋯,T) T N c T T−1 t t t−1 T+2−t T+1−t T+2−t 56 Figure 4.2 Illustration of the trend feature. Signal A has 0 trend due to its symmetry. Signal B has a negative trend for its overall decreasing tendency. 4.1.1.3 Global Vectors for Word Representation Words have different meanings to humans. However, for computers, they are just symbols or permutations of letters without engineers’ help. An effective way to represent meaning of words in computers is to encode words into high dimension vectors, as known as embeddings, with each component representing some feature. The feature can be properties like “gender”, “height”, or descriptions like “beautiful”, “dark”, or something that can hardly be directly understood by human beings. In fact, it is very likely that humans understand words in a similar way —— high dimension vectors. When we mention/capture a word in a certain context, we subconsciously project its corresponding high dimension vector to some lower dimension hyperplane. For instance, when we mention “red” while driving, it is likely to mean “stop”; in the topic of art design, it refers to a certain color; in a security training class, it probably indicates danger; in the context of sickle cell disease, it perhaps denotes the red blood cell. There are many algorithms that can effectively train the embeddings based on a large corpus (e.g. a collection of news articles or wikipedia articles), including neural language model (Bengio et al., 2003), Word2Vec skip-gram (Mikolov et al., 2013), negative sampling (Mikolov et al., 2013), Embeddings from Language Models (ELMo) (Peters et al., 2018), Universal Language Model Fine-tuning for Text Classification (ULM-FIT) (Howard and Ruder, 2018), Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), etc. In this study, a fast, effective, widely accepted algorithm called Global Vectors for word representation (GloVe) (Pennington et al., 2014) is used. Since words are represented as high dimension vectors, the distance between vectors should naturally reflect the similarity of meanings of words. GloVe is a method based on cooccurrence probability of norm_wmPPGa = wmPPGa−mean(wmPPGa) max(wmPPGa)−min(wmPPGa) norm_trend = trend−mean(trend) max(trend)−min(trend) 57 words. Define to be the cooccurrence matrix, where tabulates the number of times word occurs in the context of word . to be total count of all words appear in the context of word . to be the cooccurrence probability of word to word , i.e. the probability of word appearing in the context of word . It turns out that ratio of cooccurrence probabilities indicate the distances of words. i.e. reflects the distance between word and word . In the example given in the original GloVe paper (shown in Table 4.1), word = “ice”, word = “steam”, noise from non-discriminative words like “water” and “fashion” cancel out in the ratio (close to 1), while large ratios correlate well with features specific to “ice”, and small ratios correlate well with features specific to “steam”, and thus the ratios of cooccurrence carry the information or hints for training the embeddings. With mathematical assumptions and simplifications, it can be derived that well-trained embeddings need to satisfy the following equation. Where , are the embeddings for word and word , and , are bias terms. However, it is rarely the case that the embeddings will match perfectly to the cooccurrence count. Therefore, a cost function is needed for computational methods to train the embeddings, aiming for minimize the error between the left-hand and right-hand sides of the above equation. X X ik k i X i = ∑ k X ik i P ik = P(k |i) = X ik /X i k i k i P ik /P jk = P(k |i)/P(k |j) i j i j ⃗ e i T ⃗ e k +b i +b k = log(X ik ) ⃗ e i ⃗ e k i k b i b k 58 Probability and Ratio k = solid k = gas k = water k = fashion P(k|ice) 1.9E-04 6.6E-05 3.0E-03 1.7E-05 P(k|steam) 2.2E-05 7.8E-04 2.2E-03 1.8E-05 P(k|ice) / P(k|steam) 8.9 0.085 1.36 0.96 Table 4.1 Cooccurrence probabilities for target words “ice” and “steam” with selected context words from a 6-billion-word corpus. Noise from non-discriminative words like “water” and “fashion” are cancelled out in the ratio. Ratios much greater than 1 correlates well with “ice” specific features and ratios much less than 1 correlates well with “steam” specific features. Table reproduced from (Pennington et al., 2014). Where is the size of the vocabulary, is the weighting function that deals with boundary case (since is undefined) and prevents domination of frequent occurrences. A possible choice of can be The authors (Pennington et al., 2014) found and work well for their experiments. The same values are used in this study. In the CHLA dataset, the symbolic word sequences obtained through symbolic dynamics method 3 (refer to Section 4.1.1.1) have a vocabulary size of 262, including a “Null” word (whose embedding is a zero vector) to represent any symbolic words containing padded zero (during signal preparation process). The dimension (i.e. number of features) of embeddings are decided before training, for natural language processing applications, 10k~100k unique words are usually trained on corpus with billions of words for embeddings with 50~300 dimensions. In this study, we set 20 dimensions for the “CHLA PPGa corpus” with roughly 70k words. GloVe acts as an encoder for the encoder. It transforms words into embeddings, and these embeddings, instead of the words, will be input into machine learning models. 4.1.2 Machine Learning Models 4.1.2.1 Linear Regression Linear regression is one of, if not the most widely used machine learning decoder. It tries to find a linear combination of all features in the representation to predict the continuous-valued output, which is the risk level index in this study, such that the error is minimized according to some metric (a.k.a cost function). The most commonly used metric is sum of squared error, or equivalently mean squared error (MSE). Use the representation of two hand engineering features as an example, linear regression seeks the following expression to predict the risk index. J = V ∑ i,k=1 f(X ik )( ⃗ e i T ⃗ e k +b i +b k −log(X ik )) 2 V f X ij = 0 log0 f f(x) = { (x/x max ) α if x < x max 0 otherwise x max = 100 α = 3/4 59 Pros of linear regression include 1. It is simple and fast. 2. It has analytical solutions with many commonly used cost functions. 3. The magnitude of weights indicate the importance of the corresponding features. The major limitation of linear regression is that the assumption of linearity limits the model to making predictions based on a straight (hyper)line in the hyper plane (feature space). 4.1.2.2 Multilayer Perceptron Multilayer perceptron (MLP) is basically a neural network with multiple fully-connected layers or dense layers (refer to Section 2.2.4). It overcomes the limitation of linearity through the nonlinear activation functions in the neurons. In deep neural network models, it is usually added after the encoding layers to form a nonlinear decoder of the representation. Note that all the hidden layers in the MLP are encoders themselves. MLP are classified as decoder in this paper for two reasons. First, the main purpose of MLP is to nonlinearity decode the representation. Second, encoding layers like convolutional layers are more general and can be applied to other dataset of the same type with similar distribution. The MLP model with hand engineering features input has two hidden layers with five and four neurons respectively, as illustrated in Figure 4.3. The MLP decoder (dense layer) used in CNN and LSTM contains 128 neurons. w1⋅norm_wmPPGa+w2⋅norm_trend+b = VasoconstrictionIndex 60 4.1.2.3 Convolutional Neural Network The architecture of convolutional neural network (CNN) designed for the vasoconstriction index prediction task is shown in Figure 4.4. Note that the CNN model includes both the encoder (convolutional layers) and the decoder (dense layers). We included both parts in this section since the whole model was trained at the same time (similar for LSTM network in the next section). The scope of each neuron in the encoding layers (how much information affects the activation of a single neuron in the corresponding layer) is shown on top of the layer blocks. The model was built and trained in Python deep learning framework Keras with tensorflow backend. Both PPGa signals (downsampled from 30Hz to 0.1Hz by taking mean every 300 samples, i.e. average- 61 Figure 4.3 Architecture of the MLP model with hand engineering features input. The 3-layer neural network contains two hidden layers of size five and four respectively and a regression output neuron. The activation functions applied for the two hidden layers are ReLU. pooling) and symbol sequences processed by method 3 (Section 4.1.1.1) were tested as inputs (parameters trained for each type of input respectively). The outputs of the models were Vasoconstriction index (a continuous scaler). ReLU was used as activation functions for all convolutional layers and the dense layer prior to the output. Batch Normalization (Ioffe & Szegedy, 2015) was applied after each convolutional layer. The network was trained from scratch for 5000 epochs. Weights were initialized with Xavier initialization (uniform distribution) (Glorot and Bengio, 2010), and biases were zero initialized. We used Adam optimizer with and (Kingma and Ba, 2015), initial learning rate and decays as . One epoch is defined as all training data passed both forward and backward through the neural network once. β 1 = 0.9 β 2 = 0.999 α = 0.001 α = 0.001/(1+0.004*num_epoch) 62 Figure 4.4 Architecture of the CNN model. The network is a 6-layer model (only count layers with trainable parameters), including 4 convolutional layers and 2 dense layers (output layer is a dense layer with 1 neuron). 4.1.2.4 Long Short-Term Memory Network Figure 4.5 shows the architecture of the LSTM network. The model was built and trained in Python deep learning framework Keras with tensorflow backend. Three different types of input were tested. They were PPGa signal (0.1Hz), symbolic sequence (processed by method 3, Section 4.1.1.1), word embedding of symbolic words (processed by method 3, Section 4.1.1.1 and GloVe, Section 4.1.1.3). Note all the inputs used for LSTM network were further processed to move all the padded trailing zeros to the beginning of the signals/sequences, for the fact that the output was determined by the last hidden state, which was affected by the input at the last time step. We initialized the hidden state and memory cell to be zero 63 Figure 4.5 Architecture of the LSTM network model. The above diagram shows the version with word embedding input. The other two versions have the same encoding and decoding layers, only the input part is replaced with PPGa signal and symbolic sequence respectively. vectors. Despite the fact that some researchers pointed out non-zero initialization of hidden state can possibly accelerate the training process, zero initialization was still widely adopted. In addition, the padded leading zeros in the input will have no impact to the result if both hidden state and memory cell are zero initialized. We applied dropout regularization after both LSTM layers with rate 0.5 to prevent overfitting. ReLU activation function was used after the dense layer. We trained the model from scratch for 4000 epochs, initializing the input weights ( , , and , defined in Section 2.2.6) with Xavier initialization (uniform distribution) (Glorot and Bengio, 2010), hidden state weights ( , , and , defined in Section 2.2.6) with random orthogonal initialization (Saxe et al., 2014). We used Adam optimizer with and (Kingma and Ba, 2015), initial learning rate and decays as . 4.1.3 Model Training Due to the limit size of available data, it was unaffordable to split the dataset into three parts: training set (for training parameters), validation or develop set (for adjusting hyper-parameters) and test set (for an unbiased estimation of the final model). Instead, cross-validation with four folds was used to evaluate the deep neural networks. 163 dataset were first divided into two groups according to vasoconstriction index to reduce bias. Group A contained 16 recordings with vasoconstriction index greater or equal to 3, and group B contained the other 147 datasets. Each group was randomly drawn into 4 sets. Each fold consisted of one set from group A and one set from group B. As a result, four folds contained 41, 41, 41 and 40 datasets respectively. During cross-validation, three folds was treated as the training set and the rest one as the validation set. Once completing one full cycle, the model was be evaluated by its average performance of the four validation folds. The above training process was repeated 15 times for each model to control the effect of local minimum due to initial conditions. W cx W ix W fx W ox W ca W ia W fa W oa β 1 = 0.9 β 2 = 0.999 α = 0.001 α = 0.001/(1+0.008*num_epoch) 64 4.2 Sleep and Asthma Cohort Study Dataset 4.2.1 Machine Learning Model As illustrated in Figure 4.6, a two-level stacking model (introduced in Section 2.2.9.3) was built to predict the pain category of each patient. The first level includes seven base models. They are logistic regression, SVM with linear kernel, SVM with RBF kernel, random forest, ExtraTrees, AdaBoost and CNN. How these models were selected is discussed in Section 6.2.1.1. CNN took preprocessed PPGa and R-peak to R-peak interval (RRI) as inputs, while all other six models took subject features as inputs. The details about the inputs will be discussed in the next section. The output of these base models is the pain category defined in Section 3.2.3. The second level is a MLP model which took the outputs of base models, as well as subject features as inputs. Its output will be the final prediction (of the whole machine learning model) of subject pain category. All the base models and the meta model are classification models since the target output, i.e. label is pain category. 65 Figure 4.6 Architecture of the two-level stacking model. 4.2.2 Data Preprocessing The inputs to the machine learning model can be summarized into the following categories. They are physiological features, PSG features, vasoconstriction features, heart rate variability (HRV) features (i.e. RRI features), PPGa features and signals (PPGa & RRI). 4.2.2.1 Physiological Features Physiological features include age, sex, hemoglobin, white blood cell, reticulocyte, neutrophil, systolic blood pressure, diastolic blood pressure, body max index (BMI) of subjects at the PSG study. A binary feature that whether the subjects had been prescribed the hydroxyurea treatment during the SAC study is also included in this category as hydroxyurea can increase the amount of fetal hemoglobin to reduce the polymerization of sickle hemoglobin and clinical complications of SCD. However this feature can be very noisy in reality as many patients did not take the treatment properly as prescribed. 4.2.2.2 PSG Features Features associated with activities during sleep are categorized as PSG features, including arousal index, apnea-hypopnea index (AHI), number of limb movement, number of periodic limb movement (PLM) sequences, and effective sleep duration (defined as total duration of non-wake sleep stages from sleep onset to end of sleep). 4.2.2.3 Vasoconstriction Features The vasoconstriction features used in this work were first proposed in Chalacheva et al., 2019. The features are based on vasoconstriction events detected using the PPGa signal during sleep. The PPGa signal was obtained through the recorded PPG signal and R-peaks that were automatically generated by the Rembrandt software based on the ECG signal. The R-peaks were also used to compute the RRI signal. Both PPGa and RRI signals were generated as uniformly sampled time series data using zero-order hold interpolation, and were then downsampled to 2Hz by averaging (Berger et al. 1986). Bad 66 signals with artifacts were removed. Ectopic beats (determined by RRIs that are longer or shorter than 3.5 standard deviations from mean) were interpolated. PPGa signals were normalized by dividing its 95 percentile maximum value (from sleep onset to end of sleep). The idea of detecting a vasoconstriction event is to locate a transient drop in PPGa from its local baseline fluctuations. The respiratory influences from the PPGa signal were first removed using a low pass filter with cutoff frequency of 0.18Hz to ensure that the drops in PPGa were not induced by respiratory modulation. Next, moving window coefficient of variation (CV) was computed for every 5 seconds (window size) with a step size of 2 seconds. The mean value in the CV calculation was computed using a larger window size of 15 seconds to make the algorithm more robust to noises. Any CV peaks that were greater than 75th percentile of the calculated CV from the PPGa signal were identified as candidates for vasoconstriction events, with a constraint that any two candidates must be at least 10 seconds apart. For each candidate, the local minimum PPGa was located within a ±3 seconds window from the CV peak. Then the median of PPGa from 20 seconds to 5 seconds prior the local minimum point was taken as the temporary PPGa baseline. The onset of the vasoconstriction event was detected as the first point with value lower than the temporary baseline. The actual baseline was then updated as the median from 20 seconds to 5 seconds prior to the vasoconstriction onset. The end of vasoconstriction was defined as the first point where its value returned to (greater or equal to) the actual baseline. Note that CV only reflects changes from the adjacent mean values but not the direction of the changes, thus all the vasoconstriction candidates with minimum PPGa larger than its temporary baseline were discarded. In order to obtain a reliable baseline, the vasoconstriction events were also required to be independent, i.e. the candidates were discarded if their baseline has overlap with a valid event. Moreover, to further ensure the detected events were not due to a drift in PPGa baseline, the candidates with a duration (from onset to end of event) longer than 150 seconds were discarded as well. The above process was repeated for all vasoconstriction candidates. The vasoconstriction features were developed based on detected vasoconstriction events, using both PPGa and RRI signals, since 1. vasoconstriction is a response to autonomic stimuli and 2. previous research (Chalacheva et al., 2019) shows that both PPGa and RRI are critical to identify the phenotype of a subject’s response to autonomic stimuli in head-up tilt test. Like the PPGa signal, the RRI signal was 67 preprocessed to 2Hz in the same manner, and passed through the same low pass filter to remove respiratory modulations. The RRI signal also has its own baseline, which is the median from 20 seconds to 5 seconds prior the vasoconstriction onset. 10 different features were developed, they are 1. Compensated RRI area , defined as RRI area below baseline during vasoconstriction event , since lower RRI indicates higher heart rate, which compensates the cardiac output decrease due to vasoconstriction. 2. Uncompensated RRI area , defined as RRI area above baseline during vasoconstriction event . 3. Net RRI area 4. Ratio of uncompensated duration versus vasoconstriction duration of vasoconstriction event , 5. Vasoconstriction area , defined as PPGa area below baseline during vasoconstriction event . 6. Magnitude of vasoconstriction , defined as average PPGa during vasoconstriction event . 7. Duration of vasoconstriction event , 8. Total number of vasoconstriction events during the sleep 9. Exposure index (for all vasoconstriction events during the sleep) 10. Dosage index (for all vasoconstriction events during the sleep) Note that values of feature 1 ~ feature 6 are all relative values with respect to their corresponding baseline of each event. For single event features (feature 1 ~ feature 7), the median value of all events during the sleep was taken to represent that feature. A RRI,C i i A RRI,U i i A RRI i = A RRI,U i −A RRI,C i i R RRI,U i A vasoc i i M vasoc i i i T vasoc i N vasoc I Exposure = N vasoc ∑ i=1 ( A RRI i +A vasoc i) I Dosage = ∑ N vasoc i=1 ( A RRI i +A vasoc i) ∑ N vasoc i=1 T vasoc i 68 4.2.2.4 HRV Features HRV features (i.e. RRI features) include 1. SDNN, defined as the standard deviation of all normal RRIs during sleep. 2. SDANN, calculated by first averaging all normal RRIs for each 5 min segment during sleep, and then take the standard deviation of the averaged values. 3. SDNN index, calculated by first taking the standard deviation of all normal RRIs for each 5 min segment during sleep, and then average their values. 4. Triangular index, defined as integral of the histogram (with bin width of 1/128 seconds) of RRI during sleep. 5. Median of triangular indices, which were calculated based on each 5 min segment during sleep. 6. Median of RMSSD (root mean square of successive differences of normal RRIs) based on each 5 min segment during sleep. 7. Median of mean values of RRI, which were calculated based on each 5 min segment during sleep. 8. Median of standard deviation values of RRI, which were calculated based on each 5 min segment during sleep. 9. Median of low frequency power (LFP) of RRI, which were calculated based on each 5 min segment during sleep. 10. Median of high frequency power (HFP) of RRI, which were calculated based on each 5 min segment during sleep. 11. Median of low high ratio (LHR, ratio of low frequency power versus high frequency power) of RRI, which were calculated based on each 5 min segment during sleep. 12. Median of normalized low frequency power of RRI (LFP of RRI divided by mean RRI), which were calculated based on each 5 min segment during sleep. 13. Median of normalized high frequency power of RRI (HFP of RRI divided by mean RRI), which were calculated based on each 5 min segment during sleep. 69 4.2.2.5 PPGa Features PPGa features include 1. CV of PPGa during the whole sleep. 2. Median of CV values of PPGa, which were calculated based on each 5 min segment during sleep. 3. Median of mean values of PPGa, which were calculated based on each 5 min segment during sleep. 4. Median of standard deviation values of PPGa, which were calculated based on each 5 min segment during sleep. 5. Median of LFP of PPGa, which were calculated based on each 5 min segment during sleep. 6. Median of HFP of PPGa, which were calculated based on each 5 min segment during sleep. 4.2.2.6 PPGa and RRI Signals The 2Hz PPGa and RRI signals around vasoconstriction events were directly used as inputs for the CNN model. Both signals (only the selected parts, i.e. the parts around the vasoconstriction events) were normalized to have zero mean and unit standard deviation (with respect to the selected parts only) to have similar scale (this process was only done for signals to train the CNN model). For each vasoconstriction event, a 180-second segment of PPGa and RRI was selected, starting from 30 seconds prior to the onset of vasoconstriction event. Thus the first 30 seconds of each segment will cover the baseline of the signal, and the later part (150 seconds) covers the longest possible vasoconstriction event. For all vasoconstriction events belonging to the same subject, they share the same label, which is the pain category of that subject. 4.2.2.7 Missing Values and Feature Normalization Among all the 44 features of the 212 subjects, there are 10 missing values in total. One subject had missing value in reticulocyte. Three subjects had missing values in neutrophil. Two subjects had missing values in blood pressure (both systolic and diastolic). Two subjects had missing values in body mass index. All the missing values were replaced by the median of corresponding feature values of all the other subjects. 70 All 42 continuous features were then normalized to have zero mean and unit standard deviation, while the 2 categorical features remained unchanged. 4.2.3 Feature Selection For physiological features, all 10 of them were selected as inputs for machine learning models, since each of them describes the subject from a different aspect (and their absolute value of Spearman’s correlation coefficients were actually smaller than 0.5, except for age & BMI). For other types of features, among features (within each type) that were highly correlated (absolute value of Spearman’s correlation coefficient ≥ 0.5), only the most effective one was selected. The effectiveness of a feature was determined by how informative it was to distinguish the two different pain categories. It is defined using the Wilcoxon Rank-Sum test. Each feature was separated into two grouped according to subjects’ pain categories, then the Wilcoxon Rank-Sum test was performed on each feature between the two groups. The feature with a smaller test p-value was regarded as more effective. The feature selection algorithm can be described as following. 1. Compute the pair-wise Spearman’s correlation coefficient for all the features of the same type. 2. For each feature, separate it into two groups based on pain categories, and then perform the Wilcoxon Rank-Sum test on it. 3. Put all features in the feature pool, and rank them according to their Wilcoxon Rank-Sum test p- values in ascending order. 4. While the feature pool is not empty, select the feature with highest rank, i.e. lowest p-value. Move the selected feature from feature pool to selected pool. 5. For all the features in the pool, remove the ones with absolute value of Spearman’s rank correlation coefficient ≥ 0.5 with the feature selected in step 4. 6. Repeat 4~5 until the feature pool is empty. 7. All the features in selected pool are the selected features for a certain feature type. 71 The feature types are the ones discussed from Section 4.2.2.1 to Section 4.2.2.5. Note that for vasoconstriction features, they were further divided into three types, RRI-related (only) vasoconstriction features (feature 1~4 in Section 4.2.2.3), PPGa-related (only) vasoconstriction features (feature 5~6 in Section 4.2.2.3) and other vasoconstriction features (feature 7~10 in Section 4.2.2.3). The final selected features for machine learning models are, Physiological features (10): age, sex, hemoglobin, white blood cell, reticulocyte, neutrophil, systolic blood pressure, diastolic blood pressure, BMI (all at the time of PSG study), and whether the subject had been prescribed the hydroxyurea treatment during the SAC study. PSG features (4): AHI, arousal index, number of limb movement, effective sleep duration HRV features (1): Median of mean values of RRI, which were calculated based on each 5 min segment during sleep. PPGa features (1): CV of PPGa during the whole sleep. RRI-related vasoconstriction features (2): and PPGa-related vasoconstriction features (1): Other vasoconstriction features (2): and In our previous work (Chalacheva et al., 2019), was shown to be a predictor to pain category after adjusting the effect of age, sex and hemoglobin in a logistic regression model, while was not. Thus both of the two features were tested in our model (by selecting one of them at a time). 4.2.4 Model Training 4.2.4.1 Cross-Validation Both base models and the meta model were trained and evaluated using cross-validation with four folds. The cross-validation process can be described as following (illustrated in Figure 4.7). 1. Shuffle the subjects using random number generator (controlled by a seed) and put them into four folds. Each fold contains 53 subjects (44 low pain, 9 high pain). 2. Take fold i as test fold, and take the other three folds as training fold A RRI,C A RRI,U A vasoc N vasoc I Dosage M vasoc A vasoc 72 3. Train the model with training fold, evaluate the model performance with test fold (fold i), and save the results (both model outputs and evaluation) to the output pool. 4. Do step 2~3 for i = 1, 2, 3, 4 5. Now the output pool contains the model outputs and the evaluation of model performance for all the subjects. When training, the base models were first trained in a cross-validation manner, and the shuffle process was governed by a random seed α. Then the outputs of base models were re-ordered (based on subjects) to the original order of the dataset. Next, the meta model was also trained in a cross-validation manner, and the shuffle process for both the features (original order) and the outputs of base models (original order) was governed by another random seed β. This whole process is illustrated in Figure 4.8. 73 Figure 4.7 Illustration of the cross-validation training and evaluation process. The above figure demonstrates the process for i = 4, when model output for fold 1 ~ fold 3 has already been calculated. 4.2.4.2 Base Models (Except CNN) The inputs for all base models except the CNN model were selected features of subjects. Each training sample contains all selected features of a single subject, and the target output, i.e. label is the pain category of the subject. 4.2.4.3 Base Model (CNN) The inputs for the CNN model were preprocessed PPGa and RRI signals (Section 4.2.2.6). Each training sample contained both the PPGa and RRI signals of a single vasoconstriction event. 212 subjects had 54159 vasoconstriction events in total. When applying cross-validation, dataset was split on subject base, i.e. all vasoconstriction events belonging to a certain subject were split into the same fold. When testing 74 Figure 4.8 Overall process for training and evaluation of the two-level stacking model. or predicting, all vasoconstriction events of a certain subject were passed through the CNN model. The output of the CNN model for each vasoconstriction event was a score from 0~1 indicating the probability of this event belonging to a high pain subject. Events with a score ≥ 0.5 were classified as high pain, and the other events were classified as low. In this way, each subject had a portion of events classified as high pain. If a subject had more than 25% (this threshold is a hyper-parameter of the model, how it was chosen is discussed in Section 4.2.4.5) of all his/her events classified as high pain, then this subject was classified as high pain (otherwise he/she was classified as low pain). 4.2.4.4 Meta Model The inputs for the meta model include the selected features of subjects, and the outputs of all base models. Note for support vector machine with linear kernel and support vector machine with RBF kernel, since they only give a binary class number (1 for high pain and 0 for low pain), their outputs were binary class numbers. For other four base models except CNN, their outputs were float scores from 0 to 1 that represent their predicted probability for the subjects to be in the high pain category. For CNN, its output for each subject was the percentage of the subject’s vasoconstriction events that were classified as high pain, which is also a float number from 0 to 1. Each training sample contains all selected features of a single subject, and the outputs of the base models for this subject. The target output in each training example of the meta model is also the pain category of the subject. 4.2.4.5 Hyper-Parameter Tuning The non-trainable parameters in machine learning are called hyper-parameters. For instance, number of hidden layers in multilayer perceptron, depth of trees in decision tree models, coefficients for regularization terms, etc. In this work, the number of subjects in the high pain category (36 for post-PSG, 31 for pre-PSG) and the number of subjects in the low pain category (176 for post-PSG, 181 for pre-PSG) were fairly unbalanced, thus different class weights were assigned to the subjects to reduce the negative impact of data unbalancing. low pain subjects were assigned the class weight of 1, while the high subjects were 75 assigned the class weight of 176/36 ≈ 4.89 for post-PSG and 181/31 ≈ 5.84 for pre-PSG. The other hyper- parameters were tuned according to their model performance on the test folds in the cross-validation process. The same random seed was used in tuning hyper-parameters for all base models, while a different random seed was used in tuning hyper-parameters for the meta model. 4.2.5 Model Performance Due to the limited amount of data, we cannot afford an untouched, unbiased test dataset for model evaluation. The performance of models were evaluated by repeating the cross-validation process 20 times with different random seeds (which controls how subjects were split into folds). Note these seed were all different than the ones used during the hyper-parameter tuning process as well. In this way, 1. model performance can be evaluated in a statistical way (thus more reliable), and 2. variance (the amount of overfitting) introduced during the hyper-parameter tuning was reduced. 76 5 Results 5.1 Children’s Hospital Los Angeles Dataset 5.1.1 Evaluation Metric Since the output of each model was a continuous value, least mean square error (MSE) was chosen as the evaluation metric for the machine learning models. Recall that each model was trained for 15 times to limit the impact from random initial conditions. In addition, each model was trained with fixed number of epochs (large enough for training to converge) to reduce the variance. The model with least MSE of 15 trials was selected. 5.1.2 Model Comparison From the result shown in Table 5.1, CNN models were clearly the best ones based on the metric. It was expected since CNN in nature is a pattern recognition model, and the Vasoconstriction Index labels were based on patterns of sustained vasoconstriction. More detailed analysis of each model will be discussed in Section 6.1.1. 77 Model Input symbols min MSE LR 2 normalized features 6 ; 4 + 2 0.311 MLP 2 normalized features 6 ; 4 + 2 0.239 CNN PPGa N/A 0.115 CNN symbolic sequence 5 + 2 0.108 LSTM PPGa N/A 0.279 LSTM symbolic sequence 5 + 2 0.147 LSTM word embedding 5 + 2 0.154 Table 5.1 Performance of all models. CNN models outperform the other ones. LR represents linear regression. 5.1.3 Data Visualization 5.1.3.1 2D Plot for Hand Engineering Features Figure 5.1 and Figure 5.2 show the visualization of the linear regression model for training data and validation data respectively. Figure 5.3 and Figure 5.4 show the visualization of the MLP model for training data and validation data respectively. PPGa signals are represented by filled circles of different colors according to their vasoconstriction index labels. Black for [0, 1), blue for [1, 2), green for [2, 3), red for [3, 4]. Same colors applied for prediction map (predicted value of corresponding and ) in the background. norm_wmPPGa norm_trend 78 79 Figure 5.1 Visualization of linear regression model with hand engineering feature input on one of the training set. MSE 0.318. Each data point represents one study. The color and number of each point are determined by the vasoconstriction index label. The background color shows the prediction map of the linear regression model based on ranges. 80 Figure 5.2 Visualization of linear regression model with hand engineering feature input on one of the validation set. MSE 0.262. Each data point represents one study. The color and number of each point are determined by the vasoconstriction index label. The background color shows the prediction map of the linear regression model based on ranges. 81 Figure 5.3 Visualization of MLP model with hand engineering feature input on one of the training set. MSE 0.214. Each data point represents one study. The color and number of each point are determined by the vasoconstriction index label. The background color shows the prediction map of the MLP model based on ranges. 82 Figure 5.4 Visualization of MLP model with hand engineering feature input on one of the validation set. MSE 0.205. Each data point represents one study. The color and number of each point are determined by the vasoconstriction index label. The background color shows the prediction map of the MLP model based on ranges. 5.1.3.2 Gradient-weighted Class Activation Mapping Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al., 2017) of two PPGa signals are shown in Figure 5.5. Signal (A) has a high-valued vasoconstriction index while signal (B) has a low- valued vasoconstriction index. Both positive gradients (which show the part of signal that lead to increase in vasoconstriction index value, blue line) and negative gradients (which show the part of signal that lead to decrease in vasoconstriction index value, red line) are drawn in the same plot. For signal (A), the sustained vasoconstriction after baseline clearly increase the vasoconstriction index value, with a small dip for the region with vasodilation. The relatively low values at the beginning of the recording (pre-baseline) is the major part of negative contributions. Note that both positive gradients and negative gradients are normalized with respect to the maximum magnitudes of themselves, so it is very likely that for this signal, the maximum magnitude for positive gradients is much larger than that for the negative gradients. For signal (B), the vasodilation in the middle of the study is the major contributor to its low-valued vasoconstriction index. The vasoconstriction part near the end of the recording actually leads to increase in vasoconstriction index value as expected. Both examples show that the CNN model properly captures the vasoconstriction and vasodilation patterns we are interested in. 83 84 Figure 5.5 Grad-CAM for two PPGa signals based on the CNN model. (A) has high-valued vasoconstriction index, (B) has low-valued vasoconstriction index. Blue lines are positive gradients, and red lines are negative gradients. All gradients are normalized to themselves with respect to their maximum magnitude. 5.1.3.3 Feature Visualization Figure 5.6 shows the PPGa signal generated according to the gradients of output with respect to the input. However, after we normalized it according to its 95 percentile, and passed it through the CNN, the prediction value was below 3, which is lower than many real PPGa signals. We believe some proper regularization and constraint need to be added to the algorithm for better performance, i.e. generate a signal that looks more real and has a very high prediction value. 85 Figure 5.6 Generated PPGa signal based on gradients of output. It is supposed to have a high vasoconstriction index value, however the actual performance does not meet our expectation. We will work on to improve the algorithm in the future. 5.2 Sleep and Asthma Cohort Study Dataset 5.2.1 Evaluation Metric Since there are exactly two output categories (high pain and low pain), if we define the high pain category as the positive category (as subjects in this group would require more attention), we can further define subjects who were actually positive and predicted positive as true positive (TP), subjects who were actually positive but predicted negative as false negative (FN), subjects who were actually negative and predicted negative as true negative (TN), subjects who were actually negative but predicted positive as false positive (FP). One key metric of detection/diagnosis problems is true positive rate, also known as “sensitivity” in the statistical literature, or “recall” in the machine learning field, is defined as following Recall shows indicates the capability of the model to identify positive subjects among all actual positive subjects. Another key metric is positive predictive value, also known as “precision”, which tells the reliability of the algorithm, meaning among all the positively predicted subjects, how many of them are true positives. Precision can be calculated as Note that for any machine learning model, there is a trade-off between high recall and high precision. If a model simply predicts every subject as positive, its recall will be 1, while its precision will only equal to the population portion of positive subjects. On the other hand, if a model acts extreme conservatively when predicting a positive subject, end up with high precision, it will probably miss quite a lot of positive subjects. Thus, when comparing models, both recall and precision need to be taken into consideration. However, looking at two numbers at the same time is non-intuitive for comparison, it is controversial to determine the better model among different ones if some have higher recall while others have higher precision. When the dataset is balanced (number of positive samples and number of negative samples are Sensitivity(Recall) = TP TP+FN Precision = TP TP+FP 86 roughly the same), accuracy (defined below) can properly describe the performance of the model, but when dataset is unbalanced, accuracy is no longer an effective option. That’s when “f1-score” (defined below) kicks in. F1-score takes both recall and precision into account (the harmonic mean of recall and precision). F1-score was the metric we applied to compare model performance in predicting pain category of the subjects of the SAC dataset. Accuracy = TP+TN NumberofSubjects F1-score = 2×Recall×Precision Recall+Precision 87 Figure 5.7 F1-score contour with respect to precision and recall. 5.2.2 Benchmark of Model Performance In many other machine learning research on medical applications (for e.g. the ones mentioned in Section 2.2.2), researchers usually take the performance of human expert (for e.g. medical doctors) as the benchmark to evaluate the effectiveness of machine learning models. However, that is not available in this work as no one (even human experts, or hematologists) knows how to diagnose effectively with the known features of the patients. In other words, the performance of the experts can be assumed to be random. Recall that in statistical tests, the original hypothesis is also based on the random assumption (e.g. assume the mean performance of two models are the same, i.e. the difference between them are due to random sampling) Thus, the performance of a random model can be considered as the benchmark. Two random models were defined as following: Random model A acted like a fair coin and outputted both the positive and the negative category with equal probability (50% for each). Random model B acted like an unfair coin with head (positive) probability . The SAC dataset contains 212 subjects, 36 of them belong to the positive category (high pain), and 176 of them belong to the negative category (low pain). Thus for random model B, , , , . Recall would be equal to , and precision would be only dependent on the ratio of positive subjects versus negative subjects, which is . It can be easily derived that the optimal f1-score of random model B would be reached when . Note that when , the model was no longer random as it always predicted positive, but we still named it random model as it was derived based on probability . The performance of random model A and B on the SAC dataset is illustrated in Figure 5.8. The recall, precision, accuracy and f1-score of the random model A in Figure 5.8 (A) would be 0.5, 0.1698, 0.5 and 0.2535 respectively. The recall, precision, accuracy and f1-score of the random model B in Figure 5.8 (B) would be 1, 0.1667, 0.1667 and 0.2903 respectively. Any post-PSG pain category prediction model with f-1 score that is significantly better than 0.2903 can be regarded as “better than random” or effective to some extent. p TP = 36p FN = 36 ( 1−p ) FP = 176p TN = 176 ( 1−p ) p 36 36+176 = 9 53 p = 1 p = 1 p 88 5.2.3 Model Comparison To answer the sub-questions raised in Section 1.2.2, the two-level stacking model was adjusted to have different combinations of base models, inputs and target output. 5.2.3.1 Post-PSG Pain Category VS Pre-PSG Pain Category To understand whether ANS features are predictors for pain crisis, two stacking models with same structure and inputs but different target outputs were trained, evaluated and compared. Each model includes 7 base models and utilizes 21 features as discussed in Section 4.2.1 and Section 4.2.3 respectively. Their target outputs are post-PSG pain category and pre-PSG pain category (both defined in Section 3.2.3) respectively. Performance of models were evaluated in terms of mean f1-score of the 20 89 (A) (B) Figure 5.8 Performance of the benchmark random models. (A) Random model A (fair coin). (B) Random model B (unfair coin that gave optimal f1-score). trials (discussed in Section 4.2.5) and shown in Table 5.2. Since the allocation of subjects (how subjects were split into folds) for corresponding trials were controlled by the same random seed (e.g. trial 5 for both models used the same allocation of subjects in cross-validation training & evaluation, while the allocation of different trials is different), and that the performance of different trials for each model was roughly normally distributed, paired t-test can be used to compare the performance of different models. As shown in Table 5.2 and Figure 5.9, ANS features predict post-PSG pain category way much better than pre-PSG pain category (p-value < 0.001), which demonstrates that ANS features are predictors to VOC. Both models are far better than random models. However, a closer look at the performance of base models (refer details to Appendix II) in each stacking model reveals that for the model predicting pre-PSG pain category, its base models have similar performance as type I random models. On the other hand, for the model predicting post-PSG pain category, its base models still outperforms type I random models by a large margin. 5.2.3.2 Impact of Vasoconstriction Related Features To answer the third sub-question raised in Section 1.2.2, models with vasoconstriction related inputs were compared with model without vasoconstriction related inputs. Vasoconstriction related inputs include all features derived based on detected vasoconstriction events ( , , , , , ), and CV of PPGa during the whole sleep, and the output of one base model, the CNN model (since CNN looks for patterns in PPGa and RRI during vasoconstriction events). The results are shown in Table 5.2 and Figure 5.9. The full model (with vasoconstriction related features and CNN as a base model) outperformed the model without vasoconstriction related inputs the most. In addition, among all models with vasoconstriction related inputs, The models with outperformed the models with , despite the fact that has smaller p-value in the Wilcoxon Rank-Sum test (refer to Section 4.2.3). This founding agrees with what was discovered in our previous work (Chalacheva et al., 2019). More details of model comparison is attached in Appendix II. A RRI,C A RRI,U M vasoc A vasoc N vasoc I Dosage M vasoc A vasoc A vasoc 90 5.2.3.3 Comparison with Model in Previous Work In our previous work (Chalacheva et al., 2019), a negative binomial model was developed to predict post- PSG pain rate. The independent variables in that model are age, hemoglobin of subjects at PSG, whether the subject had been prescribed the hydroxyurea treatment during the SAC study, and . Since the output of the negative binomial model is post-PSG pain rate, it was converted to post-PSG pain category with the same cutoff point at 1.5 episodes per year. The model was evaluated in cross-validation manner (refer to Section 4.2.4.1). The fold split was done in MATLAB and controlled by rng(0) (random number generator with seed 0). The parameters were estimated using JMP Pro 14 (the same software as used in previous work). As shown in Table 5.2 and Figure 5.9, the two-level stacking model proposed in this work outperforms the negative binomial model by a large margin, despite being less interpretable as the later one is a generalized linear model. M vasoc Table 5.2 Model Performance Comparison Model Name Model Input Model Output F1-score 95% Confidence Interval Random A N/A N/A 0.2535 N/A Random B N/A N/A 0.2903 N/A Negative Binomial Age, hgb, hydroxyurea, Mvasoc post-PSG 0.3279 N/A Mvasoc 21 features (Mvasoc) 7 base models (CNN) post-PSG 0.4255 [0.4183, 0.4327] Avasoc 21 features (Avasoc) 7 base models (CNN) post-PSG 0.4135 [0.4075, 0.4196] No M & A 20 features (No M & A) 7 base models (CNN) post-PSG 0.4155 [0.4041, 0.4269] No Vasoc-related 15 features (No Vasoc) 6 base models (No CNN) post-PSG 0.4040 [0.3964, 0.4115] Pre-PSG 21 features (Mvasoc) 7 base models (CNN) pre-PSG 0.3505 [0.3377, 0.3633] 91 92 Figure 5.9 Model performance comparison, paired t-test based on 20 different random seeds that control the split of subjects. 6 Discussion 6.1 Children’s Hospital Los Angeles Dataset 6.1.1 Model Analysis For convolutional neural network (CNN) models, we downsample the input to 0.1Hz mainly for the purpose of reducing the size of the model, so that we will not overfit the model too much with a training set of only 120+ recordings, and it is faster to train. In addition, since each neuron in the input layer contain information of 10 seconds, with filter size 7 for the first convolutional layer, the lowest level of features are extracted based on information of 70 seconds, which is reasonable for PPGa signals. Moreover, the downsampling also makes it possible to let the neurons in the 4th convolutional layer have a scope of 19.5 minutes, which is appropriate considering the cold pain studies usually last around 40 minutes (after baseline), and having four convolutional layers is acceptable for a training dataset of size 120+. For the symbolic sequences input, we choose seven symbols just to be consistent with the input for LSTM model so that we can better compare the two deep neural networks (We actually test the six symbols and the performance is just slightly worse than seven symbols, difference in MSE < 0.02). The fact that symbolic sequence version slightly outperforms PPGa version indicates that symbolic dynamics is an effective preprocessing technique. The benefit from reducing the noise in PPGa signals exceed the drawback from loss of information. For long short-term memory (LSTM) networks, regarding the number of layers for the model, since RNN models are usually deep in terms of “number of time steps” instead of “number of layers” (usually no more than three layers) and the limit availability of data, we test both one LSTM layer and two LSTM layers, and select the later one for better performance. Compared to the network with PPGa input, the network using symbolic sequences significantly improved the performance, which shows that sequential models are probably more vulnerable to noises in pattern recognition tasks. For the word embedding version of the network, we actually expected it to perform better in pattern recognition tasks, since each symbolic words was constructed from a 30-second small pattern. However it only delivered on par or slightly worse result compared to the network with symbolic sequence input. We believe the main 93 limitation here is the size of the corpus for training the embeddings of words. As mentioned in Section 4.1.1.3, we only had roughly 70k words for a vocabulary size of 262, compared to billions of words in nature language processing corpus with vocabulary size of 10k ~ 100k. However, due to the fact that training word embeddings is a semi-supervised learning (only needs the corpus) task, which means recruiting a larger training set is much easier (compared to supervised learning tasks where labels are needed). We are confident that this method has great potential with a larger corpus. The hand engineering feature models were outperformed by well-built deep learning models by a large margin even with a small training set (122~123 recordings), which shows the powerfulness and huge potential of deep learning techniques in physiological signal processing tasks. As illustrated in Figure 6.1, in general, deep learning models will keep improving as more and more data are collected, while performance of traditional machine learning models plateaus much smaller data size. However, the MLP model with hand engineering feature input still beat the LSTM model with downsampled PPGa proved that 1. Well designed features are still useful for problems with small dataset. 2. Feeding data into deep models without proper preprocessing will unlikely to generate satisfying output, especially when data size 94 Figure 6.1 Illustration of performance vs size of dataset for deep neural network and traditional learning algorithm. is small. Even though it is common for researchers to directly feed time series data (e.g. physiological signals) to LSTM networks since LSTM models are well known for handling sequential data. 6.1.2 Limitations 6.1.2.1 Three Different Studies Deep neural networks work the best with data from the same distribution. However, the Children’s Hospital Los Angeles (CHLA) dataset contains three studies with different distribution. In order to have a larger training set, as well as a larger test set to more accurately evaluate the model, we have to compromise and put data with different kinds of stimuli together. Theoretically, deep neural networks can smartly recognize the noise in the input and discard with enough data. Although we are far from “enough” data, our deep models still impress us with the accuracy of prediction. 6.1.2.2 Subjective Gold Standard Another controversial part of the CHLA dataset is the labels since they are subjective. The problem with the CHLA dataset is that there is no objective data directly related with VOC or sustained peripheral vasoconstriction. Many statistical and rule-based approaches had been tested and none of them were satisfying: researchers could always find quite a few adversary examples. Then a question showed up: How can we raise adversary examples without any standards? The answer is that researchers do have estimation in their minds for the overall tendency of sustained vasoconstriction, or vasoconstriction index for each recording, but cannot precisely quantify the estimation or propose some rules that work for all recordings. The solution to this problem was rough estimation. Researchers were asked to label the signals with five different levels revealing the tendency of sustained vasoconstriction (five levels were chosen because five is an odd number so that there is a level in the middle, and the number of available options is proper). It was found that even though there was difference and preference among researchers, everyone agreed that the mean of labels correctly reveal the different tendency of sustained vasoconstriction. Thus this approach was accepted as the subjective gold standard. 95 6.2 Sleep and Asthma Cohort Study Dataset 6.2.1 Model Analysis 6.2.1.1 Base Model and Meta Model Selection Both base models and the meta model are selected according to their performance (f1-score, with 21 features including as inputs) on selected features. 11 models were tested. They were the 7 selected base models plus SVM with polynomial kernel, gradient boosting trees (gradient boosting with decision trees as base model), extreme gradient boosting trees (XGBoost), and MLP. As shown in Table 6.1, MLP performed the best, and was thus selected as the meta model in the second level. SVM with polynomial kernel, gradient boosting trees, and XGBoost were discarded as they performed significantly worse than the others. The rest 7 models were selected as base models. One possible reason for SVM with polynomial kernel having bad performance is that the polynomial hyper-plane does not help with the classification, as the powers of the features are not physiologically meaningful. Gradient booting trees is a model that tends to overfit the training data. This trend will dramatically harm its performance for small and noisy dataset, like the SAC dataset (refer to Section 6.2.2 for more information about dataset limitation). XGBoost is an improved version of gradient boosting trees by introducing multiple regularization terms to reduce overfitting. Although it beat gradient boosting trees by a large margin, it was not on par with other candidates. M vasoc 96 6.2.1.2 Effectiveness of the Stacking Model Several multilayer perceptron models with different inputs were trained to illustrate the effectiveness of the two-level stacking model. The first one was trained with outputs of 7 base models only. As shown in Table 6.2 and Figure 6.2, its performance was better than any single base model. The second one was trained with the selected features, and the last one was trained by both (the “full power” model proposed in this work). It can be seen that the full model was the best one. Despite the small performance margin, the variance (thus the standard error of mean) was much smaller. Note that in this work, the SAC dataset has quite a lot of limitations (discussed in the next section) which making it very challenging to train machine learning models. Thus the output of base models were very noisy, which constrained the its potential. With better data availability in future studies, the stacking model will release more power. Table 6.1 Performance of Base Models. Inputs are 21 features (include Mvasoc). Model Name F1-score 95% Confidence Interval Logistic regression 0.3751 [0.3594, 0.3907] SVM with linear kernel 0.3855 [0.3713, 0.3997] SVM with RBF kernel 0.3726 [0.3605, 0.3846] SVM with polynomial kernel 0.2391 [0.2149, 0.2633] Random Forest 0.3617 [0.3420, 0.3813] ExtraTrees 0.3775 [0.3645, 0.3904] AdaBoost 0.3534 [0.3370, 0.3699] Gradient boosting trees 0.2151 [0.1891, 0.2411] XGBoost 0.3101 [0.2937, 0.3265] CNN 0.3608 [0.3503, 0.3713] MLP 0.4147 [0.4033, 0.4261] 97 6.2.2 Challenges and Limitations Limitations from various kinds of aspects negatively impact the performance and reliability of the model, and introduce additional challenges into this project. Those limitations include the following sources: Table 6.2 MLP performance with different inputs Inputs F1-score 95% Confidence Interval 21 features (include Mvasoc) 0.4147 [0.4033, 0.4261] 7 base models (include CNN) 0.3827 [0.3695, 0.3960] 21 features + 7 base models (Full Model) 0.4255 [0.4183, 0.4327] 98 Figure 6.2 MLP performance with different inputs, paired t-test based on 20 different random seeds that control the split of subjects. 6.2.2.1 Unbalanced Data The amount of subjects belonging to the high pain category versus the low pain category is highly unbalanced (with the ratio of 1:4.89 for post-PSG pain category and 1:5.84 for pre-PSG pain category). Although this issue was handled by assigning larger class weights to the positive samples in training, which is equivalent to repeatedly sample positive subjects for 4.89 times (post-PSG) or 5.84 times (pre- PSG), model performance would still be harmed due to lack in diversity of the positive samples. 6.2.2.2 Data Availability The size of the dataset (212 subjects) is quite limited for a machine learning problem. In addition, only one PSG study per subject was conducted in the middle of the study. Predicting variables derived based on years with measurements from one single night adds more difficult to the challenge. 6.2.2.3 Data Leakage Due to the limitation of available data, untouched test data was not affordable for unbiased evaluation. Some extend of data leakage is inevitable with the usage of cross-validation, which will bias the hyper- parameter tuning and evaluation a little bit towards overfitting. The effect was reduced by applying different random seeds in hyper-parameter tuning and evaluation, but cannot be thoroughly removed. 6.2.2.4 Feature Selection and Feature Interaction The feature selection process applied in this work was based on similarity of distribution between a single independent variable (features) and the dependent variable (pain category), which did not properly consider the interaction effects among features. For instance, was selected over with the current feature selection process. However, as shown in results, the model fed with actually performed better than the model fed with . Thus, if better feature selection pipeline can be found, the model will be improved. A vasoc M vasoc M vasoc A vasoc 99 Two possible candidates are forward selection and backward selection. Forward selection starts with an empty pool, and then add the feature that can maximize the model performance to the pool at a time. Repeat the above steps until a certain stop criterion is met (e.g. increment of performance is lower than a threshold). Backward selection, on the contrary, starts with all features selected in the pool, then certain features are removed one at a time until the optimal set remains . Note that while more features do carry more information, they bring in more noise as well. Stop criterion is designed to find the turning point where noise dominates information. The challenges of these two methods include high computational cost, and choices of base models & their hyper-parameters. 6.2.2.5 Vasoconstriction Detection Algorithm The vasoconstriction detection algorithm proposed by Chalacheva et al., 2019 was critical to this work, as features related to vasoconstriction events were all based on it. The algorithm also helped select the inputs of the CNN model (signal segments) and thus improve the model performance by removing noises from unrelated segments of the signals. On the other hand, the algorithm contains a number of hyper- parameters (threshold of CV , various kinds of window sizes, step size, return threshold, etc.), and has the potential to achieve better outcomes. 6.2.2.6 Fixed Length of Signal Inputs to the CNN Model Convolutional neural network or CNN has been proven to be an extremely effective model in pattern recognition. One limitation is that it requires the input length to be fixed. However, the duration of vasoconstriction events is inconsistent. Information and patterns (in the 180 seconds clip) close to but not included in the targeted vasoconstriction event may negatively impact the model performance. 6.2.2.7 Noise Noise can increase uncertainty and make it harder for machine learning models to extract reliable information from the dataset. In this work, the models were trained with features derived from measurements made in a single day to predict a variable based on years. Any noise and fluctuation during 100 the measurement will dramatically hit the robustness of the models. The noise sources include fluctuation in the autonomic function or autonomic disfunction of the subjects on the study day, artifacts from motions and unmounted devices, calibration issues of the devices, accuracy and reliability of the devices, mis-detection of R-peaks, information loss due to signal overflow and data compression/conversion/ processing (e.g. imperfect filters), etc. Moreover, difference in subjects’ tolerance of pain and decisions of hospital visit, potential recording errors of hospitalization history, whether subjects strictly followed prescribed treatment (e.g. hydroxyurea) or not, all the above factors introduce additional noise to the dataset. 6.2.2.8 Model Regularization Due to the challenges and limitations discussed in previous sections, strong model regularization was introduced when training the model. Regularization penalizes the complexity of the model to reduce overfitting. The mathematical format of regularization terms for different models were discussed in Section 2.2.The details of hyper-parameters chosen for the models can be referred to Appendix I. 6.2.3 Application and Significance of the Work The models in this work were developed to help manage risks of pain crisis for SCD patients. Instead of finding biomarkers, this work focused more on developing more powerful models with better prediction f1-scores at the cost of interpretability. The reliability of the current model is still not satisfying (with f1- score just over 0.4), but the methods used to develop the model have been proven to be effective, as the model performed better than previous models which focused on identifying biomarkers. This work also cross-validate previous findings (ANS indices, especially the biomarker , are predictors of future pain crises for SCD patients) with a different approach. With increased popularity of wearable devices, and exponentially growing data, computer-aided diagnose and risk management is a sure choice. With more high quality data available, the model performance and reliability will also be dramatically improved. M vasoc 101 7 Future Work 7.1 Model Interpretation Despite the fact that the models developed in this work performed much better than straight forward linear models, their interpretability was much weaker. Better approaches can be explored in the future to help researchers further understand the mechanics of pain crises in SCD patients, in addition to assist in risk management by making predictions. 7.2 Model Improvement One obvious approach will be collecting more high quality data. Machine learning models can directly benefit from better data availability as larger datasets with fewer noise sources can better reveal the underlying distribution of data. Besides that, making prediction for shorter period with measurements from longer durations will help increase signal to noise ratio and reduce amount of uncertainty in the data. More effective feature selection (as discussed in Section 6.2.2.4) can also be critical to improve the model, especially when the size of dataset is small & noise sources are numerous. 7.3 Future Research The only signals that require continuous measurement in this work are PPG and RRI. Since RRI can be replaced by inter beat interval (IBI) abstained from PPG, PPG is the only necessary signal. Many wearable devices in the market provide PPG measurement, which make it possible for continuous health tracking for SCD patients. Our research group is connecting with vendors for wearable device solution, and meanwhile designing methods to collect feedback from SCD patients for their pain crisis status (candidates include App, text message, survey, etc.). Moreover, for patients with large amount of day to day followup data, developing personalized within-patient models for individuals will be possible. 102 Combining them with between-patients models developed for the whole group, researchers can assist the clinicians to better help SCD patients. 103 Acknowledgements This work is supported by NIH Grants U01-HL117719 and P41-EB001978. I would like to thank my advisor, Dr. Michael C.K. Khoo, and co-advisor, Dr. Thomas D. Coates for their advice and guidance through each stage of this work. I was also like to express my gratitude to the other committee members Dr. John C. Wood, Dr. B. Keith Jenkins, and Dr. Krishna S. Nayak for their inspiring advice and feedback for my work. The support on the CHLA dataset from Dr. Patjanaporn S. Chalacheva, Dr. Saranya Veluswamy, Payal M. Shah, and the support on the SAC dataset from Dr. Michael R. DeBaun, Mark Rodeghier and their colleagues was critical to this work. This work would not be possible without them. For this, I am extremely grateful. Constructive feedback and suggestion from my colleagues Dr. Patjanaporn S. Chalacheva, Dr. John Sunwoo, Wanwara Toey Thuptimdang, Kelby Knox, Dr. Alison Hu, Dr. Roberta Kato, Payal M. Shah, Dr. Saranya Veluswamy, Dr. Christopher Denton, Dr. Jon Detterich, Silvie Suriany, and Dr. Sarah R. Martin was instrumental to this work. I take this opportunity to thank them sincerely as well. Last but not least, I would like to express my thanks to my wife Qizhao Weng, my father Shuolu Ji and my mother Qiuyan Ji for their continuous support during my prolonged preoccupation with this work. 104 Bibliography Allen, J. (2007, March 1). Photoplethysmography and its application in clinical physiological measurement. Physiological Measurement. https://doi.org/10.1088/0967-3334/28/3/R01 Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10), 1340–1347. https://doi.org/10.1093/bioinformatics/ btq134 Awad, A. A., Ghobashy, M. A. M., Ouda, W., Stout, R. G., Silverman, D. G., & Shelley, K. H. (2001). Different Responses of Ear and Finger Pulse Oximeter Wave Form to Cold Pressor Test. Anesthesia and Analgesia, 1483–1486. https://doi.org/10.1097/00000539-200106000-00026 Ballas, S. K. (2005). Pain Management of Sickle Cell Disease. Hematology/Oncology Clinics of North America, 19(5), 785–802. https://doi.org/10.1016/j.hoc.2005.07.008 Barron, S. A., Rogowski, Z., Kanter, Y ., & Hemli, J. (1993). DC photoplethysmography in the evaluation of sympathetic vasomotor responses. Clinical Physiology, 13(6), 561–572. https://doi.org/10.1111/ j.1475-097X.1993.tb00472.x Bengio, Y ., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(Feb), 1137–1155. Retrieved from http://www.jmlr.org/ papers/v3/bengio03a.html Berger, R. D., Akselrod, S., Gordon, D., & Cohen, R. J. (1986). An Efficient Algorithm for Spectral Analysis of Heart Rate Variability. IEEE Transactions on Biomedical Engineering, BME-33(9), 900–904. https://doi.org/10.1109/TBME.1986.325789 Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/ bf00058655 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/ A:1010933404324 Centers for Disease Control and Prevention. (2017). Data & Statistics on Sickle Cell Disease | CDC. Retrieved January 1, 2019, from https://www.cdc.gov/NCBDDD/sicklecell/data.html Chalacheva, P., Ji, Y ., Rosen, C. L., DeBaun, M. R., Khoo, M. C. K., & Coates, T. D. (2019). High levels of peripheral vasoconstriction detected by polysomnography predict more acute severe pain episodes in children with sickle cell anemia. Blood 134 (Suppl.1):894, 2019 (abstract). Chalacheva, P., Kato, R. M., Shah, P., Veluswamy, S., Denton C., Sunwoo J., Thuptimdang W., Wood J. C., Detterich, J. A., Coates, T. D., & Khoo, M. C. K. (2019). Peripheral vasoconstriction with blunted cardiac response to head-up tilt: an abnormal autonomic phenotype characteristic of sickle cell disease. Front Physiol 2019 Apr 11;10:381. doi: 10.3389/fphys.2019.00381. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (V ol. 13-17-Augu, pp. 785–794). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785 Christoph, G. W., Hofrichter, J., & Eaton, W. A. (2005). Understanding the Shape of Sickled Red Cells. Biophysical Journal, 88(2), 1371–1376. https://doi.org/10.1529/BIOPHYSJ.104.051250 Coates, T. D., Chalacheva, P., Zeltzer, L., & Khoo, M. C. K. (2018). Autonomic nervous system involvement in sickle cell disease. Clinical Hemorheology and Microcirculation, 68(2–3), 251–262. https://doi.org/10.3233/CH-189011 105 Connes, P., & Coates, T. D. (2013). Autonomic nervous system dysfunction: Implication in sickle cell disease. Comptes Rendus Biologies, 336(3), 142–147. https://doi.org/10.1016/J.CRVI.2012.09.003 Costa, M. D., Davis, R. B., & Goldberger, A. L. (2017). Heart rate fragmentation: A symbolic dynamical approach. Frontiers in Physiology, 8(NOV), 1–14. https://doi.org/10.3389/fphys.2017.00827 Cysarz, D., Porta, A., Montano, N., Leeuwen, P. V ., Kurths, J., & Wessel, N. (2013). Quantifying heart rate dynamics using different approaches of symbolic dynamics. European Physical Journal: Special Topics, 222(2), 487–500. https://doi.org/10.1140/epjst/e2013-01854-7 Cysarz, D., Edelhauser, F., Javorka, M., Montano, N., & Porta, A. (2015). A percentile-based coarse graining approach is helpful in symbolizing heart rate variability during graded head-up tilt. Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 2015-Novem(1), 286–289. https://doi.org/10.1109/EMBC.2015.7318356 Deveopedia. (2019). “ImageNet.” Version 15. Retrieved January 17, 2020, from https://devopedia.org/ imagenet Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Retrieved from http://arxiv.org/abs/1810.04805 Dieng, A. B., Ranganath, R., Altosaar, J., & Blei, D. M. (2018). Noisin: Unbiased Regularization for Recurrent Neural Networks. Retrieved from http://arxiv.org/abs/1805.01500 Ding, J., Li, A., Hu, Z., & Wang, L. (2017). Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (V ol. 10435 LNCS, pp. 559–567). Springer Verlag. https://doi.org/ 10.1007/978-3-319-66179-7_64 Eaton, W. A., & Bunn, H. F. (2017). Treating sickle cell disease by targeting HbS polymerization. Blood, 129(20), 2719–2726. https://doi.org/10.1182/blood-2017-02-765891 Eaton, W. A., Hofrichter, J., & Ross, P. D. (1976). A Possible Determinant in Sickle Cell Disease. Blood, 47(4), 621–627. Elgendi, M. (2012). On the Analysis of Fingertip Photoplethysmogram Signals. Current Cardiology Reviews, 8(1), 14–25. https://doi.org/10.2174/157340312801215782 Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115–118. https://doi.org/10.1038/nature21056 Ferrone, F. A., Hofrichter, J., & Eaton, W. A. (1985). Kinetics of sickle hemoglobin polymerization: I. Studies using temperature-jump and laser photolysis techniques. Journal of Molecular Biology, 183(4), 591–610. https://doi.org/10.1016/0022-2836(85)90174-3 Ferrone, F. A., Hofrichter, J., & Eaton, W. A. (1985). Kinetics of sickle hemoglobin polymerization: II. A double nucleation mechanism. Journal of Molecular Biology, 183(4), 611–631. https://doi.org/ 10.1016/0022-2836(85)90175-5 Freund, Y ., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/ 10.1006/jcss.1997.1504 Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38(4), 367–378. https://doi.org/10.1016/S0167-9473(01)00065-2 106 Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.2307/2699986 Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3– 42. https://doi.org/10.1007/s10994-006-6226-1 Glorot, X., & Bengio, Y . (2010). Understanding the difficulty of training deep feedforward neural networks Xavier. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y . (2014). Generative Adversarial Nets. Retrieved from http://papers.nips.cc/paper/5423-generative- adversarial-nets Hamunen, K., Kontinen, V ., Hakala, E., Talke, P., Paloheimo, M., & Kalso, E. (2012). Effect of pain on autonomic nervous system indices derived from photoplethysmography in healthy volunteers. British Journal of Anaesthesia, 108(5), 838–844. https://doi.org/10.1093/bja/aes001 Hassell, K. L. (2010). Population Estimates of Sickle Cell Disease in the U.S. American Journal of Preventive Medicine, 38(4), S512–S521. https://doi.org/10.1016/J.AMEPRE.2009.12.022 He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1026–1034). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). HECHT-NIELSEN, R. (1992). Theory of the Backpropagation Neural Network. Neural Networks for Perception, 65–93. https://doi.org/10.1016/B978-0-12-741252-8.50010-8 Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735– 1780. https://doi.org/10.1162/neco.1997.9.8.1735 Hofrichter, J., Ross, P. D., & Eaton, W. A. (1974). Kinetics and mechanism of deoxyhemoglobin S gelation: a new approach to understanding sickle cell disease. Proceedings of the National Academy of Sciences of the United States of America, 71(12), 4864–4868. https://doi.org/10.1073/ PNAS.71.12.4864 Hofrichter, J. (1986). Kinetics of sickle hemoglobin polymerization: III. Nucleation rates determined from stochastic fluctuations in polymerization progress curves. Journal of Molecular Biology, 189(3), 553–571. https://doi.org/10.1016/0022-2836(86)90324-4 Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. Retrieved from http://arxiv.org/abs/1801.06146 Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Retrieved from http://arxiv.org/abs/1502.03167 Itano, H., & Pauling, L. (1949). A rapid diagnostic test for sickle cell anemia. Blood, (1186), 66–68. https://doi.org/10.1088/0004-637X/760/1/39 Kemp, B., Värri, A., Rosa, A. C., Nielsen, K. D., & Gade, J. (1992). A simple format for exchange of digitized polygraphic recordings. Electroencephalography and Clinical Neurophysiology, 82(5), 391–393. https://doi.org/10.1016/0013-4694(92)90009-7 Khaleel, M., Puliyel, M., Shah, P., Sunwoo, J., Kato, R. M., Chalacheva, P., … Coates, T. D. (2017). Individuals with sickle cell disease have a significantly greater vasoconstriction response to thermal 107 pain than controls and have significant vasoconstriction in response to anticipation of pain. American Journal of Hematology, 92(11), 1137–1145. https://doi.org/10.1002/ajh.24858 Khaleel, M., Puliyel, M., Sunwoo, J., Shah, P., Kato, R. M., Chalacheva, P., … Coates, T. D. (2015). Thermal Pain and Pain Anticipation Induce a Decrease in Microvascular Perfusion in Sickle Cell and Normal Subjects. Blood, 126(23). Retrieved from http://www.bloodjournal.org/content/ 126/23/67?sso-checked=true Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. Retrieved from http:// arxiv.org/abs/1412.6980 Kotikalapudi, R., & Contributers. (2017). keras-vis. GitHub. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, 1–9. https://doi.org/http:// dx.doi.org/10.1016/j.protcy.2014.09.007 Lee, J. G., Jun, S., Cho, Y . W., Lee, H., Kim, G. B., Seo, J. B., & Kim, N. (2017). Deep learning in medical imaging: General overview. Korean Journal of Radiology. Korean Radiological Society. https://doi.org/10.3348/kjr.2017.18.4.570 Lin, M., Chen, Q., & Yan, S. (2013). Network In Network. Retrieved from http://arxiv.org/abs/1312.4400 McGonagle, J., Shaikouski, G., & Williams, C. (n.d.). Backpropagation. Retrieved January 5, 2019, from https://brilliant.org/wiki/backpropagation/ Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Retrieved from https://arxiv.org/abs/1301.3781 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Retrieved from http://papers.nips.cc/paper/5021- distributed-representations-of-words-andphrases Morgan, B. J., Crabtree, D. C., & Toiber, F. S. (2013). Neurocirculatory consequences of abrupt change in sleep state in humans Neurocirculatory consequences in sleep state in humans of abrupt change, 1627–1636. Ng, A. (n.d.). CS229 Lecture notes Support Vector Machines. Ng, A. (2018). Machine Learning Yearning. deeplearning.ai. Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press. Retrieved from http:// neuralnetworksanddeeplearning.com/chap5.html Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature Visualization. Distill, 2(11), e7. https://doi.org/ 10.23915/distill.00007 Oliphant, T. E. (2006). Guide to NumPy. Retrieved from http://www.trelgol.com Pascanu, R., Tomas, M., & Bengio, Y . (2013). On the difficulty of training recurrent neural networks. Proceedings of the 30th International Conference on Machine Learning, (2), 41–71. https://doi.org/ 10.1109/72.279181 Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingle Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 32 (pp. 8024–8035). Curran Associates, Inc. Retrieved from http:// papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning- library.pdf 108 Pauling L,Itano HA,Singer SJ, W. 1C. (1949). Sickle cell anemia, a molecular disease. Science., 110(2865), 543–548. https://doi.org/10.1126/science.110.2865.543 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in {P}ython. Journal of Machine Learning Research, 12, 2825– 2830. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. https://doi.org/10.3115/v1/D14-1162 Penzel, T., Fricke, R., Brandenburg, U., Becker, H. F., & V ogelmeier, C. (2002). Peripheral arterial tonometry monitors changes of autonomous nervous system in sleep apnea. Annual International Conference of the IEEE Engineering in Medicine and Biology - Proceedings, 2, 1552–1553. https:// doi.org/10.1109/IEMBS.2002.1106532 Pérez, F., & Granger, B. E. (2007). IPython: A system for interactive scientific computing. Computing in Science and Engineering, 9(3), 21–29. https://doi.org/10.1109/MCSE.2007.53 Perez, M. V ., Mahaffey, K. W., Hedlin, H., Rumsfeld, J. S., Garcia, A., Ferris, T., … Turakhia, M. P. (2019). Large-scale assessment of a smartwatch to identify atrial fibrillation. New England Journal of Medicine, 381(20), 1909–1917. https://doi.org/10.1056/NEJMoa1901183 Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Retrieved from http://arxiv.org/abs/1802.05365 Platt, O. S., Brambilla, D. J., Rosse, W. F., Milner, P. F., Castro, O., Steinberg, M. H., & Klug, P. P. (1994). Mortality In Sickle Cell Disease -- Life Expectancy and Risk Factors for Early Death. New England Journal of Medicine, 330(23), 1639–1644. https://doi.org/10.1056/ NEJM199406093302303 Platt, O. S., Thorington, B. D., Brambilla, D. J., Milner, P. F., Rosse, W. F., Vichinsky, E., & Kinney, T. R. (1991). Pain in Sickle Cell Disease. New England Journal of Medicine, 325(1), 11–16. https:// doi.org/10.1056/NEJM199107043250103 Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151. https://doi.org/10.1016/S0893-6080(98)00116-6 Rajpurkar, P., Hannun, A. Y ., Haghpanahi, M., Bourn, C., & Ng, A. Y . (2017). Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks. https://doi.org/1707.01836 Rauh, R., Posfay, A., & Muck-Weymann, M. (2003). Quantification of inspiratory-induced vasoconstrictive episodes: a comparison of laser Doppler fluxmetry and photoplethysmography. Clinical Physiology and Functional Imaging, 23(6), 344–348. https://doi.org/10.1046/ j.1475-0961.2003.00516.x Ren, S., He, K., Girshick, R., & Sun, J. (n.d.). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Retrieved from https://github.com/ Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In IEEE International Conference on Neural Networks (pp. 586–591). IEEE. https://doi.org/10.1109/ICNN.1993.298623 Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (V ol. 9351, pp. 234–241). Springer Verlag. https:// doi.org/10.1007/978-3-319-24574-4_28 109 Ruder, S. (2016). An overview of gradient descent optimization algorithms. Retrieved from http:// arxiv.org/abs/1609.04747 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y Samuel, A. L. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development, 3(3), 210–229. Sangeetha, V ., & Prasad, K. J. R. (2006). Syntheses of novel derivatives of 2-acetylfuro[2,3-a]carbazoles, benzo[1,2-b]-1,4-thiazepino[2,3-a]carbazoles and 1-acetyloxycarbazole-2- carbaldehydes. Indian Journal of Chemistry - Section B Organic and Medicinal Chemistry, 45(8), 1951–1954. https:// doi.org/10.1002/chin.200650130 Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. Retrieved from http://arxiv.org/abs/1312.6120 Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of the IEEE International Conference on Computer Vision, 2017-Octob, 618–626. https://doi.org/10.1109/ ICCV .2017.74 Shelley, K., & Shelley, S. (2001). Pulse Oximeter Waveform: Photoelectric Plethysmography. In Clinical Monitoring, Carol Lake, R. Hines, and C. Blitt (pp. 420–428). W.B. Saunders Company. Silver, D., & Hassabis, D. (2016). AlphaGo: Mastering the ancient game of Go with Machine Learning. Retrieved from https://ai.googleblog.com/2016/01/alphago-mastering-ancient-game-of-go.html Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Retrieved from http://arxiv.org/abs/1312.6034 Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for Simplicity: The All Convolutional Net. Retrieved from http://arxiv.org/abs/1412.6806 Van Der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: A structure for efficient numerical computation. Computing in Science and Engineering, 13(2), 22–30. https://doi.org/ 10.1109/MCSE.2011.37 Vanderbilt University School of Medicine, D. of P. (2011). No Title. Retrieved from https:// pediatrics.mc.vanderbilt.edu/interior.php?mid=8307 Veluswamy, S., Shah, P., Khaleel, M., Puliyel, M., Thuptimdang, W., Chalacheva, P., … Coates, T. D. (2018). Sickle Cell Subjects Have a Stronger and Faster Neurally Mediated Vasoconstriction Response to Cold Pain That Correlates with Anxiety Scores. Blood, 132(Suppl 1), 854–854. https:// doi.org/10.1182/BLOOD-2018-99-113655 Veluswamy, S., Shah, P., Khaleel, M., Puliyel, M., Thuptimdang, W., Chalacheva, P., … Coates, T. D. (2017). Cold Pain Causes the Greatest Drop in Microvascular Blood Flow in Sickle Cell Disease Subjects and Normal Controls Exposed to Graded Thermal Stimuli. Blood, 130(Suppl 1). Retrieved from http://www.bloodjournal.org/content/130/Suppl_1/4773? utm_source=TrendMD&utm_medium=cpc&utm_campaign=Blood_TrendMD_1&sso-checked=true Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., … Contributors, S. 1. 0. (2019). SciPy 1.0--Fundamental Algorithms for Scientific Computing in Python. ArXiv E- Prints, arXiv:1907.10121. 110 V oss, A., Wessel, N., Baier, V ., Osterziel, K. J., Kurths, J., & Dietz, R. (2000). Symbolic Dynamics - a Powerful Tool in Non-Invasive Biomedical Signal Processing. Applied Sciences, (January 2013), 1– 8. Wei, X.-S. (2015). Must Know Tips/Tricks in Deep Neural Networks. Retrieved from http:// lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. https://doi.org/10.1109/5.58337 Wikipedia Contributors. (2019). Artificial neural network. Retrieved January 5, 2019, from https:// en.wikipedia.org/w/index.php?title=Artificial_neural_network&oldid=877015996 Willen, S. M., Rodeghier, M., Rosen, C. L., & DeBaun, M. R. (2017). Sleep disordered breathing does not predict acute severe pain episodes in children with sickle cell anemia. American Journal of Hematology, 93(4), 478–485. https://doi.org/10.1002/ajh.25013 Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., … Zweig, G. (2016). Achieving Human Parity in Conversational Speech Recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 25(12), 2410–2423. https://doi.org/10.1109/TASLP.2017.2756440 Yang, S., Fang, B., Tang, W., Wu, X., Qian, J., & Yang, W. (2018). Faster R-CNN based microscopic cell detection. In 2017 International Conference on Security, Pattern Analysis, and Cybernetics, SP AC 2017 (V ol. 2018-Janua, pp. 345–350). Institute of Electrical and Electronics Engineers Inc. https:// doi.org/10.1109/SPAC.2017.8304302 Yaroglu Kazanci, S. (2015). Attention deficit hyperactivity disorder in a patient with congenital mirror movement disorder and colpocephaly. Iranian Journal of Pediatrics, 25(5). https://doi.org/10.5812/ ijp.1787 Yates, A. (2018). 10 Statistics and Facts on Sickle Cell Disease. Retrieved January 1, 2019, from https:// www.verywellhealth.com/things-may-not-know-sickle-cell-disease-401318 Zeiler, M. D., & Fergus, R. (2013). Visualizing and Understanding Convolutional Networks. Retrieved from http://arxiv.org/abs/1311.2901 Zhao, H., Chen, X., Zhang, D., Zhu, W., & Shi, F. (2018). Automatic localization and segmentation of optical disk based on faster R-CNN and level set in fundus image. In E. D. Angelini & B. A. Landman (Eds.), Medical Imaging 2018: Image Processing (p. 65). SPIE. https://doi.org/ 10.1117/12.2292950 111 Appendix I Models and Hyper-parameters 1. Base Models with Feature Inputs The base models with feature inputs were implemented with open source Python machine learning library scikit-learn (Pedregosa et al., 2011). Detailed explanation of the hyper-parameters can be found in their API documentation here. Logistic regression: C: 0.01 class_weight: 'balanced' solver: 'lbfgs' SVM with linear kernel: kernel: 'linear' C: 0.005 class_weight: 'balanced' SVM with RBF kernel: kernel: 'rbf' C: 0.2 gamma: 'scale' class_weight: 'balanced' Random Forest: n_estimators: 1000 criterion: 'entropy' max_leaf_nodes: 2 class_weight: 'balanced' random_state: the same as the seed controlled subjects split ExtraTrees: 112 n_estimators: 1000 criterion: 'gini' max_leaf_nodes: 4 class_weight: 'balanced' random_state: the same as the seed controlled subjects split AdaBoost: base_estimator: DecisionTreeClassifier(max_depth=1, class_weight='balanced') n_estimators: 100 learning_rate: 0.05 random_state: the same as the seed controlled subjects split 2. CNN The CNN model was implemented with open source Python deep learning framework PyTorch (Paszke et al., 2019). The CNN model was trained using Adam optimizer with learning rate 0.001 and weight decay 1. Model structure: Convolutional Layer 1D: output_channels=32, kernel_size=7, padding=3 ReLU activation Batch Normalization 1D MaxPooling 1D: size=3, ceil_mode=True Dropout: ratio=0.5 Convolutional Layer 1D: output_channels=64, kernel_size=5, padding=2 ReLU activation Batch Normalization 1D MaxPooling 1D: size=3, ceil_mode=True Dropout: ratio=0.5 Convolutional Layer 1D: output_channels=64, kernel_size=3, padding=1 113 ReLU activation Batch Normalization 1D MaxPooling 1D: size=3, ceil_mode=True Dropout: ratio=0.5 Convolutional Layer 1D: output_channels=128, kernel_size=3, padding=1 ReLU activation Batch Normalization 1D Dropout: ratio=0.5 Global Average Pooling 1D Fully Connected Layer (Linear Layer): output_channels=1 3. MLP The MLP model was implemented with open source Python deep learning framework PyTorch (Paszke et al., 2019). The MLP model was trained using Adam optimizer with learning rate 0.001 and weight decay 1. It contains one hidden layer of size 100. The activation function for the hidden layer is ReLU. 114 Appendix II Model Performance 1. Models Predicting Post-PSG Pain Category Each table shows the performance of a meta/basic model with different sets of inputs. Each column shows the results for a certain set of inputs. Results are F1-score of 20 different trials. Each trial uses a certain split of subjects (for cross-validation) controlled by a certain random seed. The mean F1-score of the 20 trials (highlighted with sky blue color) is used to represent the overall performance of a certain model with a certain set of inputs. Results (p-values) of paired t-tests are shown below the F1-scores. P-values smaller than 0.01 are highlighted with orange color, and p-values smaller than 0.05 are highlighted with red color. 115 F1-score for meta model (MLP) predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Base Models 6 Models 6 Models 6 Models 6 Models 7 Models (6 + CNN) 7 Models (6 + CNN) 7 Models (6 + CNN) Features 21 Features 21 Features 20 Features 15 Features 21 Features 21 Features 20 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related Select Mvasoc Select Avasoc No Mvasoc, Avasoc 1 0.4262 0.3944 0.3714 0.3971 0.3857 0.3973 0.3913 2 0.4196 0.4054 0.4651 0.3944 0.4341 0.4138 0.4225 3 0.4062 0.3944 0.3944 0.4122 0.4286 0.4060 0.4148 4 0.4189 0.3942 0.4320 0.3971 0.4255 0.3946 0.4058 5 0.4308 0.4394 0.4154 0.4118 0.4255 0.4173 0.4029 6 0.4167 0.4219 0.4154 0.4029 0.4511 0.4211 0.4348 7 0.4496 0.4252 0.3916 0.3944 0.4189 0.4258 0.4576 8 0.4306 0.4308 0.4203 0.4252 0.4394 0.4203 0.4296 9 0.4143 0.3972 0.4590 0.4029 0.4262 0.3830 0.4522 10 0.4133 0.4118 0.4062 0.4143 0.4255 0.4276 0.4737 11 0.4082 0.4308 0.4179 0.4286 0.4545 0.4286 0.3971 12 0.4173 0.4161 0.4138 0.4219 0.4265 0.4027 0.4088 13 0.4088 0.3944 0.3717 0.3867 0.4211 0.4056 0.3916 14 0.4174 0.4177 0.4370 0.4179 0.4375 0.4167 0.4122 15 0.4058 0.4113 0.4154 0.4179 0.4275 0.4384 0.4306 16 0.4317 0.4394 0.4341 0.4027 0.4286 0.4091 0.4091 17 0.3969 0.4203 0.4258 0.4000 0.4296 0.4118 0.4154 18 0.4143 0.3852 0.4179 0.4068 0.4275 0.4328 0.4088 19 0.4058 0.3973 0.3889 0.3497 0.4058 0.4148 0.3623 20 0.4179 0.3846 0.4058 0.3946 0.3913 0.4031 0.3885 Mean 0.4175 0.4106 0.4150 0.4040 0.4255 0.4135 0.4155 95% conf lo 0.4123 0.4030 0.4042 0.3964 0.4183 0.4075 0.4041 95% conf hi 0.4227 0.4182 0.4257 0.4115 0.4327 0.4196 0.4269 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 0 N/A 0.0407 0.3403 0.0036 0.0568 0.1690 0.3629 Model 1 0.0407 N/A 0.2370 0.0619 0.0003 0.2357 0.2162 Model 2 0.3403 0.2370 N/A 0.0346 0.0204 0.4144 0.4686 Model 3 0.0036 0.0619 0.0346 N/A 0.0000 0.0181 0.0235 Model 4 0.0568 0.0003 0.0204 0.0000 N/A 0.0024 0.0453 Model 5 0.1690 0.2357 0.4144 0.0181 0.0024 N/A 0.3730 Model 6 0.3629 0.2162 0.4686 0.0235 0.0453 0.3730 N/A 116 F1-score for logistic regression predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3889 0.4000 0.3784 0.3889 2 0.3929 0.3964 0.3860 0.3652 3 0.3704 0.3853 0.4000 0.3750 4 0.3551 0.3585 0.3738 0.3670 5 0.3619 0.3619 0.3654 0.3894 6 0.4144 0.4259 0.4182 0.3810 7 0.3600 0.3529 0.3600 0.3396 8 0.3964 0.3717 0.3717 0.3889 9 0.4381 0.4314 0.4038 0.4118 10 0.3423 0.3585 0.3519 0.3853 11 0.3208 0.3429 0.3269 0.3619 12 0.3621 0.3540 0.3652 0.4035 13 0.3551 0.3551 0.3429 0.3619 14 0.3048 0.3486 0.3019 0.3333 15 0.3423 0.3423 0.3019 0.3455 16 0.3964 0.4144 0.3784 0.3964 17 0.3964 0.3818 0.3889 0.3925 18 0.3894 0.4107 0.3929 0.3670 19 0.3670 0.3774 0.3810 0.3853 20 0.4464 0.4505 0.4286 0.3717 Mean 0.3751 0.3810 0.3709 0.3756 95% conf lo 0.3594 0.3669 0.3561 0.3664 95% conf hi 0.3907 0.3951 0.3856 0.3847 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0472 0.1475 0.4702 Model 1 0.0472 N/A 0.0117 0.2107 Model 2 0.1475 0.0117 N/A 0.2322 Model 3 0.4702 0.2107 0.2322 N/A 117 F1-score for SVM with linear kernel predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3761 0.3966 0.3571 0.4000 2 0.4151 0.3818 0.4107 0.3784 3 0.3738 0.4071 0.4037 0.4078 4 0.3529 0.3774 0.3462 0.3762 5 0.3551 0.3860 0.3962 0.3619 6 0.4324 0.3966 0.4071 0.4000 7 0.3619 0.3704 0.3846 0.3168 8 0.4259 0.4174 0.4324 0.4040 9 0.4466 0.4528 0.4314 0.4040 10 0.3717 0.3784 0.3818 0.3960 11 0.3429 0.3670 0.3551 0.3636 12 0.3793 0.3894 0.3717 0.4112 13 0.3670 0.3636 0.3619 0.3810 14 0.3462 0.3551 0.3619 0.3434 15 0.3670 0.3894 0.3364 0.3434 16 0.4074 0.4074 0.3551 0.3922 17 0.3636 0.3604 0.3670 0.3838 18 0.3750 0.4068 0.3894 0.3689 19 0.4118 0.4272 0.3962 0.3960 20 0.4381 0.4444 0.4259 0.3600 Mean 0.3855 0.3938 0.3836 0.3794 95% conf lo 0.3713 0.3819 0.3709 0.3683 95% conf hi 0.3997 0.4056 0.3962 0.3905 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0337 0.3521 0.1956 Model 1 0.0337 N/A 0.0318 0.0188 Model 2 0.3521 0.0318 N/A 0.2858 Model 3 0.1956 0.0188 0.2858 N/A 118 F1-score for SVM with RBF kernel predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3750 0.3621 0.3684 0.3883 2 0.3962 0.3704 0.3689 0.3585 3 0.3704 0.3704 0.3551 0.3396 4 0.3551 0.3604 0.3774 0.3208 5 0.3619 0.3717 0.3619 0.3519 6 0.4038 0.4423 0.3960 0.3846 7 0.3962 0.4037 0.3883 0.3529 8 0.3889 0.4071 0.4037 0.3689 9 0.3818 0.3966 0.4000 0.3878 10 0.3774 0.3423 0.3551 0.3738 11 0.3200 0.3238 0.3495 0.3918 12 0.3486 0.3455 0.3670 0.3519 13 0.3519 0.3860 0.3925 0.3774 14 0.3238 0.3509 0.3077 0.3564 15 0.3529 0.3551 0.3333 0.3636 16 0.3495 0.3604 0.3619 0.3619 17 0.3774 0.4000 0.3738 0.3429 18 0.4324 0.4174 0.4182 0.3654 19 0.3918 0.4242 0.3838 0.3366 20 0.3960 0.3810 0.3762 0.3529 Mean 0.3726 0.3786 0.3719 0.3614 95% conf lo 0.3605 0.3652 0.3608 0.3531 95% conf hi 0.3846 0.3920 0.3831 0.3697 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0974 0.4438 0.0784 Model 1 0.0974 N/A 0.0849 0.0250 Model 2 0.4438 0.0849 N/A 0.0656 Model 3 0.0784 0.0250 0.0656 N/A 119 F1-score for SVM with polynomial kernel predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.1944 0.1690 0.1644 0.2143 2 0.2025 0.1707 0.1481 0.2391 3 0.1975 0.1500 0.1299 0.1667 4 0.2933 0.2703 0.2564 0.2198 5 0.2169 0.1750 0.2069 0.1750 6 0.2500 0.2286 0.1918 0.2143 7 0.2895 0.1972 0.2254 0.2619 8 0.2308 0.2222 0.2169 0.2824 9 0.2469 0.2892 0.2716 0.2759 10 0.1928 0.1687 0.1220 0.2273 11 0.1579 0.1519 0.1481 0.2273 12 0.2667 0.2564 0.2532 0.2500 13 0.2785 0.2651 0.2892 0.2708 14 0.1842 0.1351 0.1538 0.1702 15 0.2750 0.2683 0.2683 0.1839 16 0.3590 0.2750 0.2597 0.2022 17 0.3467 0.2597 0.2400 0.2268 18 0.2169 0.2353 0.2195 0.2796 19 0.1795 0.2105 0.1892 0.2697 20 0.2029 0.2388 0.2609 0.2651 Mean 0.2391 0.2169 0.2108 0.2311 95% conf lo 0.2149 0.1955 0.1879 0.2145 95% conf hi 0.2633 0.2382 0.2336 0.2478 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0087 0.0034 0.2987 Model 1 0.0087 N/A 0.1112 0.0998 Model 2 0.0034 0.1112 N/A 0.0476 Model 3 0.2987 0.0998 0.0476 N/A 120 F1-score for random forest predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3457 0.3614 0.3750 0.3953 2 0.3250 0.3704 0.3377 0.4146 3 0.3250 0.3457 0.3291 0.3171 4 0.3810 0.3571 0.3810 0.4146 5 0.3704 0.3415 0.3678 0.4000 6 0.4250 0.3750 0.4000 0.4146 7 0.3908 0.3864 0.3529 0.3864 8 0.3000 0.3038 0.3333 0.4146 9 0.3733 0.3590 0.4156 0.4048 10 0.4103 0.3733 0.4390 0.4691 11 0.3158 0.3077 0.3171 0.4235 12 0.3333 0.3333 0.3765 0.4471 13 0.4103 0.4156 0.3684 0.4595 14 0.3855 0.4048 0.3810 0.4471 15 0.3146 0.3182 0.3059 0.3953 16 0.3377 0.3421 0.3377 0.3678 17 0.2791 0.2651 0.3023 0.3721 18 0.3659 0.3704 0.3750 0.4000 19 0.4000 0.4051 0.4000 0.4337 20 0.4444 0.4304 0.4634 0.3953 Mean 0.3617 0.3583 0.3679 0.4086 95% conf lo 0.3420 0.3404 0.3493 0.3935 95% conf hi 0.3813 0.3762 0.3866 0.4238 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.2500 0.1260 0.0001 Model 1 0.2500 N/A 0.0921 0.0000 Model 2 0.1260 0.0921 N/A 0.0002 Model 3 0.0001 0.0000 0.0002 N/A 121 F1-score for ExtraTrees predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3922 0.3762 0.3774 0.3853 2 0.4231 0.4078 0.4000 0.4074 3 0.3774 0.3925 0.3774 0.3636 4 0.3585 0.3429 0.3396 0.3462 5 0.3962 0.3962 0.3964 0.3784 6 0.3434 0.3673 0.3846 0.3883 7 0.3800 0.3883 0.3689 0.4144 8 0.3774 0.3846 0.3654 0.4000 9 0.4158 0.4158 0.4242 0.4112 10 0.4381 0.4272 0.4340 0.3889 11 0.3434 0.3469 0.3333 0.3495 12 0.3564 0.3600 0.3704 0.3571 13 0.3762 0.3800 0.4078 0.4299 14 0.3585 0.3619 0.3704 0.4037 15 0.3689 0.3689 0.3774 0.3964 16 0.3495 0.3462 0.3396 0.3519 17 0.3366 0.3400 0.3301 0.3429 18 0.3529 0.3564 0.4000 0.3889 19 0.3818 0.3784 0.3750 0.3571 20 0.4231 0.4231 0.4000 0.3964 Mean 0.3775 0.3780 0.3786 0.3829 95% conf lo 0.3645 0.3664 0.3660 0.3716 95% conf hi 0.3904 0.3897 0.3912 0.3941 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.4015 0.4020 0.1994 Model 1 0.4015 N/A 0.4404 0.1885 Model 2 0.4020 0.4404 N/A 0.1946 Model 3 0.1994 0.1885 0.1946 N/A 122 F1-score for AdaBoost predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.2418 0.2581 0.2444 0.3469 2 0.4000 0.3958 0.4000 0.4565 3 0.3404 0.3617 0.3368 0.3371 4 0.3297 0.3297 0.3297 0.3516 5 0.3333 0.3469 0.3400 0.3529 6 0.3556 0.3556 0.3556 0.3542 7 0.3711 0.3750 0.3673 0.3960 8 0.3478 0.3571 0.3820 0.4000 9 0.3488 0.3765 0.3810 0.4000 10 0.4231 0.4231 0.4423 0.4082 11 0.3333 0.3368 0.3333 0.3913 12 0.3762 0.4082 0.3878 0.3711 13 0.3656 0.3958 0.3958 0.4138 14 0.3656 0.3617 0.3617 0.3673 15 0.3226 0.3297 0.3333 0.3542 16 0.3226 0.3077 0.3043 0.3023 17 0.3469 0.3299 0.3333 0.3913 18 0.3711 0.3542 0.3579 0.4286 19 0.3956 0.3736 0.3956 0.4082 20 0.3778 0.3864 0.3820 0.4211 Mean 0.3534 0.3582 0.3582 0.3826 95% conf lo 0.3370 0.3417 0.3398 0.3665 95% conf hi 0.3699 0.3746 0.3766 0.3988 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0975 0.0815 0.0003 Model 1 0.0975 N/A 0.4957 0.0019 Model 2 0.0815 0.4957 N/A 0.0015 Model 3 0.0003 0.0019 0.0015 N/A 123 F1-score for gradient boosting trees predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.1579 0.1690 0.1867 0.1493 2 0.2857 0.2941 0.3099 0.2540 3 0.2121 0.2388 0.1714 0.2609 4 0.2609 0.2059 0.2609 0.2899 5 0.2462 0.2462 0.2388 0.3099 6 0.2778 0.3143 0.2571 0.2609 7 0.2813 0.2817 0.2985 0.3288 8 0.1538 0.1613 0.1765 0.2951 9 0.2769 0.2319 0.2727 0.2941 10 0.2368 0.2424 0.2353 0.2121 11 0.2000 0.2222 0.1695 0.2333 12 0.2903 0.3125 0.3226 0.3051 13 0.3385 0.2813 0.3235 0.2857 14 0.1818 0.1493 0.1972 0.1905 15 0.2059 0.2647 0.2609 0.2778 16 0.1818 0.1765 0.1538 0.1818 17 0.1905 0.2222 0.1846 0.2535 18 0.1587 0.2687 0.1791 0.3030 19 0.2121 0.2090 0.2154 0.2222 20 0.2581 0.2373 0.2712 0.2258 Mean 0.2304 0.2365 0.2343 0.2567 95% conf lo 0.2074 0.2152 0.2101 0.2356 95% conf hi 0.2533 0.2577 0.2585 0.2778 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.2463 0.2369 0.0192 Model 1 0.2463 N/A 0.4068 0.0284 Model 2 0.2369 0.4068 N/A 0.0372 Model 3 0.0192 0.0284 0.0372 N/A 124 F1-score for XGBoost predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3291 0.3373 0.3210 0.3529 2 0.3133 0.4000 0.3250 0.4000 3 0.3488 0.3457 0.3373 0.3908 4 0.3250 0.2791 0.2963 0.3077 5 0.2927 0.2785 0.2785 0.3409 6 0.3765 0.3133 0.3373 0.3696 7 0.2683 0.3095 0.3095 0.2619 8 0.3250 0.2895 0.3421 0.2963 9 0.3333 0.3457 0.3721 0.3038 10 0.3488 0.3171 0.3704 0.3590 11 0.2439 0.2927 0.2824 0.3896 12 0.3333 0.3810 0.3571 0.4000 13 0.3171 0.3500 0.2963 0.3636 14 0.2564 0.2927 0.2716 0.3333 15 0.3256 0.3415 0.3488 0.3291 16 0.2353 0.2619 0.2791 0.2651 17 0.3457 0.3855 0.3500 0.3864 18 0.2927 0.3294 0.3409 0.3789 19 0.2892 0.3023 0.3373 0.2750 20 0.3014 0.3429 0.2899 0.3889 Mean 0.3101 0.3248 0.3221 0.3446 95% conf lo 0.2937 0.3082 0.3081 0.3245 95% conf hi 0.3265 0.3414 0.3362 0.3648 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0472 0.0302 0.0019 Model 1 0.0472 N/A 0.3687 0.0123 Model 2 0.0302 0.3687 N/A 0.0352 Model 3 0.0019 0.0123 0.0352 N/A 125 F1-score for CNN predicting post-PSG pain category Random Seed CNN Model Input PPGaRRI 1 0.3871 2 0.3974 3 0.4133 4 0.3370 5 0.3803 6 0.3537 7 0.3636 8 0.3467 9 0.3399 10 0.3648 11 0.3399 12 0.3418 13 0.3268 14 0.3659 15 0.3509 16 0.3293 17 0.3974 18 0.3673 19 0.3550 20 0.3586 Mean 0.3608 95% conf lo 0.3503 95% conf hi 0.3713 126 F1-score for MLP predicting post-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Model 4 Model 5 Features 21 Features 21 Features 20 Features 15 Features Base 7 21 Features + B7 Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related No Features Select Mvasoc 1 0.3750 0.3910 0.3759 0.4062 0.3649 0.3857 2 0.4113 0.4265 0.4211 0.4085 0.4189 0.4341 3 0.4000 0.3796 0.3714 0.4000 0.3832 0.4286 4 0.4054 0.4296 0.4138 0.3881 0.3924 0.4255 5 0.4186 0.4370 0.5045 0.4068 0.3441 0.4255 6 0.4110 0.4348 0.3830 0.4000 0.4324 0.4511 7 0.3810 0.3971 0.4127 0.3885 0.3316 0.4189 8 0.4483 0.4242 0.4870 0.4590 0.4000 0.4394 9 0.4394 0.4091 0.4030 0.4103 0.4028 0.4262 10 0.3651 0.4161 0.3937 0.4202 0.3758 0.4255 11 0.4655 0.3885 0.4426 0.4234 0.3605 0.4545 12 0.4151 0.4429 0.4274 0.4127 0.3536 0.4265 13 0.3939 0.4031 0.4211 0.3857 0.4192 0.4211 14 0.4091 0.4590 0.4160 0.3607 0.3452 0.4375 15 0.4127 0.4000 0.4194 0.4779 0.3879 0.4275 16 0.4355 0.4593 0.4444 0.4062 0.3420 0.4286 17 0.4478 0.3741 0.4194 0.4000 0.4026 0.4296 18 0.4444 0.4098 0.4058 0.3876 0.3708 0.4275 19 0.4056 0.3759 0.3881 0.3857 0.4031 0.4058 20 0.4094 0.3968 0.4062 0.4091 0.4234 0.3913 Mean 0.4147 0.4127 0.4178 0.4068 0.3827 0.4255 95% conf lo 0.4033 0.4014 0.4033 0.3956 0.3695 0.4183 95% conf hi 0.4261 0.4240 0.4324 0.4181 0.3960 0.4327 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 4 Model 5 Model 0 N/A 0.3652 0.3295 0.0936 0.0009 0.0276 Model 1 0.3652 N/A 0.2143 0.2220 0.0055 0.0154 Model 2 0.3295 0.2143 N/A 0.0543 0.0035 0.1627 Model 3 0.0936 0.2220 0.0543 N/A 0.0077 0.0021 Model 4 0.0009 0.0055 0.0035 0.0077 N/A 0.0000 Model 5 0.0276 0.0154 0.1627 0.0021 0.0000 N/A 127 2. Models Predicting Pre-PSG Pain Category Each table shows the performance of a meta/basic model with different sets of inputs. Each column shows the results for a certain set of inputs. Results are F1-score of 20 different trials. Each trial uses a certain split of subjects (for cross-validation) controlled by a certain random seed. The mean F1-score of the 20 trials (highlighted with sky blue color) is used to represent the overall performance of a certain model with a certain set of inputs. Results (p-values) of paired t-tests are shown below the F1-scores. P-values smaller than 0.01 are highlighted with orange color, and p-values smaller than 0.05 are highlighted with red color. 128 F1-score for meta model (MLP) predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Base Models 6 Models 6 Models 6 Models 6 Models 7 Models (6 + CNN) 7 Models (6 + CNN) 7 Models (6 + CNN) Features 21 Features 21 Features 20 Features 15 Features 21 Features 21 Features 20 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related Select Mvasoc Select Avasoc No Mvasoc, Avasoc 1 0.3239 0.3768 0.3359 0.3721 0.3582 0.3333 0.3504 2 0.3448 0.3546 0.3538 0.3582 0.3692 0.3506 0.3520 3 0.3942 0.3866 0.3571 0.3652 0.3741 0.3796 0.3504 4 0.4298 0.4299 0.3600 0.3607 0.3741 0.3803 0.3780 5 0.3711 0.3065 0.3231 0.3179 0.3721 0.3165 0.3562 6 0.3188 0.3217 0.4037 0.3259 0.3636 0.3810 0.3404 7 0.3564 0.3226 0.3357 0.3000 0.3448 0.3692 0.3651 8 0.3731 0.3741 0.3425 0.3556 0.3548 0.3934 0.3796 9 0.3478 0.3444 0.3262 0.3243 0.3740 0.3273 0.3817 10 0.3662 0.3401 0.3492 0.3504 0.3942 0.3902 0.3636 11 0.3571 0.3916 0.3721 0.3262 0.3472 0.3704 0.3521 12 0.3200 0.2685 0.2774 0.3448 0.2667 0.2740 0.3077 13 0.3529 0.3453 0.3556 0.3467 0.3465 0.3308 0.3759 14 0.3433 0.3433 0.3286 0.3404 0.3504 0.3387 0.3750 15 0.3226 0.2975 0.3514 0.3333 0.3429 0.3548 0.3404 16 0.3465 0.3407 0.3438 0.3111 0.2958 0.3594 0.3221 17 0.3562 0.3544 0.3425 0.3881 0.3472 0.4098 0.3571 18 0.3967 0.3636 0.3556 0.3871 0.3692 0.3910 0.3521 19 0.3594 0.3089 0.3212 0.2832 0.3465 0.3624 0.3540 20 0.3910 0.3231 0.3650 0.3333 0.3182 0.3433 0.3614 Mean 0.3586 0.3447 0.3450 0.3412 0.3505 0.3578 0.3558 95% conf lo 0.3460 0.3285 0.3341 0.3292 0.3377 0.3437 0.3475 95% conf hi 0.3711 0.3609 0.3559 0.3532 0.3633 0.3719 0.3640 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 0 N/A 0.0289 0.0437 0.0180 0.1402 0.4576 0.3149 Model 1 0.0289 N/A 0.4839 0.3233 0.2261 0.0486 0.0708 Model 2 0.0437 0.4839 N/A 0.3135 0.2042 0.0193 0.0502 Model 3 0.0180 0.3233 0.3135 N/A 0.1233 0.0262 0.0270 Model 4 0.1402 0.2261 0.2042 0.1233 N/A 0.1511 0.1508 Model 5 0.4576 0.0486 0.0193 0.0262 0.1511 N/A 0.3888 Model 6 0.3149 0.0708 0.0502 0.0270 0.1508 0.3888 N/A 129 F1-score for logistic regression predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.2936 0.2804 0.2883 0.2857 2 0.2941 0.2549 0.2941 0.2600 3 0.2718 0.2718 0.2718 0.2692 4 0.2800 0.2772 0.2745 0.2991 5 0.2708 0.3137 0.3168 0.2772 6 0.2526 0.2449 0.2653 0.2593 7 0.2366 0.2340 0.2553 0.2745 8 0.3158 0.3368 0.3673 0.3462 9 0.3178 0.3119 0.3019 0.2936 10 0.2718 0.2718 0.2772 0.3333 11 0.3529 0.3564 0.3564 0.2772 12 0.3107 0.3366 0.3333 0.3091 13 0.2830 0.2778 0.2642 0.2991 14 0.2941 0.3048 0.3238 0.2885 15 0.2804 0.2642 0.2778 0.3019 16 0.2500 0.2353 0.2692 0.2574 17 0.2963 0.2778 0.2909 0.2545 18 0.2476 0.2500 0.2430 0.2718 19 0.2020 0.1837 0.1875 0.2500 20 0.2708 0.2947 0.3125 0.2885 Mean 0.2796 0.2789 0.2886 0.2848 95% conf lo 0.2651 0.2610 0.2708 0.2737 95% conf hi 0.2941 0.2969 0.3063 0.2960 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.4344 0.0347 0.2389 Model 1 0.4344 N/A 0.0044 0.2267 Model 2 0.0347 0.0044 N/A 0.3206 Model 3 0.2389 0.2267 0.3206 N/A 130 F1-score for SVM with linear kernel predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.2593 0.2500 0.2857 0.3165 2 0.2476 0.2593 0.2642 0.3063 3 0.2574 0.2626 0.2553 0.2957 4 0.2883 0.2909 0.2909 0.3226 5 0.2526 0.2500 0.2526 0.3077 6 0.2476 0.2407 0.2667 0.2794 7 0.2549 0.2453 0.2453 0.2807 8 0.2718 0.2136 0.2549 0.3750 9 0.2830 0.2593 0.2885 0.2857 10 0.2321 0.2342 0.2342 0.3200 11 0.2885 0.3200 0.3238 0.2881 12 0.2909 0.2569 0.2727 0.3256 13 0.2931 0.2832 0.2832 0.3063 14 0.2810 0.2931 0.3036 0.3016 15 0.2241 0.2105 0.2087 0.2707 16 0.2264 0.2222 0.2243 0.2791 17 0.2667 0.2632 0.2564 0.2698 18 0.2564 0.2241 0.2301 0.2632 19 0.1837 0.1856 0.1684 0.2712 20 0.2243 0.2385 0.2524 0.3333 Mean 0.2565 0.2502 0.2581 0.2999 95% conf lo 0.2441 0.2363 0.2427 0.2879 95% conf hi 0.2689 0.2640 0.2735 0.3119 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0842 0.3444 0.0000 Model 1 0.0842 N/A 0.0145 0.0000 Model 2 0.3444 0.0145 N/A 0.0000 Model 3 0.0000 0.0000 0.0000 N/A 131 F1-score for SVM with RBF kernel predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.3051 0.3025 0.2833 0.3500 2 0.2617 0.2642 0.2885 0.2941 3 0.3423 0.2936 0.3455 0.3178 4 0.3448 0.3304 0.3509 0.3273 5 0.3009 0.3091 0.3273 0.2936 6 0.3168 0.2941 0.3168 0.2718 7 0.2857 0.2752 0.2727 0.2617 8 0.2941 0.2430 0.3243 0.3273 9 0.3248 0.3252 0.3333 0.3276 10 0.3471 0.3306 0.3333 0.3303 11 0.3276 0.3248 0.3103 0.3509 12 0.3363 0.3559 0.3559 0.3130 13 0.3393 0.3091 0.3214 0.3036 14 0.3520 0.3548 0.3577 0.3158 15 0.2881 0.2881 0.2833 0.3000 16 0.3186 0.3103 0.3419 0.3051 17 0.2602 0.2439 0.2581 0.2857 18 0.2034 0.2131 0.1880 0.2500 19 0.2000 0.1802 0.1667 0.2385 20 0.3361 0.3419 0.3419 0.3276 Mean 0.3042 0.2945 0.3051 0.3046 95% conf lo 0.2848 0.2741 0.2821 0.2910 95% conf hi 0.3237 0.3149 0.3280 0.3181 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.0143 0.4219 0.4796 Model 1 0.0143 N/A 0.0367 0.0939 Model 2 0.4219 0.0367 N/A 0.4778 Model 3 0.4796 0.0939 0.4778 N/A 132 F1-score for random forest predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.2373 0.2333 0.2182 0.2258 2 0.2456 0.2373 0.1754 0.2333 3 0.1887 0.1887 0.1509 0.2687 4 0.3000 0.2951 0.3000 0.2857 5 0.2000 0.2000 0.1935 0.2535 6 0.1017 0.1356 0.1356 0.2424 7 0.2373 0.2143 0.2857 0.2462 8 0.2456 0.2143 0.2105 0.2456 9 0.2388 0.2388 0.2388 0.2807 10 0.1429 0.1455 0.1852 0.2333 11 0.1818 0.1852 0.2105 0.1639 12 0.2029 0.2059 0.2154 0.2817 13 0.2545 0.2222 0.2456 0.2500 14 0.2258 0.2258 0.2712 0.3077 15 0.2090 0.2188 0.2353 0.3117 16 0.1852 0.1887 0.1923 0.2540 17 0.2154 0.2154 0.2222 0.2121 18 0.2414 0.2373 0.2373 0.2687 19 0.2143 0.2414 0.2456 0.2571 20 0.3158 0.2759 0.3158 0.3548 Mean 0.2192 0.2160 0.2243 0.2588 95% conf lo 0.1981 0.1996 0.2039 0.2412 95% conf hi 0.2403 0.2324 0.2446 0.2764 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.2151 0.2340 0.0006 Model 1 0.2151 N/A 0.1099 0.0001 Model 2 0.2340 0.1099 N/A 0.0010 Model 3 0.0006 0.0001 0.0010 N/A 133 F1-score for ExtraTrees predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.2368 0.2632 0.2162 0.3218 2 0.2250 0.2308 0.2222 0.2195 3 0.2368 0.2368 0.2338 0.2857 4 0.2791 0.2927 0.3210 0.2609 5 0.3158 0.2933 0.3158 0.3457 6 0.2059 0.2000 0.2133 0.3023 7 0.2857 0.3143 0.3056 0.2532 8 0.2093 0.2093 0.1928 0.2143 9 0.2410 0.2410 0.2791 0.2410 10 0.1944 0.1667 0.2078 0.2824 11 0.2286 0.2162 0.2432 0.2564 12 0.2439 0.2469 0.2716 0.2588 13 0.2469 0.2308 0.2105 0.2278 14 0.3200 0.3333 0.3210 0.2791 15 0.3038 0.2785 0.2651 0.3182 16 0.2535 0.2571 0.2609 0.3077 17 0.2093 0.2250 0.2338 0.2299 18 0.1818 0.1867 0.2162 0.1975 19 0.1892 0.1918 0.2133 0.2078 20 0.2469 0.2278 0.2338 0.2921 Mean 0.2427 0.2421 0.2489 0.2651 95% conf lo 0.2249 0.2229 0.2308 0.2467 95% conf hi 0.2605 0.2613 0.2669 0.2835 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.4374 0.1242 0.0087 Model 1 0.4374 N/A 0.1012 0.0187 Model 2 0.1242 0.1012 N/A 0.0702 Model 3 0.0087 0.0187 0.0702 N/A 134 F1-score for AdaBoost predicting pre-PSG pain category Random Seed Model 0 Model 1 Model 2 Model 3 Features 21 Features 21 Features 20 Features 15 Features Notes Select Mvasoc Select Avasoc No Mvasoc, Avasoc No vasoc-related 1 0.2651 0.2250 0.2278 0.2338 2 0.1975 0.1538 0.2025 0.2051 3 0.1558 0.1408 0.1558 0.1842 4 0.2588 0.2588 0.2381 0.3404 5 0.2683 0.2683 0.2651 0.2955 6 0.1975 0.2078 0.1951 0.2791 7 0.3095 0.3023 0.3023 0.2558 8 0.1750 0.1951 0.1750 0.2469 9 0.2553 0.2273 0.2581 0.2667 10 0.1795 0.1600 0.1818 0.1928 11 0.2727 0.2759 0.2727 0.2697 12 0.2791 0.2892 0.2824 0.2500 13 0.2353 0.2353 0.2353 0.2273 14 0.2472 0.3226 0.2667 0.3182 15 0.2500 0.1951 0.2750 0.2727 16 0.2667 0.2703 0.2667 0.2308 17 0.3000 0.2927 0.2927 0.3218 18 0.1364 0.1395 0.1364 0.2069 19 0.2338 0.2432 0.2338 0.2308 20 0.2973 0.2817 0.2973 0.2632 Mean 0.2390 0.2342 0.2380 0.2546 95% conf lo 0.2173 0.2097 0.2166 0.2358 95% conf hi 0.2607 0.2588 0.2595 0.2734 Paired t-test Model 0 Model 1 Model 2 Model 3 Model 0 N/A 0.2210 0.3610 0.0579 Model 1 0.2210 N/A 0.2742 0.0187 Model 2 0.3610 0.2742 N/A 0.0467 Model 3 0.0579 0.0187 0.0467 N/A 135 F1-score for CNN predicting pre-PSG pain category Random Seed CNN Model Input PPGaRRI 1 0.2737 2 0.2549 3 0.2907 4 0.2647 5 0.2732 6 0.2714 7 0.3077 8 0.2699 9 0.2637 10 0.2673 11 0.2624 12 0.2745 13 0.2591 14 0.2529 15 0.2640 16 0.2900 17 0.2537 18 0.2640 19 0.2737 20 0.2708 Mean 0.2701 95% conf lo 0.2642 95% conf hi 0.2760 136
Abstract (if available)
Abstract
Sickle cell disease (SCD) is an inherited genetic blood disorder associated with severe vaso-occlusive pain that can lead to catastrophic consequences. Deoxygenated sickle hemoglobins polymerize and become rigid, increasing the likelihood for red blood cells to block the microvasculature. Thus, microvascular flow plays a key role as it determines the time for red blood cells to escape the microvasculature. Moreover, autonomic nervous system (ANS) functions affects microvascular flow through vasoconstriction and cardiac output regulation. Thus, ANS related information may have a causal relationship with vaso-occlusive crisis (VOC). ❧ In this work, two different machine learning approaches were applied on two different datasets respectively to predict risk of VOC for SCD subjects. One approach assessed the risk based on vasoconstriction patterns revealed by the photoplethysmogram amplitude (PPGa) signals by adopting pattern recognition algorithms, including self-designed features, symbolic dynamics, convolutional neural networks (CNN), and recurrent neural networks (RNN). It was shown that CNN was the most effective algorithm to identify vasoconstriction patterns with potential risk. The other approach was to built a two-level stacking machine learning model to predict the risk through SCD subjects’ physiological and ANS features. Results show that subject features have a causal relationship with their risk of VOC as they predict future risk significantly better than previous risk (f1-score 0.4255 vs 0.3505, p-value < 0.0001). The model also performed much better than the negative binomial model adopted in previous research (f1-score 0.4255 vs 0.3279). In addition, it was demonstrated that the model including vasoconstriction-related features in its inputs performed significantly better than the model without such features (f1-score 0.4255 vs 0.4040, p-value < 0.0001). ❧ These findings show that 1. machine learning models are effective in predicting risk of VOC for SCD subjects. 2. ANS functions, particularly vasoconstriction-related features, have a causal relationship with risk of VOC for SCD patients. The models can assist clinicians in managing treatment plans for SCD patients and potentially dramatically reduce risk of the life treating events. These approaches also scale well with the spread and popularity of wearable devices.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling of cardiovascular autonomic control in sickle cell disease
PDF
Role of the autonomic nervous system in vasoconstriction responses to mental stress in sickle cell disease: a bioengineering perspective
PDF
Susceptibility-weighted MRI for the evaluation of brain oxygenation and brain iron in sickle cell disease
PDF
Estimation of cognitive brain activity in sickle cell disease using functional near-infrared spectroscopy and dynamic systems modeling
PDF
Quantitative MRI for the measurement of cerebral oxygen extraction fraction in sickle cell disease
PDF
Assessing cerebral oxygen supply and demand in sickle cell anemia using MRI
PDF
Modeling autonomic peripheral vascular control
PDF
Cerebrovascular disease of white matter in patients with chronic anemia syndrome
PDF
Clinical significance of oxyhemoglobin dissociation curves on the interpretation of pulse oximetry in Sickle Cell disease
PDF
Quantitative MRI for oxygenation imaging in cerebrovascular diseases
PDF
Machine-learning approaches for modeling of complex materials and media
PDF
Alpha globin and vascular reactivity in humans: loss of alpha globin genes in human subjects is associated with improved nitric oxide-mediated vascular perfusion
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Scalable machine learning algorithms for item recommendation
PDF
Early-warning systems for crisis risk
PDF
Feature and model based biomedical system characterization of cancer
PDF
Learning distributed representations of cells in tables
PDF
Insights from a computational approach to natural killer cell activation
PDF
Machine learning paradigms for behavioral coding
PDF
A novel role for hypoxia-inducible factor-1alpha (HIF-1alpha) in the regulation of inflammatory chemokines and leukotriene expression in sickle cell disease
Asset Metadata
Creator
Ji, Yunhua
(author)
Core Title
Photoplethysmogram-based biomarker for assessing risk of vaso-occlusive crisis in sickle cell disease: machine learning approaches
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Publication Date
03/05/2020
Defense Date
01/23/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autonomic nervous system,machine learning,OAI-PMH Harvest,photoplethysmogram,sickle cell disease
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Khoo, Michael C. K. (
committee chair
), Coates, Thomas D. (
committee member
), Wood, John C. (
committee member
)
Creator Email
yunhuaji@usc.edu,yunhuaji1989@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-276374
Unique identifier
UC11673110
Identifier
etd-JiYunhua-8222.pdf (filename),usctheses-c89-276374 (legacy record id)
Legacy Identifier
etd-JiYunhua-8222.pdf
Dmrecord
276374
Document Type
Dissertation
Rights
Ji, Yunhua
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
autonomic nervous system
machine learning
photoplethysmogram
sickle cell disease