Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
(USC Thesis Other)
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Physics-Informed Machine Learning Techniques for the Estimation and Uncertainty Quantification of Breath Alcohol Concentration from Transdermal Alcohol Biosensor Data by Clemens Bastian Oszkinat A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) August 2022 Copyright 2022 Clemens Bastian Oszkinat Thoughts without content are empty, intuitions without concepts are blind. —Immanuel Kant, 1724-1804 ii Acknowledgements I would like to thank my advisor, Gary Rosen, for giving me an exciting and fun problem to work on, for enjoying the freedom to research and investigate the things I was most interested in, for his steady advice when needed, and for the endless hours he spend with me on Zoom during the pandemic. I also want to thank my committee members, Susan Luczak and Chunming Wang, for the time they invested in me and for their support, as well as for their careful review of my work. Further, I thank my friends in the math department who made the past years unique. Thanks for all the Beer Fridays (and some other days) you spent with me; they kept me going. I thank my parents for their ongoing support during my time in grad school. Who would have thought it takes 27 years to raise a child? Finally, I thank Simon for making it through a long-distance relationship with me. A bright future is ahead of us. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract xii Chapter 1: Introduction 1 Chapter 2: Literature Survey 7 Chapter 3: Human Subject Data for Transdermal and Breath Alcohol Concentration 10 Chapter 4: A Diffusion Equation Model for the Transport of Alcohol through the Epi- dermal Layer of the Skin 15 4.1 An Abstract Parabolic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Consistency and Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Numerical Results and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 Fitting the Population Model . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 Numerical Demonstration of Theorem 8 . . . . . . . . . . . . . . . . . . . 37 4.3.3 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 5: A Physics-Informed Neural Network for the Estimation of the Distribution of Random Parameters 45 5.1 Physics-Informed Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.1 Kullback-Leibler Based Training . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.2 Integration of the Physical Model . . . . . . . . . . . . . . . . . . . . . . 49 5.1.3 Estimating the Parameter Distribution . . . . . . . . . . . . . . . . . . . . 51 5.1.4 Posterior Distribution of the Latent Variable . . . . . . . . . . . . . . . . . 52 5.1.5 Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Deconvolution of BAC/BrAC from the TAC signal . . . . . . . . . . . . . . . . . 53 5.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.1 Artificial Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 iv 5.3.1.1 Estimating Single Parameter Values . . . . . . . . . . . . . . . . 55 5.3.1.2 Distribution Estimation . . . . . . . . . . . . . . . . . . . . . . 58 5.3.2 Real Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.2.1 Distribution of the Unknown Parameters . . . . . . . . . . . . . 64 5.3.2.2 Deconvolution of the BAC/BrAC signal . . . . . . . . . . . . . 65 Chapter 6: A Physics-Informed Long Short-Term Memory Network for the Estimation and Uncertainty Quantification of BrAC 71 6.1 A Physics-Informed Long Short-Term Memory Network . . . . . . . . . . . . . . 72 6.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Uncertainty Quantification using Dropout . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4.1 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4.2 Error Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.3 Impact of the Parameterb . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4.4 Role of the Bias Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4.5 Results on Field Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.6 Importance of Covariates using Shapley Values . . . . . . . . . . . . . . . 89 Chapter 7: On a Covariate-Dependent Hidden Markov Model and a Physics-Informed Viterbi Algorithm 92 7.1 A Hidden Markov Model for BrAC from TAC . . . . . . . . . . . . . . . . . . . . 93 7.1.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.1.2 Training and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2 Introducing Covariates and Using the HMM to Obtain eBrAC from TAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2.1 Covariate-Dependent State Transitions . . . . . . . . . . . . . . . . . . . . 97 7.2.2 Covariate-Dependent Emissions or Observations . . . . . . . . . . . . . . 98 7.2.3 Using the HMM to Estimate or Predict BrAC from Observations of TAC . 99 7.2.4 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3 Decoding the Hidden States and Uncertainty Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3.1 Local Decoding and Global Decoding . . . . . . . . . . . . . . . . . . . . 101 7.3.2 Emission Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4.1 An HMM without Covariates . . . . . . . . . . . . . . . . . . . . . . . . 103 7.4.2 A Covariate-Dependent HMM . . . . . . . . . . . . . . . . . . . . . . . . 105 7.4.3 Uncertainty Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.4.4 Results on Field Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.4.5 Correlation Between Hidden States and BAC . . . . . . . . . . . . . . . . 112 7.5 A Physics-Informed Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 112 7.5.1 Results of Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 8: Conclusion and Ongoing Work 119 v References 123 vi List of Tables 3.1 List of covariates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1 Average error values depending on the number of clusters . . . . . . . . . . . . . . 41 5.1 Converged mean parameter values for varying values ofb . . . . . . . . . . . . . 57 5.2 Converged parameter variances for varying values ofb . . . . . . . . . . . . . . . 57 5.3 Converged parameter values for varying values ofs . . . . . . . . . . . . . . . . . 60 5.4 Estimated Parameter Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 Error values on training and test set . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 Error values on training and test set, heart rate included . . . . . . . . . . . . . . . 83 7.1 Median (computed with respect to all training (TR) and testing (T) datasets) rela- tive error in eBrAC and eTAC signals, peak eBrAC and eTAC values, time of peak eBrAC and eTAC, and area underneath eBrAC and eTAC curves with eBrAC and eTAC estimated using either BrAC and TAC or on TAC alone. . . . . . . . . . . . 105 8.1 Comparison of the Performance of three Physics-Informed Methods . . . . . . . . 121 vii List of Figures 1.1 (Left) Giner, Inc. WristTAS TM 7 transdermal continuous alcohol monitoring de- vice, (Right) SCRAM Systems transdermal continuous alcohol monitoring device. 2 3.1 BrAC signals with corresponding TAC signals. . . . . . . . . . . . . . . . . . . . 12 3.2 Four BrAC signals from the dataset with corresponding TAC signals from the TAC devices placed on either arm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 Scaled histograms of individually fitted q i parameters together with probability density functions of distributional fits. . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Numerical consistency result. Figure (a) shows the deconvoluted BrAC signal with 90% error bands based on the true distribution and discretization parameters N = (10;10;10). The subsequent figures show the corresponding results based on n samples of the true distribution and growing numbers for N. . . . . . . . . . . . . 43 4.3 Left: three drinking episodes from the dataset with corresponding mean BrAC estimates and 90% error bands based on one cluster. Right: the same three drinking episodes with corresponding BrAC estimates and 90% error bands based on k= 5 clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Generated drinking episodes for varying parameter values (left) and an example of a real data drinking episode (right). . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Histograms of the parameter values for q 1 and q 2 using varying values ofb. . . . . 58 5.3 BrAC signal together with mean TAC signal of the 10;000 generated drinking episodes using varying standard deviations for the q 1 and q 2 distributions. The shaded pink region indicates the range of all drinking episodes corresponding to the samples of q 1 and q 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Histograms of the approximated q 2 distribution together with the pdf of the true distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Graphs of the function q 2 (z) for the three test cases. . . . . . . . . . . . . . . . . . 61 5.6 Histograms of the approximated q 1 distribution. . . . . . . . . . . . . . . . . . . . 62 viii 5.7 BrAC signal together with mean TAC signal of 10;000 generated drinking episodes using the standard deviation s = 0:1 for the q 1 distribution. q 2 has been fixed as q 2 = 1. The shaded pink region indicates the range of all drinking episodes corresponding to the samples of q 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.8 (a) Values of the loss functions over the number of iterations and (b) Values for q 1 and q 2 over the number of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.9 Samples of the estimated joint distribution of (q 1 ;q 2 ). (a) shows the distribution using 88 selected drinking episodes using a standard normal distribution for the prior of the latent variable, while (b) shows the distribution using the posterior over the latent variable. (c) displays the distribution for the full set of 126 drinking episodes with a standard normal prior for the latent variable and (d) shows the corresponding distribution using the posterior distribution of the latent variable. . . 66 5.10 Distribution of the posterior latent variable. (a) shows the histogram for 88 drink- ing sessions used as training data and (b) shows the histogram for 126 drinking sessions used as training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.11 Results of the estimation of BAC/BrAC from TAC for different drinking episodes. (a) and (b) show a good match between real data and estimated BAC/BrAC signal. (c) and (d) show a real signal at the upper and lower edge of the predicted credible region, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.12 Comparison of predicted BAC/BrAC signals using two different drinking episodes. (a) and (d) show the estimated BAC/BrAC curves yielded by a standard normal distribution for the prior of the latent variable and b = 1. (b) and (e) show the corresponding results using the posterior distribution of the latent variable andb = 1. (c) and (f) display the corresponding results using a standard normal distribution for the prior of the latent variable together withb = 4. . . . . . . . . . . . . . . . 70 6.1 Schematic illustration of a single LSTM cell. The figure shows the information flow in a cell between the various gates. . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Structure of the proposed network for the handling of the time-dependent covari- ates (grouped as “TAC”) and the time-independent covariates. . . . . . . . . . . . 76 6.3 Performance of the physics-informed LSTM model on four drinking sessions from the test set. The figures show the BrAC data u, the predicted mean eBrAC signal ¯ u, and credible regions for the BrAC signal. . . . . . . . . . . . . . . . . . . . . . 82 6.4 Results of the BrAC estimation using different values for the parameter b on a single drinking episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.5 Results of the BrAC estimation using different values for the parameter b on a single drinking episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6 Results of the BrAC estimation using models with and without bias terms. We display three drinking episodes from the test set. . . . . . . . . . . . . . . . . . . . 87 ix 6.7 Application of the LSTM model to field data drinking sessions. We show the BrAC data, eBrAC signals obtained from the LSTM model without physics (labeled LSTM), and eBrAC signals obtained from the physics-informed LSTM model (la- beled PILSTM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.8 Shapley values of a selection of covariates. For a description of the covariates refer to Appendix A. The order of the covariates is based on the Shapley value importance. Longer bars indicate more data points in that region, while points indicate single data points. The color encoding can be seen as a heat map: high covariate values are red, low values are blue. For 0/1 covariates such as “female” red means “yes” and blue means “no”. The impact of a covariate on the BrAC AUC is indicated by the distance from the grey middle bar. . . . . . . . . . . . . . 90 7.1 Performance of the HMM on three drinking episodes from the test set. The figures on the left display TAC and BrAC data together with their predictions when both TAC and BrAC are used as input for the Viterbi algorithm. On the right hand side, only TAC is used as an input to the Viterbi algorithm to produce the eBrAC signal. 104 7.2 eBrAC signals based on the TAC signal and covariate-dependent transition proba- bilities for test set episodes. The figures show BrAC data, eBrAC using a covariate- dependent transition matrix, and eBrAC using a covariate-independent transition matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.3 eBrAC signals based on the TAC signal in the case of covariate-dependent lin- ear Gaussian emissions for three drinking episodes from the test set. The figures display BrAC data, eBrAC using a covariate-dependent conditional Gaussian emis- sion, and eBrAC using a covariate-independent conditional Gaussian emission. . . 108 7.4 Log-likelihood of the trained model together with BrAC estimation error (7.11) for the training and test sets vs the number of included covariates. For any number k of covariates, all possible combinations of k covariates out of 10 have been considered and averaged. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.5 Demonstration of the proposed uncertainty quantification for the BrAC estimation for three drinking episodes from the test set. The figures display BrAC data, eBrAC based on the global state decoding using the Viterbi algorithm, and a credible band for the BrAC signal based on the uncertainty quantification. . . . . . . . . . . . . . 110 7.6 Application of the fit HMM model to natural drinking episodes recorded in the field. The figures show recorded TAC and BrAC signals and eBrAC based only on the TAC signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.7 TAC and BrAC together with the decoded hidden states based on the TAC. . . . . . 114 x 7.8 Performance of the HMM on three drinking episodes from the test set. The figures depict the true BrAC data u as well as eBrAC signals ˜ u that are based on the TAC y only. The black lines show ˜ u obtained from the physics-informed Viterbi algorithm, whereas the red lines show the estimates obtained using the classical Viterbi algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.9 Impact of the weight b on the behavior of the eBrAC signals obtained using the physics-informed Viterbi algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.10 Application of the trained HMM model to naturalistic drinking episodes recorded in the field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 xi Abstract In this work, we consider the problem of estimating breath alcohol concentration (BrAC) based on transdermal alcohol concentration (TAC) biosensor data. Since classical ways of obtaining BrAC suffer from the need of user participation and other disadvantages, TAC measurements provide a promising alternative for alcohol researchers and clinicians. While this problem has been studied extensively in the past, here we want to advance the research through the development and investi- gation of physics-informed machine learning techniques. To this end, we first establish and discuss a mathematical diffusion partial differential equation (PDE) model with unknown parameters for the transport of ethanol through the epidermal layer of the skin. We show how the unknown pa- rameters can be modeled as random variables and how this naturally leads to a population model in an abstract parabolic framework. We discretize the model both in time and space and, using re- sults and techniques from linear semigroup theory, we prove convergence results for the discretized model. Further, we show that the model depends continuously on the random parameters and thus, we prove consistency of the parameter estimators. Based on this model, we consider three machine learning models and propose ways to introduce prior knowledge into these models in the form of the above mentioned population model. This leads to a combination of the data-driven machine learning models with a first principles physical model and so hybrid, physics-informed models are created. The application of modern machine learning techniques offers many advantages: these models are able to train efficiently on larger data sets, their flexibility allows for the inclusion of other covariates, and their complexity can capture so-far unmodeled aspects of the relation between BrAC and TAC. At the same time, the physics information acts as a regularizing term to encourage a physically meaningful behavior of these models. xii In the first of these models, we use a generative adversarial network (GAN) with a residual- augmented loss function to estimate the distribution of the random parameters of the diffusion model. We then design another physics-informed neural network for the deconvolution of the BrAC signal from the TAC signal. Based on the distribution of the random parameters, this network is able to estimate the BrAC signal and quantify the uncertainty in the form of conservative error bands. Moreover, we show how a posterior latent variable can be used to sharpen these conservative error bands. We apply the techniques to an extensive data set of drinking episodes and demonstrate the advantages and shortcomings of this approach. The second model under consideration is a long short-term memory network (LSTM). Here, we design a network in such a way that besides the TAC signal both time-dependent and time- independent covariates can be processed to estimate BrAC. We use the mathematical population model to add physics information to the training stage of the model. In addition, we use a re- cently discovered relation between Bayesian learning and the stochastical regularization technique dropout to quantify the uncertainty of the estimates. We employ Shapley values to measure the individual importance of the covariates, and demonstrate the proposed model with numerical stud- ies. Finally, we develop a novel approach to estimating BrAC from TAC based on covariate- dependent physics-informed hidden Markov models with two emissions. The hidden Markov chain serves as a forward full-body alcohol model with BrAC and TAC, the two emissions, as- sumed to be described by a bivariate normal which depends on the hidden Markovian states and person-level and session-level covariates via built-in regression models. An extension of hidden Markov modeling is developed wherein the hidden Markov model framework is regularized by the first-principles PDE model to yield a hybrid model. We train the model via the Baum-Welch algorithm and 256 sets of BrAC and TAC signals and covariate measurements collected in the labo- ratory. Forward filtering of TAC to obtain estimated BrAC is achieved via a new physics-informed regularized Viterbi algorithm which determines the most likely path through the hidden Markov xiii chain using TAC alone. The Markovian states are decoded and used to yield estimates of BrAC and to quantify the uncertainty in the estimates. We present and discuss numerical studies. xiv Chapter 1 Introduction Alcohol researchers and clinicians alike have long been interested in qualitatively and quantita- tively measuring and tracking alcohol consumption. Especially near-continuous measurements are considered important as they can be used for the monitoring of blood alcohol, the discovery of addiction problems and the examination of naturalistic drinking patterns. However, producing meaningful quantitative measures of alcohol consumption in naturalistic settings is a challenging task. Typically, this is based on the use of a breath alcohol analyzer or a drinker’s self report. Unfortunately, both methods have their shortcomings: Obtaining deep lung samples (alveolar air) needed for accurate results can be difficult and often, alcohol contained in the mouth after drinking contaminates the results [1]. Also, the procedure does obviously not allow for continuous measurements; the intense exhaling cannot be repeated frequently. Especially in the context of naturalistic drinking settings, this method is hence not well-suited. Self reports on the other hand might be inaccurate as well. It is often difficult for people to judge what quantities of alcohol they consumed in a particular situation. If an alcoholic drink was prepared by someone else, e.g. a drink at a bar, this process is entirely impossible. Even worse, it is known that alcohol directly affects the memory function of the brain [2] and hence a drinker’s self report might lack the required reliability. Additionally, self-monitoring via breathalyzers and self-report are not a viable option if participants are motivated to conceal their alcohol use from researchers and clinicians. After the consumption of alcohol, the human liver metabolizes most of the alcohol. Some of the alcohol is exhaled through breath, some leaves the body through urine, and about 1% of 1 Figure 1.1: (Left) Giner, Inc. WristTAS TM 7 transdermal continuous alcohol monitoring device, (Right) SCRAM Systems transdermal continuous alcohol monitoring device. the alcohol diffuses through the skin [3]. The excretion of ethanol through perspiration has been known since the 1930s [4–6]. Hence, measuring transdermal alcohol creates a possible alternative for the tracking of alcohol consumed. In recent years, biosensor devices have been developed that are able to measure transdermal alcohol concentration (TAC) [7–9]. TAC biosensors (two such devices are shown in Fig. 1.1) mostly work as a fuel cell: In a redox reaction, 4 electrons are produced for every ethanol molecule. This generates a current that is measured by those biosensors and converted into TAC. The availability of TAC devices allows for near-continuous measurements of alcohol consump- tion and helps researchers to gain insight into alcohol metabolism and drinking behavior. In addi- tion, TAC sensors collect the data passively, i.e. contrary to self reports or breath alcohol analyzers no active participation of the subject is required. In this context, those devices can be used for forensic applications such as DUI monitoring and the underlying mechanism can potentially also be extended to other drugs. Following the recent developments on the market for wearable com- puters, TAC biosensors can even be integrated into those devices to reach the consumer market. However, researchers interested in drinking behavior and alcohol consumption typically base their studies on breath alcohol concentration (BrAC) or blood alcohol concentration (BAC), which have been shown to reasonably agree [10]. Hence, in order to make TAC sensors useful for alcohol research, the need to convert TAC signals to BAC/BrAC signals arises. Unfortunately, the direct conversion from TAC to BAC/BrAC proves to be difficult due to many confounding factors. Differ- ences between devices, a subject’s drinking behavior, and varying environmental conditions such 2 as temperature and humidity lead to variations in the measurements. Intra- and inter-individual variations are additional sources that pose further challenges to the direct conversion from TAC to BAC/BrAC. The porosity, thickness, tortuosity, hydration, and vasodilation of an individual’s skin are important factors in the functional relationship between BAC/BrAC and TAC. Given all these varying factors and effects, the main problem in this thesis is to robustly produce BrAC estimates (eBrAC) based on TAC measurements. This shall be the main application that is considered in this thesis. We will approach this problem in three different, yet related ways. All these approaches are based on a paradigm called physics-informed learning. In recent years, advances in machine learning have yielded very successful methods and algorithms as well as a widespread interest in this research area. Typically, machine learning models rely on large training data sets to extract knowledge and to train the underlying mathematical models. However, in a range of applications, data gathering can be a difficult or expensive task. This also applies to the alcohol biosensor problem considered here. The required drinking data, in the form of TAC and BrAC time series, is human subject data that has to be gathered in time-consuming and highly controlled lab sessions. For this reason, the corresponding data sets tend to be rather small, see Chapter 3. The scarcity of available training data creates a potential problem when it comes to model training. State-of-the-art machine learning models with a large number of parameters to be fit cannot reliably be trained on small data sets. The attempt to do so may result in models that exhibit overfitting and a lack of generalization ability. This is the motivation for introducing the recent idea of making these models physics-informed (or physics-guided). This paradigm makes use of available prior knowledge about the problem that helps in the training of the model. Such prior knowledge typically takes the form of known physical equations, for example partial differential equations (PDEs) or systems thereof. Then, based on the available training data and the prior knowledge, learning algorithms are designed that leverage the combination of both data and physics. That way, the lack of training data can be compensated for. Moreover, since the model not only learns based on data but also based on physical principles, the model is encouraged to learn physically meaningful relations and so it is likely to have good generalization capabilities. 3 This observation fits in a more philosophical way of thinking about learning. Modern machine learning models are trying to adapt the way human beings learn. That is, their architecture is often meant to mimic the architecture of the human brain. The most prominent example for this are deep learning networks whose neurons, activation functions, and interconnections between neurons resemble a human brain. In this setting, it makes sense not only to adapt the model architecture to the human brain, but also to design the learning processes in a way that mimics how people actually learn. In this context, we introduce the terminology knowing that and knowing how, borrowing from an article in the New Yorker [11]. With knowing that, we mean factual knowledge, for example knowledge that can be expressed as a mathematical equation. On the other hand, knowing how describes a more empirical process of learning that involves observations, trials, and errors. Indeed, if we think about how humans learn, it is pretty clear that the way we learn involves a combination of knowing that and knowing how. While learning factual knowledge is necessary to learn a new concept, the empirical way of learning by doing is just as important. Numerous real life examples such a learning a new language, participating in sports, baking a pie, or succeeding in an interview should prove this point. However, when it comes to mathematical models and machine learning, there appears to be a gap between those two categories: Classical mathematical models, such as PDE models, describe the world in a factual, a priori way. These models are great in terms of their explainability and represent the knowing that side of learning. Data-driven machine learning models, on the other hand, are mostly after-the-fact empirical models and so in that sense they are based on the knowing how. Now, if the learning model at hand should not only be similar to the human brain in terms of its architecture but also in the way it learns, it is a natural idea to combine these two learning mechanisms. Hence, the above-mentioned physics-informed machine learning models are not only a way to overcome a certain lack of training data, but they rather represent a learning philosophy that tries to match the way a machine learns with the human way of learning. In this context, it is quite remarkable that philosopher Immanuel Kant (1724-1804) already anticipated this development nearly 250 years ago. His famous quote “Thoughts without content are empty, intuitions without concepts are blind” can be interpreted in a way that only the 4 combination of thoughts and concepts (knowing that) on the one hand, and content and intuition (knowing how) on the other yields a thorough and meaningful way of learning. The structure of this thesis is as follows. In Chapter 2, we will review the existing literature with respect to the existing approaches on the alcohol biosensor problem as well as the recent advances in machine learning that are the basis for the models being considered here. Since the work for this thesis was conducted as part of an interdisciplinary research team, a large emphasis was placed on the performance of the methods on actual human subject data. For this reason, we introduce the dataset that is used throughout this thesis in Chapter 3. Then, in Chapter 4 we derive and explain a first-principles diffusion equation model for the transport of ethanol from the blood to the surface of the skin. This model has been investigated intensively in the past and used for the estimation of BrAC signals based on TAC signals. It will serve as the underlying mathematical model for the physics-informed machine learning models that are reported on here. We show how this model can be formulated as an abstract parabolic evolution equation to make it readily accessible to some of the machine learning models, how this model can be extended to a statistical population model using randomly distributed parameters, and we comment on the discretization of this model. Based on this, we employ linear semigroup theory to prove convergence and consistency of the model with respect to the discretization and the statistical estimators. Following this, in Chapter 5, we describe a physics-informed neural network for the alcohol biosensor problem. Specifically, we use a generative adversarial network (GAN) with a residual-augmented loss function to estimate the distribution of unknown parameters in the mentioned diffusion equation model. We design another physics-informed neural network for the deconvolution of the BrAC signal from the TAC signal. Based on the distribution of the unknown parameters, this network is able to estimate the blood alcohol signal and quantify the uncertainty in the form of credible bands. We show how a posterior latent variable can be used to sharpen these credible bands, and we apply the techniques to our data set of drinking episodes and demonstrate the advantages and shortcomings of this approach. As a second model under investigation, we consider a physics-informed LSTM model for the BrAC estimation in Chapter 6. We propose a network architecture that takes into account 5 both time-dependent and time-independent covariates, and show how the information encoded in the PDE model can be included into the time series network. In addition, we demonstrate how the dropout regularization technique can be employed to yield approximations to credible bands for the estimated signals. In an extensive numerical study, we present results on the data set, and we reflect on the interplay between the data-driven terms of the loss function and the physics- driven parts. Finally, in Chapter 7 we consider a covariate-dependent (forward) hidden Markov model (HMM) with two emission or observed variables. The hidden Markov chain serves as a forward full-body alcohol model with BrAC and TAC, the two observed quantities, assumed to be described by a bivariate normal distribution which is a function of the state of the hidden Markov chain. In addition, the bivariate normal and hidden Markov chain parameters are allowed to depend on covariates, including person-level variables of sex, height, weight, hip and waist measurements, and session-level covariates including alcohol dosing pattern and blood pressure readings, via built-in regression models. The model is trained using the Baum-Welch/Expectation Maximization (EM) algorithms on our data set. The resulting fit model, together with observations of TAC only and the Viterbi algorithm, are then used to find the most likely path through the hidden Markov chain and the corresponding emissions with the intent of solving the inverse problem of obtaining an eBrAC signal from a TAC signal alone. An approach for including covariates when the HMM is fit, a method for decoding the hidden Markov states, their possible interpretation as blood alcohol concentration (BAC) and their use in quantifying eBrAC uncertainty are discussed. We present the results of our numerical studies, and discuss and investigate an extension to a physics-informed HMM scheme. Having discussed the three described approaches in detail in the previous chapters, we compare the different models, make some concluding remarks, and give an outlook in the final Chapter 8. 6 Chapter 2 Literature Survey In the last decade, our research team and other groups have worked on different approaches to overcome the difficulties in the BrAC estimation. Some early works used deterministic models [12–18] for the relationship between BAC/BrAC and TAC. Some of them were based on regres- sion models [13], while others modeled the transport of the alcohol from the blood through the epidermis by a one-dimensional diffusion equation with unknown parameters. Those parameters were then fit to an individual drinking session, known as an “alcohol challenge”. Once the param- eters were calibrated, the biosensor data was used to deconvolve the BAC/BrAC signal. Although this method was quite successful, there were two major caveats. First, this method required the alcohol researchers to carry out an alcohol challenge for each individual before the device could be used in the field and secondly, it did not account for the natural variation and uncertainty across subjects and environmental conditions. For example, a parameter calibration obtained during the alcohol challenge could yield inaccurate results in a more realistic drinking setting [19, 20]. More recent contributions consider the unknown parameters as random variables and estimate their distribution by fitting a population model to a range of training data across varying subjects, de- vices and environmental conditions [21–23]. Using the estimated distribution of the parameters it is not only possible to deconvolve the BAC/BrAC signal using the most likely parameter val- ues, but credible bands can be obtained to measure and quantify the corresponding uncertainty. A first work on this approach used a least squares estimator based on naive pooled data [21], while another uses a Bayesian approach to find a posterior distribution of the parameters [24]. 7 The idea to approach the BrAC estimation using machine learning techniques is relatively new. In [25], the authors used a tree-based ensemble regression algorithm to create estimates of BrAC from transdermal data. Apart from this, we believe that the methods reported herein are the first works on this problem using machine learning techniques, especially physics-informed learning. We will not try to give an extensive review on the recent machine learning literature in general. Instead, we want to provide a brief overview on the relevant developments in physics-informed learning. The first ideas to augment the loss function of neural networks with differential equations constraints to incorporate physical knowledge of the problem into the training process date back to the 1990s and can be found in the works by Psichogios and Ungar [26] and Lagaris et al. [27]. In a series of more recent papers, Raissi et al. revisited this approach and established the notion of physics-informed neural networks (PINNs) [28–33]. Based on these works, the research on PINNs accelerated and numerous variants [34–36] were introduces as well as related techniques [37–39]. We note that while the idea of combining physics-based modeling with machine learning techniques is quite new, it is an area that has seen significant activity as of late (a comprehensive survey and extensive bibliography can be found in the recent review article [40]). As of today, these methods are employed in fields such as finance [41, 42], fluid mechanics [43, 44], heat transfer [45, 46], material sciences [47–49], and bio-engineering [50, 51]. However, virtually all of this work has been in the area of deep neural networks and deep learning (see, for example, [30–32, 52]). In this thesis, we also want to extend this technique of physics-informed learning to another class of machine learning models, namely Hidden Markov Models (HMM). The basic idea of, and the theory behind, HMMs was established in a series of research papers by Baum and his colleagues in the late 1960s and early 1970s [53–57], and combined into a single comprehensive tutorial in the paper by Rabiner [58]. Since then, they have been used across various applications as both a forward model and as a means of formulating and solving inverse problems. The most often cited application of HMMs dating back to the late 1980s is speech recognition [58–61] with later applications following in the areas of acoustics, signal and image processing, genomics, and bioinformatics. They continue to enjoy widespread popularity 8 with researchers in areas as diverse as musicology, gesture recognition, trend analysis, and finance [62]. A comprehensive bibliography of references for hidden Markov models and their application from 1989 through 2000 can be found in [63]. In addition, [64] is an excellent general resource which, in addition to providing a rigorous treatment of the theory and derivations of the various computational algorithms for HMMs, includes an extensive survey of the literature. A more recent survey and review of the literature, essentially up to the present, covering the latest research on the theory and application of HMMs can be found in [62]. While HMMs have been around for the past three to four decades, the idea of combining and regularizing them with physics-based models appears to be new. We are aware of only one instance of this in the literature in the form of a pre-print appearing in the archive [65]. 9 Chapter 3 Human Subject Data for Transdermal and Breath Alcohol Concentration The research that is presented in this work was conducted in a joint research project with applied mathematicians and psychologists. Hence, an emphasis throughout the work was placed on val- idating the proposed models and methods on real human subject data that were gathered in the Luczak Laboratory at the University of Southern California [66]. The data consists of BrAC, TAC, and covariate data from a sample of 40 participants who each completed four laboratory drinking sessions. 50% of the participants were female, the ages ranged between 21 and 33 years, and 35% of the participants had an BMI 25.0. Each participant consumed the same amount of alcohol in all four sessions, which was designed to reach a maximum BrAC peak of .080 (the legal limit for intoxication in the US) for that person based on body water weight. There were three patterns of consumption: participants consumed the alcohol in 15 minutes (Single Dose) in two of the sessions, in two 15-minute periods spaced 30 minutes apart (Dual Dose) in one session, and over 60 minutes (Steady Dose) in one session. An Intoximeter IV breath analyzer (Intoximeters, Inc., St. Louis, MO) was used to obtain BrAC, and two SCRAM-CAM devices (Alcohol Monitoring Systems, AMS, Littleton, CO) were placed on the upper arms to obtain TAC. Measurements were taken at 10-minute intervals over the entire BrAC curve. After BrAC returned to .000, the TAC devices continued to record automatically at 30-minute intervals until TAC also returned to .000. The entire dataset consists of 267 drinking episodes. At some points in this thesis, several sets 10 of BrAC-TAC data were excluded from analyses due to a change in the protocol, in these cases our final sample consisted of 256 BrAC-TAC datasets. We also obtained covariates that may af- fect the BrAC-TAC relationship, including person-level covariates of sex, age, race/ethnicity, body measurements (e.g. hip and waist measurements, percent body fat), and past 90-day drinking be- havior and session-level covariates of alcohol dosing pattern and blood pressure readings (baseline, minimum, and maximum). A complete list of the covariates used in this work is given below. Before using the BrAC and TAC signals, we apply some preprocessing steps to them. First, as most body processes happen in a continuous way, it is reasonable to expect rather smooth signals. For this reason, we smooth the noisy BrAC and TAC data using smoothing splines. Then, we interpolate the signals at a frequency of 1=60 Hz. The smoothed and interpolated data is referred to as the dataset in the following chapters. In Figure 3.1 we display two typical BrAC and TAC signals from the dataset to demonstrate two characteristics. First, the TAC curve follows the BrAC curve with some delay. This is due to the physiological effect that the alcohol that is already in the vascular system – and thus in the breath – takes some time to diffuse through the epidermal layer of the skin to reach the TAC device on the surface of the skin. Second, comparing figures 3.1a and 3.1b, we see that the extent to which BrAC is expressed as TAC can vary a lot. It is reasonable to assume that these differences can be attributed to the already mentioned covariates. To give an example, a person with a very high BMI might have a different peak BrAC to peak TAC ratio than a person with a low BMI. In that sense, a method that only takes into account BrAC and TAC while ignoring all other covariates is unlikely to yield accurate results. However, before first order effects like the covariates can be considered in a meaningful way, the zero order effect, i.e. the measurement of TAC and BrAC, needs to be consistent. Figure 3.2 shows four BrAC signals from the dataset together with the TAC signals from the two TAC devices on the left and right arm, respectively. While figures 3.2a and 3.2b show examples where the two TAC signals reasonably agree, figures 3.2c and 3.2d depict drinking episodes where the two TAC devices yield disagreeing results. Roughly 1=3 of the drinking episodes in our dataset exhibit this issue. To this date, the precise cause of this issue is unknown. From a physiological perspective, it appears very unlikely 11 (a) (b) Figure 3.1: BrAC signals with corresponding TAC signals. that the actual TAC on the left and right arm differs a lot, as the human body is quite symmetric. Hence, the observed differences might be explained by variation in the TAC devices themselves, or the precise location they were worn on the arm. The explanation of these differences is currently under research. Alas, even if this variation were to be explained, it is clear that the application of TAC devices in the field will be affected by these factors, too. Where in the lab, the exact position of the TAC device can be enforced, this is hardly possible in a more naturalistic drinking setting. So, although the yet unexplained variation within drinking episodes appears unsatisfying and frustrating, this dataset might be quite representative for the data one might expect to gather in the field. Thus, a particular challenge in light of the data is to find robust methods that are able to deal with the high degree of variation and that can quantify the associated uncertainly of BrAC estimates effectively. 12 (a) (b) (c) (d) Figure 3.2: Four BrAC signals from the dataset with corresponding TAC signals from the TAC devices placed on either arm. 13 Table 3.1: List of covariates. name description Session Type Single, Dual, or Steady Gender Male or Female Ethnicity African/African-American, Asian, Caucasian, Hispanics, or Mixed BMI Body mass index RM Resting metabolism SM Skeletal muscle Wais Hip Waist-to-Hip ratio AGE Age of participants in years ATE Number of hours from eating to start of drinking session WATER Number of hours from time last drank water to time start drinking CAL LUN Amount of calories for lunch BPBase S Systolic blood pressure at rest BPBase D Diastolic blood pressure at rest BPPeak S Systolic blood pressure at peak BPPeak D Diastolic blood pressure at peak BPLow S S Lowest systolic blood pressure reading before lunch BPLow S D Diastolic blood pressure reading at lowest systolic reading before lunch BPLow D S Systolic blood pressure reading at lowest diastolic reading before lunch BPLow D D Lowest diastolic blood pressure reading before lunch 28d M past 28 days: Maximum number of drinks on single day 28d B Number of binge days (5 drinks for women, 4 for men) 28d T Total number of drinks 90d M past 90 days: Maximum number of drinks on single day 90d B Number of binge days (5 drinks for women, 4 for men) 90d T Total number of drinks 14 Chapter 4 A Diffusion Equation Model for the Transport of Alcohol through the Epidermal Layer of the Skin 4.1 An Abstract Parabolic Framework The process of ethanol diffusion through the epidermal layer of the skin has been carefully studied over the past years and several, mostly similar, physical models have been developed to capture the physiological processes. Here, we use a distributed parameter diffusion model that was described in [24] and takes the form ¶x ¶t (t;h)= q 1 ¶ 2 x ¶h 2 (t;h); t > 0; h2(0;1); (4.1) dw dt (t)= q 3 ¶x ¶h (t;0) q 4 w(t); t > 0; (4.2) x(t;0)= w(t); t > 0; (4.3) q 1 ¶x ¶h (t;1)= q 2 u(t); t > 0; (4.4) w(0)= w 0 ; x(0;h)=f 0 (h); h2(0;1) (4.5) y(t)= w(t); t 0: (4.6) In this model, x(t;h) is the concentration of ethanol in the epidermal layer at time t and depth h2[0;1] and w(t) is the concentration of ethanol in the sensor’s collection chamber at time t. The 15 TAC output is denoted by y(t). In the case of the dimensionless system considered above, w(t) and y(t) coincide. The BrAC input to the model is denoted by u(t). The dimensionless parameter q 1 models the diffusivity of the skin and the dimensionless q 2 describes the alcohol gain from the dermal layer of the skin. Both parameters are unknown and un-measurable; they are subject- dependent physiological parameters. The parameters q 3 and q 4 are device-dependent. In what follows, we will assume the parameters q q q=[q 1 ;q 2 ;q 3 ;q 4 ] to be random and we try to estimate their distributions in order to use them in the above model. The model describes a (linear) relationship between the BrAC u as the input of the model and the TAC y as the output of the model. In the following we rewrite the initial-boundary problem (4.1)- (4.6) as an abstract parabolic system in a Gelfand triple of Hilbert spaces. Then, we show how the random parameters q q q can essentially be considered as additional spacial variables, and how the resulting model can be discretized. Based on this dicretized model, we present an approach to estimate the distribution of q q q and we develop a consistency theory for the estimators. First, we consider a finite time horizon, T > 0, and we assume a zero-order hold input, that is u(t)= u k ;t2[kt;(k+ 1)t);k= 0;1;2;:::;K with t denoting a sampling time. Such an assump- tion is not unrealistic in this application, as biosensors have a sampling time, and breath alcohol concentration changes slowly over the course of hours. Analogously, we define x k = x k (h) = x(kt;h);w k = w(kt); and y k = y(kt);k= 0;1;:::;K, with Kt = T . Then, we consider the sys- tem (4.1)-(4.6) on the intervals [kt;(k+ 1)t] for k= 0;1;2;:::;K. With the change of variables v(t;h)= x(t;h)x(h)u k wherex(h)= q 2 q 1 h, we obtain the system ¶v ¶t (t;h)= q 1 ¶ 2 v ¶h 2 (t;h); kt < t <(k+ 1)t; h2(0;1); (4.7) dw dt (t)= q 3 ¶v ¶h (t;0) q 4 w(t)+ q 3 q 2 q 1 u k ; kt < t <(k+ 1)t; (4.8) v(t;0)= w(t); kt < t(k+ 1)t; (4.9) q 1 ¶v ¶h (t;1)= 0; kt < t <(k+ 1)t; (4.10) v(kt;)= x(kt;)x()u k = x k x u k : (4.11) 16 Let Q be a compact subset of the positive orthant ofR 4 . For q q q=[q 1 ;q 2 ;q 3 ;q 4 ]2 Q we de- fine H q =R L 2 (0;1) with the inner producth(q 1 ;f 1 );(q 2 ;f 2 )i q = q 1 q 3 q 1 q 2 + R 1 0 f 1 (h)f 2 (h)dh. Define further the Hilbert space V =f(q;f)2 H q :f2 H 1 (0;1);q =f(0)g with the inner prod- ucth(f 1 (0);f 1 );(f 2 (0);f 2 )i V =f 1 (0)f 2 (0)+hf 0 1 ;f 0 2 i L 2 (0;1) . Here,h;i L 2 (0;1) denotes the standard L 2 (0;1) inner product. These inner products yield the corresponding normsjj q andjjjj V . Using standard arguments, we obtain the dense and continuous embeddings V ,! H q ,! V and that these embeddings are uniformly bounded with respect to q q q2 Q. Further, for each q q q2 Q, we define the bilinear form a(q q q;;) : V V!R by a(q q q; ˆ f 1 ; ˆ f 2 ) = q 1 q 4 q 3 f 1 (0)f 2 (0)+ q 1 R 1 0 f 0 1 (h)f 0 2 (h)dh forf 1 ;f 2 2 V and where ˆ f 1 =(f 1 (0);f 1 ); ˆ f 2 =(f 2 (0);f 2 ). It is not difficult to show that a(q q q;;) satisfies three properties: i Boundedness There exists a constanta 0 > 0 such that ja(q q q; ˆ f 1 ; ˆ f 2 )ja 0 jj ˆ f 1 jj V jj ˆ f 2 jj V ; ˆ f 1 ; ˆ f 2 2 V . ii Coercivity There exist constantl 0 2R andm 0 > 0 such that a(q q q; ˆ f; ˆ f)+l 0 j ˆ fj 2 q q q m 0 jj ˆ fjj 2 V ; ˆ f2 V . iii Continuity For all ˆ f 1 ; ˆ f 2 2 V , we have that q q q! a(q q q; ˆ f 1 ; ˆ f 2 ) is a continuous mapping from Q intoR. Since Q was assumed to be compact, a(q q q;;) is uniformly bounded, coercive, and continuously dependent on q q q2 Q, i.e. the constantsa 0 ;l 0 andm 0 do not depend on q q q. See [24] for more details. Now we define the operator A(q q q) : Dom(A(q q q)) H q ! H q byhA(q q q) ˆ f; ˆ yi V ;V =a(q q q; ˆ f; ˆ y) where Dom(A(q q q))=f ˆ f =(f(0);f)2 V :f2 H 2 (0;1)g. Note that Dom(A(q q q)) does not depend on q q q. With that, one can show that A(q q q) ˆ f = A(q q q)(f(0);f)=(q 3 f 0 (0) q 4 f(0);q 1 f 00 ) for ˆ f2 Dom(A(q q q)). Further, A(q q q) is densely defined on H q , regularly dissipative and self-adjoint and thus, A(q q q) is an infinitesimal generator of a uniformly exponentially stable, self-adjoint, ana- lytic semigroup of bounded linear operators,fe A(q q q)t : t 0g, on H q . For details, we refer to [67, 17 68]. With this, the time-discrete system (4.7)-(4.11) takes the form d ˆ v dt (t)= A(q q q) ˆ v(t)+( q 3 q 2 q 1 ;0)u k where ˆ v(kt)=(w k ;x k x u k ) in the interval kt < t <(k+ 1)t. Define ˆ x k =(w k ;x k ) and the operator ˆ A(q q q)= e A(q q q)t 2L(H q ;H q ). Further, define the operator B(q q q)= q 3 q 2 q 1 ;0 A(q q q)(0;x) and based on that ˆ B(q q q)= Z t 0 e A(q q q)s B(q q q)ds= A(q q q) 1 e A(q q q)s B(q q q)j t 0 =( ˆ A(q q q) I)A(q q q) 1 B(q q q)=(I ˆ A(q q q)) (0;x) A(q q q) 1 q 3 q 2 q 1 ;0 2L(R;H q ); we obtain the state space form ˆ x k+1 =(w k+1 ;x k+1 )=(w((k+ 1)t);x((k+ 1)t;))= ˆ v((k+ 1)t)+(0;x)u k = e A(q q q)t (w k ;x k x u k )+ Z t 0 e A(q q q)s q 3 q 2 q 1 ;0 dsu k +(0;x)u k = ˆ A(q q q) ˆ x k +(I ˆ A(q q q))(0;x)u k + A(q q q) 1 ( ˆ A(q q q) I) q 3 q 2 q 1 ;0 u k = ˆ A(q q q) ˆ x k + ˆ B(q q q)u k : Defining further the output operator ˆ C2L(H q ;R) as ˆ C(q;f)=q for (q;f)2 H q , our discrete time model becomes ˆ x k+1 = ˆ A(q q q) ˆ x k + ˆ B(q q q)u k ; k= 0;1;2;:::;K ˆ x 0 =(w 0 ;f 0 ); y k = ˆ C ˆ x k ; k= 0;1;2;:::;K: 18 Assuming zero initial conditions for the epidermal layer of the skin and the collection chamber of the sensor, that is ˆ x 0 =(w 0 ;f 0 )=(0;0), the output y of the system takes the form of a discrete time convolution of the input, u, with a filter, h(q q q), as y k = k1 å j=0 ˆ C ˆ A(q q q) k j1 ˆ B(q q q)u j = k1 å j=0 h k j1 (q q q)u j ; k= 0;1;2;:::;K: Here, the filter is h i (q q q)= ˆ C ˆ A(q q q) i ˆ B(q q q);i= 0;1;2;:::. Now, shifting from the assumption that q q q is deterministic to the idea of q q q being a random vector, we denote this random vector asq and assume thatq has support in the cubeÕ 4 i=1 [a i ;b i ] for some¥< ¯ a< a i < b i < ¯ b<¥. Define the vectors~ a=[a i ] 4 i=1 , ~ b=[b i ] 4 i=1 and letQR 4 be a compact parameter set. Assume further thatq is distributed according to an absolutely continuous cumulative density function, F(q;[~ a; ~ b;r]) or equivalently by a push forward measurep =p([~ a; ~ b;r]) forr2Q. We setq =(~ a; ~ b;r) and write p(q) f(q), where f(q)= f(;q) is a density function. Based on this, we define the Bochner spacesV = L 2 p (Q;V) andH = L 2 p (Q;H q ) and obtain the Gelfand tripleV ,!H ,!V where V denotes the dual ofV. Further, we define the bilinear forma(q;;) onVV by a(q; ˆ f; ˆ y)=E p a q q q; ˆ f(q q q); ˆ y(q q q) = Z Q a q q q; ˆ f(q q q); ˆ y(q q q) dp(q q q;q) for ˆ f; ˆ y2V. Leveraging the properties (1)-(3) and the Cauchy-Schwarz inequality, we compute ja(q; ˆ f; ˆ y)j Z Q ja q q q; ˆ f(q q q); ˆ y(q q q) jdp(q q q;q) a 0 Z Q jj ˆ f(q q q)jj V jj ˆ y(q q q)jj V dp(q q q;q) a 0 Z Q jj ˆ f(q q q)jj 2 V dp(q q q;q) 1=2 Z Q jj ˆ y(q q q)jj 2 V dp(q q q;q) 1=2 =a 0 jj ˆ fjj V jjyjj V ; 19 as well as a(q; ˆ f; ˆ f)+l 0 j ˆ fj 2 H = Z Q a(q q q; ˆ f(q q q); ˆ f(q q q))dp(q q q;q)+l 0 Z Q j ˆ f(q q q)j 2 H dp(q q q;q) = Z Q a(q q q; ˆ f(q q q); ˆ f(q q q))+l 0 j ˆ f(q q q)j 2 H dp(q q q;q) m 0 Z Q jj ˆ f(q q q)jj 2 V dp(q q q;q) =m 0 jj ˆ fjj 2 V ; and thus,a(;;) is also a bounded and coercive sesquilinear form onVV with the same con- stants as a(;;). With that, we define the bounded linear mapA(q) :V!V byhA ˆ f; ˆ yi V ;V = a(q; ˆ f; ˆ y). The operatorA, appropriately restricted or extended, is an infinitesimal generator of an analytic semigroup of bounded linear operators,T(t;q)=fe A(q)t : t 0g, onV;H andV , see [67–69]. Analogously, we define the operatorB(q)2L(R;H) by hB(q)u; ˆ yi q =E p(q) hB(q q q)u; ˆ yi q q q = Z Q hB(q q q)u; ˆ yi q q q dp(q q q;q); u2R and set ˆ A(q)=T(t;q)2L(H;H), ˆ B(q)=A(q) 1 ( ˆ A(q)I)B(q)2L(R;H), whereI is the identity onH. Note that by the coercivity ofa(;;) we may assume thatA(q) 1 exists and is bounded. Lastly, we define the output operator ˆ C(q) by ˆ C(q) ˆ f =E p(q) ( ˆ C ˆ f)2L(H;R). In an effort to keep the notation readable, we omit the dependence onq, unless we want to emphasize it. With that, we obtain the time-discrete system ˆ x k+1 = ˆ A ˆ x k + ˆ Bu k (4.12) y k = ˆ Cx k : (4.13) Once again, the outputy k can then be written as a convolution as y k = k1 å j=0 ˆ C ˆ A k j1 ˆ Bu j = k1 å j=0 h k j1 u j ; k= 0;1;2;:::;K: (4.14) 20 Based on this, we can state the problem of input estimation or deconvolution as a constrained, regularized optimization problem. In this problem, we denote the distributional parameters ofq by q =(~ a; ~ b;r), and the optimal parameter set byq =(~ a ; ~ b ;r ) together with the optimal support of the random parameters, Q . Then, we consider the input u of the model as a random continuous time signalu, that can be sampled asu j =u( jt)2 L 2 p(q ) (Q ), where t > 0 again denotes the length of a sampling interval. Clearly,u depends on both the random parametersq and the time t. Hence, we letu2 S(0;T) with S(0;T)= H 1 (0;T ;L 2 p(q ) (Q )) and consider a compact subset U of S(0;T). With that, the input estimation problem is expressed as u = argmin U J(u)= argmin U K å k=1 jy k (u) ˆ y k j 2 +jjujj 2 S(0;T) ; (4.15) wherejjujj 2 S(0;T) denotes a regularization term with a suitable normjjjj S(0;T) on S(0;T) andy(u) as in (4.14). At this point, the model is in state space form, time-discrete, and the randomness of q q q has been taken into account. However, the filterh in (4.14) involves the semigroupfT(t) : t 0g that is defined on the infinite dimensional Hilbert spaceH. Thus, as a last step, we need to apply a finite dimensional approximation. To this end, we construct a series of approximating finite-dimensional subspacesV N ofV as well as sequences of operators ˆ A N ; ˆ B N , and ˆ C N that approximate ˆ A; ˆ B, and ˆ C, respectively. Here, by N we denote the multi-index N =(n;m 1 ;m 2 ;m 3 ;m 4 ). As before, we assume that the random parameters q i are supported in [a i ;b i ] for i= 1;:::;4. Partition [a i ;b i ] into m i subintervals of equal width for i= 1;:::;4 and denote the characteristic function on the j-th subinterval by c m i j for j= 1;:::;m i . For n= 1;2;::: letff n j g n j=0 be the standard linear B-splines on [0;1] with respect to the uniform meshf0; 1 n ; 2 n ;:::;1g and set ˆ f n j =(f n j (0);f n j )2 V . With J we denote a multi-index J = ( j 0 ; j 1 ;:::; j 4 ) with j 0 2f0;1;:::;ng and j i 2f1;2;:::;m i g;i = 1;2;3;4. We then set F N J = ˆ f n j 0 Õ 4 i=1 c m i j i and letV N = span J fF N J g. In this way, the random variablesq are essentially treated as additional spatial variables. Then, we letP N :H!V N be the orthogonal projection ofH ontoV N . Using standard arguments from spline theory and 21 piecewise constant approximation in L 2 , one can see thatP N converges strongly to the identity in bothH andV, see e.g. [70]. This shows that the sequenceV N approximatesV in an appropriate way. For the operatorA we use a standard Galerkin approach to come up with an approximation A N , and set ˆ A N = exp(A N t)2L(V N ;V N ), for details see [71]. Based on this, we set ˆ B N = (A N ) 1 ˆ A N I N P N B2L(R;V N ) and ˆ C N = ˆ CP N . Finally, we can write y N k = k1 å j=0 ˆ C N ( ˆ A N ) k j1 ˆ B N u j = k1 å j=0 h N k j1 u j ; k= 0;1;2;:::;K (4.16) for the outputy N that only depends on the inputu. Further, we let M denote the multi-index M =(m;m 1 ;m 2 ;m 3 ;m 4 ), and we set L=(N;M) with N as defined above. In the following, by N!¥, M!¥, or L!¥, it is indicated that all entries of the multi-index approach infinity. With this, we introduce the discretized spaceU M as a closed subset ofU that is contained in a finite dimensional subspace of S(0;T). Hence, the compactness ofU carries over toU M . Having defined all these discretizations, we can state the discretized input estimation problem analogously to (4.15) as u L = argmin U M J L (u)= argmin U M K å k=1 jy N k (u) ˆ y k j 2 +jjujj 2 S(0;T) ; (4.17) withy N (u) as in (4.16). 4.2 Consistency and Approximation Definition 1. Let V be a normed linear space, and letfq n g be a sequence of estimators of the parameterq. Then a doubly indexed sequence of V -valued estimators t N n =t N (q n )=t N (q n (X 1 ;X 2 ;:::;X n )) is said to be an approximating consistent sequence of esti- mators for the function of the parameter q, t(q), where t is continuous fromQ into the normed linear space V , if for everye > 0 and everyq2Q we have lim N;n!¥ P q (kt N n t(q)k V e)= 1, or equivalently, lim N;n!¥ P q (kt N n t(q)k V >e)= 0. 22 Theorem 1. Let X 1 ;X 2 ;:::; be a sequence of samples or observations according to the distribution given by f(;q) with parameterq2QfR m ;jj m g. For each n= 1;2;::: let ˆ q n = ˆ q n (X 1 ;X 2 ;:::;X n ) be a consistent estimator of the parameter q2Q. LetfV;kkg be a normed linear space, let t :Q! V be a continuous map from Q into V and letft N g ¥ N=1 , t N :Q! V , be a family of equi-continuous maps fromQ into V with the property that lim N!¥ kt N (q)t(q)k= 0, for each q2Q. Thent N n =t N ( ˆ q n ) is an approximating consistent sequence of estimators for the function of the parameterq,t(q). That is, givene > 0 andq2Q, lim N;n!¥ P q (kt N n t(q)k>e)= lim N;n!¥ P q (kt N ( ˆ q n )t(q)k>e)= 0; and sot N n =t N ( ˆ q n )!t(q) in probability as N;n!¥ for everyq2Q. Proof. Lete > 0 andq2Q be given. Let ˆ N be such that kt N (q)t(q)k e 2 ; N ˆ N: Let B N n =fw :kt N ( ˆ q n )t N (q)k > e 2 g be an event, and, by the equi-continuity of the family ft N g ¥ N=1 , let d =d( e 2 ;q)> 0 be such thatk ˆ qqkd implieskt N ( ˆ q)t N (q)ke=2, N = 1;2;:::. Then for each N= 1;2;::: and n= 1;2;:::, define A N n =fw :kt N ( ˆ q n )t(q)k>eg: (4.18) Letting w2 A N n , the definition of A N n given in (4.18), the triangle inequality, and N ˆ N yield that e <kt N ( ˆ q n )t(q)k kt N ( ˆ q n )t N (q)k+kt N (q)t(q)k (4.19) kt N ( ˆ q n )t N (q)k+ e 2 : 23 The estimate given in (4.19) then implies that kt N ( ˆ q n )t N (q)k>e e 2 = e 2 ; for all N ˆ N, and therefore thatw2 B N n whenever N ˆ N. Consequently, it follows that A N n B N n whenever N ˆ N. Now let C n =fw :k ˆ q n qk>dg. By the equi-continuity of the familyft N g ¥ N=1 , it follows that B N n C n , for all N = 1;2;::: and all n= 1;2;:::. Thus if N ˆ N, we have that A N n B N n C n , for all n= 1;2;:::. The assumption that ˆ q n is a consistent estimator for q implies that P q (C n )! 0 as n!¥, or equivalently that, givens > 0, there exists an ˆ n such that P q (C n )s for all n ˆ n. It follows that P q (kt N ( ˆ q n )t(q)k>e)= P q (A N n ) P q (B N n ) P q (C n )s; for all N ˆ N and all n ˆ n and the theorem is proven. Theorem 1 is based on two assumptions: The continuity of t with respect to q, the equi- continuity of the familyft N g ¥ N=1 with respect toq, and the convergence of the approximationt N . In what follows, we take care of both assumptions separately. First, we focus on the continuity assumption. As described above, the spacesH andV (or more specifically: their inner products) depend on the distribution ofq that is given by the density function f(;q) and the support Q. Hence, a continuity result for the semigroupA(q) on q needs to take into account the changing spaces H n andV n asq n !q. For n= 1;2;::: we let~ a n =[a n i ] 4 i=1 ; ~ b n =[b n i ] 4 i=1 with¥< a n i < b n i <¥. Then, we setq n =(~ a n ; ~ b n ;r n ), and further Q n =Õ 4 i=1 [a n i ;b n i ]. Based on this, we define the spaces H n = L 2 p(q n ) (Q n ;H q ) andV n = L 2 p(q n ) (Q n ;V), and we letI n :H!H n be the identity map such thatI n (y)=y for any y2H. We base our continuity proof on a variant of the Trotter- Kato semigroup approximation theorem [69, 72, 73]. This variant was presented in [74] and more 24 detailed in [21] and deals with the situation where the approximating spaces are not subspaces of the underlying infinite dimensional state space. For a Hilbert space X, we introduce the notation F2 G(M;b) for an operator F : Dom(F) X! X that is the infinitesimal generator of an analytic semigroup of bounded linear operatorsfT(t)g that satisfiesjT(t)j Me bt . Assuming a bound, uniformly in n, on the semigroups generated by the operatorsA n , we can show the following theorem. Theorem 2. LetH,H n be Hilbert spaces as defined above, wherejj n denotes the norm onH n . LetI n :H!H n be an operator such that Im(I n )=H n andjI n zj n jzj. LetA2 G(M;l 0 ) on H, andA n 2 G(M;l 0 ) onH n . Suppose that for somell 0 , jI n R l (A)z R l (A n )I n zj n ! 0; as n!¥; (4.20) for every z2H, where R l (A)=(lIA) 1 and R l (A n )=(lIA n ) 1 denote the resolvent operators ofA andA n , respectively. Then jI n T(t)zT n (t)I n zj n ! 0; as n!¥; inH n for every z2H, uniformly in t on compact t-intervals. Proof. Without loss of generality, we let l 0 = 0. AsT(t)R l (A) andT n (t)R l (A n ) are both strongly differentiable in t, we get d dt T(t)R l (A)=AT(t)R l (A)=T(t)AR l (A)=T(t)(lR l (A)I): With that, and an analogous identity forT n (t)R l (A n ), we get d ds (T n (t s)R l (A n )I n T(s)R l (A))=T n (t s)(I n R l (A R l (A n )I n )T(s): (4.21) 25 We compute T n (t s)R l (A n )I n T(s)R l (A)j s=t s=0 = R l (A n )(I n T(t)T n (t)I n )R l (A); (4.22) and together, (4.21) and (4.22) yield R l (A n )(I n T(t)T n (t)I n )R l (A)= Z t 0 T n (t s)(I n R l (A) R l (A n )I n )T(s)ds: (4.23) Then, using (4.23) andjT n (t s)j M, for any u2H, we obtain jR l (A n )(I n T(t)T n (t)I n )R l (A)uj n M Z t 0 j(I n R l (A) R l (A n )I n )T(s)uj n ds: (4.24) Using the assumption (4.20), the integrand in (4.24) converges to 0 for a fixed s. Moreover, it is bounded by 2M 2 juj=l and so, by the Dominated Convergence Theorem, the upper bound in (4.24) converges to 0 as n!¥, and the convergence is uniform in t on compact intervals. Denote v= R l (A)u, we get for all v2H jR l (A n )(I n T(t)T n (t)I n )vj n ! 0; as n!¥; (4.25) because Dom(A) is dense inH. Using again the assumption (4.20) and the fact thatjT n (t)j M, we have jR l (A n )T n (t)I n vT n (t)I n R l (A)vj n =jT n (t)(R l (A n )I n vI n R l (A)v)j n ! 0; (4.26) and analogously, becausejT n (t)j M, we get jR l (A n )I n T(t)vI n T(t)R l (A)vj n =j(R l (A n )I n I n R l (A))T(t)vj n ! 0: (4.27) 26 The triangle inequality together with (4.26) and (4.27) yields jR l (A n )(T n (t)I n I n T(t))v+(I n T(t)T n (t)I n )R l (A)vj n ! 0; (4.28) as n!¥. Invoking (4.25), the triangle inequality then gives j(I n T(t)T n (t)I n )R l (A)vj n ! 0; as n!¥: (4.29) Lastly, for w= R l (A)v, we notice that w2 Dom(A 2 ). Since Dom(A 2 ) is dense inH, we combine the assumption (4.20) with (4.28) and (4.29) to get jT n (t)I n zI n T(t)zj n ! 0; as n!¥; for all z2H uniformly in t on compact intervals. Based on Theorem 2, to prove continuity of the semigroupT(t;q) with respect toq, we need to establish the resolvent convergence as in (4.20). To this end, we defineA=A(q),A n =A(q n ), T(t)=T(t;q) andT n (t)=T(t;q n ). Theorem 3. Let the assumptions (i)-(iii) hold, and letA2 G(M;l 0 ) onH, andA n 2 G(M;l 0 ) onH n . For the four parameter model (4.12)-(4.13), assume thatq f(;q), with q2Q, Q a compact subset of Euclidean space, that f(;q) has compact support Q2R 4 , and that the map q7! f(;q) is continuous from Q into L ¥ (Q). Then iffq n gQ with q n ! q2Q, we have T(q n )I n xI n T(q)x! 0 inH n for each x2H, whereI n denotes the identity map fromH ontoH n . Proof. SinceV is dense inH, andI n R l 0 (A) and R l 0 (A n )I n are uniformly bounded, it suffices to show resolvent convergence for every z2V as this will imply resolvent convergence for every z2H. 27 Let z2V and set w= R l 0 (A)z as well as w n = R l 0 (A n )I n z. Define z n = w n w. Since w n 2V n , we can compute a(q n ;w n ;z n )=hA n w n ;z n i H n =h(l 0 IA n )R l 0 (A n )I n z;z n i H nl 0 hw n ;z n i H n =hI n z;z n i H nl 0 hw n ;z n i H n: (4.30) Let us denote the Moore-Penrose generalized inverse [75] ofI n byI n + . Observe that fory2H n , I n + y2H agrees withy on Q n and is extended by zero on Qn Q n . Then, due to w2 Dom(A), we get a(q;w;I n + z n )=hAw;I n + z n i H =h(l 0 IA)R l 0 (A)z;I n + z n i H l 0 hw;I n + z n i H =hz;I n + z n i H l 0 hw;I n + z n i H : (4.31) Combining (4.30) and (4.31) yields a(q n ;w n ;z n )a(q;w;I n + )=hI n z;z n i H nl 0 hw n ;z n i H n hz;I n + z n i H +l 0 hw;I n + z n i H : 28 The resolvent convergence is now based on the boundedness and coercivity of the forma(;;) and we denote the respective coefficients by ˜ a 0 , ˜ m 0 , and ˜ l 0 . Further, we denote f(q)= f(q;q) and f n (q)= f(q;q n ). Together with the Cauchy-Schwarz inequality we compute ˜ m 0 jjz n jj 2 V na(q n ;z n ;z n )+ ˜ l 0 jz n j H n =a(q n ;w n ;z n )a(q n ;w;z n )+ ˜ l 0 jz n j H n =a(q n ;w n ;z n )a(q;w;I n + z n )+a(q;w;I n + z n )a(q n ;w;z n )+ ˜ l 0 jz n j H n =hI n z;z n i H nl 0 hw n ;z n i H nhz;I n + z n i H +l 0 hw;I n + z n i H + Z Q (a(q;w;z n ) f(q) a(q;w;z n ) f n (q))dq+ ˜ l 0 jz n j H n = Z Q hz;z n i H ( f n (q) f(q))dq+ ˜ l 0 Z Q (hw;z n i H f(q)hw;z n i H f n (q))dq + Z Q (a(q;w;z n ) f(q) a(q;w;z n ) f n (q))dq = Z Q hz;z n i H ( f n (q) f(q))dq+ ˜ l 0 Z Q (hw;z n i H ( f(q) f n (q))dq + Z Q (a(q;w;z n )( f(q) f n (q))dq Z Q jzj H jz n j H j f n (q) f(q)jdq+ ˜ l 0 Z Q jwj H jz n j H j f(q) f n (q)jdq +a 0 Z Q jjwjj V jjz n jj V j f(q) f n (q)jdq Using the Dominated Convergence Theorem and the continuity of the mapq7! f(;q), we obtain ˜ m 0 jjz n jj V n! 0 as n!¥. This shows resolvent convergence, and with Theorem 2 the claim is proven. Using Theorems 2 and 3, we obtain the following two corollaries. Corollary 1. Under the same hypotheses of Theorems 2 and 3, it follows thatjy k (q n )y k (q)j! 0 for each k = 1;2;:::;K, wherey k (q)=å k1 j=0 h k1 j (q)u j for every sequencefu j g ¥ j=0 , with u j 2R. 29 Proof. Recall thath(q) is defined byh k (q)= ˆ C(q) ˆ A(q) k ˆ B(q). Using Theorems 2 and 3, it follows that ˆ A(q n ) k I n xI n ˆ A(q) k x! 0 inH n for each x2H, for each k = 1;2;:::, since ˆ A(q)=T(t;q)= e A(q)t . The operator ˆ B can be expressed as ˆ B(q)= Z t 0 T(s;q)B(q)ds=A(q) 1 T(s;q)B(q)j t 0 =A(q) 1 ( ˆ A(q) I)B(q): By Theorems 2 and 3,A and ˆ A depend continuously onq. The resolvent convergence shown in Theorem 3 also establishes the continuity ofA(q) 1 . Remember thatB is defined byhBu; ˆ yi= E p hB(q q q)u; ˆ yi q q q ; u2R and since p(q;q) depends continuously on q, the Dominated Conver- gence Theorem yields the continuity ofB onq. Thus, we have ˆ B(q n )rI n ˆ B(q)r! 0 inH n for each r2R. Finally, the operator ˆ C is defined by ˆ C ˆ f =E p ( ˆ C ˆ f). Again, the Dominated Convergence Theorem shows that ˆ C is continuous inq. Putting things together,h k is a concatenation of operators that are continuous inq and soh k itself depends continuously on q for every k= 1;2;:::. That is,jh k (q n )h k (q)j! 0 for each k = 1;2;:::. Asy k (q)=å k1 j=0 h k1 j (q)u j is a finite sum of terms involvingh(q), the claim follows. Corollary 2. Under the same hypotheses of Theorems 2 and 3, it follows thatjy N k (q n )y N k (q)j! 0 for each k= 1;2;:::, uniformly in N, N = 1;2;:::, wherey N k (q)=å k1 j=0 h N k1 j (q)u j for every sequencefu j g ¥ j=0 , with u j 2R. Proof. The operators ˆ A N (q), ˆ B N (q) and ˆ C N (q) are essentially the infinite-dimensional operators ˆ A(q), ˆ B(q), ˆ C(q) restricted to the finite-dimensional spacesH N . Hence, the continuity of ˆ A(q), ˆ B(q) and ˆ C(q) gives rise to the equicontinuity of the familiesf ˆ A N (q)g ¥ N=1 ,f ˆ B N (q)g ¥ N=1 and f ˆ C N (q)g ¥ N=1 . Then, the claim follows with Corollary 1. 30 Theorem 4. For the four parameter model, assume thatq f(;q), with q2Q, Q a compact subset of Euclidean space, that f(;q) has compact support Q2R 4 , and that the mapq7! f(;q) is continuous fromQ into L ¥ (Q). Then iffq n gQ withq n !q2Q, we have 1.ju k (q n )u k (q)j! 0 for each k= 0;1;2;:::;K, wherefu k (q)g denotes the optimal input corresponding toq2Q as described in (4.15). 2.ju N k (q n )u N k (q)j! 0 for each k = 0;1;2;:::;K, uniformly in N, N = 1;2;:::, where fu N k (q)g denotes the optimal input corresponding to q2Q as described in (4.17) for M fixed. Proof. Observe that by Corollary 1, the objective functional J(u) as given in (4.15) depends continuously onq. By Corollary 2, the same holds true for the objective functional J L (u) for fixed M. Together with the convexity of J and J L this immediately yields the claims. The results so far focused on the continuous dependence of the estimatorsy andu on the parameterq. However, in practice, the operators ˆ A, ˆ B, and ˆ C defined on the infinite dimensional spacesH andV cannot be used and have to be approximated by finite dimensional counterparts ˆ A N , ˆ B N , and ˆ C N as described above. In the following, we focus on convergence results for these approximating operators. These results are based on versions of the Trotter-Kato theorem that can be found in [67]. Theorem 5. LetA N andA be infinitesimal generators of C 0 semigroupsT N (t) andT(t) onH N andH, respectively, satisfying: 1. There exist constants b, M such thatjT N (t)j Me bt for each N. That is,A N 2 G(M;b) uniformly in N. 2. There exists l2r(A)\ ¥ N=1 r(A N ) with Rel >b such that R l (A N )P N x! R l (A)x for each x2H. 31 Then, for each x2H,T N (t)P N x!T(t)x uniformly in t on any compact interval. The proof can be found in [67]. In order to state the following convergence theorem, we require an additional assumption on the approximating spacesH N . iv For each z2V, there exists ˆ z N 2H N such thatjz ˆ z N j V ! 0 as N!¥. Theorem 6. Let conditions (i), (ii), (iii), and (iv) hold. Then, for l =l 0 , R l (A N (q))P N z! R l (A(q))z in theV norm for any z2H. Proof. The proof is similar to that of Theorem 3. Note thatA N (q) results from the restriction ofa toH N H N and so we can choose l =l 0 in (ii) such that l2r(A(q))\ ¥ N=1 r(A N (q)) with l >b. Again, for z2H, we denote w= R l (A(q))z and w N = R l (A N (q))P N z. As w2 Dom(A(q))V, condition (iv) yields a sequencef ˆ w N g such thatj ˆ w N wj V ! 0 as N!¥. Our aim is to showjw N wj V ! 0. By the triangle inequality, we have jw N wj V jw N ˆ w N j V +j ˆ w N wj V ; and so, using the condition (iv), it suffices to showjw N ˆ w N j V ! 0. Define z N = w N ˆ w N 2 H N V. Note that other than in Theorem 3, all the spacesH N here are subspaces ofH, so the only appearing inner product ish;i H . For readability, we just denote this ash;i. With that, we get a(q;w;z N )=hA(q)w;z N i =h(l 0 IA(q))R l 0 (A(q))z;z N il 0 hw;z N i =hz;z N il 0 hw;z N i (4.32) 32 as well as a(q;w N ;z N )=hA N (q)w N ;z N i =h(l 0 IA N (q))R l 0 (A N (q))P N z;z N il 0 hw N ;z N i =hz;z N il 0 hw N ;z N i: (4.33) Combining (4.32) and (4.33) yields a(q;w;z N )=a(q;w N ;z N )+l 0 hw N w;z N i: Using the coercivity ofa(q;;) we get ˜ m 0 jjz N jj 2 V a(q;z N ;z N )+ ˜ l 0 jz N j 2 H =a(q;w N ;z N )a(q; ˆ w N ;z N )+ ˜ l 0 jz N j 2 H =a(q;w;z N )+l 0 hw w N ;z N ia(q; ˆ w N ;z N )+ ˜ l 0 jz N j 2 H =a(q;w ˆ w N ;z N )+ ˜ l 0 (hw w N ;z N i+jz N j 2 H ): The boundedness ofa(q;;) further gives ˜ m 0 jz N j 2 V a 0 jw ˆ w N j V jz N j V + ˜ l 0 (hw w N ;z N i+jz N j 2 H ) (4.34) Writing hw w N ;z N i=hw ˆ w N ;z N i+h ˆ w N w N ;z N i=hw ˆ w N ;z N ihz N ;z N i; 33 (4.34) turns into ˜ m 0 jz N j 2 V a 0 jw ˆ w N j V jz N j V + ˜ l 0 hw ˆ w N ;z N i a 0 jw ˆ w N j V jz N j V +j ˜ l 0 jjw ˆ w N j H jz N j H : Finally, the embeddingjj H kjj V results in ˜ m 0 jz N j 2 V (a 0 + ˜ l 0 k 2 )jw ˆ w N j V ; which concludes the proof. Corollary 3. For the four parameter model, assume thatq f(;q), with q2Q,Q a compact subset of Euclidean space, that f(;q) has compact support Q2R 4 , and that the mapq7! f(;q) is continuous fromQ into L ¥ (Q). Then forq2Q, fixed, we havejy N k (q)y k (q)j! 0 for each k= 1;2;:::, wherey N k (q)=å k1 j=0 h N k1 j (q)u j for every sequencefu j g ¥ j=0 , with u j 2R. Proof. By Theorems 5 and 6, we have ˆ A N (q)P N x ˆ A(q)x! 0 inH for each x2H, whereP N denotes the orthogonal projection ofH ontoH N . Then, it immediately follows that ˆ A N (q) k P N x ˆ A(q) k x! 0 inH for each x2H, for each k= 1;2;:::. Further, recalling ˆ B(q)=A(q) 1 ( ˆ A(q) I)B(q) and ˆ B N (q)=A N (q) 1 ˆ A N (q)I N P N B(q), this implies ˆ B N (q)r ˆ B(q)r! 0 in H for each r2R. Finally, keeping in mind ˆ C N = ˆ CP N , we concludejh N k (q)h k (q)j! 0 for each k= 1;2;:::, whereh N k (q) = ˆ C N (q) ˆ A N (q) k ˆ B N (q). Asy k (q) andy N k (q) are based onh(q) andh N (q), respectively, the proof is complete. Theorem 7. Let the assumptions of Theorem 6 hold. Then for q2Q fixed, we haveju N k (q) u k (q)j! 0 for each k= 0;1;2;::: wherefu k (q)g denotes the optimal input determined as given in (4.15) corresponding toq2Q, andfu N k (q)g denotes the optimal input as given in (4.17). 34 Proof. Remember that, for fixed q, the finite dimensional approximating deconvolution problem is given by u L = argmin U M J L (u)= argmin U M K å k=1 jy N k (u) ˆ y k j 2 +jjujj 2 S(0;T) : (4.35) We show that for any L, the problem (4.35) admits a solutionu L , and that there exists a sub- sequencefu L k gfu L g with u L k !u as k!¥, where u is the solution to the infinite- dimensional deconvolution problem as in (4.15). By (4.16), y N k depends continuously on the input sequenceu and hence J L is continuous. SinceU M is assumed to be compact, the existence of a solution for every L is clear. Consider a convergent sequencefu M g inU S(0;T) withu M 2U M , andjju M ujj S(0;T) ! 0 as M!¥, for anyu2U. Then, by Corollary 3, it holds jy N k (u M )y k (u)j! 0; as L=(N;M)!¥; k= 1;2;:::;K: This shows J L (u M )! J(u) as L!¥. Now, consider a sequence of solutions to the finite dimen- sional approximating optimization problem (4.35),fu L gU M U. Leveraging the compactness ofU, we obtain a subsequencefu L k gfu L g withu L k !u 2U as k!¥. Let u2U be arbi- trary and denote L k =(N k ;M k ). Then, J(u )= J( lim k!¥ u L k )= lim k!¥ J L k (u L k ) lim k!¥ J L k (u M k )= J( lim k!¥ u M k )= J(u): Here,fu M k gU withu M k 2U M k is the sequence of approximations tou whose existence is given by the approximation assumption (iv). Hence,u solves the problem (4.15). Theorem 8. Letq i Beta(a i ;b i ;a i ;b i ), i= 1;2;3;4 be independent, and setq =[r 1 ;r 2 ;r 3 ;r 4 ] with r i = [a i ;b i ;a i ;b i ], i = 1;2;3;4. Letfq n g be a sequence of either Method of Moments (MoM) or Maximum Likelihood Estimators (MLE) for q. Thenh N n =h N (q n ),y N n =y N (q n ), 35 andu N n =u N (q n ) are approximating consistent sequences of estimators forh,y, andu, respec- tively, whereh N n =fh N k (q n )g K k=0 ,y N n =fy N k (q n )g K k=0 , andu N n =fu N k (q n )g K k=0 . Proof. We apply Theorem 1. First, as shown for example in [76], the MoM and MLE estimators are consistent forq. In Corollary 1 we established thath(q) andy(q) are continuous inq. Theorem 4 shows that the same is true foru(q). Corollary 2 and Theorem 4 show that this is also true forh N , y N andu N , uniformly in N, N= 1;2;:::. Finally, Corollary 3 shows lim N!¥ jh N k (q)h k (q)j= 0 and lim N!¥ jy N k (q)y k (q)j= 0, respectively, for every k= 1;2;:::;K, and Theorem 7 shows ju N k (q)u k (q)j! 0 for k= 1;2;:::;K as N!¥. Thus, all the assumptions of Theorem 1 are satisfied. 4.3 Numerical Results and Clustering In what follows, the mathematical model under consideration slightly deviates from the model above. As the discretization of the model (4.1)-(4.6) grows exponentially in the number of param- eters q i , here we utilize a two-parameter model of the form ¶x ¶t (t;h)= q 1 ¶ 2 x ¶h 2 (t;h); 0<h < 1; t > 0; (4.36) q 1 ¶x ¶h (t;0)= x(t;0); t > 0; (4.37) q 1 ¶x ¶h (t;1)= q 2 u(t); t > 0; (4.38) x(0;h)= x 0 ; 0<h < 1; (4.39) y(t)= x(t;0); t > 0: (4.40) Note that the considerations from sections 4.1 and 4.2 with regards to discretization and conver- gence also apply to this modified model, compare [22]. 36 4.3.1 Fitting the Population Model Before the population model can be used for the deconvolution of estimated BrAC signals based on TAC signals, we first need to find a suitable parameter distribution to be used in the population model. Here, we follow a two step process. In the first step, we fit the raw parameters q q q=[q 1 ;q 2 ] to the individual drinking episodes from the BrAC-TAC dataset (see 3) using techniques for solving inverse problems for distributed parameter systems that can be found in [12, 16, 18, 67, 74]. Then, in the second step, we remove those episodes for which any of the fitted parameters were larger than 1:5 times the right hand boundary of the third quartile for that parameter, and use the remaining parameter values for q q q to fit four-parameter beta distributions to q 1 and q 2 using the method of moments [76]. The probability density function of the four-parameter beta distribution B(a;b;a;c) with parametersa,b, a and c is given by f(x;a;b;a;c)= (x a) a1 (c x) b1 (c a) a+b1 B(a;b) ; where B(a;b)= G(a)G(b) G(a+b) is based on the gamma functionG. Note that our theory does not require the independence of q 1 and q 2 . However, for simplicity we assume independence in this treatment here. Also note that the choice of a four-parameter beta distribution is no necessity, but this beta distribution is flexible enough to appropriately model the distribution of the parameters, and it is supported on the compact interval[a;c]. Using this process, we obtained the distributions q 1 q 1 withq 1 Beta(1:59;11:00;0:32;5:65), q 2 q 2 withq 2 Beta(4:41;9:61;0:26;4:11). Figure 4.1 shows histograms of the individually fitted parameters q i and the distribution fit for i= 1;2. 4.3.2 Numerical Demonstration of Theorem 8 In Theorem 8 above, we show in particular that the BrAC estimatesu N n form an approximating consistent sequence of estimators foru. Here, we want to demonstrate the proven behavior numer- ically. To this end, we use a drinking episode from the dataset and find the best-fitting parameters q 1 and q 2 . Based on that, we define four-parameter beta distributions that are supported in a small 37 (a) q 1 (b) q 2 Figure 4.1: Scaled histograms of individually fitted q i parameters together with probability density functions of distributional fits. neighborhood of these parameters. We set q 1 q 1 withq 1 Beta(1:1;1:1;0:7;0:74) and q 2 q 2 withq 2 Beta(1:1;1:1;1:39;1:43). Then, we sample n times from the distributions forq 1 andq 2 and solve the deconvolution problem with parameters N=(n s ;m 1 ;m 2 ). Here, n s , m 1 , m 2 denote the discretization parameters as introduced in Section 4.1. The results are shown in Figure 4.2. We see that for a small number of samples and a coarse discretization, the estimatedu N n , given as a mean signal and 90% credible bands, deviates significantly from the trueu that is based on the chosen distribution, compare Figure 4.2b. Then, as N and n increase, the estimatoru N n converges tou in the sense of Definition 1. Note that N!¥ is not feasible to perform on a computer. However, even for N 7, we get the desired result. 4.3.3 Clustering Methods As can be seen in Figure 4.1, the estimated distributions for the parameters q i naturally exhibit a certain variance. This makes sense as we consider the parameter distributions to reflect the dis- tribution of the parameters in the population. However, this one-size-fits-all approach, where one distribution is fit to the entire population, leads to inaccurate results in the deconvolution stage. The deconvolved BrAC signal will reflect the average BrAC signal of the population that corresponds to the given TAC signal. Given the observed variance in the distributions, this average signal may 38 differ quite significantly from the true BrAC signal. Hence, we want to stratify the entire popula- tion based on a clustering of the parameters q i that were fit to the individual drinking episodes as explained in the previous section. The idea behind this approach is that the fit distributions to the clustered parameters may exhibit a lower variance and thus better reflect the alcohol metabolism of the sub-population. In turn, this would lead to more accurate results for the deconvolved BrAC signals. We base our clustering of the parameters q q q i =(q 1 i ;q 2 i ), i= 1;2;:::;n, on the k-means algorithm [77]. Given a number of clusters k, this algorithm aims at minimizing the error function k å i=1 å q q q j 2C i jjq q q j m i jj 2 (4.41) where C i describes the i-th cluster, m i is the mean or center of this cluster, and q q q j is the j-th observation of q q q. Usually, Lloyd’s algorithm [77] is used to implement the k-means clustering. Here, as to ini- tialize the algorithm, k cluster centersm (0) j , j= 1;:::;k, are randomly chosen from the data points. Then, every data point q q q i is associated to the cluster whose center is closest to it, that is, C (t) i =fq q q j :jjq q q j m (t) i jjjjq q q j m (t) ` jj8`= 1;:::;kg; for steps t= 1;2;:::. Finally, based on the observations q q q j that are associated to cluster C i , the new center of C i is reevaluated as m (t+1) i = 1 jC (t) i j å q q q j 2C (t) i q q q j : The k-means algorithm has some downsides. It is known that the algorithm not necessarily finds the best solution to the objective (4.41), but rather a locally optimal solution. Especially, the k- means algorithm is quite sensitive with respect to the initial positions of the cluster centers that are randomly drawn. A popular remedy is to run the algorithm M times with random initial cluster centers and select the best clustering among those M attempts. Here, we follow the approach in [78] 39 where a modified global k-means algorithm is proposed. As introduced in [79], the global k-means algorithm is a deterministic algorithm that makes use of k-means as a local search procedure. This version of k-means yields optimal or near-optimal solutions. Further, while solving a clustering problem with L clusters, it also solves all clustering problems with` L clusters. In this algorithm, first one cluster is found, this corresponds to k= 1. Obviously, the optimal cluster center c 1 for this problem is the mean of the data points. Then, for k= 2, the k-means algorithm is run n times with initial cluster centers(c 1 ;q q q i ), i= 1;2;:::;n. Among those n executions, the optimal cluster center c 2 is found. For k=` clusters, k-means is executed n times with initial centers(c 1 ;c 2 ;:::;c `1 ;q q q i ), i= 1;2;:::;n, and the best cluster center c ` is determined. As it can be seen from this procedure, the global k-means algorithm is not suitable for datasets of medium or large size. However, since the dataset considered here is fairly small, we can apply this algorithm in a reasonable amount of time. Sometimes the global k-means algorithm produces singleton clusters that only contain a single datapoint. As this behavior is unwanted, a modified version of global k-means is designed in [78] to avoid those singleton clusters. See [78] for the precise algorithm; we use this algorithm for our results below. Using the clustering techniques described above, we establish k population models, where k describes the number of clusters or subpopulations. Then, for every drinking episode j, j = 1;:::;267 from the dataset, based on the parameter vector q q q j that was fit to this individual episode, we associate the episode j to one of the k clusters and use the corresponding population model to deconvolve an estimated BrAC signal from the TAC signal. That way, we want to see whether an increase in the number of clusters leads to a better prediction ability of the model. Table 4.1 shows the results for k= 1;2;:::;6 clusters. In each case, we considered k-means clustering on(q 1 ;q 2 ). The quantities to compare were chosen to be the average relative area under the curve error (AUC) between the true BrAC signal and the estimated mean BrAC signal, and the average relative L 2 error between the true BrAC signal and the estimated mean BrAC signal. For comparison, note that the average relative errors using the deterministic model with individually 40 Table 4.1: Average error values depending on the number of clusters k AUC L 2 L 2 dist. 1 0.3350 0.5073 0.2803 2 0.2732 0.4925 0.3345 3 0.3062 0.4961 0.3271 4 0.2441 0.4426 0.2180 5 0.2264 0.4379 0.2172 6 0.2114 0.4295 0.1350 fit values for q q q are 0:0596 in the case of AUC and 0:3253 in the case of L 2 . Furthermore, based on the population model approach, we compute 90% error bands for the estimated BrAC signal to get lower and upper bounds for the true BrAC signal. In Table 4.1, we also show the average L 2 distance between the upper band and the lower band. As can be seen, the clustering indeed leads to lower average errors in the BrAC prediction. Using six clusters, the average relative AUC error decreases about 37% compared to one cluster, and the average relative L 2 error decreases by about 15%. Note that this error reduction is not monotone in the number of clusters. For k= 3 clusters, the results are not as good as for k= 2 clusters. While the results for the estimated mean BrAC signal are already promising, it is even more interesting to observe the upper and lower error bands. Here, the L 2 distance between upper and lower band decreases by 52%. Hence, the BrAC estimate can be located in much narrower error bands which is a considerable benefit in this application. This observation is demonstrated further in Figure 4.3. On the left side, three drinking episodes from the dataset are shown together with the mean estimated BrAC signal and 90% error bands based on one cluster. On the right side, we show the same episodes and estimates based on k= 5 clusters. Notice the enormous narrowing of the error bands. Our results show that a clustering strategy can be effectively applied to improve the mean BrAC estimates and, more importantly, to shrink the width of the error bands. Here, even relatively low numbers of clusters yield great improvements. What remains, of course, is to predict the cluster without directly observing the parameters q q q. That way, a new drinking episode with a corresponding TAC signal could be assigned to a cluster for an improved BrAC estimation. A possible strategy here is to model the cluster as the 41 dependent variable in a regression model whose independent variables are features of the TAC signal and covariates from the drinking episode, e.g. body covariates of the individual. Such a regression model is subject to current research. 42 (a) true (b) N=(2;2;2); n= 10 (c) N=(2;2;2); n= 100 (d) N=(3;3;3); n= 250 (e) N=(3;3;3); n= 500 (f) N=(4;4;4); n= 1;000 (g) N=(4;4;4); n= 2;500 (h) N=(5;5;5); n= 5;000 (i) N=(5;5;5); n= 10;000 (j) N=(6;6;6);n= 25;000 (k) N=(6;6;6); n= 50;000 (l) N=(7;7;7); n= 100;000 Figure 4.2: Numerical consistency result. Figure (a) shows the deconvoluted BrAC signal with 90% error bands based on the true distribution and discretization parameters N=(10;10;10). The subsequent figures show the corresponding results based on n samples of the true distribution and growing numbers for N. 43 (a) (b) (c) (d) (e) (f) Figure 4.3: Left: three drinking episodes from the dataset with corresponding mean BrAC esti- mates and 90% error bands based on one cluster. Right: the same three drinking episodes with corresponding BrAC estimates and 90% error bands based on k= 5 clusters. 44 Chapter 5 A Physics-Informed Neural Network for the Estimation of the Distribution of Random Parameters In this chapter, we build upon the mathematical model (4.36)-(4.40) and the mentioned works [21– 24] by considering a nonparametric distribution of the unknown parameters to yield credible bands for the deconvolved BAC/BrAC signal. However, here we base our methodology on the recent developments in the field of neural networks [80]. Goodfellow et al. [81] introduced generative adversarial networks (GANs) in 2014 as a class of neural networks that is able to generate new data with the same statistics as the training set. In this data-driven approach, large amounts of data are required to train the model. In the clinical alcohol research, such data is typically not available: Due to the difficulties mentioned in Chapter 1, the acquisition of drinking session data is an expensive task. Moreover, by the very nature of the problem, only blood/breath alcohol and the transdermal alcohol can be measured. Data in the domain between blood vessels and skin is unobservable. To account for this situation, we want to outline and use a latent variable physics-informed neural network (PINN) model. In [82], the authors propose a framework for uncertainty propagation in physical systems that only allow for small training sets, but where prior information is available in the form of governing physical laws. We aim to adapt this framework to our problem of the conversion from TAC to BrAC. Using the introduced diffusion equation model for the alcohol transport through the epidermal layer of the 45 skin, we train a physics-informed generative adversarial network with available drinking session data to yield estimates for the distribution of the unknown parameters. Then, in a second step, we employ a simple PINN for the deconvolution of the BAC/BrAC signal. In Section 5.1, we describe the probabilistic formulation and the used generative adversarial net- work in detail and in Section 5.2, we propose a physics-informed network for the deconvolution of the BAC/BrAC signal. Then, in Section 5.3, we demonstrate the performance of our approach through numerical studies using generated and real drinking session data. 5.1 Physics-Informed Adversarial Learning In [82], the authors put forth a probabilistic formulation for propagating uncertainty through physics-informed neural networks using latent variable models of the form p(xjt;h;z); z p(z); s:t: x t +N h x= 0: Here,N h is a general differential operator and x(t;h) is a random field depending on the location variableh and the time variable t, represented as x(t;h)= x q (t;h) by a deep neural network with the parameter setq. The fundamental idea behind this approach is to condense all random factors and uncertainty into a single (possibly multidimensional) latent variable. That way, one can sample from the latent variable and propagate this through the neural network to yield samples of the random field that reflect the uncertainty. To this end, in [82], the use of a generative adversarial net [81] is proposed. Fundamentally, a GAN consists of two competing neural nets: The generator net tries to produce new data that is distributed as the training data. This new data is presented to the discriminator net that classifies the sample as either original data or generated sample. Hence, the generator aims at fooling the discriminator and the discriminator tries not to be fooled. In order to make this process work, an appropriate choice of loss functions is required. 46 5.1.1 Kullback-Leibler Based Training Based on findings in [83] and [82], we use a learning mechanism for the generator that tries to match the joint distribution of the observed data q(t;h;x) with the joint distribution of the generated data p q (t;h;x) (the subscriptq denotes the parameters of the generator net). Such a matching can be achieved by minimizing the reverse Kullback-Leibler divergence of p q (t;h;x) and q(t;h;x). The (reverse) Kullback-Leibler divergence is given as KL(p q (t;h;x)jjq(t;h;x)) :=E p q (t;h;x) log p q (t;h;x) q(t;h;x) : (5.1) This can further be written as KL(p q (t;h;x)jjq(t;h;x))= H(p q (t;h;x)) (5.2) E p q (t;h;x) (log(q(t;h;x))); where H(p q (t;h;x))=E p q (t;h;x) (log(p q (t;h;x))) denotes the entropy of the generator. In [82], the authors show that minimizingl H(p q (t;h;x))E p q (t;h;x) (log(q(t;h;x))) with l > 1 instead of the pure Kullback-Leibler divergence introduces an entropic regularization to mitigate the common issue of mode collapse [81]. When minimizing (5.1) with respect to the generator parametersq, we face the issue that we only have samples from the p q and the q distribution; the distributions themselves remain unknown. A general technique to approximate the density ratio of two distributions only given samples is described in [82]. This approach is based on the discriminator network T that acts as a binary classifier. Given N data points drawn from p q (t;h;x) labeled y=+1 and N data points drawn from q(t;h;x) labeled y=1, the probabilities can be written as conditionals p q (t;h;x)=a(t;h;xjy= 1); q(t;h;x)=a(t;h;xjy=1): 47 Now, using Bayes rule, the density ratio can be computed by p q (t;h;x) q(t;h;x) = a(t;h;xjy= 1) a(t;h;xjy=1) = a(y=1jt;h;x)a(t;h;x) a(y=1) a(y=1jt;h;x)a(t;h;x) a(y=1) = a(y= 1jt;h;x) a(y=1jt;h;x) = a(y= 1jt;h;x) 1a(y= 1jt;h;x) = T(t;h;x) 1 T(t;h;x) : Another problem that arises when minimizing (5.2) is the computation of H(p q (t;h;x)) due to the fact that p q (t;h;x) is unknown a priori. Hence, in [83], a computable lower bound for the entropy term is derived. Introducing a variational distribution q(zjt;h;x) represented by an encoder net q f (zjt;h;x) (the subscriptf denotes the parameters of encoder net), this bound reads H(p q (t;h;x)) H(z)+E p q (t;h;x) log(q f (zjt;h;x)) : (5.3) Using this entropy bound and the method for estimating the density ratio based on samples, the following loss functions for minimization of the reverse Kullback-Leibler divergence can be found [80, 82, 83]: L D (y)=E q(t;h)p(z) (log(s(T y (t;h;x q (t;h;z))))) +E q(t;h;x) (log(1s(T y (t;h;x))) (5.4) L G (q;f)=E q(t;h)p(z) (T y (t;h;x q (t;h;z))) +(1l)log(q f (zjt;h;x q (t;h;z))) (5.5) 48 Here, the subscript D denotes the discriminator loss, while G denotes the generator loss. s(y)= 1=(1+ e y ) is the logistic sigmoid function and the subscripty denotes the parameters of the dis- criminator network. The subscripts in the expectation symbolize the corresponding distributions. That is, the subscript q(t;h)p(z) means that t andh are to be sampled from the empirical data dis- tribution and z should be sampled from its prior p(z). It is clear that the generator aims to reduce the Kullback-Leibler divergence as much as possible, i.e. it strives for a minimum. The discrimina- tor, on the other hand, tries to maximize its ability to correctly classify data samples and generated samples. That way, the learning of the network is characterized as a minimax game between the two players. It can be shown [81] that this minimax game between the generator and the discrim- inator has a global optimum if the generated distribution matches the empirical data distribution. Although there is no theoretical guarantee that this optimum is achieved, one typically trains the model by alternating between a minimization of the generator loss over the parameters q and f and a maximization of the discriminator loss over the parametersy. In order to adapt the capacity of the generator and the discriminator, one can also fix a ratio of optimizing iterations, i.e. perform k steps of optimizing the generator before` steps of optimizing the discriminator. 5.1.2 Integration of the Physical Model Up to this point, the proposed method resembles a classical adversarial neural network. Typically, those networks are trained with huge amounts of data. In our case, however, due to the expense of data collection and the unobservability of data in the regimeh6= 0, we only have a small training dataset available. Thus, the pure data-driven approach of GANs will no longer work. We therefore resort to the idea introduced in [26, 27] and more recently in [28–33, 82] of augmenting the above loss functions by information obtained from the physics of the problem. This is where the model from Chapter 4 or rather (4.36)-(4.40) proves to be of high value: The strong prior knowledge about the problem in form of a partial differential equation can be used to train the network. That way, a hybrid between pure data-driven approaches and physics-driven methods is created, a physics- informed neural network. 49 As a first step, we introduce two additional neural nets q 1 m (z) and q 2 n (z), i.e. we input the latent variable into these nets to propagate the uncertainty though the estimates of q 1 and q 2 . Now, the physics of the problem can be integrated in the training process by introducing a PDE- related loss function. To this end, we specify N r i collocation points in the interior of the domain f(t;h) : t > 0; 0<h < 1g, N r b1 collocation points on the left boundaryf(t;h) : t > 0;h = 0g and N r b2 collocation points on the right boundaryf(t;h) : t> 0;h = 1g. We then compute the residual of the PDE at these collocation points(t i ;h i ) in dependence of the parametersq of the generative model and the parametersm andn of the parameter-estimating networks as L PDE (q;m;n)= 1 N r i N r i å i=1 ¶x q ¶t (t i ;h i ) q 1 m ¶ 2 x q ¶h (t i ;h i ) 2 + 1 N r b1 N r b1 å i=1 q 1 m ¶x q ¶h (t i ;0) x q (t i ;0) 2 + 1 N r b2 N r b2 å i=1 q 1 m ¶x q ¶h (t i ;1) q 2 n u(t i ) 2 : (5.6) Note that in this formulation, we treat the residual as a deterministic value, i.e. we set x q (t;h;z)= x q (t;h) as well as q 1 m (z)= q 1 m and q 2 n (z)= q 2 n . The gradients appearing in these residuals can be efficiently evaluated thanks to the recent advances in automatic differentiation [84]. Therefore, no discretization schemes for the differential operators are required. Also note that the initial condition (4.39) could be included in this PDE loss. However, we choose to account for that using the Kullback-Leibler divergence based training process of the generator, see section 5.3. Now, we augment the generator loss with a scaled version of the PDE loss as L G (q;f)+bL PDE (q;m;n): (5.7) That way, forb > 0, the introduced PDE residual acts as a regularization term that leads the gen- erator to create samples that approximately satisfy the diffusion equation model (4.36)-(4.40). The 50 precise choice ofb is a tuning between the dominance of the data on one hand and the dominance of the physics on the other hand. As shown in Section 5.3, the value ofb influences the result, so it has to be chosen experimentally to yield a good balance between data and physics. We also want to emphasize that the augmentation of the generator loss with the physics informa- tion is the core element in the estimation of q 1 and q 2 . By minimizing the combined loss, the network parametersm andn are adjusted such that the obtained estimates q 1 m (z) and q 2 n (z) match the training data as well as the physics model (4.36)-(4.40) in the best possible way. Putting things together, we ultimately obtain the following minimax problem for the generator and the discriminator max y L D (y) min q;f;m;n L G (q;f)+bL PDE (q;m;n): (5.8) Let us comment on how the observable data for TAC and BAC/BrAC enter the training process. In the diffusion model (4.36)-(4.40), the TAC data acts as a Dirichlet output. We can thus directly use the TAC data as training data for the generator. The handling of the BAC/BrAC data, however, is more involved. The BAC/BrAC is a Neumann-type input of (4.36)-(4.40) and so it is not repre- sented by x q (t;h;z) for some values of t andh. Hence, we only incorporate the BAC/BrAC data using the PDE loss as in (5.6). So the model is encouraged to match the distribution of the TAC data by minimizing the Kullback-Leibler divergence and it is encouraged to match the BAC/BrAC data and to obey the physical model by minimizing the residuals of the equations. The interplay between those two objectives is governed by the tuning parameterb. 5.1.3 Estimating the Parameter Distribution After training the generative model, it remains to estimate the resulting distribution of (q 1 ;q 2 ). By design, q 1 m and q 2 n depend on the latent variable z. Thus, we can sample from the latent variable and pass these samples through the networks for q 1 and q 2 in order to obtain samples of 51 these parameters. Using a sufficiently large number of samples, we can create an estimate of the distribution of(q 1 ;q 2 ). Hence, the presented method not only estimates the distribution of(q 1 ;q 2 ), but the trained networks can directly be used to generate samples of this distribution by a simple forward-pass. That way, we can avoid the use of sampling algorithms like Markov chain Monte Carlo [85]. 5.1.4 Posterior Distribution of the Latent Variable As mentioned in Section 5.1.1, the entropic regularization requires the introduction of an additional encoder network q f (zjt;h;x). While this might seem like a complication of the model, the encoder offers a remarkable advantage [82]. During the training process, the encoder network learns the best, i.e. most likely, latent variables given the data. So, based on the TAC and BAC/BrAC data, the encoder network yields a posterior distribution over the latent variable that corresponds to the given data. Moreover, since the encoder network is involved in the training of the generator which is physics-informed, the posterior over the latent variable will also be physics-informed. Thus, as a byproduct, we obtain a posterior distribution of the latent variable that is both data- and physics- informed. Especially in the context of our given problem, this mechanism is quite appealing. Instead of a direct sampling from the prior over the latent and subsequent passing through the parameter networks q 1 m and q 2 n as described in Section 5.1.3, this allows us to pass all available data through the encoder q f to obtain a posterior distribution of the latent variable that corresponds to the data. This distribution can then be fed to the parameter networks to yield an estimated distribution of (q 1 ;q 2 ). We examine the use of this posterior latent distribution in Section 5.3. 5.1.5 Network Design In [82], the authors use a neural network architecture where the generator, the encoder and the dis- criminator are roughly of equal capacity. In our case, however, the situation is quite different. The set-up of the problem only allows for observable data ath = 0, i.e. the TAC data, and additionally 52 the BAC/BrAC data. Points inside the domain are inherently unobservable, hence they cannot be used as training data. Thus, the training data is very sparse. A set-up allowing only for few training data favors the discriminator network, i.e. it is easy to classify samples into generated samples and real samples. Then, when the discriminator network is too strong, the generator gains little infor- mation from the discriminator and the ability of the generator network to learn is impaired [81]. To account for this, we use a discriminator network with a low capacity compared to the other networks. This can be achieved in two different ways: First, we can decrease the capacity of the discriminator by choosing a network design that involves less hidden layers and less neurons per layer. Secondly, we can strengthen the generator by allowing more learning steps in the alternating learning process. That way, we enable the generator to train a certain number of steps for a given discriminator before the discriminator improves further. For the precise choice of the network design, see section 5.3. 5.2 Deconvolution of BAC/BrAC from the TAC signal After estimating the distribution of (q 1 ;q 2 ), the next step is to deconvolve the BAC/BrAC signal from the TAC signal. Here, we want to employ a simple physics-informed neural network for the deconvolution process. Given the TAC signal, the network outputs x j (t;h;q 1 ;q 2 ), the (unobserv- able) alcohol level in the epidermal layer. To this end, we use the only available training data, the TAC signal consisting of N T data points, and set up the TAC-related loss function as L T (j)= 1 N T N T å i=1 x j (t i ;0) y(t i ) 2 : 53 This way, the network is encouraged to match the provided TAC signal ath = 0. Using the penalty approach for the incorporation of the PDE described above, we augment the loss by a PDE-related loss L PDE (j)= 1 N r i N r i å i=1 ¶x j ¶t (t i ;h i ) q 1 ¶ 2 x j ¶h (t i ;h i ) 2 + 1 N r b1 N r b1 å i=1 q 1 ¶x j ¶h (t i ;0) x j (t i ;0) 2 : (5.9) The complete loss is now given as L=L T +bL PDE : Once the network is trained for the specific drinking episode using the TAC signal, the BAC/BrAC signal can then be estimated using equation (4.38) and automatic differentiation [84]. Note that q 1 and q 2 are inputs of the network and are included in the training process. So, in order to obtain BAC/BrAC estimates for varying values of q 1 and q 2 , only a simple forward pass through the network is required. This enables us to directly use the available sample of the joint distribution for (q 1 ;q 2 ) to produce BAC/BrAC estimates based on this sample. Hence, it is easy and time-efficient to come up with credible bands, see section 5.3. 5.3 Numerical Results 5.3.1 Artificial Training Data As a first step of our numerical studies, we want to investigate the performance of this model on generated training data. Although we are using artificial training data, we still want to make the 54 (a) (b) Figure 5.1: Generated drinking episodes for varying parameter values (left) and an example of a real data drinking episode (right). test case realistic in the sense that we use data that resemble naturalistic drinking data. To this end, we consider a BrAC signal that reflects the dynamics of real drinking sessions and choose y(t)= 8 > > < > > : 0:1 sin 2pt 10 ; t 5 0; t > 5 (5.10) Here, the amplitude of :1 is in reasonable agreement with naturalistic drinking episodes and the sine function represents a sharp rise followed by a sharp decline to zero after which the signal remains at zero. Figure 5.1a shows this generated BrAC signal together with the corresponding TAC signal for various values of q 1 and q 2 . In Figure 5.1b, we display a typical drinking session recorded in the lab; one can easily see the similarity between generated and real data. 5.3.1.1 Estimating Single Parameter Values First, we want to consider the performance of the method using single drinking sessions. To this end, we fix deterministic values for q 1 and q 2 and compute the resulting TAC signal corresponding to the BrAC input as in (5.10). Using this single drinking session, we fit the model and try to recover the correct values for q 1 and q 2 . We note that this deterministic use of the model contradicts 55 its design purpose to handle uncertainty; still the model should be able to handle this case. When solving the forward model for the TAC signal, we use 1;000 discretization points in both x- and t-direction. Hence, we have N u = 1;000 training data points and augment this by additional N i = 1;000 initial training data points. Then, we choose N r i = N r b 1 = 50;000 collocation points for the interior and the TAC boundary and use the aforementioned discretization points for the forward model as our N r b 2 = 1;000 collocation points on the BrAC boundary. Once the model has been trained, we draw 100;000 samples from a standard normal distribution for the latent variable z and propagate the samples through the networks for q 1 and q 2 . In table 5.1 we show the mean values of the samples of q 1 and q 2 in dependence of the parameterb. We note that besides the value ofb, other factors could affect the convergence to the true parameter values as well. An example for this is the level of discretization we chose for generating the training data. Since the generation of training data involves the forward solution of the model (4.36)-(4.40), the step size of the discretization is of importance. Although using 1;000 points in both x- and t-direction should yield a very high accuracy, this process might add an error term to the true values of q 1 and q 2 . Factors like these, however, are out of the scope of this study and have not been considered. Instead we focus on the parameter b whose value directly affects the interplay between data and physics. For the analysis of table 5.1, let us first consider the case b = 1. The table shows that using this value for b already yields estimates of q 1 and q 2 that are within a 5% margin of the true values. It also appears that some parameter constellations are easier to recover than others, see especially the case of q 1 = 0:5; q 2 = 1. In addition to these observations, we see that the estimated values for q 2 are much closer to the true ones than those for q 1 . See section 5.3.1.2 for a further discussion on this observation. Then, asb increases to 5, the accuracy of the estimates increases greatly and most estimates are within 1% of the true values. For even larger values of b, i.e. 10 and 25, the data show a slight decrease in the estimation accuracy. This suggests that too large values of b place too much emphasis on the enforcement of the PDE model, while neglecting the data. Based on this numerical experiment, we conclude that choosing b 5 appears to yield a good balance 56 Table 5.1: Converged mean parameter values for varying values ofb b q 1 = 1; q 2 = 1 q 1 = 0:5; q 2 = 1 q 1 = 1; q 2 = 0:5 q 1 = 0:5; q 2 = 0:5 1 (0.9754,0.9999) (0.5028,0.9988) (0.9649,0.4901) (0.4770,0.4947) 2 (0.9739,0.9989) (0.4962,0.9962) (0.9633,0.4907) (0.4725,0.4923) 3 (0.9745,0.9978) (0.4976,0.9989) (0.9845,0.4966) (0.5039,0.5035) 4 (0.9777,0.9987) (0.4953,0.9971) (0.9617,0.4910) (0.4936,0.5017) 5 (0.9894,1.0026) (0.4990,0.9977) (0.9967,0.5019) (0.4904,0.4996) 10 (0.9860,0.9980) (0.4978,0.9966) (0.9803,0.4986) (0.4954,0.4991) 25 (0.9831,0.9977) (0.4978,0.9991) (0.9788,0.4971) (0.4958,0.4993) Table 5.2: Converged parameter variances for varying values ofb b q 1 = 1; q 2 = 1 q 1 = 0:5; q 2 = 1 q 1 = 1; q 2 = 0:5 q 1 = 0:5; q 2 = 0:5 1 (2.58e-4,.0018) (2.06e-4,.0018) (.0017,9.74e-4) (2.15e-4,7.73e-4) 2 (3.42e-5,.0017) (1.70e-5,.0015) (6.23e-4,9.83e-4) (9.22e-5,7.82e-4) 3 (3.04e-4,.0020) (2.76e-4,.0016) (6.97e-4,.0010) (1.78e-4,6.71e-4) 4 (4.82e-4,.0015) (9.43e-5,.0019) (2.60e-4,9.62e-4) (6.30e-5,8.48e-4) 5 (5.96e-4,.0015) (5.76e-5,.0014) (3.67e-5,5.00e-4) (3.07e-6,6.62e-4) 10 (8.88e-5,.0015) (3.21e-5,9.30e-4) (6.46e-5,5.10e-4) (1.09e-4,5.26e-4) 25 (2.66e-5,9.34e-4) (1.48e-6,.0011) (1.96e-6,7.15e-4) (2.82e-6,4.94e-4) between data and physics. Note that this confirms an earlier observation made in [52]. Besides the mean values of the parameter estimates, we are also interested in the parameter distribution. Since our training data in this section consists of a single drinking episode with deterministic values for q 1 and q 2 , the estimated distribution should ideally be a delta distribution. Table 5.2 displays the variance of the estimated parameters analogously to table 5.1. Here, two key observations can be made: First, the variance of the estimates overall decreases with increasing b. This makes sense as a growing b shifts the emphasis of the model towards the PDE constraints and so there is less variation in the PDE parameters q 1 and q 2 . Secondly, we observe that this is especially true for q 1 . Here, the variance decreases much more significantly than for q 2 whose variance appears to be much less affected by the value of b. Still, we emphasize that the variances are small on an absolute scale and so the estimates are quite focused around the mean. In Figure 5.2, we display histograms of the parameter estimates for q 1 (upper row) and q 2 (lower row) for increasing values of b using the true parameter values of q 1 = 1; q 2 = 0:5. It can be seen that the q 1 distribution indeed resembles that of a delta distribution for growing values ofb, 57 (a) q 1 ; b = 1 (b) q 1 ; b = 2 (c) q 1 ; b = 3 (d) q 1 ; b = 4 (e) q 1 ; b = 5 (f) q 1 ; b = 10 (g) q 1 ; b = 25 (h) q 2 ; b = 1 (i) q 2 ; b = 2 (j) q 2 ; b = 3 (k) q 2 ; b = 4 (l) q 2 ; b = 5 (m) q 2 ; b = 10 (n) q 2 ; b = 25 Figure 5.2: Histograms of the parameter values for q 1 and q 2 using varying values ofb. whereas the q 2 distribution does not. Moreover, the histograms for q 2 resemble that of a normal distribution. This makes sense as q 1 and q 2 are modeled as a function of the latent variable z whose distribution is normal. Interestingly, the histograms of q 1 do not consistently have a normal shape; instead some of them rather appear to have a one-sided shape. See the next section for a more thorough investigation of the distribution of the parameter estimates. 5.3.1.2 Distribution Estimation In this section, we want to explore the performance of the model on distributed population data. Here, we again choose the BrAC signal as in (5.10). But other than in the previous section, we sample values for q 1 and q 2 from a predetermined distribution and use the corresponding TAC- BrAC pairs as training data. More specifically, we use normal distributions for q 1 and q 2 with a mean of m 1 = m 2 = 1 and varying standard deviations of s = 0:2, s = 0:1, and s = 0:05. Using these distributions, we sample 10;000 values for q 1 and q 2 , compute the corresponding TAC signals using the forward model, and use these 10;000 generated drinking episodes for model training. Using experience from the previous section, we setb = 5 as this leads to an appropriate balance between given data and physical model. Although the choice to use normal distributions for q 1 and q 2 appears to be arbitrary, this is a very natural choice in the context of the alcohol biosensor problem. The underlying distributions of q 1 and q 2 represent the distribution of alcohol metabolism types in the population of all human beings. 58 (a)s = 0:05 (b)s = 0:1 (c)s = 0:2 Figure 5.3: BrAC signal together with mean TAC signal of the 10;000 generated drinking episodes using varying standard deviations for the q 1 and q 2 distributions. The shaded pink region indicates the range of all drinking episodes corresponding to the samples of q 1 and q 2 . Thus, using the central limit theorem, an argument can be made that the natural distributions for the parameters are indeed normal with unknown mean and variance. This is why we set up the test case to assess our model’s performance on the estimation of mean and variance of normally distributed parameters. We note, however, that our model does not assume a parametric distribution of the parameters. Its non-parametric design would allow for all other types of distributions as well. In Figure 5.3 we depict the range of the 10;000 generated drinking sessions for varying values of s. We clearly see that as s grows, the variation of the TAC signals increases greatly. In table 5.3 we summarize the mean values and the standard deviation of the samples of q 1 and q 2 after training the model. We observe that in all cases the mean values of the samples are very close to the true mean values of the distributions for q 1 and q 2 . Then, looking at the standard deviation of q 2 we see a remarkably good agreement with the specified values of s. This is true for all three cases under consideration. Figure 5.4 confirms this observation. Here, we depict (normalized) histograms of the estimated q 2 samples in blue together with the probability density function of the chosen true distribution in red. Also here, there is a very good agreement between true and approximated distribution. Hence, although the variation in generated TAC signals can be rather large (compare figure 5.3c), our method is not only able to accurately recover the mean of the distribution, but the distribution itself. To confirm that the approximated distributions for q 2 are 59 Table 5.3: Converged parameter values for varying values ofs s mean q 1 std. q 1 mean q 2 std. q 2 0.05 0.9966 0.0107 0.9998 0.0534 0.1 0.9991 0.0134 1.0031 0.0954 0.2 0.9905 0.0281 0.9995 0.2170 (a)s = 0:05 (b)s = 0:1 (c)s = 0:2 Figure 5.4: Histograms of the approximated q 2 distribution together with the pdf of the true distri- bution. indeed normal, we make advantage of the fact that q 2 is modeled as a function of the latent variable z whose distribution is known. Since z has a standard normal distribution, we can write q 2 as q 2 =sz+ 1 (5.11) where s is one of the specific values of our test cases. Figure 5.5 shows the graph q 2 (z) for our various test cases in the domain z2[4;4]. One can see that the graphs are indeed of the form (5.11). For s = 0:05 and s = 0:1, q 2 (z) is (almost) a straight line with the correct slope and intercept. Fors = 0:2, the middle part of the graph in the regime z2[3;3] has the correct shape, while the outer parts exhibit some curvature. It is possible that during training, too few instances of z withjzj> 3 were drawn to accurately train the q 2 network in this regime. Since 99:7% of samples of z have a smaller magnitude, this explanation appears meaningful. In conclusion, at least in terms of q 2 , we see that our method accurately recovers the true parameter distribution. When looking at q 1 , the results are quite different. Although table 5.3 shows very accurate mean values for the approximated q 1 distribution, the standard deviation is significantly 60 Figure 5.5: Graphs of the function q 2 (z) for the three test cases. off the true parameter s. Figure 5.6 shows the histograms of the estimated distribution for q 1 for the various values of s. Clearly, the histograms in figures 5.6a and 5.6b are not normally shaped, and the histogram in Figure 5.6c might be normally shaped, but the variance is way too low. Hence, while the model is still able to accurately find the mean values of the q 1 distribution, it does not recover the underlying distribution itself. While we cannot fully rule out a problem in our implementation of the model, we want to outline a possible reason for the diminished performance of the model on the distribution for q 1 . Figure 5.7 shows, analogous to figure 5.3b, the BrAC signal together with 10,000 generated TAC signals when q 2 is fixed and the q 1 distribution has a standard deviation of s = 0:1. Comparing this figure to figure 5.3b, where both q 1 and q 2 have s = 0:1, we see that the variation in q 1 value barely affects the resulting TAC signals, whereas the variation in q 2 has a large impact on the TAC signals. In other words, it is much easier to detect a change in q 2 than in q 1 . This creates a possible identifiability issue in the sense that it is numerically hard to identify the underlying distribution of q 1 . Note that there is no identifiability issue in the underlying continuous PDE model (4.36)-(4.40). However, using finitely many sample points of a discretized model in combination with the low impact of varying q 1 might lead to numerical difficulties. Further research at this point is required. 61 (a)s = 0:05 (b)s = 0:1 (c)s = 0:2 Figure 5.6: Histograms of the approximated q 1 distribution. Figure 5.7: BrAC signal together with mean TAC signal of 10;000 generated drinking episodes using the standard deviation s = 0:1 for the q 1 distribution. q 2 has been fixed as q 2 = 1. The shaded pink region indicates the range of all drinking episodes corresponding to the samples of q 1 . 62 5.3.2 Real Training Data After the previous investigation of the model on artificial training data, we now want to provide numerical results based on our real dataset. All algorithms were implemented in Tensorflow [86] and the corresponding computations were executed on a NVIDIA Tesla T4 GPU card. The GAN model was trained for 50,000 iterations using the Adam optimizer [87]. The learning rate was set to 10 4 and the ratio for the generator and discriminator updates was set to 10. For the entropic regularization we used the value l = 1:5 which was found to be suitable in [82]. If not stated explicitly, the penalty parameterb = 1 was used. In some examples however, we used b = 4 to reflect the fact that the problem is rather physics-driven than data-driven due to the scarce observability of training data. The dimension of the latent variable space was chosen to be 1. The network topology for the generator and the encoder consists of four hidden layers with 50 neurons each, whereas the discriminator network only has one hidden layer with 20 neurons. As mentioned in Section 5.1.5, this accounts for the small number of available training data. The networks for q 1 and q 2 each have two hidden layers with 50 neurons. We used N b = 126;252 boundary training data together with N i = 20;000 initial training data, i.e. N u = 146;252, and N r i = N r b1 = N r b2 = 50;000 collocation points. In every iteration, a batch of 5;000 training data points and 500 collocation points was randomly chosen to compute the loss functions. Once the model was trained, we sampled 100;000 values of the joint distribution of(q 1 ;q 2 ) using a standard normal distribution for the prior of the latent variable. We also fed the N u = 146;252 data points to the encoder network to get samples of the posterior distribution of the latent variable. These samples were consequently fed into the networks for q 1 and q 2 to produce a posterior joint distribution of(q 1 ;q 2 ). The deconvolution network was trained for 30,000 iterations using the Adam optimizer [87]. The network has five hidden layers with 50 neurons each. We used N r i = N r b1 = 50;000 collocation points of which 500 were chosen randomly in every iteration. The penalty parameter was set to b = 10. Figure 5.8a shows the values for the different parts of the loss function during the training of the GAN model. The curves for the Kullback-Leibler divergence and the encoder loss show 63 (a) (b) Figure 5.8: (a) Values of the loss functions over the number of iterations and (b) Values for q 1 and q 2 over the number of iterations. many outliers, whereas the PDE loss decreases quite steadily. However, due to stochastic gradient descent using batches of data in every iteration, the convergence is not monotonous. The discrimi- nator loss quickly reaches a maximum value and remains constant over the iterations. This makes sense as the discriminator loss is to be maximized. Figure 5.8b shows the values of q 1 and q 2 over the number of iterations. Here, we pass a 5,000-sample of a standard normal through the networks for q 1 and q 2 and display the mean value. Comparing this to figure 5.8a, we see that the parameter values start to converge relatively late. The fluctuating shape of the curves in the converged state is due to the probabilistic nature of taking samples. 5.3.2.1 Distribution of the Unknown Parameters One of the main goals of this work is to estimate the parameter distribution of(q 1 ;q 2 ). Figure 5.9 shows histograms of the joint distribution. In Figure 5.9c we depict the distribution using the full data of 126 drinking sessions, whereas figure 5.9a shows that distribution for 88 selected drinking episodes that share common features. In both cases, the histogram appears to be a curve in the two-dimensional parameter space. This is quite remarkable, especially as this proves to be true using a two-dimensional latent variable as well. It is also visible that the histogram using 88 drinking episodes is narrower than the histogram 64 Table 5.4: Estimated Parameter Statistics m latent variable b Mean value q 1 Mean value q 2 Radius of credible circle 88 prior 1 1.0935 0.8528 0.5148 88 posterior 1 1.0639 0.8527 0.3256 126 prior 1 0.9573 0.8083 0.5766 126 posterior 1 0.9430 0.7915 0.4678 126 prior 4 1.1764 1.1582 0.6620 using the full data. Thus, the greater variability of the full data is directly reflected in the estimated parameter distribution. The histogram of the posterior latent variable is given in Figure 5.10. We see that this distribution decays much more rapidly than a standard normal. Hence, using this distribution as input for q 1 and q 2 we expect a more concentrated distribution for those parameters. Figures 5.9d and 5.9b show this joint distribution using samples of the posterior latent distribution as input. The histograms are much more centered around the most likely parameter values. This supports the idea that the posterior latent variable can indeed be used to yield sharper credible bands of the BAC/BrAC signal, see section 5.3.2.2. In table 5.4, we summarize the mean values of the parameters as well as the radius of the 90% credible circle for different settings. 5.3.2.2 Deconvolution of the BAC/BrAC signal Using the estimated joint distribution for(q 1 ;q 2 ), we can use the deconvolution network described in Section 5.2 to recover the BAC/BrAC signal and to find credible bands. Taking the samples from the joint distribution, we compute the mean parameter values to get the mean predicted BAC/BrAC signal. We also take a radius around this mean such that 90 per cent of the samples fall into this circle. Then, we use these samples to find the BAC/BrAC predictions corresponding to the parameter values. Note that after training of the deconvolution network, this only requires forward passes through the network, i.e. we can efficiently generate a large number 65 (a) (b) (c) (d) Figure 5.9: Samples of the estimated joint distribution of (q 1 ;q 2 ). (a) shows the distribution us- ing 88 selected drinking episodes using a standard normal distribution for the prior of the latent variable, while (b) shows the distribution using the posterior over the latent variable. (c) displays the distribution for the full set of 126 drinking episodes with a standard normal prior for the latent variable and (d) shows the corresponding distribution using the posterior distribution of the latent variable. 66 (a) (b) Figure 5.10: Distribution of the posterior latent variable. (a) shows the histogram for 88 drinking sessions used as training data and (b) shows the histogram for 126 drinking sessions used as train- ing data. of predicted signals corresponding to the samples of the joint distribution. At each time, we use the maximal and the minimal value of these predicted signals to form the credible bands. Figure 5.11 shows these results for four selected drinking episodes, using the parameter distribution from training the GAN with all 126 drinking episodes. Figures 5.11a and 5.11b show two examples of a situation where the mean prediction matches the real signal quite well. We notice, however, that the method yields a predicted start of the BAC/BrAC curve that is smoother than the real data. The sudden jump in the signal at the beginning of a drinking episode is not well reflected. It is also visible that the credible region appears to be rather large in both cases. This is due to the data containing a high variability between different subjects and drinking episodes and so the larger credible region is required to capture this variability. Figures 5.11c and 5.11d show situations where the real data is at the upper and lower bound of the predicted credible region, respectively. Nevertheless, even with the given nature of the data to vary a lot between subjects, it is desirable to keep the credible band as small as possible. One way to achieve this is to use the posterior distribution of the latent variable in the generation of samples for (q 1 ;q 2 ) rather than the prior standard normal. As seen above, the joint parameter distribution becomes narrower in this case. Another way appears to be the tuning of the penalty parameterb. The default choice ofb = 1 leads 67 (a) (b) (c) (d) Figure 5.11: Results of the estimation of BAC/BrAC from TAC for different drinking episodes. (a) and (b) show a good match between real data and estimated BAC/BrAC signal. (c) and (d) show a real signal at the upper and lower edge of the predicted credible region, respectively. 68 to a balance between given training data and physics. As described in Section 5.1.5, the underlying problem is rather driven by physics and so a higher weight on the PDE residuals might be favorable. Figure 5.12 compares these different approaches for two different drinking sessions. It shows that both ways, using the posterior latent and tuning the penalty parameterb, lead to narrower credible bands. Note that this does not necessarily improve the quality of the mean prediction: In Figure 5.12d, the default mean prediction matches the real data nicely, whereas the match in Figure 5.12e using the posterior latent is worse although the credible region is smaller. 69 (a) (b) (c) (d) (e) (f) Figure 5.12: Comparison of predicted BAC/BrAC signals using two different drinking episodes. (a) and (d) show the estimated BAC/BrAC curves yielded by a standard normal distribution for the prior of the latent variable and b = 1. (b) and (e) show the corresponding results using the posterior distribution of the latent variable andb = 1. (c) and (f) display the corresponding results using a standard normal distribution for the prior of the latent variable together withb = 4. 70 Chapter 6 A Physics-Informed Long Short-Term Memory Network for the Estimation and Uncertainty Quantification of BrAC In this chapter, we want to maintain the idea of a hybrid model to enrich the data-driven qualities of machine learning with the a priori physics information. But other than in the previous Chap- ter 5, where the input/output PDE model was considered on the full domain and where a GAN was used, here we want to make use of the recent advances in time series machine learning in combination with an abstract evolutionary, semigroup theory based formulation of the PDE model. To this end, we propose a physics-informed Long Short-Term Memory (LSTM) network. In this model, the data-driven nature of an LSTM is augmented with a first principles physics model for the diffusion of ethanol through the epidermal layer of the skin. This physical model takes the role of a regularization term in the loss function of the LSTM. That way, while learning, the model is encouraged to both fit the data and to produce physically meaningful outputs. Moreover, due to the high variation in the data observed, we also put forth a mechanism for the uncertainty quantifica- tion of the estimates that is based on a recently discovered relation between Monte-Carlo dropout and Bayesian learning. In Section 6.1, we describe the LSTM network that is used in this work and we show how the physics information is added to the network training to obtain a physics-informed LSTM model. In Section 6.2, we briefly explain the network architecture that we use to handle both time-dependent 71 and time-independent covariates, followed by an outline of the approach used to capture the un- certainty in Section 6.3. Finally, in Section 6.4, we show some examples of our numerical studies. 6.1 A Physics-Informed Long Short-Term Memory Network The BrAC estimation in this work is based on a Long Short-Term Memory Network as developed by Hochreiter and Schmidhuber in [88]. These networks are a class of recurrent neural networks and are especially useful for time-series data as they are able to process sequences of data and remember temporal features over long time spans. The whole network can be thought of as T LSTM units stacked next to each other, where T denotes the number of time steps. In a single unit, there is a cell state c as well as a forget gate f , an input gate i, and an output gate o that are used to regulate the flow of information through the network. Denoting the time by the subscript t, the equations for these gates are given by f t =s(W f y t +U f h t1 + b f ) i t =s(W i y t +U i h t1 + b i ) o t =s(W o y t +U o h t1 + b o ): Here, h t denotes the so-called hidden state that is passed from one time step t 1 to t, and y t denotes the input to the network at time t, in our case y=(y 1 ;:::;y T ) is a TAC signal or a vector of multiple time-dependent covariates, for example TAC and heart rate. f t denotes the forget gate. Here, the current input y t is weighted by a matrix W f and the previous hidden state h t1 is weighted by a matrix U f . Then, a bias term b f is added and a sigmoid functions(x)= 1=(1+e x ) is applied to the result. Similar computations apply to the input gate i t and the output gate o t . Moreover, a new candidate cell state ˜ c t is computed as ˜ c t = tanh(W c y t +U c h t1 + b c ) and based on the forget gate and the input gate the cell state is updated according to c t = f t c t1 + i t ˜ c t ; 72 Figure 6.1: Schematic illustration of a single LSTM cell. The figure shows the information flow in a cell between the various gates. where denotes an elementwise product. Here, the roles of the input and the forget gate become clear: the forget gate regulates how much of the previous cell state c t1 is kept, and the input gate determines how much of the new candidate state ˜ c t is added. Finally, the new hidden state h t is computed as h t = o t tanh(c t ). Figure 6.1 below provides a schematic illustration of a single LSTM unit or cell as described above. In many applications, the output of the LSTM network is solely based on the last hidden state h T . However, in our situation we feed the network with a TAC sequence y=(y 1 ;:::;y T ) and want the output to be the corresponding estimated BrAC sequence ˜ u=( ˜ u 1 ;:::; ˜ u T ). This is known as the sequence-to-sequence setting. The complexity of an LSTM network is mainly governed by the dimensionality of the output h t . Thus, in order to get scalar values ˜ u t based on h t , we apply a feed-forward network to the hidden states h t whose output is the desired estimated BrAC value at time t, ˜ u t . When training such a network, the matrices W f , W i , W o , W c , U f , U i , U o , and U c as well as the bias terms b f , b i , b o , and b c are adjusted to minimize a given loss function. Such a loss function typically relies on a difference between the network output and the true output. Hence, a simple loss function could be given byL = MSE(u; ˜ u), where u denotes the true BrAC signal and ˜ u denotes the estimated BrAC (eBrAC) signal, and MSE is the mean squared error between the two signals. 73 As described in Chapter 1, the available datasets to train the network on are quite small for the alcohol biosensor problem. This lack of data may result in several problems. First, the insufficient amount of data may hinder the model to learn appropriately, i.e. the network does not learn a suitable relationship between TAC and BrAC signals, and second, the model may tend to overfit on the training dataset and thus its generalization abilities can be poor. To avoid these issues, here we propose a physics-informed learning scheme where the a priori physics information derived from the physical model (4.1)-(4.6) is used as a regularization term. Based on the mechanisms developed in [28–30, 33, 52], we can obtain a physics-informed LSTM network by augmenting the loss function with an additional penalty term. Moreover, since the natural input and output of LSTM networks are time series (rather than data points (t;h) in the time-space domain), we can make use of the previously described formulation of the system (4.1)- (4.6) as an abstract parabolic system and its discretization in both time and space. In [23], the authors use the minimization problem u = argmin ˜ u T å i=1 j ˜ y i ( ˜ u) y i j 2 +jj ˜ ujj 2 (6.1) for the input or BrAC estimation. Here,jj ˜ ujj denotes a regularization term with a suitable norm jjjj and ˜ y( ˜ u) denotes the forward solution of the physical model, i.e. the eTAC based on the eBrAC using the discretized forward model as in (4.16). In this work, we want to make use of this idea in the context of the LSTM network. To this end, we propose to augment the loss function of the neural network that is fully based on data by a physics term involving the squared difference between ˜ y( ˜ u) and y, i.e. the TAC output of the eBrAC ˜ u under the physical model (4.1)-(4.6) is encouraged to match the true TAC output y. That way, the network is physics-informed in the sense that physically meaningless eBrAC signals are penalized and this additional penalty shall help the network to learn, although the available training dataset is rather small. More precisely, we define the loss function as L= MSE(u; ˜ u)+b MSE(y; ˜ y)+k 1 jj ˜ ujj 2 2 +k 2 SV( ˜ u) ; (6.2) 74 where MSE denotes the mean squared error,jjjj 2 is the Euclidean norm, and SV(x)=å i (x i x i+1 ) 2 is the squared variation of x. k 1 and k 2 denote tuning parameters, and b is the parameter that governs how much emphasis is placed on the physical information. Assuming that the number of drinking episodes in the training batch is N b , the loss function can be written as L= 1 N b N b å j=1 1 T T å t=1 (u j;t ˜ u j;t ) 2 +b y j;t ˜ y j;t 2 +bk 1 T å t=1 ˜ u 2 j;t +bk 2 T1 å t=1 ( ˜ u j;t ˜ u j;t+1 ) 2 ! : Note that the deconvolution of BrAC from TAC based on the minimization problem (6.1) as used in [23] is not suitable for real-time BrAC estimation. That is, eBrAC signals can change based on new incoming data. Assume a recorded TAC signal y=(y 1 ;:::;y T ;y T+1 ). Then, the two sequences (u (y 1 ;:::;y T )) T t=1 and (u (y 1 ;:::;y T ;y T+1 )) T t=1 are not necessarily identical. The here proposed physics-informed LSTM network does not possess this property; the eBrAC signal( ˜ u) T t=1 is only based on y 1 ;:::;y T . This enables our approach to yield eBrAC signals in real time, a feature that is especially interesting for the use of BrAC estimation in wearables. 6.2 Network Architecture As mentioned in Chapter 1, TAC is not only a function of BrAC. Instead, the expression from BrAC to TAC depends on a variety of other covariates as well. These can roughly be grouped into body covariates, drinking episode level covariates and behavioral covariates. Body covariates in- clude, but are not limited to, gender, age, body mass index, ethnicity, and resting metabolism. On the drinking episode level, the alcohol dosing pattern, ambient temperature, blood pressure, and heart rate might influence the functional relationship between BrAC and TAC. Finally, behavioral covariates such as the past 28 or 90 day drinking behavior can lead to adaptions in the body and hence affect the diffusion process as well. Accounting for all these covariates in the model might explain the observed variance in the TAC expression and eventually lead to improved BrAC esti- mates. When handling the various covariates, the issue arises that some covariates depend on time, such 75 Figure 6.2: Structure of the proposed network for the handling of the time-dependent covariates (grouped as “TAC”) and the time-independent covariates. as TAC, heart rate, and ambient temperature, while others are constant over time, such as the age or the gender of the human subject. Consequently, these covariates should be treated differently within the neural network. Here, we propose a network architecture consisting of three separate networks. Network 1 is a physics-informed LSTM network as described above to process the time-dependent covariates. Network 2 is a standard feed-forward network that processes the time- independent covariates. Finally, in network 3 the outputs of these networks are combined in a feed-forward network to yield the eBrAC. 6.3 Uncertainty Quantification using Dropout Given the high observed variance in BrAC/TAC data pairs, a completely deterministic approach to the BrAC estimation does not appear to be very promising. Therefore, to account for the variability and uncertainty not explained by the model, a quantitative assessment of the uncertainty is key to meaningful estimates. Previous works [23, 89], relying fully on the physical model (4.1)-(4.6), assumed the parameters q2 Q to be random and estimated their distribution. Based on this distri- bution, values for q were sampled to produce credible bands for the eBrAC. Although the present 76 work also uses the model (4.1)-(4.6), we do not aim at estimating the distribution of q, but we aim at fitting the network parametersw wherew =fWg with W =(W f ;W i ;W o ;W c ;U f ;U i ;U o ;U c ;b f ;b i ;b o ;b c ) is the set of all the network parameters to fit. In this context, an uncertainty quantification could be based on the idea of a Bayesian neural network. In Bayesian neural networks, the network parametersw are assumed to be random following some prior distribution. The goal is then to come up with a posterior distribution for w based on the data. The main caveat of Bayesian networks is the complexity of these models, so that more so- phisticated models quickly become intractable. As a remedy, a common method is to consider an approximation to the full posterior distribution, a process that is known as variational inference [77]. Only recently, a remarkable connection between this variational inference and the common regularization technique of Monte-Carlo dropout was found by Gal in [90]. Here, we will briefly describe the uncertainty quantification that is derived from this relation, following the steps taken in [91]. To this end, we consider a simple problem where the input y and output u of the network are scalar. Later, we will make the appropriate adaptions for the time series network considered here. We consider the network parametersw to be random and we assume they have some prior distribution p(w). Further, we need to define a likelihood distribution p(ujy;w), i.e. the output distribution given the observed variables y and the network parametersw. Such a likelihood can be given as a Gaussian distribution as p(ujy;w)=N(u; f w (y);t); where f w denotes the parameterized function that relates the input y and the network output u by u = f w (y). In our case, f will be a physics-informed LSTM network as described above. Moreover,t denotes some variance or noise. Let Y and U be some given datasets. Then, using Bayes’ theorem, we can come up with a posterior distribution ofw as p(wjY;U)= p(UjY;w)p(w) p(UjY) : 77 The denominator of this expression, p(UjY) is known as the model evidence. Written out, it takes the form of an integral p(YjX)= Z p(YjX;w)p(w)dw: This integration is called marginalization and it is well known that the exact marginalization often becomes unfeasible for more complex problems. This is also the case for the LSTM network considered here. Hence, we choose to look for a variational approximation q q to approximate the posterior distribution as q q (w) p(wjY;U). Here, a distribution q, parameterized byq, is chosen such that it is easy to evaluate. A comprehensive text on the topic of variational approximation can be found in [77]. Ideally, we want the approximation q q (w) to be as close to the true posterior distribution p(wjY;U) as possible. To achieve this, our goal is to minimize the Kullback-Leibler divergence between the two distributions with respect to the parameterq. This divergence is given by KL(q q (w)jjp(wjY;U)= Z q q (w) q q (w) p(wjY;U) dw: Minimizing the Kullback-Leibler divergence is equivalent to maximizing the so called evidence lower bound (ELBO) with respect to the parameterizationq. This ELBO is defined as Z q q (w)log p(UjY;w)dw KL(q q (w)jjp(w)): The described method is known as variational inference and is well-established in the field of Bayesian modeling. Written out, the ELBO takes the form N å i=1 Z q q (w)log p(u i j f w (y i ))dw KL(q q (w)jjp(w)); (6.3) where N denotes the number of data pairs(y i ;u i ). Now, let us take a more specific look at the time series problem considered here. In this case, the inputs take the form y i =(y i;1 ;y i;2 ;:::;y i;T ) and the outputs take the form ˜ u i =( ˜ u i;1 ; ˜ u i;2 ;:::; ˜ u i;T ). For simplicity, we drop the index i for now and focus on the time index t. Then, the LSTM network described in Section 6.1 defines functions 78 f w h that relate the hidden states h t with the input arguments as h t = f w h (y t ;h t1 ) for t = 1;:::;T . Further, the output ˜ u t depends on the hidden state h t of the LSTM network and the output k of the time-independent covariate network via ˜ u t = f w u (h t ;k). In that way, the full output sequence ˜ u can be expressed as ˜ u= f w u (h 0 ;:::;h T ;y 1 ;:::;y T ;k) = f w u (h 0 ; f w h (y 1 ;h 0 );:::; f w h (y T ; f w h (y T1 ; f w h (::: f w h (y 1 ;h 0 ))));y 1 ;:::;y T ;k) where h 0 = 0. Using this notation, the objective from (6.3) takes the form N å i=1 Z q q (w)log p(y i j f w u (h 0 ;:::;h T ;x 1 ;:::;x T ;k))dw KL(q q (w)jjp(w)): (6.4) Then, to approximate the integral, we use Monte-Carlo integration and obtain the objective N å i=1 log p(y i j f ˆ w y (h 0 ;:::;h T ;x 1 ;:::;x T ;k)) KL(q q (w)jjp(w)); (6.5) that is based on a single sample ˆ w from the variational distribution q(w). It is known that (6.5) is an unbiased estimator for (6.4). For more information on Monte-Carlo integration we refer to [92]. Finally, for the parameterization of q q (w), we follow the approach of Hinton/van Camp [93] to factorize over the weight matrices W j of the network as q q (w)= Õ j q M j (W j ) (6.6) where M j = mean(W j ) denotes the mean weight matrix and q M j (W j ) is given by q M j (W j ) = M j Z where Z is a diagonal matrix whose elements z k;k are i.i.d. Bernoulli distributed as z k;k Bernoulli(p) with p2[0;1] to be defined. This choice for the variational distribution immediately gives rise to the mentioned technique of 79 dropout [94]. Dropout is a common stochastic regularization technique, usually used to prevent a neural network from overfitting. Using dropout, in each training iteration, a randomly selected set of neurons in the network is deactivated. The main idea behind this is that the ever-changing structure of the network hinders it to just learn and reproduce the training data. Also, since the network architecture is different on every training batch, one can consider the final outcome as an average over a whole ensemble of slightly different neural networks. For a more detailed work on dropout regularization, we refer the reader to [94]. Usually, a certain percentage p is chosen and then a p fraction of all the neurons is masked randomly in every training step. This description is equivalent to randomly setting columns of the weight matrices M j to zero, which in turn is equiva- lent to the multiplication of M j with the random diagonal matrix Z whose entries are Bernoulli(p) distributed. In [90, 91], it is shown that finding the optimal network parameters w in a neural network using dropout is essentially the same objective as the variational inference objective in (6.3) with the special choice of variational distribution q q (w) as in (6.6). This implies that every neural network trained with dropout can be considered as a special type of a Bayesian neural network. It is this observation that enables us to obtain an uncertainty quantification in the sense of a Bayesian neural network within a feasible computational time. In practice, this remarkable connection means that a neural network, trained using dropout, can be used to assess the uncertainty of the outcomes in the inference phase by simply using dropout in the inference phase as well. Hence, by sampling from the Bernoulli random matrix Z we automatically obtain samples of the approximate posterior q(w) that can be used to generate samples of the eBrAC ˜ u. Thus, based on samples ˆ w k ; k= 1;:::;M, we get a means to produce samples of the eBrAC sequences ˜ u k = f ˆ w k u (y 1 ;:::;y T ); k= 1;:::;M. It is then straightforward to obtain quantities like the mean eBrAC signal as ¯ u= 1 M M å k=1 f ˆ w k u (y 1 ;:::;y T ); 80 as well as credible regions that contain a certain percentage of all the observed samples. In Sec- tion 6.4, we will demonstrate this uncertainty quantification technique using real data drinking episodes. 6.4 Numerical Studies 6.4.1 Numerical Results We split the 256 sets of BrAC-TAC data into a training set containing 192 episodes and a test set with the remaining 64 episodes, with all of each participant’s episodes placed either in the training or in the test set to reduce the effects of non-independence of repeated drinking episodes within participants. If not stated otherwise, the only input to the LSTM network is the TAC data, the time-independent covariates are handled according to Section 6.2. All results displayed below show predictions on the test set using an LSTM network that has been trained solely on the training set. The physics-informed LSTM network was implemented in Python, using the deep learning application programming interface Keras. To train the model, we used the Adam optimizer [87]. For the choice of the parameters q 1 ;:::;q 4 in the physical model (4.1)-(4.6) we used the population model approach as outlined in Chapter 4 and fit four-parameter beta distributions as described in Section 4.3.1. The particular episodes used to demonstrate our approach were selected because they were gener- ally representative of the particular aspect of our method we were intending to illustrate. Figure 6.3 depicts four drinking episodes from the test set. It shows the true BrAC signal u as well as the eBrAC signal ˜ u which is the mean of 1000 sampled eBrAC signals using the method outlined in Section 6.3 together with 95% credible regions for the BrAC signal. Overall, we see that the mean eBrAC signal quite nicely matches the BrAC signal. The magnitude of the peak is not accurately captured in some cases, but this comes as no surprise, given the observed variation in the data. Moreover, the shown credible regions for the BrAC signal appear to accurately capture the uncertainty of the prediction: the true BrAC signal is within these bounds most of the time. 81 (a) (b) (c) (d) Figure 6.3: Performance of the physics-informed LSTM model on four drinking sessions from the test set. The figures show the BrAC data u, the predicted mean eBrAC signal ¯ u, and credible regions for the BrAC signal. 82 Table 6.1: Error values on training and test set rL 2 rP rAUC jDTj training 0.2168 0.1650 0.1496 0.2186 test 0.2183 0.1565 0.1590 0.2547 Table 6.2: Error values on training and test set, heart rate included rL 2 rP rAUC jDTj training 0.2065 0.1536 0.1287 0.2339 test 0.2295 0.1491 0.1682 0.2538 6.4.2 Error Evaluation When evaluating the goodness of fit of the model, four quantities are of particular interest. These are the relative L 2 error (rL 2 ), the relative peak error which is the relative distance between the peak of the true BrAC signal and the eBrAC signal (rP), the relative error of the area under the curve (rAUC), and the absolute difference between the peak times of the BrAC signal and the eBrAC signal (jDTj). Table 6.1 shows these error values for the trained model on both the training and the test set. As can be seen, the errors on training set and test set are very similar. This indicates that the model generalizes fairly well and does not overfit. The proposed framework allows for the time-dependent input data to be vector valued as well. Hence, it is easy to incorporate multiple time-dependent covariates into the model. In Table 6.2, we show the resulting error values when TAC and the heart rate are used as input signals for the LSTM network. As can be seen, the model including the heart rate performs slightly better on the training set, but no clear statement can be made for the test set. It appears that the heart rate does not improve the performance or the insight of the model significantly. More on the importance of covariates in Section 6.4.6. 6.4.3 Impact of the Parameterb As described in Section 6.1, we use the parameter b to balance the data and the physics infor- mation in the loss function (6.2). Intuitively, a lower b yields an estimate that aligns best with 83 the training data, while a larger b tries to match the computed TAC signal of the estimate, ˜ y( ˜ u), with the true TAC signal y. Figures 6.4 and 6.5 show the numerical results for varying values of b based on two drinking sessions from the test set. In either figure, we display results for b2f0:01;0:05;0:1;0:25g. In Figure 6.4, the signals u and y are roughly of the same magnitude. Forb = 0:01, the agreement between y and ˜ y( ˜ u) is rather poor and also ˜ u does not quite match the signal u. Forb = 0:05 (Figure 6.4b), the regularization mechanism yields a TAC signal ˜ y( ˜ u) that better matches y. Also the agreement between u and ˜ u is good. For even larger values of b the agreement between ˜ y( ˜ u) and y becomes significantly better. However, the estimated signal ˜ u dete- riorates and even becomes unphysical as the negative values in Figure 6.4d show. This indicates that indeed an intermediate value of b yields the best balance between data-fitting and physical knowledge. How important the physics information actually is can be seen in Figure 6.5. Here, the TAC signal y is of much smaller magnitude than the BrAC signal u. Since situations like these are rare in the training data, the model with little to no physics information yields an estimated signal ˜ u that is close to zero, compare Figure 6.5a. Then, asb increases, the signal ˜ u clearly shows a drinking episode. Note that also in this case higher values for b lead to a non-physical behav- ior. Note further that although the signal ˜ u in Figure 6.5b does not match the true signal u, the result is much better than in the non physics-informed case where there appears to be no drinking at all. Hence, the physics-informed mechanism proves especially useful in situations that are not or barely covered in the training data. Thus, this mechanism is important for the extrapolation or generalization capabilities of the model. When looking at figures 6.4 and 6.5, the eBrAC signal becomes less smooth as the regular- ization parameter b increases, i.e. it starts to oscillate. This is somewhat counterintuitive, as a regularization usually enhances the regularity. However, in the context of this work, the physics- informed regularization is not meant to make the eBrAC signal smoother, but to make it physically meaningful. In that sense, the observation is not contradictory. Note that the loss function in (6.2) also contains terms related to the variation of the eBrAC signal. By fine tuning the parametersk 1 84 (a)b = 0:01 (b)b = 0:05 (c)b = 0:1 (d)b = 0:25 Figure 6.4: Results of the BrAC estimation using different values for the parameterb on a single drinking episode. and k 2 in (6.2) a higher smoothness of the eBrAC signals could be obtained. We do not discuss this matter in the proof of concept reported here. 6.4.4 Role of the Bias Terms The LSTM network described in Section 6.1 includes bias terms b f , b i , b o in the forget, input, and output gate, respectively. These bias terms can be used to encode a general behavior of drinking sessions that is present in most episodes from the training set. A problem, however, arises if the training set is rather small or does not cover the whole range of realistic samples, as in the context of this alcohol biosensor problem. Then, features present in training drinking episodes can mistakenly be generalized to be features of every drinking episode. This results in a tradeoff: strong bias 85 (a)b = 0:01 (b)b = 0:05 (c)b = 0:1 (d)b = 0:25 Figure 6.5: Results of the BrAC estimation using different values for the parameterb on a single drinking episode. 86 (a) (b) (c) Figure 6.6: Results of the BrAC estimation using models with and without bias terms. We display three drinking episodes from the test set. terms yield good results on the training dataset, but might not be applicable to data coming from outside. Without bias terms, on the other hand, the estimated signals might be missing important information. Figure 6.6 demonstrates this behavior. In Figure 6.6a, the eBrAC signals obtained with or without bias terms are very similar. The only difference is that the estimate with bias terms correctly predicts the start of the drinking episode, whereas the estimate without bias does not. The reason for this behavior is the delay between the TAC and the BrAC signal. In Figure 6.6b, both estimates predict the start of the episode correctly, but the estimate without bias terms shows some non-physical behavior. Figure 6.6c, however, shows an example where the estimate with bias is far off the true signal, whereas the estimate without bias captures the true signal much better. We can attribute this behavior to a TAC signal that has a lower magnitude than most TAC signals in the training data. These observations suggest that an LSTM model without the bias terms will be advantageous for the application to a wide variety of different drinking episodes. 6.4.5 Results on Field Data So far, the BrAC and TAC for all the presented results were collected in the laboratory under controlled conditions following a prescribed dosing protocol. Now, we want to use our approach for the estimation of BrAC under naturalistic drinking conditions in the field. In [95], it was shown that the direct application of a BrAC-estimating model to field data can be difficult, especially if the naturalistic drinking episodes differ a lot from the training drinking episodes with regards to 87 (a) (b) (c) Figure 6.7: Application of the LSTM model to field data drinking sessions. We show the BrAC data, eBrAC signals obtained from the LSTM model without physics (labeled LSTM), and eBrAC signals obtained from the physics-informed LSTM model (labeled PILSTM). shape and magnitude. The three naturalistic drinking episodes considered here were collected by one of the authors (S.E.L.) using the WrisTAS 7 TM transdermal alcohol biosensor (Giner, Inc., Waltham, Massachusetts). We trained a non physics-informed LSTM model (labeled LSTM) and a physics-informed model (labeled PILSTM) with the training set described earlier, containing only laboratory controlled drinking episodes. Figure 6.7 shows that the overall performance of the models on the naturalistic drinking data is worse compared to the laboratory controlled drinking sessions. However, the PILSTM yields results that still capture the main dynamics of the true signals with respect to shape and magnitude, whereas the default LSTM shows an erratic behavior. This can be seen especially in figures 6.7a and 6.7c where the non physics-informed estimates exhibit completely unphysical artifacts towards the end of the episodes. The use of the physics regularization effectively suppresses these artifacts. In all episodes shown in Figure 6.7 the eBrAC signals show a lot of oscillations. The authors think that this can be attributed to the fact that the TAC signals of the field episodes are quite different to those of the training set. Hence, the model could not appropriately learn these situations. Of course, a possible remedy would be the usage of an extended training set that includes a variety of naturalistic drinking episodes. Unfortunately, as described earlier, data gathering for the problem at hand is expensive and time-consuming. Thus, the physics-informed mechanism appears to be valuable to make up for this lack of training data. 88 6.4.6 Importance of Covariates using Shapley Values The proposed BrAC estimation model is one of the first to include a range of covariates, such as body covariates, session level covariates, and behavioral covariates. Hence, thus far, only little is known about the importance of these covariates for the BrAC estimation and how they affect the TAC expression. Here, we want to qualitatively and quantitatively assess the importance of the covariates. To this end, we employ a concept from cooperative game theory, the so-called Shapley values [96]. Named after Lloyd Shapley, Shapley values can be applied to machine learning prob- lems to find each covariate’s contribution to the result. Formally, Shapley values can be defined in the following way. Assume a set D of d players together with a function w : 2 D !R such that w(/ 0)= 0. The function w maps every subset S of players to the so-called worth of coalition, i.e. the reward that the players in S can obtain cooperatively. Now, in the setting of machine learning, the players can be thought of as the covariates and the worth of coalition is the output of the network when the covariates in S were used as input. Then, based on this definition, Shapley values assign to every single player i its contributionf i (w) according to the formula f i (w)= å SDnfig jSj!(djSj 1)! d! (w(S[fig) w(S)) = 1 d å SDnfig d 1 jSj 1 (w(S[fig) w(S)): To apply this characterization to our problem, we need to come up with a suitable function w. We choose w(S) to be the area under the curve (AUC) of the eBrAC signal when the covariates in S have been used for prediction. Figure 6.8 displays the importance and influence of a selection of the available covariates using Shapley values on our whole dataset. When interpreting these results, one should keep in mind that the data basis is fairly small. Hence, outlier values in the dataset may distort the results. Moreover, the chosen function w only measures effects that are averaged over the drinking episode. 89 Figure 6.8: Shapley values of a selection of covariates. For a description of the covariates refer to Appendix A. The order of the covariates is based on the Shapley value importance. Longer bars indicate more data points in that region, while points indicate single data points. The color encoding can be seen as a heat map: high covariate values are red, low values are blue. For 0/1 covariates such as “female” red means “yes” and blue means “no”. The impact of a covariate on the BrAC AUC is indicated by the distance from the grey middle bar. 90 If some covariates affect the eBrAC signal at a specific point during the episode, the function w would not be able to measure this accurately. We note that these are serious limitations of this approach. Figure 6.8 shows that especially the body covariates such as the body mass index, the resting metabolism, the skeletal muscle, the age, and some blood pressure characteristics are among the most important covariates. The picture drawn here is that younger, physically more active persons with a lower BMI have a higher predicted BrAC AUC than older, physically less active persons with a higher BMI. Unfortunately, only little is known about the biology of BrAC to TAC expression. It is hence difficult to check these observations for plausibility. A typical BrAC signal has an AUC value in the order of 0:15. Based on the obtained results, even the most important covariates have an impact of only 0:01 on the estimated signal. This suggests that while the covariates are important to consider in the model, they do not affect the result too greatly. Note that we observe that the participants’ gender appears to be of importance with a clear distinction between male and female subjects. For male participants, the BrAC estimations have a larger AUC than for females. However, we also observe a difference in the baseline BrAC AUC between males and females of about 0:029. It is thus unclear whether the observed effect can really be attributed to the gender or whether the underlying calibrated alcohol dosing of the experiment led to these differences. This shows exemplarily the difficulties in interpreting the Shapley values correctly. Similar observations can be made for the ethnicity and dosing pattern. Further research is required. 91 Chapter 7 On a Covariate-Dependent Hidden Markov Model and a Physics-Informed Viterbi Algorithm This chapter deviates from the previous two chapters in that we no longer consider neural networks and deep learning techniques. Instead, we shift gears and consider a covariate-dependent Hidden Markov Model (HMM) to approach the alcohol biosensor problem. In Section 7.1 we discuss HMMs and their associated algorithms for training and prediction, or more precisely, producing eBrAC based on observations of TAC. The latter is achieved through the defining of a full-body alcohol HMM. In Section 7.2 we discuss how we incorporated covariates into our HMM, then in Section 6.3 we explain how our HMM can be used to quantify the uncertainty in eBrAC through a decoding of the hidden states. In Section 7.4 we present and discuss results from our numerical studies. Then, in Section 7.5 we propose and discuss a physics-informed Viterbi algorithm that makes use of the model from Chapter 4. Using this algorithm, the BrAC estimates are subject to a physical regularization which is meant to encourage a physically meaningful behavior. Finally, in Section 7.5.1 we show results produced with this physics-informed algorithm and compare them to the results obtained from the default Viterbi algorithm. 92 7.1 A Hidden Markov Model for BrAC from TAC 7.1.1 Hidden Markov Models LetO=(o 1 ;:::;o T ) be an observation sequence with d-dimensional observations o t =(o t;1 ;:::;o t;d ) and letX be a random vector whose values are state sequencesX=(X 1 ;:::;X T ) with X i 2f1;:::;Sg for all i. Assume a p-dimensional vector of covariates Z=(z 1 ;:::;z p ). In clas- sical Hidden Markov theory, the processX is a Markov process of order one, i.e. P(X t+1 = jjX 1 ;:::;X t )= P(X t+1 = jjX t ): Here we assume the Markov chain is time-homogeneous, that is P(X t+1 = jjX t = i)= P(X 2 = jjX 1 = i) 8t: In the context of the alcohol biosensor problem, the observations are bivariate (i.e. d = 2) with a TAC and a BrAC component. Though not explicitly modeled, it is not unreasonable to consider the hidden states to be the generally unobservable discretized BAC level (and hence the terminology Hidden Markov Model). Such an HMM is described by the parameters • p =(p 1 ;:::;p S ); p i = P(X 1 = i), i.e. the prior probabilities of the chainX to start in state i, • a i; j = P(X t+1 = jjX t = i) with a i; j 0 and å S j=1 a i; j = 1. These parameters describe the transition probabilities between different states. • b i (o t )= P(o t jX t = i), the probability of observing o t given that the current state is i. These are called emissions. For the density functions b i various choices, both discrete and continuous, are possible. Here, we will assume the emission distribution to be Gaussian; more on this to follow. In the sequel we combine all the parameters into a single parameter vector,q. 93 7.1.2 Training and Initialization Training an HMM involves determining the parameters q based on sequences of observationsO. We consider the following quantities that can be computed efficiently by using the Baum-Welch or forward-backward algorithm [58]. Briefly, we denote the probability of the partial observation sequence (o 1 ;:::;o t1 ) and being in state i at time t given the model parameters q by a t (i)= P(o 1 ;:::;o t1 ;X t = ijq), the probability of the partial observation sequence(o t ;:::;o T ) given that the state sequence is in state i at time t by b t (i)= P(o t ;o t+1 ;:::;o T jX t = i;q), the probability of going from state i to state j given the observationsO by g t (i; j)= P(X t = i;X t+1 = jjO;q)= a t (i)a i; j b j (o t+1 )b t+1 ( j) å S m=1 a t (m)b t (m) ; and the probability of being in state i at time t given the observations by g t (i)= P(X t = ijO;q)= S å j=1 g t (i; j): We note that all of these quantities can be efficiently computed recursively [58]. The optimal choice for the parameters, q =q , is determined using the Expectation Maxi- mization (EM) algorithm [97]. The log likelihood of the observations with parameter q is given by `(qjO;X)= logL(qjO;X)= logP(O;Xjq): As is standard in the EM literature [97], we define Q(q;q (k1) )=E q (k1) (`(qjO;X))= å X log(P(O;Xjq))P(O;Xjq (k1) ) 94 where q (k1) denotes the (k 1) st iterate of q. The joint probability of an observation sequence and a state sequence can be written as P(O;Xjq)=p X 1 b X 1 (o 1 ) T Õ t=2 a X t1 ;X t b X t (o t ); and therefore Q(q;q (k1) )= å X log p X 1 b X 1 (o 1 ) T Õ t=2 a X t1 ;X t b X t (o t ) ! P(O;Xjq (k1) ) = å X log(p X 1 )P(O;Xjq (k1) ) + å X T å t=2 log a X t1 ;X t ! P(O;Xjq (k1) ) + å X T å t=1 log(b X t (o t )) ! P(O;Xjq (k1) ): (7.1) This expression for Q is maximized by maximizing the three terms individually. For the first term involving the prior probabilitiesp i , partitioning all sequences,X, by their initial state, we get å X log(p X 1 )P(O;Xjq (k1) )= S å i=1 log(p i )P(O;X 1 = ijq (k1) ): (7.2) Maximizing (7.2) with respect toq with the constraintå S i=1 p i = 1 yields p i = P(O;X 1 = ijq (k1) ) P(Ojq (k1) ) =g 1 (i): For the second term we find that å X T å t=2 log a X t1 ;X t ! P(O;Xjq (k1) ) = S å i=1 S å j=1 T å t=1 log(a i; j )P O;X t1 = i;X t = jjq (k1) : 95 For this term, maximizing under the constraint thatå S j=1 a i; j = 1 yields a i; j = å T t=1 P O;X t1 = i;X t = jjq (k1) å T t=1 P O;X t1 = ijq (k1) = å T t=1 g t1 (i; j) å T t=1 g t1 (i) : (7.3) We will return to the maximization of the third term in (7.1) in the next section when we discuss the incorporation of covariates in our emissions model. In our HMM the emissions are bivariate (i.e. BrAC and TAC). Consequently we fit a bivariate Gaussian for each state. The density is given by b i (y)= f(yjX = i) = 1 (2p) d=2 jS i j 1=2 exp 1 2 (ym i ) T S 1 i (ym i ) ; (7.4) where in (7.4) y2R 2 denotes a possible value for an emission random vector o t . Since the emis- sions are known to be non-negative, another reasonable choice here would be, for example, a bivariate log-normal distribution. However, we opted for a normal to take advantage of the ease of parameterization and the fact that the marginal distributions which are of particular significance to us here are normal as well. We rely on the training data to produce means, m i and covariance matrices,S i that make negative values highly unlikely. Training an HMM is a non-linear optimization problem over a large set of parameters. The Baum-Welch algorithm is likely to find local maxima. Hence the initial parameter guessesq 0 are of great importance. While there are initialization methods for the transition probabilities [98], we focus on initialization of the meansm i and covariance matricesS i of the emission probabilities. One possible initialization of the means m 0 i makes use of a common clustering algorithm, the k-means algorithm. Given a number of clusters k, here the number of different states k= S, this algorithm minimizes the error function k å i=1 å o j 2C i jjo j m 0 i jj 2 96 where C i describes the i-th cluster and m 0 i is the mean or center of this cluster. In this way, the observations are clustered into a number of different levels of BrAC and TAC that correspond to the number of hidden states S. Using this initialization for the means, the hidden states will tend to emit different TAC and BrAC levels. Then, having estimated values form i , it is straightforward to produce estimates for the covariance matricesS i . For each cluster i= 1;:::;S, we can estimate the initial covariance matrix as S 0 i = 1 jC i j å o j 2C i (o j m 0 i )(o j m 0 i ) T . In the covariate-dependent emissions case, which is of particular interest to us here and will be discussed in the next section, this initialization process still applies by setting the regression coefficients to zero for all states. 7.2 Introducing Covariates and Using the HMM to Obtain eBrAC from TAC 7.2.1 Covariate-Dependent State Transitions Making state transitions covariate dependent is straightforward [99] and it often reveals additional insight [100]. The usual approach for introducing covariates into the transition probabilities is multinomial logistic regression [99]: a ˜ Z i; j = P(X t = jjX t1 = i; ˜ Z)= exp(f T i; j ˜ Z) å S j=1 exp(f T i; j ˜ Z) ; (7.5) where, so as to include a bias term when identifying the model, ˜ Z denotes an augmented vector of covariates ˜ Z=(1;Z). We imposed the additional restriction thatf i;i = 0 for all states i. The model (7.5) maintains the assumption of a time-homogeneous Markov process. It is possible to model transition probabilities a ˜ Z t i; j that depend on a time-dependent covariate vector Z t . The resulting Markov chain would not be homogeneous. The covariates of interest to us here are constant with respect to time and hence our model with covariates will continue to be homogeneous with respect to time. 97 When using the model (7.5) for the transition probabilities, (7.3) is no longer valid. In fact, a closed form expression of the corresponding maximizer a ˜ Z; i; j does not exist. Consequently a numerical optimization technique is required. 7.2.2 Covariate-Dependent Emissions or Observations To include covariates in the emissions model, we replace the density given in (7.4) with a bivariate conditional linear Gaussian distribution. That is, a conditional Gaussian with an added regression term to account for the covariates. The density of such a distribution is given by b i (y)= f(yjZ;X = i) = 1 (2p) d=2 jS i j 1=2 exp 1 2 (y B i Zm i ) T S 1 i (y B i Zm i ) : (7.6) Once again, by augmenting Z and B as ˜ Z=(1;Z) and ˜ B i =(e;B i ), respectively, with e=(1;:::;1) T , we get y B i Zm i = y ˜ B i ˜ Z, i.e. thatm i in the above formulation is a covariate-independent bias for the mean of yj(X = i). We again note that a time-dependent covariate vector Z t could also be used here. Now, returning to the maximization of the third term in (7.1), the one dealing with the state- dependent emission probabilities, b i (o t ), we write å X T å t=1 log(b X t (o t )) P(O;Xjq (k1) ) = S å i=1 T å t=1 log(b i (o t ))P O;X t = ijq (k1) : (7.7) In [101], it is shown how the maximum likelihood estimators for the conditional linear Gaussian distribution can be computed. In the case of B i = 0 for all i, i.e. when there are no regression terms, from (7.7) the MLE form i ,m i is obtained as m i = å T t=1 P(X t = ijO;q)o t å T t=1 P(X t = ijO;q) = å T t=1 g t (i)o t å T t=1 g t (i) : 98 In the case S= 1, this agrees with the sample mean. In the covariate regression case, m i and B i have to be estimated simultaneously. Consequently the maximum likelihood estimator for the augmented matrix ˜ B, ˜ B , is given by the solution of a weighted version of the normal equations as ˜ B i = T å t=1 g t (i)o t ˜ Z T ! T å t=1 g t (i) ˜ Z ˜ Z T ! 1 : (7.8) It then remains to find an estimate of the covariance matricesS i ,S i . Using the MLE ˜ B i from (7.8) this is found to be S i = 1 å T t=1 g t (i) T å t=1 g t (i)o t o T t ˜ B i T å t=1 g t (i) ˜ Zo T t ! : 7.2.3 Using the HMM to Estimate or Predict BrAC from Observations of TAC After training the model with the observed BrAC and TAC data, we want to use the HMM to produce, or predict, an estimated BrAC signal, eBrAC, based only on a TAC signal. During the training, we provide both TAC and BrAC, so, as noted in the previous section, we fit either a covariate-independent (i.e. (7.4)) or a covariate-dependent (i.e. (7.6)) bivariate Gaussian for every state. In the prediction stage, we use the marginal distribution of b i (y 1 ;y 2 ) with respect to the first variable y 1 , i.e. b i (y 1 )= f(y 1 jZ;X = i)= 1 p 2pS i 11 exp 1 2S i 11 (y 1 B 1 i Zm i 1 ) 2 (7.9) where B 1 i denotes the first row of B i ,m i 1 is the first entry ofm i , andS i 11 is the variance of y 1 j(X= i). Using the TAC observations and the conditional densities b i (y 1 ) given in (7.9), we use the Viterbi algorithm [102] to predict the most likely state sequence based on the observation sequenceO 1 = (o 1;1 ;:::;o T;1 ). Given the state sequenceX=(X 1 ;:::;X T ), it is then straightforward to evaluate 99 the eBrAC signal, ˜ O 2 , as ˜ O 2 = m X 12 ;:::;m X T 2 . Note that by providing the Viterbi algorithm with only the marginal densities b i (y 1 ), there is some inherent loss of information. However, the training of the HMM with TAC and BrAC should enable the model to yield meaningful predictions even when only observations of TAC are provided. 7.2.4 Model Complexity When including covariates into the HMM model, it is important to consider the induced changes to the model complexity. In the classical case, i.e. a multivariate Gaussian distribution without a regression on the covariates for the emissions and S different states with covariate-independent transitions, we need to estimate S d parameters for the meansm i , S d(d+ 1)=2 parameters for the positive definite symmetric covariance matricesS i , S priors and S(S 1) transition probabilities. Now, in the case of covariate-dependent state transitions as in (7.5), instead of S(S 1) transition probabilities, we need to estimate the vectors f i; j . We assumed that f i;i = 0, and so there are S(S 1) p parameters to be estimated for the transitions. In the case of covariate-dependent emissions, we add S d p parameters for the regression matrices B i . Assuming S = 20 states and a bivariate emission (d = 2), the resulting classical model has 500 parameters to fit. Using 3 covariates, there are 1260 unknown parameters for the covariate- dependent transition case and 620 parameters for the covariate-dependent emission case. For 10 covariates, those numbers go up to 3920 and 900, respectively. Hence, it can clearly be seen that the dependence on covariates leads to significantly increased model complexity, especially in the case of covariate-dependent transitions. To accurately fit a model with large complexity typically requires a large training dataset. In the context of the alcohol biosensor problem, data collection with human subjects is an expensive and time consuming undertaking, and so available datasets are relatively small. Consequently, including a large number of covariates does not necessarily lead to a model with enhanced predictive power. Instead, one should balance the greater flexibility of a more complex model with the scarcity of training data. This can be achieved by including only a small number of highly significant covariates, as we discuss in Section 7.4. 100 7.3 Decoding the Hidden States and Uncertainty Quantification Since HMMs are probabilistic models, the most likely path estimation and hence the BrAC es- timation are subject to statistical considerations. The probabilistic formulation of an HMM can be utilized to gain insight into the uncertainty that comes with the estimation. In this section, we describe how we can quantify the uncertainty and how this leads to credible bands for eBrAC. 7.3.1 Local Decoding and Global Decoding As described in Section 7.1.1, the Viterbi algorithm can be used to find the most likely state se- quenceX =(i 1 ;:::;i T ) given the observationsO asX = argmax X P(XjO;q). This is global decoding and the result is guaranteed to be a feasible state sequence in the sense that P(X t+1 = i t+1 jX t = i t )> 0. Other than the global decoding that considers the full sequence of states, the process of local decoding finds the most likely state i t at every time step t; that is i t = argmax i2f1;:::;Sg P(X t = ijO;q)= argmax i2f1;:::;Sg g t (i) . Note that local decoding is a byproduct of the forward-backward algorithm since it is based on the quantitiesg t (i). The main disadvantage of local decoding is that it potentially can yield an unfeasible or impossible state sequence. However, local decoding not only leads to an identification of the most likely state at time t, but it also provides a measure of confidence in the form of P(X t = i t )=g t (i t ). This probability can be used to inform the uncertainty that comes with the state decoding. More on this in Section 7.4 below. 101 7.3.2 Emission Uncertainty In the previous sub-section we examined the uncertainty that arises through the decoding of the states. A second source for uncertainty are the emissions. In our model, we fit bivariate conditional linear Gaussians that are parameterized by a covariate-independent mean m, a regression matrix B and a covariance matrixS. Hence, given a state i, the emission o is subject to the uncertainty introduced by the varianceS i . Thus, there are two types of uncertainty: A state decoding-related one and an emission-related one. Fortunately, both uncertainties can be quantified by parameters of the fitted model. This shows that an uncertainty quantification is straightforward for the HMM we consider here. In Section 7.4 below we present a method that takes into account both types of uncertainty to yield credible bands for the eBrAC signal. There, we also demonstrate how this approach would work in practice using actual human subject drinking data. 7.4 Numerical Studies We split the 256 sets of BrAC-TAC data into a training set containing 150 episodes and a test set with the remaining 106 episodes, with all of each participant’s episodes placed either in the training or in the test set to reduce the effects of non-independence of repeated drinking episodes within participants. All results displayed below show estimates on the test set using a model that has been trained solely on the training set. The covariate-dependent HMM was implemented in MATLAB based on the Hidden Markov Model (HMM) Toolbox for MATLAB. Through experimentation on the number of hidden states in our model, we settled on S= 20 hidden states because this number appeared to yield the best representation of the data. The particular episodes used to demonstrate our approach were selected because they were generally representative of the particular aspect of our method we were intending to illustrate. 102 7.4.1 An HMM without Covariates In Figure 7.1, the left hand side shows TAC and BrAC data together with their respective predic- tions, eBrAC and eTAC, without including covariates when both TAC and BrAC are inputs to the Viterbi algorithm. The right hand side shows eBrAC predictions based on the TAC signal without including BrAC or covariates. In the bivariate case, the predicted signals provide accurate approx- imations of the true data, especially for the BrAC data. This is not surprising in so far as the BrAC signals in the dataset tend to show less variation than do the TAC signals. Overall we observe that the mechanism of using the observations to find the most likely state sequence and then predicting the signal by means of the corresponding emissions works quite well. That is, the trained HMM appears to appropriately model the data and capture the underlying dynamics. For the case of BrAC prediction based only on observations of TAC, it is not surprising that we see that the accuracy decreases somewhat. Nevertheless, the eBrAC signal still shows remarkably good agreement with the true signal with regard to its shape and amplitude. Table 7.1 shows the median of the relative errors in the eBrAC and eTAC signals, peak eBrAC and eTAC values, and the area underneath the eBrAC and eTAC curves (AUC) where eBrAC and eTAC were estimated using the Viterbi algorithm and either BrAC and TAC or TAC alone. The table also shows the absolute error (in units of hours) for the time at which the peak eBrAC and eTAC values occurred. Errors for eTAC are lower when the Viterbi algorithm is supplied with only TAC and the eBrAC errors are much lower when both TAC and BrAC are used as the basis for prediction. We observed no clear difference between errors for the test set and for the training set. From this we infer that our model generalizes well with no evidence of over-fitting. With regard to how the model would perform when used in practice to estimate BrAC based on TAC, we observed that the median relative peak error was just over 20%, the median relative AUC error was just over 25% and the predicted peak time deviated from the true peak time by about 15 minutes. By way of a comparison, the results in Table 7.1 are consistent with, but somewhat less accurate than, similar results obtained using a nonlinear least squares, single subject, single laboratory alcohol challenge episode trained deterministic diffusion equation model that was then tested on that same subject’s 10 additional 103 (a) (b) (c) (d) (e) (f) Figure 7.1: Performance of the HMM on three drinking episodes from the test set. The figures on the left display TAC and BrAC data together with their predictions when both TAC and BrAC are used as input for the Viterbi algorithm. On the right hand side, only TAC is used as an input to the Viterbi algorithm to produce the eBrAC signal. 104 Table 7.1: Median (computed with respect to all training (TR) and testing (T) datasets) relative error in eBrAC and eTAC signals, peak eBrAC and eTAC values, time of peak eBrAC and eTAC, and area underneath eBrAC and eTAC curves with eBrAC and eTAC estimated using either BrAC and TAC or on TAC alone. eBrAC from BrAC and TAC eBrAC from TAC Only rL 2 rP rAUC jDTj rL 2 rP rAUC jDTj TAC Tr .2133 .1427 .0594 .3200 .1381 .1039 .0330 .1900 TAC T .2127 .1207 .0674 .3850 .1419 .1220 .0350 .3700 BrAC Tr .2215 .0717 .0563 .2850 .4975 .2320 .2659 .3000 BrAC T .2447 .0842 .0735 .2650 .4855 .2175 .2529 .2800 drinking episodes conducted in the field [12, 16, 18]. We note that this is a much smaller and clearly significantly less diverse dataset than the one used to train and test our HMM models here. 7.4.2 A Covariate-Dependent HMM To select the covariates that are most significant for determining eBrAC from TAC to include in our HMM model, we introduce the scalar quantity l = max t2f1;:::;Tg o t;2 max s2f1;:::;Tg o s;1 ; (7.10) which is the ratio of the maximum BrAC value and the maximum TAC value during a drinking session. We then use this scalar quantity as the dependent or response variable in a standard linear regression using all available covariates as explanatory variables. Next we use the resulting fit statistics to determine the most significant ones forl. In this way, we are able to identify a set of the 10 most significant covariates, which we use in the computations to follow. Figure 7.2 shows eBrAC for given TAC signals for three drinking sessions. We used a covariate- dependent transmission matrix as in (7.5) based on the three most significant covariates (scale- derived fitness age, hips, and waist) forl in (7.10). In each case, we display eBrAC for a transmis- sion matrix depending on the covariates of the corresponding drinking session and an eBrAC signal obtained using a transmission matrix that is independent of the covariates. Figure 7.2a shows only very minor differences between the covariate-dependent and -independent estimates. We note that 105 (a) (b) (c) Figure 7.2: eBrAC signals based on the TAC signal and covariate-dependent transition probabilities for test set episodes. The figures show BrAC data, eBrAC using a covariate-dependent transition matrix, and eBrAC using a covariate-independent transition matrix. this was the case in the vast majority of the test drinking episodes we evaluated. However, figures 7.2b and 7.2c show examples of eBrAC where the inclusion of covariates leads to significantly different estimates. We assume that the negligible influence of the covariates on the transitions lies in the structure of the fitted transition matrix and the model itself. If we then also assume that the hidden states represent discrete BAC levels or their surrogates, we would expect that the states would tend to either remain fixed or transition to a neighboring state or BAC level. Consequently, the trained 106 transition matrix has tended to be either banded or close to the identity and this overall behavior was unaffected by the covariates. In the context of the problem at hand, it seems more meaningful to include the covariates in the state-conditional emission densities as in (7.6). Due to the Markovian assumption, a certain level of the unobserved BAC at a given time will be expressed as a particular TAC level. Thus, it is not unreasonable to assume that the TAC level for a given BAC level depends on covariates. Moreover, this dependence may be more readily identifiable in the data than perhaps any dependence of the transition probabilities on the covariates. Thus, in our HMM we considered covariate-dependent means for the emission distributions. Figure 7.3 shows eBrAC signals for given TAC signals for three drinking sessions using co- variate-dependent emission distributions as in (7.6) based on the three most significant covariates for l in (7.10). In every case, we display an eBrAC signal for covariate-dependent emissions and an eBrAC signal for covariate-independent emissions. It can be seen that the differences between covariate-dependent and -independent estimates are larger than in the case of the transition probabilities in Figure ??, and we observed this consistently in the drinking episodes in our test set. Since the effect of introducing covariates into the state transitions negligibly improved fit while greatly increasing the complexity of the model, in what follows we focus exclusively on covariate- dependent emissions. In particular, we are interested in quantitatively assessing the degree to which including the pre-selected set of 10 covariates improves the predictive power of our model. For every k2f1;:::;10g, we train the HMM with all possible subsets of k covariates and report the log-likelihood of the fitted model together with the average eBrAC error on both the training set and the test set. For these errors, e, we only use the TAC signal of a drinking episode to obtain the eBrAC signal and then compute the squared deviation from the true BrAC signal, that is e=jj ˜ O 2 O 2 jj 2 = T å t=1 ( ˜ o t;2 o t;2 ) 2 ! 1=2 : (7.11) 107 (a) (b) (c) Figure 7.3: eBrAC signals based on the TAC signal in the case of covariate-dependent linear Gaussian emissions for three drinking episodes from the test set. The figures display BrAC data, eBrAC using a covariate-dependent conditional Gaussian emission, and eBrAC using a covariate- independent conditional Gaussian emission. 108 Figure 7.4: Log-likelihood of the trained model together with BrAC estimation error (7.11) for the training and test sets vs the number of included covariates. For any number k of covariates, all possible combinations of k covariates out of 10 have been considered and averaged. Our results are in Figure 7.4. It is clear that including the covariates increases the log-likelihood of the fitted model. There appears to be a linear relationship between the number of covariates used and the log-likelihood. More significantly, there appears to be a similar relation when one considers the prediction errors on the training set. Including more covariates yielded a lower training error with up to 9 covariates but then at 10 covariates started to increase the training error. Apparently, increasing the number of parameters eventually leads to over-fitting that tends to outweigh the information gain from another covariate. This same behavior was also observed in the test set. Here, the error decreased with up to 8 covariates and then started to increase. Thus, it appears to be important to balance the disadvantages of increasing model complexity with the benefits to the overall quality of the fit to the data when deciding to include additional covariates. 7.4.3 Uncertainty Quantification In light of the discussion in Section 7.3, at each time t, we use local decoding to find the prob- abilities P(X t = ijO;q) for i2f1;:::;Sg and consider the most likely states i 1 ;:::;i r such that P(X t 2fi 1 ;:::;i r gjO;q) 0:9. 109 (a) (b) (c) Figure 7.5: Demonstration of the proposed uncertainty quantification for the BrAC estimation for three drinking episodes from the test set. The figures display BrAC data, eBrAC based on the global state decoding using the Viterbi algorithm, and a credible band for the BrAC signal based on the uncertainty quantification. That is, to borrow a term from Bayesian statistics, we construct a 90% credible interval for the decoded states. Typically, the number r of states to consider here is very small. For each of the states k2fi 1 ;:::;i r g, we then consider the 90% credible interval I k =(m k 2 1:645 p S k 22 ) for the emission. Finally, we use the maximal and minimal values of S r k=1 I k as the upper and lower bound, respectively, for the credible band. In Figure 7.5, we display the eBrAC signal based on the global decoding using the Viterbi algorithm together with the credible bands constructed as described above. The displayed drinking episodes in figures 7.5a and 7.5b are the same as in 7.1b and 7.1f, respectively. 110 7.4.4 Results on Field Data The BrAC and TAC for all of the drinking episodes considered thus far in this section were col- lected in the laboratory under controlled conditions following a prescribed dosing protocol. We now use our approach to obtain eBrAC for TAC collected under naturalistic drinking conditions in the field and seek to recover the corresponding eBrAC signal. The five naturalistic drinking episodes were collected by one of the authors (S.E.L.), with the first four episodes using the WrisTAS 7 TM transdermal alcohol biosensor (Giner, Inc., Waltham, Massachusetts) and the fifth episode using the SCRAM-CAM. The HMM was trained with the SCRAM-CAM dataset dis- cussed earlier. Figure 7.6 shows the results from these naturalistic drinking sessions. It can be seen that the eBrAC signal captures most features of the true BrAC signal quite accurately. Although the peak eBrAC level in some cases differs from the true peak BrAC level, the shape of the eBrAC signal nicely matches the true BrAC signal. In particular, the fact that the peak of the BrAC signal appears earlier than the peak of the TAC signal is well captured. Our data-intensive approach only worked well on field data if the natural drinking episodes, to some degree, resembled the drinking episodes on which the HMM was trained. Not included in Figure 7.6 are the naturalistic drinking sessions we tested our approach on with very low peak TACs. These single drink episodes had peak TACs of around .020 which is much lower than the peak TACs of our laboratory collected training episodes that were dosed to reach a peak BrAC of .080. Since there were no drinking episodes with a very low peak TAC present in the training data, the model could not learn what BrAC values are associated with low peak TAC signals and consequently produced false eBrACs for these sessions. In Figure 7.6(e) we see that our HMM also failed to capture the bi-modal nature of this particular drinking episode. We again believe that this may be attributed to the lack of multi-modal drinking sessions in the data available to us for the proof-of-concept study being reported on here. A training set with a broader range of naturalistic drinking sessions, including those with lower, higher, and multi-modal levels of 111 consumption would test the fit of our HMM under more varied drinking conditions. This is an important direction for future research. 7.4.5 Correlation Between Hidden States and BAC As indicated in Section 7.1, one could ask whether there is sufficient evidence to identify a corre- lation between the hidden states and, for example, BAC. Based on the assumption that BAC would be more highly correlated with BrAC than with TAC, we ordered the hidden states based on the mean of the BrAC component of the emission distributions. In Figure 7.7, we display the TAC and BrAC signal of two drinking sessions together with the sequence of ordered hidden states based on the TAC signal and the Viterbi algorithm. From these plots, we observe that early in the episode the hidden states quickly reach their peak at essentially the same time the BrAC signal reaches its peak. In the later part of the episode, however, the hidden states tend to decay more slowly with the TAC following a parallel but somewhat delayed trajectory. Finally, when the hidden states have reached zero, the TAC follows suit and decays to zero as well. To conclude from the limited evidence provided here that the hidden states are, in fact, BAC, one would have to argue that when BAC drops below approximately .03 percent alcohol, the breath analyzer is no longer able to de- tect the presence of alcohol, which is a conclusion that would be at odds with experience. It seems more likely that the hidden states are simultaneously capturing a number of the mechanisms the body uses to metabolize and excrete alcohol, or at least those that can be observed in breath and transdermal data (i.e. possibly in the skin or only in the venous blood). More research is required to better understand the hidden states and their relationship among BAC, BrAC, TAC, and covariates. 7.5 A Physics-Informed Viterbi Algorithm So far, the model under investigation does not make use of the mathematical model from Chapter 4. As demonstrated in the previous sections, the model is still able to capture the dynamics with some success. However, as seen in Section 7.4.4, the model does not perform too well on field data, i.e. 112 (a) (b) (c) (d) (e) Figure 7.6: Application of the fit HMM model to natural drinking episodes recorded in the field. The figures show recorded TAC and BrAC signals and eBrAC based only on the TAC signal. 113 (a) (b) Figure 7.7: TAC and BrAC together with the decoded hidden states based on the TAC. on data that deviates significantly from the training dataset. Hence, we want to introduce a physics- informed regularization mechanism to encourage the model to learn physically meaningful. Let us quickly recall how the inversion process of obtaining an eBrAC signal from a TAC signal is done. In the prediction stage, we use the marginal distribution of b i (y;u) with respect to the first variable y, i.e. b i (y)= f(yjZ;X = i)= 1 p 2pS i 11 exp 1 2S i 11 (y B 1 i Zm i 1 ) 2 (7.12) where B 1 i denotes the first row of B i , m i 1 is the first entry of m i , and S i 11 is the variance of yj(X = i). Using the observations of y and the conditional densities b i (y) given in (7.12), we use the Viterbi algorithm [102] to predict the most likely state sequence based on the observation sequenceO 1 =(o 1;1 ;:::;o T;1 ). Given the state sequenceX=(X 1 ;:::;X T ), it is then straightfor- ward to evaluate the ˜ u signal, ˜ O 2 , as ˜ O 2 = m X 12 ;:::;m X T 2 . Note that by providing the Viterbi algorithm with only the marginal densities b i (y), there is some inherent loss of information. This observation serves as the starting point of our proposed idea to incorporate physics information into the HMM or, more specifically, into the Viterbi algorithm. To this end, we want to employ the 114 physical forward model y=Lu from Chapter 4. There are a few possibilities for the precise way in which we could augment our HMM with such a physics-informed regularization. A first nat- ural idea would be to include the physics-based information into the training phase of the HMM. This would also align with the approach used in physics-informed deep learning [33]. However, introducing physics-based information into the HMM training stage is problematic. This is due to the fact that the feasibility of HMMs and the avoidance of the curse of dimensionality relies on the exploitation of the Markovian assumption to construct a highly efficient implementation of the Expectation Maximization (EM) algorithm, known as Baum-Welch algorithm. This is based on the observation that the quantity to be maximized, Q, nicely decomposes into three separate terms, see (7.1). This enables the algorithm to maximize the terms involving the prior probabilities, the transition probabilities, and the distribution of the emissions separately. If one were to augment the objective Q with a physics-informed regularization term, this elegant and efficient structure would be disrupted and so the performance of the algorithm would deteriorate greatly. Hence, as an alternative, one could introduce physics-based regularization in the actual esti- mation or inversion stage. This also appears meaningful since it is the inversion stage where the described loss of information happens and so a regularization can be especially important here. In light of this, we propose the following process. Once the HMM has been trained using observed data for y and u, we employ a physics-informed Viterbi algorithm to predict the path of states through the Markov chain based on observations of y only that agrees with the learned HMM and the physical model simultaneously. Note that this changes the objective of the Viterbi algorithm: it no longer finds the most likely path based on the HMM, but it finds a path that is consistent with both the data and the physics-based model, satisfying a more complex objective. To make this precise, we recall that the two key quantities in the classical Viterbi algorithm are the arraysx andy, wherex( j;t) is the probability of the most likely path(X 1 ;X 2 ;:::;X t ) with X t = j, j= 1;2;:::;S, based on the observed emissions sequence(o 1 ;o 2 :::;o t ), andy( j;t) stores the state X t1 of the most likely path(X 1 ;X 2 ;:::;X t1 ;X t = j), j= 1;2;:::;S. Initially, when t= 1, for j= 1;2;:::;S, we set x( j;1)=p j b j (o 1 ), where b j is given by (7.12), and y( j;1)= 0 since 115 no predecessor state exists. Then, by sequentially filling the arraysx andy, the most likely path is computed. In our physics-informed Viterbi algorithm, we introduce the additional array g( j;t) that stores the predicted y value at time t, based on the best found path (X 1 ;X 2 ;:::;X t1 ;X t = j), j= 1;2;:::;S. To compute the values forg, for j= 1;2;:::;S, we use the ˜ u sequence derived from the best path (X 1 ;X 2 ;:::;X t1 ;X t = j). We compute the corresponding signal for y, ˜ y, using the model ˜ y=L ˜ u and setg( j;t)= ˜ y t , j= 1;2;:::;S. We then augment the problem of finding the next best state with a weighting of the absolute difference between y t2 andg(;t 2). The sequential filling of the arrays is the reason for evaluating at time t2. In this way, the objective of the Viterbi algorithm is no longer to simply find the most likely path through the Markov chain based only on the probabilistic formulation of the HMM, but rather to find a path that best matches both the fitted HMM model and the physics-based model. Algorithm 1 provides pseudocode for a proposed Phyiscs-Informed Viterbi Algorithm. Algorithm 1 Physics-Informed Viterbi function PHYSVITERBI for j= 1 : S do x( j;1)=p j b j (y(1)) y( j;1)= 0 end for for t= 2 : T do for j= 1 : S do ˜ X t1 = j for k= t 2;:::;1 do ˜ X k =y( ˜ X k+1 ;k+ 1) end for ˜ u=(m ˜ X 1 2 ;:::;m ˜ X t1 2 ) ˜ y=L( ˜ u) g( j;t 1)= ˜ y(t 1) y( j;t)= argmax i x(i;t 1) a i; j bjy(t 2)g(i;t 2)j x( j;t)=x(y( j;t);t 1) a y( j;t); j b j (y(t)) end for end for end function 116 (a) (b) (c) Figure 7.8: Performance of the HMM on three drinking episodes from the test set. The figures depict the true BrAC data u as well as eBrAC signals ˜ u that are based on the TAC y only. The black lines show ˜ u obtained from the physics-informed Viterbi algorithm, whereas the red lines show the estimates obtained using the classical Viterbi algorithm. 7.5.1 Results of Numerical Studies Based on the proposed physics-informed Viterbi algorithm, we want to provide numerical ex- amples to compare this physics-informed learning mechanism against the default HMM model described earlier. Figure 7.8 shows the true BrAC signal u(t) for three drinking episodes from the test set together with the corresponding estimates ˜ u(t) (called eBrAC for estimated BrAC) using either the classical Viterbi algorithm or the proposed physics-informed Viterbi algorithm. First, we observe that the outlined approach of fitting an HMM with observations from both y and u and then obtaining estimates for u based on y only using the Viterbi algorithm produces meaning- ful results. The estimated shape, duration, and magnitude of the estimated BrAC curves agrees well with the true data. However, as can be seen in the figure and reported in [95], the signal ˜ u produced by the classical Viterbi algorithm sometimes exhibits non-physical artifacts. This can be seen in figures 7.8a and 7.8b where the signal ˜ u shows a bi-modality that is not present in the data. The estimates obtained from the physics-informed Viterbi algorithm do not show this be- havior. Instead, they exhibit physically meaningful behavior which demonstrates the efficacy of our physics-informed Viterbi algorithm. In the physics-informed Viterbi algorithm proposed here, the parameterb acts as a relative weighting factor between the stochastic HMM and the physical model. A low value forb favors a predicted path that agrees with the training data, while a larger b places more emphasis on the physical model y=Lu. We demonstrate this influence in Figure 117 (a)b = 0:01 (b)b = 0:1 (c)b = 1 Figure 7.9: Impact of the weight b on the behavior of the eBrAC signals obtained using the physics-informed Viterbi algorithm. (a) (b) (c) Figure 7.10: Application of the trained HMM model to naturalistic drinking episodes recorded in the field. 7.9. For a very low value ofb, we see that the estimate ˜ u still shows non-physical artifacts, i.e. the physics-based regularization is not strong enough. If we choose an intermediate value for b, the artifact disappears while the estimate still agrees well with the data. However, if b is chosen too large, the estimated signal no longer matches the true data. This shows the importance of a careful choice of the weighting factor. As mentioned at the beginning of this section, the use case of field data serves as a motivation for the physical regularization mechanism. We have seen in Figure 6.7 that the application of the pure HMM model to field data has proven to be challenging. In Figure 7.10 below we see that the eBrAC signals obtained from the physics-informed Viterbi algorithm show no artifacts and fit the data slightly better than the eBrAC signals from the classical Viterbi algorithm. Still, the fit on these naturalistic drinking episodes is not as good as on the test dataset. 118 Chapter 8 Conclusion and Ongoing Work In this thesis, we have examined the alcohol biosensor problem that, so far, has mainly been dis- cussed in terms of a first principles physical model, from a hybrid perspective that takes into account both the data and the physics using physics-informed machine learning techniques. In chapters 5, 6, and 7 we have derived and outlined three such techniques and presented numerical studies based on our laboratory gathered dataset. Naturally, this leads to a discussion and com- parison of the three models, their strengths and weaknesses. The following discussion can also be understood as a guideline under which circumstances a particular model can or should be used. The physics-informed GAN model from Chapter 5 is surely the model that aligns closest with the existing and established treatments of this problem whose basis is the mathematical model described in Chapter 4. Basically, it aims at recovering the distribution of the unknown parameters q i by means of a generative adversarial network. In that sense, it is a lot like the works in [21– 23, 89, 103], but it uses an entirely different methodology to solve the problem. One benefit of the model presented herein is that the distribution of the unknown parameters is non-parametric, that is we do not assume a certain underlying family of distributions. When using the obtained distribution for the deconvolution of the BrAC signal from the TAC signal, the proposed neural network deviates significantly from the classical approach. The LSTM model from Chapter 6 is somewhat similar to the GAN model in the way the PDE model is used to augment the loss function of the network. However, the two-step process of first finding the distribution of the parameters q and then deconvolving the BrAC signal is no longer 119 needed in this case. Instead, the PDE model serves as a regularization term that is used to train the network directly in a physics-informed way. This change of paradigm offers the immense advantage that once the model is trained, every BrAC estimation only requires a forward pass of the TAC signal through the network. Such a forward pass is computationally very cheap and so even devices with weak computational power such as wearables can easily perform this task in milliseconds. This makes the LSTM model especially interesting for application in the field. Moreover, the LSTM model offers a flexibility that is unparalleled by the other described methods. Its ability to account for stationary as well as time-dependent covariates and its applicability to arbitrarily large training datasets makes it a valuable tool. In addition, the outlined uncertainty quantification that is based on Monte-Carlo dropout mimics Bayesian learning and is hence, in some ways, a more rigorous approach to credible bands for the eBrAC signal than the approaches that are based on the distribution of the parameters q. Lastly, the HMM model in Chapter 7 explores quite a different approach. Here, we model the dynamics of the problem as an unobservable Markov chain with observable emissions, that is BrAC and TAC. In a rather uncommon way, we fit the model with both BrAC and TAC and then use the model to predict BrAC based on TAC. One advantage of this approach is the straightforward way to include covariates in the emissions or transitions. In addition, the Baum-Welch algorithm for training the model is computationally cheaper and hence much faster than the training of the deep neural networks. However, this advantage no longer exists when the physics-informed Viterbi algorithm is used. A shortcoming of the HMM approach is its rather coarse resolution: the eBrAC signals are not continuous, but there are only S discrete breath alcohol levels where S denotes the number of states. Choosing a large number of states makes the model training harder and so this does not offer a remedy. At the end of this discussion, we also want to mention that either model suffers from the fact that they cannot be as rigorously justified as the classical approach. As the fields of deep learning and physics-informed learning are rapidly evolving, their theoretical justification and properties are not yet fully studied and established. 120 In Table 8.1 below we assemble the average error values of the three models for comparison. As in the previous chapters, we display the average relative L 2 error, the average relative peak error, the average relative area under the curve (AUC) error, and the average absolute peak time error. In the direct comparison, it is easily visible that the GAN model has the weakest performance. This comes as no surprise, because although this approach involves machine learning, it can rather be seen as the neural network version of the classical two-parameter model approach. Hence, the true strength of neural networks is not leveraged. The HMM approach yields significantly lower average errors. This demonstrates the power of machine learning models whose large number of parameters enables them to capture so-far unmodeled relations. However, as described above, the HMM approach suffers from its coarse resolution. With regards to Table 8.1, the LSTM approach appears to be, quite significantly, the model that yields the best BrAC estimates. Here, a state-of- the-art neural network technique is combined with prior physical information to yield good BrAC estimates on both the training and the test set. Table 8.1: Comparison of the Performance of three Physics-Informed Methods Model dataset rL 2 rP rAUC jDTj GAN – 0.7761 0.6713 0.6467 0.6764 HMM training 0.5401 0.2398 0.3349 0.3744 HMM test 0.5174 0.2349 0.3091 0.3958 LSTM training 0.2168 0.1650 0.1496 0.2186 LSTM test 0.2183 0.1565 0.1590 0.2547 We began this thesis with a discussion about the ways humans learn (knowing that versus know- ing how) and how this can be mimicked in the realm of machine learning. Generally speaking, the results and observations in this work confirm the hypothesis that pure machine learning approaches benefit from additional prior knowledge in the form of physical equations. In chapters 6 and 7 we show numerical results without physics information and contrast them with results from physics- informed models. In either case, we see that the physical information reduces artifacts and leads to better results on data that deviate from the training data. In that sense, the hybrid physics-informed models posses better generalization capability than the pure data-driven models. 121 Moving forward, a crucial step in validating and improving the models appears to be the avail- ability of more and better data. Here, “more” data refers to a more varied set of drinking episodes that contains naturalistic drinking and multimodal episodes from a diverse set of human subjects. In particular, the power of deep learning networks to train on large datasets could be leveraged in this case. “Better” data refers to data that is more consistent. As seen in Chapter 3, the TAC measurements of a single drinking episode sometimes show a significant disagreement between two TAC devices. This hinders the ability of the models to accurately learn the significance of the covariates as their contribution is overshadowed by the measurement variation. Besides this, an- other important step would be a more rigorous theoretical investigation of the proposed methods. To this date, only a few theoretical results on physics-informed (deep) learning are known. A first study [104] that establishes convergence of PINNs for certain types of PDEs was just published. As the research in this area continues, it appears likely to come up with a theoretical foundation for the methods proposed in this thesis. 122 References 1. Caddy, G., Sobell, M. & Sobell, L. Alcohol breath tests: Criterion times for avoiding con- tamination by “mouth alcohol”. Behavior Research Methods & Instrumentation 10, 814– 818 (1978). 2. Lister, R. G., Gorenstein, C., Risher-Flowers, D., Weingartner, H. J. & Eckardt, M. J. Disso- ciation of the acute effects of alcohol on implicit and explicit memory processes. Neuropsy- chologia 29, 1205–1212. doi:https://doi.org/10.1016/0028-3932(91)90034-6 (1991). 3. Sakai, J. T., Mikulich-Gilbertson, S., Long, R. J. & Crowley, T. J. Validity of transdermal alcohol monitoring: Fixed and self-regulated dosing. Alcoholism Clinical and Experimental Research 30, 26–33. doi:10.1111/j.1530.0277.2006.00004.x (2006). 4. Gamella, M., Campuzano, S., Manso, J., de Rivera, G. G., L´ opez-Colino, F., Reviejo, A., et al. A novel non-invasive electrochemical biosensing device for in situ determination of the alcohol content in blood by monitoring ethanol in sweat. Analytica Chimica Acta 806, 1–7. doi:https://doi.org/10.1016/j.aca.2013.09.020 (2014). 5. Nyman, E. & Palml¨ ov, A. The elimination of ethyl alcohol in sweat. Acta Physiologica 74, 155–159. doi:https://doi.org/10.1111/j.1748-1716.1936.tb01150.x (1936). 6. Pawan, G. L. S. Physical exercise and alcohol metabolism in man. Nature 218, 966–967 (1968). 7. Swift, R. M. Transdermal alcohol measurement for estimation of blood alcohol concentra- tion. Alcoholism: Clinical and Experimental Research 24, 422–423. doi:10.1111/j.1530- 0277.2000.tb02006.x (2000). 8. Swift, R. M. Transdermal measurement of alcohol consumption. Addiction 88, 1037–1039. doi:10.1111/j.1360-0443.1993.tb02122.x (1993). 9. Swift, R. M., Martin, C. S., Swette, L., LaConti, A. & Kackley, N. Studies on a wearable, electronic, transdermal alcohol sensor. Alcoholism: Clinical and Experimental Research 16, 721–725. doi:https://doi.org/10.1111/j.1530-0277.1992.tb00668.x (1992). 10. Labianca, D. A. The chemical basis of the breathalyzer: a critical analysis. J. Chem. Educ 67, 259–261. doi:DOI:10.1021/ed067p259 (1990). 11. Mukherjee, S. AI versus MD: What Happens When Diagnosis Is Automated? New Yorker 3 (2017). 12. Dai, Z., Rosen, I. G., Wang, C., Barnett, N. & Luczak, S. E. Using drinking data and pharma- cokinetic modeling to calibrate transport model and blind deconvolution based data analysis software for transdermal alcohol biosensors. Mathematical Biosciences and Engineering 13, 911–934. doi:10.3934/mbe.2016023 (2016). 13. Dougherty, D. M., Charles, N. E., Acheson, A., John, S., Furr, R. M. & Hill-Kapturczak, N. Comparing the detection of transdermal and breath alcohol concentrations during periods of 123 alcohol consumption ranging from moderate drinking to binge drinking. Experimental and Clinical Psychopharmacology 20, 373–381 (2012). 14. Dougherty, D. M., Hill-Kapturczak, N., Liang, Y ., Karns, T. E., Cates, S. E., Lake, S. L., et al. Use of continuous transdermal alcohol monitoring during a contingency management procedure to reduce excessive alcohol use. Drug and Alcohol Dependence 142, 301–306 (2014). 15. Dougherty, D. M., Karns, T. E., Mullen, J., Liang, Y ., Lake, S. L., Roache, J. D., et al. Trans- dermal alcohol concentration data collected during a contingency management program to reduce at-risk drinking. Drug and Alcohol Dependence 148, 77–84 (2015). 16. Dumett, M. A., Rosen, I. G., Sabat, J., Shaman, A., Tempelman, L., Wang, C., et al. De- convolving an estimate of breath measured blood alcohol concentration from biosensor collected transdermal ethanol data. Applied Mathematics and Computation 196, 724–743. doi:10.1016/j.amc.2007.07.026 (2008). 17. Rosen, I. G., Luczak, S. E., Hu, W. W. & Hankin, M. in, 160–167 (2013). doi:10.1137/1. 9781611973273.22. 18. Rosen, I. G., Luczak, S. E. & Weiss, J. Blind deconvolution for distributed parameter sys- tems with unbounded input and output and determining blood alcohol concentration from transdermal biosensor data. Applied Mathematics and Computation 231, 357–376. doi:10. 1016/j.amc.2013.12.099 (2014). 19. Hill-Kapturczak, N., Roache, J., Liang, Y ., Karns-Wright, T., Cates, S. & Dougherty, D. Accounting for sex-related differences in the estimation of breath alcohol levels using trans- dermal alcohol monitoring. Psychopharmacology 232. doi:10.1007/s00213-014-3644-9 (2014). 20. Karns-Wright, T., Roache, J., Hill-Kapturczak, N., Liang, Y ., Mullen, J. & Dougherty, D. Time Delays in Transdermal Alcohol Concentrations Relative to Breath Alcohol Concen- trations. Alcohol and Alcoholism 52. doi:10.1093/alcalc/agw058 (2016). 21. Sirlanci, M., Luczak, S. & Rosen, I. Estimation of the Distribution of Random Parameters in Discrete Time Abstract Parabolic Systems with Unbounded Input and Output: Approxima- tion and Convergence. Communications in applied analysis 23, 287–329. doi:10.12732/ caa.v23i2.4 (2019). 22. Sirlanci, M. Finite dimensional approximation and convergence in the estimation of the distribution of, and input to, random abstract parabolic systems with application to the deconvolution of blood/breath alcohol concentration from biosensor measured transdermal alcohol level PhD thesis (University of Southern California, 2018). 23. Sirlanci, M., Rosen, I. G., Luczak, S. E., Fairbairn, C. E., Bresin, K. & Kang, D. Deconvolv- ing the Input to Random Abstract Parabolic Systems; A Population Model-Based Approach to Estimating Blood/Breath Alcohol Concentration from Transdermal Alcohol Biosensor Data. Inverse problems 34. arXiv:1807.05088v1 (2018). 24. Hawekotte, K., Luczak, S. E. & Rosen, I. G. Deconvolving breath alcohol concentration from biosensor measured transdermal alcohol level under uncertainty: a Bayesian approach. Mathematical Biosciences and Engineering 18, 6739–6770 (2021). 25. Fairbairn, C. E., Kang, D. & Bosch, N. Using machine learning for real-time BAC estima- tion from a new-generation transdermal biosensor in the laboratory. Drug and Alcohol De- pendence 216, 108205. doi:https://doi.org/10.1016/j.drugalcdep.2020.108205 (2020). 124 26. Psichogios, D. C. & Ungar, L. H. A hybrid neural network-first principles approach to pro- cess modeling. AIChE Journal 38, 1499–1511. doi:https://doi.org/10.1002/aic. 690381003 (1992). 27. Lagaris, I. E., Likas, A. & Fotiadis, D. I. Artificial neural networks for solving ordinary and partial differential equations. IEEE Transactions on Neural Networks 9, 987–1000. doi:10. 1109/72.712178 (1998). 28. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations 2017. 29. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations 2017. 30. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Multistep Neural Networks for Data-driven Discovery of Nonlinear Dynamical Systems 2018. 31. Raissi, M. & Karniadakis, G. Hidden Physics Models: Machine Learning of Nonlinear Par- tial Differential Equations. Journal of Computational Physics 357. doi:10.1016/j.jcp. 2017.11.039 (2017). 32. Raissi, M. Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations. J. Mach. Learn. Res. 19, 932–955 (2018). 33. Raissi, M., Perdikaris, P. & Karniadakis, G. Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations. Journal of Computational Physics 378. doi:10.1016/j.jcp.2018. 10.045 (2018). 34. Kharazmi, E., Zhang, Z. & Karniadakis, G. E. Variational Physics-Informed Neural Net- works For Solving Partial Differential Equations. ArXiv abs/1912.00873 (2019). 35. Jagtap, A. D. & Karniadakis, G. E. Extended Physics-informed Neural Networks (XPINNs): A Generalized Space-Time Domain Decomposition based Deep Learning Framework for Nonlinear Partial Differential Equations. Communications in Computational Physics (2021). 36. Meng, X., Li, Z., Zhang, D. & Karniadakis, G. E. PPINN: Parareal Physics-Informed Neural Network for time-dependent PDEs. ArXiv abs/1909.10145 (2019). 37. Zhu, Y ., Zabaras, N., Koutsourelakis, P.-S. & Perdikaris, P. Physics-constrained deep learn- ing for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics 394, 56–81. doi:https://doi.org/10.1016/j. jcp.2019.05.024 (2019). 38. Geneva, N. D. & Zabaras, N. Modeling the Dynamics of PDE Systems with Physics- Constrained Deep Auto-Regressive Networks. ArXiv abs/1906.05747 (2020). 39. Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J. & Battaglia, P. Learning to Simulate Complex Physics with Graph Networks in Proceedings of the 37th International Conference on Machine Learning (eds III, H. D. & Singh, A.) 119 (PMLR, 2020), 8459– 8468. 40. Karniadakis, G., Kevrekidis, I., Lu, L., Perdikaris, P., Wang, S. & Yang, L. Physics- in- formed machine learning. Nat. Rev. Phys. 3, 422–440 (2021). 41. Elbrachter, D., Grohs, P., Jentzen, A. & Schwab, C. DNN Expression Rate Analysis of High- Dimensional PDEs: Application to Option Pricing. Constructive Approximation (2018). 42. Han, J., Jentzen, A. & E, W. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences 115, 8505–8510 (2018). 125 43. Raissi, M., Yazdani, A. & Karniadakis, G. E. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science 367, 1026–1030 (2020). 44. Tartakovsky, A. M., Marrero, C. O., Perdikaris, P., Tartakovsky, G. & Barajas-Solano, D. A. Learning Parameters and Constitutive Relationships with Physics Informed Deep Neural Networks. arXiv: Analysis of PDEs (2018). 45. Hennigh, O., Narasimhan, S., Nabian, M. A., Subramaniam, A., Tangsali, K. M., Rietmann, M., et al. NVIDIA SimNet: an AI-accelerated multi-physics simulation framework in ICCS (2021). 46. Cai, S., Wang, Z., Wang, S., Perdikaris, P. & Karniadakis, G. Physics-Informed Neural Net- works (PINNs) for Heat Transfer Problems. Journal of Heat Transfer 143. doi:10.1115/1. 4050542 (2021). 47. Lu, L., Dao, M., Kumar, P., Ramamurty, U., Karniadakis, G. & Suresh, S. Extraction of mechanical properties of materials through deep learning from instrumented indentation. Proceedings of the National Academy of Sciences 117, 201922210. doi:10.1073/pnas. 1922210117 (2020). 48. Chen, Y ., Lu, L., Karniadakis, G. E. & Negro, L. D. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Opt. Express 28, 11618–11633. doi:10. 1364/OE.384875 (2020). 49. Goswami, S., Anitescu, C., Chakraborty, S. L. & Rabczuk, T. Transfer learning enhanced physics informed neural network for phase-field modeling of fracture (2019). 50. Kissas, G., Yang, Y ., Hwuang, E., Witschey, W. R., Detre, J. A. & Perdikaris, P. Machine learning in cardiovascular flows modeling: Predicting arterial blood pressure from non- invasive 4D flow MRI data using physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering 358, 112623. doi:https://doi.org/10.1016/j. cma.2019.112623 (2020). 51. Costabal, F. S., Yang, Y ., Perdikaris, P., Hurtado, D. E. & Kuhl, E. Physics-Informed Neural Networks for Cardiac Activation Mapping in Frontiers in Physics (2020). 52. Oszkinat, C., Luczak, S. E. & Rosen, I. G. Uncertainty Quantification in the Estimation of Blood Alcohol Concentration Using Physics-Informed Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. (2021). 53. Baum, L. E. & Petrie, T. Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Ann. Math. Statist. 37, 1554–1563 (1966). 54. Baum, L. E. & Eagon, J. A. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull. Amer. Math. Soc. 73, 360–363 (1967). 55. Baum, L. E. & Sell, G. R. Growth transformations for functions on manifolds. Pacific J. Math. 27, 211–227 (1968). 56. Baum, L. E., Petrie, T., Soules, G. & Weiss, N. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. Ann. Math. Statist. 41, 164–171 (1970). 57. Baum, L. E. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, 1–8 (1972). 58. Rabiner, L. R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286. doi:10.1109/5.18626 (1989). 126 59. Juang, B. H. & Rabiner, L. R. Hidden Markov Models for Speech Recognition. Technomet- rics 33, 251–72 (1991). 60. Rabiner, L. R., Levinson, S. E. & Sondhi, M. M. On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition. The Bell System Technical Journal 62, 1075–1105 (1983). 61. Schwartz, R., Chow, Y ., Roucos, S., Krasner, M. & Makhoul, J. Improved hidden Markov modeling of phonemes for continuous speech recognition in ICASSP ’84. IEEE Interna- tional Conference on Acoustics, Speech, and Signal Processing (1984), 21–24. 62. Mor, B., Garhwal, S. & Kumar, A. A systematic review of hidden Markov models and their applications. Arch. Comput. Methods Eng. 28, 1429–1448 (2021). 63. Cappe, O. Ten years of HMMs (2001). 64. Cappe, O., Moulines, E. & Ryden, T. Inference in Hidden Markov Models (Springer, New York, 2005). 65. Liu, W., Lai, Z., Bacsa, K. & Chatzi, E. Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty 2021. 66. Saldich, E. B., Wang, C., Rosen, I. G., Goldstein, L., Bartroff, J., Swift, R. M., et al. Obtain- ing high-resolution multi-biosensor data for modeling transdermal alcohol concentration data. Alcoholism: Clinical and Experimental Research 44. Suppl. 1 (2020). 67. Banks, H. T. & Ito, K. A unified framework for approximation in inverse problems for distributed parameter systems. Control Theory Advanced Technology 4, 73–90 (1988). 68. Tanabe, H. Equations of Evolution (Pitman, 1979). 69. Banks, H. T. & Kunisch, K. Estimation Techniques for Distributed Parameter Systems (Springer Science & Business Media, 2012). 70. Schultz, M. H. Spline Analysis (Pearson Education, Limited, 1972). 71. Yao, M., Luczak, S. E. & Rosen, I. G. A population model-based lqg compensator for the control of intravenously-infused alcohol studies with transdermal sensing 2021. 72. Kato, T. Perturbation Theory for Linear Operators (Springer Science & Business Media, 1995). 73. Pazy, A. Semigroups of Linear Operators and Applications to Partial Differential Equations (Springer, 1983). 74. Banks, H. T., Burns, J. A. & Cliff, E. M. Parameter estimation and identification for sys- tems with delays. SIAM Journal on Control and Optimization 19, 791–828. doi:10.1137/ 0319051 (1981). 75. Campbell, S. L. & Meyer, C. D. Generalized Inverses of Linear Transformations (SIAM, 2009). 76. Casella, G. & Berger, R. L. Statistical Inference (Duxbury, Pacific Grove, CA, 2002). 77. MacKay, D. J. C. Information theory, inference and learning algorithms. (Cambridge Uni- versity Press, 2003). 78. Wang, X. & Bai, Y . The global Minmax k-means algorithm. SpringerPlus 5 (2016). 79. Likas, A., Vlassis, N. A. & Verbeek, J. J. The global k-means clustering algorithm. Pattern Recognit. 36, 451–461 (2003). 80. Goodfellow, I., Bengio, Y . & Courville, A. Deep Learning (MIT Press, 2016). 81. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. Generative Adversarial Nets in Advances in Neural Information Processing Systems (eds 127 Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. & Weinberger, K. Q.) 27 (Curran Associates, Inc., 2014). 82. Yang, Y . & Perdikaris, P. Adversarial uncertainty quantification in physics-informed neural networks. Journal of Computational Physics 394, 136–152. doi:https://doi.org/10. 1016/j.jcp.2019.05.027 (2019). 83. Li, C., Li, J., Wang, G. & Carin, L. Learning to sample with adversarially learned likelihood- ratio Submitted for ICLR 2018. 84. Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. Automatic differentiation in machine learning: a survey cite arxiv:1502.05767Comment: 43 pages, 5 figures. 2015. 85. Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109. doi:10.1093/biomet/57.1.97 (1970). 86. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. TensorFlow: A System for Large-Scale Machine Learning in (USENIX Association, Savannah, GA, USA, 2016), 265–283. 87. Kingma, D. & Ba, J. Adam: A Method for Stochastic Optimization. International Confer- ence on Learning Representations (2014). 88. Hochreiter, S. & Schmidhuber, J. Long Short-term Memory. Neural computation 9, 1735– 80. doi:10.1162/neco.1997.9.8.1735 (1997). 89. Sirlanci, M., Luczak, S. E., Fairbairn, C. E., Kang, D., Pan, R., Yu, X., et al. Estimating the distribution of random parameters in a diffusion equation forward model for a transdermal alcohol biosensor. Automatica. to appear, arXiv:1808.04058 (2018). 90. Gal, Y . & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Un- certainty in Deep Learning. arXiv:1506.02142 (2015). 91. Gal, Y . Uncertainty in Deep Learning PhD thesis (University of Cambridge, 2016). 92. Caflisch, R. Monte Carlo and quasi-Monte Carlo methods. Acta Numerica 7, 1–49. doi:10. 1017/S0962492900002804 (1998). 93. Hinton, G. E. & van Camp, D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights in Proceedings of the Sixth Annual Conference on Com- putational Learning Theory (Association for Computing Machinery, Santa Cruz, California, USA, 1993), 5–13. doi:10.1145/168304.168306. 94. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014). 95. Oszkinat, C., Shao, T., Wang, C., Rosen, I. G., Rosen, A. D., Saldich, E. B., et al. Obtaining Blood and Breath Alcohol Concentration from Biosensor Measured Transdermal Alcohol Level: Estimation and Uncertainty Quantification Based on a Covariate-Dependent Hidden Markov Model 2021. 96. The Shapley Value: Essays in Honor of Lloyd S. Shapley doi:10.1017/CBO9780511528446 (Cambridge University Press, 1988). 97. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39, 1–38 (1977). 98. Liu, T., Lemeire, J. & Yang, L. Proper initialization of Hidden Markov models for industrial applications in 2014 IEEE China Summit International Conference on Signal and Informa- tion Processing (ChinaSIP) (2014), 490–494. doi:10.1109/ChinaSIP.2014.6889291. 128 99. Zucchini, W., Macdonald, I. & Langrock, R. Hidden Markov Models for Time Series - An Introduction Using R doi:10.1201/b20790 (2016). 100. Shirley, K. E., Small, D. S., Lynch, K. G., Maisto, S. A. & Oslin, D. W. Hidden Markov models for alcoholism treatment trial data. The Annals of Applied Statistics 4. doi:10.1214/ 09-aoas282 (2010). 101. Murphy, K. P. Fitting a conditional linear Gaussian distribution 2000. 102. Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, 260–269. doi:10.1109/TIT. 1967.1054010 (1967). 103. Sirlanci, M., Luczak, S. & Rosen, I. G. Approximation and convergence in the estimation of random parameters in linear holomorphic semigroups generated by regularly dissipa- tive operators in American Control Conference (ACC), 2017 (2017), 3171–3176. doi:10. 23919/ACC.2017.7963435. 104. Shin, Y ., Darbon, J. & Karniadakis, G. E. On the Convergence of Physics Informed Neural Networks for Linear Second-Order Elliptic and Parabolic Type PDEs. Communications in Computational Physics 28, 2042–2074. doi:10.4208/cicp.oa-2020-0193 (2020). 129
Abstract (if available)
Abstract
In this work, we consider the problem of estimating breath alcohol concentration (BrAC) based on transdermal alcohol concentration (TAC) biosensor data. Since classical ways of obtaining BrAC suffer from the need of user participation and other disadvantages, TAC measurements provide a promising alternative for alcohol researchers and clinicians. While this problem has been studied extensively in the past, here we want to advance the research through the development and investigation of physics-informed machine learning techniques. To this end, we first establish and discuss a mathematical diffusion partial differential equation (PDE) model with unknown parameters for the transport of ethanol through the epidermal layer of the skin. We show how the unknown parameters can be modeled as random variables and how this naturally leads to a population model in an abstract parabolic framework. We discretize the model both in time and space and, using results and techniques from linear semigroup theory, we prove convergence results for the discretized model. Further, we show that the model depends continuously on the random parameters and thus, we prove consistency of the parameter estimators.
Based on this model, we consider three machine learning models and propose ways to introduce prior knowledge into these models in the form of the above mentioned population model. This leads to a combination of the data-driven machine learning models with a first principles physical model and so hybrid, physics-informed models are created. The application of modern machine learning techniques offers many advantages: these models are able to train efficiently on larger data sets, their flexibility allows for the inclusion of other covariates, and their complexity can capture so-far unmodeled aspects of the relation between BrAC and TAC. At the same time, the physics information acts as a regularizing term to encourage a physically meaningful behavior of these models.
In the first of these models, we use a generative adversarial network (GAN) with a residual-augmented loss function to estimate the distribution of the random parameters of the diffusion model. We then design another physics-informed neural network for the deconvolution of the BrAC signal from the TAC signal. Based on the distribution of the random parameters, this network is able to estimate the BrAC signal and quantify the uncertainty in the form of conservative error bands. Moreover, we show how a posterior latent variable can be used to sharpen these conservative error bands. We apply the techniques to an extensive data set of drinking episodes and demonstrate the advantages and shortcomings of this approach.
The second model under consideration is a long short-term memory network (LSTM). Here, we design a network in such a way that besides the TAC signal both time-dependent and time-independent covariates can be processed to estimate BrAC. We use the mathematical population model to add physics information to the training stage of the model. In addition, we use a recently discovered relation between Bayesian learning and the stochastical regularization technique dropout to quantify the uncertainty of the estimates. We employ Shapley values to measure the individual importance of the covariates, and demonstrate the proposed model with numerical studies.
Finally, we develop a novel approach to estimating BrAC from TAC based on covariate-dependent physics-informed hidden Markov models with two emissions. The hidden Markov chain serves as a forward full-body alcohol model with BrAC and TAC, the two emissions, assumed to be described by a bivariate normal which depends on the hidden Markovian states and person-level and session-level covariates via built-in regression models. An extension of hidden Markov modeling is developed wherein the hidden Markov model framework is regularized by the first-principles PDE model to yield a hybrid model. We train the model via the Baum-Welch algorithm and 256 sets of BrAC and TAC signals and covariate measurements collected in the laboratory. Forward filtering of TAC to obtain estimated BrAC is achieved via a new physics-informed regularized Viterbi algorithm which determines the most likely path through the hidden Markov chain using TAC alone. The Markovian states are decoded and used to yield estimates of BrAC and to quantify the uncertainty in the estimates. We present and discuss numerical studies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Obtaining breath alcohol concentration from transdermal alcohol concentration using Bayesian approaches
PDF
Population modeling and Bayesian estimation for the deconvolution of blood alcohol concentration from transdermal alcohol biosensor data
PDF
An abstract hyperbolic population model for the transdermal transport of ethanol in humans: estimating the distribution of random parameters and the deconvolution of breath alcohol concentration
PDF
Distributed parameter model based system identification and filtering in the estimation of blood alcohol concentration from transdermal alcohol biosensor data
PDF
A nonlinear pharmacokinetic model used in calibrating a transdermal alcohol transport concentration biosensor data analysis software
PDF
M-estimation and non-parametric estimation of a random diffusion equation-based population model for the transdermal transport of ethanol: deconvolution and uncertainty quantification
PDF
A Bayesian approach for estimating breath from transdermal alcohol concentration
PDF
Determining blood alcohol concentration from transdermal alcohol data: calibrating a mathematical model using a drinking diary
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Finite dimensional approximation and convergence in the estimation of the distribution of, and input to, random abstract parabolic systems with application to the deconvolution of blood/breath al...
PDF
Linear quadratic control, estimation, and tracking for random abstract parabolic systems with application to transdermal alcohol biosensing
PDF
Delta Method confidence bands for parameter-dependent impulse response functions, convolutions, and deconvolutions arising from evolution systems described by…
PDF
Estimation of random input to semi-linear abstract parabolic systems with application to quantitative description of drinking behavior based on trans-dermal alcohol concentration
PDF
Large scale inference with structural information
PDF
Building better species distribution models with machine learning: assessing the role of covariate scale and tuning in Maxent models
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Nonlinear modeling and machine learning methods for environmental epidemiology
PDF
Mutual information estimation and its applications to machine learning
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Shrinkage methods for big and complex data analysis
Asset Metadata
Creator
Oszkinat, Clemens Bastian
(author)
Core Title
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2022-08
Publication Date
05/06/2022
Defense Date
04/18/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biosensors,distribution estimation,hidden Markov models,LSTM,OAI-PMH Harvest,physics-informed machine learning,transdermal alcohol concentration
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rosen, Gary (
committee chair
)
Creator Email
clemens_oszkinat@hotmail.de,oszkinat@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111271663
Unique identifier
UC111271663
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Oszkinat, Clemens Bastian
Type
texts
Source
20220506-usctheses-batch-939
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
biosensors
distribution estimation
hidden Markov models
LSTM
physics-informed machine learning
transdermal alcohol concentration