Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Explorations in the use of quantum annealers for machine learning
(USC Thesis Other)
Explorations in the use of quantum annealers for machine learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLORATIONS IN THE USE OF QUANTUM ANNEALERS FOR MACHINE LEARNING by Richard Li A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (CHEMISTRY (CHEMICAL PHYSICS)) August 2019 Copyright 2019 Richard Li Acknowledgements I’m quite indebted to a large group of people, without whom this PhD would not have been possible, but I would like specifically to mention a few here. First, I would like to thank my advisors, Prof. Daniel Lidar and Prof. Remo Rohs for their mentorship during these years. Both provided an environment conducive to asking questions and never made me feel like my ideas (though probably naive at times) were unimportant. I appreciate immensely the trust they placed in me as a student, giving me the freedom for independent exploration, yet also being available when I got stuck. They have taught me to be critical yet optimistic, when to dive into the details and when to let things go. Beyond the scientific discussions, I have benefited immensely from their perspective when things did not quite go as expected, how to see things through and appreciate the process of discovery as much as the result. I would also like to thank all those who were with me in the weeds, so to speak. The friendship of group- mates, including Siddharth, Dr. (soon-to-be Assistant Prof.) Tameem Albash, Tsu-Pei, Beibei, Satya helped me get through my PhD existential crises and were always a welcome source of interesting conversation (often) about unscientific subjects. Thank you to those outside of USC, for reminding me of life outside the bubble of my PhD. Thanks to all the members of my church, who may not know what I’ve been doing but have cared for me and fed me nonetheless. Lastly, I would like to thank my family. Much gratitude to my parents for giving me life and raising me. Thanks to my sister for always being my friend and taking fun vacations with me. And a special thanks to my wife, Kristen, who joined me at the end of my PhD but made the last part of it particularly memorable. ii Table of Contents Acknowledgements ii List Of Tables v List Of Figures vi Abstract x Chapter 1: Introduction 1 1.1 Quantum annealing with D-Wave processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Quantum annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 D-Wave processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Transcription Factor - DNA Binding 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Performance on gcPBM Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Weight Logos from Feature Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Performance on HT-SELEX Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 QUBO mapping of TF-DNA binding problem . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Technical details of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Data processing and availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.4 Using Excited State Solutions and Combining Weights (DW, SA and SQA) . . . . . . . . 26 2.4.4.1 Iterative averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.4.2 Direct averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.5 Binarizing the binding affinities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 3: Cancer Classification 31 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Dataset and dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Standard machine learning approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.3 D-Wave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.4 Simulating Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.5 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iii 3.2.6 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.7 Formulating a multi-class classification problem on a quantum annealer . . . . . . . . . . 34 3.2.8 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Reducing Amount of Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.3 Further study of Random’s performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3.1 Performance metrics versus energy . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.3.2 Varying the number of weights . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.4 Multiclass Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter 4: Nested Quantum Annealing Correction 55 4.1 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.1 NQAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.2 Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.1 Datasets and performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.1.1 Supervised machine learning of MNIST . . . . . . . . . . . . . . . . . . . . . 62 4.2.1.2 Unsupervised machine learning of Bars and Stripes . . . . . . . . . . . . . . . 63 4.2.2 Temperature estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.3 Distance from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.4 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.5 Classical repetition vs NQAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.6 Technical Details of DWave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.7 Ramping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.8 SQA and SVMC Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.1 Performance as a function of rescaling parameter . . . . . . . . . . . . . . . . . . . . . 68 4.3.2 Results for the BAS dataset at different anneal times . . . . . . . . . . . . . . . . . . . . 70 4.3.3 Results for the BAS dataset at different ramp points . . . . . . . . . . . . . . . . . . . . . 72 4.3.4 Comparison to SQA and SVMC atC = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 5: Conclusion 81 Reference List 84 Appendix A Classification measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 iv List Of Tables 2.1 Value of hyper-parameter used for AUPRC. Parenthesis indicate method of combining solution used: (it) refers to iterative averaging and (dir) refers to direct averaging (see Sec. 2.4.4). Since MLR has a closed-formed solution, there is no need to combine solutions. Max* refers to the gcPBM data, and Max y to the HT-SELEX dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Value of hyper-parameter used for Kendall’s. Otherwise the same as in Table 2.2. . . . . . . . 29 3.1 Size of datasets used for this work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1 Number of sweeps used in SQA simulations for each final Hamiltonian (anneal time) and at differ- ent ramp points during the anneal (s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Number of sweeps used in SVMC simulations for each final Hamiltonian (anneal time) and at different ramp points during the anneal (s). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 A.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 v List Of Figures 1.1 Illustration of the principles and purpose of this work. a: quantum versus classical adiabatic an- nealing processes. With quantum annealing there is the possibility for the system state (red) to tunnel through a changing barrier (black) and arrive at the ground state; for classical annealing, the system must rely on thermal fluctuations (temperatureT > 0) to overcome any energy barriers. b: typical spectrum of instantaneous energy eigenvalues during adiabatic quantum optimization. The ground state energy ats = 0 has a significant gap to the next energy level. The speed at which the optimization can take place depends on the size of the minimum gap. . . . . . . . . . . . . . . . . 4 1.2 Chimera graph of NASA’s D-Wave 2000Q processor. Green qubits indicate active qubits, gray qubits are inactive qubits, and active physical couplers are indicated by black lines. . . . . . . . . 6 2.1 Illustration of the principles and purpose of this work. (b) Using simplified datasets of a small number of sequences derived from actual binding affinity experiments, we use D-Wave, a com- mercially available quantum annealer, to classify and rank binding affinity preferences of a TF to DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Schematic overview of the data handling procedure. About 10% of the data were held out for testing (D TEST ) and was unseen during the calibration and training phases. The remaining 90% of the data were used as calibration and training (D TRAIN ). For DW, SA, and MLR, in the cal- ibration step, hyper-parameter was tuned using a 100-fold Monte Carlo cross-validation. For XGB, several other hyper-parameters were tuned (see Methods for a list). During the training step, a procedure similar to bagging (bootstrap aggregating) was used by randomly sampling a 2% and 10% replacement of the data 50 times to give 50 training instances. In the testing step, the area under the precision-recall curve (AUPRC) of the best performing weights for each of the 50 training instances was evaluated onD TEST , which was unseen during training and calibration, to evaluate generalization of classification performance. The calibration, training, and testing proce- dure was identical for ranking tasks, with the exception that Kendall’s was used as the metric of performance instead of the AUPRC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Quantitative performance on held-out experimental test dataset of two different types of tasks for three high quality gcPBM datasets. (a) The mean AUPRC for Mad, Max, and Myc plotted versus threshold at certain threshold percentiles of the data, when training with 2% of the data (left) and 10% of the data (right). In both cases 50 instances were randomly selected for training and performance of the 50 trained weights is evaluated on the same held-out test set. Error bars are the standard deviation. (b) Boxplot of Kendall’s on held-out test dataset. Red ‘+’ indicate outliers, gray line represents the median. The bottom and top edges of the box represent the 25th and 75th percentiles, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 vi 2.4 Comparison of feature weights visualized as weight logos for DW, SA, and MLR. Weights repre- sent the relative importance of a nucleotide at each position for the binding affinity. These weight logos were obtained using the Mad, Max, and Myc gcPBM datasets when training with the AUPRC as the objective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Summary of results for HT-SELEX data. (a) Comparison of the AUPRC (top) and Kendall’s (bottom) when training with 1% of the data (left) and 5% of the data (right) for Max. Error bars are standard deviation over 50 instances. (b) Comparison of the AUPRC (top) and Kendall’s (bottom) when training with 2% of the data (left) and 5% of the data (right) for TCF4. (c) Weight logos for Max and TCF4 from training with the AUPRC as the objective. . . . . . . . . . . . . . 21 2.6 Percent change in objective function versus training size and number of solutions included. Both DW (left) and SA’s (right) performance improve with increasing the number of solutions included. SQA was not included but performed similarly. Error bars represent standard deviation. . . . . . . 28 3.1 Overview of data pipeline. (a) Whole exome sequencing, RNA-Seq, miRNA-Seq, DNA methyla- tion array, and genotyping array data were retrieved from The Cancer Genome Atlas (TCGA) for cancer classification. The data were concatenated in a single matrix. The data was then repeatedly split into training cuts of 80% of the data and testing cuts of 20% of the data. Principal component analysis (PCA) was performed on each training cut of the data, and the test data was projected us- ing the same loadings. We compared several standard classical machine learning algorithms with quantum and quantum-inspired algorithms (see Methods for details of the algorithms). After train- ing, performance was evaluated on the corresponding test dataset for a variety of metrics, including balanced accuracy, AUC, and the F1 score. Averages were taken across the 100 cuts of the data. (b) The 6 cancer types used for the 6-cancer classification. . . . . . . . . . . . . . . . . . . . . . 41 3.2 Comparison of classification approaches on the five binary classification datasets. Algorithms are sorted based on their performance in terms of the balanced accuracy. Different performance metrics are represented in different colors. Error bars represent two standard errors of the mean. . . . . . . 43 3.3 Comparison of classification approaches on the lumAB dataset using the top 44 genes from PC1. Error bars represent two standard errors of the mean . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 The test set balanced accuracy for the Luminal A vs B binary classification with different amounts of training data. Error bars are two standard error of the mean. . . . . . . . . . . . . . . . . . . . 46 3.5 Evaluation of the performance of DW, SA at various final inverse temperatures (b 1 in the Legend), and random when using a sliding window of energies. Average balanced accuracy (top) and the average logistic loss (bottom) across the 100 cuts on the test dataset versus the average Ising energy of the post-processed weights. The shaded region represents 2 standard deviations. . . . . . . . . 48 3.6 Heatmaps showing the effect of changing the total number of solutions (y-axis) and the number of solutions used in the post-processing procedure (x-axis) on the average balanced accuracy on the test datasets for the five binary classification datasets. SA was run with a final inverse temperature of b 1 = 0.03 and DW was run with a J c of 8.0. A higher balanced accuracy indicates better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 vii 3.7 Heatmaps showing the effect of changing the total number of solutions (y-axis) and the number of solutions used in the post-processing procedure (x-axis) on the average balanced accuracy on the test datasets for the five binary classification datasets. SA was run with a final inverse temperature of b 1 = 0.03 and DW was run with a J c of 8.0. A higher balanced accuracy indicates better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8 Heatmaps showing the effect of changing the total number of solutions (y-axis) and the number of solutions used in the post-processing procedure (x-axis) on the average balanced accuracy on the test datasets for the five binary classification datasets. SA was run with a final inverse temperature of b 1 = 0.03 and DW was run with a J c of 8.0. A higher balanced accuracy indicates better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.9 Comparison of results for classification of six cancers. The same classical algorithms were used as were used for the binary classifications. Quantum annealing and simulated annealing perform worse than the classical methods for this dataset, though still do better than randomly generated bit strings sorted according by their performance of the DW loss function. . . . . . . . . . . . . . . . 53 4.1 Illustration of NQAC. The left panel represents the logical problem, the middle panels represent the encoded Hamiltonian with nesting levelC = 2; 4, and right panel represents the encoded physical problem embedded onto D-Wave Chimera graph. Thick bright red lines in the middle and right panels represent 1 , the ferromagnetic coupling between code qubits in the nested Hamiltonian. Thick brown lines in the right panel represent 2 , the ferromagnetic coupling between physical qubits in the encoded problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Some representative images of training dataset. BAS (left) and coarse-grained MNIST (right). Starting from the upper-left and going across the rows and down the columns, the digits are f3; 6; 1; 5g;f2; 6; 7; 8g;f3; 1; 6; 4g;f0; 6; 8; 7g. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Performance using NQAC on the BAS dataset. A small advantage with increasing nesting level C is noticeable for the BAS dataset for intermediate values of the rescaling parameter. All runs were done with an anneal time t f of 10s. Here and in all other plots, unless noted otherwise, error bars represent one standard deviation. The maximum possible value of the empirical log- likelihood for the BAS dataset used is3:41. The empirical log-likelihood is computed using Eq. (4.9). The middle and left panels compare the performance of a classical repetition versus NQAC (see Sec. 4.2.5 for a description). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Detailed performance at end of each training epoch for different values of the rescaling parameter for the BAS dataset. Here and in the subsequent figures the effective inverse temperature (middle panel) is computed using Eq. (4.11), and the distribution distance from Gibbs (bottom panel) using Eq. (4.10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Performance using NQAC on the coarse-grained MNIST dataset. There is a small but significant advantage for using NQAC at intermediate values of for C = 3. All runs were done with a t f = 10s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Effect of using NQAC with different anneal times for the BAS dataset, measured in microseconds, for = 0:03. The maximum possible value of the empirical log-likelihood in this case is3:41. Both the empirical log-likelihood and inverse temperature exhibit statistically significant growth with anneal timet f and nesting levelC. The distance from Gibbs grows withC and also witht f forC = 2; 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 viii 4.7 Comparison of nesting levels at different anneal times with ramp at various intermediate points in the anneal, for the BAS dataset. A 1s ramp was used throughout; was fixed at 0:03. Here error bars represent only one standard deviation to improve readability. For allC values there is a transition that becomes sharper with increasing anneal time. The transition point shifts to earlier in the anneal with increasingC; after the transition all quantities freeze, except at the largest anneal times. The trend with increasing anneal time is consistent throughout: the empirical log-likelihood, fidelity with data (1distance) and inverse temperature all increase. In contrast, the distance from Gibbs is at first smaller as a function of ramp point for largert f , then becomes larger for largert f past the freeze-out point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.8 As in Fig. 4.7, but with a fixed logical problem Hamiltonian. The trends are similar: freezing occurs earlier with higher nesting level. ForC = 3, the distribution distance from Gibbs increases with longer anneal times. In addition, at longer anneal times, the distribution distance increases at later ramp points. This is probably due to 1=f noise. . . . . . . . . . . . . . . . . . . . . . . . . 74 4.9 SQA and SVMC simulations with ramp point. Error bars are one standard deviation. The “anneal time” in the legend corresponds to the final physical problem Hamiltonian withC = 3 trained by DW at the specified anneal time. The corresponding number of sweeps for SQA and SVMC was selected such that the effective inverse temperature of the distribution was closest to that found by DW at the specified anneal time. See Tables 4.1 and 4.2 for the number of sweeps. . . . . . . . . 76 4.10 Direct comparison between DW, SQA, and SVMC forC = 3. Each column represents the final logical Hamiltonian after training with a particular anneal time (i.e., each column is a different color curve in Fig. 4.7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 ix Abstract Quantum computing is an emerging field with potentially far-reaching implications for a diverse set of fields, from chemical simulation and material science, to computer science, to finance and cryptography. Demonstrations of simple quantum algorithms have been carried out on a very small scale, and theoretical work has established how to scale up algorithms in the presence of noise. However, little is known in the intermediate regime of noisy quantum devices. The subject of this dissertation is the study of commerically available quantum annealers of a non-trivial size (namely, those manufactured by D-Wave) applied to practical problems in machine learning. D- Wave devices only admit problems formulated as a two-body Ising Hamiltonian, and underlying the applications considered is mapping various problem of interest to an Ising. First, we apply quantum annealing to model the binding preferences of transcription factors for different sequences of DNA. We show that we can map this problem to an Ising problem, and that solving the Ising problem is advantageous when the amount of training data is scarce, and that the mapping generates weights that are consistent with known biological information. Second, we develop a different mapping of a classification problem to an Ising problem based on an approximation to the negative log-likelihood and apply our mapping to classify different classes of cancer. Our results show an advantage for quantum annealing, but the performance of QA can be matched by simple classical algorithms on an Ising Hamiltonian. These simple algorithms are able to outperform standard classical machine learning approaches with a small amount of data. We consider the approximations used in the mapping on the final machine learning performance. Third, we apply an error-correction scheme to slightly improve supervised and unsupervised learning of the D-Wave devices. In comparison to classical simulations, we obtain a better understanding of the physics of the devices, which for these problems, does not appear to be quantum. Overall, although we have not found a case x where D-Wave outperforms any classical algorithm, our research has shown the power of formulating problems as an Ising and gives a better understanding of the device. xi Chapter 1 Introduction Quantum computing seeks to harness uniquely quantum-mechanical phenomena such as entanglement, tunneling and superposition to perform computation. In the past twenty or so years, quantum computing has blossomed from a rather obscure academic discipline to a subject that (for better or worse) has begun to seep into the consciousness of the general public. Today a significant amount of research and resources are being poured into preparing for an expected, imminent “quantum (information) revolution” at nearly all levels of the computational stack – from hardware [1, 2, 3] to programming languages [4, 5, 6, 7]. Not only are tech giants like Google, IBM, and Microsoft participating; a plethora of startups hoping to commercialize quantum algorithms have emerged, and the United States government has signed into law a “National Quantum Initiative” [8] meant in part to prepare a quantum workforce. This most recent surge of interest in quantum computing was perhaps precipitated by Shor’s factoring algorithm for integers [9], which promises to factor large integers exponentially faster than any known classical algorithm, but one the earliest proposals for quantum computers was based on simulating quantum systems [10]. Research along the lines of using quantum computers for quantum simulation [11, 12, 13, 14] has been particularly fruitful, with applications in material simulation [15] and in quantum chemistry to calculate ground state molecular energies, determine chemical dynamics, prepare initial guesses for wavefunctions, and to rediscover molecular properties and reaction mechanisms [16, 17, 18, 19, 20, 21, 22, 23]. In parallel to research on quantum simulation, there has been much interest in the potential of using quantum computers to assist or replace classical machine learning routines. Machine learning has seen enormous success 1 in many areas, including (but not limited to) image recognition, speech recognition, self-driving cars, medicine, and recommender systems for e-commerce [24, 25, 26, 27, 28]. This widespread success is in part due to the exponential growth in computational power and the amount of data. In essence, machine learning aims to extract patterns and make generalizations from a given set of training data without being explicitly told what the possible patterns are. Since quantum systems produce patterns that are thought to be difficult for classical systemts to produce efficiently, it seems reasonable that quantum computers may outperform classical computers on certain machine learning tasks. As such, quantum versions of boosting, support vector machine, principal component analysis, Boltzmann machines have been developed in the context of improving both supervised and unsupervised learning [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]. Despite the immense progress and excitement, substantial challenges remain before quantum computing be- comes broadly applicable. As noted in [38, 44] and others, there are several caveats that need to be addressed before claiming proof of quantum advantages for many of the above-mentioned quantum algorithms. These in- volve difficulties in rapidly preparing quantum states and the speed at which quantum operations can be performed. Taking into account the restrictions that exist for quantum computers may remove the advantages entirely or oth- erwise lead to the discovery of classical algorithms that perform as well under the same restrictions. In addition, even if all algorithmic caveats could be addressed, errors that arise from coupling of quantum systems to their environments cannot be disregarded. In the long-term, assuming large amounts of quantum resources, and modest assumptions about error rates there are rigorous proofs concerning “fault-tolerant” quantum computing [45]; i.e., the ability to still perform meaningful quantum computation in the presence of noise by using additional physical resources. However, it is becoming increasingly apparent that the engineering challenges associated with scaling up quantum systems are not trivial. For example, to maintain sufficient precision of control over a growing number of qubits while minimizing cross-talk is no mean feat. In the current state of quantum technology, sometimes referred to as the noisy intermediate-scale quantum (NISQ) era [44], quantum computers are approaching a size that is infeasible for classical computers to simulate, but not large enough to make use of fault-tolerance results and still solve problems of interest. In this regime and for practical problems, there is still relatively little known about the performance of quantum algorithms in 2 comparison to classical algorithms. This dissertation explores the readiness and feasibility of applying current quantum hardware, which is demonstrably quantum but also noisy, to practical problems of interest, focusing on applications to machine learning. The particular hardware used in this work are the D-Wave processors [46, 1, 47], which are special-purpose quantum annealers. The remainder of this chapter will be organized as follows: first, the model of quantum computation used in this work (quantum annealing) will be presented along with some technical details related to the D-Wave processors; then, a brief introduction to machine learning will follow; and finally, a summary of the results will be given. 1.1 Quantum annealing with D-Wave processors Although different implementations and models of quantum computing are still in development, promising theo- retical and experimental research indicates that quantum annealing (QA) [48], or adiabatic quantum optimization [49], may be capable of providing advantages in solving classically-hard problems that are of practical interest. QA is the only paradigm of quantum computation that currently offers physical implementations of a non-trivial size, namely the D-Wave processors. In this Section we will provide an overview of quantum annealing before mentioning some specifics related to the D-Wave processors and some general techniques needed to apply D-Wave to the problems studied in this work. 1.1.1 Quantum annealing QA can be considered a special case of adiabatic quantum computation (AQC) (for a review see Ref. [50]). The adiabatic theorem of quantum mechanics implies that a physical system will remain in the ground state if a given perturbation acts slowly enough and if there is a gap between the ground state and the rest of the system’s energy spectrum [51] (Fig. 1.1). To use the adiabatic theorem for computation, we can specify an initial Hamiltonian, H B , whose ground state is easy to find (typically a transverse field), and a problem Hamiltonian,H P , that does not commute withH B and whose ground state encodes the solution to the problem we are seeking to optimize [52]. We then interpolate fromH B toH P by defining the combined HamiltonianH(s) = A(s)H B +B(s)H P , with 0s =t=t f 1, whereA(s) andB(s) are, respectively, decreasing and increasing smoothly and monotonically, 3 t is time, and t f is the total evolution, or annealing time. In the closed-system setting with no coupling to an external environment, the adiabatic theorem ensures that the ground state of the system at t = t f will give the desired solution to the problem, provided the interpolation is sufficiently slow, i.e., t f is large compared to the timescale set by the inverse of the smallest ground state gap of H(s) and by dH(s)=ds [53] so that all non- adiabatic transitions are avoided (Fig. 1.1). In this setting, errors arise only from non-adiabatic transitions and any control errors (e.g., in the annealing schedule or in the initial and final Hamiltonians). In QA, instead of running the computation once at a sufficiently long anneal time such that the adiabatic theorem is obeyed, one may choose to run the computation multiple times for a shorter anneal time, such that the time it takes to find the ground state with high probability is minimized [54]. When QA is implemented in a physical device, temperature and other noise effects play an important role; thermal excitation and relaxation cannot be neglected and affect performance [55, 56, 57]. Figure 1.1: Illustration of the principles and purpose of this work. a: quantum versus classical adiabatic annealing processes. With quantum annealing there is the possibility for the system state (red) to tunnel through a changing barrier (black) and arrive at the ground state; for classical annealing, the system must rely on thermal fluctuations (temperatureT > 0) to overcome any energy barriers. b: typical spectrum of instantaneous energy eigenvalues during adiabatic quantum optimization. The ground state energy ats = 0 has a significant gap to the next energy level. The speed at which the optimization can take place depends on the size of the minimum gap. 1.1.2 D-Wave processors Quantum annealing was implemented on the processors manufacted by D-Wave. The basic computational unit in quantum computing is a qubit – a quantum bit, or a two-level quantum mechanical system. D-Wave uses 4 superconducting flux qubits, treating the lowest two energy states (clockwise or anticlockwise circulating current loop) as a two-level system [1]. The problem Hamiltonians that are used for D-Wave (DW) can be described as Ising spin models with tunable parameters [58], where the spins may take one of two values, up or down (that is, they are binary variables). The Ising model assumes a graphG = (V;E) composed of a set of vertices,V , and edges,E. Each of theN spins is a binary variable located at a unique vertex. The problem, or Ising, Hamiltonian for this system can be written as H P = X i2V h i z i + X (i;j)2E J ij z i z j ; (1.1) where the local fieldsfh i g and couplingsfJ ij g define a problem instance, and are programmable to within a few percent Gaussian distributed error. The z i represent both the binary variables taking on values1, and the Pauli z-matrices acting on qubiti. At the end of the anneal (ats = 1), a readout of the final states is performed in the z basis, and the result for a single anneal is a spin configurationf z i g =f1; 1g N . In quantum annealing the anneal is repeated multiple times, generating a distribution of such spin configurations. Given a spin configuration,H P is the total energy of the system. Note that in general, it is difficult to find the minimum ofH P but easy to calculate H P given a set of spins [58]. The initial Hamiltonian,H B is a transverse magnetic field:H B = P i x i , where x i is the Paulix-matrix acting on qubiti. In this work, we used one of three D-Wave processors: (1) the DW2X processor housed at the Information Sciences Institute (ISI) of the University of Southern California; (2) the DW2000Q processor located at Moffett Field and managed by NASA-Ames; (3) the DW2000Q processor located in Burnaby, BC, Canada and owned by D-Wave. The DW2X hasN = 1098 active qubits, NASA’s DW2000Q hasN = 2031 active qubits, and D-Wave’s DW2000Q has 2038 qubits. None of the processors have couplings between any arbitrary pair of qubits; i.e., the graphG is not a complete graph. Instead, both have a “Chimera” graph; an example of which is shown in Fig. 1.2. Problems submitted to DW are automatically scaled so that allh i andJ ij values lie between1 and 1, and DW returns a set of spin valuesf z i =1g that attempts to minimize the energy given by Eq. (1.1) (a lower energy indicates better optimization). Depending on the context, this set of spin values returned by D-Wave will 5 Figure 1.2: Chimera graph of NASA’s D-Wave 2000Q processor. Green qubits indicate active qubits, gray qubits are inactive qubits, and active physical couplers are indicated by black lines. sometimes be referred to as “spin configuration”, “solutions” (i.e., candidate solutions of minimizing the Ising Hamiltonian) or “weights” (in the context of machine learning). The Ising Hamiltonian can be easily transformed into a quadratic unconstrained binary optimization (QUBO) problem by applying the following transformation: w i = z i +1 2 2f0; 1g; i.e., the binary variables are mapped fromf1; 1g in the Ising Hamiltonian tof0; 1g in a QUBO. The Ising and QUBO are equivalent, but optimization problems and machine learning models with binary variables are often formulated with weights inf0; 1g. 6 1.1.3 Embedding As a final practical note, the Chimera graph depicted in Fig. 1.2 has a maximum degree of 6, where the degree of a graph refers to the maximum number of edges from any single vertex. However, real-world problems will often have anH P that requires couplings between any two qubits; i.e., they necessitate a complete graph. In order to use DW for these problems, the original “logical problem”, the problem we would like to solve, needs to be embedded on the physical hardware by finding a corresponding “physical problem.” This introduces the need to find an “minor-embedding” of the hardware graph [59, 60]. A graph minorG emb of a graphG is constructed by deleting edges and vertices and by contracting edges ofG. By grouping several of the physical qubits, we can define a graph on a smaller number of “logical” qubits with a higher connectivity [61]. A minor embedding maps a logical problem qubit to a set of physical qubits such that for every coupling between pairs of logical qubits in the logical problem there exists at least one physical coupling between the corresponding sets of physical qubits. Depending on the size of the hardware graph and the desired target graph, multiple embeddings may exist. In order to ensure that physical qubits are aligned and act as a single logical qubit (or “chain”), a strong coupling term, J c , is introduced between physical qubits that comprise a logical qubit. Then, for a fixed embedding, the way the values of the couplings and local fields for the logical problem are distributed among the physical qubits is known as “parameter setting.” A built-in function provided by D-Wave [60] has been used for minor embedding and parameter setting. By the embedding procedure and parameter setting, logical problems may be transformed into physical problems. Note that for one logical problem there may be many physical problems, depending both on the embedding and the parameter setting. In summary, the physical Ising problem that is actually run on D- Wave consists ofh phys i ,J phys ij (the local fields and couplings from the logical problem) and theJ c (the coupling strength to align the physical qubits that make up a logical qubit). Ideally, once the strength of the coupling between logical qubits is set, all solutions returned by DW would correspond to valid logical solutions, i.e., all the physical qubits within a logical qubit would have the same spin (there would be no “broken chains”). However, due to the probabilistic nature of quantum annealing, as well as noise from different sources, there is often some percentage of solutions that have broken chains. Unless otherwise 7 indicated, we used a majority vote decoding to go from the physical spins to logical spins; the final value of the logical spin is assigned based on the value of the majority of the physical spins (in the event of an equal number of +1 and1, the value of the spin is randomly selected). The likelihood of having broken chains can be roughly adjusted by controlling a parameterJ c , the value of the strong coupling between physical qubits within a logical qubit; the larger the magnitude ofJ c , the more likely it is is for the physical qubits within a logical qubit to have the same spin. However, because of limitations in precision (recall that D-Wave scales the problem parameters to lie between -1 and 1) and the presence of noise on actual hardware, increasing the strength ofJ c may cause some of theh phys i andJ phys ij to be washed out by the noise. In this work, we usually try several values ofJ c and select the one that gives the best performance. 1.2 Machine learning Our target application to assess the usefulness of quantum annealing is machine learning. Broadly speaking, ma- chine learning works by allowing computers to extract and generalize patterns in data, without explicitly defining “rules” for the patterns. Generally, we assume we are given some training dataD Train from which we wish to learn some patterns, and a separate test dataD Test on which to evaluate performance. In this work we experiment with two classes of machine learning: supervised and unsupervised learning. In supervised learning, data is provided as input-output pairs, with the goal being to accurately predict the output from the input; i.e.,D Train =f(x n ;y n )g N n=1 , wherex n is thenth training input, andy n is thenth training output, andN is the total number of training examples. For example, the task may be to predict whether an email is spam. In this case, thex would be the text of the email, and they would be whether the email is spam or not (i.e., a binary classification). In unsupervised learning, no output labels are given and the goal is to identify hidden patterns, often by modeling a probability distribution over the inputs; i.e.,D Train =f(x n )g N n=1 . Unsupervised learning may be of use in cases where the patterns are unknown and can be used to learn useful representations of high-dimensional data [62, 63]. In supervised learning, we assume there is some functionf that maps input to predicted output, ^ y; i.e., ^ y = f(x). The exact form off depends on the choice of the model, but D-Wave and many classical models may be 8 considered generalized linear models in the input:f w =f(w | x) (note thatf itself need not be a linear function of its argument). The form off and the mapping of the learning problem to an Ising will be given in their respective chapters. The goal of the learning then is to find the set of weightsw such that the predicted output ^ y best matches the actual outputy. To do so, we define an objective function to be optimized. In general, the objective function consists of two parts: a training loss function and a regularization term that helps avoid overfitting. We may write the objective function as Obj(w) =R(w) + (w); (1.2) whereR is the training loss, is the regularization term, andw is the set of feature weights to be determined. As an example, the training loss function may be the mean-squared error (R(w) = P n (y n ^ y n ) 2 ), and the regularization term can be the L1 or L2 norm of the weights,w, multiplied by a hyper-paramater ( (w) =jjwjj).R should be minimized and generally controls model complexity by penalizing complicated models. Generally, some form of cross-validation is needed to select the optimal value of the hyper-parameters. Cross-validation usually consists of repeatedly splitting the training data into training and validation sub-datasets to tune and select model hyper- parameters, before finally evaluating performance on a test dataset. The specific approach for cross-validation is described in their respective chapters. Many unsupervised learning algorithms, including the ones studies in this work, assume that the input data can be modeled by some probability distribution, dependent on some set of parametersfg, and the goal of the learning procedure is to learn the parameters that best model the input data. In other words, one can define a probability distribution to be trainedp(x) and perform maximum likelihood estimation to find the optimal values of the’s, where the likelihood is L = X xi2D Train p(x i ;fg): (1.3) 9 If there is an analytic expression for the probability distribution in terms of the parameters, gradient descent can be used to update the parameters such that the log-likelihood is maximized. 1.3 Summary of results In Chap. 2 we present our first foray into machine learning with quantum annealing: modeling the binding of transcription factors (TFs) to DNA. We demonstrate the feasibility of using actual quantum annealers to classify and rank sequences of DNA based on the strength of their binding to several transcription factors. In particular, D-Wave and analogous classical approaches all perform very well when the amount of training data is relatively small. Moreover, we find that the models trained using D-Wave recover existing biological knowledge. In Chap. 3 we developed an alternate formulation of classification as an Ising Hamiltonian and apply our new method to cancer classification, once again demonstrating success of QA in comparison to standard machine learning approaches with small amounts of training data. We also do a more careful comparison to classical approaches that see the same Ising problem; these approaches do as well if not better than D-Wave, and we explore the reasons why. Whereas both Chap. 2 and 3 only considered supervised approaches, in Chap. 4 we use D-Wave to train both a supervised model and a simple unsupervised model. We show that we are able to improve learning of D-Wave by using more quantum resources, yet a study of the underlying physics through simulations indicates D-Wave may essentially be to acting classically. Finally in Chap. 5 we present some concluding remarks and future outlook. 10 Chapter 2 Transcription Factor - DNA Binding 2.1 Introduction Much attention has been paid to whether the D-Wave devices are capable of delivering quantum speedups [54, 64, 65, 66, 67]. Here we sidestep this question and instead use the DW2X processor at ISI as a physical device that implements quantum annealing for the purpose of solving a problem in machine learning, focusing on performance measures other than speedup. The topic of interest in this chapter is a simplified formulation of biologically relevant problem: transcription factor (TF)-DNA binding (see Fig. 2.1). TFs are a key component in the regulation of gene expression, yet the mechanisms by which they recognize their functional binding sites in a cell and thereby activate or repress transcription of target genes is incompletely understood. Nucleotide sequence, flexibility of both TFs and binding sites, the presence of cofactors, cooperativity, and chromatin accessibility are all hallmarks that affect the binding specificity of TFs in vivo [68, 69]. As a first step to gaining insight into TF binding, it is valuable to understand the intrinsic binding specificity of the TFs for DNA, which can gained from in vitro data. Widely used methods to gain such an understanding and represent the DNA sequence preferences of TFs are based on position weight matrices (PWM) or PWM-like models [70]. In the simplest of these models, the binding preference of a TF for each of the four nucleotides of the DNA alphabetfA, C, G, Tg of a sequence of length L is represented as a 4L matrix. Such models implicitly treat each position in the DNA sequence as being independent, so that each element of the matrix can 11 Figure 2.1: Illustration of the principles and purpose of this work. (b) Using simplified datasets of a small number of sequences derived from actual binding affinity experiments, we use D-Wave, a commercially available quantum annealer, to classify and rank binding affinity preferences of a TF to DNA. be thought of as the contribution of a nucleotide at the corresponding position to the overall binding affinity. Since the independence of the nucleotide positions is in many cases a valid approximation and also because of current restrictions on the size of the DW processors, in this chapter we have used a model consisting of single-nucleotide sequence features to show a proof of principle of the use of machine learning via quantum annealing in biology. Despite technological limitations of emerging quantum technology, we concurrently demonstrate cases in which this form of machine learning using quantum annealing outperforms classical machine learning when training with small datasets. This is among the very first successful applications of quantum hardware to a realistic, though simplified problem in computational biology. The results in this chapter have been published in Ref. [71]. 2.2 Results Experimental datasets on TF-DNA binding for a specific TF consist of N sequences of fixed length L and N values that express a measure of the binding affinity of the chosen TF to each sequence:f(x n ;y n )g N n=1 . In other words, the nth sequence is represented by the vector x n = (x n;1 ;x n;2 ;:::;x n;L ) with x n;j 2fA,C,G,Tg; for j = 1;:::;L, andy n is the corresponding measure of binding affinity. For instance, x n may be ACAACTAA, withy n = 4:95. In this work we used binding from three genomic-context protein binding microarray (gcPBM) experiments, which use fluorescence intensity as a measure of binding affinity, [72] and two high-throughput 12 systematic evolution of ligands by exponential enrichment (HT-SELEX) [73, 74, 75] experiments, which report relative binding affinity. After preprocessing, the three gcPBM datasets consisted of N 1600 sequences of L = 10 base-pairs. The two HT-SELEX datasets consisted ofN 3200 and 1800 sequences of lengthL = 12 after preprocessing (see Methods for a brief descrption of the preprocessing procedure). We used the following one-hot encoding to represent the sequence as a vector of binary variables: A = 1000, C = 0100, G = 0010, T = 0001, and thus transformedx n into a feature vector ~ n ( n;1 ;:::; n;4L ) | . This encoding scheme [76] was used so that all combinations of inclusion and exclusion of the four nucleotides may be identified. Similar to previous studies [77, 76, 78, 79] the goal of the present work is to identify patterns within the data to qualitatively assess whether the strength of a TF binding to a particular unseen sequence is above a certain threshold (classification) or to rank sequences in terms of binding affinity (ranking). To identify conditions in which machine learning with existing quantum annealing devices may be of use for studying a simplified biological problem, we report results obtained by solving a learning protocol with six different strategies: (i) an adiabatic quantum machine learning approach formulated in Refs. [29, 30] (DW), (ii) simulated annealing [80] (SA) (using the implementation given in Ref. [81]), (iii) simulated quantum annealing [82] (SQA), a classical algorithm that can represent the potential of a noiseless thermal quantum annealer, (iv) L 2 regularized multiple linear regression (MLR), (v) Lasso [83] and (vi) a scalable machine learning tool known as XGBoost (XGB) [84]. DW, SA and SQA are probabilistic approaches. SQA is a (classical) path integral Monte Carlo method that has performed very similarly to quantum annealing and captures some of its main advantages [85]. MLR is a deterministic method with a closed-form solution that returns the weights that best minimize the objective function (defined below). Lasso is a method for linear regression that uses an L 1 norm (see description of objective function below for more details). XGB uses boosted trees and has been applied to a variety of machine learning tasks in physics, natural language processing and ad-click prediction (e.g., Ref. [86]). Given a transformed feature vector ~ n that represents a DNA sequence, the goal of each method is to compute a predicted binding scoref( ~ n ) that best matches the actual binding score. To carry out the task, we optimize the objective function given in Eq. 1.2, which we reproduce here for ease of reference: Obj(w) = R(w) + (w). R is the training loss, is the regularization term, and w is the set of feature weights to be determined by the 13 six learning algorithms: DW, SA, SQA, MLR, Lasso and XGB. The mean squared error was used as the loss function for all six methods; namely,R(w) = P n (y n f w ( ~ n )) 2 , wherey n is the actual binding score of the nth sequence, andf w ( ~ n ) is the predicted binding score. The regularization term was (w) = jjwjj 1 for DW, SA, SQA and Lasso, (w) = jjwjj 2 2 for MLR, and (w) = S + 1 2 P S j=1 w 2 j for XGB, where thekk 1 norm is the number of 1’s (Hamming weight), thekk 2 2 norm is the square of the Euclidean norm, andS is the number of leaves [84]. The calibration of the hyper-parameters; andS is discussed below. The loss function should be minimized and the regularization term generally controls model complexity by penalizing complicated models; the strength of the regularization was determined using a 100-fold Monte Carlo cross-validation. All four methods assume a linear model for the predicted binding affinity, i.e.,f w ( ~ n ) = w |~ n = P j w j n;j . DW, SA and SQA return binary weights and are probabilistic methods, that is, they return a distribution of weights with different energies [values of the Hamiltonian in Eq. (1.1)]. In order to utilize the distribution of weights returned, while not sacrificing the discrete nature of the QUBO approach, up to twenty of the best weights were averaged (see Sec. 2.4.4 for a description of how excited-state solutions were included). Our computational procedure consisted of three main phases: (1) calibration of hyper-parameters, (2) training, and (3) testing (Fig. 2.2). About 10% of the data were held out for testing during the testing phase (“test data” orD TEST ); this test data was not seen during calibration and training stages. Calibration and training were carried out using the remaining 90% of the data (“training data” orD TRAIN ). Due to the discrete nature of the weights returned in the QUBO approach, as well as technological limitations of the DW2X device, calibration of hyper- parameter was carried out by repeatedly sampling a small number of sequences, about 2% and 10% ofD TRAIN , corresponding to about 30 and 150 sequences, respectively. In particular, in the calibration phase we determined the hyper-parameters by using 100-fold Monte Carlo (or split and shuffle) cross-validation with training splits of 2% and 10% of the training data, varying from 2 3 to 2 6 . Monte Carlo cross-validation was used so that hyper- parameters would be tuned on a similar number of sequences as used in the training phase (in contrast, n-fold cross-validation trains on n1 n 100% of the data). The same calibration procedure was applied to tune for SA, SQA, MLR and Lasso. In order to demonstrate good performance for XGB, , S, and several additional parameters needed to be tuned (see Methods). In the training phase we used a bagging (bootstrap aggregating) 14 Figure 2.2: Schematic overview of the data handling procedure. About 10% of the data were held out for testing (D TEST ) and was unseen during the calibration and training phases. The remaining 90% of the data were used as calibration and training (D TRAIN ). For DW, SA, and MLR, in the calibration step, hyper-parameter was tuned using a 100-fold Monte Carlo cross-validation. For XGB, several other hyper-parameters were tuned (see Methods for a list). During the training step, a procedure similar to bagging (bootstrap aggregating) was used by randomly sampling a 2% and 10% replacement of the data 50 times to give 50 training instances. In the testing step, the area under the precision-recall curve (AUPRC) of the best performing weights for each of the 50 training instances was evaluated onD TEST , which was unseen during training and calibration, to evaluate generalization of classification performance. The calibration, training, and testing procedure was identical for ranking tasks, with the exception that Kendall’s was used as the metric of performance instead of the AUPRC. procedure [87], randomly sampling with replacement 2% and 10% of the training data, namely about 30 and 150 sequences. Each subset of about 30 or 150 sequences formed a training “instance”, and the mapping of a subset of data to theh i andJ ij seen by DW and SA is given in the Methods. Each learning approach (DW, SA, SQA, MLR, Lasso and XGB) was trained on the same set of instances. To collect statistics, 50 instances were randomly selected with replacement, for each training size. In the testing phase, the predictive power was assessed in terms of classification performance (the mean area under the precision-recall curve or AUPRC) and ranking performance (the median Kendall’s) on the test data unseen during calibration and training phases. AUPRC is a measure of classification performance that may help discern between similar algorithms when there is a high degree of class imbalance; i.e., when the data set contains many more false labels than true labels [88]. Kendall’s is a rank correlation coefficient that counts all mismatches equally [89]; see Appendix A for more details on performance measures. 15 2.2.1 Performance on gcPBM Data To quantify the relative performance of the algorithms in capturing DNA-protein binding preferences, we first present results for high-quality gcPBM [72] data of three TFs from the basic helix-loop-helix (bHLH) family: the Mad1/Max heterodimer (‘Mad’), the Max homodimer (‘Max’), and the c-Myc/Max heterodimer (‘Myc’) [76]. bHLH proteins typically recognize and bind as dimers to the enhancer box (E-box), which is of the form CANNTG, where N denotes any of the 4 nucleotides (A, C, G, or T). Mad, Max, and Myc are part of a gene network that controls transcription in cells; a mutation of Myc has been associated with many forms of cancer [90]. For the work here, these three datasets were modified to consist of about 1600 sequences of 10 base pairs (bp) in length with the E-box located at the central 6 bp. Figure 2.3: Quantitative performance on held-out experimental test dataset of two different types of tasks for three high quality gcPBM datasets. (a) The mean AUPRC for Mad, Max, and Myc plotted versus threshold at certain threshold percentiles of the data, when training with 2% of the data (left) and 10% of the data (right). In both cases 50 instances were randomly selected for training and performance of the 50 trained weights is evaluated on the same held-out test set. Error bars are the standard deviation. (b) Boxplot of Kendall’s on held-out test dataset. Red ‘+’ indicate outliers, gray line represents the median. The bottom and top edges of the box represent the 25th and 75th percentiles, respectively. 16 In Fig. 2.3 we present the AUPRC and Kendall’s obtained with the different algorithms when training with about 30 (2%) and 150 (10%) sequences. To compute the AUPRC, a threshold of the data was introduced: for a threshold at the pth percentile of the data, p% of the total number of sequences have binding affinities below the threshold and were set as negatives (“false”), and the (1p)% of the sequences that have binding affinities above the threshold were set as positive (“true”); see Sec. 2.4.5 for a more detailed explanation of the procedure to threshold the data. During the calibration phase, we tuned hyper-parameters with a single threshold at the 80th percentile of the data, and during the testing phase we evaluated performance between the 70th and the 99th percentiles of the data. Kendall’s was evaluated between the predicted and measured binding affinity. A higher AUPRC indicates a better ability to correctly classify sequences that would be strongly bound by a TF, and a higher indicates a better ability to accurately rank the binding affinities for different sequences. For the AUPRC, when training on instances with 2% of the data (left column in Fig. 2.3a), DW, SA and SQA perform very similarly, with DW slightly outperforming SA on the Myc data, and are somewhat better than MLR at the 70th and 80th percentiles. MLR tends to do better at the higher thresholds: this behavior could be affected by the fact that, during the calibration phase, we selected the that gave the best performance at the 80th percentile. Lasso, which uses the same L 1 norm as DW, SA, and SQA, performs better than XGB but worse than the other methods. XGB, which has been successfully applied to a growing number of learning tasks, does poorly with small training sizes. When training with 10% of the data (right column in Fig. 2.3a), the trends of relative classification performance are quite different. XGB and MLR perform very similarly, though XGB does slightly better for the Max dataset. DW tends to perform better than SA and SQA, especially at higher thresholds. DW’s mean AUPRC is normally worse than MLR and XGB’s, though there is overlap between the error bars. SA and SQA generally perform worse than the other methods, but not conspicuously so. Lasso’s performance is in general comparable to DW, SA, and SQA and generally seems to perform the worst with 10% of training data. For Kendall’s (Fig. 2.3b), Lasso and XGB perform the worst when training with 2% of the data. SQA generally performs the best over the three TFs, though MLR’s median is marginally greater than SQA’s for Mad. SA’s performance is very close to SQA’s and DW’s performance is slightly worse than the other two annealing schemes, though generally better than the typical machine learning algorithms. With 10% of the data, DW performs 17 the worst; SA and SQA perform very similarly, with SQA being slightly better on two of the three datasets; MLR and Lasso perform very similarly, though MLR looks slightly better; and XGB performs the best. The fact that for Mad and Max with 2% of the training data there is very little variation in Kendall’s for SA and SQA (and to a lesser extent, DW), is a consequence of the choice of hyper-parameters. The specific values of the hyper-parameters that gave optimal value of Kendall’s during the calibration phase are shown in Table 2.2, but we note here that the value of is quite high. controls the model complexity and is closely related to the bias-variance tradeoff, which states that it is impossible to simultaneously minimize errors from both bias and variance. A large value of introduces a large bias [91]; consequently, for the cases where there is no or little variance, SA and SQA are essentially extracting the same pattern from all the training data. For the ranking tasks shown here with training on about 30 sequences, this gives the best performance for SA and SQA. It may be unsurprising, however, that a large value of be appropriate for small datasets; over-fitting may be a greater concern with smaller amounts of data. The results presented in Fig. 2.3 suggest a precise case where current quantum technology may offer slight performance advantages relative to classical computational approaches; that is, when there is only a small amount of experimental training data available (about 30 sequences in our specific cases). In both classification and ranking tasks, DW performs comparably to SA and SQA and better than Lasso and XGB. MLR performs comparably with the annealing methods, but its error bars are much larger, indicating that its performance is less stable and more dependent on the training data. Moreover, the similarity between DW and SQA suggests that for small training sizes DW is functioning very nearly like a noiseless quantum annealer as captured by quantum Monte Carlo simulations. On a larger size of the training data DW’s performance decreases relative to the classical approaches for all three TFs, though results are still competitive. The decrease in the performance of all annealing methods (DW, SA, and SQA) seems to indicate a limitation on using methods with discrete weights, which enforce simpler models. Such models may be more advantageous with a small number of training samples because they prevent overfitting. However, with larger amounts of training data, a simpler model may not adequately learn the variation within the data and hence suffer worse performance. Nevertheless, the fact that Lasso uses the same L 1 norm as the 18 annealing methods (i.e., DW, SA and SQA), yet does not perform as well, indicates an advantage of such annealing methods when training with a small number of sequences. This is consistent with the finding reported in Ref. [43]. 2.2.2 Weight Logos from Feature Weights Since the one-hot encoding was used with a linear model to represent DNA sequence, the weights returned by DW, SA, and MLR reflect the relative importance of the corresponding nucleotide at a position in the sequence for the binding score. The magnitude of these feature weights for DW, SA, and MLR can be visualized as a “weight logo” and are presented in Fig. 2.4 for the Mad, Max, and Myc gcPBM datasets. XGB, which finds an ensemble of trees, does not assign weights to individual nucleotides and hence does not easily lend itself to visualization. The weight logos show the contribution of nucleotides at particular positions to the strength of binding. The contribution of a nucleotide at a particular position in the sequence is represented by its height; nucleotides with the smallest weights are at the bottom and those with the largest weights are at the top. These weight logos in Fig. 2.4 were obtained by averaging the weights from the 50 training instances of the same number of sequences with the AUPRC as the objective. In other words, the logo represents the average of the weights that give the AUPRCs shown in Fig. 2.3a. DW, SA, and MLR all perform very similarly and give weight logos that are in good agreement with the expected consensus sequence, CANNTG. This demonstrates that all methods are able to capture biologically relevant information. 2.2.3 Performance on HT-SELEX Data HT-SELEX [73, 75] is a method for investigating the relative binding affinity of a TF for a particular sequence of DNA, an in vitro technique complementary to PBM. We present results for the Max homodimer and TCF4, another member of the bHLH family with consensus sequence CANNTG, using data from HT-SELEX experiments [79]. The Max dataset consisted of 3200 sequences of 12 bp in length, and the TCF4 dataset was modified to contain 2000 sequences of 12 bp in length. The procedure for splitting each dataset into test and training data was similar to that described earlier for the gcPBM datasets (see Fig. 2.2). There was no overlap between training and testing data. The quantitative results for classification and ranking performance of the six different machine learning approaches are summarized in 19 Figure 2.4: Comparison of feature weights visualized as weight logos for DW, SA, and MLR. Weights represent the relative importance of a nucleotide at each position for the binding affinity. These weight logos were obtained using the Mad, Max, and Myc gcPBM datasets when training with the AUPRC as the objective. Figs. 2.5a and 2.5b, and the weight logos for DW, SA, and MLR in Fig. 2.5c. As with the gcPBM data, when training with about 30 sequences (1% of training data for Max and 2% of the training data for TCF4), DW, SA and SQA exhibit the best performance on the test dataset. MLR matches the annealing protocols with a threshold at the 70th and 80th percentile of the data, but does worse at the higher percentiles of the data (left column in Figs. 2.5a and 2.5b). Lasso and XGB have the poorest performance. When training with about 150 (5% of training data for Max and 10% of the training data for TCF4) sequences, XGB performs very well, as on the gcPBM datasets with more training data, and MLR does well on the Max dataset but rather poorly on the TCF4 dataset; Lasso is comparable to MLR. DW’s performance is worse than the best performing method (XGB), but comparable to the other methods (right column in Fig. 2.5a and 2.5b). XGB’s performance on the TCF4 dataset is much better than the other methods, except when thresholding at the 99th percentile. In terms of Kendall’s (Fig. 2.5a and 2.5b, bottom), all methods have similar performance when training with about 30 sequences, with the exception of XGB which does not do as well on the Max dataset. When training with about 150 sequences, XGB gives the best ranking performance, as it did with the gcPBM data, and the other 20 Figure 2.5: Summary of results for HT-SELEX data. (a) Comparison of the AUPRC (top) and Kendall’s (bottom) when training with 1% of the data (left) and 5% of the data (right) for Max. Error bars are standard deviation over 50 instances. (b) Comparison of the AUPRC (top) and Kendall’s (bottom) when training with 2% of the data (left) and 5% of the data (right) for TCF4. (c) Weight logos for Max and TCF4 from training with the AUPRC as the objective. 21 methods all perform similarly. Finally, the weight logos in Fig. 2.5c indicate that DW, SA, and MLR capture patterns in the data that give good agreement with the expected consensus sequence. 2.3 Discussion In this work we have explored the possibility of using a machine learning algorithm based on quantum annealing to solve a simplified but actual biological problem, the classification and ranking of TF-DNA binding events. This is the first application of quantum annealing to real biological data. We have shown that DW performs comparably or slightly better than classical counterparts for classification when the training size is small, and competitively for ranking tasks. Moreover, these results are consistent with a similar approach for the Higgs particle classification problem [43], where DW and SA both outperformed XGB with small training sizes, with a slight occasional advantage for DW over SA. This robustness across completely different application domains suggests that these findings represent real present-day advantages of annealing ap- proaches over traditional machine learning in the setting of small-size training data. In areas of research where datasets with a small number of relevant samples may be more common, a QUBO approach such as quantum annealing realized via DW may be the algorithm of choice. On the other hand, when data is plentiful, some of the other state-of-the-art classical algorithms may be a better choice. We have also demonstrated that the feature weights obtained by DW reflect biological data; the weight logos for the TF-DNA binding data from gcPBM and HT-SELEX are consistent with the consensus binding site. This gives some confidence that quantum annealing is learning relevant biological patterns from the data. Yet, the approach is not without limitations. One limitation comes from the use of a the single-nucleotide model to encode the DNA binding sites. In fact, we implicitly used a simple model that assumes independence between positions in the sequence. This is not always a valid approximation; higher-order “k-mer” features or other “shape” features that account for interdependencies between nucleotide positions may enhance model precision [92, 93, 77, 76, 79]. We are limited to this simple model because of major technological constraints on the number of available qubits, which limits the number of features that can be used and thus the length of sequences that can be examined. The DW2X processor used for this study has 1098 functional qubits, but because of a sparse connectivity between 22 qubits, only 40 or so features can actually be implemented on the device and in our study because of the limited connectivity of the DW Chimera graph (see Sec. 1.1.3). Another serious limitation is the use of discrete weights. Discrete weights seem to be advantageous with a small number of training samples, as they enforce simpler models and are less prone to overfitting. However, as the amount of training data increases, these simpler models do not fare as well as some of the classical methods, which allow for greater numerical precision in the weights. Despite these limitations, it is encouraging to see competitive performance for the simplified problem we have studied here. Although the performance advantage from annealing-type optimizers makes it difficult to solely attribute the performance to quantumness, this work may inspire future investigations into the power of quantum annealing devices. As quantum technology continues to develop and advance, it is possible that some of the practical limitations will be addressed and the range of problems that can be explored will be expanded. 2.4 Methods 2.4.1 QUBO mapping of TF-DNA binding problem After processing the experimental datasets ofN sequences of fixed lengthL and a measure of the binding affinity, we obtained the restricted datasets to which we applied six different machine learning strategies. Datasets were for- mulated asf( ~ n ;y n )g N n=1 , where ~ n ( n;1 ;:::; n;4L ) | is the transformed feature vector, andy n is the binding affinity. Solving for the simplest model is equivalent to finding a vector of binary weightsw = (w 1 ;:::;w 4L ), wherew i 2f0; 1g, such that the quantity = N X n=1 y n w | ~ n 2 (2.1) is minimized. The problem can then be specified as finding aw opt such that w opt = arg min w N X n=1 y n w | ~ n 2 +jjwjj 1 ; (2.2) 23 where is a regularization (penalty) term included to prevent overfitting andjjwjj 1 = P m w m is the number of non-zero weights. To represent the above as an Ising problem, note that we can rewrite Eq. (2.2) as follows: w opt = arg min w N X n=1 y n w | ~ n 2 + 4L X m=1 w m (2.3) = arg min w X n y 2 n 2y n w | ~ n +w | ~ n ~ | n w + 4L X m=1 w m = arg min w w | Qw +w | k; where Q = X n ~ n ~ | n ; Q i;j = X n n;i n;j (2.4) k =1 2 X n y n ~ n ; k i = 2 X n y n n;i : Constants that do not affect the optimization are dropped in the latter step. This procedure demonstrates that the problem of TF-DNA binding can be formulated as a QUBO problem, which in turn can easily be transformed into an Ising Hamiltonian of the form in Eq. (1.1) and passed to D-Wave. 2.4.2 Technical details of algorithms DW, SA, SQA, MLR, Lasso and XGB were run on the same set of instances for assessment of the quantum annealer on the chosen problem. The experimental quantum processor, DW2X, was designed and built by D-Wave Systems, Inc. For each instance a total of 10000 anneals (“runs”) were collected from the processor, run with an annealing time of 20s. SA and SQA are classical analogues of quantum annealing that perform annealing on a classical and path integral Monte Carlo simulation of the Ising spin glass, respectively. SA and SQA were run with 10000 sweeps (each sweep is an update of all spins) per repetition (or “anneals”) with an initial inverse temperature of 0:1 and a final inverse temperature of 3, for a total of 10000 repetitions. The SA code was adapted from Ref. [81], and an in-house version of SQA was used. MLR is a widely used technique to minimize the loss function shown 24 in Eq. (2.2), with the convex penalty termjjwjj 2 2 instead of the linear penalty term. Lasso has the linear penalty term [83]; XGB uses boosteed trees [84]. The weights w returned by MLR, Lasso and XGB are real-valued, whereas the weights returned by DW, SA and SQA (which solve a QUBO/Ising problem) are binary. In addition, DW, SA and SQA are probabilistic, meaning that a distribution of weights with different energies [the value ofH P in Eq. (1.1)] are returned. Up to 20 of the lowest energy weights were included for both DW, SA and SQA (see Sec. 2.4.4 for more details). The lower the energy, the better the particular solution is at minimizing Eq. (2.2). In contrast, MLR, Lasso and XGB are deterministic and return a single solution. In the calibration phase, only one hyper-parameter, was tuned for DW, SA, SQA, MLR and Lasso. All five methods were tuned separately for both classification and ranking tasks, resulting in different optimal for each method (see Tables 2.1 and 2.2 final values of ). With an older dataset we varied both the number of sweeps for SA and the value of but results were not significantly different; hence, here we only vary for SA. SA also has various other parameters that are related to the algorithm itself, including number of runs, initial and final temperature, and the cooling schedule, all of which affect the optimization performance. These parameters were not tuned. Similar additional parameters for DW, including annealing time and number of runs, were also not tuned. XGB’s performance depends on several hyper-parameters, and more careful tuning was necessary in order to give competitive performance. XGB parameters [84] that were considered include , the max depth, and min child weight (all of which control model complexity), subsample, colsample bytree, (which add randomness to make training robust to noise), as well as learning rate, eta. Rather than doing a full grid search over all these parameters, parameters were tuned sequentially; i.e., one value of was fixed, then the best value of max depth and min child weight were found. The optimal for those values was then found; and finally subsample and colsample bytree tuned. was then varied and the process repeated. was varied from 0:05 to 0:3, max depth from 3 to 20, min child weight from 1 to 20, from 0 to 1, and subsample and colsample bytree both from 0:6 to 1. In the testing phase, we evaluated performance based on two metrics: the AUPRC for classification perfor- mance and Kendall’s for ranking performance. For the AUPRC, we reported mean values with standard devia- tions as error bars, whereas for Kendall’s the median value was presented. 25 2.4.3 Data processing and availability Original probes for the gcPBM [72] data contained 16000 18000 sequences of 36 bp in length with the fluores- cence intensity as a measure of binding affinity. The same data is used in [76] and may be downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE59845. Because of current limitations of the ar- chitecture of the DW device that limit the number of features that may be used, the data was truncated to the central 10 bp. For each sequence of 10 bp, we calculated its average gcPBM signal. In other words, all sequences in the datasets were unique. The final Mad, Max, and Myc datasets consisted of 1655, 1642, and 1584 sequences, respec- tively, of length 10 bp, and the logarithm base 2 with fluorescence intensities was used. The HT-SELEX data came from mammalian TFs [74] that was re-sequenced with on average 10-fold increase in sequencing depth [79]. The sequencing data is available at the European Nucleotide Archive (ENA - https://www.ebi.ac.uk/ena; study iden- tifier PRJEB14744) and was pre-processed following the protocol in [79]. After this first step of pre-processing, the Max and TCF4 datasets consisted of 3209 and 15556 sequences of length 12 bp and 14 bp, respectively. The Max dataset did not require further truncation, but one bp on the left and right flanks were trimmed for the TCF4 dataset, giving a modified dataset of 1826 sequences of length 12 bp. As with the gcPBM data, the average relative affinity was averaged for each truncated sequence. 2.4.4 Using Excited State Solutions and Combining Weights (DW, SA and SQA) We used one of two heuristics used to take advantage of the distribution of solutions returned by DW, SA, SQA. The first is an iterative averaging to decrease the value of the objective function, and the other is a direct averaging of the best 20 solutions. In the training phase, all algorithms aim to to minimize an objective functionObj(w) = L(w) + (w). Whether iterative averaging or direct averaging was used depended on which one gave higher metric (either AUPRC for classification tasks or Kendall’s for ranking tasks) during the calibration phase. A list of the values of and the method used is indicated in Tables 2.1 and 2.2. 26 2.4.4.1 Iterative averaging Unlike MLR, Lasso, and XGB, DW, SA and SQA are probabilistic algorithms that are run a number of times to return a distribution of binary weights with different energies; as such, it may be of interest to see whether higher energy (or, excited state) solutions that both DW, SA and SQA return may still be useful. Note that by “energy” we refer to the energy of the Hamiltonian in Eq. (1.1), which is a measure of the optimization performance, and is different from the free energy of TF-DNA binding; see Methods in the main text for a mapping of the problem inputs to Ising Hamiltonian. Fig. 2.6 shows the change in the objective value of the training instances for DW and SA upon inclusion of the first few excited (or higher-energy) state solutions (“weights” and “solutions” are used interchangeably). To facilitate comparisons, both DW and SA have the same value of the regularizer so that values of the objective function are on the same scale (SQA was not included in Fig. 2.6 but performed similarly). We start with the lowest energy solution found. Higher energy solutions were then iteratively included if doing so yielded a lower value of the objective function. Namely, given a solution averaged overk solutions, w (k) = 1 k P k i=1 w i , thek + 1th solution was included ifw (k+1) = w (k) + 1 k+1 (w k+1 w (k) ) yielded a lower value of the objective function, i.e., ifObj(w (k+1) )<Obj(w (k) ); otherwise it was discarded and the next higher energy solution was examined in the same fashion. Up to 20 solutions were considered; beyond that, changes to the objective function value are not significant, and going further might sacrifice the discrete nature of the weights. Although 20 solutions were examined, typically only about 4-7 excited state solutions were actually included. This way of including excited state solutions led to a monotonically decreasing objective function with increasing number of excited states. As can be seen from Fig. 2.6, including higher energy solutions generally improved optimization performance for both DW, SA and SQA. Performance improvement was not only in terms of the training data, but also on the held out testing data; hence, all the results in the main body of the text are for weights that have been iteratively averaged as described above. 2.4.4.2 Direct averaging The second way of including excited state solutions is to simply average the lowest 20 energy solutions: w = 1 20 P 20 i=1 w i , wherew i is the solution corresponding to theith lowest energy found, andw is the final, directly 27 Figure 2.6: Percent change in objective function versus training size and number of solutions included. Both DW (left) and SA’s (right) performance improve with increasing the number of solutions included. SQA was not included but performed similarly. Error bars represent standard deviation. averaged solution used. For some of the training instances SA returned fewer than 20 unique solutions and hence we only averaged over the actual solutions returned. Whether iterative averaging or direct averaging was used depended on which gave higher metric (either AUPRC for classification tasks or Kendall’s for ranking tasks) during the calibration phase. A list of the values of and the method used is indicated in Tables 2.1 and 2.2. 2.4.5 Binarizing the binding affinities The data used in this chapter came from gcPBM and HT-SELEX experiments, both of which return a measurey of binding affinity (fluorescence intensity for gcPBM data and relative binding affinity for HT-SELEX). In order 28 Dataset 2% Training 10% Training DW SA SQA MLR Lasso DW SA SQA MLR Lasso Mad 64 (it) 64 (it) 64(it) 8 1 192 (dir) 256 (it) 256 (it) 32 1/32 Max* 96 (it) 96 (it) 96 (it) 12 0.5 384 (dir) 384 (it) 384 (it) 48 1/64 Myc 32 (dir) 48 (it) 24 (as) 12 1 256 (dir) 16 (dir) 96 (dir) 48 1/256 TCF4 192 (it) 192 (it) 192 (dir) 6 0.5 2 (it) 512 (it) 512(it) 32 0.25 Max y 256 (dir) 128 (it) 128(it) 6 1 192 (dir) 64 (it) 64 (dir) 0.5 1/256 Table 2.1: Value of hyper-parameter used for AUPRC. Parenthesis indicate method of combining solution used: (it) refers to iterative averaging and (dir) refers to direct averaging (see Sec. 2.4.4). Since MLR has a closed-formed solution, there is no need to combine solutions. Max* refers to the gcPBM data, and Max y to the HT-SELEX dataset. Dataset 2% Training 10% Training DW SA SQA MLR Lasso DW SA SQA MLR Lasso Mad 64 (it) 64 (it) 64 (it) 4 1/32 256 (it) 3 6 (dir) (it) 12 1/64 Max* 96 (it) 96 (it) 96 (it) 12 1/32 2 (it) 4 (it) 4 (dir) 4 1/128 Myc 32 (as) 32 (it) 32 (dir) 48 1/64 256 (as) 4 (it) 16 (it) 8 1/64 TCF4 0.25 (it) 4 (it) 1 (it) 6 1/128 0.25 (it) 16 (it) 3 (it) 32 1/256 Max y 128 (it) 128 (it) 128(it) 6 1/64 6 (it) 8 (it) 512 (it) 0.5 1/128 Table 2.2: Value of hyper-parameter used for Kendall’s. Otherwise the same as in Table 2.2. to treat the data as a classification problem we first need to classify the data based on the binding scores,y n . To do so, we introduced a threshold and classify eachy n 2D TEST such that ^ y n = 8 > > > < > > > : 0 ify n < 1 ify n ; forn = 1;:::;jD TEST j; (2.5) was the class,D TEST refers to the test dataset andjD TEST j is the size of the test dataset. In other words, ally’s above the threshold were given the actual label of positives (“strongly binding”) and all other values were negatives (“weak binding”). Changing the threshold clearly changes the number of positive and negative labels. Since picking the threshold value is somewhat arbitrary, in order to evaluate performance at different proportions of class labels, we picked thresholds at fixed percentiles of the data (recall that they-values were originally real-valued; hence the need to classify them). We picked thepth percentile of the data such thatp% of the data was classified as negative and (100p)% of the data was classified as positive. In the main text we examined classification performance with relatively high thresholds at the 70th, 80th, 90th, and 99th percentile of the data. Thresholds 29 at these percentiles can be thought of as properly selecting strong binding sites. As supporting material we also present performance at lower and medium thresholds. In total, 11 different thresholds were chosen corresponding to the percentiles at 1%,10%,20%,...,90%, and 99%. 30 Chapter 3 Cancer Classification 3.1 Introduction In this chapter we apply quantum machine learning (qML) algorithms via quantum annealing to high-dimensional, multi-omic cancer data from the Cancer Genome Atlas (TCGA) and compare their performance to several standard machine learning algorithms and classical analogues to quantum annealing. We developed a novel mapping of the classification problem to an Ising formulation that can be run on D-Wave. The aim of this chapter is two-fold: first, to assess the applicability of current quantum hardware and algorithms to real biological data, and second, to understand under what circumstances, if any, quantum algorithms perform better than existing classical algorithms. To do so, we compared the quantum algorithms to a variety of standard classical machine learning algorithms and classical analogues to quantum annealing for both binary and multi-class classification. Our results show that in most cases when using all the training data available, quantum algorithms on average do slightly worse than classical machine learning, though the difference is not statistically significant. For smaller amounts of training data, quantum annealing and a classical analogue perform better than classical machine learning algorithms with statistical significance. We explored in detail the performance of the algorithms that see the Ising problem, both quantum and classical, to understand the differences in performance. Our results show that while it is possible for current quantum hardware to be used in the analysis of high-throughput genomics data, other simple classical 31 algorithms perform comparably. The main contribution of this work may be the development of an Ising model that allows that allows for very good machine learning performance, despite the complexity of the underlying data. 3.2 Methods 3.2.1 Dataset and dimensionality reduction Genomic data from The Cancer Genome Atlas (TCGA) was retrieved, pre-processed, and normalized. An overview of our data pipeline is depicted in Fig. 3.1. Briefly, we retrieved whole exome sequencing, RNA-Seq, miRNA- Seq, DNA methylation array, and genotyping array data for cancer binary classification (breast cancer vs normal, estrogen receptor positive vs negative, kidney renal clear vs papillary cell carcinoma, lung adenocarcinoma vs squamous cell carcinoma, and two breast cancer subtypes luminal A vs luminal B) as well as 6-cancer multiclass classification (breast vs colorectal vs lung vs kidney vs brain vs liver cancer). These were retrieved from either the Genome Data Commons (GDC) data portal (https://portal.gdc.cancer.gov/ - data release 4.0) or cBioportal (http://www.cbioportal.org/) [94, 95]. All five data types (mRNA, miRNA, CNV , DNA methylation, and somatic tumor variants) were concatenated into a single data matrix. Each feature was standardized to zero mean and unit variance (z-score). In order to collect statistics in our downstream analysis, we took 100 random cuts training (80%) and testing (20%) data. Furthermore, given that the data has over 79; 000 features, dimensionality reduction was conducted in order to make comparisons with existing quantum hardware and simulators. As such, we performed principal component analysis (PCA) on each cut of the feature data, retaining the top 44 principal components for the binary classification datasets and 13 principal components for the 6-cancer dataset, where the number of principal components was chosen based on the largest number of features that could be accommodated on existing quantum annealing hardware (see Sec. 3.2.7 on formulating the classification problem as an Ising Hamiltonian for a brief discussion of the relationship between the number of features we can use and the number of classes that can be used in the classification). 32 3.2.2 Standard machine learning approaches We compared performance of quantum annealers to several standard machine learning approaches: LASSO, Ridge Regression, support vector machines (SVM), random forest, and Naive Bayes. All approaches were implemented in R using the ‘caret’ package [96] with hyper-parameters chosen using 10-fold cross-validation. 3.2.3 D-Wave For the results in this chapter we used two different D-Wave processors. For the binary classification comparisons we used the DW2X processor housed at the USC ISI. This processor has a total of 1098 active qubits. For the multiclass classification, we used the DW 2000Q processor owned by D-Wave and located in Burnaby, BC, Canada. This processor has a total of 2038 qubits. As mentioned in Sec. 1.1.3, we need to embed the logical problem of interest onto the physical hardware. This is done by finding a minor embedding, which is maps each logical qubit to a group of physical qubits. An important factor that affects D-Wave’s performance is J c , the ferromagnetic coupling between the physical qubits that make up a logical qubit. In this chapter, we treatJ c as a hyper-parameter with values in the setf0:5;1;3;8;16;32g and select the one that gave the best performance. We set the annealing time at 5s and repeated the anneal 1000 times with 10 spin reversal transformation (or “gauges” [97]) to mitigate parameter misspecifications from the machine. We selected the 20 spin configurations with the lowest Ising energy and did some quick classical post-processing to combine the weights to improve performance. 3.2.4 Simulating Annealing Similar to quantum annealing, simulated annealing (SA) accepts problems formulated as an Ising problem, as defined in Eq. 1.1 and returns binary variables. For this work we used the implementation in [98]. There are several important parameters that affect its overall performance: the number of sweeps, the type of schedule (linear or exponential), and the initial and final temperatures. For our purposes, we fixed the number of sweeps, which is analogous to the annealing time of quantum annealing, to 1000 and selected a linear schedule with an inverse initial temperature of 0:01. We treated the final inverse temperature,b 1 , as a tunable hyper-parameter with values in the setf0:03; 0:1; 0:3; 1; 3g and repeated the anneal 1000 times. Results in the main text are plotted for the 33 hyper-parameter that gave the best performing during cross-validation. We used the same classical post-processing procedure that was used with D-Wave to combine 20 spin configurations with the lowest energy, not just the one that returned the lowest Ising energy. 3.2.5 Field As an another approach to explore the usefulness of the formulating the classification task as an Ising problem, we implemented a very simple algorithm that only takes into account the values of the local fields,fh i g. Onceh i has been determined based on the training data, we choose the weights to be the opposite sign of the fields; i.e., field i =h i . This amounts to an optimization of Eq. 1.1 without any of the coupling terms, J ij . This almost trivial approach was used to determine whether the Ising Hamiltonian makes problems too easy. 3.2.6 Random As a sanity check, we generated random weights uniformly on the interval [0; 1). Values below 0:5 were set to 1 and those above 0:5 were set to 1, thereby generating spin configurations the same as those returned by DW and SA. We then sorted the spin configurations according to their Ising energy (see Eq. (1)). As with DW and SA, we generated 1000 such random spin configurations and used the same classical post-processing procedure to combine the 20 weights with the lowest energy to a final set of weights. 3.2.7 Formulating a multi-class classification problem on a quantum annealer We consider a multi-class classification problem. Letx | i be theith sample (or row of the data matrix). For the datasets used in this work, the original data matrix, X, consisted of about 100 to about 800 samples of around 34 70; 000 genes. In order to use quantum annealing on D-Wave, we applied PCA to reduce the number of compo- nents that can fit on D-Wave (44 for the binary classification problem). Assuming we have performed PCA, the transformed and projected data matrixD p (firstp PCs only) can be written as: D p = 0 B B B B B B B B B B @ x | 1 x | 2 . . . x | N 1 C C C C C C C C C C A (3.1) where N is the size of the training dataset. Let y i be the pre-assigned class of the ith sample (we are doing supervised learning). Let there be K classes, so that y i 2f1;:::;Kg (in this work, the classes correspond to different types of cancer, but they could be any set of labels, e.g., images with labels such as cat, dog, plane, etc.). Our goal is to learn the weight vectorsfw k g K k=1 , with entries in1 (Ising variables) to be assigned to each class. We do so by formulating the problem as a minimization of the negative log-likelihood, defined below. GivenK unique classes, we can define the probability of each class using the SOFTMAX function as: Pr(y i =k 0 ) = expw | k 0 x i P K k=1 expw | k x i : (3.2) where eachx i andw k 0 are column vectors whose length is equal to the number of features or PCs that we retain, p. I.e., each weight vectorw k has elementsw k;l 21, wherel2f1;:::pg. Since we want the Pr(y i =k 0 ) to be probabilities (which must sum to 1), we can reduce the number of weights we have to train fromK toK 1 and define the firstK 1 probabilities as: Pr(y i =k 0 ) = expw | k 0 x i 1 + P K1 k=1 expw | k x i ; (3.3) 35 with the probability of theKth class as: Pr(y i =K) = 1 1 + P K1 k=1 expw | k x i : (3.4) Our goal is to maximize the probability given the classes in the dataset, or equivalently to minimize the negative log-likelihood, defined as: L = log Y i Pr(y i ) = N X i=1 log Pr(y i ); (3.5) As stated, the negative log-likelihood is not formulating as an Ising problem, so we need to bi-linearize it, which we can do using a second order Taylor expansion as explained below. Note that the original problem, as formulated in terms of the log-likelihood, is convex, while its Ising version is in general NP-hard, so that the training time typically grows exponentially with the number of variables. This type of tradeoff has been discussed, e.g., in [43]. The main argument in favor of the Ising version is that it reduces sensitivity to modeling errors, and due to the binary weights it guards against overtraining. In this Ising version the model provides a straightforward interpre- tation of the criteria used to classify the events. The exponential cost is a price sometimes worth paying in return for robustness in the presence of label noise. For simplicity, letz (k) i =w | k x i . We split the sum into terms over the firstK 1 classes and theKth class; i.e., 36 L = X i:yi2f1:::K1g log Pr(y i ) X i:yi=K log Pr(y i =K) = X i:yi2f1:::K1g " z (yi) i log 1 + K1 X k=1 expz (k) i !# + X i:yi=K log 1 + K1 X k=1 expz (k) i ! = X i:yi2f1:::K1g z (yi) i + X i log 1 + K1 X k=1 expz (k) i ! : (3.6) Now Taylor expand around 0 for the second summation, expanding in thez (k) i . For simplicity, writez k instead of z (k) i andz = (z 1 ;:::;z K1 ). Then: f(z) = log 1 + K1 X k=1 expz k ! f(0) +rf(0)z + 1 2 z | Hf(0)z; (3.7) wherer is the gradient andH is the Hessian. @f @z n = expz n 1 + P K1 k=1 expz k (3.8a) @ 2 f @z m @z n = m;n expz n 1 + P K1 k=1 expz k expz m expz n 1 + P K1 k=1 expz k 2 : (3.8b) Therefore, f(z) logK + 1 K X k z k + 1 2 K 1 K 2 X k z 2 k 1 2 1 K 2 X j6=k z j z k (3.9) 37 With this we can complete our approximation for the log-likelihood (where we drop the sum over logK as it is constant): L X i:yi2f1:::K1g z (yi) i + X i 2 4 1 K X k z (k) i + 1 2 K 1 K 2 X k z (k) i 2 1 2K 2 X j6=k z (j) i z (k) i 3 5 = X i:yi2f1:::K1g w | yi x i + X i;k 1 K w | k x i + K 1 2K 2 w | k x i x | i w k 1 2K 2 X i;j6=k w | j x i x | i w k = K1 X k=1 w | k 0 @ X i:yi=k x i + 1 K X i x i 1 A + K1 X k=1 w | k K 1 2K 2 X i x i x | i ! w k X j6=k w | j 1 2K 2 X i x i x | i ! w k : (3.10) Let b k = X i:yi=k x i ; h 0 = 1 K X i x i ; J 0 = K 1 2K 2 X i x i x | i ; J 00 = 1 2K 2 X i x i x | i : (3.11) Then: L K1 X k=1 w | k (b k +h 0 ) + K1 X k=1 w | k J 0 w k K1 X k=1 X j6=k w | j J 00 w k : (3.12) This is the Hamiltonian to be implemented on quantum annealing hardware. In general, this formulation necessitates requires arbitrary inter-weight couplings (i.e.,J 00 – couplings between w k and w j , where k and j represent the weights for the kth and jth class) and intra-weight couplings (i.e., J 0 – couplings betweenw k;n andw k;m , wheren andm are the indices of the weights assigned to thenth andmth features for the kth class). This imposes constraints on the number of classes and number of features that can be run on a particular hardware graph. For a dataset with M features and K classes, this approach requires M (K 1) logical variables. On the D-Wave 2000Q, the largest complete graph that can be found is around 66 logical variables; i.e.,M (K 1) must be less than 66. For our purposes, we choseK = 6 cancer types, which 38 limits the number of features we can use to 13. The largest complete graph that can be embedded onto the DW2X processor at ISI consists of 45 logical variables, and for the binary classification datasets we chose a total of 44 features. In examining the derivation, one may ask whether it is reasonable to take an expansion to second-order near z (k) i = 0. We offer two brief arguments in response. The first is that a second-order approximation has been used to great success in other algorithms, such as XGBoost. XGBoost is a gradient-boosted algorithm that has seen much success recently in a variety of machine learning tasks. To speed up calculations, XGBoost uses a second-order approximation to calculate the objective function in a general setting. It is important to note, however, that for XGBoost (and other gradient-based methods) the weights are updated iteratively, whereas here we are presumably using a quantum annealer to directly evaluate the loss function. A second argument is that we are looking for a set of self-consistent solutions. We take the second-order approximation around 0, and if the optimization works properly, we will get results for which the approximation is valid. Perhaps a more serious concern is that this expansion is not formally within the radius of convergence of the natural logarithm. Given this concern, care should be given to make sure that the difference in the approximation does not differ too greatly. One simple way to check this is to see whether there is a clear correlation between the energies (the approximation) and the original function we are trying to optimize (the log-likelihood). As long as there is good correlation, the approximation is reasonable. The correlation between the negative log-likelihood and the energy for the five binary classification datasets is shown in Figure 3.5. For binary classification, the negative log-likelihood is equivalent to the logistic loss, l ll = P n ln(1 + exp[y n w | x n ]), if we use the label conventiony n 2f1; 1g or the binary cross-entropy loss,l CE = P n (y n ln((w | x n )) (1y n ) ln(1(w | x n ))) where(z) = 1 1+exp(z) is what the softmax function reduces to for binary classes, if we use the conventiony n 2f0; 1g. As a comment, the binary cross-entropy loss is used frequently in training neural networks. 39 3.2.8 Cross-validation Note that the coupling terms J 0 ;J 00 in Eq. (3.11) are proportional to P N i=1 x i x | i , where x | i is a row in the data matrix. Let us expressD p in terms of its columns: D p = c 1 ;c 2 :::c p ; (3.13) where by construction from PCA column c i is orthogonal to the other columns (i.e., c | i c j = 0; j6= i). Then, using Eq. (3.1): J 0 ;J 00 / N X i=1 x i x | i = x 1 ;x 2 :::x N 0 B B B B B B B B B B @ x | 1 x | 2 . . . x | N 1 C C C C C C C C C C A =D | p D p (3.14a) = 0 B B B B B B B B B B @ c | 1 c | 2 . . . c | p 1 C C C C C C C C C C A c 1 ;c 2 :::c p = 0 B B B B B B @ c | 1 c 1 ::: c | 1 c p . . . . . . . . . c | p c 1 ::: c | p c p 1 C C C C C C A (3.14b) Because the c i are orthogonal, only the diagonal survives in J 0 and J 00 . To circumvent this issue one may use cross-validation, a procedure in which each time we take a different subset of the sum (e.g., P N 0 i=1 x i x | i , with N 0 < N), and thus some of the off-diagonal elements remain. More specifically, cross-validation is a technique to split the data into training and testing “folds.” With k-fold cross-validation, the first 1=k of the data is used as the testing fold with the remaining used for training. Then the next 1=k of data is used as testing and the rest for training. This is repeatedk times such that each sample in the data matrix is trained onk 1 times. Thus N 0 =N(k 1)=k. In our work we typically usedk = 3 folds. 40 Cross-validation is a common (necessary) procedure used in machine learning for selection of hyper-parameters, and it has the added benefit of avoiding the orthogonality of the PCs. It can be used to help select relevant hyper- parameters and decrease the variance in estimates of performance. 3.3 Results We assessed the performance of several standard machine learning algorithms on several datasets in an effort to understand the readiness of applying current quantum hardware to cancer classification and compared their performance to the following classical machine learning algorithms. Genomic data from The Cancer Genome Atlas (TCGA was retrieved, pre-processed, and normalized before dimesionality reduction with principal component analysis was performed. An overview of our data pipeline is depicted in Fig. 3.1. a Principal component analysis (i) Data Pre-Processing & Normalization (iii) Classical and Quantum Models Methylation miRNA mRNA STV CNV PCA PC 1 PC 2 PC 3 PC 44 ... Integrated PC Feature Space Training Data (80%) .. .. . . .. Used for hyperparameter tuning and training (iv) Model Evaluation Test data was only used to evaluate model statistics (ii) Dimensionality Reduction 100 Random cuts .. .. .. .. Test Data (20%) Held out during training and hyperparameter calibration LASSO Classical ML and Quantum algorithms SVM RIDGE Random Forest Naive Bayes Random DWave SA Field 0.91 AUC 0.96 0.91 Acc F1 The Cancer Genome Atlas (TCGA) b 5. Brain (499) 6. Lung (962) 1 2 3 4 5 6 1. Breast (1006) 2. Liver (358) 3. Kidney (611) 4. Colon & Rectum (551) Hyperparameter Calibration (20% Validation) Best Hyperparameters L 1 = 930 L 2 = 950 L 3 = 890 L 100 = 980 , , , ... , Figure 3.1: Overview of data pipeline. (a) Whole exome sequencing, RNA-Seq, miRNA-Seq, DNA methylation array, and genotyping array data were retrieved from The Cancer Genome Atlas (TCGA) for cancer classification. The data were concatenated in a single matrix. The data was then repeatedly split into training cuts of 80% of the data and testing cuts of 20% of the data. Principal component analysis (PCA) was performed on each training cut of the data, and the test data was projected using the same loadings. We compared several standard classical machine learning algorithms with quantum and quantum-inspired algorithms (see Methods for details of the algorithms). After training, performance was evaluated on the corresponding test dataset for a variety of metrics, including balanced accuracy, AUC, and the F1 score. Averages were taken across the 100 cuts of the data. (b) The 6 cancer types used for the 6-cancer classification. 41 Dataset No. of training samples No. of test samples KIRC vs. KIRP 490 121 LUAD vs. LUSC 770 192 BRCA vs. Normal 118 28 ERpos. vs. ERneg. 768 191 LumA vs. LumB 250 61 6 cancer 3193 794 Table 3.1: Size of datasets used for this work. 3.3.1 Binary Classification In this section we present results for five binary classification datasets. The comparisons are Kidney Renal Clear Cell Carcinoma (KIRC) versus Kidney Renal Papillary Cell Carcinoma (KIRP); Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC); Breast Invasive Carcinoma (BRCA) versus matched normal; BRCA molecular subtypes - ER positive versus ER negative; and luminal A versus luminal B. See Table 3.1 for the size of each dataset. We ran five classical machine learning models (LASSO, Ridge, Random Forest, Naive Bayes and SVM), one quantum algorithms (D-Wave), and three other classical “Ising-type” algorithms (simulated annealing (SA), Random, Field). Here, we first provide a brief overview of the quantum and classical Ising-type methods (later we will refer to D-Wave, SA, Random, and Field as Ising-type algorithms). Other approaches for machine learning using the D-Wave devices either needed to construct weak classifiers based on the data [43], or were used by introducing a threshold on the (real-valued) response values [71]. In this chapter, we developed what we believe to be a novel classification approach that can be used to solve classification problems directly. Our approach is inspired by a multinomial regression, which reduces to logistic regression when there are two classes (see Sec. 3.2.7 the mapping to an Ising problem). We compared performance to several classical approaches that use the same objective function as D-Wave (problems formulated as an Ising): simulated annealing (SA), Random, and Field. Simulated annealing[80] is heuristic optimization algorithm that uses Metropolis updates to optimize a target objective function. For Random, we randomly generated candidate weights, sorted them by their Ising energy and selected the best performing weights. For Field, we disregarded J, the coupling terms, and performed an optimization only over, h, the local fields. See Methods for additional 42 technical details. For DW, SA, and Random, we generated 1000 weights and combined the 20 sets of weights with the lowest Ising energy in a simple post-processing procedure, using the same procedure described in the previous chapter in Sec. 2.4.4. LumA.vs.LumB LUAD.vs.LUSC KIRC.vs.KIRP ERpos.vs.ERneg BRCA.vs.Normal LASSO SVM Random D−Wave SA Ridge Field RF NB Ridge SVM LASSO RF NB D−Wave Random SA Field SVM LASSO Ridge Random SA D−Wave RF NB Field SVM LASSO RF Ridge Random SA D−Wave NB Field RF Ridge Random SVM SA LASSO D−Wave Field NB 0.875 0.900 0.925 0.950 0.975 1.000 0.7 0.8 0.9 0.92 0.96 1.00 0.90 0.95 1.00 0.6 0.7 0.8 Algorithm Value Metric Accuracy AUC Bal.Accuracy F1 score Figure 3.2: Comparison of classification approaches on the five binary classification datasets. Algorithms are sorted based on their performance in terms of the balanced accuracy. Different performance metrics are represented in different colors. Error bars represent two standard errors of the mean. 43 Fig. 3.2 shows comparisons of all the approaches for the five binary classification datasets for four metrics of performance: accuracy, balanced accuracy, AUC, and the F1 score. See Appendix for details on the performance metrics. Perhaps unsurprisingly, one of the standard classical machine learning approaches usually performs the best across all metrics of performance. However, for several of the datasets, one of the Ising-type algorithms per- forms nearly as well as the best classical methods on some of the performance metrics. For BRCA versus normal, Random Forest slightly outperforms all other methods, but Ridge Regression, Random, SVM, SA, LASSO, and D-Wave are almost indistinguishable. Similarly, on the luminal A versus luminal B comparison, LASSO performs slightly better than the others, but SVM, Random, D-Wave, SA, Ridge, and Field are nearly identical in terms of their balanced accuracy (though the classical methods tend to better in terms of the F1 score). For the other datasets, the Ising-type algorithms do not perform as well as the best classical algorithms but their performance is still fairly close. The ability of the Ising-type algorithms to give performance that is competitive with the classical methods shows the feasibility of phrasing the classification problem as an Ising problem. Field is usually one of the worst methods, but it does surprisingly well on the luminal A versus luminal B dataset. These overall trends appear to be the same, even when using the original features; i.e., the features before performing PCA (Fig. 3.3). Though the quantum algorithms generally perform worse than the classical algorithms, for the luminal A versus luminal B (which we will refer to as “lumAB”), the Ising-type algorithms perform better than most of the classical methods, though not outside of the error bars. Like the classical methods, the most informative weight in the D-Wave weights is the first principal component, indicating that D-Wave also assigns the greatest weight to the most biologically relevant information. This is consistent with previous results where D-Wave was able to extract a motif for protein-DNA binding that agreed with classical results [71]. 3.3.2 Reducing Amount of Training Data Based on other work that showed an advantage for Ising-type approaches over classical machine learning ap- proaches with smaller training sizes [43, 71], we successively reduced the size of the training data. To do this, we reduced the amount of data for the lumAB dataset to see if the Ising-type algorithms perform better than the clas- sical algorithms. We fixed one cut of the training data (80% of the entire data). From this we repeatedly selected 44 LumA.vs.LumB RF Random Ridge SA DWave SVM NB LASSO Field 0.75 0.80 0.85 0.90 Algorithm Value Metric Accuracy AUC Bal.Accuracy F1 score Figure 3.3: Comparison of classification approaches on the lumAB dataset using the top 44 genes from PC1. Error bars represent two standard errors of the mean smaller and smaller portions of the training data and evaluated performance on the original test data (20% of the entire dataset). The results are presented in Fig. 3.4. At very small training sizes, the Ising-type approaches (D-Wave, SA, Random, and Field) do much better than nearly all the classical algorithms. SVM does very well at 18% of the original training data, but this seems to be an outlier, as its performance decreases after slightly increasing the amount of training data. With between 20-50% of the training data, the performance of the Ising-type algorithms is statistically significantly better than all the classical algorithms. After increasing the amount of training data, LASSO and Ridge regression improve significantly and reach the same level of performance as SVM and the Ising-type methods. Random Forest and Naive Bayes do poorly on this dataset, though their performance also does improve with increased training data. 45 0.5 0.6 0.7 0.25 0.50 0.75 Fraction of Training Data Balanced Accuracy Algorithm D−Wave Field LASSO NB Rand RF Ridge SA SVM Figure 3.4: The test set balanced accuracy for the Luminal A vs B binary classification with different amounts of training data. Error bars are two standard error of the mean. 3.3.3 Further study of Random’s performance It may come as a surprise that Random is doing so well for the results presented thus far. For several of the binary classification datasets, Random does better than D-Wave and SA, and when reducing the amount of training data, it is often not significantly different from D-Wave and SA. To be fair, Random’s performance is not truly random. Although candidate weights are randomly generated, they are sorted according to their Ising energy, and only the best performing weights are used; thus, Random’s performance should be viewed more as the utility of the formulating the classification task as an Ising problem. Nevertheless, in an effort to understand the behavior of Random in comparison to DW and SA, in this Section we will present some additional details about the relative performance of the various probabilistic Ising-type methods that optimize the Ising energy (i.e., DW, SA, and Random; Field is an Ising-type method but is not probabilistic). We first show that for the datasets we examined, the Ising energy is a reasonable approximation for the negative log-likelihood and that Random’s performance comes from a mismatch between Ising energy and balanced accuracy. We also explore the effect of changing the 46 number of weights, both the total number of weights and the number of weights in the post-processing approach described in Sec. 2.4.4. 3.3.3.1 Performance metrics versus energy As a way to systematically gain some insight into the machine learning performance with respect to Ising energy, we considered the effect of using “higher-energy excited-state” weights (the weights that have a higher energy according to the Ising Hamiltonian defined in Eq. 1.1). To do so, we first generated 1000 sets of weights for each method (DW, SA, and Random). The weights were then sorted according to their Ising energy. Next, we used a sliding window of energies; i.e., we applied our post-processing averaging procedure to 20 weights (w) at a time starting from the 20 weights with the lowest Ising energy, then applying the averaging procedure to the 20 weights with the next lowest Ising energy and so on. This procedure was repeated for all 100 cuts of the data for each dataset, and the results for the balanced accuracy and negative log-likelihood (referred to as the logistic loss; these are equivalent when using the label conventiony n 2f1; 1g) are presented in Figure 3.5. The top row of the Figure 3.5 shows a maximum in the balanced accuracy versus the Ising energy for most of the datasets. This maximum indicates the presence of a mismatch between the optimized objective function (the Ising energy) and the performance metric (the balanced accuracy); some weights that perform worse in terms of the Ising energy perform better in terms of the balanced accuracy. This is in part due to the nature of the logistic loss, which is somewhat sensitive to outliers. To understand this mismatch, note that to calculate the balanced accuracy, we must first threshold the predicted probabilities, assigning classes based on whether the probability for that class is greater than 0.5. The balanced accuracy does not take into account how far off a prediction was (i.e., how far away from 1 the probability was), whereas the logistic loss will heavily penalize predictions that are made incorrectly with high probability. In other words, the lower energy weights may correspond to those that make more inaccurate guesses, but do so with lower confidence, thereby giving a better logistic loss but a worse balanced accuracy, whereas the higher energy weights may correspond to those that predict the incorrect class less often, but predict the incorrect class with relatively high probability. 47 Figure 3.5: Evaluation of the performance of DW, SA at various final inverse temperatures (b 1 in the Legend), and random when using a sliding window of energies. Average balanced accuracy (top) and the average logistic loss (bottom) across the 100 cuts on the test dataset versus the average Ising energy of the post-processed weights. The shaded region represents 2 standard deviations. By comparing the SA results at different final temperatures we are able to control the weights found to lie within a particular energy range; higher final temperature (smallerb 1 in Fig. 3.5) shift the energies to the right. Note that while it is possible to bring SA to find higher energy weights by increasing the final temperature, there is no single parameter we can use to decrease the temperature of the weights found by the random method, other than increasing the number of weights we randomly generate (for D-Wave we can essentially control the temperature by scaling theh andJ). It seems reasonable to expect that with a growing number of features, Random will become less and less likely to find weights low enough in energy to be useful; i.e., we may expect the weights to lie to the right of the maximum. DW and SA are generally to the left of the maximum and we can effectively “inrease the temperature” such that they find weights that are near the maximum. In addition, within the same energy window, performance in terms of the balanced accuracy is very similar between the respective methods. A priori one might 48 hope that even if we tune DW and SA to find energies within the same energy region as Random, the quality of weights might be different. This does not seem to be the case; moving SA to higher temperature essentially causes the performance in terms of the balanced accuracy to overlap with D-Wave, then with Random. In fact, D-Wave can hardly be seen, suggesting that within a certain energy range, all methods return similar weights. The bottom row of Fig. 3.5 shows the averaged negative log-likelihood across the 100 cuts of the data versus the average Ising energy. Though not always linear, there is a clear correlation between the Ising energy and the logistic loss. This indicates that the approximations we used to generate the Ising problem from the negative log- likelihood, though perhaps not perfect, are good enough that we see excellent correlations between the two. Had the approximation not been valid, we would expect to see very poor correlation, or no correlation at all between the two, so the Random’s better performance cannot be attributed to an incorrect approximation. 3.3.3.2 Varying the number of weights Because DW, SA, and Random are all probabilistic, returning a distribution of weights, finding weights with low Ising energy is somewhat dependent on the total number of weights that are found. We note that the ability of the respective methods in finding low energy weights differs; as seen in Figure 3.5 when allowing the three methods to generate 1000 weights, Random does not find weights that are as low in Ising energy as DW and SA; we might expect the need to randomly generate on the order of 2 44 (recall that we used 44 PCs for the binary classification datasets) weights in order for Random to find the weights that match the Ising energy of the weights returned by DW and SA. In addition to the total number of weights, performance is also dependent on the number of weights we include in our post-processing procedure, described in Sec.2.4.4. Our post-processing procedure is designed to monotoni- cally improve performance on the training datasets, and thus we might expect including more weights to improve performance. We in general did not consider using all weights for several reasons: first, doing so increases the amount of time needed to generate final candidate weights; second, by including many of the weights, we begin to somewhat lose the discrete nature of the weights and therefore some of the robustness associated with them; 49 finally, monotonically improving performance on the training dataset may lead to overfitting, and therefore using a smaller number of weights is somewhat analogous to an early-stopping regularization scheme. Figure 3.6: Heatmaps showing the effect of changing the total number of solutions (y-axis) and the number of solutions used in the post-processing procedure (x-axis) on the average balanced accuracy on the test datasets for the five binary classification datasets. SA was run with a final inverse temperature ofb 1 = 0.03 and DW was run with aJ c of 8.0. A higher balanced accuracy indicates better performance. The choice of 1000 total weights and 20 weights for post-processing are somewhat arbitrary, and hence to systematically determine the effect of the number of weights, we varied the total number of weights from 1 to 1000 and the number of best performing weights from 1 to 1000 for each of the five binary classification datasets. As before, we used the same 100 cuts of the data described in the main text for each of the datasets. The average test balanced accuracy, average test logistic loss, and the average training negative Ising energy are shown in Figures 3.6, 3.7, and 3.8, respectively. For each performance metric (balanced accuracy, logistic loss, and Ising energy), we chose the final inverse temperature for SA and the J c for DW (see Sec. 1.1.3) that gave the best performance of that measure. 50 Figure 3.7: Heatmaps showing the effect of changing the total number of solutions (y-axis) and the number of solutions used in the post-processing procedure (x-axis) on the average balanced accuracy on the test datasets for the five binary classification datasets. SA was run with a final inverse temperature ofb 1 = 0.03 and DW was run with aJ c of 8.0. A higher balanced accuracy indicates better performance. Figures 3.6-3.8 all show that when using a very small number of total weights, Random does worse than DW and SA on all metrics. However, with as few as 50 total weights and 5 best performing weights, Random performs nearly the same as DW and SA in terms of the balanced accuracy. Further increasing the total number of weights and the number of weights used in the post-processing procedure improves the balanced accuracy for DW, SA, and Random. For ER positive versus ER negative (ERpn) and LUAD vs. LUSC (luadlusc), using all 1000 out of 1000 weights gives the best performance, but for the other datasets, using a smaller number (around 20 or 50) of the total weights gives nearly equal performance as using all 1000. For Random, using all 1000 weights for the lumAB dataset is worse than using 50 weights, indicating that for this dataset early stopping may help improve performance in terms of the balanced accuracy. Figures 3.7 and 3.8 show that Random does not find weights that are as low in Ising energy or with as low of a logistic loss as DW and SA, confirming what was shown in Figure 3.5. 51 Figure 3.8: Heatmaps showing the effect of changing the total number of solutions (y-axis) and the number of solutions used in the post-processing procedure (x-axis) on the average balanced accuracy on the test datasets for the five binary classification datasets. SA was run with a final inverse temperature ofb 1 = 0.03 and DW was run with aJ c of 8.0. A higher balanced accuracy indicates better performance. These additional results show that there are indeed combinations of total number of weights returned and the number of weights used in the post-processing procedure for which Random does worse than DW and SA. However, even by exploring only a small subset of the total search space (1000 out of 2 44 possible weights), Random is able to give very good machine learning performance. This suggests that perhaps there are only a small number of informative features; getting several of these features “right” is sufficient for good performance. 3.3.4 Multiclass Classification We also evaluated the performance of the classical approaches and the annealing approaches for a 6-cancer dataset; results are shown in Fig. 3.9. The six cancer data types include brain, breast, kidney, lung, liver, and colorectal cancer. For this larger dataset, classical approaches are superior to the annealing approaches. 52 6cancer LASSO SVM RF Ridge NB D−Wave SA Random Field 0.8 0.9 1.0 Algorithm Value Metric Accuracy AUC Bal.Accuracy F1 score Figure 3.9: Comparison of results for classification of six cancers. The same classical algorithms were used as were used for the binary classifications. Quantum annealing and simulated annealing perform worse than the classical methods for this dataset, though still do better than randomly generated bit strings sorted according by their performance of the DW loss function. 3.4 Discussion Our results are one of the first demonstrations of quantum algorithms applied to real, complex multi-omics bio- logical data, specifically for cancer classification. We have shown that classification with quantum annealing and Ising-type algorithms can nearly match the performance of classical approaches on several datasets. However, the benefit from using quantum annealing cannot be attributed primarily to inherent quantum behavior, as SA and Random perform similarly if not better than D-Wave. It may come as a surprise that randomly generating bit strings and sorting them by their Ising energy achieves performance nearly equal to standard machine learning techniques, and usually does better than both D-Wave and SA. We have explored this surprising fact and can under- stand Random’s performance as a mismatch between the objective function for the Ising-type approaches, which is an approximation for the negative log-likelihood, and the performance metrics presented (accuracy, balanced accuracy, F1 score, AUC). Random’s overall excellent performance notwithstanding, our results have shown the utility of formulating the classification problem as an Ising. This advantage of using an Ising problem is even more prominent with a smaller amount of training data. Field, which is an almost trivial algorithm after formulating the Ising problem, does extremely well with very small amounts of data on the lumAB dataset (though we note of course that its performance is almost always the worst for the other datasets examined). 53 The annealing approaches had less success than the classical methods with multiclass classification. The annealing approaches may be doing worse in comparison to the classical approaches due to the amount of training data used. As shown with reducing the amount of training data for the lumAB dataset, the quantum annealing approaches are able to do very well with a small amount of data but do not benefit as much from increasing the amount of data, as do the classical algorithms. The 6-cancer dataset has about 12x more data than the full lumAB dataset (see Table 1), and thus the classical algorithms are most likely able to learn more from the additional data, but the annealing approaches do not seem to be able to learn as well. Another reason the performance of the Ising-type approaches seems to do worse may be related to the approximations used to formulate the classification problem as an Ising (see Sec. 3.2.7). The approximation may be valid for a smaller number of classes but could break down with a larger number of classes. Though practical quantum computing architectures are still in development, the demonstration of nearly com- petitive performance with powerful classical methods on these datasets is encouraging. As technology improves and new algorithms are introduced, we are cautiously optimistic that quantum or quantum-inspired algorithms may generate new biological insights and drive the discovery of novel approaches to solving complex biological problems. 54 Chapter 4 Nested Quantum Annealing Correction In the previous two chapters we have shown the feasibility of mapping supervised machine learning tasks to Ising problems that can be run with current quantum annealers made by D-Wave. In this chapter, we will again consider machine learning metrics as a means of assessing performance, but the focus will be on on improving the performance of quantum annealers relative to themselves and understanding the physics of the device. For a brief review of closed-system QA, see Sec. 1.1.1. Practical implementations of QA are coupled to the environment, which may introduce additional sources of errors, such as dephasing errors and thermal excitation errors [55, 99, 57]. Theoretical and experimental studies have indicated that due to strong coupling to a thermal bath, current quantum annealing devices operate in a quasi-static regime [67, 100, 101, 102, 103]. In this regime there is an initial phase of quasi-static evolution in which thermalization times are much shorter than the anneal time, and thus the system closely matches a Gibbs distribution of the instantaneous Hamiltonian. Towards the end of the anneal, thermalization times grow and eventually become longer than the anneal time, and the system enters a regime in which the dynamics are frozen. The states returned by a quantum annealer operating in this regime therefore match a Gibbs distribution not of the final Hamiltonian, but of the Hamiltonian at the freezing point. The fact that open-system QA prepares a Gibbs state may be a bug for optimization problems but it could be a feature for sampling applications. Recently, there has been interest in using QA to sample from classical or quantum Gibbs distributions (see Ref. [104] for a review), and there is interest in whether QA can prepare such 55 distributions faster than Monte Carlo methods. One application where sampling plays an important role is the training of Boltzmann machines. Boltzmann machines (BMs), which are a class of probabilistic energy-based graphical models, are the basis for powerful deep learning models that have been used for both supervised (with labels) and unsupervised (without labels) learning tasks [105, 106]. As its name suggests, a Boltzmann machine assumes that some target dataset is generated by some underlying probability distribution that can be modeled by a Boltzmann or Gibbs distribution. Whereas the temperature does not play a large role for classical methods, as the values of model parameters can simply be scaled as needed, physical constraints on current (and future) quantum annealers limit the range of model parameters that can be programmed. As such, an important parameter for using a physical quantum annealer is the effective freezing temperatureT e , the temperature of the approximate thermal distribution at the freezing point. The magnitude of this temperature can be viewed as a rough metric of the performance of a quasi-static quantum annealer. However, based on results for a “temperature scaling law” [103] for quantum annealers, we should expect the effective freezing temperature to grow with problem size. In the limit of very high temperatures, the samples drawn from a quantum annealer with a limited range would be nearly random, which would make training of a Boltzmann machine with a quantum annealer nearly impossible. Nested quantum annealing correction (NQAC) [107] is a potentially scalable form of quantum annealing correction (QAC) [108, 109, 110, 111] tailored for use on quantum annealing devices, including commercially available ones, that was developed to address some of these concerns. NQAC achieves error correction by introducing an effective temperature reduction [107, 112, 113], and previous work has shown that NQAC can be used to improve optimization performance and obtain more accurate estimates of the gradient [112]. In this work we apply NQAC to an entire training procedure of fully-visible Boltzmann machines. We demonstrate an improvement for both supervised and unsupervised machine learning tasks using NQAC, explore the effects of increased anneal times, and make comparisons to SQA and SVMC to probe the underlying physics of using a D-Wave quantum annealer as a sampler. The remainder of the chapter is structured as follows. We provide some more technical background, both of Boltzmann machines and the NQAC construction in Sec. 4.1. In Sec. 4.2 we give an overview of the methods, and results are presented in Sec. 4.3. We conclude in Sec. 4.4. 56 4.1 Technical Background 4.1.1 NQAC NQAC is a form of quantum annealing correction that can encode problems with arbitrary connectivity, allows for a variable code-size (and is thus potentially scalable) and also can be implemented on a generic quantum annealing device [107]. In general, to implement QAC we encode the original (or “logical”) quantum annealing HamiltonianH(s) in an “encoded physical Hamiltonian” H(s), using a repetition code [108, 109, 114]: H(s) =A(s)H X +B(s) H P ; s2 [0; 1]; (4.1) where H P is the “encoded physical problem Hamiltonian” and all terms in H P are defined over a set of physical qubits that is larger than the number of logical qubits in the original unencoded problem. The states of the logical problem HamiltonianH P can then by recovered by properly decoding the states of H P . Encoding the driver Hamiltonian would provide improved performance with error correction since it would enable the implementation of fully encoded adiabatic quantum computation, for which rigorous error suppression results have been proven [115, 116, 117, 118, 119]. However, unfortunately this is not possible with present implementations of quantum annealers; i.e., onlyH P is encoded in QAC. NQAC encodes the logical Hamiltonian in two steps: (1) an “encoding” step which maps logical qubits to code qubits, (2) an “embedding” step, which maps code qubits to the physical qubits, e.g., those on the Chimera graph of the D-Wave quantum annealer. These two steps are depicted in Fig. 4.1 for a fully connected problem of 4 vertices. For the encoding step, the repetition code has lengthC, which we also refer to as the “nesting level” for reasons apparent from Fig. 4.1. C determines the amount of hardware resources (qubits, couplers, etc.) used and allows the error correction method to be scaled, and provide protection against thermal and control errors. Each logical qubiti (i = 1;:::;N) inH P is represented by aC-tuple of code qubits (i;c), withc = 1;:::;C. The values of 57 1 1 1 1 1 2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 1 2 2 3 3 4 4 4 4 3 3 2 2 1 1 1 1 2 2 2 2 1 1 1 1 1 1 N=4 C=2 C=4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 L=3 L=5 Logical Problem Encoded Problem Physical Problem N CxN CxNxL Num. qubits Num. couplers N(N-1)/2 CxN(CxN-1)/2 CxN(CxN-1)/2 + CxNx(L-1) Figure 4.1: Illustration of NQAC. The left panel represents the logical problem, the middle panels represent the en- coded Hamiltonian with nesting levelC = 2; 4, and right panel represents the encoded physical problem embedded onto D-Wave Chimera graph. Thick bright red lines in the middle and right panels represent 1 , the ferromagnetic coupling between code qubits in the nested Hamiltonian. Thick brown lines in the right panel represent 2 , the ferromagnetic coupling between physical qubits in the encoded problem. the “nested couplers” ~ J (i;c);(j;c 0 ) and “nested local fields” ~ h (i;c) in terms of the logical problem parameters andC are given as follows: ~ J (i;c);(j;c 0 ) = J ij ; 8c;c 0 ;i6=j; (4.2a) ~ h (i;c) =Ch i ; 8c;i; (4.2b) ~ J (i;c);(i;c 0 ) = 1 ; 8c6=c 0 : (4.2c) 58 Code qubitsc andc 0 that belong to different logical qubitsi andj are coupled with the strength of the original logical problem couplingsJ ij . Since each logical qubit is encoded inC code qubits, there are a total ofC 2 nested couplers ~ J (i;c);(j;c 0 ) between all theC code qubits that comprise two logical qubitsi andj. In order to keep the overall energy scale constant, each nested local field ~ h (i;c) is set to beC times as large as the logical problem local field. Finally, in order to facilitate alignment of the C code qubits that make up a logical qubit i, C(C 1)=2 ferromagnetic couplings ~ J (i;c);(i;c 0 ) (couplings between code qubits corresponding to the same logical qubiti) are introduced, the strength of which is given by 1 > 0. The encoded problem must then be implemented on physical QA hardware, which typically has less than full connectivity. Because we collected our results in this work on the D-Wave quantum annealers (described in Sec. 4.2.6) we used minor embedding [120, 121, 60] onto the Chimera hardware graph. The minor embedding step replaces each code qubit by a ferromagnetically coupled chain of physical qubits, with the intra-chain coupling for physical qubits given by another penalty term 2 > 0, which helps keep physical qubits aligned. Finally, in order to extract states in terms of the original logical problem, a proper decoding scheme should be used. In this work, we used a majority vote to go from physical qubits to code qubits, and another majority vote to go from the physical problem states to the logical problem states. Other decoding strategies, such as local energy minimization, are also possible [114]. 4.1.2 Boltzmann machines We combined NQAC with training of a Boltzmann machine. Boltzmann machines are a class of probabilistic graphical models that can be stacked together as deep belief networks or as deep Boltzmann machines [122, 123]. Deep learning methods such as these have the potential to learn complex representations of the data, which could be useful for applications such as image processing or speech recognition. Typically, Boltzmann machines include both hidden and visible units, which give greater modeling power. However, we restrict ourselves to examining fully-visible Boltzmann machines, which are less powerful, but allow us to demonstrate the effect of NQAC. 59 A Boltzmann machine is defined on a graphG = (V;E) with binary variables on the verticesV. LetN =jVj be the number of vertices. In exact analogy to Eq. (1.1), the energy of a particular configurationx = (x 1 ;:::;x N ) is E(x) = X i2V b i x i + X (i;j)2E w ij x i x j (4.3) whereb i (the biases) andw ij (the weights) are the parameters of the model to be learned from training. Eachx i is a binary variable, and to be consistent with quantum annealing conventions, we definex i =1 (spins) instead of 0 or 1, so thatx2f1; 1g N . The associated Boltzmann or Gibbs distribution is: P (x) = exp[E(x)] Z ; Z = X x exp[E(x)]; (4.4) where = 1=T is the inverse temperature andZ is the partition function. To use a Boltzmann machine for machine learning, we assume that a given training dataset is generated according to some probability distribution. The goal of the machine learning procedure is to find the set of parameters (i.e., theb i ’s andw ij ’s) that best models this distribution. Typically this is done by maximizing the log-likelihood over the data distribution (or minimizing the negative log-likelihood): L = X x2D logP (x); (4.5) whereD represents the target training data distribution. We can easily derive the following update rule for the model parameters by taking gradients with respect to the model parameters: 60 b i (t + 1) =b i (t) + @L @b i (4.6a) w ij (t + 1) =w ij (t) + @L @w ij ; (4.6b) where is the learning rate that controls the size of the update, and the gradients are: @L @b i =hx i i D hx i i (4.7a) @L @w ij =hx i x j i D hx i x j i ; (4.7b) wherehi is the expectation value with respect to the appropriate distribution. In practice, rather than updating the model parameters based on averages over the entire dataset, “mini-batches” of a fixed number of training samples are used to reduce computational cost and insert some randomness. One pass through all the training samples is referred to as an epoch. In general, training proceeds as follows: (1) initialize the problem parameters (b i andw ij ) to some random values; (2) obtain averages by sampling from the current problem parameters; (3) update the parameters based on Eqs. (4.7a) and (4.7b); (4) repeat steps (2)-(3) until convergence or until a certain number of epochs is reached. The first term in the gradient step,hi D , is referred to as the positive phase, and is an average of the training data distribution, and the second term,hi , is referred to as the negative phase and is an average over the cur- rent model. The positive phase serves to move the probability density towards the training data distribution, and the negative phase moves probability away from states that do not match the data distribution. Calculating exact averages for the negative phase becomes intractable as the number of possible states grows exponentially. Nev- ertheless, the averages can be still be estimated if there is an efficient way to perform sampling. Classically, this is done by Markov chain Monte Carlo methods, or by replacing the likelihood function altogether by a different objective function called contrastive divergence (whose derivatives with regard to the parameters can be approxi- mated accurately and efficiently) [124]. Using a quantum annealer to prepare a Boltzmann distribution may also 61 be a viable option [40]. However, using a quantum annealer introduces another obstacle, which is the temperature of the distribution. In the classical method, in Eq. (4.4) is set to 1 as it amounts to an overall change in the energy scale, which can be easily rescaled. For quantum annealers, the magnitude of the problem parameters depends on physical quantities, such as a biasing current; hence, the model parameters cannot become arbitrarily large, and plays an important role. Moreover, the effective temperature of the sampled distribution is expected to grow as the problem size grows [103]. Because of these factors, one might expect to find a sweet-spot in effective temperature for quantum annealers. When the effective temperature of a sampler is too high, the samples drawn will essentially look random and will not provide any meaningful training. When the effective temperature is too low, the samples from this distribution will resemble delta-functions on the training example, and thus may fail to generalize to data not within the training data. We regard NQAC as a method to offer more control on the effective temperature and thus improve performance in certain temperature regimes. 4.2 Methods 4.2.1 Datasets and performance metrics In this work we explored both supervised and unsupervised machine learning with two different datasets. 4.2.1.1 Supervised machine learning of MNIST First, we trained a Boltzmann machine to do supervised machine learning with a coarse-grained version of the MNIST dataset, a set of hand-written digits. In supervised machine learning, the dataset consists of input states and response variables (labels), i.e.,D TRAIN =f(x n ;y n )g S n=1 , whereS is the size of the training dataset. The goal of the supervised machine learning task is to learn some functionf that maps inputs to response variables such that y n =f(x n ): For our purposesx n corresponds to thenth coarse-grained image andy n is the corresponding label of the image (i.e., an integer digit). Our metric of performance for the MNIST dataset is classification accuracy. Due to limitations on the number of qubits available on the DW processor, we coarse-grained the original MNIST digits, which are 28 28 pixels, to 4 4 images (see Fig. 4.2 for some examples of coarse-grained images). We then removed the corners of the coarse-grained images and only selected images that were labeled 1-4, adding four 62 “label” bits to each image (i.e., the last four digits set to 1000 means the image is labeled with “1”, 0100 represents “2”, 0010 represents “3”, and 0001 represents “4”). After this, eachx n is a vector of length 16. Of the 50; 000 training images in the original dataset, we used 5000 samples for training, and about 2500 for testing. Classification accuracies were evaluated on the held-out test dataset, which were unseen during training. Labels were predicted by “clamping” the values in the logical problem Hamiltonian to the input [40, 42] and sampling from the clamped HamiltonianH clamp = H P ( z i = x i ; z j = x j ), wherei;j2f1;:::; 12g andx i ;x j are the image pixels (recall that the last four bits were used as label bits). Because we obtainH clamp by setting the values z i and z j ofH P [Eq. (1.1)] to the value of the pixels in the image,H clamp is only defined on the last four label qubits, and the effect of clamping is to add a local field to the remaining unclamped label qubits. In order to determine the predicted label ^ y n of thenth image, we then sampled from the four unclamped label qubits, as follows. We acquiredK = 100 anneals from DW per image, with thekth anneal producing a vectorl k , of length four, corresponding to the four possible labels. Note that in principle, depending on the value ofH clamp and due to the probabilistic nature of quantum annealing,l k could have more than one non-zero entry. We assigned the predicted label via ^ y n = arg max 1 K P K k=1 l k , where the arg max is over the four indices ofl k . We then selected the label qubit with the largest appearance frequency. Once we obtained the prediction of a class label, we compared it to the true label and thus calculated classification accuracies as expected: acc = S 0 X n=1 1 S 0 1(y n = ^ y n ); (4.8) where 1 is the indicator function of whether the predicted label is the same as the real label andS 0 is the size of the test dataset 4.2.1.2 Unsupervised machine learning of Bars and Stripes We also trained a Boltzmann machine to do unsupervised learning on a “bars and stripes” (BAS) dataset [125], which has a very simple underlying probability distribution. In unsupervised machine learning, the data is provided without a response variable (i.e.,D =fx n g S n=1 ), and the goal is to learn the underlying structure within the data. A BAS dataset is generated by first selecting an orientation (rows or columns), and then setting all pixels in each row 63 Figure 4.2: Some representative images of training dataset. BAS (left) and coarse-grained MNIST (right). Starting from the upper-left and going across the rows and down the columns, the digits are f3; 6; 1; 5g;f2; 6; 7; 8g;f3; 1; 6; 4g;f0; 6; 8; 7g. or column to 1 with probability 1 2 (see Figure 4.2 for some examples). We constructed a dataset of 4x4 images, and our metric of performance for the BAS is the “empirical” log-likelihood; i.e., the log-likelihood of the distribution returned by drawing actual samples from DW. More formally, letQ(x) be the empirical probability of an image (or state)x in the distribution returned from a quantum annealer. We define the empirical log-likelihood as ~ L = X x2D logQ(x): (4.9) Note that the “exact” log-likelihood in Eq. (4.5) is defined in terms of the exact Boltzmann distribution on the current model parameters, whereas the empirical log-likelihood defined above is defined in terms of the empirical probability distribution returned by a quantum annealer. To calculate the empirical log-likelihood, we sampled from DW 10 4 times, i.e.,Q(x) was determined by finding frequencies of the images of the training distribution in the 10 4 samples returned by DW. 4.2.2 Temperature estimation Based on the intuition that higher nesting level should result in a lower effective temperature, we estimated the temperature by comparing the distribution of energies of the decoded logical problem to an exact Gibbs distribution on the logical problem Hamiltonian. To do so, we found the value of such that the total variation distance d() = 1 2 X Ei jQ(E i )P (E i ;)j; (4.10) 64 between the empirical probability distributionQ and the exact Gibbs distributionP [Eq. (4.4)] is minimized, i.e., eff = arg min d(): (4.11) We calculated the distribution distance in terms of energy levels, not in terms of distinct states, in order to speed up calculation. We also note that this method of estimating the temperature can only be done for sufficiently small problems, because of the challenge of computing the partition function. The data we used was composed of 16 pixels, and thus calculating the exact Boltzmann probabilities is still feasible. 4.2.3 Distance from data For some of our results, we also plot the distance to the target data distribution, instead of the distance from a Gibbs distribution at a parameter. This quantity is defined as follows: d data = 1 2 X Ei jQ(E i )R(E i )j; (4.12) whereR(E i ) is the frequency of finding a states with energyE i in the data distribution; i.e., we first calculated the energies of every state that appeared in the data distribution under the current model parameters. However, unlike the distance from Gibbs, the distance from data does not depend on a value of the effective temperature; we calculated the distance this way to be consistent with how the distribution distance from Gibbs is calculated. The distance to a target data distribution gives more information about how well an algorithm is learning information about the data distribution (and thus has a similar interpretation as the empirical log-likelihood), and the distance from a Gibbs distribution provides more information about the quality of the samples drawn. With a perfect noiseless sampler and the assumption that the data can be modeled by a perfect Gibbs distribution, these two quantities should be perfectly correlated. However, as will be seen in our results, this is not always the case when dealing with imperfect samplers; the sampled distributions may be somewhat far away from Gibbs and yet closely resemble the target data distribution. 65 4.2.4 Training procedure For the sake of comparison, we used a fix set of initial weights and biases (the model parameters). First, to compare the best performance at each nesting levelC, we need to find the optimal values of the penalty terms 1 and 2 . To do so, we did a grid search in steps of from 0:2 to 1:0 in steps of 0:2 to find the combination of 1 and 2 that gave the best empirical log-likelihood for the BAS dataset and the best classification accuracy for each nesting level ( 1 and 2 are upper bounded by a value of 1 based on hard constraints on the physical processor). The optimal 1 and 2 may be different at different nesting levels. After finding the optimal 1 (C) and 2 (C), we reinitialized the model parameters and trained for 10 epochs. All runs with DW were done with an anneal timet f of 10s, except when we studied the dependence ont f , and the learning rate [Eq. (4.6)] was set to 0:01. In order to mimic the effect of using NQAC at different temperatures, we introduced a scaling parameter 0< 1 for the logical problem Hamiltonian, i.e.,H P 7!H P . Smaller emulates a higher bath temperature or more thermal noise. In practice, this rescaling is implemented on the final physical Hamiltonian. 4.2.5 Classical repetition vs NQAC Because higher nesting levels use more qubits thanC = 1 it may be fairer to compare the performance at a nesting level C > 1 with NQAC to the C = 1 problem replicated to use approximately the same amount of physical resources (recall from Fig. 4.1 that the physical problem uses at leastCN (CN 1)=2 +CN (L 1) qubits, whereC is the nesting level,N is the size of the logical problem, andL is the length of the chain needed for the embedding). To implement this, we createdM C replicas, whereM C is the closest integer multiple of resources needed for nesting levelC compared toC = 1. C = 2 uses about 4 the number of physical qubits ofC = 1, andC = 3 uses about 8 the number of physical qubits asC = 1, and thusM 2 = 4 andM 3 = 8. Then, at each training iteration, we generatedM C replicas of theC = 1 encoded physical problem and performed the update of the parameters according to Eqs. 4.7a and 4.7b. We then sampled from the updated parameters and selected the replica whose parameters gave the best empirical log-likelihood and set the parameters of the remaining replicas to the best-performing replica. We then repeated the process until convergence was reached. 66 4.2.6 Technical Details of DWave For the results in this chapter we used the DW 2000Q managed by NASA-Ames housed in Moffett Field. This process has a total of 2031 qubits and has a Chimera graph as in Fig. 1.2. 4.2.7 Ramping In addition to evaluating the effect of NQAC on machine learning performance, we also explored various control parameters affecting the anneal. Specifically, we used the “ramp” feature of the DW2000Q devices, which allows for a fast quench of the annealing schedule at intermediate points in the anneal. The ramp allows us to rapidly change the annealing parametersA(s int ) andB(s int ) at some intermediate time 0<s int < 1 to their final value at s = 1. The annealing parameters cannot be changed instantaneously, so ramping takes at least 1s. To probe the distributions at intermediate points in the anneal, we ramped betweens int = 0:1 and 0:9 in steps of 0:1 with a 1 s ramp. To collect statistics, for each ramp point we repeated the anneal 20 times. The results are presented in Figures 4.7 and 4.8. 4.2.8 SQA and SVMC Simulations We also compared performance to simulated quantum annealing (SQA) [82] and spin vector monte carlo (SVMC) [126] at intermediate points in the anneal. SQA is a Monte Carlo method that is unable to simulate quantum dynamics but can capture thermal averages at intermediate points in the anneal. SVMC also uses Monte Carlo updates, but replaces the qubits with classical angles; it can be considered the semi-classical limit of quantum annealing. For both SQA and SVMC, we used a temperature of 15mK. To simulate the effect of ramping described above, we used 500 sweeps to change the annealing parameters from their intermediate value to the final value. For each Hamiltonian at each ramp point, we ran both SQA and SVMC with between 2500 and 10 7 sweeps (varied on a logarithmic scale) on the encoded physical problem Hamiltonian sent to DW. We then selected the number of sweeps that gave an effective inverse temperature on the logical problem distribution that was closest to DW’s for each Hamiltonian at each ramp point. Because each of the (noisy) anneals of D-Wave produces a slightly different distribution resulting in a slightly different effective temperature and distance from Gibbs, instead of adding noise 67 to the SQA and SVMC simulations we ran SQA and SVMC once at each ramp point for each final Hamiltonian and selected the number of sweeps that gave the closest effective temperature for each of the 20 noisy realizations of D-Wave. 4.3 Results In this section we present our results testing NQAC’s ability to improve machine learning performance on the BAS dataset and the coarse-grained MNIST dataset. We consider performance as a function of the rescaling parameter and the anneal timet f , for different NQAC nesting levels. We also study the dependence on the ramp point, and compare SQA and SVMC simulations. 4.3.1 Performance as a function of rescaling parameter Figure 4.3: Performance using NQAC on the BAS dataset. A small advantage with increasing nesting level C is noticeable for the BAS dataset for intermediate values of the rescaling parameter. All runs were done with an anneal timet f of 10s. Here and in all other plots, unless noted otherwise, error bars represent one standard deviation. The maximum possible value of the empirical log-likelihood for the BAS dataset used is3:41. The empirical log-likelihood is computed using Eq. (4.9). The middle and left panels compare the performance of a classical repetition versus NQAC (see Sec. 4.2.5 for a description). Figure 4.3 shows the empirical log-likelihood on the BAS dataset versus the scaling parameter, defined via H P 7! H P (see Sec. 4.2.4), for different nesting levelsC. As shown in the left panel of Fig. 4.3, the empirical log-likelihood for the BAS dataset exhibits a very small statistically significant improvement at intermediate values. At = 10 3 there is no improvement in using NQAC; this indicates that when the effective temperature is too high, NQAC will not provide much benefit. At intermediate values of, there is a very modest but noticeable 68 advantage in using NQAC, though the improvement seems to decrease as continues to increase. Note, however, that the error bars are also shrinking with . At = 1 nesting is not helpful. This may be due to entering the “penalty-limited regime”, as was also the case in previous NQAC work [112]. In this regime, the optimal strength of 1 and 2 is greater than 1, but limitations of the physical annealer prevent the penalties from reaching their optimal value; increasing 1 and 2 would amount to reducing . There is a trend for greater benefit in using NQAC versus a simple repetition with higher nesting level: the difference between the mean performance of NQAC and classical repetition increases and the variance decreases forC = 3 compared toC = 2. ForC = 3 there is a small, but statistically significant advantage in using NQAC at intermediate values of. The slight improvement for higher nesting levelC remains when comparing to a classical repetition (middle and right panels of Fig. 4.3) at intermediate values of. In the repetition code case we simply useC = 1 multiple times and took the best of these repetitions, as in previous work [108, 109, 107] (see Sec. 4.2.5 for details of how we implemented the classical repetition). At very low values of, the classical repetition code performs better than NQAC. However, at higher nesting level, the difference between the mean performance of NQAC and classical repetition increases and the variance decreases forC = 3 compared toC = 2. ForC = 3 there is a small, but statistically significant advantage in using NQAC at intermediate values of. The small improvement in performance for the BAS dataset at intermediate values can be associated with a decrease in the effective temperature. Figure 4.4 shows the empirical log-likelihood, closest inverse temperature, and distance from a Gibbs distribution at each training epoch [the results shown in Fig. 4.3 are for the last epoch]; improvement in log-likelihood correlates very well with an increase in the effective inverse temperature. We observe three regimes: (i) very small (two leftmost columns), where the differences between results for different levels of nestingC are not statistically distinguishable, (ii) intermediate (two middle columns), where differentC values separate clearly and improvement is monotonic withC, and (iii), larger (two rightmost columns), where the ordering with C is flipped. A consistent trend across all values of is the increase in distribution distance from Gibbs with epoch number; we investigate this in a later section. At higherC performance is poor at large since the accessible penalties are not strong enough. Across other values of, the effective inverse temperature 69 increases with, but at = 1, the effective inverse temperature forC = 2 andC = 3 is lower than the value of the effective inverse temperature at = 0:1. Figure 4.4: Detailed performance at end of each training epoch for different values of the rescaling parameter for the BAS dataset. Here and in the subsequent figures the effective inverse temperature (middle panel) is computed using Eq. (4.11), and the distribution distance from Gibbs (bottom panel) using Eq. (4.10). For the MNIST data [Fig. 4.3.1], we see a slight improvement in classification performance at small and intermediate values of ; i.e., when the effect of noise should be very high. As increases, NQAC becomes less and less effective and eventuallyC = 1 outperforms higher nesting levels. Note that simple out-of-the-box classical methods like a logistic regression or a nearest neighbor classifier get accuracies of more than 90%. These results are not aimed at comparing to classical methods, but to demonstrate the use of NQAC in improving machine learning performance with a quantum annealer. 4.3.2 Results for the BAS dataset at different anneal times Next, we present results on the BAS dataset at different anneal times, in part motivated by previous benchmarking results on optimization problems [127] that have shown the importance of finding an optimal anneal time, though we do not find such optimal times in this work. For each anneal time, we repeated the same procedure as in the 70 Figure 4.5: Performance using NQAC on the coarse-grained MNIST dataset. There is a small but significant advantage for using NQAC at intermediate values of forC = 3. All runs were done with at f = 10s. previous subsection; i.e., we did a grid search in steps of 0:2 for the values of 1 and 2 and selected the values that gave the best performance at the end of 5 epochs. We then took those optimal values of the penalties and trained on the entire dataset for 10 epochs. Figure 4.6 shows the results at the end of training at a fixed = 0:03, an intermediate value for which we saw in Sec. 4.3.1 that NQAC improves performance with increasingC. At C = 1, as seen in Fig. 4.6, trends are generally in accordance with expectations, in that increasing the annealing results in higher log-likelihood, lower distribution distance from the data, lower effective temperature (in accordance with Refs. [112, 113]), and generally lower distance from the Gibbs distribution. AsC increases, the trends for the log-likelihood, distance from data, and effective inverse temperature are the same, but the distance from Gibbs increases. In other words, despite the increase in distance from the Gibbs distribution with C, the measures of machine learning performance improve with longer anneal times, indicating that some meaningful learning is still taking place. In addition, there is a greater performance improvement for higher nesting levels, i.e., there is a greater difference in performance betweenC = 1 and higherC at longer anneal times. It should be noted that, prima facie, this is somewhat surprising, and not only because we expect closeness to the Gibbs distribution to result in better estimates of the negative phase terms in Eq. (4.7). The adiabatic 71 Figure 4.6: Effect of using NQAC with different anneal times for the BAS dataset, measured in microseconds, for = 0:03. The maximum possible value of the empirical log-likelihood in this case is3:41. Both the empirical log-likelihood and inverse temperature exhibit statistically significant growth with anneal timet f and nesting level C. The distance from Gibbs grows withC and also witht f forC = 2; 3. theorem for open quantum systems implies that longer anneal times will allow a system to thermalize better and therefore be closer to the quantum Gibbs distribution [100]. This should hold in particular at the end of the anneal. Thus, not only do we observe better performance (larger log-likelihood) despite a larger deviation from the Gibbs distribution, the deviation itself is unexpected in light of the adiabatic theorem. However, these apparent surprises can in fact be attributed to the quasi-static regime effect [67, 100, 101, 102, 103], whereby the distribution that DW reaches at an intermediate point in the anneal is not necessarily the same as the Gibbs distribution on the final logical problem. Namely, if the system reaches a Gibbs state at an intermediate point in the anneal and freezes there, then we should not expect agreement with the Gibbs state at the end of the anneal (which would correspond to the Gibbs distribution on the logical problem). In an effort to determine whether the dynamics are indeed freezing, we next discuss the results of probing the distribution during the anneal. 4.3.3 Results for the BAS dataset at different ramp points In this subsection we present results obtained using the ramp feature of the DW2000Q devices (see Sec. 4.2.7 for more details). Figure 4.7 shows the effect of ramping at intervals of 0:1. For these results we did not repeat the machine learning procedure from Sec. 4.3.2 (i.e., starting from a random set of weights and biases and training for 10 epochs), but instead we used the final problem Hamiltonian after training for 10 epochs with a fixed anneal time t f . I.e.,, each colored curve in Fig. 4.7 at a differentt f corresponds to a different logical problem Hamiltonian; 72 for results with a fixed Hamiltonian, see Fig. 4.8. Statistics were then obtained by sampling 20 times from each trained Hamiltonian at its correspondingt f . Figure 4.7: Comparison of nesting levels at different anneal times with ramp at various intermediate points in the anneal, for the BAS dataset. A 1s ramp was used throughout; was fixed at 0:03. Here error bars represent only one standard deviation to improve readability. For allC values there is a transition that becomes sharper with increasing anneal time. The transition point shifts to earlier in the anneal with increasingC; after the transition all quantities freeze, except at the largest anneal times. The trend with increasing anneal time is consistent throughout: the empirical log-likelihood, fidelity with data (1distance) and inverse temperature all increase. In contrast, the distance from Gibbs is at first smaller as a function of ramp point for largert f , then becomes larger for largert f past the freeze-out point. At all nesting levels there is a clear transition point, after which all quantities plotted remain nearly constant. Hence we call the transition point a “freeze-out” point, or a freezing transition. Beyond this point, the empirical log-likelihood [Eq. (4.9)], the distance from data [Eq. (4.12)], the distance from the Gibbs distribution [Eq. (4.10)], and the effective inverse temperature [Eq. (4.11)] do not change significantly for most anneal times. Note that this freeze-out point happens earlier in the anneal with increasingC. This may happen because we are only able 73 Figure 4.8: As in Fig. 4.7, but with a fixed logical problem Hamiltonian. The trends are similar: freezing occurs earlier with higher nesting level. ForC = 3, the distribution distance from Gibbs increases with longer anneal times. In addition, at longer anneal times, the distribution distance increases at later ramp points. This is probably due to 1=f noise. to encode the final Hamiltonian, H P , which corresponds to increasing the scale of B(s) relative to A(s) (see Eq. (4.1)) [107]. As a result, the quantum critical point may be moved to an earlier point in the anneal. Note, though, that while the freeze-out point on the physical problem happens earlier with increasing nesting level, the effective temperature of the logical problem is still decreasing. This shift to an earlier freeze-out point can explain why we generally see increases in the distance to the Gibbs distribution with nesting level. At earlier freeze-out points the transverse field is larger, and with more qubits due to higherC, clusters of different sizes can form, such that instead of a freezing point what emerges is a freezing interval. In this case, the final distribution will depend on the dynamics in this interval and may not correspond to a Gibbs distribution at the end or at any particular point [40]. 74 Changes in the freezing point notwithstanding, the behavior after the freeze-out point is suggestive of a tunnel- ing event. When the transverse field is very strong, the spins are mostly random and there is very poor agreement with the target data distribution (second row of Fig. 4.7). However, the sharp change in distribution distances at the freezing point is consistent with a multi-spin flip event, in which spins flip such that the distribution more closely resembles the data distribution. This trend is consistent across nesting levels. 4.3.4 Comparison to SQA and SVMC atC = 3 Figures 4.7 and 4.8 show that in addition to the distribution distance increasing with anneal time, the distribution distance also increases slightly even after the freeze-out point, most noticeably at C = 3 and for the largest t f values. In order to gain more insight into the behavior of DW atC = 3 and to explore whether tunneling plays a role in DW’s performance, we compare our results to simulated quantum annealing (SQA) and spin vector monte carlo (SVMC) simulations in Fig. 4.9 (see Sec. 4.2.8 for a description of how the SQA and SVMC simulations were run and compared to DW). The Hamiltonian changes are explained in the caption, so the proper comparison is to Fig. 4.7. Tables 4.1 and 4.2 show the mode of the number of sweeps at each ramp point (recall that we compared SQA and SVMC at each ramp point for each instance to 20 noisy realizations of D-Wave and thus the closest eff could vary slightly from realization to realization; see Sec. 4.2.8). Befores = 0:5 the number of sweeps for both SQA and SVMC vary significantly, with no discernible trend for the longer anneal times. This is unsurprising as freezing and the minimum gap likely occur before this point, and the contribution from the transverse field is still large. D-Wave probably does not ramp fast enough, so the trends in the number of sweeps in the early stages of the anneal are not consistent (as can be seen in Fig. 4.10). At later points in the anneal, trends for the number of sweeps are reasonable. Shorter anneal times require fewer number of sweeps. For the most part, both SQA and SVMC exhibit remarkably similar behavior as DW at C = 3. There is a transition betweens = 0:3 ands = 0:4, and trends with anneal time of the empirical log-likelihood, distance from data, and effective inverse temperature all echo the DW trends. The distance from the Gibbs distribution generally 75 Figure 4.9: SQA and SVMC simulations with ramp point. Error bars are one standard deviation. The “anneal time” in the legend corresponds to the final physical problem Hamiltonian withC = 3 trained by DW at the specified anneal time. The corresponding number of sweeps for SQA and SVMC was selected such that the effective inverse temperature of the distribution was closest to that found by DW at the specified anneal time. See Tables 4.1 and 4.2 for the number of sweeps. increases with the final annealed Hamiltonian, but not as consistently as the DW results. This is probably because we did not fine-tune the number of SQA and SVMC sweeps to match DW’s results. A direct comparison of the DW, SQA, and SVMC results are presented in Fig. 4.10. There is very clear agreement between all three in almost every respect, except at early points in the anneal and at late points in the anneal for the Hamiltonians trained at longer anneal times (the rightmost columns in Fig. 4.10. Ats = 0:3, DW differs significantly from SQA and SVMC. SQA and SVMC tend to have much lower empirical log-likelihoods and much smaller eff . The reason for this may be an artifact in the length of the ramp. For DW, the shortest amount of time the ramp can take place is 1s, and for SQA and SVMC we set the ramp at 500 sweeps. For the trained Hamiltonian at 5s, ramping ats = 0:3 means that the total time the system has been allowed to evolve is 1:5s, whereas the ramp is 1s. For the Hamiltonian trained with a total annealing time of 2000s, ramping at s = 0:3 means that the system is allowed to evolve for 600s before a 1s ramp. In other words, if the ramp is too 76 Anneal time s = 0:3 s = 0:4 s = 0:5 s = 0:6 s = 0:7 s = 0:8 s = 0:9 s = 1:0 5 69751 18421 18421 18421 18421 18421 18421 18421 10 264105 35846 35846 35846 35846 35846 35846 35846 15 1000000 35846 35846 35846 35846 35846 35846 35846 20 1000000 35846 35846 35846 35846 35846 35846 35846 30 1000000 69751 69751 69751 69751 69751 69751 69751 40 18421 35846 35846 35846 35846 35846 35846 35846 50 135727 69751 69751 69751 69751 69751 69751 69751 100 35846 69751 69751 135727 69751 135727 135727 135727 300 9467 69751 69751 69751 135727 135727 135727 135727 1000 135727 264105 264105 264105 264105 264105 513912 513912 2000 4865 513912 513912 513912 513912 513912 513912 513912 Table 4.1: Number of sweeps used in SQA simulations for each final Hamiltonian (anneal time) and at different ramp points during the anneal (s). Anneal time s = 0:3 s = 0:4 s = 0:5 s = 0:6 s = 0:7 s = 0:8 s = 0:9 s = 1:0 5 1290000 5700 8600 8600 12900 12900 12900 12900 10 380000 8600 12900 12900 12900 19400 19400 19400 15 860000 12900 12900 12900 19400 19400 29200 29200 20 380000 12900 12900 19400 19400 19400 29200 29200 30 380000 19400 19400 29200 29200 29200 29200 44100 40 8600 66400 12900 12900 19400 19400 29200 29200 50 380000 29200 29200 29200 44100 44100 44100 66400 100 570000 44100 29200 44100 44100 66400 66400 100000 300 570000 570000 44100 44100 66400 66400 100000 100000 1000 3800 250000 100000 100000 250000 250000 380000 380000 2000 8600 570000 100000 250000 250000 250000 380000 380000 Table 4.2: Number of sweeps used in SVMC simulations for each final Hamiltonian (anneal time) and at different ramp points during the anneal (s). long, we may not have an accurate representation of the state of the system in the middle of the anneal. The ramp may still be too long for DW, such that even ats = 0:3 across all anneal times, the system is able to thermalize as the ramp is taking place. For long anneal times at later points in the anneal, the distance from Gibbs does not inrease as noticeably with ramp point for SQA and SVMC as it does with DW. For both SQA and SVMC, the distance from a Gibbs distribu- tion stays approximately the same after the “freeze-out” point, whereas DW’s distance from Gibbs increases. We attribute this to the presence of 1=f noise, which becomes more noticeable at longer anneals. The fact that SVMC and SQA gives nearly the same trends as DW suggests that for the most part DW is performing similarly to both SQA and SVMC, including in having larger distribution distances from Gibbs with 77 Figure 4.10: Direct comparison between DW, SQA, and SVMC for C = 3. Each column represents the final logical Hamiltonian after training with a particular anneal time (i.e., each column is a different color curve in Fig. 4.7). longer anneal times. Inasmuch as we trust SQA and SVMC as realistic simulations of the underlying physics, this increase in distribution distance is not an artifact of DW but consistent with other physicals models. We offer two possible explanations for the increase in distance from Gibbs. The first is that longer anneal times is allowing better thermalization, but to the physical problem. The embedding procedure preserves the structure of low-lying states but may not guarantee the ordering of higher-energy states. In the process of decoding the physical problem states to the encoded problem states and then the logical problem states, the lowest part of the energy spectrum may be preserved, but upon decoding the order of the higher-energy states may not be preserved, and as such eff increases from more probability on the lower energy states, but the distribution in terms of the logical problem appears less Gibbs. The fact that the increase in distribution distance increases for largerC supports this idea; with largerC there are more physical qubits and hence more energy levels of the physical problem, but the same number of states for the logical problem. The second, less likely, explanation may be that the system is thermalizing to a classical 78 Gibbs state at some intermediate point in the anneal. The distance from Gibbs is calculated with reference to the final classical Hamiltonian, but if DW, SQA, and SVMC are thermalizing to a different but close Gibbs distribution, the distance may increase. However, the overall excellent agreement between DW and SVMC does suggest that DW is behaving mostly classically. The fact that DW’s performance is simulatable by SVMC suggests that tunneling is not playing a role, but this does not necessarily mean that tunneling cannot play a role at lower device temperatures. In Ref. [127], simulations demonstrated that at high temperature SQA and SVMC performed similarly, but at lower temperatures the two performed differently, most likely because SQA is able to simulate tunneling events, whereas SVMC can only make use of thermal hopping. The fourth row of Fig. 4.10 shows that SQA and SVMC are essentially finding the same states at all points in the anneal, which may be an indication that the temperature is too high to see a difference in performance. 4.4 Discussion Previous work has demonstrated that NQAC can offer an effective temperature reduction by trading more physical qubits to create colder code qubits in optimization problems and estimation of gradients [112]. This previous result used a fixed Hamiltonian for a given problem size. Here, we have incorporated NQAC into training a machine learning model, where the parameters of the Hamiltonian are iteratively adjusted to match some target distribution. We have shown, using both the BAS and MNISTS datasets, that for intermediate values of the Ising Hamiltonian rescaling parameter, NQAC offers an increase in training performance with higher nesting levelC (in agreement with previous results, when is large, encoding is limited by the strength of the penalties and does not provide an advantage). For these intermediate values of , we tested whether NQAC outperforms classical repetition for the BAS dataset, and found this to be the case at the largest accessible nesting level (C = 3), though the effect is small. However, we have also shown that the advantage for using NQAC increases with longer anneal times, though the behavior of the device becomes more classical with largerC. In the unsupervised case we also studied how the improvement in performance correlates with a effective tem- perature of the associated Gibbs distribution; our results showed that the improved learning is associated with a 79 decrease in the effective temperature. In addition, we found that longer anneal times increase machine learning performance across all nesting levels, but counterintuitively the distribution distance from a Gibbs distribution on the final Hamiltonian increased with anneal time. To understand why, we used the ramp feature of the DW2000Q device to probe the state of the system at intermediate points in the anneal, and comparisons to SQA and SVMC to gain some physical insight into sampling process. Overall, our results agree quite well with previous understand- ing of the dynamics of a quantum annealer in the quasi-static regime: at some intermediate point in the anneal, dynamics are frozen and the distribution (and various expectation values of the distribution) do not change signifi- cantly after this freezing point; the increase at higher levels ofC with annealing time may be related to decoding the physical problem to the logical problem, and the increase in distance from Gibbs at later points in the anneal is probably due to 1=f noise. Excellent agreement with SQA and SVMC indicates that DW is sampling from the same semiclassical distribution at an intermediate point in the anneal, even at longer anneal times. For the BAS dataset this still improves the log-likelihood, but whether this will generally be the case for other data distributions is an open question. The results presented here may not generalize to problems of arbitrary size, and it is also unclear whether increasing the nesting level will continue to provide an advantage. If the freeze-out point tends to shift earlier in the anneal with nesting level, at some point the structure of final logical Hamiltonian may be lost. Moreover, as the freeze out point moves earlier in the anneal or if individual qubits do not freeze at the same time, which seems to be more likely with larger problems and longer chains, the final state distribution may not resemble a Gibbs distribution at any particular point (though some of these concerns may be addressed with control over individual qubit annealing schedules). However, limitations notwithstanding, we are hopeful that NQAC may help in finding a quantum advantage for noisy intermediate-scale quantum annealers. 80 Chapter 5 Conclusion In this dissertation, we have mainly studied the D-Wave quantum annealers, which accepts problems formulated as a two-body Ising Hamiltonian, to assess the applicability of current quantum devices to real-world problems. We have shown that quantum annealing on physical hardware (not simulations) can be used to solve a variety of interesting, albeit usually simplified, problems. In Chap. 2 we developed a machine learning model with quantum annealing to classify and rank binding of transcription factors to different sequences of DNA. We also identified situations where quantum annealing’s performance is better relative to existing classical algorithms; when the amount of training data is small. Although this runs counter to the trend in machine learning, where classical models benefit from larger and larger amount of data, for applications with a small number of relevant samples, quantum annealing may have an advantage over standard machine learning approaches. However, DW’s perfor- mance was nearly the same as both SA and SQA, classical algorithms that can be used to minimize the same Ising Hamiltonian. As such we cannot ascribe the performance benefit to quantum annealing per se, but rather to for- mulating a regression problem as an Ising. Finally, we also demonstrated that the weights learned by the quantum annealer adequately capture previously known biological information about which nucleotides are important for binding. In Chap. 3, we used PCA on a large “mutli-omics” dataset to reduce the dimensionality to a size that could be simulated on DWave quantum annealers. We developed a new formulation of the Ising problem that solves the classification problem directly. Using this real biological data, we demonstrated that Ising-type approaches 81 (including both DW and various classical solver that solve the Ising problem) perform nearly the same as standard classical machine learning approaches for a variety of cancer types. In addition, as in the Chap. 2, we demonstrated that with less training data, the Ising-type algorithms perform better than classical methods with a smaller amount of data. The fact that simple classical algorithms that use the Ising formulation are able to give equal performance as the DWave quantum annealer may be disappointing for quantum annealing, but it nonetheless points to the utility of formulating machine learning tasks as an Ising problem for small problems. A possible reason for the better relative performance of the Ising-type algorithms with small amounts of training data is that the weights returned are inherently more robust to noise because they are discete. The Ising-type approaches perform less well for multi-class classification, but this may be due to a larger amount of training data used, and possibly a break-down of the approximation used in mapping the classification problem to an Ising. In Chap. 4 we showed that quantum annealing with D-Wave can accurately predict the labels of blurry integers and (somewhat) reproduce simple distributions. The model that we used for our machine learning in this chapter depends an effective temperature. We have showed that we are able to slightly improve performance of quantum annealing at certain levels of simulated noise by using a simple error correction scheme, nested quantum anneal- ing correction, to trade qubits for a decrease in temperature. By comparing results to SQA and SVMC, we also gained a better understanding of the underlying physics of the machine at intermediate points in the anneal. For the particular problems studied, the D-Wave quantum annealer does not behave quantumly, or if it does, its per- formance cannot be distinguished from classical algorithms that do not account for any quantum effects (SVMC). Nevertheless, from our study of intermediate points in the anneal, DW seems to be preparing a Gibbs state on the logical problem, which then becomes frozen until the end of the anneal. This is, in general, not the Gibbs state of the programmed, final Hamiltonian, but the (classical) learning process is able to update the parameters of the final Hamiltonian such that the Gibbs state at the intermediate point in the anneal begins to closely resemble the desired target distribution. In summary, we have demonstrated several examples of how to map real-world problems to existing quan- tum hardware. Although the work has not shown a definite quantum advantage, in the process we have learned a somewhat surprising lesson: by introducing constraints (limited number of qubits and a two-body Ising problem) 82 which actually make the computational problem more difficult theoretically, heuristic solvers on these more diffi- cult problems give better performance under certain conditions; namely, when the amount of data is small. Similar benefits may be found in other areas of machine learning by mapping other problems to an Ising. In Chap. 2 we mapped a regression problem to an Ising and in Chap. 3 we mapped a classification problem to an Ising. There may be other classical algorithms for which making approximations and mapping to an Ising can lead to simple but better algorithms, quantum or not. As such, we are hopeful that continuing research with NISQ devices may generate useful classical algorithms. In addition, it may be that current devices are still too noisy to demonstrate quantum advantages. Innovation on the hardware side to reduce sources of noise, give better control of the tem- perature and precision of the machines, and allow for higher connectivity between physical qubits could make it easier to harness quantumness for a computational advantage. All things considered, our foray in to the unknown, should leave us not discouraged, but rather inspired to explore what quantum computing can do. 83 Reference List [1] Johnson, M. W., et al. Quantum annealing with manufactured spins. Nature, 473, 194–198, (2011). [2] Kelly, J. A preview of bristlecone, google’s new quantum processor, (2018). [3] Gil, D. The future is quantum, (2017). [4] Cross, A. W., Bishop, L. S., Smolin, J. A., and Gambetta, J. M. Open Quantum Assembly Language. arXiv:1707.03429. [5] Steiger, D. S., Hner, T., and Troyer, M. Projectq: An open source software framework for quantum comput- ing. (2016). [6] Wecker, D. and Svore, K. M. Liqui—¿: A software design architecture and domain-specific language for quantum computing, (2014). [7] Svore, K., et al. Q#: Enabling scalable quantum computing and development with a high-level dsl. In Proceedings of the Real World Domain Specific Languages Workshop 2018, RWDSL2018, pages 7:1–7:10, New York, NY , USA, (2018). ACM. [8] National quantum initiative, h.r. 6227,115th cong., (2018). [9] Shor, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput., 26, 1484, (1997). [10] R.P. Feynman. Simulating Physics with Computers. Intl. J. Theor. Phys., 21, 467, (1982). [11] Feynman, R. P. Quantum mechanical computers. Optics News, 11, 11–20, (1985). [12] Lloyd, S. Universal quantum simulators. Science, 273, 1073–1078, (1996). [13] D.S. Abrams and S. Lloyd. Simulation of Many-Body Fermi Systems on a Universal Quantum Computer. Phys. Rev. Lett., 79, 2586, (1997). [14] G. Ortiz, J.E. Gubernatis, E. Knill, and R. Laflamme. Quantum Algorithms for Fermionic Simulations. Phys. Rev. A, 64, 022319, (2001). [15] Babbush, R., et al. Low-depth quantum simulation of materials. Physical Review X, 8, 011044, (2018). [16] A. Aspuru-Guzik, A.D. Dutoi, P.J. Love, and M.Head-Gordon. Simulated quantum computation of molec- ular energies. Science, 309, 1704, (2005). [17] Kassal, I., Jordan, S. P., Love, P. J., Mohseni, M., and Aspuru-Guzik, A. Polynomial-time quantum algo- rithm for the simulation of chemical dynamics. Proceedings of the National Academy of Sciences, 105, 18681–18686, (2008). [18] Wang, H., Kais, S., Aspuru-Guzik, A., and Hoffmann, M. R. Quantum algorithm for obtaining the energy spectrum of molecular systems. Phys. Chem. Chem. Phys., 10, 5388–5393, (2008). 84 [19] Kassal, I. and Aspuru-Guzik, A. Quantum algorithm for molecular properties and geometry optimization. The Journal of Chemical Physics, 131, 224102, (2009). [20] Reiher, M., Wiebe, N., Svore, K. M., Wecker, D., and Troyer, M. Elucidating reaction mechanisms on quantum computers. Proceedings of the National Academy of Sciences, 114, 7555–7560, (2017). [21] Tranter, A., et al. The bravyikitaev transformation: Properties and applications. International Journal of Quantum Chemistry, 115, 1431–1441, (2015). [22] Sugisaki, K., et al. Quantum chemistry on quantum computers: A polynomial-time quantum algorithm for constructing the wave functions of open-shell molecules. The Journal of Physical Chemistry A, 120, 6459–6466, (2016). PMID: 27499026. [23] Veis, L. and Pittner, J. Adiabatic state preparation study of methylene. The Journal of Chemical Physics, 140, 214111, (2014). [24] Krizhevsky, A., Sutskever, I., and E. Hinton, G. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25, (2012). [25] Hinton, G., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29, 82–97, (2012). [26] Levinson, J., et al. Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pages 163–168, (2011). [27] Esteva, A., et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115 EP –, (2017). [28] Schafer, J. B., Konstan, J. A., and Riedl, J. E-commerce recommendation applications. Data Mining and Knowledge Discovery, 5, 115–153, (2001). [29] Neven, H., Denchev, V . S., Rose, G., and Macready, W. G. Training a binary classifier with the quantum adiabatic algorithm. arXiv:0811.0416, (2008). [30] Pudenz, K. L. and Lidar, D. A. Quantum adiabatic machine learning. Quantum Information Processing, 12, 2027–2070, (2013). [31] Lloyd, S., Mohseni, M., and Rebentrost, P. Quantum algorithms for supervised and unsupervised machine learning. arXiv:1307.0411, (2013). [32] Wittek, P. Quantum Machine Learning: What Quantum Computing Means to Data Mining. Elsevier, AP, (2014). [33] Lloyd, S., Mohseni, M., and Rebentrost, P. Quantum principal component analysis. Nat. Phys., 10, 631– 633, (2014). [34] Rebentrost, P., Mohseni, M., and Lloyd, S. Quantum support vector machine for big data classification. Phys. Rev. Lett., 113, 130503–, (2014). [35] Wiebe, N., Kapoor, A., and Svore, K. M. Quantum deep learning. arXiv:1412.3489, (2014). [36] Schuld, M., Sinayskiy, I., and Petruccione, F. An introduction to quantum machine learning. Contemporary Physics, 56, 172–185, (2015). [37] Wiebe, N., Kapoor, A., and Svore, K. Quantum algorithms for nearest-neighbor methods for supervised and unsupervised learning. Quantum Information & Computation, 15, 0318–0358, (2015). [38] Aaronson, S. Read the fine print. Nat. Phys., 11, 291–293, (2015). 85 [39] Adachi, S. H. and Henderson, M. P. Application of quantum annealing to training of deep neural networks. arXiv:1510.06356, (2015). [40] Amin, M. H., Andriyash, E., Rolfe, J., Kulchytskyy, B., and Melko, R. Quantum boltzmann machine. Physical Review X, 8, 021050–, (2018). [41] Biamonte, J., et al. Quantum machine learning. Nature, 549, 195–202, (2017). [42] Benedetti, M., Realpe-G´ omez, J., Biswas, R., and Perdomo-Ortiz, A. Quantum-assisted learning of graphi- cal models with arbitrary pairwise connectivity. arXiv:1609.02542, (2016). [43] Mott, A., Job, J., Vlimant, J.-R., Lidar, D., and Spiropulu, M. Solving a higgs optimization problem with quantum annealing for machine learning. Nature, 550, 375 EP –, (2017). [44] Preskill, J. Quantum computing in the nisq era and beyond. Quantum, 2, 79, (2018). [45] Shor, P. W. Fault-tolerant quantum computation. Proceedings of 37th Conference on Foundations of Com- puter Science, pages 56–65, (1996). [46] Harris, R., et al. Experimental investigation of an eight-qubit unit cell in a superconducting optimization processor. Phys. Rev. B, 82, 024511, (2010). [47] Bunyk, P. I., et al. Architectural considerations in the design of a superconducting quantum annealing processor. IEEE Transactions on Applied Superconductivity, 24, 1–10, (Aug. 2014). [48] Kadowaki, T. and Nishimori, H. Quantum annealing in the transverse Ising model. Phys. Rev. E, 58, 5355, (1998). [49] Farhi, E., et al. A Quantum Adiabatic Evolution Algorithm Applied to Random Instances of an NP- Complete Problem. Science, 292, 472–475, (2001). [50] Albash, T. and Lidar, D. A. Adiabatic quantum computation. Reviews of Modern Physics, 90, 015002–, (2018). [51] Kato, T. On the adiabatic theorem of quantum mechanics. J. Phys. Soc. Jap., 5, 435, (1950). [52] Farhi, E., Goldstone, J., Gutmann, S., and Sipser, M. Quantum Computation by Adiabatic Evolution. arXiv:quant-ph/0001106, (2000). [53] Jansen, S., Ruskai, M.-B., and Seiler, R. Bounds for the adiabatic approximation with applications to quantum computation. J. Math. Phys., 48, 102111, (2007). [54] Rønnow, T. F., et al. Defining and detecting quantum speedup. Science, 345, 420–424, (2014). [55] Childs, A. M., Farhi, E., and Preskill, J. Robustness of adiabatic quantum computation. Phys. Rev. A, 65, 012322, (2001). [56] Amin, M. H. S., Averin, D. V ., and Nesteroff, J. A. Decoherence in adiabatic quantum computation. Phys. Rev. A, 79, 022107, (2009). [57] Albash, T. and Lidar, D. A. Decoherence in adiabatic quantum computation. Phys. Rev. A, 91, 062320–, (2015). [58] Barahona, F. On the computational complexity of Ising spin glass models. J. Phys. A: Math. Gen, 15, 3241, (1982). [59] Choi, V . Minor-embedding in adiabatic quantum computation: I. The parameter setting problem. Quant. Inf. Proc., 7, 193–209, (2008). 86 [60] Cai, J., Macready, W. G., and Roy, A. A practical heuristic for finding graph minors. arXiv:1406.2741, (2014). [61] Robertson, N. and Seymour, P. Graph minors. iii. planar tree-width. Journal of Combinatorial Theory, Series B, 36, 49 – 64, (1984). [62] Bengio, Y ., Courville, A., and Vincent, P. Representation learning: A review and new perspectives, (2012). [63] Erhan, D., et al. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11, 625–660, (2010). [64] Katzgraber, H. G., Hamze, F., and Andrist, R. S. Glassy chimeras could be blind to quantum speedup: Designing better benchmarks for quantum annealing machines. Phys. Rev. X, 4, 021008–, (2014). [65] Venturelli, D., et al. Quantum optimization of fully connected spin glasses. Phys. Rev. X, 5, 031040–, (2015). [66] Hen, I., et al. Probing for quantum speedup in spin-glass problems with planted solutions. Phys. Rev. A, 92, 042325, (2015). [67] Amin, M. H. Searching for quantum speedup in quasistatic quantum annealers. Physical Review A, 92, 052323–, (2015). [68] Slattery, M., et al. Absence of a simple code: how transcription factors read the genome. Trends in Bio- chemical Sciences, 39, 381 – 399, (2014). [69] Shlyueva, D., Stampfel, G., and Stark, A. Transcriptional enhancers: from properties to genome-wide predictions. Nat Rev Genet, 15, 272–286, (2014). [70] Stormo, G. and Zhao, Y . Determining the specificity of dna-protein interactions. Nature Reviews Genetics, 11, 751–760, (2010). [71] Li, R. Y ., Di Felice, R., Rohs, R., and Lidar, D. A. Quantum annealing versus classical machine learning applied to a simplified computational biology problem. npj Quantum Information, 4, 14, (2018). [72] Gordˆ an, R., et al. Genomic regions flanking e-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Reports, 3, 1093 – 1104, (2013). [73] Jolma, A., et al. Multiplexed massively parallel selex for characterization of human transcription factor binding specificities. Genome Research, 20, 861–873, (2010). [74] Jolma, A., et al. Dna-binding specificities of human transcription factors. Cell, 152, 327–339, (2013). [75] Liu, J. and Stormo, G. D. Combining selex with quantitative assays to rapidly obtain accurate models of protein-dna interactions. Nucleic Acids Res, 33, e141, (2005). [76] Zhou, T., et al. Quantitative modeling of transcription factor binding specificities using DNA shape. Pro- ceedings of the National Academy of Sciences, 112, 4654–4659, (2015). [77] Yang, L., et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Research, 42, D148–D155, (2014). [78] Abe, N., et al. Deconvolving the recognition of DNA shape from sequence. Cell, 161, 307 – 318, (2015). [79] Yang, L., et al. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Molecular Systems Biology, 13, (2017). [80] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Optimization by simulated annealing. Science, 220, 671– 680, (1983). 87 [81] Isakov, S., Zintchenko, I., Rønnow, T., and Troyer, M. Optimized simulated annealing for ising spin glasses. arXiv:1401.1084, (2014). [82] Santoro, G. E., Martoˇ n´ ak, R., Tosatti, E., and Car, R. Theory of quantum annealing of an Ising spin glass. Science, 295, 2427–2430, (2002). [83] Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, (1996). [84] Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY , USA, (2016). ACM. [85] Crosson, E. and Harrow, A. Simulated quantum annealing can be exponentially faster than classical simu- lated annealing. arXiv preprint arXiv:1601.03030, (2016). [86] Chen, T. and He, T. Higgs boson discovery with boosted trees. In Proceedings of the 2014 Interna- tional Conference on High-Energy Physics and Machine Learning - Volume 42, HEPML’14, pages 69–80. JMLR.org, (2014). [87] Breiman, L. Bagging predictors. Machine learning, 24, 123–140, (1996). [88] Davis, J. and Goadrich, M. The relationship between precision-recall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 233–240, New York, NY , USA, (2006). ACM. [89] Kendall, M. A new measure of rank correlation. Biometrika, 30, 81–93, (1938). [90] Grandori, C., Cowley, S., James, L., and Eisenman, R. The Myc/Max/Mad network and the transcriptional control of cell behavior. Annual Review of Cell and Developmental Biology, 16, 653–699, (2000). [91] Cucker, F. and Smale, S. Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2, 413–428, (2002). [92] Mordelet, F., Horton, J., Hartemink, A. J., Engelhardt, B. E., and Gordˆ an, R. Stability selection for regression-based models of transcription factor–dna binding specificity. Bioinformatics, 29, i117–i125, (2013). [93] Zhou, T., et al. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Research, 41, W56–W62, (2013). [94] Gao, J., et al. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal. Science Signaling, 6, pl1 LP – pl1, (2013). [95] Cerami, E., et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery, 2, 401 LP – 404, (2012). [96] Kuhn, M. Building predictive models in r using the caret package. Journal of Statistical Software, Articles, 28, 1–26, (2008). [97] Boixo, S., Albash, T., Spedalieri, F. M., Chancellor, N., and Lidar, D. A. Experimental signature of pro- grammable quantum annealing. Nat. Commun., 4, 2067, (2013). [98] Isakov, S. V ., Zintchenko, I. N., Rønnow, T. F., and Troyer, M. Optimised simulated annealing for ising spin glasses. Computer Physics Communications, 192, 265–271, (2015). [99] Ashhab, S., Johansson, J. R., and Nori, F. Decoherence in a scalable adiabatic quantum computer. Phys. Rev. A, 74, 052330, (2006). 88 [100] Venuti, L. C., Albash, T., Lidar, D. A., and Zanardi, P. Adiabaticity in open quantum systems. Physical Review A, 93, 032118–, (2016). [101] Marshall, J., Rieffel, E. G., and Hen, I. Thermalization, freeze-out and noise: deciphering experimental quantum annealers. arXiv:1703.03902, (2017). [102] Chancellor Nicholas, Szoke Szilard, Vinci Walter, Aeppli Gabriel, and Warburton Paul A. Maximum- Entropy Inference with a Programmable Annealer. Scientific Reports, 6, 22318, (2016). [103] Albash, T., Martin-Mayor, V ., and Hen, I. Temperature scaling law for quantum annealing optimizers. Physical Review Letters, 119, 110502–, (2017). [104] Perdomo-Ortiz, A., Benedetti, M., Realpe-Gmez, J., and Biswas, R. Opportunities and challenges for quantum-assisted machine learning in near-term quantum computers. (2017). [105] Salakhutdinov, R. and Hinton, G. An efficient learning procedure for deep boltzmann machines. Neural Comput., 24, 1967–2006, (2012). [106] Tang, Y ., Salakhutdinov, R., and Hinton, G. Robust boltzmann machines for recognition and denoising. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2264–2271, (2012). [107] Vinci, W., Albash, T., and Lidar, D. A. Nested quantum annealing correction. Nature Quantum Information, 2, 16017, (2016). [108] Pudenz, K. L., Albash, T., and Lidar, D. A. Error-corrected quantum annealing with hundreds of qubits. Nat. Commun., 5, 3243, (2014). [109] Pudenz, K. L., Albash, T., and Lidar, D. A. Quantum annealing correction for random Ising problems. Phys. Rev. A, 91, 042302, (2015). [110] Matsuura, S., Nishimori, H., Albash, T., and Lidar, D. A. Mean field analysis of quantum annealing correc- tion. Physical Review Letters, 116, 220501–, (2016). [111] Mishra, A., Albash, T., and Lidar, D. A. Performance of two different quantum annealing correction codes. Quant. Inf. Proc., 15, 609–636, (2015). [112] Vinci, W. and Lidar, D. A. Scalable effective-temperature reduction for quantum annealers via nested quantum annealing correction. Physical Review A, 97, 022308–, (2018). [113] Matsuura, S., Nishimori, H., Vinci, W., and Lidar, D. A. Nested quantum annealing correction at finite temperature:p-spin models. arXiv:1803.01492, (2018). [114] Vinci, W., Albash, T., Paz-Silva, G., Hen, I., and Lidar, D. A. Quantum annealing correction with minor embedding. Phys. Rev. A, 92, 042310–, (2015). [115] Jordan, S. P., Farhi, E., and Shor, P. W. Error-correcting codes for adiabatic quantum computation. Phys. Rev. A, 74, 052322, (2006). [116] Bookatz, A. D., Farhi, E., and Zhou, L. Error suppression in hamiltonian-based quantum computation using energy penalties. Physical Review A, 92, 022317–, (2015). [117] Jiang, Z. and Rieffel, E. G. Non-commuting two-local hamiltonians for quantum error suppression. Quan- tum Information Processing, 16, 89, (2017). [118] Marvian, M. and Lidar, D. A. Error Suppression for Hamiltonian-Based Quantum Computation Using Subsystem Codes. Physical Review Letters, 118, 030504–, (2017). 89 [119] Marvian, M. and Lidar, D. A. Error suppression for hamiltonian quantum computing in markovian environ- ments. Physical Review A, 95, 032302–, (2017). [120] Choi, V . Minor-embedding in adiabatic quantum computation: Ii. minor-universal graph design. Quant. Inf. Proc., 10, 343–353, (2011). [121] Klymko, C., Sullivan, B. D., and Humble, T. S. Adiabatic quantum programming: minor embedding with hard faults. Quant. Inf. Proc., 13, 709–729, (2014). [122] Hinton, G. E., Osindero, S., and Teh, Y .-W. A fast learning algorithm for deep belief nets. Neural Compu- tation, 18, 1527–1554, (2006). [123] Salakhutdinov, R. and Larochelle, H. Efficient learning of deep boltzmann machines. In Teh, Y . W. and Tit- terington, M., editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 693–700, Chia Laguna Resort, Sardinia, Italy, (2010). PMLR. [124] Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1771–1800, (2002). [125] MacKay, D. J. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, (2003). [126] Shin, S. W., Smith, G., Smolin, J. A., and Vazirani, U. How “quantum” is the D-Wave machine? arXiv:1401.7087, (2014). [127] Albash, T. and Lidar, D. A. Demonstration of a scaling advantage for a quantum annealer over simulated annealing. Physical Review X, 8, 031016–, (2018). 90 Appendix A Classification measures Here we will provide various explanations of various classification measures used in this dissertation. In a binary classification problem, an algorithm must predict either positive or negative for a given sample of data based on some cut-off value of the prediction. The simplest metric of performance is the accuracy, as fraction of predictions made correctly; i.e. given a predicted label ~ y i and true labely i , the accuracy is Accuracy = 1 N N X i=1 1(y i == ~ y i ) (A.1) While helpful as a simple metric, the accuracy may hide some details of the performance. For example, if 95 out of 100 samples were labeled as positive, a classifier that assigned a positive label to every sample would have an accuracy of 95%, but such a classifier clearly performs poorly for the 5% negative examples. For a more thorough analysis of classification performance, a confusion matrix may then be defined in order to (see Table A.1), and the relationship between different quantitites in a confusion matrix may give more insight into the classification performance. In a confusion matrix, true positive (TP ) are samples correctly predicted as being positive. False positives (FP ) are negative samples incorrectly labeled as positive. False negatives (FN) are positive samples incorrectly labeled as negative. True negatives (TN) are negative labels correctly labeled as such. Given the confusion matrix, an appropriate metric may then be constructed, based on the class imbalance. We define some commonly used terms before explaining the other metrics used in this work Recall = True Positive Rate = #TP #TP + #FN (A.2a) Precision = #TP #TP + #FP (A.2b) False Positive Rate = #FP #FP + #TN (A.2c) Specificity = True Negative Rate = #TN #FP + #TN (A.2d) In terms of the confusion matrix, we can define the following quantities: predicted positive predicted negative actual positive TP FN actual negative FP TN Table A.1: Confusion Matrix 91 Accuracy = #TP + #TN #TP + #FP + #TN + #FN (A.3) Balanced Accuracy = Recall + Specificity 2 (A.4) F1 score = Precision*Recall Precision +Recall (A.5) For classifiers that predict a single label, the above defined metrics may be sufficient. However, many classifiers do not directly output binary results, but instead predict either a score or a probability. Labels may be generated by thresholding the predictions. To understand the behavior of such classifiers across a variety of these thresholds, various curves are sometimes plotted. We used two such curves this work: the receive operating characteristic (ROC) and the precision-recall (PR) curves. ROC curves plot the false positive rate versus the true positive rate, whereas PR curves plot precision versus recall. For distributions that have a small number of actual positive labels, the PR curve may give help to better distinguish between similar classifiers, as it accounts for false positives in the denominator [88]. Curves may then be generated by varying a critical value, i.e., the value above which the algorithm predicts positive labels. More formally, we picked a critical valuec and determined the predicted class ~ y such that ~ y n = ( 0 iff(x n )<c 1 iff(x n )c ; forn = 1;:::;jD TEST j; (A.6) wheref(x n ) is the predicted output after training. For example, in this workf(x n ) =w x n for DW, SA, LASSO, and others in Chap.2. For each critical value, we can generate a point in ROC space or PR space according to the confusion matrix. The point was generated by comparing ~ y n toy n for a given percentilep and critical valuec. If ~ y n = 1 and alsoy n = 1 the casen was counted as aTP . If ~ y n = 0 and alsoy n = 0 the casen was counted as aTN. If ~ y n = 1 whiley n = 0 the casen was counted as aFP . And, if ~ y n = 0 whiley n = 1 the casen was counted as aFN. The number of such cases, for fixedp, was tallied forn = 1;:::;N and defined the number of TP ,FP ,TN, andFN used in Eq. (A.2). i.e., #TP = (number of cases such thaty n = ~ y n = 1), etc. 92
Abstract (if available)
Abstract
Quantum computing is an emerging field with potentially far-reaching implications for a diverse set of fields, from chemical simulation and material science, to computer science, to finance and cryptography. Demonstrations of simple quantum algorithms have been carried out on a very small scale, and theoretical work has established how to scale up algorithms in the presence of noise. However, little is known in the intermediate regime of noisy quantum devices. The subject of this dissertation is the study of commerically available quantum annealers of a non-trivial size (namely, those manufactured by D-Wave) applied to practical problems in machine learning. D-Wave devices only admit problems formulated as a two-body Ising Hamiltonian, and underlying the applications considered is mapping various problem of interest to an Ising. First, we apply quantum annealing to model the binding preferences of transcription factors for different sequences of DNA. We show that we can map this problem to an Ising problem, and that solving the Ising problem is advantageous when the amount of training data is scarce, and that the mapping generates weights that are consistent with known biological information. Second, we develop a different mapping of a classification problem to an Ising problem based on an approximation to the negative log-likelihood and apply our mapping to classify different classes of cancer. Our results show an advantage for quantum annealing, but the performance of QA can be matched by simple classical algorithms on an Ising Hamiltonian. These simple algorithms are able to outperform standard classical machine learning approaches with a small amount of data. We consider the approximations used in the mapping on the final machine learning performance. Third, we apply an error-correction scheme to slightly improve supervised and unsupervised learning of the D-Wave devices. In comparison to classical simulations, we obtain a better understanding of the physics of the devices, which for these problems, does not appear to be quantum. Overall, although we have not found a case where D-Wave outperforms any classical algorithm, our research has shown the power of formulating problems as an Ising and gives a better understanding of the device.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Error correction and quantumness testing of quantum annealing devices
PDF
Error suppression in quantum annealing
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Open-system modeling of quantum annealing: theory and applications
PDF
Minor embedding for experimental investigation of a quantum annealer
PDF
The theory and practice of benchmarking quantum annealers
PDF
Tunneling, cascades, and semiclassical methods in analog quantum optimization
PDF
Understanding physical quantum annealers: an investigation of novel schedules
PDF
Imposing classical symmetries on quantum operators with applications to optimization
PDF
Controlling electronic properties of two-dimensional quantum materials: simulation at the nexus of the classical and quantum computing eras
PDF
Designing data-effective machine learning pipeline in application to physics and material science
PDF
Theory and simulation of Hamiltonian open quantum systems
PDF
Trainability, dynamics, and applications of quantum neural networks
PDF
Towards optimized dynamical error control and algorithms for quantum information processing
PDF
Applications and error correction for adiabatic quantum optimization
PDF
Quantum information flow in steganography and the post Markovian master equation
PDF
Quantum information-theoretic aspects of chaos, localization, and scrambling
PDF
Photoexcitation and nonradiative relaxation in molecular systems: methodology, optical properties, and dynamics
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Machine learning of DNA shape and spatial geometry
Asset Metadata
Creator
Li, Richard
(author)
Core Title
Explorations in the use of quantum annealers for machine learning
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Chemistry (Chemical Physics)
Publication Date
06/13/2019
Defense Date
05/13/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,OAI-PMH Harvest,quantum annealing,quantum computing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lidar, Daniel (
committee chair
), Di Felice, Rosa (
committee member
), Rohs, Remo (
committee member
), Takahashi, Susumu (
committee member
)
Creator Email
liry@usc.edu,ryndil@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-173239
Unique identifier
UC11660439
Identifier
etd-LiRichard-7477.pdf (filename),usctheses-c89-173239 (legacy record id)
Legacy Identifier
etd-LiRichard-7477.pdf
Dmrecord
173239
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Richard
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
quantum annealing
quantum computing