Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable exact inference in probabilistic graphical models on multi-core platforms
(USC Thesis Other)
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALABLE EXACT INFERENCE IN PROBABILISTIC GRAPHICAL MODELS ON MULTI-CORE PLATFORMS by Nam Ma A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2014 Copyright 2014 Nam Ma Dedication to my parents and my wife ii Acknowledgments First, I would like to thank my advisor Professor Viktor Prasanna, who patiently and supportively guided me through many research challenges to my Ph.D. completion. This work could not have been possible without him. His patience and willingness to help were surpassed only by his astounding knowledge of all the subject matter in question. In addition, under his guidance, I have learned how to identify important problems and abstract them, and how to more effectively communicate ideas both in writing and in presentation. I am grateful that these skills have not only allowed me to be where I am today, but will also assist me in my future endeavor. Much of my work has been done with Yinglong Xia, and I am grateful to have his expertise for my research work. Also I am really grateful to Prof. Monte Ung and Hoang Le for their support since my first day at USC. Additionally, I would like to thank my committee members Prof. Aiichiro Nakano, Prof. C.R. Raghavendra, and Prof. Yogesh Simmhan for their insightful feedback on my work. I would also like to express my gratitude to Prof. Tru Cao, my undergraduate research advisor, for inspiring and preparing me to pursue Ph.D. study. I was fortunate to be part of an amazing research group. I really enjoy being a col- laborator, and at the same time, a friend with so many of them: Mike Giakkoupis, Thilan Ganegedara, Lucia Sun, Edward Yang, Andrea Sanny, Hsuan-Yi Chu, Alok Kumbhare, to name a few. They certainly made my Ph.D. student life memorable. I would also like iii to thank Janice Thompson, Lizsl De Leon, Margery Berti, and Estela Lopez for their tremendous support in administrative work. I am grateful to the VEF Fellowship Program for supporting my Ph.D. study. I would also like to acknowledge the support from the Intel Manycore Testing Lab, USC HPCC, NSF Grant CNS-0613376, NSF Grant CNS-1018801, and the DARPA XDATA program. Finally, I deeply thank my parents, Ma Toan and Ngo Thi Binh, for what they have done for me throughout my life. Their endless support and encouragement are crucial for what I have ever accomplished. My wife Linh Phan, with her unconditional love and belief in me, has been the strength and motivation for me to persevere. I am indebted to all of them. iv Table of Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures ix Abstract xiii Chapter 1: Introduction 1 1.1 Probabilistic Graphical models . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Background and Related Work 9 2.1 Exact Inference in Graphical Models . . . . . . . . . . . . . . . . . . . 9 2.1.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 The MapReduce Programming Model . . . . . . . . . . . . . . . . . . 15 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3: Data Parallelism for Exact Inference in Factor Graphs 18 3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Data Parallelism in Table Operations . . . . . . . . . . . . . . . . . . . 20 3.2.1 Representation of Distribution Tables . . . . . . . . . . . . . . 20 3.2.2 Realization of Data Parallelism in Table Operations . . . . . . . 22 3.3 Data Parallel Algorithms for Exact Inference . . . . . . . . . . . . . . . 23 3.3.1 Parallel Algorithms for the Table Operations . . . . . . . . . . . 23 3.3.2 Complete Algorithm for Belief Propagation . . . . . . . . . . . 26 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 28 v 3.4.2 NUMA-Aware Implementations . . . . . . . . . . . . . . . . . 31 3.4.3 Experimental Results with Various Table Sizes . . . . . . . . . 35 Chapter 4: Task Parallelism for Exact Inference in Factor Graphs 38 4.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Task Dependency Graph for Belief Propagation . . . . . . . . . . . . . 41 4.2.1 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2 Task Graph Construction . . . . . . . . . . . . . . . . . . . . . 41 4.3 Task Scheduling for Belief Propagation . . . . . . . . . . . . . . . . . 43 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter 5: DAG Scheduling with Weak Dependency 55 5.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Weak Dependency in Task Dependency Graphs . . . . . . . . . . . . . 57 5.2.1 Definition of Weak Dependency . . . . . . . . . . . . . . . . . 57 5.2.2 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . 59 5.3 DAG Scheduling with Weak Dependencies . . . . . . . . . . . . . . . . 60 5.3.1 Components of The Scheduler . . . . . . . . . . . . . . . . . . 60 5.3.2 A Weak Dependency Based Scheduler . . . . . . . . . . . . . . 63 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 6: Pointer Jumping for Parallel Exact Inference 75 6.1 Pointer Jumping Review . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3 Pointer Jumping for Evidence Propagation . . . . . . . . . . . . . . . . 78 6.3.1 Evidence Propagation in a Chain of Cliques . . . . . . . . . . . 78 6.3.2 Pointer Jumping Based Evidence Propagation . . . . . . . . . . 80 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 81 6.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 7: Parallel Exact Inference Using MapReduce 91 7.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.2 MapReduce for Exact Inference . . . . . . . . . . . . . . . . . . . . . 93 7.2.1 Task Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2.2 Depth-First Search based Iterative MapReduce . . . . . . . . . 95 7.2.3 Level based Iterative MapReduce . . . . . . . . . . . . . . . . . 98 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 vi 7.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 8: Conclusion and Future Work 106 Bibliography 110 vii List of Tables 4.1 Overall complexity and data size for slim-tree factor graphs with n = 2047,r = 2,d = 14, and various values ofb. . . . . . . . . . . . . . . . 48 4.2 Overall complexity and data size for balanced-tree factor graphs with n = 2000,r = 2,d = 14, and various values ofc. . . . . . . . . . . . . 49 4.3 Task size, overall complexity, and data size for balanced-tree factor graphs withn = 2000,c = 4, and various values ofr andd. . . . . . . . 51 4.4 Overall complexity and data size for random-tree factor graphs withn = 2000,r = 2,d = 14, and various values ofh. . . . . . . . . . . . . . . 53 viii List of Figures 2.1 (a) A sample Bayesian network and (b) corresponding junction tree. . . 10 2.2 Illustration of evidence collection and evidence distribution in a junction tree. The cliques in boxes are processed in the corresponding step. The shaded cliques have been processed. . . . . . . . . . . . . . . . . . . . 11 2.3 A factor graph for the factorizationf 1 (x 1 )f 2 (x 2 )f 3 (x 1 ;x 2 ;x 3 ). . . . . . 12 3.1 Index mapping between potential tableF of factor nodef(x 1 ;x 2 ;x 3 ;x 4 ;x 5 ) and the messageX 1 of variablex 1 . Only potential values in the shaded column are actually stored. . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Architectures of (a) the 10-core Intel Westmere-EX processor and (b) the 4-socket Westmere-EX system with 4 processors fully connected via QPI links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Execution time of the OpenMP based implementation for belief prop- agation in factor graphs with r = 2;d = 20;m = 100 on a 10-core Westmere-EX processor. . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 NUMA effects on table multiplication with two different datasets on the 4-socket Intel Westmere-EX system. . . . . . . . . . . . . . . . . . . . 32 3.5 NUMA effects on table marginalization with two different datasets on the 4-socket Intel Westmere-EX system. . . . . . . . . . . . . . . . . . 32 3.6 NUMA effects on message computation with two different datasets on the 4-socket Intel Westmere-EX system. . . . . . . . . . . . . . . . . . 34 3.7 NUMA effects on the execution time of the table operations and the message computation with two different datasets when 40 threads are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ix 3.8 Performance of table multiplication on the 40-core Intel Westmere-EX system with (a)d = 3 and various numbers of states of the variables, (b) r = 2 and various node degrees. . . . . . . . . . . . . . . . . . . . . . 35 3.9 Performance of table marginalization on the 40-core Intel Westmere-EX system with (a)d = 3 and various numbers of states of the variables, (b) r = 2 and various node degrees. . . . . . . . . . . . . . . . . . . . . . 36 3.10 Overall performance of belief propagation on the 40-core Intel Westmere- EX system with (a)d = 3 and various numbers of states of the variables, (b)r = 2 and various node degrees. . . . . . . . . . . . . . . . . . . . 37 4.1 (a) A factor graph with its skeleton formed by the non-leaf nodes, and (b) the corresponding task dependency graph. . . . . . . . . . . . . . . 42 4.2 An alternative way to construct a task dependency graph from a factor graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 The slim-tree skeleton of factor graphs. . . . . . . . . . . . . . . . . . . 46 4.4 Speedups of belief propagation using various methods on a 10-core Westmere- EX processor for (a) the balanced-tree factor graph and (b) the random- tree factor graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Performance of belief propagation in slim-tree factor grahps,n = 2047, r = 2,d = 14, and various values ofb on the 40-core Intel Westmere-EX system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Performance of belief propagation in balanced-tree factor graphs with n = 2000,r = 2,d = 14, and various values ofc on the 40-core Intel Westmere-EX system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.7 Performance of belief propagation in balanced-tree factor graphs with n = 2000, c = 4, r = 2, and various values ofd on the 40-core Intel Westmere-EX system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.8 Performance of belief propagation in balanced-tree factor graphs with n = 2000, c = 4, d = 5, and various values ofr on the 40-core Intel Westmere-EX system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.9 Performance of belief propagation in random-tree factor graphs with n = 2000,r = 2,d = 14 and various values ofh on the 40-core Intel Westmere-EX system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 A fragment of a task dependency graph with weak dependencies . . . . 58 x 5.2 Evidence collection using (a) the traditional scheduling method, (b) the weak dependency based scheduling method. . . . . . . . . . . . . . . . 59 5.3 (a) A portion of a task dependency graph, (b) The corresponding rep- resentation of the global task list (GL), (c) The data of task T i in the GL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 A Pine Tree and a Multipine Tree. . . . . . . . . . . . . . . . . . . . . 66 5.5 Execution timestamps on 8 cores for a Pine Tree withN = 32 nodes and node degreed = 8 using (a) the proposed scheduler and (b) the baseline. 69 5.6 Execution time for a Pine Tree withN = 1024 nodes and various node degreed using (a) the proposed scheduler, and (b) the baseline. . . . . . 71 5.7 Speedups of the two schedulers for a Pine Tree withN = 1024 nodes and node degree (a)d = 2, (b)d = 4, (c)d = 8, (d)d = 16. . . . . . . . 71 5.8 Speedup of the two schedulers using two threads per core for a Pine Tree withN = 1024 nodes and node degreed = 16. . . . . . . . . . . . . . 72 5.9 Speedups of the two schedulers for a Multipine Tree with N = 1024 nodes and node degreed = 4 and the number of branches (a)b = 2, (b) b = 4, (c)b = 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.10 Speedups of the two schedulers for arbitrary trees withN = 1024 nodes, maximum node degreeD m = 16, and tree height (a)H = 10, (b)H = 100, (c)H = 500: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.11 Speedups of the two schedulers for arbitrary trees withN = 1024 nodes, maximum node degreeD m = 6, and tree height (a)H = 10, (b)H = 100, (c)H = 500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.12 Speedups of the two schedulers for balanced trees withN = 1024 nodes and node degree (a)d = 2, (a)d = 4, (a)d = 8. . . . . . . . . . . . . . 74 6.1 Pointer jumping on a list. . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2 Three types of slim junction trees . . . . . . . . . . . . . . . . . . . . . 83 6.3 Comparison between the proposed algorithm (left) and SEI (right) for chains on the 32-core Intel Nehalem-EX based system. . . . . . . . . . 86 6.4 Comparison between the proposed algorithm (left) and SEI (right) for balanced trees on the 32-core Intel Nehalem-EX based system. . . . . . 86 xi 6.5 Comparison between the proposed algorithm and the scheduling based exact inference (SEI) for slim trees withN = 4096 cliques on the 32- core Intel Nehalem-EX based system. . . . . . . . . . . . . . . . . . . 88 6.6 The speedup of the pointer jumping based exact inference with respect to the sequential exact inference on the 32-core Intel Nehalem-EX based system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1 Map tasks and Reduce tasks in the post-order sequence of MapReduce invocations for evidence collection. . . . . . . . . . . . . . . . . . . . . 96 7.2 Map tasks and Reduce tasks in the level-based sequence of MapReduce invocations for evidence collection. . . . . . . . . . . . . . . . . . . . . 98 7.3 Scalability of the MapReduce based approaches compared with the base- lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4 Speedup of the level-based MapReduce approach on the 40-core Intel Westmere-EX based system for balanced junction trees with respect to various parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.5 Speedup of the level-based MapReduce based approach on the 40-core Intel Westmere-EX based system for random junction trees. . . . . . . . 105 xii Abstract The recent switch to multi-core computing and the emergence of machine learning appli- cations have offered many opportunities for parallelization. However, achieving scala- bility with respect to the number of cores for these applications still remains a fundamen- tal challenge, especially for machine learning problems that have graph computational structures. This thesis explores parallelism for exact inference in probabilistic graphical mod- els to achieve scalable performance on multi-core platforms. Exact inference is a key problem in probabilistic reasoning and machine learning. It also represents a large class of graph computations that have sophisticated computational patterns. In this thesis, we present our proposed techniques to extract and exploit parallelism efficiently at different levels. We first exploit parallelism available from the input graph using multithreading and task scheduling. At the node level, we explore data parallelism for the com- putational operations within a node. Data layout and data parallel algorithms are proposed for such node level computations. At the graph level, task paral- lelism is explored using a directed acyclic graph (DAG) model. DAG scheduling is employed to efficiently map the tasks in the DAG to the hardware cores. In many cases, the input graph provides insufficient parallelism. To expose more parallelism, we study a relationship called weak dependency between the tasks in a xiii DAG. A novel DAG scheduling scheme is developed to exploit weak dependency for parallelism. In addition, pointer jumping technique is employed for exact inference when the input graph offers very limited parallelism due to its chain-like structure. Given fixed-size problems, the proposed techniques can still achieve high scalability with increasing number of cores. In order to avoid the implementation complexity of many parallel techniques, we study the use of MapReduce as a higher level programming model for exact infer- ence. Our MapReduce-based techniques not only provide the ease of use but also achieve good performance for exact inference on general purpose multi-core sys- tems. With the high level of abstraction, our techniques can be applied for a class of graph computations with data dependency. We implement and evaluate our techniques on state-of-the-art multi-core systems, and demonstrate their high scalability for a variety of input graphs. xiv Chapter 1 Introduction 1.1 Probabilistic Graphical models Many real life problems require reasoning tasks that take available information and reach conclusions. For example, a doctor takes information about the symptoms and test results from a patient to diagnose possible diseases. An image-denoising software uses the available clean pixels to recover the contaminated pixels. A robot needs to synthesize all data from its sensors and cameras to conclude its next actions. In most of these real life scenarios, uncertainty is heavily involved. A joint probability distribution for a set of random variables can be used to model such scenarios. However, a joint probability distribution of the entire set of random variables appears intractable in complex systems. Probabilistic graphical models [KF09] provide a framework for effectively modeling and reasoning in complex systems under uncertainty. Probabilistic graphical models have become essential tools for probabilistic rea- soning and machine learning. A probabilistic graphical model is a graph-based rep- resentation that compactly encodes a joint probability distribution of random variables based on the conditional independence relationships among them. One of the earliest types of graphical models is Bayesian network, which has been used in artificial intelli- gence since the 1960’s. Recently, factor graph [KFL01] has emerged as a unified model of directed graphs (e.g. Bayesian networks) and undirected graphs (e.g. Markov net- works). Probabilistic graphical models have found applications in a variety of domains 1 such as medical diagnosis, data mining, image processing, and error-control decoding used in digital communications [STG + 01, Hec97, SF08, Loe04]. The reasoning task performed over probabilistic graphical models is called infer- ence. Inference is the problem of computing posterior probability distribution of cer- tain variables given some value-observed variables as evidence. Inference can be exact or approximate. In general, exact inference is NP hard [Coo90]. By exploiting net- work structure, many algorithms have been proposed to make exact inference practical for a wide range of applications. In Bayesian networks, the most popular exact infer- ence algorithm proposed by Lauritzen and Spiegelhalter [LS88] converts a Bayesian network into a junction tree, then performs evidence propagation in the junction tree. In factor graphs, inference is proceeded by the well-known belief propagation algo- rithm [KFL01]. Belief propagation leads to exact inference in acyclic factor graphs and to approximate inference in cyclic factor graphs. Approximate inference is an embar- rassingly parallel problem. Exploring parallelism in exact inference is more challenging and is our focus in this thesis. Exact inference in graphical models is based on an ordered message passing process along the edges of the graph to update the probability distributions for all the nodes. A node after receiving updated messages from its neighbors will update its own probability distribution and then pass its updated distribution to the next neighbors. Updating a node essentially requires local computations over the probability distribution of a subset of the random variables. With discrete random variables, such distribution is represented by potential tables. The size of the potential table of a set of variables grows exponentially with the number of variables in the set. The complexity of exact inference thus increases dramatically with the number of states of variables, the number of variables, and the density of the graph. 2 1.2 Motivation Over the last few years, large scale graphical models have been increasingly used in practical applications. In many applications, inference must be performed in real time. For example, belief propagation in factor graphs is employed in the Turbo code and low-density parity-check (LDPC) code for real-time data transmission on wireless net- works [TR08]. Inference in graphical models can be used for complex recommendation systems at Amazon or Netflix [SHG09, KYP10]. Belief propagation has also been used for fraud detection in online auctions [CPF06] or in general ledger data for the financial domain [MBA + 09]. In such applications, it is essential to accelerate the execution of inference to produce the results as fast as possible. In the past, sequential algorithms were focused to improve the performance of exact inference . The increasing processor clock rate played a big role of improving these sequential algorithms. With the advent of multi-core processors, the computational industry has been shifted to multi-core computing. General purpose multi-core pro- cessors have been widely used for supercomputers, servers, personal computers, and even cell phones. The number of cores has been increasing, bringing massive hardware parallelism to the systems. Early 2-core processors have been replaced with those that can hold up to hundreds of cores in a single die. For example, the recent Dell PowerEdge C6145 server can accommodate 128 AMD cores on a single die. The Sun UltraSPARC T3-4 server can integrate 64 cores that each has supporting 8 threads to produce a 512- thread machine. The Intel Manycore Testing Lab has extensively employ 40-core Intel Westmere-EX-based servers for various research and teaching activities. In the future, servers that utilize a higher number of cores can be added. Parallel computation for exact inference is therefore required to utilize parallel machines and to accelerate the execution of the computation. It is desirable to achieve 3 scalable performance for exact inference given the increasing number of cores in gen- eral purpose multi-core systems. Essentially, exact inference is based on an ordered message passing process along the edges of the graphs. The computation at each vertex requires sophisticated computational operations over probability distributions There- fore, to achieve high scalability for exact inference, parallelism must be efficiently extracted and exploited at both the graph level and the node level. In the past, a few parallel techniques were proposed for exact inference. However, these techniques either only scaled well for a set of graph structures [DFL92, KS94, Pen98] or were devel- oped for some specialized computer architectures [XP10b, JXP10]. In this thesis, we aim to achieve scalable performance for exact inference with various structures of input graphical models on general purpose multi-core processors. Furthermore, exact inference is an example of vertex-centric graph computation which has emerged as a common computational pattern in many practical problems, such as PageRank, Linear Regression, K-means, or Betweeness Centrality [MAB + 10, LS10]. Parallel-graph computation is known as a difficult problem due to its data-driven computations, unstructured problem, and poor locality [LGHB07]. In addition to these issues, exact inference also exhibits the precedence constraints between the nodes and the complex computation within a node. DAG scheduling needed here has long been a critical problem in parallel computing. Thus, our proposed techniques are not only useful for exact inference but are also able to have an impact on a large class of graph computations. It is challenging yet fascinating to investigate parallel techniques for such DAG computational problems. From a higher level point of view, our research contributes to three main challenges in parallel computing: (1) the exploration of software parallelism for the increasing number of cores in parallel machines, (2) the mapping of the parallelism to the hardware 4 cores to optimize the performance, (3) the use of higher-level parallel programming programming model for a problem that has sophisticated computation pattern. 1.3 Contributions In this thesis, we propose parallel algorithms to achieve scalable performance for exact inference on general purpose multi-core platforms. We explore parallelism for exact inference in graphical models with respect to various types of input graphical models. Many techniques proposed for exact inference in this thesis can also be utilized for a class of vertex-centric graph computations with data dependencies. The major contribu- tions of the thesis are summarized as follows: Data Parallelism for Exact Inference: We explore data parallelism at node level computation for belief propagation in acyclic factor graphs. We identify basic operations called table operations for updating the potential table of a node in a factor graph. Data layout for storing potential tables is proposed such that data parallelism is exploited effectively in parallel algorithms for table operations. A complete belief propagation algorithm is developed using these table operations. Non-Uniform Memory Access (NUMA) aware techniques are also developed for optimizing the performance on state- of-the-art multi-socket multi-core systems. On a NUMA system with 40 cores, we achieve 39:5 speedup for the primitives and 39 speedup for the complete algorithm using factor graphs with large distribution tables. Task Parallelism for Exact Inference: Data parallel technique has shown to be very efficient when potential tables are large. For factor graphs with small potential tables that do not offer sufficient data parallelism, it is necessary to explore task parallelism at graph level for belief propagation. Task parallelism is explored based on the compu- tational independencies among the nodes that do not have any precedence relationship. 5 Belief propagation in acyclic factor graphs can be expressed by a directed acyclic graph (DAG), where a node represents a task and a directed edge represents the dependency between the two tasks. Task scheduling is used exploit task parallelism offered by a DAG. Our approach consists of building a task dependency graph based on the input factor graph and then using a dynamic task scheduler to exploit task parallelism. The experimental results show the efficiency and scalability of the proposed approach with a 37 speedup on a general purpose multi-core system with 40 cores. DAG Scheduling with Weak Dependencies: It is challenging when both data and task parallelism are insufficient in the input factor graph. In traditional scheduling approaches, a task in a DAG is started after all its preceding tasks are completed. How- ever, we observe that in a DAG representing belief propagation, a task can be partially executed with respect to each preceding task. Hence, a task can start execution after any subset of its preceding tasks have completed. We define such relationship between the tasks as weak dependency. We show that exploiting weak dependency can signif- icantly increase task parallelism. We design a DAG scheduling method to exploit the parallelism given by weak dependencies in a DAG. Our results on a general purpose multi-core system show that the weak dependency based scheduler runs 4 faster than a traditional scheduler. Weak dependency can also exist in other computational prob- lems, such as LU factorization in linear algebra. Thus, the proposed technique can have a significant impact in DAG scheduling for parallel processing. Pointer Jumping for Junction Tree algorithms: We study pointer jumping tech- nique on manycore systems for exact inference in junction trees that offer very limited parallelism. The existing techniques result in limited scalability if the input junction trees offer limited data and task parallelism. By using pointer jumping technique, our proposed method can efficiently process such junction trees. We implement the pro- posed method on state-of-the-art manycore systems. Experimental results show that, for 6 junction trees with limited data and structural parallelism, pointer jumping is well suited to accelerate exact inference on manycore systems. Exact Inference Using MapReduce: The existing parallel techniques achieve high performance for exact inference. However, the implementation complexity has lim- ited the applicability of these techniques especially in application domains where the experts may not be aware of the parallel computing techniques and tools. The increas- ing number of data-intensive applications requires new parallel programming models that enable easy parallelism and distribution. In recent years, MapReduce has emerged as a programming model for large-scale data processing. Many applications with data parallelism have gained good scalability. However, there is still a very limited num- ber of works using MapReduce for applications with data dependency constraints such as exact inference. We study the used of MapReduce for such application by exploiting task parallelism for exact inference using MapReduce with graph traversal methods. Our experimental results show that by using an existing MapReduce runtime on multi-core processors, our techniques can achieve good scalability for exact inference. 1.4 Organization The rest of the thesis is organized as follows. In Chapter 2, we provide an overview of exact inference in graphical models, the MapReduce model, and the pointer jump- ing technique. Related work for parallel exact inference is also included in this sec- tion. Chapter 3 presents our techniques for exploring data parallelism in table opera- tions at node level computation. Task parallelism for exact inference is given in Chap- ter 4. In Chapter 5, the exploration of task parallelism based on weak dependency in 7 DAG scheduling is presented. Chapter 6 introduces our proposed pointer jumping tech- nique for exact inference. Chapter 7 presents our study of parallel exact inference using MapReduce. Our conclusion and future work are given in Chapter 8. 8 Chapter 2 Background and Related Work 2.1 Exact Inference in Graphical Models 2.1.1 Bayesian Networks A Bayesian network is a probabilistic graphical model that exploits conditional inde- pendence to compactly represent a joint distribution. Figure 2.1 (a) shows a sample Bayesian network, where each node represents a random variable. Each edge indicates the probabilistic dependence relationships between two random variables. Notice that these edges can not form directed cycles. Thus, the structure of a Bayesian network is a directed acyclic graph (DAG). The evidence in a Bayesian network is the set of variables that have been instantiated. Traditional exact inference using Bayes’ theorem fails for networks with undirected cycles [LS88]. Most inference methods for networks with undirected cycles convert a network to a cycle-free hypergraph called a junction tree. We illustrate a junction tree converted from the Bayesian network in Figure 2.1, where all undirected cycles in are eliminated. Each vertex in Figure 2.1(b) contains multiple random variables from the Bayesian network. For the sake of exploring evidence propagation in a junction tree, we use the following notations. A junction tree is defined asJ = (T; ^ P), whereT represents a tree and ^ P denotes the parameter of the tree. Each vertexC i , known as a clique ofJ, is a set of random variables. AssumingC i andC j are adjacent, the separator between them is defined asC i \C j . ^ P is a set of potential tables. The potential table ofC i , denoted C i , 9 can be viewed as the joint distribution of the random variables inC i . For a clique with w variables, each havingr states, the number of entries inC i isr w . 5,3,4 7,3,5 6,4,5 8,7 11,5 ,6 3,1,2 9,8 10,7 4 5 6 11 8 9 (a) (b) 1 2 10 7 3 Figure 2.1: (a) A sample Bayesian network and (b) corresponding junction tree. In a junction tree, exact inference is performed as follows: Assuming evidence is E =fA i = ag andA i 2C Y , E is absorbed atC Y by instantiating the variableA i and renormalizing the remaining variables of the clique. The evidence is then propagated fromC Y to an adjacent cliqueC X . Let Y denote the potential table ofC Y after E is absorbed, and X the potential table ofC X . Mathematically, evidence propagation is represented as [LS88]: S = X YnS Y ; (2.1) X = X S S (2.2) 10 whereS is a separator between cliquesX andY; S ( S ) denotes the original (updated) potential table ofS; X is the updated potential table ofC X . Hence, propagating evi- dence fromC Y toC X includes computing the marginal S from Y with Eq. 2.1 and then updating X from S with Eq. 2.2. A two-stage method ensures that evidence at any cliques in a junction tree can be propagated to all the other cliques [LS88]. The first stage is called evidence collection, where evidence is propagated from leaf cliques to the root; the second stage is called evidence distribution, where the evidence is propagated from the root to leaf cliques. Figure 2.2 illustrates the first three steps of evidence collection and distribution in a sample junction tree. Evidence Collection Evidence Distribution Time Figure 2.2: Illustration of evidence collection and evidence distribution in a junction tree. The cliques in boxes are processed in the corresponding step. The shaded cliques have been processed. 2.1.2 Factor Graphs Given a set of random variablesX =fx 1 ;x 2 ;:::;x n g, a joint probability distribution ofX can be written as a factorized function [KF09]: P(X)/ m Y j=1 f j (X j ) (2.3) 11 where/ denotes proportional to;X j is a subset offx 1 ;:::;x n g; factorf j (X j ) is called a local function of X j , and m is the number of factors. A local function defines an unnormalized probability distribution of a subset of variables fromX. f 1 f 2 f 3 x 1 x 2 x 3 Figure 2.3: A factor graph for the factorizationf 1 (x 1 )f 2 (x 2 )f 3 (x 1 ;x 2 ;x 3 ). A factor graph is a type of probabilistic graphical model that naturally represents such factorized distributions [KFL01]. A factor graph is defined asF = (G;P), where G is the graph structure and P is the parameter of the factor graph. G is a bipar- tite graph G = (fX;Fg;E), where X and F are nodes representing variables and factors, respectively. E is a set of edges, each connecting a factor f j and a variable x 2 X j of f j . Figure 2.3 shows a factor graph corresponding to the factorization g(x 1 ;x 2 ;x 3 ) = f 1 (x 1 )f 2 (x 2 )f 3 (x 1 ;x 2 ;x 3 ). Parameter P is a set of factor potentials, given by the definitions of local functionsf i (X j ). For discrete random variables, a fac- tor potential is represented by a table in which each entry corresponds to a state of the set of variables of the factor. We focus on discrete random variables in this work. Evidence in factor graphs is a set of variables with observed values, for example, E =fx e 1 = a e 1 ;:::;x e k = a e k g, e k 2f1;2;:::;ng. Given evidence, an updated marginal distribution of any other variable can be inquired. Inference is the process of computing the posterior marginals of variables, given evidence. Belief propagation is a well-known inference algorithm introduced in [Pea88, LS88] and later formalized for factor graphs in [KFL01]. After evidenceE isabsorbed at the involved variables, belief propagation is performed. Belief propagation is based on message passing processes where messages are computed locally and sent from a node to its neighbors. Two types 12 of messages are given in Eqs. (2.4) and (2.5), one sent from variable nodex to factor nodef, and the other sent from factor nodef to variable nodex ([KFL01]): x!f (x)/ Y h2Nxnffg h!x (x) (2.4) f!x (x)/ X N f nfxg 0 @ f(X f ) Y y2N f nfxg y!f (y) 1 A (2.5) wherex andy are variables;f andh are factors;N f andN x are the sets of neighbors of f andx respectively; P denotes marginalization over a potential table; Q denotes prod- ucts of the potential tables. Note that a message is a distribution of a random variable. Both message computations require the products of the incoming messages from all the neighbors excluding the one where the message will be sent. Computing a message fromf tox requires a marginalization forx. We assume that each variable has at most r states and each node has at most d neighbors. Thus, the size of the potential table of a factor f is at most r d , and the size of a message for a variablex isr. The serial complexity of computing x!f (x) is O(dr), and that of computing f!x (x) isO(dr d ). It can be seen that the complexity of computing f!x (x) is dominant the complexity of computing x!f (x). In cycle-free factor graphs, message passing is initiated at leaves. Each node starts computing the message and sends it to a neighbor after receiving messages from all the other neighbors. The process terminates when two messages have been passed on every edge, one in each direction. Therefore,jEj messages in Eq. (2.4) andjEj messages in Eq. (2.5) are computed during the execution of belief propagation in cycle-free factor graphs. The overall serial complexity of belief propagation isO(jEjdr+jEjdr d ) = O(jEjdr d ) =O(md 2 r d ), wherem is the number of factor nodes andjEj =md. 13 In this case, belief propagation leads to exact inference, since all the final results are guaranteed to be exact [KFL01]. In cyclic factor graphs, belief propagation must be performed in an iterative message passing process. All messages are computed simultaneously using the messages from the previous iteration [KFL01]. The process is terminated when the changes of mes- sages between two consecutive iterations are less than a threshold. The final approxi- mate results imply approximate inference for cyclic factor graphs. The parallel version of belief propagation in cyclic factor graphs is known as an embarrassingly parallel algo- rithm. In this proposal, we focus on belief propagation in cycle-free factor graphs only, where exploring parallelism is a challenge. The parallel version of belief propagation in cyclic factor graphs is known as an embarrassingly parallel algorithm. In our work, we focus on belief propagation in cycle- free factor graphs only, where exploring parallelism is a challenge. Our techniques of exploiting data parallelism in computing a message are beneficial for belief propagation in cycle-free factor graphs. However, since computing a message in both types of graphs is identical, our techniques can also be applied for belief propagation in cyclic factor graphs. Finally, once the belief propagation completes, the posterior marginals of variables are computed by Eqs. 2.6 and 2.7. The exactP(x) andP(X f ) is obtained by normaliz- ing the right sides of the equations. P(x)/ Y f2Nx f!x (x) (2.6) P(X f )/f(X f ) Y x2N f x!f (x) (2.7) 14 2.2 The MapReduce Programming Model MapReduce is a parallel programming model inspired from the functional programming concepts and proposed for data-intensive applications [DG04]. A MapReduce based computation takes a set of input <key,value> pairs, and produces a set of output <key,value> pairs. The user of MapReduce expresses the computation using two functions: Map and Reduce. 1. The Map function takes an input pair, performs computation on it, and produces a set of intermediate<key,value> pairs. All input pairs can be processed by the Map function simultaneously in the Map phase. At the end of the Map phase, all intermediate values associated with the same intermediate key are grouped together and passed to the Reduce function. 2. The Reduce function receives an intermediate key and a set of values associated with that key. These values are merged with some aggregation operations to form a possibly smaller set of values. All intermediate keys and the associated sets of values can also be processed by the Reduce function simultaneously. Simplicity is the main benefit of the MapReduce programming model. The user only needs to provide a simple description of the algorithm using Map and Reduce functions and leave all parallelization and concurrency control to the runtime system. The MapReduce runtime system is responsible of distributing the work to the available processing units. In many applications, there must be a sequence of MapReduce invocations during the entire execution. The term Iterative MapReduce [ELZ + 10] is used to refer to the repetitive invocations of MapReduce. 15 2.3 Related Work Many parallel techniques for inference in graphical models have been proposed. In [MSLB07, GLGO09], authors studied parallelism for belief propagation in factor graphs. In [PM11], MapReduce was used for parallelizing inference in Conditional Random Fields and in general undirected graphical models. In digital communications, parallel belief propagation was implemented on FPGAs, GPUs, and multi-core proces- sors [FSS11]. However, these techniques were proposed for general cyclic graphs which lead to approximate inference, and were based on embarrassingly parallel. We focus on acyclic graphs for exact inference in which data dependency with precedence constraints needs to be addressed. In the early study of parallel exact inference[KS94, DFL92], structural parallelism was explored in Bayesian network on clusters. However, the scalability is limited on the some structures of the graph. In [XP10d, XP10c], the authors investigated data and structural parallelism for exact inference in Bayesian networks. In our work, we focus on the parallelism in factor graphs which require different optimization tech- niques. For example, in node level computations, we arrange the table data layout to achieve constant-time index mapping between a factor potential and a message. Com- pared with that in [XP10d], the mapping not only eliminates a space-expensive operation of the table extension, but also reduces the execution time of potential table operations. In addition, our algorithms are developed for general-purpose multi-socket multi-core systems instead of for a specific architecture as those in [XP10b, JXP10]. Exact inference can be represented by a task dependency graph, i.e. a directed acyclic graph (DAG) in which each node represents a task and each edge corre- sponds to precedence constraints among the tasks. Task scheduling is known to be the most efficient tool to map the task parallelism to the hardware processors. The scheduling problem has been extensively studied on parallel systems for several 16 decades [ARK08, KA99, THW02, SYD09]. Scheduling techniques have been also supported by several emerging runtime systems such as Cilk [BJK + 96], Intel Thread- ing Building Blocks (TBB) [Int], OpenMP [Ope], Charm++ [Cha] and MPI micro- tasking [OIS + 06], etc. However, those schedulers were optimized for some specific structures of task dependency graphs, such as a tree or a fork-join graph. In addition, these schedulers followed the traditional precedence constraints in DAG: a task is ready only after all its predecessors are completed. This restriction may limit the available parallelism when the graph comes with narrow forms. We propose techniques using DAG scheduling for arbitrary task graphs in exact inference on multi-core processors. In addition, we adapt the scheduling method to exploit the weak dependency, exposing more parallelism and improving the scalability of exact inference for the input graphs with limited available parallelism. 17 Chapter 3 Data Parallelism for Exact Inference in Factor Graphs Data parallelism is a key technique in parallel computing [J ´ 92]. With its simplicity yet efficiency, data parallel technique has been used extensively in a variety of appli- cations. In [BJC95], scalable data parallel algorithms are proposed for image process- ing with a focus on Gibbs and Markov Random Fields model representation for tex- tures. In [SMV10], data parallelism is supported in the techniques for data orchestra- tion and tuning on OpenCL devices. In [CDMC + 05], the authors discuss data paral- lelism in global address space programming. Data parallel technique is also discussed in [PL10, KBD10, Buy99] for various computation intensive applications on multi-core processors, accelerators, and grids. This chapter presents our data parallel techniques for belief propagation in factor graphs. Our target platform is state-of-the-art multi-socket multi-core systems. In these systems, multi-core processors are interconnected to form a non-uniform memory- access (NUMA) machine with a larger number of cores. Data parallelism is explored in the operations over the potential tables including table multiplication and table marginal- ization. Such operations are referred to as table operations. Table operations are par- allelized by dividing the potential tables into small chunks, each processed by a core. For factor graphs with large potential tables, data parallelism is an efficient approach to accelerate belief propagation. Our contributions in this chapter include: 18 We organize potential tables such that data parallelism for table multiplication and marginalization can be identified and explored efficiently. We develop data parallel algorithms for these two table operations on shared mem- ory platforms. The two table operations are combined for a complete parallel belief propagation algorithm in factor graphs. We implement the proposed algorithms on a multi-socket multi-core system. NUMA-aware optimizations with data placement are also implemented. Our implementation shows almost linear speedup with the number of cores. 3.1 Prior Work Many parallel techniques for inference in general graphical models have been pro- posed. In [KS94], parallelism for inference is explored in Bayesian network. In [PM11, SSPM10], map-reduce is used for parallelizing inference in Conditional Ran- dom Fields and in general undirected graphical models. For factor graphs, several par- allel techniques for belief propagation have been discussed in [MSLB07, GLGO09]. However, these techniques are proposed for general factor graphs which lead to approx- imate inference, while we focus on acyclic factor graphs for exact inference. In addi- tion, the authors only focus on structural parallelism given by the factor graph topol- ogy. In digital communications, parallel belief propagation is implemented on FPGAs, GPUs, and multi-core processors [KC04, FSS11]. These techniques apply to cyclic fac- tor graphs with the employment of embarrassingly parallel message passing algorithms. There are a few earlier works studying data parallelism in graphical models. In [XP10d], the authors investigate data parallelism for exact inference with respect to Bayesian networks, where a Bayesian network must be first converted into a junc- tion tree. Parallel primitives are proposed for evidence propagation in junction trees. 19 The primitives including table marginalization, table extension, and table multiplica- tion/division are optimized for distributed memory platforms. In [XP10c], data paral- lelism in exact inference in junction trees is explored on Cell BE processors. Although all synergistic processing elements (SPEs) in a Cell BE processor access a shared mem- ory, data must be explicitly moved to the local store of a SPE so as to be processed. In this work, we adapt the data layout for potential tables and the node level compu- tations for belief propagation in acyclic factor graphs. For the node level computations, we optimize the table operations by taking advantage of the fact that only the relation- ship between a factor node and a single variable node needs to be considered. This leads to constant-time index mapping between a factor potential and a message. The mapping not only eliminates a space-expensive operation of the table extension, but also reduces the execution time of table multiplication and table marginalization. In addition, our algorithms are developed for shared memory platforms, with a focus on general-purpose multi-socket multi-core systems. Thus, unlike [XP10c], no explicit data copy is needed during the execution. To the best of our knowledge, no data parallel techniques have been discussed for belief propagation in factor graphs. 3.2 Data Parallelism in Table Operations 3.2.1 Representation of Distribution Tables In a factor graph, each factor nodef represents a local function of a setX f of random variables. As given in Section 2.1.2, this local function describes a probability distri- bution of the variables. For discrete random variables, the local function is defined by a potential table where each entry is a non-negative real number corresponding to the probability of a state ofX f . In our potential table representation, instead of storing the states of X f along with their potential values, we only store the potential values in a 20 one-dimensional tableF. The potential values are stored in an order such that indexi of a potential value inF is determined based on the corresponding state ofX f as described below. Note that a message is also a potential table with respect to a variable. Given a factor f(X f ) with d variables. Let X f = x 1 ;x 2 ;:::;x d , where x k is the k th variable of X f . Without loss of generality, we assume that the state of random variablex k is taken fromD(x k ) =f0;1;:::;r k 1g, wherer k =jD(x k )j. State ofX f is determined by the states of x 1 ;x 2 ;:::;x d . The size of the potential tableF of f is jFj = Q d k=1 r k . Similarly, the message atf to/from its variablex k is stored by potential tableX k ;jX k j = r k . Indexi of an entry inF is determined by the corresponding state ofX f as given in Eq. 3.1: i = d X k=1 x k k1 Y j=1 r j (3.1) Thus, given any indexi in potential tableF off, we determine the corresponding statei of variablex k off by: i = i os k mod r k (3.2) whereos k = Q k1 j=1 r j withos 1 = 1.os k is precomputed for each variablex k off. Thus, mapping an indexi inF to the corresponding indexi inX k takes onlyO(1) time. Figure 3.1 illustrates the organization of a potential tableF of a factorf withX f = fx 1 ;x 2 ;x 3 ;x 4 ;x 5 g. Each variable has r = 2 states. Only the potential values of X f , which are in the shaded column, are actually stored. As given by Eqs. 3.1 and 3.2, there is a one-to-one mapping between the indices of entries inF and the states ofX f . Thus, the state of the variables at an entry is determined by the index of the entry. For example, the entry at indexi = 13 corresponds to state(x 1 ;x 2 ;x 3 ;x 4 ;x 5 ) = (1;0;1;1;0) ofX f . Since the state ofx 1 is 1, the corresponding indexi inX 1 is also 1. 21 Table index State of variable(s) Potential value 0 1 (0) (1) 0,34 0.66 Table index State of variable(s) , , , , Potential value 0 12 13 31 (0,0,0,0,0) (0,1,1,0,0) (0,1,1,0,1) (1,1,1,1,1) 0,134 0,022 0,081 0,016 Factor Potential Table Message Potential Table Potential Table Organization • Organization of the potential tables 1 Factor table index State of factor variables 1 State of message variable Message table index : → : 13 (0,1,1,0,1) (1) 1 Figure 3.1: Index mapping between potential tableF of factor nodef(x 1 ;x 2 ;x 3 ;x 4 ;x 5 ) and the messageX 1 of variable x 1 . Only potential values in the shaded column are actually stored. 3.2.2 Realization of Data Parallelism in Table Operations As given in Section 2.1.2, the complexity of computing a message from a factor node to a variable node dominates the complexity of computing a message from a variable node to a factor node. Thus, we focus on parallelizing the computation at factor nodes. The computation at factor nodef to produce a messageX k sent tox k ;1kd, is driven from Eq. 2.5 as follows: X k = X X f nfx k g (F Y j6=k X j ) (3.3) 22 Eq. 3.3 is performed in two phases: (1) multiplying factor potential tableF with the (d 1) incoming messages to obtain resulting potential tableF and (2) marginaliz- ing the resultingF forx k to obtain messageX k . Hence, we have two potential table operations for computing a new message: Table multiplicationF =FX j , realized by: F [i] =F[i]X j [i];8i2f0;1;:::;jFj1g (3.4) Table marginalizationX k = P nfx k g (F ), realized by: X k [i] =X k [i]+F [i];8i2f0;1;:::;jFj1g (3.5) where the relationship betweeni andi is determined by Eq. 3.2. For factors with large potential tables, these two table operations can be parallelized to accelerate execution. A large potential tableF is divided into small chunks that can be processed concurrently. Detailed algorithms for these two table operations are given in the following section. 3.3 Data Parallel Algorithms for Exact Inference 3.3.1 Parallel Algorithms for the Table Operations On a shared memory parallel system with P threads, the potential tableF is divided into P chunks. In table multiplication, each thread updates its own chunk. In table marginalization, the output message needs to be updated using the potential factors from all chunks. Thus, when computing marginal, we let each thread keep a local resulting message computed from its own chunk. Then, these local messages are aggregated to 23 generate the final message. The parallel algorithms for table multiplication and table marginalization are given in Algorithm 1 and Algorithm 2, respectively. Algorithm 1 Table multiplication Input: factorf, variablex k off, potential tableF, messageX k fromx k tof, number of threadsP Output: resulting potential tableF 1: forp = 0 toP1 pardo flocal computationg 2: fori = jFj P p to jFj P (p+1)1 do 3: i =index mapping(i;f;x k ) using Eq. 3.2 4: F [i] =F[i]X k [i] 5: end for 6: end for In Algorithm 1, the input includes factor f, the k th variable (x k ) of f, a potential tableF w.r.t f, a messageX k sent from x k to f, and the number of threads P . The output is the product ofF andX k , which is another potential tableF of the same size ofF. Line 1 launchesP -way parallelism and assigns an ID,p2f0;:::;P1g, to each thread. Lines 2-5 perform local computations of a thread on its own chunk of data that consists of jFj P contiguous entries. Lines 3-4 update value of thei th entry inF using the correspondingi th entry inX k . Function index mapping(i;f;x k ) computesi using Eq. 3.2 given thatr k andos k are available for each variablex k of factorf. In Algorithm 2, the input includes factor f, the k th variable (x k ) of f, a potential tableF, and the number of threadsP . The output is the marginalX k computed from F for x k . Line 2 initiates a local message for each thread. Lines 3-6 perform local computations of a thread on its own chunk of data to update its local message. Note that those locally updated messages can be read by all the threads. In Lines 4-5, value of the i th entry inF is used to update the correspondingi th entry in the localX (p) k of threadp. After Line 7, all the threads complete the local computations. Finally, Line 8 sums all 24 Algorithm 2 Table marginalization Input: factorf, variablex k off, potential tableF, number of threadsP Output: marginalX k forx k fromF 1: forp = 0 toP1 pardo flocal computationg 2: X (p) k = ! 0 3: fori = jFj P p to jFj P (p+1)1 do 4: i =index mapping(i;f;x k ) using Eq. 3.2 5: X (p) k [i] =X (p) k [i]+F[i] 6: end for 7: end for fglobal update forX k from the local resultsg 8:X k = P p X (p) k the local messages to form the final output messageX k . Note that alljX k j entries ofX k can be processed in parallel. We analyze the complexity of the algorithms using the concurrent read exclusive write parallel random access machine (CREW-PRAM) model [J ´ 92] withP processors. Each thread runs in a separate processor. In Algorithm 1, the mapping function at Line 3 takes O(1) time to complete, as given in Section 3.3.1. Line 4 also takes O(1) time. Therefore, forjFjP , the complexity of Algorithm 1 isO( jFj P ). In Algorithm 2, Lines 1-7 take O( jFj P ) time. Line 8 essentially computes the sum ofP vectors, each vector hasjX k j entries (jX k j-by-P computations). IfjX k j P , we useP processors to computeP rows at a time step, with one processor for each row. In this case, Line 8 takesO(d jX k j P eP) = O(jX k j) time. If logP jX k j < P , we useP processors to computelogP rows at a time step, withP=logP processors for each row. Line 8 also takes O(jX k j) time in this case. IfjX k j < logP , we use P processors to compute one row at a time step. Sum ofP elements can be performed onP processors inO(logP) time using the well-known all-reduce algorithm ([J ´ 92]). In this case, Line 8 25 takes O(jX k jlogP) time. Thus, forjX k j logP , the complexity of Algorithm 2 is O( jFj P +jX k j). 3.3.2 Complete Algorithm for Belief Propagation According to Eq. 3.3, computing a message at a factor node requires a series of table operations: (d1) table multiplications and one table marginalization, whered is the node-degree of the factor node. The table operations discussed above can be utilized here. The incoming messagesX j from neighbor nodes are the inputs to the table oper- ations. The computation process to produce a message at a factor node is described in Algorithm 3. Algorithm 3 Message computation kernel Input: factorf, variablex k off, potential tableF off, number of threadsP Output: messageX k sent fromf tox k 1:F =F 2: forp = 0 toP1 pardo flocal computation for the product ofF withX j g 3: forj = 1 tod,j6=k do 4: ComputeX j 5: fori = jFj P p to jFj P (p+1)1 do 6: i =index mapping(i;f;x j ) using Eq. 3.2 7: F [i] =F [i]X j [i] 8: end for 9: end for flocal computation for the marginalX k fromF g 10: X (p) k = ! 0 11: fori = jFj P p to jFj P (p+1)1 do 12: i =index mapping(i;f;x k ) using Eq. 3.2 13: X (p) k [i] =X (p) k [i]+F [i] 14: end for 15: end for fglobal update forX k from the local resultsg 16: X k = P p X (p) k 26 In Algorithm 3, the input includes factorf, thek th variable (x k ) off, the original potential tableF of f, and the number of threads P . The output is the messageX k to send from factor node f to variable node x k . Line 1 initiates a copyF ofF that will be used as an intermediate result. Lines 3-9 perform (d1) table multiplications betweenF with (d1) incoming messages. Line 4 computes the incoming message X j using Eq. 2.4. Lines 10-15 perform table marginalization on the finalF to generate the output messageX k forx k . Line 16 combines all the local results to form the final messageX k . In the CREW-PRAM model, the complexity of Algorithm 3 isO( djFj P + d P j6=k jX j j+jX k j),1PjFj. The process of belief propagation consists of a sequence of local computations and message passing. The order of passing the messages guarantees that each message at a node is computed after all of the other messages have arrived at the node. A common strategy is to order the nodes by Breadth first search (BFS) with respect to an arbitrarily selected root [KFL01]. Based on this order, a queue of directed edgesQ =f(f;x)g from factors to variables is formed for the sequence of messages to be computed. Algorithm 4 Belief propagation using message computation kernels Input: factor graphF = (G;P), edge queueQ, number of threadsP Output: updated messages on all edges in both directions 1: forp = 0 toP1 pardo 2: fori = 1 tojQj do 3: Let(f;x) =Q(i) 4: LetF =P(f) 5: ComputeMessage(f;x;F) 6: end for 7: end for Given a queue of edges, belief propagation in factor graphF is performed as shown in Algorithm 4. For each directed edge from factor f to variable x, it retrieves the potential tableF of f from parameter P. Then, it applies the message computation kernel given in Algorithm 3. A synchronization barrier is implied at the end of each 27 iteration before moving to another edge. At the end of the algorithm, the messages in both directions on all the edges are updated. 3.4 Experiments 3.4.1 Experimental Setup Platforms and Implementation Setup Core 0 2.2 6 GHz Core 9 2.2 6 GHz 32 KB 32 KB 256 KB 256 KB 24 MB L3 Cache 4-channel QPI 4 x 6.4 GT/s Memory Controllers DDR3 A DDR3 B DDR3 C DDR3 D (a) (b) Figure 3.2: Architectures of (a) the 10-core Intel Westmere-EX processor and (b) the 4-socket Westmere-EX system with 4 processors fully connected via QPI links. We conducted experiments on a 4-socket Intel Westmere-EX system as a represen- tative state-of-the-art multi-socket multi-core system. The system consists of four Xeon E7-4860 processors fully connected through 6.4 GT/s QPI links. Each processor has 10 cores running at 2.26 GHz, and 16 GB DDR3 memory. Thus, the system has total 64 GB shared memory. Figure 3.2, adapted from [V AB10], shows the detailed architecture of Xeon E7-4860 and the 4-socket system. 28 The algorithms were implemented using Pthreads. Hyperthreading was disabled. In our initial experiments, we noted that running multiple threads on each core showed no improvement. Thus, we initiated as many threads as the number of hardware cores, and bound each thread to a separate core. These threads persisted over the entire pro- gram execution and communicated with each other using the shared memory. So as to evaluate the scalability of the algorithms, various numbers of threads were chosen from 1;2;4;6;8;:::;36;38;40. Datasets and Performance Metrics We generated acyclic factor graphs using libDAI, an open source library for probabilistic inference in factor graphs [Moo10]. The factor graphs are generated with modifiable parameters: number of factor nodes m, degree of each node d, and number of states of each variabler. We examined the complexity and the scalability of our algorithms with regards tor andd by changing each of them at a time. In the experiments, we kept m = 100. With node degreed = 3, we conducted experiments using various numbers of states of variablesr = 60, r = 100, andr = 140. With binary variables (r = 2), we conducted the experiments usingd = 14,d = 18, andd = 22. Factor graphs with binary variables are the most popular datasets, while factor graphs with a large number of states of variables can be found in bioinformatics and computer vision domain. Metrics for evaluating the performance of our method are execution time and speedup. Speedup is defined as the ratio of the execution time on a single core to the execution time onP cores. The serial code for belief propagation in factor graphs was obtained from libDAI. Our parallel code was developed based on this libDAI serial code. Experimental results show that the execution time of our parallel code running on one core is almost the same as that of the serial code. We denote ms as millisecond and s as second for the execution time. 29 64 128 256 512 1024 2048 1248 Execution Time (s) Number of Threads omp proposed Figure 3.3: Execution time of the OpenMP based implementation for belief propagation in factor graphs withr = 2;d = 20;m = 100 on a 10-core Westmere-EX processor. An OpenMP Baseline An OpenMP implementation for belief propagation was developed based on libDAI serial code. In this OpenMP implementation, we used the directives #pragma omp parallel for in front of the loop for table multiplication in Eq. 3.4. For table marginalization in Eq. 3.5, we used#pragma omp for reduction for each item in tableX k and enabled nested loop with omp set nested. These directives guide the runtime system to automatically parallelize the computations within the loops using the given number of threads. We also used additional supporting options like static scheduling. The dataset used in this experiment is a factor graph with m = 100;r = 2;d = 20. Figure 3.3 shows the experimental results for the OpenMP baseline and our proposed method on a 10-core Xeon E7-4860 processor. It can be seen that the OpenMP baseline has limited scalability when the number of threads is increased beyond 4. In contrast, our proposed method scales linearly with the number of threads in this experiment. The poor scalability of the OpenMP baseline is because the OpenMP runtime does not offer efficient support for the irregular data accesses required by the two table operations. 30 In addition, false sharing is another issue in the OpenMP implementation because a potential table, stored as a continuous array, is updated frequently by multiple threads. In our proposed method, each thread is assigned a separate chunk of data that is big enough to avoid such false sharing. The results show that the use of OpenMP directives for automatic parallelization is not suitable for belief propagation on multi-core. 3.4.2 NUMA-Aware Implementations On a single processor with 10 cores, our proposed technique shows linear speedup. However, since each table operation performs one FLOP/memory access, a NUMA platform may affect the performance of the table operations significantly. In our non NUMA-aware implementation, we let the master thread allocate all the data including the entire potential tables. When executing the table operations, a thread needs to access its chunk of data that may be located in a remote memory. We address this problem by using the first-touch data allocation policy of the operating system [VDGR96] and modifying the code for the table operations correspondingly. We evaluate the effect of NUMA-aware implementation using two different datasets: (1)r = 100;d = 3 and (2) r = 2;d = 20, with the same number of factorsm = 100. These two datasets have simi- lar factor potential table sizes and also similar complexities of the table operations (100 3 vs. 2 20 ). However, computing a message for the second dataset is more computationally expensive (202 20 vs. 3100 3 ). Potential Table Operations In Algorithm 1 for table multiplication, threadp updates its own chunk of the potential tableF , from jFj P p to jFj P (p + 1) 1. In the NUMA-aware implementation, at the beginning, each chunk of a potential table is separately allocated to the memory of the processor of the corresponding thread. Thus, when executing the table operations, 31 each thread accesses the data in its local memory. The original index i of an entry is maintained for the mapping to index i in messageX k . Figure 3.4 shows the effect of data placement on table multiplication on the 4-socket Westmere-EX system. For both datasets, the NUMA-aware implementation shows much better speedup compared with the non NUMA-aware implementation. Using 40 cores, the speedups of the NUMA- aware and non NUMA-aware implementations are about37 and31:2, respectively. 0 10 20 30 40 0 10203040 Speedup Number of Threads ideal mult_NUMA mult_nonNUMA (a)r = 100;d = 3 0 10 20 30 40 0 10203040 Speedup Number of Threads ideal mult_NUMA mult_nonNUMA (b)r = 2;d = 20 Figure 3.4: NUMA effects on table multiplication with two different datasets on the 4-socket Intel Westmere-EX system. 0 10 20 30 40 0 10203040 Speedup Number of Threads ideal marg _NUMA marg_nonNUMA (a)r = 100;d = 3 0 10 20 30 40 0 10203040 Speedup Number of Threads ideal marg _NUMA marg_nonNUMA (b)r = 2;d = 20 Figure 3.5: NUMA effects on table marginalization with two different datasets on the 4-socket Intel Westmere-EX system. 32 In Algorithm 1 for table marginalization, threadp reads its own chunk of the poten- tial tableF, from jFj P p to jFj P (p+1)1, and updates its own message copyX (p) k . In the NUMA-aware implementation, in addition to the data placement of potential table’s chunks, the message copies are also allocated separately for the threads. Thus, when executing table marginalization, even though the potential tables are already accessed by the prior table multiplications, accessing the message copy still affects the perfor- mance of non NUMA-aware implementation. As shown in Figure 3.5, the speedup of the non NUMA-aware code for dataset r = 100;d = 3 significantly tapers off as the number of threads exceeds 10, 20, 30. For datasetr = 2;d = 20, it is slightly affected because the message size is small (r = 2). The speedup of non NUMA-aware code for these two datasets are30 and33:4, respectively. The NUMA-aware implementation achieves good scalability for both the datasets, with a speedup of 36.8 using 40 cores. Message Computation Kernel Data placement has shown significant improvement on the performance of the individ- ual table operations. As given in Algorithm 3, computing a message at a factor node f consists of a series of table operations, including (d 1) table multiplications and 1 table marginalization. That is computing a message requires O(dr d ) FLOPs on a potential table of sizer d . Only the first table multiplication needs to access the potential table from the remote memory. The remaining table operations in the series will work on the already loaded chunks of the potential table. Table marginalization only access a remote memory for the message copy. Thus, with non NUMA-aware implementation, the performance of computing a message is not significantly affected by NUMA plat- forms for factor graphs with large node degreed. Figure 3.6 shows the comparison of the speedups of message computation between NUMA-aware and non NUMA-aware 33 implementations. The improvement of NUMA-aware implementation is very clear for datasetr = 100;d = 3 but not noticeable for datasetr = 2;d = 20. 0 10 20 30 40 0 10203040 Speedup Number of Threads ideal msg _NUMA msg_nonNUMA (a)r = 100;d = 3 0 10 20 30 40 0 10203040 Speedup Number of Threads ideal msg _NUMA msg_nonNUMA (b)r = 2;d = 20 Figure 3.6: NUMA effects on message computation with two different datasets on the 4-socket Intel Westmere-EX system. 0.10 1.00 10.00 100.00 mult marg msg Execution Time (ms) NUMA‐Aware Non NUMA‐Aware (a)r = 100;d = 3 0.10 1.00 10.00 100.00 mult marg msg Execution Time (ms) NUMA‐Aware Non NUMA‐Aware (b)r = 2;d = 20 Figure 3.7: NUMA effects on the execution time of the table operations and the message computation with two different datasets when 40 threads are used. Figure 3.7 summarizes the effect of data placement on the table operations and mes- sage computation kernel. With dataset r = 100;d = 3, using 40 cores, the NUMA- aware code is significantly faster than the non NUMA-aware code for the table oper- ations ( 20% faster) and for the message computation ( 12% faster). With dataset 34 r = 2;d = 20, the improvement of NUMA-aware code is 20% for the first table multi- plication, 10% for the table marginalization, and 1.2% for the message computation. 3.4.3 Experimental Results with Various Table Sizes Using NUMA-aware implementations, we evaluated the impact of table size on the performance of each table operation and of the complete belief propagation process. Table size directly reflects the amount of parallelism available in belief propagation. From Section 3.3.1, the number of entries of a factor potential table is determined by jFj = Q d k=1 r k = r d , assuming that r i = r;i = 1;:::;d. The number of entries of a message to/from x k isjX k j = r. Double-precision floating-point numbers were used for potential values. Thus, withd = 3 andr = 60;r = 100;r = 140, the size of the potential tables are 1.6 MB, 7.6 MB, 20.9 MB respectively. Withr = 2 andd = 14;d = 18;d = 22, the size of the potential tables are 128 KB, 2 MB, 32 MB respectively. 0.03 0.25 2.00 16.00 128.00 1 2 4 8 16 32 Execution Time (ms) Number of Threads r=140 r=100 r=60 (a) 0.03 0.25 2.00 16.00 128.00 1248 16 32 Execution Time (ms) Number of Threads d=22 d=18 d=14 (b) Figure 3.8: Performance of table multiplication on the 40-core Intel Westmere-EX sys- tem with (a)d = 3 and various numbers of states of the variables, (b)r = 2 and various node degrees. Figure 3.8 illustrates the execution time of table multiplication. In Figure 3.8a, with r = 140, table multiplication scales almost linearly with the number of threads. It 35 achieves 39 speedup when 40 threads are used. With r = 100, table multiplication also shows very good scalability, achieving 37:2 speedup when 40 threads are used. Withr = 60, the speedup of table multiplication tapers off when the number of threads increases beyond 10. The speedup is about28 when 40 threads are used. In Figure 3.8b, withr = 2 andd = 14;18;22, the range of the potential table size is wider. Similarly, the larger the potential tables, the higher the achieved speedup is. When 40 threads are used, the speedups for table multiplication with d = 22 and d = 18 are 39.5 and 30.7 respectively. With r = 2;d = 14, the scalability of table multiplication is poor, achieving only3:2 speedup on 40 threads because the potential tables are too small to offer sufficient parallelism. 0.03 0.25 2.00 16.00 128.00 1 2 4 8 16 32 Execution Time (ms) Number of Threads r=140 r=100 r=60 (a) 0.03 0.25 2.00 16.00 128.00 1248 16 32 Execution Time (ms) Number of Threads d=22 d=18 d=14 (b) Figure 3.9: Performance of table marginalization on the 40-core Intel Westmere-EX system with (a) d = 3 and various numbers of states of the variables, (b) r = 2 and various node degrees. The execution time of table marginalization is illustrated in Figure 3.9. Although table marginalization requires the accumulation of the resulting message copies at the end, this table operation still scales very well with the number of threads for large potential tables. This is because in the experiments,jFj jX k j, making the time for accumulating the message copies very small compared with the entire time for table 36 marginalization. Similar to the results for table multiplication, when 40 threads are used, withd = 3 andr = 140;r = 100;r = 60, the speedups for table marginalization are 38:8;36:1;28 respectively. Withr = 2 andd = 22;d = 18;d = 14, the speedups are 39:3;30:4;2:9 respectively. 0.13 1.00 8.00 64.00 1 2 4 8 16 32 Execution Time (s) Number of Threads r=140 r=100 r=60 (a) 0.25 2.00 16.00 128.00 1,024.00 8,192.00 1248 16 32 Execution Time (s) Number of Threads d=22 d=18 d=14 (b) Figure 3.10: Overall performance of belief propagation on the 40-core Intel Westmere- EX system with (a)d = 3 and various numbers of states of the variables, (b)r = 2 and various node degrees. The overall execution time for belief propagation is shown in Figure 3.10. The results are consistent with the performance of the table operations. It is predictable since the belief propagation performs a series of message computations that are composed of a set of the table operations. Since the table operations scale well, the complete belief propagation also scales well. However, due to the synchronization required between the table operations, the overall speedup is slightly lower than the speedup of the table operations. Using 40 threads, we observed a 39 speedup for the factor graphs with r = 2;d = 22 orr = 140;d = 3. 37 Chapter 4 Task Parallelism for Exact Inference in Factor Graphs For factor graphs with small potential tables, data parallel techniques exhibit limited scalability due to the lack of data parallelism. Task parallelism is provided by com- putational independence across the nodes that do not have precedence relationship in the graph. These nodes can be updated in parallel when they are ready. This chapter presents our proposed task parallel techniques for belief propagation in factor graphs. Many task parallel techniques have been proposed for belief propagation in fac- tor graphs. However, most of these techniques are developed for loopy belief prop- agation in cyclic factor graphs with the employment of embarrassingly parallel algo- rithms [MSLB07, FSS11] or with the need of graph partitioning [GLGO09]. Paral- lelizing belief propagation in acyclic factor graphs still remains a challenging prob- lem due to the precedence constraints among the nodes in the graphs. In the mean- while, task scheduling has been extensively studied and used in parallel comput- ing [Sin07, KA99, THW02, ARK08]. In [SYD09], task scheduling is shown to be an efficient tool for a class of linear algebra problems, known as regular applications, on general-purpose multi-core processors. Since belief propagation in acyclic factor graphs is an irregular application, task scheduling is even a more suitable tool for parallelizing it. Thus, we approach this problem by defining a task dependency graph for belief prop- agation and then using a dynamic task scheduler [XP10a] to exploit task parallelism available in the task dependency graph. 38 Our contributions in this chapter include: We define task dependency graphs for belief propagation in an acyclic factor graph based on the topology of the factor graph. We use a dynamic task scheduler to allocate tasks to cores. By using the dynamic task scheduling method with work-sharing approach, we maximize task paral- lelism and optimize load balancing across the cores. We implement our techniques on a state-of-the-art multi-core system consisting of 40 cores. The experimental results show the efficiency of our techniques in accelerating belief propagation. 4.1 Prior Work Many parallel techniques for inference in general graphical models have been pro- posed. In [KS94], parallelism for inference is explored in Bayesian network. In [PM11, SSPM10], map-reduce is used for parallelizing inference in Conditional Ran- dom Fields and in general undirected graphical models. In [XP10d], the authors inves- tigate data parallelism for exact inference with respect to Bayesian networks, where a Bayesian network must be first converted into a junction tree. Node level primitives are proposed for evidence propagation in junction trees. The primitives including table marginalization, table extension, and table multiplication/division are optimized for dis- tributed memory platforms. In [XP10c], data parallelism in exact inference in junction trees is explored on Cell BE processors. Although all synergistic processing elements (SPEs) in a Cell BE processor access a shared memory, data must be explicitly moved to the local store of a SPE in order to be processed. 39 There are a few earlier papers studying parallelism for belief propagation in factor graphs [MSLB07, GLGO09]. However, while these techniques are proposed for general factor graphs which lead to approximate inference, we focus on acyclic factor graphs for exact inference. In digital communications, parallel belief propagation is implemented on FPGAs, GPUs, and multi-core processors [KC04, FSS11]. Similar to [MSLB07], these techniques apply to cyclic factor graphs with the employment of embarrassingly parallel algorithms. In [MXP11], data parallelism is explored for belief propagation at node level computation. This approach is suitable only if the node level computation offers sufficient parallelism. Our approach explores structural parallelism offered by acyclic factor graphs based on the computational independencies among nodes that do not have any precedence relationship. The scheduling problem has been extensively studied on parallel systems for several decades [ARK08, KA99, THW02, SYD09]. Early algorithms optimized scheduling with respect to the specific structure of task dependency graphs [Cof76], such as a tree or a fork-join graph. In general, however, programs come in a variety of structures [KA99]. Scheduling techniques have been proposed by several emerging programming systems such as Cilk [BJK + 96], Intel Threading Building Blocks (TBB) [Int], OpenMP [Ope], Charm++ [Cha] and MPI micro-tasking [OIS + 06], etc. All these systems rely on a set of extensions to common imperative programming languages, and involve a compilation stage and runtime libraries. These systems are not optimized specifically for scheduling DAGs on manycore processors. For example, the authors in [KD09] show that Cilk is not efficient for scheduling workloads in dense linear algebra problems on multi- core platforms. In [SYD09], a dynamic task schedulers is proposed as the optimized method to exploit parallelism for a class of linear algebra problems on general-purpose multi-core processors. Because belief propagation in acyclic factor graph is an irregular application that has a DAG-structured computation, using a task scheduler is a suitable 40 method for parallel execution. Thus, in our approach, we use a task scheduler to exploit task parallelism in the task dependency graph constructed for belief propagation in an acyclic factor graph. 4.2 Task Dependency Graph for Belief Propagation 4.2.1 Task Definition Belief propagation is a process of passing messages along the edges. As given in Sec- tion 2.1.2, in an acyclic factor graph, a node computes and sends a message to its neigh- bor after receiving messages from all the other neighbors. The process proceeds with first converting the factor graph into a rooted tree with a node chosen to be the root. Then, in this rooted tree, messages are propagated from the leaves to the root and then from the root to the leaves. Note that a leaf node is only connected to a single node. Thus, by the definition of message calculation (Eqs. 2.4 and 2.5), only non-leaf nodes in the factor graph need to compute new message based on incoming messages. Figure 4.1a shows a factor graph where non-leaf nodes form the skeleton of the factor graph. 4.2.2 Task Graph Construction It is natural to construct a task dependency graph for belief propagation in which the computation at each node in the skeleton becomes a task. We denote this as the First method. Tasks are defined as follows. On the way up from the leaves to the root of the skeleton, a task at each node is the computation of a message sent to the parent. The computation of a message at a variable node and at a factor node are based on Eqs. 2.4 and 2.5, respectively. On the way down from the root to the leaves of the skeleton, a task at each node is the computation of all messages sent to the children. These tasks have 41 root Leaf variable node Leaf factor node Non-leaf factor node Non-leaf variable node (a) (b) Figure 4.1: (a) A factor graph with its skeleton formed by the non-leaf nodes, and (b) the corresponding task dependency graph. different task sizes depending on the complexity of the computations. Letd denote the node degree and letr denote the number of states of each variable. On the way up, the task complexity isw uf =dr d at a factor node, and isw uv =drd and at a variable node. On the way down, the task complexity at a factor node isw df = d 2 r d , and at a variable node it isw dv =d 2 rd. Detail implementation of message computation can be found in [MXP11] with the assumption that each message is computed by only one thread. The task dependency graph is constructed from the skeleton of a factor graph as illustrated in Figure 4.1. The top half of the task dependency graph corresponds to the process of propagating messages from the leaves to the root, the bottom half of it corresponds to the process of propagating messages from the root to the leaves of the skeleton. It can be seen that task parallelism is affected by the number of children of each node in the skeleton and by the total number of nodes. In the task dependency graph given by Figure 4.1b, each task in the bottom half is the computation of all the messages sent to the children of the corresponding node in 42 Figure 4.2: An alternative way to construct a task dependency graph from a factor graph. the factor graph. Alternatively, we can define a task as the computation of only one message. Thus, the new tasks correspond to the edges in the factor graph. A task in the bottom half of this task dependency graph (Figure 4.1b) becomes smaller independent tasks in the resulting task dependency graph. Hence, this task dependency graph has more parallelism and has smaller tasks. This alternative construction of task graph is illustrated in Figure 4.2. In this work, we use the First method to construct the task dependency graph, as given in Figure 4.1. 4.3 Task Scheduling for Belief Propagation The task dependency graph constructed above ensures the precedence order of propa- gating messages in the input factor graph. It also exhibits the task parallelism available in belief propagation: ready tasks that have no precedence constraints can be executed in parallel. So as to exploit the task parallelism, we use a task scheduling method that dynamically distributes ready tasks to threads based on the current workload of the threads [XP10a]. The scheduling method is briefly described as follows. The input task dependency graph is represented by a list called Global task list (GL), which all the threads have access to. Each element of the GL stores a task and the related information such as the dependency degree, task weight, the successors, and the task meta data (e.g. application specific parameters). The dependency degree of a task is 43 initially set as the number of incoming edges of the task. We keep the successors that are the pointers to the elements of succeeding tasks along with each task. Each element of the GL also stores task meta data, such as the task type and pointers to the data buffer of the task. A shared Completed task buffer is used to store the IDs of tasks that have completed execution. The Allocate module in each thread allocates ready tasks from the GL to the threads for execution. A task in the GL is ready when its dependency degree becomes zero. The Allocate module pops out each task from the Completed task buffer and decrements the dependency degree of each successor of the task. If the dependency degree of a successor reaches zero, it is allocated by the Allocate module to the thread that has the smallest workload. Each thread has a Local task list (LL) to store the ready tasks that will be executed in this thread. Any thread can add tasks to an LL, but only the owner thread of that LL can remove its tasks. This organization reduces the access conflicts among the threads, hence reducing the overhead of scheduling. Each LL has a weight counter to keep track of the current workload of the thread. When a task is added to an LL, the LL’s weight counter is updated by the task weight. Each thread has an Execute module that picks the ready tasks in its own LL for execution. This module performs the task based on the meta data associated with the task. When completing the execution of a task, this module stores the ID of the task to the Completed task buffer and send a notice the Allocate module in its thread. The weight counter of its LL is also updated. Hence, the task scheduler helps maximize the task parallelism of the task graph since all ready tasks will be allocated to a thread for parallel execution. Because the scheduler allocates tasks to the thread that has the smallest workload, load balancing is supported during execution. 44 4.4 Experiments 4.4.1 Experimental Setup Platforms and Implementation Setup We conducted experiments on the 40-core Intel Westmere-EX system that is shown in Figure 3.2. The system consists of four Xeon E7-4860 processors fully connected through 6.4 GT/s QPI links. Each processor has 10 cores running at 2.26 GHz, 24 MB L3 cache, and 16 GB DDR3 memory. We used Pthreads for the task scheduler. Hyperthreading was disabled. In our initial experiments, we noted that running multiple threads on each core showed no improve- ment. Thus, we initiated as many threads as the number of hardware cores, and bound each thread to a separate core. These threads persisted over the entire program execu- tion and communicated with each other using the shared memory. We evaluated the scalability of our proposed method using various numbers of threads up to 40. Datasets So as to examine the impact of task size and amount of task parallelism on the execution time, we generated synthetic factor graphs with various skeleton structures. A skeleton was generated as a rooted tree in which the root is a factor node. Three types of skeletons used in our experiments are described below. A Balanced-tree skeleton was generated as a tree in which every node, except the leaf nodes, had an identical number of childrenc. A balanced-tree skeleton offers sufficient parallelism to achieve high speedup. A Slim-tree skeleton, as shown in Figure 4.3, was formed by first creating a bal- anced binary tree with b, b > 1 leaf nodes. Then each of the leaf nodes was 45 b Figure 4.3: The slim-tree skeleton of factor graphs. connected to a chain consisting ofd n(2b1) b e nodes. Increasing b increases the amount of structural parallelism. A Random-tree skeleton was generated based on a random graph generator described in [THW02]. Parameters used to specify a random-tree include: (1) number of nodesn, (2) tree heighth, (3) maximum number of children of each nodec m . In the experiments, we keptn = 2000 and variedh andc m to get var- ious shapes of tree. For each pair of h and c m , we generated 20 random trees from which we computed the average m, average c, and measured the average execution time. Performance Metrics and Baselines We used execution time and speedup to to evaluate the performance of our method. The speedup onP cores is calculated as the ratio of the execution time on a single core to the execution time onP cores. We compared our proposed method with two parallel baselines. The first baseline used OpenMP directives to parallelize the message computation at node level. The 46 second baseline used data parallel technique for the message computation [MXP11]. The second baseline has shown very good results when potential tables are large. The comparison of performance of our method with these two baselines is given in the next section. 4.4.2 Experimental Results Notation Summary We summarize the notations used in our experimental results as follows: n: number of nodes in the skeleton of a factor graph (i.e. number of non-leaf nodes in the factor graph). m: number of factor nodes in the skeleton. c: number of children of each node in the skeleton. d: degree of a node (i.e. number of neighbors of a node) in the factor graph. r: number of states of a variable. b: number of chains of the slim-tree skeleton. h: height of skeleton. P : number of threads (cores). Comparison with Baselines We compare our method with the two baselines using a balanced-tree dataset and a random-tree dataset. The balanced-tree dataset has r = 2;d = 12;n = 2000;c = 4, leading tom = 908. The random-tree dataset hasr = 2;d = 14;n = 1024;h = 15; the generated factor graph has average number of childrenc = 4:4 andm = 723. Fig- ure 4.4 shows the experimental results for the three methods using these two datasets on a 10-core Intel Westmere-EX processor. In both datasets, the proposed method scales 47 0 2 4 6 8 10 02468 10 Speedup Number of Threads linear Proposed DataParallel OpenMP (a) 0 2 4 6 8 10 02468 10 Speedup Number of Threads linear Proposed DataParallel OpenMP (b) Figure 4.4: Speedups of belief propagation using various methods on a 10-core Westmere-EX processor for (a) the balanced-tree factor graph and (b) the random-tree factor graph. Table 4.1: Overall complexity and data size for slim-tree factor graphs withn = 2047, r = 2,d = 14, and various values ofb. b m overall complexity data size (MB) (md 2 r d ) (8mr d ) 16 1029 3.3E+09 128 32 1013 3.3E+09 126 64 1045 3.4E+09 130 128 981 3.2E+09 122 much better than the two baselines. Especially for the balanced-tree dataset, the pro- posed method achieves almost linear speedup with the number of threads, while the two baselines have very limited speedup. From the two datasets, it can be seen that the balanced-tree dataset has smaller task size yet more task parallelism compared to the random-tree dataset. 48 Table 4.2: Overall complexity and data size for balanced-tree factor graphs with n = 2000,r = 2,d = 14, and various values ofc. c m overall complexity data size (MB) (md 2 r d ) (8mr d ) 2 1318 4.2E+09 165 4 908 2.9E+09 114 6 1334 4.3E+09 167 Impact of Task Parallelism In factor graphs with slim-tree skeleton, the amount of task parallelism varies asb. At any time, at mostb tasks can be executed in parallel. In this experiment, slim-tree factor graphs withn = 2047;r = 2;d = 14, andb = 32;64;128 were used. Task size is fixed by r and d. Size of the potential table at each factor node is r d = 2 14 . The number m of factor nodes in the skeleton is varied by various values of b. The overall serial complexity and the data size are given in Table 4.1. The table shows that the datasets have similar overall complexities. The total size of the data is the size of all the potential tables multiplied by 8 (the size of a double-precision data for each potential value). The data size is larger than the total cache size to guarantee that there was no cache advantage in the experiments. Figure 4.5 shows the execution times and speedups of the proposed method for the slim-tree factor graphs on the 40-core Westmere-EX system. The execution times are approximately identical when using 1 thread. When the number of threads increases, it can be seen that belief propagation in factor graphs with larger b achieves higher speedup. When b is smaller than P , the speedup is limited by b no matter how many threads are used. The achieved speedup is actually smaller thanb due to the overhead of the scheduler and limited task parallelism at the balanced-tree part of the graph. For b = 128, the proposed method achieves a speedup of 31 using 40 threads. 49 1.00 4.00 16.00 64.00 256.00 1248 16 32 Execution Time (s) Number of Threads b=128 b=64 b=32 (a) Execution Time 0 10 20 30 40 0 10203040 Speedup Number of Threads linear b=128 b=64 b=32 (b) Speedup Figure 4.5: Performance of belief propagation in slim-tree factor grahps, n = 2047, r = 2,d = 14, and various values ofb on the 40-core Intel Westmere-EX system. 1.00 4.00 16.00 64.00 256.00 1248 16 32 Execution Time (s) Number of Threads c=6 c=4 c=2 (a) Execution Time 0 10 20 30 40 0 10203040 Speedup Number of Threads linear c=6 c=4 c=2 (b) Speedup Figure 4.6: Performance of belief propagation in balanced-tree factor graphs withn = 2000,r = 2,d = 14, and various values ofc on the 40-core Intel Westmere-EX system. For balanced-tree datasets, the greater the number c of children of each node, the larger the amount of task parallelism is. The datasets haven = 2000,r = 2,d = 14, and various values ofc which derive various values ofm. Task sizes and potential table sizes were also fix byr andd. The overall complexity and data size are given in Table 4.2. Withc = 4, the generated factor graph has higher numberm of non-leaf factors than that withc = 4 orc = 6 does. Thus, as can be in Figure 4.6, the dataset withc = 4 has 50 Table 4.3: Task size, overall complexity, and data size for balanced-tree factor graphs withn = 2000,c = 4, and various values ofr andd. r d m w uf w df w uv w dv overall complexity data size (MB) (dr d ) (d 2 r d ) (drd) (d 2 rd) (fd 2 r d ) (8fr d ) 2 16 908 1.0E+06 1.7E+07 5.1E+02 8.2E+03 1.5E+10 454 2 14 908 2.3E+05 3.2E+06 3.9E+02 5.5E+03 2.9E+09 113 2 12 908 4.9E+04 5.9E+05 2.9E+02 3.5E+03 5.4E+08 28 2 10 908 1.0E+04 1.0E+05 2.0E+02 2.0E+03 9.3E+07 7 12 5 908 1.2E+06 6.2E+06 3.0E+02 1.5E+03 5.6E+09 1723 8 5 908 1.6E+05 8.2E+05 2.0E+02 1.0E+03 7.4E+08 227 4 5 908 5.1E+03 2.6E+04 1.0E+02 5.0E+02 2.3E+07 7 smaller execution time than the datasets withc = 2 andc = 6. Regarding scalability, the greater the value ofc, the higher speedup is achieved. Forc = 6, the proposed method scales almost linearly with the number of threads, achieving a speedup of 37 using 40 threads. The speedups forc = 2 andc = 4 are also very good, much higher than the speedups achieved for slim-tree factor graphs. The reason is that the scheduler kept all cores busy because of the sufficient parallel activities provided by the balanced-tree factor graphs. Impact of Task Size In this experiment, we used balanced-tree datasets with a fixed amount of task paral- lelism. Various values ofr andd were used to examine the impact of task sizes on the performance of our method. The datasets haven = 2000, c = 4, resultingm = 908. The four task types, given in Section 4.2.2, have different task sizes (i.e. task weights). Table 4.3 provides the task sizes, the overall complexity, and total size for the datasets with various values of r and d. As can be seen in this table, tasks at factor nodes are far larger than tasks at variable nodes (dr d vs. drd). Also tasks performed during the downward propagation process are larger than tasks performed during the upward 51 propagation process by a factor ofd. When fixingd and varyingr, the task sizes and the potential table sizes are highly various. Impact of task size on the performance of the proposed method is illustrated in Fig- ures 4.7 and 4.8. The results show that the larger the tasks, the better the speedup is achieved. For the datasets withr = 2 and various values ofd, the achieved speedups vary slightly. Using 40 threads, the proposed method achieves a speedup of 33 for d = 10 and a speedup of 35 for d = 16. For the datasets with d = 5 and various values ofr, the achieved speedups vary more greatly. Using 40 threads, the proposed method achieves a speedup of 28 forr = 8 and a speedup of 35 forr = 12. The results are expected, since larger tasks help reduce the the ratio of overhead of the scheduler to the overall execution time. The results also show that our method can achieve good scalability even for the factor graphs with a small amount of computation at node level. Results on Random Trees In the random-tree datasets, the numberc of children of each node in the skeleton was generated randomly from a uniform distribution over [0;c m ]. We fixed the number n of nodes and varied tree height h and c m to get different shapes for the skeleton. In particular, the datasets haven = 2000 andh chosen from5;15;25. Value ofc m was set to be 20 forh = 5, 15 forh = 15, and 10 forh = 25. For each value ofh, we generated a set of 20 random trees. The smaller the tree heighth, the more ”bushy” the tree is, and hence more task parallelism is available. In this experiment, we keptr = 2 andd = 14. Table 4.4 shows average values ofm andc, the overall complexity and data size for each value ofh. As expected, the average value ofc decreases ash increases. The average value ofm also slightly decreases ash increases. 52 Table 4.4: Overall complexity and data size for random-tree factor graphs with n = 2000,r = 2,d = 14, and various values ofh. h averagem averagec overall complexity data size (MB) 5 1701 11.4 5.46E+09 213 15 1634 8.7 5.25E+09 204 25 1505 6.3 4.83e+09 188 0.02 0.06 0.25 1.00 4.00 16.00 64.00 256.00 1,024.00 1248 16 32 Execution Time (s) Number of Threads d=16 d=14 d=12 d=10 (a) Execution Time 0 10 20 30 40 0 10203040 Speedup Number of Threads linear d=16 d=14 d=12 d=10 (b) Speedup Figure 4.7: Performance of belief propagation in balanced-tree factor graphs withn = 2000,c = 4,r = 2, and various values ofd on the 40-core Intel Westmere-EX system. Figure 4.9 shows the performance of our method on the generated random-tree fac- tor graphs. We measured the average execution time for each set of factor graphs corre- sponding toh. The execution times reflect the complexity and scalability of the proposed method for each set of factor graphs. Using a single thread, whenh increases, the execu- tion time slightly decreases corresponding to the average value ofm. As expected, the method shows best scalability for factor graphs withh = 5, achieving a speedup of 36.6 using 40 threads. The larger the tree height, the better the speedup is. The speedups for random-tree factor graphs are almost as good as the speedups for balanced-tree factor graphs. 53 0.02 0.06 0.25 1.00 4.00 16.00 64.00 256.00 1 2 4 8 16 32 Execution Time (s) Number of Threads r=12 r=8 r=4 (a) Execution Time 0 10 20 30 40 0 10203040 Speedup Number of Threads linear r=12 r=8 r=4 (b) Speedup Figure 4.8: Performance of belief propagation in balanced-tree factor graphs withn = 2000,c = 4,d = 5, and various values ofr on the 40-core Intel Westmere-EX system. 1.00 4.00 16.00 64.00 256.00 1248 16 32 Execution Time (s) Number of Threads h=5 h=15 h=25 (a) Execution Time 0 10 20 30 40 0 10203040 Speedup Number of Threads linear h=5 h=15 h=25 (b) Speedup Figure 4.9: Performance of belief propagation in random-tree factor graphs withn = 2000,r = 2,d = 14 and various values ofh on the 40-core Intel Westmere-EX system. 54 Chapter 5 DAG Scheduling with Weak Dependency General-purpose multi-core processors such as the AMD Opteron and Intel Xeon have been prevalent in servers, workstations, and even personal computers. Parallel pro- grams executed on those platforms can be represented as a task dependency graph, i.e., a directed acyclic graph (DAG) with weighted nodes. Nodes represent code segments. An edge from nodev to node ~ v denotes that the output from the code segment atv is an input to the code segment at ~ v. The weight of a node represents the (estimated) exe- cution time of the corresponding code segment. Computations that can be represented as task dependency graphs are called DAG structured computations [ARK08, KA99]. Exact inference in graphical models, as shown in Chapter 4, is an example of DAG structured computations. The overall execution time of a program is sensitive to scheduling [CK00, DVKT06, TG99]. It remains a fundamental challenge in parallel computing to dynamically sched- ule tasks in a task dependency graph. The traditional approach for scheduling tasks in a DAG is to process a task after all its preceding tasks complete, so as to preserve prece- dence constraints among tasks. However, we observe that for a class of DAGs, such as exact inference in junction tree or LU factorization in linear algebra, a task can be partially executed with respect to each preceding task. Hence, a task can start execution after any subset of its preceding tasks have completed. We call such relationship in a DAG weak dependency. We show that weak dependency can significantly increase task 55 parallelism. This chapter presents our task scheduling method to exploit the parallelism given by weak dependencies in a DAG. Our contributions in this chapter include: We discuss the impact of weak dependency on scheduling DAG structured com- putations on parallel computing platforms. We show that a scheduling method exploiting weak dependency can explore more parallelism and can reduce the overall execution time. We propose a scheduler that can take advantage of weak dependency. We dis- cuss the scheduler in a modular fashion, so that it can be easily adapted to vari- ous DAG structured computations by using plug-and-play modules specific to an application. We implement the proposed scheduling method on a general-purpose multi-core platform. We use exact inference, a classic AI technique, to evaluate our schedul- ing method. We show that the proposed scheduler outperforms the baseline for various task graphs, especially for those having limited structural parallelism and tasks with large indegree. 5.1 Prior Work A parallel program can be represented by a directed acyclic graph (DAG), where each node represents a task and each edge corresponds to precedence constraints among the tasks. Each task in the DAG is associated with a weight, which is the estimated execution time of the task. Traditionally, a task can start execution only if all of its predecessors have been completed [AKW96, Sin07]. The task scheduling problem is to map the tasks on parallel computing systems to the threads in order to minimize the overall execution 56 time. Task scheduling is in general an NP-complete problem [GJ90, PY88]. We con- sider scheduling an arbitrary DAG with given task weights and perform the mapping and scheduling of tasks on-the-fly. The goal of such dynamic scheduling includes not only the minimization of the overall execution time, but also the minimization of the scheduling overhead [KA99]. The scheduling problem has been extensively studied for several decades [ARK08, BHR09, KA99, THW02, SYD09]. Early algorithms optimized scheduling with respect to the specific structure of task dependency graphs [Cof76], such as a tree or a fork- join graph. In general, however, programs come in a variety of structures [KA99]. Karamcheti and Chien studied hierarchical load balancing framework for multithreaded computations for employing various scheduling policies for a system [KC98]. Don- garra et al. proposed dynamic schedulers optimized for a class of linear algebra prob- lems on general-purpose multi-core processors [SYD09]. Scheduling techniques have been also proposed by several emerging programming systems such as Cilk [BJK + 96], Intel Threading Building Blocks (TBB) [Int], OpenMP [Ope], Charm++ [Cha] and MPI micro-tasking [OIS + 06], etc. All these systems rely on a set of extensions to com- mon imperative programming languages, and involve a compilation stage and runtime libraries. All of these scheduling techniques follow the traditional precedence con- straints. In our work, we focus on scheduling tasks in DAGs with weak dependencies on general purpose multi-core processors. 5.2 Weak Dependency in Task Dependency Graphs 5.2.1 Definition of Weak Dependency Weak dependency in task dependency graphs is defined as follows: 57 Definition 1. Given a task dependency graph as a DAG. Performing a task T in the DAG requires input data X =fx 0 ;x 1 ;:::;x d1 g from its immediate predecessors T 0 ;T 1 ;:::;T d1 , where d is the number of immediate predecessors of T , and x i is the output ofT i , 0i<d. Assume that taskT computes functionf(X). The dependency betweenT and its immediate preceding tasks is called weak dependency, if the function f(X) is defined recursively by: 9X X;f(X) =f(f(X );f(XnX )) (5.1) whereX 6=; is the output data of the corresponding subsetT of preceding tasks, all data inX is required for performingT . Note thatf(X ) can be computed without the availability ofXnX . 0 1 2 d-1 0 1 2 d-1 Figure 5.1: A fragment of a task dependency graph with weak dependencies From the definition of weak dependency, the set of all the predecessors of a task can be divided into mutual exclusive subsets of tasks, as illustrated in Figure 5.1. Each subset is used for executingT independently. Thus, as soon as any subsetT completes the execution, T can start the execution using the corresponding data X . Note that for each subset,T needs to wait until all the tasks in the subset complete to be able to proceed. A special case of weak dependency is when X = X, that is T must wait until all its predecessors complete execution. This is the traditional approach in DAG 58 C 0 C 2 C 5 C 1 C 3 C 4 C 6 C 7 C 8 C 0 C 2 C 5 C 1 C 3 C 4 C 6 C 7 C 8 C 0 C 2 C 5 C 1 C 3 C 4 C 6 C 7 C 8 C 0 C 2 C 5 C 1 C 3 C 4 C 6 C 7 C 8 (a) C0 C2 C5 C1 C3 C4 C6 C7 C8 C0 C2 C5 C1 C3 C4 C6 C7 C8 C0 C2 C5 C1 C3 C4 C6 C7 C8 C0 C2 C5 C1 C3 C4 C6 C7 C8 C0 C2 C5 C1 C3 C4 C6 C7 C8 (b) Figure 5.2: Evidence collection using (a) the traditional scheduling method, (b) the weak dependency based scheduling method. scheduling. We refer to this case as “strong” dependency. Another special case of weak dependency is when every subset has exactly one task. In this case,T has “fully” weak dependencies with every predecessors.T is allowed to be executed using anyx i as soon asx i becomes available even ifx j ;j6=i is not available. 5.2.2 A Motivating Example Consider evidence collection in a junction tree with 9 cliques as given in Fig- ure 5.2. Note that in evidence collection, the evidence is propagated from the leaves (C 1 ;C 3 ;C 4 ;C 6 ;C 7 ;C 8 ) to the root (C 0 ). Nodes marked with red boxes represent the being-updated cliques. Nodes marked with dark shadow represent the already-updated cliques. 59 Using the traditional scheduling methods, evidence collection in the junction tree proceeds as illustrated in Figure5.2a. Accordingly, cliqueC 2 needs to wait untilC 4 ,C 5 , andC 6 complete execution. Then,C 2 is updated sequentially from inputsC 4 ,C 5 , and C 6 . Similarly,C 0 needs to wait for and then updated fromC 1 ,C 2 ,C 3 . Using the weak dependency based scheduling method, evidence collection in junc- tion tree proceeds as illustrated in Figure5.2b. Accordingly, C 2 does not have to wait untilC 4 , C 5 , andC 6 complete; it can start to be updated right after any of them com- pletes execution. Similarly, C 0 can start to be updated when any of C 1 , C 2 , and C 3 completes. Thus, during the execution,C 0 ,C 2 ,C 5 can be updated in parallel usingC 3 , C 6 ,C 8 and thenC 1 ,C 4 ,C 7 respectively. Assume that updating a leaf clique or updating a clique using another clique takes unit time. On a parallel random access machine (PRAM) with sufficient number of processors, the traditional scheduling method requires unit time to update leaf cliques fC 1 ;C 3 ;C 4 ;C 6 ;C 7 ;C 8 g, two units to update C 5 fromfC 7 ;C 8 g, three units to update C 2 fromfC 4 ;C 5 ;C 6 g, and three units to updateC 0 fromfC 1 ;C 2 ;C 3 g, resulting in nine units of time in total; the weak dependency based scheduling method requires unit time to updatefC 1 ;C 3 ;C 4 ;C 6 ;C 7 ;C 8 g, a unit to updatefC 0 ;C 2 ;C 5 g fromfC 3 ;C 6 ;C 8 g, a unit to updatefC 0 ;C 2 ;C 5 g fromfC 1 ;C 4 ;C 7 g, a unit to updateC 2 fromC 5 , and a unit to updateC 0 fromC 2 , resulting in five units of time in total. 5.3 DAG Scheduling with Weak Dependencies 5.3.1 Components of The Scheduler We exploit weak dependencies in DAGs using a straightforward scheduling scheme that consists of a single task allocator allocating ready tasks to threads for execution. Each 60 Successors T i w d i T k T j … Dependence Degree Task ID T i Task meta data T j T k T i T j T k (a) (b) (c) … … tid Thread ID Weight Figure 5.3: (a) A portion of a task dependency graph, (b) The corresponding represen- tation of the global task list (GL), (c) The data of taskT i in the GL. thread is bound to a core and persists over the entire program execution. Locks are used to protect shared data. Detailed modules for scheduling are discussed as the following: The input task dependency graph is represented by a list called the global task list (GL), accessible by all the threads. Figure 5.3 shows a portion of the task dependency graph, the corresponding part of the GL, and the data of element T i in the GL. Each element in the GL consists of task ID, dependency degree, task weight, successors, thread ID, and the task meta data (e.g. application specific parameters). The task ID is the unique identity of a task. The dependency degree of a task is initially set as the number of incoming edges of the task. The task weight is the estimated execution time of the task. We keep the successors that are the pointers to the elements of succeeding tasks along with each task. The thread ID shows the thread in which the task is executing. The thread ID is used for allocating task copies as described below. In each element, there is task meta data, such as the task type and pointers to the data buffer of the task, etc. Each thread has a local task list (LL) for tasks that are ready to be processed. How- ever, to exploit weak dependencies in a task graph, each element in the LL is not the real task as in the GL but a copy of the task. Each copy of a task consists of a pointer to the 61 corresponding task in the GL and a pointer to the preceding task in the GL. The number of copies of a task is equal to the number of its incoming edges (node degree). When a copy of a task completes execution, the dependency degree of the corresponding task in the GL is decreased by 1. Thus, when all the copies of a task finish execution, its dependency degree becomes zero, and its successors become ready to execute. Then, a copy of each successor of the task is created and inserted into one of the LLs. Each LL has a weight counter associated with it to record the total workload of the tasks currently in the LL. Once the first copy of a task is inserted into an LL or the last copy of a task is fetched from an LL, the weight counter is updated. The Allocate module of the scheduler creates a copy of a task and inserts it into one of the LLs. The Completed Task ID buffer of the scheduler stores the IDs of tasks that have completed execution. Initially, all the Completed Task ID buffers are empty. For each task in the Completed Task ID buffer, the Allocate module creates a copy for each of its successors and insert that copy into an LL. The preceding task in each copy will be updated by that task. Since all the copies of a task need to be processed one after another, we assigned all of them to the same thread for data locality. Accordingly, when the first copy of a task is assigned to a thread, the thread ID will be updated in the corresponding task. The later copies of the task will be assigned to the thread based on the available thread ID. Heuristics can be used for task allocation to balance the workload across the threads. In our design, we allocate the first copy of a task to the thread with the smallest workload at the time of completion of the task execution. Each thread has a Fetch module that takes a copy of a task from its LL and sends the task-copy to the Execute module in the same thread for execution. Heuristics can be used by the Fetch module to select elements from the LL. For example, tasks with more children can have higher priority for execution [KA99]. We use a straightforward method that allows the element at the head of the LL to be selected by the Fetch module. 62 Once a task-copy is fetched for execution, the dependency degree of the corresponding task is decremented. Once the dependency degree becomes 0, i.e. the last copy of the task has been processed, the Fetch module sends the ID of the task to the Completed Task ID buffer, so that the Allocate module can accordingly create and schedule copies of the successors of the task. 5.3.2 A Weak Dependency Based Scheduler Based on the framework of scheduling in Section 5.3.1, we present a sample implemen- tation of the weak dependency based scheduling method in Algorithm 5. The notations used in the algorithm are given as follows. N denotes the total number of tasks in GL. LL t denotes the local task list in Thread t, 0 t < P . tid T , d T and w T denote the thread ID, dependency degree and the weight of taskT , respectively.T j i denotes a copy of taskT i that will be executed based on the preceding taskT j .W t is the weight counter of Threadt, i.e. the total weight (estimated execution time) of the tasks currently in LL t . C i denotes the clique corresponding to taskT i . Lines 1-3 in Algorithm 5 initialize the thread ID of the tasks and the local task lists. The thread ID of a task is initialized asNIL, and will be updated when the first copy of the task is allocated. The later copies of the task will be allocated to the same thread. In Lines 4-31, the algorithm performs task scheduling iteratively until all the tasks are processed. Lines 6-12 correspond to the Fetch module, where it fetches a task- copy from the head of the local task list and decrements the dependency degree of the corresponding task (also the number of unprocessed copies of the task). Lines 9-11 add the ID of the fetched task to the Completed Task ID buffer and updatesW t according to the fetched task when the last copy of the task is processed. Lines 14-29 correspond to the Allocate module, which creates copies of the tasks in the Completed Task ID buffer (Line 17), then allocates those copies into a target thread (Line 20 or 24). Only 63 Algorithm 5 An Implementation of Weak Dependency Based Scheduling for Evidence Collection in Junction Tree fInitializationg 1: lettid i =NIL;0i<N; 2: letS =fT i jd i = 0g;0i<N; 3: evenly distribute a task-copyT NIL i of eachT i inS to LL t ;0t<P ; fSchedulingg 4: for Threadt(t = 0P1) pardo 5: while GL6= or LL t 6= do fFetch moduleg 6: fetch a task-copyT p i from the head of LL t ; 7: d T i =d T i 1; 8: ifd T i = 0 then 9: remove taskT i from GL; 10: add the ID ofT i to the Completed Task ID buffer; 11: W t =W t w T i ; 12: end if fExecute moduleg 13: executeT p i : updatingC i fromC p according to Eq. 2.2; fAllocate module at Thread 0g 14: ift = 0 then 15: for allT i 2 Completed Task ID buffer do 16: for allT s 2fsuccessors ofT i g do 17: create a task-copyT i s ofT s ; 18: iftid Ts =NIL then 19: letj = argmin t=1P (W t ); 20: appendT i s to LL j ; 21: lettid Ts =j; 22: W j =W j +w Ts ; 23: else 24: appendT i s toLL tid Ts ; 25: end if 26: end for 27: end for 28: Completed Task ID buffer =; 29: end if 30: end while 31: end for 64 thread 0 is responsible for allocating tasks to all the threads (Line 14). We choose the target thread for the first copy of a task as the one with the smallest workload (Line 21), although alternate heuristics can be used. Line 13 corresponds to the Execute module, where the fetched task-copyT p i is exe- cuted. We include the execution of the evidence collection as an illustrated application. In the execution ofT p i for evidence collection, cliqueC i is updated from its preceding cliqueC p according to Equation 2.2. Other DAG structured computations can be easily plugged in. 5.4 Experiments 5.4.1 Experimental Setup Platforms We conducted experiments on an 8-core AMD Opteron based system. The system con- sists of the Dual AMD Opteron 2350 (Barcelona) Quad-Core processors running at 2.0 GHz, and 16 GB DDR2 memory shared by all the cores. The operating system was Cen- tOS version 5 Linux. We used the GCC 4.1.2 compiler on this platform. Since AMD Opteron does not support more than one hardware thread per core, we mostly conducted experiments using one thread in each core. Baseline Implementation So as to compare with the proposed scheduler, we implemented a baseline scheduler using the traditional scheduling approach. The baseline scheduler is similar to the pro- posed scheduler, except for handling the precedence constraints in the DAG. The base- line scheduler allowed a task to be executed only when all its preceding tasks completed. 65 Multipine Tree Pine Tree Figure 5.4: A Pine Tree and a Multipine Tree. Thus, there is no need for task copy. The local task lists contained the elements of task data as in the global task list. The Allocate module must keep track of the number of completed preceding tasks of a task. When a task completed execution, the Allocate module decremented the dependency degree of each successor of that task. When the dependency degree of a task became 0, the Allocate module assigned the task to the LL with lightest workload and updated the corresponding weight counter. The Fetch module fetched a task from the LL, updated the weight counter, and placed the ID of the task into the Completed Task ID buffer. In the Execute module, a task updated its clique from all its preceding cliques sequentially. Spinlocks were used for the ready task lists and Completed Task ID buffer. Datasets We evaluated our proposed scheduling method using evidence collection for various junction trees. They can be grouped into three types of trees: Pine Tree, Arbitrary Tree, and Balanced Tree. 66 A Pine Tree, as illustrated in Figure 5.4, was formed by first creating a chain of cliques, then each clique on this chain was connected to (d 1) cliques. Thus, nodes on the principal chain have the same node degree (d). As shown in Sec- tion 5.2.2, this type of graph can significantly limit the performance of the base- line scheduler since the root node has to wait for a long time for the cliques on the principal chain to complete. In contrast, the weak dependency based scheduler can still execute cliques on the principal chain in parallel. A Multipine Tree was formed by connectingb Pine Trees together to obtain more parallelism. One Pine Tree was used as the main branch, in which some leaf nodes became the roots of (b1) remaining Pine Trees, distributed evenly on the main branch. An Arbitrary Tree was generated based on the parametric random graph described in [THW02]. Parameters used to specify an arbitrary tree include: (1) number of nodes (N), (2) maximum node degree (D m ), (3) maximum tree height (H m ). A generated tree had an average node degree (d) and tree height (H). IfH is high, the tree may have a very low average node degree, resulting in the limited amount of parallelism. Depending onD m , the tree could be more balanced (with lowD m ) or unbalanced (with highD m ). A Balanced Tree was generated as a traditional balanced tree: every node, except leaf nodes, had an identical node degree (d). Balanced Trees offered sufficient par- allelism to achieve high speedup. A high node degree may limit the performance of the baseline. Binary Balanced Tree, with the lowest node degree (d = 2), is considered as the best case for the baseline scheduler. The specific parameters of the junction trees, including the clique size, are given in Section 5.4.2. In all the experiments, the random variables were set as binary, i.e.r = 2. 67 Performance metrics We used execution time and speedup to present our experimental results. We calculated the speedup onP cores of each scheduler as the ratio of the execution time on 1 core to the execution time onP cores. 5.4.2 Evaluation Results Illustration of the Execution Time of Tasks For the sake of illustrating how the weak dependency based scheduling method improved the performance of evidence collection, we recorded the execution timestamp of all the tasks for a small Pine Tree that hadN = 32 cliques and node degreed = 8. Clique size was set at 18, resulting in the size of the POT of each clique to be 2 18 . We used 8 cores with one thread in each core. The results of the evidence collection in this junction tree are given in Figure 5.5. Each entry on the vertical axis represents a task (TaskID) for the baseline, and a task-copy (TaskID - Preceding TaskID) for the proposed scheduler. We observe that 28 leaf nodes of the tree were distributed evenly to 8 cores in both of the schedulers. However, as expected, the two schedulers had different behaviors with the 4 nodes of the principal path. In the baseline, these nodes were executed one after another; each node was processed from the preceding nodes sequentially. In the proposed scheduler, these 4 nodes were processed in parallel; each from a corresponding preceding node. Thus, the amount of parallelism increased, and the execution time was significantly reduced. 68 0 3000 6000 9000 12000 15000 4- X 12- X 20- X 28- X 0- 5 0- 6 0-10 0- 4 0- 9 0- 7 0- 8 0- 1 5- X 13- X 21- X 29- X 1-11 1-14 1-13 1-12 1-17 1-15 1-16 1- 2 6- X 14- X 22- X 30- X 2-19 2-18 2-22 2-21 2-20 2-23 2-24 2- 3 7- X 15- X 23- X 31- X 3-27 3-26 3-25 3-30 3-28 3-29 3-31 8- X 16- X 24- X 9- X 17- X 25- X 10- X 18- X 26- X 11- X 19- X 27- X Timestamp (msec) Task (a) 0 3000 6000 9000 12000 15000 4 12 20 28 3 2 1 0 5 13 21 29 6 14 22 30 7 15 23 31 8 16 24 9 17 25 10 18 26 11 19 27 Timestamp (msec) Task The Baseline, 8 Threads, H=4, B=8 (b) Figure 5.5: Execution timestamps on 8 cores for a Pine Tree withN = 32 nodes and node degreed = 8 using (a) the proposed scheduler and (b) the baseline. Pine Tree Pine Tree is considered as the special case where the baseline performs worst comparing with the proposed scheduler. We used Pine Trees consisting ofN = 1024 cliques, with various node degrees (d) to observe the impact of node degree on the performance of the two schedulers. Clique size was set at 15. We used one thread in each core. The results including the execution time and the speedup for various numbers of cores are given in Figures 5.6 and 5.7, respectively. For anyd, the proposed scheduler ran much faster than the baseline. Whend increased, the speedup of the proposed scheduler was significantly improved while speedup of the baseline just slightly increased. We observed that when d = 16, using 8 cores, the speedup of the proposed scheduler was about 4 times higher than that of the baseline (5.43 vs. 1.37). This speedup can be higher by using more processors or by increasing the node degree. For the Pine Tree, we briefly analyze the impact of the node degree on the speedup of the two scheduling methods as follows. 69 Given a junction tree ofN cliques with node degreed, and a PRAM withP proces- sors. Assume that executing a task takes unit time. LetH = N d . The total execution time is computed including the sequential runtime and the parallel runtime. Hence, the execution time for the sequential execution isT 0 =H(d1)+Hd =H(2D1) the baseline scheduler isT 1 =T (s) 1 +T (p) 1 =Hd+ H _ (d1) P =H(d+ d1 P ) the proposed scheduler isT 2 =T (s) 2 +T (p) 2 =H+ H _ (d1)+H _ (d1) P =H(1+ 2(d1) P ) Therefore, the speedup achieved for the baseline scheduler isSP 1 = T 0 T 1 = 2d1 d+ d1 P the proposed scheduler isSP 2 = T 0 T 2 = 2d1 1+ 2(d2) P When P !1; SP 1 ! 2 1 d ; SP 2 ! (2d 1), and SP 2 SP 1 ! d. Hence, the speedup of the baseline is always less than a small constant regardless of the number of processors and the node degree. In contrast, the speedup of the proposed scheduling method can be up to(2d1) using a sufficient number of processors. It also shows that for the Pine Tree, the proposed scheduler can run up tod times faster than the baseline. We also conducted our experiments using more than one thread on each core. Fig- ure 5.8 shows the speedup of the two schedulers using two threads per core for a Pine Tree withN = 1024 andd = 16, the same dataset used in Figure 5.7(d). The proposed scheduler still achieved far higher speedup than baseline. However, compared with the results using one thread per core given in Figure 5.7(d), the speedup achieved using two threads per core was slightly decreased. Note that the allocator is centralized. Thus we incur higher overheads as the number of threads increases. In all the experiments presented hereafter, we used one thread per core. 70 0 10 20 30 40 50 60 70 02468 Execution Time (s) Number of Cores d=2 d=4 d=8 d=16 (a) 0 10 20 30 40 50 60 70 02468 Execution Time (s) Number of Cores d=2 d=4 d=8 d=16 (b) Figure 5.6: Execution time for a Pine Tree with N = 1024 nodes and various node degreed using (a) the proposed scheduler, and (b) the baseline. 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (a) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (b) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (c) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (d) Figure 5.7: Speedups of the two schedulers for a Pine Tree withN = 1024 nodes and node degree (a)d = 2, (b)d = 4, (c)d = 8, (d)d = 16. 71 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline Figure 5.8: Speedup of the two schedulers using two threads per core for a Pine Tree withN = 1024 nodes and node degreed = 16. 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (a) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (b) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (c) Figure 5.9: Speedups of the two schedulers for a Multipine Tree withN = 1024 nodes and node degreed = 4 and the number of branches (a)b = 2, (b)b = 4, (c)b = 8. As the generalization of the Pine Tree, a Multipine Tree was created by connecting multiple branches of Pine Tree. The more number of Pine Tree branches, the higher the amount of parallelism was provided in the input junction tree. Therefore, the speedups of the two schedulers increased with the number of branchesb, as shown in Figure 5.9. However, the speedup of the proposed scheduler was still significantly higher than the baseline. Arbitrary Tree We generated arbitrary trees of 1024 cliques using the maximum node degreesD m = 16 (less balanced trees) andD m = 6 (more balanced trees). For eachD m , we set the tree 72 heightH = 10 (very ”bushy” tree), H = 100, andH = 500 (very slim tree). Clique size was set randomly between 14 to 16. The speedups of the two schedulers are given in Figures 5.10 and 5.11. As expected, the two schedulers achieved high speedups whenH = 10, and low speedups whenH = 500. In any case, the proposed scheduler consistently outperformed the baseline scheduler with the average improvement of30%. For lowerD m , the gap between the two schedulers became smaller. This is because with smaller D m the execution time of the preceding tasks became more identical, and the impact of exploiting weak dependencies became less. 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (a) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (b) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (c) Figure 5.10: Speedups of the two schedulers for arbitrary trees withN = 1024 nodes, maximum node degree D m = 16, and tree height (a) H = 10, (b) H = 100, (c) H = 500: 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (a) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (b) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (c) Figure 5.11: Speedups of the two schedulers for arbitrary trees withN = 1024 nodes, maximum node degreeD m = 6, and tree height (a)H = 10, (b)H = 100, (c)H = 500. 73 Balanced Tree Balanced trees provide plenty of parallelism for the two schedulers. For balanced trees, the impact of the weak dependency based scheduling method is minimal, since all the preceding tasks of a task finish their execution almost at the same time. The binary balanced tree is the best case for the baseline, where the node degree is low – only 2. Figure 5.12 shows the speedup of the two schedulers for balanced trees withN = 1024 cliques and various node degreesd. Clique size was set at 15in this experiment. 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (a) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (b) 0 2 4 6 8 02468 Speedup Number of Cores Proposed Baseline (c) Figure 5.12: Speedups of the two schedulers for balanced trees withN = 1024 nodes and node degree (a)d = 2, (a)d = 4, (a)d = 8. Since the parallelism was sufficient, we achieved almost linear speedup with both schedulers. In addition, we observed that the proposed scheduler continued having bet- ter speedups compared with the baseline. Even for the best case of the baseline – the binary balanced tree – the proposed scheduler still slightly outperformed the baseline. 74 Chapter 6 Pointer Jumping for Parallel Exact Inference Many parallel techniques have been developed to accelerate exact inference in Bayesian Networks. In [KS94] [XFP09], the authors exploit structural parallelism by paral- lelizing independent computations across cliques that are not related by an ancestor- descendent relationship in the network. In [KS94] [DFL92] [XP10d], the authors exploit data parallelism by parallelizing independent computations within a clique. These algo- rithms result in limited performance if the input dataset offers limited data and task par- allelism, such as a chain-like graph with small potential tables. In [Pen98], the authors propose a pointer jumping based method for exact inference in Bayesian networks. This method achieves logarithmic execution time regardless of the network topology. Based on this method, we propose a pointer jumping based approach to explore alternative task parallelism for exact inference in junction trees. In this chapter, we make the following contributions: We adapt the pointer jumping technique for both evidence collection and evidence distribution in junction trees. The proposed method is suitable for junction trees that offer limited data and structural parallelism. We implement the proposed method using Pthreads on two state-of-the-art many- core systems: Intel Nehalem-EX and AMD Magny-Cours. 75 We evaluate our implementation and compare it with our baseline method - a par- allel implementation that exploits structural parallelism. Our experimental results show that the technique is suitable for a class of junction trees. 6.1 Pointer Jumping Review Pointer jumping technique is used in many parallel algorithms operating on lists [CLR90][J ´ 92]. Given a linked list of N elements, each element i has a succes- sors(i) which is initialized ass(i) = i+1, 0 i N2. Each element sends data to its successor. Pointer jumping technique conceptually performs data transfer from all the elements to the last node of the list (node (N 1)) in log N iterations. All logarithms are to base 2. Each iteration of pointer jumping performs s(i) = s(s(i)), 0iN2, unlesss(i) =N1. In other words, the successor of each element is updated by that successor’s successor. In each iteration, the data of elementi is trans- ferred to the updated successor ofi independently of the operations performed by other elements. Figure 6.1 illustrates the iterations of pointer jumping on a list ofN elements. Afterlog N iterations, all the elements complete transferring their data to node(N1). Iteration 0 N-1 N-2 N-3 3 2 1 0 Iteration 1 N-1 N-2 N-3 3 2 1 0 N-1 N-2 N-3 3 2 1 0 Iteration log(N) Figure 6.1: Pointer jumping on a list. On the concurrent read exclusive write parallel random access machine (CREW- PRAM) [J ´ 92] withP processors, 1 P N, each element of the list is assigned to a 76 processor. Note that the update operations are implicitly synchronized at the end of each pointer jumping iteration. Each iteration takesO(N=P) time to completeO(N) update operations. Thus, the pointer jumping based algorithm runs inO((Nlog N)=P) time using a total number ofO(N log N) update operations. 6.2 Prior Work There are several works on parallel exact inference. In [KS94], the authors explore structural parallelism in exact inference by assigning independent cliques to separate processors for concurrent processing. In [XFP09], the structural parallelism is explored by a dynamic task scheduling method, where the evidence propagation in each clique is viewed as a task. These methods are suitable for ”bushy” junction trees which offer par- allel activities at structural level. Unlike the above techniques, data parallelism in exact inference is explored in [DFL92][KS94] [XP10d]. Since the solutions for exploring data parallelism partition the potential tables in a clique and update each part in parallel, they are suitable for junction trees with large potential tables. If the input junction tree offers limited structural and data parallelism, the performance of the above solutions are adversely affected. Pointer jumping has been used in many algorithms, such as list ranking, parallel pre- fix [J ´ 92]. In [BC05], the authors use pointer jumping to develop a new spanning tree algorithm for symmetric multiprocessors that achieves parallel speedups on arbitrary graphs. In [Pen98], the authors propose a pointer jumping based method for exact infer- ence in Bayesian networks. The method achieves logarithmic execution time regard- less of structural or data parallelism in the networks. However, the method in [Pen98] exhibits limited performance for multiple evidence inputs. In addition, no experimental 77 study of the method has been provided. Our proposed algorithm combines pointer jump- ing technique with Lauritzen and Spiegelhalter’s two-pass algorithm, which is efficient for multiple evidence inputs. Our proposed algorithm is empirically evaluated using a class of junction trees on state-of-the-art manycore processors. 6.3 Pointer Jumping for Evidence Propagation 6.3.1 Evidence Propagation in a Chain of Cliques Given a chain consisting of N cliquesC 0 !C 1 ! :::!C N1 , in which evidence comes atC 0 , we use pointer jumping technique to propagate the evidence fromC 0 to the remaining cliques as follows: Letpa(C i ) denote the parent ofC i , i.e.,C i1 , 0<i<N. In the first iteration, we propagate the evidence frompa(C i ) toC i . Then, we propagate the evidence from pa 2 (C i ) = pa(pa(C i )) toC i . We repeat such operations for logN iterations so that the evidence is propagated to all the cliques. We formulate evidence propagation in a chain of cliques as follows. LetC i = fS i+1;i ;V i ;S i;i1 g, whereS i+1;i =C i+1 \C i ,S i;i1 =C i \C i1 andV i =C i nfS i+1;i [ S i;i1 g, 0 < i < N. IndeedS i+1;i is the separator betweenC i+1 andC i . For the sake of simplicity, we letS 0;1 = S N;N1 =; andP(C i ) = P(S i+1;i ;V i ;S i;i1 ) = C . We rewriteP(C i ) as a conditional distribution: P(S i+1;i ;V i jS i;i1 ) = P(S i+1;i ;V i ;S i;i1 ) P(S i;i1 ) = P(S i+1;i ;V i ;S i;i1 ) P S i+1;i [V i P(S i+1;i ;V i ;S i;i1 ) = P(C i ) P S i+1;i [V i P(C i ) (6.1) 78 P(S i+1;i jS i;i1 ) = P V i P(S i+1;i ;V i ;S i;i1 ) P(S i;i1 ) = X V i P(C i ) P S i+1;i [V i P(C i ) (6.2) Using Eqs. 6.1 and 6.2, in iteration 0, we propagate the evidence frompa(C i ) toC i by: P(S i+1;i ;V i jS i1;i2 ) = X S i;i1 P(S i+1;i ;V i jS i;i1 )P(S i;i1 jS i1;i2 ) (6.3) P(S i+1;i jS i1;i2 ) = X V i P(S i+1;i ;V i jS i1;i2 ) (6.4) In iterationk, we propagate evidence frompa 2 k (C i ) toC i by: P(S i+1;i ;V i jS i2 k ;i2 k 1 ) = X S i2 k1 ;i2 k1 1 P(S i+1;i ;V i jS i2 k1 ;i2 k1 1 )P(S i2 k1 ;i2 k1 1 jS i2 k ;i2 k 1 ) (6.5) P(S i+1;i jS i2 k ;i2 k 1 ) = X V i P(S i+1;i ;V i jS i2 k ;i2 k 1 ) (6.6) Performing Eqs. 6.5 and 6.6 for eachk;0k logN1, we propagate evidence fromC 0 to all the cliques in a chain ofN cliques. For the rest of this chapter, we define evidence propagation as the computation using Eqs. 6.5 and 6.6 to propagate evidence from one clique from another clique. 79 6.3.2 Pointer Jumping Based Evidence Propagation Algorithm 6 shows exact inference in a junction tree using pointer jumping. The algo- rithm is based on Lauritzen and Spiegelhalter’s two-pass algorithm described in Sec- tion 2.1.1. Given a junction tree J, let ch(C i ) denote the set of children ofC i 2 J, i.e., ch(C i ) = fC j jpa(C j ) = C i g. We define ch d (C i ) = fC j jpa d (C j ) = C i g, and pa d (C j ) = pa(pa d1 (C j )) = pa((pa(C j ))), where d 1 . We denote H as the height ofJ. Updating a clique means that we perform all evidence propagations from its predecessors that are either its children in evidence collection or its parent in evidence distribution. Algorithm 6 Pointer Jumping Based Exact Inference in a Junction Tree Input: Junction treeJ with a height ofH Output: J with updated potential tables fevidence collectiong 1: fork = 0;1; ;(logH)1 do 2: forC i 2 J pardo 3: for all C j 2ch 2 k (C i ) do 4: Propagate evidence fromC j toC i according to Eqs. 6.5 and 6.6 5: end for 6: end for 7: end for fevidence distributiong 8: fork = 0;1; ;(logH)1 do 9: forC i 2 J pardo 10: C j =pa 2 k (C i ) 11: Propagate evidence fromC j toC i according to Eqs. 6.5 and 6.6 12: end for 13: end for In Algorithm 6, Lines 1-7 and 8-13 correspond to evidence collection and distribu- tion, respectively. Lines 3 and 10 show how pointer jumping technique is used in the algorithm. In each iteration, all the cliques can be updated independently. In evidence 80 collection, in iterationk, a cliqueC i inJ is updated by performing all the evidence prop- agations from the setch 2 k (C i ). In evidence distribution, in iterationk, a cliqueC i inJ is updated by performing the evidence propagation from its only parentpa 2 k (C i ). Assume thatP processors,1PN, are employed. LetW = max i fW i g, where W i is the number of cliques at level i of the junction tree, 0 i < H. Assuming each evidence propagation takes unit time, updating a clique takes O(W) time dur- ing evidence collection, since the clique must be updated by sequentially performing O(W) evidence propagations from its children. In contrast, updating a clique takes O(1) time during evidence distribution since only one evidence propagation from its parent is needed. In each iteration, there are at most O(N) clique updates performed by P processors. The algorithm performs log H iterations for each evidence collec- tion and distribution. Thus, the complexity of the algorithm isO((WNlog H)=P) for evidence collection and O((N log H)=P) for evidence distribution. The overall complexity isO((WNlog H)=P). The algorithm is attractive if the input junction tree offers limited structural paral- lelism. For example, for a chain ofN cliques, we haveW = 1,H =N, the complexity of the algorithm isO((NlogN)=P). This is superior to the complexity of serial exact inference whenP > logN. 6.4 Experiments 6.4.1 Experimental Setup Platforms We conducted experiments on two state-of-the-art manycore platforms: 81 32-core Intel Nehalem-EX based system which consisted of four Intel Xeon X7560 processors fully connected through 6.4 GT/s QPI links. Each processor had 24 MB L3 Cache and 8 cores running at 2.26 GHz. All the cores shared a 512 GB DDR3 memory. The operating system on this system was Red Hat Enterprise Linux Server release 5.4. 24-core AMD Magny-Cours based system which consisted of two AMD Opteron 6164 HE processors. Each processor had 12 MB L3 Cache and 12 cores running at 1.70 GHz. All the cores shared a 64 GB DDR3 memory. The operating system on the system was CentOS release 5.3. Due to the similarity of the results from the two platforms, we only present the results from the 32-core Nehalem-EX based system. Junction Tree Topologies We evaluated the performance of our proposed algorithm using three families of junc- tion trees: chains, balanced binary trees, and slim trees. GivenW = max i fW i g for a junction tree, whereW i is the number of cliques at leveli of the junction tree, we define a slim tree as a junction tree satisfyingW <H, whereH is the height of the tree. Note thatW is an upper bound on the number of sequential evidence propagations required for updating a clique. H determines the number of pointer jumping iterations. Using slim trees, we can observe the performance of the proposed algorithm, even though there exists some amount of structural parallelism in such junction trees. The clique size, i.e. the number of variables in the clique, wasW d = 12 in our experiments. This offered limited amount of data parallelism. The random variables were set as binary, i.e.r = 2, in our experiments. LetN denote the number of cliques in a slim tree. Three different types of slim trees were generated for the experiments. As illustrated in Figure 6.2, these are: 82 b (a) Slim Tree 1 b (b) Slim Tree 2 b (c) Slim Tree 3 Figure 6.2: Three types of slim junction trees Slim Tree 1 was formed by first creating a balanced binary tree withb leaf nodes. Then each of the leaf nodes was connected to a chain consisting ofd N(2b1) b e nodes. The height of the tree is(dlogbe+d N(2b1) b e). Increasingb increases the amount of structural parallelism. Slim Tree 2 was formed by connectingb cliques to the head of a chain consisting of(Nb) nodes. The height of the tree is(Nb+1). Increasingb increases the amount of structural parallelism. Largerb also results in more sequential opera- tions to update a clique during evidence collection. Slim Tree 3 was formed by first creating a chain of lengthd N b e. Then, each clique in the chain was connected to another chain consisting of (b1) cliques, except that the last chain connected to the root may have less than (b1) cliques. The height of the tree is(d N b e+b1). Increasingb reduces the amount of structural parallelism. 83 In the experiments, for each given number of cliquesN, we variedb to change slim tree topology. Baseline Implementation We use a parallel implementation as the baseline to evaluate our propose method. The parallel implementation explores structural parallelism offered by junction trees. A col- laborative scheduler proposed in [XP10a] was employed to schedule tasks to threads, where each thread was bound to a separate core. We call this baseline implementation as the scheduling based exact inference (SEI). In this implementation, we represented a junction tree as an adjacency array, each element representing a clique and the links to the children of the clique. Each task had an attribute called the dependency degree, which was initially set as the in-degree of the corresponding clique. The scheduling activities were distributed across the threads. The adjacency array was shared by all the threads. In addition to the shared adjacency array, each clique had a ready task list to store the assigned tasks that are ready to be executed. Each thread fetched a task from its ready task list and executed it locally. After the task completed, the thread reduced the dependency degree of the children of the clique. If the dependency degree of a child became 0, the child clique was assigned to the ready task list of the thread with the lightest workload. Implementation Details We implemented our proposed algorithm using Pthreads on manycore systems. We initiated as many threads as the number of hardware cores in the target platform. Each thread was bound to a separate core. These threads persisted over the entire program execution and communicated with each other using the shared memory. 84 The input junction tree was stored as an adjacency array in the memory. Similar to the SEI, the adjacency array consists of a list of nodes, each corresponding to a clique in the given junction tree. Each node has links to its children and a link to its parent. Note that the links were dynamically updated due to the nature of pointer jumping. We define a task as the computation of node level primitives [XP10d] for updating a clique by a thread. The input to a task is the clique needed to be updated and a set of cliques propagating evidence to it. The output of a task is the clique with updated potential tables. In thek-th iteration of pointer jumping in evidence collection, the com- putation for each task corresponds to Lines 3-5 in Algorithm 6, which updatesC i using the cliques inch 2 k (C i ). In thek-th iteration in evidence distribution, the computation for each task corresponds to Lines 10-11 in Algorithm 6 where cliqueC i is updated using cliquepa 2 k (C i ). In each iteration of pointer jumping, all the tasks can be executed in parallel for both evidence collection and distribution. The tasks were statically mapped to the cores by a straightforward scheme: the task updating cliqueC i is distributed to thread(i mod P), whereP is the number of threads. In an iteration of pointer jumping, execution time for the tasks can vary due to different number of cliques propagating evidence to a task. Using the PRAM model, synchronization across processors is implicit. However, in our implementation, we explicitly synchronized the update operations across threads at the end of each iteration of pointer jumping using a barrier. 6.4.2 Results and Analysis Figures 6.3, 6.4, and 6.5 illustrate the performance of the proposed algorithm and the scheduling-based exact inference (SEI) for various junction trees. We consider the exe- cution time and the scalability with respect to the number of processors for each algo- rithm. 85 0 10 20 30 40 50 60 70 0 8 16 24 32 Execution Time (s) Number of Cores N = 4096 N = 2048 N = 1024 N = 512 0 2 4 6 8 10 12 0 8 16 24 32 Execution Time (s) Number of Cores N = 4096 N = 1024 N = 256 N = 64 Figure 6.3: Comparison between the proposed algorithm (left) and SEI (right) for chains on the 32-core Intel Nehalem-EX based system. 0 5 10 15 20 25 0 8 16 24 32 Execution Time (s) Number of Cores N = 4096 N = 1024 N = 256 N = 64 0 1 2 3 4 5 6 0 8 16 24 32 Execution Time (s) Number of Cores N = 4096 N = 1024 N = 256 N = 64 Figure 6.4: Comparison between the proposed algorithm (left) and SEI (right) for bal- anced trees on the 32-core Intel Nehalem-EX based system. For chains, as shown in Figure 6.3, SEI showed limited scalability; in contrast, the proposed algorithm scaled very well with the number of cores. This is because for chains, SEI had no structural parallelism to exploit. For a chain of 4096 cliques, on a single core, the proposed algorithm ran much slower than SEI; however, when the number of cores exceeded 16, the proposed algorithm ran faster than SEI. For balanced trees, as shown in Figure 6.4, both methods scaled well due to available structural parallelism. The scalability of SEI was even better than that of the proposed algorithm. This is because for balanced trees, during evidence collection in the pro- posed algorithm, some clique are updated by sequentially performing a large number of 86 evidence propagations from its children. Load imbalance can occur due to the straight- forward scheduling scheme employed by us. By employing the work-sharing scheduler, SEI had better load balance. Note that SEI ran much faster than the proposed algorithm. To understand better the impact of junction tree topology on the scalability, we used slim trees for the experiments. The results are shown in Figure 6.5. The number of cliques (N) was 4096. We varied the amount of structural parallelism offered by the junction trees by changing the parameter b (see Section 6.4.1) for each type of slim trees. For Slim Tree 1, we observed that SEI did not scale when the number of cores P exceededb. The proposed algorithm still scaled well. The scalability of the proposed algorithm is not significantly affected byb. Thus, whenb was small andP was large, e.g.b = 4 andP = 32, the proposed algorithm ran faster than SEI. For Slim Tree 2, increasingb made SEI scale better. In contrast, the proposed algo- rithm did not scale well whenb increased because of the load imbalance. Load imbal- ance occurs because in each iteration of pointer jumping in evidence collection, some cliques are updated by sequentially performingb evidence propagations, while the other cliques perform only one evidence propagation. Thus, when b was large, e.g. 256 or 1024, for the number of cores used in the experiments, SEI ran faster than the proposed algorithm. Whenb was small andP was large, e.g. b = 16 andP = 32, the proposed algorithm ran faster than SEI. For Slim Tree 3, although increasingb reduces the amount of structural parallelism, it also reduces the length of the sequential path (chains consisting ofd N b e cliques). Hence, as long asN=b is greater than the number of coresP , increasingb makes SEI scale better. IfN=b is less thanP , then all the processors are not fully utilized by SEI. We observed that forN = 4096 andb = 256, SEI no longer scaled when the number of 87 0 10 20 30 40 50 60 0 8 16 24 32 Execution Time (s) Number of Cores b = 4 b = 8 b = 16 b = 32 0 1 2 3 4 5 6 0 8 16 24 32 Execution Time (s) Number of Cores b = 4 b = 8 b = 16 b = 32 (a) The proposed algorithm (left) and SEI (right) for Slim Tree 1. 0 10 20 30 40 50 60 70 0 8 16 24 32 Execution Time (s) Number of Cores b = 16 b = 64 b = 256 b = 1024 0 2 4 6 8 10 0 8 16 24 32 Execution Time (s) Number of Cores b = 16 b = 64 b = 256 b = 1024 (b) The proposed algorithm (left) and SEI (right) for Slim Tree 2. 0 10 20 30 40 50 60 0 8 16 24 32 Execution Time (s) Number of Cores b = 4 b = 16 b = 64 b = 256 0 1 2 3 4 5 6 0 8 16 24 32 Execution Time (s) Number of Cores b = 16 b = 64 b = 256 b = 1024 (c) The proposed algorithm (left) and SEI (right) for Slim Tree 3. Figure 6.5: Comparison between the proposed algorithm and the scheduling based exact inference (SEI) for slim trees withN = 4096 cliques on the 32-core Intel Nehalem-EX based system. 88 coresP exceeded 16. The proposed algorithm scaled best whenb = 4. Whenb = 4 and P = 32, the proposed algorithm ran faster than SEI. Examining the execution time of the proposed algorithm and SEI in all the cases, we can conclude that when SEI does not scale with the number of cores due to lack of structural parallelism, the proposed algorithm can offer superior performance compared with SEI by using a large number of cores. 0 0.5 1 1.5 2 2.5 3 0 1024 2048 3072 4096 Speedup Number of Cliques (N) P = 4 P = 8 P = 16 P = 32 Figure 6.6: The speedup of the pointer jumping based exact inference with respect to the sequential exact inference on the 32-core Intel Nehalem-EX based system. As observed, only when the junction trees offered very limited data and structural parallelism, the proposed algorithm was able to outperform the SEI using a large number of cores. Consider the best case for pointer jumping based algorithm when the junction tree input was a chain. The proposed algorithm achieved speedup only when the number of cores was greater thenlog N as given in our performance analysis and verified in our experimental results. Therefore, a large number of cliques requires a large number of cores for the algorithm to be effective. With the above observation, we consider the speedup of the proposed algorithm with respect to the serial implementation baseline when the junction tree input is a chain of cliques. Figure 6.6 show how the speedup change with the number of cliquesN, given a known number of processorP . As observed, the speedup increases quickly withN 89 when N is relatively small compared with P , then the speedup gradually decreased. Hence, N must be smaller than a certain number to achieve speedup greater than 1. As given in Section 6.3.2, the complexity of the algorithm for a chain ofN cliques is O((N logH)=P); thus the speedup is expected to be (P=logH), given the complexity of the sequential code isO(N). Hence,N is expected to be smaller than 2 P to achieve speedup. In Figure 6.6, forP = 8, the speedup is greater than 1 only whenN is smaller than 256 which is equal to2 8 . 90 Chapter 7 Parallel Exact Inference Using MapReduce As we have seen in the previous chapters, parallel techniques have been developed to accelerate exact inference. Although these techniques exhibit good speedup, the imple- mentation complexity has limited the applicability of these especially in application domains where the experts may not be aware of parallel computing techniques and tools. In recent years, MapReduce [DG04] has emerged as a programming model for large-scale data processing. MapReduce provides a simple and powerful abstrac- tion that hides the details of low-level parallelism, data distribution, load balanc- ing. It allows for easy parallelism with well-known supporting runtimes such as Hadoop, Twister for clusters or Phoenix for multi-core [had, ELZ + 10, RRP + 07]. Many applications with data parallelism have gained good scalability using MapRe- duce [CKL + 06, PHBB09, TSWY09, GSZ11]. However, there are a very limited number of implementations using MapReduce for applications with data dependency constraints such as exact inference. In this chapter, we present our approach to using MapReduce for exact inference in Bayesian networks on multi-core platforms. Specifically, our contributions include: We propose algorithms for evidence propagation in junction trees using MapRe- duce. Our methods explore task parallelism and resolve data dependency con- straints based on tree traversal methods. 91 We implement the algorithms on state-of-the-art multi-core machines using Phoenix as the underlying MapReduce runtime. Our methods demonstrate a framework for using MapReduce to parallelize applications with data dependen- cies using existing runtimes. We conduct experiments with various datasets. Our experimental results show 20 speedup on a 40-core Intel Westmere-EX based system. 7.1 Prior Work There are several works on parallel exact inference in graphical models. In [KS94], structural parallelism in exact inference is explored by assigning independent cliques of the junction tree to separate processors for concurrent processing. In [XFP09], structural parallelism is also exploited by a dynamic task scheduling method, where the evidence propagation in each clique is viewed as a task. In [DFL92, XP10d], data parallelism is explored in node level computations by updating parts of a potential table in paral- lel. These techniques require complex implementation and optimization using low-level programming models such as Pthreads or MPI. MapReduce was originally proposed for large-scale data processing applications on clusters [DG04]. It has then evolved to a computational model for a broader class of problems on a range of computing platforms [KSV10, RRP + 07, CKL + 06, HFL + 08, YTT + 08]. In [CKL + 06], MapReduce is used for a class of machine learning algorithms on multi-core. These algorithms are written in a summation form which is easily fit into MapReduce model. MapReduce is also used for loopy belief propagation in graphical models [GLG09, UK10]. Howerver, these problems employ embarrassingly parallel algorithms that have no data dependency; that is each MapReduce iteration scan on the same entire dataset. 92 The use of MapReduce has been introduced for some graph algorithms such as min- imum spanning tree, shortest path, or tree-based prefix sum [GSZ11, KSV10]. Design patterns for graph algorithms in MapReduce are discussed in [LS10] with the illustration of the PageRank algorithm. It is also assumed that the computations occur at every node of the graph. For exact inference, there exist data dependencies given by the graph, only potential tables at some nodes in the graph are ready in a MapReduce invocation. Graph traversal methods are used in our approach to handle data dependency constraints. Several runtime systems have been developed to support the MapReduce model. Hadoop [had] is a well-known MapReduce infrastructure used intensively in industry. Twister [ELZ + 10] was introduced to support iterative MapReduce. It aims to minimize the overhead between iterations of MapReduce. These runtime systems focus on han- dling data-intensive applications with very large data files on distributed platforms. On multi-core, the Phoenix runtime [RRP + 07] allows easy implementation of MapReduce based applications by handling the low-level parallelism, load balancing and locality. In this work, we use Phoenix because it maintains the basic MapReduce model and is developed for multi-core platforms. 7.2 MapReduce for Exact Inference 7.2.1 Task Definition As given in Section 2.1.1, exact inference in junction tree proceeds in two stages: evi- dence collection and evidence distribution. In evidence collection, evidence is propa- gated from the leaves to the root. When a clique is fully updated from all of its children, it is ready to compute the marginal of the separator connecting to its parent. The update of a clique from the separators connecting to its children is considered as an aggregation operation. Thus, it is natural to define the computation of marginal of a separator as a 93 Algorithm 7 Map and Reduce tasks for evidence collection 1: procedure MAP(idi, POT Y ) 2: Letj =pa(i) be the identifier of the parent ofC i 3: LetS =sep(i;j) be the separator ofC i andC j 4: Compute marginal S ofS from Y according to Eq. 2.1 5: EMIT(j, S ) 6: end procedure 7: procedure REDUCE(idj, list[ S 1 ; S 2 ;:::]) 8: Denote X as the POT of cliqueC j 9: for all S 2 [ S 1 ; S 2 ;:::; S d ] do 10: Update X from S according to Eq. 2.2 11: end for 12: EMIT(j, X ) 13: end procedure map task, and the update of a clique from all of its children as a reduce task. Details of the map task and reduce task for evidence collection are given in Algorithm 7. In the map task, defined by procedure MAP, the keyi is the identifier of the clique, and the value Y is the potential table (POT) of cliqueC i . Line 2 assigns the identifier of the parent clique ofC i toj. The marginal S of the separator connectingC i with its parent is computed in Line 3. In Line 4, this map task producesj as the intermediate key and S as the intermediate value. In the reduce task, defined by procedure REDUCE, the inputs include identifier j as the intermediate key and a list of the associated intermediate values [ S 1 ; S 2 ;:::] produced by the map tasks. CliqueC j has potential table X which will be updated from all child cliques via the separators. Lines 9-11 perform the update of X from each new S of its children. Computing marginal of a separator and updating a clique can be performed by node level primitives given in [XP10d]. At a clique withw variables, each havingr states, andd children, the complexity of a map task isO(wr w ) and the complexity of a reduce task isO(dwr w ). 94 Algorithm 8 Map and Reduce tasks for evidence distribution 1: procedure MAP(idi, POT Y ) 2: LetL =ch(i) be identifiers of all children ofC i 3: for allj2L do 4: LetS =sep(i;j) be the separator ofC i andC j 5: Compute marginal S ofS from Y according to Eq. 2.1 6: EMIT(j, S ) 7: end for 8: end procedure 9: procedure REDUCE(idj, list[ S 1 ; S 2 ;:::]) 10: Denote X as the POT of cliqueC j 11: for all S 2 [ S 1 ; S 2 ;:::] do 12: Update X from S according to Eq. 2.2 13: end for 14: EMIT(j, X ) 15: end procedure In evidence distribution, evidence is propagated from the root to the leaves. A clique is updated from its single parent. The clique then computes the marginal of each sep- arator for all of its child cliques. We define the map task as the computation of the marginal and the reduce task as the update of the clique. Details of the map task and reduce task for evidence distribution are given in Algorithm 8. In the algorithm, the map task computes the marginal for all of its child cliques. The map task now produces a list of intermediate pairs of key and value, each for a separator connecting to a child clique. The reduce task is exactly the same as that in evidence collection. However, the list of intermediate values [ S 1 ; S 2 ;:::] should now contain only one element corre- sponding to the single parent. For evidence distribution, the complexity of a map task is O(dwr w ) and the complexity of a reduce task isO(wr w ). 7.2.2 Depth-First Search based Iterative MapReduce 95 Map Task Reduce Task Map Task Reduce Task Completed Task 1 1 3 2 4 4 5 5 Figure 7.1: Map tasks and Reduce tasks in the post-order sequence of MapReduce invo- cations for evidence collection. Algorithm 9 Evidence collection using MapReduce with post-order traversal 1: procedure MR Postorder Propagation(JTreeJ , idi) 2: LetL =ch(i) be identifiers of all children ofC i inJ 3: ifL6=; then 4: for allj2L do 5: MR Postorder Propagation(J ,j) 6: end for 7: SetL as input data for the map phase 8: Perform MapReduce invocation onL 9: end if 10: end procedure Iterative MapReduce is used to complete evidence propagation in junction trees. In each iteration, there is one MapReduce invocation in which the Map tasks are executed in parallel and so are the Reduce tasks. It is essential to determine the input data for each MapReduce invocation so that data dependency constraints are preserved. In evidence collection, a clique is ready to compute marginal of the separator connecting to its parent only after it is updated from all the separators connecting to its children. In evidence distribution, when a clique is updated from its parent, it is ready to compute separator marginal for all of its children. Our first approach to preserving the data dependency constraints is using depth- first search (DFS) traversal for the sequence of MapReduce invocations. We call this 96 Algorithm 10 Evidence distribution using MapReduce with pre-order traversal 1: procedure MR Preorder Propagation(JTreeJ , idi) 2: LetL =ch(i) be identifiers of all children ofC i inJ 3: ifL6=; then 4: Setfig as input data for the map phase 5: Perform MapReduce invocation onfig 6: for allj2L do 7: MR Preorder Propagation(J ,j) 8: end for 9: end if 10: end procedure approach DFS-based approach. The DFS-based approach employs post-order traver- sal in evidence collection and pre-order traversal in evidence distribution. Data input for each MapReduce invocation is set up accordingly. This approach is illustrated in Algorithms 9 and 10. In Algorithm 9, the post-order traversal for the sequence of MapReduce invocations is produced by recursive procedure calls. The initial value of inputi is the identifier of the root. Lines 7-8 perform the MapReduce invocation with the defined data input for its map phase. By the definition of the map task and reduce task, there is one reduce task and multiple map tasks corresponding to the child nodes of the reduce node in each MapReduce invocation. The map tasks at the child nodes are executed in parallel in the map phase. Figure 7.1 illustrates the map tasks and reduce tasks in the post- order sequence of MapReduce invocations for evidence collection in a sample junction tree. Only non-leaf cliques become reduce tasks. It can be seen that the number of MapReduce invocations is equal to the number of non-leaf cliques in the junction tree. In Algorithm 10, the pre-order traversal for the sequence of MapReduce invocations is produced. The initial value of inputi is also the identifier of the root. The MapReduce invocation is performed before the procedure is called recursively on the current clique’s children. This case, the data input for the map phase includes only the current clique. 97 Map Task Reduce Task Map Task Reduce Task Completed Task 1 1 3 2 4 4 Figure 7.2: Map tasks and Reduce tasks in the level-based sequence of MapReduce invocations for evidence collection. Thus, in each MapReduce invocation, there is one map task and multiple reduce tasks that corresponds to the children of the map clique. 7.2.3 Level based Iterative MapReduce The DFS-based approach provides a straightforward way to preserve the data depen- dencies. However, parallelism is available only in the map phase for evidence collection or in the reduce phase for evidence distribution. In addition, parallelism is limited by the number of child nodes of a clique. We propose a level-based approach that can still preserve data dependencies but exploits more parallelism. Level of a clique is defined as the number of edges on the path from it to the root in the junction tree. In this approach, each MapReduce invocation receives all the nodes at a certain level as the data input for the map phase. The sequence of MapReduce invocations proceeds with a descending- level order in evidence collection, and proceeds with a ascending-level order in evidence distribution. Hence, the data dependencies are preserved by the order of map tasks and reduce tasks in the sequence of MapReduce invocations. This approach is illustrated in Algorithm 11 and 12. 98 Algorithm 11 Level-based MapReduce for evidence collection. 1: procedure MR Level-based EviCol(JTreeJ ) 2: LetH =height(J) 3: forl =H;H1;:::;1 do 4: LetL =nodes at level(J;l) 5: SetL as input data for the map phase 6: Perform MapReduce invocation onL 7: end for 8: end procedure In Algorithm 11, data input for the map phase of each MapReduce invocation includes all the cliques at a corresponding level. Map task and reduce tasks for evi- dence distribution are defined in Algorithm 7. The map tasks at level l produce a set of intermediate keys that are identifiers of some parent cliques at level (l 1). The intermediate values for an intermediate key are grouped by the MapReduce invocation, giving separator input to clique update to be performed by a reduce task. Note that only non-leaf cliques at level(l1) correspond to the reduce tasks. The sequence of MapRe- duce invocations are performed following the descending order of levell fromH to 1. The execution of Agorithm 11 in a sample junction tree is illustrated in Figure 7.2. We analyze the complexity of the algorithm using the concurrent read exclusive write parallel random access machine (CREW-PRAM) model [J ´ 92]. We assume that each non-leaf node has d children and w variables with each variable having r states. LetN denote the number of nodes,H denote the tree height, andP denote the number of processors. Assume that n l is the total number of nodes and m l is the number of non-leaf nodes at levell. Note thatn l =dm l1 . In the MapReduce invocation at tree levell, H l 1, there aren l map tasks executed in parallel in the map phase, and m l1 reduce tasks executed in parallel in the reduce phase. Thus, the execution time in the MapReduce invocation is t (l) M = wr w dn l =Pe wr w (bn l =Pc + 1) for the map phase, and t (l) R = dwr w dm l1 =Pe dwr w (bm l1 =Pc + 1) for the 99 Algorithm 12 Level-based MapReduce for evidence distribution. 1: procedure MR Level-based EviDist(JTreeJ ) 2: LetH =height(J) 3: forl = 0;1;:::;H1 do 4: LetL =nodes at level(J;l) 5: SetL as input data for the map phase 6: Perform MapReduce invocation onL 7: end for 8: end procedure reduce phase. The number of MapReduce invocations isH. Thus, the time complexity of Algorithm 11 ist = P 1 l=H (t (l) M +t (l) R ) =O(dwr w (N=P +H)). Algorithm 12 shows the level-based approach to using MapReduce for evidence distribution. In this algorithm, the sequence of MapReduce invocations follows the ascending order of the level l, from 0 to (H 1). Also note that the map task and reduce task in evidence distribution, defined in Algorithm 8, are different from those in evidence collection. Similar analysis shows that the time complexity of Algorithm 12 is t =O(dwr w (N=P +H)), which is the same as that of Algorithm 11. 7.3 Experiments 7.3.1 Experimental Setup Facilities We conducted experiments on a 40-core Intel Westmere-EX based system as a repre- sentative state-of-the-art multi-core system. The system consists of four Xeon E7-4860 processors fully connected through 6.4 GT/s QPI links. Each processor has 10 cores running at 2.26 GHz and sharing 24 MB L3 cache. The system has total 64 GB DDR3 shared memory. The operating system is Red Hat Linux 4.1.2-45. We used gcc-4.1.2 with optimization flag -O3 to compile our programs. 100 Phoenix-2 [YRK09] was used as the MapReduce runtime for our MapReduce invo- cations. The runtime allows us to define the Map task and Reduce task, and to assign data input for the map phase by defining functionsplitter for each MapReduce invo- cation. The runtime is responsible for scheduling all the Map tasks and all the Reduce tasks to the available cores. We evaluated the scalability of our proposed method using various numbers of cores up to 40. Datasets To evaluate the performance of our proposed approaches, we used a set of balanced junction trees generated synthetically and a set of random junction trees converted from real Bayesian networks. Balanced junction trees were generated asd-ary trees, in which each node except the leaf nodes and the last non-leaf node hasd children. Impact of the tree struc- tures and the task size on the performance were evaluated using various param- eters such as number of children of each node (d), number of cliques (N), and clique width (w), i.e. number of variables in a clique. Random junction trees were obtained by extracting parts of the junction trees con- verted from real Bayesian networks in the Quick Medical Reference knowledge base [MSH + 91]. The junction trees were extracted with parameters including number of cliques (N) and tree height (H). For each pair ofN andH, we obtained 20 random trees and reported the average execution time. Performance Metrics and Baselines Execution time and speedup are used to evaluate the performance of our method. The speedup onP cores is calculated as the ratio of the execution time on a single core to the execution time onP cores. 101 We compare our MapReduce based approaches with serial, data parallel, and OpenMP implementations for evidence propagation. Serial is a sequential implementa- tion of evidence propagation in which cliques are updated one by one by the BFS-based order of cliques. Data parallel is a parallel technique that explores data parallelism in node level computation [XP10d]. The cliques are updated one by one by the BFS-based order, but in each node level primitive, the potential table is divided and distributed to the cores so that different parts of the potential table can be updated in parallel. The OpenMP baseline uses the available OpenMP directives to parallelize the node level computation. The ease of implementation of the two parallel baselines is comparable with that of the MapReduce based approaches. 7.3.2 Experimental Results 5.0E‐01 1.0E+00 2.0E+00 4.0E+00 8.0E+00 1.6E+01 1248 16 32 Execution Time [sec] Number of Threads Serial DFS‐based OpenMP Data parallel Level‐based Figure 7.3: Scalability of the MapReduce based approaches compared with the base- lines. Figure 7.3 illustrates the execution time of the proposed approaches compared with the two parallel baselines. The experiments were conducted using a balanced junction tree withN = 2047,r = 2,w = 15, andd = 2 except thatd = 40 was used for the DFS- based MapReduce approach. Withr = 2 andw = 15, data parallelism is sufficient for 102 the two baselines to scale well when less than 20 cores are used. With low synchroniza- tion overhead, the data parallel technique exhibits even better scalability compared with the level-based MapReduce approach when the number of cores is less than 10. How- ever, as the number of cores increases beyond 20, the two baselines do not scale with the number of cores. The scalability of the OpenMP baseline is worse because the OpenMP runtime does not offer efficient support for the irregular data accesses required by the node level computation. In addition, false sharing is another issue in the OpenMP imple- mentation when a potential table, stored as a continuous array, is updated frequently by multiple threads. The level-based MapReduce approach shows good scalability with the number of cores. When 40 cores are used, the improvement in execution time of the level-based MapReduce approach is 50.3% compared with the OpenMP baseline and is 39.6% compared with the data parallel baseline. For the DFS-based MapReduce approach, because there are onlyd map tasks in one MapReduce invocation, we setd = 40 to provide sufficient parallelism. This approach shows very limited scalability in Figure 7.3. However, this result is expected because as analyzed in Section 7.2.2, the execution time cannot be reduced by more than 50% com- pared with the sequential execution time regardless of the number of cores used. Thus, the DFS-based approach does not seem to be suitable for exact inference in junction trees. For other computational problems, the DFS-based MapReduce approach may appeal as an effective method due to its simplicity. In the following paragraphs, we present the performance of the level-based MapReduce approach. Impact of tree structures and task size on the performance of the level-based MapRe- duce approach are illustrated in Figure 7.4. The default values of the parameters are: N = 2047,r = 2,w = 15,d = 2. Figure 7.4a shows that for larger values ofN, i.e. for larger trees, the proposed approach achieves better speedups. This result is expected because larger trees offer more parallelism and also reduce the impact of the limited 103 0 8 16 24 0 8 16 24 32 40 Speedup Number of Cores N = 4095 N = 2047 N = 1023 (a) 0 8 16 24 0 8 16 24 32 40 Speedup Number of Cores d=2 d=4 d=8 (b) 0 8 16 24 0 8 16 24 32 40 Speedup Number of Cores w=20 w=15 w=10 (c) Figure 7.4: Speedup of the level-based MapReduce approach on the 40-core Intel Westmere-EX based system for balanced junction trees with respect to various parame- ters. parallelism available at the root clique on the overall execution time. In Figure 7.4b, changing value of d does not have significant impact on the speedup of the approach. The speedups achieved are almost identical ford = 2,d = 4, andd = 8. In Figure 7.4c, clique widthw, which reflects the task size defined bywr w , shows significant impact on the speedup of the approach. The larger the value ofw, i.e. the larger the task size, the better speedup the level-based MapReduce approach achieves. Larger task size helps reduce the impact of overhead of the Phoenix system when managing the parallelization. Finally, we evaluated the level-based MapReduce approach using the random junc- tion trees based on real Bayesian networks in the Quick Medical Reference knowledge base. We obtained junction trees withN = 2000,r = 2, andw = 12 as the average. We 104 0 8 16 24 0 8 16 24 32 40 Speedup Number of Cores jtree1 jtree2 jtree3 Figure 7.5: Speedup of the level-based MapReduce based approach on the 40-core Intel Westmere-EX based system for random junction trees. varied the density of the junction trees by setting various values for tree heightH. The smaller the tree height, the more “bushy” the tree is. Figure 7.5 shows the speedup of the proposed approach for three junction trees, named jtree1, jtree2, jtree3, corresponding toH = 5, H = 15, H = 25 respectively. As expected, the best speedup is achieved with jtree1, the next one is with jtree2, and the last one is with jtree3. Although the average clique width w is quite small (w = 12), the level-based MapReduce method still achieves good scalability with jtree1. Compared with balanced trees, these random trees have more even distribution of the nodes to the levels. Thus, when the number of cores is average (from 10 to 30), the speedup tends to be better. However, when the number of cores is more than 30, the speedup tapers off, because the number of nodes at each level becomes relatively small compared with available parallelism. 105 Chapter 8 Conclusion and Future Work Graphical models have been essential tools for probabilistic reasoning and machine learning. With the emergence of multi-core processors and the need for large-scale graphical models in real-time systems, parallelism for graphical models must be explored efficiently. In this thesis, we investigated parallel techniques to achieve scal- able performance for exact inference in graphical models on multi-core platforms. We exploited parallelism at multiple levels to extract the most parallelism for the given input. Various techniques such as multithreading, task scheduling were employed to map the parallelism to the hardware core efficiently. The exploration of weak depen- dency helped significantly increase the degree of parallelism for a given graph. We also studied the use of the MapReduce programming model for exact inference which is not a conventional application of MapReduce. As exact inference in graphical models rep- resents a large class of graph computations, our techniques can also be applied to these graph computational problems. Summary of our proposed techniques and future work are given as follows. We explored data parallel techniques for belief propagation in factor graphs. We proposed algorithms for node level computation to exploit data parallelism in the table operations. An algorithm for belief propagation was then developed using these table operations. We implemented the algorithms and conducted experi- ments on state-of-the-art multi-socket multi-core systems. Data placement was used to address NUMA effects in the system. For factor graphs with large poten- tial tables, the experimental results exhibited almost linear speedup. In addition, 106 when data parallelism is insufficient, task parallelism is explored for exact infer- ence in acyclic factor graphs. Our approach consists of constructing a task depen- dency graph for the input factor graph and then using a task scheduler to allocate tasks to the cores for parallel execution. Our experimental results showed that it is an efficient method for parallelizing belief propagation in factor graphs. For factor graphs with sufficient task parallelism, the proposed method achieved37 speedup on a general purpose 40-core system. When the available data and task parallelism is insufficient, we exploited weak dependencies in DAG scheduling to improve the performance on multi-core pro- cessors. We designed and implemented a scheduling method by adapting a tra- ditional scheduling scheme, so as to allow task execution with respect to its pre- ceding tasks separately. We used a typical example called evidence collection in junction trees to evaluate the proposed scheduling method. We showed that the proposed scheduling method significantly improved the performance compared with the baseline for a wide variety of junction trees. For arbitrary trees, on a gen- eral purpose 8-core system, the proposed scheduler ran faster than the baseline by 30% on the average. For some types of junction trees, the proposed method can run4 faster than the baseline. We adapted a pointer jumping based method to explore exact inference in junc- tion trees with very limited data and task parallelism on manycore systems. The proposed method is employed for both evidence collection and distribution. In evidence collection, as each clique is updated from its children sequentially, the performance of the pointer jumping based evidence collection depends on the topology of the input junction tree. The less slim the tree is, the better the improve- ment the technique can achieve. Our experimental results show that the proposed 107 algorithm is well suited for junction trees with limited parallelism on manycore platforms. To avoid the implementation complexity of the proposed techniques, we stud- ied the use of MapReduce programming model to exploit task parallelism for exact inference in Bayesian networks. We defined map tasks, reduce tasks, and methods to utilize the MapReduce framework for evidence propagation process with respect to data dependency constraints. Our approaches employed depth-first search (DFS) traversal methods and a level-based method to order the sequence of MapReduce invocations for handling the data dependencies. Although hav- ing limited scalability for evidence propagation, the DFS-based approach can be efficient for other computational problems. The level-based approach was shown to be efficient for evidence propagation in junction trees. Compared with data parallel baselines that have similar level of implementation complexity, the level- based MapReduce approach achieved better scalability with the number of cores, especially when the data parallelism is limited. Using an existing MapReduce runtime called Phoenix, the level-based approach achieved20 speedup for exact inference on a 40-core Intel Westmere-EX based system. While the proposed techniques have been shown to be efficient for exact inference, there are still some interesting problems to study in the future: One problem to explore is an automatic combination of data parallelism and task parallelism to increase load balancing and to reduce the overhead of coordinating threads. When the number of tasks is small and there are some very large tasks, the task scheduler can exploit data parallelism within these large tasks to distribute the load more evenly. When there are too many small tasks, the scheduler can group them together to reduce the scheduling overhead. 108 Another interesting problem is how to generalize the exploitation of weak depen- dency on different scheduling methods including both work-sharing and work- stealing approaches on large scale systems. Other graph computations can be used to evaluate this method. Data locality and communication overhead are also challenging issues to address when studying the DAG scheduling problems. It is important to study additional optimization techniques when using MapRe- duce for graph computations, such as embedding map tasks to reduce tasks or overlapping map tasks with reduce tasks to increase task parallelism. The use of MapReduce for data parallelism in the computation within a clique node is another problem to study. Finally, modifying the MapReduce execution model to exploit weak dependencies in graph computations is a fascinating problem to investigate. 109 Bibliography [AKW96] I. Ahmad, Yu-Kwong Kwok, and Min-You Wu. Analysis, evaluation, and comparison of algorithms for scheduling task graphs on parallel proces- sors. In Proceedings of the 1996 International Symposium on Parallel Architectures, Algorithms and Networks, pages 207–213, 1996. [ARK08] I. Ahmad, S. Ranka, and S.U. Khan. Using game theory for scheduling tasks on multi-core processors for simultaneous optimization of perfor- mance and energy. In Intl. Sym. on Parallel Dist. Proc., pages 1–6, 2008. [BC05] David A. Bader and Guojing Cong. A fast, parallel spanning tree algo- rithm for symmetric multiprocessors (smps). Parallel and Distributed Computing, 65(9):994–1006, 2005. [BHR09] Anne Benoit, Mourad Hakem, and Yves Robert. Contention awareness and fault-tolerant scheduling for precedence constrained tasks in hetero- geneous systems. Parallel Computing, 35(2):83–108, 2009. [BJC95] David A. Bader, Joseph J´ aJ´ a, and Rama Chellappa. Scalable data parallel algorithms for texture synthesis and compression using Gibbs Random Fields. IEEE Transactions on Image Processing, 4(10):1456–1460, 1995. [BJK + 96] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H Ran- dall, and Y . Zhou. Cilk: An efficient multithreaded runtime system. Tech- nical report, Cambridge, 1996. [Buy99] Rajkumar Buyya. High Performance Cluster Computing: Architectures and Systems, Vol. 1. Prentice Hall, 1999. [CDMC + 05] Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, Franc ¸ois Can- tonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel Chavarr´ ıa-Miranda. An evaluation of global address space languages: co- array Fortran and unified parallel C. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel program- ming, PPoPP ’05, pages 36–47, New York, NY , USA, 2005. ACM. 110 [Cha] Charm++ programming system. http://charm.cs.uiuc.edu/research/charm/. [CK00] Huey-Ling Chen and Chung-Ta King. Eager scheduling with lazy retry in multiprocessors. Future Gener. Comput. Syst., 17(3):215–226, 2000. [CKL + 06] Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y . Ng, and Kunle Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281–288. MIT Press, 2006. [CLR90] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Intro- duction to Algorithms. The MIT Press and McGraw-Hill Book Company, 1990. [Cof76] E. G. Coffman. Computer and Job-Shop Scheduling Theory. John Wiley and Sons, New York, NY , 1976. [Coo90] Gregory F. Cooper. The computational complexity of probabilistic infer- ence using bayesian belief networks. Artificial Intelligence, 42(2-3):393– 405, 1990. [CPF06] Duen Horng Chau, Shashank Pandit, and Christos Faloutsos. Detect- ing fraudulent personalities in networks of online auctioneers. In Pro- ceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, PKDD’06, pages 103–114. Springer- Verlag, 2006. [DFL92] Bruce D’Ambrosio, Tony Fountain, and Zhaoyu Li. Parallelizing proba- bilistic inference: Some early explorations. In UAI, pages 59–66, 1992. [DG04] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data pro- cessing on large clusters. In OSDI, pages 137–150, 2004. [DVKT06] M. De Vuyst, R. Kumar, and D.M. Tullsen. Exploiting unbalanced thread scheduling for energy and performance on a CMP of SMT processors. In Intl. Sym. on Parallel Dist. Proc., pages 1–6, 2006. [ELZ + 10] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung- Hee Bae, Judy Qiu, and Geoffrey Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810–818, 2010. [FSS11] G. Falc˜ ao, L. Sousa, and V . Silva. Massively LDPC decoding on Multi- core Architectures. IEEE Transactions on Parallel and Distributed Sys- tems, 22(2):309–322, 2011. 111 [GJ90] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY , USA, 1990. [GLG09] Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. Residual splash for optimally parallelizing belief propagation. Journal of Machine Learning Research - Proceedings Track, 5:177–184, 2009. [GLGO09] Joseph Gonzalez, Yucheng Low, Carlos Guestrin, and David O’Hallaron. Distributed parallel inference on large factor graphs. In Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, July 2009. [GSZ11] Michael T. Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, search- ing, and simulation in the mapreduce framework. In Proceedings of the 22nd international conference on Algorithms and Computation, pages 374–383, 2011. [had] Hadoop wiki - powered by. [Hec97] David Heckerman. Bayesian networks for data mining. Data Mining and Knowledge Discovery, 1(1):79–119, 1997. [HFL + 08] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuy- ong Wang. Mars: a mapreduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT ’08, pages 260–269, 2008. [Int] Intel Threading Building Blocks. http://www.threadingbuldingblocks.org/. [J ´ 92] Joseph J´ aJ´ a. An Introduction to Parallel Algorithms. USA: Addison- Wesley, Reading, MA, 1992. [JXP10] Hyran Jeon, Yinglong Xia, and Viktor K. Prasanna. Parallel exact infer- ence on a CPU-GPGPU heterogeneous system. In International Confer- ence on Parallel Processing (ICPP), pages 1–10, 2010. [KA99] Yu-Kwong Kwok and Ishfaq Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Sur- veys, 31(4):406–471, 1999. [KBD10] Jakub Kurzak, David A. Bader, and Jack Dongarra. Scientific Computing with Multicore and Accelerators. Chapman & Hall / CRC Press, 2010. [KC98] V . Karamcheti and A.A. Chien. A hierarchical load-balancing frame- work for dynamic multithreaded computations. In Proceedings of the ACM/IEEE Conference on Supercomputing, pages 1–17, 1998. 112 [KC04] M. Karkooti and J.R. Cavallaro. Semi-parallel reconfigurable architec- tures for real-time LDPC decoding. In Proceedings of International Con- ference on Information Technology: Coding and Computing, 2004. [KD09] Jakub Kurzak and Jack Dongarra. Fully dynamic scheduler for numerical computing on multicore processors. Technical report, 2009. [KF09] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Prin- ciples and Techniques. The MIT Press, 2009. [KFL01] Frank R. Kschischang, Brendan J. Frey, and Hans-Andrea Loeliger. Fac- tor graphs and the sum-product algorithm. IEEE Transactions on Infor- mation Theory, 47(2):498–519, 2001. [KS94] Alexander V . Kozlov and Jaswinder Pal Singh. A parallel Lauritzen- Spiegelhalter algorithm for probabilistic inference. In Proceedings of Supercomputing, pages 320–329, 1994. [KSV10] Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938–948, 2010. [KYP10] Byung-Hak Kim, Arvind Yedla, and Henry D. Pfister. Message- passing inference on a factor graph for collaborative filtering. CoRR, abs/1004.1003, 2010. [LGHB07] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan W. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5–20, 2007. [Loe04] H. Loeliger. An Introduction to factor graphs. IEEE Signal Processing Magazine, 21(1):28–41, 2004. [LS88] S. L. Lauritzen and D. J. Spiegelhalter. Local computation with proba- bilities and graphical structures and their application to expert systems. Royal Statistical Society B, 50:157–224, 1988. [LS10] Jimmy Lin and Michael Schatz. Design patterns for efficient graph algo- rithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG ’10, pages 78–85, 2010. [MAB + 10] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehn- ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIG- MOD International Conference on Management of Data, SIGMOD ’10, pages 135–146. ACM, 2010. 113 [MBA + 09] Mary McGlohon, Stephen Bay, Markus G. Anderle, David M. Steier, and Christos Faloutsos. Snare: a link analytic system for graph labeling and risk detection. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 1265–1274, New York, NY , USA, 2009. ACM. [Moo10] Joris M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical models. Journal of Machine Learning Research, 11:2169–2173, August 2010. [MSH + 91] Blackford Middleton, Michael Shwe, David Heckerman, Harold Lehmann, and Gregory Cooper. Probabilistic diagnosis using a reformula- tion of the INTERNIST-1/QMR knowledge base. Medicine, 30:241–255, 1991. [MSLB07] Alexander Mendiburu, Roberto Santana, Jose Antonio Lozano, and Endika Bengoetxea. A parallel framework for loopy belief propagation. In GECCO (Companion), pages 2843–2850, 2007. [MXP11] Nam Ma, Yinglong Xia, and Viktor K. Prasanna. Data parallelism for belief propagation in factor graphs. In Proceedings of the 23th Inter- national Symposium on Computer Architecture and High Performance Computing, pages 56–63, 2011. [OIS + 06] M. Ohara, H. Inoue, Y . Sohda, H. Komatsu, and T. Nakatani. Mpi micro- task for programming the cell broadband enginetm processor. IBM Sys- tems Journal, 45(1):85–102, 2006. [Ope] OpenMP Application Programming Interface. http://www.openmp.org/. [Pea88] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. [Pen98] David Pennock. Logarithmic time parallel Bayesian inference. In Pro- ceedings of the 14th Annual Conference on Uncertainty in Artificial Intel- ligence, pages 431–438, 1998. [PHBB09] Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. Planet: massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow., 2(2):1426–1437, 2009. [PL10] M. Parashar and X. Li. Advanced Computational Infrastructures for Par- allel and Distributed Applications, chapter Enabling Large-Scale Compu- tational Science: Motivations, Requirements and Challenges, pages 1–7. 2010. 114 [PM11] Nico Piatkowski and Katharina Morik. Parallel inference on structured data with CRFs on GPUs. In International Workshop at ECML PKDD on Collective Learning and Inference on Structured Data (COLISD2011), 2011. [PY88] Christos Papadimitriou and Mihalis Yannakakis. Towards an architecture- independent analysis of parallel algorithms. In Proceedings of the twen- tieth annual ACM symposium on Theory of computing, pages 510–513, 1988. [RRP + 07] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating mapreduce for multi-core and multipro- cessor systems. In Proceedings of the 2007 IEEE 13th International Sym- posium on High Performance Computer Architecture, HPCA ’07, pages 13–24, 2007. [SF08] E. Sudderth and W. T. Freeman. Signal and Image Processing with Belief Propagation. IEEE Signal Processing Magazine, 25, March 2008. [SHG09] David H. Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online bayesian recommendations. In Juan Quemada, Gonzalo Le´ on, Yo¨ elle S. Maarek, and Wolfgang Nejdl, editors, WWW, pages 111–120. ACM, 2009. [Sin07] Oliver Sinnen. Task Scheduling for Parallel Systems (Wiley Series on Parallel and Distributed Computing). Wiley-Interscience, 2007. [SMV10] Kyle Spafford, Jeremy Meredith, and Jeffrey Vetter. Maestro: data orchestration and tuning for OpenCL devices. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II, Euro- Par’10, pages 275–286, Berlin, Heidelberg, 2010. [SSPM10] Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. Distributed map inference for undirected graphical models. In Neural Information Processing Systems (NIPS), Workshop on Learning on Cores, Clusters and Clouds, 2010. [STG + 01] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller. Rich proba- bilistic models for gene expression. In 9th International Conference on Intelligent Systems for Molecular Biology, pages 243–252, 2001. [SYD09] Fengguang Song, Asim YarKhan, and Jack Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In International Conference for Hight Performance Computing, Networking Storage and Analysis, 2009. 115 [TG99] Xinan Tang and Guang R. Gao. Automatically partitioning threads for multithreaded architectures. J. Parallel Distrib. Comput., 58(2):159–189, 1999. [THW02] Haluk Topcuouglu, Salim Hariri, and Min-you Wu. Performance- effective and low-complexity task scheduling for heterogeneous comput- ing. IEEE Trans. Parallel Distrib. Syst., 13:260–274, March 2002. [TR08] Ruediger Urbanke Tom Richardson. Modern Coding Theory. Cambridge University Press, 2008. [TSWY09] Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD inter- national conference on Knowledge discovery and data mining, KDD ’09, pages 807–816, 2009. [UK10] Christos Faloutsos U Kang, Duen Horng (Polo) Chau. Inference of beliefs on billion-scale graphs. In The 2nd Workshop on Large-scale Data Min- ing: Theory and Applications (LDMTA 2010), July 2010. [V AB10] D. Pasetto V . Agarwal, F. Petrini and D.A. Bader. Scalable graph explo- ration on multicore processors. In Proceedings of the 22nd IEEE and ACM Supercomputing Conference (SC10), 2010. [VDGR96] Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the seventh international conference on Architectural support for programming languages and operating sys- tems, ASPLOS-VII, pages 279–289, 1996. [XFP09] Yinglong Xia, Xiaojun Feng, and Viktor K. Prasanna. Parallel evidence propagation on multicore processors. In PaCT, pages 377–391, 2009. [XP10a] Yinglong Xia and Viktor K. Prasanna. Collaborative scheduling of dag structured computations on multicore processors. In Conf. Computing Frontiers, pages 63–72, 2010. [XP10b] Yinglong Xia and Viktor K. Prasanna. Paralell Exact Inference on the Cell Broadband Engine Processor. Journal of Parallel and Distributed Computing, 70:558–572, 2010. [XP10c] Yinglong Xia and Viktor K. Prasanna. Parallel Evidence Propagation on Multicore Processors. The Journal of Supercomputing, 2010. 116 [XP10d] Yinglong Xia and Viktor K. Prasanna. Scalable Node Level Computation Kernels for Parallel Exact Inference. IEEE Transactions on Computers, 59(1):103–115, 2010. [YRK09] Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC ’09, pages 198–207, 2009. [YTT + 08] Jackson H. C. Yeung, C. C. Tsang, K. H. Tsoi, Bill S. H. Kwan, Chris C. C. Cheung, Anthony P. C. Chan, and Philip H. W. Leong. Map-reduce as a programming model for custom computing machines. In Proceed- ings of the 2008 16th International Symposium on Field-Programmable Custom Computing Machines, FCCM ’08, pages 149–159, 2008. 117
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploration of parallelism for probabilistic graphical models
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Hardware and software techniques for irregular parallelism
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Scalable processing of spatial queries
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Parameter estimation to infer injector-producer relationships in oil fields: from hybrid constrained nonlinear optimization to inference in probabilistic graphical model
PDF
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
High performance classification engines on parallel architectures
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Improving machine learning algorithms via efficient data relevance discovery
Asset Metadata
Creator
Ma, Nam
(author)
Core Title
Scalable exact inference in probabilistic graphical models on multi-core platforms
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/07/2014
Defense Date
04/09/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph computations,graphical models,multi‐core,OAI-PMH Harvest,parallel algorithms,probabilistic inference,scalability
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Nakano, Aiichiro (
committee member
), Raghavendra, Cauligi S. (
committee member
)
Creator Email
manam84@yahoo.com,namma@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-373594
Unique identifier
UC11296665
Identifier
etd-MaNam-2325.pdf (filename),usctheses-c3-373594 (legacy record id)
Legacy Identifier
etd-MaNam-2325.pdf
Dmrecord
373594
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ma, Nam
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graph computations
graphical models
multi‐core
parallel algorithms
probabilistic inference
scalability