Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploration of parallelism for probabilistic graphical models
(USC Thesis Other)
Exploration of parallelism for probabilistic graphical models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLORATIONOFPARALLELISMFOR PROBABILISTICGRAPHICALMODELS by YinglongXia ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) December2010 Copyright2010 YinglongXia Dedication Tomyparents ii Acknowledgments First, I must thank my advisor, Prof. Viktor K. Prasanna. His guidance and support havebeeninstrumentalinmysuccess. Thisworkwouldnothavebeenpossiblewithout him. Iwouldalsoliketoexpressmygratitudetomycommitteemembers,Prof. Michel Dubois, and Prof. Aiichiro Nakano. They have provided many useful insights during thisprocess. Additionally,Prof. AiichiroNakano,Prof. ShuyaYou,Prof. MonteUng, and Prof. B. Keith Jenkins provided helpful feedback at the time of my qualifying exam. I would also like to thank my Master advisors, Dr. Nanyuan Zhao and Dr. ChangshuiZhang,fortheirinspirationandguidance. Ihavehadthebenefitofworkingwithawonderfulresearchgroup. Iwouldliketo acknowledge Prof. Mark Redekopp, Animesh Pathak, Yi-Hua Edward Yang, Hyeran Jeon, Nam Ma, and Hoang Le. I would especially thank Prof. Bo Hong, an early member of P-group. I also thank Janice Thompson, Estela Lopez, and Lizsl De Leon for their administrative assistance. Thanks also to the other P-group members and alumniforsharingtheirinterestingideasandperspectives. Ithasalsobeenalotoffun tohavethemaroundandIwillmissthemdearly. Finally, my family deserves special recognition for all that they have done for me, not only during this process, but throughout my life. I could not have reached this pointwithouttheirloveandsupport. Thankyou. iii TableofContents Dedication ii Acknowledgments iii ListofFigures viii Abstract xiii Chapter1: Introduction 1 1.1 ProbabilisticGraphicalModels . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter2: BackgroundandRelatedWork 11 2.1 BayesianNetworkandJunctionTree . . . . . . . . . . . . . . . . . . 11 2.2 SequentialExactInference . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 ParallelImplementationofExactInference . . . . . . . . . . . . . . 17 2.4 PlatformsandChallenges . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter3: ParallelConversionofaBayesianNetworkintoaJunctionTree 21 3.1 ParallelConversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 StructureConversion . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 PotentialTableConstruction . . . . . . . . . . . . . . . . . . 23 3.1.3 ComplexityAnalysis . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 ComputingFacilities . . . . . . . . . . . . . . . . . . . . . . 25 3.2.2 InputBayesianNetworks . . . . . . . . . . . . . . . . . . . . 25 3.2.3 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . 27 Chapter4: PointerJumpingandRerootingforExactInference 28 4.1 PointerJumpingBasedEvidencePropagation . . . . . . . . . . . . . 29 4.1.1 PointerJumpinginaChainofCliques . . . . . . . . . . . . . 29 iv 4.1.2 PointerJumpinginaJunctionTree. . . . . . . . . . . . . . . 31 4.1.3 SpeedupAnalysis . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 JunctionTreeRerooting . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 MinimizingCriticalPathbyRerooting . . . . . . . . . . . . . 40 4.2.2 AStraightforwardRerootingMethod . . . . . . . . . . . . . 41 4.2.3 AnEfficientJunctionTreeRerootingMethod . . . . . . . . . 43 Chapter5: DataParallelisminExactInference 47 5.1 PotentialTableOrganization . . . . . . . . . . . . . . . . . . . . . . 47 5.1.1 PotentialTableRepresentation . . . . . . . . . . . . . . . . . 47 5.1.2 SeparatorsBetweenCliques . . . . . . . . . . . . . . . . . . 51 5.2 NodeLevelPrimitives . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 TableMarginalization . . . . . . . . . . . . . . . . . . . . . 52 5.2.2 TableExtension . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.3 TableMultiplicationandTableDivision . . . . . . . . . . . . 57 5.2.4 OptimizedComputationKernelsforEvidencePropagation . . 58 5.3 ExactInferencewithNodeLevelPrimitives . . . . . . . . . . . . . . 61 5.3.1 DataLayout . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3.2 CompleteAlgorithm . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.1 ComputingFacilities . . . . . . . . . . . . . . . . . . . . . . 66 5.4.2 ExperimentsusingPNL . . . . . . . . . . . . . . . . . . . . 67 5.4.3 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . 68 Chapter6: JunctionTreeDecomposition 74 6.1 EvidencePropagationBasedonDecomposition . . . . . . . . . . . . 74 6.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.1.2 JunctionTreeDecomposition . . . . . . . . . . . . . . . . . 75 6.1.3 JunctionTreeMerging . . . . . . . . . . . . . . . . . . . . . 78 6.1.4 TradeoffbetweenStartupLatencyand BandwidthUtilizationEfficiency . . . . . . . . . . . . . . . . 79 6.1.5 DistributedEvidencePropagationAlgorithm . . . . . . . . . 82 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.1 Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2.3 BaselineMethods . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . 86 Chapter7: ExactInferenceonHeterogeneousProcessors 90 7.1 CentralizedSchedulingforExactInferenceonCellBEProcessors . . 90 7.1.1 TasksandTaskPartitioning . . . . . . . . . . . . . . . . . . 90 7.1.2 TaskDependencyGraph . . . . . . . . . . . . . . . . . . . . 92 v 7.1.3 DynamicPartitioningandScheduling . . . . . . . . . . . . . 94 7.1.3.1 PotentialTableOrganizationandEfficientPrimitives 95 7.1.3.2 DataLayoutforOptimizingCellPerformance . . . 96 7.1.3.3 CompleteAlgorithmforParallelExactInference . . 98 7.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1.4.1 Facilities . . . . . . . . . . . . . . . . . . . . . . . 106 7.1.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . 106 7.1.5 PerformanceEvalutationoftheProposedMethod . . . . . . . 107 7.1 TableofspeedupwithrespecttovariousnumberofSPEs . . . . . . . 108 Chapter8: SchedulingDAGStructuredComputations 113 8.1 Lock-freeCollaborativeScheduling . . . . . . . . . . . . . . . . . . 113 8.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.1.2 AnImplementationofCollaborativeScheduler . . . . . . . . 116 8.1.3 Lock-freeDataStructures . . . . . . . . . . . . . . . . . . . 118 8.1.4 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 122 8.1.5.1 BaselineSchedulers . . . . . . . . . . . . . . . . . 122 8.1.5.2 DatasetsandTaskTypes . . . . . . . . . . . . . . . 125 8.1.5.3 ExperimentalResults . . . . . . . . . . . . . . . . 127 8.2 Hierarchical Scheduling on Manycore Processors with Dynamic ThreadGrouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.2.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.2.2 DynamicThreadGrouping . . . . . . . . . . . . . . . . . . . 134 8.2.3 HierarchicalScheduling . . . . . . . . . . . . . . . . . . . . 136 8.2.4 SchedulingAlgorithmandAnalysis . . . . . . . . . . . . . . 137 8.2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2.5.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . 141 8.2.5.2 DatasetsandDataLayout . . . . . . . . . . . . . . 143 8.2.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . 144 Chapter9: ExactInferenceUsingTaskSchedulers 151 9.1 CollaborativeSchedulingforExactInference . . . . . . . . . . . . . 151 9.1.1 CollaborativeScheduling . . . . . . . . . . . . . . . . . . . . 151 9.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 153 9.2 HierarchicalSchedulingforExactInference . . . . . . . . . . . . . . 156 9.2.1 ExactInferenceUsingHierarchicalSchedulers . . . . . . . . 157 9.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.2.2.1 ComputingFacilities . . . . . . . . . . . . . . . . 160 9.2.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . 160 9.2.2.3 BaselineImplementations . . . . . . . . . . . . . . 161 9.2.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . 161 vi Chapter10:ConclusionandFutureWork 166 10.1 SummaryofContributions . . . . . . . . . . . . . . . . . . . . . . . 167 10.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 References 173 vii ListofFigures 2.1 (a)AsampleBayesiannetworkand(b)correspondingjunctiontree. . 12 2.2 SequentialExactInference . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Illustration of evidence collection and evidence distribution in a junc- tion tree. The cliques in boxes are under processing. The shaded cliqueshavebeenprocessed. . . . . . . . . . . . . . . . . . . . . . . 16 2.4 The scalability of exact inference in various junction trees using PNL library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 An example of (a) Bayesian network; (b) Moralized graph; (c) Trian- gulatedgraph;(d)Junctiontree. . . . . . . . . . . . . . . . . . . . . 22 3.2 An example of (a) Random Bayesian network; (b) Linear Bayesian network;(c)BalancedBayesiannetwork. . . . . . . . . . . . . . . . 25 3.3 TheexecutiontimeforexactinferenceonDataStar. . . . . . . . . . . 26 3.4 TheexecutiontimeforexactinferenceonHPCC. . . . . . . . . . . . 27 4.1 Pointerjumpingforexactinference. . . . . . . . . . . . . . . . . . . 30 4.2 PointerJumpingBasedExactInferenceinaJunctionTree . . . . . . 32 4.3 Comparison between pointer jumping and scheduling based exact in- ferenceinchains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Comparison between pointer jumping and scheduling based exact in- ferenceinslimtrees. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Comparison between pointer jumping and scheduling based exact in- ferenceinbalancedtrees. . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Speedup of pointer jumping based exact inference with respect to se- rialimplementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 viii 4.7 (a)AsamplejunctiontreewhereClique0istheroot. Thelengthofthe criticalpathis5. (b)Anewjunctiontreeafterrerooting,whereClique 2istheroot. Thelengthofthecriticalpathis4. . . . . . . . . . . . . 41 4.8 Straightforwardrootselectionforminimizingcriticalpath . . . . . . 42 4.9 Rootselectionforminimizingcriticalpath . . . . . . . . . . . . . . . 44 5.1 (a)Asamplecliqueanditsvariablevector;(b)Therelationshipamong the potential table (POT) index, state string and potential table, as- suming all random variables are binary variables; (c) The relationship amongthearrayindex,statestringandpotentialtable,assuminga,c,e arebinarywhileb,d,f areternary. . . . . . . . . . . . . . . . . . . . 50 5.2 Illustration of input and output separators of cliqueC with respect to evidencecollectionandevidencedistribution. . . . . . . . . . . . . . 52 5.3 Illustration of using table marginalization to obtain separatorψ S ch i (C) fromcliquepotentialtableψ C . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Marginalizationforseparatorψ S ch i (C) . . . . . . . . . . . . . . . . . . 54 5.5 (a) The mapping relationship between separatorψ S(C) and clique po- tentialtableψ C ;(b)Themappingrelationshipbetweenextendedψ S(C) andψ C ,whereeveryentryinψ S(C) correspondstotheentryinψ C with thesameindex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.6 Tableextensionforseparatorψ S (C) . . . . . . . . . . . . . . . . . . . 57 5.7 Tablemultiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.8 Tabledivision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.9 Illustration of using the primitive of table marginalization to obtain separatorψ Spa(C) fromcliquepotentialtableψ C . . . . . . . . . . . . . 60 5.10 Computationkernelofevidencecollection . . . . . . . . . . . . . . . 62 5.11 Computationkernelofevidencedistribution . . . . . . . . . . . . . . 63 5.12 (a) Sample junction tree where the size of the nodes indicates the size of potential tables. (b) The layout of potential tables. Each row con- sists of segments of one or more potential tables, while each column correspondstoaprocessor. . . . . . . . . . . . . . . . . . . . . . . . 64 5.13 Exactinferenceusingcomputationkernels . . . . . . . . . . . . . . . 65 5.14 ScalabilityofPNLbasedparallelexactinference. . . . . . . . . . . . 68 ix 5.15 Scalabilityoftablemarginalizationwithrespecttovariousparameters. 69 5.16 Scalability of the computation kernel for evidence collection with re- specttovariousparameters. . . . . . . . . . . . . . . . . . . . . . . . 70 5.17 Scalability of the computation kernel for evidence distribution with respecttovariousparameters. . . . . . . . . . . . . . . . . . . . . . . 71 5.18 Scalabilityofexactinferencewithrespecttovariousparameters. . . . 72 6.1 Ajunctiontreein(a)canbedecomposedintosubtreesofvariousgran- ularities shown in (b), (c) and (d). The arrows in the graph indicate evidencecollectiondirection. . . . . . . . . . . . . . . . . . . . . . 75 6.2 JunctionTreeDecompositionintoSubtrees . . . . . . . . . . . . . . 76 6.3 (a)Asamplebitmap;(b)areducedbitmap;(c)partitionedbitmaps. . . 80 6.4 DistributedEvidencePropagation . . . . . . . . . . . . . . . . . . . 83 6.5 Scalabilityoftheproposedtechniquecomparedwithbaselinemethods. 87 6.6 Executiontimeofvarioussteps. . . . . . . . . . . . . . . . . . . . . 88 6.7 Normalizedexecutiontimeofvarioussteps. . . . . . . . . . . . . . . 88 6.8 Impactofvariousparametersofjunctiontreeonoverallexecutiontime. 89 7.1 (a)IllustrationoftherelationshipbetweenacliqueC andthetwotasks relatedtoC;(b)Partitioningofalargetaskintoaseriesofsmalltasks. 91 7.2 (a) An example of a task dependency graph for a junction tree and the corresponding task dependency graph with partitioned tasks. (b) Illustrationoftheschedulingschemeforexactinference. . . . . . . . 93 7.3 (a) Data layout for junction tree in main memory; (b) Data layout for each clique; (c) Relationship between the layout and partial junction treestructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.4 PPE:Taskdependencygraphscheduling . . . . . . . . . . . . . . . . 99 7.5 SPE:processtasksusingdoublebuffering . . . . . . . . . . . . . . . 102 7.6 Computationkernelofevidencecollection . . . . . . . . . . . . . . . 104 7.7 Computationkernelofevidencedistribution . . . . . . . . . . . . . . 104 7.8 Normalized computational complexity of the critical path for various junctiontrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 x 7.9 Execution time of exact inference on IBM QS20 for various junction treesusingvariousnumberofSPEs. . . . . . . . . . . . . . . . . . . 109 7.10 Observed speedup of exact inference on IBM QS20 for various junc- tiontrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.11 The workload of each SPE for various junction trees (a) without load balancingand(b)withloadbalancing. . . . . . . . . . . . . . . . . . 110 7.12 ExecutiontimeoftheCellversusthatofotherprocessors. . . . . . . 111 8.1 Componentsofthecollaborativescheduler. . . . . . . . . . . . . . . 114 8.2 (a) A portion of a task dependency graph. (b) The corresponding rep- resentation of the global task list (GL). (c) The data of elementT i in theGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3 ASampleImplementationofCollaborativeScheduling . . . . . . . . 117 8.4 Lock-freeorganizationoflocaltasklist(LL i )andasetofweightcoun- tersinThreadiandtheirinteractionwithrespecttoeachthread. . . . 120 8.5 Comparisonoftheproposedschedulerwithworkstealingbasedsched- ulers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.6 Experimental results on dual Intel Xeon 5335 (Clovertown) quadcore platform.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.7 Experimental results on quad AMD Opteron 8358 (Barcelona) quad- coreplatform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.8 Componentsofthehierarchicalscheduler. . . . . . . . . . . . . . . . 133 8.9 (a)MergeGroup i andGroup j . (b)MergecircularlistsList i andList j . Thehead(tail)pointstothefirst(last)tasksstoredinthelist. Theblank elementshavenotaskstoredyet. . . . . . . . . . . . . . . . . . . . 135 8.10 Groupmerge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.11 Thehierarchicalrelationshipbetweenthesupermanager,managersand workers,andthecorrespondingschedulingschemes. . . . . . . . . . 136 8.12 ASampleImplementationofHierarchicalScheduler . . . . . . . . . 138 8.13 Meta-LevelSchedulingforSupermanager . . . . . . . . . . . . . . . 140 8.14 GroupLevelSchedulingfortheManagerofGroup i . . . . . . . . . . 141 xi 8.15 Within-GroupLevelSchedulingforaWorkerofGroup i . . . . . . . . 142 8.16 Comparisonofaverageexecutiontimewithexistingparallelprogram- mingsystems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.17 Comparison with baseline scheduling methods using task graphs of varioustasksizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.18 Performance achieved by the proposed method without dynamically adjustingtheschedulergroupsize(numberofthreadspergroup,thds/grp) withrespecttotaskgraphsofvarioustasksizes. . . . . . . . . . . . . 147 8.19 Impact of various properties of task dependency graphs on speedup achievedbytheproposedmethod. . . . . . . . . . . . . . . . . . . . 148 9.1 Componentsofthecollaborativescheduler. . . . . . . . . . . . . . . 152 9.2 CollaborativeTaskScheduling . . . . . . . . . . . . . . . . . . . . . 153 9.3 Scalabilityofexactinferenceusingvariousmethodson(a)IntelXeon and(b)AMDOpteron. . . . . . . . . . . . . . . . . . . . . . . . . . 154 9.4 (a) Load balance across the threads; (b) Computation time ratio for eachthread. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.5 Speedupsofexactinferenceonmulticoresystemswithrespecttovar- iousjunctiontreeparameters. . . . . . . . . . . . . . . . . . . . . . . 157 9.6 Exactinferenceusingthehierarchicalscheduler . . . . . . . . . . . . 159 9.7 Comparisonwithexistingparallelprogrammingsystems. . . . . . . . 162 9.8 Comparison with baseline scheduling methods using junction trees of variouspotentialtable(POT)sizes. . . . . . . . . . . . . . . . . . . . 162 9.9 Scalabilityoftheproposedmethodwithrespecttovariousparameters ofjunctiontrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.10 Loadbalancewithrespecttonumberofavailablecores. . . . . . . . . 165 xii Abstract Probabilistic graphical models such as Bayesian networks and junction trees are widely used to compactly represent joint probability distributions. They have found applications in a number of domains, including medical diagnosis, credit assessment, genetics,amongothers. Thecomputationalcomplexityofexactinference,akeyprob- lem in exploring probabilistic graphical models, increases dramatically with the den- sity of the network, the clique width and the number of states of random variables. In manycases,exactinferencemustbeperformedinrealtime. Inthiswork,weexploreparallelismforexactinferenceatvariousgranularitieson state-of-the-art high performance computing platforms. We first study parallel tech- niquesforconvertinganarbitraryBayesiannetworkintoajunctiontree. Then,atafine grained level, we explore data parallelism in node level primitives for exact inference injunction trees. Basedonthe node levelprimitives, wedevelop computation kernels for evidence collection and distribution on both clusters and multicore processors. In addition, we propose a junction tree decomposition approach for exact inference on a cluster of processors to explore structural parallelism at a coarse grained level. To utilizestructuralparallelismdynamically,wealsodevelopvariousschedulersforexact inference. Specifically, we develop a centralized scheduler for heterogeneous proces- sors, a lock-free collaborative scheduler for multicore processors, and a hierarchical scheduler with dynamic thread grouping for manycore processors. The schedulers balance the workload across the cores and partition large tasks at runtime to adapt to xiii the processor architecture. Finally, for junction trees offering limited parallelism at both data and structural levels, we propose a pointer jumping based method for exact inferencetoaccelerateevidencepropagation. WeimplementedourproposedmethodsusingPthreadsandMessagePassingInter- face(MPI)onvariousplatforms,includingclusters,general-purposemulticoreproces- sors, heterogeneous multicore processors, and manycore processors. Compared with various baseline algorithms using a representative set of junction trees, our proposed methodsexhibitsuperiorperformance. xiv Chapter1 Introduction 1.1 ProbabilisticGraphicalModels Manyreal-lifesystemscanbemodeledusingasetofrandomvariables. A probabilis- tic graphical model is a graph based representation for encoding a joint probability distribution over a set of random variables. Using graphs to represent the dependency relationship among the random variables makes such a representation compact [25]. A typical example of probabilistic graphical models is the Bayesian network, which has found applications in a number of domains, including medical diagnosis, credit assessmentandgenetics[32]. ABayesiannetworkrepresentsajointprobabilitydistributionoverasetofrandom variables as follows. Each random variable is represented by a node in the network. If a random variable is conditionally dependent upon another variable, an edge is in- troduced from one node to another to show the dependencies between the two corre- sponding variables. Note that a Bayesian network requires that there are no directed cycles in the graph [25], also known as a directed acyclic graph (DAG). Along with 1 each node, we store the conditional probability distribution of the corresponding ran- domvariable. Therefore,ajointprobabilitydistributionisrepresentedbythetopology oftheDAGandthelocallystoredconditionalprobabilitydistributions. One of the most important operations for exploring a Bayesian network is called inference. Givenanobservedconditionalprobabilitydistributionofsomerandomvari- able in a Bayesian network, it is possible that this observation is different from the original conditional probability distribution. That is, we observe some changes in the system. Such changes can be propagated to all the rest of the nodes in the Bayesian network through the edges. Specifically, according to the newly observed distribu- tion,weupdatetheconditionalprobabilitydistributionofthechildrenoftheobserved node using Bayes’ Theorem. Then, we update the grand children of the node using the updated distributions for the children of the node. We repeat such process until the conditional distributions of all the random variables are updated. Such a process is called inference. Inference can be either exact or approximate. Exact inference is known to be NP hard [25]. In this thesis, we focus on the parallelization of exact inference. However,previousresearchhaspointedoutthatBayes’Theoremcannotbedirectly used for Bayesian networks with undirected cycles [25]. By undirected cycles, we mean that the closed paths in a graph all have their edge directions removed. The most popular approach for solving such a problem is to convert a Bayesian network intoajunctiontree,andthenperformexactinferenceontheresultingjunctiontree. A junction tree is a hypergraph of the corresponding Bayesian network. Each node in a junction tree (as known as a clique) corresponds to multiple random variables in the correspondingBayesiannetwork. Similartotheconditionalprobabilitydistributionfor each node in a Bayesian network, each clique in a junction tree has a potential table obtained by combining the conditional probability distributions corresponding to the 2 random variables in the clique. Thus, the size of a potential table is much larger than the conditional probability distributions. After a Bayesian network is converted into a junction tree, the exact inference in the junction tree becomes a set of computations betweenthepotentialtables. 1.2 Motivation Paralleltechniquesarerequiredtoaccelerateexactinferenceinlargescaleprobabilistic graphical models. Many real-life applications lead to Bayesian networks with a large numberofnodesandedges. Forexample,inbioinformatics,ageneregulationnetwork cancontainhundredsandthousandsofgenes,whichrequiresalargenumberofrandom variables to capture the relationship among the genes. Thus, the number of nodes in thecorrespondingBayesiannetworkisalsolarge. ConvertingsuchaBayesiannetwork into a junction tree can result in cliques consisting of a number of random variables. Notethatthesizeofapotentialtable(i.e.,thenumberofentriesinthepotentialtable) of a clique increases dramatically as the number of random variables in the clique and the number of states of these random variables increase. Since exact inference in a junction tree is essentially a set of computations between the potential tables, execution time for exact inference in junction trees with large potential tables can be very high. However, in many cases, exact inference must be performed in real time. Therefore,paralleltechniquesmustbedevelopedforthesakeofacceleratingevidence propagation. Inourresearch,weexploretheparallelismforexactinferenceatmultiple granularitiesonvariousparallelcomputingplatforms. 3 The state-of-the-art platforms nowadays provide parallel processing capability at various granularity levels. Thus, to fully utilize the parallel computing capacity pro- vided by the hardware, we must study methodologies for mapping real-life applica- tions, such as exact inference in probabilistic graphical models, onto parallel comput- ing architectures. Since Moore’s Law is expected to remain valid, manufacturers are integrating more and more cores into a chip. Multicore processors have been widely used for servers, workstations and even personal computers. Several manycore pro- cessors have been produced, such as various GPGPUs, Sun UltraSPARC processors, AMD Magny Cours and the recent Intel 48-core processors [34]. In addition to mul- ticore/manycoreprocessors,supercomputersconsistingofaclusterofprocessorswith high-bandwidth interconnections have been used to solve many large scale applica- tions. The compute nodes of a cluster also offer parallel computing capacity. Exact inference in probabilistic graphical models is a complicated real-life problem offer- ing parallelism at multiple levels, such as the task level and data level. Therefore, we are motivated to study parallel exact inference for exploring the potentials of modern parallelarchitectures. In addition, exact inference in probabilistic graphical models is an appropriate ex- ampletostudyparalleltechniquesforaclassofcomplicatedapplicationswithrespect tovariousparallelcomputingarchitectures. Exactinferenceoffersparallelismatmul- tiple levels. At the data level, the node level primitives are similar to the calculations formanyotherstatisticalmachinelearningproblems. Atthestructurallevel,likemany other problems, exact inference can be represented as a DAG structured computation. Thus, exploring structural parallelism in exact inference helps us understand schedul- ingtechniquesforaclassofDAGstructuredcomputations. 4 1.3 Challenges Althoughtherearesomeearlyworksonparallelexactinference,suchasPennock[32], KozlovandSingh[23],andSzolovits[33],parallelizingexactinferenceremainsafun- damentalchallengeinparallelcomputing. However,theperformanceofsomeofthose methods, such as [23], is dependent upon the structure of the Bayesian network. Oth- ers, such as [32], exhibit limited performance for multiple evidence inputs, since the evidence is assumed to be at the root of the junction tree. Rerooting techniques are employed to deal with the cases where the evidence appears at more than one clique. Some other works address certain individual steps of exact inference. Reference [29] discussesthestructureconversionofthejunctiontreefromBayesiannetworks. Refer- ence[37]introducesnodelevelprimitives. Another challenge comes from the various architectures of computing platforms. In recent years, a whole spectrum of different multicore processors have been de- signed, manufactured and made commercially available. At one end there are tradi- tional general-purpose multicore processors containing a handful of cache-coherent and powerful cores attached to either a switching fabrics or a shared bus. Examples arethequadcoreOpteronfromAMDandthequadcoreXeonfromIntel[1,6]. Atthe other end there are highly parallel processors such as IBM’s Cell BE and NVIDIA’s Telsa [4], where a number of accelerators are controlled explicitly by a specific core or software. There are also various special-purpose network/content processors [14]. Multicore processor architectures with different properties require different optimiza- tion techniques to implement an algorithm efficiently. Thus, to port an application from one parallel computing platform to another generally requires algorithmic mod- ifications. Sometime we even need to redesign the algorithm to adapt to the platform architectures. 5 We use a real example to show the challenges of parallelizing exact inference on real parallel computing platforms. An OpenMP based implementation of exact in- ference is available in Intel’s Open-Source Probabilistic Networks Library (PNL) [5], which is a full function, free, open source, graphical model library released under a BerkeleySoftwareDistribution(BSD)stylelicense. ThePNLlibraryincludesparallel implementations for shared and distributed parallel computing platforms. For shared memoryplatforms,OpenMPintrinsicpragmas[31]wereusedsothatthecompilercan parallelize loops in the code. Such parallelization is completely based on sequential implementation, therefore there is no change to the algorithm at all. Similarly, MPI is used to parallelize the sequential code. Although such parallelization is straightfor- wardandproductive,thescalabilityofexactinferenceusingPNLisnotsatisfactory. In ourexperiments,weobservedthatthePNLbasedexactinferencedoesnotscaleifthe number of processors is more than 4. We will introduce the details regarding the ex- perimentsinlatersections. Weobservedthat,forvariousjunctiontrees,theexecution time increases when the number of processors is greater than four. Therefore, given a multicore/manycore processors, the straightforward modifications to the sequential implementation may not fully take the advantages of the platforms. In addition, as far as we know, the PNL cannot support some parallel platforms such as the Cell BE and GPGPUs, etc. Therefore, it remains challenging to achieve scalability on parallel computing platform by simply utilizing the existing sequential implementations. In manycases,wemayneedtore-designthealgorithmwithrespecttothegivenplatform architectures. 6 1.4 Contributions In this thesis, we explore parallelism for exact inference atmultiple granularities with respect to various parallel computing platforms. We propose efficient parallel algo- rithms for exact inference in probabilistic graphical models with respect to various architectures. Many techniques proposed for exact inference in this thesis can also be utilizedforaclassofDAGstructuredcomputations. Welistthemajorcontributionsof thethesisbelow: ParallelBayesiannetworkconversion: Manyexactinferencealgorithmsconvert aBayesiannetworkintoajunctiontree,sothatwecanperformexactinferenceregard- less the topology of the input Bayesian networks. Thus, we investigate parallel tech- niquesforsuchaconversion. Wefocusonparallelizationofthemostpopularheuristic method called Lauritzen-Spiegelhalter algorithm, which consists of several steps to convertaBayesiannetworkintoajunctiontree,suchasnetworknormalization,clique identification,etc. Weproposeparallelalgorithmsforthesestepsandimplementthem onaclusterusingmessagepassing. Theimplementationscaleswelloveranumberof processors. Since a junction tree is a hypergraph representation for the corresponding Bayesiannetwork,ourproposedparallelalgorithmsarealsousefulforothergraphical problemsinvolvinghypergraphrepresentation. Junction tree decomposition based exact inference: Traditional exact infer- ence in junction trees requires a series of communication steps to propagate evidence throughoutanentirejunctiontree. Thesecommunicationstepsleadtosignificantover- headduetostartuplatencyincommunicationonclusters. Thus,weproposeajunction treedecompositionbasedmethodforexactinferenceonclusters,whichreducesover- head due to startup latency in communication. This method decomposes a junction tree into a set of subtrees, each being updated on a separate processor. We derived 7 formulas for combining the decomposed subtrees, so that all the cliques can be fully updated. The junction tree decomposition based method can complete exact infer- ence using only a single all-to-all communication. Considering the tradeoff between the overhead due to startup latency and bandwidth utilization efficiency, we propose heuristics to allow several all-to-all communication steps in exact inference so as to minimize overall execution time. This method explores structural parallelism offered bythetopologyofjunctiontrees. Parallel node level primitives: Exact inference in a junction tree is essentially a setofcomputationsbetweenpotentialtables. Thesizeofthepotentialtablesincreases dramatically as the number of variables in a clique increases. For junction trees with large potential tables, exact inference can be very slow. We summarize the computa- tions for exact inference in a junction tree into four node level primitives. For each primitives,weexploredataparallelismanddevelopaparallelalgorithm. Weoptimize theprimitivewithrespecttovariousarchitectures. Theoptimizationforexploringdata parallelism includes a memory efficient organization for potential tables and the con- struction of two kernels for evidence collection and distribution. We develop parallel primitives on both clusters and shared memory multicore processors. The implemen- tations scale almost linearly for many input junction trees on both platforms. This methodexploresdataparallelismofferedbyjunctiontrees. Pointer jumping based exact inference: We also study a pointer jumping based method for exact inference. This method is suitable for a class of junction trees that offer limited data parallelism and structural parallelism. We analyze the speedup of pointer jumping based exact inference with respect to the serial exact inference algo- rithm. Our analysis shows the relationship between the performance and the number of concurrent threads in a parallel computing system. The proposed method achieves 8 superiorperformanceonmanycoreprocessorsforjunctiontreeswithlimiteddatapar- allelismandstructuralparallelism. SchedulingtechniquesforDAGstructuredcomputations: Weinvestigatestruc- tural parallelism using another approach for exact inference as well as other DAG structured computations. Specifically, we represent exact inference as a DAG struc- turedcomputation. Accordingtothetopologyofajunctiontree,webuildataskgraph for exact inference, where a task corresponds to a series of node level primitives for updatingaclique,andtheedgesindicateprecedencerelationship. Weproposeseveral schedulers to assign tasks onto parallel processing units of various systems. In partic- ular, we built a lock-free collaborative scheduler for multicore processors to balance the workload across the cores and reduce the coordination overhead. We also extend the collaborative scheduler to manycore processors. Considering the large number of concurrent threads and relatively limited caches on manycore processors, we design a hierarchical scheduler with dynamic thread grouping. The scheduling methods pro- posedinthisthesisareallframeworks,sovariousheuristicscanbeutilizedforimprov- ing performance. We develop various schedulers for different architectures, including the Cell BE processor and multicore/manycore processors. The proposed schedulers canalsobeusedforaclassofDAGstructuredcomputations. 1.5 Organization Thedissertationisorganizedasfollows: Wediscussthebackgroundandrelatedwork in Chapter 2. In Chapter 3, we propose a parallel solution for converting an arbitrary Bayesian network into a junction tree. In Chapter 4, we study pointer jumping and junctiontreererootingforexactinference. Thedataparallelisminjunctiontreesisex- ploredinChapter5. InChapter6,weexplorestructuralparallelismbydecomposinga 9 junctiontreeintosubtrees. Acentralizedschedulerisscheduledforexactinferenceon CellBEprocessorsinChapter7. InChapters8and9,weinvestigatevariousschedulers forDAGstructuredcomputations. Theseschedulersareutilizedforexactinferencein Chapter9. Finally,weconcludethedissertationinChapter10. 10 Chapter2 BackgroundandRelatedWork 2.1 BayesianNetworkandJunctionTree A Bayesian network is a probabilistic graphical model that exploits conditional inde- pendence to represent a joint distribution compactly. Figure 2.1 (a) shows a sample Bayesian network. In Figure 2.1 (a), each node represents a random variable. The edges indicate the probabilistic dependence relationships between two random vari- ables. Notice that these edges can not form any directed cycles. Each random vari- able in the Bayesian network has a discrete (conditional) probability distribution. We use the following notations to formulate a Bayesian network and its properties. A Bayesian network is defined as B = (G,P), where G is a directed acyclic graph (DAG) andP is the parameter of the network. The graphG is denotedG = (V,E), whereV ={A 1 ,A 2 ,...,A n } is the node set andE is the edge set. Each nodeA i rep- resentsarandomvariable. IfthereexistsanedgefromA i toA j i.e. (A i ,A j )∈E,then A i iscalledaparentofA j .pa(A j )denotesthesetofallparentsofA j . Giventhevalue ofpa(A j ), A j is conditionally independent of all other preceding variables. The pa- rameterPrepresentsagroupofconditionalprobabilitytables,whicharedefinedasthe conditionalprobabilityP(A j |pa(A j ))foreachrandomvariableA j . GivenaBayesian 11 network, the joint distribution is given by [25]: P(V) = Q n j=1 Pr(A j |pa(A j )) where A j ∈V. Figure2.1: (a)AsampleBayesiannetworkand(b)correspondingjunctiontree. The evidence in a Bayesian network is the variables that have been instantiated with values, e.g. E ={A e 1 =a e 1 ,··· ,A ec =a ec },e k ∈{1,2,...,n}. The evidence variablesaretheobservablenodesinaBayesiannetwork. Inanobservation,weobtain the realvalues of theserandom variables. The observedvalues are the evidence -also known as belief - to the Bayesian network. We use the observed values to update the prior (conditional) distribution. This is called evidence absorption. As the evidence variables are probabilistically dependent upon some other random variables, the evi- dence can be absorbed and propagated according to Bayes’ theorem. Propagating the evidence throughout the Bayesian network changes the distribution of other variables accordingly. The variables to which we compute the updated distribution are called queryvariables. Theprocessof inferenceinvolvespropagatingtheevidenceandcom- puting the updated distribution of the query variables. There are two categories of inference algorithms, called exact inference and approximate inference. Exact infer- ence is proven to be NP hard [32]. The computational complexity of exact inference increasesdramatically with the densityofthe networkandthe number ofstatesof the randomvariables. 12 Traditional exact inference using Bayes’ theorem fails for networks with undi- rected cycles [25]. Most inference methods for networks with undirected cycles con- vert a network into a cycle-free hypergraph called a junction tree. In Figure 2.1 (b), weillustrateajunctiontreeconvertedfromtheBayesiannetworkinFigure2.1(a). All undirectedcyclesinFigure2.1(a)areeliminatedinFigure2.1(b). EachvertexinFig- ure 2.1 (b) contains multiple random variables. The numbers in each vertex indicate which random variables in the Bayesian network the vertex consists of. Notice that adjacentverticesalwaysshareoneormorerandomvariables. Alljunctiontreessatisfy therunningintersectionproperty(RIP)[25]. TheRIPpropertyrequiresthattheshared randomvariablesofanytwoverticesinajunctiontreeshouldappearinallverticeson thepathbetweenthetwovertices. TheRIPpropertyensurestheevidenceobservedat any random variables can be propagated from one vertex to another. For the sake of exploring evidence propagation in a junction tree, we use the following notations to formulateajunctiontree. AjunctiontreeisdefinedasJ = (T, ˆ P),whereTrepresents atreeand ˆ Pdenotestheparameterofthetree. EachvertexC i , knownasa cliqueofJ, is a set of random variables. AssumingC i andC j are adjacent, the separator between them is defined asC i ∩C j . ˆ P is a group of potential tables. The potential table ofC i , denotedψ C i , can be viewed as the joint distribution of the random variables inC i . For a clique withw variables, each takingr different values, the number of entries in the potentialtableisr w . In a junction tree, exact inference proceeds as follows: Assuming evidence is E = {A i = a} and A i ∈ C Y , E is absorbed at C Y by instantiating the variable A i 13 andrenormalizingtheremainingconstituentsoftheclique. Theevidenceisthenprop- agated from C Y to all other cliques. Mathematically, the evidence propagation from CliqueY toCliqueX isrepresentedas[25]: ψ ∗ S = X Y\S ψ ∗ Y , ψ ∗ X =ψ X ψ ∗ S ψ S (2.1) where S is a separator between cliques X and Y; ψ ∗ denotes the updated potential table. After all cliques are updated, the distribution of a query variable Q ∈ C y is obtainedbysummingupallentrieswithrespecttoQ =q forallpossibleq inψ Cy . 2.2 SequentialExactInference For the sake of completeness, we present a serial exact inference algorithm in Algo- rithm 2.2. In Algorithm 2.2, the input consists of a junction tree converted from an arbitrary Bayesian network, the evidence and query variables. All cliques in the junc- tiontreearenumberedaccordingtothebreadthfirstsearch(BFS)order. Theoutputis theaposterioridistributionofthequeryvariables. InAlgorithm2.2,Line1isevidenceabsorption,wheretheevidenceE isabsorbed by the cliques. Lines 2-7 perform evidence collection, which propagates the evidence from the leaf cliques to the root (bottom up). Lines 8-11 in Algorithm 2.2 perform evidencedistribution,whichpropagatestheevidencefromtheroottoleafcliques(top down). Evidence collection and evidence distribution are two major phases in exact inference, which update the cliques in reverse BFS order and the original BFS order, respectively. Updating each clique involves a series of computations on potential ta- bles (see Lines 4, 5, 9 and 10 in Algorithm 2.2). Line 12 in Algorithm 2.2 computes the distribution of the query variables. The parallelism in Lines 1 and 12 is trivial. 14 Input: Junction tree J = (T, ˆ P), evidence E and query variables Q, BFS order of cliquesα. Output: ProbabilitydistributionofQ 1: Everycliqueabsorbsevidence:ψ C i =ψ C i δ(E =e),∀C i {Evidence collection} 2: fori =N −1to1by-1do 3: foreachC j ∈{C j |pa(C j ) =C α i }do 4: Computeseparatorpotentialtableψ ∗ S = P C j \Cα i ψ C j whereS =C α i ∩C j 5: UpdatecliqueC α i usingψ Cα i =ψ Cα i ψ ∗ S /ψ S whereψ S = P Cα i \Cα i ψ Cα i 6: endfor 7: endfor {Evidence distribution} 8: fori = 2toN do 9: Compute separator potential table ψ ∗ S = P Cα i \C pa(α i ) ψ Cα i where S = C α i ∩ pa(C α i ) 10: Updateψ Cα i usingψ Cα i =ψ Cα i ψ ∗ S /ψ S whereψ S = P Cα i \Cα i ψ Cα i 11: endfor 12: ComputequeryQfromψ C i ifQ∩C i 6=∅usingp(Q) = 1 Z P C i \Q ψ C i Figure2.2: SequentialExactInference Therefore, we focus on the parallelism in evidence propagation, i.e., evidence collec- tion and distribution. Figure 2.3 illustrates the first three steps of evidence collection anddistributioninasamplejunctiontree. Weanalyzethecomputationtimeforevidencecollectionanddistribution(Lines2- 11)inAlgorithm2.2. Lines2and3introducetwoloopswithO(kN)iterations,where k is the maximum number of children of a clique andN is the number of cliques in thejunctiontree. Line4convertsacliquepotentialtableψ C j intoaseparatorpotential tableψ S . Suchaconversionvisitseachentryinψ C j andsumstheentriescorresponding to the same entry inψ S . The correspondence relationship is identified by comparing the states of the random variables in C j and S. Thus, Line 4 takes (|ψ C j |w) time, where|ψ C j | = Q w C i=1 r i isthesizeofψ C j ;w andw S arethemaximumcliquewidthsfor cliques and separators, respectively. Line 5 consists of a multiplication and a division 15 Figure 2.3: Illustration of evidence collection and evidence distribution in a junction tree. The cliques in boxes are under processing. The shaded cliques have been pro- cessed. operation,whichalsotakesO(|ψ C j |w)time. Lines8-11performevidencedistribution using a loop with (N − 1) iterations. Lines 9 and 10 correspond to Lines 4 and 5. Thus,thecomputationcomplexityisO(|ψ C j |w)forbothLines9and10. Basedonthe aboveanalysis,thetotalcomputationcomplexityforAlgorithm2.2isgivenby: O( N X i=1 k i w C i Y j=1 r j ·w C i ) =O(Nkwr w ) (2.2) where k, w and r are the maximum number of children of a clique, the maximum clique width and the maximum number of states for random variables respectively. Eq. (2.2) implies that, when r and w are moderately large, evidence propagation is dominatedbyr w . 16 2.3 ParallelImplementationofExactInference Thereareseveralworksonparallelexactinference,suchasPennock[32],Kozlovand Singh [23], and Szolovits [33]. However, the performance of some of those methods, such as [23], is dependent upon the structure of the Bayesian network. Others, such as [32], exhibit limited performance for multiple evidence inputs, since the evidence isassumedtobeattherootofthejunctiontree. Rerootingtechniquesareemployedto deal with the cases where the evidence appears at more than one clique. Some other worksaddresscertainindividualstepsofexactinference. Reference[29]discussesthe structureconversionofthejunctiontreefromBayesiannetworks. AnOpenMPbasedimplementationofexactinferenceisavailableinIntel’sOpen- Source Probabilistic Networks Library (PNL) [5], which is a full function, free, open source,graphicalmodellibraryreleasedunderaBerkeleySoftwareDistribution(BSD) style license. The PNL provides an implementation for junction tree inference with discrete parameters. The scalability of exact inference using PNL is shown in Fig- ure 2.4. The results shown in Figure 2.4 were obtained on IBM P655 multiprocessor systems[30]. WecanseefromFigure2.4that,forallthethreejunctiontrees,theexe- cution time increased when the number of processors was greater than 4. In addition, as far as we know, the PNL cannot support some parallel platforms such as the Cell BE. 2.4 PlatformsandChallenges Multicore processors are being dedicated to process simultaneous threads to achieve high performance. Multicore processors can be homogeneous and heterogeneous. A 17 Figure 2.4: The scalability of exact inference in various junction trees using PNL library. homogeneous multicore processor (a.k.a. general-purpose multicore processor) con- sistsofidenticalcomputingcores,althoughthememoryaccessmodecanbeeitheruni- form (UMA) or nonuniform (NUMA). Typical homogeneous multicore architectures include the quad core AMD Opteron and Intel Xeon series processors. For example, anAMDOpteron2350(Barcelona)processorhas4homogeneouscoresrunningat2.0 GHz, each having an exclusive 64 KB L1 data cache and 512 KB L2 cache [1]. All the cores share a 2 MB L3 cache, a DDR2 memory controller and a Hyper Transport workingat2GT/s. Amotherboardprovidesupto4socketsforsuchprocessors. Each processor can access the shared memory through either its memory controller or the Hyper Transport, depending on if the target memory bank is physically connected to thelocalsocketoraremotesocket. Thus,thememoryaccessmodeforAMDOpteron 2350 is NUMA. Although an Intel Xeon 5335 (Clovertown) processor also has four cores, unlike AMD Opteron 2350, there is no shared cache for all of them [6]. Each coreinIntelXeon5335hasaseparate32KBL1datacacheandeverytwocoresshare a 4 MB L2 cache. Because all the cores access the shared memory through the Front SideBusinterfaceworkingat1.33GT/s,thememoryaccessmodeisUMA. 18 A heterogeneous multicore processor consists of different cores. For example, the Sony/Toshiba/IBM Cell Broadband Engine (Cell BE) processor consists of a Pow- erPC element (PPE) coupled with eight independent synergistic processing elements (SPEs)[4]. ThePPEisageneral-purposecorewith512KBL2cacheandSPEsarein- order, dual-issue, statically scheduled architectures, in which two SIMD instructions can be issued per cycle. Thus, they are not designed to efficiently perform branches during execution. There is no cache for SPEs. Instead, each SPE has a 256 KB local store,wheretheusercanfullycontrolthedatainthelocalstore. ToallowaSPEutilize some data, the user must load data from the main memory to the local store by DMA transfers. The cores can also communicate with each other using some hardware fea- turessuchassignalsorthemailbox,wherethedataistransferredthroughtheelement interfacebus(EIB).Sincetheclockspeedis3.2GHzfortheCell,thetheoreticalpeak performanceis204.8GFlops. The trend in architecture design is to integrate more and more cores onto a single chip to achieve higher performance. Such architectures are known as manycore pro- cessors. Examples of existing manycore processors include the Sun UltraSPARC T1 (Niagara) andT2(Niagara 2)[34], which supportupto 32and64concurrent threads, respectively. The Nvidia Tesla and Tilera TILE64 are also available. More manycore processors are emerging soon, such as the Sun Rainbow Falls and IBM Cyclops64. The Sun UltraSPARC T2 (Niagara 2) platform has 8 hardware multithreading cores, each running at 1.4 GHz. In addition, each core supports 8 hardware threads with 2 sharedpipelines. Thus,thereare64hardwarethreads. EachcorehasitsownL1cache sharedbythethreadswithinacore. TheL2cachesizeis4MB,sharedbyallhardware threads. Such processors are more interested in how many tasks from a DAG can be completed efficiently over a period of time rather than how quickly an individual task canbecompleted. 19 Unlike the above multicore processors, a cluster consists of networked compute nodes with distributed memory, where each compute node have one or more proces- sors. The HPCC cluster at USC is a typical cluster, which employs a diverse mix of computinganddataresources,including256DualQuadcoreAMDOpteronquad-core processornodesrunningat2.3GHzwith16GBmemory[2]. Thismachineusesa10 GBlow-latencyMyrinetbackbonetoconnectthecomputenodes. Theclusterachieved 44.19 teraflops in fall 2008, ranked at the 7th among supercomputers in an academic setting in the United States. The cluster runs USCLinux, a customized distribution of the RedHat Enterprise Advanced Server 3.0 (RHE3) Linux distribution. The Portable BatchSystem(PBS)isusedtoallocatenodesforajob. There are various challenges to achieve high sustainable performance on parallel computing platforms. To maximize the capacity of the processors or clusters, algo- rithms must be parallelized or even redesigned to adapt to specific architectures. For example,ongeneral-purposemulticoreprocessors,thecachebasedarchitectureresults in relatively high overhead due to coordination. The contention caused by accessing sharedresourcesincreasesdramaticallyasthenumberofcoresincreases. Thisisespe- cially true for manycore processors. In addition, manycore processors generally have limited caches and a simpler core design. The Cell processor, however, has a limited localstoreandheterogeneouscores,whichresultsinlowproductivity. Fortheclusters, the shared data must be sent through the network by message passing, which leads to significant overhead due to the startup latency in communication. To maximize the potential of clusters, we must reduce the communication cost, even by moderately increasingthelocalcomputation. 20 Chapter3 ParallelConversionofaBayesianNetworkintoa JunctionTree Since traditional exact inference using Bayes’ theorem fails for networks with undi- rected cycles, most inference methods convert a Bayesian network into a cycle-free hypergraph called a junction tree. In this chapter, we discuss the parallelization of Lauritzen-Spiegelhalter algorithm [25], the most popular heuristic for converting an arbitraryBayesiannetworkintoajunctiontree. 3.1 ParallelConversion GivenanarbitraryBayesiannetworkB = (G,P)whereCisthenetworkstructure(i.e. aDAG)andPistheconditionalprobabilitytables,Bcanbeconvertedintoajunction tree J by five steps: moralization, triangulation, clique identification, junction tree construction and potential table construction. In this section, we discuss the parallel algorithmsusedinourimplementation. 21 Figure3.1: Anexampleof(a)Bayesiannetwork;(b)Moralizedgraph;(c)Triangulated graph;(d)Junctiontree. 3.1.1 StructureConversion In parallel moralization, additional edges are inserted so that all parents of each node arepairwiseconnected. TheinputtoourparallelmoralizationalgorithmisaDAGG = (V,E). TheoutputisamoralizedgraphG m = (V,E m ). TheDAGGisrepresentedas an adjacency array. Each processor is in charge of a segment of the adjacency array. Assuming node v i is assigned to processor p i , p i generates a set pa(v i ) as the set of v i ’s parents and sendspa(v i ) to the processors wherev i ’s parents are assigned. Each processor works in parallel. Then, every processor p i receives a set pa k from its k- th child. If v i is not connected to a node v j ∈ pa k , they are connected so that v i is moralized. Intriangulation,edgesareinsertedtoE m sothatinthemoralizedgraphallcyclesof sizelargerthanorequalto4havechords. Achordisdefinedasanedgeconnectingtwo nonadjacent nodes of a cycle. The input to our triangulation algorithm is a moralized graph G m = (V,E m ). The output is a triangulated graph G t = (V,E t ). E t is the union of the newly added edges and E m . The optimal triangulation minimizes the maximal clique width in the resulting graph, which is known to be NP hard [22]. In triangulation, each processor is in charge of a subset ofG m . The cycles of size equal orlargerthan4aredetectedandchordedbyinsertingedges. 22 In parallel clique identification, the cliques of the junction tree are computed on each processor. The input is a triangulated graph G t = (V,E t ) and the output is a group of cliques {C 1 ,··· ,C N }. G t is also represented as an adjacency array. Each processor obtains a segment ofG t and forms a candidate clique for each node v i in parallel. Each processor then uses the details of all the cliques to verify if there exists C i andC j satisfyingC i ⊆C j . Ifso,thecandidatecliqueC i isremoved. Eachprocessor performs the candidate clique verification in parallel. The surviving cliques form the output. Parallel junction tree construction assigns a parent to each clique identified in the previousstep. Theinputtoourparalleljunctiontreeconstructionisagroupofcliques {C 1 ,··· ,C N }. TheoutputisjunctiontreestructureTwhichisanadjacencyarrayrep- resentingtheconnectionsbetweencliques. TosatisfytheRIP,weneedtocomputethe unionU j =C 1 ∪C 2 ∪···∪C j−1 forC j , j = 1,··· ,N. Weusethewell-knownparallel techniqueof pointer-jumping[21]toperformthecomputationinO(logN)timeforN cliques. Then, each processor computes in parallel the intersectionI j = C j ∩U j and findsC i ⊇I j fromC 1 ,··· ,C j−1 sothatC i canbeassignedastheparentofC j . 3.1.2 PotentialTableConstruction Potential table construction initializes a POT for each clique of a junction tree using the CPTs of the Bayesian network from which the junction tree is converted and then converts the initial POTs to chain set POTs. The input to the parallel potential table constructionistheBayesiannetworkB = (G,P)andthestructureofthejunctiontree T. Theoutputis ˆ P,agroupofPOTsforthecliquesofjunctiontreeJ = (T, ˆ P). Inparallelpotentialtableconstruction,eachprocessorisinchargeofoneorseveral cliques. The processors work in parallel to identify those nodes A that satisfy A∪ 23 pa(A)⊆C i foreachcliqueC i . Then,theCPTsoftheseidentifiednodesaremultiplied in parallel to produce the initial POT for each clique. Each processor then converts the initial POTs to chain set POTs in parallel. The conversion process is the same as inference in junction trees except for the absence of an evidence variable. We use a pointer jumping based technique to parallelize the conversion from initial POTs to chainsetPOTs. 3.1.3 ComplexityAnalysis The analysis of the execution time is based on the Concurrent Read Exclusive Write ParallelRandomAccessMachine(CREWPRAM)model,i.e.,concurrentreadaccess tothesamedatabydifferentprocessorsisallowed,butconcurrentwriteisnot. Theexecutiontime formoralizationisO(nk 2 /p)wherenisthenumberofnodes, k is the maximal number of neighbors inG, andp is the number of processors. Tri- angulationandcliqueidentificationtakeO(k 2 m n+wn 2 /p)time,wherew istheclique widthandk m isthemaximalnumberofneighborsinG m . Theexecutiontimeforjunc- tiontreeconstructionisO((wN logn+wN 2 )/p)whereN isthenumberofcliquesof the junction tree. The potential table construction takesO(r w N(w+logN)/p) time, wherer is the number of states of a variable. Both moralization and clique identifica- tion are scalable over 1 ≤ p ≤ n, while junction tree construction and potential table initializationarescalableover1≤p≤N. 24 3.2 Experiments 3.2.1 ComputingFacilities We ran our implementations on the DataStar Cluster at the San Diego Supercomputer Center(SDSC)[7]andontheclustersattheUSCCenterforHigh-PerformanceCom- putingandCommunications(HPCC)[2]. TheDataStarClusteratSDSCemploysIBM P655 nodes running at 1.5 GHz with 2 GB of memory per processor. This machine uses a Federation interconnect, and has a theoretical peak performance of 15 Tera- Flops. Furthermore,eachchannelisconnectedtoaGPFS(parallelfilesystem)through a fiber channel. The DataStar Cluster runs Unix with MPICH. IBM Loadleveler was usedtosubmitjobstoabatchqueue. TheHPCCisdiscussedinChapter2.4. 3.2.2 InputBayesianNetworks We experimented with three types of Bayesian network as the input: random, linear andbalanced. EachBayesiannetworkhas1024×128nodesandeachnodeisabinary variable (r = 2). We experimented with 1, 2, 4, 8, 16, 32, 64 and 128 processors for eachofthreenetworks. Figure 3.2: An example of (a) Random Bayesian network; (b) Linear Bayesian net- work;(c)BalancedBayesiannetwork. 25 (a) (b) Figure3.3: TheexecutiontimeforexactinferenceonDataStar. TherandomBayesiannetwork(Figure3.2(a))wasinitiallygeneratedasarandom graph. Representingtheinitialgraphasanadjacencymatrix,wemodifiedthematrixto ensure the following constraints: (1) It is an upper-triangular matrix; (2) The graph is connected;(3)Thenumberof1sineachrowandcolumnislimitedbyagivenconstant known as node degree. These constraints lead to a legal Bayesian network structure. Assuming all nodes were binary variables, the CPT for a nodeA was generated as a table of non-negative floating point numbers with 2 rows and 2 |pa(A)| columns where |pa(A)| is the number of parents of A. In our experiment, we had the clique width w = 13. ThelinearBayesiannetwork(Figure3.2(b))isachain: eachnodeexceptthe terminaloneshasexactlyoneincomingedgeandoneoutgoingedge. Since|pa(v)|≤ 1, ∀v, andw = 2, the CPT size is no larger than 4. The balanced Bayesian network (Figure 3.2 (c)) was generated as a tree, where the in-degree of each node except the root was 1 and the out-degree for each node except the leaves and last non-leaf node wasaconstant. Therefore,wehad|pa(v)|≤ 1andw = 2. 26 (a) (b) Figure3.4: TheexecutiontimeforexactinferenceonHPCC. 3.2.3 ExperimentalResults In our experiments, we recorded the execution time for parallel Bayesian network conversion, and potential table construction in junction trees. We added together the above time to obtain the total execution time. The results from DataStar Cluster are shown in Figure 3.3 and the results from HPCC in Figure 3.4. We are concerned with the scalability of the implementation. From these figures, we can see that all the individual steps scaled well for all the three types of Bayesian networks. The executiontimeforrandomnetworkswashigherthantheothersbecausethemaximum clique width of the junction tree created from random Bayesian networks was larger. The experimental results evaluate the scalability of our implementation, confirming thatcommunicationdoesnotovershadowcomputationoverasignificantrangeofp. 27 Chapter4 PointerJumpingandRerootingforExactInference Although many junction trees offer data parallelism and structural parallelism (see Chapters 5, 8, and 9), there are some junction trees that offer limited data parallelism andstructuralparallelism. Inthissection,weutilizethepointerjumpingtechniquefor evidence propagation on junction trees. In contrast to the pointer jumping algorithm forexactinferenceproposedbyPennock[32],ourproposedalgorithmallowsevidence on multiple cliques and performs evidence propagation in two stages, i.e., evidence collection and distribution. For junction trees with evidence on multiple cliques, Pen- nock’s algorithm must reroot the junction tree with respect to each evidence clique sequentially,whichadverselyaffectstheoverallperformance. Inaddition,weanalyze therelationshipbetweenthedepthofajunctiontreeandthenumberofprocessors. We also discuss a technique called junction tree rerooting, which can increase structural parallelismformanyjunctiontrees. Afterjunctiontreererooting,ifthererootedjunc- tion tree still offers insufficient parallelism, the pointer jumping technique should be consideredtoaccelerateexactinference. 28 4.1 PointerJumpingBasedEvidencePropagation Unlike traditional exact inference algorithms that propagate evidence through separa- tors between adjacent cliques, we derive formulas for propagating evidence between two cliques that are not adjacent to each other. For the sake of simplicity, we first discuss pointer jumping based exact inference in a chain of cliques. Then, we extend thetechniquetoajunctiontree. Inaddition,byanalyzingthespeedupoftheproposed algorithm, wepointoutthatforaplatformwithagivennumberofprocessors(cores), thespeedupdecreasesasthedepthofajunctiontreeincreases. 4.1.1 PointerJumpinginaChainofCliques Achainisaspecialtreewithoutanybranches. Nostructuralparallelismisavailablefor chains,sincenocliquescanbeprocessedinparallelduetoprecedenceconstraints. Se- rialexactinferencerequiresN stepstoupdateachainconsistingofN cliques. Wecon- sider a technique called pointer jumping to accelerate exact inference in chains [32]. Pointer jumping in a chain is illustrated in Figure 4.1(a). Initially, cliqueA 0 has been updated. Letpa(A i )denotetheparentofA i ,i.e.,A i−1 ,0<i≤N−1. Wepropagate the evidence inA 0 to the rest of the cliques as follows: In the first step, we updateA i usingpa(A i ). Then, we update eachclique again usingpa 2 (A i ) =pa(pa(A i )). After logN steps,allthecliquesareupdated,asshowninFigure4.1(a). We derive formulas for propagating evidence for a chain of cliques based on Pen- nock’salgorithm[32]. WedenoteeachcliqueC i ={S i+1,i ,V i ,S i,i−1 },whereS i+1,i = C i ∩C i+1 , S i,i−1 = C i ∩C i−1 and V i = C i \{S i+1,i ∪S i,i−1 }, 0 ≤ i < N. We let 29 (a)Pointerjumpinginachain. (b) Pointer jumping in a tree. Figure4.1: Pointerjumpingforexactinference. S 0,−1 = S N,N−1 = ∅ andP(C i ) = P(S i+1,i ,V i ,S i,i−1 ) = ψ C . We rewriteP(C i ) to a conditionaldistribution: P(S i+1,i ,V i |S i,i−1 ) = P(S i+1,i ,V i ,S i,i−1 ) P(S i,i−1 ) = P(S i+1,i ,V i ,S i,i−1 ) P S i+1,i ∪V i P(S i+1,i ,V i ,S i,i−1 ) = P(C i ) P S i+1,i ∪V i P(C i ) (4.1) and P(S i+1,i |S i,i−1 ) = P V i P(S i+1,i ,V i ,S i,i−1 ) P(S i,i−1 ) = X V i P(C i ) P S i+1,i ∪V i P(C i ) (4.2) AccordingtoEqs.4.1and4.2,afterthefirststepinpointerjumping,wehave: P(S i+1,i ,V i |S i−1,i−2 ) = X S i,i−1 P(S i+1,i ,V i |S i,i−1 )·P(S i,i−1 |S i−1,i−2 ) (4.3) 30 and P(S i+1,i |S i−1,i−2 ) = P V i P(S i+1,i ,V i |S i−1,i−2 ). Thus, after the k-th step, 0 ≤ k< logN,wehave: P(S i+1,i ,V i |S i−2 k−1 ,i−2 k) = X S i−2 k−2 ,i−2 k−1 P(S i+1,i ,V i |S i−2 k−2 ,i−2 k−1)·P(S i−2 k−2 ,i−2 k−1|S i−2 k−1 ,i−2 k) (4.4) P(S i+1,i |S i−2 k−1 ,i−2 k) = X V i P(S i+1,i ,V i |S i−2 k−1 ,i−2 k) (4.5) PerformingEqs.4.4and4.5foreach0≤k< logN,wepropagateevidencethrough- outallthecliques. 4.1.2 PointerJumpinginaJunctionTree Performingpointerjumpinginajunctiontreeisequivalenttoperformingpointerjump- inginalltheleaf-rootpathsinthejunctiontree(Figure4.1(b)). Evidenceisfirstprop- agated from leaf cliques to the root (evidence collection) and then from the root to the leaf cliques (evidence distribution) along each path. When evidence collection is performed using pointer jumping, a clique must be updated sequentially with respect to its children. Algorithm 4.2 shows the pointer jumping based exact inference algo- rithm in a junction tree. Given an input junction tree J, let ch(C i ) denote the set of cliques consisting of the children of C i ∈ J, i.e., ch(C i ) = {C j |pa(C j ) = C i }. We definech k (C i ) ={C j |pa k (C j ) =C i } as the children ofC i in stepk, wherek ≥ 1, and pa k (C j ) =pa(pa k−1 (C j )) =pa(···(pa(C j ))) as the parent ofC i in stepk. We denote D as the depth of the junction tree J. The depth of a tree is also knwon as the height ofthetree. 31 Input: JunctiontreeJ,depthD Output: UpdatedJ {evidencecollection} 1: fork = 0,1,··· ,(log(D+1))−1do {updatecliquesconcurrently} 2: forC i ∈ Jpardo {eachcliqueisupdatedbycliquesinch 2 k (C i )sequentially} 3: forall C j ∈ch 2 k (C i ) do 4: UpdateC i usingC j accordingtoEqs.4.4and4.5 5: endfor 6: endfor 7: endfor {evidencedistribution} 8: fork = 0,1,··· ,(log(D+1))−1do 9: forC i ∈ Jpardo 10: C j =pa 2 k (C i ) 11: UpdateC i usingC j accordingtoEqs.4.4and4.5 12: endfor 13: endfor Figure4.2: PointerJumpingBasedExactInferenceinaJunctionTree BeforeweanalyzethecomplexityofAlgorithm4.2,weprovethefollowinglemma, wherewedefinedepthasthelengthofthelongestpathfromtheroottoaleaf,i.e.,the criticalpathofthetree. Lemma1: Consider pointer jumping in a tree consisting ofN nodes. If the depth of the tree is D, then the maximum number of children of any node at any step is (N −D). Proof: We prove the lemma by contradiction. Assume a nodeA in the tree has (N−D+1)childrenatsomestepduringpointerjumping. SincethedepthisD,there are(D+1)nodesinthecriticalpathofthetree.Aiseitheronthecriticalpathornot. IfA is on the critical path, there are (N −D) children not on the critical path. Thus, thereareatleast(D+1)+(N−D) = (N +1)nodesinthetree. Thisconflictswith thefactthatthetreehasonlyN nodes. Thus,Acannotbeonthecriticalpath. IfAis 32 notonthecriticalpath,thenthereareatleast(D+1)+1+(N−D+1) = (N +3) nodesinthetree. Thus,Acannotbeoffthecriticalpatheither. Therefore,anodewith (N−D+1)childrenatanystepduringpointerjumpingdoesnotexist. Similarly,we provethatanodewith(N−D+i)children,i≥ 1,atanystepduringpointerjumping doesnotexisteither. Therefore,themaximumnumberofchildrenforanynodeatany stepis(N −D). We use the concurrent read exclusive write parallel random access model (CREW PRAM) [21] to analyze Algorithm 4.2. Lines 1-7 and 8-13 are evidence collection anddistribution,respectively. Eachtakeslog(D+1)iterations. Lines2and9aretwo parallel for-loops, where the cliques are assigned to the processors in a round robin fashion. Note that we bound a separate thread to each processor. GivenP processors and the number of cliques N, each processor updates N/P cliques, 1 ≤ P ≤ N. Lines 4 and 11 update potential tables. For the sake of simplicity, we assume that all thecliqueshavethesamesize. Thus,thetimetakenbyupdatingacliqueisaconstant. The box in Line 3 shows that each clique must be updated sequentially with respect to all of its children at stepk. According to Lemma 1, since the maximum number of childrenofacliqueis(N −D),theoverallcomplexityforAlgorithm4.2is: O((N −D)·N ·(log(D+1))/P) (4.6) We discuss the complexity in two cases: (1) If the depth of the junction tree is small, i.e. 1≤ D < C ≪ N, whereC is a constant, the overall complexity becomes O(N 2 /P). Since we haveP ≤ N, compared with the complexity for serial exact in- ferenceO(N),wecanseethatAlgorithm4.2cannot accelerateevidencepropagation in such junction trees. A special case of such junction trees is the star graph, where D = 1. Therefore,accordingtoEq.(4.6),wehaveO((N−1)·N·(log(1+1))/P) = 33 O(N −1) = O(N). Note that such junction trees actually offer sufficient structural parallelism. Therefore, we propose algorithms in Chapters 6 and 9 to process these junctiontrees. (2) If the depth of the junction tree is large, i.e. 1 ≪ D ≈ N − 1, the overall complexity is O(N(log(D + 1))/P). We call such junction trees as slim trees. A special case of slim trees is a chain, whereD = N −1. According to Eq. (4.6), we haveO((N −(N −1))·N ·(log((N −1)+1))/P) = O(N ·(logN)/P). This is superior to the complexity of serial exact inference, i.e., O(N). Therefore, for slim trees,Algorithm4.2canaccelerateevidencepropagation. However,accordingtocase(2),wecannotguaranteethatAlgorithm4.2issuitable forallslimtrees. Forexample,ifP =N,theoverallcomplexitybecomesO(log(D+ 1)),whichoutperformstheserialexactinferencediscussedinChapter2.2. However,if P = 1,thecomplexityisO(N log(D+1)),whichisgreaterthanO(N),thecomplexity forthesequentialexactinferencealgorithm. In the following section, we analyze the speedup of Algorithm 4.2 with respect to the serial exact inference discussed in Chapter 2.2. Based on the analysis, we can determineifAlgorithm4.2issuitableforsomejunctiontreeonaplatformwithagiven numberofprocessors(cores). 4.1.3 SpeedupAnalysis For the sake of simplicity, we first discuss the speedup of a chain of cliques. Note that any junction tree can be viewed as a chain of hyper nodes if we view the cliques at the same level with respect to the root as a hyper node. Assume that we have a chain ofN cliques, each of the same size. We start fromC 0 to update the rest of the 34 (N −1) cliques. Lett denote the execution time for updating a single clique. Thus, thesequentialexecutiontimeforupdatingthechain,denotedt s ,isgivenby: t s = (N −1)·t. (4.7) Using pointer jumping, we update the chain in logN steps. In the first step, each clique except C 0 is updated in parallel. Thus, if we have P processors (we assume P ≤ N), the parallel execution time is (N −1)·t/P. In the second step, since both C 0 andC 1 have been updated, the parallel execution time is (N −2)·t/P. Similarly, wehavetheparallelexecutiontimeforthej-thstep: t j p = (N −2 j−1 )·t/P (4.8) Therefore,theoverallparallelexecutiontimeisgivenby: t p = logN X j=1 t j p = logN X j=1 (N −2 j−1 )·t/P = (N ·logN −(N −1))·t/P = (N ·logN −N +1)·t/P (4.9) AccordingtoEqs.4.7and4.9,thespeedupis: Sp(N,P) =t s /t p = (N −1)·P N logN −N +1 (4.10) = (1−1/N)·P logN −1+1/N (4.11) WhenN ≤P, we canonly useN processors. Thus, we haveSp(N,P) = (N 2 − N)/(N logN −N + 1) ≈ N/logN. When N is much greater than P, we have 35 lim N→∞ Sp(N,P) = 0. Now, we prove that the speedupSp(N,P) is monotonically decreasingasN increaseswhenP ≤N isfixed: ∂Sp(N,P) ∂N = P ·(N logN −N +1)−P(N −1)·logN (N logN −N +1) 2 = (logN +1)−N (N logN −N +1) 2 /P < 0 (4.12) Therefore, the maximum speedup is achieved whenN = P. WhenN increases, the speedupmonotonicallydecreases. Using Eq. 4.10, we also conclude that, to achieve speedup using pointer jumping, the range of N is P ≤ N < ˜ N, where ˜ N s.t. P( ˜ N − 1) = ( ˜ N log ˜ N − ˜ N + 1). Thereisnoexplicitrootfor ˜ N sinceitisatranscendentalfunction. WhenN ≫P,the approximatesolutionis ˜ N = 2 P+1 . Claim: Given P processors with a shared memory, the pointer jumping based exact inference algorithm shown in Algorithm 4.2 accelerates evidence propagation forchainswithanumberofcliqueslessthan2 P+1 ;however,forchainswithmorethan 2 P+1 cliques, the serial exact inference algorithm shown in Algorithm 2.2 leads to a lowerexecutiontime. The above claim implies a range for P with respect to a given chain, where the pointer jumping based exact inference algorithm leads to a lower execution time than the serial exact inference algorithm. For slim trees defined in Section 4.1.2, if the depth of the trees is less than 2 P+1 , the above claim gives an approximate estimate of the range where Algorithm 4.2 shows superior performance compared with the serial exactinferencealgorithm. Notethattherangeforthenumberofcliques(orthedepth) in the above claim can decrease, if workload is not balanced across the processors duringexactinference. 36 4.1.4 Experiments We conducted experiments on a server with two AMD Opteron(tm) 6164 HE proces- sors (Magny-Cours). Each processor has a 12 MB L3 cache and 12 cores running at 1.70 GHz. All the cores share a 64 GB DDR3 memory. Each core is viewed as a processorintheabovesections. TheoperatingsystemonourserverisCentOSrelease 5.3. WeimplementedthepointerjumpingbasedexactinferenceusingPthreads. We used a dynamic scheduling based exact inference implementation as the base- line. The baseline method used a collaborative scheduler proposed in [40], which schedulestasksinaworksharingapproach. Thus,onceacliqueisreadytobeupdated, i.e., all the precedent cliques have been processed, the scheduler always allocates the clique to a thread with the lightest workload, so that we can keep the workload as balanced as possible across all the threads. The details of the baseline method are discussedinChapter9.1. We evaluated the performance of the pointer jumping based exact inference using junction trees with various parameters. We used three types of topology for the junc- tion trees: a chain, a balanced tree and a slim tree. Specifically, we generated a set of slimtreesasfollows: Givenaconstantb≤N/4,webuiltabalancedbinarytreeusing N/2 cliques. Then, we generated b chains, each having N/(2·b) cliques. We con- nected each chain to a leaf in the tree. For each type of topologies, we experimented withvariousnumbersofcliquesandvariouscliquewidths. Weconductedexperiments usingjunctiontreewithN between64to4096,cliquewidthW C between8to12,and cliquedegreedbetween1and4. Allrandomvariablesarebinary,i.e.,r = 2. In Figures 4.3, 4.4 and 4.5, we illustrate the scalability of pointer jumping based exact inference, compared with the scheduling based exact inference using various junction trees. In Figure 4.3, scheduling based exact inference shows no scalability 37 (a)Pointerjumpingbasedexactinference (b)Schedulingbasedexactinference Figure 4.3: Comparison between pointer jumping and scheduling based exact infer- enceinchains. (a)Pointerjumpingbasedexactinference (b)Schedulingbasedexactinference Figure 4.4: Comparison between pointer jumping and scheduling based exact infer- enceinslimtrees. since there is no structural parallelism available. However, the pointer jumping based method worked very well. Note that, when a single core is used, the pointer jumping based method took much more time due to the additional workload caused by pointer jumping. In Figure 4.4, the scalability of the scheduling based method was improved as b increases, since a large b results in higher structural parallelism. The pointer jumping was not impacted by the value of b. Although large b results in less depth, the number of clique at each level increases. Both methods showed scalability when 38 (a)Pointerjumpingbasedexactinference (b)Schedulingbasedexactinference Figure 4.5: Comparison between pointer jumping and scheduling based exact infer- enceinbalancedtrees. using balanced trees. However, the execution time was different due to the additional workloadinpointerjumpingbasedmethod. Figure 4.6: Speedup of pointer jumping based exact inference with respect to serial implementation. Since pointer jumping based exact inference involves additional work, the pointer jumping based method may not outperform the scheduling based method. In Fig- ure 4.6, we show that the speedup of pointer jumping based exact inference with re- spect to serial implementation, given various number of cores. As we can see, the speedup increased quickly when N was smaller than P, then the speedup gradually decreases. WhenP is small, we observed that the speedup is lower than 1. This ob- servation is consistent with our analysis in Chapter 4.1.3: speedup can be achieved 39 whenthenumberofcliqueinachain(orthedepthofatree)iswithinarangegivenby [1,2 P+1 );otherwise,theperformanceofpointerjumpingcanbeadverselyaffected. 4.2 JunctionTreeRerooting As discussed in Chapter 4.1.1, if the depth of the input junction tree is much greater than the number of processors, the performance of pointer jumping based exact in- ference can be adversely affected. In this section, we propose an efficient method to minimizethecriticalpathofagivenjunctiontree. 4.2.1 MinimizingCriticalPathbyRerooting A junction tree can be rerooted at any clique [32]. Consider rerooting a junction tree at cliqueC. Letα be a preorder walk of the underlying undirected tree, starting from C. Then, α encodes the desired directions for the edges. By desired direction, we mean that, if and only ifα i < α j , an edge can point fromC α i toC α j . In the rerooting procedure,wechecktheedgesinthegivenjunctiontree,andreverseanyedgesincon- sistent withα. The result is a new junction tree rooted atC, with the same underlying undirectedtopologyastheoriginaltree. Rerooting a junction tree at certain cliques leads to an acceleration of evidence propagation on parallel computing systems. For an arbitrary junction tree, the criti- cal path (CP) is defined as the longest weighted path from the root to a leaf clique, and the path weight is the sum of the computational complexity of the cliques in the path. Rerooting a junction tree at various cliques can result in different critical paths. The one with the minimum critical path leads to the fastest evidence propagation on platformswithsufficientparallelprocessingunits. Forexample,assumingeachclique in Figure 4.7 (a) has a unit of computational complexity, then the complexity of the 40 criticalpathis5,sincethereare5cliquesinthelongestpath. However,ifwererootthe junction tree at Clique 2 (Figure 4.7 (b)), the complexity of the critical path becomes 4. If there are enough parallel processing units available, we can update the cliques at the same level in parallel. For example, Cliques 2 and 3 in Figure 4.7 (a) can be updated in parallel; Cliques 0, 3, 6 and 7 in Figure 4.7 (b) can be updated in paral- lel as well. Thus, when a sufficient number of parallel processing units are available, evidencepropagationinFigure4.7(b)isfasterthanthatinFigure4.7(a). Figure 4.7: (a) A sample junction tree where Clique 0 is the root. The length of the critical path is 5. (b) A new junction tree after rerooting, where Clique 2 is the root. Thelengthofthecriticalpathis4. 4.2.2 AStraightforwardRerootingMethod We introduce several notations for junction tree rerooting. According to Eq. (2.2) and theanalysisforAlgorithm2.2,thecliquecomplexityL C i forC i isgivenby: L C i =k i w C i w C i Y j=1 r j (4.13) wherew C i is the clique width;k i is the number of children ofC i andr j is the number of states of the j-th random variable in clique C i . Let P(C i ,C j ) = C i ,C i 1 ,C i 2 ,...,C j denote a path in a junction tree, starting at CliqueC i and terminating atC j . Based on 41 Eq. (4.13), the path complexity ofP(C i ,C j ), denotedL (C i ,C j ) , is defined as the sum of thecliquecomplexitiesofallthecliquesinthepath,thatis, L (C i ,C j ) = X Ct∈P(C i ,C j ) L Ct = X Ct∈P(C i ,C j ) k t w Ct w C t Y l=1 r l (4.14) wherew Ct isthecliquewidthandr l isthenumberofstatesofthel-thrandomvariable incliqueC t . Input: JunctiontreeJwithN cliques Output: newrootC r 5 1: forallC i ∈ J, ∀i = 1,2,··· ,N do 2: Letα = (α 1 ,α 2 ,··· ,α N ) be a preorder walk of the underlying undirected tree ofJstartingatC i 3: forj =N to1do 4: W (i) α j =L Cα j +max k (W (i) α j ) ∀C α k ∈ Child(C α j ) 5: endfor 6: endfor 7: selectnewrootC r s.t.W (r) α 1 = min i W (i) α 1 ∀i = 1,2,··· ,N Figure4.8: Straightforwardrootselectionforminimizingcriticalpath As a comparison with the efficient junction tree rerooting method in the next sec- tion, we present a straightforward method to perform rerooting in Algorithm 4.8. Line 2 obtains α by performing a breadth first search (BFS) starting atC i in the un- derlying undirected tree. Actually, α encodes the desired new edge directions of a junction tree rooted atC i . Lines 3-5 compute the path complexity of the critical path in each rerooted tree, whereW (i) α j denotes the path complexity of a critical path in the subtree rooted atC α j . Thus,W (i) α 1 gives the path complexity of a critical path for the entiretree. Noticethat,whenC α j isaleafclique,max k (W (i) α j )returns0. Line7selects acliquecorrespondingtothererootedjunctiontreewithminimumcriticalpath. We briefly analyze the serial complexity of Algorithm 4.8. Both for-loops involve N iterations. Line 2 performs BFS using N time and Line 4 takes O(w Cα j ) time to 42 computeL Cα j ,andO(d Cα j )timetocomputemax k (W (i) α j ),whered Cα j isthenumberof children ofC α j . Noted Cα j ≤w Cα j for any junction tree. Line 7 takesN comparisons to select the new root. Therefore, the computational complexity for Algorithm 4.8 is O(N 2 w Cα j ). WhenN islarge,thestraightforwardrerootingmethodisnotefficient. 4.2.3 AnEfficientJunctionTreeRerootingMethod The key step of junction tree rerooting is to select a clique as the new root, which leads to a junction tree with the minimum critical path. LetP(C x ,C y ) denote a path fromC x toC y ,andL (Cx,Cy) thesumofthecliquecomplexitiesofallthecliquesinpath P(C x ,C y )(seeEq.4.14). Wedevelopanefficientalgorithmtoselectsucharootbased onthefollowinglemma: Lemma1: SupposethatP(C x ,C y )isthelongestweightedpathinagivenjunction tree, where C x and C y are leaf cliques, and L (Cr,Cx) ≥ L (Cr,Cy) , where C r is the root. ThenP(C r ,C x )isacriticalpathinthegivenjunctiontree. Proof Sketch: AssumethecriticalpathisP(C r ,C z )whereC z 6=C x . LetP(C r ,C b1 ) denotethelongestcommonsubpathbetweenP(C r ,C x )andP(C r ,C y ),wherethecom- mon subpath means that, ∀ C i ∈ P(C r ,C b1 ), we have C i ∈ P(C r ,C x ) and C i ∈ P(C r ,C y ). Let P(C r ,C b2 ) be the longest common subpath between P(C r ,C x ) and P(C r ,C z ). Without loss of generality, we assumeC b2 ∈P(C r ,C b1 ). SinceP(C r ,C z ) is acriticalpath,wehaveL (Cr,Cz) ≥L (Cr,Cx) . BecauseL (Cr,Cz) =L (Cr,C b2 ) +L (C b2 ,Cz) and L (Cr,Cx) =L (Cr,C b2 ) +L (C b2 ,C b1 ) +L (C b1 ,Cx) , we haveL (C b2 ,Cz) ≥L (C b2 ,C b1 ) +L (C b1 ,Cx) > 43 L (C b1 ,Cx) . Thus, we can find pathP(C z ,C y ) = P(C z ,C b2 )P(C b2 ,C b1 )P(C b1 ,C y ), which leadsto: L (Cz,Cy) =L (Cz,C b2 ) +L (C b2 ,C b1 ) +L (C b1 ,Cy) >L (C b1 ,Cx) +L (C b2 ,C b1 ) +L (C b1 ,Cy) >L (Cx,C b1 ) +L (C b1 ,Cy) =L (Cx,Cy) (4.15) TheaboveinequalitycontradictstheassumptionthatP(C x ,C y )isthelongestweighted pathinthegivenjunctiontree. Thus,P(C r ,C x )mustbeacriticalpath. Input: JunctiontreeJ Output: NewrootC r 1: initializeatuplehv i ,p i ,q i i =hk i w C i Q w C i j=1 r j ,0,0iforeachC i inJ 2: fori =N downto1do 3: p i = arg j max(v j ),∀pa(C j ) =C i 4: q i = arg j max(v j ),∀pa(C j ) =C i andj 6=p i 5: v i =v i +v p i 6: endfor 7: selectC m wherem = arg i max(v i +v q i ),∀i 8: initializepathP ={C m };i =m 9: whileC i isnotaleafcliquedo 10: i =p i ;P ={C i }∪P 11: endwhile 12: P =P ∪C qm ;i =m 13: whileC i isnotaleafnodedo 14: i =p i ;P =P ∪{C i } 15: endwhile 16: denoteC x andC y thetwoendcliquesofpathP 17: selectnewrootC r = arg i min|L (Cx,C i ) −L (C i ,Cy) |∀C i ∈P(C x ,C y ) Figure4.9: Rootselectionforminimizingcriticalpath According to Lemma 1, the new root can be found once we identify the longest weighted path between two leaves in the given junction tree. We introduce a tuple hv i ,p i ,q i i for each cliqueC i to find the longest weighted leaf-to-leaf path (Lines 1-6, Algorithm 4.9), wherev i denotes the path complexity of a critical path of the subtree 44 rooted atC i ;p i is the index ofC p i , a child ofC i ; andq i is the index of another child of C i . The path fromC p i to a certain leaf clique in the subtree rooted atC i is the longest weighted path among all paths from a child ofC i to a leaf clique, while the path from C q i to a certain leaf clique in the subtree rooted atC i is the second longest weighted path. The two paths are concatenated atC i and form a leaf-to-leaf path in the original junction tree. In Algorithm 4.9, the tuples are initialized in Line 1 and updated in the for-loop(Lines2-6). Noticethattheindicesofcliquesareconsistentwiththepreorder walk of the tree starting from the root, i.e. an edge points from C i toC j if and only if i < j. In Lines 3 and 4, arg j max(v j ) stands for the value of the given argument (parameter)j forwhichthevalueofthegivenexpressionv j attainsitsmaximumvalue. In Line 7, we detect a cliqueC m on the longest weighted path and identify the path in Lines8-15accordingly. ThenewrootisthenselectedinLine17. We analyze the serial complexity of Algorithm 4.9. Line 1 takes O(w C N) com- putation time for initialization, wherew C is the clique width andN is the number of cliques. The loop in Line 2 has N iterations. Lines 3 and 4 each take O(k) time, wherek is the maximum number of children of a clique. Line 7 takesO(N) time, as do Lines 8-15, since a path consists of at mostN cliques. Lines 16-17 can be com- pleted in O(N) time. Since k < w C , the complexity of Algorithm 4.9 is O(w C N), comparedtoO(w C N 2 ),thecomplexityofthestraightforwardapproach. Algorithm4.9canbeparallelizedusingtechniquessuchaspointerjumping. How- ever, the parallelized version involves many branches and limited computation. Due tothe branchoverheadinSPEs, we perform rerooting sequentiallyonthe PPE.When r and w C are large, the execution time for rerooting is very small compared to the evidencepropagationdiscussedinSections7.1.3.1and7.1.3.2. Wewouldliketoemphasizethat,althoughjunctiontreererootingcanprovidemore parallelism for exact inference, it is an optional step for parallel exact inference. In 45 addition to the parallelism provided by the structure of the junction trees, we also utilize the node level parallelism of the junction trees, which will be addressed in the nextsection. Inthefollowingsections,weassumethatjunctiontreererootinghasbeen appliedduringpreprocessing. 46 Chapter5 DataParallelisminExactInference 5.1 PotentialTableOrganization 5.1.1 PotentialTableRepresentation Each node of a junction tree denotes a clique, which is a set of random variables. For eachcliqueC,thereisapotentialfunction,ψ C ,whichdescribesthejointdistributionof the random variables in the clique [25]. The discrete form of the potential function is calleda potential table. Apotentialtableisalistofnon-negativerealnumbers,where eachnumbercorrespondstoaprobabilityofthejointdistribution. Thestraightforward representationforpotentialtablesstoresthestatestringalongwitheachentry[30]. In such a representation, a potential value can be stored in any entry of the potential ta- ble. However,forlargepotentialtables,sucharepresentationoccupiesalargeamount of memory. In addition, frequently checking the state string of an entry adversely af- fects the performance. For the sake of enhancing the performance of computation, we carefully organize the potential tables. The advantages of our representation in- clude: 1) reduced memory requirement; 2) direct access to potential values based on themappingrelationship. 47 We define some terms to explain the potential table organization. We assign an order to the random variables in a clique to form a variable vector. We will discuss how to determine the order in Section 5.1.2. For a clique C, the variable vector is denoted V C . Accordingly, the combination of states of the variables in the variable vector forms state strings. A state string is denotedS = (s 1 s 2 ···s w ), wherew is the clique width ands i ∈ {0,1,··· ,r i −1} is the state of thei th variable in the variable vector. Thus, there are Q w i=1 r i state strings with respect to V C . Since for each state string there is a corresponding potential (probability), we need an array of Q w i=1 r i entriestostorethecorrespondingpotentialtable. Traditionally, state strings are stored with a potential table, because both the state strings and potentials are required in the computation of evidence propagation. As each potential corresponds to a state string, we need O(w Q w i=1 r i ) memory to store the state strings,w times larger than that for storing a potential table. In addition, this approachleadstolargemessagesizeincommunicationamongprocessors. We optimize the potential table representation by finding the relationship between arrayindicesandstatestrings. Thatis,weencodeagivenstatestringS = (s 1 s 2 ···s w ) asanintegernumbert,wheres i ∈{0,1,··· ,r i −1}andt∈{1,2,··· , Q w i=1 r i }: t = 1+ w X i=1 s i i−1 Y j=1 r j (5.1) The formula that maps an indext to a state stringS = (s 1 s 2 ···s w , wheres i ,1≤ i≤w,isgivenby: s i = $ t−1 Q i−1 j=1 r j % %r i (5.2) where%isthemodulooperator. 48 To demonstrate the correctness of Eq. (5.1), we briefly prove that, for an arbitrary state string S, t is a legal index of the potential table. That is, t is an integer and 1≤t≤ Q w i=1 r i . Becauses i andr j areintegersforanyi,itisapparentthattisalsoan integer. Sinces i andr j are nonnegative, P w i=1 s i Q i−1 j=1 r j is also nonnegative. Thus, t ≥ 1 is satisfied. To provet ≤ Q w i=1 r i , we notice thats i ∈ {0,1,··· ,r i − 1} i.e. s i ≤r i −1. Therefore,Eq.(5.2)isgivenby: t = 1+ w X i=1 s i i−1 Y j=1 r j ≤ 1+ w X i=1 (r i −1) i−1 Y j=1 r j ! = 1+ w X i=1 i Y j=1 r j − i−1 Y j=1 r j ! = w Y i=1 r i (5.3) Therefore,foragivenstatestringS,Eq.(5.1)alwaysmapsS toanindexofthepoten- tialtable. TodemonstratethecorrectnessofEq.(5.2),weprovethat,usingtheexpressionof t given in Eq. (5.1), j (t−1)/( Q i−1 j=1 r j ) k %r i in Eq. (5.2) givess i , thei-th element of S. Noticethat⌊x⌋roundsx,and%isthemodulooperator. Wehave: $ t−1 Q i−1 j=1 r j % %r i = 1+ P w k=1 s k Q k−1 j=1 r j −1 Q i−1 j=1 r j %r i = $ w X k=i s k k−1 Y j=i r j + i−1 X k=1 s k Q i−1 j=k r j % %r i = w X k=i+1 s k k−1 Y j=i r j +s i ! %r i =s i (5.4) In Eq. (5.4), note that P w k=i s k Q k−1 j=i r j is an integer and 0 ≤ P i−1 k=1 s k / Q i−1 j=k r j < 1. Therefore, according to Eq. (5.4) we have proven that Eq. (5.2) correctly producess i foranyi∈{1,2,··· ,w}. 49 Figure 5.1: (a) A sample clique and its variable vector; (b) The relationship among the potential table (POT) index, state string and potential table, assuming all random variables are binary variables; (c) The relationship among the array index, state string andpotentialtable,assuminga,c,earebinarywhileb,d,f areternary. WeorganizethepotentialtableusingEq.(5.1)andEq.(5.2). Foreachentryindex t (t = 1,2,··· , Q i r i ), we convertt into a state stringS using Eq. (5.1). For a given state string S, we store the corresponding potential in the t-th entry of the potential table, where t is obtained from S according to Eq. (5.1). For example, we show a segmentofasamplepotentialtableinFigure5.1. InFigure5.1(a),weshowasample cliqueC with variable vector (a,b,c,d,e,f). A segment of the potential table (POT) forC, i.e. ψ C , is given in Figure 5.1 (b). For each entry ofψ C , the corresponding state stringandindexarealsopresented. Forthesakeofillustration,weassumeallrandom variables are binary in Figure 5.1 (b). In Figure 5.1 (c), however, we assume random variablesa,c,e are binary, whileb,d,f are ternary. Using Equations (5.1) and (5.2), 50 weobtaintherelationshipamongindices,statestringsandentriesofthepotentialtable showninFigure5.1(c). 5.1.2 SeparatorsBetweenCliques A separator is the intersection of two adjacent cliques. Therefore, for each edge in a junctiontree,thereisaseparatorwithrespecttothetwocliquesconnectedtotheedge. InFigure5.1(a),weshowthreeseparatorsrelatedtocliqueC. S pa (C)istheseparator betweenC and its parent, pa(C). S ch 1 (C) and S ch 2 (C) are two separators betweenC and its two children, ch 1 (C) and ch 2 (C). The terms defined in the previous section, suchasvariablevectorandstatestring,arealsoappliedtoseparators. UsingEq.(5.1) andEq.(5.2),weorganizethepotentialtablesforseparatorsaswedoforcliques. For each cliqueC in the given junction tree, we impose the following requirement on the variable vector ofC and the variable vector ofS pa (C): the variable vector ofC is ordered byV C = (V C\Spa(C) ,V Spa(C) ), whereV Spa(C) is the variable vector ofS pa (C), andV C\Spa(C) consists of random variables existing inV C but not inV Spa(C) . The order ofvariablesinside ofV C andV Spa(C) is arbitrary. Theadvantageofsucharequirement is the simplification of the computation in evidence propagation. Details are given in Section5.2.4. In evidence propagation, we propagate evidence from the input separators to C, and then utilize the updatedψ C to renew the output separators. We define input sep- arators with respect to clique C as the separators which carry evidence information before updatingC. The output separators are defined as the separators to be updated. Taking Figure 5.2 as an example, in evidence collection,S ch 1 (C) andS ch 2 (C) are the input separators whileS pa (C) is the output separator. In evidence distribution,S pa (C) becomestheinputseparatorwhileS ch 1 (C)andS ch 2 (C)becometheoutputseparators. 51 Figure 5.2: Illustration of input and output separators of cliqueC with respect to evi- dencecollectionandevidencedistribution. 5.2 NodeLevelPrimitives Evidencepropagationisaseriesofcomputationsonthecliquepotentialtablesandsep- arators. Based on the node level probability representation addressed in Section 5.1, we introduce four node level primitives for evidence propagation: table marginaliza- tion, table extension, table multiplication,and table division. 5.2.1 TableMarginalization Table marginalization is used to obtain separator potential tables from a given clique potentialtable. Forexample,inFigure5.1(a),wecanmarginalizeψ C toobtainψ S ch 1 (C) , ψ S ch 2 (C) andψ S pa(C) . To obtain the potential table for thei th child of a given cliqueC, i.e. ψ S ch i (C) , the marginalization must identify the mapping relationship betweenV S ch i (C) andV C . We define the mapping vector to represent the mapping relationship from V C to V S ch i (C) . The mapping vector is defined as M ch i (C) = (m 1 m 2 ···m w S ch i |m j ∈ {1,··· ,w}), where w is the width of clique C and w S ch i is the length of V ch i (C) . Notice that V S ch i (C) ⊂ V C . The value of m j is determined if the m th j variable in V C is the same 52 as thej th variable inV S ch i (C) . Using mapping vectorM ch i (C) , we identify the relation- shipbetweenψ C andψ S ch i (C) . Givenanentryψ C (t)inapotentialtableψ C ,weconvert the indext into a state stringS = (s 1 s 2 ···s w ). Then, we construct a new state string ˜ S = (˜ s 1 ˜ s 2 ···˜ s ch i (C) ) by letting ˜ s i = s m i . The new state string ˜ S is then converted back into an index ˜ t. Therefore, we show that ψ C (t) corresponds to ψ S ch i (C) ( ˜ t). To computeψ S ch i (C) fromψ C ,weidentifytherelationshipforeachtandaccumulateψ C (t) toψ S ch i (C) ( ˜ t). Weillustratetablemarginalizationforobtainingaseparatorψ S ch i (C) from agivencliquepotentialtableψ C inFigure5.3. Figure5.3: Illustrationofusingtablemarginalizationtoobtainseparatorψ S ch i (C) from cliquepotentialtableψ C . Algorithm 5.4 describes table marginalization for obtainingψ S ch i (C) fromψ C . Let w andw out denote the clique width and the separator width, respectively. The input to table marginalization includes clique potential table ψ C , a mapping vector M C = (m 1 ,m 2 ,··· ,m wout ), and the number of processorsP. The output is a separator po- tentialtableψ out i.e.ψ S ch i (C) . Eachprocessorisinchargeofasegmentofψ C ,consisting of|ψ C |/P contiguousentries. Line1inAlgorithm5.4launchesP-wayparallelismand assignsanID,p,toeachprocessor. Lines2-8formalocalcomputationround. Line2 53 Input: cliquepotentialtableψ C ,mappingvectorM C ,numberofprocessorsP Output: separatorpotentialtableψ out 1: forp = 1toP pardo {local computation} 2: ψ (p) out (1 :|ψ out |) = ~ 0 3: fort = |ψ C | P (p−1)to |ψ C | P p−1do 4: converttintoS = (s 1 s 2 ···s w ) 5: construct ˜ S = (˜ s 1 ˜ s 2 ···˜ s wout )fromS usingmappingvectorM C 6: convert ˜ S into ˜ t 7: ψ (p) out ( ˜ t) =ψ (p) out ( ˜ t)+ψ C (t) 8: endfor {global communication} 9: broadcastψ (p) out andreceiveψ (j) out fromProcessorj (j = 1,··· ,P andj 6=p) {local computation} 10: ψ out = P P j=1 ψ (j) out 11: endfor Figure5.4: Marginalizationforseparatorψ S ch i (C) inAlgorithm5.4initializestheoutputonthelocalprocessor,denotedψ (p) out . InLines4- 7, each processor updates ψ (p) out . Notice that Line 3 does not reallocate the potential table, but converts the indices of the partitioned table into the corresponding indices of the original potential table. Then, in Line 4, the index scalar is converted into a state string using Eq. (5.2). Line 5 transforms the state string by assigning ˜ s i = s m i for i = 1,2,··· ,w out . The resulting state string is converted back into an index in Line 6. Line 9 is a global communication round, where the all-to-all communication is performed to broadcast the local resultψ (p) out . Each processor receivesψ (j) out from the j-th processor (j = 1,2,··· ,P andj 6= p) and accumulates (Line 10)ψ (j) out to its lo- cal result. The sum calculated in Line 10 gives the updated separator potential table ψ S ch i (C) . For the sake of illustrating the scalability of the algorithm, we analyze Algo- rithm 5.4 using the coarse grained multicomputer (CGM) model [12].Line 1 takes 54 constant time to initialize. Let w C denote the clique width of C. Line 4 takes 3w C time, since Eq. (5.2) computes w C elements, each involving three scalar operations: a division, a modulo and the increase of the product of r j . Line 5 takes w out time and Line 6 takes 3w out time, since Eq. (5.1) also involves three scalar operations. Note that w out < w C . Line 7 takes constant time. Line 9 performs all-to-all com- munication, which is the only global communication round for Algorithm 5.4. The operation of Lines 9 and 10 is known as all-reduce [19]. By organizing the pro- cessors into a spanning tree with logarithmic depth, all-reduce takes O(|ψ out |logP) time[15]. Since|ψ out |≪|ψ C |inourcontext,thelocalcomputationtimeisO(|ψ C |w C / P),1≤P ≤|ψ C |. 5.2.2 TableExtension Table extension identifies the mapping relationship between two potential tables and equalizes the size of the two tables. Table extension is an inverse mapping of table marginalization,sincetheformerexpandsasmalltable(separatorpotentialtable)toa largetable(cliquepotentialtable),whilethelattershrinksalargetabletoasmalltable. Tableextensionisutilizedtosimplifytablemultiplicationandtabledivision. Figure5.5illustratestableextensionusingthesamplecliqueandseparatorgivenin Figure5.3. Thecliqueconsistsof6random variableswherethe numbers ofstatesare r a =r c =r e = 2andr b =r d =r f = 3. Theseparatorconsistsof2randomvariables r d andr f . In Figure 5.5 (a), we illustrate the state strings ofψ S(C) , the corresponding state strings of ψ C , and the indices. We can see from Figure 5.5 (a) that an entry in ψ S(C) corresponds to multiple entries inψ C . In addition, the corresponding entries in ψ C may not be contiguous. However, after applying table extension, each entry in ψ S(C) correspondstotheentrywiththesameindexinψ C (seeFigure5.5(b)). 55 Figure5.5: (a)Themappingrelationshipbetweenseparatorψ S(C) andcliquepotential tableψ C ; (b) The mapping relationship between extendedψ S(C) andψ C , where every entryinψ S(C) correspondstotheentryinψ C withthesameindex. TheparallelalgorithmfortheprimitiveoftableextensionisshowninAlgorithm5.6. Theinputtotableextensionincludestheoriginalseparatorψ S (C) ,thesizeofthecorre- sponding clique potential table|ψ C |, the mapping vectorM C = (m 1 ,m 2 ,··· ,m wout ), and the number of processorsP. The output is an extended separator tableψ ext . Re- garding the data layout, we assume that a segment of ψ ext (|ψ C |/P entries) is main- tained in local memory. Other inputs are stored in every local memory. Line 2 in Algorithm5.6initializestheoutputbyallocatingmemoryfor|ψ C |entries. InLine3-7, eachprocessorassignsvaluestoψ ext inparallel. Line4convertsanindexscalarintoa statestringusingEq.(5.2). Line5transformsthestatestringbyassigning˜ s i =s m i for i = 1,2,··· ,w out . TheresultingstatestringisconvertedbackintoanindexinLine6. Line7copiesthevalueintheidentifiedentrytoψ ext (t). Nocommunicationisneeded intableextension. UsingtheCGMmodel, weanalyzethecomplexityofAlgorithm5.6. Line2takes constanttimeformemoryallocation. AswiththeanalysisforAlgorithm5.4,weknow Line 4 in Algorithm 5.6 takes 3w C time for local computation. Line 5 takesw S time, 56 Input: separator potential tableψ S , the size of clique potential table|ψ C |, mapping vectorM C ,numberofprocessorP Output: extendedseparatorpotentialtableψ ext 1: forp = 1toP pardo {local computation} 2: allocatememoryof|ψ C |/P entriesforψ ext 3: fort = |ψ C | P (p−1)to |ψ C | P p−1do 4: converttintoS = (s 1 s 2 ···s w ) 5: construct ˜ S = (˜ s 1 ˜ s 2 ···˜ s wout )fromS usingM C 6: convert ˜ S into ˜ t 7: ψ ext (t) =ψ S ( ˜ t) 8: endfor 9: endfor Figure5.6: Tableextensionforseparatorψ S (C) wherew S is the width of separatorS. Line 6 requires 3w S time since we need three scalar operations to obtain each element of ˜ S. Therefore, the local computation time isO(|ψ C |w C /P),where1≤P ≤|ψ C |. 5.2.3 TableMultiplicationandTableDivision In exact inference, table multiplication occurs between a clique potential table and its separators. For each entry in a separator, table multiplication multiplies the potential ψ S (t)intheseparatorwithanotherpotentialψ C ( ˜ t)inthecliquepotentialtable,where the random variables shared by the separatorS and the cliqueC have identical states. Table multiplication requires the identification of the relationship between entries in theseparatorandthoseinthecliquepotentialtable. Weusetableextensiontoidentify thisrelationship. We present the algorithm for table multiplication in Algorithm 5.7. The input in- cludes the separator potential table ψ S , the clique potential table ψ C , the mapping vectorM C , and the number of processorsP. The output is the updated potential table 57 Input: separator potential table ψ S , clique potential table ψ C , mapping vector M C , numberofprocessorP Output: resultingpotentialtableψ ∗ C 1: ψ ext =extendψ S tosize|ψ C |usingAlgorithm5.6 2: forp = 1toP pardo {local computation} 3: fort = |ψ C | P (p−1)to |ψ C | P p−1do 4: ψ ∗ C (t) =ψ C (t)∗ψ ext (t) 5: endfor 6: endfor Figure5.7: Tablemultiplication ψ ∗ C . Line 1 extends the separator with respect toψ S using Algorithm 5.6. Lines 3-5 updateψ C by multiplying the extended separator and the clique potential table. Each processorisinchargeofasegmentoftheψ ∗ C . Since Line 1 utilizes Algorithm 5.6, the local computation time isO(|ψ C |w C /P). Lines 3-5 consist of |ψ C |/P iterations, where each iteration takes O(1) time for lo- cal computation. Therefore, the total computation time for Algorithm 5.7 is also O(|ψ C |w C /P),where1≤P ≤|ψ C |. Noticethat|ψ C | = Q w C i=1 r i . Table division is very similar to table multiplication, as shown in Algorithm 5.8. According to Eq. (2.1), table division occurs between two separator potential tables ψ ∗ S and ψ S . Algorithm 5.8 shows the primitive of parallel table division. Similar to the analysis for table multiplication, we obtain the total computation time for Algo- rithm5.7asO(|ψ C |w C /P),where1≤P ≤|ψ C |. 5.2.4 OptimizedComputationKernelsforEvidencePropagation The node level primitives discussed above can be utilized to implement exact infer- ence. In this section, we optimize the primitives with respect to evidence collection 58 Input: separatorpotentialtablesψ ∗ S andψ S ,mappingvectorM S ,numberofproces- sorP Output: resultingpotentialtableψ † S 1: ψ ext =extendψ S tosize|ψ S |usingAlgorithm5.6 2: forp = 1toP pardo {local computation} 3: fort = |ψ ∗ S | P (p−1)to |ψ ∗ S | P p−1do 4: ifψ S (t)6= 0then 5: ψ † S (t) =ψ ∗ S (t)/ψ ext (t) 6: else 7: ψ † S (t) = 0 8: endif 9: endfor 10: endfor Figure5.8: Tabledivision and evidence distribution separately. The optimized primitives form the computation kernelsforevidencecollectionanddistribution. Table marginalization for obtaining the separator between cliqueC and its parent pa(C)canbesimplifiedbyavoidingconversionsbetweenindicesandstatestrings. To obtainthepotentialtableψ S pa(C) ,wealsomarginalizethepotentialtableψ C . However, noticethatinSection5.1.2werequirethevariablevectorofcliqueC tobeorderedby V C = (V C\Spa(C) ,V Spa(C) ), whereV Spa(C) is the variable vector ofS pa (C) andV C\Spa(C) consists of random variables existing inV C but not inV Spa(C) . Using the relationship between V C and V Spa(C) , we simplify the table marginalization for obtaining ψ S pa(C) from ψ C . Since V S pa(C) is the lower part of V C , the relationship between entries in ψ S pa(C) and entries inψ C is straightforward (see Figure 5.9). Segmenti ofψ C denotes ψ C (i−1)|ψ pa(C) | : (i|ψ pa(C) |−1) , i.e. an array consists of entries from the (i− 1)|ψ pa(C) | th tothe(i|ψ pa(C) |−1) th inψ C . Thus,thismarginalizationcanbeimplemented by accumulating all segments, without checking the variable states for each entry of thepotentialtable. 59 Figure 5.9: Illustration of using the primitive of table marginalization to obtain sepa- ratorψ Spa(C) fromcliquepotentialtableψ C . Since multiplication and division have the same priority in Eq. (2.1), we can per- form either first. However, if we perform multiplication first, we must perform table extension twice (for multiplication and division respectively). If we perform division first, we need to perform table extension only once (ψ ∗ S andψ S in Eq. (2.1) have the same size, so table extension is not needed by division). In this paper, we always per- form the division between two separators first. Notice that the variable vectors of the two separators are the same. Therefore, we eliminate table extension (Line 1) from Algorithm 5.8 and the table division becomes the element-wise division between two arrays. Tablemultiplicationcanalsobeoptimizedbycombiningtableextensionandmul- tiplication. According to Algorithm 5.7, the extended table has the same size asψ C . Sincethesizeofthecliquepotentialtablescanbeverylarge,allocationofmemoryfor the extended table is expensive. To avoid allocating memory for the extended table, we combine table extension and multiplication. Once we obtain the value ofψ ext (t), whereψ ext istheextendedtableandtistheindex,wemultiplyψ ext (t)byψ C (t)instead of storingψ ext (t) in memory. The product of the multiplication replaces the original datainψ C (t). 60 We analyze the complexity of Algorithms 5.10 and 5.11 based on the analysis of the node level primitives. Line 1 in Algorithm 5.10 launches P-way parallelism. Lines2-11plusLine13performtablemarginalizationforeachchildofC. Lines2-11 takeO(d C |ψ C |w C /P) time for local computation, whered C is the number of children of clique C. Line 11 is a global communication round. Line 14 takes O(|ψ ∗ in i |) lo- cal computation time to perform element-wise table division. Notice that we do not parallelize the computation of table division. Considering|ψ in i | is very small in our context, the communication overhead due to the parallelization of table division is larger than the local computation time. Without loss of generality, we assume|ψ out |/ P ≥ |ψ in i | ≈ |ψ out | in this analysis. Lines 15-17 perform table multiplication on d C |ψ C |/P pairs of entries. Using ˜ t j computed in Lines 5-7, Lines 19-22 perform sim- plified marginalization, for which the local computation time isO(|ψ C |/P). Line 23 is the second global communication round. Therefore, the local computation time for Algorithm 5.10 is O(d C |ψ C |w C /P), 1 ≤ P ≤ |ψ out |. The number of global com- munication rounds is 2. Similarly, the local computation time for Algorithm 5.11 is O(d C |ψ C |w C /P),andthenumberofglobalcommunicationroundsisalso2. 5.3 ExactInferencewithNodeLevelPrimitives 5.3.1 DataLayout Foranarbitraryjunctiontree,wedistributethedataasfollows: LetP denotethenum- berofprocessors. Insteadofsimplysplittingeverypotentialtableψ C intoP segments, wecompare|ψ C |/P with|ψ S |,whereψ S isthelargestpotentialtableoftheseparators adjacent toC. If|ψ C |/P ≥|ψ S |, each processor stores|ψ C |/P entries of the potential table. Otherwise, we find another clique ˜ C, which can be processed in parallel withC 61 Input: input separators ψ ∗ in i , clique potential table ψ C , mapping vector M in i (i = 1,2,··· ,d C ) Output: updatedcliquepotentialtableψ C andtheoutputseparatorψ out 1: forp = 1toP pardo {local computation} 2: fori = 1tod C do 3: ψ (p) in i (1 :|ψ ∗ in i |) = ~ 0 4: forj = |ψ C | P (p−1)to |ψ C | P p−1do 5: Convertj intoS = (s 1 s 2 ···s w ) 6: Construct ˜ S = (˜ s 1 ˜ s 2 ···˜ s |in| )fromS usingM in i 7: Convert ˜ S into ˜ t j 8: ψ (p) in i ( ˜ t j ) =ψ (p) in i ( ˜ t j )+ψ C (j) 9: endfor 10: endfor {global communication} 11: broadcast ψ (p) in i and receive ψ (k) in i from Processor k (i = 1,··· ,d C ; k = 1,··· ,P andk6=p) {local computation} 12: fori = 1tod C do 13: ψ in i = P P j=1 ψ (j) in i 14: ψ in i (1 : |ψ ∗ in i |) = ψ ∗ in i (1 : |ψ ∗ in i |)/ψ in i (1 : |ψ ∗ in i |) for non-zero elements of ψ in i 15: forj = |ψ C | P (p−1)to |ψ ∗ C | P p−1do 16: ψ C (j) =ψ C (j)∗ψ in i ( ˜ t j ) 17: endfor 18: endfor 19: ψ (p) out (1 :|ψ out |) = ~ 0 20: forj = |ψ C | P (p−1)to |ψ C | P p−1do 21: ψ (p) out (j%|ψ out |) =ψ (p) out (j%|ψ out |)+ψ C (j) 22: endfor {global communication} 23: broadcastψ (p) out andreceiveψ (k) out fromProcessork (k = 1,··· ,P andk6=p) {local computation} 24: ψ out = P P j=1 ψ (j) out 25: endfor Figure5.10: Computationkernelofevidencecollection 62 Input: input separator ψ ∗ in , clique potential table ψ C , mapping vector M out i (i = 1,2,··· ,d C ) Output: updatedcliquepotentialtableψ C andtheoutputseparatorψ out i 1: forp = 1toP pardo {local computation} 2: ψ (p) in (1 :|ψ ∗ in |) = ~ 0 3: fori = |ψ C | P (p−1)to |ψ C | P p−1do 4: ψ (p) in (i%|ψ ∗ in |) =ψ (p) in (i%|ψ ∗ in |)+ψ C (i) 5: endfor {global communication} 6: broadcastψ (p) in andreceiveψ (k) out fromProcessork (k = 1,··· ,P andk6=p) {local computation} 7: ψ in = P P j=1 ψ (j) in 8: ψ in (1 :|ψ in |) =ψ ∗ in (1 :|ψ ∗ in |)/ψ in (1 :|ψ in |)fornon-zeroelementsofψ in i 9: fori = |ψ C | P (p−1)to |ψ C | P p−1do 10: ψ C (i) =ψ C (i)∗ψ in (i%|ψ in |) 11: endfor 12: fori = 1tod C do 13: ψ (p) out i (1 :|ψ out i |) = ~ 0 14: fort = |ψ C | P (p−1)to |ψ C | P p−1do 15: Convertt+t δ intoS = (s 1 s 2 ···s w ) 16: Construct ˜ S = (˜ s 1 ˜ s 2 ···˜ s |out i | )fromS usingM out i 17: Convert ˜ S into ˜ t 18: ψ (p) out i ( ˜ t) =ψ (p) out i ( ˜ t)+ψ C (t) 19: endfor 20: endfor {global communication} 21: broadcast ψ (p) out i and receive ψ (k) out i from Processor k (i = 1,··· ,d C ; k = 1,··· ,P andk6=p) {local computation} 22: ψ out i = P P j=1 ψ (j) out i (i = 1,··· ,d C ) 23: endfor Figure5.11: Computationkernelofevidencedistribution 63 (e.g.C and ˜ C sharethesameparentclique). Wedistributeψ C to⌊P|ψ C |/(|ψ C |+|ψ ˜ C |)⌋ processors andψ ˜ C to the rest (see Figure 5.12). Similarly, if (|ψ C |+|ψ ˜ C |)/P ≤ |ψ S | andk parallel cliques exist (k > 2), each having a small potential table, then we al- locate the processors to thek cliques. Let thek cliques be denoted byC 1 ,C 2 ,··· ,C k , thenψ C i , ∀i ∈ [1,k], is partitioned intoP i = P|ψ C i |/ P k j=1 |ψ C j | segments, each be- ingstoredbyaprocessor. Theallocationinformationofpotentialtablesismaintained in each processor using a queue called clique queue: LetC i denote the clique queue maintainedinProcessori.C i givestheorderbywhichProcessoriupdatesthecliques. Suchanordermustbeconsistentwiththebreadthfirstsearch(BFS)orderofthejunc- tiontree. EachelementinC i consistsofacliqueIDdenotedC,theindexofasegment ofψ C storedinProcessoriandtheprocessorIDswhereothersegmentsofψ C arepro- cessed. In Figure 5.12 (b), each column corresponds to a clique queue. In addition to thepotentialtables,eachprocessormaintainsO(d C |ψ S |)memorytostoretheadjacent separatorsduringevidencepropagation. Figure 5.12: (a) Sample junction tree where the size of the nodes indicates the size of potential tables. (b) The layout of potential tables. Each row consists of segments of oneormorepotentialtables,whileeachcolumncorrespondstoaprocessor. 64 5.3.2 CompleteAlgorithm The process of exact inference with node level primitives is given in Algorithm 5.13. The input to this algorithm includes the number of processorsP, the clique queueC i for each processor, the structure of a given junction treeJ, and the mapping vectors for all cliques with respect to their adjacent separators. The output is updated po- tential tables. Notice that the complete process of exact inference also includes local evidence absorption and query computation. The parallelization of these two steps is intuitive [37]. Therefore, Algorithm 5.13 focuses only on evidence propagation, includingevidencecollectionandevidencedistribution. Input: numberofprocessorsP,cliquequeuesC 1 ,··· ,C P ,junctiontreeJ,mapping vectorsM S forallcliqueswithrespecttotheiradjacentseparators Output: updatedcliquepotentialtablesforallcliques 1: forp = 1toP pardo {Evidence collection} 2: fori =|C p |downto1do 3: Let C = C p (i); ψ ∗ in j = ψ ch j (C) ;M in j = M ch j (C) ;ψ out = ψ pa(C) ,∀j = 1,··· ,d C 4: waituntilnonemptyψ ∗ in j , ∀j = 1,··· ,d C ,areupdated 5: EvidenceCollect(C,ψ ∗ in j ,ψ out ,M in j ) 6: setψ out asaupdatedseparatorwithrespecttoevidencecollection 7: endfor {Evidence distribution} 8: fori = 1to|C p |do 9: Let C = C p (i); ψ ∗ in = ψ pa(C) ; M out j = M ch j (C) ; ψ out j = ψ ch j (C) ,∀j = 1,··· ,d C 10: waituntilnonemptyψ ∗ in isupdated 11: EvidenceDistribute(ψ C ,ψ ∗ in ,ψ out j ,M out j ) 12: set ψ out j , ∀j = 1,··· ,d C , as updated separators with respect to evidence distribution 13: endfor 14: endfor Figure5.13: Exactinferenceusingcomputationkernels 65 In Algorithm 5.13, Lines 2-7 perform evidence collection in the given junction tree. In evidence collection, each processor updates cliques in the reverse order given inC i . Therefore, a processor first processes leaf cliques. Line 3 defines some nota- tions. Since input separators are always empty for leaf cliques during evidence col- lection, Line 4 is not executed for the leaf cliques. Thus, Line 4 can not suspend all processors. For non-leaf cliques, Line 4 ensures that all input separators are updated before we processC. Line 5 applies Algorithm 5.10 to perform evidence collection. Line 6 declares that ψ out has been updated in evidence propagation. Lines 8-13 in Algorithm 5.13 perform evidence distribution. Line 10 ensures that all the inputs are ready. Line 11 performs evidence distribution inC using Algorithm 5.11. The output separatorisupdatedinLine14. The computation in Algorithm 5.13 occurs in Lines 5 and 11. The local compu- tation time and the number of global communication rounds have been discussed in Section 5.2.4. However, Lines 4 and 10 introduce two more communication rounds. Based on the analysis in Section 5.2.4, the local computation time for Algorithm 5.13 isO(Nd C |ψ C |w C /P). Since we perform six communications per clique, the number ofglobalcommunicationroundsisO(N). 5.4 Experiments 5.4.1 ComputingFacilities WeimplementedthenodelevelprimitivesusingtheMessagePassingInterface(MPI) onthelinuxclusterinHigh-PerformanceComputingandCommunications(HPCC)at the University of Southern California (USC) [2]. The configurations of HPCC cluster isdiscussedinChapter2.4. 66 5.4.2 ExperimentsusingPNL IntelOpenSourceProbabilisticNetworksLibrary(PNL)isafullfunction,free,graph- icalmodellibrary[5]. AparallelversionofPNLisnowavailable,whichisdeveloped by Intel Russia Research Center and University of Nizhni Novgorod. Intel China Re- searchCenterandIntelArchitectureLaboratorywerealsoinvolvedinthedevelopment process. We conducted experiments to explore the scalability of the exact inference algo- rithmusingthePNLlibrary[30]. TheinputgraphsweredenotedJ1,J2andJ3,where each clique inJ1 had at most one child; all cliques except the leaves inJ2 had equal numbersofchildren;thecliquesinJ3hadrandomnumbersofchildren. Allthegraphs had 128 cliques, each random variable having 2 states. We show the execution time in Figure 5.14 (a). We can see from Figure 5.14 (a) that, although the execution time reducedwhenweused2and4processors,wefailedtoachievereducedexecutiontime when we used 8 processors. We measured the time taken by performing table opera- tions and non-table operations in exact inference implemented using the PNL library. We observed that the time taken by performing table operations is the dominant part of the overall execution time. With clique width w C = 20 and the number of states r = 2, table operations take more than 99% of the overall execution time. This result demonstratesthatparallelizingthenodelevelprimitivescanleadtohighperformance. We conducted an experiment using a Bayesian network from a real application. The Bayesian network is called the Quick Medical Reference decision theoretic ver- sion(QMR-DT),whichisusedinamicrocomputer-baseddecisionsupporttoolfordi- agnosisininternalmedicine[27]. Therewere1000nodesinthisnetwork. Thesenodes formed two layers, one representing diseases and the other symptoms. Each disease hasoneormoreedgespointingtothecorrespondingsymptoms. Allrandomvariables 67 (nodes) were binary. We converted the Bayesian network into a junction tree offline and then performed exact inference in the resulting junction tree. The resulting junc- tiontreeconsistsof114cliquesandtheaveragecliquewidthis10. InFigure5.14(b), weillustratetheexperimentalresultsusingthePNLandourproposedmethodrespec- tively. Our method illustrated almost linear speedup while the PNL based method did notshowscalabilityusing8processors. (a)ScalabilityofexactinferenceusingPNL (b)ExactinferenceonQMR-DTnetwork Figure5.14: ScalabilityofPNLbasedparallelexactinference. 5.4.3 ExperimentalResults We evaluated the performance of individual node level primitives, the computation kernels and the complete exact inference by using various junction trees generated by off-the-shelf software. We utilized Bayes Net Toolbox [28] to generate the input dataincludingpotentialtablesandjunctiontreestructures. Theimplementationofthe proposed methods was developed using the C language with MPI standard routines for message passing. Our implementation based on Algorithm 5.13 employed a fully distributedparadigm,wherethecodewasexecutedbyalltheprocessorsinparallel. To evaluate the performance of individual node level primitives, we generated a set of single cliques as well as the adjacent separators to each clique. A potential table was generated for every clique and separator. The set contained potential tables 68 (a)Scalabilitywithrespecttocliquewidth (b) Scalability with respect to number of states (c)Scalabilitywithrespecttoseparatorwidth (d)Scalabilitywithrespecttocliquedegree Figure5.15: Scalabilityoftablemarginalizationwithrespecttovariousparameters. with various parameters: The clique widthw C was selected to be 20,25, or 30. The numberofstatesrwasselectedtobe2forsomerandomvariablesand3fortheothers. As double precision floating point numbers were used, the size of the potential tables varied from 8 MB (w C = 20, r = 2) to 8 GB (w C = 30, r = 2). The widths of the separatorsw S werechosentobe1,2or4. Thecliquedegreed(i.e. maximumnumber of children) was selected to be 1,2 or 4. The mapping vectors M for the junction treeswereconstructedoffline. Theconstructionofthemappingvariablesisexplained in Section 5.2.1. We performed node level primitives on these potential tables using various numbers of processors (P was chosen from 1, 2, 4, 8, 32, 64 and 128). Each processorprocessed1/P ofthetotalentriesinagivenpotentialtable. The experimental results of table marginalization are shown in Figures 5.15. The other primitives showed similar results [39]. The default parameter values for the experimentswere: w C = 25,w S = 2,r = 2andcliquedegreed = 2. However,welet 69 (a)Scalabilitywithrespecttocliquewidth (b) Scalability with respect to number of states (c)Scalabilitywithrespecttoseparatorwidth (d)Scalabilitywithrespecttocliquedegree Figure5.16: Scalabilityofthecomputationkernelforevidencecollectionwithrespect tovariousparameters. w C = 15 for Panels (b), since the size of the potential table withw C = 25 andr = 3 isabout6TB,whichexceedsourquota. ForPanel(b),thelabelr = 2&3meansthat the number of states for half of the random variables is 2 while for the other half is 3. In each panel of the above figures, we varied one parameter at a time to show the influence of each specific parameter. From the experimental results, we can see that the primitives exhibited almost linear scalability using 128 processors, regardless of theinputparameters. We also evaluated the performance of the computation kernels for evidence col- lectionanddistribution. NoticethatthecomputationkernelsgiveninAlgorithms5.10 and 5.11 update a given clique potential table as well as its related separators. We also used the same input data used to evaluate node level primitives. We conducted experimentsforthetwocomputationkernelsseparatelyinFigures5.16and5.17. 70 (a)Scalabilitywithrespecttocliquewidth (b) Scalability with respect to number of states (c)Scalabilitywithrespecttoseparatorwidth (d)Scalabilitywithrespecttocliquedegree Figure 5.17: Scalability of the computation kernel for evidence distribution with re- specttovariousparameters. Finally,weevaluatedtheperformanceofthecompleteparallelexactinference(Al- gorithm 5.13 in Section 5.3.2). Unlike the experiments conducted above, the input to Algorithm 5.13 included an entire junction consisting of potential tables for each clique, the related separators, and a clique queue for each processor. The data layout for the input junction tree is addressed in Section 5.3.1. The clique queues generated according to the layout were sent to each processor separately. That is, each proces- sor kept partial potential potential tables, related separators and a clique in its local memory. In our experiments, we generated several junction trees with the number of cliquesN = 1024. For cliques and separators in the junction tree, as with for evalu- ating primitives and kernels, we also selected various values for the clique widthw C , the number of random variablesr, the separator widthw S , and the clique degreed C . Notice that, by varying these parameters, we obtained various types of junction trees. 71 (a)Scalabilitywithrespecttocliquewidth (b) Scalability with respect to number of states (c)Scalabilitywithrespecttoseparatorwidth (d)Scalabilitywithrespecttocliquedegree Figure5.18: Scalabilityofexactinferencewithrespecttovariousparameters. For example,d C = 1 implies a chain of cliques, while an arbitraryd C gives a junction tree with random numbers of branches. The results are shown in Figure 5.18 (a)-(c). In each panel of Figure 5.18, we varied the value for one parameter and conducted experimentswithrespecttovariousnumbersofprocessors. According to the results shown in Figure 5.18, the proposed method exhibits scal- ability using 128 processors for various input parameters. For the junction tree where N = 1024,w C = 25,r = 2andd = 2,theparallelexactinferenceachievedaspeedup of 98.4 times using 128 processors. Thus, the execution time was reduced to about two minutes from approximately three hours using a single processor. Compared to the experiment in Section 5.4.2, our method performed exact inference in junction trees with large potential tables. In addition, the proposed exact inference algorithm exhibited scalability over a much larger range. We observed almost linear speedup in our experiments. According to the algorithm analysis based on the CGM model in 72 Section 5.3.2, the computation complexity is inversely proportional to the number of processors,whichmatchestheexperimentalresults. 73 Chapter6 JunctionTreeDecomposition In addition to data parallelism discussed in Chapter 5, there is another level of par- allelism for exact inference in junction trees. Considering cliques without any de- pendency constraint among them, we can update these cliques in parallel once their respective proceeding cliques have been updated. Because such parallelism is offered bythetopologyofjunctiontrees,wecallit structural parallelisminthischapter. 6.1 EvidencePropagationBasedonDecomposition 6.1.1 Overview Wedecomposeajunctiontreeintoaseriesofsubtrees,eachconsistingofoneormore root-leafpathsintheinputjunctiontree(seeFigure6.1). Notethatthedecomposition isdifferentfromconventionaltree partitionthatdividesagiventreewithoutduplicat- ing any node [17]. A junction tree can be decomposed at various granularity levels to generate different number of subtrees. Given the number of processors P in a clus- ter, we decompose a given junction tree intoP subtrees of approximately equal task weights, so that each processor can host a distinct subtree. We assume the number of leafcliquesoftheinputjunctiontreeisgreaterthanthenumberofavailableprocessors. 74 Figure 6.1: A junction tree in (a) can be decomposed into subtrees of various granu- laritiesshownin(b), (c)and(d). Thearrowsinthegraphindicate evidencecollection direction. Since cliques may be duplicated in several subtrees, we must merge partially up- dated cliques in the subtrees to obtain fully updated cliques. For example, in Fig- ure 6.1(c), the root exists in all the three subtrees. After evidence collection, the root in each subtree is partially updated, since the root in the first subtree cannot collect evidence from Cliques 3, 5, 6, 7 and 10. Thus, we must merge the duplicated cliques. Mergingcliquesrequirescommunicationamongtheprocessors. Ourproposedmethod guarantees all the duplicated cliques can be merged in parallel. After clique merging, weperformevidencedistributionineachsubtreeinparallelandobtainthefinaloutput of evidence propagation. Note that we do not need to merge cliques after evidence distribution. 6.1.2 JunctionTreeDecomposition WeshowourproposedheuristicforjunctiontreedecompositioninAlgorithm6.2. This algorithmdecomposestheinputjunctiontreeintoP,P ≥ 1,subtreesofapproximately equal workload. By workload, we mean the overall execution time for processing the cliquesinasubtree. In Algorithm 6.2, we use the following notations. Weight V C is the estimated execution time for updating a clique C in junction tree J. Given clique width w C , 75 Input: JunctiontreeJ,cliqueweightV C ,numberofprocessorsP,tolerancefactorΔ Output: SubtreesJ i ,i = 0,1,...,P −1 1: letR C = 0,S C = 0,∀C ∈ J, ˜ r betherootofJ {Findweightofpathfrom ˜ r toparentofS C ,∀C ∈ J} 2: Q ={˜ r} 3: whileQ6=∅do 4: forallC ∈ Qdo 5: R C ′ =R C +V C ,∀C ′ =ch(C) 6: endfor 7: Q ={C ′ :C ′ ∈ch(C)andC ∈ Q} 8: endwhile {FindweightofsubtreerootedatS C ,∀C ∈ J} 9: Q ={leafcliquesinJ},S C =V C ,∀C ∈ Q 10: whileQ6=∅do 11: forallC ∈ Qdo 12: S C =V C + P C ′ ∈ch(C) S C ′ 13: endfor 14: Q ={C :S C = 0andS C ′ > 0,∀C ′ ∈ch(C)} 15: endwhile {Decomposition} 16: G ={˜ r} 17: repeat 18: C m = max C∈G (R C +S C )wherech(C)6=∅ 19: G = G∪{C ′ :C ′ ∈ch(C m )6=}\{C m } 20: letG ′ = G,K 0 = K 1 =··· = K P−1 =∅ 21: whileG ′ 6=∅do 22: letj = argmin i∈[0,P−1] P C ′ ∈K i (S C ′ +R C ′) 23: K j = K j ∪{max C∈G ′C},G ′ = G ′ \{C} 24: endwhile 25: K max = max i∈[0,P−1] P C ′ ∈K i (S C ′ +R C ′) , K min = min i∈[0,P−1] P C ′ ∈K i (S C ′ +R C ′) 26: untilG ={leafcliquesinJ}or (K max −K min )/(K max +K min )< Δ 27: foralli = 0,1,··· ,P −1do 28: J i = J∩(Path(˜ r,C)∪Subtree(C)),∀C ∈ K i 29: endfor Figure6.2: JunctionTreeDecompositionintoSubtrees 76 the number of children d C and the number of states of random variables r, we have V C = d C w 2 C r w C +1 [41]. The tolerance factor Δ∈ [0,1] is a threshold that controls the load balance among the subtrees. Small Δ results in better load balance, but can lead tolongerexecutiontimeforAlgorithm6.2.ch(C)representsthechildrenofC. Weuse Path(˜ r,C)todenotethepathfromroot˜ rtotheparentofacliqueC,andSubtree(C)the subtreerootedatC. WeuseR C torepresenttheoverallweightofcliquesinPath(˜ r,C) andS C theoverallweightofcliquesinSubtree(C). Thus,(S C +R C )givestheestimated execution time for updating a decomposed junction tree rooted at ˜ r. We assume there areP processorsandthedecomposedjunctiontreehostedbythei-thprocessorisJ i . Algorithm 6.2 consists of three parts: The first part (Lines 2-8) populatesR C for each clique; the second part (Lines 9-15) populates S C . In the third part, we use a heuristic to decompose the input junction tree into P subtrees with approximately equal weight. Note that (S C +R C ) is monotone nonincreasing from the root to leaf cliques. As the loop in Lines 17-26 iterates, more and more cliques are added toG in Line 19. In Lines 20-24, we assign the cliques toP groups using a heuristic, so that the overall weights of the decomposed junction trees are approximately equal. If the loadbalanceisacceptable(Lines25-26),thedecomposedjunctiontreesaregenerated using the cliques in Path(˜ r,C) and Subtree(C). The complexity isO(1) for Lines 1-2, andO(N) for Lines 3-8 and 9-15. The time taken by Lines 16-26 depends on Δ and the input junction tree. In the worst case, each nonleaf clique is inserted into set G once. Thus, the complexity is also O(N). Lines 27-28 take O(P) time. Thus, the serialexecutiontimeofAlgorithm6.2isO(N). 77 6.1.3 JunctionTreeMerging A straightforward approach for junction tree merging is as follows: for each cliqueC, accesspartiallyupdatedpotentialtablesψ C fromalltheprocessorswhereC exists,and then combine them through a series of computations. Note that the potential table of cliques is generally much larger than the potential table of separators. Thus, although theaboveapproachisdoable,itrequiresintensivedatatransferamongtheprocessors. Thiscanadverselyaffecttheperformance. An alternative approach for merging cliques is to utilize the separator potential tables, instead of the large clique potential tables. Consider the root clique in Fig- ure6.1(c)asanexample. Notethatevidenceflowsintotherootthroughtheseparators between the root (i.e. Clique 0) and Clique 1 only. Thus, if we want to fully update ψ C , we simply transfer the potential tables of all the separators adjacent to C. Note that the transferred data is much smaller than transferringψ C . Letψ ∗ C denote the fully updated potential table for C and ψ i C the partially updated potential table in the i-th subtree. ψ i S C (j) represents thej-th separator ofC in thei-th subtree. Let ˜ S k C denote all theseparatorpotentialtablesrequiredforobtainingψ ∗ C onthek-thprocessor: ˜ S k C ={ψ i S C (j) }, 0≤i<P,i6=k, 0≤j <d C (i) (6.1) where d C (i) is the number of children of C in subtree J k . Let ˆ ψ i S C (j) represent the potential table of separatorS C (j) obtained by marginalizingψ C beforeψ i S C (j) is used toupdateψ C . WehavethefollowingformulaM(·)formerginganyduplicatedclique C inanysubtreeJ k : ψ ∗ C =M ψ k C , ˜ S k C =ψ k C Y i,j ψ i S C (j) ˆ ψ i S C (j) (6.2) 78 where0≤i<P,i6=k, 0≤j <d C (i). Duringevidencecollection,eachprocessorchecksifanylocalcliquehasbeendu- plicatedontootherprocessors. Foreachduplicatedclique,wesendthelocalseparators correspondingtothecliquetootherprocessors,andalsoreceiveseparatorsfromother processorsaswell. Then,weuseEq.6.2toobtainthefullyupdatedcliques. Notethat wedonotneedtomergejunctiontreesafterevidencedistribution. 6.1.4 TradeoffbetweenStartupLatencyand BandwidthUtilizationEfficiency During junction tree merging, the processors exchange locally updated separators by messagepassing. Assumethetotalamountofdatatransferredforjunctiontreemerging iss and there arek communication steps executed in sequence. Given the aggregate networkbandwidthB andthestartuplatencyl forsendingamessage,anoptimization approachforexchangingseparatorsmustminimizet comm where, t comm =k·l+ s B (6.3) SincelandBareconstantsforagivensystem,wemustminimizekands. Weconsider thefollowingtwocasesformessagecommunication. (1) For each duplicated cliqueC, a processor broadcasts the local separator poten- tial tables ofC to other processors containingC, and receives the separator potential tablesfromotherprocessors. Suchanoperationisknownasall-gatherinMPI[16]. In this case, all the transferred separators are necessary for junction tree merging. Thus, s is minimized. However, since a communication step is required for each duplicated clique,adecomposedjunctiontreewithalargenumberofduplicatedcliquescanresult 79 Figure6.3: (a)Asamplebitmap;(b)areducedbitmap;(c)partitionedbitmaps. inalargekinEq.6.3. Foranyjunctiontreewithmultipleleafcliques,wealwayshave k> 1. (2) Each processor packages its local data to be sent into an array ordered by the target processor IDs, then in a single communication step, sends the data in such an array to their respective target processors. This operation is known as all-to-all in MPI[16,26]. Inthiscase,alltheprocessorsexchangedatainasinglecommunication step, i.e., k = 1. Thus, k is minimized. Note that each processor sends its local separatorstoalltheotherprocessors. Inaspecificreceiver,theseparatorsformerging cliquesnothostedbytheprocessorarenotbeused. Suchseparatorsarecalled invalid data. Therefore,duetotransferofinvaliddata,sinEq.6.3is not minimized. Wedevelopatechniquetoexplorethetradeoffbetweensandk. First,werepresent thecliqueallocationandduplicationusingabitmap,i.e.,abooleanmatrixwhereeach row corresponds to a subtree and each column corresponds to a clique. We label the cliques using a breadth-first search (BFS) order starting from the root of the junction tree. Figure6.3(a)illustratesasamplebitmapforthesubtreesshowninFigure6.1(c). Sinceweareonlyconcernedwiththeduplicatedcliquesduringjunctiontreemerging, we remove the columns corresponding to the unduplicated cliques from the bitmap. Theresultiscalled reduced bitmap(seeFigure6.3(b)). Second,sincethe0sinabitmapcorrespondtoinvaliddatatransferredbetweenpro- cessors,wepartitionasparsebitmapintoasetofsmallerbitmapsusingaheuristic,as showninFigure6.3(c). Eachresultingbitmapiscalledasub-bitmap. Anysub-bitmap 80 with all entries equal to 0 can be eliminated. We define the density of a sub-bitmap as the percentage of the entries equal to 1. Given a sub-bitmap, the corresponding processorsexchangedatausinganall-to-allcommunicationstep. Thenumberofcom- munication stepsk is given by the number of sub-bitmaps. Denser sub-bitmaps lead to higher bandiwith utilization efficiency in all-to-all communication. Let d i j denote the clique degree ofC j inJ i . |ψ i S j | represents the size of a separator potential table in J i . AssumingP m processorsareinvolvedinthem-thsub-bitmap, thecommunication timeisgivenby: t comm =k·l+ ( k−1 X m=0 P m −1)· max 0≤i<Pm |J i | X j=0 (d i j ·|ψ i S j |) B (6.4) There are several ways to partition a bitmap. For example, we can construct a bipartitegraphusingthereducedbitmapandthenidentifythedenselyconnectedsub- graphsby graph clustering[42]. Aconstraintforsuchclusteringisthattheduplicates of a clique must be assigned to the same sub-bitmap, so that we can obtain fully up- dated cliques. To the best of our knowledge, there is no known algorithm for graph clustering with such a constraint. We design a heuristic using the characteristics of the junction tree decomposition discussed in Section 6.1.2. The heuristic partitions a givenreducedbitmapintoasetofsub-bitmaps,whilesatisfyingtheaboveconstraint. We briefly discuss the proposed heuristic for bitmap partitioning. Since a bitmap representssubtreesdecomposedfromagivenjunctiontree,acharacteristicofjunction tree decomposition is that the root of the junction tree is duplicated to all subtrees. Thus,thefirstcolumninthecorrespondingreducedbitmapmustbeall1s. Thecliques close to the root are likely to be duplicated to many subtrees. Therefore, we can start from the first column of a bitmap to find a dense sub-bitmap. Given the size of a 81 separator potential table |ψ S | and the number of children of a clique d, we use the followingmetrictodetermineifasub-bitmapisdenseenough. z·d·|ψ S | η·B <l (6.5) wherez is the number of entries equal to 0 andη,η > 0, is a constant called tradeoff factor with default value equal to 1. We use the default value. Users can reduce η to improve the bandwidth utilization efficiency, or increase it to reduce the overhead due to startup latency. Note that (z·d·|ψ S |)/B is an estimate of the time taken for transferring the invalid data. We compare this time with the latencyl to determine if thesub-bitmapisdenseornot. Using Eq. 6.5, we identify a dense sub-bitmap starting from the first column of a reduced bitmap Λ. Assume the sub-bitmap consists of the first j columns in Λ. The rest of the bitmap is partitioned into two sub-bitmaps, one consisting of the rows where thej-th column is 1, the other consisting of the remaining rows. We eliminate thecolumnsandrowswithall0sfromtheresultingsub-bitmaps. Ifanysub-bitmapis notdenseenoughaccordingtoEq.6.5, werecursivelypartitionthesub-bitmaps, until allsub-bitmapsaredense. 6.1.5 DistributedEvidencePropagationAlgorithm Based on the techniques discussed in Sections 6.1.2, 6.1.3 and 6.1.4, we present the complete algorithm for distributed evidence propagation in junction trees in Algo- rithm 6.4. The input includes an arbitrary junction tree J, clique weights (i.e., esti- mated execution time)V C , some system parameters such as bandwidthB and startup latency l and a user threshold Δ ∈ [0,1] for controlling the junction tree decompo- sition. We let Δ = 0.1 in our experiments. The output is the updated junction tree 82 J, where the evidence originally on any clique has been propagated to the rest of the cliques. Input: Junction treeJ, clique weightV C ,∀C ∈ J, number of processorsP, tolerance factorΔ,bandwidthB,startuplatencyl,tradeofffactorη Output: UpdatedjunctiontreeJ {Initialization} 1: (J 0 ,··· ,J P−1 )=Decompose(J,V C ,P,Δ)//Algorithm6.2 2: ˜ Λ =reducedbitmapfor{J 0 ,J 1 ,··· ,J P−1 } 3: M =BitmapPartition( ˜ Λ,B,l,η,J) 4: assignJ i andMtoprocessorp i ,∀0≤i<P {ParallelEvidencePropagation} 5: forprocessorp i ,i = 0,1,··· ,P −1pardo 6: EvidenceCollect(J i ) 7: forallΛ∈Mdo 8: S i =∅ 9: forallcliqueC inbothJ i andΛdo 10: S i = S i ∪{ψ i S C (j) },0≤j <d C (i) 11: endfor 12: All-to-allsendS i top j andreceive ˆ S j , ∀0≤j <P −1,j 6=i 13: Obtainfullyupdatedcliquesusing ˆ S j //Eq.(6.2) 14: endfor 15: EvidenceDistribute(J i ) 16: endfor Figure6.4: DistributedEvidencePropagation Algorithm 6.4 consists of initialization and parallel evidence propagation. Initial- izationcanbedoneoffline. Thus,weonlyprovidesequentialalgorithmforthesesteps. In Line 1, we invoke Algorithm 6.2 in Section 6.1.2 to decompose J intoP subtrees. ThecorrespondingbitmapiscreatedinLine2andpartitionedinLine3. Line4assigns each subtree to a separate processor and broadcastsM. In Lines 5-16, all processors run in parallel to process the subtrees. In Lines 6 and 15, each processor performs evidence collection and distribution in its local subtree using the sequential algorithm 83 proposedin[25]. InLines7-14,foreachbitmapinΛ,weperformall-to-allcommuni- cation among the processors corresponding to the bitmap. In Line 14, each processor collectstherelevantseparatorsinS i . Thereceiveddataisusedtoobtainfullyupdated cliquesinLine13. 6.2 Experiments 6.2.1 Facilities We implemented the proposed method using the Message Passing Interface (MPI) on the High-Performance Computing and Communications (HPCC) Linux cluster at the University of Southern California (USC) [2]. The configurations of the HPCC cluster arediscussedinChapter2.4. 6.2.2 Datasets To evaluate the performance of our proposed method, we generated various junction trees by mimicking the junction trees converted from a real Bayesian network called theQuickMedicalReferencedecisiontheoreticversion(QMR-DT)[27]. Thenumber ofcliquesN inthesejunctiontreeswasintherange32768to65536. Thecliquedegree d, i.e., the number of children of a clique, was in the range 1 to 8, except for the leaf cliques. The clique width W C , i.e., the number of random variables per clique was varied from 8 to 20. The separator widthW S was from 2 to 5. The random variables r inthejunctiontreeswereeitherbinaryorternary. Whenbinaryvariableswereused, the number of entries per clique potential table was between 256 and 1048576; while the number of entries per separator potential table was between 4 and 32. We used single precision floating point data for the potential tables. All the potential tables 84 were aligned in the memory to improve the performance. We conducted experiments with 1 to 128 processors. For the data layout and the potential table organization, we followedtheapproachusedin[41]. 6.2.3 BaselineMethods Serial is a sequential implementation of evidence propagation discussed in Sec- tion 2.2. For a given junction tree, we start from the root to obtain the breadth-first search (BFS) order of the cliques. During evidence collection, we utilize Eq. (2.1) to update each clique according to the reversed BFS order. Then, during evidence distribution,theprocessorupdatesallthecliquesagainaccordingtotheBFSorder. Clique allocateisanasynchronousmessagedrivenparallelimplementation ofevidencepropagation. Wemappedthecliquesateachleveloftheinputjunctiontree onto distinct processors so as to maximize the number of cliques that can be updated in parallel. We started with leaf cliques during evidence collection. Once a clique obtained messages from all its children, the clique was updated by the host processor. Similarly, we started with the root and updated all the cliques again during evidence distribution. Data parallel is a parallel baseline method exploring data parallelism in ex- act inference. We traversed the input junction tree according to breadth first search (BFS) order and updated the cliques one by one. For each clique, the potential table was partitioned and each processor updated a part of the potential table using paral- lel node level primitives proposed in [41]. The processors were synchronized after a cliquewasupdated. Chainsisaparallelbaselinemethodwhichdecomposesajunctiontreeintoaset of chains, each corresponding a path from a leaf clique to the root [38]. We manually 85 assigned the chains to the processors, so that workload was evenly distributed. The processorsfirstupdatedtheassignedchainsinparallelandthenexchangedduplicated cliques in a global communication step. The junction tree was fully updated after the duplicatedcliquesweremerged. 6.2.4 ExperimentalResults In Figure 6.5, we illustrate the experimental results of the proposed distributed evi- dence propagation method and two baseline methods discussed in Section 6.2.3 with respect to a junction tree with the following parameters: N = 65536,W C = 15,r = 2,d = 4,W S = 4. We had consistent results with respect to other junction trees. We ran each experiment 10 times and calculated the average execution time and its standard deviation. As shown in Figure 6.5, the overhead due to parallelization was negligible for the proposed method, since the execution time with respect to a single processorwasalmostthesameasthebaselinecalledserial. Incontrast,thebaseline chains shows significant overhead. Clique allocate showed higher overhead than our proposed method, since it involves more coordination among the processors. Sincethepotentialtablesintheinputjunctiontreearemuchsmallerthanthosein[41], thedata parallelbaselinemethodshowedlimitedscalability. From Figure 6.5(a), the improvement of the proposed distributed evidence propa- gation method compared with all the baseline methods was more than28% when 128 processors were used. In Figure 6.5(b), we illustrate the execution time when more than 32 processors were used, so that we can take a close look at the improvement of theproposedmethod. We illustrate the scalability for each stage, i.e., evidence collection (collect), all-to-all communication (AllToAll), junction tree merging (Merge) and evidence 86 Figure6.5: Scalabilityoftheproposedtechniquecomparedwithbaselinemethods. distribution(Distribute),oftheproposedparallelevidencepropagationalgorithm (Lines 9-20 in Algorithm 6.4) in Figure 6.6. Figure 6.6(a) shows the results where we used a single bitmap. Thus, there was only one all-to-all communication. In Fig- ure6.6(b),wedividedthebitmapintomultiplebitmaps. Thus,wehadmultipleall-to- all communication steps, but the bandwidth utilization efficiency was improved. We controlledthenumberofbitmapsbyalteringthetrade-offfactorη inEq.6.5 Figures 6.6(a) and (b) show that the impact of η was very limited when a small numberofprocessorswereused. Asthenumberofprocessorsincreases,weobserved the impact ofη on the execution time for each stage. In Figure 6.7, we can observe thatusingasingleall-to-allcommunicationstepresultsinlessoverheadduetostartup latency. In Figure 6.7, we show the normalized execution time for the results in Fig- ure 6.6(a) and (b) when 128 processors were used. When a single bitmap was used, the all-to-all communication took a small percentage of execution time, since we sent onlyonemessage. So,theoverheadduetostartuplatencywasminimizedinthiscase. 87 (a)Evidencepropagationwithasinglebitmap(b) Evidence propagation with a set of sub- bitmaps Figure6.6: Executiontimeofvarioussteps. However,sinceinvaliddataweretransferredinall-to-allcommunication,junctiontree merging took additional time to identify useful separators from all received data, as shown in Figure 6.7. When multiple bitmaps were used, we improved the efficiency ofjunctiontreemerging,butthecommunicationtimewaslongerbecauseofincreased overheadduetostartuplatency. Figure6.7: Normalizedexecutiontimeofvarioussteps. Finally, we evaluated our proposed method using various junction trees. In Fig- ure 6.8, the label Original JTree represents the junction tree having the following pa- rameters: N = 65536,W C = 15,r = 2,d = 4,W S = 4. We varied a parameter each time as shown in the labels to illustrate the impact of various parameters on the executiontimeofourmethod. Whenwereducedthenumberofcliquesinthejunction treefrom65536to32768,theexecutiontimeofthenewjunctiontreeexhibitedsimilar 88 scalabilityastheoriginaltree. Theexecutiontimewasalmosthalfofthatfortheorig- inal junction tree. When we used the junction tree withW C = 10, the execution time wasmuchsmallerthanthatfortheoriginaljunctiontree. Fortheoriginaljunctiontree, we must update 2 15 entries for each clique potential table. However, for the junction tree withW C = 10, we only update 2 10 entries, 1/32 of that for the original junction tree. Thus, it is reasonable that the execution time was much faster. When W S was reduced from 4 to 2, the impact was negligible. When we increaseddfrom 2 to4, we observed in Figure 6.8 that the scalability was adversely affected. The execution time increased slightly as the number of processors increases. The reason is that large d resultsinmoreseparatorsbeingtransferred. Theexperimentalresultsinthissectionshowthattheproposedmethodscaleswell with respect to almost all the junction trees. In addition, compared with several base- line algorithms for evidence propagation, our method shows superior performance in termsofthetotalexecutiontime. Figure6.8: Impactofvariousparametersofjunctiontreeonoverallexecutiontime. 89 Chapter7 ExactInferenceonHeterogeneousProcessors Multicore/manycore processors can be either homogeneous or heterogeneous. The Cell Broadband Engine (Cell BE) processor, jointly developed by IBM, Sony and Toshiba, is a typical heterogeneous chip with one PowerPC control element (PPE) coupled with eight independent synergistic processing elements (SPEs) [4]. In this chapter, we investigate techniques for exact inference on Cell and analyze the com- plexity using a model of computation [9], which can also be used for other heteroge- neouscomputingsystems. 7.1 CentralizedSchedulingforExactInferenceonCell BEProcessors 7.1.1 TasksandTaskPartitioning Inourcontext,atask(denotedT)isdefinedasthecomputationtoupdateacliqueusing theinputseparatorsandthengeneratetheoutputseparators. Eachcliqueinthejunction tree is related to two tasks, one for evidence collection and the other for evidence distribution. The data required by a task consist of input separators, a (partial) clique 90 potential table and the output separators (see Figure 7.1 (a)). In evidence collection, the input separators for clique C are the separators between C and its children, i.e. S ch i (C) for all i. The output separators are the separators between C and its parent, S pa(C) . Inevidencedistribution,theinputandoutputseparatorsareswitched,asshown inFigure7.1. Figure 7.1: (a) Illustration of the relationship between a clique C and the two tasks relatedtoC;(b)Partitioningofalargetaskintoaseriesofsmalltasks. We propose a scheduler to check the size of tasks to be executed. The scheduler partitions each large task into a series of small tasks to fit in the local store. Assume taskT involves a large potential tableψ C , which is too large to fit in the local store. We evenly partition ψ C into k portions (k > 1) and ensure that each portion can be completely loaded into a single SPE. For each part, the scheduler creates two small 91 tasks. The reason why we need two tasks to process a portion will be addressed in Section7.1.3.1. Therefore,theschedulerpartitionstaskT intotwosetsofsmalltasks, eachsethavingk smalltasks. Everysmalltaskcontainsn/k entriesinψ C ,wherenis sizeofψ C . Wedenotethe2k smalltasksT 1 ,T 2 ,··· ,T 2k . WecallT 1 ,T 2 ,··· ,T k type- a tasks of the original task T; T k+1 ,T k+2 ,··· ,T 2k are called type-b tasks of T (see Figure 7.1 (b)). The original taskT is called a regular task. The input separators of type-atasksareidenticaltothoseoftheoriginaltaskT. Theoutputseparatorsoftype- a tasks are called temporary separators. The temporary separators of all type-a tasks mustbeaccumulated astheinputseparatorsforeachtype-btask. Thus,notype-btask can be scheduled for execution until all the corresponding type-a tasks are processed. Separators accumulation is defined as calculating the sum of corresponding entries of allinputseparators. Theoutputseparatorsofallthe type-btasksarealso accumulated as the final output separators (the output separators of the original taskT). Figure 7.1 (b)showsasamplelargetaskT and2k smalltaskspartitionedfromT. 7.1.2 TaskDependencyGraph For the sake of scheduling tasks for both the evidence collection and distribution, we createataskdependencygraphGaccordingtothegivenjunctiontree(seethelefthand sidefigureinFigure7.2(a)). EachnodeinGdenotesatask. EachedgeinGindicates the dependency between two tasks. A task can not be scheduled for execution until all dependent tasks are processed. Note that there are two subgraphs inG, which are symmetricwithrespecttothedashedlineinFigure7.2(a): Theuppersubgraphisthe givenjunctiontreewithalledgesfromchildrentotheirparents;thelowersubgraphis thethegivenjunctiontreewithalledgesfromparentstotheirchildren. Theupperand 92 Figure 7.2: (a) An example of a task dependency graph for a junction tree and the corresponding task dependency graph with partitioned tasks. (b) Illustration of the schedulingschemeforexactinference. lower subgraphs correspond to evidence collection and evidence distribution, respec- tively. Thus, each clique except the root in the junction tree is related to two nodes in the task dependency graph. Note that we do not duplicate potential tables during theconstructionofthetaskdependencygraph,thougheachcliquecorrespondstotwo tasks. Weneedonlytostorethelocationofthepotentialtableinthetask. Assometasksinvolvinglargepotentialtablesarepartitionedintoaseriesoftasks, the task dependency graphG is modified accordingly. The scheduler is in charge of task partitioning and task dependency graph modification at runtime. We assume the boldnodesinG(seetheleftsidefigureinFigure7.2(a))involvelargepotentialtables. After task partitioning, the bold nodes are replaced by sets of small nodes shown in the dashed boxes (see the right hand side in Figure 7.2 (a)). Each node in a dashed box denotes a small task partitioned from the original task. All the type-a tasks are connectedtotheparentoftheoriginaltask. Eachtype-btaskdependsonallthetype-a tasks. Thechildrenoftheoriginaltaskdependonallthe type-btasks. 93 7.1.3 DynamicPartitioningandScheduling Schedulingtaskdependencygraphshasbeenextensivelystudied(forexample,see[21]). In this paper, we use an intuitive but efficient method to schedule tasks to SPEs. The input to the scheduler is an arbitrary junction tree. Initially, the scheduler constructs a task dependency graphG, according to the given junction tree. As discussed in Sec- tion7.1.2,itisstraightforwardtoconstructGusingthestructureofthegivenjunction tree. The components of the scheduler are shown in Figure 7.2 (b). Issue setS I is a set oftasksthatarereadytobeprocessed,i.e. foranycliqueC ∈S I ,thedependentcliques ofC given inG have been processed. The preload listS L consists of all the tasks that are not ready to be processed. The partitioner is a crucial component of the proposed scheduler, as it dynamically explores parallelism at a finer granularity. The function of the partitioner is to check and partition tasks. In Figure 7.2 (b), P 1 ,P 2 ,··· ,P P denote processing units i.e. the SPEs in the Cell. The issuer selects a task from S I and allocates an SPE to it. Both the solid arrows and dashed arrows in Figure 7.2 (b) illustratethedataflowpathinthescheduler. The scheduler issues tasks to processing units as follows. Initially, S I contains tasks related to the leaf cliques of the junction tree. These tasks have no parent in the task dependency graphG. Other tasks inG are placed inS L . Each task inS L has a property called the dependency degree, which is defined as the number of parents of the task in G. The issuer selects a task from S I and issues the task to an available processing unit, repeating issuing if idle processing units exist. Several strategies for selecting tasks have been studied [10, 18]. We use a straightforward strategy where the task inS I with largest number of children is selected. When a processing unitP i is assigned to task T, it loads data relevant to T - including input separators, clique 94 potential tables and output separators - from the main memory. Once P i completes taskT,P i notifiesS L andwaitsforthenexttask.S L receivesnotificationfromP i and decreases the dependency degree of all tasks that directly depend on taskT. If ˜ T is dependent uponT, and the dependency degree of ˜ T becomes 0 afterT is processed, then ˜ T is moved from S L to S I . When ˜ T is moving from S L to S I , the partitioner checks if the data size of ˜ T can fit in the local store, and partitions ˜ T into a set of smallertasksaccordingly. If ˜ T ispartitioned,allthetype-atasksgeneratedfrom ˜ T are assignedtoS I ,while type-btasksareplacedinS L . A feature of the proposed scheduler is that it dynamically exploits parallelism at finer granularity. The scheduler tracks the status of the processing units and the size ofS I . If several processing units are idling, and the size ofS I is smaller than a given threshold, the partitioner picks the largest regular task in S I , and partitions the task into smaller tasks, even though the task can fit in the local store. The reason is that smallS I cannotprovideenoughparalleltaskstokeepalltheSPEsbusy,andidleSPEs adversely affect the performance. In Figure 7.2 (b), the dashed arrows illustrate that a regular task is taken fromS I and gets partitioned at runtime. After partitioning the task, type-atasksaresentbacktoS I while type-btasksareplacedintoS L . 7.1.3.1 PotentialTableOrganizationandEfficientPrimitives We utilize the potential table organization discussed in Chapter 5.1 on Cell BE plat- forms. Note that the size of the potential table increases dramatically with the clique width and the number of states of variables. However, the local store of each SPE in the Cell is limited to 256 KB. In addition, using double buffering to overlap the data transfer and computation makes the available local store more limited. For example, assumingsingleprecisionandbinaryrandomvariablesareused,thewidthofaclique thatcanfitinLSshouldnotbemorethan14(14 =⌊log 2 (128KB/4)⌋). Ifthepotential 95 tableofacliqueistoolargetofitinthelocalstore,theschedulerpresentinSection7.1 partitions the clique inton/k portions, wheren is the size of the potential table andk is the size of the portion which can fit in the local store. Each portion is processed by anSPE.However,asmarginalizationmustsumupentriesthatcanbedistributedtoall portions, the partial marginalization that results from each portion must be accumu- lated. AccordingtoAlgorithm2.2,accumulationisneededafterLines5and10. Thus, for each portion of the potential table, we create two small tasks (type-a and type-b tasks). Accumulationisappliedafterbothtype-atasksandtype-btasksareprocessed. 7.1.3.2 DataLayoutforOptimizingCellPerformance To perform exact inference in a junction tree, we must store the following data in memory: the structure of the junction tree, the clique potential table for each clique in the junction tree, the separator in each edge, and the properties of the cliques and separators (such as clique width, table size, etc.). For the Cell, the optimized data layout helps data transfer between main memory and local stores, and the vectorized computationinSPEs. The data that a SPE must transfer between its local store and the main memory depends on the direction of the evidence propagation. Figure 7.3 (c) demonstrates the components related to evidence propagation in a clique. In evidence collection, cliquesch 1 (C) andch 2 (C) update separatorsS ch1 (C) andS ch2 (C) respectively. Then, clique C is updated using S ch1 (C) and S ch2 (C). Lastly, C updates S pa (C). Notice that, according to Eq. (2.1), updating clique potential table ψ C needs both ψ C and the updated separator, while updating separator potential tableψ Spa(C) needs only the updatedψ C . Inevidencedistribution,ψ C isupdatedusingψ Spa(C) andthentheupdated ψ C isusedtopropagateevidencetoS ch1 (C)andS ch2 (C). 96 Figure7.3: (a)Datalayoutforjunctiontreeinmainmemory;(b)Datalayoutforeach clique;(c)Relationshipbetweenthelayoutandpartialjunctiontreestructure. We store the junction tree data in the main memory (Figure 7.3). The SPE issues DMA commands to transfer data between the local store and main memory. For the sake of decreasing the DMA transfer overhead, we minimize the number of DMAs arising from each SPE. If all the data required for processing a clique are stored in contiguous locations in main memory, the SPE needs only to issue one DMA com- mand (or a DMA list, if the data size is larger than 16KB) to load or save all the data. However, note that a separator is shared by two cliques. If all the data needed for processing a clique are stored in contiguous locations, separators must be duplicated. Inaddition,maintainingtheconsistencyofthecopiesofseparatorscausesextramem- ory access. Considering a clique has only one parent and several children, we store a cliquetogetherwithallthechildseparators,butleavetheparentseparatortotheparent of the clique. Each clique corresponds to a block in Figure 7.3 (a), while each block consists of a clique potential table, child separators, the properties of the clique, and child separators (Figure 7.3 (b)). All blocks are 16-byte aligned for DMA transfer. We also insert paddings to ensure each potential table within a block is also 16-byte 97 aligned. Such alignment benefits the computation in SPEs [4]. When the PPE sched- ulesacliqueC toanSPE,thePPEsendsthestartingaddressofthedatablockofclique C andthestartingaddressoftheparentseparatortotheSPE. 7.1.3.3 CompleteAlgorithmforParallelExactInference The parallel exact inference algorithm proposed in this paper consists of task depen- dencygraphscheduling(Section7.1)andpotentialtableupdate(Sections7.1.3.1). For the Cell, PPE is responsible for the task scheduling, and SPEs perform the potential tableupdate. Using the scheduling definition and notations in Section 7.1, we present an al- gorithm for task dependency graph scheduling in Algorithm 7.4. Lines 1 and 2 in Algorithm 7.4 are initial steps. Since all tasks inS I are ready to be processed, Lines 4-9 issue the tasks to available SPEs. Lines 10-14 check if the size ofS I is less than a given thresholdδ S , which is a small constant. If|S I | < δ S , some SPEs may keep idling, since there are not enough parallel tasks. In Section 7.1.4, we let δ S be the numberofSPEs. IfS I islessthanδ S ,thelargestregulartaskinS I ispartitionedintoa setofsmalltasks,sothatthereareenoughparalleltasksforSPEs. Foreachcompleted task in the SPEs, Line 15 adjusts the dependency degree for child tasks. If the depen- dencydegreeofachildtaskbecomes0,thetaskismovedfromS L toS I (Lines19-24). InLine19,δ T isaconstantensuringanytaskssmallerthanδ T canfitinthelocalstore. If the task to be moved is too large to fit in the local store, the scheduler partitions it (Lines22-23). We analyze the complexity of Algorithm 7.4 using the model of computation pre- sentedin[9],wherethecomputationcomplexityforPPEisdenotedT (PPE) C . InLine1 of Algorithm 7.4, rerootingJ is an optional preprocessing step. Its complexity is ad- dressedinSection4.2. Line1firsttakesO(N)timetocopythejunctiontreestructure 98 Input: Junction tree J, size of clique potential tables and separators, thresholds δ T ,δ S ,numberofSPEsP Output: Schedulingsequenceforexactinference 1: GeneratetaskdependencygraphGfrom(rerooted)J 2: S I =∅;S L ={alltasksinG} 3: whileS I ∪S L 6=∅do 4: fori=1tomin(|S I |,P)do 5: ifSPE i isidlingthen 6: T =getataskinS I ;S I =S I \T 7: assigntaskT toSPE i 8: endif 9: endfor 10: if|S I |<δ S andS I containsregulartasksthen 11: T =thelargestregulartaskinS I 12: partitionT intosmalltasksetsT type−a ,T type−b 13: S I =S I ∪T type−a ;S L =S L ∪T type−b 14: endif 15: forT ∈{completedtasksinallSPEs}do 16: for ˜ T ∈{childrenofT}do 17: decreasethedependencydegreeof ˜ T by1 18: ifdependencydegreeof ˜ T = 0then 19: if| ˜ T|<δ T then 20: S L =S L \{ ˜ T};S I =S I ∪{ ˜ T} 21: else 22: partition ˜ T intosmalltasksets ˜ T type−a , ˜ T type−b 23: S I =S I ∪ ˜ T type−a ;S L =S L ∪ ˜ T type−b 24: endif 25: endif 26: endfor 27: endfor 28: endwhile Figure7.4: PPE:Taskdependencygraphscheduling 99 and reverse the direction of edges. Note that there are (N − 1) edges for a junction tree withN cliques. Then, Line 1 connects the original structure of the junction tree and that with reversed edges to form a task dependency graph G. This takes O(1) time. Line 2 takes O(1) time to initialize S I and O(N) time to put 2N − 1 tasks to S L . In Line 3, the size of S L ∪S I is bounded by Nr w /δ m . Lines 4-9 perform a loop withO(P) iterations, whereP is the number of SPEs. In each iteration, Lines 6 and 7 both take constant time for computation. However, data transfer is invoked by Line 7, which will be addressed in the analysis of Algorithm 7.5. Line 8 takesO(1) time to judge if the condition is satisfied. Selecting the largest regular task in Line 11 takesO(1) time if the tasks are organized as a heap according to their sizes. Assume tasks smaller thanδ m can not be further decomposed (δ m ≤δ T ); then a taskT is split into at most 2T/δ m small tasks. Thus, Line 12 takesO(|T|/δ m ) =O(r w /δ m ) time to decomposeT to 2T/δ m small tasks, and Line 13 takesO(|T|/δ m ) time to insert the small tasks into the task dependency graph. Line 15 checks P SPEs if their current tasks have been completed. Assuming each task has at mostk children, Lines 15-27 consistofO(Pk)iterations. Foreachcompletedtask,thedependencyofeachchildis decreasedby1usingconstanttime. Line20takeslog|S L |timetoremoveataskfrom S L , and log|S I | time to insert a task into S I . The size of S L is bounded by O(N). O( P N i=1 Qw C j j=1 r i,j /δ m ) =O(Nr w /δ m )isalooseupperboundonthesizeofS I ,where r isthemaximumnumberofstatesofrandomvariables,andw isthemaximumclique width. Line 22 takes O(| ˜ T|/δ T ) time to decompose ˜ T. Note | ˜ T| ≤ r w . Similar to Line 13, Line 23 takesO(| ˜ T|/δ T ) = O(r w /δ T ) time to insert the small tasks into the task dependency graph. Lett denote the maximum number of small tasks partitioned 100 from aregulartask. TheupperboundontisgivenbyO(r w /δ m ). Basedonthe above analysis,thecomputationcomplexityforAlgorithm7.4isgivenby: T (PPE) C =O(2N +Nr w /δ m (P(r w /δ m +1)+2r w /δ m +Pk(1+r w /δ T +logNr w /δ T ))) =O(Nt(Pt+2t+Pk(t+logNt))) =O(NPkt(t+logN))≈O(PktN logN) (7.1) whereweassumeN ≫tforthesakeofsimplification. Using the data layout in Section 7.1.3.2, we present an algorithm for updating potentialtables(Algorithm7.5). Twocomputationkernelsforpotentialtableupdating are shown in Algorithms 7.6 and 7.7. Double buffering is used in Algorithm 7.5 to overlap computation and data transfer. While loading data relevant to ˜ T, we perform computationsontaskT. Lines5-7computeψ temp forregularandtype-atasks.ψ temp is the output for type-a tasks, but is an intermediate result for regular tasks (see Lines 8- 12inAlgorithm7.5). Lines13-19processT usingoneofthetwocomputationkernels, dependingonthestatusofcliqueC. Line21notifiesthePPEtoprepareanewtaskfor thisSPE. In Algorithm 7.5, assume a regular task is decomposed intot small tasks on aver- ageandalltasksaredistributedtoSPEsevenly. Therefore,thesizeofeachtaskcanbe represented by O( P N i=1 Qw C j j=1 r i,j /(Nt)) = O(r w /t) and each SPE processes Nt/P tasks, wherew is the maximum clique width andr is the maximum number of states for random variables. Line 1 transfersO(r w /t) data from main memory to the local store of each SPE. Lines 2-23 form the main body of this algorithm, which consists of a loop with (Nt/P −1) iterations. Line 3 starts data transfer for the next task pro- cessedontheSPE.Thedatasizefor ˜ T isalsoO(r w /t). Line6marginalizesa(partial) 101 Input: taskT anditsrelevantdataψ in ,ψ C ,ψ out ;task ˜ T anditsrelevantdata Output: updateddataforT and ˜ T 1: T =receiveataskfromPPE 2: whileT 6=∅do 3: ˜ T =receiveanewtaskfromPPE 4: waitfortheloadingofψ in andψ C forT tocomplete 5: ifT isaregulartaskor type-ataskthen 6: marginalizeψ C toobtainψ temp 7: endif 8: ifT isa type-ataskthen 9: ψ out =ψ temp 10: elseifT isa type-btaskthen 11: ψ temp =ψ in 12: endif 13: ifT isaregulartaskor type-btaskthen 14: ifcliqueC inT hasnotbeenupdatedthen 15: update ψ C andψ out using computation kernel of evidence collection (Al- gorithm7.6) 16: else 17: updateψ C andψ out usingcomputationkernelofevidencedistribution(Al- gorithm7.7) 18: endif 19: endif 20: storeupdatedψ C andψ out tothemainmemory 21: notifyPPEthattaskT isdone 22: letT = ˜ T 23: endwhile Figure7.5: SPE:processtasksusingdoublebuffering 102 potentialtableψ C ,iftheconditioninLine5issatisfied. Usingtheanalysisforthecom- putationcomplexityofnodelevelprimitivesinSection2.2,weknowthecomputation complexityforLine6isO(|ψ C |w) =O(r w w/t),wherew S isthemaximumseparator width. RegardlessoftheconditioninLine8,performingeitherLine9orLine11takes constanttime. Lines15and17invokecomputationkernelsinAlgorithms4and5. We willanalyzethecomplexityofAlgorithm4and5laterinthissection. Line20transfers O(r w /t) data from the local store to the main memory. Note the size ofψ S is ignored because it is smaller than that ofψ C . Lines 21 and 22 take constant time. Therefore, the number of DMA requests isT D = O(2Nt/P + 1) = O(Nt/P) and the size of transfered data isO(Nr w /P). The total computation time for Algorithm 7.5 is given by: T (SPE) C =O Nt P (|ψ C |w+1+|ψ C |kw) =O Nt P r w t kw =O(Nkwr w /P) (7.2) Two computation kernels (Algorithms 7.6 and 7.7) are used for evidence collec- tionandevidencedistribution,respectively. ThetwokernelsapplyEq.2.1tothegiven task. Lines 1-9 in Algorithm 7.6 absorb evidence from each child of cliqueC and up- dateψ C . Lines4-7implementpotentialtablemultiplicationusingmappingvectors(see details in Section 7.1.3.1). Lines 10-13 marginalize the updatedψ C . Benefiting from thedataorganizationpresentedinSection7.1.3.1,potentialtabledivision(Line2)and marginalization(Line12)aresimplifiedtovectorizedalgebraicoperations,whichper- formefficientlyonSPEsaswellasotherSIMDmachines. Algorithm7.7issimilarto Algorithm 7.6, where potential table division (Line 1) and multiplication (Line 3) are simplified to vectorized algebraic operations. Mapping vectors are utilized to imple- mentpotentialtablemarginalization(Lines6-12). 103 Input: inputseparatorsψ in i ,(partial)cliquepotentialtableψ C ,outputseparatorψ out , temporaryseparatorψ temp ,indexoffsett δ ,mappingvectorM in Output: updatedψ C andψ out 1: fori = 1to(NumberofchildrenofC) do 2: ψ in i (1 :|ψ in i |) =ψ in i (1 :|ψ in i |)/ψ temp (1 :|ψ in i |) 3: fort = 0to|ψ C |−1do 4: Convertt+t δ toS = (s 1 s 2 ···s w ) 5: Construct ˜ S = (˜ s 1 ˜ s 2 ···˜ s |in| )fromS usingmappingvectorM in i 6: Convert ˜ S to ˜ t 7: ψ C ( ˜ t) =ψ C ( ˜ t)∗ψ in i (t) 8: endfor 9: endfor 10: ψ out (1 :|ψ out |) = ~ 0 11: fori = 1to|ψ C |step|ψ out |do 12: ψ out (0 :|ψ out |) =ψ out (0 :|ψ out |)+ψ C (i :i+|ψ out |) 13: endfor Figure7.6: Computationkernelofevidencecollection Input: inputseparatorψ in ,(partial)cliquepotentialtableψ C ,outputseparatorsψ out i , temporaryseparatorψ temp ,indexoffsett δ ,mappingvectorM out i Output: updatedψ C andψ out 1: ψ in (1 :|ψ in |) =ψ in (1 :|ψ in |)/ψ temp (1 :|ψ in |) 2: fori = 1to|ψ C |step|ψ in |do 3: ψ C (i :i+|ψ in |) =ψ C (i :i+|ψ in |)∗ψ in (1 :|ψ in |) 4: endfor 5: fori = 1to(NumberofchildrenofC) do 6: ψ out i (1 :|ψ out i |) = ~ 0 7: fort = 0to|ψ C |−1do 8: Convertt+t δ toS = (s 1 s 2 ···s w ) 9: Construct ˜ S = (˜ s 1 ˜ s 2 ···˜ s |out i | )fromS usingmappingvectorM out i 10: Convert ˜ S to ˜ t 11: ψ out i ( ˜ t) =ψ out i ( ˜ t)+ψ C (t) 12: endfor 13: endfor Figure7.7: Computationkernelofevidencedistribution 104 We analyze Algorithm 7.6 using the model proposed in [9]. Line 1 is a for-loop withO(k) iterations, wherek is the maximum number of children of a clique. Line 2 simplifies table division by performing a vector division. Line 2 takes O(|ψ S |) = O(r w S ) time, wherew S is the maximum separator width andr is the maximum num- ber of states for random variables. Using SIMD operations, the execution time for Line2canbespeededupby4. However,thisdoesnotaffectthecomplexity. Lines3- 8 form an embedded loop withO(|ψ C |) = O( Q w C j=1 r j ) = O(r w ) iterations, wherew is the maximum clique width. In each iteration, Line 4 takes 3w time to convert an indextoastatestring,sinceitcomputesw C elements,eachinvolvingthreescalaroper- ations: adivision,amoduloandanincreaseoftheproductof( Q i r i )(see[37]). Line5 takesO(w S )timetoobtain ˜ S fromS. Line6isthereverseoperationtoLine4,which also takes 3w time. Line 8 takesO(1) time to perform a multiplication between two scalar numbers. Line 10 takes constant time to initialize a vector. Lines 11-13 form another loop withO(|ψ C |/|ψ out |) = O(r w /r w S ) = O(r w−w S ) iterations. The compu- tationcomplexityforLines11-13isO(r w ). Thus,thetotalcomputationcomplexityof Algorithm7.6isO(kr w w). For Algorithm 7.7, Line 1 takesO(|ψ S |) = O(r w S ) time to perform vector divi- sion. Lines 2-4 take O(|ψ C |) = O(r w ) time to perform vector multiplication. The loop starting at Line 5 hask iterations. Line 6 takes constant time to initializeψ out i . Lines 7-12 form a loop with O(|ψ C |) = O(r w ) iterations. Line 8 takes 3w time for computation,similartoLine4inAlgorithm7.6. Line9takesO(w S )timetoobtain ˜ S. Line 10 takesO(w S ) time and Line 11 takesO(1) to perform scalar addition. Based on above analysis, the total computation complexity for Algorithm 7.7 is given by O(kr w w),whichisthesameasthatforAlgorithm7.6. 105 7.1.4 Experiments 7.1.4.1 Facilities We conducted experiments on an IBM BladeCenter QS20 [3]. The IBM BladeCenter QS20hastwo3.2GHzCellBEprocessorsintroducedin2.4. Thetwoprocessorsshare a 1 GB main memory. We used one Cell processor to measure the performance. The IBM BladeCenter QS20 was installed with the Fedora Core 7, an RPM-based general purposeLinuxdistributiondevelopedbythecommunity-supportedFedoraProjectand sponsoredbyRedHat. TheCellBESDK3.0Developerpackagewasinstalledforour experiments. We compiled the code using the gcc provided with the SDK, with level 3optimization. 7.1.4.2 Datasets In our experiments, we used junction trees of various sizes to analyze and evaluate the performance of our method. The junction trees were generated using Bayes Net Toolbox [28]. The first junction tree (J1) had 1024 cliques. The average clique width was 10. The average degree for each clique was 3. Thus, each clique potential table had1024entries,andthesizeoftheseparatorsvariedfrom2to32entries. Thesecond junction tree (J2) had 512 cliques, and an average clique width of 8. The average degreeforthecliquesinthesecondjunctiontreewasalso3. Thus,eachcliquepotential table had 256 entries. The third junction tree (J3) had 1024 cliques. Each clique contained quarternary variables, and the average clique width was 5. Therefore, the potential table had approximately 1024 entries. In our experiments, we used single precisionfloatingpointnumberstorepresenttheprobabilitiesandpotentials. 106 7.1.5 PerformanceEvalutationoftheProposedMethod In Figure 7.9, we measured the execution time for exact inference on the Cell BE processor. WeusedjunctiontreesJ1,J2andJ3definedabove. Weshowtheparameters of these junction trees in Figure 7.9. In order to show the scalability of the proposed method with respect to various junction trees, we normalized the execution time, so that the sum of all the bars for each junction tree is 1. Using the results in Figure 7.9, we calculated the speedup of exact inference for all the input junction trees. The real and ideal speedups are given in Figure 7.10. The speedups for various junction trees were very similar. The ideal speedup given in Figure 7.10 increased linearly with respect to the number of SPEs, which is the theoretical upper bound on the real speedup. For the sake of illustrating the load balancing feature of the proposed algorithm, we measured the workload of each SPE (Figure 7.11). Unlike Figure 7.9, where the bars represent normalized execution time, each bar in Figure 7.11 indicates the the normalized workload. Figure 7.11 (a) shows the results without load balancing (see the last paragraph in Section 7.1.3). Figure 7.11 (b) shows the results without load balancing. In each subfigure, for the sake of comparison, we normalized the heaviest workload to be 1. In our implementation, the overhead of the scheduling was much less(lessthan1%)thanthatforpotentialtablecomputation. Theschedulingalgorithm was executed in the PPE, while the potential table computation was performed in one or more SPEs. Thus, the time taken for scheduling was partially overlapped with that forpotentialtablecomputation. The experimental results shown in this section illustrate the superior performance oftheproposedalgorithmandimplementationontheCell. Comparedwiththeresults shown in Figure 5.14, our experimental results exhibited linear speedup and a larger 107 Figure7.8: Normalizedcomputationalcomplexityofthecriticalpathforvariousjunc- tiontrees. (N,w,r) 1SPE 2SPEs 4SPEs 6SPEs 8SPEs (512,8,2) 1 1.994 3.937 5.742 6.670 (1024,10,2) 1 1.992 3.925 5.733 6.984 (1024,5,4) 1 1.987 3.912 5.644 6.895 Table7.1: TableofspeedupwithrespecttovariousnumberofSPEs scalability range. From Table 7.1 and Figure 7.9, we observed that, even though we used junction trees with various parameters, the experimental results exhibited very stable speedup for all types of input. We can see from Figure 7.10 that, for input junctiontreeswithvariousparameters,thespeedupsachievedinourexperimentswere nearlyequal,suggestingthatthespeedupobtainedwasnotsensitivetotheparameters of the input junction trees. The proposed method exhibited similar performance in terms of speedup for all types of inputs. In addition, all the speedup curves in Fig- ure 7.10 were close to the ideal speedup, which is the theoretical upper bound of the real speedup. When the number of SPEs used in the experiments was less than 5, the real speedup was almost the same as the ideal speedup. When 6 or more SPEs were used, the speedup was still very close to the upper bound. The stability in the 108 Figure7.9: ExecutiontimeofexactinferenceonIBMQS20forvariousjunctiontrees usingvariousnumberofSPEs. experimental results shows that the proposed method is useful for exact inference for arbitraryjunctiontrees. TheexperimentalresultsinFigure7.11showthatthedynamicschedulerproposed inSection7.1evenlydistributed the workload acrossallthe SPEs, regardless ofnum- berofcliquesinthejunctiontreeorthesizeofthepotentialtablesofthecliques. The scheduler dynamically exploits parallelism at multiple granularity levels, leading to load balancing. Thus, the SPE resources were effectively used. In Figure 7.11 (b), the workloads were nearly equal for various input junction trees. The bar length for a junction tree with 512 cliques is much shorter than that for the other two junction trees, because the workload of the junction tree with 512 cliques is much lighter. In Figure 7.11 (a), the scheduler randomly distributed the workload across all the avail- able SPEs without considering the current workload for the SPEs. As a result, the workload was not evenly distributed. Although the scheduler needed to keep track of multiple data, the proposed scheduler was efficient and the scheduling overhead was very small. The time taken for scheduling (e.g. checking task size and partitioning large tasks) took a tiny fraction of the entire execution time. This overhead was no 109 Figure 7.10: Observed speedup of exact inference on IBM QS20 for various junction trees. Figure 7.11: The workload of each SPE for various junction trees (a) without load balancingand(b)withloadbalancing. more than 1%. Notice that the scheduling algorithm was performed in the PPE,while thepotential tablecomputation wasexecutedintheSPEs. Thepotentialtable compu- tation, which took a large portion of the overall execution time, was parallelized. The timetakenforscheduling,althoughnotparallelized,waspartiallyoverlappedwiththe potentialtablecomputation. For the sake of illustrating the capability of the Cell for exact inference, we im- plementedaserialcodeforexactinferenceusingthesamepotentialtableorganization and data layout mentioned in Section 7.1.3.2. We compiled the code using gcc with 110 Figure7.12: ExecutiontimeoftheCellversusthatofotherprocessors. level3optimizationaswell,andexecutedthecodeonvariousplatforms,includingthe Intel Pentium 4 (3.0 GHz, L1 16 KB L2 2 MB) and IBM Power 4 (1.5 GHz, L1 128 KB + 64 KB, L2 1.4 MB, L3 32 MB). We also parallelized the potential table com- putationoftheserialcodeusingPthreadsandexecutedthecodeontheAMDOpteron 270(2.0GHz,L164KB,L21MB)andIntelXeon(2.0GHz,L1128KB,L28MB). TheresultsareshowninFigure7.12. Using the 8 SPEs in the Cell, our implementation for exact inference on the Cell achieved speedups of 3.0, 6.2, 2.7 and 9.9 over Opteron, Pentium 4, Xeon and Power 4, respectively. For the sake of comparison, we also show the normalized execution time of the proposed method using only one SPE. In Figure 7.12, the execution time of exact inference on the Cell with 8 SPEs was the lowest. The deviations in the execution time among these processors were caused by several factors, such as the clockrate,cacheconfigurationandprocessorarchitecture. Forexample,theexecution time on the Intel Pentium 4 was higher than that on the Xeon, because the Pentium processor had only a single core. The IBM Power 4 processor was equipped with a 3 level cache, and the size of L3 cache was 32 MB, which is much larger than 111 that of the other processors, but it had a lower clock frequency. Note that both the Intel Xeon and the AMD Opteron systems contained two quadcore processors with a shared memory. We used Pthreads to distribute the computation to all the cores. However,theparallelizedversionontheIntelorAMDprocessorsstilltookmoretime thantheCelltoperformexactinference. Inconclusion,comparedwiththesestate-of- the-artprocessors,theCellexhibitedsuperiorperformanceforparallelexactinference. Cell BE processors have some features that are especially suitable for exact inference and some other DAG structured computations. For example, the SPE has no cache but a user controlled local store. Thus, we can avoid the issues caused by cache. In addition, the superior computation capacity and high bandwidth are also suitable for ourapplication. 112 Chapter8 SchedulingDAGStructuredComputations 8.1 Lock-freeCollaborativeScheduling In this chapter, we propose a lightweight scheduling framework for DAG structured computations on general-purpose multicore processors. We distribute the scheduling activitiesacrossthecoresandlettheschedulerscollaboratewitheachothertobalance theworkload. Theschedulingframeworkconsistsofmultiplemodules,wherevarious heuristics can be employed for each module to improve the performance. In addi- tion, we develop a lock-free local task list for the scheduler to reduce the scheduling overheadonmulticoreprocessors. 8.1.1 Components WemodularizethecollaborativeschedulingmethodandshowthecomponentsinFig- ure 8.1, where each thread host a scheduler consisting of several modules. The input task dependency graph is shared to all threads. The task dependency graph is repre- sented by a list called the global task list (GL) on the left-hand side of Figure 8.1, which is accessed by all the threads shown on the right-hand side. Unlike the global tasklist,theothermodulesarehostedbyeachthread. 113 Figure8.1: Componentsofthecollaborativescheduler. Theglobaltasklist(GL)representsthegiventaskdependencygraph. Figure8.2(a) showsaportionofthetaskdependencygraph. Figure8.2(b)showsthecorresponding partoftheGL.AsshowninFigure8.2(c),eachelementintheGLconsistsoftaskID, dependency degree, task weight, successors and the task meta data (e.g. application specific parameters). The task ID is the unique identity of a task. The dependency degreeofataskisinitiallysetasthenumberofincomingedgesofthetask. Duringthe scheduling process, we decrease the dependency degree of a task once a predecessor of the task is processed. The task weight is the estimated execution time of the task. WekeepthetaskIDsofthesuccessorsalongwitheachtasktopreservetheprecedence constraintsofthetaskdependencygraph. WhenweprocessataskT i ,wecanlocateits successorsdirectlyusingthesuccessorIDs,insteadoftransversingtheglobaltasklist. In each element, there is task meta data, such as the task type and pointers to the data bufferofthetask, etc. TheGLissharedbyallthethreadsandweuselockstoprotect dependencydegreed i , 0≤i<N. EverythreadhasaCompletedTaskIDbuffer andanAllocatemodule(Figure8.1). TheCompletedTaskIDbufferineachthreadstorestheIDsoftasksthatareprocessed by the thread. Initially, all the Completed Task ID buffers are empty. During schedul- ing, when a thread completes the execution of an assigned task, the ID of the task is 114 inserted into the Completed Task ID buffer in that thread. For each task in the Com- pleted TaskID buffer, the Allocate module decrements the task dependency degree of the successors of the task. For example, if the ID ofT i in Figure 8.2 was fetched by Threadl from the Completed Task ID buffer, then Threadl locates the successors of T i and decreases their task dependency degrees. If the task dependency degree of a successorT j , becomes 0, the Allocate module allocatesT j to a thread. Heuristics can be used in task allocation to balance the workload across the threads. Our framework permits plug-and-play insertion of suchmodules. Inthis section, we allocatea taskto thethreadwiththesmallestworkloadatthetimeofcompletionofthetaskexecution. The local task list (LL) in each thread stores the tasks allocated to the thread. For load balance, the Allocate module of a thread can allocate tasks to any thread. Thus, the LLs are actually shared by all threads. Each LL has a weight counter associated with it to record the total workload of the tasks currently in the LL. Once a task is insertedinto(orfetchedfrom)theLL,theweightcounterisupdated. The Fetch module takes a task from the LL and sends it to the Execute module in the same thread for execution. Heuristics can be used by the Fetch module to select tasks from the LL. For example, tasks with more children can have higher priority for execution [24]. In this paper, we use a straightforward method, where the task at the head of the LL is selected by the Fetch module. Once the execution of the task is completed, the Execute module sends the ID of the task to the Completed Task ID buffer,sothattheAllocatemodulecanaccordinglydecreasethedependencydegreeof thesuccessorsofthetask. 115 We emphasize that Figures 8.1 shows the framework of collaborative scheduling, where various heuristics can be used for the components. For example, genetic al- gorithms or randomization techniques can be used for the Allocate and Fetch mod- ules[24]. Inthispaper,wefocusonthereductionofoverheadofcollaborativeschedul- ingwithrespecttostraightforwardimplementationsofthecomponents. Figure8.2: (a)Aportionofataskdependencygraph. (b)Thecorrespondingrepresen- tationoftheglobaltasklist(GL).(c)ThedataofelementT i intheGL. 8.1.2 AnImplementationofCollaborativeScheduler Based on the framework of collaborative scheduling in Section 8.1.1, we present a sample implementation of collaborative scheduling in Algorithm 8.3. We use the fol- lowing notations in the algorithm: GL denotes the global task list. LL i denotes the local task list in Thread i, 0 ≤ i < P. d T and w T denote the dependency degree and the weight of taskT, respectively. W i is the weight counter of Threadi, i.e. the total weight (estimated execution time) of the tasks currently in LL i . δ M is a given threshold. Thestatementsinboxesinvolvesharedvariableaccesses. Lines1and2inAlgorithm8.3initializethelocaltasklists. InLines3-19,thealgo- rithmperformstaskschedulingiterativelyuntilallthetasksareprocessed. Lines5-14 correspondtotheAllocatemodule,wherethealgorithmdecreasesthedependencyde- greeofthesuccessorsoftasksintheCompletedTaskIDbuffer(Line7),thenallocates 116 {Initialization} 1: LetS ={T i |d i = 0},0≤i<P 2: EvenlydistributetasksinS toLL i ,0≤i<P {Scheduling} 3: forThreadi(i = 0···P −1)pardo 4: while GL ∪ LL i 6=φ do {Completed Task ID buffer & Allocate module} 5: if|CompletedTaskIDbuffer|>δ M orLL i =φthen 6: for all T ∈ {successors of tasks in the Completed Task ID buffer of Threadi}do 7: d T =d T −1 8: ifd T = 0then 9: fetchT fromGLandappendittoLL j , wherej = argmin t=1···P (W t ) 10: W j =W j +w T 11: endif 12: endfor 13: CompletedTaskIDbuffer=φ 14: endif {Fetch module} 15: fetchataskT ′ fromtheheadofLL i 16: W i =W i −w T ′ {Execute module} 17: executeT ′ and place the task ID ofT ′ into the Completed Task ID buffer of Threadi 18: endwhile 19: endfor Figure8.3: ASampleImplementationofCollaborativeScheduling 117 thesuccessorswithdependencydegreeequalto0intoatargetthread(Line9). Line5 determines when the Allocate module should work. When the number of tasks in the CompletedTaskIDbufferisgreaterthanthethresholdδ M ,theAllocatemodulefetches tasksfromtheGL.ThemotivationisthataccessingthesharedGLlessfrequentlycan reduce the lock overhead. We choose the target thread as the one with the smallest workload(Line9),althoughalternativeheuristicscanbeused. Line15correspondsto the Fetch module, where we fetch a task from the head of the local task list. Line 16 updates W i according to the fetched task. Finally, in Line 17, T ′ is executed in the Executemodule,anditsIDisplacedintotheCompletedTaskIDbuffer. Due to the collaboration among threads (Line 9), all local task lists and weight countersaresharedbyallthreads. Thus,inadditiontotheGL,wemustavoidconcur- rentwritetoLL i andW i , 0≤i<P,inLines9,10,15,and16. 8.1.3 Lock-freeDataStructures InAlgorithm8.3,collaborationamongthethreadsrequireslocksfortheweightcoun- tersandlocaltaskliststoavoidconcurrentwrites. Forexample,withoutlocks,multiple threads can concurrently insert tasks into local task list LL j , for somej, 0 ≤ j < P (Line 9, Algorithm 8.3). In such a scenario, the tasks inserted by a thread can be overwritten by those inserted by another thread. Similarly, without locks, data race can occur to weight counterW j (Line 10, Algorithm 8.3). Therefore, locks must be used for weight counters and local task lists. Locks serialize the execution and incur increasingoverheadsasP increases. Thus,wefocusoneliminatingthelocksforlocal tasklistsandweightcounters. Weproposealock-freeorganizationforthelocaltasklistsandweightcounters. We substituteP weight counters for each original weight counter andP circular lists for 118 each local task list, so that each weight counter or circular list is updated exclusively during scheduling. Thus, there are P 2 lock-free weight counters and P 2 lock-free circular lists. The lock-free organization is shown in Figure 8.4, where the dashed box on the left-hand side shows the organization of local task list LL i and weight counterW i in Threadi for somei, 0 ≤ i < P; the dashed circles on the right-hand side represent theP threads. Although there are more queues and counters than the shown in Figure 8.1, this organization does not stress the memory. Each queue can be maintained by three variables only: head, tail and counter. If each variable is an integer, then the total size of these variables is less than 1 KB, which is negligible comparedwiththecachesizeofalmostallmulticoreplatforms. The lock-free local task list in Threadi, 0 ≤ i < P, denoted by LL i , consists of P circular listsQ 0 i ,Q 1 i ,··· ,Q P−1 i . Thej-th circular listQ j i , 0≤j <P, corresponds to the j-th portion of LL i . Each circular list Q j i has two pointers, head j i and tail j i , pointing to the first and last task, respectively. Tasks are fetched from (inserted into) Q j i atthelocationpointedtobyhead j i (tail j i ). The solid arrows in Figure 8.4 connect the heads of the circular lists in LL i to Threadi, which shows that Threadi can fetch tasks from the head of any circular list in LL i . Corresponding to Line 15 of Algorithm 8.3, we let Threadi fetch tasks from the circular lists in LL i in a round robin fashion. The dash-dotted arrows connecting Threadj, 0 ≤ j < P, to the tail ofQ j i imply that Threadj allocates tasks to LL i by insertingthetasksatthetailofcircularlistQ j i . IfQ j i isfull,Threadj insertstasksinto Q j k , 0 ≤ k < P andk 6= i, in a round robin fashion. According to Figure 8.4, each headortailisupdatedbyonethreadonly. We useP weight counters in Threadi to trackW i , the workload of LL i . TheP weight counters are denoted W 0 i , W 1 i , ···, W P−1 i in Figure 8.4, and W i is given by W i = P P−1 j=0 W j i . Note that all the solid arrows pass through weight counter W i i in 119 Figure 8.4, which means that Thread i must update W i i when it fetches a task from any circular list. The dash-dotted arrow passing throughW j i for somej, 0≤ j < P, indicates that Threadj must updateW j i when it inserts a task intoQ j i . Note that the value ofW j i is not the total weight of tasks inQ j i . Therefore, each weight counter is updatedbyonethreadonlyduringscheduling. Figure8.4: Lock-freeorganizationoflocaltasklist(LL i )andasetofweightcounters inThreadiandtheirinteractionwithrespecttoeachthread. 8.1.4 Correctness When a thread is writing data to shared variables, e.g., the weight counters, the head and tail of the circular lists, another thread may read stale values of these shared vari- ables if we eliminate locks. However, such a stale value does not impact the correct- ness. Bycorrectness,wemeanthatalltheprecedenceconstraintsaresatisfiedandeach ofthetasksintheDAGisscheduledontosomethread. AlltheprecedenceconstraintsofagivenDAGstructuredcomputationaresatisfied because of the locks for the global task list (GL). For each taskT in the GL, the task 120 dependency degreed T is protected by a lock to avoid concurrent write. Thus, accord- ing to Lines 7 and 8 in Algorithm 8.3, T can be allocated only if all the precedence tasksofT areprocessed. Therefore,theprecedenceconstraintsaresatisfiedregardless ofpotentiallystalevalues. When a stale value of a weight counterW j i is read by a thread, an error is intro- ducedintotheworkloadestimateforLL i ,i.e. W i = P P−1 j=0 W j i . Duetotheinaccurate workload estimate, the thread may allocate a new task to a different thread, compared with the case for which the exact workload is known. This does not affect the cor- rectness, but may lead to an unbalanced load. However, since the weight counters are checked in every iteration of scheduling (Lines 4-18, Algorithm 8.3), the updated valueistakenintoconsiderationinthenextiterationofsomethread. Therefore,except fortemporalinfluenceonloadbalance,thestalevaluesoftheweightcountershaveno impactoncorrectness. We analyze the impact of the stale value ofhead i j , 0 ≤ i,j < P, by considering Line 9 in Algorithm 8.3, where Threadi allocates taskT to Threadj by appendingT to LL j . To append T to the lock-free local task list LL j , Thread i first checks if Q i j is full. If not, T is appended at tail i j in Q i j . A circular list Q i j is full if and only if head i j = tail i j +1. Consider the following scenario: AssumeQ i j is full and Threadj fetches a task fromQ i j . Then, Threadj updateshead i j and Threadi allocates taskT to Threadj. In such a scenario,T can be appended toQ i j , since Threadj has taken a taskaway. However,ifThreadireadsthestalevalueofhead i j ,thenQ i j stillappearsto be full. Thus, Threadi finds another circular listQ i k , k 6=j, in a round robin fashion, for task allocation, whereQ i k is not full. Therefore, the stale value ofhead i j does not affectthecorrectness,butmayleadtoanunbalancedload. Similarly, we analyze the impact of the stale value of tail j i , 0 ≤ i,j < P, by considering Line 15 in Algorithm 8.3, where Threadi fetches taskT ′ from LL i . To 121 fetch a task from a lock-free local task list LL i , Threadi first checks ifQ j i is empty, where j is chosen in a round robin fashion. If not empty, the task athead j i in Q j i is fetched. Otherwise, Thread i fetches a task from the next nonempty circular list in LL i . Considerthefollowingscenario: AssumeQ j i isemptyandThreadj appendstask T ′ toQ j i . Then, Threadj updatestail j i and Threadi fetches a task fromQ j i . In such a scenario, Threadi can fetch a task fromQ j i , since Threadj has insertedT ′ intoQ j i . However, if Threadi reads the stale value oftail j i , thenQ j i still appears to be empty. Thus,Threadj fetchesataskfromanothercircularlist. Theupdatedvalueoftail j i can be read in the next iteration of scheduling. Then Threadj can fetchT ′ for execution. Therefore,thestalevalueoftail j i doesnotaffectthecorrectness. 8.1.5 Experiments We conducted experiments on three state-of-the-art homogeneous multicore systems: The dual Intel Xeon 5335 (Clovertown) quadcore platform, the dual AMD Opteron 2335 (Barcelona) quadcore platform and the quad AMD Opteron 8358 (Barcelona) quadcoreplatform. ThearchitecturesoftheseprocessorsareintroducedinChapter2.4. Thesesystemseachhad64GBsharedmemoryandtheoperatingsystemwasRedHat EnterpriseLinuxServerRelease5.3(Tikanga). WeusedGCC4.1.2compiler. 8.1.5.1 BaselineSchedulers Toevaluatetheproposedmethod,weimplementedsixbaselinemethodsandcompared them along with the proposed one using the same input task dependency graphs on variousmulticoreplatforms. 122 (1) Centralized scheduling with shared core (Cent shared): This scheduling method consisted of a scheduler thread and several execution threads. Each execu- tion thread was bound to a core, while the scheduler thread can be executed on any of these cores. In this scheduling method, the input DAG was local to the scheduler thread. Eachexecutionthreadhasareadytasklistsharedwiththeschedulerthread. In addition, there was a Completed Task ID buffer shared by all the threads. The sched- ulerthreadwasinchargeofalltheactivitiesrelatedtoscheduling,includingupdating taskdependencydegrees,fetchingready-to-executetasksandevenlydistributingtasks to the ready task lists. Each execution thread fetched tasks from its ready task list for execution. The IDs of completed tasks were inserted into the Completed Task ID buffer,sothattheschedulerthreadcanfetchnewready-to-executetasksfromthesuc- cessors of tasks in the Completed Task ID buffer. After a given number (i.e. δ M in Algorithm 8.3) of tasks in the ready task list were processed, the execution thread in- voked the scheduler thread and then went to sleep. When a task was allocated to the readytasklistofasleepingthread,theschedulerinvokedthecorrespondingexecution thread. SpinlockswereusedforthereadytasklistsandCompletedTaskIDbuffer. (2) Centralized scheduling with dedicated core (Cent ded): This scheduling methodwasadaptedfromthecentralizedschedulingwithsharedcore. Theonlydiffer- enceisthatacorewasdedicatedtotheschedulerthread,i.e.,eachthreadwasboundto aseparatecore. SimilartoCent shared,theinputDAGwaslocaltothescheduler. Each execution thread had a ready task list shared with the scheduler thread. There was a Completed Task ID buffer shared by all the threads. The scheduler thread was also in charge of all the activities related to scheduling and the execution threads exe- cutedassignedtasksonly. SpinlockswereusedforthereadytasklistsandCompleted TaskIDbuffer. 123 (3) Distributed scheduling with shared ready task list (Dist shared): In this method,wedistributedtheschedulingactivitiesacrossthethreads. Thismethodhada shared globaltasklistandashared readytasklist. EachthreadhadalocalCompleted Task ID buffer. The schedulers integrated into each thread fetched ready-to-execute tasks from the global task list, and inserted the tasks into the shared ready task list. If the ready task list was not empty, each thread fetched tasks from the ready task list for execution. Each thread inserted the IDs of completed tasks into its Completed Task ID buffer. Then, the scheduler in each thread updated the dependency degree of the successors of tasks in the Completed Task ID buffer, and fetched the tasks with dependency degree equal to 0 for allocation. Pthreads spinlocks were used for the globaltasklistandthereadytasklist. (4)Collaborativeschedulingwithlock-basedlocaltasklist(ColSch lock): This wasthecollaborativescheduling(Section8.1.1)withoutanyoptimizationdiscussedin Section 8.1.3. Rather than using the lock-free data structures, we used mutex locks to avoidconcurrentwritetothelocaltasklistsandweightcounters. Spinlockswereused toprotecttaskdependencydegreesinthesharedglobaltasklist. (5)Collaborativeschedulingwithlock-freelocaltasklist(ColSch lockfree): Thiswasnotabaseline,buttheproposedmethod(Section8.1.3)withlock-freeweight counters and local task lists (Section 8.1.3). We used spinlocks to protect task depen- dencydegreesinthesharedglobaltasklist. (6) DAG Scheduling using the Cilk (Cilk): This baseline scheduler performed work stealing based scheduling using the Cilk runtime system. Unlike the above schedulingmethodswhereweboundathreadtoacoreofamulticoreprocessorandal- locatedtaskstothethreads,wedynamicallycreatedathreadforeachready-to-execute taskandthenlettheCilkruntimesystemtoschedulethethreadsontocores. Although Cilk can generate a DAG dynamically, we used a given task dependency graph stored 124 in a shared global list for the sake of fair comparison. Once a task completed, the correspondingthreadreducedthedependencydegreeofthesuccessorsofthetaskand created new threads for the successors with resulting dependency degree equal to 0. Weusedspinlocksforthedependencydegreestopreventconcurrentwrite. (7)Ourimplementationofworkstealingbasedscheduler(Stealing): Although the above baseline Cilk is also a work stealing scheduler, it used the Cilk runtime system to schedule the threads, each corresponding to a task. On the one hand, the Cilk runtime system has various additional optimizations; on the other hand, schedul- ing the threads onto cores incurs overhead due to context switching. Therefore, for thesakeoffaircomparison,weimplementedtheStealingbaseline; wedistributed the scheduling activities across the threads, each having a shared ready task list. The global task list was shared by all the threads. If the ready task list of a thread was not empty, the thread fetched a task from it at the top for execution and upon completion updated the dependency degree of the successors of the task. Tasks with dependency degree equal to 0 were placed into the top of its ready task list by the thread. When a thread ran out of tasks to execute, it randomly chose a ready list to steal a task from its bottom, unless all tasks were completed. The data for randomization were gener- ated offline to reduce possible overhead due to a random number generator. Pthreads spinlockswereusedforthereadytasklistsandglobaltasklist. 8.1.5.2 DatasetsandTaskTypes Weusedbothsyntheticandrealdatasetstoevaluatetheproposedschedulingmethods. WebuiltataskdependencygraphforexactinferenceinaBayesiannetworkusedbya realapplicationcalledQuickMedicalReferencedecisiontheoreticversion(QMR-DT), a microcomputer-based decision support tool for diagnosis in internal medicine [20]. There were 1000 nodes in this network of two layers, one representing diseases and 125 theothersymptoms. Eachdiseasehasoneormoreedgespointingtothecorresponding symptoms. All random variables (nodes) were binary. The resulting task dependency graphforexactinferencehad228tasks,eachcorrespondingtoaseriesofcomputations called node level primitives. These primitives consist of addition, multiplication and divisionoperationsamongsingleprecisionfloatingpointdatainoneortwotables[41]. Given the number of random variables involved in each task and the number of states oftherandomvariables,denotedbyW c andr,respectively,thetablesizeis4r Wc bytes. Eachtableentrywasa4-bytesingleprecisionfloatingpointdata. Inthisdataset,rwas 2 and the average value forW c was 10. The average number of successorsd for each taskwasalso2. ThetaskweightswereestimatedusingW c ,r andd(see[41]). We used extensive synthetic datasets to investigate the impact of various parame- ters of task dependency graphs on the scheduling performance. Our graph generator synthesizedDAGstosimulatethestructureofprograms: Eachnoderepresentedacode segment. The edges showed the data dependency between the code segments. We as- sumed that the loops in a program must either be inside a code segment or unrolled. Let N denote the number of nodes in the DAG and d the average node degree. Let α 0 ,α 1 ,··· ,α N−1 denote the nodes. We started with an empty set of edges. When nodeα i wasvisited,thegeneratorspawnedmax(0, (d−d in +δ))outgoingedgesfor α i , whered in is the number of incoming edges andδ ∈ (⌊−d/2⌋,⌈d/2⌉) is a random integer. The outgoing edges were connected to nodes selected randomly fromα i+1 to α N−1 : The probability thatα j ,i < j < N, was selected isP(α i → α j ) = e −1/(j−i) . That is, for a given nodeα i , the nodes close toα i had higher probability of selection. Finally, all the nodes with no parent, except forα 0 , were connected toα 0 . We gener- ated DAGs with 10000 nodes. The average degree of the nodes was 8. The threshold fortheTaskIDbufferδ M was5. 126 We experimented with three types of tasks for each node of the generated DAGs in our experiments: (1) Dummy task: Dummy task was actually a timer, which al- lowed us to assign the execution time of a task. Dummy tasks also helped us analyze the scheduling overhead, since we could easily calculate the task execution time. (2) Computationintensivetask: Thissimulatedanumberofrealtasks. Weperformedma- trix multiplicationY =X T X, whereX was a 256×256 matrix. (3) Memory access intensive task: This simulated another common type of real task, which performed ir- regular memory access. In our experiments, we updated each elementx[i] of an array with128kentriesbyperformingx[i] =x[y[i]],wherey wasanindexarrayforx. 8.1.5.3 ExperimentalResults We first compared the proposed scheduling method (i.e. ColSch lockfree) with the two work stealing based methods Stealing and Cilk using the task depen- dency graph for exact inference in the QMR-DT Bayesian network on the platform with dual AMD Opteron 2335 quadcore processors. In each comparison, the input task dependency graphs were identical for all the methods. In Figure 8.5(a), we show the execution time for scheduling exact inference using the task dependency graph. The corresponding speedup is shown in Figure 8.5(b). According to Figure 8.5(a), when the number of threads is less than four, almost no difference in execution time canbeobserved. However,asthenumberofthreadsincreases,wecanobservethatthe proposedmethodledtolessexecutiontimethanthetwobaselinemethods. Thediffer- enceisclearinFigure8.5(b), whereweconvertedtheexecutiontimeintospeedup. A reason for the difference in the speedups is that the proposed scheduling method bal- anced the workload across the threads according to the task weights. However, both the baseline methods did not consider the task weights. Instead, they stole tasks from other threads to feed the underutilized threads. Since the DAG had 228 nodes and 127 the number of threads was 8, the number of tasks per thread was 28.5 on the average. Thus, the number of available tasks for stealing at a particular time was even less. In thisscenario,thereisnoguaranteeforworkstealingtoachieveloadbalanceacrossthe threads. Note that, although the results ofStealing andCilk were similar in Fig- ure8.5(b),theperformanceofStealingwasslightlybetterthanCilk. Onereason fortheperformancedifferencewastheoverheadduetothreadcreationandthecontext switch. For Cilk, we dynamically created threads for the tasks. This incurs thread creation overheads. In addition, we let the runtime system to swap the threads in the cores,whichresultedincontextswitchoverhead. However,fortheothermethods,we bound the threads to separate cores and allocated tasks to the threads in user space, wherenothreadcreationorswapin/outoccurred. (a) (b) Figure 8.5: Comparison of the proposed scheduler with work stealing based sched- ulers. We conducted the following four experiments using synthetic datasets on multi- core platforms to investigate the impact of various parameters of input task depen- dency graphs on scheduling performance. The first experiment compared the pro- posedschedulingmethodandvariousbaselinemethods. Weusedourgraphgenerator to construct a DAG with 10000 nodes, where each node was assigned a dummy task 128 taking 50 microseconds. We executed the DAG using the five task sharing schedul- ing methods on a dual Intel Xeon 5335 (Clovertown) quadcore platform and a quad AMD Opteron 8358 (Barcelona) quadcore platform, where up to 8 and 16 concurrent threads were supported, respectively. Given various available threads, we measured the overall execution time on both platforms and converted it into speedups in Fig- ures 8.6(a) and 8.7(a), where the speedup is defined as the serial execution time over parallelexecutiontime. (a) Comparison with baseline scheduling methods. (b) Impact of the size of tasks on the speedup achievedbyColSch lockfree. (c)Impactoferrorintheestimatedtaskexecu- tiontimeonthespeedupachievedbyColSch lockfree. (d) Performance of ColSch lockfree withrespecttotasktypes. Figure8.6: ExperimentalresultsondualIntelXeon5335(Clovertown)quadcoreplat- form. The second experiment investigated the overhead due to the proposed lock-free collaborative scheduling method with respect to the number of threads and task size 129 (a) Comparison with baseline scheduling methods. (b) Impact of the size of tasks on the speedup achievedbyColSch lockfree. (c)Impactoferrorintheestimatedtaskexecu- tiontimeonthespeedupachievedbyColSch lockfree. (d) Performance of ColSch lockfree withrespecttotasktypes. Figure 8.7: Experimental results on quad AMD Opteron 8358 (Barcelona) quadcore platform. on various platforms. As in the above experiment, we used a DAG with 10000 nodes anddummytasks. However,wevariedthetimedelayforeachdummytask. Weshow the ideal speedup and the experimental results with dummy tasks taking 50, 100, 500 and1000microsecondsinFigures8.6(b)and8.7(b). The third experiment showed the impact of the error in the estimated task exe- cution time. We again used a DAG with 10000 dummy task nodes. However, in- stead of using the real execution time as the task weight, we used estimated execu- tion time with errors. Let t denote the real task execution time and t ′ the estimated execution time. The real task execution time t was randomly selected from range 130 ((1−r/100)t ′ , (1+r/100)t ′ ), wherer is the absolute percentage error in estimated task execution time. We conducted experiments withr = 0,1,5,10,20 and show the results Figures 8.6(c) and 8.7(c). Note that the definition of speedup in the above two figuresistheserialexecutiontimeover8-threadparallelexecutiontime. The last experiment examined the impact of task types. We used the lock-free collaborative scheduler to schedule a DAG with 10000 tasks. In the experiments, we used dummy tasks, computation intensive tasks and memory access intensive tasks, discussedinSection8.1.5.2. TheresultsareshowninFigures8.6(d)and8.7(d). According to Figures 8.6 and 8.7, the proposed method showed superior perfor- mance compared with baseline methods on various platforms as number of cores in- creased. Forexample,inFigure8.7(a),when16coreswereused,thespeedupachieved by our method (15.12×) was significantly higher than others methods (up to 8.77×). This was because more cores lead to higher conflict in accessing the task lists. The proposed method allowed such concurrent access, but the baseline methods serialized theaccesses. ThebaselineCent sharedshowedrelativelylowperformancedueto the overhead of frequent context switch between the scheduler thread and the execu- tion thread. Cent ded also showed limited speedup, since the computing capability of the core dedicated to the scheduler thread was not used efficiently. The proposed method exhibited good scalability for various task weights. The speedups were close totheidealspeedupforlargetaskweights. Theresultsshowedthatourtechniquewas tolerant to the error in estimation of task weights. This was because the errors in es- timation of task weights can counteract each other. In addition, we observed that the schedulingoverheadwaslessthan1%fortheproposedmethodinalltheexperiments. Finally,theproposedmethodachievedalmostlinearspeedupfortasksofvarioustypes, including computation intensive tasks and memory access intensive tasks, which im- pliedthattheproposedschedulercanworkwellinreallifescenarios. 131 8.2 Hierarchical Scheduling on Manycore Processors withDynamicThreadGrouping In order to improve scheduling performance on manycore processors, we propose a hierarchicalschedulingmethodwithdynamicthreadgrouping,whichschedulesDAG structuredcomputationsatthreedifferentlevels. Atthetoplevel,asupermanagersep- arates threads into groups, each consisting of a manager thread and several worker threads. Thesupermanagerdynamicallymergesandpartitionsthegroupstoadaptthe schedulertotheinputtaskdependencygraphs. Throughgroupmergingandpartition- ing, the proposed scheduler can dynamically adjust to become a centralized sched- uler, a distributed scheduler or somewhere in between, depending on the input graph. At the group level, managers collaboratively schedule tasks for their workers. At the within-grouplevel,workersperformself-schedulingwithintheirrespectivegroupsand executetasks. 8.2.1 Organization We illustrate the components of the hierarchical scheduler in Figure 8.8. The boxes with rounded corners represent thread groups. Each group consists of a manager thread and several worker threads. The manager in Group 0 is also the supermanager. Thecomponentsinsideofaboxareprivatetothegroup;whilethecomponentsoutof theboxesaresharedbyallgroups. The representation of the global list (GL) is discussed in Chapter 8.1.1 (see Fig- ure8.1). Theglobalreadylist (GRL)inFigure8.8storestheIDsoftaskswithdepen- dency degree equal to 0. These tasks are ready to be executed. During the scheduling 132 process,ataskisputintothislistbyamanagerthreadoncethedependencydegreeof thetaskbecomesto0. Figure8.8: Componentsofthehierarchicalscheduler. The local ready list (LRL) in each group stores the IDs of tasks allocated to the group by the manager of the group. The workers in the group fetch tasks from LRL for execution. Each LRL is associated with a workload indicator (WI) to record the overall workload of the tasks currently in the LRL. Once a task is inserted into (or fetchedfrom)theLRL,theindicatorisupdated. Thelocalcompletedtasklist(LCL)ineachgroupstorestheIDsoftaskscompleted by a worker thread in the group. The list is read by the manager thread in the group fordecreasingthedependencydegreeofthesuccessorsofthetasksinthelist. ThearrowsinFigure8.8illustratehoweachthreadaccessesacomponent(reador write). As we can see, GL and GRL are shared by all the managers for both read and write. For each group, the LRL is write-only for the manager and read-only for the workers; while LCL is write-only for the workers and read-only for the manager. WI islocaltothemanagerintherespectivegrouponly. 133 8.2.2 DynamicThreadGrouping The scheduler organization shown in Figure 8.8 supports dynamic thread grouping, which means that the number of threads in a group can be adjusted at runtime. We adjust groups by either merging two groups or partitioning a group. The proposed organizationensuresefficientgroupmergingandpartitioning. Figure8.9(a)illustratesthemergingofGroup i andGroup j ,i<j. Thetwogroups aremergedbyconvertingallthreadsofGroup j intotheworkersofGroup i andmerging WIs, LCLs and LRLs accordingly. Converting threads of Group j into the workers of Group i is straightforward: Manager j stops allocating tasks to Group j , but performs self-scheduling as a worker thread. Then, all the threads in Group j access tasks from the merged LRL and LCL. To combine WI i and WI j , we add the value of WI j to WI i . AlthoughWI j is not used after merging, we still keepit updated for the sake of possible future group partitioning. Merging the lists i.e. LCLs and LRLs is efficient. NotethatbothLCLandLRLarecircularlists,eachhavingaheadandatailpointerto indicate the first and last tasks stored in the list, respectively. Figure 8.9(b) illustrates the approach to merge two circular lists. We need to update two links only, i.e. the bold arrows shown in Figure 8.9(b). None of the tasks stored in the lists are moved or duplicated. The head and tail of the merged list areHead i andTail j , respectively. Notethattwomergedgroupscanbefurthermergedintoalargergroup. We summarize the procedure in Algorithm 8.10. Since the queues and weight indicatorsaresharedbyseveralthreads,locksmustbeusedtoavoidconcurrentwrite. For example, we lockLRL i andLRL j immediately before Line 1 and unlock them after Line 3. Algorithm 8.12 does not explicitly assign the threads in Group i and Group j to Group k , since this algorithm is executed only by the supermanager. Each 134 threaddynamicallyupdatesitsgroupinformationanddecidesifitshouldbeamanager orworker(seeAlgorithm8.12). Figure 8.9: (a) Merge Group i and Group j . (b) Merge circular lists List i and List j . Thehead(tail)pointstothefirst(last)tasksstoredinthelist. Theblankelementshave notaskstoredyet. Input: Group i andGroup j . Output: Group k =Group i +Group j {Merge LRL i and LRL j } 1: LetLRL j .Head.Predecessor pointstoLRL i .Tail.Successor 2: LetLRL i .Tail.Successor pointstoLRL j .Head 3: LRL k .Head =LRL i .Head,LRL k .Tail =LRL j .Tail {Merge LCL i and LCL j } 4: LetLCL j .Head.Predecessor pointstoLCL i .Tail.Successor 5: LetLCL i .Tail.Successor pointstoLCL j .Head 6: LCL k .Head =LCL i .Head,LCL k .Tail =LCL j .Tail {Merge WI i and WI j } 7: WI k =WI i +WI j Figure8.10: Groupmerge Group i and Group j can be restored from the merged group by partitioning. As a reverse process of group merging, group partitioning is also straightforward and efficient. Due to space limitations, we do not elaborate it here. Group merging and partitioning can be used for groups with an arbitrary number of threads. We assume thenumberofthreadspergroupisapoweroftwohereinafterforthesakeofsimplicity. 135 8.2.3 HierarchicalScheduling Using the proposed data organization, we schedule a given DAG structured computa- tion atthree levels. The top levelis called the meta-level, where we have a superman- ager to control group merging/partitioning. At this level, we are not scheduling tasks directly, but reconfiguring the scheduler according to the characteristics of the input tasks. Such a process is called meta-scheduling. The supermanager is hosted along with the manager of Group 0 by Thread 0 . Note that Manager 0 can never become a workerasdiscussedinSection8.2.2. The mediate level is called the group level, where the manager in each group col- laborateswitheachotherandallocatestasksfortheworkersinthegroup. Thepurpose of collaborating between managers is to improve the load balance across the groups. Specifically, the managers ensure that the workload in the local ready lists is roughly equalforallgroups. Amanagerishostedbythefirstthreadinagroup. Thebottomleveliscalledthewithin-grouplevel,wheretheworkersineachgroup perform self-scheduling. That is, once a worker finishes a task execution and updates LCLinitsgroup,itfetchesanewtask,ifany,fromLRLimmediately. Self-scheduling keeps all workers busy, unless the LRL is empty. Each worker is hosted by a separate thread. Figure 8.11: The hierarchical relationship between the supermanager, managers and workers,andthecorrespondingschedulingschemes. Thehierarchicalschedulerbehavesbetweencentralizedanddistributedschedulers, sothatitcanadapttotheinputtaskgraph. Notethateachgroupconsistsofamanager 136 threadandseveralworkerthreads. Whenallthegroupsaremergedintoasinglegroup, theproposedmethodbecomesacentralizedscheduler;whenmultiplegroupsexist,the proposedmethodbehavesasadistributedscheduler. 8.2.4 SchedulingAlgorithmandAnalysis We propose a sample implementation of the hierarchical scheduler presented in Sec- tion 8.2.3. Based on the organization shown in Section 8.2.1, we use the following notations to describe the implementation: Assume there are P threads, each bound to a core. The threads are divided into groups consisting of a manager and several workers. GL andGRL denote the global task list and global ready list, respectively. LRL i and LCL i denote the local ready list and local completed task list of Group i , 0 ≤ i < P. d T and w T represent the dependency degree and the weight of task T, respectively. WI i is the workload indicator of Thread i . Parametersδ M ,δ + andδ − are given thresholds. The boxes show the statements that access variables shared by all groups. Algorithm8.12illustratestheframeworkofthehierarchicalscheduler. InLines1- 3, thread groups are initialized, each with a manager and a worker, along with a set of ready-to-execute tasks stored in LRL j , where the overall task weight is recorded inWI j . A boolean flagf exit in Line 3 notifies if the threads can exit the scheduling iteration (Lines 5-15). rank controls the size of groups: Increasing rank leads to mergingoftwoadjacentgroups;whiledecreasingrankleadstopartitioningofcurrent groups. rank = 1 corresponds to the minimum group size i.e. two threads per group. Thus,wehave1≤rank≤ logP. ThegroupsizeQisthereforegivenby: Q = P 2 logP−rank = 2 rank (8.1) 137 Line4inAlgorithm8.12startsallthethreadsinparallel. Thethreadsperformvarious scheduling schemes according to their thread IDs. The first thread in each group is a manager (Line 8). In addition, the first thread in Group 0 i.e. Thread0 performs as the supermanager (Line 10). The rest of the threads are workers (Line 13). Given thread IDi,thecorrespondinggroupis⌊i/Q⌋. Input: P threads;TaskdependencygraphstoredinGL;Thresholdsδ M ,δ + andδ − . Output: Assigneachtasktoaworkerthread {Initialization} 1: Group j ={Manager: Thread 2j , Worker: Thread 2j+1 },j = 0,1,··· ,P/2−1 2: Evenly distribute tasks {T i |T i ∈ GLandd i = 0} across LRL j , WI j = P T∈LRL j w T ,∀j = 0,1,··· ,P/2−1 3: f exit =false,rank = 1 {Scheduling} 4: forThreadi = 0,1,··· ,P −1pardo 5: whilef exit =falsedo 6: Q = 2 rank 7: ifi%Q = 0then {Manager thread} 8: GrouplevelschedulingatGroup ⌊i/Q⌋ (Algorithm8.14) 9: ifi = 0then {Supermanager thread} 10: Meta-levelscheduling(Algorithm8.13) 11: endif 12: else {Worker thread} 13: Within-grouplevelschedulingatGroup ⌊i/Q⌋ (Algorithm8.15) 14: endif 15: endwhile 16: endfor 17: ifGL =∅thenf exit =true Figure8.12: ASampleImplementationofHierarchicalScheduler Algorithm 8.13 shows the meta-scheduling method for the supermanager. The al- gorithmconsistsoftwoparts: updatingrank(Lines1-2)andre-grouping(Lines3-11). We use a heuristic to updaterank: Note thatWI j is the computational workload for 138 Group j . AlargeWI j requiresmoreworkersfortaskexecution.|LCL j |isthenumber of completed tasks and d is the average number of successive tasks. For each com- pleted task, the manager reduces the dependency degree of the successive tasks and moves ready-to-execute tasks to LRL j . Thus, (|LCL j |·d) represents the workload for the scheduler. A larger (|LCL j |·d) requires more managers for task scheduling. In Line 1, the ratio r tells us if we need more managers or more workers. If more workers are needed, we increaserank in Line 2. In this case, groups are merged to provide more workers per manager. Otherwise,rank decreases. Line 2 also ensures thatrank is valid by checking the boundary values. d,δ + andδ − are given as inputs. The re-grouping depends on the value ofrank. Ifrank increases, two groups merge (Line 5); if rank decreases, the merged group is partitioned (Line 9). The two op- erators Merge(·) and Partition(·) are discussed in Section 8.2.2. Line 12 flipsf exit if no task remains inGL. This notifies all of the threads to terminate (Line 5 in Algo- rithm8.13). Algorithm 8.14 shows an iteration of the group level scheduling for managers. Each iteration consists of three parts: updatingWI i (Lines 1-2 and 15), maintaining precedencerelationship(Lines3-8)andallocatingtasks(Lines9-14). Lines3-8check thesuccessorsofalltasksinLCL i inbatchmodetoreducesynchronizationoverhead. Letm = 2 rank −1denotethenumberofworkerspergroup. Inthebatchtaskallocation part(Lines9-14),wefirstfetchmtasksfromGRL. Line12isanadaptivestepofthis algorithm. Iftheoverallworkloadofthemtasksistoolight( P T∈S ′ w T < ΔW)orthe currenttasksinLRL i isnotenoughtokeeptheworkersbusy(WI i <δ M ),moretasks are fetched for the next iteration. This dynamically adjusts the workload distribution and prevents possible starvation for any groups. In Line 10, the manager inspects a setoftasksandselectsmtaskswithrelativelymoresuccessors. Thisisawidelyused heuristicforscheduling[24]. SeveralstatementsinAlgorithm8.14areputintoboxes, 139 where the managers access shared components across the groups. Synchronization costofthesestatementsvariesasthenumberofgroupschanges. {Update rank} 1: r = P P/Q j=0 (WI j /(|LCL j |·d)),rank old =rank 2: rank = min(rank+1,logP), r>δ + max(rank−1,1), r<δ − {regrouping} 3: ifrank old <rank then {Combine Groups} 4: forj = 0toP/(2·Q)−1do 5: Group j =Merge(Group 2j ,Group 2j+1 ) 6: endfor 7: elseifrank old >rank then {Partition Group} 8: forj =P/Q−1downto0do 9: (Group 2j ,Group 2j+1 )=Partition(Group j ) 10: endfor 11: endif Figure8.13: Meta-LevelSchedulingforSupermanager The workers schedule tasks assigned by their manager (Algorithm 8.15). This algorithmisastraightforwardself-schedulingscheme,whereeachidleworkerfetches a task fromLRL i and then puts the tasks intoLCL i after execution. AlthoughLRL i andLCL i aresharedbythemanagerandworkerthreadsinthesamegroup,noworker accessesanyvariablessharedbetweengroups. 8.2.5 Experiments TheSunUltraSPARCT2(Niagara2)platformwasaSunfireT2000serverwithaSun UltraSPARCT2multithreadingprocessor[34]. Thereare64hardwarethreadsand32 GB DDR2 memory shared by all the cores. The operating system was Sun Solaris 11 andweusedSunStudioCCwithLevel4optimization(-xO4)tocompilethecode. 140 {Update workload indicator} 1: ΔW = P ˜ T∈LCL i w ˜ T 2: WI i =WI i −ΔW {Update precedence relations} 3: forallT ∈{successorsof ˜ T, ∀ ˜ T ∈LCL i }do 4: d T =d T −1 5: ifd T = 0then 6: GRL =GRL∪{T};GL =GL\{T} 7: endif 8: endfor {Batch task allocation} 9: ifLRL i isnotfullthen 10: S ′ ⇐fetchmtasksfromGRL,ifany 11: if P T∈S ′ w T < ΔW orWI i <δ M then 12: FetchmoretasksfromGRLtoS ′ ,sothat P T∈S ′ w T ≈ ΔW +δ M 13: endif 14: LRL i =LRL i ∪{S ′ } 15: WI i =WI i + P T∈S ′ w T 16: endif Figure8.14: GroupLevelSchedulingfortheManagerofGroup i 8.2.5.1 Baseline To compare the performance of the proposed method, we performed DAG structured computations using Charm++ [13] Cilk [11] and OpenMP [31]. In addition, we im- plemented three typical schedulers called Cent ded, Dist shared and Steal, respectively. We evaluated the baseline methods along with the proposed scheduler usingthesameinputtaskdependencygraphs. (a)SchedulingDAGstructuredcomputationsusingCharm++(Charm++): Charm++ runtime system employs a phase-based dynamic load balancing scheme facilitated by virtualization, where the computation is monitored for load imbalance and computa- tion objects (tasks) are migrated between phases by message passing to restore bal- ance. Givenataskdependencygraph,eachtaskispackagedasanobjectcalledchore. 141 {Perform task execution} 1: FetchT fromLRL i 2: ifT 6=∅then 3: ExecutetaskT 4: LCL i =LCL i ∪{T} 5: endif Figure8.15: Within-GroupLevelSchedulingforaWorkerofGroup i Initially,alltaskswithadependencydegreeequalto0aresubmittedtotheruntimesys- tem. Whenataskcompletes,itreducesthedependencydegreeofthesuccessors. Any successors with reduced dependency degree equal to 0 are submitted to the runtime systemforscheduling. (b) Scheduling DAG structured computations using Cilk (Cilk): This baseline scheduler performed work stealing based scheduling using the Cilk runtime system. Unliketheproposedschedulingmethodswhereweboundathreadtoacoreofamul- ticoreprocessorandallocatedtaskstothethreads,wedynamicallycreatedathreadfor each ready-to-execute task and then let the Cilk runtime system schedule the threads onto cores. Although Cilk can generate a DAG dynamically, we used a given task dependency graph stored in a shared global list for the sake of fair comparison. Once ataskcompleted,thecorrespondingthreadreducedthedependencydegreeofthesuc- cessorsofthetaskandcreatednewthreadsforthesuccessorswithdependencydegree equalto0. Weusedspinlocksforthedependencydegreestopreventconcurrentwrite. (c)SchedulingDAGstructuredcomputationusingOpenMP(OpenMP):Thisbase- line initially inserted all tasks with dependency degree equal to 0 into a ready queue. Then, using the OpenMP pragma directives, we created threads to execute these tasks in parallel. While executing the tasks in the ready queue, we inserted new ready-to- executetasksintoanotherreadyqueueforparallelexecutioninthenextiteration. Note that the number of tasks in the ready queue can be much greater than the number of 142 cores. We let the OpenMP runtime system to dynamically schedule tasks to underuti- lizedcores. (d)Centralizedschedulingwithdedicatedcore(Cent ded),(e)Distributedschedul- ing with shared ready task list (Dist shared) and (f) Task stealing based schedul- ing with distributed ready task list (Steal) are discussed in Items (2), (3) and (7) in Chapter8.1.5.1,respectively. 8.2.5.2 DatasetsandDataLayout We experimented with both synthetic and real datasets to evaluate the performance of the proposed scheduler. For the synthetic datasets, we varied the task dependency graphs so that we can evaluate our scheduling method using task dependency graphs with various graph topologies, sizes, task workload, task types and accuracies in esti- matingtaskweights. Fortherealdatasets,weusedtaskdependencygraphsforblocked matrix multiplication (BMM), LU and Cholesky decomposition. In addition, we also used the task dependency graph for exact inference, a classic problem in artificial in- telligence, where each task consists of data intensive computations between a set of probabilisticdistributiontables(alsoknownaspotentialtables)involvingbothregular andirregulardataaccesses[36]. Weusedthefollowingdatalayoutintheexperiments: Thetaskdependencygraph wasstoredasanarrayinthememory,whereeachelementrepresentsataskwithatask ID,weight,numberofsuccessors,apointertothesuccessorarrayandapointertothe taskmetadata. Thus,eachelementtook32Bytes,regardlessofwhatthetaskconsisted of. The task meta data was the data used for task execution. For LU decomposition, the task meta data is a matrix block; for exact inference, it is a set of potential tables. The lists used by the scheduler, such as GRL, LRLs and LCLs, were circular lists, 143 each having a head and a tail pointer. In case any list was full during scheduling, new elementswereinsertedon-the-fly. 8.2.5.3 Results We compared the performance of the proposed scheduling method with two state-of- the-art parallel programming systems i.e. Charm++[13], Cilk [11] and OpenMP [31]. We used a task dependency graph for which the structure was a random DAG with 10,000 tasks and there was an average of 8 successors for each task. Each task was a dense operation, e.g., multiplication of two 30× 30 matrices. For each schedul- ing method, we varied the number of available threads, so that we could observe the achieved scalability. The results are shown in Figure 8.16. Similar results were ob- served for other tasks. Given the number of available threads, we repeatedthe experi- mentsfivetimes. Theresultswereconsistent;thestandarddeviationoftheresultswere almost within 5% of the execution time. In Figure 8.16(a), all the methods exhibited scalability,thoughCharm++showedarelativelylargeoverhead. Areasonforthesig- nificantoverheadofCharm++comparedwithothermethodsisthatCharm++runtime systememploysamessagepassingbasedmechanismtomigratetasksforloadbalanc- ing(seeSection8.2.5.1). Thisincreasedtheamountofdatatransferringonthesystem bus. Note that the proposed method required at least two threads to form a group. In Figure8.16(b)wheremorethreadswereused,ourproposedmethodstillshowedgood scalability; while the performance of the OpenMP and Charm++ degraded signifi- cantly. As the number of threads increased, the Charm+ required frequent message passingbasedtaskmigrationtobalancetheworkload. Thisstessedthesystembusand caused the performance degradation. The performance of OpenMP degraded as the number of threads increase, because it can only schedule the tasks in the ready queue 144 (see Section 8.2.5.1), which limits the parallelism. Cilk showed scalability close to theproposedmethod,buttheexecutiontimewashigher. (a)Scalabilitywithrespectto1-8threads (b)Scalabilitywithrespectto8-64threads Figure 8.16: Comparison of average execution time with existing parallel program- mingsystems. We compared the proposed scheduling method with three typical schedulers, a centralized scheduler, a distributed scheduler and a task-stealing based scheduler ad- dressedinSection8.2.5.1. Weusedthesamedatasetasinthepreviousexperiment,but thematrixsizeswere50×50(large)and10×10(small)forFigures8.17(a)and(b), respectively. We normalized the throughput of each experiment for comparison. We divided the throughput of each experiment by the throughput of the proposed method using 8 threads. The results exhibited inconsistencies for the two baseline methods: Cent ded achieved much better performance than Dist shared with respect to large tasks, but significantly poorer performance with respect to small tasks. Such in- consistenciesimpliedthattheimpactoftheinputtaskdependencygraphsonschedul- ingperformancecanbesignificant. Anexplanationtothisobservationisthatthelarge tasksrequiredmoreresourcesfortaskexecution,butDist shareddedicatedmany threadstoscheduling,whichlimitstheresourcesfortaskexecution. Inaddition,many schedulersfrequentlyaccessingshareddataledtosignificantoverheadsduetocoordi- nation. Thus, the throughput decreased forDist shared as the number of threads 145 increased. When scheduling small tasks, the workers completed the assigned tasks quickly,butthesingleschedulerofCent dedcouldnotprocessthecompletedtasks andallocatenewtaskstoalltheworkersintime. Therefore,Dist sharedachieved higherthroughputthanCent dedinthiscase. Whenschedulinglargetasks,thepro- posed method dynamically merged all the groups and therefore became the same as Cent ded (Figure 8.17(a)). When scheduling small tasks, the proposed scheduler becameadistributedschedulerbykeepingeachcore(8threads)asagroup. Compared withDist shared,8threadspergroupledtothebestthroughput(Figure8.17(b)). Stealexhibitedincreasingthroughputwithrespecttothenumberofthreadsforlarge tasks. However, the performance tapered off when more than 48 threads were used. Onereasonforthisobservationisthat,asthenumberofthreadincreases,thechanceof stealing tasks also increases. Since a thread must access shared variables when steal- ing tasks, the coordination overhead increased accordingly. For small tasks, Steal showed limited performance compared with the proposed method. As the number of threads increases, the throughput was adversely affected. The proposed method dy- namically changedthe group sizeandmerged all the groups for the large tasks. Thus, the proposed method becomesCent ded except for the overhead of grouping. The proposed scheduler kept each core (8 threads) as a group when scheduling the small tasks. Thus, the proposed method achieved almost the same performance as Cent dedinFigure8.17(a)andthebestperformanceinFigure8.17(b). We experimentally show the importance of adapting the group size to the task de- pendency graphs in Figure 8.18. In this experiment, we modified the proposed sched- uler by fixing the group size. For each fixed group size, we used the same dataset in the previous experiment and measured the performance as the number of threads increases. According to Figure 8.18, larger group size led to better performance for large tasks; while for the small tasks, the best performance was achieved when the 146 (a) Performance with respect to large tasks (50×50matrixmultiplicationforeachtask) (b) Performance with respect to small tasks (10×10matrixmultiplicationforeachtask) Figure8.17: Comparisonwithbaselineschedulingmethodsusingtaskgraphsofvari- oustasksizes. group size was 4 or 8. Since the optimized group size varied according to the in- put task dependency graphs, it is necessary to adapt the group size to the input task dependencygraph. (a) Performance with respect to large tasks (50×50matrixmultiplicationforeachtask) (b) Performance with respect to small tasks (10×10matrixmultiplicationforeachtask) Figure 8.18: Performance achieved by the proposed method without dynamically ad- justing the scheduler group size (number of threads per group, thds/grp) with respect totaskgraphsofvarioustasksizes. In Figure 8.19, we illustrated the impact of various properties of task dependency graphs on the performance of the proposed scheduler. We studied the impact of the topology of the graph structure, the number of tasks in the graph, the number of suc- cessors and the size of the tasks. We modified these parameters of the dataset used in 147 (a)Taskgraphtopology (b)Numberoftasksintaskgraph (c)Numberofsuccessorsofeachtask (d)Tasksize (e)Impactoftasktypes (f)Loadbalance (g)Impactoferrorinestimatedtaskweight (h)Overheadoftheproposedmethod Figure 8.19: Impact of various properties of task dependency graphs on speedup achievedbytheproposedmethod. 148 the previous experiments. The topologies used in Figure 8.19(a) included a random graph (Rand), a 8-dimensional grid graph (8D-grid) and the task graph of blocked matrixmultiplication (BMM).Notethatweonlyusedthe topologyofthetaskdepen- dency graph for BMM in this experiment. Each task in the graph was replaced by a matrixmultiplication. Accordingtotheresults,formostofthescenarios,theproposed schedulerachievedalmostlinearspeedup. Notethatthespeedupfora10×10tasksize wasrelativelylowerthanothers. Thiswasbecausesynchronizationinschedulingwas relativelylargeforthetaskdependencygraphwithsmalltasksizes. Notethatweused the speedup as the metric in Figure 8.19. By speedup, we mean the serial execution time over the parallel execution time, when all the parameters of the task dependency grapharegiven. In Figure 8.19(e), we investigated the impact of task types on scheduling perfor- mance. Thecomputationintensivetasks(LabelComputation)werematrixmultiplica- tions,forwhichthecomplexitywasO(N 3 ),assumingthematrixsizewasN ×N. In our experiments, we hadN = 50. The memory access intensive tasks (Mem Access) summed an array ofN 2 elements usingO(N 2 ) time. For the last task type (Mixed), we let all the tasks with aneven ID perform matrix multiplication and the rest sum an array. We achieved speedup with respect to all task types. The speedup for memory accessintensivetaskswasrelativelylowerduetothelatencyofmemoryaccess. Figure 8.19(f) reflects the efficiency of the proposed scheduler. We measured the execution time of each thread to check if the workload was evenly distributed, and normalized the execution time of each thread for the sake of comparison. The under- lyinggraphwasarandomgraph. Wealsolimitedthenumberofavailablecoresinthis experiment to observe the load balance in various scenarios. Each core had 8 threads. As the number of cores increased, there was a minor imbalance across the threads. 149 However, the percentage of the imbalanced work was very small compared with the entireexecutiontime. Forrealapplications,itisgenerallydifficulttoestimatethetaskweightsaccurately. Tostudytheimpactoftheerrorinestimatedtaskweight,weintentionallyaddednoise to the estimated task weight in our experiments. We included noise that added 5%, 10% and 15% of the real task execution time. The noise was drawn from uniform distribution using the POSIX math library. According to the results in Figure 8.19(f), theimpactwasnotsignificant. In Figure 8.19(h), we investigated the overhead of the proposed scheduler. Us- ing the same dataset used in the previous experiment, we first performed hierarchical scheduling and recorded to which thread a task was allocated. According to such al- locationinformation,weperformedstaticschedulingtoeliminatetheoverheaddueto the proposed dynamic scheduler. We illustrate the execution time in Figure 8.19(h). Unlike the previous experiments, we show the results with respect to execution time to compare both the scalability and the scheduling overhead for a given number of threads. Aswecansee,theoverheadduetodynamicschedulingwasverysmall. 150 Chapter9 ExactInferenceUsingTaskSchedulers 9.1 CollaborativeSchedulingforExactInference 9.1.1 CollaborativeScheduling We utilize the collaborative scheduler proposed in Chapter 8.1 to allocate the tasks in the task dependency graphG to the cores. We assume that there areP cores in a system. Theglobaltasklist (GL)inFigure8.2storesthetasksinthetaskdependency graphthatisconvertedfromtheinputjunctiontree. Eachentryoftheliststoresatask and its related data, such as the task size, the task dependency degree, and the links to its succeeding tasks. Each thread also has a local Execute module where the node levelprimitiveisperformedforthecurrenttaskinthemodule. Thedetailsfortheother modulesaregiveninChapter8.1. Unlike the generic collaborative scheduler shown in Figure 8.1, we have a new module called Partition Module in Figure 9.1. This is an plug-and-play module de- signed specifically for exact inference. The Partition module checks the workload of the fetched task. The tasks with heavy workload are partitioned for load balancing. A property of the primitives is that the potential table of a clique can be partitioned 151 Figure9.1: Componentsofthecollaborativescheduler. easily. Thus,ataskT canbepartitionedtosubtasks ˆ T 1 , ˆ T 2 ,··· , ˆ T n ,eachprocessinga part of the potential table related to taskT. Each subtask inherits the parents ofT in the task dependency graphG. However, we let ˆ T n be the successor of ˆ T 1 ,··· , ˆ T n−1 , andonly ˆ T n inheritsthesuccessorofT. Therefore,theresultsfromthesubtaskscanbe concatenatedoraddedby ˆ T n . ThePartitionmodulereplacesT intheGLwith ˆ T n ,and appends other subtasks to the GL. ˆ T 1 is sent to the Execute module and ˆ T 2 ,··· , ˆ T n−1 are evenly distributed to local lists, so that these subtasks can be executed by several threads. The collaborative scheduling algorithm is shown in Algorithm 9.2. We use the following notations in the algorithm: GL is the global list. LL i is the local ready list in Thread i. d T and w T denote the dependency degree and the weight of task T, respectively. W i is the total weight of the tasks in LL i . δ is the threshold of the size of the potential table. Any potential table larger than δ is partitioned. Line 1 in Algorithm 9.2 initializes the Task ID buffers. As shown in Line 3, the scheduler keeps on working until all tasks are processed. Lines 4-10 correspond to the Allocate 152 1: ∀T s.t.d T = 0,evenlydistributetheIDofT toTaskIDbuffers 2: forThreadi(i = 1...P)inparalleldo 3: whileGL∪LL i 6=∅do 4: forT ∈{successorsoftasksinthei-thTaskIDbuffer} do 5: d T =d T −1 6: ifd T = 0then 7: allocateT toLL j wherej = arg t min(W t ),t = 1···P 8: W k =W k +w T 9: endif 10: endfor 11: fetchtaskT ′ fromLL i 12: ifthesizeofpotentialtableψ T ′ >δ then 13: partitionT ′ intosubtasks ˆ T ′ 1 , ˆ T ′ 2 ,··· , ˆ T ′ n s.t.ψ ˆ T ′ j ≤δ, j = 1,··· ,n 14: replaceT ′ inGLwith ˆ T ′ n ,andallocate ˆ T ′ 1 ,··· , ˆ T ′ n−1 tolocallists 15: execute ˆ T ′ 1 andplacethetaskIDof ˆ T ′ 1 intothei-thTaskIDbuffer 16: else 17: executeT ′ andplacethetaskIDofT ′ intothei-thTaskIDbuffer 18: endif 19: endwhile 20: endfor Figure9.2: CollaborativeTaskScheduling module. Line11istheFetchmodule. Lines12-18correspondtothePartitionmodule andExecutemodule. Algorithm 9.2 achieves load balancing by two means: First, the Allocate module ensures that the new tasks are allocated to the threads where the total workload of the tasks in its LL is the lowest. Second, the Partition module guarantees that each single largetaskcanbeprocessedinparallel. 9.1.2 Experiments We generated junction trees of various sizes to analyze and evaluate the performance oftheproposedevidencepropagationmethod. Thejunctiontreesweregeneratedusing Bayes Net Toolbox [28]. The first junction tree (Junction tree 1) had 512 cliques and 153 Figure9.3: Scalabilityofexactinferenceusingvariousmethodson(a)IntelXeonand (b)AMDOpteron. theaveragecliquewidthwas20. Theaveragedegreeforeachcliquewas4. Allrandom variables were binary. The second junction tree (Junction tree 2) had 256 cliques and the average clique width was 15. The number of states of random variables was 3 and the average clique degree was 4. The third junction tree (Junction tree 3) had 128 cliques. The clique width was 10 and the number of states of random variables was 3. The average clique degree was 2. All the three junction trees were rerooted using Algorithm 4.9. In our experiments, we used double precision floating point numbers torepresenttheprobabilitiesandpotentials. We compared three parallel methods for evidence propagation on both Intel Xeon and AMD Opteron in Figure 9.3. The first two methods were the baselines. The first parallel method was the OpenMP based method, where the OpenMP intrinsic functions were used to parallelize the sequential code. We used ICC to compile the code on Xeon, while GCC was used on Opteron. The second method is called data parallelmethod,wherewecreatedmultiplethreadsforeachnodelevelprimitive. That is, the node level primitives were parallelized every time they were performed. The dataparallelmethodissimilartothetaskpartitioning mechanisminourcollaborative scheduler, but the overheads were large. The third parallel method was the proposed method. Using Junction trees 1-3 introduced above, we conducted experiments on 154 both the platforms. For all the three methods, we used level 3 optimization (-O3). The OpenMP based method also benefited from the SSE3 optimization (-msse3). We show the speedups in Figure 9.3. The results show that the proposed method exhibitedlinearspeedupandwassuperiorcomparedwiththebaselinemethodsonboth theplatforms. Performingtheproposedmethodon8cores,weobservedaspeedupof 7.4onIntelXeonand7.1onAMDOpteron. ComparedtotheOpenMPbasedmethod, ourapproachachievedaspeedupof2.1when8coreswereused. Comparedtothedata parallelmethod,weachievedaspeedupof1.8onAMDOpteron. Toshowtheloadbalanceweachievedandtheoverheadofthecollaborativesched- uler, we measured the computation time for each thread. In our context, the com- putation time for a thread is the total time taken by the thread to perform node level primitives. Thus, the time taken to fetch tasks, allocate tasks and maintain the local ready list were not considered. The computation time reflects the workload for each thread. We show the results in Figure 9.4 (a), which were obtained on Opteron using Junction tree 1 defined above. We observed very similar results for Junction tree 2 and 3. Due to space constraints, we show results on Junction tree 1 only. We also calculatedtheratioofthecomputationtimeoverthetotalparallelexecutiontime. This ratio illustrates the quality of the scheduler. From Figure 9.4 (b), we can see that, al- thoughtheschedulingtimeincreasedalittleasthenumberofthreadsincreases,itwas notexceeding0.9%oftheexecutiontimeforallthethreads. Finally,wemodifiedparametersofJunctiontree1toobtainadozenjunctiontrees. We applied the proposed method on these junction trees to observe the performance of our method in various situations. We varied the number of cliquesN, clique width w C , number of statesr and average number of childrenk. We obtained almost linear speedupforallcases. FromtheresultsinFigure9.5(a),weobservethatthespeedups achieved in the experiments with various values forN were all above 7. All of them 155 Figure 9.4: (a) Load balance across the threads; (b) Computation time ratio for each thread. exhibitedlinearspeedups. InFigure 9.5(b)and(c),allresultsshowedlinearspeedup excepttheoneswithw C = 10andr = 2. Thereasonwasthatthesizeofthepotential table was small. Forw C = 10 andr = 2, the potential table had 1024 entries, about 1/1000 of the number of entries in a potential table withw C = 20. However, sinceN and the junction tree structure were the same, the scheduling requires approximately thesametimeforjunctiontreewithsmallpotentialtables. Thus,theoverheadsbecame relatively large. In Figure 9.5 (d), all the results had similar performance whenk was varied. Allofthemachievedspeedupsofmorethan7using8cores. 9.2 HierarchicalSchedulingforExactInference In real applications, the size of the potential tables in a junction tree can vary dra- matically. Thus, to achieve scalability over dozens of threads remains a fundamental challenge for exact inference on manycore processors. In this section, we utilize the hierarchicalschedulerdiscussedinChapter8.2forexactinference. WeusedthesamemethodproposedinChapter8.2toconvertaninputjunctiontree into a task graph. The task graph is also stored as a list called global task list (GL). The other components for the hierarchical scheduler and an important feature called 156 Figure 9.5: Speedups of exact inference on multicore systems with respect to various junctiontreeparameters. dynamicthreadgroupingarediscussedinChapter8.2. Inthissection,wefocusonthe utilizationofthehierarchicalschedulertoexactinference. 9.2.1 ExactInferenceUsingHierarchicalSchedulers Based on the organization shown in Section 8.8, we use the following notations to describe the implementation: Assume there areP threads, each bound to a core. The threads are divided into groups consisting of a manager and several workers. GL and GRL denote the global task list and global ready list, respectively. LRL i andLCL i denotethelocalreadylistandlocalcompletedtasklistofGroup i ,0≤i<P. d T and w T represent the dependency degree and the weight of task T, respectively. WI i is the workload indicator of Thread i . Parametersδ M ,δ + andδ − are constant thresholds. 157 Empirically, we set the three thresholds as 3 P T∈GL d T /|GL|, 1.25 P T∈GL d T /|GL|, 0.75 P T∈GL d T /|GL|, respectively. These thresholds can be tuned with respect to the platforms. We use a variable rank to control the merge or participation of the groups to maintain 2 rank groups during the scheduling. We letβ denote the average number ofsuccessorsoftasks. The complete algorithm is shown in Algorithm 9.6. Lines 1-2 are initial steps. Lines 3-37 are the main body of the proposed algorithm. In Line 5, Q is the num- ber of threads in a group and m is the group to which Thread i belongs. Lines 6 selects the first thread in each group to be the manager and the others be the workers. Lines7-30describetheworkofthemanagerinGroupm. Newready-to-executetasks are added toGRL (Lines 16-18) and part of the ready-to-execute tasks are added to LRL m accordingly (Line 19). The manager in Group 0 is also in charge of group merging/partitioning (Lines 21-29). Line 22 judges if the groups need to be merged (Line 25) or partitioned (Line 27), where β is the average number of successors for each task. Note that, in Line 22,WI j gives the workload for the workers in Groupj; while|LCL j |·βshowstheworkloadforthemanager. Thus,bycomparingwithparam- etersδ + andδ − ,theratiorhintsifweneedmoremanagersormoreworkers(Line27). Line 23 also ensures the range of rank, i.e., 1 ≤ rank ≤ logP. Group partition- ing increases the number of managers; while group merging increases the number of workers. ThetwofunctionsMerge(·,·)andPartition(·)arediscussedinSection8.2.2. When all tasks are completed, the manager in Group 0 notifies all threads to quit by flipping f exit . Lines 32-34 describe the algorithm for workers. Each worker fetches tasksallocatedbythecorrespondingmanager(Line33)forexecution. Thecompleted tasksinGroupmareaddedtoLCL m ,sothatthemanagercanprocessthem. 158 Input: TaskgraphGLconvertedfromjunctiontreeJ,NumberofthreadsP Output: UpdatedpotentialtablesinJ 1: DistributetasksT i ∈GLwithd i = 0acrossLRL j ,0≤j <P 2: f exit =false,rank = 1 3: forThreadi = 0,1,··· ,P −1pardo 4: whilef exit =falsedo 5: Q = 2 rank ,m =⌊i/Q⌋ 6: ifi%Q = 0then 7: ΔW = P ˜ T∈LCLm w ˜ T ,WI m =WI m −ΔW 8: forallT ∈{successorsof ˜ T ∈LCL i }do 9: d T =d T −1 10: ifd T = 0then 11: GRL =GRL∪{T};GL =GL\{T} 12: endif 13: endfor 14: ifLRL m isnotfullthen 15: S ′ ⇐fetch(Q−1)tasksfromGRL 16: if P T∈S ′ w T < ΔW orWI m <δ M then 17: FetchtasksfromGRLtoS ′ sothat P T∈S ′ w T ≈ ΔW +δ M 18: endif 19: LRL m =LRL m ∪{S ′ },WI m =WI m + P T∈S ′ w T 20: endif 21: ifi = 0then 22: r = P P/Q j=0 (WI j /(|LCL j |·β)),rank old =rank 23: rank = min(rank+1,logP), r>δ + max(rank−1,1), r<δ − 24: ifrank old <rank then 25: Group j =Merge(Group 2j ,Group 2j+1 ),∀0≤j < P 2·Q −1 26: elseifrank old >rank then 27: (Group 2j ,Group 2j+1 )=Partition(Group j ),j = P Q −1,··· ,0 28: endif 29: endif 30: ifGL =∅,thenletf exit =true 31: else 32: ifLRL m isnotemptythen 33: FetchT fromLRL m andexecuteT,LCL m =LCL m ∪{T} 34: endif 35: endif 36: endwhile 37: endfor Figure9.6: Exactinferenceusingthehierarchicalscheduler 159 9.2.2 Experiments 9.2.2.1 ComputingFacilities The Sun UltraSPARC T2 (Niagara 2) platform was a Sunfire T2000 server with a Sun UltraSPARC T2 multithreading processor [34]. UltraSPARC T2 has 8 hardware multithreadingcores,eachrunningat1.4GHz. Inaddition,eachcoresupportsupto8 hardware threads with 2 shared pipelines. Thus, there are 64 hardware threads. Each core has its own L1 cache shared by the threads within a core. The L2 cache size is 4MB,sharedbyallhardwarethreads. Theplatformhad32GBDDR2memoryshared byallthecores. TheoperatingsystemwasSunSolaris11andweusedSunStudioCC withLevel4optimization(-xO4)tocompilethecode. 9.2.2.2 Datasets In our experiments, we used various junction trees to analyze and evaluate the perfor- mance of our method. All the junction trees were generated using Bayes Net Tool- box [28]. The number of cliquesN of these junction trees varied from 512 to 2048. The clique widthw was selected from 8 to 20. The width of separatorw s was from 1 to4. Thenumberofstatesr fortherandomvariablesinthethesejunctiontreesvaried from 2 to 4. The number of successors d of the cliques in these junction trees was selectedfrom2to16. We stored the data of each junction tree as two parts in the memory. The first part wasalistrepresentingthestructureofthejunctiontree(seeFigure8.2). Theotherpart was an array, where each element was a structured variable consisting of all the data relatedtoaclique,suchasthecliquepotentialtableandseparatorsbetweentheclique and its successors. All the potential tables were represented using fixed point data, sincethefloatingpointunitsweresharedbyallthehardwarethreadsinacore. C.Shi 160 hasstudiedtheaccuracyconstraintofthefixedpointdatacomparedwithfloatingpoint data[35]. 9.2.2.3 BaselineImplementations Inordertoevaluatetheperformanceoftheproposedalgorithm,weimplementedexact inference using five baseline methods, i.e., (1)Charm++ [13], (2)OpenMP [31] and (3) Cilk [11]. (4) Cent ded dedicated a thread as the manager to monitor all the rest of the threads. The manager constructed the task dependency graph from a given junctiontree,andallocatedready-to-executetaskstothethreads. Thededicatedthread wasalsoinchargeofmaintainingtheprecedenceofthetasks. (5)Dist sharedlet eachthreadallocatetasksforitselfandputallready-to-executetasksintoasharedlist. Each thread locked the shared list to fetch tasks and add new ready-to-execute tasks. We used mutex lock in the implementation. We evaluated the baseline methods along withtheproposedschedulerusingthesameinputtaskdependencygraphs. 9.2.2.4 Results Figure 9.7 compares performance of the proposed method with three state-of-the-art parallel runtime systems, i.e., OpenMP, Charm++ and Cilk. The input junction tree was a junction tree with 1024 cliques, each consisting of 15 binary variables. The average number of successors for the cliques was 2 and the separator width was 4. We observed similar results using other junction trees. In Figure 9.7(a), all the meth- ods exhibited scalability, although Charm++ implementation showed relatively large overheadduetoitsmessagepassingbasedsynchronizationmechanism. Charm++had similar execution time when using one or two threads, since no message passing was involved for the scenario with one thread. For Cilk, we generated the DAG offline and then dynamically spawned a thread for each task. The runtime system of Cilk 161 scheduledthethreads. ThisensuresthatCilkusedthesameDAGforscheduling. Cilk implementationscaledwell,sincetherandomizationtechniqueusedforworkstealing in Cilk reduces synchronization overhead. The absolute execution time of Cilk im- plementation was higher compared with the proposed method. In Figure 9.7(b), the proposed method still showed the best performance. The OpenMP based implemen- tation exhibited limited scalability, since it mainly parallelized the loops in node level primitives (see [37]) during exact inference. When 64 threads were used, the speedup of the proposed method is 13.7 and 9.4 compared with the OpenMP and Charm++ basedimplementations,respectively. (a)Scalabilitywithrespectto1-8threads (b)Scalabilitywithrespectto8-64threads Figure9.7: Comparisonwithexistingparallelprogrammingsystems. (a)SpeedupwithrespecttosmallPOTs (b)SpeedupwithrespecttolargePOTs Figure9.8: Comparisonwithbaselineschedulingmethodsusingjunctiontreesofvar- iouspotentialtable(POT)sizes. 162 Wecomparedtheproposedmethodwithtwotypicalschedulersforexactinference, i.e.,acentralizedschedulerandadistributedscheduler,asdiscussedinSection9.2.2.3. Weusedthesamejunctiontreeasinthepreviousexperiment,butthecliquewidthwas setas8and15forFigure9.8(a)and(b),respectively,toshowtheimpactoftasksizes. Thus,thesizeofeachpotentialtable(POT)inFigure9.8(b)was128timesaslargeas thatinFigure9.8(a). Theresultswereinconsistentforthetwobaselinemethods.Cent ded achieved better performance than Dist shared for junction tree with large POTs,butpoorperformanceforjunctiontreewithsmallPOTs. Anexplanationtothis observationisthatlargePOTsprovidemoreparallelism,butDist shareddedicates many threads to scheduling. Thus, the number of threads for processing POTs was limited. When scheduling small POTs, many threads completed the assigned POTs quickly,butthesingleschedulerofCent dedwastoobusytoprocessthecompleted POTsandallocatenewPOTsintime. Theproposedmethoddynamicallyadjustedthe numberofthreadsforschedulingorexecution. Thus,itachievedsuperiorperformance. We modified the parameters of the junction tree used in the previous two experi- ments to observe the performance of our method in various situations. We varied the number of cliques N, clique width W C , number of states of the random variables r and the average number of childrend for the cliques. For each set of parameters, we repeated the experiment five times. The average of the results is shown in Figure 9.9. The standard deviation was within 5% of the execution time. In Figure 9.9(a), the execution time was propotional to the number of cliques N. Regardless of N, the execution time scaled well for all the junction trees. In Figure 9.9(b), the execution timewithrespecttoW C = 15wasmuchhigherthantheothertwosituations,sincethe potential table size increases exponentially with respect toW C . In Figure 9.9(c), we reducedW C to10toevaluatetheimpactofr,sincethejunctiontreewith1024cliques is too large to be stored whenr = 3 andW C = 15. In this figure, the labelr = 2&3 163 (a) Execution time with respect to number of cliques (b)Executiontimewithrespecttocliquewidth (c) Execution time with respect to number of statesofrandomvariables (d) Execution time with respect to number of successorsforeachclique Figure 9.9: Scalability of the proposed method with respect to various parameters of junctiontrees. corresponds to the junction tree where half of the random variables were binary and the others are trinary. As shown in Figure 9.9(d), the impact ofd on execution time was minor. Based on the observations in Figure 9.9, we conclude that the proposed methodscaleswellinalargerangeforjunctiontreeshavingvariousparameters. Finally,weinvestigatedtheefficiencyoftheproposedschedulerforexactinference on manycore processors. We measured the execution time of each thread to check if theworkloadwasevenlydistributed,andnormalizedtheexecutiontimeofeachthread for the sake of comparison. We also limited the number of available cores in this experiment to observe the load balance in various scenarios. As the number of cores increases, although there was a minor imbalance across the threads, the percentage of the imbalanced work was very small compared with the total execution time. In 164 Figure9.10: Loadbalancewithrespecttonumberofavailablecores. Figure 9.10, we show one of the results where we used a junction tree with N = 1024,W C = 15,r = 2 and d = 4. We had consistent results for the junction trees giveninSection9.2.2.2. 165 Chapter10 ConclusionandFutureWork In this chapter, we summarize our contributions in this thesis. We also discuss open problems for future research on parallelization of various machine learning applica- tions on state-of-the-art parallel computing architectures. In this thesis, we focused on the parallelization of probabilistic graphical models at multiple levels on various parallel computing platforms. We demonstrated that, although many challenges ex- ist in mapping a challenging problem onto parallel architectures, these challenges can be overcome by exploring parallelism at multiple levels. To illustrate our ideas, we have proposed parallel algorithms for exact inference in probabilistic graphical mod- els on various parallel computing platforms, including a cluster of processors, homo- geneous/heterogeneous multicore processors, and manycore processors. Our work in this thesis allows many other graphical model based machine learning applications to take advantage of our proposed techniques, i.e., node level primitives and dynamic scheduling methods. The scheduling techniques can also be used for a class of DAG structured computations. In addition, we have pointed out that a set of optimization techniquescanbeutilizedforourproposedschedulingframeworks,sothatthesched- ulers can be adaptive to various input task graphs and underlying platforms. All of thesetechniquescanbeusednotonlybythemachinelearningcommunitytodealwith 166 large scale statistical machine learning applications, but also by system architects to design advanced parallel computing architectures for complex real-life applications. Inthefollowingsection,wesummarizeourcontributionsfromfourareas. 10.1 SummaryofContributions Efficientorganizationforpotentialtablesandparallelnodelevelprimitives: Ex- actinferenceisessentiallyaseriesofcomputationsamongasetofpotentialtables. The size of the potential table increases dramatically as the clique width and the number of states of the random variables in a clique increase. Thus, an efficient organization of potential tables not only saves system memory, but also improves the efficiency of computation. We have proposed an efficient organization for potential tables. There are two merits of the proposed organization. First, for each entry in a potential ta- ble, we do not have to save the corresponding states of each random variables in the clique. However, we converted the states of the random variables into an index and then stored the value into the entry indicated by the index. For each entry, the cor- responding states of random variables can be obtained dynamically using the indices. Second, the proposed organization for potential tables allows arbitrary dynamic par- titioning of potential tables. Such partitioning helped us to explore data parallelism whenperformingcomputationbetweenpotentialtables. Based on the potential table organization discussed above, we summarized four nodelevelprimitivesforexactinference. Theprimitivescanbeusedasbuildingblocks for exploring data parallelism in computations among potential tables. We also con- structed two kernels for exact inference, one for evidence collection and the other for evidence distribution. Our proposed methods achieved linear speedup, according to 167 our analysis using the coarse grained multicomputer (CGM) model [12]. We imple- mented the node level primitives using message passing on state-of-the-art clusters. Weobservedthattheimplementationscaleswellwithoverahundredprocessors. RepresentationofexactinferenceusingDAGstructuredcomputations: Many applications can be represented as directed acyclic graph (DAG) structured computa- tions, where each node in the DAG corresponds to a series of computations. Schedul- ing a DAG structured computation onto a set of processors has been widely studied. Therefore, converting exact inference into a DAG structured computation allowed us to utilize various existing approaches for scheduling DAG structured computations. In this thesis, the tasks in a DAG can be dynamically partitioned so as to fit into the caches or local stores of various architectures. Thus, our DAGs can be modified at runtime by the scheduler. In addition, we proposed data structures for representing a DAG efficiently. We stored all the tasks from a DAG into a list, where each element is a structured variable. The links to the child tasks are also stored in the ready list, so that we can preserve the topology of the input DAG. By exploring characteristics ofjunctiontrees,weproposedanefficientmethodforrerootingajunctiontree,which resultsinthetaskgraphwiththeshortestcriticalpath. Scheduling frameworks for DAG structured computations: We have devel- oped various schedulers for exact inference. Each scheduler is optimized for a class ofarchitectures. Forhomogeneousmulticoreprocessors,weproposedalock-freecol- laborativescheduler,whichisalightweightworksharingscheduler. Itensuresthatthe workloadisbalancedacrossthethreads. Sinceweobservedthatthecoordinationover- head due to locks is significant for collaborative schedulers on multicore processors, weproposedalock-freedatastructureforthesharedreadytasklists. Thisisunlikethe traditionalatomicoperationbasedapproach,sinceatomicoperationsalsoinvolvehigh 168 overheads. Ourproposedlock-freedatastructureacceleratestaskschedulingonmulti- core processors without adversely impacting the correctness of scheduling. However, the lock-free data structure requires auxiliary variables, which can possibly stress the cache on manycore processors. Therefore, we developed another scheduling method called hierarchical scheduling. By dynamically grouping the threads in a system, the hierarchical scheduling method explores the tradeoff between lock-based scheduling and lock-free scheduling. Compared with various baseline methods, the hierarchical schedulingmethodconsistentlyachievedsuperiorperformance. All the scheduling methods developed in this thesis are modularized scheduling frameworks. Each module in the schedulers can be viewed as a plug-and-play com- ponent. Thus, we can utilize various existing optimization techniques for each com- ponent. For example, for the component in charge of fetching tasks from a ready task list,theheuristicselectingthetaskwiththemaximumnumberofchildrencanbeused. Various other off-the-shelf optimization techniques can also be used. Such flexibility helpsuserstoadaptourproposedmethodstospecificarchitectures. Variousparallelalgorithmsforexactinference: Inadditiontotheaboveschedul- ingbasedapproaches,wedevelopedvariousothertechniques. Weinvestigatedajunc- tiontreedecompositiontechniqueforexactinference. Thismethodisefficientforex- act inference on distributed memory platforms, such as a cluster of processors, where the startup latency in communication is significant. We also studied other techniques, such as the pointer jumping based method for exact inference in junction trees. We analyzed the performance of exact inference by building a performance model for pointer jumping. According to our analysis, when the depth of a junction tree is less than the square of the number of cores, the pointer jumping based exact inference al- ways achieves acceleration compared with the optimized sequential implementation. 169 Whenthejunctiontreeofferslimitedstructuralanddataparallelism,thepointerjump- ing basedmethod canoutperform any other existing exact inference approaches, such astheschedulingbasedmethodsorthosebasedonparallelnodelevelprimitives. 10.2 FutureWork The trend in high performance computing is to integrate processors of various archi- tectures into a heterogeneous system for efficiently executing applications consisting of various types of tasks. For example, the two supercomputers called Nebulae and Tianhe,rankedatthe2ndand7threspectivelyinthelatestTOP500list[8],employIn- tel Xeon/AMD Opteron multicore processors along with NVIDIA Tesla/ATI Radeon GPGPUs on each compute node to deliver superior performance. Despite outstand- ingpeakperformanceofheterogeneousclusters,itisverychallengingtoachievehigh sustainableperformanceonsuchplatforms,especiallyformanyreal-lifeapplications. Therefore, we plan to take exact inference as an example to explore task scheduling techniquesonheterogeneouscomputingplatforms. CPU-GPGPUparallelcomputingsystem: Inthisthesis,wehavestudiedparal- lel algorithm design for exact inference on various parallel computing platforms. The next goal is to study the coordination and task allocation across these platforms of different architectures. Note that the heterogeneity comes from various levels. For example, the compute nodes in a cluster can be installed with different types of pro- cessors. This forms a level of heterogeneity. The next level of heterogeneity exists within each compute node. If a compute node consists of both CPU and GPGPU, we must carefully assign tasks to the two distinct architectures. It is known that GPGPU provides massively concurrent threads, but very limited synchronization mechanisms can be supported among the threads. This requires that the scheduler must be able 170 to identify the types of a given task. If a task involves sufficient data parallelism and the execution of these data is independent, then the scheduler should allocate the task to the GPGPU instead of the CPU. In addition, when a task is assigned to the GPGPU, it generally involves data transfers. This is another challenge for exact in- ference on GPGPU, since the bandwidth between the main memory and the device memory is limited. Thus, some techniques should be examined for such a compute node,e.g.,asynchronousdatatransferring,hostmemorymapping,etc. Insummary,it ispromisingtostudytaskschedulingonCPU-GPGPUplatformsandthenextendsuch techniquestovariousheterogeneouscomputingsystems. Adaptive data organization with dynamic remapping: Optimization of data organization leads to efficient task execution. However, even for the same applica- tion, the optimized data organization can be different on different architectures. This remains a challenge for scheduling tasks on heterogeneous platforms, such as CPU- GPGPU systems. In many scenarios, we must allocate tasks to either the CPU or GPGPUdynamically. Therefore,itisverydifficult,ifnotimpossible,toorganizetask data efficiently before these tasks are allocated to a specific processor. A possible ap- proach for solving such a problem is to utilize adaptive data organization. In adaptive dataorganization,thedataassociatedtoataskcanbestoredinanintermediateformat. During task scheduling, once the scheduler allocates a task to some processor (e.g., theCPUorGPGPU),theintermediateorganizationofthetaskdataistranslatedintoa newformatthatisoptimizedforthetargetarchitecture. Suchatranslationmustbevery efficient and possibly overlapped by the execution of other tasks. For future research, itisimportanttodesignsuchanintermediatedataorganizationonheterogeneoussys- tems. It is also critical to study the translation of such an intermediate organization intotheformatoptimizedforspecificarchitectures. 171 Task scheduling on heterogeneous clusters: In this thesis, the schedulers were working on multicore/manycore processors. It requires a higher level task scheduler that can assign tasks across various heterogeneous compute nodes in a cluster. The motivation is very straightforward: Since we have studied parallel algorithm design for exact inference on various parallel computing platforms, we need a task sched- uler to automatically assign tasks to compute nodes of different types, based on the characteristics of the tasks. This scheduler is different from what we have discussed in this thesis, since this scheduler works in a distributed memory environment, where the startup latency in communication among compute nodes is significantly higher. In addition, because of the possible performance penalty, the scheduler on a cluster shouldnotcausemuchpotentialtabletransferring,whichresultsinachallengeinload balancing. Tokeeploadbalanceacrossthecomputenodes,anadvancedschedulerfor heterogeneous clusters should identify characteristics of a task and assign it to some compute node on which the task can be performed efficiently. For example, a task with rich data parallelism can be assigned to a compute node with GPGPUs. There- fore,varioustechniquesfortaskschedulingonheterogeneousclustershouldbestudied in future. Such techniques will play a significant role for performing DAG structured computationsonmodernsupercomputersconsistingofvarioustypesofprocessors. 172 References [1] AMDOpteronprocessors,http://www.amd.com/amd64ecosystem/. [2] HighPerformanceComputingandCommunications,http://www.usc.edu/hpcc/. [3] IBMBladeCenterQS20,http://www.ibm.com/technology/splash/qs20/. [4] IBMCellBEprogrammingtutorial,http://www.ibm.com/developer/power/cell/. [5] Intel Open Source Probabilistic Networks Library, http://www.intel.com/technology/computing/pnl/. [6] IntelXeonprocessors,http://www.intel.com/. [7] SanDiegoSupercomputerCenter,http://www.sdsc.edu/. [8] Top500SupercomputerSites,http://www.top500.org/. [9] David Bader, Virat Agarwal, Kamesh Madduri, and Seunghwa Kang. High per- formancecombinatorialalgorithmdesignontheCellBroadbandEngineproces- sor. Parallel Computing,33:720–740,2007. [10] Pieter Bellens, Josep Perez, Rosa Badia, and Jesus Labarta. CellSs: a program- mingmodelforthecellbearchitecture. InProceedingsoftheACM/IEEESuper- computing 2006 Conference,pages5–5,2006. [11] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Technical report, Cambridge,1996. [12] Albert Chan, Frank Dehne, Prosenjit Bose, and Markus Latzel. Coarse grained parallelalgorithmsforgraphmatching. Parallel Computing,34(1):47–62,2008. [13] Charm++programmingsystem. http://charm.cs.uiuc.edu/research/charm/. [14] D.ComerandL.Peterson. Network Systems Design Using Network Processors. PrenticeHall,UpperSaddleRiver,NJ,USA,2003. [15] Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to Parallel Computing (2nd Edition). AddisonWesley,January2003. 173 [16] MessagePassingInterface. http://www.mcs.anl.gov/mpi/. [17] Takehiro Ito, Takeaki Uno, Xiao Zhou, and Takao Nishizeki. Partitioning a weighted tree to subtrees of almost uniform size. In Proceedings of the 19th InternationalSymposiumonAlgorithmsandComputation,pages196–207,2008. [18] MichaelIversonandFusunOzguner. Dynamic,competitiveschedulingofmulti- pleDAGsinadistributedheterogeneousenvironment. InHCW’98: Proceedings of the Seventh Heterogeneous Computing Workshop,page70,1998. [19] B. Vinter J. M. Bjøndalen, O. J. Anshus and T. Larsen. Configurable collec- tive communication in LAM-MPI. In Proceedings of Communicating Process Architectures,pages133–143,2002. [20] TommiS.JaakkolaandMichaelI.Jordan. Variationalprobabilisticinferenceand the QMR-DT network. Journal of Artificial Intelligence Research, 10(1):291– 322,1999. [21] Joseph J´ aJ´ a. An Introduction to Parallel Algorithms. USA: Addison-Wesley, Reading,MA,1992. [22] T. Klocks. Treewidth: Computations and Approximations. Springer-Verlag, Berlin,1994. [23] Alexander V. Kozlov and Jaswinder Pal Singh. A parallel Lauritzen- Spiegelhalter algorithm for probabilistic inference. In Supercomputing, pages 320–329,1994. [24] Yu-KwongKwokandIshfaqAhmad. Staticschedulingalgorithmsforallocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406–471, 1999. [25] S.L.LauritzenandD.J.Spiegelhalter. Localcomputationwithprobabilitiesand graphical structures and their application to expert systems. J. Royal Statistical Society B,50:157–224,1988. [26] Wenheng Liu, Cho-Li Wang, and Viktor K. Prasanna. Portable and scalable algorithm for irregular all-to-all communication. J. Parallel Distrib. Comput., 62(10):1493–1526,2002. [27] Blackford Middleton, Michael Shwe M. S, David Heckerman, Harold LehmannM.D,andGregoryCooper. Probabilisticdiagnosisusingareformula- tionoftheINTERNIST-1/QMRknowledgebase. Medicine,30:241–255,1991. [28] Kevin Murphy. Bayes net toolbox, http://www.cs.ubc.ca/∼murphyk/software/ BNT/bnt.html. 174 [29] VasanthKrishna Namasivayam, Animesh Pathak, and Viktor K. Prasanna. Scal- ableparallelimplementationofBayesiannetworktojunctiontreeconversionfor exact inference. In Proceedings of the 18th International Symposium on Com- puter Architectureand High PerformanceComputing,pages167–176,2006. [30] VasanthKrishnaNamasivayamandViktorK.Prasanna. Scalableparallelimple- mentation of exact inference in Bayesian networks. In Proceedings of the 12th International Conference on Parallel and Distributed Systems, pages 143–150, 2006. [31] OpenMPApplicationProgrammingInterface. http://www.openmp.org/. [32] David Pennock. Logarithmic time parallel Bayesian inference. In Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence, pages 431–438,1998. [33] Ross D. Shachter, Stig K. Andersen, and Peter Szolovits. Global conditioning for probabilistic inference in belief networks. In Proceedings of the Tenth Con- ference on Uncertainty in Articial Intelligence,pages514–522,1994. [34] Denis Sheahan. Developing and tuning applications on UltraSPARC T1 chip multithreadingsystems. Technicalreport,2007. [35] Changchun Shi. Floating-point to fixed-point conversion. PhD thesis, Berkeley, CA,USA,2004. Chair-Brodersen,RobertW. [36] YinglongXia,XiaojunFeng,andViktorK.Prasanna. Parallelevidencepropaga- tion on multicore processors. In The 10th International Conference on Parallel Computing Technologies,pages377–391,2009. [37] Yinglong Xia and Viktor K. Prasanna. Node level primitives for parallel exact inference. In Proceedings of the 19th International Symposium on Computer Architecture and High PerformanceComputing,pages221–228,2007. [38] Yinglong Xia and Viktor K. Prasanna. Junction tree decomposition for parallel exact inference. In IEEE International Parallel & Distributed Processing Sym- posium (IPDPS),2008. [39] Yinglong Xia and Viktor K. Prasanna. Node level computation kernels for par- allelexactinference. Technicalreport,TechnicalReportCENG-2009-7,Univer- sityofSouthernCalifornia,2009. [40] YinglongXiaandViktorK.Prasanna. Collaborativeschedulingofdagstructured computations on multicore processors. In CF ’10: Proceedings of the 7th ACM international conference on Computing frontiers,pages63–72,2010. 175 [41] Yinglong Xia and Viktor K. Prasanna. Scalable node-level computation kernels forparallelexactinference. IEEETrans. Comput.,59(1):103–115,2010. [42] Hongyuan Zha, XiaofengHe, Chris Ding, HorstSimon, andMingGu. Bipartite graph partitioning and data clustering. In Proceedings of the 10th international conference on Information and knowledge management,pages25–32,2001. 176
Abstract (if available)
Abstract
Probabilistic graphical models such as Bayesian networks and junction trees are widely used to compactly represent joint probability distributions. They have found applications in a number of domains, including medical diagnosis, credit assessment, genetics, among others. The computational complexity of exact inference, a key problem in exploring probabilistic graphical models, increases dramatically with the density of the network, the clique width and the number of states of random variables. In many cases, exact inference must be performed in real time.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Parallel simulation of chip-multiprocessor
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
New approaches using probabilistic graphical models in health economics and outcomes research
PDF
Parameter estimation to infer injector-producer relationships in oil fields: from hybrid constrained nonlinear optimization to inference in probabilistic graphical model
PDF
The extraction and complexity limits of graphical models for linear codes
PDF
Rapid creation of photorealistic large-scale urban city models
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Model-guided performance tuning for application-level parameters
PDF
Dependent R-D modeling for H.264/SVC bit allocation
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Information geometry of annealing paths for inference and estimation
PDF
A novel hybrid probabilistic framework for model validation
PDF
Stochastic and multiscale models for urban and natural ecology
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
Algorithms for stochastic Galerkin projections: solvers, basis adaptation and multiscale modeling and reduction
Asset Metadata
Creator
Xia, Yinglong
(author)
Core Title
Exploration of parallelism for probabilistic graphical models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/28/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
exact inference,multicore processor,OAI-PMH Harvest,parallel algorithm,parallel computing,probabilistic graphical model,scheduler
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Dubois, Michel (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
yinglongxia@gmail.com,yinglonx@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3487
Unique identifier
UC157574
Identifier
etd-Xia-4046 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-420474 (legacy record id),usctheses-m3487 (legacy record id)
Legacy Identifier
etd-Xia-4046.pdf
Dmrecord
420474
Document Type
Dissertation
Rights
Xia, Yinglong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
exact inference
multicore processor
parallel algorithm
parallel computing
probabilistic graphical model
scheduler