Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
M A R K ER PROPAG ATIO N NETW ORKS: A PR A C TIC A L PARALLEL PRO CESSING A PPR O A C H by Changhwa Lin A Dissertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Com puter Engineering) August 1994 Copyright 1994 Changhwa Lin UMI Number: D P22889 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is d ep en d en t upon the quality of the copy subm itted. In th e unlikely event that the author did not sen d a com plete m anuscript and there are m issing p ag es, th e se will be noted. Also, if material had to be rem oved, a note will indicate the deletion. Dissartation Publishing UMI D P22889 Published by P roQ uest LLC (2014). Copyright in th e D issertation held by the Author. Microform Edition © P roQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S ta tes C ode P roQ uest LLC. 789 E ast Eisenhow er Parkw ay P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by CHANGHWA L IN under the direction of h te . Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of P b - U CpS mb D O C TO R OF PH ILOSOPH Y Dean of Graduate Studies Date DISSERTATION COMMITTEE Chairperson D ed ication To m y fam ily and m y teach ers, who teach me to learn from life. A ck n ow led gem en ts I am deeply grateful to my advisor Professor Dan Moldovan for his guidance, support and encouragement throughout my graduate studies at the University of Southern California. He has been a constant source of ideas and insights th at this dissertation rests; his incessant enthusiasm for research and learning along with 'riendship and advice have helped my studies enjoyable. I thank Professor Michael Arbib and Professor Michel Dubois for serving on my defense committee. I sincerely appreciate the tim e and guidance they provided in the completion of my dissertation. I also thank Professor George Becky, Professor Jean- Luc Gudiot and Professor Victor Prasanna for being on my guidance committee. Their advice and criticism have greatly strengthened this work. I wish to thank Ron Demara, Steve Kowalski, Steve Kuo, Wing Lee, Chinyew Lin and Adrian Moga for their insightful comments. Many of the ideas were originated i;hrough long discussions with them. I thank Hiroaki Kitano for providing valuable feedback during the early stages of SNAP. I like to thank Min-hwa Chung and Sang-Hwa Chung for lending me their applications to study and benchmark. I also thank them and Ron Dem ara for providing me results from the SNAP-1 prototype. Many thanks are due to everyone jwho contributed to the SNAP-2 prototype: Hasan Akhter, Chinyew Lin, Adrian 1 Moga, Traian M itrache and Mihai Petrescu. I like to express my gratitude to Dawn Ernst and Lucille Stivers for their patient proofreading. I like to m ention m y friends ■Rong-Feng Chung, Kung-Shiuh Huang, Jih-Cheng Liu, Weihua Mao, Tara Poppen, H irendu Vaishnav and Hwang-Cheng Wang who have made my life enjoyable during my studies. I would like to acknowledge the continuous support of National Science Founda tion grants MIP-89-02426 and MIP-90-09109. Most of all, I thank my parents Chun-Mei and Tse-Long Lin. All my life they have instilled upon me the value of the education. They encouraged m e to work hard and to study. They were always there when I needed them. I thank my wife Huiling for her unwavering patience and support. To them , I dedicate this thesis. i _____________________________________________________________________________iv_ C on ten ts D ed ica tio n ^Acknowledgem ents L ist O f T ables L ist O f F igu res A bstract 1 In tro d u ctio n 1.1 M o tiv atio n ........................................................................................................... 1.2 Toward a Practical Parallel Processing Model for M arker Propagation A p p r o a c h ........................................................................................................... 1.3 C o n trib u tio n ........................................................................................................ 1.4 D issertation Organization ............................................................................ ■ 2 M arker P rop agation M od el 2.1 Memory Model and Spreading A ctiv atio n ................. ............................... 2.2 Knowledge B a s e ................................................................................................. 2.2.1 Knowledge R e p resen ta tio n .............................................................. 2.2.2 Concept S equence................................................................. ... 2.3 Inference with M arker Propagation . ............................... 2.4 Related W o rk s.................................................................................................... 2.4.1 M arker Propagation S ystem s........................................................... 2.4.2 Hardware for Marker Propagation S y ste m s............................... 3 P arallel P ro cessin g A p p lied to M arker P rop agation 3.1 Parallelizing M arker Propagation with the SIMD M o d e l..................... 3.1.1 Knowledge Base Partitioning ....................................................... 3.1.2 Global Operations ............................................................................ 3.1.3 Parallelizing M arker P r o p a g a tio n ................................................. 3.1.4 SNAP-1 M arker Propagation S y s te m .......................................... 3.2 Parallelism and D ata Dependency Analysis 3.3 Parallel M arker Propagation Analysis . . . 3.4 S u m m a r y .......................................................... 36 43 45 4 M arker P rop agation N etw ork s 46 4.1 Requirem ents for a Practical Parallel M arker Propagation Model . . . 46 4.2 M arker Propagation N e tw o rk s.......................................................................... 52 4.2.1 Process Flow ........................................................................................... 52 4.2.2 M arker Propagation with Process F lo w ........................................... 53 4.3 Im plem entation Techniques ............................................................................. 56 4.3.1 Knowledge B a s e ........................................................................................ 57 4.3.2 Process Flow M arker Propagation Im p le m e n ta tio n s................... 59 4.3.3 Issues on S y n c h ro n iz a tio n ................................................................... 65 5 S N A P -2 P arallel M arker P rop agation N etw ork s P r o to ty p e 71 5.1 System O v e rv ie w .................................................................................................. 71 5.1.1 Prototype Design Goals ...................................................................... 73 5.1.2 Knowledge Base C a p a c i ty ................................................................... 75 5.1.3 Processing Models ................................................................................ 75 5.1.4 Hardware for Global Synchronization D e te c tio n ......................... 76 5.1.5 Interconnection Network T opologies.................................................. 78 5.2 SNAP-2 Prototype Hardware D e s i g n ............................................................ 79 5.2.1 System H o s t...................................................................................................79 5.2.2 Processing E le m e n ts ............................................................................. 81 5.2.3 Interconnection Network Im p lem en tatio n ....................................... 83 5.2.3.1 Host-Array Communication N e tw o rk .............................. 83 5.2.3.2 Synchronization Network .................................................. 84 5.2.3.3 Inter Processor Communication N e tw o rk ....................... 86 5.3 S u m m a r y ............................................................................................................... 88 6 Softw are E n viron m en ts for S N A P -2 89 6.1 Software A rchitecture O v e rv ie w ...................................................................... 89 6.2 System Software with Runtim e O p tim izatio n .............................................. 92 6.2.1 Memory and Knowledge Base M anagem ent........................................ 92 6.2.2 Process M a n a g e m e n t............................................................................. 98 6.3 Program m ing T o o ls .................................................................................................101 6.3.1 SNAP-2 Programming E nvironm ent................. 101 6.3.2 Performance D ata G a th erin g ..................................................................104 6.3.3 Parallel Debugger Im plem entation....................................................... 106 6.4 S u m m a r y ..................................................................................................................109 :7 P erform an ce S tu d ies 1 7.1 Synchronization . 110 110 7.2 Memory B uffering ■ . . 7.3 Perform ance Results with SNAP-2 System 113 115 8 C on clu sion s 119 8.1 M ajor F in d in g s .......................................................................................................119 8.2 Future R esearch .......................................................................................................122 A p p en d ix A SNAP-2 P rototype Processing Element C a r d .............................................. 125 A p p en d ix B SNAP-2 Hardware and Software A v ailab ility ............................................................131 i i vil L ist O f T a b les 4.1 Execution tim e distribution with SNAP-1 processing m o d e l................48 4.2 Path-rules for one-hop propagation only ..................................................... 61 4.3 M ulti-hop path-rules provided by marker propagation networks .... 62 5.1 SNAP-2 parallel processor prototype design objectives......................... 74 5.2 SNAP-2 Specifications...................................................................................... 80 7.1 Num ber of messages required for the different synchronization proto cols, and the reduction in message traffic using tiered reporting. . . . I l l L ist O f F ig u r e s 2.1 Sem antic networks knowledge object exam ple............................................... 10 2.2 Concept sequence example: seeing-event ..................................................... 11 2.3 The knowledge base of a party............................................................................ 13 2.4 The three m ain phases of the marker propagation algorithm .......................13 2.5 Translating sentence John i s a t USC into semantic networks objects 14 2.6 The semantic networks for Clyde....................................................................... 17 3.1 Moving a semantic node from one processing element to another . . . 23 3.2 Simplified marker table and class t a b l e ......................................................... 24 3.3 Example of parallel relation creation ................................................. 28 3.4 Example of parallel relation deletion ................................................. 29 3.5 SNAP-1 P r o to t y p e ............................................................................................... 33 3.6 The m arker propagation algorithm program flo w ....................................... 37 3.7 Knowledge base distribution for speedup e v a lu a tio n .....................................43 4.1 Parallel execution tim e of m arker propagation applications on the SIMD machine and SNAP-1 system ..................................................................... 47 4.2 Propagation path examples ............................................................................. 49 4.3 Augmented Propagation path e x a m p le ......................................................... 51 4.4 a — (b ■ + c) * (d — e) with control flow approach ....................................... 53 4.5 a = (6 -f- c) * {d — e) with data flow a p p ro a c h .............................................. 54 4.6 a = (6 + c) * (d — e) with process flow a p p r o a c h ....................................... 54 4.7 M arker propagation critical p ath ................. • .................................................... 57 4.8 Semantic networks knowledge base after height re d u c tio n ...................... 58 4.9 Dynamic Process Creation................................................................................... 69 5.1 The SNAP-2 system o rg a n iz a tio n ................................................................... 72 5.2 Hardware synchronization mechanism for nondeterm inistic algorithm 77 5.3 The SNAP-2 parallel processor prototype....................................................... 80 5.4 The block diagram of the Processing Elem ent.............................................. 82 5.5 The SNAP-2 host-array communication netw ork..............................................85 5.6 Synchronization wired-AND tree in SNAP-2 prototype.................. 86 5.7 Tiered synchronization protocol for SNAP-2 p r o t o t y p e ......................... 87 ,ix_ 6.1 SNAP-2 Program m ing Environment Overview............................................ 91 6.2 The memory buffer for fixed size memory request c a l l ........................... 94 6.3 Knowledge objects structures . ................................................................. 96 6.4 Knowledge object membership interlinks....................................................... 97 6.5 The Priority Queues for SNAP-2 processes......................................................100 6.6 The m arker propagation networks application compiled w ith the SNAP- C com piler...................................................................................................................103 6.7 The SNAP-2 parallel debugger............................................................................ 108 7.1 Benchmark # 1 : Teaching Assistant ( T A ) ...................................................... I l l 7.2 Benchmark # 2 : Inheritance M e s h ....................................................................112 7.3 Local synchronization message growth ..........................................................112 7.4 The performance w ith memory buffering s c h e m e ........................................ 114 7.5 The performance for memory buffering with scale up knowledge base. 114 7.6 The execution tim e of PASS on the SNAP-2 p r o t o t y p e .......................... 116 7.7 The speed up for the PASS program on SNAP-2 p r o to ty p e ................... 116 7.8 Input message example for PARALLEL with MUC5 d o m a i n ................117 7.9 The execution tim e of PARALLEL on SNAP-2 p ro to ty p e .......................118 A .l The SNAP-2 Processor Board: Static RAM Interface................................. 126 A.2 The SNAP-2 Processor Board: Dynamic RAM Interface and Global Bus.................................................................................................................................127 A.3 The SNAP-2 Processor Board: Communication P o rts.................................128 A.4 The SNAP-2 Processor Board: Connectors and Em ulator Interface. . 129 A.5 The SNAP-2 Processor Board: Global bus, Interrupts and Handshake Signals.......................................................................................................................... 130 A b stract The m arker propagation paradigm is developed for an artificial intelligence applica tion w ith a semantic networks knowledge base. It is a promising parallel approach for m any artificial intelligence applications. However, m arker propagation models have not yet been widely accepted due to their ill-defined parallel processing model and sequential machine im plem entations. The goal of this research is to develop a parallel marker propagation m odel from a practical parallel processing perspective. The approach begins with an analysis •of parallelism in operations required for a m arker propagation program. Based on the analysis, we develop a parallel m arker propagation model, nam ely m arker prop agation networks. We study the m ajor factors th at affect the performance of the m arker propagation networks, and propose the practical solutions. A parallel proces sor prototype is im plem ented based on the requirem ents for the m arker propagation networks. The proposed solutions are built into the prototype to prove their correct ness. We also im plem ented a set of program m ing tools, including perform ance data •gathering tools and a parallel program debugger, to make the prototype a complete m arker propagation networks application development system. X I C h ap ter 1 In trod u ction 1.1 M o tiv a tio n jThis research is stim ulated by the search for a parallel m arker propagation model jwhich is simple in concept to map into a parallel processor architecture, yet efficient in execution to utilize the extensive com putation power. Therefore, we first study the requirem ents for the parallel marker propagation im plem entation, then develop a parallel m arker propagation model from a practical parallel processing perspective. We also im plem ent the parallel processor prototype to validate the proposed model. 1.2 Toward a P ra ctica l P arallel P ro cessin g M o d el for M arker P rop agation A pproach C urrent research in artificial intelligence has focused on solving problems in specific domains. The slow execution problem exists in all artificial intelligence applica tions due, in part, to the complex nature of artificial intelligence and th e size of th e knowledge base required for solving the problem. P art of the problem is their sequential processing model and the usage of a sequential com puter. Applications, Luch as N atural Language Understanding, require huge knowledge bases and ex tensive com putational power. The solution is to employ state of the art parallel processor systems th a t can offer sufficient memory space and processing power. The _______:_______________________________________________________________ i m arker propagation paradigm occupies a prominent place in parallel artificial intel ligence applications w ith semantic networks. M arker propagation algorithm s have been developed for the areas of knowledge classification [18], graph unification [19], ^planning [13] [10], robotic, and natural language processing [20] [37] [33]. However, m ost of the research has focused on the problem solving w ith m arker propagation approach. To date, most im plem entations of marker propagation algorithm s have employed sim ulation programs on sequential machines [39]. The im plem entations of the m arker propagation algorithms on sequential machines are slow. For example, an experim ental m arker propagation for the real-tim e speech translation system built at CMU called <&DmDialog, took 30 seconds to process a simple sentence. Clearly, for m arker propagation applications to be widely used, the processing tim e m ust be reduced. Several attem pts were m ade to put marker propagation on parallel machines, such as the original Connection Machine [8] and IXM2 [22], but these attem pts have not been successful. One m ajor problem w ith these im plem entations is th at the hardware does not m eet the requirem ents of the marker propagation approach. For example, jwith the original Connection Machine, local memory in each node is very small. For a knowledge base with a large node structure, the processor local memory may not be enough to hold the entire information of a single node. Since the knowledge node to processor element mapping is one to one, the com m unication overhead is too high to take advantage of the parallel processor power. This is because the system design ignores the factors th at affect performance such as synchronization and communication overheads. Most of all, the problem with the marker propagation approach is th at the ex isting models are not defined from a practical parallel processing perspective. For examples, most of the current m arker propagation models do not lim it m arker propa gation satisfactorily. In Spreading Activation [36], m arkers propagate along a preset num ber of links. However, this semantic distance is chosen arbitrarily and does not guarantee th a t the correct concepts are marked. In addition, m arkers are sent along many unnecessary paths, thus increasing both message bandw idth and processing re quirem ents. O thers have proposed modifications to spreading activation. Charniak [2] allowed activation to decay over tim e. Norvig [33] suggested assigning lower ac tivation strengths to those markers farther away from the source. Wu [44] proposed another approach th a t applied connectionist techniques to m arker propagation by restricting m arker movement according to probabilities and thresholds. However, arriving at the proper probability and threshold values for a large knowledge base is a difficult task. Moldovan [28] created several propagation rules th a t the user could use to direct marker propagation. B ut more elaborate and complex m arker movements had to be built using the basic rules, resulting in a loss of parallelism. Yu [45] proposed a simple marker propagation programming model w ith m arker ta bles, where marker movement is controlled by the tables in each sem antic node th at directs the future propagation of the marker. However, the m arker table allows the definition of only one path for a m arker type and is too restrictive to perm it the easy creation and modification of marker propagation methods. A nother problem with the current m arker propagation models is th at they provide only simple operations, such as tagging a node w ith a m arker, for the m arker propagation process. The processing tim e is too small to utilize the parallel processor system because of the^ parallel processing overhead. One of the m ajor problems in running the m arker propagation algorithm s on parallel machines is th a t they generate a huge am ount of markers. This is called the m arker overrun problem. In DMAP and 4>DmDialog, m arkers need to be sent alongi the sem antic nodes th at make up all of the possible interpretations of a sentence. In| general, at the beginning of input d ata processing, many hypotheses in the sem antic networks are possible interpretations. Therefore, with a very large knowledge basej m any m arkers m ust be passed to m ark the possible interpretations. W ithout a good! resource manager, the communication channels on a parallel com puter m ay quickljj become saturated. Similarly, both Unification and Classification require com paring a subgraph w ith every other subgraph in the semantic networks, generating m arkers for m ost of the nodes in the system. 3 The problems with current parallel m arker propagation approaches can be sum m arized as following: • the current model was defined w ithout a good parallel processing perspective, • the requirem ents for a practical parallel approach were not identified, • the current model has wrong assum ptions on the parallel processing overhead, • the knowledge base and inference mechanism im plem entations, such as m arker propagation, were not efficient and were hard to m aintain, • the program m ing model was not formalized and each m arker propagation sys tem has its own im plem entation. In this dissertation, we present a parallel m arker propagation model called the m arker propagation networks (MPN) to solve the problems described. T he M PN is derived from a practical parallel processing perspective and can be m apped into advanced general purpose parallel processor systems, such as CM5 and Paragon. We started w ith studying the processing requirem ents for m arker propagation applica tions. The bottlenecks of the current m arker propagation model were then identified. The m arker propagation model was then modified to elim inate some bottlenecks and other solutions were proposed to solve the rest of the problems. We also propose a parallel processor system, SNAP-2, and im plem ent a pro totype, which is also provided as a M PN application development platform . The design tradeoff for the SNAP-2 prototype is also discussed. We also present some special hardw are im plem entations for supporting fast synchronization and broad casting. The runtim e support and some im plem entation issues for the program m ing tools are then described. For the perform ance comparison, we present the performance im provem ent w ith the solutions for some critical im plem entation problems with the marker propagation program . Then we show the application performance improvements w ith th e parallel processing approach. 4 1.3 C on trib u tion The contribution of this dissertation is summarized as follows: • Define a parallel m arker propagation model, nam ely the M arker Propagation Networks (M PN), from a practical parallel processing perspective. • Define the programming and execution model for the M PN model. • Identify the requirem ents for the parallel m arker propagation. • Isolate the bottlenecks of the parallel m arker propagation approach. • Propose practical solutions for the bottlenecks. • Parallel Processor System Prototype Im plem entation to validate the solutions. • Provide the program m ing tools for M PN program development. • Develop parallel function libraries and runtim e optim ization support. • Devise perform ance data gathering tools for parallel program tuning. 1.4 D isserta tio n O rganization This thesis is organized as follows: C hapter 2 provides an introduction on formal m arker propagation model and its implem entations. C hapter 3 discusses the SIMD parallel approaches of the m arker propagation model along w ith the parallelism and d ata dependency analysis as well as speedup study. This chapter concludes w ith a sum m ary of problems with the SIMD marker propagation approaches. C hapter 4 presents the key issues for improving performance and m arker propagation networks. C hapter 5 describes the SNAP-2 parallel processor system and the design and im ple m entation of the SNAP-2 prototype which was designed to be a m arker propagation application development platform . The architectural features developed and design decisions based on the results of C hapter 3 and C hapter 4 are discussed. Chap ter 6 presents the software environment for the SNAP-2 including runtim e support, 5 perform ance data gathering tools and a parallel debugger. C hapter 7 shows some perform ance studies and comparisons. C hapter 8 concludes this dissertation with some m ajor findings and an outline for future work. C h ap ter 2 M arker P ro p a g a tio n M o d el A m arker propagation application requires a knowledge base and an inference mech anism. Generally, the knowledge base contains only specific domain knowledge and is configured in a hierarchical manner. The m ain inference mechanism in a m arker propagation application is to propagate the markers through the knowledge base. The input d ata will be translated into knowledge objects where the m arker prop agation start, then the markers will be moved around the knowledge base. New objects will be generated after each wave of the marker propagation. Valid results will be extracted from the final wave of propagation. This chapter first presents the I basic m arker propagation model, then we show how other researchers approach the im plem entation. 2.1 M em ory M o d el and Spreading A ctiv a tio n The m arker propagation model was derived for a knowledge processing w ith a sem an tic networks knowledge base. The idea of passing markers as an inference m echanism was first developed by Quillian [36]. There are two m ajor components required for Q uillian’s Spreading Activation approach: a m em ory m o d el which is a sem antic networks knowledge base and a sim u la tio n program which is a m arker propaga tion program . The purpose of the memory model is to form ulate the hum an m em ory w ith sem antic networks, while the m arker propagation program replies to the input d ata by retrieving the data stored in the memory. The semantic networks m em ory 7 i model consists basically of a mass of nodes interconnected by different kind of links. A word concept, such as plant, is composed of set of nodes and links. For example, the plant concept contains a node, namely food, which connects with other concepts th a t show the plant requires air, water, and earth for food. In this example, air, water, and earth are nodes and they may also be concepts. In the m em ory model, these concepts comprise planes in the memory. In spreading activation, the m arker propagation program provides an inference m echanism to compare and to contrast word pairs. The program requires three steps of processing: First, the two words are translated into memory objects and are placed in the semantic networks memory. The program then sim ulates the gradual activation of each concept outward through the links originating from each input d ata node. By moving out along the links, the program tags each node encountered w ith a special tw o-part tag, namely activation tag, which has to be filled by both propagations to indicate an intersection between two propagations. Upon detecting an intersection, the propagation path is retrieved. The last step is to process the retrieved paths to produce output. The idea of spreading activation had been employed in other fields, such as psy chology [6]. However, there was no research to im plem ent the spreading activation in parallel hardw are until Fahlm an [9] proposed a modified m arker propagation m echa nism, which will be described in this chapter. In the next two sections, we elaborate the two modules required for the marker propagation approach as well as formal researches on improving these modules. 2.2 K n ow led ge B a se The design of the knowledge base in the m arker propagation algorithm is very im portant, as the knowledge base affects the efficiency, accuracy and robustness of the algorithm . We like to call the knowledge base the m eta-program as the inference algorithm has to be incorporated with the knowledge base structure. 8 2.2.1 K nowledge R epresentation The sem antic networks knowledge representation has two im portant features: 1. the problem domain can be adequately represented by the sem antic networks, and 2. the sem antic networks knowledge base can be easily m apped onto a parallel hardware. The sem antic networks knowledge base is constructed as a directed graph. T he nodes of the graph represent concepts, while the links or relations represent relationships between the concepts. Besides relations, groups of nodes can be classified according to their special features. Such groups can be represented by a class notation. At runtim e, a m arker can be used for tem porarily identifying a group of special features. The selection of knowledge object granularity and the knowledge base construction, in general, affect the overall performance. Figure 2.1 depicts the concept elephant. The node elephant is related to other nodes via various types of links, namely relations. The is-a relation in this example shows the inheritance relation between the elephant and the mammal. The inheri tance relation defines th at an elephant has all the features a generic m am m al has. W ith the concept of inheritance, redundant d ata can be elim inated. 2.2.2 Concept Sequence The knowledge base m ust include real world inform ation for d ata searching and retrieving. However, to process real world problems, certain processing algorithm s have to be provided for d ata processing. In the traditional program m ing m odel, the processing algorithm s are coded into the m ain program. For a huge knowledge base, such an approach is not appropriate because the code will grow as the knowledge base size increases. Memory based processing approach is a program m ing model to encode th e processing procedure into the knowledge base. : ___________________________________ _ ___________ i MAMMAL AS-PART TEETH [AS-PART EYES IS-A ELEPHANT HAS-PART IS-A TUSK CIRCUS ^ iLEPHANT, AS-PART IS-A IS-A LEGS DUMBO Figure 2.1: Semantic networks knowledge object example. Figure 2.2 depicts a program module called concept sequence, which is designed for N atural Language Processing. The concept sequence used in m em ory based pro cessing was introduced in DM AP [37]. The knowledge base is represented as sentence fragm ents th at are coded as concept sequences and stored in the sem antic networks. Figure 2.2 shows a seeing-event concept sequence. The knowledge base is divided into several layers. At the top is the concept sequence root (CSR) which identifies the type of interpretation being represented, in this case, a seeing-event. Below the root are the concept sequence elements (CSE) which identify the components of the concept sequence. The concept sequence root and concept sequence elem ents form a concept sequence which is a phrase p attern designed to fit several input sentences. In our example, the concept sequence for a seeing-event is defined as an experiencer, S E E , determ iner, object 10 Concept Sequence Root CSR Layer CSE Layer Semantic Network of Concepts (^physical-object) Lexical Layer CSI Layer (Parsing result) seeing-event oncept Sequence Elements next neXt z' c r eriencer; ► f SE C ^® term inir> ►(^object Canimate^ <jsee-rooT (^Tnanimate) isa~f < ^naturaT^ > isa person C^see-pasP) cs-mstance instance instance instance “saw c-ins ance t eXperienc£tCjieeing-event # l^ X o b ject c-john#l c-in stance Concept Sequence Instance jinstance Figure 2.2: Concept sequence example: seeing-event 11 The next level is a semantic networks o f concepts which serves as a connecting layer th a t bridges the lexical entries w ith the concept sequence elements. It consists of a knowledge hierarchy formed by is-a relations. For example, in Figure 2.2, this layer connects “john” to experiencer via the nodes c-john, person and animate. The lexical layer consists of the individual words th at comprise th e sentence. T he final layer is the concept sequence instance (CSI) and contains the result of parsing. In the exam ple ‘ ‘ ‘ 'John saw a log." it shows th at “ john” is an “experience” and “a log” is an object of a seeing-event. 2.3 In feren ce w ith M arker P ro p a g a tio n The passing m arker is the m ajor operation in the m arker propagation model. For mal m arker propagation approaches tre at m arkers as d ata patterns associated with knowledge objects. They represent properties of objects, and m em bership in differ ent classes. In the memory base processing model, the m arkers reflect the state of the processing as they travel through the concept sequence. The basic inferring using m arker propagation is to detect the set intersection between a given p attern and a knowledge base. For example, given a sem antic networks as shown in Figure 2.3(a), the query is to find who is a female and is a party goer. Two different markers will propagate from both fem ale and h u m a n -in -p a rty nodes. The nodes which receive both markers are candidates for the query. The paths will then be evaluated to decide the correct answer. The procedure is discussed below. In general, a marker propagation algorithm consists of three different inferring tasks: input translation, marker propagation and evaluation of results as shown in Figure 2.4. In p u t T ran slation The input translation is normally a sequential process, which translates th e input d ata to a sem antic networks representation. The processing tim e is assum ed to be very small in comparison with the other two tasks. An simple exam ple of input 12 female isa isa isa is a Mary Bill John Penny Allen Mike friem Susan Joe Linda roll role human-in-party (a ) isa isa isa isa John Allen Mike friend Susan Linda ■ ole' ole. role role (b) wrong answer f ) correct answer female ( A t o ) Mike ) ( Penny friend Susan) ■ | inda role \ ' y x r0* e human-in-party (c) Figure 2.3: The knowledge base of a party. j In p u t ! I Input T ranslation I P ropagation 1 ' E valuation l f O u tp u t Figure 2.4: The three m ain phases of the m arker propagation algorithm . 13 Connect to the semantic netowrk John use instance next instance next next instance instance John-1 at-1 is-1 USC-1 Translation of input sentence John is at USC Input text Figure 2.5: Translating sentence John is at USC into sem antic networks objects translation is shown in Figure 2.5. A sentence is given as an input and th e translation is to create a virtual copy of concept of the input words. In the p a r ty example, the input translation is om itted. M a rk e r P r o p a g a tio n There are two different propagation schemes: spreading activation and path- directed m arker propagation. • S p re a d in g A c tiv a tio n [36] The spreading activation assumes th a t the m arker propagation should propagate along all paths. For each node w ith m arker in tersection from different source nodes, the propagation paths from both source nodes to the intersection point are extracted to form a m arker collision path. T he path evaluator then evaluates the marker collision paths to determ ine valid paths for producing results. Most of the path evaluators for spreading activation use first order predicate logic for path evaluation. Probability [44] or strength propagation has been used in m arker propagation to reduce the 14 spreading area. Therefore, the a priori model dom inates the size of a propa gation area. For the Party example, the m arker from female will be prop agated through r-isa Mary, r-isa Penny, r-isa Susan, and r-isa Linda. Then m arkers received by Linda and Susan will be propagated to r-role human-in-party, and a marker will go from Linda to Joe, which is a wrong answer. There is a cycle problem, which occurs when th e m arker received by human-in-party tries to m ark Susan, Linda and Joe. The result of propaga tion is shown in Figure 2.3(b). • P a th -D ir e c te d M arker P ro p a g a tio n [9] The path-directed m arker propa gation differs from the spreading activation in th e sense th at certain patterns of the m arker path are imposed because of the hierarchical sem antic networks. Therefore, for certain inference, an abstract m arker propagation p ath can be pre-determ ined. In this way, the marker will cover a much smaller area then the spreading activation. Another advantage of this approach is th a t the com plexity of the path evaluator is reduced. In th e p arty example, for th e path- directed approach, we can set up the p ath for both markers by ju st reaching the collision node. The result of propagation is shown in Figure 2.3(c). E valu ation The intersections of the markers indicate the possibility of valid solutions. How ever, to prove th a t solutions are valid, it is im portant to show th a t the paths of propagation are meaningful. The evaluation phase is a sequential process of early m arker propagation systems. In the party example, the evaluation is not necessary, we ju st pick up the intersection. For the spreading activation approach, the path for m arker from female should go through r-isa and therefore, the Joe concept m arked through the friend relation is not a correct answer. Several iterations on marker propagation and evaluation phases m ay be needed to get a correct answer. For example, in the spreading activation approach with strength control, if the strength is not enough to reach a solution, another propaga tion would have to be started from the current state. In the path-directed approach, 15' some algorithm s require hypothetical creation of tem porary results in the knowledge base and then repeating the propagation and evaluation phases again. 2.4 R ela ted W orks 2.4.1 Marker Propagation System s The m arker propagation approach for solving knowledge processing problems was first proposed in Q uillian’s work [36] on semantic memory processing in the sixties. In his system, the spreading activation propagation was employed, and a path evaluator was used for checking valid paths. In this section, we describe several systems th at employ the m arker propagation model. N E T L In the seventies, Fahlm an [9] used m arker propagation as an inference algorithm for the NETL system. The NETL system is a system targeted to hum an knowledge representation using a sem antic networks. The propagation scheme in the NETL system is the origin of path-directed m arker propagation. For exam ple, to get the role set of node Clyde, the m arker will go up through i s a links and go down one hop of the h a s - p a r t link. This is shown in Figure 2.6. W IM P The W im p system, developed by Charniak [2], is a spreading activation system for language comprehension. The path evaluation scheme is im plem ented using first- order predicate-calculus. The propagation phase is improved by adding a zorch to the m arker. The zorch is a m arker data, which indicates the length of propagation. At beginning of propagation, the zorch is assigned a pre-determ ined value, and the value is decreased for each hop of propagation. W hen the zorch falls below one, the m arker will not be propagated any more. The path aggregation is done by associating the m arker w ith a list of node-relation pair of the traversing path. Has-part H 0 M I Has-part H 0 > Mammal Elephant Circus Performer Elephant Has-part Purpose Has-part Has-part Has-part trunk Has-part r \ Teeth L ) Is-a r ■ s Tusk I ) Figure 2.6: The sem antic networks for Clyde. True W altz [43] proposed using the cooperative com putation w ith spreading activa tion for tex t understanding of a static (no creation nor deletion of nodes) sem antic networks. The TRU E system, developed by Yu [45], employs a sim ilar idea. The| i TR U E system is a path-directed m arker propagation system for language under standing. The p ath evaluation phase of the TRU E system is associated with the m arker propagation phase. Although the m arker propagation scheme is used for th e value passing, th e evaluation scheme is the cooperative com putation approach by converging the semantic networks to a stable state. In this section, we describe several systems th at employ inference m echanism derived from Quilli art’s spreading activation. F A U S T U S The FAUSTUS system, developed by Norvig [33], is a spreading activation system for story comprehension. The p ath evaluation scheme is achieved by providing a set of rules (agenda o f suggestions) of possible paths th at represent correct forms of propagation paths. The m arker paths are then evaluated by these rules. 17 S C R A P S The SCRAPS system, developed by Hendler [13], uses spreading activation to find all of the paths in the semantic networks, and then evaluates each p ath to find the most appropriate one. The propagation phase and evaluation scheme is similar to the W im p approach. T R U C K E R and R U N N E R The TR U C K ER and RU NNER systems, developed by Hammond [10], employ path-directed m arker propagation techniques to directly find th e best p ath and the concept sequences encapsulating the existing plans for case-based reasoning. U n ifica tio n A graph unification algorithm , developed by K itano [21], finds the most general common subgraph th a t matches two distinct subgraphs. This is done by starting a m arker in each of the subgraphs. As the m arker visits th e other nodes in the graph, node and relation inform ation is continually attached to the m arker. At the end, the m arker inform ation from each subgraph is sent to a common node, where the comparison is performed. P A R K A In PARKA [18], the goal is to find the best place to insert a new concept in a hierarchical knowledge base. This is done by com paring the properties of the new concept w ith the properties of every other concept in the knowledge base. Markers are used to flag the property nodes of each concept. In addition, classification requires knowledge of the type of relation attached to each property node. This is done by assigning a different m arker type to each of the relation types of the node and constraining those m arker types to travel only on those relations. D M A P In the DM AP [37], <&DmDialog [20] style of natural language processing, path- directed m arker propagation were employed. The concept sequence is used to en capsulate one possible interpretation of a sentence. As words come into the system, 18 th e concept sequences are used to guide the movement of m arkers. Additionally, in $Dm Dialog, arithm etic operations need to be performed at the destination node for th e cost calculation used in sentence disambiguation. 2.4.2 Hardware for Marker Propagation System s In th e eighties, several machines were im plem ented for m arker propagation on se m antic networks. The most famous one is the original Connection M achine. Another m arker propagation m achine im plem ented in Japan is called the IXM2. C o n n ectio n M ach in e as a M arker P ro p a g a tio n S y stem The Connection Machine was originally developed by Hillis [16] as an im plem en tation of Fahlm an’s NETL. The Connection M achine is a fine-grained array processor w ith program m able connections between nodes. It consists of 64K single-bit proces sors, w ith each processor having 4K bits of memory and a serial ALU. The processors operate in SIMD fashion, with messages being the m ethod of com m unication. The movement of m arkers in the Connection M achine is under user control. The Connection Machine executes instructions to process markers. The drawback with this approach is th at m arker propagation m ust occur in the foreground. Conse quently, the Connection M achine cannot perform any other functions during this; tim e. Since it is a true SIMD system, all the nodes m ust execute the same in structions. In the m ajority of cases, only a small portion of the sem antic networks participates in m arker propagation, and the rest of the knowledge base is idle. In addition, because m arker processing in the Connection M achine is software based, the Connection M achine can propagate only one type of m arker at a tim e. There fore, different types of m arkers require a different set of software functions which can not be executed at the same tim e w ith the other m arker operations. A nother lim itation of the Connection Machine is the single-bit processor. The inform ation th at is codified in the knowledge base requires m any bits of data, and i the processor processes both the processing and the com m unication serially. This isj a serious bottleneck for the perform ance of the system. i The new generation of Connection Machine, CM5, has elim inated all th e prob lems described above by providing a powerful processor and fast com m unication channels for each processing node. j IX M 2 IXM2 is a massive parallel associative processor system developed by Higuchi [14] at the Electrotechnical Laboratory in Japan. IXM2 consists of 64 associative processors based on the IMS T800 transputer. Each processor has access to an associative m em ory of 4K x40 bits and is connected to other processors through a network processor. Each network processor supports 8 associative processors in a star configuration. The 8 network processors are fully interconnected to perm it two processors to com m unicate using a m axim um of only 2 network processors. Processing on IXM2 is performed using single-bit markers and set operations. The associative memory in IXM2 provides fast bit m arker operations. A special m arker propagation scheme which is composed of sequence of m arker set operations, is also employed to facilitate the associative memory. However, the m arker propaga-| tion requires extensive control from the host, therefore, greatly reduces the speedup w ith associative memory. Also, the IXM2 provides only bit operation for m arker intersection, while some of the m arker propagation algorithms, such as 4>DmDialog [22], require num eric calculation upon m arker collision. Therefore, the im plem enta tions of those m arker propagation applications on the IXM2 are not possible. C h ap ter 3 P a ra llel P ro cessin g A p p lied to M arker P ro p a g a tio n In this Chapter, we discuss several general parallel processing techniques applied to the m arker propagation algorithm . We start with the SIMD approach because m ost of the parallel m arker propagation algorithms are im plem ented based on the SIMD model. First, we identify the operations required for the m arker propagation program and their parallel im plem entations. Next, through an example, we ana lyze the d ata dependency and parallelism in m arker propagation and seek possible parallel compiler optim ization techniques. We then show the speedup provided by parallel m arker propagation approach. The results of this chapter help us identify th e requirem ents for im plem enting an efficient parallel m arker propagation model. In the next chapter, a parallel m arker propagation model nam ed m arker propagation networks is defined to encapsulate the required parallel constructs. 3.1 P arallelizin g M arker P ro p a g a tio n w ith th e SIM D M o d el As described in th e previous chapter, a marker propagation algorithm requires a knowledge base and an inference program. To parallelize the m arker propagation program , the knowledge base has to be loaded into the parallel processor system and the parallel inference m echanism m ust be well designed to utilize the parallel 21 processor com putation power. In this section, we first describe some basic require m ents to parallelize the m arker propagation algorithm . Then we discuss the inherent parallelism th at can be exploited w ith the SIMD approach. I 3.1.1 K now ledge Base Partitioning T he form al m arker propagation model assumes the sem antic knowledge base contains the following objects: node, relation, class and markers. There are relations between nodes. A set of nodes with similar features can be given a class for the common features. Even though the m arker is as a tem porary memory, the m arker can be p art of the knowledge base to reduce the runtim e translation. To load the know ledge; t base into the parallel processor, the knowledge base is simply partitioned into pieces j and is then loaded into the processor nodes. W ith the SIMD approach, the m ain program which provides the inference m echanism resides in the host. All processors are then operated under the host command. N o d e and R ela tio n Node and relation are the basic constructs of a sem antic network knowledge base. In the formal m arker propagation model, the node is treated as concept data, j D uring the m arker propagation phase, when a node receives a m arker, the processor) will change the status of the node to reflect the m arker activity. A node has some m em ory for storing inform ation regarding the status of the node. Some of the m em ory is used for storing relations connected to or from other nodes. The sem antic relations link pairs of nodes of the sem antic networks. The relations in the formal m arker propagation provides bridges to construct the m arker propagation path. In the path-directed propagation approach, the relation is provided as a guideline to restrict m arker spreading activity. One of the m ajor problems w ith the knowledge base partitioning is to define the m apping between semantic node and the physical processor. For the sequential approach and the static node-processor m apping parallel approach, th e im plem enta tion is simple. For a static m apping, the node-processor m apping inform ation can be encoded into the node identifier. For example, considering a 32-bit node identifier, 22 R1 R2 R3 PEI PE2 Before Data Migration R2 R1 R3 PEI PE2 After Data Migration Figure 3.1: Moving a semantic node from one processing elem ent to another 8-bit is assigned for processor identifier and the rest 24-bit is provided as a key to search th e node in a processor. For a sequential machine, the processor identifier field has 0 bit. In a parallel processor system w ith load balancing, the nodes may have to be m igrated between processors. W hen a node has to be moved from one processor to another processor for load balance, all the nodes connected to th e node being moved have to be notified about the move. Consider the exam ple in Figure 3.1. The knowledge base is distributed into two processors. If node B has to be moved from processor 2 to processor 1, the physical to logic processor m apping is changed after the move. The relations R1 and R2, which are associate w ith node B m ust be modified according to the change. From a practical point of view, the simple static partition and allocation are suitable only for a small knowledge base. C lasses and M arkers The classes and m arkers are both used to identify group of nodes w ith th e same features. The class and m arker are different mainly in their characteristics. The class is statically stored in the knowledge base, while the m arker is norm ally assigned at runtim e. We can view the class, m arker and node association as two big m atrix as shown in Figure 3.2. For nodes which are members of the class or m arker, an “x” 23 node m Marker Table nod< Class Table Figure 3.2: Simplified m arker table and class table 1 is m arked in the table. Intuitively there are two ways to partition these two tables. F irst is vertical partitioning, the class table or m arker table has therefore been partitioned into several vertical bars. Each bar will be allocated to one processor node. The second partition strategy is the horizontal partitioning where th e tables are cut into horizontal bars. The table allocation then depends on th e node allocation instead of simply allocated to processor nodes. M a p p in g B etw een E x tern a l and In tern al R ep resen ta tio n s Before the external d ata being processed by the processor elem ents, the external d ata has to be converted into the internal representation. For exam ple, the node nam e “clyde” stored in processing element have to be converted to a tuple (proces sor J d , node_id). In general, the m apping is one to one, therefore, the node name m ust be unique. For example, “clyde” is a node name; for second instance which has th e same clyde node name, it is suggested to use clyde-1. To convert a node from external representation into internal representation, a nam e table is required for such operation. The translation is a simple search operation. W ith hash table im plem entation, the search can be done w ith m inim um processing tim e overhead. The m ain concern w ith the nam e table im plem entation is the required m em ory space. Consider a million-node knowledge base, each node needs a nam e. For an average of ten character per node, the nam e table requires 10MB of memory, and the pointers (pointer to next entry) and hash table require another 5MB. For the internal form, the (processorJd, node_id) tuples, 8MB is needed. Therefore, just for the nam e table, it easily takes up to 23MB memory. The big table can be stored at host or partitioned into pieces and stored in the processing elements. Practically, the first approach is b etter if the host has sufficient memory since the d ata com m unication tim e overhead is required for search by the processing elem ent, while the speed up w ith the table partitioning and parallel search do not help overcoming the overhead. 3.1.2 G lobal O perations A fter the knowledge base partitioning, there exists inherent d ata parallelism due to the partitioning, several operations required for m arker propagation algorithm can therefore utilize such parallelism. The global operations include search, set, knowl edge base m aintenance, and d ata retrieving operations. Not all global operations can be parallelized to reduce the processing tim e. For example, the “create_node” operation requires only one processor to perform the operation, therefore it can not be parallelized. For the parallel global operations, in general, no result will be pro- j vided because the result collection with parallel m achine is an expensive operation. ! i However, a d ata retrieving function has to be provided for this purpose. Search O p era tio n s Search is one of the m ost im portant operations for a knowledge processing al gorithm . In general, basic search operations have to be applied on all knowledge objects i.e., node, relation, class and m arker. The basic operations include, search node w ith relation R, search node w ith class C and search node w ith m arker M. Com plex search operations deal w ith combined object look up. For example, search for all nodes th a t have relation R1 and is in class Cl. H ard search problems, such as the traveling-salesm an problem, require sophisticated algorithm s to solve the problem. Several researches on parallelizing those problems had good results. The search operations required for the m arker propagation algorithm are consid ered simple search operations. W hile the complex search operations are, in general, 25 provided by various set operations. For basic search operations, such as “search node w ith relation i?,” they are inherent parallel operations after the knowledge base is partitioned. The central controller or host broadcast the search com m and, then the processors can search through the node table at the same tim e. In C hapter 5, we will discuss the use of CAM (content addressable memory) for the search operations. S et O p era tio n s Set operations are useful in knowledge processing applications to identify objects th a t have similar features and features common to several objects. The set operation is also needed for generalizing or restricting the m em bership requirem ent of a group. Basic set operations include A N D , OR, EX-O R , and NOT. The set operations provide supplem ental operations for the search operations and m arker propagation operations. For example, combined search, such as search for nodes which have relation R1 and are in class Cl, can be accomplished by the following code: search_node_R C (class C, relation R , m arker M) { m arker M X , MX; % Declare variables % search_node_with_C ( C, M X ); % m ark nodes w ith class C % search_node_with_R (R, M-2)m , % m ark nodes w ith class R % and_marker (M X , M X, AT); % m ark the intersection % } The pseudo code m arks the nodes with both class C and relation R w ith m arker M. T he set operations also take advantage of the partitioned knowledge base for the inherent parallelism. The central controller or host broadcast the set opera tion request, then the processors can perform the set operation on the partitioned knowledge base at the same tim e. K n o w led g e B a se M a in ten a n ce O p eration s M ost of the knowledge base m aintenance operations can not be parallelized due to the required external to internal translation process. For example, to create a 26 node w ith “create_node T V ” several operations are required. Before the creation, N m ust be translated into internal node representation form. A free node space is allocated for the new node. Then the controller assigns the node to a processing elem ent and asks for creating the node. T he required external d ata to internal d ata translation operation as described in the previous section is sequential. The free node table is similar to the nam e table and should be im plem ented the same way. For create relation, if the relation is unique (For example, a knowledge base may have only one is-a relation and other is-a relations m ust be given different nam es), the procedure to create relation is sim ilar to create node. However, if th e relation is provided as a generic type, then the parallel operations which are shown below, can be im plem ented. The effect of such operations are shown in Figure 3.3 and Figure 3.4. • create relation by m arker (m arker-create): for all nodes which have th e m arker create a relation to a given node. • delete relation by m arker (marker_delete): for all nodes which have the m arker delete a relation to a given node. Knowledge base operations involved w ith class and m arker are, in general, to convert between each other. For example, given a set of nodes which have m arker M, assign them w ith class C. The im plem entation required for such an operation is ;o find the nodes which have the m arker and set the class for th e nodes. W ith a partitioned knowledge base, these operations can utilize inherent parallelism . Data R e tr ie v in g F u n ction s The d ata retrieving operations require the processing elements retrieve the re quested d ata and send back to host or central control. Intuitively, the parallel approach is slower than the sequential operation. For example, to retrieve a list of nodes as candidates of possible solutions, and the nodes are m arked w ith m arker M. One can issue a co llect-n o d e_ M ( m arker M) 27 o © n4 © nj (a) Before marker_create n2 C™) m : marker b, nl, n2, r > 3 , n4 : nodes rl, r2 : relations m n2 ( ^ ) n4 (b) After marker_create Figure 3.3: Example of parallel relation creation instruction to request the processing element to send back the node list which has the m arker M. W ith sequential approach, the list is constructed at runtim e, the retrieve function simply returns the pointer of the list as the result. For parallel approach, the list has to be reconstructed from the partitioned list if the partitioning scheme is horizontal partitioning or the entire list has to be sent back to th e host from the processing elem ent which owns the list if the vertical partitioning scheme is employed. I E ither way, the operation is much slower than the returning pointer operation on a sequential machine. In general, the d ata retrieving is considered as an overhead of th e parallel m arker propagation since it requires m ore processing tim e th an the sequential approach. S y n ch ro n iza tio n for G lobal O p era tio n s If the search and set operations are program m ed w ith the parallel approaches described in this section, the operations may require synchronization at the end of ;he operation. The search and set operations on each processor require different 28_ m n n4 © n l b (a) Before marker_delete m : marker b, nl, n2, n3, n4 : nodes n4 (b) After marker_delete n2 Figure 3.4: Exam ple of parallel relation deletion processing tim es because the knowledge base may not be equally partitioned. If a processing elem ent can only execute broadcast commands one by one from the broadcast com m and queue, then synchronization is not necessary even if there is d ata dependency between operations since no interaction betw een processors is re quired. W ith m arker propagation operations mixed with global operations, then a synchronization is required between m arker propagation and global operation if there is d ata dependency. Synchronization is another overhead w ith th e parallel m arker propagation approach. 3.1.3 Parallelizing Marker Propagation In form al m arker propagation approaches, a single propagation perform s a search and tracing along several paths. The goals of the propagation is to find associations between knowledge objects by specifying guidelines for the searching process in the 29_ jform of a propagation path. W hen combining the results of m ultiple propagations, an intersection search can be perform ed w ithout the em ploym ent of complex heuristics. A S eq u en tia l M arker P ro p a g a tio n F u n ction j To parallelize the m arker propagation operation, one has to understand the m arker propagation in sequential form. A generic m arker propagation function can be defined as following code: p ro p a g a te ( m arker C, m arker M, path P, function F) { for all nodes which have the m arker C, sen d th e m arker M through the path P, and e x e c u te the function F at the nodes on the path. } From the definition, the propagation operation will be initiated from th e nodes jwhich have m arker C. Assuming the p ath P is R l, and node N has m arker C, the function will search through the relations of node N and send the m arker M to the destinations pointed by the relations which are of type R l. W henever the m arker !m reaches a destination, the function F will be executed. I P a ra llel M arker P ro p a g a tio n F u n ction The initialization of the m arker propagation m ust search the nodes which have m arker C. W ith a sequential approach, this can be achieved by setting up a link list of nodes, which has m arker C at runtim e. Upon request for the search of nodes w ith such a m arker, the link list is retrieved. For a parallel machine, as described in the previous section, the knowledge base m ust be partitioned to utilize the parallelism . 1 If the vertical partitioning scheme is employed, the inform ation regarding th e node ist is then stored in a single processor. Any further propagation operation will be Dounded by the throughput of the processor. W ith the horizontal partitioning, each processor has part of th e m arked node list. Therefore, several processors can start ;he propagation at the same tim e. The initialization has b etter perform ance w ith diis approach. __________________________________________________________________________ 30_ Some parallel processing techniques are developed to parallel the loop operation of the sequential program . Those techniques are not suitable for parallelizing the loops inside the m arker propagation algorithm , because they were developed for de term inistic algorithm s, namely “for” loop. The loops inside the m arker propagation algorithm are “while” loops, since th e num ber of item s is unknown at compile tim e. For a message passing m achine, the propagation paths are partitioned due the partitioned knowledge base. Therefore, whenever the propagation p ath is term inated a t current processor but is continued at another processor, a rem ote procedure call is needed to continue the propagation remotely. However, this approach cannot be im plem ented w ith a generic SIMD machine such as CM2. T he m ain problem w ith th e SIMD approach is th a t the com m unication between the processing elem ents is controlled by the central controller. During the com m unication phase, the central controller determ ines the direction of com m unication between processors, then the central controller commands all processors com m unicate w ith each other. Consider a propagation going through a path, which has only one hop R , if all th e routing (destinations are at local processing elements, no inter-processor routing is needed, therefore, the function can be accomplished w ith a single broadcasting command. 'If some of them have to com m unicate w ith other processing elem ents, the routing operation m ust be issued by the central controller. In general, the propagation paths require the processors perform communications in different directions. The num ber of broadcasting command required for a m arker propagation can be calculated with ;he following equation: L i-2 where B is the num ber of broadcasting commands required, “2” indicates a prop agation w ithout com m unication requires two broadcast com m ands for processing and checking com m unication requirem ent, L is the length of th e longest propagation p ath , and C is the num ber of com m unication required at each propagation hop. At Lach propagation hop, there are four broadcasting commands required: setting up Lom m unication direction, com m unicating, processing and checking com m unication requirem ent. In general, each routing step requires a synchronization to make sure __________________________________________: ________________________________ 31J jthat all processing elements finish their job before going further, then th e results jhave to be returned to the host to verify th a t no further propagation is required. jThe im plem entation of parallel m arker propagation system w ith connection m achine CM2 [4] requires this approach. j3.1.4 SN A P -1 Marker Propagation System The SNAP-1 system was designed to provide the parallel im plem entations described previously and to overcome the problem of parallel m arker propagation w ith the SIMD approach. The parallel processing mode of SNAP-1 crosses betw een a SIMD jand a MIMD com puter. It is SIMD in term s of global operation execution, but M IM D in term s of m arker propagation processing. Detailed description of SNAP-1 model can be found in [27]. The SNAP-1 prototype was built to dem onstrate the feasibility of the SNAP-1 m arker propagation system to solve N atural Language problems in real tim e. The SNAP-1 prototype consists of nine 9U-sized Eurocards attached to a Sun 3/280 (see Figure 3.5). Eight of the boards provide the SNAP array processors, which hold the nodes, relations and markers th at make up the knowledge base. The ninth board contains the SNAP-1 controller, which runs the application program , broadcasts instructions to the array and collects results from the array. The SNAP-1 prototype consists of 32 processing clusters, w ith 4 clusters per aoard. Each cluster consists of five Texas Instrum ents TMS320C30 digital signal processing chips. W ithin a cluster, the five processors are connected via two four- port memories, while between clusters a spanning bus hypercube, im plem ented using four-port memories, is used. SNAP-1 runs at 12MHz, and its 160 TMS320C30s give it a peak throughput of almost 2 Giga-operations per second. More inform ation about SNAP-1 prototype hardware design can be found in [7]. SNAP-1 prototype can support a sem antic networks knowledge base of 16K nodes; each node has an average of 10 relations. Besides the nodes and relations, SNAP-1 supports m arkers and colors (classes). The m arkers and colors are treated as d ata and are p art of the knowledge base. For each node, the SNAP-1 prototype .32 Software Hardware Physical Environment Environment Design Host Computer and User Interface Program development in SNAP instruction set Host SUN 3/280 SNAP-1 Controller Compiled and SNAP VME Interface code VMI Bus Controller One 9U-size board SNAP-1 Array and Shared Memory Network SNAP primitive instructions Custom Backplane SNAP Array r iu h t 9U-s ize hoards Figure 3.5: SNAP-1 Prototype supports a set of 32 complex m arkers, consisting of a m arker bit, a 16-bit value and a 16-bit propagation origin node address, and 64 bit m arkers which have only ,he m arker bit. The program m ing language for SNAP-1 system is the standard C anguage w ith special m arker operation/propagation prim itives. In the SNAP-1 processing model [30] [31], the problem w ith SIMD m arker prop agation problem is solved by incorporating the MIMD processing model into the m arker propagation function. During the propagation, the initialization is done by broadcasting the propagation command as in the SIMD model. Each processor, after each step of processing, generates a message according to th e result of the current processing to invoke the next step of processing either in th e local processor or in another processor. The processors do not report the results to th e host until the end of propagation. The propagation function of SNAP-1 is shown below: p - p ro p a g a te ( m arker (7, m arker M, path P, function F) 33J Tlie~function w ilF initiate- the propagation'from the nodes wKiclThave m arker C7 They send out m arker M through p ath P. W henever a node sends a m arker M, th e function F will be performed. In real im plem entation, th e path has a m axim um two types of relation. The basic propagation patterns which are supported by SNAP-1 are listed below. 1. SEQ(R1, R2): the SEQUENCE propagate the m arker through R l once, then through R2 once. 2. SPREA D (R1, R2): the SPREAD allows the m arker to traverse through a chain of R l links. For each node in the R l path, if there exist any R2, the m arker switches to R2 link and continues to propagate until the end of the R2 link. 3. C 0M B (R 1, R2): the COM Bine propagates the m arker through all R l R2 links w ithout lim itation. ; 1 For a propagation p attern longer than two relations, a sequence of propagation functions is needed to accomplish the task, because th e SNAP-1 program language does not support closure notation to extend the propagation p attern . This, how ever, becomes one of the problems w ith the SNAP-1 processing m odel, because the synchronization between propagations m ust be enforced to ensure the correct result. To im plem ent the MIMD parallel m arker propagation approach in the SNAP-1, I we find out the following two problems have to be solved: I E n d o f P r o p a g a tio n D e te c tio n P ro b lem T he end of propagation detection is similar to the barrier synchronization in parallel program s. During m arker propagation, the m arker propagation will spawn several subpropagations recursively. Each subpropagation will stop whenever th e m arker reaches the end of the propagation path. The end of propagation detection is to detect such a condition. In parallel prolog program m ing, this is called term ination detection. T he global end of propagation detection is im plem ented in th e SNAP- 1 processing m odel to assure the d ata correctness under parallel processing. Ini 34 M IMD parallel architecture, the global end of propagation detection im plem entation using a distributed algorithm is very difficult and may produce a phantom end of propagation. For the SNAP-1 processing model w ith dedicated controller, the global end of propagation detection can be im plem ented w ith hardw are. Assuming th at th e m arker message processing is distributed on m ultiprocessor, the term ination detection will require a pair of globally shared variables for all processors to memorize the num ber of messages produced and the num ber of messages consumed. If the produce-consum e sequence is m aintained, the term ination can be correctly detected. T he term ination detection is im plem ented by two adder trees connected to the array processors: one for calculating messages produced and the other one for calculating messages consumed. Under th e following circum stance, the global end of propagation is true: ' • all processors inform the controller with an idle signal and • the num ber of messages produced is equal to the num ber of messages con sumed. We will present th e detailed solution for the global term ination detection in th e next chapter. M arker P ro p a g a tio n O verrun The m em ory requirem ent for parallelizing the m arker propagation algorithm de pends on the size of the knowledge base and the algorithm used. For spreading activation, the m em ory requirem ent will increase exponentially according to the num ber of links from each node. For example, if there are an average 10 relations per node, the memory required for tem porary messages storage will be 910 after 10 steps of propagation (the 9 is from 10 subtract 1 for incoming messages). For path-directed propagation, the exponential factor is still the same, while the average relation has no elfect on the m em ory requirem ent, but the user program to select thej i p ath will affect the m em ory requirem ent. This is due to the fact th a t the num ber of m arker generated depends on the num ber of fanout. And w ith p ath selection, the fanout is reduced. However, restricting the propagation p ath can not always solve 35 jthe problem. For example, if the knowledge base construction uses only a few types of relations, the chances having a m arker overrun problem is very high, as the fan out is high. In next chapter, we will discuss the m arker propagation networks model, a b ette r approach to control the m arker overrun problem for th e m arker propagation networks m odel will be present in C hapter 6. 3.2 P a ra llelism and D a ta D ep en d e n c y A n a ly sis kn this section, we discuss th e data dependency analysis for the m arker propagation algorithm as the key for parallelization. J As described in C hapter 2, a m arker propagation algorithm contains three phases of processing. However, the formal im plem entations of the m arker propagation al gorithm showed th a t there is a loop among the propagation and evaluation phase as shown in Figure 3.6(a). If we extend the loop as shown in Figure 3.6(b), the critical p a th of the algorithm will be much larger than before. By parallelizing the loop jve can obtain b etter perform ance as shown in Figure 3.6(c). As m any parallelism detection techniques detect the d ata dependencies between instructions, the paral lelism detection for the m arker propagation checks the d ata dependencies between operations or functions. In the following, a m arker propagation algorithm will be described and provided as an example for d ata dependency analysis. T he original urogram assumes a sequential model. An E xam p le: M arker P ro p a g a tio n A lg o rith m for S u b su m p tio n T est T he subsum ption test is an algorithm to inference th e superconcept-subconcept relation between two concepts. The subsum ption algorithm is th e first step for the knowledge classification process, which is a very im portant procedure for knowl edge acquisition and reasoning on hierarchical knowledge bases. T he work on this m arker propagation algorithm was done by K im [18]. The subsum ption algorithm is described below. Let C be a new concept which has S \ ,S 2, ...Sn super concepts, and has roleset relations R i , R 2, w ith value descriptions V\, V2, ...Vm- Let Sup(C) be the set of 36 i I i I ▼ (a) I l • • • T (b) T (C ) translate propagate evaluation Figure 3.6: The m arker propagation algorithm program flow all superconcepts of C, SR(C) be the set of all superconcepts of roleset relation of (7, and SV(C ) be th e set of all superconcepts of value description of Os rolesets. (’suP refers to subsum ption relation, ’role1 refers to roleset relation and /- and r- m eans ’ orward and reverse direction of m arker propagation on each relation link). Then ;he algorithm for finding all subsumers of C is as follows: -*hase 1. For some concept C' to be a subsum er of C, all the superconcepts th a t subsum e C' m ust also subsume C. i.e., Sup(C') C Sup(C). In Phase 1, we filter out ;he concepts which violate this condition. jl. W h ile Si exists do 2. se t sub-marker on node Si; ! 3. p ro p a g a te sub-marker propagation path (f-sub)*; j l . se t ind-marker on node Si; 5. p ro p a g a te ind-marker propagation p ath (r-sub)*; 6. For all nodes 7. if (not (or sub-marker ind-marker)) s e t cancel-marker; 37 8. p rop agate cancel-marker :propagation p ath (r-sub)*-, 9. For all nodes 10. if (exist cancel-marker) d e le te ind-marker, 'Mter Phase 1, all the nodes which do not satisfy th e above condition are filtered out by cancel-marker. The nodes which are marked by ind-markers are the possible subsum ers of C. Chase 2. Among those concepts m arked in Phase 1, those which have rolesets th a t are not in SR(C) m ust also be filtered. Those concepts which have value descriptions th a t are not in SV(C) have to be filtered out too. To do this we first m ark all the nodes which are m em bers of the sets SR(C) or SV(C). Since there are rolesets not only explicitly given by description, but also im plicitly given as rolesets of given superconcepts, we m ust also propagate markers from superconcepts of C. By doing .his, all the rolesets which m ust be inherited to concept C ca n be marked. 1. W h ile Si exists do 2. se t sub-marker on node Si] 3. p ro p a g a te sub-marker propagation path ((f-sub)\(f-role))*; ^ 4. W h ile Ri exists do 15. set sub-marker on node R i; 16. p rop agate sub-marker propagation path (f-sub)*", 17. W h ile Vi exists do 18. set sub-marker on node Vi; 19. p ro p a g a te sub-marker propagation p ath (f-sub)*", A fter Phase 2, all the nodes m arked by sub-marker are one of the following: (1) explicitly given superconcepts of C or their subsum ers, (2) explicitly given rolesets and value restrictions of C or their subsumers, and (3) rolesets and value descriptions 38 of C which are found by inference. These are properties inherited from superconcepts of c. Phase 3. In this Phase, we filter out those concepts which have S R or S V such th a t % S R { C ) or S V < 2 S V { C ) by propagating a cancel-m arker. 20. For all nodes 21. if (n o t (or su b -m a rke r in d -m a rk e r)) se t cancel-m arker, 22. p ro p a g a te cancel-m arker :propagation path ((r-sub)\(r-role))*-, 23. For all nodes 24. if (exist cancel-m arker) d e le te ind-m arker, \f te r P h a se 3, the rem aining nodes m arked by in d -m a rk e r are actual subsum ers of concept C. D a ta D e p e n d e n c y A n a ly sis D ata dependencies in the m arker propagation applications m ust be analyzed in ;erms of sets of knowledge base objects. The m em bership of a node to a aggregation is specified by a corresponding m arker. There are set operations which take two input sets and create a new set, or add or remove members to an already existing set. In general, the m arker propagation and global operations always dom inates the processing tim e. However, the parallel architecture shall allow m ore th an one m arker operation at a tim e. By parallelizing the m arker operations, the speedup is expected. T he basic approach for the parallelism detection, is to investigate the dependencies between m arker operations. For example, the detection investigates a single iteration of the loop program. The dependencies between two m arker propa gations occurs when one of the propagations will be initiated from th e intersection area of another propagation. Therefore, there is a dependency between those two m arker propagations. .39_i The m arker propagation operation, in general, can affect m ultiple sets by both adding and removing nodes. However, in the m ost common cases it will build or expand a single set, starting from a pre-existing activation set. This can be done by using B ernstein’s rules for d ata dependencies detection: 1. read-after-w rite, 2. w rite-after-read, and 3. w rite-after-w rite. The read-after-w rite dependency occurs when reading the intersection area after the propagation is needed. The w rite-after-read may occur when propagation p ath may go through the newly created concepts and relations. The w rite-after-w rite can occur when m odification on the propagation m arked area has occurred. T he dependencies petween propagations are in read-after-w rite category. For the subsum ption example, we can find out th at there are several looping procedures in the algorithm s. The first loop (linel - line5) contains two propagation operations. However, there is no dependency between two propagations. These two propagations can be propagated at the same tim e. By reordering the instructions, jwith SN A P-1’s p-propagate for parallel propagate operation, which is described pre viously in this chapter, the following code is obtained: > 1 . W h ile Si exists do { I 2. se t sub-marker on node Si: i 3. se t ind-marker on node Si) 4. p -p ro p a g a te sub-marker propagation p ath (f-sub)*) 5. p -p ro p a g a te ind-marker propagation p ath (r-sub)*] } D a ta D e p e n d e n c ie s B etw een Itera tio n s The loop parallelization is to investigate the dependencies between iterations. For the first loop, as shown above, there exists d ata dependency between iterations .40 J because all iterations use the same m arker to m ark nodes. If "there is an intersection betw een different propagation p ath , then there exists a d ata dependency. However, further investigation shows th at the m arker propagation w ith this program performs S E T - M A R K E R when the m arker reaches a node. Since a double set m arker does not affect th e correct result, the d ata dependency constraint is released. Assume th a t th e propagation function is im plem ented w ith SNAP-1 parallel function, then the program can improved as follows: 1. W h ile S i exists do { 2. set su b -m a rke r on node Si; 3. set in d -m a rk e r on node Si; 4. sn a p -p ro p a g a te su b -m a rke r :propagation p ath (f-sub)*; 5. sn a p -p ro p a g a te in d -m a r k e r : propagation path (r-sub)*; 6. } Therefore, only two propagations have to be initiated. This alternation saves the synchronization tim e and increases the parallelism as com pared w ith th e previous optim ization. J D a ta D e p e n d e n c ie s B etw een L oops If a single loop is viewed as a single propagation, then loops can be processed in parallel. For example, there is read-after-w rite dependency betw een th e first loop and second loop, therefore, they cannot be parallelized. However, there is no d ata dependency among the loops 4, 5 and 6 (lin ell-lin el9 ). They can be propagated in parallel. Again, the m arker operation assumes only S E T - M A R K E R operation during propagation. The program can be rew ritten as the following code: I 1 11. W h ile S i exists do I 12. set su b -m a rker on node Si; 13. W h ile R i exists do 41 ■R. se t sub-marker on node Ri: I jl5. W h ile Vi exists do 16. se t sub-marker on node Vi: t 17. sn a p -p ro p a g a te sub-marker:propagation p ath ((f-sub)\(f-role))*; 18. sn a p -p ro p a g a te sub-marker :propagation path (f-sub)*-, 9. sn a p -p ro p a g a te sub-marker rpropagation path (f-sub)*-, D ata D e p e n d e n c ie s B etw een D ifferen t In p u ts For different inputs, the dependencies between instructions will be based on the side effect of th e previous input. For example, if some sem antic concepts are created or deleted at the end of processing, the propagations in the next iteration which involve the changed sem antic networks can not be executed in parallel. The detection of d ata dependencies between sem antic networks changes and the propagation is done by checking the dependencies between changed relations and propagation path. For example, when a new relation is a is created betw een two nodes, and the propagation path traverses the is a relations, then there exists a d ata ilependency. For the subsum ption example, the sem antic m odification code is not shown but the created relations include all the propagation paths described in the program , therefore there is no possibility for parallel execution between inputs. P roblem w ith P a ra llel C om p iler O p tim iza tio n One of the m ajor compiler optim ization techniques is to reorder the instruction sequence to improve the performance. Same technique can be apply to th e parallel m arker propagation. However, this technique does not always improve the perfor m ance. Let us consider the following pseudo code: . p ro p a g a te m ark er# 1 2. p ro p a g a te m arker#2 3. an d -m ark er m ark er# 1, m arker#2, m arker#3 4 . p ro p a g a te m arker#3 5. p ro p a g a te m arker#4 42 Pi Po Figure 3.7: Knowledge base distribution for speedup evaluation Alter reordering the functions, statem ent 5 may be executed in parallel w ith state m ents 1 and 2. B ut if the propagation tim e for statem ent 4 is larger than statem ent 5, it is not necessary to reorder the propagation. 3.3 P ara llel M arker P ro p a g a tio n A n a ly sis The power of the m arker propagation is due to the massive parallelism inherent ;rom th e knowledge base structure and the knowledge partitioning. To get b etter processor utilization, the propagation volume m ust be high and distributed. To evaluate the parallel processing speedup, we first consider the knowledge base Dartitioning, as shown in Figure 3.7. The node is connected to average N other nodes. There are K N nodes assigned to a single processor, and th e rest (l-K )N nodes are assigned to other processors where K < 1. The processing tim e is Tt = P N U a x(ki) + (7(1 - K ) N jwhere P is the processing tim e for a m arker arrive a node, kt is th e num ber of destinations at i processor, and C is the average com m unication tim e to send a m arker to another processor. Assume th at the spreading activation is used and the 43 'knowledge base is a tree structure which is th e extension of the sm all tree as shown in Figure 3.7. Then the total execution tim e for the propagation can be calculated with the following equation, i - 1 TP1 = M j 2 Ni(p NMax(ki) + C{1 - K)N) i = 0 where I is the length of propagation and M is the num ber of initializations. Assume he knowledge partitioning is evenly distributed. The equation becomes i - i Tp 1 = M j ^ W i P N i l / n ) + C( 1 - (1 fn))N) i - 0 jwhere n is the num ber of processors. From the final equation, it is obvious th at for spreading activation, a lot of processing will be required to finish the com putation. To calculate the speedup, we can find out w hat the sequential processing tim e is when the n = 1, i-i Ts = M j 2 NiP N i = 0 Therefore the speedup is Speedup P + ( n - l ) C rom the equation, if the processing tim e is equal to com m unication tim e, then the speedup is always one. Therefore, the speedup is heavily dependent on the processing to com m unication tim e ratio. L et’s take a look at the effect of M. If the knowledge base is uneven distributed, ;he propagation produces a sub-process at the local processor only. However, th e ini- ialization M is evenly distributed throughout the processors, the parallel processing ;ime becomes i - i TP2 = ( M / n ^ N ^ P N ) i = o 'the sequential processing tim e is still the same, therefore the speedup becomes n. 44 For the first parallel processing tim e, the speedup is based on th e MIMD model. (For the second parallel processing tim e, the speed is based on SPMD model. From ^hese equations, we conclude th a t both speedups can not be achieved at th e same tim e. In general, the knowledge base can not be evenly distributed, to gain b etter speedup w ith the parallel processor system. The SPMD m odel is the right way to parallelize the m arker propagation system. Therefore, the initial propagation volume \M becomes very im portant for b etter speedup. As the M gets larger, there is b etter even distribution of task initializations. We have profiled several m arker propagation program s, including NLU and speech understanding, running on SNAP-1 model, m ost of the propagations are short length and small volume propagations. This is due to the SIMD processing model and the program m ing model can not merge the small volume propagations into a large vol um e propagation, as described earlier in this section. 3.4 S u m m ary jThe key requirem ents th a t affect parallel m arker propagation program perform ance have been identified. W ith a distributed m em ory parallel architecture, the knowl edge base is partitioned and is distributed throughout the processors. The required operations m ust be designed to utilize the inherited parallelism and we found the problems w ith parallelizing the required operations. In th e next chapter, we will jpresent a parallel m arker propagation approach to solve those problems and to achieve b etter perform ance. 45. C h ap ter 4 M arker P ro p a g a tio n N etw o rk s . n this chapter, we would like to start w ith the m ajor findings from the program m ing experience w ith the SNAP-1 model, as the results indicate, there is m ore room for perform ance im provem ent. Based on the results, we defined a parallel m arker propagation model, nam ely M arker Propagation Networks. The m arker propagation networks m odel is also designed to solve the problems described in th e previous chapter. 4.1 R eq u irem en ts for a P ra ctica l P a ra llel M arker P ro p a g a tio n M o d el From th e previous chapter, in the three m ajor phases of the m arker propagation application, m ost of the parallelism exists in the m arker propagation phase of pro cessing. However, w ith SIMD processing mode, parallelism gained from the m arker propagation phase is lim ited because of the central controlled program flow [4]. Fig ure 4.1 shows an example of the parallel execution tim e of SIMD parallel m arker propagation program . The synchronization points are also increased because of the Central controlled program flow. The SNAP-1 model improves the SIMD parallel processing m odel by switching the processing model to MIMD m ode during the m arker propagation phase. The passing m arker activity is controlled by the m arker 46 SIMD SNAP-1 Processor 1 Processor 2 time Legend: Processor 1 Processor 2 time synchronization 7igure 4.1: Parallel execution tim e of m arker propagation applications on th e SIMD m achine and SNAP-1 system. itself. The synchronization point is needed only at the end of th e propagation. Also, ;he SNAP-1 reduces the synchronization requirem ent w ith the global operations. Though the im provem ents w ith SNAP-1 reduces synchronization overhead and im prove the processor utilization, th e result of the program m ing study shows there is more room for improving the performance. As shown in Table 4.1, the PASS [5] w ith I aTC (Air Traffic Control) dom ain knowledge base spends m ost of the execution tim e on parallel execution. However, due to the lim ited SNAP-1 program m ing model, m ost of th e parallel executions m ust be synchronized at end of a single parallel operation. Therefore, the advantage with the SNAP-1 parallel processing m odel is reduced to the parallel propagation with m ultiple level, which has one tw entieth of ;he to tal processing time. From th e results described, we can improve the m arker propagation application by reducing th e synchronization requirem ents, which also im prove the processor utilization, as depicted in Figure 4.1. To achieve such goal, th e processing model 47 PARALLEL w ith MUC4 dom ain Knowledge base PASS w ith ATC dom ain Knowledge base Initialization 8% 10% Sequential Tim e 22% 4% Global O perations 44% 57% Parallel Propagation: single level 11% 13% Parallel Propagation: m ultiple level 6% 5% D ata Retrieving 9% 11% M axim um Speedup 2^-+0.39 i T 5 T S , n 0r “5T+0 :2 5 ....... .. N is the num ber of processing elements. Table 4.1: Execution tim e distribution w ith SNAP-1 processing m odel las to incorporate the global operations into the m arker propagation. Therefore, rom a practical point of view, the parallel m arker propagation processing model m ust provide continuous propagation paths and the m arker function m ust include ;he global operations. P r o p a g a tio n P a th As spreading activation does not lim it the m arker propagation in a satisfactory way, the com m unication overhead is too high w ith parallel approach. Therefore, only path-directed propagation is considered for parallel approach. To reduce the propagation tim e, the overall propagation p ath should be short. To reduce th e syn chronization overhead, it is desirable th a t the propagation p ath continuity is not interrupted. The m arker propagation path can be viewed as an augm ented regular expression over the alphabet of link types. Figure 4.2 shows several simple prop agation paths. These propagation paths fit the model of regular expressions and are provided in the SNAP-1 prototype. The relative sim plicity of these rules and the fact th a t the program m er had to decompose th e algorithm -dependent propa gation patterns into these three patterns, require a compile tim e optim ization to com bined several propagation function into a single propagation function to reduce ihe synchronization overhead. 48_| R l 6 R2 6 R l RI R2 R l R2 (a) R1R2 (b) R1+R2+ (c) (R1IR2)* Figure 4.2: Propagation path exam ples To m aintain the p ath continuity, it is desirable th a t the basic p attern and th e com piler com bination is able to capture the complex nature of th e m arker propaga tion program . However, w ith lim ited program m ing m odel provided by th e SNAP- '1, m ost of propagation patterns can not be easily described w ith the SNAP-1 propagation rules. For example, to program a propagate p ath w ith (IS-A.HAS- ^PART.FEATURES) * p attern, is very hard w ith the SNAP-1 program m ing model because th e SNAP-1 program m ing language does not support the closure notation to extend the propagation pattern. M ark er F u n ction Along the p ath of propagation, there are nodes and relations. In formal m arker propagation approach, only a set m arker function is required whenever a m arker reaches a node. To include the global operations in the m arker function, a b etter approach is th at the program m er is provided w ith th e ability to specify action which will be invoked based the m arker propagation state by attaching the action to the state. 49 To support the global operations w ith the m arker function, a local synchroniza tion m echanism is required. In other words, the global synchronization required :or th e global operation is replaced by the local synchronization. Intuitively, the required local synchronization is sim ilar to dataflow’s token m atching. However, the (dataflow is not a feasible processing model for m arker propagation approach, be cause it can not support huge static data, which is required for a sem antic networks mowledge base. In addition to control over th e d ata processing on the propagation p ath , a user (defined function should be provided for path selection and propagation filtering. For th e regular expression of path specification, some of the irrelevant propagation paths, too generally specified by regular expressions, should be pruned based on 1 local conditions at the node level during the propagation. Augm enting th e regular expressions w ith conditional transitions leads us to the specification of propagation rules in the form of algorithm ic state machines (ASM). One such possible propaga tion rule is shown in Figure 4.3, the selection code m ust be provided as p art of the propagation program , then at each step of propagation the condition will be checked to decide the next move. Assuming identical initial conditions and sem antic net works topology, the propagation graph generated by this rule is actually a subgraph of the one obtained by applying the p ath of Figure 4.2 (b). Besides the p ath selection, the m arker propagation m ay have overlapping prob- ems due to the knowledge base design. There are several possible overlap propaga tions, as shown below. • Node N has m arker M. During propagation Node N receives the m arker M again. • Node N receives several m arker M ’ s from different source nodes during prop agation. • Node N receives several m arker M ’ s from same source node during propagation. To reduce the propagation traffic, the filtering function should prevent th e m arker propagation from propagating further if the condition is not valid. For exam ple, the 50 Rl Rl If Ml.value >0 R1+; if the marker value is positive, switch to R2* Figure 4.3: Augm ented Propagation p ath exam ple liter function can be designed to check the m axim um value of the m arker. Therefore, only m arkers larger th an the current value can be passed on. From the observation described in this section, we can conclude th a t th e require m ents for a practical parallel m arker propagation model is th e following: • supporting the SIMD and SNAP-1 parallel approach, as described in previous chapter, • providing continuous propagation path, • incorporating th e global operation into propagation w ith local synchronization, and • optim ized im plem entation of all required mechanisms. In the rest of this chapter, we will present a parallel m arker propagation m odel which supports the requirem ents and the im plem entation techniques of the parallel model. .51 i.2 M arker P ro p a g a tio n N etw o rk s The m a r k e r p ro p a g a tio n n e tw o rk s is designed to im prove the perform ance over ;he SNAP-1 and other SIMD parallel m arker propagation models, by improving ,he m arker propagation scheme and providing the guidelines for knowledge base construction. j£.2.1 P rocess Flow To provide the requirem ents described in previous section, th e m arker propagation networks model is designed w ith the concept of process-driven com putation. Com puter system designs norm ally fall into one of two categories: d ata flow and control I flow. The m arker propagation networks provides a th ird m ode of processing which is a com bination of the d ata and control flow, which we call process flow. The program execution w ith control flow is controlled by the m ain program . T he m arker propa gation networks is different than the control flow processing because the knowledge I base d ata controls the program flow. In the dataflow model, the program execution is dependent upon the d ata availability. W hile in the m arker propagation networks, ;he process activation on a knowledge object depends on the process availability. To show the difference among the process flow, control flow and d ata flow con cepts, given a n a = (6-f c )* (d — e) example, the three processing models are depicted in Figure 4.4 through Figure 4.6. Figure 4.4 shows a sequential control flow pro cessing sequence. Assuming an Intel 80x86 processor is used, the compiled object code will be stored in the program memory, as shown in Figure 4.4. As the program counter goes, the instructions are executed one by one. T he d ata is retrieved from th e d ata m em ory and th e result is stored in th e m em ory after th e last instruction is finished. For the dataflow model, a dataflow graph is constructed, as shown in Figure 4.5. Each node in the graph represents a function node and is activated when all required inputs are available. The d ata flow model does not require any barrier synchronization. However, the synchronization is hidden in the function boxes as ;he function node can not be activated until all required inputs are available. The __________________________________________________________________________ 52_ Program Execution flow movw _b, %ax movw _d, %cx movw _c, %bx movw _e, %dx addw %bx, %ax subw %dx, %cx mulw %cx %ax movw %ax, _a _a \ _b c Data memory ; J \ Program memory Figure 4.4: a = (b + c) * (d — e) with control flow approach process flow approach is shown in Figure 4.6. The program started w ith two nodes b and d , the process passed among nodes are program m ed to follow the next links, execute th e node function and send the results to the next node. T he program flow in this exam ple is controlled by both the program and data. 4.2.2 M arker Propagation w ith Process Flow jib m ap th e m arker propagation algorithm with the m arker propagation networks, th e knowledge base has to be created w ith program encapsulated in th e sem antic networks hierarchy. A good exam ple is th e knowledge base w ith m em ory based processing as described in C hapter 2. The concept sequence is sim ilar to a small program built into the knowledge base. W ith such an approach, m arker propaga tion effectively im plem ents a simplified distributed bottom -up parsing process for the context-free gram m ar induced by the regular expression [1]. The set of inputs consists of the chains of links th a t can be followed from th e activation set of nodes. t . 53 w.f + () () y/1 a Figure 4.5: a — (b + c) * (d — e) w ith d ata flow approach M2 M2=b ) node b node d( M l= d NEXT NEXT M 2=M 2+d nodec nodee (M l= M l-d NEXT NEXT ------ node a (a=Ml*MZ process flow path is:NEXT,NEXT Figure 4.6: a = (6 4- c) * (d — e) with process flow approach Every m arker propagation step over a link is equivalent to a shift action in th e LR(1) parser. At every propagation step (sem antic node), the parser is presented w ith m ul tiple inputs (outgoing links), possibly asking for different actions and next-states. j T he parsing stack, in fact ju st a state, is stored in the node d ata and identified by 1 th e m arker. W ith the program encapsulated sem antic networks knowledge base, the j process required for process flow is m apped as following: I I m ap th e m arkers in to p ro cesses I Therefore, during the m arker propagation, each m arker generated is equivalent to i i create a sub-process. [ ' As described in previous chapter, a large am ount of m arker propagations running j sim ultaneously produce b etter parallelism . However, the num ber of sub-processes ! created is also increased. To reduce th e overhead for process m anagem ent, the m arker process should be im plem ented w ith light weight processes or threads. Im ple m entation issues will deal w ith synchronization, com m unication and m em ory man- ; agem ent support for the execution of the threads. Every thread runs in its own | context, as defined by the node, link and m arker identifiers, the state of th e m arker, : I th e propagation path, etc. Conceptually, however, all the threads access a global j sem antic networks in a shared space. Threads are easy to start, having a small j context, they have a rather short life span and, once started, run to com pletion, | 1 involving no context switches. Because of these reasons we refer to the m arkers i as threads, rather th an regular processes. On the other side, the nam e m ay be a i little confusing because threads are not specified at user level, but by th e run-tim e environm ent, transparent to the program m er. P r o c e ssin g M o d el For m arker propagation w ith process flow, we have to define the v irtual com p utational unit, which is scheduled when it receives the process. We also have to specify how these units cooperate and compete. The choice of a com putation unit will affect the specification of the knowledge base construction. N ode-O riented Processing Model is th e more n atural processing m odel in which the focus is on the sem antic nodes. In object-oriented term s, th e node is defined _____________________________________ 55--. 'as an object and the threads are actually activations of the unique m ethod defined for the node object: processing an incoming m arker. The com putation is message- i i driven: the reception of a m arker, encapsulated in a message, sets for execution ' a new thread. The execution of a thread comprises two parts: a user-specified com putation and a routing algorithm which is used to determ ine the continuations !for the m arker propagation. Although th e processing m odel makes no assum ption of a certain allocation of th e sem antic networks, some points in the following discussion i will be easier to understand if we consider th a t nodes are being statically assigned to the processors. I The context of the thread is given by the m arker num ber, the address of the node th a t m ust receive the m arker, the propagation path, the state of the m arker reflecting previous routing decisions, and the m arker data. T he propagation p ath could vary in com plexity from very simple, such as a context independent “propagate only along to adaptive routing rules which take into account the d ata of the m arker and the d ata of the relations. It is however assum ed th a t the past history of ! the m arker routing decisions can be encoded into a single value. This is equivalent j to saying th a t the routing rule corresponds to a finite state machine. The strategy , in this m odel is to exhaustively consider the outgoing links from the receiving node 1 as possible candidates for a m atch in the routing decision. If a m atch is found, a j new m arker is being sent to the destination of the link where an identical thread is j started. Unlike the R P C in the Unix system , the subprocess spawned does not have : to report to its parent when the process is finished unless the program specified. | s 4 .3 Im p lem en ta tio n T ech n iq u es In general, the m arker propagation networks approach needs the basic parallel m arker propagation requirem ents, as described in the previous chapter. In this section, we discuss some issues required for a optim al m arker propagation networks application im plem entation. Some of the techniques are provided as th e solutions for the problem s in last chapter. marker PATH is R1*R2* R l R3 R l R2 R2 R5 ,R1 R2 R l R2 legend: C ) node on the propagation path I Figure 4.7: M arker propagation critical path. I ! ; t 4.3.1 K now ledge Base | i < ' K n o w led g e B a se C o n stru ctio n . From th e parallel processing point of view, the critical p ath of th e sequential p art I of the program dom inates the processing tim e. The same concept can be applied to j the knowledge base of the m arker propagation model. Consider the knowledge base ! shown in Figure 4.7 w ith the propagation p ath defined as R l*R 2*. T he m arker will be propagated through two paths: R1R2R2 and R1R1R1R2. Assume th e operation ! at each node requires same processing tim e, then the p ath R1R1R1R2 becomes th e ! ! ' critical path for the propagation. ! | In general, a knowledge base is constructed w ith a hierarchy structure, such as a tree. T he entry point for starting propagation is norm ally from th e tree leaves. To ; reduce th e critical path, one way is to reduce the height of the tree. This approach requires elim inating some of th e unused interm ediate node. For exam ple, for the i elephant sem antic networks, as shown in Figure 2.1, we can reduce th e height by rem oving th e circus elephant concept and add to the node feature, as shown in j Figure 4.8. A nother im provem ent can be done by adding redundant relation. For i exam ple, w ith the elephant sem antic networks, a new link can be created between DUMBO and ELEPH ANT. However, the link m ust not be th e same as ISA, to MAMMAL iAS-PART TEETH IAS-PART IS-A EYES IS-A ELEPHANT HAS-PART TUSK EAS-PART IS-A LEGS CIRCUS -ERFORMEl DUMBO FEATURE Figure 4.8: Sem antic networks knowledge base after height reduction i i avoid the overlap problem. In conclusion, for b etter perform ance, th e knowledge base construction m ust be shallow from the m arker propagation point of view, while ; th e knowledge base integrity m ust be preserved. K n o w led g e B a se L oad in g j M arker Propagation Networks requires a huge knowledge base for inferring throughl i | propagating m arkers. The knowledge base can be provided in two forms for down- : loading into system: • Canonical Form: The canonical form is a list of knowledge base entry which will be executed to create the knowledge base at runtim e. I • M emory Image: The m em ory image is the result of saved knowledge base data. ; A good exam ple of loading of canonical knowledge base is the loading of th e LISP ; program . T he knowledge base is created after the knowledge base LISP program is loaded. Intuitively, the loading of such a knowledge base m ust be very slow because the tim e required for m em ory allocation, but this provides th e flexibility for J allocating the nodes to a physical processor at runtim e. Loading of m em ory image ! is much faster than loading of canonical form. However, such a loading m echanism is less flexible in the parallel processor system. Consider the saved LISP program j im age after th e creation of the knowledge base, if the required num ber of processors is not available or relocation is needed, the static allocated m em ory image is not usable, because the relocation of the m em ory image requires m ajor changes on the m em ory d ata indexes. I I j For a m arker propagation algorithm , a large knowledge base is needed, the tim e j I for th e loading of a knowledge base from i t ’s canonical form is unacceptable. There- j fore, th e knowledge base m ust be converted into the m em ory image form at. To pro- j jvide the knowledge base m em ory image for a parallel processor system , the m em ory J im age m ust be treated so it can be m apped into different parallel processor config uration. This can be achieved by block partitioning. T he block partitioning treat th e parallel processor system as a virtual processor system w ith a large num ber of processors. W hen th e knowledge base is first created, it is partitioned according ■ to th e predefined num ber of virtual processors. W hen loading the knowledge base, | several knowledge base blocks are loaded into a processor. A block m ap is created I at every physical processor to show the virtual processor to the physical processor association. i ,4.3.2 Process Flow Marker Propagation Im plem entations P r o p a g a tio n P a th F orm at T he m arker propagation networks program m ing is provided w ith a C-like lan guage w ith parallel extensions. Some of them are being used for specifying the propagation p ath, nam ely propagation rules. The extensions are handled by a pre processor which outputs a tabular representation of the propagation paths in a for- ! m at known to the library routines and the kernel. This table contains inform ation about m arker state transitions, transition conditions and state-dependent actions. , T h e application program m er specifies the propagation p ath using the p a tte rn key word, followed by a regular expression composed of path-rule nam es and th e usual operators: concatenation, union and closure. Parentheses can be used to m odify th e j order in which operators are applied. The path-rule is in C-language function form at, For different path-rules, different destination objects have to be specified in the ar gum ent field. For examples, the llPN(M,N)'> ’ ’ m eans the propagation to a node, while ■ the “P R N (M ,R )V m eans the propagation from a current node to its neighbor nodes j through the relation R, where M is the m arker. However, there is no restriction i on th e destination. Some of the path-rules have actions which are executed during ! I | propagation. Currently, only one action can be specified for any knowledge object : I ! on th e path. For example, the propagation p ath P R N F (M ,R l,f)P R N F (M ,R 2,g) \ I indicates th a t the m arker M is propagated from current node via relations R l to j some destination. T he function / i s then executed at the destination nodes. M arkers j are then sent along one R2 link, and function g is executed at the destinations. In | particular, th e action may be executed in every iteration of the closure, in th e first , one, or in th e very last. Some com binations of the closure and action path-rules ! i I m ay lead to am biguities regarding the tim ing of actions. For exam ple, the rule P R N F (M , R l, f ) * P R N F (M , R2, g)* makes it legal to execute either action / or 1 i action g at the origination node. We will further analyze rule am biguities later in j this section. j | In the following, we describe some of the path-rules currently supported by the j m arker propagation networks model. We first look at the the following propagation ' rule exam ple, which is for detecting the most specific com ponents of a com posite object. Any com ponent part can be reached through a sequence of IS-A and HAS- P A R T links, a classic case of inheritance of properties. 1. P r o p a g a tio n R u le inheritance (M,N) { '2. PN (M ,N ); 3. PRN(M ,ISA)*; 4. PRN L(M ,H A S-PA RT,SET_M A RK ER)+; 60 path-rule propagation pattern and action PN(M1,N) Propagate marker Ml to node N PC(M1,C) Propagate marker Ml to node group identified by class C PM(M1,M2) Propagate marker Ml to node group identified by marker M2 PNF(M,N,f) Propagate marker M l to node N and execute function / at N PCF(M ,C,f) Propagate marker Ml to node group identified by class C and execute function / at the nodes PM F(M l,M 2,f) Propagate marker M l to node group identified by marker M2 and execute function / at the nodes PRF(M l,R,f) Propagate marker Ml to node group which has relation R and execute function / at the nodes 1 Table 4.2: Path-rules for one-hop propagation only i i I 5. END | ■ 6 ‘ } : T he propagation only performs S E T - M A R K E R function at end of propagation, j T he path-rule P N (M , N ) at line 2 is one of the one-hop rules, as shown in Table 4.2. j » This set of rules, in general, are used for first path-rule of th e propagation rule, which j is called by the m ain program , but they can be applied to the body of the propagation rule, too. ; T he path-rules at line 3 and line 4 are m ulti-hop p ath rules which can be used ' iWith closure notation in the propagation rule program . The p ath rule set which 'support m ulti-hop is shown in Table 4.3. In line three, there is closure following j th e P R N ( M , I S — A). This means the propagation will go through chained I S — A links from th e origination node. Also, it means th at line four has to be evaluated at every hop on the path. B ut it does not equal to the following code: 1 1. P r o p a g a tio n R u le rule2 (M,N) { ■ 2 . PN (M ,N ); 1 3. PRN (M ,ISA )* | PRN L(M ,H A S-PA RT,SET_M ARK ER)+; 4. END _____61 path-rule propagation pattern and action PRN(M 1,R) Propagate marker Ml to the node group connect with current node with relation R P R N A (M l,R ,f) Propagate marker Ml to the node group connect with current node with relation R and execute function / at the destination nodes P R N F (M l,R ,f) Propagate marker Ml to the node group connect with current node with relation R and execute function / at the first destination nodes PR N L (M l,R ,f) Propagate marker Ml to the node group connect with current node with relation R and execute function / at the last destination nodes ‘ Table 4.3: M ulti-hop path-rules provided by m arker propagation netw orks | I I 5' } i I W here, the inheritance rule propagate w ith line 4 will not evaluate line 3 again, and j rule2 still evaluates both rules at same tim e. T he last path-rule in the program asks I th a t the m arker propagates at least one hop, then it keeps on the H A S-P A R T path , to th e end. In general, for the rule-paths w ithout functions, the following notation ' i can be applied for adding the function to the path-rule and is equivalent to the I predefined path-rules w ith the function argum ent. j path-rule,f(destination node data); j C o n d itio n a l P ro p a g a tio n P a th It is desirable to provide a condition branch for propagation rule program m ing. For th e exam ple in Figure 4.3, at the R l+ state, there is a conditional branch to the R2* state. To handle the branch, the program m er m ust understand th a t the d ata available for the function is the d ata at the destination node. W ith the node data, we can add th e following syntax to the program m ing language: 1. p a th -ru le l+ , 2 . if (f(destination node d a ta )> 0) then { 62 •3 . path-rule2; The propagation rule is for th e exam ple in Figure 4.3, where the p ath -ru lel is P R N (M 1 , R l) and path-rule2 is P R N ( M l,R 2 ) condition branch and the R1+ is separated by a comma, while the semi-column tells the com piler its end of current statem ent. T he evaluation is only taking place at the destination node, therefore if the evaluation has to be from the first origination node, th e closure at line 1 m ust be A fter branch, the path-rule2 is not necessary to be a single “path-rule” but can be a com plete propagation rule. For the following propagation rules, for the first branch, the rule will switch to a totally different rule, while the second branch will resum e when the branch p ath is finished. This is determ ined by the keyword C O N T or E N D . il. P r o p a g a tio n R u le rulel (M,N) { -2 . PN (M ,N ); 3. END |, } fe. P r o p a g a tio n R u le rule2 (M ,R) { L PRN (M ,N ); 7. CO N T r } 9. P r o p a g a tio n R u le rultS (M ,N1,N2,R1,R2) { 10. PN (N 1), 11. if (f(N l)> 0 ) th e n { 12. ru lel (M,N2);} 13. else {; 14. rule2 (M ,R1);}; 15. PRN (ISA)*; jl6 . END} R u n tim e P ro p a g a tio n P a th R ep resen ta tio n | 1 i For the exam ple which is shown in Figure 4.3, we can use a determ inistic finite autom ation (DFA) representation of th e regular expression. This representation is being used at run-tim e to keep track of the succession of links th a t th e m arker has } i already traversed, in order to com pute the appropriate continuations. T he use of ; i | DFAs is prim arily for efficiency purposes, because every state transition is physically | 'carried out by propagating a m arker. In general, more possible transitions m eans j th a t m ore m arkers m ust be sent out. Also, using DFAs makes the com putation * and m arker m anagem ent a lot simpler. T he propagation rule com pilation starts by ’ transform ing the user-level propagation rule into a parse tree. A m biguities in the I [ specification of actions are analyzed using this interm ediate form at. In particular, ! constructs of th e type R E 1 * R E 2* should be checked for the d ata dependency be- [ tween the actions provided by R E 1 and R E 2, when both RE1 and RE2 have closure This m eans th a t the path on both directions can be zero, which implies the j . . . i origination node, therefore, both functions have to be applied on the origination I node. U nder the circum stances, if both actions conflict w ith each other, th e result i I of th e operations m ay not be correct. To solve this problem, we define th e execution I | I p ath , for such a propagation rule, is from top down and left to right. T he propaga- ! 1 tion, however, is to split the m arker and effectively pursue both p attern s in parallel. ] i * | For p attern like R E 1 * R E 2+ , as described previously in the exam ple, both R E 1 ; i and R E 2 are propagated in parallel. The action to be perform ed at destination is dependent upon the p ath and is recorded on the m arker message. Therefore, there is i no am biguity w ith such a p attern. The second transform ation step, generates a DFA : from the parse tree. Finally, routing and action tables are com puted based on the j DFA specification. The routing table contains fields for the current state, knowledge object type, knowledge object nam e, next state, source state, distance, and closure ; as well as a pointer to the list of functions to be executed when the m arker reaches the destination. Some of the functions are conditional boolean functions th a t are useful for the conditional branch of propagation. J4.3.3 Issues on Synchronization An efficient synchronization scheme is needed for the parallel program s. To date, j m ost of the synchronization algorithm s have been designed for the determ inistic al- ' gorithm in which th e num ber of subprocesses created can be determ ined at compile \ tim e. N ondeterm inistic Parallel algorithm s, including Prolog and M arker Propaga- j tion Networks, contain loop functions similar to while loop. For a parallel approach, j the num ber of the subprocesses created for the while loop depends on th e input ^ d ata and program state. Several synchronization algorithm s have been developed l 1 for parallel Prolog. However, these synchronization schemes have either too m uch ; overhead or m ay detect false synchronization conditions. This section describes a j fail proof synchronization for th e m arker propagation algorithm . i ^Global P r o p a g a tio n T erm in ation D e te c tio n j i Ichiyoshi [17] developed a synchronization detection protocol for the global ter- | i _ ! ruination detection, which is base on a process count scheme. In W TC [38], the ' process count is extended to weight and the synchronization protocol is enhanced I to handle abortion efficiently. These synchronization m echanisms require th a t each j processor m aintains th e process creation/term ination counts at runtim e. T he syn- ] chronization inform ation is then reported to a central process (in a processor or central controller) to ensure th a t each created process has term inated. Obviously, it is inefficient to inform the central process every tim e a single process is created or term inated. The subpool is created for counting the num ber of subprocesses created | | i and processed. The processors report only the status of subpool to central process 1 at subpool creation or term ination, or when requested by central process. j W ith Ichiyoshi’s approach, the synchronization overhead is th e extra message ; generated for acknowledge, which is alm ost equal to th e num ber of throw n processes created. The approach of W TC saves the acknowledge messages overhead by a , weighting scheme. B ut the num ber of subpools is increased as the level of the , subprocess is increased. Also the messages to term inate subpooles are increased. Taylor [42] proposed another scheme to detect the global term ination. He first identified th a t the global term ination condition is reached when 1) all messages for .65, the subprocess creation are consumed and 2) all processing units are idle. The h rst condition indicates all subprocesses are processed. This can be achieved by ; calculating the to tal num ber of messages created for spawning sub-processes, which ] m ust be equal to the num ber of processes processed. For detecting both conditions, a ! token is passed from a start node (or th e controller) to other processors to accum ulate the message count and processor status. As the sta rt node receives two successive token cycles it indicates all processors are idle and the message count is 0 , then the processing is term inated. I Taylor’s approach greatly reduces the message traffic by grouping subpools infor- ; m ation into a single counter in a processing elem ent. The message overhead becomes j : . i ju st th e token passing between processors. However, his approach m ay falsely detect i ; I th e end of processing. Consider th e following scenario for a two processor system : | 1. Pi sends P2 a process-create message. I 2. P2 sends Pi a process-create message. 1 | 3. Before P2 receives the message from Pi, it reports to token: produce — consume — 1 ! 1 — 0 = 1, and status — idle. ! 4. P2, after processing the message from the P i, P2 sends Pi two process-create mes- j ' sages, and one process-create message to itself. j 1 5. P i, after processing the message from the P2, reports to token: produce —consume — 1 1 — 2 = — 1, and status — idle. ; I T he token detects a false end of processing because it sees two processes created, two processes term inated, and both processors are idle. B ut actually, there is still one process out there in P2 w aiting for processing. To prevent this situation, once th e idle status is set, the processor m ust be disabled until the detection procedure'^ ends. In this case, P 2 m ust confirm w ith the controller before processing the creation message from Px. However, w ith a message-passing m ultiprocessor, th e tu rn around tim e could be long. The num ber of messages required becomes p x (i + 2c) where c is the to tal num ber of confirm ation requests. Normally, c is equal to i. Therefore •the message traffic is tripled. The following scheme is thus introduced to reduce the message passing requirem ent. j I T iered P r o p a g a tio n T erm in ation D e te c tio n To investigate the synchronization requirem ents for TSP, consider th e parallel program shown below. In PO, two processes PI and P2 are parallel nondeterm inistic processes and synchronization is needed at end of processing before proceeding to P3. I po() { ■ cobegin; ! ' PI; | i P2 • ! , coend; I t P 3 ; j . . . . i j For determ inistic processes, synchronization of PI and P2 should be detected at i compile tim e. As a nondeterm inistic algorithm , the PI or P2 will generate a num ber of subprocesses according input data, PO would have to wait until all subprocesses of PI and P2 are term inated, as shown in Figure 4.9. The processes PI and P2 correspond to the processes Pj1 ’1 and P^’1 respectively. At th e end of P I and P2 | i processing, the num ber of active subprocesses belonging to the current thread m ust | converge to 0. At this state, no more subprocess will be created, therefore, no throw n ! process message will be sent out. T he synchronization detection w ith TSP is also based on th e subprocess count. 1 T he subprocess count for TSP is m aintained w ith tiered inform ation. T he tiered inform ation is used to distinguish the different levels of subprocesses production and consum ption. T h at is, for the kth level, J2i nlpawned = n f TmiTiated. For exam ple, th e global sum of first level processes created m ust be equal to th e num ber of first level processes consumed. The synchronization conditions for T SP are • detecting a level of creation w ith 0 subprocess, and j • the processes created is equal to the processes processed a t all levels. j T he first condition is equivalent to the two required conditions th a t proposed bye I Taylor. T he second condition provides the inform ation to prevent the false detec- I tion. Also, such inform ation can be used to distinguish between different parallel ' hondeterm inistic processes when they are processed concurrently and synchronized at different points. To avoiding reporting separate spaw n/term inate counts for dif- ' . . | ferent levels of subprocess processing, each processor m aintains a set of counters for each parallel nondeterm inistic process and reports to the central process only when it is necessary. ; 1 i To provide the TSP w ith correct tiered inform ation, th e process created / pro- ; i ; cessed count at each level should be well m aintained. For th e process processed , bount, the count is increased according to a term inated subprocess. However, for f i . 1 th e process created count, the following two forms of subprocess creations have to \ be considered. | 1. Subprocess C reation w ith Thrown Process Message: For subprocesses created ! ’ w ith throw n process messages, th e process created count can be easily main- ; tained. For each parent process of a level the process created count is increased ! whenever a parent process of the level requests a subprocess creation. j I , I I 2. Subprocess C reation w ith Broadcasting Message: For subprocesses created j w ith broadcasting message, the tiered inform ation has to be interpreted in a | I different way. I i For exam ple, as depicted in Figure 4.9, execution is initiated by the m = 2 parent 1 processes P^’1 and P^'1. P I’1 spawns n — 2 subprocesses. S y n ch ro n iza tio n on M arker C ollision W ith the m arker propagation networks, several m arkers can be propagated on th e knowledge base at same tim e. The m arkers m ay propagate with the same prop- , agation rule or a different rule. In general, some of th e propagation paths m ay intersect w ith each other. Under such circum stances, the program m er m ay ask the F igure 4.9: D ynam ic Process C reation. process to do some com putation. A synchronization is necessary for assuring th a t th e synchronization has taken place, then the process is executed. Unlike the global synchronization detection, the local synchronization can be easily im plem ented w ith software approach, such as semaphore. I j For the exam ple below, ! 1. P r o p a g a tio n R u le rulel (M,N) { 2. extern Ma; 3. PN (M ,N ), f(M ,M a); 4. R ETU R N NODE 5 - } 6 . P r o p a g a tio n R u le ruleS (M ,C) { 7. PC(M ,C); 8 . END 9. } 10. m a in () { 11. M arker M l, M2; Node N; 12. rule2 (M2, “H um an”); 13. N = ru le l (M l, “John”); 14. BIND f(M l, M2) 15. } ,The program propagates a m arker M l to the node John, and propagates the m arker j M 2 to the node group identified by the class Human. A t end of propagation, the ' nodes m arked w ith m arker M l m ust be returned to the host. T he function /re q u ire s j two m arker argum ents. One of them is modified w ith th e extern keyword, m eaning j th e m arker Ma will be specified later by the BIND. In this exam ple, th e function / i m ust wait for both M l and M 2 for execution. The w aiting m echanism is provided by a semaphore-like m echanism. W hen the BIND is read by the compiler, the process j j j control for m arker M 1 is setup to wait for M 2 for function / execution. W hen a i m arker M l arrived, the process first checks to see if the M 2 arrived. If th e M 2 ; i is at current node, th e function will be executed, otherwise th e M l is set up for i I ! w aiting state and will not be activated until a M 2 arrives a t th e same node. Since | the 14 statem ent requests a return from the propagation, a global synchronization j is required for th e M l. Since th e M 2 is bound w ith th e M l, th e results have to ! wait for the global term ination of M 2, too. C h a p ter 5 i [ S N A P -2 P ara llel M arker P ro p a g a tio n N etw o rk s P r o to ty p e T he design of SNAP-2 parallel m arker propagation system architecture is based bn th e M arker Propagation Networks m odel described in the previous chapters, program m ing experience w ith the SNAP-1 system , along w ith processing speed and scalable criteria. We designed and built a SNAP-2 system prototype base on the SNAP-2 architecture. To speedup the com pletion of a usable system , th e SNAP-2 prototype is im plem ented w ith only off-shelf components. W hile there are some constraints and compromises imposed by the given hardw are, the SNAP-2 parallel i processor system prototype still makes an interesting research vehicle. As some of j th e function modules are not available w ith off-shelf com ponents, such functions are ' i accom plished w ith a set of software routines inside the processing elem ent kernel. ' In this chapter, we describe the SNAP-2 system architecture and th e design trade j ^ i off and hardw are im plem entation issues of the SNAP-2 prototype. | 5.1 S y ste m O verview T he SNAP-2 parallel processor system is based on the message passing archi tecture. As shown in Figure 5.1, the SNAP-2 system consists of a set of PEs (pro cessing elem ents) to store and process the sem antic networks, a control hub to relay th e global d ata between host and PEs. There are two sets of networks associated Controller Host Host-Array Communication Network Global Data Communication Network Synchronization Network PE PE PE High Throughput Communication Net work Legend: Data Bus ........* ..... Control Bus Figure 5.1: T he SNAP-2 system organization |with the SNAP-2 system: host-array com m unication network and inter-processor j com m unication network. As shown in Figure 5.1, th e host-array com m unication is composed of three networks. The synchronization network is provided for global syn- I chronization among processors, which is designed to reduce the message requirem ent for synchronization. The global com m unication network is provided for broadcast operation and global d ata collection. T he host to processing elem ents communica- i I ! tion network provides host to individual P E com m unication network. The controller 1 1 connects th e host and global d ata bus and synchronization d ata bus as certain d ata duplication/com bination operations are required for global d ata com m unication and synchronization detection. T he inter-processor com m unication network is provided for th e com m unication between PEs during processing. T here are three advantages to provide different sets of networks for different types of com m unication: (1) T he netw ork topology can be different for th e two networks to m eet different com m u nication requirem ents. (2 ) Any hardw are m odification on one of the netw ork has no effect on the other networks. (3) B etter com m unication bandw idth. For the prototype system im plem entation, the inter-processor com m unication occurs over a i high speed detachable backplane which can be easily changed for different network topology. T he host-nodes com m unication network is connected through the EISA ^ bus to simplify the hardw are design. j For a m arker propagation application running on the SNAP-2, th e sem antic i networks is stored as a distributed knowledge base. Knowledge base partitioning functions can be applied to divide the knowledge base into node blocks. Each node block is allocated to a cluster which processes all of its nodes, relations, classes, and j m arkers. T he operations are initiated by the requests from the host. M ost of th e I processing is perform ed in the PEs. As input d ata is read from a file, the application ; < i on the host request the PEs by sending or broadcasting a rem ote procedure call to ' sta rt th e processing on the corresponding concepts. M arkers are then propagated ! through the knowledge base and invoking new propagation according th e processing [ context. As th e m arkers move, they perform constraint checks and activate suitable j 1 concepts. M arkers are send in parallel by the PEs. W hen th e propagation term i- ; i I nates, the host will retrieve the results from th e PEs and retu rn the results to the ! program . I 5.1.1 P rototyp e D esign Goals [ I T h e design requirem ents and constraints for the SNAP-2 prototype is shown in i jTable 5.1. T he prim ary objective was to design a machine to m eet the requirem ents 1 for the M arker Propagation Networks Model and can be provide as a research vehicle , I for m arker propagation application developm ent and optim ization. T he system is ! also carefully designed to be a scalable high perform ance system for AI applications w ith m arker propagation approach. In the rest of this section we will discuss some ■ architectural features of the SNAP-2 prototype. Requirements Constraints Functionality provide an MPN development platform for programming, executing and analyzing marker propagation applications provide SNAP-1 com patible functions Capacity store a quarter m illion node semantic networks knowledge base provide average 10 relations per node Programming parallel compiler, programming m odel isolates the programmer from data allocation and low level com munication issues provide parallel program debugger Instrumentation support system performance experiments to conduct design tradeoffs for full scale SNAP-2 support program profile for application tuning and MPN model refinement Hardware use only off-the-shelf components Connectivity support for two interconnection networks support for different topologies for inter-processor com munication network with detachable communication backplane Complexity maintain a low part count to reduce development tim e Table 5.1: SNAP-2 parallel processor prototype design objectives. 5.1.2 K now ledge Base C apacity | : i It is necessary to provide an extensive sem antic networks for real world AI appli- j cation. As th e knowledge size grows, processing tim e for search operations such , as search for sem antic objects, search for free sem antic objects for allocation, and search m em ory heap for deallocating the freed sem antic objects, are increased ac- j cordingly. Historically, search is a difficult function to perform on general purpose 'computers. In practical approaches, heuristics, hashing, and other techniques have i been developed to improve search tim e. Ideally, the use of content addressable mem- 1 i | ory (CAM ) is th e optim al solution for th e search problem. T he CAM is different I from conventional m em ory m odule for its d ata retrieving key is based on a content ! of th e m em ory instead of th e address. Basic CAM has been widely used for the j cache and translation lookaside buffer im plem entations. Advanced CAM can pro vide both fast search and simple operation w ith the m em ory content. However, the CAM fabrication requires much m ore transistor count th an the regular memory. For j com m ercially available chips, the physical chip size of a 16KM CAM is the same as j i a 16MB dynam ic ram . Storing a large knowledge base w ith CAM is not feasible. T he SN A P-2 architecture is designed to support millions of concepts. T he trade- J off had been m ade to design the SNAP-2 prototype w ithout the CAM since its den- j sity is too low to make a useful system w ith reasonable physical size. T he search | Operation in SNAP-2 is im plem ented w ith hash tables and link lists. Each SNAP-2 | P E can be configured w ith 16MB to 64MB of memory. From th e SNAP-2 sim ulation \ result, a sem antic node requires roughly 512 bytes. Therefore, A SNAP-2 P E can handle up to 32K knowledge nodes. 32 PEs are needed to handle one m illion nodes. : 5.1.3 P rocessing M odels T he prototype is designed to support SIMD, MIMD and SPMD processing models. For th e SIMD mode, the prototype provides upward com patible w ith the SNAP-1 program m ing model. There are several SNAP-1 applications available for evaluating the system perform ance. In this mode of operation, the m ain program is running at host, and control th e PEs through th e global d ata bus. T he instructions from th e host will be broadcasted to the PEs. T he d ata collection from the PEs is queued and is read by th e host sequentially. T he PEs are configured w ith th e same j program which will respond to the broadcast instructions and are synchronized at th e 1 instruction level. W ith th e MIMD mode, the m arker propagation application m ust ' be program m ed w ith the M arker Propagation Networks model. T he application is J ^hen divided into two parts: host program for sequential control and P E program for I parallel execution. The m ain sequential program resides in the host, and the parallel j execution code is downloaded into the processing elem ents. T he processors are first j configured w ith a small kernel and then the user parallel function code. D uring ! ! . . . . . i processing, the host will send different instructions to different PEs or broadcast j I . I instructions for all PEs. For the later operation, th e SNAP-1 SIMD function set 5 can be used for the application. The PEs will be synchronized at predeterm ined ■ barriers. At the end of processing, the processors send back results to th e host | program . For SPM D processing, the PEs are setup w ith the same program . Each P E will have a copy of the knowledge base instead of the partitioned knowledge j base. During processing, the PEs provide th e service w ith round-robin fashion but j execute in parallel. j * t 5.1.4 Hardware for G lobal Synchronization D etectio n I T he problem w ith synchronization in parallel processor systems is the lack of a global j jview of processor activity. There are several hardw are synchronization m echanisms proposed for parallel processor systems [15] [34] [35]. However, they are designed | for th e determ inistic algorithm in which the synchronization points can be detected at compile tim e. As described in C hapter 4, th e synchronization condition for the nondeterm inistic algorithm is th at • all processors are idle or no sub-process in the queue and • message queue length is zero. This can be done w ith the circuit shown in Figure 5.2. However, this circuit is only useful for detecting global synchronization. If only partial processes are To controller (SIMD) or Host node (MIMD) Sync detected PU-1 MU-1 PU-0 MU-0 PU-n MU-n Communication Network | Figure 5.2: H ardware synchronization m echanism for nondeterm inistic algorithm j ! I I required to participate in th e synchronization, the detection has to identify the j I process in the process queue and the message queue. This is not suitable for hardw are \ im plem entation. A nother condition equivalent w ith the above condition is th a t when ! I all th e subprocesses created are equal to th e subprocesses term inated from a root 1 . . . . . . . parent. This simple condition is not suitable for message passing protocol since we ' m ay detect false synchronization points as described in C hapter 4. This is due to | the communication latency. If th e synchronization inform ation collection is slower j th an the process creation message passing, then the simple process count condition j cannot be used for detecting th e synchronization condition. J • To reduce the hardware complexity, the synchronization approach em ployed in ■ fhe SNAP-2 prototype is based on the tiered synchronization scheme developed j for nondeterm inistic process creation, as described in C hapter 4. T he hardw are ! synchronization support for the SNAP-2 prototype has two purposes. F irst one is 1 I to provide th e SIMD mode w ith th e lock step execution mode. This feature can be used for parallel debugging on the single step debugging mode. T he second purpose is to reduce the messages required for reporting the synchronization inform ation to , |the host. 1 5.1.5 Interconnection Network Topologies ! . . . . i The SNAP-2 system is designed w ith two sets of com m unication networks. One I of th e netw orks is dedicated for the host to processing elements com m unication. < A nother network is required for a generic message passing parallel processor ma- ! I chine inter-processor com m unication. In this section, we discuss th e com m unication i ! I network topologies provided by the SNAP-2 prototype. I I H o st-N o d e C o m m u n ica tio n N etw o rk ! The host-array network provides three different functions: global d ata transfer, ; i i synchronization detection, and individual com m unication between processing ele- 1 I , m ents and host. T he key design issues for the host-array com m unication network are topology, capacity, and latency. W ith state of the art inter-processor communi- 1 cation technology, different com m unication topologies can em ulate others w ith little ! 1 . . . I overhead. However, our approach still takes th e traditional approach to design th e | host-array com m unication network, which requires the com m unication p atte rn to j m atch th e com m unication topology. We have evaluated the required com m unica- | tion p a ttern w ith the SN A P-1 system , which shows th a t the com m unication topology between th e host and the processing elem ent should be a tree inter-connect. In gen- | eral, the com m unication pattern , such as broadcasting, is sim ilar to send th e message j to all leaves of th e tree. The d ata collection sometimes requires a reduction opera tion to m erge the d ata from all leaves. The host-node com m unication is therefore * i designed w ith Fat Tree topology. ! In te r -P r o c e sso r C om m u n ica tio n N etw o rk I T he inter-processor com m unication network is one of th e m ajor factors which , affects the m arker propagation perform ance. T he topology m ust accom m odate th e required num ber of PEs w ithin physical constraints such as wiring. The study of the i m arker propagation application w ith the SNAP-1 system reveals th a t th e m arker : propagation requires passing a massive num ber of processes. According to our ex- . perim ents [26] [23], th e best choice is th e cross bar network, this is due to its low latency and high com m unication capacity. The drawback w ith the crossbar, of 7.8„ course, is th a t th e cost of im plem entation and com plexity are high. T he hypercube j topology, which has lower cost and complexity, has lower bandw idth and higher la- i tency th an crossbar, but is the best alternative and is im plem ented w ith the SNAP-1 system . Though the SNAP-2 system is designed to be a scalable system , our choice ' of netw ork topology is not lim ited. State of the art com puter system design hides : the physical topology from the program m ing perspective. Therefore, th e change of J physical topology only varies the network perform ance but not the program m er’s | logical view. Based on this concept, we have decided to support SNAP-2 system s I w ith different topologies based on the com m unication technology, im plem entation i cost and com plexity (processor num ber), and physical constraints. For a small num ber of PEs, such as the SNAP-2 prototype system, the point to point topology is employed. For system w ith processing elem ents up to 64 processor, we have decided ' upon the hypercube topology. | I i I 5.2 S N A P -2 P r o to ty p e H ard w are D e sig n ! » i T he prototype is im plem ented w ith only off-shelf com ponents. No custom VLSI j chip is used to reduce the developm ent tim e. Glue logics are designed w ith PALs to < i reduce the physical size. The design and im plem entation of the P E , interconnection network and interface to host are described in this section. The SNAP-2 prototype specification is shown in Table 5.2. I 5.2.1 System H ost 1 j As shown in Figure 5.3, the com plete SNAP-2 system is installed in a single 1 chassis w ith the host, peripherals and processor boards built in. T he host is an Intel 486 based IBM -PC w ith the EISA (Enhanced Industry S tandard A rchitec ture) bus. A m inim um system consists of four full size IBM -PC AT peripheral , circuit boards. W ith a special IBM -PC m other boards, th e SNAP-2 prototype can have up to eight processor boards. All processor boards are identical and contain a processor, local memory, interprocessor com m unication circuit, and th e host-array Throughput peak aggregate 2.2 GOPS 400 M FLOPS per processor 275 M OPS 50 M FLOPS M emory system capacity 64-256 M Byte per node 16-64 M Byte Interconnection network host-array fat tree 33M B/sec host to processor 264M B/sec broadcast inter-processor for 4-7 processor, point to point for 8 processor, hypercube 20MB per channel w ith com m unication coprocessor W ord Size 32 bits Table 5.2: SNAP-2 Specifications. Host Computer EISA BUS SNAP2 Processor Boards custom backplane for communication Hardware Architecture Intel 80486 with EISA Architecture Software Environment UNIX Operating Systen Processor Interface Library Host Progaram Micro Kernel Parallel Function Library Processor Program Figure 5.3: The SNAP-2 parallel processor prototype. com m unication circuits are residing on each board. T he physical connections for the , host-array com m unication network are built upon EISA bus. T he physical connec- ! tion for inter-processor com m unication is provided w ith a detachable printed circuit | board. Currently, only th e fully connected connection is provided. The software for j the SN A P-2 is also shown in Figure 5.3. We have chosen the IBM -PC instead of ' ! i another powerful station as the host for th e following reasons: I I i ' i • Low cost, easy to m aintain. ‘ • The IBM -PC w ith Intel DX2/66 processor provides b e tte r processing power j ; th an a SUN/4. f • T he EISA architecture is well defined and easy to interface w ith sufficient ! throughput for the prototype experim ent. T he EISA bus provides m axim um 33M B/sec throughput. On the broadcasting chan- ! hel, the m axim um throughput is 264M B/sec w ith eight processing elem ent boards 1 installed. I [ 5.2.2 P rocessing E lem ents j T he choice of m icroprocessor was based on requirem ents for a high perform ance pro- j cessor w ith a floating point coprocessor, which is required for applications such as | Speech Understanding. T he chip family m ust support fast, m ultiple com m unication ! ports for message passing architecture. The CPU chip m ust be able to handle a large j am ount of physical m em ory since we need a large knowledge base. We have chosen th e Texas Instrum ent TMS320C40 for designing th e processing elem ents. T he fea tures provided by th e TMS320C40 fits our requirem ent including large size m em ory addressing capability, th e floating point coprocessor and, m ost of all, six channels of two way parallel com m unication channels w ith a com m unication coprocessor to free the CPU from the low level com m unication tasks and reduce the num ber of 'components. To m aintain a low chip count is very im portant for th e prototype im plem entation since the m axim um size of a standard IBM -PC AT interface card is about one quarter of the 9U VM E board. 81 Host to Array Communication Network EISA BUS Host to PE In terface^ c p u <■ ! Local Memory <; Backplane Addr Data Memory Host to Processing Element, Interface, Handshake and Control Unit Global Comm. Buffer Host-PE Comm Buffer Dual Port DRAM Controller Sync Sync. PE Host Reset Ack Req Req. Req 16MB-64MB Dynaimc Memory Addr Global Bus 40Mhz Interrupt Clock TMS320C40 Hardware Debugger Local Bus DMA Controller Six parallel communication channels Program Memory Terminators and Buffers V ' Address Data ' i r Local Memory Control Single-Port SRAM 32KW Inter processor Communication Network JTAG Hardware Debugger Network Legend: Latches and Buffers data bus control bus } for bus without label Figure 5.4: T he block diagram of th e Processing Elem ent. ! T he block diagram of the processing elem ent is shown in Figure 5.4. Each P E is I im plem ented w ith the TMS320C40 processor which is the m ain processing unit, and it also provides the com m unication channels. The P E is physically connected to two ' buses: EISA and backplane. The EISA bus provides th e functionality of th e host- array com m unication network. The com m unication function is built w ith th e host j to processing elem ent interface circuit, along w ith a dual port shared memory, as I shown in Figure 5.4. The global d ata bus and the host to P E bus are em ulated w ith ' th e shared memory. There are two blocks of m em ory reserved for the two commu- ■ nication channels. From the array view, the m em ory addresses are fixed. From th e ' host side, the address of th e host-PE channel is fixed, while the global d ata channel is reconfigurable. The com m unication flow control between the host and th e pro- l cessing elem ent is controlled by a handshake signal, which is provided in the host to ’ processing elem ent interface circuit. The synchronization signals are controlled by j ! two of the TMS320C40 I/O pins and are connected to the synchronization netw ork j via th e H ost-PE interface. There are two buses on the backplane. T he inter pro- ; cessor com m unication network provides the physical connections among processors, while the JT A G network is im plem ented to utilized th e T I TMS320C40 processor ; em ulator. T he processing elem ent is designed w ith two sets of memory. T he d ata ! i m em ory which is a dual port shared m em ory system , provides th e storage for m ajor j program d ata such as the knowledge base, com m unication buffer and stacks. T he j user program and kernel are stored in th e second m em ory which is the program memory. The system is designed w ithout a second level cache to reduce the board complexity. 5.2.3 Interconnection Network Im plem entation 5 .2 .3 .1 H o st-A rra y C o m m u n ica tio n N etw o rk T he SNAP-2 host-array network im plem entation combines the three subnetw orks to reduce the hardw are complexity. A 32-bit d ata bus is provided for com m unication betw een the processing elem ents and the host. The d ata bus is shared by th e global d a ta network and the individual com m unication bus. T he host-array com m unication netw ork for the SNAP-2 prototype is shown in Figure 5.5. From the host view point, the processing elem ents are connected to the host EISA bus via m em ory m apped com m unication channels. Some of the com m unication channels are assigned for the 1 host to processing elem ent com m unications, others are assigned for th e broadcast- I ing channels. During system bootup, whenever a processing elem ent is found w ith j 'EISA probing, the host to processing elem ent com m unication channel is assigned. The processing elem ent is provided to the program m er as a unix device. At be ginning of the application execution, the program m ust request a set of processing ■ elem ents. If th e requested num ber of processor is available, a broadcasting channel I l will be assigned to the requested processing elements. The processing elem ents are ' then configured to com m unication w ith th e broadcasting channel. Figure 5.5 shows j three clusters of processing elem ents using three broadcasting channels. From the ■ processing elem ent view, the m em ory m apped broadcasting com m unication channel 1 and th e host to processing elem ent com m unication channel are at fixed m em ory addresses. To reduce th e hardw are complexity, we have decided to im plem ent th e host-array j com m unication network w ithout the control hub because the prototype has only one j level of tree. T he functions provided by th e control hub are therefore sim ulated by ■ t I software. | i I i 5 .2 .3 .2 S y n ch ro n iza tio n N etw o rk ! i : ; ! |As described in C hapter 4, to detect the synchronization for the message passing J I system , the basic approach is to have all the child processes retu rn a term inated 1 message to its parent. W hen the root parent receives term inated signals from all children, the m arker propagation activity is term inated. This approach requires a great deal of message bandw idth. For the m arker propagation program w ith a ' distributed knowledge base, th e high overhead is not acceptable. We introduced a synchronization protocol, nam ely Tiered Synchronization Protocol, which is de scribed in C hapter 4, reduces the num ber of synchronization messages. T he TSP m aintains a set of counters th a t keep inform ation regarding the m arker creation and HOST EISA BUS Legend: | | Host communication channel assigned to Host-PE communication Host communication channel assigned to Global data communication [c j PE communication channel assigned to Host-PE communication [ b] PE communication channel assigned to Global data communication Figure 5.5: The SNAP-2 host-array com m unication network. i ! term ination at each level of propagation. The TSP also checks on a m arker status S to see if any m ore processes for the m arker exist. Upon request for synchronization, th e PEs will send th e synchronization inform ation for checking. In general, we want | to reduce the traffic on the global d ata bus. The SNAP-2 system provides a hard- j I : w are synchronization netw ork to reduce the overhead. As shown in Figure 5.6, the i : synchronization network consists of an AND-gate tree which detects w hether all PEs reach the condition to send the synchronization inform ation. W hen all processors respond w ith an em pty m arker process status, the AND-gate informs th e host and th e PEs to start the TSP d ata collection. The control hub in th e SNAP-2 system can provide a reduction operation for the collected T SP d ata and send back to the host. In the SNAP-2 prototype, the control hub function is replaced by the host. T h e TSP em ployed for the SNAP-2 prototype is shown in Figure 5.7. T he pro cessing elem ents will not report the inform ation until all processing elem ents set the AND-gate. T he tiered synchronization inform ation is then exam ined by th e host. If ___________ 85 sync detect loop back all processors are ready PE-0 PE-n PE-1 Host Communication Network Figure 5.6: Synchronization wired-AND tree in SNAP-2 prototype. th e synchronization condition is detected, the host will inform the array to remove 1 th e term inated process. 5 .2 .3 .3 In ter P r o cesso r C o m m u n ica tio n N etw o rk As described earlier in this chapter, th e m arker propagation application requires i a high bandw idth com m unication network. In SNAP-2 prototype, each processing • i elem ent has six com m unication channels. The com m unication port provided w ith | ! TM S320C40 is driven at 20MHz. As shown in Figure 5.3, the processor boards are i plugged into the ho st’s m ain circuit board. The inter processor netw ork is provided | w ith a detachable backplane, which connectes th e processor boards on th e other edge. For a seven-board configuration, since the connection is point to point, the j m axim um physical distance betw een two com m unication ports is about 7cm. The TM 320C40 is capable of com m unicating w ith 20M B/sec on each channel. For six channels of com m unication channels, we designed a backplane w ith 7 connectors. T he backplane provides fully connected point to point com m unication topology for up to 7 processors w ith the size com m unication channel. For an eight board system , the backplane is designed w ith hypercube topology. As a m atter of fact, w ith six 86 inform nodes Node •New process from host New process from other nodes sync detecter (AND gate) all processor ready processing: each node maintains sync, info for the threads all processor ready HoSt New propesses to other nodes Housekeep while waiting for response Need sync and queue empty? inform detector the empty queue Sum MC s f Zero sum? all processor ready? inform nodes report synchronization information In,- m essage passing legend: Hardware signal ► program flow Send tiered process create/processed information Figure 5.7: Tiered synchronization protocol for SNAP-2 prototype com m unication channels, the SNAP-2 prototype can be scaled up to 64 processor w ith hypercube topology. On th e backplane, a JTA G debugging network is also provided for th e TMS320C40 processor em ulator connection. 5.3 S u m m ary I i W ith the SN A P-1 prototype im plem entation experience, th e design and im plem en tatio n of SNAP-2 is very fast and accurate. The com plete SNAP-2 hardw are is im plem ented and debugged in four m onths. The detailed schem atic of th e SNAP-2 processor board are given in the A ppendix A. The SNAP-2 prototype im plem enta tion note and PAL equations for the SNAP-2 processor board are available from a 'FTP site which is listed in A ppendix B. 88 C h a p ter 6 S oftw are E n v iro n m en ts for S N A P -2 .Computer hardw are is th e m echanism by which we often see as a physical entity. Software provides a set of policies executed by th a t m echanism. It is essential to provide these com plem entary functions to make a good system , j The runtim e environm ent, program m ing tools and interface designed for the SNAP-2 parallel processor system is described in this chapter. Most of the SNAP-2 software designs are adapted from standard UNIX environm ents and program m ing tools w ith the parallel extensions Therefore, the program m er can easily adopt to the SNAP-2 software environm ents and speedup the application developm ent. 6.1 S oftw are A rch itectu re O v erv iew T h e design goal of the system software, in general, is to bring out th e m axim um perform ance of the hardw are system. However, an easy to use program m ing platform and program productivity are sometimes m ore im portant in the early stage of system software developm ent. Since providing an easy program m ing environm ent for the program m er is one of the m ajor goals, the general approach is to hide the hardw are under a set of soft ware interfaces. From the program m er’s viewpoint, th e hardw are interfaces becom e encapsulated software functions. T he SNAP-2 has two m ajor parts of hardware: host and processing elem ents. On th e host, a UNIX operating system provides th e program m er w ith the standard user interface. As shown in Figure 6.1, the host interfaces w ith th e array processors through a “device driver.” The device driver provides a set of functions to boot up th e processing elem ent, check the processing elem ent and interface th e processing ' elem ents. The device driver is compiled as p art of the host operating system . The ; rest of the host operating system is adapted from the Berkeley 4.4 Lite BSD version of UNIX operating system. On top of the host operating system are program m ing ! tools such as parallel function libraries, M PN program compiler, C compiler, parallel ! program debugger, and perform ance d ata gathering tools...etc. For th e processing elem ents, a small kernel provides a set of system functions to interface w ith the host. ! T he kernel is also in charge of the perform ance d ata gathering and parallel program debugging control. There are several protocols defined for the SNAP-2 system . As shown in Figure 6.1, the synchronization protocol is designed for th e “Tiered Syn chronization Protocol,” the broadcast protocol is designed for broadcasting messages from th e host to the array, the host-processor protocol is a host to a single process- ; ing elem ent com m unication protocol, th e processor-processor is th e com m unication 1 protocol between processing elem ents, knowledge base download protocol for down- 1 load knowledge from host to the processor elem ent and sanity protocol to check the processing elem ent functionality. ! W hen th e system power is on, the operating system calls several routines in the : device driver to check the num ber of the processing elem ents available in th e sys- ■ tern. T he sanity check procedure is then started to check the processing element functionality including processor, memory, interrupts, handshake signals, and com- , i m unication ports. W hen the application program started, the application program has to request processing elem ents by querying the device drivers. A t th e end of ' the query function, a m apping table will be constructed to inform th e processor 1 regarding logic to physical processor num ber m apping. T he sem antic networks is then downloaded to the processing elements. T he m arker propagation processes are initiated by the requests from th e host. As input d ata is read from a file, the application on the host request th e processing elem ents by sending or broadcasting a rem ote procedure call to sta rt th e processing Unix Environment Other standard unix tools and user programs. Program profile, SNAP-C compiler Parallel Debugger MPN Application Host program KB Classification, Planning, Unification, MUC4, MUC5, Speech...etc. Parallel Function Library (Host) Profiled library SNAP-1 library, PE-control SNAP-2 library, Host-PE Comm. Host to Array Protocol KB download, Host-PE Broadcast Host to PE, Synchronization Debugging Protocol Unix O.S. PE Device Driver Synchronization, Sanity check Host-PE comm, Ptrace(Debug), Performance data gathering, PE control. MPN Application PE program Parallel Function Library (PE) Profiled library Path-rules, PE-control, PE-PE comm, SNAP-2 library Array to Host protocol PE-HOST Synchronization PE to PE protocol PE-PE PE-PE Broadcast P E K ernel: Sanity check, Synchronization, PE-PE comm. Host-PE comm,Ptrace (Debug) performance data gathering, Memory Buffering___________ \ Legend: Hardware Links Software Links \J Figure 6.1: SNAP-2 Program m ing Environm ent Overview. 91 on th e corresponding concepts of the input data. M arkers are then propagated j ! through the knowledge base and invoking new propagation according the processing i context. As the m arkers move they perform constraint checks, activate suitable ! nodes and create subprocesses. M arkers are sent in parallel by th e processing nodes. W hen the propagation term inates, either the m arker result is sent directly from ■ processing elem ents to the host, or the host retrieves the results from the processing j nodes through a collect d ata request. ! 6.2 S y ste m Softw are w ith R u n tim e O p tim iza tio n ; i ! R untim e m odules are designed to optim ized the execution of the m arker propagation ■ applications. Some of these modules require m odification of the system software, in cluding host operating system and processor elem ent kernel. O thers can be provided as separated server processes th a t are waiting for service call. Several runtim e mod- j ules are im plem ented w ith SNAP-2 softwares th a t include object nam e translation ; m odule, knowledge base m anagem ent and memory m anagem ent m odules, and light weight process m anagem ent and m igration module. j 6.2.1 M em ory and K now ledge Base M anagem ent j An efficient m em ory knowledge base m anagem ent system is very im portant for a high perform ance m arker propagation application. Since we take th e th read approach in- | 1 . . . . ! Stead of the handicraft message passing program. During program execution, a : lot of threads will be generated. In other words, m any m arkers will be generated. W ithout the CAM, the knowledge base objects creation and deletion need to access th e conventional memory, too. We im plem ent the m em ory buffering scheme in the processing elem ent kernel to reduce the tim e for m em ory resource m anagem ent. As . described in th e previous chapter, the knowledge base loading and saving is im por ta n t to reduce the program initialization tim e. For exam ple, the creation of MUC5 [32] knowledge base takes about 60 m inutes. Therefore, it is not acceptable to load th e knowledge base in canonical form at running tim e. In the SNAP-2 approach, we 92 apply the block partitioning scheme on the knowledge base. The partitioned knowl- ] [ , edge base is then downloaded to the processing elem ents through th e global d ata ! netw ork at th e beginning of application execution. As described in C hapter 3, th e ■ m arker propagation program requires an external to internal representation trans- : lation. W ith th e nam e table scheme proposed in C hapter 3, a user application can easily take up to 16MB of memory. A shared nam e table scheme is provided to ' reduce th e m em ory usage w ith a m inim um processing tim e penalty. i M em o ry B u fferin g j We proposed a m em ory buffering technique for reducing th e tim e for th e m em ory j resource m anagem ents. The generic buffering technique has been applied on a lot of operating system functions including paging, file system , com m unication, and other applications. The basic approach is to provide a set of fixed size m em ory ■ buffers for th e special d ata structure. For exam ple, to im plem ent the com m unication buffer, a set of m em ory blocks is allocated from th e system m em ory during system startu p . The size of the block is fixed and is equal to the predefined m essage d ata structure size. A buffer m anagem ent is provided to handle all buffer requests and reclaim s. If th e system is out of the buffer, the system fails the request call. The m ain advantage w ith the buffering is the fast allocation and free tim e. W hen the buffers are first allocated, they are provided as a link list objects. For a m em ory | 1 I request call, one buffer is detached from the list while a m em ory free call pu ts the buffer back into the list. W ith a generic m em ory allocation algorithm , for m em ory 1 request, th e m em ory m anager m ust search in the m em ory heap for th e best fit. Upon free call, the m em ory m anager m ust search to m erge the reclaim ed m em ory w ith continuous m em ory blocks. Therefore, the buffering scheme is obvious m uch faster th an conventional m em ory m anagem ent scheme. T he problem s w ith the buffering are th e fixed size buffer and the small fixed num ber of buffers. By investigating the m arker propagation program w ith process flow, th e types of m em ory requests is listed as follows: 1. the m em ory request for a light weight process or m arker, I m e m o ry allo c a n d free request memory subsystem * m em ory buffer size a - o d d d d d size b - 0 - 0 - 0 siz e c o o o o o o o o - a siz e d - O d d d - d d size e -O d-dH O -O O -O size f -O-OHOd-O size g - O d d - d d d H O d size h d d m em ory heap for o ther sy ste m m a llo c an d free request Figure 6.2: The m em ory buffer for fixed size m em ory request call 2 . the m em ory request for a knowledge object which includes node, relation and class, 3. the m em ory request for a external representation storage, 4. a conventional m em ory request, and 5. the request for a com m unication buffer. O ur m em ory operation optim ization is to reduce the m em ory request for th e first , and second m em ory objects. T he com m unication buffer is im plem ented w ith con ventional buffering scheme, as described previously. As shown in Figure 6.2, T he . m em ory m anagem ent is a two level function call. T he first level provides the m em ory request for LWP and knowledge objects. For other m em ory request, the m em ory request passes down to a conventional m em ory m anager. T he im plem entation of th e m em ory buffer m anagem ent is based on the m em ory size classification. For exam ple, for the lightweight process, th e process descriptor is a fixed size d a ta structure. For the knowledge base objects, the object descriptor d a ta structures are also fixed size m em ory objects. T he m em ory buffering scheme is depicted in Figure 6.2. In general, the object d ata sizes are fixed sizes also. There fore, the im plem entation of m em ory buffer consists of an array of m em ory buffer link lists. Each link list corresponds to a fixed size of m em ory request and is im plem ented jwith same strategy as a regular buffer. However, unlike the regular buffer im ple m entation w ith fixed num ber of buffers, the num ber of buffers is expanded when th e buffer is exhausted, and is reduced whenever there are too m any free cells, and other m em ory requests need more resources. Since the sizes of th e m em ory objects are application dependent, th e m em ory buffer table m ust be constructed according to the application m em ory request p attern. This can be done at com pile tim e by checking th e application m em ory request call and recording the m em ory request size in a m em ory request table. During the application initialization, th e m em ory request table is then retrieved to construct the m em ory buffer table. K n o w led g e B a se C o n stru ctio n T he knowledge base structure designed for th e SNAP-2 prototype is based on ;he criteria described in C hapter 3 and C hapter 4. The following list shows the requirem ents to design th e knowledge base d ata structure. • support m arker, class, relation, and node knowledge objects, • expandable object data, • support for knowledge base partitioning, dynam ic allocation and download, and • facility for knowledge object search. T he simplified knowledge object structures provided by the SNAP-2 prototype are shown in Figure 6.3. A single knowledge object requires an object descriptor jivhich provides the inform ation regarding th e object status, a link list of d ata objects and th e links to other objects. T he node structure provides inform ation regarding the node identifier and data. The class, relation and m arker descriptor structure rave their own d ata entry for storing generic type inform ation. T here are several search operations provided by the SNAP-2 prototype. The lasic search operation includes search for a node, search for a relation type, search ‘ or a class, and search for m arker. For th e basic search operations, th e table look up .95J Node descriptor Class descriptor Node ID Class ID Class ms-list Node mb-list Marker ms-list Marker ms-list Relation ms-list Relation mb-list Node Data Class Data Thread list entry Class list entry Marker ID Class ID Receiver ID Owner ID Receiver type Thread path Thread status Thread Data Relation descriptor Marker descriptor Relation ID Marker ID Node mb-list Node mb-list Marker ms-list Class mb-list Class ms-list Relation mb-list Relation Data Thread List Marker data Relation list entry Marker list entry Relation ID Marker ID Owner ID Owner ID Destination node Relation Data Marker Data T j ms-list: membership list Legend: v mb-list: member list Figure 6.3: Knowledge objects structures scheme is em ployed and the object identifier is provided as th e index for search. The Complex search operations include search for node and m arker com bination, search •for node and relation type com bination, and search for node and class type combi nation. To speed up th e com bination search, several m em ber lists are m aintained by knowledge base m anager. The list entries inside th e descriptor structures and th e list entry structures, which is shown in Figure 6.3, are provided to construct the m em ber lists. The m arker m em ber lists are provided for node and m arker combi nation search. T he class m em ber lists are defined for class and node com bination search. T he relation m em ber lists are m aintained for th e relation and node com bina tio n search. Exam ples of the interlinks between the m em ber lists and the knowledge objects are shown in Figure 6.4. As shown in Figure 6.3, several list structures have [data fields. T he d ata field is provided for storing individual d ata for th e owner. For 'examples, the d ata for the relation can be a value or weight, the d ata for th e m arker can be th e th read sender address. The relation has a destination node field. ■ 9 6 J node descriptor table relation descriptor table memory class descriptor table relation A member list relation B member list class C member list marker D member list marker D thread list node R node S relation A relation B node T relation N class C nodeU marker C marker descriptor table Figure 6.4: Knowledge object m em bership interlinks. K n o w le d g e B a se L oad in g and S avin g_______________________________________ T he knowledge base transfer mechanism is designed to reduce the program initial ization tim e. We employed the partitioning scheme, which is described in C hapter 4, for knowledge base construction. We prefer th e node oriented knowledge base cre ation for th e SNAP-2, and partitioning is based on the node objects. Currently, there is no program m ing tool for the user to construct th e knowledge base, except a set of knowledge base functions are provided for program m er to read the user defined knowledge base, and create the knowledge base in the m em ory then save as a knowledge base image. During the creation, the created knowledge base object process assumes th a t the canonical knowledge base has been processed w ith topo- I logical sort, and fills up th e knowledge base blocks one by one. A newM ock function is provided to force the knowledge base creation process using a new block for th e next creation. This function is used, in general, to put some objects in th e same block. At the end of knowledge base creation, the knowledge base im age can be saved in a file by calling a saveJib function. To use the saved knowledge base image jwith an application, th e program m ust call a load-kb function at the beginning of ;he program . 6.2.2 P rocess M anagem ent Tor the application program , which resides in the host, the program is sim ply in sequential form. The process m anagem ent is, therefore, controlled by the generic jUNIX operating system. For th e program in the processing elem ents, subprocess, nam ely m arkers, are generated to be processed locally or at the rem ote processors. jThe process m anagem ent provided by the kernel in the processing elem ent is designed to utilize the parallelism . Therefore, different treatm ents are provided for th e thread creation at the local processor and the rem ote processor. S o lu tio n for m arker overrun p rob lem In the m arker propagation program , after a node has processed a m arker, some subprocesses will be created according to the propagation rule and outgoing relation availability. Some of them are for current processors and are created w ith the internal 98 propagation procedure, the rest of them are for other processors and are processed jwith the external propagation procedure. Since the subprocesses m ust be created at rem ote processors. It can be achieved by sending a rem ote process call to the destination processor. Practically, to obtained a b etter parallelism , a good strategy jwith process m anagem ent is to create the rem ote processes first, then the processors jwork together for m axim um processing tim e overlap. This can be accom plished by assigning higher priority on processing the outgoing process. However, before im plem enting th e process m anagem ent strategy, we have to solve the process overrun problem, as described in C hapter 3. In general, th e process overrun problem will occur when too m any waiting-process processes reside in the processing elem ent, and the m em ory is not enough for further processing which m ay create m ore processes. To solve this problem , we should inves tig ate the process creation w ith m arker propagation. A fter processing an incoming process, if all subprocesses at the next level are created before any subprocess pro cessing, its the breadth first approach. To reduce the subprocess creation, the. depth first approach allows only one process created at a tim e, b u t this eventually reduces th e parallelism significantly. However, it is reasonable to traverse the internal prop agation sequentially. Therefore, for the internal propagation, it is not necessary to create all subprocesses at once. To optim ize the external propagation, our approach is to classify the subprocess category by checking the residential status of the des tin atio n node. This is achieved by the knowledge base loading procedure. W hen th e knowledge is first loaded, th e relations of a node are separated into an external category and an internal category. W ith such inform ation, it is easier to schedule th e process w ith depth first approach while m aintaining high parallelism w ith the following approach. W ith the external propagation and internal propagation inform ation, a ready node (process in the ready queue) can be classified as one of the following: has b o th external propagation and internal propagation, has only external propagation, Jand has only internal propagation. Since we want to m axim ize th e parallelism , 'when th e com m unication resource is available for the ready nodes w ith external .99. I H I K D H I H I H D - O level a B - o - o - a E — E H — E H — E H — E H — E H — E H — E H — E H — E H I HZhCHHHHHHHZHZI level a-1 B - E T H Z H H -E ^ C H Ih O E - C H H T M level a-2 ^ -E JT D -T D H D H 3 H IH IK H B - e r a E I: queue for process with only internal propagation Legend: B: queue for processes with both external and internal propagation E: queue for process with only external propagation Figure 6.5: The Priority Queues for SNAP-2 processes. propagation, both m ust be assigned w ith higher priority. However, when there is no com m unication resource available, the nodes w ith only external propagation m ust be assigned w ith lower priority. At this tim e, the nodes which have only internal propagation and the nodes which have both categories m ust be assigned w ith higher priority. In the SNAP-2, the nodes w ith both propagations are assigned w ith the 1 . highest priority, because if the node is at run state but the com m unication resource is out, no task switch is needed, since the node has internal propagation also. P r io r ity Q u eu e A rra n g em en t We have chosen the depth first approach to prevent the m arker overrun. However, some of the internal propagation will eventually create some external propagation while the com m unication resource is not available. Intuitively, the child process m ust be assigned w ith a lower priority than its parent process. Therefore, w hen a process is created, it is assigned w ith a priority num ber. W henever it creates a subprocess, its child process has th e priority one level lower. The processing resource is always assigned to the highest priority queues. After combining both criteria, th e priority 'queues for th e SNAP-2 is shown in Figure 6.5. W hen the resource is scheduled, Jthe processes in the B queue of the highest level has the resource. W hen a process 1 100- Jin the B queue runs out of either external propagation or internal propagation, it twill be moved to I queue or E queue accordingly. If the B queue is em pty, and there [are com m unication resources available, the resources are assigned to the processes in the E queue. W hen the com m unication resource is not available, but the processing ■resource is, and the B queue is em pty, the processing resource is assigned to the processes in th e I queue. For process in either E and / queues, when the process is bnished then it is removed from th e queue. W hen the resource is available and the Jqueues of th e current level is em pty, th e resource is assigned to th e processes in the ^queues in the next level. W henever there are com m unication resources available, th e processes in either B or E queue have higher priority to be scheduled for next execution. 6.3 P ro g ra m m in g T ools The prim ary m otivation for program m ing tool im plem entation is to provide the parallel m arker propagation application development environm ent. | g.3.1 S N A P -2 Program m ing Environm ent It is a known fact th a t even though the parallel processor system s have been avail- ■able for years, th e user still is reluctant to use it. T he m ain reason is th a t the parallel m achines lack software systems th a t would make them easy to use. In general, the japplication developer wants to program in a standard program m ing environm ent including program m ing language and program m ing tools. To achieve this goal, the SN A P-2 program m ing environm ent is built on the UNIX tradition, em bodying a phi losophy of software and application engineering based on rapid prototyping. Rapid prototyping encourages the use if small modules to easily link discrete program s to perform new functions. The SNAP-2 program m ing environm ent includes th e SNAP- Jc compiler and parallel function libraries. T he system interfaces are designed for easy access by program m ers; layered m odularity provides access to the SNAP-2 'hardware at a variety of levels. .lOlJ J" 1 .... .... ...............-.........................— " I 'M arker P r o p a g a tio n N etw o rk s A p p lic a tio n P ro g ra m m in g As we stated early in this dissertation, the m arker propagation can deal w ith the parallelism explicitly. T hat is the goal of the m arker propagation networks model. ! jWith the com bination of the propagation rule program and the C language, we have a parallel program m ing language which is sim ilar to Data Parallel Language [11] [12]. From the program m er view, our program m ing model exhibits the following characteristic: im perative style of program m ing w ith explicit parallelism and explicit com m unication control. The program m ing m odel also looks like a single th read of control, though the propagations is actually processed in parallel w ith different propagation rules, which is different from the generic d ata parallel program m ing. I The propagation rule program m ing exhibits a m ore object-oriented approach than ;he d ata parallel approach. Since th e SNAP-2 system software is built w ith a unix operating system , we had chosen “C” language for the base of th e m arker propagation program m ing language, because C is th e system s language th at “belongs” to UNIX, therefore it is th e m ost n atu ral choice for the im plem entation of SNAP-2 program m ing language. From ;he practical point of view, we have the following advantages by using C as the orogramming language base. • C object code compiler is available for m ost of th e commercial chips. • C language can handle low level operations elegantly, therefore is good for processing elem ent kernel design and library interface. As a m atter of fact, the trend for AI program m ing has C as one of the m ost popular brogram languages other than LISP. A M PN application consists of three parts of a program : knowledge base, m ain orogram and propagation rules. The application program which consists of m ain orogram and propagation rules is w ritten in C-like language. T he knowledge base creation procedure is described previously in this chapter. The application program has to be compiled into two executable object codes w ith th e SNAP-C compiler. |The SNAP-C compiler is sim ilar to a preprocessor. During th e com pilation, the 102 Processing Element Program Host binary Host program PE binary Application program Host Parallel Function Library Processing Element Parallel Function Library SNAP-C Compiler SNAP-C compiler calls Compiler for node architecture(TIC40) SNAP-C compiler calls Compiler for host architecture(Intel486) Figure 6 .6 : The m arker propagation networks application com piled w ith the SNAP- compiler. application program will be divided into two set of program : host program and processing elem ent program , as shown in Figure 6 .6 . T he guideline to divide the japplication program is the propagation rules. T he propagation rules are transform ed into a set of C functions. For the host program , a set of functions correspond to the propagation rule function set, is created for the host program to call th e propagation rule at the processing elem ent remotely. The host program is then compiled w ith the com piler for the host architecture and linked w ith the parallel function library to produce th e executable code. All the subroutines called by the propagation rules are also included w ith the processing elem ent program . A pseudo m ain program is created for the processing elem ent program to include all propagation rules and the required subroutines. The propagation rule table and the function table is also generated for fast propagation rule and function call. T he processing elem ent program is then compiled w ith a compiler for processing elem ent architecture, and linked w ith the parallel function library to produce th e executable code. 103 'Parallel F u n ction L ibrary The parallel function libraries provide a set of pre-designed parallel program jfunctions to speed up the application program developm ent. There are two sets of libraries available, one is for the host program , and the other is for th e processing elem ents. The function library for the host provides some essential functions for startu p of th e m arker propagation application. These functions are provided to download jthe knowledge base, to save the knowledge, and for processor initialization. We also support the SIMD SN A P-1 function set for upward com patibility. Therefore, jthe application program s which are w ritten in th e SNAP-1 model, can still run on the SNAP-2 system. Some low level function calls are also provided for lower level control of processing elem ent hardw are. For the processing elem ents, th e parallel ibrary supports th e propagation path-rule library, the kernel function calls and low evel functions for inter-processor com m unication. 6.3.2 Perform ance D ata G athering The SNAP-2 perform ance d ata gathering tools are designed to provide th e pro gram m er w ith a detailed parallel program perform ance inform ation, including the processing elem ent CPU and resource utilization, and the application program ex ecution profile. The detailed application program perform ance profile can help the program m er in tuning the parallel program by showing the bottleneck of the program execution. A p p lica tio n P erfo rm a n ce D a ta G a th erin g j A application program profile d ata gathering tools collects several runtim e data. |To reduce the hardw are complexity, we d id n ’t include the perform ance d a ta sam pling circuit in th e SNAP-2 prototype. In general, w ithout the hardw are support for runtim e d ata sampling, the resolution is set at function call level to reduce th e overall execution. To sam ple th e runtim e d ata at function level, the profiled application program requires to be modified w ith additional profiling code. T he additional orofile codes are inserted by the com piler at compile tim e and are activated when .10.4J the profile option is enabled. According to the previous section, the com pilation of ithe SNAP-2 applications will generate two executable files: one for the host and the [other one for th e processing elements. T he host program is then compiled w ith the G N U C com piler which supports the profile option. The compiler for compiling the processing elem ent code, which is provided by T I and is designed for T I hardw are profile, has no profile option. O ur solution is to insert the required perform ance d ata ^gathering code in th e functions for the processing elem ent program at th e first stage iof com pilation, as described previously. T he additional code inserted into th e functions is a system function call. In jthe UNIX host operation system , the system call is “m onitor.” Due to the tim e sharing feature of th e UNIX system , the m onitor routine cannot rely only on the real tim e clock for recording the function execution tim e. Special care has been taken to remove the task switching error. We provide a sim ilar routine for th e kernel for processing elem ents. T he kernel process m anagem ent provides non-preem ptive policy and can be interrupted only by hardw are in terru p t. Therefore, the error is Jcompensated by subtracting the interrupt service tim e. The prim ary runtim e d ata has to be recorded is th e execution tim e. For each function, there is a tim e m em ory for storing accum ulated execution tim e for th e function. To calculate th e average execution for a single iteration of the function, we have to m aintain another counter I for each function for storing th e call frequency of the function. We are also interested in constructing the dynam ic call graph of th e execution of the application program . T his requires th a t th e d ata sampling code records arcs of the dynam ic call graph th a t are traversed by th e execution of the application program . T he frequency count is then associated w ith the call arcs instead of the function tim e. W hen the application program is com piled w ith the profiled option, th e m ain program is also modified to jwrite the perform ance d ata into a file at end of program execution. H ard w are R u n tim e D a ta G a th erin g J T he hardw are runtim e d ata profile is provided to study th e hardw are design feasibility for the given program m ing model to hardw are m apping. In our case, it will provide us w ith the guideline for a full scale SNAP-2 system im plem entation. For 105J the hardw are utilization, we are interested in th e processor load and com m unication overhead. Unlike th e software runtim e d ata gathering, which requires m odification of th e user program at compile tim e, the required hardw are perform ance d a ta gathering routine is built inside the processing elem ent kernel. T he processor load sam pling is invoked by the real tim e clock at the predefined tim e interval. T here are several ^approaches to calculate processor load. In SNAP-2 prototype, we calculate the processor load P L by calculating the length process queue P Q against a predefined process num ber C as shown in the following equation, P L - p Q - s c where S is the num ber of the required system processes. Two kernel system calls are provided to adjust the kernel sam pling tim e and the sam pling d a ta buffer size. The philosophy behind adjusting the sam pling tim e is th a t program s w ith longer execution tim e, th e sampling tim e can be higher to reduce the d ata recorded. W hile ^adjusting th e size of th e buffer to record th e entire execution trace for program s w ith fast execution tim e. T he com m unication channel can be m onitored by the com m unication system call. In th e SNAP-2 prototype, the inter-processor com m unication channel is handled by L com m unication co-processor and the com m unication d ata is transferred by DMA. W ithout the hardw are sam pling tool, th e com m unication d ata gathering is focusing on the com m unication queue usages and th e traffic statistics. A set of functions is provided to sta rt/sto p the sam pling and, save the result into a file. 6.3.3 Parallel D ebugger Im plem entation ! For a parallel processor system , a good parallel debugger is required for debugging some bugs which cannot be seen on a sequential system. In general, the parallel debugger im plem entation requires hardw are support. The hardw are is provided to synchronize th e CPU operations on all processors. A basic approach is to provide a global CPU clock. However, this approach will suffer from th e transm ission line 106. delay and noise for the high frequency clock. A reasonable approach is to provide the synchronized clock which is much slower th an the CPU clock. All processing junits are then synchronized with the clock. A m ore sophisticated hardw are em ployed scan code debugging protocol, such as “/TAG '.” As described in th e previous |chapter, the CPU used for the processing elem ent is TI TMS320C40, which has a ‘ JT A G connector to interface w ith a proprietary parallel debugger interface provided from TI. T he JT A G connectors of the processing elem ents are connected as a ring network. A low speed clock is provided in JT A G network as a synchronized CPU clock. W hen the debugger is in control the CPU is running at debugging clock speed. jThe debugging clock can also be disengaged to allow CPU processing at full speed. 'During debugging, th e debugger program is running on an IBM PC com patible w ith O S /2 operating system. The debugger controls the CPUs through th e proprietary interface w ith th e JT A G network. This T I parallel debugger is provided as a de velopm ent tool for system developed w ith th e T I TMS320C40 processor. There are several problem s related w ith using this package as th e standard parallel debugger: • T he T I parallel debugger can be used to debug th e C40 program only, since th e host is built w ith Intel 80x86 family, only the program in th e processing elem ent can be debugged. • T he T I parallel debugging package is very expensive com pared to th e system cost. • T he T I debugger allows only a single user interface and requires a separate IBM P C system for debugging. • T he T I debugger program is not portable and is not available for the SNAP-2 host operating system . In th e SNAP-2 hardw are, we have im plem ented a hardw ired synchronization netw ork which can synchronize all processors w ith the same pace during debugging. J a Iso the broadcasting bus can com m and all processors to configure their tim er and *to sta rt processing. Therefore, to im plem ent the SNAP-2 parallel debugger, we em ployed the hardw are available w ith only software protocol add on. T he broadcasting .1 0.7 . Host Processing Elements Processing Element Kernel PE Sptrace Unix Kernel Hardware Links Software Links Legend: ptrace Host sptrace MPN Application Node source code MPN Application Host program src. MPN Application Host program obj. Processing Element Device Driver MPN application processing element program PE Debugger Host Debugger Parallel Debugger Figure 6.7: The SNAP-2 parallel debugger. netw ork available from the SNAP-2 hardw are allows the easy im plem entation of sin gle step debugging m ode while the synchronization hardw are supports free running Jdebugging mode. The standard unix operating system provides a “ptrace” system jfunction call for process tracing. The SNAP-2 debugger is derived form G N U de bugger call gdb, which also uses ptrace for the debugging process on unix system. T h e ptrace provided by th e unix system , however, cannot trace th e rem ote processes at the processing elem ent. To debug a parallel processor program , a good im ple m entation should provide user interface windows for the processors. As shown in jFigure 6.7, the SNAP-2 debugger is provided w ith two debugger sets. One debug ger is designed for the host program which calls ptrace for program tracing. The o ther debugger calls sptrace to trace th e program at the processing elem ent. The ^sptrace is im plem ented w ith a set of functions com patible w ith ptrace and is inter faced w ith th e processing elem ent through a device driver. T he sptrace is provided in one of the host parallel function libraries. The sptrace sends rem ote procedure 108 calls to processing elem ent kernel for debugging processing. In th e processing ele m ent, a set of ptrace com patible kernel functions is provided for servicing th e rem ote procedure call. During debugging, the host takes over th e control of th e synchro nization acknowledgment signal. The processing elem ents are set up to response to jthe synchronization signal right away. T he T SP protocol is also switching to the ■mode w ithout hardw are support. This, however, will increase some traffic between the host and processing elements. 6.4 S u m m ary M ost of the software packages th a t we had developed for th e SNAP-2 prototype, Lre derived from public dom ain software packages and are modified for th e parallel system . These software packages include the parallel debugger which is derived from .GNU debugger, the host operating system which is derived from N etBSD , and the object code compilers are GNU C compiler and T I TMS320C40 C com piler. The software packages which are built from scratch are the parallel function library, the processing elem ent kernel and the SNAP-C compiler preprocessor. In this chapter, we have described th e functions of the SNAP-2 software packages from a high-level perspective. T he detailed docum ents and th e source code for th e software packages 'described in this chapter are available from a F T P site which is listed in A ppendix B. 109J C h a p ter 7 P erfo rm a n ce S tu d ies To study th e im provem ent of th e m arker propagation networks, we have experi m ented w ith m arker propagation networks subsystem s, such as th e m em ory buffering system and synchronization mechanism. To evaluate the SNAP-2 parallel proces- lo r system , we have run several m arker propagation applications for gathering the perform ance results. 7.1 S y n ch ro n iza tio n The tiered synchronization protocol is im plem ented in a simplified parallel m arker propagation networks sim ulator on th e IPSC /2. T he T S P sim ulation provides in form ation regarding th e im provem ent over the basic synchronization approach. Two application benchm arks were executed w ith different sim ulation configu rations to m easure synchronization overhead using the tiered protocol. The unit of synchronization overhead is the num ber of com m unication messages transferred betw een processors. T he first benchm ark, T A , is depicted in Figure 7.1. It is a fragm ent of a knowl edge base used to answer the query: Name all students who are single and live on campus. T he second benchm ark, 2D -m esh, is depicted in Figure 7.2. It is a 2-dimensional inheritance m esh where m ore specific concepts inherit properties from m ore general concepts. A search algorithm is then applied to answer the query: 110_ Andy .C am pus. Singk TAship Cam pus. RAship Figure 7.1: Benchm ark # 1 : Teaching A ssistant (TA) 2-D Mesh TA # processors Basic Tiered %reduction Basic Tiered %reduction 2 17 2 88.2% 120 4 96.7% 4 49 8 83.7% 186 11 94.1% 8 90 16 82.2 % 222 22 90.1% jTable 7.1: N um ber of messages required for the different synchronization protocols, and the reduction in message traffic using tiered reporting. Does concept number 0 have property number 99? W ith the query, the processing originates at the physical processor which has node 0 of the mesh. Processes are then recursively spawned until node 99 is reached. Figure 7.3 shows th a t the num ber of synchronization messages grows linearly as |the num ber of processors is increased. We count only the messages com m unicated ^between processors. Table 7.1 compares these results w ith th e basic synchronization (protocol described in C hapter 4. Synchronization message traffic is significantly reduced. The results show th a t the tiered synchronization protocol is suitable for i m arker propagation application and other parallel nondeterm inistic program s. L l l l j Number of m e s s a g e s 0 — ►has-part — ► 1 0 —> has-part —► 20 .. . . . 80 — > has-part —> 90 1 1 1 1 1 is-a is-a is-a is-a is-a 1 1 1 1 1 1 — ► has-part — ► 11 — ► has-part —► 21 •• ■ -81 -*■ has-part — ► 91 1 I 1 1 1 is-a is-a is-a is-a is-a 1 I 1 I 1 2 —> has-part > 1 2—► has-part —► 22 .... . . 82 -► has-part — ► 92 8 _ » has-part - » 1 8—» has-part —» 28 .........8 8 - » has-part 98 1 i 1 1 1 is-a is-a is-a is-a is-a 1 1 1 1 1 g —►has-part — ► 1 9 — 3 ► has-part > 29 • • 89 — * ■ has-part — ► 99 Figure 7.2: Benchm ark # 2 : Inheritance Mesh 2 5 0 -1 2 0 0 - 1 5 0 - 1 0 0- 5 0 - 4 8 2 ■ ' 2 - D B a s i c 2D T i e r e d — TA B a s i c TA T i e r e d N u m b e r o f P E s Figure 7.3: Local synchronization message growth 112J 7.2 M em o ry B u fferin g It is essential to provide a good m em ory m anagem ent system to support an efficient ■knowledge base subsystem . T he m ain technique used in the SNAP-2 is to reduce jthe m em ory allocation tim e w ith the m em ory buffer. To study th e SNAP-2 m em ory jbuffer scheme characteristic, we had to gather the m em ory access p atte rn d a ta by running the program on a single processor configuration w ith different input data. Intuitively, to study the m em ory access p attern , one should vary th e sizes of the knowledge base, then th e experim ental results obtained from different knowledge size jean be used to plot the curve which shows the m em ory buffer m echanism response to th e size of knowledge base. However, to provide a consistent knowledge base w ith a jvariable size is not a easy task. O ur approach, instead of providing th e variable size knowledge base, is to change the input size. This scheme is based on th e concept jthat th e processing w ith a sem antic networks, a virtual copy of the object m ust be provided for the processing instead of changing the one and only original copy. In |this experim ent, we employed the MUC5 NLU parser program for perform ance d ata sam pling because it has the largest knowledge base we have. As shown in Figure 7.4, the use of buffering scheme greatly reduces the num ber of “malloc” and “/ree” function calls invoked by the knowledge m anagem ent requests. (Because there are other functions also request for m em ory m anagem ent and are effectively filtered by the m em ory buffer scheme, the to tal free and malloc function calls are m uch higher w ithout the m em ory buffer system. Figure 7.5 shows th at the m em ory buffering for the knowledge base m anagem ent js very good. Though the requested activity is high, the request for th e system m em ory function call is at about the same range. This also indicates th a t, for a given knowledge base, the creation of new knowledge base objects dem and will be satu rated . Eventually, the knowledge base m anager will not ask for m ore m em ory from the system . .113. s y s te m m em o ry function calls fre e (with buffer) free(w ith o u t buffer) m alloc(w ith buffer) m alloc(w ithout buffer) " T r i j i " i i i | i t r 1 " | I I 1 I 50000 100000 150000 200000 250000 300000 H total | k b m a n a g e r Figure 7.4: The perform ance w ith m em ory buffering scheme 1200000 -i io o o o o o - 800000 600000 - 400000 - 200000 - 645890 975684 1245682 1906623 98524 knowledge base management function calls malloc(without buffer) free(without buffer) malloc(with buffer) free (with buffer) Figure 7.5: T he perform ance for m em ory buffering w ith scale up knowledge base. .1 1 .4 . 7.3 P erfo rm a n ce R e su lts w ith S N A P -2 S y ste m We ran two experim ents w ith the SNAP-2 system: Speech Recognition, and N atural Language U nderstanding. T he first experim ent employed th e application developed by S. Chung [5], which is a parallel speech recognition program designed for air jtraffic control dom ain. T he second application is developed by th e M. Chung [3], which is a parallel NLU parser designed for MUC5 [32] domain. P A S S : S p eech U n d ersta n d in g w ith A T C D o m a in PASS, the speech program m ing was developed w ith the SNAP-1 m arker propa gation model. It is based on the m em ory parsing model where the knowledge base is jconstructed w ith a set of concept sequences for Air Traffic Control (ATC) dom ain. 'In our experim ent, the speech program running on th e SNAP-2 prototype is linked jwith an upw ard com patible SNAP-1 parallel function library. There are two prob lems w ith this approach: (1) th e synchronization points can not be rem oved due to the correctness of the program , and (2) th e advantage w ith the M PN m odel, which allows m ulti-propagation, can not be utilized. J In our experim ent, th e ATC domain consists of 9033 sem antic nodes w ith 20,840 sem antic relations. Due to the lim ited availability of the speech data, we are not able to vary the size of knowledge base, nor the size of input. | Figure 7.6 shows the execution tim e of th e PASS program w ith ATC dom ain jwith a different num ber of PEs. T he parallel processor configurations provide the overall speedup w ith less than 2, as shown in Figure 7.7. P A R A L L E L : N a tu ra l L an gu age P r o c e ssin g w ith M U C 5 D o m a in T he PARALLEL system was developed for message understanding. T he com plete PARALLEL system consists of a phrasal parser and a m em ory base parser. However, the underline program m ing model for the m em ory based parser, again, is th e SN A P-1 program m ing model. Due to the size of the knowledge base and ac curacy criteria, our experim ents can not vary the size of th e knowledge base. The in p u t d ata size, however can be changed. A input message exam ple is shown in Figure 7.8. A nother restriction for the experim ent on the PARALLEL w ith MUC5 .115J 25-i i n p u t 1 i n p u t 2 2 0 - -H a 1 5 - •rH ± J 1 0 - 2 N u m b e r o f P E s 4 1 Figure 7.6: T he execution tim e of PASS on the SNAP-2 prototype i n p u t l 1 . 1 - . i n p u t 2 1.3-: 1 -2: 4 2 N u m b e r o f P E s 1 Figure 7.7: T he speed up for the PASS program on SNAP-2 prototype 116J do c D O C N O 0043 /D O C N O DD JAN U A RY 31, 1991, THURSDAY /D D SO Copyright (c) 1991 Kyodo News Service /SO jTXT ] IW ATANI IN T E R N A T IO N A L CORP., A JA P A N ESE T R A D IN G HOUSE SPECIALIZING IN ^ A S E S , SAID THURSDAY IT HAS R EA C H ED AN A G R E E M E N T W ITH A STATE-RUN C H IN ESE P E T R O L E U M COM PANY TO FO R M A JO IN T V E N T U R E IN GUANGDONG PROVINCE. [ T H E V E N T U R E W ILL SELL AND PR O D U C E M ACHINERY AND E Q U IP M E N T F O R LIQ U E FIE D P E T R O L E U M GAS (L P G ) F O R IN D U ST RIA L USE AND O T H E R R ELATED GASES, IM P O R T E D F R O M ! A B RO A D , IN SHENZHEN, A SPEC IAL EC O N O M IC ZONE IN G U ANGDONG PROVINCE. j IWATANI W ILL ESTABLISH T H E V E N T U R E W ITH SHENZHEN PE T R O L E U M CO., A LOCAL STATE-RUN ORGANIZATION, SO M ET IM E IN M ARCH. j SHENZHEN IWATANI GAS M ACHINERY CO. W ILL BE T H E F IR S T JO IN T V E N T U R E F O R M E D .WITH A FO R E IG N FIRM BY A CH IN ESE PU B LIC U TILIT Y E N T E R P R IS E . | IWATANI AND SHENZHEN W ILL OWN EQUAL STAKES IN TH E FIRM CA PITALIZED AT 5.3 M ILLION DO LLAR S, EQUIVALENT T O A B O U T 700 MILLION YEN. | T H E T W O FIR M S H O P E TO BUILD 15 TANKS, W H IC H W ILL HAVE A ST O R A G E CAPA CITY T O T A L IN G 5,300 CUBIC M E T E R S, BY T H E SP R IN G O F N E X T YEAR. I W H E N C O M P L E T E D , T H E TANKS W ILL HAVE TH E LA R G EST ST O R A G E CAPA CITY IN T H E PR O V IN C E AND W ILL FO R M T H E C O R E O F T H E IM PO R T BASE F O R T H E GASES. | L P G IM P O R T E D FR O M EX P O R T IN G C O U N T R IE S LIKE MALAYSIA AND SOUTH K O R E A W ILL BE D IST R IB U T E D T O T H E C IT Y AND T H E AR EA SURROUNDING IT F O R IN D U ST R IA L USE. | T H E T W O FIRM S ALSO PLA N TO BEGIN P R O D U C T IO N O F C O M B U STIO N M ACHINERY AND E Q U IP M E N T F O R LP G AND O T H E R R ELATED GASES SO M E T IM E IN 1993. | CHINA H O PES T O E X P O R T 80 P E R C E N T O F T H E T O T A L O U T P U T AS A MEANS O F OBTA IN IN G F O R E IG N C APITAL. | t x t / doc Figure 7.8: Input message exam ple for PARALLEL w ith M UC5 dom ain dom ain is th a t the knowledge base is around 25K nodes, which requires four SNAP- ■ 2 processor boards w ith 16MB of m em ory per board. The experim ent results are shown in Figure 7.9. . 117. execution time 8 0 0 -1 7 0 0 6 0 0 5 0 0 - 4 0 0 - 3 0 0 - 2 0 0 - 100 6 4 5 8 9 0 9 7 5 6 8 4 1 2 4 5 6 8 2 1 9 0 6 6 2 3 9 8 5 2 4 knowledge management function calls parallel time serial time total time Figure 7.9: T he execution tim e of PARALLEL on SNAP-2 prototype .1 1 8 I C h a p ter 8 C o n clu sio n s In this thesis we have developed a new parallel processing m odel for m arker propa gation paradigm . Solutions to key com putational issues for the parallel m odel have been analyzed, designed and evaluated. In the following, we sum m arize th e m ain results of th e thesis and discuss the directions for future research. 8.1 M ajor F in d in g s S IM D p arallel p ro cessin g ap p roach d o es n ot e x p lo it th e p o te n tia l o f th e m ark er p ro p a g a tio n m o d el T he SIMD parallel m arker propagation approach and its variations, such as SNAP-1, reduce the potential parallelism at inter-task level, since the sequential pro gram flow prohibits more than one task running at a tim e. T he study also shows th a t the SIMD parallel m arker propagation m odel requires m ore synchronization points. W ith the speed up study, th e SIMD configuration cannot be scaled up because the orocess tim e is low, the knowledge base partition produces m ore com m unication overhead, and has no capability of exploring the inter-task level parallelism . P a ra llel M arker P r o p a g a tio n M o d el T he m arker propagation networks m odel is developed based on practical parallel processing techniques. The program m ing m odel w ith the m arker propagation n et works suggests separating the d a ta parallelism execution from th e m ain sequential control flow program . T he d ata parallel execution is then provided by a program I __________________________________________________________________________ 119_ jembedded knowledge base where the program s are defined w ith a set of propagation rules which m atch th e knowledge base structure and a set of propagation functions invoked during propagation. T he m ain program interfaces the knowledge base pro gram through calling propagation rules, which will spawn threads of processing in Jthe knowledge base. The threads then traverse through the knowledge base and are guided by th e propagation patterns which are associated w ith the propagation rules, which is process flow approach. During m arker propagation, a b etter processor utilization and lower synchro nization overhead can be achieved by continuous propagation. In th e M arker P rop agation Networks m odel, th e global operations are decom posed and combined w ith th e m arker propagation. Thus the propagation can be continued until th e result is produced. B etter inter-task level parallelism can be achieved by initiating several threads according to different inputs. From the speedup study, th e M arker P ropa gation Networks approach can support b etter scale up due to capability of handling I both inter and in tra task level parallelism . iThe S N A P -2 P a ra llel P r o ce sso r S y ste m P r o to ty p e A small scale parallel processor system for the m arker propagation networks m odel was designed w ith eight 32-bit processing elem ents. The m achine was con structed in its full configuration. Though it is a research prototype im plem entation, full feature commercial quality system hardw are and software developm ent environ m ent are provided for a stable and easy to use application program developm ent platform . Novel designs were com pleted for the tiered synchronization protocol, i knowledge base m anagem ent and parallelism oriented th read scheduling. T he special hardw are proved to be quite valuable especially for synchronization and separating the global network and processor interconnection network. The special hardw are provide by th e SNAP-2 prototype is not available on existing machines. 'Support for M a ssiv e K n o w led g e B a se for p ra ctica l A I A p p lic a tio n s can jbe A ch iev ed I W ork at M CC [24] has shown th a t the knowledge base can grow to millions of sem antic nodes. Building a m achine to handle such a huge knowledge base was not 120. jeconomically feasible, but can be achieved now. A good knowledge base subsystem •must provide a fast d ata retrieving m echanism and scalable knowledge base. Our Japproach is to provide a good object structure w ith knowledge base m anagem ent which m aintains th e inform ation for fast search operation and a m atched m em ory buffering scheme for fast knowledge base object creation and deletion. W ith only eight processing elem ents, the SNAP-2 prototype system is capable of supporting a jquarter m illion nodes of sem antic network knowledge base. We expect to expand .the m achine up to 64 processors in the near future, then the system can support up to two m illion sem antic nodes. T rend o f th e P a ra llel P r o cesso r S y ste m Experience w ith parallel processor system im plem entation and program m ing in dicates th a t for parallel architectures to achieve widespread usage it is im portant jthat they efficiently run a wide variety of applications w ithout excessive program ming difficulty. To m aximize both high perform ance and wide applicability, we believe a parallel architecture should provide the following features: • th e system is scalable, with support from a few processors to thousands of processors, • high perform ance processor for processing elem ents, • large local memory, • high bandw idth com m unication between processors, • kernel for processor node, • perform ance m onitoring tools, and • good parallel program debugging tools. 8.2 F u tu re R esea rch O p tim iz a tio n P arallel C o m p iler T he compiler provided for th e m arker propagation networks application program is m erely a preprocessor. It converts the propagation rule into a program , generates jthe rem ote processes called pairs, and separates the application program into two pieces. No optim ization takes place during compile tim e. Future research should be on the parallel compiler which can optim ize the user program s so th a t th e m axim um perform ance can be achieved. T he problem to be addressed includes propagation p a th dependency analysis which yields propagation p ath reduction, determ ining jthe m ain program control flow and dependency for combining m arker propagation threads for m axim um parallelism , and m aintaining the program correctness. 'Parallel P r o ce sso r K ern el T he processing elem ent kernel has been designed w ith only support for th e m arker propagation networks application. However, th e underline architecture of th e SNAP- 1 2 system is a generic message passing architecture w ith special hardw are support for jbroadcasting and synchronization. To make the SNAP-2 system a general purpose m achine, th e processing elem ent kernel should adapt a full scale m ulti-processor ker nel such as MACH. The program m ing interface is then extended to support message [passing program m ing model. 'K now ledge B a se P a r titio n in g and Load B a la n cin g A proper knowledge base partitioning strategy can reduce the com m unication overhead and provide sufficient parallelism for m arker propagation. T he general Idea is to reduce th e interconnections between nodes which span over processors. T h e research direction should evaluate the criteria, such as topology of the knowl edge structure, connectivity of nodes and retrieval frequency along a link. As the p artitio n in g provides the guideline for initial placem ent, the dynam ic load balancing (requires following the same direction to m igrate th e processes and knowledge blocks .122 am ong the processing elem ents. The research problem is to accum ulate the knowl- jedge regarding the distributed processor load, and to build th e protocol and criteria for process m igration. M ark er P r o p a g a tio n A p p roach w ith Shared M e m o r y S y ste m U ntil now, the m arker propagation model has been im plem ented w ith only the distributed m em ory system . Shared m em ory system offers th e potential of the load balance by providing the knowledge base in shared memory, while the processor keeps a local copy of p art of the knowledge base which is required for the process ing. W ith the m arker propagation networks approach, th e threads are created and assigned to an underloaded processor for processing. The shared m em ory im plem en tatio n therefore can obtain th e m axim um speedup provided by inter task parallelism . However, for intra-task parallelism , further study on the knowledge base coherence consideration and process arbitration criteria have to be established. -Hardware S u p p o rt for P a ra llel P ro cesso r S y ste m D eb u g g in g and P erfo r m a n ce M o n ito r in g T he debugging scheme w ith th e SNAP-2 heavily depends on the broadcasting bus and synchronization network. This approach, of course, is slow in response during jdebugging. A nd m ost of all, the synchronization betw een processors is very loosely coupled, thus we are not able to display the parallel activity correctly. State of the art parallel processor debugging system s, such as JTAG scan type hardw are debugger, provide the parallel processor system debugging w ith a synchronized clock and w ith jthe aid of th e CPU built-in debugging support. This approach has its draw back to lim it th e debugging on the processors designed w ith built-in debugging interface and allows only one user debugging at a tim e. In general, this approach requires a separate system for debugging th e host and th e com m unication bandw idth betw een i th e debugging host and target system is low. To develop a m arker propagation program , a b e tte r debugging tool m ust be capable of displaying the host source code screen, processing elem ent source code screens and the activity on the knowledge ;base. B etter yet, th e execution speed should be controlled by the user to show the Animation of th e m arker propagation activity and be able to release th e “brake” to 123J allow th e processor running at full speed. This requires all processor synchronization at variable CPU clock and high speed transfer between processing elem ents and the ‘ host. In fu tu re developm ent of the hardw are support for debugging, one m ust include the synchronized CPU clock, hardw are for CPU execution control, event m onitors for break point, and utilize the com m unication channel between th e host and array jfor d ata transfer. The debugging network should be provided for th e debugging hardw are on th e processing elem ent and connect to the host. Users should be able to com m unicate to the debugging network of their assigned processor at the same tim e. The bonus w ith the hardw are debugger support is the detailed system perfor m ance m onitoring. For perform ance m onitoring w ith the SNAP-2, an ex tra code m ust be inserted into th e application program for perform ance d a ta handling. W ith this approach, the perform ance d ata resolution is set to function level to reduce th e execution overhead. The added code which invokes function call for execution m onitoring, sometimes will affect the parallel program execution p a tte rn (ex. global ^synchronization) and result in perform ance d ata error. W ith hardw are event m on itoring, the detailed events can be sam pled, but additional storage is required for storing the sam pled data. .124. A p p e n d ix A S N A P -2 P r o to ty p e P r o c essin g E lem en t C ard A ppendix A contains physical design and schem atic for the SNAP-2 system . All figures are described from left to right. Figure A .l shows the SNAP-2 static RAM interface. Figure A .2 shows a dynam ic RAM interface, including DRAM refresh Circuits, dual port access control and host to processing elem ent com m unication d ata channel m apping control. Figure A .3 shows th e inter-processor com m unication channel interface. Figure A.4 shows the T I TMS320C40 JTA G em ulator interface and the processor board to com m unication backplane connection. Figure A .5 shows the Global bus interface, in terru p t connections and some handshake signals. 125 m 2 3 7 7 7 7 7 ? ? 2 7 / 77777777777/777 77777 O Q Q O O O Q Q O a Q D Q Q D Q O O Q Q D Q 55555555555355555555555555555 Figure A .l: T he SNAP-2 Processor Board: Static RAM Interface 126 / 7 7 7 2 2 2 2 2 2 2 2 2 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 Figure A .2: T he SNAP-2 Processor Board: Dynam ic RAM Interface and Global Bus. 127 wwww wW/ \ A / v v w W 'yY y /a t cc coo ojl T he SNAP-2 Processor Board: Com m unication P orts Figure 128 /Y W \ y/////y n s w v a w w SWWWWWW’ n N 5 5 2 : lii Figure A .4: T he SNAP-2 Processor Board: Connectors and Em ulator Interface. 7 7 7 3 s X C L c c C l c / 7 7 7 7 7 T 7 o a z > • H J < c c U i s /////////////// o B r 'Figure A.5: T he SNAP-2 Processor Board: Global bus, In terru p ts and Handshake Signals. 130 i n t e r r u p t Logl< A p p e n d ix B S N A P -2 H ard w are and Softw are A v a ila b ility The detailed docum entations regarding the SNAP-2 prototype hardw are design, software tools design and SNAP-2 system user m anuals, are available in postscript form at. T he postscript files are available at F T P site la re s .s e a s .s m u .e d u w ith anonym ous user login. In the same F T P site, the source code files for the software (described in this thesis are also available. T he files available at the F T P site are organized as following: • /S N A P -2 /p ro to ty p e/: This directory contains files including the docum ent of SNAP-2 prototype hardw are description, detailed system schem atic, and the PAL equations for the PALs required for the SNAP-2 processor board. • /S N A P -2 /m an u al/: This directory contains m anual for program m ing a m arker propagation network application and m anual for using th e program m ing tools. • /S N A P -2/softw are/: This directory contains the software source codes and im plem entation docum entations for the software tools described in C hapter 6. • /S N A P -2/sim ulator/: This directory contains a SNAP-2 sim ulator package and the m anual for using the sim ulator. T he account on SNAP-2 prototype is also available for public. Please em ail s n a p - 2 @ sn a p .se a s.sm u .e d u for the new account request. 131J ’ R efe re n c e L ist [1] A. V. Aho, R. Sethi and J. D. Ullman. Compiler, Principles, Techniques and Tools. Addison-Wesley, 1988. [2] E. C harniak, “A Neat Theory of M arker-Passing,” Proceedings o f the Conference o f the Am erican Assoc, fo r Artificial Intelligence, 1986. [3] M. Chung and D. Moldovan, “PARALLEL: Applying Parallel Processing to N LP,” Technical Report PK P Lab 92-7, Dept, of EE-System s, U niversity of Southern California, Los Angeles, California, U.S.A., Decem ber, 1992, pp. 1- 2 1 . [4] S. Chung and D. Moldovan, “M odeling Sem antic Networks on th e Con nection M achine,” Technical R eport PKPLab-90-2, D epartm ent of Electrical Engineering-System s. University of Southern California, 1990. [5] S. Chung, D. Moldovan and R. DeM ara “A Parallel C om putational M odel for Integrated Speech and N atural Language U nderstanding,” Technical R eport P K P Lab 92-2, D ept, of EE-System s, Univ. of Southern California, Los Angeles, California, U.S.A., 1992. [6] A. Collins and E. Loftus, “A Spreading A ctivation Theory of Sem antic Process ing,” Psychological Review, 1975. [7] R. F. D eM ara and D. I. Moldovan, “The SNAP-1 Parallel AI P ro to ty p e,” IE E E Transactions on Parallel and Distributed System s, to appear. [8] M. E vett, J. Hendler and L. Spector, “PARKA: Parallel Knowledge R epresenta tion on the Connection M achine,” Technical Report UM IACS-TR-90-22, Univ. of M aryland, College Park, M aryland, U.S.A., 1990. [9] S. E. Fahlm an, N E T L : A system fo r representing and using real-world knowl edge. T he M IT Press, Cam bridge, MA, U.S.A., 1979. 10] K. H am m ond, T. Converse, and C. M artin, “Intergrating Planning and A cting in a Case-Based Framework,” Proceeding of N ational Conference on Artificial Intelligence, 1990. 132 [11] P. J. H atcher and M. J. Quinn. Data Parallel Programming on M IM D Com put ers, M IT Press, 1991. [12] P. J. H atcher, M. J. Q uinn and B. K. Seevers. “Im plem enting a D ata Parallel Language on a Tightly Couple M ultiprocessor,” Proceeding Third Workshop Programming Languages Compilers Parallel Computers, 1991. [13] J. A. Hendler, Integrating M arker-Passing and Problem Solving: A Spreading Activation Approach to Improved Choice in Planning, Lawrence Erlbaum As sociates, 1988. [14] T. Higuchi, T. Furuya, H. K usum oto, K. Handa, and A. Kokuba, “The P roto type of a Sem antic Network M achine IXM ,” Proceedings o f 1989 International Conference on Parallel Processing, 1989, pp. 217-224. [15] K. Hwang and S. Shang “W ired-N O R Barrier Synchronization for Designing Large Shared-M em ory M ultiprocessors,” Proceedings o f the 1991 International Conference on Parallel Processing, 1991. [16] W . D. Hillis, The Connection Machine, The M IT Press, Cam bridge, MA, U.S.A., 1985. [17] N. Ichiyoshi, T. M iyazaki and K. Taki. “A D istributed Im plem entation of F lat GHC on the M ulti-PSI,” Proceeding o f the Fourth International Conference on Logic Programming, 1987. [18] J. K im and D. Moldovan, “Parallel Knowledge Classification on SNAP,” Pro ceedings o f the 1990 International Conference on Parallel Processing, 1990. [19] H. K itano, “Unification Algorithm for Massively Parallel C om puter,” Technical Report- Carnagie Mellon University, 1990. [20] H. K itano, “DM-Dialog: An Experim ental Speech-to-Speech Dialog Translation System ,” IE E E Com puter 24(6), pps. 36-50, June 1991. [21] H. K itano, D. Moldovan and S. Cha, “High Perform ance N atural Language U n derstanding on the Sem antic Network A rray Processor,” Proceedings o f the In ternational Joint Conference on Artifical Intelligence, Sydney, A ustralia, 1991. [22] H. K itano and T. Higuchi, “High Perform ance M emory-Based Translation on IXM2 Massively Parallel Associative Memory Processor,” Processding o f Na tional Conference on Artificial Intelligence, 1991. [23] W. Lee, “A Parallel M arker-Passing Com puter for Knowledge Processing,” Technical R eport PK PL 92-1. 133 [24] D. Lenat, R. G uha, K. P ittm an , D. P ra tt and M. Shepherd, “CYC: Toward Program s w ith Common Sense,” Com m unication o f the ACM , August 1990. [25] E. C. Lin, R. F. DeM ara, S. H. Kuo, and D. I. Moldovan, “D istributed Process Synchronization for N ondeterm inistic Execution,” Technical R eport P K P Lab 91-8, D ept, of EE-System s, Univ. of Southern California, Los Angeles, Califor nia, U.S.A., 1991. 26] C. Lin and D. Moldovan, “SNAP Sim ulator R esult,” Technical R eport P K P Lab 90-5, D ept, of EE-System s, Univ. of Southern California, Los Angeles, California, U.S.A., 1990. 27] D. M oldovan, W. Lee and C. Lin, “Parallel Knowledge Processing on SNAP,” Proceedings o f International Conference on Parallel Processing, 1990. 28] D. Moldovan, S. Cha, M. Chung, K. Hendrickson, J. Kim, and S. Kowalski, “D escription of SNAP System Used for M UC-4,” Proceedings o f Fourth M es sage Understanding Conference, M organ K aufm ann Publishers, San M ateo, CA, 1992. 29] D. M oldovan, S. Cha, I. T. Um, R. D eM ara, and J. T. Kim , “D irect M emory Access Translation on SNAP,” Technical Report P K P Lab 90-9, D ept, of EE- System s, University of Southern California, Los Angeles, CA, U .S.A., 1990, p p .1-31. 30] D. M oldovan, W . Lee, C. Lin, and M. Chung, “SNAP: Parallel Processing | Applied to A I,” IE E E Computer, May 1992, Vol. 25, No. 5, pp. 39-50. [31] D. Moldovan, W. Lee and C. Lin, “Parallel Knowledge Processing on SNA P,” IE E E Transactions on Knowledge and Data Engineering, to appear. 32] D. Moldovan, S. Cha, M. Chung, K. Hendrickson, and J. K im “D escription of SNAP System Used for M UC-5,” Proceedings o f Fifth Message Understanding Conference M organ Kaufm ann Publishers, San M ateo CA, 1993. 33] P. Norvig, “Unified Theory of Inference for Text U nderstanding,” PhD Disser tation, Univ. of Calfornia, Berkeley, California, U.S.A., 1987. 34] M. T. O ’Keefe and H. G. Dietz “H ardware Barrier Synchronization: Dynam ic B arrier MIMD (D BM ),” Proceedings o f the 1990 International Conference on Parallel Processing, 1990. [35] M. T. O ’Keefe and H. G. Dietz “H ardw are Barrier Synchronization: Static B arrier MIMD (SBM ),” Proceedings o f the 1990 International Conference on Parallel Processing, 1990. .134 [36] M. R. Quillian, “Sem antic M emory,” Ph.D . D issertation, Carnegie In stitu te of Technology (Carnegie Mellon University), P ittsburg, Pennsylvania, U.S.A., 1966. [37] C. Riesbeck and C. M artin, “Direct Memory Access Parsing,” Yale U niversity R eport No. 354, 1985. 38] K azuaki Rokusawa et al., “An Efficient Term ination D etection and A bortion Al gorithm for D istributed Processing System ,” International Conference on Par allel Processing Vol. 1. 1988. 39] H. Schneider, “Sim ulation of parallel m arker propagation system s,” Proceedings o f the 1987 Sum m er Com puter Sim ulation Conference, N ineteeenth Annual Edition, July 27-30, 1987. 40] R. C. Schank, Conceptual Inform ation Processing, A m erican Elsevier Publish ing Co., New York, 1975. 41] L. Spector, J. A. Hendler and M. P. E vett, “Knowledge R epresentation in PA R K A ,” Dept, of Com puter Science, Univ. of M aryland, College Park, M ary land, U.S.A., UMIACS-TR-90-23, February 1990. 42] Stephen Taylor, Parallel Logic Programming Techniques, Prentice Hall, 1989. [43] D. W altz, “Massively Parallel A I,” Proceedings o f A A A I-9 0 , 1990. [44] D. Wu, “A Probabilistic Approach to M arker P ropagation,” Proceedings o f the Eleventh International Joint Conference on Artificial Intelligence, Vol. 1, August 20-25, 1989. '[45] Y. Yu and R. F. Simmons, “Truly Parallel U nderstanding of T ext,” Proceedings o f A A A I, 1990.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255768
Unique identifier
UC11255768
Legacy Identifier
DP22889