Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Architectural support for network -based computing
(USC Thesis Other)
Architectural support for network -based computing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy subm itted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ARCHITECTURAL SUPPORT FOR NETWORK-BASED COMPUTING by Chung-Ta Cheng A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COM PUTER ENGINEERING) May 2000 Copyright 2000 Chung-Ta Cheng Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3017993 ___ ® UMI UMI Microform 3017993 Copyright 2001 by Bell & Howell Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 TTiis dissertation, written by Chung-Ta Cheng under the direction of ..... Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY D an of Graduate Studies r%a f * F e b ru ary 24, 2000 DISSERTATION COMMITTEE Chairperson Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication To my wife Pei-Ying. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgments I am indebted to my advisor, professor Jean-Luc Gaudiot, for his inspiration, guidance and support. It was a privilege to have been one of his students. I would also like to express my gratitude to professors Sandeep Gupta and Douglas Ierardi for serving on my dissertation committee. Their inquisitive questions have stimulated my research. I thank professor Timothy Pinkston and Massoud Pedram for being on my Ph.D guidance committee. I would like to thank my colleague and very good friend Wen-Yen Lin for helping me a lot during the years. I would like to thank my former group members Dr. Hung-Yu Tseng and Dr. Yung-Syau Chen for giving me advice and encouragement. 1 also acknowledge group members Chulho Shin, Seongwon Lee, Jungyup Kang, Jongwook Woo, James Burns, Steve Jenks and Dongsoo Kang. I must also thank Maxime Delanis for his nice supporting work. I also thank my very good friend, Yi-Yen Wu for his valuable advice and warm friendship. My sincere gratitude goes to my parents Shan-Gen Cheng and Yu-Lien Lin. my brother Meng-Jia and my sister Ya-Ling for their love and support. I must thank my wife Pei-Ying for her understanding and love, which I can never thank enough. W ithout her continuous support, I would never have been able to finish the program. Finally, I would like to thank my unborn child, who brings me strength and luck during my last stage of study. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents Dedication ii Acknowledgments iii List Of Tables vii List Of Figures viii Abstract xi 1 Introduction 1 2 Background 4 2.1 Software I s s u e s ................................................................................................ 5 2.1.1 User-Level NI A ccesses...................................................................... 6 2.1.2 Lightweight Communication Protocols ......................................... 7 2.1.3 Support of Collective C om m unication............................................ S 2.2 Hardware Issu es................................................................................................ 8 2.2.1 Network Interface Connection P o in t............................................... 8 2.2.2 Intelligence of Network Interface...................................................... 12 2.2.3 Communication Buffers...................................................................... 12 2.2.4 Notification of Message Arrival ...................................................... 14 2.3 Related W o r k ................................................................................................... 15 2.3.1 Fast Messaging Softwares ................................................................ 15 2.3.2 Network In te rfa c e s ............................................................................. 16 2.3.3 I/O Subsystem s.................................................................................... 18 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 Communication Architecture 20 3.1 Network In terface............................................................................................. 21 3.2 Communication P r o to c o ls ........................................................................... 22 3.3 Communication Primitives and O p e ratio n s............................................... 24 3.3.1 Buffer A llo c atio n ............................................................................... 24 3.3.2 S e n d ...................................................................................................... 27 3.3.3 Receive ................................................................................................ 29 3.4 Blocking and Non-blocking M odes................................................................ 30 3.5 Collective C om m unication ........................................................................... 32 3.6 Application Programming In te rfa c e ........................................................... 35 3.7 Q uantitative A nalysis...................................................................................... 35 4 M ethodology 38 4.1 Design V alidation ............................................................................................ 38 4.2 S im u la to r .......................................................................................................... 39 4.3 T a sk s................................................................................................................... 42 4.4 Benchmarks and Performance M etrics........................................................ 44 5 Performance Evaluation 45 5.1 Simulation E n v iro n m e n t............................................................................... 45 5.2 Point-to-Point Com m unication..................................................................... 46 5.3 Collective C om m unication............................................................................ 50 5.4 M acro-benchm arks......................................................................................... 55 5.4.1 Benchmark D escription....................................................................... 55 5.4.2 S tr a te g y ................................................................................................. 57 5.4.3 M u ltig r id .............................................................................................. 58 5.4.4 Integer S o r t .......................................................................................... 64 6 Conclusion 69 6.1 S u m m a r y .......................................................................................................... 69 6.2 Future W ork...................................................................................................... 70 Appendix A Implementation of Our Communication Architecture in Paint .................... 72 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A .l Data Structure A.2 Operation . . Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Tables 3.1 Cost of sending a unicast m essage................................................................ 36 3.2 Cost of sending a multicast m essage............................................................. 37 3.3 Cost of receiving a message .......................................................................... 37 4.1 B en chm arks....................................................................................................... 44 5.1 Architectural P aram eters................................................................................. 46 5.2 Timing Expressions for Collective Communications on S P 2 .................. 51 5.3 Timing Expressions for Collective Communications on T 3 D ...................... 51 5.4 Timing Expressions for Collective Communications on Paragon . . . . 51 5.5 Overhead of MPI Layer in M icroseconds................................................... 54 5.6 Overhead of MPI Layer in P e rc e n tag e s...................................................... 54 A.l Scheduling of NI t a s k ....................................................................................... 78 v ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Figures 2.1 Message Passing System A rchitecture....................................................... 4 2.2 Communication Path for PVM ................................................................. 7 2.3 NI on the I/O B u s .......................................................................................... 9 2.4 NI on the Memory B u s .................................................................................... 10 2.5 NI on the Cache B u s ....................................................................................... 1 1 2.6 NI on the Host P rocessor................................................................................. 1 1 2.7 Kernel-level Comm, I/O Bus-based NI, Three P a s s e s ............................. 13 2.8 User-level Comm, I/O Bus-based NI, Two P a s s e s ................................. 13 2.9 User-level Comm, Memory Bus-based NI, Two P a sse s............................ 14 2.10 User-level Comm, Memory Bus-based NI, One P a s s ............................. 14 2.11 A Switched I/O Subsystem ............................................................................ 19 3.1 Communication A rchitecture........................................................................ 21 3.2 Organization of Network Interface.............................................................. 22 3.3 One-Way and Three-Way Network Transaction P ro to co ls................... 23 3.4 Address Mapping for NI M e m o ry ............................................................... 24 3.5 Structure of Message B u f f e r ........................................................................ 25 3.6 Logical View of N I ......................................................................................... 26 3.7 Send O p e ra tio n ................................................................................................ 28 3.8 Receive Operation - Late Arrival ............................................................... 29 3.9 Receive Operation - Early A rriv a l............................................................... 30 3.10 Synchronous and Asynchronous Blocking S e n d ........................................ 31 3.11 Barrier O p e ratio n ............................................................................................. 32 3.12 Reduce O p eratio n ............................................................................................. 33 3.13 Broadcast O p e ra tio n ...................................................................................... 34 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ol o< Oi 01 Oi Oi Oi Ol Ol o » 4.1 Program-Driven S im u la tio n ......................................................................... 40 4.2 Event G en eratio n ............................................................................................. 42 4.3 Time Wheel Mechanism in Paint ............................................................... 43 4.4 Simulation M odules..............................................................................................43 5.1 Receive Latency (Opt-Receive) ................................................................. 47 5.2 Receive Latency (Equ-Receive) ................................................................. 47 5.3 Round-Trip Latency (O p t-R eceiv e).......................................................... 48 5.4 Round-Trip Latency (E qu-R eceive).......................................................... 48 5.5 One-Way L aten cy ............................................................................................. 48 5.6 Deliverable B a n d w id th ................................................................................... 49 5.7 Barrier L a te n c y ............................................................................................... 52 5.8 Broadcast L a te n c y ......................................................................................... 52 5.9 Gather L a te n c y .............................................................................................. 53 5.10 Scatter L a te n c y ............................................................................................... 53 5.11 Reduce L a te n cy ............................................................................................... 53 5.12 MPI Overhead in M icroseconds................................................................. 54 5.13 MPI Overhead in P ercen tag es.................................................................... 54 5.14 MG Speedup ................................................................................................... 58 5.15 MG Execution T i m e ...................................................................................... 59 5.16 MG Communication R a t io ............................................................................ 59 .17 MG Message C o u n t......................................................................................... 60 .18 MG Average Message S i z e ............................................................................ 60 .19 MG Distribution of Message S i z e ............................................................... 60 .20 Impact of Overhead on Comm Time ( M G ) .............................................. 62 .21 Impact of Overhead on Run Time ( M G ) ................................................. 62 .22 Impact of Uncacheable NI on Comm Time (MG) ................................ 63 .23 Impact of Uncacheable NI on Run Time (M G )....................................... 63 .24 Impact of Memory Speed on Comm Time ( M G ) .................................... 63 .25 Impact of Memory Speed on Run Time ( M G ) ....................................... 63 .26 IS Speedup ....................................................................................................... 64 5.27 IS Execution T i m e ......................................................................................... 65 5.28 IS Communication R a t io ............................................................................... 65 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.29 IS Message C o u n t........................................................................................... 65 5.30 IS Average Message S i z e .............................................................................. 65 5.31 IS Distribution of Message S i z e ............................................................ 66 5.32 Impact of Overhead on Comm Time ( I S ) .................................................. 67 5.33 Impact of Overhead on Run Time ( I S ) ..................................................... 67 5.34 Impact of Uncacheable NI on Comm Time (IS) .................................... 67 5.35 Impact of Uncacheable NI on Run Time ( I S ) ........................................... 67 5.36 Impact of Memory Speed on Comm Time ( I S ) ........................................ 68 5.37 Impact of Memory Speed on Run Time ( I S ) ........................................... 68 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract For networks of workstations to become a viable alternate approach to supercom puting, we believe some amount of architectural support is necessary. Based upon a careful analysis of the various bottlenecks and inefficiencies in design, we propose an intelligent network interface based communication architecture to improve commu nication performance. The network interface is connected to the memory bus and serves as the home of communication buffers. Based on the hardware platform, an efficient messaging layer is designed. The architecture is validated through a simula tion approach and three groups of benchmarks, including point-to-point communi cation, collective communication and NAS benchmarks, are employed. The simula tion results show that the point-to-point communication latency can be significantly reduced and that collective communication can be more efficiently supported. Fur ther, the macro-benchmarking demonstrates that reducing communication latency and caching communication buffers can effectively improve the overall run time for parallel applications. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction Although the technology advances in computers have, in recent years, satisfied some of the thirst for faster machines, there is still much demand for even more comput ing power. Several directions have been pursued to increase the computing power. Examples are vector processors, massively parallel processing (M PP) and networks of workstations (NOW). Vector processors[20, 15, 31, 32, 52, 10, 70] use vector registers, superpipelined functional units and highly interleaved memory modules to achieve high performance scientific computing. These machines require fully customized processors and system architectures and thus incur high development cost and short life cycle. MPP[16, 42, 18, 2, 55, 35, 34], on the other hand, utilizes the commodity parts as the building block to leverage the state of the art at a much lower cost. The powerful computers nowadays, such as Intel ASCI Red[47] and SGI/Crav T3E[60. 61], are built in this way. MPP has demonstrated a more cost-effective approach for supercomputing than vector processors. The development of MPP machines reveals a trend: more and more commodity parts are used as the building block, i.e. from processors to processor boards. For example, IBM SP2[3, 63] uses processor board from RS/6000 model 590 workstation as the building block and Intel ASCI Red is built from mainboards housing two compute nodes, consisting of two Pentium Pro Processors each. For the current MPPs, the remaining customized part in the system architecture is the interconnec tion network. NOW takes one step further than MPP: general-purpose PCs/workstations con nected by a general-purpose network function as a large-scale distributed system 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and provide a transparent interface to the user. NOW offers a more economic way to achieve supercomputing than MPP. Besides, NOW provides a mature software development environment and is available for general-purpose usage. Driven by a widely growing user base, workstation performance has consistently improved at a fast pace. By leveraging the state of the art, the aggregate computing power of networks of workstations can be significant. However, traditional design of computers has not considered the communication requirement as its first priority: communication between workstations incurs an extremely large software/hardware overhead. W ithout any optimization, the end- to-end communication latency is on the order of milliseconds[44, 53]. Compared to Massively Parallel Processors, where the communication latency can be as low as a few microseconds[21], network-based computing is therefore only suitable to coarse-grain applications. Although technology advance has made the processor speed keeps increasing at an incredible pace, communication latency is not improved at the same rate. Conse quently, the communication cost, in terms of processor cycles, may in fact increases with faster processors. On the other hand, computers nowadays are not only used for standalone computation but actually communicate frequently between each other to perform more useful jobs. The increasing gap between computation and commu nication performance and the increasing ratio of communication over computation make improving the communication performance an essential and crucial task. Recent high-speed networks, such as ATM and Mvrinet[9], dramatically increase the network bandwidth but the per-message based software/hardware overhead in curred on the host communication interface is not significantly reduced. Indeed, a high bandwidth network is beneficial to long messages only, while studies have shown that most messages on the network are small. This is also true for message-passing parallel computing, in which communication messages, mostly for synchronization and exchange of computation results, are usually small. Therefore, it is crucial to identify and minimize the communication overhead to make network-based comput ing applicable to broader applications, including the finer-grain ones. Although there are quite a few research projects focused on improving the com munication performance for NOWs, most of them axe software-based approaches. Our argument is that the performance gain from pure software approaches is limited 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by the underlying hardware. If the underlying hardware does not efficiently support the software functions, the performance will be severely limited. Therefore, to im prove the communication performance to its maximum extent possible, a redesign of the underlying hardware, mainly the network interface, must be considered. The low level software messaging layer can then take advantage of the efficient hardware to achieve better communication performance. In this dissertation, we propose a communication architecture, including an intel ligent network interface, a low-level messaging layer and a set of application program ming interfaces. This architecture is evaluated through a program-driven simulation and the performance numbers measured are compared with others. The remainder of this dissertation is organized as follows. In chapter 2, we analyze communication bottlenecks and possible solutions for NOW and also discuss and compare the related research work. Our proposed communication architecture is described in chapter 3. Chapter 4 presents our methodology. Performance evaluation is illustrated in chapter 5 and finally, conclusions and future research areas are given in chapter 6. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Background Although both shared-memory and message passing programming models can be supported by networks of workstations, the shared-memory programming model is usually built on top of message passing mechanism due to the inherently loosely cou pled architecture and thus is not beneficial in terms of performance. The architecture of a message passing system is shown in figure 2.1. high-level communication abstraction low-level communication abstraction j communication primitives implemention of low-level abstraction hardware support message handling inside host network transactions protocols programming interface parallel applications Figure 2.1: Message Passing System Architecture Synchronization of message-passing parallel programs are constructed from ap plication programming interfaces (API), which are in turn built on top of a few 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. communication primitives. These primitives are implemented in two areas: one is handled inside a node and the other is network transaction protocol. The underlying hardware provides support for implementing these operations. For a message passing machine designed from scratch, the architecture can be seamlessly matched the needed functionality at each layer to achieve maximum per formance. However, the architecture of workstations is not designed originally for such a purpose and thus functional mappings between a message passing system and a network of workstations bring up the performance issue. Historically, computation performance has been considered the first priority for workstations while comparatively less attention has been paid to improve the com munication performance. The communication performance is determined by the communication architecture and the currently unsatisfactory communication per formance is due to the following software and hardware issues. 2.1 Software Issues In communication between off-the-shelf workstations, major software overheads come from the following facts. • I/O operations are performed by the kernel. System calls are used to send or receive messages. This introduces overhead for context switches between user mode and kernel mode and for memory copies of message data between user space and kernel space. • Heavyweight communication protocols such as TCP/IP[13] are used. T C P /IP is designed for internetworking and it assumes the operating environment is vulnerable. Therefore, some functionality in the protocol stack is designed to ensure data integrity and to enforce flow control and congestion control. Besides, the multiple layer design incurs overhead as the same message data needs to be processed in multiple passes. Buffer management also contributes a significant part of the communication overhead. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Collective communication is usually emulated by point-to-point communica tion. Since collective communication is often employed in the parallel com puting and the point-to-point emulation makes the collective communication performance less scalable. For example, to send out a message, the user process needs to invoke a system call, which involves a context switch and a memory copy of the message data from user space to kernel space as the communication subsystem is managed by the kernel only. After the kernel finishes sending out the message, another context switch occurs to return back to user mode and the user process continues. For a receive operation, the user process also invokes a system call, which involves a context switch to kernel mode. When the message arrives, it is received at the kernel space initially. The kernel then does a memory copy from the kernel space to the user space and notifies the user process of message arrival. Parallel Virtual Machine(PVM) [36, 66], though providing a convenient tool for distributed computing on a network of workstations, suffers from the inefficient communication performance. PVM is built on top of the socket layer, which is im plemented based on T C P /IP protocol suite. In PVM, a PVM daemon is running on each configured node and a TCP connection is established between two PVM dae mons for inter-node communication. As shown in figure 2.2, process communication using PVM needs to go through the T C P /IP protocol stack and also encounters multiple context switching between user and kernel space. In a measurement of PVM performance done in [44], the one-way trip tim e for sending a 4-byte message through a 100Mbps ATM switch between two workstations is 2766 microseconds. These software overheads can be reduced by several approaches, including user- level NI accesses, lightweight communication protocols and better support of collec tive communication. 2.1.1 User-Level NI Accesses Since kernel intervention introduces such a significant amount of communication overhead, it is essential to remove it from the critical communication path. The user processes should be allowed to access NI control registers and communication buffers directly. Obstacles to directly accessing the NI include providing protection and fair 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Process A PVM Socket Layer Socket Layer UDP TCP UDP Link Layer Physical Layer Physical Layer Process B PVM Link Layer Figure 2.2: Communication Path for PVM resource sharing between different users/processes, which are enforced by the kernel originally. The usual way to provide user-level protection is through the virtual memory systems, where access control information can be stored in the page tables. In general, system calls are used to setup the NI access and protection information initially. A fair resource sharing policy can also be enforced inside these system calls. 2.1.2 Lightweight Communication Protocols The preferred environment for network-based parallel computing is a local area net work or system area network, which is usually administered and controlled by a single entity. Since these networks are more reliable than the Internet, heavyweight communication protocols are not required. In addition, when user-level NI accesses are implemented, much of the communication functionality will also be moved to the user level. The communication protocol can thus be further streamlined and made more efficient. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.3 Support o f Collective Communication Parallel computing often involves synchronization and communication of a set of processes, which is done through collective communication where a group of pro cesses join the same operation. However, the collective communication is usually implemented based on multiple point-to-point communication at the software layer. This approach incurs a large amount of communication latency and host processing cycles. A better support of collective communication is needed which can reduce the communication overhead and free the host processor for computational tasks. 2.2 Hardware Issues At the same time, major hardware inefficiencies for communication performance are: • The network interface is usually tied to the I/O bus which is slow and far away from CPU. • The network interface is usually a passive device which relies on the host processor to handle the message processing and delivery. • The communication buffers are usually located in the host main memory in stead of the network interface. • The notification cost of message arrival is high. These inefficiencies can be partly overcome by changing the network interface connection point, adding “intelligence” to the network interface, allocating commu nication buffers in the network interface and providing a flexible and inexpensive mechanism for message arrival notification. 2.2.1 Network Interface Connection Point I /O B us In off-the-shelf workstations, the network interface conventionally resides on the I/O bus, as shown in figure 2.3, due to the following reasons: • I/O operations are not considered as im portant as memory accesses. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • I/O bus specifications axe usually open standards and adopted by many dif ferent machines. • The same I/O bus specification is usually employed on the same machine model line for several generations. Putting the network adaptor on the I/O bus increases the life cycle of the network adaptor. Memory Bus £ 3 Bridge I/O Bus Cache Memory Slow I/O devices Figure 2.3: NI on the I/O Bus However, this design decision significantly limits the communication performance. As the demand for high performance communication is increasing and I/O opera tions are sometimes the first priority for some applications, alternate NI connection points need to be considered and evaluated. The alternate points include the mem ory bus, the cache bus and embedded inside the host processor. Intuitively, it seems that the closer to CPU the NI is connected, the better the communication perfor mance is. However, there are several factors which interact in a complex way so that the conclusion may not be so simple to draw. Memory Bus The memory bus is one step closer to the CPU than the I/O bus. W ith a wider data path and a faster clock rate, the memory bus bandwidth is much higher than the I/O bus bandwidth. Further, the memory bus can support cache coherence protocols so that coherence of I/O data is automatically enforced. An other advantage is that I/O operations can be implemented in a fashion similar to 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. main memory operations which facilitates the implementation of user-level I/O func tions. In addition, for the increasingly popular Symmetric Multiprocessors (SMP), a memory bus-based NI is a better choice because the memory bus is shared by all processors and the memory bus-based NI provides a consistent view of the memory to all processors. M emory Bus ^ I/O Bridge I/O Bus Slow I/O devices Cache Memory Figure 2.4: NI on the Memory Bus Although memory bus specifications are not usually open standards, the same memory bus is usually used for several machine generations. In terms of the develop ment cost and life cycle, a memory bus-based network adaptor is still an acceptable solution. There are several projects which adopt this approach, including Shrimp[7. 22], TMC CM-o[41], Coherent Network Interface[50] and VVARPmemory[30, 62]. C ache Bus Connecting the NI to the off-chip cache bus is another possibility. The NI and the CPU can communicate with each other directly and share the same cache. This design requires for the NI to have a performance comparable to the processor and support cache coherence protocols if the on-chip cache is employed. The advantages of this design is that the communication latency may be decreased further. However, since the cache interface for a processor is usually proprietary and depends highly on the processor design, this option may be expensive due to its short life cycle. In addition, because the cache is shared, the I/O data may pollute the cache lines and thus destroy the data locality of the computing processor. Examples of this design include MIT Alewife[l] and StarT-NG[12]. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cache bus M emory Bus E 3 I/O Bridge I/O Bus Slow I/O devices Cache M emory Figure 2.5: NI on the Cache Bus Host Processor The most aggressive design option is to incorporate the NI in the same chip as the processor. This would minimize the communication latency and thus sounds attractive in term of performance. However, a fully customized chip is required to realize the functionality of an integrated processor and network interface pair. Potential problems for this design include the high development cost and the flexibility/capability of the communication mechanisms. For a full functional communication interface with sufficient buffering for I/O data, a significant chip area needs to be allocated for communication, which may leave insufficient space for the processor. Examples for this design includes MIT J-Machine[19] and Stanford M-Machine[29]. M em o ry B us E jj| I/O B ridge I/O B us Slow I/O devices C ache M em ory Figure 2.6: NI on the Host Processor 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2.2 Intelligence o f Network Interface The intelligence of the network interface also affects communication performance. Different degrees of intelligence can be included to enhance its capability. Dumb NI For a pure passive network interface, send and receive operations are initiated and performed by the host processor. The host processor needs to switch between computation and communication modes. This design may thus deprive the host processor of too many CPU cycles. DMA A more intelligent design consists in including DMA engines into the net work interface so that the host processor only needs to initiate send or receive op erations. The DMA engine can perform the send and receive operations while the host processor is doing computation. However, DMA operations are only advanta geous for long messages. For short messages, the high initiation overhead of DMA operations may outweigh the advantage of off-loading the host processor. NI Processor The most intelligent design consists in incorporating a processor into the network interface. The NI processor can independently handle send and re ceive operations and directly deliver a received message to its destination. Further, the NI processor can also handle complex operations such as collective communica tion without the host processor intervention. For such a design, the performance of the NI processor is critical. If a slow NI processor is employed, the communica tion performance may be worse than using the much faster host processor to handle communication operations and protocol processing. 2.2.3 Communication Buffers Traditionally, the communication buffers are located in the host main memory in stead of the NI because the amount of memory on the NI is too small to efficiently support the communication need for multiple user processes. However, this design incurs the communication overhead of moving data between the host main memory and the NI. If the NI is connected to the I/O bus, both memory bus and I/O bus transactions axe involved for communication. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. W ith the continuing advances of memory technology, it is becoming feasible to put a large amount of memory on the NI as the communication buffers to improve the communication performance. Although it is possible to run out of the memory space on the NI if the applications exchange a huge amount of data frequently and concurrently, the benefit still exists while the NI communication buffers are used. When there is no more free memory space on the NI, the host memory can then be used for communication buffers. The NI can be designed to support both modes of operations. Figure 2.7 shows the situation where the NI is connected to the I/O bus and the communication buffers are located in the host main memory. Besides, the communi cation functions axe implemented as system calls. Three passes are required to send out a message. The first pass is done in the user mode by writing to the user buffers in the host main memory. The second pass is done in the kernel mode by reading from the user buffers and writing to the kernel buffers. This pass may involve two memory bus transactions for read and write operations. The third pass is to send the data from the kernel buffers to NI, which involves both memory bus and I/O bus transactions. The three passes of operations show the overhead of a traditional NI design. communication Memory NI 'Memory Bus 1 I/O Bus I I □ c 3 £—3 Slow I/O devices Memory o c ijri (T ) IE . . F ® Memory Bus ■I/O Bus NI Slow I/O devices Figure 2.7: Kernel-level Comm, I/O Figure 2.8: User-level Comm, I/O Bus- Bus-based NI, Three Passes based NI, Two Passes Figure 2.8 shows the situation for an I/O bus based NI where the communica tion buffers are still in the host main memory but user-level communications are 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. supported. In this case, only two passes are required. In the first pass, data is w ritten to the host main memory and in the second pass, the data is sent to the NI. I/O bus transaction is still required in the second pass. If the NI is connected to the memory bus, user-level communication will not involve any I/O bus transactions. However, depending on the location of buffers, one or two passes of operations are necessary for sending out a message. As shown in figure 2.9, since the communication buffers are in the host main memory, a second pass is required to move communication data to the NI. However, this is still better than the I/O bus-based NI with two-pass communication as the second pass involves only memory bus transaction. If the communication buffers are located in a memory bus-based NI, only one pass is needed for user-level communication. This is shown in figure 2.10, where only one memory bus transaction is involved. Memory NI I/O Bus Bus I I I I Slow I/O devices Memory NI I (D I emoiyBus I/O Bus | j __ | Slow I/O devices Figure 2.9: User-level Comm. Memory Figure 2.10: User-level Comm. Memory Bus-based NI. Two Passes Bus-based NI. One Pass 2.2.4 Notification of M essage Arrival When a message is received by a network interface, the host processor needs to be notified to process the message data. It is important to minimize the notification overhead of message arrival in order to reduce the communication latency. One notification scheme is interrupt. The network interface sends an interrupt signal to the host processor and the interrupt service routine handles the message 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. properly. This approach is simple and prompt but may incur too much overhead since interrupts are expensive. This approach is used in Active Messages[25]. Another approach is polling. The host processor periodically polls some NI status registers to determine if there is a message arrival. The decision of the polling period is an issue. If the period is too short, most of the time the processor finds no message and wastes a lot of CPU cycles. If the period is too long, messages may accumulate in the receive buffer and the buffer will overflow eventually. This approach is employed in Illinois Fast Messages(56]. One alternative is to use a hybrid method, in which interrupt and polling are both used. For example, in TMC CM-5[69, 41], an interrupt is generated upon a message arrival but inside the interrupt service routine polling is used to avoid excessive interrupt overhead. EARTH-MANNA[45] uses a polling-based approach but an interrupt may be generated if the waiting time for an arrived message exceeds some preset limit before the next scheduled polling occurs. Another way is to classify messages into different priority classes. Messages with higher priority will cause an interrupt upon arrival but lower priority ones are left for polling. 2.3 Related Work Many research projects have been focusing on improving the communication perfor mance of networks of workstations. They adopt either software approaches, mainly fast messaging layers, or hardware approaches in designing efficient network inter faces. 2.3.1 Fast M essaging Softwares Active Messages[25, 26, 27, 17, 58, 46] and Fast Messages[56, 39] integrate com munication and computation by including a function handler in the message. The function is executed at the receiver upon message arrival. PM[68] is a communication library designed to support multi-user parallel processing environments. B IP[57] is an optimized and streamlined messaging protocol with a limited ca pability. The design of a low-level software messaging layer greatly depends on the underlying NI since the messaging layer specifies the interaction between host 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and NI. These messaging layers have been designed to work with the 10 bus-based Myrinet[9] NI. U-Net[24] removes the kernel from the communication critical path and moves all buffer management and communication protocol processing to the user-level. U-Net has been implemented with Fast Ethernet and ATM. These messaging layers do not need custom hardware or even a modification of the kernel because one of their goals is to leverage off-the-shelf communication hardware. However, without hardware support, the performance of the messaging software is limited by the underlying hardware capability, as discussed in section 2.2 of chapter 2. The goal of the Beowulf project[5, 59, 64] is to dem onstrate the feasibility of achieving parallel supercomputing with commodity components. The hardware sys tems include Personal Computers. Fast Ethernet adaptors and Fast Ethernet switch ing networks. The software systems contain Linux OS, MPICH, PVM or BSP for message passing. The round-trip end-to-end communication latency is in the range of 55 to 500 microseconds, depending on the employed protocol. This communication latency may not be good enough for fine-grain applications. Basically, the commu nication latency for Beowulf can not be reduced significantly due to the limitation of the underlying hardware/software commodity components. 2.3.2 Network Interfaces The Shrimp[7, 22, 23, 28, 8] project develops Virtual Memory Mapped Communi cation for NOW communication. In VMMC, the virtual memory space of a sending process can be mapped to that of a receiving process. A send operation is divided into two steps: first, a virtual-physical address mapping is created on both the sender and the receiver sides. The mapping information is stored in the network interface. Then, to send out a message, the sender just writes to the mapped-out physical page and the sender NI will capture the data through a bus-snooping mechanism and deliver it to the destination node. The receiving NI uses the stored mapping information to determine the final location of the message. VMMC has been imple mented on two platforms: one is a memory bus-based NI with a Paragon backplane and the other is a PCI bus-based NI with Myrinet. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DEC MEMORY CHANNEL[40] uses a similar memory mapping scheme as VMMC to automatically deliver data to a remote node. However, the DEC MEMORY CHANNEL interconnect is a proprietary design and the DEC MEMORY CHAN NEL cluster supports a global address space into which each adaptor can map regions of local virtual address space. Besides, the NI connection point for DEC MEMORY CHANNEL is the PCI bus instead of the memory bus. The VMMC approach does not efficiently support collective communication as the the address mapping is implemented in one-to-one fashion. For example, to broadcast a message to N destinations, it is required to write to N different mapped- out pages. The cost is N times of a unicast operation. WARP memory [30, 62] is a device that makes shared memory physically avail able to a number of workstations. It contains a processor, a multiport memory module, fiber optic links and memory bus-based plug-in cards for workstations. Each workstation addresses the multiport memory as a part of its own address space. A directory-based write-invalidate protocol is used to ensure cache coherence between workstations. The WARPmemory is intended to efficiently support the shared-memory programming model but scalability is an issue. Coherent Network Interface[50, 51] is a network interface supporting cache co herence protocols. Traditionally, the network interface status information is marked as uncacheable to avoid any data consistency problem. Coherent Network Interfaces can communicate with the processor through cacheable, coherent memory opera tions. The access latency to network devices can then be reduced to the cache hit time at best. Different NI connection points, including the I/O bus and the memory bus, and different locations of communication buffers are evaluated through simula tion for performance comparison. CNI does not include a NI processor so that the send and receive operations require the intervention of the host processor. Similarly, collective communication is not efficiently supported here. HP Hamlyn[ll] uses sender-based memory management to eliminate receiver buffer overruns by determining a message’s final destination in the receiving host’s memory before sending it. An I/O bus-based prototyped Hamlyn interface has been built to work with Myrinet. As the communication buffers are constrained in the host main memory and the NI sites on the I/O bus, the performance is limited as we discussed in the previous chapter. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Virtual Interface Architecture[14] is set to become the industry standard for a high performance network interface architecture. VIA incorporates features from several research projects, including U-Net, and supports user-level NI accesses. In VIA, each user process is provided with a protected, directly accessible interface to the network hardware - a Virtual Interface. Each VI represents a communication endpoint. A network adaptor performs the endpoint virtualization directly and subsumes the tasks of multiplexing, de-multiplexing, and data transfer scheduling. Although VIA does not specify the physical design of network interface and the NI connection point, the communication buffers in VIA are most likely located within the host main memory. One disadvantage of VIA is the long small message latency. To send out a message, three memory bus and I/O bus transactions are required for doorbell notification, descriptor read and data read respectively. Although memory-bus based network interfaces exist in MPPs. such as TMC CM-5[69, 41], there are two major differences between the design of MPP NI and workstation NI. First, the MPP NI usually does not support multi-programming and a number of nodes are exclusively allocated for executing a designating program. Secondly, there is only one operating system in M PP while a network of workstations has one operating system on each node. 2.3.3 I/O Subsystems The network interface is an essential part of the I/O subsystem in computers nowa days. Due to the increasing demands of I/O performance, the computer industry also finds the current I/O axchitecture has reached its limit and is proposing new I/O subsystems for better performance. PCI-X, for example, is an enhancement of the popular PCI bus. PCI is operated at 33 MHz with the bus width of 32 or 64 bits. The PCI-X will be operated at 133 MHz with 64-bit bus width. The I/O peak bandwidth can thus be increased to 1066 MBytes/sec using PCI-X. However, the PCI-X is still a bus architecture which is shared by all I/O devices no matter they are fast or slow. Furthermore, the communication latency incurred due to the software overhead remains the same. There are more aggressive proposals for I/O subsystems, including Next Gener ation I/O (NGIO)[54] and Future I/O (FIO)[33]. Both of them are backed up by a 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cache Cache Cache Memory Bus H C A SW! SW2 Figure 2.11: A Switched I/O Subsystem group of computer or networking companies. NGIO and FIO target the server I/O architecture which requires high I/O performance. Instead of the traditional bus ar chitecture, the NGIO and FIO use a channel-oriented, switched fabric architecture. The switched I/O subsystem directly attach to the host main memory controller to facilitate the data moving between host memory and I/O devices. Each link of the switch can connect to an I/O device or another switch. The switched architecture provides a dedicated communication channel for each I/O device and better scala bility for system expansion. Figure 2.11 shows the architecture of such a switched I/O subsystem, e.g. NGIO. The Host Channel Adapter(HCA) is an interface into and out of the host memory controller. Two switches are interconnected to allow more device connections. Although the proposed I/O architectures may significantly improve the I/O band width, the communication latency still depends on how the messaging layer is de signed. If the communication buffers are still in the host main memory, multiple transactions axe still required for sending a network message. Recently, the NGIO and FIO group were merged to provide an unified I/O subsystem which would adopt technologies and concepts from both NGIO and FIO efforts. It is not clear how the final architecture will be at this moment. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Communication Architecture Based on the above analysis of communication performance issues and observation from the other projects, the optimal NI location and the required architectural fea tures have been identified. Therefore, we propose here a communication architecture, which consists of several layers, as shown in figure 3.1, and each layer brings specific new solutions to the problems. The scope of our communication architecture includes both hardware and soft ware layers. On the hardware side, a network interface card is designed to support inter-node communication. There are two types of operations to be performed by the network interface: interaction with a remote node and interaction with the local host. The interaction with a remote note is specified by the end-to-end communication protocols. The interaction with the local host is described through communication primitives, the basic interface between the processor and the network interface. To provide a more user-friendly programming environment, a higher level programming interface is required. Besides, a library consists of frequently used functions is needed to relieve the programmer’s burden. As discussed in chapter 2, communication performance is determined by the combined effect of hardware and software characteristics. By carefully designing the hardware mechanisms to seamlessly support the desired software functions, we can build a high-performance communication architecture. In the following, each layer of the communication architecture will be described in detail. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Applications APIs Library Communication Primitives Communication Protocols Network Interface Figure 3.1: Communication Architecture 3.1 Network Interface The core of our communication architecture is a memory bus-based intelligent NT, as shown in figure 3.2. As discussed in chapter 2, the memory bus is a better NI connection point since it can provide better communication performance while re taining the inter-operability and expandability. The key design issue is to make the NI efficiently support user-level accesses and collective communication. To reduce the overhead of unnecessary memory copies and bus transactions, the NI provides a significant amount of memory, directly accessible by user processes, for communi cation buffers. As discussed previously, the Nl-holding communication buffers can reduce the communication procedure to an one-pass operation. The NI processor can offload the communication workload from the host pro cessor and directly support the communication protocols and APIs. The command buffers accept communication commands issued from the host and provide memory- mapped addresses for user accesses. The command buffers provide an efficient com munication channel because communication requests from different processes are hold in the same place to facilitate the processing by the NI processor. The NI local memory is the working area for the NI processor and accommodates tables for send/receive descriptors and temporary message buffers. The NI memory management unit contains address mappings for converting vir tual memory addresses to physical locations. The memory bus-based NI supports 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Memory Bus command buffers shared memory (proc) MMU - - ; d m a I I L ______ J local memory memory bus interface Transmit network buffers Receive network interface Network Figure 3.2: Organization of Network Interface the cache coherence protocol employed to ensure data consistency. A DMA engine can be integrated to transfer bulk data between the host memory and the NI shared memory. Packetization and reassembly mechanisms may be required, depending on the underlying interconnection network. Two FIFO network buffers are provided for input and output network messages. The NI design focuses on its interaction with the host and is not tailored to a particular communication network. A general mechanism is assumed here to send/receive messages from the network. 3.2 Communication Protocols Network Transaction Protocols Two network transaction protocols are used. One is a one-way protocol, in which a sender just sends out a message without knowing whether the corresponding receive command at the receiver side has been posted. This protocol assumes that the receive buffer is usually available and has the least overhead. In case of a receive buffer overflow, a simple flow control scheme 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. such as return-to-sender can be employed. The one-way protocol is suitable for short messages. The other is a three-way handshaking protocol, in which a sender does not send out the data message until receiving a ready signal from the receiver. This protocol is good for transferring bulk data to avoid the receive buffer overflow. Figure 3.3 shows the network transaction protocols. Sender send continue data Receiver receive Sender send Receiver senri-readv continue receive receive-ready V - Figure 3.3: One-Way and Three-Way Network Transaction Protocols R eliab le D eliv ery For parallel computing, reliable delivery must be guaranteed. Although the environment we assume for network-based computing is more robust than the Internet, we still need to handle the case of corrupted or lost messages. Otherwise, the execution of parallel programs may either be halted or proceed in correctly. Corrupted messages can usually be detected by the hardware generated checksum information. To ensure a message is received by the destined node, an ac knowledgment is sent back to the source node. The source NI cannot free a message buffer until the corresponding acknowledgment is received. The NI processor can handle the generation and processing of acknowledgment messages. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Communication Primitives and Operations Efficient send and receive operations are achieved by reducing the bus transaction and memory copy overhead since communication buffers are located in the NI. Com munication protocol overhead such as reliable delivery is not exposed to the host pro cessor due to the NI intelligence. The basic send and receive operations axe described below. Communication buffer allocation is required before these operations. 3.3.1 Buffer Allocation A user process needs to reserve a chunk of NI memory as the home for its send/receive operations. To realize user-level NI accesses without compromising protection or fair resource sharing, the allocation of communication buffers is implemented as a system call: M alloc jC omm(chunk^size) The kernel allocates a contiguous region of NI memory and maps it to the process' virtual address space through page tables. The returned value is a pointer to the reserved NI memory chunk. The virtual-physical address mapping is also stored in the NI mapping table. Figure 3.4 shows the mapping of NI memory. NI Memory Host Memory Virtual Address Space Physical Address Space Figure 3.4: Address Mapping for NI Memory The allocated NI memory pages are marked as write-through so that a modifi cation to the cached NI communication data from the host processor will propagate 24 M apping M apping Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to the NI memory. On the other hand, a write to the NI memory from the NI pro cessor will generate an invalidation signal to the host cache to avoid obsolete data remaining in cache. Protection is ensured through the page tables so that a process cannot access a NI memory page which does not belong to itself. Besides, quotas can be set for each user/process to prevent monopoly of the NI memory. The NI memory granted is initialized to a free buffer list, with a user-defined buffer size. Different buffer sizes can be chosen for the same or different user pro cesses. The idea is to allow the user to maximize the utilization of the communication buffers. This initialization is implemented as a library function. Each buffer con tains a header and a message body, as shown in figure 3.5. The header is mainly used for buffer management and the message body holds the actual message data. H e a d e r M e s s a g e B o d y next \ n e x t m e s s a g e n e x t b u f f e r f o r t h e m e s s a g e m e s s a g e s t a t u s f o r p o l l i n g m e s s a g e c o u n t f o r c o l l e c t i v e c o m m , t o t a l m e s s a g e s i z e m e s s a g e d a t a s i z e in t h i s b u f f e r p h y s i c a l b u f f e r s i z e p o i n t e r t o m e s s a g e b o d y nextjbuf flag count mien blen bsize data Figure 3.5: Structure of Message Buffer In the header, n ex tJ m f points to the next buffer if the message size is larger than one buffer size, fla g indicates status of the message: for a send, it shows whether the message has been delivered and the buffer can be reused; for a receive, it indicates whether the message has arrived. The flag is designed to support user polling for probing communication status, data points to the actual message body, which is usually the next memory location. To support bulk data transfer, data can also point to a location in host memory, count is used for collective communication to indicates the remaining number of expected messages, bsize is set by the initialization library function, which takes the desired buffer size as an argument. Different buffer sizes are supported for flexibility considerations. Figure 3.6 shows the logical view of the NI. In the figure, the left box, representing the NI shared memory, contains two queues: one is Free Queue and the other is 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Complete Queue. The former is the free buffer list and the latter is the recycled buffer list. When a buffer is no longer in use, it can be put into Complete Queue by NI. The separation of two queues can reduce the synchronization cost of buffer management. W ithout separate queues, a lock mechanism needs to be implemented since both the user process and the NI can simultaneously access and update the links of the queue. This will increase the buffer management cost since every time the user process needs to get a free buffer, it needs to lock the queue first and then unlock it after acquisition. The NI also needs to lock and unlock the queue while it needs to put back recycled buffers. However, with two queues, the synchronization cost can be reduced. The Free Queue is only accessed by the user process and most of the time the Complete Queue is only accessed by the NI. When the Free Queue becomes empty, the Complete Queue can be appended to the Free Queue. Only at this time, the Complete Queue needs to be locked. Note that when a message buffer is used by a communication operation, the message buffer is not in any queue. The send/receive descriptors, which will be described shortly, contain pointers to those message buffers in use. For each process involved in communication, a Free Queue and a Complete Queue is created. U ser C om m ands Send D escriptors R eceive D escriptors Early A rrived M sgs Free Q ueue C om plete Q ueue T ransm it B uffer Receive B uffer Figure 3.6: Logical View of NI 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The right box represents the NI local memory and holds several types of ta bles and a message queue. Send Descriptors Table and Receive Descriptors Table store the descriptors for send and receive operations for each process. An early ar rived message queue is for storing the messages which arrive before the local receive command is posted. There is another type of table not shown in the figure - the process control table. Since our communication architecture supports multi-processing, a process descrip tion entry is allocated for each process which has allocated any NI buffer space. The process description entry contains information about the process, including the map ping table, pointer to the associated Free Queue, Complete Queue, Send Descriptors Table. Receive Descriptors Table and group membership information for collective communication. Note that the early arrived message queue can be a global queue, i.e. shared by all communication processes. Arrows in figure 3.6 represent the direction of possible data/block movement, which will be explained in detail in the following subsections. 3.3.2 Send Send is a user-level command with several parameters: SEND(desfmatzon Jd, m sg.ptr, size, tag , cm dbuf-ptr) Before issuing a send command, the user process acquires free data buffers and fills them using ordinary memory writes. The user process also needs to acquire a free command buffer, cm dbuf.pir. Since the command buffers are a shared re source between different user processes, an increment-after-read NI control register is provided to avoid race condition. Execution of a send command actually writes the command information into a command buffer. The NI processor processes the command and sends the message to the network output buffer. Note that the m sg-ptr is a virtual address to NI memory and the NI processor accesses it through the NI Memory Management Unit, which translates it to the corresponding physical address. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To support reliable delivery, a send descriptor is created based on information from the send command. In the descriptor, a pointer to the message buffer and a time stam p are included. When the corresponding ACK is received, the descriptor is removed and the message buffer fla g field is marked as “FR EE”. To expedite the matching of the ACK and the send descriptors, the descriptors are stored in hash queues, where the hash function is computed based on the tag. If the corresponding acknowledgment is not received within a pre-specified time limit, either the message sent or the acknowledgment is lost. In either case, the NI needs to resend the message. Reconstructing the message can be easily done with the send descriptor because the message data is still in the NI memory. In the case of a lost ACK, the destination node will receive duplicate messages. However, this is not harmful because the duplicate message will be discarded eventually. Figure 3.7 shows the steps of a send operation in order. SEND Send Descriptors Receive Descriptors □■pa Ppa £ p ' g p o cP ■ hr— -. message buffer Early Arrived Msgs □ Free Queue Complete Queue Transmit Buffer Receive Buffer Figure 3.7: Send Operation 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3.3 Receive A receive command is shown as below: RECEIVE(source_id, m sg.ptr, s iz e , tag, crndbu f ptr) Similar to a send, the receive operation needs to reserve free data buffers for the expected message and acquire a free command buffer to submit the receive request. In the one-way communication protocol, a process just sends out a message without knowing whether the receiving process has posted the receive command. There are two cases when a receive command is posted: the message has either already arrived or not. If the message has not arrived, a receive descriptor is created and stored in the hash queue. A field in the descriptor points to the message buffer. When the message later arrives, the matching descriptor is found and the message data is directly delivered to the reserved buffers. This situation is shown in figure 3.8. where the numbers indicate the steps in sequence. RECEIVE Send Descriptors Receive Descriptors 5 a cb m C D _ _ _ _ - - e b Early Arrived Msgs message buffer Free Queue Complete Queue Transmit Buffer Receive Buffer 4 Figure 3.8: Receive Operation - Late Arrival 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. If the message has arrived before the receive command is posted, the NI processor also creates a receive descriptor, which it marks as “RECEIVED”, and stores the message data in a temporary buffer in the NI local memory. The temporary message buffer pointer is stored in the receive descriptor. When the receive command is posted later, the matching receive descriptor is found and the message data is moved from the temporary buffer to its destination. This procedure is shown in figure 3.9. After a received message is delivered to its reserved data buffers, the NI set the fla g to “RECEIVED”. RECEIVE Send Descriptors Receive Descriptors ch i c jz S a a t± i " o m a i i Early Arrived Msg£ message buffer Free Queue Complete Queue Transmit Buffer Receive Buffer Figure 3.9: Receive Operation - Early Arrival 3.4 Blocking and Non-blocking Modes From a user process’ point of view, the above send/receive operations are essentially non-blocking. The user process returns immediately after writing to the command buffer and the NI processor takes over the rest. This non-blocking model is designed for better performance. A user-level probe function can be easily implemented by polling the fla g field to learn if the message has arrived/been successfully delivered. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to provide more flexible application programming interfaces, the blocking mode of communication is implemented on top of the non-blocking mode. There are two types of blocking send: synchronous and asynchronous. For a synchronous blocking send, the receiver NI does not send the ACK back until the corresponding receive command is posted. For an asynchronous blocking send, the receiver NI sends the ACK back immediately after receiving the message regardless of whether the receive command is posted. The blocking send function returns when the ACK is received, as indicated by the flag. The blocking receive function returns when the fla g is set to “RECEIVED” by NI. Sender Receiver Sender Receiver Data ACK Continue Conunuc Daia B jecciv c ACK Continue Continue Figure 3.10: Synchronous and Asynchronous Blocking Send Figure 3.10 shows the operations of blocking mode communication. On the left side, the sender performs a synchronous blocking send operation and the receiver perform a blocking receive operation. On the right side, the sender performs an asynchronous blocking send operation. Note that the ACK is returned at different times. In addition to polling, interrupts can also be supported for the blocking mode of communication. Instead of setting the flag, the NI processor generates an inter rupt to the host processor for announcing the message delivery/arrival. Selection of the polling or interrupt style can be encoded in the send/receive command and descriptors. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5 Collective Communication Collective communication[48, 49] can be directly supported by the intelligent NI like the described send/receive operations, except that the source .id./destination i d is replaced with a group .id. Each group.id identifies a preset group of members to join the collective communication. User functions are implemented to set and remove the group member information in the NI. For collective operations with a single originating or receiving point, such as broadcast and gather, a uroot” member is specified. Barrier is frequently used in parallel computing to synchronize a set of processes. For the root member, a barrier operation is similar to a blocking receive operation; for non-root members, it is like a synchronous blocking send operation. The root NI creates a single receive descriptor and sets the count field to be the number of non root members. For each received barrier message from non-root members, the count is decremented by 1. When the count field reaches 0, the root NI sends an acknowl edgment to each non-root member and marks the fla g field as "COMPLETED”. Figure 3.11 shows the barrier operations for both root and non-root nodes. R o o t N I N o n - r o o t N Is B a rrie r B a rrie r B a rrie r ACK A C K c o n tin u e A C K B a rrie r Figure 3.11: Barrier Operation 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Gather and reduce are similar to barrier in having a single receiving point. For a gather operation, the root NI can deliver the received data from group members into the right place of the receive buffers according to the stored group information. For a reduce operation, the root NI can directly perform the specified type of computation and store the final results of computation in the receive buffers. The reduce operation is shown in figure 3.12. R o o t N I D a ta R e d u c e D a ta A C K A C K D a ta A C K c o n t i n u e N o n - r o o t N I s S e n d S e n d 1 Figure 3.12: Reduce Operation For barrier, gather and reduce operations, the root NI may receive one or more messages before the local command is posted. Similar to the receive operation, the first arriving message makes the root NI create a receive descriptor with a pointer to the temporary message buffer and the descriptor is marked as “RECEIVED”. Later arrived messages will find the tag-matched descriptor and attach to the previously arrived messages. When the corresponding command is posted, the matching de scriptor is found and all the early arrived messages can be located. At this moment, two approaches can be adopted. One is a lazy approach: the root NI waits until all expected messages have arrived and then the corresponding operations axe per formed. The other is an eager approach: the root NI processes all the early arrived messages and then immediately process each received message thereafter. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Actually, it is also possible to start the eager processing approach right after the first early arrived message. For example, the reduce operation can be perform between the first and the second early arrived message and the result is stored in the first message buffer. For each early arrived message thereafter, the reduce operation is performed between the current message and the previous result. Consequently, only one temporary message buffer is allocated to reduce the resource usage. For a broadcast operation, the root NI constructs an outgoing message with a send descriptor for each group member and each group member performs a corre sponding receive operation. The count field keeps track of the number of received ACKs and the send message buffers are marked as "FREE” when the count reaches 0. Figure 3.13 shows the broadcast operations. Scatter is similar to broadcast except that each group member receives a different portion of data in the root send buffers. R o o t N I N o n - r o o t N Is R e c e iv e B r o a d c a s t ACK R e c e iv e D at: ACK c o n tin u e D a ta R e c e iv e ACK Figure 3.13: Broadcast Operation Except for barrier, the described collective communication are supported in both blocking and non-blocking modes. In terms of utilized host processor cycles, the cost of a collective communication operation is similar to that of a simple send/receive primitive. Reliable delivery is still guaranteed for collective communication. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.6 Application Programming Interface Although the described point-to-point and collective communication primitives can be used directly by applications, they are not convenient for programming. On the other hand, Message Passing Interface[48, 49] has been widely used as the application programming interface for various parallel machines, including distributed-memory multiprocessors and networks of workstations. Therefore, we choose Message Passing Interface as the API for our communication architecture. Our goal is to demonstrate how an application programming interface such as MPI can be efficiently supported by our design. It does not imply that our communication architecture can only support MPI. MPI is designed to provide an efficient message-passing standard for writing message-passing programs with the advantages of portability and ease-of-use. In MPI, only the specifications of the programming interface are defined and the im plementation of these functions is dependent on the capability of the lower level communication subsystem of the target system. Since MPI contains a lot of func tions and a full implementation of MPI will take a significant amount of time, we only implemented a selective set of MPI functions which are enough for programming the target benchmarks. We find that the implementation of MPI based on our communication primitives is very simple and straightforward. Actually, a lot of the functionality described in MPI has been done in our communication primitives. For example, the tag matching mechanism and the support of blocking and non-blocking modes have been directly supported. Furthermore, the intelligent NI can support the Contexts and Groups concepts without the host processor intervention. The result of our MPI layer is a light-weight API which does not incur much overhead beyond our communication primitives. The analysis of the MPI overhead will be shown in chapter 5. 3.7 Quantitative Analysis In this section, our design is compared with two other designs in terms of three metrics - the number of memory copy required, memory bus traffic and I/O bus traffic. One compared design is a conventional NI sitting on the I/O bus, in which 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the send/receive buffers must be allocated in the host memory. The other is a memory bus-based NI but the NI memory is not mapped to host processor physical address space and the home of send/receive buffers are in the host memory. The number of memory copy for a send operation is counted from the time the message data is ready until it is sent to the network output queue. The unit for memory bus and I/O bus traffic is sending one message worth of data through the bus. The number of bus traffic is counted from the time the message buffer starts to be written. Table 3.1 shows the cost of sending a unicast message. For the two compared designs, they need one extra copy of message data from the host memory to NI and two units of memory bus traffic, including from processor to host memory and from host memory to NI. The I/O bus-based NI also incurs I/O bus traffic when the message is copied from host memory to NI. For our design, since the message buffer is in NI, there is no memory copy and only one unit of memory bus traffic, generated when the message buffer is filled. NI Design Memory Copy Memory Bus Traffic I/O Bus Traffic I/O Bus-based NI 1 2 1 Memory Bus-based NI Message Home in MM 1 2 0 Our design 0 1 0 Table 3.1: Cost of sending a unicast message Our design also shows a significant advantage over others while sending a mul ticast message. Table 3.2 indicates the cost of sending a multicast message with N destinations. Since the header and message body are separated in our design, there is no need to replicate N copies of message data by the host processor. For the compared designs, since N complete messages need to be presented to NI there are N memory copies from host memory to NI. The memory bus traffic can be either N + l, if the message data and header is separated in the host memory, or 2N. if the whole message is built in host memory. 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. NI Design Memory Copy Memory Bus Traffic I/O Bus Traffic I/O Bus-based NI N 2N (N + l) N Memory Bus-based NI Message Home in MM N 2N (N + l) 0 Our design 0 1 0 Table 3/2: Cost of sending a multicast message Table 3.3 shows the cost of receiving a message. The number of memory copy for a receive operation is counted from the time is message is received, after reassembly, until the message data is delivered to its destination. Since the receive buffer is in the NI, there is no memory bus traffic for our design. There is one memory copy and one unit of memory bus traffic for the others. Our design sometimes requires one memory copy from the temporary buffer to the receive buffer if a message is received before the corresponding receive command is posted. NI Design Memory Copy Memory Bus Traffic I/O Bus Traffic I/O Bus-based NI I 1 1 Memory Bus-based NI Message Home in MM 1 I 0 Our design 0/1 0 0 Table 3.3: Cost of receiving a message The above analysis shows that our design can reduce the number of memory copy and the amount of memory bus and I/O bus traffic. Besides, the cost of sending a multicast message is significantly reduced. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 M ethodology 4.1 Design Validation There are several ways to validate the design of a new architecture. Traditional approaches include analytical modeling, direct implementation and simulation. Each approach has its strength and disadvantages. Analytical modeling uses a mathematical model to describe the target system. The main advantage is that the system behavior can be easily predicted by perform ing direct calculation. The main challenge is to derive an accurate model which can describe the target system using selected parameters. Since the whole communica tion architecture involves several layers of components and the operations inside the NI and its interaction with host processor and network are very complex and corre lated. it is very difficult to derive an accurate analytical model for our architecture. Besides, our communication architecture is part of the workstation computer archi tecture which is supposed to execute a set of applications and comes up with the correct results. To evaluate the performance of our communication architecture, the model of the host also needs to be built, which is an even more difficult task using analytical modeling. Therefore, the analytical modeling approach is not practical. Directly implementing the proposed network interface and supported software provides the most accurate performance measurement. However, the process is costly and time-consuming. Besides, it is required to evaluate the communication architecture with different design paxameters. Therefore, different versions of imple mentation axe necessary and the overall cost is further increased. Due to the limited resources and man power we have, the approach can not be adopted either. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Simulation, on the other hand, provides a better trade-off among accuracy, cost and flexibility. This approach simulates the operations of an architecture at the desired level. The complexity of a simulation model can be implemented based on the required degree of accuracy. Since the simulation models are parameterized, architectural changes can be easily performed and the corresponding impact on the system performance can be evaluated and compared. Common simulation techniques include trace-driven, execution-driven and program- driven simulation. Trace-driven simulator requires the trace of a previous program execution as input. For the execution of a parallel program, the trace already de termines the execution order of synchronization events while the simulator may not observe the original execution order due to the dynamic feature of process inter action. Consequently, trace-driven simulation for a multiprocessor system may get incorrect result. Execution-driven simulation requires modifying and argumenting the benchmark program so that the simulation modules are integrated into the benchmark object code. This approach gets a better slow-down factor at the cost of preprocessing and recompiling the benchmark program. Program-driven simulation accepts the unchanged binary as input. A program- driven simulator contains two parts: a front end for event generator and a back end for target system simulator. While executing the input binary, the front end send interested events to the back end for processing and the back end sends signals back to the front end for process control. Synchronization operations can be simulated correctly through the interaction of the two modules. Figure 4.1 shows the structure of a program-driven simulator. 4.2 Simulator Since our communication architecture is a subsystem of a computer architecture and needs to interact with other components as well, we need a simulator for the whole system in order to evaluate the communication architecture. Fortunately, we do not need to develop the simulator from scratch as there are a few such simulators with source codes available to the public. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Events Event Target System Generator Simulator Process Front End Control Back End Figure 4.1: Program-Driven Simulation The chosen simulator for our design is Paint[65]. which is developed based on Mint[72] with some extensions. The main purpose of Mint is to facilitate the de sign of memory hierarchy in a shared-memory multiprocessor. Mint adopts the program-driven simulation scheme and includes two parts: a MIPS instruction set interpreter and a simulation library. The MIPS interpreter models the execution of an application program on some number of processors and serves as the memory reference generator. The simulation library provides functions for event scheduling and process control, which can be used to construct the interested target system for simulation. Paint is modified from Mint to be a HP PA-RISC'[37] instruction set inter preter. Paint is originally developed to support the Avalanche Scalable Comput ing Project[67]. The extensions in Paint include a new process model which allows multiple programs to be run on each processor and the ability to model both ker nel and user code on each processor. The kernel part is a subset of BSD 4.4 Unix kernel. Since Paint is close to the model of network of workstations and the un derlying program-driven model allows us to simulate the process synchronization operations through message passing correctly, it is adopted as the base simulator for our communication architecture. In Paint, each simulated processor is represented by a single thread. The kernel is the first program to run within that thread, followed by one or more user level programs. Similar to the operations in a real machine, the kernel does context switching between user programs. When a program is loaded into Paint to be run, the binary file is converted into a linked list of instruction code structures. For 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a branch instruction, both the sequential path and the target path are linked to the branch code structure. Each instruction code represents a real instruction and stores information about how to interpret the instruction, which is usually a function pointer. Therefore, program execution in Paint is actually interpreted by traversing the instruction code list. Events are generated while Paint is traversing the instruction code list. The purpose of events is to notify the back end that something interesting occurs and thus handle the control over the back end. The back end is responsible for more detailed simulation of selected architectural features, e.g. memory hierarchy or net work interface. Events are specified and inserted by Paint when an input binary is loaded and converted to a linked list of instruction code structures. Each event is implemented as an instruction code structure with a pointer to an event function in the back end and with zero cycle of execution time so that the overall execution time is not affected by interpreting the event code structure. However, the execution of the event function can move the system clock forward due to the characteristics of the simulated architecture features. There are two types of event generation: one occurs before the instruction and the other occurs after it, depending on the instruction type. For example, when a memory reference instruction is encountered, an event is generated before the in struction so that the cache module in the back end can be examined first: for a cache hit, the cost of the event function is zero and for a cache miss, the cost is the penalty to load a cache block. When the event function returns and the back-end signals the front-end that the execution can continue, the original instruction structure is exe cuted to effect the changes in processor state required by the particular instruction. Figure 4.2 shows the two types of event insertion. The original instruction sequence is shown in (1) where the second instruction 12 is the interesting instruction to be simulated in more detail. Paint may insert an event E2 before 12, as shown in (2), or after 12, as shown in (3), depending on the type of 12. In a parallel system such as a network of workstations, multiple operations occur at any given time instant. Paint simulates parallel tasks through a time wheel mechanism. The tim e wheel consists of an array of head nodes to linked lists. The array is indexed by the time value of a task. Each task is then appended to the linked list of the tim e wheel entry at which it is supposed to be executed. Therefore, 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E2 Event E2 Event Figure 4.2: Event Generation tasks to be executed at the same time are in the same list and will be scheduled to run sequentially. The internal Paint clock advances only when all tasks on the currently pointed time wheel entry are done. Because the size of time wheel array is fixed, a future queue is included to ac commodate all tasks whose due times are beyond the last time wheel entry. A reload task is always enqueued in the last time wheel entry. When the Paint clock reaches the last time wheel entry, the reload is executed: future queue is examined and tasks from the queue are distributed to the appropriate time wheel entries for another pass of time wheel processing. Figure 4.3 shows the diagrams of the time wheel mechanism in Paint, where the current processed entry is 2. The current time of the simulator is computed by adding the time wheel entry index with a time base variable, which is updated for each time the time wheel pointer wraps around. 4.3 Tasks Since Paint includes only the kernel module and the front end module, we need to implement the back end modules, including cache, main memory, network interface and communication network, as shown in figure 4.4. The network interface module is the key module because we need to simulate the detailed operations, as described in chapter 3, occurring inside the network interface and its interaction with host processor and communication network. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Current Time I 0 12 3 4 N -l N Future Queue Lists of Tasks Reload Figure 4.3: Time Wheel Mechanism in Paint Front End Back End Communication Network Figure 4.4: Simulation Modules 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In addition to the above hardware modules, we also need to implement soft ware layers on top of them. The software layers include the messaging layer and Messaging Passing Interface. Since the buffer allocation function is implemented as a system call, the Paint kernel module is updated to support this. A set of user level library functions, including communication buffer management and polling of communication status, are also needed to better support the benchmarking. 4.4 Benchmarks and Performance Metrics We choose three types of benchmark programs to evaluate the performance of our communication architecture. These benchmark programs represent different levels of communication performance. The first type of benchmarks is for point-to-point communication performance, which stands for the base communication performance. The second type of benchmarks is for collective communication performance, which is im portant for parallel computing. The third type of benchmarks is for application performance and can show the impact of our design on the performance of parallel applications with a typical mix of computation and communication. The three types of benchmarks are shown in table 4.1 with the names of the programs. A detailed description of these benchmarks is given in chapter 5. Type Purpose Programs 1 Point-to-Point Communication Opt-Receive, Equ-Receive, BW-Test 2 Collective Communication Barrier, Gather, Scatter, Broadcast, Reduce 3 Macro-benchmarks Multigrid, Integer Sort Table 4.1: Benchmarks The main performance metric for micro-benchmarks, including type 1 and type 2 benchmarks, is the communication latency. For type 3 benchmarks, the commu nication tim e and overall run time are collected. The communication time is the accumulated time the host processor spends in communication operations during the execution of the application. Since the communication tim e and run time vary for each node in a parallel system, the critical path, i.e. the node with the longest run time, is selected to represent the system performance. 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Performance Evaluation 5.1 Simulation Environment The simulation is performed by feeding the benchmark programs into the extended Paint simulator. Since Paint takes the unchanged binary files, i.e. the binaries com piled for the native machine, as input, no code argumentation is needed. This sim plifies the task of benchmark development. Besides, the performance measurement operations are done inside Paint and thus do not affect the measured performance numbers. The platforms for conducting the simulation include three HP 750/755 worksta tions running HP-UX 9.X. Although the Paint simulator can only be run on a single machine, concurrently using multiple machines for different benchmarks/simulation parameters still shortens the overall simulation time. The major architectural parameters of our proposed architecture are described in table 5.1. The NI command buffers and NI local memory are assumed to be implemented in SRAM technology, at a speed of 10ns. Since the NI shared memory speed is a crucial factor for communication performance, several values have been experimented with for comparison. These values include 10ns and 20ns for SRAM. 30ns for SDRAM and 60ns for DRAM technology. The interconnection network is a shared bus, where bus contention is simulated. The user-defined communication data buffer size is set to 64 bytes for the message body. The host processor speed is set to 400 MHz. Although higher speed Processors are available nowadays, we choose a moderate speed to show the capability of our communication architecture when included in a general system. The NI processor 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Host Processor(HP) 400 MHz Cache Size 1M Bytes Cache Block Size 32 Bytes Host Cache Miss Penalty 24 HP cycles NI Processor(NIP) 100 MHz NI Shared Memory Size 1M Bytes NI Shared Memory Access Time by NI 1 to 6 NIP cycles NI Local Memory Access Time 1 NIP cycle NI Command Buffers Access Time by NI 1 NIP cycle Network Speed (default) 1 Gbps Table 5.1: Architectural Parameters speed is set to 100 MHz, only one fourth of the host processor speed. This is due to the cost consideration because the network interface is only a peripheral and should not be comparable with the host processor. In addition, the NI processor can deploy a general purpose DSP processor as floating point calculation is seldom used. 5.2 Point-to-Point Communication The benchmark for measuring communication latency is simple: a message is sent from node A to node B. Due to the one-way protocol employed, a message may arrive before the corresponding receive command is posted. Therefore, two test programs are implemented: Opt-Receive The send command is issued early enough so that the message arrives at the receiver before the receive command is posted. Equ-Receive The send and receive commands are issued at the same time. Communication latency is measured from the time the receive command is issued by the user process until the message is delivered to its final destination, i.e. from the time the data is requested until the data is available. Note that this is different from the request-reply communication model, where the communication latency is measured from the tim e the request command is issued until the requested data message comes back. The latency for this request-reply model is essentially equal 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to the round-trip latency. Opt-Receive measures the optimal receive latency and Equ-Receive reveals the one-way latency. Different message sizes, ranging from 4 bytes to 8K bytes, are tested. Opt-Receive Latency 1000 too s ! m too tooo toooo Size * BytM Equ-Receive Latency tooo too 5 r too tooo SU» <n Bytta Figure 5.1: Receive Latency (Opt- Figure 5.2: Receive Latency (Equ- Receive) Receive) Figure 5.1 shows the optimal receive latency for various message sizes. The four graphs represent different speeds of the NI shared memory. Since the message arrives at the NI earlier than the receive command, the latency includes the processing of the receive command, moving the message from the temporary buffer to NI shared memory and setting the flag to notify the user process. As expected, for the same message size, the latency increases with the NI shared memory access time. The optimal latency can be achieved at 1.2 /a s for a 32-byte message and about 11 ^s or less for up to 512-byte messages. For a 4-byte message, the latency is less than 1 /a s . Figure 5.2 shows the receive latency for Equ-Receive, i.e. the traditional one-way latency. For a 32-byte message, the latency is under 3.5 /is. For messages up to 128 bytes, the latency is just under 8 g.s. Figures 5.3 and 5.4 show the round-trip latency. This latency is measured from the tim e the send command is issued until the corresponding acknowledgment is received. Therefore, this latency also shows the occupied tim e of a send buffer. The shorter this latency is, the sooner the send data buffer can be recycled. The send commands in both Opt-Receive and Equ-Receive are in the asynchronous mode, 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Round-Trip Latency torO pt-R anto* Round-Trip Latency tor Equ-R*c«v« 1000 G— 0 6 0 O 0 2 0 too i s I « 10000 1000 1000 100 5 r m 1 0 0 0 too M h i i » S in m B y * t Figure 5.3: Round-Trip Latency (Opt- Figure 5.4: Round-Trip Latency (Equ- Receive) Receive) where the acknowledgment is generated by the receiver NI no m atter the receive command has been posted. These two figures show similar results because the acknowledgment is generated after the message is stored in either a temporary buffer(Opt-Receive) or a reserved buffer (Equ-Receive). Note that the round-trip latency is much less than twice the receive latency for Equ-Receive because the ACK is internally handled by the NI. The round-trip latency for a 32-byte message is 5 (is or less. One-Way Latency 1000 100 1 £ :i— a — o 10000 100 MmmiQt Size <n BytM Figure 5.5: One-Way Latency 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.5 shows the comparison of our results with MPI-GAM[6], MPI-BIP[43] and MPI-FM[71], which are MPI implementations based on Generic Active Mes sages, Fast Messages and BIP respectively. The performance of our communication primitives axe maxked by DMCA, namely Direct Mapped Communication Architec ture. MPI-DMCA represents the performance of our MPI implementation based on DMCA. The NI shared memory access time is set to 3 NIP cycles. Our results are better than others for messages up to 512 bytes. For larger messages, the advantage gradually disappeaxs. This is because the data movement is performed by the NI processor at the transfer rate of 4 bytes a time. For large messages, the data moving time contributes a significant portion of the overall latency. This can be remedied by incorporating a DMA engine for transferring large messages. Further, as stated earlier, the performance of transferring small messages is more important as they occur more frequently. The one-way latency of the Shrimp project(VMMC) is 3.7 microseconds for auto matic update and 7.6 microseconds for deliberate update[28]. For automatic update, the binding of a local memory page to a remote memory page is made at the time a mapping is created; for deliberate update, the binding does not occur until an explicit send command is issued. Although autom atic update has less overhead, its capability is limited and less scalable as collective communication such as broadcast can not be efficiently supported. Deliverable Bandwidth Synchronous Blocking Send 3000 100.0 0.0 10000 1000 Figure 5.6: Deliverable Bandwidth 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The one-way latency of Shrimp project can Figure 5.6 shows the deliverable bandwidth between two nodes. The test program is as follows: node A iteratively executes the synchronous blocking send command and node B iteratively executes the matching receive command. The NI shared memory access time is set to 3 NIP cycles. Note that a synchronous blocking send is blocked until the corresponding receive is posted. This ensures a message is delivered to its destination before the next one is sent out. Three network speeds are experimented with. The result shows the deliverable bandwidth starts to saturate for message size larger than lk bytes. The performance is not significantly improved when the network speed increases. This can be expected since the bottleneck is not in the network bandwidth. 5.3 Collective Communication Performance for collective communication, including barrier, broadcast, gather, scat ter and reduce, are derived for both collective communication primitives (DMCA) and the corresponding MPI implementation (MPI-DMCA). The test programs are designed so that all nodes start the same operation at the same time instant. This synchronization can be easily achieved in our simulation environment. The number of participating nodes ranges from 2 to 64. The size of message data transferred between the root member and each non-root member is 32 bytes for all operations except barrier, in which the message data is 0 byte. The NI shared memory access time is set to 3 NIP cycles. Our results are compared with the published performance numbers measured from three MPPs - IBM SP2, Cray T3D and Intel Paragon[38]. In [38], the measured performance numbers were expressed in curve-fitted timing formulas, which were parameterized by the message size in bytes(m) and the system size(P). The timing expressions for collective communications on SP2, T3D and Paragon are shown in table 5.2, 5.3 and 5.4 respectively. The unit of the expressions is microsecond. Figure 5.7 shows the latency of barrier. For a barrier operation, the root member has the shortest latency and non-root members have longer and different latencies, depending on the order of ACKs. The DMCA latency shown is calculated by av eraging the barrier latencies for all group members, in the same way the latency 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Barrier 123logP-90 Broadcast (55logP-b30)+(0.0141ogP+0.053)m Reduce (631ogP+26)+(0.016logP+0.071 )m Gather (3.7P+128)+(0.022P-0.011)m Scatter (5.8P+77)+(0.039P-0.12)m Table 5.2: Timing Expressions for Collective Communications on SP2 Barrier O.OlllogP+3 Broadcast (23IogP+12)+(0.0131ogP-0.0071)m Reduce (34logP+49)+(0.0611ogP-0.00035)m Gather (5.3P+30)+(0.0047P+0.0084)m Scatter (4.3P+67)+(0.0057P+0.16)m Table 5.3: Timing Expressions for Collective Communications on T3D Barrier 1471ogP-66 Broadcast (521ogP+15)+(0.0191ogP-0.022)m Reduce (77logP+3.6)+(0.161ogP-0.028)m Gather (4SP+l5)+(0.008lP+0.039)m Scatter (18P+78)+(0.0031P+0.039)m Table 5.4: Timing Expressions for Collective Communications on Paxagon Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for the M PPs is derived. T3D has the best performance, which is independent of the number of nodes, because T3D provides direct hardware support for barrier. However, our results are better than SP2 and Paragon. For 64 nodes, DMCA takes 51 (is while SP2 and Paragon need 648 (is and 816 (is respectively. The performance difference between MPI-DMCA and DMCA is negligible. Figure 5.8 shows the latency for broadcast. Our communication architecture performs better than all three MPPs. Note that a significant amount of startup overhead is present in MPPs for initializing the collective communication operations. This overhead is minimized in our architecture because our intelligent NI can directly and efficiently support the collective communication operations. For example, even for a two-node system, T3D takes 32 (is and Paragon and SP2 take 67 (is and 87 (is respectively. However, DMCA only takes 4.2 (is and MPI-DMCA, 5.6 (is. Besides, the insignificant overhead of adding the MPI layer to DMCA is shown since most part of the MPI functionality has been realized in the DMCA collective communication primitives. Figure 5.9, 5.10 and 5.11 show the compared performance numbers for gather, scatter and reduce operations respectively. The phenomena are similar to broadcast: the three MPPs present a much higher latency than MPI-DMCA. Barrier Latency Broadcast Latency 10000 1000 i 5 I 3 100 10000 J — 0 _ _ 0 S P 2 a - — -J P«r«oon O OT30 A— AMPI-OMCVi 1000 ^ <°MCA ^ : S 2 S 100 3 Numt*r of NodM Figure 5.7: Barrier Latency Figure 5.8: Broadcast Latency To examine the overhead of MPI layer over DMCA more closely, the MPI over head is derived. Table 5.5 shows the overhead of MPI layer in microseconds and 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Latency m M cii Gather Latency Scatter Latency 10000 1000 1 0 0 0 — 0 SP2 3— c Paragon O OT30 A— AMPUOMCn < ODMCA Number o < Nodes 10000 1000 100 0 — OSP2 o— 3 Paragon < > OT3D A— AMPI-OMCA * — <DMCA 10 t 0 16 32 48 64 Number of Nodes Figure 5.9: Gather Latency Figure 5.10: Scatter Latency Reduce Latency 10000 G GSP2 G— G Paragon O OT3D A— dMPt-OMCA ■ d — <DMCA 1000 i s 1 0 0 Number of Nodes Figure 5.11: Reduce Latency 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. System Size 2 4 8 16 32 64 Barrier 0.1 0.4 0.3 0.3 0.4 0.2 Broadcast 1.4 1.3 1.3 1.3 1.3 1.3 Reduce 1.4 1.3 1.3 1.3 1.4 1.4 Gather 1.5 1.9 3.2 6.4 10.1 18.7 Scatter 1.8 2.6 3.5 5.6 9.8 18.1 Table 5.5: Overhead of MPI Layer in Microseconds System Size 2 4 8 16 32 64 Barrier 4.2 9.5 4.1 2.2 1.6 0.4 Broadcast 33.3 17.1 9.0 4.6 2.4 1.2 Reduce 51.9 23.6 12.4 6.4 3.5 1.8 Gather 60.0 38.0 32.0 29.5 19.0 12.6 Scatter 42.9 34.2 24.1 19.8 17.6 16.3 Table 5.6: Overhead of MPI Layer in Percentages MPI Overhead of Collective Communication MPI Overhead of Collective Communication 100 O— 0 8 am«f H l: — u Broadcast I Raduca £Gafftar t I 48 Numdtr d Processors . Broadcast ! Reduce la— dSaiher i<— < Scatter ts 10 s & it 16 32 48 Figure 5.12: M PI Overhead in Mi- Figure 5.13: M PI Overhead in Percent croseconds ages Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. table 5.6 shows the MPI overhead expressed in the ratio of corresponding DMCA collective communication latency, i.e. M PI.O verhead = (M P I -D M C A - DMCA) x 100/DM C.4 Graphs in figure 5.12 and 5.13 show the data from table 5.5 and 5.6. In fig ure 5.12, it is found that the absolute MPI overhead is almost fixed for barrier, broadcast and reduce operations regardless of the system size. This reflects the fact of directly supporting collective communication operations by NI. In addition, the MPI overhead for barrier is less than broadcast and reduce operations. This is because barrier is a synchronization operation and does not need any user data. While for broadcast and reduce operations, 32 bytes of message data is involved. On the other hand, the absolute MPI overhead for gather and scatter operations grows almost linearly with the system size. This is because the size of message data which the root of gather and scatter operations needs to process is proportional to the system size. If the ratio of MPI overhead over the native DMCA collective communication latency is examined, the MPI overhead ratio decreases with the system size, as shown in figure 5.13. This is due to the fact that the DMCA collective communication latency increases with the system size. 5.4 Macro-benchmarks To study the impact of our communication architecture on real world applications, two NAS benchmarks[4], including Multigrid(MG) and Integer Sort(IS). have been implemented. These benchmarks are programmed with MPI functions for inter processor communication. 5.4.1 Benchmark Description In the Multigrid benchmark, a three dimensional Poisson partial differential equation V 2u = v 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is solved using the multigrid method. Due to the memory size constrain of our machine, the problem size is set to 32 by 32 by 32, i.e. 32K grid points in total. The algorithm of partitioning the full grid onto some power of 2 number of processors is to start by splitting the last dimension of the grid (z dimension) in 2: the problem is now partitioned onto 2 processors. Next the middle dimension (y dimension) is split in 2: the problem is now partitioned onto 4 processors. Next, the first dimension (x dimension) is split in 2: the problem is now partitioned onto 8 processors. Next, the last dimension (z dimension) is split again in 2: the problem is now partitioned onto 16 processors. This partitioning is repeated until all of the power of 2 processors have been allocated. Initially, v is set to 0 except at twenty grid points of which ten points are set to + 1 and the other ten are set to -1. These 20 points are selected by a random number generator, which generates 32x32x32 random numbers and the locations of the ten largest numbers and the ten smallest values are set to +1 and -1 respectively. The PDE is solved by four iterations of the following two step, beginning with u=0: r = v — Au u = v -j- M kr where the first step is to evaluate residual and the second step applies correction to the solution. The solution is verified through its L 2 norm, which should agrees with the reference value of 0.530770700573 x 10“4 within an absolute tolerance of 10-8 . In the Integer Sort benchmark, a number of keys are sorted in parallel using the bucket sort algorithm. The keys are generated by a sequential key generation algorithm. Initially, each node generates a number of keys and then exchanges part of its keys with others based on the bucket sort algorithm. Each node then sorts the portion of keys it owns. The result is verified by comparing the smallest key and the largest key of a node with its neighbors. The largest key of a node should be no larger than the smallest key of the node with a larger id. The problem size of this benchmark is set to 64K, which is the largest problem size our machine can handle. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.4.2 Strategy For these macro-benchmarks, vve would like to evaluate impact of our communication architecture on the application performance. Since our communication architecture is a overall solution which includes various features for optimizing communication performance, we use a “reverse” approach to evaluate the impact of individual com ponents. The “reverse” approach means we either increase the overhead or disable a feature to study its performance impact and compare it with our base configura tion, with all features included without imposing any overhead. Note that with this approach, the performance improvement made by our communication architecture is indeed the aggregate impact of the individual components. In the previous sections, we have shown the advantage of our communication architecture in improving the latency of point-to-point and collective communication. We would like to see the impact of this improvement on the overall application performance. In addition, our architecture makes the NI buffers cacheable in the host cache, we want to know the impact of this caching feature. Finally, we have observed the impact of the NI shared memory speed on the point-to-point communication performance and would like to look into its impact on the application performance. To sum up, we want to study the impact of the following parameters/features on the application performance: • Communication Latency • Cacheable NI Buffers • NI memory speed The corresponding approaches are • Impose additional overhead for each communication operation. • Enable/Disable caching of NI buffers. • Vary the NI memory speed. and then observe their impact on the application performance. During the execution of a program, two modes are distinguished: computation and communication. In the communication mode, the host processor executes the 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. communication functions. The processor idle time for a blocking communication operation is also included in the communication time. In the computation mode, the host processor does the local computation. These two modes interleave during the execution and the overall run tim e is the sum of both. The measurement includes the communication time, the overall run time and the statistics of communication messages. The number of simulated nodes ranges from 1 to 32. 5.4.3 Multigrid The multigrid benchmark scales well for nodes from 2 to 16 and does not improve too much from node 16 to node 32. Figure 5.14 shows the speedup of MG. Speedup Bancfenark M uT O ghd 16 1 2 8 4 0 8 24 32 0 40 Figure 5.14: MG Speedup Figure 5.15 shows the overall run time and the corresponding communication time for each configuration. The communication time is the time the host processor spends in communication operations, i.e. the MPI functions, and is shown in the lower part of the bar chart. It is observed that the ratio of communication time over the overall run time increases with the system size, as shown in figure 5.16. This phenomenon is due to the fact that when the system size grows, the synchronization time increases and the number of messages to be exchanged also increases but the computation tim e decreases as the computation is shared by more nodes. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Execution and Communication Time Benchmark MuWghd 5 0 0 .0 400 0 S 3000 I o 5 2000 £ 00 Number of Processors Ratio of Communication Time Benchmark Mufflgnd 1.00 090 060 0.40 0.20 0.00 Number of Processors Figure 5.15: MG Execution Time Figure 5.16: MG Communication Ratio The previous explanation is verified through the collected communication statis tics shown in figure 5.17, which shows the message count for each nodes. There are two types of nodes: root and non-root. The designated root node performs synchronization tasks such as barrier and reduce and the sequential portion of the benchmark; all other nodes are non-root nodes. Figure 5.17 shows that the message count grows almost linearly with the system size for the root node. However, the message count is nearly the same for non-root nodes for all configuration as the communication pattern is determined by the partition algorithm. The average message size decreases with the system size since the amount of data allocated for each node decreases when the system size increases. To look at the communication statistics more closely, the distribution of message size of a 8- node configuration is shown in figure 5.19, where message size is shown in step of 64 bytes. For example, a data message of size 0 to 63 is shown as 0 and that of 64 to 127 is shown as 64. The top graph shows the distribution of message size for the root node and the bottom one is for non-root nodes. For the root node, 83 percent of messages axe less than 64 bytes and for non-root nodes, 45 percent. This validates our argument that the latency for small messages is im portant as a significant portion of the data messages are small. To illustrate the impact of processing overhead of communication operations, an additional overhead is imposed for each communication operation. The amounts 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Message Count Benchmark Miibgnd 12000 r 10000 8000 - 6000 • 4000 »- - ♦ noo-rooc 2000 Average Message Size Benchmark Muittgnd Number of Processors 1500 ♦- — ♦ non-root i o 5 e a (/) 500 » Number ot Processors Figure 5.17: MG Message Count Figure 5.IS: MG Average Message Size Distribution of Message Size Benctimaik Multigrid 10000 1000 100 10 256 512 768 1024 1280 1536 1792 2048 0 10000 | 1000 I i 100 10 L nn a 0 256 512 768 1024 1280 1536 1792 2048 Message Size Figure 5.19: MG Distribution of Message Size 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of imposed overhead are 1000, 2000 and 3000 host processor cycles. The result is compared with the base model where no additional communication overhead is imposed. Figure 5.20 shows the impact on the communication time, where the y-axis in dicates the percentage change of the communication tim e with respect to the base model. For example, the total communication time increases by about 10 percent for a 2-node system when a 2000-cycle overhead is imposed. It is obvious that for various imposed overhead, the impact on communication performance increases with the amount of imposed overhead. For the same overhead, the impact is not corre lated with the system size. For example, for the 2000-cycle overhead, 17 percent slowdown exists for 8 nodes but 12 percent slowdown is observed for 16 nodes. This is partially because the message count for non-root nodes does not change a lot for different system size. This graph shows that communication latency does have a significant impact on the overall communication time. For each additional 1000- cycle communication overhead, the communication time increases by 3 to 9 percent, depending on the system size. Figure 5.21 shows the impact of communication overhead on the overall run time. The impact on the overall run time is actually the product of the impact on the communication time and the communication ratio. The computation time during which the host processor works in local computation is not affected at all. Therefore, the impact on the overall run time greatly depends on the communication ratio. Since the communication ratio increases with the system size, so does the impact on the overall run time. To evaluate the performance gain of caching data from NI memory, we measure the performance when the NI data is not cacheable and compare it with our base model, a cacheable configuration. The impact of uncacheable NI is shown in fig ure 5.22, where the Y-axis indicates the change on the normalized communication time. The figure shows that caching NI data has more impact for small systems as larger amount of data is exchanged. For a 2-node system, the communication performance loss is about 45 percent while for a 32-node system, the loss is less than 10 percent. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Impact of Communication Overhead Benchmark MulOgrtd 0 5 0 [ • — • lO O O c v d M |» • 2000cyctos !♦ ♦aOQOcvdaa 0.40 I g 0 2 0 0 1 0 000 Number of Processor* Impact of Communication Overhead Benchmark Mutttgnd * 0.06 3 I S 0.04 0.02 000 1000 cycles ■- *2000 cycles ♦ *3000 cycles 8 16 24 32 Number of Processors Figure 5.20: Impact of Overhead on Figure 5.21: Impact of Overhead on Comm Time (MG) Run Time (MG) Unlike the previous imposed overhead experiments, the computation time is af fected by caching/uncaching NI data since the computation data and the commu nication data share the same cache. It is found that uncaching NI data slightly improves the computation time as the cache block conflict from NI data no longer occurs and thus the cache hit ratio is improved. However, the improvement is not much since the amount of communication data is much less than the amount of computation data. The impact of uncaching NI data on the overall performance is shown in figure 5.23. The top curve shows the overall performance loss due to increasing communication tim e and the bottom curve shows the overall performance gain due to decreasing computation time. The net effect is shown in the middle curve. Although caching NI data has more impact on the communication perfor mance for small systems, it does not have the same effect on the overall performance since small systems have smaller communication ratio. Figure 5.24 and 5.25 show the impact of NI memory speed on the communication time and the overall run tim e respectively. The times axe normalized for the config uration with N I P = 3. For communication performance, more impact is shown on smaller systems. However, the impact on the overall performance still depends on the product of the communication impact and the communication ratio. 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Impact of Uncacheable NI Benchmark M uM grtd 0.50 | 0.40 1 3 030 0 10 0 0 0 ‘ — Impact of Uncachecable NI Benchmark Muftlgrtd Number of Processors 0.10 O— O Oue to Commurueaoonj G— Q Due to Computation ♦— + Nst EHect__________ j 0.06 I I 0 0 6 3 | 0 0 4 s I 0.02 000 - 0.02 Number of Processors Figure 5.22: Impact of Uncacheable NI Figure 5.23: Impact of Uncacheable NI on Comm Time (MG) on Run Time (MG) Impact of NI Memory Speed on Communication Benchmark Mumgnd • — #NlP*1 a -- b n ip *3 ♦ ♦ NtP«6 | 120 8 16 24 32 40 Numbar of Processors Impact of NI Memory Speed on Run Time Benchmark Mumgnd a — * nip«i • — a nip«3 ♦ »NIP«6 6 16 24 32 40 Number of Processors Figure 5.24: Impact of Memory Speed Figure 5.25: Impact of Memory Speed on Comm Time (MG) on Run Time (MG) 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.4.4 Integer Sort Figure 5.26 shows the speedup of Integer Sort. The speedup declines from 16 nodes to 32 nodes since the cost of increased communication load outweighs the benefit of computational sharing. The overall communication tim e and run time are shown in figure 5.27. Similar to Multigrid, the ratio of communication time over the total run time increases with the system size, as shown in figure 5.28. Speedup Integer Sort 6 5 4 3 2 0 24 32 40 0 8 16 Number ot Processor* Figure 5.26: IS Speedup The statistics of messages is shown in figure 5.29 and 5.30. The message counts for both the root and non-root nodes increase linearly with the system size and the average message size decreases with the system size. This is because the commu nication pattern is quite regular and balanced in Integer Sort. When more nodes join the sorting, the amount of data each node is responsible decreases but more messages axe distributed to others. The communication pattern in Integer Sort is simpler than Multigrid. In Integer Sort, each node needs to send a portion of its generated keys to each other node. Therefore, a portion of exchanged messages are large in size. This is revealed in figure 5.31, where the distribution of message size for a 8-node configuration is shown. However, small messages still take a significant shaxe. For example, 47 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cycles m 100.0 Execution and Communication Time m ug* Sort 8 16 24 Number ol Processors 32 40 Ratio of Communication Time integer Sort 1.00 0.80 0.60 0.40 r 020 0.00 Number of Processors Figure 5.27: IS Execution Time Figure 5.28: IS Communication Ratio Message Count Benchmark Integer Sort 2000 ------- 1500 - 1000 - a — a mot + - - + non-root 500 - Number of Processors Figure 5.29: IS Message Count Average Message Size Benchmark integer Sort 100000 a — a root ♦— ♦non-root I 10000 r e f i 3 I 1000 a 100 Number of Processors Figure 5.30: IS Average Message Size Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. percent of messages for non-root nodes are less than 64 bytes and 34 percent, for root. The importance of small messages latency is proved again. Distribution of Message Size Benchmark Integer Sort I 1 6 0 o c 5 120 I 8 0 & 3 3 3072 4096 1024 2048 160 120 I I § z 4096 1024 3072 2048 Message Size Figure 5.31: IS Distribution of Message Size The impact of communication processing overhead is shown in figure 5.32 and 5.33. The impact on communication time increases with the system size due to the increas ing amount of communication. The impact on the overall run tim e still depends on the communication ratio. Since the communication ratio is higher for Integer than Multigrid, a larger impact on the overall run time is present here. The impact of uncacheable NI data is shown in figure 5.34 and 5.35. Similar to the result from Multigrid, the smaller the system is, more slowdown is found for the communication time. This is because larger amount of data is exchanged for smaller systems. The impact on the overall run tim e also shows that a small amount of improvement is made for uncacheable NI due to the improved computation time. However, the loss due to communication is fax beyond than the gain from computation and the net effect is still performance loss. 6 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Impact of Communication Overhead Benchmark integer Sort 0 2 0 • — •1000 cycles ■ • 2000cydes ♦ ♦ 3000cvdes I 0.15 010 O S 0.05 000 Number of Processors Impact of Communication Overhead Benchmark integer Sort | 0.15 I s 1 0 ,0 S I 0.05 1000 cycles ■ *2000 cycles ♦ 4 3000 cycles Number or Processors Figure 5.32: Impact of Overhead on Figure 5.33: Impact of Overhead on Comm Time (IS) Run Time (IS) Impact of Uncacheable NI Benchmark integer Sort 0.50 | 0.40 0.30 s 0 2 0 0.10 0.00 Number of Processors Impact of Uncachecable NI Benchmark integer Sort 0.20 0 —0 Due to Communication •C— QOue to Computation '♦ — ♦N et effect_______ 0 15 0.10 005 0.X ■o- Number of Processors Figure 5.34: Impact of Uncacheable NI Figure 5.35: Impact of Uncacheable NI on Comm Time (IS) on Run Time (IS) 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The impact of NI Memory speed on the communication time is shown in fig ure 5.36. It is found that the impact is larger on both ends. This is because small systems exchange a small number of large messages and large systems exchange a large number of small messages. The NI shared memory contains both message data and header. Accessing time to both parts are affected by the NI shared memory speed. For small systems, the data processing time dominates and for large systems, the non-data processing time dominates. The impact on the overall run time, as shown in figure 5.37, is the aggregate effect of impact on communication time and communication ratio. Impact of NI Memory Speed on Communication Benchmark imager Sort I t$ 24 32 40 Number of Processors Impact of N I Memory Speed on Run Time Benchmark Integer Sort • — •N lP -t • — •N lP -l • «NIP»3 I • -®NIP«3 ♦ ♦ ♦ NIP-6 i ♦ e nip-6 8 16 24 32 Number of Processors Figure 5.36: Impact of Memory Speed Figure 5.37: Impact of Memory Speed on Comm Time (IS) on Run Time (IS) 6 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Conclusion 6.1 Summary We have analyzed the inefficiencies of the traditional communication architectures for network-based computing and proposed an intelligent NI based communication architecture as a solution. The memory bus is chosen as the NI connection point as it provides better communication performance and facilitates the implementation of a high performance software messaging layer. The intelligent network interface is designed to independently handle communication operations as much as possible to alleviate the burden of host processor. Our design can efficiently realize user-level NI accesses and effectively reduce the overhead from memory copies and bus transactions. Reliable message delivery can be efficiently supported without host intervention. In addition, the one-way communication protocol employed allows compiler optimization techniques to be applied so that the send command can be scheduled well before the corresponding receive command to achieve optimal receive latency. Collective communication is frequently used in parallel computing but is usually supported on top of the software messaging layer. Such a design may result in poor collective communication performance and occupying too many host processor cycles. Our intelligent NI can fully and efficiently handle collective communication operations to spare the host processor for more computation work. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The results from macro-benchmarking show that communication latency does have a significant impact on the communication time of applications. The NI mem ory speed is an im portant factor in determining the communication performance. Besides, caching NI data does help in improving the overall run tim e performance. The overall simulation results support our argument that architectural support is indeed necessary for further reducing the communication latency and making it more comparable with MPP. Especially, putting the NI on the memory bus and allocating the communication buffers on the NI memory can provide a better architectural platform on which a more efficient messaging layer can be designed and supported. Note that our target platform, a network of workstations, is essentially a multi programming environment and multiple processes from different users are usually running concurrently. Our communication architecture not only improves the perfor mance of the applications requiring fast communication but also releases more host processor cycles for other processes. Therefore, applications without communication are also beneficial from this efficient and independent communication architecture. Consequently, the overall system performance is also improved. 6.2 Future Work Although we have shown a communication architecture to achieve better commu nication performance, there are still areas which are worth exploring and further studying. Network Module In our architecture, we focus on the interaction between NI and host processor and the communication network is assumed a shared bus. Since a bus network introduces more contention than a switched network, the performance numbers we have derived can be considered as conservative measurement. Alterna tively, switched networks such as ATM or Myrinet can be employed. Although a more complex network can slow down the simulation speed, better communication performance can be expected due to less network contention. Cache Coherence Protocol We use an invalidation scheme for ensuring coherent NI data. It is also possible to employ an update protocol which propagates the 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. updated NI data to host cache whenever data in NI memory is modified by the NI processor. Although the update scheme may introduce more bus traffic, the host processor might be able to get the updated NI data right from its cache. D M A For large messages, a DMA engine is useful for transferring data between NI local memory and NI shared memory or NI shared memory and host main memory, which can relieve the workload of the NI processor. Therefore, the DMA engine can be included to the NI architecture and the impact on performance can be further studied. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A Implementation of Our Communication Architecture in Paint Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The appendix describes the implementation of our communication architecture as a simulation module for the Paint simulator[65]. A .l Data Structure '/, add d e s c r i p t i o n f o r t h e s e d a ta s t r u c t u r e s * / /* M essage B u f f e r * / ty p e d e f s t r u c t d b u f_ ty p e { d b u f _ p tr n e x t; d b u f . p t r p re v ; d b u f _ p tr n e x t . b u f ; /* p o i n t e r t o n e x t b u f f e r * / i n t n b u f; /* number o f FILLED b u f f e r s f o r t h i s msg * / u n sig n e d f l a g ; i n t c o u n t ; /* i n t m ien; /* i n t b le n ; /* i n t b s iz e ; /* c h a r * d a ta ; /* > d b u f . t ; /* Send D e s c r i p t o r T a b le * / ty p e d e f s t r u c t s e n d . t a b l e . e n t r y * s ta b l e _ e n t r y _ p t r ; ty p e d e f s t r u c t s e n d _ ta b le _ e n try { s t a b l e _ e n t r y _ p t r n e x t; s t a b l e _ e n t r y _ p t r p re v ; u n s ig n e d cm dtype; /* command ty p e * / / * m essage ty p e * / u n sig n e d ty p e ; u n sig n e d d s t i d ; d b u f _ p tr vd b u f; d b u f . p t r pd b u f; i n t d i s p ; u n sig n e d s i z e ; /* v i r t u a l a d d re s s * / /* p h y s ic a l a d d re s s * / /* d is p la c e m e n t, u sed f o r s c a t t e r * / / * s i z e in b y te * / 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. u n s ig n e d t a g ; i n t v a l i d ; i n t m c a s t; i n t tim e ; /* tim e r * / m in t_ tim e _ t i s s u e .t i m e ; /* cmd i s re a d y i n cmd b u f * / m in t_ tim e _ t a c k .tim e ; /* ack i s r e c e iv e d * / } s t a b l e _ e n t r y _ t ; ty p e d e f s t r u c t r e c e iv e _ t a b le _ e n tr y * r t a b l e _ e n t r y _ p t r ; ty p e d e f s t r u c t n etw o rk .m essag e * n e t_ m sg _ p tr; / * R e c e iv e D e s c r ip to r T ab le * / ty p e d e f s t r u c t r e c e iv e _ t a b le _ e n tr y { r t a b l e _ e n t r y _ p t r n e x t; / * n e x t e n tr y i n t h e d e s c . t a b l e * / r t a b l e _ e n t r y _ p t r p re v ; / * p r e v io u s e n tr y in t h e d e s c . t a b l e * / r t a b l e . e n t r y . p t r h n e x t; / * n e x t e n tr y in t h e h a sh queue*/ r t a b l e _ e n t r y _ p t r h p rev ; / * p re v e n tr y i n t h e h a sh queue*/ u n s ig n e d cmdtype; / * command ty p e * / u n s ig n e d ty p e ; / * m essage ty p e * / u n s ig n e d s r c i d ; / * s o u rc e node i d * / d b u f . p t r vdbuf; /* v i r t u a l a d d re s s * / d b u f . p t r pdbuf; /* p h y s ic a l a d d re s s * / u n s ig n e d s i z e ; / * s i z e in b y te * / u n s ig n e d t a g ; / * t a g * / i n t v a l i d ; / * v a l i d i t y of t h e e n tr y * / i n t r e c e iv e d ; / * RECEIVED f l a g * / u n s ig n e d v p id ; / * t h e r e c e i v e r 's p r o c e s s id * / m i n t . t i m e . t i s s u e .t im e ; / * cmd re a d y i n cmd b u f * / m i n t . t i m e . t r e c v .t im e ; / * d a t a in t h e r e c e i v e b u f f e r * / n e t . m s g . p t r mp; /* p o i n t e r f o r e a r l y a r r i v i n g m essage * / i n t g s i z e ; /* group s i z e * / u n s ig n e d lo n g b itm a p [ 2 ] ; / * b i t map f o r c o l l e c t i v e com m unciation > r t a b l e _ e n t r y _ t ; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. /* Network M essage * / ty p e d e f s t r u c t netw ork_m essage{ n e t_ m sg _ p tr n e x t; /* n e x t m essage * / /* p re v m essage * / /* s o u rc e node id * / /* s o u rc e p r o c e s s id * / /* d s t node id * / /* d e s t i n a t i o n p ro c e s s i d * / /* command ty p e * / /* m essage ty p e * / m in t_ tim e _ t i ti m e ; /* cmd i s s u e tim e * / m in t_ tim e _ t d tim e ; /* d e p a r t u r e tim estam p - in t x b u f f e r * / m in t_ tim e _ t s tim e ; /* re a d y t o send to netw ork * / m in t_ tim e _ t a tim e ; /* a r r i v a l tim e in d s t NI r e c v b u f f e r * / u n sig n e d t a g ; n e t_ m sg _ p tr p r e v ; un sig n e d s r c i d ; u n sig n e d s r c v p id ; u n sig n e d d s t i d ; u n sig n e d d s tv p i d ; u n sig n e d cm dtype; u n sig n e d ty p e ; u n sig n e d le n ; d b u f . p t r vd b u f; d b u f . p t r p d b u f ; i n t d is p ; > n et_m sg_t; /* v i r t u a l a d d re s s of t h e m essage d a ta * / /* p h y s ic a l a d d r * / /* d isp la c e m e n t * / /♦ A d d ress Mapping T ab le * / ty p e d e f s t r u c t addr.m ap *addr_m ap_ptr; ty p e d e f s t r u c t addr_map{ addr_m ap_ptr n e x t; /* n e x t e n tr y * / addr_m ap_ptr p r e v ; /* p re v e n tr y * / u n sig n e d v page; /* v i r t u a l page number * / u n sig n e d p p age; /* p h y s ic a l page number * / u n sig n e d s i z e ; /* s i z e * / } addr_m ap_t; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. /* P ro c e s s C o n tro l T ab le used by NI * / ty p e d e f s t r u c t p c b .e n tr y * p c b _ p tr; ty p e d e f s t r u c t p c b _ e n try { p c b . p t r n e x t; /* n e x t e n tr y * / p c b . p t r p re v ; /* p re v e n tr y * / u n sig n e d v p id ; /* p ro c e s s id * / addr_m ap_ptr m ap ta b le ; /* mapping t a b l e * / s t a b l e _ e n t r y _ p t r s t a b l e ; /* send d e s c r i p t o r t a b l e * / r t a b l e . e n t r y . p t r r t a b l e ; /* r e c e iv e d e s c r i p t o r t a b l e * / i n t s l e n ; /* s i z e of s t a b l e * / i n t r l e n ; / * s i z e o f r t a b l e * / d b u f _ p tr v s l i s t ; / * s e n t dbuf l i s t ( v i r t u a l a d d r ) * / d b u f . p t r p s l i s t ; / * s e n t dbuf l i s t ( p h y s i c a l a d d r ) * / d b u f . p t r s i m . s l i s t ; / * s e n t dbuf l i s t ( p a i n t a d d r e s s ) * / d b u f _ p tr v r l i s t ; /* r e c e iv e d dbuf l i s t ( v i r t u a l a d d r ) * / d b u f . p t r p r l i s t ; /* r e c e iv e d dbuf l i s t ( p h y s i c a l a d d r ) * / d b u f . p t r s i m _ r l i s t ; /* r e c e iv e d dbuf l i s t ( p a i n t a d d r e s s ) * / d b u f . p t r v f l i s t ; /* f r e e dbuf l i s t ( v i r t u a l a d d r ) * / d b u f . p t r p f l i s t ; /* f r e e dbuf l i s t ( p a i n t a d d r e s s ) * / com m .group.t comm.group[MAX.GROUP]; /* group t a b l e * / } p c b .t ; /* Network I n t e r f a c e * / ty p e d e f s t r u c t n e t . i n t e r f a c e { cm d .t cmdbuf [CMDBUFSIZE] ; /* commauid b u f f e r * / u n sig n e d lo c k ; /* a c c e s s c o n t r o l * / n e t . m s g . p t r o u tq u eu e ; /* o u tp u t queue * / n e t .m s g .p t r in q u eu e ; /* in p u t queue * / u n s ig n e d o u t q s t a t e ; /* s t a t e o f o u tp u t queue * / Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. u n s ig n e d i n q s t a t e ; /* s t a t e o f in p u t queue * / n e t . m s g . p t r t x b u f ; /* t r a n s m i t b u f f e r * / n e t . m s g . p t r r x b u f ; /* r e c e i v e b u f f e r * / i n t o u tq le n ; /* le n o f o u tp u t queue * / m i n t . t i m e . t t x b u f b u s y u n t i l ; / * tim e f o r t x b u f f e r to be i d l e . * / i n t in q le n ; /* le n o f in p u t queue * / u n s ig n e d c m d s t a r t ; /* s t a r t o f command b u f f e r * / u n s ig n e d cmdend; /* end o f command b u f f e r * / s t a b l e . e n t r y . p t r f r e e . s e n d ; /* f r e e l i s t o f send d e s c e n t r i e s * / r t a b l e . e n t r y . p t r f r e e . r e c v ; /* f r e e l i s t o f re c v d e sc e n t r i e s * / n i . h a s h . t recvhash[NI_HASHTABLE_SIZE]; /* hash t a b l e f o r r x d e s c * / p c b . p t r p c b ta b le ; /* p o i n t e r t o p r o c e s s c o n tr o l t a b l e * / u n s ig n e d s t a t e ; /* s t a t u s o f t h e NI. 1 : a c t i v e 0 : i d l e * / i n t d p ra m fre e ; /* a v a i l a b l e s h a re d memory s i z e * / u n s ig n e d dpram top; /* s t a r t a d d r o f a v a i l a b l e NI m em ory.*/ /* below a r e f o r benchm arking o n ly * / i n t s e n d .a c k .c o u n t ; /* number of ACKs s e n t * / i n t s e n d .s m a ll .c o u n t [ 1 6 ] ; /* number o f s h o r t m essages(<=lK B y te s )* / i n t s e n d _ b ig _ c o u n t[6 5 ]; /* number of lo n g m essages(> lK B y te s) * / i n t s e n d .to t a l.m s g ; / * t o t a l number o f m essages s e n t * / d o u b le s e n d .t o t a l .m s g d a t a .s i z e ; /* t o t a l s i z e o f d a ta s e n t * / i n t r e c v .a c k .c o u n t ; /* number of ACKs rc v d * / i n t r e c v .s m a l l .c o u n t [ 1 6 ] ; /* number o f s h o r t m essages(<=1K B y te s ) * / i n t r e c v . b i g . c o u n t [6 5 ]; /* number of lo n g m essages(> lK B y te s) * / i n t r e c v .t o t a l .m s g ; / * t o t a l number o f m essages rc v d * / d o u b le r e c v . t o t a l . m s g d a t a . s i z e ; / * t o t a l s i z e o f d a t a s e n t * / > n e t . i f . t , * n e t _ i f _ p t r ; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A .2 Operation This section describes the operations of the simulated network interface module. The NI is implemented as a task in Paint. Although the NI task can be active all the time, this approach will consume too much CPU cycles and thus increase the simulation time. To improve the simulation performance, the NI task is scheduled to run only when necessary. There axe several situations when the NI task needs to be scheduled: case condition scheduled by 1 when a command is subm itted to the NI command buffer user process 2 when a message is received from the network network task 3 when there are unfinished jobs for the NI NI itself Table A .l: Scheduling of NI task A flag state is used to indicate whether the NI is currently active or not. If the NI is inactive while a new job is subm itted to NI, the NI will be scheduled to run at the specified tim e and then become active. If the NI is currently active and a new job is required to be handled by the NI, the NI will not be scheduled again because the NI will finish all the queued jobs before it becomes inactive. However, since there are many parallel running tasks in our simulation environment, we need to allocate a fair amount of system resources for each task routinely to capture the real system parallelism. Therefore, although the NI will finish all the queued jobs before going inactive, it will only perform a part of the queued jobs and reschedule itself at a later tim e to continue processing. In table A .l, case 1 and 2 may trigger the execution of NI task or not, depending on the state of the NI. For case 3. the NI remains active between the rescheduling of itself. The source code of the NI main task is shown below. Note that the function is only part of the extension for Paint and a basic understanding of task scheduling scheme in Paint is required to understand the code here. / * NI t a s k m odule 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. INPUTS: p t a s k - > i v a l l : node id p ta s k - > iv a l2 : s c h e d u lin g ty p e BY_NI - c a l l e d by i t s e l f BY.SENDCMD - c a l l e d by p r o c e s s o r f o r sendcmd BY_BUS - c a l l e d by bus f o r an a r r i v i n g msg p t a s k - > i v a l 3 : i f ival2=BY_NI, t h i s v a lu e i n d i c a t e s what was done b e f o r e i t i s r e s c h e d u le d . PROC.CMD - p r o c e s s command PROC.OUTQl - p r o c e s s o u tp u t queue ( f o r d a ta msg) PROC.INQ - p r o c e s s in p u t queue PROC.ACK - p r e p a r e an ACK PR0C_0UTQ2 - p r o c e s s o u tp u t queue ( f o r o u tg o in g ACK) PR0C_BACK - p r e p a r e a d e la y e d ACK (synchronous mode) p t a s k - > t v a l l : e la p s e d tim e s in c e in v o c a tio n OUTPUT: f l a g t o s i g n a l t h e f r o n t end what t o do n e x t . */ i n t n i _ o p e r a t i o n ( t a s k _ p t r p ta s k ) { i n t n i d , f l a g ; n e t _ i f _ p t r n ip ; i n t r e t 1 , r e t 2 , r e t 3 ; p t a s k - > tv a ll = 0 ; n i d = p t a s k - > i v a l l ; n i p = n e t _ i f [ n i d ] ; i f ( p t a s k - > i v a l 2 != BY_NI){ i f ( n i p - > s t a t e == ACTIVE){ / * no d u p l i c a t e in v o k a tio n */ r e t u r n T.FREE; > 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. e ls e { if(ptask->ival2==BY_SENDCMD) ptask->ival3=PR0C_CM D;/* p r o c e s s command f i r s t * / e l s e i f ( p t a s k - > i v a l 2 == BY_BUS) ptask->ival3=PR 0C _IN Q ;/* p r o c e s s in p u t queue f i r s t * / e l s e f a t a l ( " i n v a l i d i v a l 2 v a lu e \ n " ) ; nip->state=A CTIV E; / * change s t a t e t o ACTIVE * / > r e t l = r e t 2 = r e t 3 = l ; / * s u b ta s k s t a t u s f l a g * / s t a r t : s w itc h ( p ta s k - > iv a l3 ) { c a s e PROC.CMD: i f ( ( r e tl= p r o c _ c m d ( p ta s k ) ) ! = 0 ) { /* p ro c e s s command * / p ta s k -> tim e += CONVERT_CYCLE(ptask->tvall); p task-> ival2= B Y _N I; p t a s k - > i v a l 3 = r e t l ; /* n e x t s te p : PR0C_0UTQ1 o r PREP_BACK s c h e d u l e .t a s k ( p t a s k ) ; r e t u r n T_YIELD; > c a s e PROC.DUTqi: i f C ( r e t2 = p r o c _ o u tq ( p ta s k ) ) ! = 0 ) { /* p r o c e s s o u tp u t queue * / p ta s k -> tim e += CONVERT_CYCLE(ptask->tvall); ptask-> ival2= B Y _N I; ptask->ival3=PR 0C_IN Q ; s c h e d u l e _ ta s k ( p t a s k ) ; r e t u r n T_YIELD; > c a s e PRQC_INQ: r e t3 = p r o c _ i n q ( p t a s k ) ; / * p ro c e s s in p u t queue * / Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. s w i t c h ( r e t 3 ) { c a s e 0: ptask -> iv al2 = B Y _ N I; ptask->ival3=PR0C_CMD; b re a k ; c a s e 1: p ta s k - > tim e += CONVERT_CYCLE(ptask->tvall); ptask -> iv al2 = B Y _ N I; ptask->ival3=PR0C_CMD; s c h e d u l e _ ta s k ( p t a s k ) ; r e t u r n T_YIELD; c a s e 2: p ta s k - > tim e += CONVERT_CYCLE(ptask->tvall); ptask -> iv al2 = B Y _ N I; ptask->ival3=PREP_ACK; s c h e d u l e . t a s k ( p t a s k ) ; r e t u r n T.YIELD; d e f a u l t : f a ta l( " N I -'/,d p r o c .i n q : e r r o r r e t u r n v a lu e ! \ n " , n i d ) ; > b re a k ; c a s e PREP.ACK: p r e p . a c k ( p t a s k ) ; p ta s k - > tim e += CONVERT_CYCLE(ptask->tvall); p task-> ival2= B Y _N I; ptask->ival3=PRQC_0UTQ2; s c h e d u l e _ ta s k ( p t a s k ) ; r e t u r n T.YIELD; c a s e PR0C_0UTQ2: p r o c _ o u t q ( p ta s k ) ; p ta s k - > tim e += CQNVERT_CYCLE(ptask->tvall); p task-> ival2= B Y _N I; ptask->ival3=PR0C_CMD; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. s c h e d u l e . t a s k ( p t a s k ) ; r e t u r n T_YIELD; c a s e PREP.BACK: p r e p . a c k ( p t a s k ) ; p ta s k - > tim e += CQNVERT_CYCLE(ptask->tvall); ptask -> iv al2 = B Y _ N I; ptask->ival3=PR 0C _0U TQ l; s c h e d u l e _ t a s k ( p t a s k ) ; r e t u r n T.YIELD; d e f a u l t : f a t a l ( " i n v a l i d p ta s k - > iv a ! 3 v a lu e \ n " ) ; > i f ( ! r e t l 44 ! r e t 2 44 ! r e t 3 ) { /* no more jo b to do * / nip->state=IN A C T IV E ; r e t u r n T_FREE; > e ls e { g o to s t a r t ; > Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [1] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22th Annual International Symposium on Computer Architecture, June 1995. [2] A. Agarwal, D. Chaiken, G. DSouza, K. Johnson, D. Kranz. J. Kubiatowicz. K. Kurihara, B. H. Lim, G. Maa, D. Nussbaum, M. Parkin, and D. A. Yeung. The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiproces sor. In Proceedings of Workshop Multithreaded Computers. Supercomputing 91. 1991. [3] T. Agerwala, J. L. M artin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir. SP2 System Architecture. IBM Systems Journal, 34(2):152— 1S4, 1995. [4] David Bailey, Tim Harris, William Saphir, Rob van der W ijngaart. Alex Woo. and Maurice Yarrow. The NAS Parallel Benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995. [5] D. J. Becker, T. Sterling, D. Savarese, E. Dorband, U. A. Ranawake, and C. V. Packer. BEOWULF: A Parallel Workstation for Scientific Computation. In Proceedings of the 1995 International Conference on Parallel Processing, 1995. [6] Berkeley NOW Project. Performance of MPI on Generic Active Messages - Point to Point. http://now.CS.Berkeley.EDU/Fastcom m /M PI/perform ajice. [7] M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, and E. Felten. Virtual Memory Mapped Network Interface for the SHRIMP M ulticomputer. In Proceedings of the 21th Annual International Symposium on Computer Architecture, 1994. S3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [8] M atthias A. Blumrich, Cezary Dubnicki, Edward W. Felten, and Kai Li. Pro tected, User-level DMA for the SHRIMP Network Interface. In Proceedings of the second International Symposium on High-Performance Computer Architec ture, February 1996. [9] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W. Su. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro, February 1995. [10] R. A. Brunner, D. P. Bhandarkar, F. X. McKeen, B. Patel, W. J. Rogers, and G. L. Yoder. Vector Processing on the VAX 9000 System. Digital Technical Journal, 2(4):61— 79, 1990. [11] Greg Buzzard, David Jacobson, Milon Mackey, Scott Marovich. and John Wilkes. An Implemention of the Hamlyn Sender-managed Interface Architec ture. In Proceedings of the Second Symposium on Operating Systems Design and Implementation(OSDE96). October 1996. [12] D. Chiou, B. S. Ang, Arvind, M. J. Beckerle, A. Boughton, R. Greiner, J. E. Hicks, and J. C. Hoe. Start-NG: Delivering Seamless Parallel Computing. Tech nical Report CSG Memo 371, Laboratory for Computer Science, MIT, February 6 1995. [13] D. Clark, V. Jacobson, J. Romkey, and II. Salwen. An Analysis of TCP Pro cessing Overhead. IEEE Communication Magazine, 27(6), June 1989. [14] Compaq, Intel and Microsoft. Virtual Interface Architecture Specification Ver sion 1.0, December 1997. [15] Cray. The Cray Y /M P C-90 Supercomputer System. Eagan,MN, 1991. [16] Cray. The C ray/M P P Announcement. Eagan,MN, 1992. [17] D. E. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chen, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F. Wong. Parallel Computing on the Berkeley NOW. In 9th Joint Symposium on Parallel Processing, 1997. 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [IS] W. D. Dally, J. Fiske, J. Keen, R. Lethin, M. Noakes, P. Nuth, R. Davison, and G. Fyler. The Message-Driven Processor: A M ulticomputer Processing Node with Efficient Mechanisms. IEEE Micro, 12(2):23— 39, April 1992. [19] V V . J. Dally, J. Fiske, J. Keen, R. Lethin, M. Noakes, P. Nuth, R. Davison, and G. Fyler. The Message-Driven Processor: A M ulticomputer Processing Node with Efficient Mechanisms. IEEE Micro, April 1992. [20] J. Dongarra and A. Hinds. Comparison of the Cray X/M P-4, Fujisu VP-200, and Hitachi S-S10/20. In Hwang and DeGroot, editors, Parallel Processing for Supercomputing and Artificial Intelligence, pages 2S9-323. McGraw-Hill, New York, 1989. [21] Jack J. Dongarra and Tom Dunigan. Message-Passing Performance of Various Computers. In M PI Developers Conference, University of Notre Dame, June 1995. [22] C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and K. Li. VMMC-2: Efficient Support for Reliable, Connection-Oriented Communication. In Proceedings of the Hot Interconnects Symposium V, August 1997. [23] Cezary Dubnicki, Liviu Iftode, Edward V V . Felten, and Kai Li. Software Sup port for Virtual Memory-Mapped Communication. In Proceedings of the 10th International Parallel Processing Symposium, 1996. [24] T. v. Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th A C M Symposium on Operating Systems Principles, December 1995. [25] T. v. Eicken, D. E. Culler, S. C Goldstein, and K. E. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th Annual International Symposium on Computer Architecture. May 1992. [26] T. v. Eicken, D. E. Culler, S. C Goldstein, and K. E. Schauser. LogP Quantified: The Case for Low-Overhead Local Area Networks. In Hot Interconnects III: A Symposium on High Performance Interconnects, 1995. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [27] T. v. Eicken, D. E. Culler, S. C Goldstein, and K. E. Schauser. LogP Perfor mance Accessment of Fast Network Interfaces. IEEE Micro, February 1996. [28] Edward W. Felten, Richard D. Alpert, Angelos Bilas, M atthias A. Blumrich, Douglas W. Clark, Stefanos N. Damianakis, Cezary Dubnicki, Liviu Iftode, and Kai Li. Early Experience with Message-Passing on the SHRIMP Mul ticomputer. In Proceedings of the 23th Annual International Symposium on Computer Architecture, May 1996. [29] M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee. The M-Machine Multicomputer. In Proceedings of MICRO-28. 1995. [30] U. Finger, C. O’Donnell, and C. Siegelin. Boosting the Perfromance of Work stations through WARPmemory. In E U R O -PA R ’95, 1995. [31] Fujitsu. VP2000 Series Supercomputers. Japan, 1990. [32] Fujitsu. VPP500 Vector Parallel Processor. San Jose, CA, 1992. [33] Future I/O Developers Forum. Future I/O . http://w w w .futureio.org. [34] Guang R. Gao, Lubomir Bic, and Jean-Luc Gaudiot. Advanced Topics in Dataflow Computing and Multithreading. IEEE Computer Society, 1995. [35] Jean-Luc Gaudiot and Lubomir Bic. Advanced Topics in Dataflow Computing. Prentic-Hall, Englewood Cliffs, NJ, 1991. [36] A1 Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. PVM: Parallel Virtual Machine. MIT Press, 1994. [37] Hewlett-Packard Co. PA-RISC 1.1 Architectural and Instruction Set Reference Manual, February 1994. [38] Kai Hwang, Choming Wang, and Cho-Li Wang. Evaluating MPI Collective Communication on the SP2, T3D and Paragon Multicomputers. In Proceedings of the Third Symposium on High-Performance Computer Architecture, 1997. 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [39] M Lauria, S. Pakin, and A. Chien. Efficient Layering for High Speed Com munication: Fast Messages 2.x. In Proceedings of the 7th High Performance Distributed Computing Conference, July 1998. [40] James V. Lawton, John J. Brosnan, Morgan P. Doyle, Seosamh D. 0 Riordain, and Timothy G. Reddin. Building a High-Performance Message-passing System for MEMORY CHANNEL Clusters. Digital Technical Journal, 8(2), 1996. [41] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Dauglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, V V . D. Hills, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong-Chan, S. Yang, and R. Zak. The Network Architecture of the Connection Machine CM-5. In .4CA/ Symposium on Parallel Algorithms and Architectures, June 1992. [42] D. Lenoski. J. Laudon, K. Gharachorloo, W. D. Weber, A. Gupta. J. Hen- nessy, M. Horowitz, and M. Lam. The Standford Dash Multiprocessor. IEEE Computer, pages 63-79, March 1992. [43] LHPC and INRIA ReMaP. MPI-BIP: An Implementation of MPI over Myrinet. http://Ihpca.univ-lyonl.fr/m pibip.htm l. [44] M. Lin, J. Hsieh, D. Du, J. P. Thomas, and J. A. MacDonald. Distributed Network Computing over Local ATM Networks. [EEE Journal on Selected Areas in Communications, 13(4), May 1995. [45] 0 . Maquelin, G. R. Gao, H. H.J. Hum, K. B. Theobald, and X.M. Tian. Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling. In Proceedings of the 23th Annual International Symposium on Computer 4 r- chitecture, 1996. [46] Richard P. Martin, Amin M. Vahdat, David Culler, and Thomas E. Ander son. Effect of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [47] T. G. Mattson, D. Scott, and S. Wheat. A TeraFLOPS Supercomputer in 1996: The ASCI TFLOPS System. In Proceedings of the 6th International Parallel Processing Symposium, pages 84-93, 1996. [48] Message Passing Interface Forum. MPI: .4 Message-Passing Interface Standard, 1995. [49] MPI-2 Forum. MPI-2: Extensions to the Message-Passing Interface, 1997. [50] Shubhendu S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent Network Interfaces for Fine-Grain Communication. In Proceedings of the 23th Annual International Symposium on Computer Architecture, 1996. [51] Shubhendu S. Mukherjee and Mark D. Hill. The Impact of Data Transfer and Buffering Alternatives on Network Interface Design. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. February 1998. [52] NEC. SX -X Series HNEX. Japan, 1990. [53] Nick Nevin. The Performance of LAM 6.0 and MPICH 1.0.12 on a Worksta tion Cluster. Technical Report OSC-TR-1996-4, Ohio Supercomputer Center, March 1996. [54] NGIO Forum. Next Generation I/O . http://www.ngioforum.org. [55] R. S. Nikhil and G. M. Papadopoulos. *T: A Multithreaded Massively Parallel Architecture. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1991. [56] S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Worksta tions: Illinois Fast Messages(FM) for Myrinet. In Supercomputing’95, 1995. [57] Loic Prylli and Bernard Tourancheau. BIP: a new protocol designed for high performance networking on Myrinet. In Workshop of PC-XOW . IP P S /S P D P ’ 98, 1998. [58] Steve Rodrigues, Tom Anderson, and David Culler. High-Performance Local- Area Communication Using Fast Sockets. In USENIX’97, 1997. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [59] J. Salmon, C. Stein, and T. Sterling. Scaling of Beowulf-Class Distributed Systems. In Proceedings of Supercomputing 98, November 1998. [60] S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of ASP LOS 7, pages 26-36, October 1996. [61] S. L. Scott and G. Thorson. The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In Hot Interconnects IV, August 1996. [62] C. Siegelin, U. Finger, and V. Fietze. YVARPmemory: Multiprocessor Work groups over ATM. In 2nd International Workshop on High-Speed Network Com- puting(HiNet ’ 96), 1996. [63] M. Snir, P Hochschild, D. D. Frye, and K. J. Gildea. The Communication Software and Parallel Environment of the IBM SP2. IBM Systems Journal 34(2):205— 221, 1995. [64] T. Sterling, J. Salmon, D. Becker, and D. F. Savarese. How to Build a Beowulf MPI Press, March 1999. [65] L. B. Stoller, M. R. Swanson, and R. Kuramkote. Paint: PA Instruction Set Interpreter. Technical Report UUCS-96-009, Departm ent of Computer Science, University of Utah, September 1996. [66] V. S. Sunderram. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice and Experience, pages 315-339, 1990. [67] M. R. Swanson, R. Kuramkote, and L. B. Stoller. Message Passing Support in the Avalanche Widget. Technical Report UUCS-96-002, Department of Com puter Science, University of Utah, March 1996. [68] Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato. PM: An Operating System Coordinated High Performance Communication Library. In High-Performance Computing and Networking, Volume 1225 of LNCS, April 1997. [69] Thinking Machines Corporation. The Connection Machine CM-5 Technical Summary, October 1991. 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [70] A. Trew and G. Wilson, editors. Past, Present, Parallel: A Survey of Available Parallel Computer Systems. Springer-Verlag, London, 1991. [71] UCSD Concurrent Systems Architecture Group. MPI-FM on Myrinet Perfor mance. http://www-csag.ucsd.edu/projects/com m /m pi-fm -perf.htm l. [72] J. E. Veenstra and R. J. Fowler. MINT Tutorial and User Manual. Technical Report 452, Computer Science Department, The University of Rochester, June 1993. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
A framework for coarse grain parallel execution of functional programs
PDF
Automatic array partitioning and distributed-array compilation for efficient communication
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Decoupled memory access architectures with speculative pre -execution
PDF
Implementation of neural networks on parallel architectures
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Functional testing of constrained and unconstrained memory using march tests
PDF
On The Learning Of Rules For An Information Extraction System
PDF
Fault simulation and multiple scan chain design methodology for systems -on -chips (SOC)
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Mapping parallel algorithms onto parallel architectures
PDF
An integrated systems approach for software project management
PDF
Communication scheduling techniques for distributed heterogeneous systems
Asset Metadata
Creator
Cheng, Chung-Ta (author)
Core Title
Architectural support for network -based computing
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Gaudiot, Jean-Luc (
committee chair
), Gupta, Sandeep (
committee member
), Ierardi, Douglas (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-60254
Unique identifier
UC11338190
Identifier
3017993.pdf (filename),usctheses-c16-60254 (legacy record id)
Legacy Identifier
3017993.pdf
Dmrecord
60254
Document Type
Dissertation
Rights
Cheng, Chung-Ta
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical