Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Parallel STAP benchmarks and their performance on the IBM SP2
(USC Thesis Other)
Parallel STAP benchmarks and their performance on the IBM SP2
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Parallel STAP Benchmarks and Their Performance on the IBM SP2 by Masahiro Arakawa A Thesis Presented to the FACULTY OF THE SCHOOL OF ENGINEERING UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE (COMPUTER ENGINEERING) August 1995 Copyright 1995 Masahiro Arakawa This thesis, written by m s a h i r o ..juuxaw a under the guidance of Faculty Committee and approved by all its members, has been presented to and accepted by the School of Engineering in partial fulfillment of the re quirements for the degree of MASTER OF SCIENCE (COMPUTER ENGINEERING) Faculty Committee Chairman frW i -U . S w * W /),. Acknowledgments This project was not the work of one man; endeavors of this magnitude rarely are. I would like to thank the many people who made my success in this effort possible. I apologize right now to those people I forgot to mention heie by name. Prof. Kai Hwang. for the opportunity to work on this project, and for his guidance throughout my academic career at the University of Southern California: working with Prof. Hwang on this project was a very valuable experience. Dr. Zhiwei Xu. for his collaboration on this project: analysis of the IBM SP2 and the STAP benchmarks, benchmark parallelization, and the dynamically updatable table in the demonstrations performed at our Adaptive Sensor Array Processing (ASAP) Woikshop 1995 poster session. David Martinez, for making sure we had the resources necessary to make this project a success: it couldn't possibly be easy to get dedicated use of nearly all of a $15 million supercomputer for days at a time, when it is usually shared by hundreds of users. Tim Fahev. Lon Waters, and the rest of the User Support Group at the Maui High Performance Computing Center, for always answering our incessant stream of questions with a smile, and for processing our many requests for dedicated use of the SP2 so quickly: one couldn't ask for a better technical support team. Regina Morton, for helping to make my tenure as a research assistant for Prof. Hwang go smoothly: who knew it would be so hard for me to pick up a paycheck? Mark and Sallv Wang. for helping me maintain my sanity during the very busy and stressful nine months this project lasted: do not doubt the therapeutic value of a pizza from Round Table. Contents Acknowledgments............................................................................................................ii List of Tables....................................................................................................................vi List of Figures..................................................................................................................vii Abstract.............................................................................................................................. x 1 Introduction................................................................................................................... 1 2 Evaluation of the IBM SP2........................................................................................... 4 2.1 Hardware Description.......................................................................................4 2.1.1 The POWER2 Microprocessor............................................................4 2.1.2 The High-Performance Switch............................................................ 5 2.2 Software Support..............................................................................................6 2.3 Machine Organization.......................................................................................7 3 Parallel APT Benchmark Program............................................................................... 8 3.1 Sequential Program........................................................................................... 8 3.2 Parallel Program Development........................................................................ 12 3.3 Parallel Program User's Guide........................................................................ 20 3.3.1 The Parallel APT Code......................................................................20 3.3.2 The APT Data Files............................................................................22 3.3.3 Compiling the Parallel APT Program................................................ 22 3.3.4 Running the Parallel APT Program.................................................. 23 4 Parallel HO-PD Benchmark Program......................................................................... 28 4.1 Sequential Program.........................................................................................28 4.2 Parallel Program Development........................................................................ 31 4.3 Parallel Program User’s Guide........................................................................ 36 4.3.1 The Parallel HO-PD Code..................................................................36 4.3.2 The HO-PD Data Files................................ 39 4.3.3 Compiling the Parallel HO-PD Program.......................................... 39 4.3.4 Running the Parallel HO-PD Program.............................................. 40 5 Parallel General Benchmark Program......................................................................... 45 5.1 Sequential Program.........................................................................................45 5.1.1 Sorting Subprogram..........................................................................45 5.1.2 FFT Subprogram................................................................................46 5.1.3 Vector Multiplication Subprogram.................................................... 48 5.1.4 Linear Algebra Subprogram.............................................................. 49 iii 5.2 Parallel Program Development........................................................................50 5.2.1 Sorting Subprogram..........................................................................50 5.2.2 FFT Subprogram................................................................................53 5.2.3 Vector Multiplication Subprogram.................................................... 56 5.2.4 Linear Algebra Subprogram..............................................................59 5.3 Parallel Program User’s Guide........................................................................59 5.3.1 The Parallel General C ode................................................................59 5.3.2 The General Data Files......................................................................63 5.3.3 Compiling the Parallel General Program.......................................... 63 5.3.4 Running the General HO-PD Program.............................................. 64 6 Parallel STAP Performance Results...........................................................................69 6.1 Experimental Setup......................................................................................... 69 6.2 Performance of the APT Benchmark..............................................................70 6.3 Performance of the HO-PD Benchmark..........................................................76 6.4 Performance of the General Benchmark..........................................................82 6.4.1 Sorting Subprogram..........................................................................82 6.4.2 FFT Subprogram................................................................................85 6.4.3 Vector Multiply Subprogram............................................................89 6.4.4 Linear Algebra Subprogram..............................................................92 6.5 Scalability Analysis......................................................................................... 95 6.5.1 Scalability with Respect to Machine Size.......................................... 95 6.5.2 Scalability with Respect to Problem Size.......................................... 96 6.5.3 Isoefficiency..................................................................................... 97 7 Conclusions................................................................................................................ 99 Bibliography...................................................................................................................101 Appendix A Parallel APT C ode.................................................................................102 A.1 bench_mark_APT.c......................................................................................102 A.2 cell_avg_cfar.c............................................................................................. 120 A.3 cmd_Jine.c ................................................................................................... 123 A.4 f f tc ............................................................................................................... 124 A.5 fft_APT.c..................................................................................................... 126 A.6 forbackx....................................................................................................... 128 A.7 house.c.......................................................................................................... 133 A.8 read_input_APT.c ........................................................................................136 A.9 stepl_beams.c ..............................................................................................139 A. 10 step2_beams.c ..............................................................................................142 A.11 defs.h ........................................................................................................... 150 A. 12 com pile.apt................................................................................................. 152 A. 13 run.256 ........................................................................................................ 152 Appendix B Parallel HO-PD Code............................................................................. 153 B.l bench_maik_STAP.c............................. .....................................................153 B.2 celLavg^cfar.c..............................................................................................167 B.3 cmdjine.c ....................................................................................................171 B.4 compute Jjeam s.c..........................................................................................172 B.5 compute_weights.c........................................................................................176 B.6 ffLc................................................................................................................180 B.7 m_STAP.c....................................................................................................182 B.8 forback.c........................................................................................................184 B.9 form_beams.c................................................................................................189 B.IO form_str_vecs.c ............................................................................................190 B.U house.c..........................................................................................................191 B.12 read inpuLSTAP.c ......................................................................................194 B.13 defs.h ............................................................................................................197 B.14 compile_hopd................................................................................................199 B.15 run.256 ........................................................................................................ 199 Appendix C Parallel General Code.............................................................................200 C. 1 Sorting and FFT Subprogram ......................................................................200 C. 1.1 bench_marit_SORT.c..................................................................... 200 C.1.2 bench_mark_SORT_FFT.c ............................................................209 C.1.3 bubble_sorLc ................................................................................. 225 C.1.4 cmdjine.c ..................................................................................... 227 C.1.5 f f tc ................................................................................................. 228 C.1.6 rcad_input_SORT_FFT.c............................................................... 230 C.1.7 defs.h .............................................................................................232 C.1.8 compile_sort...................................................................................234 C.1.9 run. 128 ........................................................................................... 235 C.2 Vector Multiply Subprogram........................................................................235 C.2.1 bench_mark_VEC.c....................................................................... 235 C.2.2 cmdjine.c ............ 247 C.2.3 ieadJnput_VEC.c ......................................................................... 248 C.2.4 defs.h .............................................................................................251 C.2.5 compile_vec...................................................................................253 C.2.6 run.128 ...........................................................................................253 C.3 Linear Algebra Subprogram......................................................................... 254 C.3.1 bench_mark_LEN.c ....................................................................... 254 C.3.2 cmdjine.c .....................................................................................262 C.3.3 forbaclLC.........................................................................................263 C.3.4 house.c........................................................................................... 268 C.3.5 readJnput_UN.c........................................................................... 271 C.3.6 defs.h .............................................................................................273 C.3.7 compile J i n .....................................................................................275 C.3.8 run.032 ...........................................................................................276 v List of Tables 3.1 Breakdown of Computational Woikload in the APT Benchmark............................ 12 4.1 Breakdown of Computational Workload in the HO-PD benchmark........................ 31 6.1 Performance of the Parallel APT Program............................................................. 76 6.2 Breakdown of Parallel APT Execution Time in Seconds........................................76 6.3 Performance of the Parallel HO-PD Program..........................................................81 6.4 Breakdown of Parallel HO-PD Execution Time in Seconds....................................81 6.5 Performance of the Parallel Sorting Subprogram....................................................85 6.6 Breakdown of Parallel Sorting Subprogram Execution Time in Seconds................85 6.7 Performance of the Parallel FFT Subprogram..........................................................88 6.8 Breakdown of Parallel FFT Subprogram Execution Time in Seconds....................88 6.9 Performance of the Parallel Vector Multiply Subprogram......................................91 6.10 Breakdown of the Parallel Vector Multiply Execution Time in Seconds................92 6.11 Performance of the Parallel Linear Algebra Subprogram........................................95 7.1 Computation-to-Communication Ratio Vs. System Efficiency..............................99 vi List of Figures 2.1 Layout of the IBM S P2..........................................................................................6 3.1 APT benchmaik data flow diagram.........................................................................8 3.2 Sequential APT program skeleton.......................................................................... 9 3.3 Packed binary format to floating-point format conversion............................ 10 3.4 Dimension along which the FFTs are performed in the APT benchmaik................10 3.5 Early performance prediction based on workload and overhead characterization . 13 3.6 Mapping of the parallel algorithm and data set onto the S P2................................ 14 3.7 Parallel APT program skeleton.............................................................................15 3.8 One task’s slice of the data cube...........................................................................16 3.9 Forming the Householder matrix from the data cube............................................18 3.10 Target report merging process.............................................................................. 19 3.11 Calling relationship between the procedures in the parallel APT program............ 21 3.12 Target list generated by the parallel APT program............................................... 24 3.13 Timing report generated by the parallel APT program.................................. 25 4.1 HO-PD benchmark data flow diagram........................................................... 28 4.2 Sequential HO-PD program skeleton.................................................................... 29 4.3 Mapping of the parallel algorithm and data set onto the SP2................................32 4.4 Parallel HO-PD program skeleton........................................................................ 33 4.5 Circular shift operation in the parallel HO-PD program....................................... 35 4.6 Calling relationship between the procedures in the parallel HO-PD program 38 4.7 Target list generated by the parallel HO-PD program...........................................41 4.8 Timing report generated by the parallel HO-PD program..................................... 42 5.1 Sequential sorting subprogram skeleton............................... 46 5.2 Sequential FFT subprogram skeleton.................................................................... 47 5.3 Sequential vector multiply subprogram skeleton.....................................................48 5.4 Sequential linear algebra subprogram skeleton................................................... 49 5.5 Mapping of the parallel sorting subprogram and data set onto the S P 2 .................51 5.6 Parallel sorting subprogram skeleton...................................................................... 52 5.7 Mapping of the parallel FFT subprogram and data set onto the SP2.......................54 5.8 Parallel FFT subprogram skeleton.......................................................................... 55 5.9 Mapping of the parallel vector multiply subprogram and data set onto the SP2. . . 57 5.10 Parallel vector multiply subprogram skeleton........................................................ 58 5.11 Calling relationship between the procedures in the parallel sorting subprogram .. 61 5.12 Calling relationship between the procedures in the parallel sorting and FFT subprogram.............................................................................................................. 62 5.13 Calling relationship between the procedures in the parallel vector multiply subprogram.............................................................................................................. 62 5.14 Calling relationship between the procedures in the parallel linear algebra subprogram.............................................................................................................. 62 5.15 Timing report from the parallel sort and FFT subprogram.................................. 66 5.16 Timing report from the parallel vector multiply subprogram................................ 66 5.17 Timing report from the parallel linear algebra subprogram.................................. 67 6.1 Overall execution time of the parallel APT program.............................................. 71 6.2 Breakdown of the computation time of the parallel APT program.........................72 6.3 Breakdown of the overhead of the parallel APT program...................................... 72 6.4 Sustained processing rate of the SP2 on the parallel APT program....................... 74 6.5 System efficiency of the SP2 on the parallel APT program................................. 75 viii 6.6 Overall execution time of the parallel HO-PD program........................................ 77 6.7 Breakdown of the computation time of the parallel HO-PD program.....................78 6.8 Breakdown of the overhead of the parallel HO-PD program.................................78 6.9 Sustained processing rate of the SP2 on the parallel HO-PD program................... 80 6.10 System efficiency of the SP2 on the parallel HO-PD program...............................80 6.11 Breakdown of the execution time of the sorting subprogram.................................83 6.12 Sustained processing rate of the SP2 on the parallel sorting subprogram...............83 6.13 System efficiency of the SP2 on the parallel sorting subprogram ........... 84 6.14 Breakdown of the execution time of the FFT subprogram.....................................86 6.15 Sustained processing rate of the SP2 on the parallel FFT subprogram................... 87 6.16 System efficiency of the SP2 on the parallel FFT subprogram...............................87 6.17 Breakdown of the execution time of the vector multiply subprogram...................90 6.18 Sustained processing rate of the SP2 on the parallel vector multiply subprogram . 90 6.19 System efficiency of the SP2 on the parallel vector multiply subprogram.............91 6.20 Overall execution time of the linear algebra subprogram...................................... 93 6.21 Sustained processing rate of the SP2 on the parallel linear algebra subprogram .. 94 6.22 System efficiency of the SP2 on the parallel linear algebra subprogram...............94 ix Abstract This thesis is the outgrowth of my involvement in the STAP benchmark experiments on the IBM SP2 massively parallel processor located at the Maui High Performance Computing Center. The experiments were conducted at the University of Southern California. The benchmark parallelization strategies were mainly developed by my colleague, Zhiwei Xu. Dr. Xu and I shared the code implementation and performance measurement responsibilities, while I performed the bulk of the debugging and program documentation, under the guidance of Prof. Kai Hwang. After introducing the SP2 hardware/software environment, I describe the structure of the parallel STAP programs and provide user's guides for porting and executing them on the SP2. I then present the STAP benchmark performance results along with a scalability analysis. The best STAP performance obtained is a sustained 23 GFLOPS on 256 nodes, for a 34% system efficiency. The documented parallel STAP source code is attached in the appendices. x 1 Introduction Massively parallel processing has finally reached the cost-to-performance ratio necessary to compete with traditional vector supercomputers. In this thesis, I provide an overview of the parallelization of the Space-Time Adaptive Processing (STAP) benchmarks for the IBM SP2 massively parallel processor (MPP), and summarize the parallel benchmaik performance results from our experiments on the SP2, located at the Maui High Performance Computing Center (MHPCC). This project was conducted between August 1994 and May 1995 by the research team at the University of Southern California, consisting of Prof. Kai Hwang, Dr. Zhiwei Xu, and myself. Prof. Hwang served as the principal investigator of the entire project. Dr. Xu developed the parallelization strategies for all the STAP benchmarks. Dr. Xu and I jointly developed the parallel code and collected performance results. I documented the parallel benchmarks reported here. This project was funded by the ARPA Mountaintop Program as a subcontract from MIT Lincoln Laboratory. These benchmarks allow us to evaluate the performance of a particular parallel computer for real-time digital radar signal processing by simulating the computations necessary for STAP. Digital adaptive signal processing is the one technology which has the potential to achieve thermal noise limited performance in the presence of jamming (electronic counter-measures) and clutter (signal returns from land) [Titi94]. This thesis will not elaborate on the details of radar technology here. Instead, the thesis emphasizes parallel processing technology and the performance evaluations of MPPs. 1 The suite of benchmaik programs we ran on the SP2 at the MHPCC consists of three C programs: the Adaptive Processing Testbed (APT) program, the Higher-Order Post-Doppler (HO-PD) program, and the General benchmark, which itself is made up of four independent subprograms: sorting, fast Fourier transform (FFT), vector multiplication, and linear algebra. The IBM SP2 is a state-of-the-art message-passing MPP, with a currently installed base of about 400 machines. But why use massively parallel processing, when MPPs are potentially difficult to program? By using many off-the-shelf processors instead of a handful of powerful but expensive vector processors, massively parallel processing offers supercomputer-class processing power with a better cost-to - performance ratio. The 400-node IBM SP2 used in our experiments has a peak performance of over 100 GFLOPS (billions of floating-point operations per second) and costs about $15 million. For comparison, a 32-node Cray T90, a vector supercomputer, has a peak performance of 64 GFLOPS and costs $35 million. We used a single-program, multiple data stream (SPMD) approach to parallelizing the STAP benchmarks, directing each task to perform the same algorithm on independent portions of the data set We added message-passing operations, such as broadcast and total exchange, to redistribute data. Because of the high message-passing overhead associated with the SP2, we adopted a coarse-grain strategy, minimizing the number of intemode communication operations in our parallel programs. In Chapters 3 through 5 ,1 describe the benchmark parallelization process and provide user’s guides to the parallel programs, listing the files and procedures making up 2 the programs, as well as directions regarding the compilation and execution of the parallel benchmarks. Chapter 6 shows the benchmark experimental results. The SP2 is capable of very high sustained processing rates on the STAP benchmarks: on the HO-PD program, we were able to coax 23 GFLOPS out of the SP2, which represents 34% of the 68 GFLOPS peak performance on 256 nodes. The nature of the parallel execution overhead on the SP2 makes further improvements difficult. The performance of the STAP benchmarks was degraded to varying degrees by the high message-passing overhead associated with the SP2. This thesis represents the first complete program documentation of the parallel STAP benchmarks. An overview of the Mountaintop Program can be found in [Titi94]. Descriptions of SP2 message-passing operations can be found in [IBM94b]. More information about the message-passing performance of the SP2 can be found in [Xu95a]. Details of parallel application code development can be found in [Xu95b]. [Hwan95c] presents a comprehensive report of the STAP benchmark performance results on the SP2. The performance results were obtained jointly with the entire USC research team. My contributions in the parallel code development and performance measurements are reflected in this thesis. 3 2 Evaluation of the IBM SP2 The IBM SP2 is a message-passing MPP. In this chapter, I present a brief overview of the hardware and software characteristics of this computer. 2.1 Hardware Description Each processing node has one POWER2 processor and between 64 and 2,048 MB of private local memory. The nodes are interconnected by the High-Performance Switch (HPS). 2.1.1 The POWER2 Microprocessor The POWER2 microprocessor is the high floating-point performance implementation of the IBM RS/6000 architecture, succeeding the POWER microprocessor. Like its predecessor, the POWER2 is implemented as a multi-chip module. IBM does not plan to build a successor “POWER3” microprocessor based on this architecture; instead, the next high-performance floating-point processor will be based on the PowerPC architecture. The POWER2 has two floating-point pipelines, each capable of handling a new instruction on every cycle. Because each pipeline has a separate multiplier and adder, the POWER2 can actually issue up to four floating-point instructions per cycle. With a clock rate of 66.7 MHz, the POWER2*s peak floating-point processing rate is 266 MFLOPS. 4 The POWER2 microprocessor is available in two versions: “thin” and “wide” nodes. Both versions are capable of 266 MFLOPS peak at 66.7 MHz, and have 32 KB instruction caches. The thin nodes have a 64 bit wide memory bus between the processor and main memory, as well as a 64 KB data cache. The wide nodes have a 236 bit wide memory bus and a 256 KB data cache. 2.1.2 The High-Performance Switch The HPS is a multi-staged buffered packet-switched Omega network. To improve utilization and reduce communication time, the HPS also uses cut-through worm hole routing. The HPS provides a minimum of four paths between any pair of nodes, providing redundancy and reducing the likelihood of a connection being blocked. The HPS has a hardware switching time of 500 ns. However, the software startup latency for a point-to-point message is about 40 s. Once a connection has been established, the HPS can support a point-to-point bandwidth of 35 MB/s. Because the HPS provides the connectivity of a crossbar switch, the startup latency and the bandwidth of the HPS is constant for all node pairs: the concept of neighboring or remote nodes has no meaning for the SP2. Each node has an adapter card which provides an interface between the processor and the HPS. This adapter has a separate on-board communication co-processor (an i860 microprocessor) to offload message passing related functions from the main node processor. The co-processor provides functions such as cyclic redundancy checking for error checking, partitioning checking to ensure security and system reliability, and 5 support for direct memory access. The co-processor can support a peak bandwidth of 80 MB/s (40 MB/s in each direction simultaneously). Figure 2.1 shows the organization of these components of the SP2. X adapter POWER2 mam memory 399 nodes Figure 2.1: Layout of the IBM SP2 The message-passing performance of the SP2 has been studied in detail and reported in [Xu95a]. Based on the results therein, the SP2 should be programmed in a coarse-grained fashion: the startup latency for a point-to-point communication is 40 ps; in this time each node can perform over 10,000 floating point operations. 2J2 Software Support The SP2 runs on the AIX operating system, which is IBM's implementation of UNIX. IBM has also provided additional software tools to facilitate parallel programming. IBM’s Parallel Operating Environment (POE) supports multi-user time-sharing use, as well as single-user dedicated use, of the SP2. POE also supports both single 6 program, multiple data streams (SPMD) and multiple-program, multiple data streams (MPMD) styles of parallel programming. IBM’s Message-Passing Library (MPL) handles intemode communication through explicit message-passing. MPL calls are available in C and Fortran. IBM has also developed the Visualization Tool to facilitate performance monitoring, as well as the ESSL scientific libraries. Finally, IBM’s Loadleveler can be used to support load balancing between multiple batch-mode users. The public-domain tools PVM and MPI, along with the third-party tools Linda, Express, and FORGE, are also available for the SP2. 23 Machine Organization The nodes in the SP2 are organized in two different ways. Physically, nodes are grouped into frames. Each frame contains either eight wide nodes or sixteen thin nodes. Logically, nodes can be grouped into pools. All the nodes in the machine can be arbitrarily divided into resource pools. Each pool can have its own access restrictions, allowing one user to have dedicated use of the nodes in the pool, or several users to share the pool. Also, the nodes to be used to execute a parallel program can be specified by pool number, instead of individually listing each node to be used. Each node can run several jobs simultaneously: each program shares the processor in a time-sharing fashion. Also, each physical node can support several virtual nodes, so it is possible to use one node to run a multiple-node job. 7 3 The Parallel APT Benchmark Program In this chapter, I describe the parallelization of the Adaptive Processing Testbed (APT) benchmaik. The APT benchmark performs a simplified version of the processing performed by the Adaptive Processing Testbed, a space-time adaptive signal processor recently developed at MIT Lincoln Laboratory [LL94], 3.1 Sequential Program We generated the parallel APT program by modifying the original sequential version. Below, Figure 3.1 shows the data flow of the APT benchmark. steering vectors data cube' Doppler Processing^ T-matrix Adaptive Weight Computation weights Adaptive Weight Computation threshold CFAR Target Report Figure 3.1: APT benchmark data flow diagram While analyzing the original sequential program, we developed a higher-level program skeleton to help us visualize the structure of the benchmaik. Program skeletons 8 are much shorter and simpler to understand than the complete code, while still displaying all the control and data dependence information needed for parallelization. The key words in , o u t, and in o u t are similar to those used in the Ada programming language, and indicate that a subroutine parameter is read only, write only, or read/write, respectively. The array index notation [ . ] is used to indicate that all array elements along that dimension are to be used. The program skeletons were adapted from [Hwan95c]. Below, Figure 3.2 shows the program skeleton for the sequential APT program. COMPLEX data_cube[PRI][RNG][EL], detect_cube{PRI][BEAM][RNG], twiddle[PRI]; /* Doppler Processing */ coirpute_twiddle_factor (out twiddle, la PRI); for (j = 0; j < RNG; j++) for (k = 0; k < EL; k++) Cft (Inout data_cube[.][j][ k], in twiddle. In PRI); /* Beamforming */ for (i = 0; i < PRI; i++) beam_form (In data_cube[i][.][.], in data_cube(i-l % PRI][.][.], In data_cube[i+1 % PRI][.][.], out detect_cube[i j[.){.]); / * Target Detection/Cell Averaging CFAR * / for {i = 0; i < PRI; i++) for (j = 0; j < BEAM; j++) coinpute_target ( inout detect_cube [ i ] [ j ] [. ]) ; / * Target Reporting */ m = 0; for (k = 0; k < RNG; k++) for (i = 0; i < PRI; i++) for (j = 0; j < BEAM; j++) if (IsTarget (in detect_cube[i][j][k])) { target_report[m].pri = i; target_report[m].beam = j; target_report(m].rng = k; target_report[m].power = detect_cube[i][j][k].real; m m m + 1; if (m > 25) goto finished; ) finished: Figure 3.2: Sequential APT program skeleton 9 Not included in this program skeleton, but implicit to the program, is the data file access step. In this step, the program reads the input data cube and the steering vectors from disk. The input data cube is stored in a packed binary format; prior to use by the program, it must be converted into floating-point format (see Figure 3.3). The steering vectors consist of both primary and auxiliary steering vectors, and are stored on disk in ASCII format. Data from disk: ► 0100101101110010 1100010010001101 packed binary format j 1 ■ / V imaginary portion real portion ^ convert to floating-point ^ d a ta .im a g d a t a . r e a l Figure 3.3: Packed binary format to floating-point format conversion The program first transforms the input data cube into the Doppler domain by performing fast Fourier transforms (FFTs) on the data. In this Doppler processing step, an FFT is performed along the PRI dimension on each column of constant EL and RNG (see Figure 3.4). 4 PRI t EL RNG Figure 3.4: Dimension along which the FFTs are performed in the APT benchmaik 10 The beamforming step consists of two adaptive nulling substeps. In the first substep, the program computes a set of adaptive weight vectors (the T matrix) designed to optimally null jamming interference while maintaining gain in the directions specified by the steering vectors. The principal steering vectors specify the directions of interest, while the auxiliary steering vectors provide additional degrees of freedom to assist in the adaptive nulling. The jamming training set is computed from data extracted from the highest and lowest Doppler bins, under the assumption that these bins contain jamming interference but no signals (targets) or clutter. A Householder transformation is performed on this training set to get a lower triangular Cholesky. Then, the program forward and back solves the training set with the steering vectors to form the T matrix. The weight vectors are then applied to the Doppler data cube to compute a set of jammer- nulled beams. In the second beamforming substep, the program computes a set of adaptive weight vectors designed to optimally null clutter while maintaining gain in the directions specified by the principal steering vectors. The range extent is divided into a series of range segments, and weight vectors are computed for each range segment. The training sets are computed for each steering vector and for each Doppler and range segment by extracting samples from the nine most powerful auxiliary beams (i.e. beams corresponding to the auxiliary steering vectors) and the principal beams under consideration. The principal steering vectors must be transformed by the same transformation that produced the set of jammer-nulled beams from which the training set is chosen. This entails transforming the steering vector using an appropriate subset of the T matrix. Finally, the weight vectors are applied to the data cube to compute a set of clutter (and jammer) nulled beams. 11 In the target detection step, a cell averaging, constant false alarm rate (CFAR) detection process is performed to find those signals whose power exceeds a given threshold. The target report generated at the end of the program consists of targets ordered by range. If more than 25 targets were found, only the 25 closest are reported. Table 3.1 below lists the number of floating-point operations performed in each of the four main computational steps in the APT benchmark. Table 3.1: Breakdown of Computational Workload in the APT Benchmark Computational Step Number of floating-point operations (x 106) Doppler Processing 83.89 First Beamforming Substep 2.88 Second Beamforming Substep 1,313.67 Cell Averaging CFAR 46.45 Total 1,4& »9 3.2 Parallel Program Development In parallelizing the APT program, we adopted the method described in [Xu95bj. This method (see Figure 3.5), which entails an early MPP performance prediction scheme, significantly reduced the parallel software development cost During the algorithm design phase of the parallel program development, we considered several parallelization strategies [Hwan95c]. The nature of the algorithm in the APT program made the simple compute-interact paradigm sufficient. In this paradigm, the parallel program is written as a sequence of alternating computation and interaction steps. During a computation step, each task performs computations 12 rC i Algorithm Design unsatisfactory c I Application Specifications ^ ^ Workload Chai^terization*^ 1 Performance Prediction z z r ~ X Overhead Estimation satisfactory erroneous c Coding HI T MPP Hardware Platform and Software Environment 1 Debugging » ■ correct unsatisfactory c i < i Performance Measurements I satisfactory Parallel Code Documentation ) ) Figure 3.5: Early performance prediction based on workload and overhead characterization independently. During an interaction step, the tasks exchange data or synchronize. The parallel APT program can be described as the following sequence of steps: Compute: Doppler Processing Interact: Total Exchange Broadcast Compute: Beamforming Cell Averaging CFAR Interact: Target List Reduction 13 Because the same algorithm is applied to the entire data cube, we were able to use the SPMD (single program, multiple data stream) paradigm. We exploited the parallelism inherent in the input data set: in this data parallel approach, each task performed the same algorithm on its portion of the input data set. The mapping of the parallel algorithm and the data set onto the SP2 is shown in Figure 3.6. data cube ( c ) CFAR CFAR total exchange I broadcast target list reduction Node 255 q p t — r ) DP Doppler Processing BF Beamforming CFAR Cell Averaging/CFAR Target Report Figure 3.6: Mapping of the parallel algorithm and data set onto the SP2 During the coding step, we started with a program skeleton to help us visualize the structure of the benchmark. I have included the program skeleton shown below in Figure 3.7. 14 COMPLEX data_cube[PRI][RNG][EL], detect_cube[PRI][BEAM][RNG], house_matrix[EL][320], t_matrix[EL][EL]; /* Doppler Processing (DP) */ parfor (j * 0; j < RNG; j++) { COMPLEX twiddle[PRI]; compute_twiddle_factor (out twiddle, 1& PRI); for (k = 0; k < EL; k++) fft (incut data_cube[.][j][k], la twiddle, la PRI); ) total_exoh«ntfQ data_cube[.][j][.] to data_cube[i][.][.]; house_matrix[.][0..159] = data_cube[0..1][RNG-80..RNG-1][.]; Task LAST sends data_cube[PRI-2..PRI-1][RNG-80..RNG-1][.], and task 0 receives into house_matrix[.][160..319] broadcast house_matrix to all tasks; /* Beamforming Step 1 (BF1), performed sequentially by all tasks */ bfl (la house_matrix, la str_vecs, out t_matrix); parfor (i = 0 ; i < PRI; i++) [ /* Beeunforming Step 2 (BP2) */ bf2 (la data_cube[i][.][.], la t_matrix, out detect_cube[i][.][.]); Target Detection and Local Target Report Generation ) reduce local_reports to target_report in task 0; Figure 3.7: Parallel APT program skeleton At the beginning of the parallel program, each task reads its portion of the data cube from disk in parallel. Because each FFT performed in the Doppler Processing step requires all the data along the PRI dimension, but is independent of the other FFTs along the RNG (as well as the EL) dimension (see Figure 3.4), we divide the data cube equally along the RNG dimension (see Figure 3.8a). In order to improve disk access performance, we modified the r e a d _ in p u t_ A P T procedure (see Appendix A) so each task reads its entire slice of the 15 PRI EL EL RNG (a) before total exchange RNG (b) after total exchange Figure 3.8: One task’s slice of the data cube data cube from disk before converting the data from a packed binary format (in which the data is stored on disk) to floating-point format (in which computations are done). In the sequential version, each complex number is read then converted immediately. Because the execution time of this step is not included in the total execution time of the parallel program (this step would not exist in the final implementation as part of an airborne radar processor), the performance of this step is not critical: the timer is started after the data is loaded and converted. Each task performs FFTs on its slice of the data cube in parallel. In order to improve the performance of the FFTs, we recoded the FFT routine. Our version, a hybrid between the one in the original APT program and a routine suggested by MHPCC, is a more efficient implementation of the same in-place FFT algorithm with bit reversal. By replacing two separate arrays of floating-point numbers accessed by pointers with a single array of COMPLEX (a user-defined data structure) numbers, we were able to improve the processing rate on this routine from 15 MFLOPS per node to as high as 24 MFLOPS per node. 16 The remaining computational steps of the APT benchmark require all the data along the RNG dimension, but have no data dependencies along the PRI and EL dimensions. Therefore, before continuing with the computation, the APT program executes a total exchange operation to redistribute the data cube so that it is resliced along the RNG dimension (see Figure 3.8b). Because the MPL command m pc_index (which implements the total exchange operation) did not work quite to our specification, we needed to adjust the output of this operation. The output of the m pc_index operation is a 1*D vector, not the 3-D data cube slice the program needs. Therefore, we added the rewind step, which rewinds the 1-D vector back into a 3-D data cube slice. So that we can give each task an equal slice of the data cube after the total exchange operation, we reduced the number of range gates in the data set from its nominal value of 280 to 236. With 236 range gates, the data can be divided along this dimension evenly by any power of two from 1 to 236. This adjustment affects the power levels of the targets slightly, but has no impact on the correctness of the parallel program. We confirmed this correctness by comparing the output of the sequential program using a data set with 236 range gates to the output of the parallel program using the same data set Following the total exchange step, the APT program forms the Householder matrix. In the sequential APT program, this step was part of the first beamforming step. In the parallel program, the portions of the data cube needed to form the Householder matrix are located in several different tasks. These tasks copy the data into sections of the Householder matrix (see Figure 3.9), then broadcast these sections to all tasks. At the end of the broadcasts, each task has a complete copy of the Householder matrix. 17 8 0 R N G & PRI PRI #255 PRI #254 PRI #1 PRI #0 EL Householder Matrix 320 RNG Figure 3.9: Forming the Householder matrix from the data cube The first beamforming substep cannot be efficiently parallelized, and is therefore executed sequentially. Instead of performing this step on one task and broadcasting the results to the remaining tasks, each task performs this step in its entirety. Because the algorithm in the second beamforming substep displays data independence along the PRI and EL dimensions, each task can perform this step on its own portion of the data cube in parallel. The cell averaging CFAR step has been parallelized in a similar fashion. Each task generates its own target report, consisting of the closest targets found in its data cube slice, up to 25 targets. In the target report reduction step, the parallel APT program reduces the individual target reports into one final target report, consisting of the closest targets in the entire data cube, up to 25 targets. Several algorithms were studied, but only one was time-efficient [Hwan95c]. In this algorithm, the tasks pair off, then merge the two target lists, sorting by range and keeping up to 25 targets. The tasks holding the merged target 18 lists then pair off, repeating this process until one task holds the target list for the entire data cube (see Figure 3.10). Once the final target list is generated, the timer is stopped. The remainder of the program displays the final target list and collects execution time information. In order to improve the resolution and accuracy of the execution time measurements, we used two different timing calls: tim es and g e ttim e o f day. The timing call tim es measures only CPU time used by the program, but has a resolution of only 10 ms. The timing call g e ttim e o f day measures wall clock time (which may include time not spent on the APT program), but has a resolution of 1 ps. 3 3 Parallel Program User’s Guide In this section, I describe the files and procedures in the parallel program, and how to compile and run the parallel APT program. A brief description of the input data file is also included. individual target reports final target report Figure 3.10: Target report merging process 19 3.3.1 The Parallel APT Code The parallel program has been divided into the following files: bench_jnar k_APT. c cell_avg_cfar.c cmd_lina.c f ft .c fft_APT.c forback.c house.c read_input_APT.c stepl_beams.c step2_beams.c main () cell_avg_cfar () cmd_line () fft (), bit_reverse {) fft_APT () forback () house {) read_input_APT () stepl_beams () step2_beams () The purpose of each procedure is briefly described below: main () This procedure represents the main body of the parallel APT program. ceii_avg_cfar () This procedure performs the cell averaging CFAR target detection on the slice of the detect cube local to a node. cmd_iine () This procedure extracts the names of the data file and the steering vector file from the command line. f f t o This procedure performs a single n-point in-place decimation-in-time complex FFT using n/2 complex twiddle factors. The implementation used in this procedure is a hybrid between the implementation in the original sequential version of this program and the implementation suggested by MHPCC. This modification was made to improve the performance of the FFT on the SP2. bit_raverse () This procedure performs a simple but somewhat inefficient bit reversal. This procedure is used by f f t o . ___ f ft_ A P T o This procedure performs an FFT along the PRI dimension for each column of data by calling fft o (there are a total of RNQxEL such columns). 20 forback (> This procedure performs a forward and back substitution on an input array using the steering vectors, and normalizes the returned solution vector. house () This procedure performs an in-place Householder transformation on a complex N xM input array, where M £ N. The results are returned in the same matrix as the input data. read_input_APT () This procedure reads the node’s slice of the data cube from disk, then converts the data from the packed binary integer form in which it is stored on disk into floating-point format. stepi_beams (} This procedure performs the first beamforming substep on the local slice of the data cube. step2_beams () This procedure performs the second beamforming substep on the local slice of the data cube. Figure 3.11 shows the calling relationship between the procedures. f bench_mark_APT ^ line Q D h o u a t Figure 3.11: Calling relationship between the procedures in the parallel APT program The complete code listings for the IBM SP2 can be found in Appendix A. 21 3.3.2 The APT Data Files The input data cube file n e w _ d a ta . d a t is a modified version of the original data cube file a p t _ d a t a . d a t. We reordered the data in n e w _ d a ta . d a t so that each task can read its slice of the data cube from one contiguous location in the file. The modified data cube file still has PRI x RNG x EL complex numbers stored in 32-bit binary integer packed format. In this nominal data set, PRI = 256, RNG = 256, and EL = 32. On disk, this data file is 8 MB. The steering vectors file n ew _ d ata . s t r is identical to the original steering vector file a p t _ d a ta . s t r . The steering vectors file contains the number of PRIs in the data cube, the power threshold used in target detection, and the complex values for the steering vectors. 3«3J Compiling the Parallel APT Program The parallel APT benchmark is compiled by calling the script file c o m p ile _ a p t. This script file executes the following shell command: mpcc -qarch=pwr2 -03 -DAPT -DIBM -o apt b*nch_mark_APT.c \ csll_avg_cfar.c Cft.c f£t_APT.c forback.c cmd_lin®.c house.c \ rsad_input_APT.c stepl_baams.c step2_b«ams.c -lm The -q arch = p w r2 option directs the compiler to generate POWER2-efficient code. The -03 option selects the highest level of compiler optimization available on the mpcc parallel C code compiler. The define flag -DAPT must be included so that the compiler selects the proper data cube dimensions in the d e f s . h header file. The define flag 22 -dibm must be included so that the compiler can set the number of clock ticks per second (100 ticks per second on IBM machines, 60 ticks per second on SUN machines). To compile the program to use double-precision floating-point numbers instead of the default single-precision, add the define flag -DDBLE. Prior to compiling the program, the user must set the number of nodes on which the program will be run. This number is defined as NN in the header file d e f s . h. This step is necessary because it is not possible to statically allocate arrays with their sizes defined by a variable in C. We have provided a set of header files with NN set to powers of two from 1 to 236. These header files are called d e f s . 001 through d e f s . 256. 33.4 Running the Parallel APT Program The parallel APT benchmark is invoked by calling the script file ru n . ###, where # # # is the number of nodes on which the program will be run (and is equal to N N in d e f s . h). This script file executes the following shell command: poe apt data_filename -procs ### -ue The command p o e distributes the parallel executable Hie to all nodes and invokes the parallel program. The command line argument d a ta _ f ilen am e is the name of the Hies containing the input data cube and the steering vectors. The input data cube Hie must have the extension . d a t , while the steering vectors Hie must have the extension . s t r . The command line arguments - p r o c s and - u s are Hags to poe. The flag -p ro c s sets the number of nodes on which the program will be run. The flag -u s directs poe to use the User Space communication library. This library gives the best communication 23 performance. We have provided a set of script files to run the APT benchmark, with the number of nodes equal to powers of two from 1 to 256. These files are called ru n .001 through ru n .256. Once the parallel program has begun and the timer has been started, the program will generate a “Running. . . " message. Once the program has finished and the timer has been stopped, it will generate a “ . . . done. ” message. The output of the program has the following format, shown in Figures 3.12 and 3.13. Figure 3.12 shows the target list, consisting of the targets found in the data cube, up to 25 targets. This number of targets was set as a parameter to the APT benchmark. Figure 3.13 shows the timing report from the parallel APT benchmark. 0 ENTRY RANGE BEAM DOPPLER POWER 0 00 019 00 151 0.705780 0 01 019 01 151 0.713037 0 02 039 02 138 0.747974 0 03 059 04 182 0.729151 0 04 099 06 130 0.763610 0 05 119 08 075 0.847957 0 os 119 07 075 0.785197 0 07 139 10 147 0.745536 0 08 139 09 147 0.840084 0 09 159 11 116 0.792221 0 10 000 00 000 0.000000 0 11 000 00 000 0.000000 0 12 000 00 000 0.000000 0 13 000 00 000 0.000000 0 14 000 00 000 0.000000 0 15 000 00 000 0.000000 0 16 000 00 000 0.000000 0 17 000 00 000 0.000000 0 18 000 00 000 0.000000 0 19 000 00 000 0,000000 0 20 000 00 000 0.000000 0 21 000 00 000 0.000000 0 22 000 00 000 0.000000 0 23 000 00 000 0.000000 0 24 000 00 000 0.000000 Figure 3.12: Target list generated by the parallel APT program 24 In Figure 3.12, the ENTRY column counts the number of targets detected, starting with target #0. The RANGE column indicates the range gate in which the target was found, and is proportional to the distance to the target. The BEAM column indicates the direction to the target. The DOPPLER column provides information regarding the speed of the target. The POWER column represents the power of the signal returned by the target. The targets listed in Figure 3.12 are those targets placed in the input data cube when the data cube was generated. We confirmed the correctness of our parallel APT program by comparing this target list to the target list generated by the original sequential APT benchmark (see [LL94]). 0.*** Timing information 0 ; - numtask = 128 0: a1l_uaar_max = 2.17 a. all_aya_max - 0.03 a 0: diak_user_max = 0.14 a. diak_sys_max - 0.02 a 0: f ft_uaer_jnax = 0.04 8, fft_sys_max = 0.00 a 0: index_user_max = 0.06 a. index_ays_max = 0.00 a Oi rewind_user_max = 0.01 a. rewind_ayB_max = 0.00 a 0: part ia l_uaar_jnax = 0.00 a. partial_aya_max * 0.00 a 0: bcaat_uaer_jnax = 0.05 8, bcast_8ya_max = 0.00 a 0 : a tepl_uaer_max = 0.04 8, a t epl_sys_max = 0.00 a 0: at«p2_ua«r_max s 0.09 8. a t ep2_ays_max = 0.00 a 0: cfar_uaer_max = 0.01 8, c f a r_sy s_jnax = 0.00 a 0: raport_uaer_max = 0.08 8, report_sya_jnax = 0.00 a 0! 0:Wa 11 clock timing - 0 i all_clock_tima = 2.339890 0 ; diak_clock_tim« = 0.346691 0: fft_clock_tim* = 0.018968 0: index_clock_tima * 0.048371 0: r«wind_clock_time = 0.001918 0: partial_clock_time = 0.000653 0: bcast_clock_time = 0.023326 0: •t*pl_clock_time = 0.037597 0: atap2_clock_tima = 0.076094 0: cfar_clock_tima = 0.003508 0: raport_clock_tima = 0.075713 Figure 3.13: Timing report generated by the parallel APT program 25 In Figure 3.13, the timing entries ending with _user_m ax represent the largest amount of user CPU time spent by a node while performing that step, while the timing entries ending with _sys_m ax represent the largest amount of system CPU time spent by a node while performing that step. The W all c lo c k tim in g entries indicate the amount of wall clock time spent on a given step, regardless of whether the time was spent by the program or by other programs. The steps we timed are listed below: all This component is the end-to-end execution time of the entire program. This time is not equal to the sum of the times of the other steps, because this all time includes time spent waiting for synchronization not included in the individual steps. disk This component is the time to read the node's portion of the data cube from disk. £ft This component is the time spent performing FFTs on this node’s portion of the data cube, index This component is the time spent performing the total exchange operation. rewind This component is the time spent rewinding the 1-D output of the total exchange operation back into a 3-D data cube slice. partial This component is the time spent forming the p a rtia l_ te m p matrix, beast This component is the time spent broadcasting the p a r tia l.te m p matrix to all nodes, stepi This component is the time spent performing the first beamforming substep. step2 This component is the time spent performing the second beamforming substep, cfar This component is the time spent performing the cell averaging CFAR step. report This component is the time spent combining the individual target reports from all nodes into one final target report 26 The time spent in these steps (Tm, etc.) are combined to determine the total execution time of the parallel APT benchmark (see Equation 6.1 in Chapter 6). When determining the time spent for a particular step, we compared the sum of the user and system CPU times to the wall clock time, and took the smaller of the two. This time most accurately represents the time spent on the step. If the wall clock time is smaller than the sum of the user and system CPU times, the wall clock time includes only time spent on the benchmark, and we use this time, which has greater resolution. If the wall clock time is greater than the sum of the user and system CPU times, the wall clock time includes time spent on non-benchmark activities, and is not representative of the execution time for that step. The timing results vary with machine size and from run to run, but the target list should be identical between runs and independent of machine size. Both the target list and timing report are printed to the screen. To write these reports to a file, use standard UNIX output redirection: run.### > out_filename 27 4 The Parallel HO-PD Benchmark Program In this chapter, I describe the parallelization of the Higher-Order Post-Doppler (HO-PD) benchmark. The HO-PD benchmark performs a reduced-dimension space-time adaptive algorithm [LL94]. Unlike the APT benchmark's two-step beamforming process, the HO-PD benchmark's nulling process is done in one step. Furthermore, the HO-PD benchmark performs a smaller number (256 for the HO-PD vs. 1792 for the APT) of larger (144 degrees of freedom for the HO-PD vs. 10 degrees of freedom for the APT) Cholesky factorizations than the APT benchmark. 4.1 Sequential Program We generated the parallel HO-PD program by modifying the original sequential version. Below, Figure 4.1 shows the data flow of the HO-PD benchmark. dam cube' input steering vectors Doppler Processing Form Space-Time Steering Vectors weights f Adaptive Weight ^ 1 Computation ^ Beamforming CFAR threshold Target Report Figure 4.1: HO-PD benchmark data flow diagram 28 While analyzing the original HO-PD benchmark, we developed a program skeleton of the sequential program. Below, Figure 4.2 shows the program skeleton for the sequential HO-PD program. COMPLEX data_cube[PRI][RNG][EL], detect_cube[PRI][BEAM][RNG], twiddle[PRI}; /* Doppler Processing (DP) */ computo_twiddle_factor (out twiddle, In PRI); for (j = 0; j < RNG; j++) fo* (k = 0; k < EL; k++) fft (lnout data_cube[ . ] [ j ] [k] , in twiddle, In PRI); /* Beamforming (BF) */ for (i * 0; i < PRI; i++) beam_form (In data_cube[i][.][.], in data_cube[i-1 % PRI][.][.], In data_cube[i+1 % PRI][.][.], out detect_cube[i][.][.]); /* Target Detection (CFAR) */ for (i = 0; i < PRI; i++) for (j = 0; j < BEAM; j++) compute_target {lnout detect_cube[i][j][.]); /* Target Report */ m = 0; for (k = 0; k < RNG; k++) for (i = 0; i < PRI; i++) for (j = 0; j < BEAM; j++) If (IsTarget (In detect_cube[i][j][k])) t target_report[m].pri = i; targetsreport[m].beam = j; target_report[m].rng = k; target_report[m].power = detect_cube[i][j][k].real; m = m + 1; If (m > 25) goto finished; J finished: Figure 4.2: Sequential HO-PD program skeleton Not included in this program skeleton, but implicit to the program, is the input data file access step. Like in the APT benchmark, the program reads the input data cube and the steering vectors from disk before any computation is done. 29 The program first transforms the input data cube into the Doppler domain by performing FFTs on the data. In this Doppler processing step, an FFT is performed along the PRI dimension on each column of constant EL and RNG. The spatial steering vectors need to be similarly transformed to produce space-time steering vectors. After the FFTs, the HO-PD performs the beamforming step. Each Doppler of interest is processed with its two adjacent (the preceding and succeeding) Dopplers to make 3 x EL = DOF number of rows and RNG_S range gates. Every other range gate (i.e. 0, 2,4,6, etc.) to RNG_S (nominally 288) number of range gates is used. A Householder transformation is performed on all DOF number of rows to lower triangularize all rows. Since there are NUMSEG segments of range gates, a Cholesky factorization and a forward back solve using the space-time steering vector on a segment for later application of weights to the segment adjacent to it are performed. The program starts at range gate 0 for the Cholesky factorization of the first segment, and applies the weights to the second segment starting at 0+RNGSEG. For the second or higher segments, the program starts at range gate = NUMSEG xRNGSEG for the Cholesky factorization, and applies the weights to range gates starting at (NUMSEG-1) xRNGSEG segment (the previous segment). Finally the program performs the cell averaging CFAR target detection step. In this step, the program calculates the power of each range cell in a segment, as well as the cell average power of the range cells at least three range gates away from the cell of interest The program compares these two powers, and if the cell power is above the 30 threshold, then the cell is considered as having a target and added to the target list. This step is started at the lowest range so closer targets will be reported first. Table 4.1 below Usts the number of floating-point operations performed in each of the four main computational steps in the HO-PD benchmark. Table 4.1: Breakdown of Computational Workload in the HO-PD Benchmark Computational Step Number of floating-point operations (x 10°) Doppler Processing 220.00 Adaptive Beamforming 12,618.00 Cell Averaging CFAR 14.00 Total 12352.00 4.2 Parallel Program Development Similar to the APT program, in parallelizing the HO-PD program, we once again followed the early performance prediction method outlined in [Xu95b]. The HO-PD program can also be parallelized using a compute-interact paradigm: Compute: Doppler Processing Interact: Total Exchange Circular Shift Compute: Beamforming Cell Averaging CFAR Interact: Target List Reduction Because the same algorithm is applied to the entire data cube, we were able to use the SPMD paradigm. We exploited the parallelism inherent in the data set: in this parallel 31 approach, each task performed the same algorithm on its portion of the input data set The mapping of the parallel algorithm and the data set onto the SP2 is shown in Figure 4.3. data cube DP Doppler Processing BF Beamforming CFAR Cell Averaging/CFAR DP DP total exchange circular shift CFAR CFAR target list reduction Target Report Figure 4.3: Mapping of the parallel algorithm and data set onto the SP2 During the coding step in the parallel software development process, we started with a program skeleton to help us visualize the structure of the benchmark. I have included the program skeleton in Figure 4.4. At the beginning of the parallel program, each task reads its portion of the data cube from disk in parallel. Because each FFT performed in the Doppler Processing step 32 COMPLEX data_cube[PRI ][RNG][EL], detect_cube[PRI][BEAM][RNG]; /* Doppler Processing (DP) */ perfor {j = 0; j < RNG; j++) /* There are RNG tasks */ { COMPLEX twiddle[PRIJ; /* Local variable */ compute_twiddle_£actor (out twiddle, ia PRI); /* Each task computes its own twiddle factor */ for (k = 0; k < EL; k++) /* Each task computes EL FFTs */ fft (iaout data_cube[.][j][k], la twiddle, la PRI); ) barrier; total_exchsnge data_cube[.][j][. ] to data.cube[i][.][.]; shift data_cube[i][.][.] to data_cube[i-1 % PRI]{.][.]; shift data_cube[i][.][.] to data_cube[i+1 % PRI][-][-]; barrier; perfor (i = 0; i < PRI; i++) ( /* Beamforming (BF) V beam_form ( ia data_cube[i][.][.] , ia data_cube[i-1 % PRI][.][.], ia data_cube[i+1 % PRI][*][.], out detect_cube[i][.][.]); barrier; /* Target Detection (CFAR) */ foe (j = 0; j < BEAM; j++) compute_target (iaout detect_cube[i][j][.]); /* Target Report */ m = 0; for (k = 0; k < RNG; k++) for (j = 0; j < BEAM; j++) if (IsTarget (ia detect_cube[i][j][k])) { local_report[m].pri = i; local_report[m].beam = j; local_report[m].rng = k; local_report[m].power = detect_cube[i][j][k].real; m = m + 1 ; if (m > 2 5) goto finished; > finished: ) reduoe local„reports to target_report; Figure 4.4: Parallel HO-PD program skeleton requires all the data along the PRI dimension, but is independent of the other FFTs along the RNG (as well as the EL) dimension, we divide the data cube equally along the RNG dimension (see Figure 3.8a). In order to improve disk access performance, we modified the r e a d _ in p u t_ S T A P procedure (see Appendix B) so each task reads its entire slice 33 of the data cube from disk before converting the data from a packed binary format (in which the data is stored on disk) to floating-point format (in which computations are done). In the sequential version, each complex number is read then converted immediately. Because the execution time of this step is not included in the total execution time of the parallel program (this step would not exist in the final implementation as part of an airborne radar processor), the performance of this step is not critical: the timer is started after the data is loaded and converted. Each task performs FFTs on its slice of the data cube in parallel. The FFT routine used in the parallel HO-PD program is the same, modified routine used in the parallel APT program. The remaining computational steps of the HO-PD benchmark require all the data along the RNG dimension, but have no data dependencies along the PRI and EL dimensions. Therefore, before continuing with the computation, the HO-PD program executes a total exchange operation to redistribute the data cube so that it is resliced along the RNG dimension (see Figure 3.8b). We had the same difficulties with the m pc_index command as in the parallel APT program, and thus had to include a rewind step to complete the data redistribution. So that we can give each task an equal slice of the data cube after the total exchange operation, we changed the number of range gates in the data set from its nominal value of 1230 to 1024. With 1024 range gates, the data can be divided along this dimension evenly by any power of two from 1 to 236. This adjustment affects the power levels of the targets slightly, but has no impact on the correctness of the parallel program. 34 We confirmed this correctness by comparing the output of the sequential program using a data set with 1024 range gates to the output of the parallel program using the same data set The beamforming algorithm in the HO-PD program requires the preceding and succeeding Dopplers with each Doppler. Because these preceding and succeeding Dopplers are located in the data cube slices of other tasks, they must be sent via message passing. We used two circular shifts to move the appropriate data (see Figure 4.5). What each task has after the total exchange What each task needs before beamforming After the first circular shift After the second circular shift Figure 4.5: Circular shift operation in the parallel HO-PD program Following the total exchange and circular shift operations, the HO-PD program performs the beamforming step. Because this algorithm displays data independence along PRI 4 ^ * One doppler ' / / / / / / / / z Y / / / / / / / / S ' 35 the PRI and EL dimensions, each task can perform this step on its own portion of the data cube in parallel. The cell averaging CFAR step has been parallelized in a similar fashion. Each task generates its own target report, consisting of the closest targets found in its data cube slice, up to 25 targets. The target reduction step in the parallel HO-PD program is identical to the one in the parallel APT program. Once the final target list is generated, the timer is stopped. The remainder of the program displays the final target list and collects execution time information. In order to improve the resolution and accuracy of the execution time measurements, we used both tim es and g e ttim e o f day in the HO-PD program. 4 3 Parallel Program User’s Guide In this section, I describe the files and procedures in the parallel program, and how to compile and run the parallel HO-PD program. A brief description of the input data file is also included. 43.1 The Parallel HO-PD Code The parallel program has been divided into the following files: bench_mark_STAP. c c e ll_ a v g _ c f a r . c crnd_l in * . c c£xnputa_beams. c cotnputa_w aights. c f f t . c fft_ST A P .c f orm_beams. c form _«tr_vec«. c forb a ck . c main () cell_avg_c far () cmd_line () c ompu t e_beams {) compute_weights () fft t), bit_rever« fft_STAP () form_beams {) form_«tr_v*ca () forback () () 36 house.c read_input_STAP.c house {) read_input_STAP () The purpose of each procedure is briefly described below: main (} cell_avg_cfar () cmd_line () compute_beams {) compute_weights () fft () bit_reverse () fft_STAP () forback () This procedure represents the body of the parallel HOW) program. This procedure performs the cell averaging CFAR target detection on the slice of the detect cube local to a node. This procedure extracts the names of the data file and the steering vector file from the command line. This procedure performs beamforming on the node's slice of the data cube on a per- doppler basis. This procedure computes the weights prior to beamforming. This procedure also generates a set of weights to be used in the beam computation routine from the modified space-time steering vectors. This procedure performs a single n-point in- place decimation-in-time FFT using n/2 com plex tw iddle factors. The implementation used in this procedure is a hybrid between the implementation used in the original sequential version of this program and the implementation suggested by MHPCC. This modification was made to improve the performance of the FFT on the SP2. This procedure performs a simple but somewhat inefficient bit reversal. Tliis procedure is used by fft o . ___ This procedure performs an FFT along the PRI dimension for each column of data by calling f f t () (there are RNG xEL such columns). This procedure performs a forward and back substitution on an input array using the steering vectors, and normalizes the returned solution vector. 37 form_beams () form_str_vecs () house () read_input_STAP () This procedure performs the beam forming step by calling compute_weights () and compu t *_beams {) on each doppler. This procedure forms space-time steering vectors from the input spatial steering vectors. This procedure performs an in-place Householder transform on a complex N xM input array, where M & N. The results are returned in the same matrix as the input data. This procedure reads the node’s slice of the data cube from disk, then converts the data from the packed binary integer form in which it is stored on disk into floating-point format. Figure 4.6 shows the calling relationship between the procedures. ( bench_mark_STAP 3 >C read STAP beams CUED /comput e_'\ /compu t e_'\ ^weights beans J f house ) (forback ^ Figure 4.6: Calling relationship between the procedures in the parallel HO-PD program The complete code listings for the IBM SP2 can be found in Appendix B. 38 43J, The HO-PD Data Files The input data cube file n e w _ s ta p . d a t is a modified version of the original data cube file s t a p _ d a t a . d a t. We reordered the data in n e w _ sta p . d a t so that each task can read its slice of the data cube from one contiguous location in the file. The modified data cube file still has PR] x RNG x EL complex numbers stored in 32-bit binary integer packed format. In this nominal data set, PRI = 128, RNG = 1024, and EL = 48. On disk, this data file is 48 MB. The steering vectors file n e w _ s ta p . s t r is identical to the original steering vectors file s ta p _ d a ta .s t r . The steering vectors file contains the number of PRls in the data cube, the power threshold used in target detection, and the complex values for the steering vectors. 4 3 3 Compiling the Parallel HO-PD Program The parallel HO-PD benchmark is compiled by calling the script file com pile_hopd. This script file executes the following shell command: mpcc -qarch=pwr2 -03 -DSTAP -DIBM -o atap banch_fliar]c_STAP. c call_avsr_cfar.c fft.c fft_STAP.c forback.c cind_Xina.c house.c rea<J_input_STAP.c form_beama.c compute_w#ighta.c compute_beama.c form_str_vecs.c -lm The -qarch=pwr2 option directs the compiler to generate POWER2-efficient code. The -03 option selects the highest level of compiler optimization available on the mpcc parallel C code compiler. The define flag -DSTAP must be included so that the compiler selects the proper data cube dimensions in the d e f s . h header file. The define flag 39 -DIEM must be included so that the compiler can set the number of clock ticks per second (100 ticks per second on IBM machines, 60 ticks per second on SUN machines). To compile the program to use double-precision floating-point numbers instead of the default single-precision, add the define flag -DDBLE. Prior to compiling the program, the user must set the number of nodes on which the program will be run. This number is defined as N N in the header file d e f s . h. This step is necessary because it is not possible to statically allocate arrays with their sizes defined by a variable in C. We have provided a set of header files with NN set to powers of two from 1 to 256. These header files are called d ef s . 001 through d e f s .256. 43.4 Running the Parallel HO-PD Program The parallel HO-PD benchmark is invoked by calling the script file r u n . ###, where # # # is the number of nodes on which the program will be run (and is equal to N N in d e f s .h). This script file executes the following shell command: poe atap data_filanome -procs ### -us The command poe distributes the parallel executable file to all nodes and invokes the parallel program. The command line argument data_f ilenam e is the name of the files containing the input data cube and the steering vectors. The input data cube file must have the extension . d a t , while the steering vectors file must have the extension . str. The command line arguments -p ro c s and -u s are flags to poe. The flag -procs sets the number of nodes on which the program will be run. The flag -u s directs poe to use the User Space communication library. This library gives the best communication 40 performance. We have provided a set of script files to run the HO-PD benchmark, with the number of nodes equal to powers of two from 1 to 256. These files are called run. 001 through ru n .256. Once the parallel program has begun and the timer has been started, the program will generate a “ Running. . . ” message. Once the program has finished and the timer has been stopped, it will generate a “ ... dona. * message. The output of the program has the following format, shown in Figures 4.7 and 4.8. Figure 4.7 shows the target list, consisting of the targets found in the data cube, up to 25 targets. This number of targets was set as a parameter to the HO-PD benchmark. Figure 4.8 shows the timing report from the parallel HO-PD benchmark. 0 ENTRY RANGE BEAM DOPPLER POWER 0 00 099 00 033 1.156658 0 01 699 01 057 1.440849 0 02 000 00 000 0.000000 0 03 000 00 000 0.000000 0 04 000 00 000 0.000000 0 05 000 00 000 0.000000 0 06 000 00 000 0.000000 0 07 000 00 000 0.000000 0 08 000 00 000 0.000000 0 09 000 00 000 0.000000 0 10 000 00 000 0.000000 0 11 000 00 000 0.000000 0 12 000 00 000 0.000000 0 13 000 00 000 0.000000 0 14 000 00 000 0.000000 0 15 000 00 000 0.000000 0 16 000 00 000 0.000000 0 17 000 00 000 0.000000 0 18 000 00 000 0.000000 0 19 000 00 000 0.000000 0 20 000 00 000 0.000000 0 21 000 00 000 0.000000 0 22 000 00 000 0.000000 0: 23 000 00 000 0.000000 0 24 000 00 000 0.000000 Figure 4.7: Target list generated by the parallel HO-PD program 41 In Figure 4.7, the ENTRY column counts the number of targets detected, starting with target #0. The RANGE column indicates the range gate in which the target was found, and is proportional to the distance to the target The BEAM column indicates the direction to the target. The DOPPLER column provides information regarding the speed of the target. The POW ER column represents the power of the signal returned by the target The targets listed in Figure 4.7 are those targets placed in the input data cube when the data cube was generated. We confirmed the correctness of our parallel HO-PD program by comparing this target list to the target list generated by the original sequential HO-PD benchmark (see [LL94]). 0:*** Timing information - numtask = 128 0 : 0: all_user _jnax = 2.77 8, all_sy8_max = 0.03 8 0 : disk_usar_max = 0.05 s. disk_sys_max = 0.03 a 0: fft_user_max — 0.09 B, fft_sys_max = 0,00 8 0: ind0x_user_max = 0.07 B, i ndex_sys_max = 0.00 a 0: rewind_user_max = 0.01 B, rewind_ays_max - 0.00 a 0: shift_u8©r _jnax = 0.09 B, shift_sys_max = 0.00 a 0: str_user _max = 0.01 B, atr_ay8_max = 0.00 a 0: b«am_user_max - 1.10 B, beam_8ys_max = 0.00 a 0: cfar_us©r_max = 0.01 B, cfar_sys_max = 0.00 a 0: r«port_uaer _max = 0.01 B, report_sy s_n&x = 0.00 a 0 r 0:Wall clock timing - 0: all_clock_tim© = 3.039417 0: disk_clock_tim* = 0.324861 0: fft_clock_tim© = 0.051098 0s ind©x^clock_tim© = 0.068951 0: r©wind_clock_tim© = 0.005573 0: shift_clock_tim© = 0.072578 0: str_clock_time = 0.003732 0: baam_cloek_tim© = 0.932793 0: cfar_clock_tim© = 0.001401 0: r©port_clock_tim© = 0.009694 Figure 4.8: Timing report generated by the parallel HO-PD program 42 In Figure 4.8, the timing entries ending with _user_m ax represent the largest amount of user CPU time spent by a node while performing that step, while the timing entries ending with _sys_max represent the largest amount of system CPU time spent by a node while performing that step. The Wall clock tim ing entries indicate the amount of wall clock time spent on a given step, regardless of whether the time was spent by the program or by other programs. The steps we timed are listed below; all This component is the end-to-end execution time of the entire program. This time is not equal to the sum of the times of the other steps, because this ail time includes time spent waiting for synchronization not included in the individual steps. disk This component is the time to read the node’s portion of the data cube from disk. eft This component is the time spent performing FFTs on this node’s portion of the data cube. index This component is the time spent performing the total exchange operation. rewind This component is the time spent rewinding the 1-D output of the total exchange operation into a 3-D data cube slice. shift This component is the time spent performing the two circular shift operations. str This component is the time spent forming the space-time steering vectors. beam This component is the time spent performing the bearaforming step. cfar This component is the time spent performing the cell averaging CFAR step. report This component is the time spent combining the individual target reports from all nodes into one final target report. The time spent in these steps ( Tm , etc.) are combined to determine the total execution time of the parallel HO-PD benchmark (see Equation 6.4 in Chapter 6). 43 As was the case in the APT benchmark, we used both the CPU time and the wall clock time to more accurately measure the execution time of each step. The timing results vary with machine size and from run to run, but the target list should be identical between runs and independent of machine size. To write these reports to a file, use standard UNIX output redirection: run.### > out_filename 44 5 The Parallel General Benchmark Program In this chapter, I describe the parallelization of the General benchmark. The General benchmark is designed to stress general signal-processing computations, as well as test the data communication capabilities of the machine. This benchmark performs the following steps: 1. Search and sort processing 2. FFT processing across each dimension of the data cube 3. Point-by-point vector multiplication along each dimension of the data cube. 4. Miscellaneous linear algebra processing, including Cholesky factorization and back substitution to solve a set of linear equations, and subsequent beamforming (inner product). Each of these steps is independent. We parallelized each of these steps in a separate subprogram, and will therefore analyze each one separately. 5.1 Sequential Program 5.1.1 Sorting Subprogram We generated the parallel sorting subprogram by modifying the original sequential sorting subprogram. While analyzing the original subprogram, we developed a program skeleton, shown below in Figure 3.1. r 45 COMPLEX data_cube t DIM1][DIM2] [DIM3], max_elament; flo a t total[DIM1], average[DIM1]; /* the SRCH substep */ for (i = 0; i < D1M1; i++) for (j = 0; j < DIM2; j++) for (k = 0; k < DIM3; k++) find_max(lnout max_element, In data_cube[ i ] [ j ] [ k]) r /* the SORT substep */ for (j = 0; j < DIM2; j++) for (k = 0; k < DIM3 / 3; k++) sort_and_total{lnout total 1.1, In data_cube[.][j][k]); for (i = 0 j i < DIM1; i++) averagefi] = total[i] / (DIM2 * DIM3 / 3); Figure 5.1: Sequential sorting subprogram skeleton Not included in this program skeleton, but implicit to the program, is the input data file access step. The program reads the input data cube and the vectors from disk before any computation is done. The subprogram first searches the entire data cube for the element with the largest power magnitude. After this searching step, the subprogram performs DIM2xDIM3-*-3 DIM 1-element bubble sorts. Also during this step, the subprogram calculates DIM1 averages. The sorting subprogram requires 1,183.00 x 106 floating-point operations. 5.1.2 FFT Subprogram We developed a program skeleton of the sequential subprogram while analyzing the original FFT benchmark, shown below in Figure 5.2. 46 COMPLEX data_cube[DIM1][DIM2][DIM3], doppler_cube[DIM1][DIM2][DIM3], twiddle; flo a t max_el«mant; computa_twiddla_factor{out twiddle, la DOP_GEN); /* the FFT1 subetep */ for (j = 0; j < DIM2; j++) for (k = 0; k < D1M3; k++) fft (la data_cube[.][j][k] , in twiddle, out doppler_cube[.][j}[k]); /* the FFT2 substep */ for (i = 0; i < DIM1; i++) for (k = 0 j k < DIM3; k+ +) fft {lnout doppler_cube[i][.][k], la twiddle); /* the FFT3 substep */ for (i * 0; i < DIM1; i++) for (j = 0; j < DIM2; j++) fft (laout doppler_cube[i][j][.], la twiddle); /* the SRCH substep */ for (i = 0; i < DIM1; i++) for (j * 0; j < DIM2; j++) for (k * 0; k < DOP_GEN; k++) find_max(lnout max^element, la data_cube[i][j][k]); Figure 5.2; Sequential FFT subprogram skeleton Not included in this program skeleton, but implicit to the program, is the input data file access step. The program reads the input data cube and the vectors from disk before any computation is done. This subprogram performs FFTs along each of the three dimensions of the data cube: DIM2xDIM3 DIM 1-point FFTs, then DIM1 xDIM3 DlM2-point FFTs, then DIM1 x DIM2 DIM3-point FFTs. After the FFTs, the subprogram searches the entire data cube for the element with the largest power magnitude. The FFT subprogram requires 1,909.00 x 106 floating-point operations. 47 5.13 Vector Multiply Subprogram We developed a program skeleton of the sequential subprogram while analyzing the original vector multiply benchmark, shown below in Figure 5.3. COMPLEX data_cub»[DIMl][DIM2J[DIM3]f vlfDIMl], v2[DIM2], v3[DIM3J; float max_alementl, max_element2, max_elament3; /* the VEC1 subatep */ for {j = 0; j < DIM2; j + +) for (k = 0; k < DIM3; k++) vec_mult iply_and_max{In data_cube[.][j][k], In V I, lnout max_elementl); /* the VEC2 substep */ for (i = 0; i < DIM1; i++) for (k = 0; k < DIM3; k++) vec_multiply_and_jnax(In data„cube[i] [. ] [k], In V2, lnout max_element2>; /* the VEC3 substep */ for (i = 0; i < DIM1; i++i for (j = 0; j < DIM2; j++) vec_multiply_and_max(In data_cube[i][j][. ] , lnV3, lnout max_element3); Figure 5.3: Sequential vector multiply subprogram skeleton Not included in this program skeleton, but implicit to the program, is the input data file access step. The program reads the input data cube and the vectors from disk before any computation is done. The subprogram performs three sets of vector multiplications. First, the subprogram multiplies vectors along the DIM1 dimension of the input data cube by the input vector VI. Then, the subprogram multiplies vectors along the DIM2 dimension of the input data cube by the input vector V2. Finally, the subprogram multiplies vectors along the DIM3 dimension of the input data cube by the input vector V3. During each set 48 of multiplications, the subprogram searches for the element with the largest power magnitude. The vector multiply subprogram requires 603.00 x 106 floating-point operations. 5.1.4 Linear Algebra Subprogram We developed a program skeleton of the sequential subprogram while analyzing the original linear algebra benchmark, shown below in Figure 5.4. COMPLEX data_cube[DIM1][DIM2][DIM3], temp_mat[V4][COLS], weight_vec[V4], beams[NUM_MAT][DIM3]; float max_elament; for (j = 0; j < N U h L M A T ; j++J { CHOL { in data_cube[.][j][.}, out temp_mat}; SUB (in temp_mat, out weight_vec); BF (la weight_vec, In data.cube[.][j][.], out beams[.](j][.]); SRCH (in beams[.][j][.], out max_element); ) Figure 5.4: Sequential linear algebra subprogram skeleton Not included in this program skeleton, but implicit to the program, is the input data file access step. The program reads the input data cube and the vectors from disk before any computation is done. The subprogram first computes NUM_MAT number of Cholesky factors using the first NUM.MAT number of DIM1 xDIM3 matrices from the input data set Then, the subprogram forward and back solves these Cholesky factors using the vector V4 as a steering vector to generate NUM_MAT number of weight vectors. These weight vectors 49 are applied to their corresponding DIM1 xDIM3 matrices from the input data cube to generate NUM_MAT number of output beams of length DIM3. The subprogram also searches for the element with the largest magnitude in each output beam. The linear algebra subprogram requires 1,603.00 x 106 floating-point operations. 5.2 Parallel Program Development Similar to the APT and HO-PD programs, in parallelizing the General program, we once again followed the early performance prediction method outlined in [Xu95b], Each of the subprograms in the General program can also be parallelized in an SPMD mode following the compute-interact paradigm. 5.2.1 Sorting Subprogram The sorting subprogram can be described as the following sequence of compute- interact steps: Compute: Search data cube slice for local maximum Interact: Reduce to And global maximum Compute: Bubble sort data cube slice Search data cube slice for local maximum Interact: Reduce to find global maximum The mapping of the parallel algorithm and the data set onto the SP2 is shown in Figure 5.3. 50 data cube search search for local maximum search sort reduce to find global maximum bubble sort sort sort ^ search ^ search search reduce to find global maximum Figure 5.5: Mapping of the parallel sorting subprogram and data set onto the SP2 To help illustrate the structure of the parallel sorting subprogram, I have included the program skeleton in Figure 5.6. At the beginning of the parallel subprogram, each task reads its portion of the data cube from disk in parallel. In order to improve disk access performance, we modified the re a d _ in p u t_ S O R T _ F F T procedure (see Appendix C) so each task reads its entire slice of the data cube from disk before converting the data from a packed binary format (in which the data is stored on disk) to floating-point format (in which computations are done). In the sequential version, each complex number is read then converted immediately. Because the execution time of this step is not included in the total execution 51 COMPLEX data.cube[DIM1][DIM2][DIM3]; flo a t total[DIM1], average[DIM1]; atruot { flo a t power; int loci, loc2, loc3; } max_element; pazfor (p = 0; p < n; p++) ( COMPLEX local_jnax; flo a t local.total[DIM1]; /* the SRCH aubstep */ for (i = 0; i < DIM1; i++) for (j = 0; j < DIM2; j++) for (k = p; k < DIM3; k = k + n) find_max (lnout local_jnax, In data_cube[i][j][k]); reduce local_jnaxs to max_element ; /* the SORT subatep */ for (j = 0; j < DIM2; j + +) for (k = p; k < DIM3 / 3; k = k + n) sort_and_total (lnout local.totalt.], In data.cube(.][j][kj); ) reduce local.totala to total; Figure 5.6: Parallel sorting subprogram skeleton time of the parallel program, the performance of this step is not critical: the timer is started after the data is loaded and converted. First, each task searches its slice of the data cube to find the element with the largest power magnitude. To find the global maximum, we perform the MPL call m pc_reduce. Because we need to keep track of the location of the element along with its power, we created the indexed_max data structure. By placing the coordinates of the maximum element in a single data structure along with the power, we can let the m pc_reduce call handle the entire global maximum search operation, thus improving program performance. 52 Each task then bubble sorts its slice of the data cube along the DIM 1 dimension. The subprogram also accumulates the local total during the bubble sorts. To find the global totals, we again use the MPL m pc_reduce call. In our implementation, we combined the sorting and FFT subprograms into one subprogram. 5.2.2 FFT Subprogram The FFT subprogram can be described as the following sequence of compute- interact steps: Compute: First and second set of FFTs Interact: Total Exchange Compute: Third set of FFTs Search data cube slice for local maximum Interact: Reduce to find global maximum The mapping of the parallel algorithm and the data set onto the SP2 is shown in Figure 5.7. To help illustrate the structure of the parallel FFT subprogram, I have included the program skeleton in Figure 5.8. Like the sorting subprogram, the FFT subprogram loads the data cube from disk in parallel prior to any computation. 53 data cube ---------- 71 / * ♦ ♦ NodeO Node 1 Node 253 search search for local maximum f F F T l J Q FFT2 J ( FFT1 ) C f f t i J I Q FFT2 j C F F T 2 J I I total exchange J f f f ( FFT3 j ^ search j ( FFT3 ^ Q f t o J f ^ search ) ( j o r e h J f r reduce to find global maximum J ---------- n ----------- 1 -------------------1 — — Figure 5.7: Mapping of the parallel FFT subprogram and data set onto the SP2 The FFT subprogram begins by performing DIM2xDIM3 DIM 1-point FFTs. Each task performs these FFTs along the DIM1 dimension on its slice of the data cube in parallel. Next, each task performs FFTs along the DIM2 dimension in parallel. Before the third set of FFTs can be performed, the subprogram must perform a total exchange operation. This data redistribution is necessary because each task loads a slice of data from the disk which has all the elements along the DIM1 and DIM2 dimensions but only a fraction of the elements along the DIM3 dimension, and this last set of FFTs is performed along the DIM3 dimension. 54 COMPLEX data_cube[DIM1][DIM2][DIM3], doppler_cube[DIM1][DIM2][DIM3], twiddle[DOP_GEN]; flo a t local_max, max_element; parfor (k = 0; k < DIM3; k++) ( compute_twiddle_factor (out twiddle, in DOP_GEN); /* the FFT1 subetep */ for (j = 0; j < DIM2; j++) fft (in data_cube[.J[jj[k], In twiddle, out doppler_cube [. ] [ j ] [k]) ; /* the FFT2 substep */ for (i = 0; i < DIM1; i++) fft (lnout doppler_cube[i][.][k] , In twiddle); ) total_exchanae doppler_cube[.][.][k] to doppler_cube[. ] [ j ] [. ] ; parfor (j =0; j < DIM2; j++) { /* the FFT3 substep */ for (i = 0; i < DIM1; i++) fft (lnout doppler_cube[i][j]{ .], In twiddle); > /* the SRCH substep */ parfor (k = 0; k < DOP_GEN; k++) { COMPLEX localjnax; for (i = 0; i < DIM1; i++) for (j = 0; j < DIM2; j++) find_max (lnout local_max. In doppler_cube[i][j] [k]) ; > reduce local_maxs to max_element; Figure 5.8: Parallel FFT subprogram skeleton After the last set of FFTs are performed, each task searches its data cube slice for the element with the largest magnitude. As was done in the sorting subprogram, an m pc_reduce call is used to find the global maximum. 55 5.23 Vector Multiply Subprogram The vector multiply subprogram can be described as the following sequence of compute-interact steps: Compute: First set of vector multiplications Search data cube slice for local maximum Interact: Reduce to find global maximum Compute: Second set of vector multiplications Search data cube slice for local maximum Interact: Reduce to find global maximum Total exchange Compute: Third set of vector multiplications Search data cube slice for local maximum Interact: Reduce to find global maximum The mapping of the parallel algorithm and the data set onto the SP2 is shown in Figure 5.9. To help illustrate the structure of the parallel vector multiply subprogram, I have included the program skeleton in Figure 5.10. Like the sorting and FFT subprograms, the vector multiply subprogram loads the data cube from disk in parallel prior to any computation. 56 data cube Node 255 C ) search search search reduce to find global maximum ' I — T ...... C ) C VM2 J ( VM2 ) f | » Q search J C - r l O - - * £ search } -I____ r f c c T N T reduce to find global maximum " I--------- total exchange ) reduce to find global maximum search VM vector multiply search search for local maximum Figure 5.9: Mapping of the parallel vector multiply subprogram and data set onto the SP2 First, the subprogram performs DIM2xDIM3 DIM 1-element vector multiplications. After these multiplications, each task searches its data cube slice for the element with the largest magnitude, then aggregates the local maximums into a global maximum using the m pc_reduce operation. Next, the subprogram performs vector 57 parfor (k = 0; k < DIM3; k++> { /* th e VEC1 su b stsp * / fo r {j = 0; j < DIM2; j++) v ecjn u ltip ly_an d _jn ax( in d a ta _ c u b e [. ] [ j] [k], In VI, lnout lo c a lj n a x ) ; rodueo local_jnaxs to roax_elament1; /* th e VEC2 eu b step */ for ( i = 0; i < DIH1; i++) vec_jnult iply_and_jnax (In data_cube [ i ] [. ] [k], inV2, lnout lo ca l_ m a x ); rodueo local_m axs to raax_element2; total_orohango data_cube[.][.][k] to doppler_cube[. ] [ j ] [. ]; parfor (j = 0; j < DIM2; j++) { /* the VEC3 substep */ for (i = 0; i < DIM1; i++) vecjnultiply_and_jnax(In data_cube[ i] [ j ] [. ], lnV3, lnout local_max) ; reduce local_jnaxs to max_element3; Figure 5.10: Parallel vector multiply subprogram skeleton multiplications along the DLM2 dimension, and finds the global maximum again. Before vector multiplications can be done along the DIM3 dimension, the subprogram must perform a total exchange operation to redistribute the data cube. This operation is necessary because, prior to the total exchange, each task only has part of the data cube along the DIM3 dimension. After this data redistribution, each task can perform the vector multiplications on its portion of the data cube along the DIM3 dimension, and find the global maximum. 58 5.2.4 Linear Algebra Subprogram The linear algebra subprogram consists of only one computation step, and no interaction steps. As such, the only difference between the sequential and parallel subprograms is the fact that each task in the parallel subprogram performs the linear algebra steps on only a slice of the data cube. 53 Parallel Program User’s Guide In this section, 1 describe the files and procedures in the parallel program, and how to compile the parallel General program. A brief description of the input data file is also included. 53.1 The Parallel General Code The parallel program has been divided into the following files, listed below by subprogram: Sorting & FFT bench_fnarkwSORT. c bench_mark_SORT_FFT. c b u b b l« _ a o rt. c cm d _ lin e.c f f t .c read_input_SORT_FFT. c Vector Multiply bench_jnark_VEC. c cm d _lin*. c r*ad_input_VEC. c main o for sorting only___ main < ) for sorting and FFT b u b b le_ so rt () cm cLline () f f t ( ) , b it_ r e v * r s e () read_input_SORT_FFT () main { } for vector multiply cm d_Iin* () read_input_VEC () 59 Linear Algebra b®nch_jnark_LiN. c main o for linear algebra c iw i_ lin « .c crod_line () fo rb a ck . c forback (} h o u se. c house () read_input_L IN . c read_input_LIN {) The purpose of each procedure is briefly described below: Sorting & FFT bench_mark_SORT () bench_mark_SORT_FFT () b u b b le_ so r t () cm d _iin e () f f t () b it _ r e v e r s e () read_input_SORT_FFT {) This procedure represents the body of the parallel sorting subprogram. This procedure represents the body of the parallel sorting and FFT subprograms. This procedure performs a bubble sort, and also calculates the average value of the sorts. The data is sorted by the magnitude of the power. This procedure extracts the names of the data file and the vector file from the command line This procedure performs a single n-point in- place decimation-in-time complex FFT using n/2 complex twiddle factors. The implementation used in this procedure is a hybrid between the implementation in the original sequential version of this program and the implementation suggested by MHPCC. This modification was made to improve the performance of the FFT on the SP2. This procedure performs a simple but somewhat inefficient bit reversal. This procedure is used by fft (). This procedure reads the node’s slice of the data cube from disk, then converts the data from the packed binary integer form in which it is stored on disk into floating-point format. Vector Multiply benchmarkVEC O This procedure represents the body of parallel vector multiply subprogram. 60 cm d _lin e { ) read_input_VEC () Linear Algebra bench_mark_LIN () cm d _lin e (} forb ack () h o u se () read_input_L IN () This procedure extracts the names of the data file and the vector file from the command line. This procedure reads the node’s slice of the data cube from disk, then converts the data from the packed binary integer form in which it is stored on disk into floating-point format. This procedure represents the body of parallel linear algebra subprogram. This procedure extracts the names of the data file and the vectors file from the command line. This procedure performs a forward and back substitution on an input array using the steering vectors, and normalizes the returned solution vector. This procedure performs an in-place Householder transformation on a complex N xM input array, where M i N. The results are returned in the same matrix as the input data. This procedure reads the node’s slice of the data cube from disk, then converts the data from the packed binary integer form in which it is stored on disk into floating-point format Figures 5.11 through 5.14 show the calling relationship between the procedures for the subprograms. f bench_mark_SORT j cm d_lin« ) fraad_input_SORT_FFT ) f b u b b l«_»ort ) Figure 5.11: Calling relationship between the procedures in the parallel sorting subprogram 61 C bit. reverse 5 Figure 5.12: Calling relationship between the procedures in the parallel sorting and FFT subprogram ( bench_rnark_VEC J C cm d_line read^input_VEC J Figure 5.13: Calling relationship between the procedures in the parallel vector multiply subprogram Q b e n c t^ n a r ljJ jIN ^ input Figure 5.14: Calling relationship between the procedures in the parallel linear algebra subprogram The complete code listings for the IBM SP2 can be found in Appendix C. 62 53.2 The General Data Files The data set for the General benchmark is a DIM1 xDIM2xDIM3 cube. Nominally, DIM1 = 64, DIM2 = 128, and DIM3 = 1536. On disk, this data file is 96 MB. The original input data cube Hie g e n _ d a t a . d a t has been reordered and saved in several versions. Each of these reorderings are designed to optimize the process of reading the data cube from disk in parallel. The parallel vectors file is identical to the original vectors file g e n _ d a t a . s t r . 5 3 3 Compiling the Parallel General Program The individual subprograms which comprise the parallel General benchmark: sorting and FFT, vector multiply, and linear algebra, are compiled by calling the script files co m p ile_ S O R T _ F F T , c o m p ile _ V E C , and c o m p ile _ L I N , respectively. The script files execute the following shell commands: com p i 1e_SO R T_FFT: mpcc -03 -qarch=pwr2 -DGEN -DIBM -o gen bench_jnark_SORT_FFT. c b u b b le .,s o r t. c c m d _ lin e .c f f t . c read_input_SORT_FFT.c -lm c o m p ile _ V E C : mpcc -03 -qarch=pwr2 -DGEN -DIBM -o v e c bench_mark_VEC. c c m d _ lin e . c read_input_V EC . c - lm c o m p ile _ L IN : mpcc -03 -qarch=pwr2 -DGEN -DIBM -o l i n bench_mark_LIN. c c m d _ lin e .c fo r b a c k .c h o u s e .c read _in p u t_L IN . c -lm The -qarch=pwr2 option directs the compiler to generate POWER2-efficient code. The -03 option selects the highest level of compiler optimization available on the mpcc 63 parallel C code compiler. The define flag -DGEN must be included so that the compiler selects the proper data cube dimensions in the de f s . h header file. The define flag -DIBM must be included so that the compiler can set the number of clock ticks per second (100 ticks per second on IBM machines, 60 ticks per second on SUN machines). To compile the program to use double-precision floating-point numbers instead of the default single-precision, add the define flag -DDBLE. Prior to compiling the program, the user must set the number of nodes on which the program will be run. This number is defined as N M in the header file d e f s . h. This step is necessary because it is not possible to statically allocate arrays with their sizes defined by a variable in C. We have provided a set of header files with NN set to powers of two from 1 to 256. These header files are called d ef s . 001 through d e f s .256. 5J.4 Running the Parallel General Program The parallel General benchmark is invoked by calling the script file r u n . ###, where ### is the number of nodes on which the program will be run (and is equal to NN in d ef s . h). This script file executes the following shell command: po« prograni_nam« data_filename -procs ### -ua The argument program _nam e is the name of the subprogram to be run: gen (for the sorting and FFT subprogram), v ec (for the vector multiply subprogram), and l i n (for the linear algebra subprogram). The command poe distributes the parallel executable file to all nodes and invokes the parallel program. The command line argument d a ta _ f i lenam e is the name of the files containing the input data cube and the vectors. 64 The input data cube ftle must have the extension . d a t , while the vectors file must have the extension . s tr . The command line arguments -p ro c s and -u s are flags to poe. The flag -procs sets the number of nodes on which the program will be run. The flag -u s directs poe to use the User Space communication library. This library gives the best communication performance. We have provided a set of script flies to run the General benchmark, with the number of nodes equal to powers of two from 1 to 256. These files are called run. 001 through run. 256. The subprograms in the General benchmark do not have any real outputs; however, to confirm the correctness of the parallel algorithms, the program displays intermediate results on the screen. In addition, timing reports are displayed on the screen. Timing reports from the sort and FFT subprogram, the vector multiply subprogram, and the linear algebra subprogram are shown in Figures 5.15 through 5.17, respectively. In Figures 5.15 through 5.17, the timing entries ending with _user_m ax represent the largest amount of user CPU time spent by a node while performing that step, while the timing entries ending with _sys_m ax represent the largest amount of system CPU time spent by a node while performing that step. The Wall clo ck t iming entries indicate the amount of wall clock time spent on a given step, regardless of whether the time was spent by the program or by other programs. 65 0:*** CPU Timing information - numtask * 128 0: 0: a1l_us•r_max _ 2.38 s. all_sys_max = 0.06 s 0: d i s k_u s« r_jmax = 0.10 s. disk_sys_max = 0.02 s 0: maxl_ussr_max = 0.02 s. maxl_ays_max = 0.00 s 0: bafore_usar_jnax = 0.01 s, b*fore_By*_jnax = O.i00 0: sort_user_max = 0.75 S, 8ort_Bys_max = 0.00 s 0: sort_total_us«r_jnax = 0.02 s, sort_total_sys_rnax 0: f f t l_us*r_m*x 0.10 s, fftl_sys_m*x = 0.00 s 0: fft2_us*r_max = 0.19 s. f ft2_8ys_max * 0.01 s 0: f f 13 _us «rjsax S 0.19 s, f f 13_sys_max = 0.01 s Ot index_user_max = 0.17 s, index_sys_max = 0.00 s 0 : max2_user_max = 0.02 s, max2_sys_max = 0.01 a 0: 0: after_us«r _jnax B 0.01 s, after_syB_max = 0.00 s 0:*** Wall Clock Timing information - numtask = 128 u: 0: all_wall 2529295 us 0: task_wal1 = 502 us 0: disk_wall = 1631067 us 0: maxl_wall = 32853 us 0: b«fore_wall = 3375 us 0: sort_wall = 165641 us 0: Bort_total_wall = 12129 us 0: fftl_wall b 110256 us 0: fft2_wall s 199256 us 0: fft3_wall S 215477 us 0: index_wal1 B 171785 us 0: max2_wall = 32516 us 0: aft«r_wall = 10493 us Figure 5.15: Timing report from the parallel sort and FFT subprogram 0:*** CPU Timing information - numtask = 128 0: 0: all_user_max = 1.96 s, all_sys_roax = 0.05 s 0: disk_user_max = 0.29 a, disk_syB_max = 0.05 a 0; vbc l_u b e r_jmax = 8084644.00 a, v«cl_sys_max = 8084547 0: vbc2_u b « r_max = 0.77 a, v«c2_sys_max = 0.00 8 0! vbc3 _ u b e r_max = 0.06 a, v*c3_sy0_max = 0.00 s 0: ft . ind«x_user_max = 0.17 a, ind«x_sy8_max = 0.00 a 0t*** Wall Clock Timing information - numtask A . = 128 0: all_wall = 2156784 us 0: task_wall = 507 us Ot disk_wall = 1876482 us Ot v«cl_wall = 1450517 us Ot vsc2_wall = 58877 us 0: v«c3_wall = 50551 us Ot ind«x_wall = 182361 us Figure 5.16: Timing report from the parallel vector multiply program .00 s 00 s 66 0:*** CPU Timing information - numtask = 16 0 : Os all_user_jnax = 7.13 s, all_sys_max = 0.13 s 0: disk_user_max = 0.36 s, disk_sys_max = 0.13 a 0: lin_ussr_jnax = 6.82 a, lin_sys_max = 0.01 a 0i 0:*** Wall Clock Timing information - numtask = 16 0 : 0s all_wall * 7440370 us 0: task_wall = 30867 us 0s disk_wall = 1766285 us 0: lin_wall = 5566332 us Figure 5.17: Timing report from the parallel linear algebra program The steps we timed are listed below: sort and FFT subprogram: all disk maxi before sort sorttotal fftl f ft2 f ft3 index max2 after This component is the end-to-end execution time of the entire subprogram. This time is not equal to the sum of the times of the other steps, because this all time includes time spent waiting for synchronization not included in the individual steps. This component is the time to read the node's portion of the data cube from disk. This component is the time spent searching the local portion of the data cube for the largest element. This component is the time spent performing a reduction operation to find the global maximum. This component is the time spent performing bubble sorts on the local portion of the data cube. This component is the time spent performing a reduction operation to collect the totals from all the nodes. This component is the time spent performing the first FFT step. This component is the time spent performing the second FFT step. ___ This component is the time spent performing the third FFT step. This component is the time spent performing the total exchange operation. This component is the time spent searching the local portion of the data cube for the largest element after the FFTs. This component is the time spent performing a reduction operation to find the global maximum. 67 vector multiply subprogram: all This component is the end-to-end execution time of the entire subprogram. This time is not equal to the sum of the times of the other steps, because this all time includes time spent waiting for synchronization not included in the individual steps, disk This component is the time to read the node's portion of the data cube from disk, veci This component is the time spent performing the first vector multiply step. vec2 This component is the time spent performing the second vector multiply step. vec3 This component is the time spent performing the third vector multiply step, index This component is the time spent performing the total exchange operation. linear algebra subprogram: all This component is the end-to-end execution time of the entire subprogram. This time is not equal to the sum of the times of the other steps, because this all time includes time spent waiting for synchronization not included in the individual steps, disk This component is the time to read the node's portion of the data cube from disk, lin This component is the time spent performing all the linear algebra computation steps. The time spent in these steps etc.) are combined to determine the total execution time of the parallel General benchmark (see Equations 6.5 through 6.7 in Chapter 6). As was the case in the APT and HO-PD programs, we used both the CPU time and the wall clock time to more accurately measure the execution time of each step. To write these reports to a file, use standard UNIX output redirection: run.### > out_filename 68 6 Parallel STAP Performance Results In this chapter, I describe our experimental setup and summarize the performance of the STAP benchmarks on the SP2. Details of the STAP benchmark performance results on the SP2 weie reported in [Hwan95a, Hwan95b, Hwan95c], 6.1 Experimental Setup on the SP2 In our project, we used the 400-node IBM SP2 located at the Maui High Performance Computing Center (MHPCC). We accessed this machine from our location at the University of Southern California (Los Angeles, California), by using the UNIX commands te ln e t and ftp. The bulk of our benchmark experiments were conducted between December 9 and December 13, 1994. During this period, we were given dedicated use of 268 nodes on the MHPCC SP2: 76 wide nodes and 192 thin nodes. The wide nodes had between 64 and 256 MB of main memory each, while the thin nodes had 64 MB of main memory each. It is not possible to give exclusive use of the HPS to any one user (without giving that user the entire 400-node machine), and we therefore had to share the HPS with other users. Because there were only 132 nodes remaining, we did not expect the shared status of the HPS to be a serious problem. We encountered many technical problems during our benchmarking experiments. We found that the individual nodes are somewhat unreliable. During the four days we had dedicated use of 268 nodes, several nodes were not operational. Interference from other 69 users resulted in an appreciable degradation of the HPS' performance on several occasions. Routers on the link between Maui and Los Angeles were not infallible: a router on one of the two paths between Maui and Los Angeles failed for two days during the middle of our dedicated runs, reducing the data transfer rate from a peak of 100 Mbits/sec to 10 Kbits/sec. We also found that the performance of the HPS and communication operations was greatly improved if we “primed" the switch by running our programs several times in rapid succession. In collecting our performance data, we executed each program at least ten times, then used the smallest execution time in our analysis. The run with the least interference from the operating system and other users results in the lowest execution time, and is therefore most representative of the true performance of the SP2. 6,2 Performance of the APT Benchmark In this section, I present and analyze the performance results from the parallel APT benchmark runs on the SP2. Figure 6.1 shows the overall execution time of the program as a function of the machine size (the machine size is equal to the number of nodes used). This overall execution time is the sum of the execution times of the component steps: L h * . ~ T g t + T *kM + + T r - r f i + T bc~i + T .»p\ + T . * , 2 + T c fv + T ^ r o n (6.1) 70 The computation portion of the execution time is the sum of the execution times for the FFTs (7^), the beamforming steps ( + TtH pl), and the cell-averaging CFAR step ( T ^r) (see Figure 6.2). The ovethead portion of the execution time is the sum of the execution times of the total exchange operation (7 ^ .), the rewind step ( Trrnini). the formation of the Householder matrix {TpvtU l), the broadcast operations ( 7 ^ ) , and the target report reduction step ( Tnrort) (see Figure 6.3). We were able to reduce the overall execution time of the APT program to 0.16 seconds on 256 nodes. ■ overhead ■ computation to - I 128 2 32 6 4 256 1 4 8 16 Machine Size Figure 6.1: Overall execution time of the parallel APT program Because most of the computational workload in the APT program is fully parallelizable, doubling the number of nodes halved the computation time. Only the execution time for the first beamforming step (BF 1) remained constant, as it was the only computation step which was not parallelized. On 256 nodes, the computation took 0.09 seconds. 71 0 FFT BF 1 BF 2 CFAR 8 16 32 64 128 256 Machine Size Figure 6.2: Breakdown of the computation time of the parallel APT program 0.6 0.5 3 0.4 H 1 P 0.3 H □ total exchange rewind ■ matrix forming ■ broadcast ■ target report reduction Machine Size Figure 6.3: Breakdown of the overhead of the parallel APT program 72 The total exchange operation constitutes most of the overhead. As the total exchange time itself is dominated by the time spent actually sending the messages, this portion of the overhead decreased with increasing machine size. The size of the individual messages sent during this operation is inversely proportional to the square of the machine size. In the rewind step, the amount of data to be rewound is equal to one task’s slice of the data cube, and is therefore inversely proportional to the machine size. Thus, doubling the number of nodes halved the rewind time. The broadcast time increased with machine size, as the size of the message being broadcast remained constant, and the time required per byte broadcast increased. On 256 nodes, the overhead was nearly as large as the computation time. The APT program’s workload consists of 1.45 billion floating-point operations. By dividing this workload by the total execution time, we can calculate the sustained floating-point processing rate: , „ » Floating-point workload ___ Sustained Processing Rate = ------------ (6.2) ' U fC U fftO * Figure 6.4 shows the sustained processing rate of the SP2 on the parallel APT program as a function of machine size. We define the system efficiency as the ratio between the sustained processing rate and the peak processing rate: Sustained Processing Rate System Efficiency - Peak Processing Rate < # 3> 73 0 1 1 I to Machine Size Figure 6.4: Sustained processing rate of the SP2 on the parallel APT program The peak processing rate is the product of the number of nodes and the peak processing rate of a single node, which, for the SP2, is 266 MFLOPS. Figure 6.5 shows the system efficiency achieved by the SP2 on the parallel APT program as a function of machine size. Because the overhead decreases only to 64 nodes, and actually increases for 128 and 256 nodes, the sustained processing rate and the system efficiency suffer. The system efficiency is at or above 30% for up to 64 nodes, but falls sharply beyond that machine size. Numerical tabulations of these performance results are given in Tables 6.1 and 6.2. In Table 6.1, the Wall Clock Time is the end-to-end execution time of the program, to 9 8 7 6 5 4 3 2 1 0 1 2 4 8 16 32 64 128 296 74 40 192 256 12 8 64 0 Machine Size Figure 6.5: System efficiency of the SP2 on the parallel AFT program where the timer is started immediately after the data set is loaded from disk, and the timer is stopped after the final target report is produced. The Total Time is the sum of the times of all individual computation and communication steps. In Table 6.2, the DP step is the Doppler Processing or FFT step, BF 1 and BF 2 are the first and beamforming steps respectively, the Index step corresponds to the total exchange operation, the Beast step is the broadcast operation, and the Reduce step is the target list reduction operation. 75 Table 6.1: Performance of the Parallel APT Program Machine Size Wall Clock Time (seconds) Total Time (seconds) Sustained Processing Rate (GFLOPS) Efficiency (%) 1 14.68 14.09 0.10 38.5 2 7.68 7.66 0.19 35.4 4 3.80 3.79 0.38 35.8 8 1.93 1.91 0.76 35.5 16 1.22 1.00 1.44 33.9 32 0.56 0.53 2.71 31.9 64 0.33 0.29 5.05 29.7 128 0.25 0.21 7.03 20.7 256 0.22 0.16 8.95 13.1 Table 6.2: Breakdown of Parallel APT Execution Time in Seconds Machine Size DP BF1 BF2 CFAR Index Beast Reduce 1 4.05 0.038 9.44 0.56 0.00 0.00 0.00 2 2.01 0.038 4.77 0.28 0.56 0.0035 I 0.00035 I 4 1.00 0.038 2.39 0.13 0.22 1 0.0065 1 0.00067 I 8 0.49 0.038 1.19 0.066 0.12 0.0092 ] 0.00096 I 16 0.24 0.637 0.60 0.034 0.077 0.0097 0.0012 32 0.12 0.038 0.30 0.013 0.048 0.011 0.0015 64 0.036 0.038 0.15 0.0067 0.035 0.019 0.0019 128 0.019 0.038 0.076 0.0035 0.042 0.024 0.0034 256 0.0098 0.040 0.039 0.0019 0.039 0.028 0.0042 63 Performance of the HO-PD Benchmark In this section, I present and analyze the performance results from the parallel HO-PD benchmark runs on the SP2. Figure 6.6 shows the overall execution time of the program as a function of machine size. The overall execution time is the sum of the execution times of the component steps: 76 T sc u tu m = T ffi + T m te + T ™ * i + T .h * + T tlr + T b ~ * + T c J t + T n p cr, (6‘4) The computation portion of the execution time is the sum of the execution times for the FFTs (T# ), the steering vector formation step ( T^), the beamforming step ( 7yrim ). and the cell-averaging CFAR step This breakdown of the computation time is shown in Figure 6.7. The overhead portion of the execution of the execution time is the sum of the execution times of the total exchange operation ( 7 ^ ,), the rewind step ( Tn w % ltJ ). the circular shift operation ( and the target report reduction step ( Trtp o n l This breakdown of the overhead is shown in Figure 6.8. We were able to reduce the overall execution time of the HO-PD program to 0.56 seconds on 256 nodes. iso ■ overhead ■ computation too 1 1 4 8 16 32 64 128 256 Machine Size Figure 6.6: Overall execution time of the parallel HO-PD program 77 IS O S - 100 h 50 - E3 FFT ■ steering vector formation ■ BF ■ CFAR 0 1 2 4 8 16 32 64 128 2S6 Machine Size Figure 6.7: Breakdown of the computation time of the parallel HO-PD program 1.2 1.0 - 0.8 P g 0.6 0.4 0.2 0.0 I total exchange rewind circular shift target report reduction 2 4 8 16 32 64 128 256 Machine Size Figure 6.8: Breakdown of the overhead of the parallel HO-PD program 78 All of the computational workload in the HO-PD program is fully parallelizable. Therefore, doubling the machine size halved the computation time. On 256 nodes, the computation took 0.44 seconds. Like in the parallel APT program, the overhead in the HO-PD program is dominated by the total exchange operation. Once again, the size of the messages sent is inversely proportional to the square of the machine size, and is the primary reason why the total exchange time decreases with increasing machine size. The amount of data to be rewound in the rewind is inversely proportional to the machine size, and decreases accordingly. In the circular shift operation, the message size remains constant The time required per byte shifted increases only marginally with machine size, and thus the circular shift time remained relatively constant. The HO-PD program’s workload consists of 12.85 billion floating-point operations. From this workload, we can calculate the sustained floating-point processing rate, as well as the system efficiency (see Equations 6.2 and 6.3). The sustained floating point processing rate is plotted as a function of machine size in Figure 6.9. The system efficiency is plotted in Figure 6.10. Because the overhead represents such a small fraction of the total execution time (between 1% and 21%), the sustained processing rate grows nearly linearly with machine size, and the system efficiency remains almost constant 79 20 - P . 6 0 10 . 8 16 64 128 236 1 2 4 Machine Size Figure 6.9: Sustained processing rate of the SP2 on the parallel HO-PD program 40 30 • 20 - I 10 . 128 192 236 0 64 Machine Size Figure 6.10: System efficiency of the SP2 on the parallel HO-PD program 80 Numerical tabulations of these performance results are given in Tables 6.3 and 6.4. In Table 6.4, the Steer step is the formation of the space-time steering vectors, and Shift corresponds to the circular shift operations. Table 6.3: Performance of the Parallel HO-PD Program Machine Size Wall Clock Tune (seconds) Total Time (seconds) Sustained Processing Rate (GFLOPS) System Efficiency (%) 1 130.95 129.36 0.10 37.4 2 65.85 65.67 0.20 36.8 4 33.09 33.08 0.39 36.5 8 16.60 16.56 0.78 36.5 16 8.43 8.38 1.54 36.1 32 4.35 4.25 3.03 35.6 64 2.36 2.11 6.10 35.8 128 1.39 1.14 11.30 33.2 256 0.71 0.56 23.00 33.8 Table 6.4: Breakdown of Parallel HO-PD Execution Time in Seconds Machine Size DP Steer BF CFAR Index Shift Reduce 1 11.53 0.0037 117.66 0.17 0.00 0.00 0.00 2 5.7^ 0.0037 58.75 0.08 1.07 0.05 0.0003 4 2.87 0.0037 29.46 0.04 0.65 0.05 0.0006 8 1.38 0.0037 14.75 0.017 0.36 0.05 0.0009 16 0.69 0.0037 7.40 0.009 0.22 0.05 0.0012 32 0.35 0.0037 3.71 0.005 0.13 0.05 0.0015 64 0.10 0.0037 1.85 0.002 0.09 0.06 0.0018 128 0.05 0.00^7 0.93 0.001 0.07 0.07 0.002i 256 0.03 0.0037 0.41 o.b66 0.07 0.04 0.0044 81 6.4 Performance of the General Benchmark In this section, I present and analyze the performance results from the parallel General benchmark runs on the SP2. The four subprograms: sorting, FFT, vector multiply, and linear algebra, will be treated separately. 6.4.1 Sorting Subprogram Figure 6.11 shows the overall execution time of this subprogram as a function of the machine size. The overall execution time is the sum of the execution times of the component steps. The steps in the sorting subprogram are search, sort, reduce 1, and reduce 2: = T ^ 1 + T b^orr + T m n + T _ * > * ! (6 - 5) We were able to reduce the execution time of this subprogram down to 0.12 seconds on 256 nodes. The sorting subprogram’s workload consists of 1.18 billion floating-point operations. From this workload, we can calculate the sustained floating-point processing rate, as well as the system efficiency. The sustained floating-point processing rate is plotted as a function of machine size in Figure 6.12. The system efficiency is plotted in Figure 6.13. 82 < /5 . 12 10 - 8 - 0 J-J 0 search ■ sotting ■ reduce 1 ■ reduce 2 Machine Size Figure 6.11: Breakdown of the execution time of the sorting subprogram s i w 32 64 128 256 2 4 S 1 6 Machine Size Figure 6.12: Sustained processing rate of the SP2 on the parallel sorting subprogram 83 40 30 • 20 - 10 . 64 128 192 236 0 Machine Size Figure 6.13: System efficiency of the SP2 on the parallel sorting subprogram The time to execute the two reduction steps are small relative to the total execution time. This low overhead leads to a nearly constant system efficiency and linear growth of the sustained processing rate for up to 16 nodes. For larger machine sizes, the overhead constitutes a growing proportion of the total execution time, and the system efficiency suffers somewhat. Numerical tabulations of these performance results are given in Tables 6.3 and 6.6. In Table 6.6, the Search step corresponds to the local data set search, and the Reduce component is the sum of the two reduction operations to find the global maximum and global total. 84 Table 6.5: Performance of the Parallel Sorting Subprogram Machine Total Time Sustained Processing System Size (seconds) Rate (GFLOPS) Efficiency (9b) 1 22.8 0.05 19.5 2 11.2 0.12 21.8 4 5.11 0.23 21.8 8 2.56 0.46 21.7 16 1.28 0.92 i n 32 0.66 1.79 21.0 64 0.38 3.31 19.5 128 0.19 6.44 18.6 256 0.12 10.2 14.9 Table 6.6: Breakdown of Parallel Sorting Subprogram Execution Time in Seconds Machine Size Search Sort Reduce 2 0.695 9.490 0.000 4 0.350 4.760 0.000 8 0.180 2.380 0.001 16 0.091 1.190 0.001 32 0.050 0.610 0.001 64 0.027 0.329 0.001 128 0.019 0.167 0.002 256 0.010 0.092 0.013 6.4.2 FFT Subprogram Figure 6.14 shows the overall execution time of this subprogram as a function of the machine size. The overall execution time is the sum of the execution times of the component steps. The steps in the FFT subprogram are FFT, total exchange, search, and reduction: T ^u c a n c m ~ ^ g l l + ^ f f ! 2 + ^ f f i 3 + + 'T n w J + ( 6 * 6 ) 85 We were able to reduce the execution time of this subprogram down to 0.42 seconds on 256 nodes. □ FFT ■ total exchange ■ search ■ reduction 20 1 € n 10 - o4-i 2 4 32 8 16 64 128 2S6 Machine Size Figure 6.14: Breakdown of the execution time of the FFT subprogram The FFT subprogram's workload consists of 1.91 billion floating-point operations. From this workload, we can calculate the sustained floating-point processing rate, as well as the system efficiency (see Equations 6.2 and 6.3). The sustained floating point processing rate is plotted as a function of machine size in Figure 6.15. The system efficiency is plotted in Figure 6.16. In an absolute sense, both the sustained processing rate and the system efficiency are poor. However, this lack of performance is not a result of an ineffective parallelization, but is due to the poor performance of the original sequential subprogram we parallelized. As Figure 6.14 shows, the total execution time continues to decrease 86 2 4 8 16 32 64 128 236 Machine Size Figure 6. IS: Sustained processing rate of the SP2 on the parallel FFT subprogram 40 30 - 20 - I 10 - 0 64 128 192 236 Machine Size Figure 6.16: System efficiency of the SP2 on the parallel FFT subprogram 87 with increasing machine size, almost linearly for smaller machine sizes. From this observation, we can conclude that our parallelization was effective. Numerical tabulations of these performance results are given in Tables 6.7 and 6.8. In Table 6.8, the FFT component is the sum of the execution times for all three FFT steps. Table 6.7: Performance of the Parallel FFT Subprogram Machine Size Total Time (seconds) Sustained Processing Rate (GFLOPS) System Efficiency (%) 1 79.1 0.02 9.06 2 32.1 0.06 11.2 4 16.3 0.12 11.0 8 8.22 0.23 10.9 16 4.21 0.45 10.7 32 2.16 0.88 10.4 64 1.15 1.66 9.74 128 0.67 2.86 8.40 256 0.42 4.54 6.67 Table 6.8: Breakdown of Parallel FFT Subprogram Execution Time in Seconds Machine Size FFT Search Index Reduce 2 28.80 0.92 2.^5 0.0001 4 14.40 0.46 1.43 0.0002 8 7.21 0.23 0.78 0.0003 16 3.61 0.12 0.48 0.0004 32 1.83 0.06 0.27 0.0005 64 0.94 0.03 0.18 0.0006 128 0.48 0.02 0.17 0.0008 256 0.25 0.02 0.15 0.0047 88 6.43 Vector Multiply Subprogram Figure 6.17 shows the overall execution time of this subprogram as a function of the machine size. The overall execution time is the sum of the execution times of the component steps. The steps in the vector multiply subprogram are vector multiply and total exchange: + T ^ (6.7) We were able to reduce the execution time of this subprogram down to 0.25 seconds on 256 nodes. The vector multiply subprogram's workload consists of 0.60 billion floating-point operations. From this workload, we can calculate the sustained floating-point processing rate, as well as the system efficiency (see Equations 6.2 and 6.3). The sustained floating point processing rate is plotted as a function of machine size in Figure 6.18. The system efficiency is plotted in Figure 6.19. Like the FFT subprogram, the sustained processing rate and the system efficiency of the vector multiply subprogram are poor because the original sequential version was slow. 89 10 vector multiply total exchange 8 16 32 64 128 256 Machine Size Figure 6.17: Breakdown of the execution lime of the vector multiply subprogram | O 0 4 t/1 3 2 I 0 8 16 32 64 128 236 Machine Size Figure 6.18: Sustained processing rate of the SP2 on the parallel vector multiply subprogram 90 30 • 20 - 10 - 128 192 256 0 64 Machine Size Figure 6.19: System efficiency of the SP2 on the parallel vector multiply subprogram Numerical tabulations of these performance results are given in Tables 6.9 and 6.10. In Table 6.10, the Vector Multiply component is the sum of the three vector multiply steps. Table 6.9: Performance of the Parallel Vector Multiply Subprogram Machine Size Total Time (seconds) Sustained Processing Rate (OFLOPS) System Efficiency (%) 1 19.1 0.03 11.9 2 8.18 0.07 13.9 4 4.34 0.14 13.1 8 2.25 0.27 12.6 16 1.21 0.50 11.7 32 0.66 0.91 10.7 64 0.47 1.29 7.55 1 2 & 0.29 2.06 6.06 256 0.25 2.45 3.60 91 Table 6.10: Breakdown of Parallel Vector Multiply Subprogram Execution Time in Seconds Machine Size Vector Multiply Index 2 5.96 2.22 4 2.97 1.37 8 1.51 0.74 16 0.75 0.46 32 0.41 0.25 64 0.30 0.17 128 0.12 0.17 236 0.09 0.15 6.4.4 Linear Algebra Subprogram Figure 6.20 shows the overall execution time of this subprogram as a function of the machine size. The overall execution time is the sum of the execution times of the component steps. As indicated in Chapter 5, this subprogram consists of only the one computation step: Because we exploited data-parallelism in this subprogram, we are limited by the degree of parallelism (DOP) available in the input data set For this subprogram running on the nominal data set the maximum DOP is 32. We were able to reduce the execution time of this subprogram down to 0.61 seconds on 32 nodes. The linear algebra subprogram’s workload consists of 1.60 billion floating-point operations. From this workload, we can calculate the sustained floating-point processing rate, as well as the system efficiency (see Equations 6.2 and 6.3). The sustained floating- 92 10 2 4 8 16 32 Machine Size Figure 6.20; Overall execution time of the linear algebra subprogram point processing rate is plotted as a function of machine size in Figure 6.21. The system efficiency is plotted in Figure 6.22. Because this entire subprogram is fully parallelizable, with no overhead, we expect the execution time to be halved as the machine size is doubled. Figure 6.22 clearly indicates that the total execution time is inversely proportional to machine size, as the system efficiency is nearly perfectly constant 93 0 0 C /2 Machine Size Figure 6.21: Sustained processing rate of the SP2 on the parallel linear algebra subprogram C /2 i ■ ■ ■ 8 16 24 Machine Size Figure 6.22: System efficiency of the SP2 on the parallel linear algebra subprogram 94 Numerical tabulations of these performance results are given in Table 6.11. Table 6.11: Performance of the Parallel Linear Algebra Subprogram Machine Size Total Tune (seconds) Sustained Processing Rate (GFLOPS) System Efficiency (%) 1 21.2 0.08 31.0 2 9.73 0.17 32.2 4 4.84 0.34 3 l4 8 2.43 0.69 3 l2 16 1.22 1.37 32.2 32 0.61 2.75 32.3 6.5 Scalability Analysis The scalability of an algorithm is an important issue when studying a problem. In this section, I informally consider the scalability of the STAP benchmarks on the SP2 from three angles: scalability with respect to machine size, scalability with respect to problem size, and isoefficiency. 6.5.1 Scalability with Respect to Machine Size The STAP benchmarks are not very scalable with respect to machine size on the SP2. The maximum number of nodes we can use is limited by the size of the data set, because we used a data-parallel approach. Some steps may exhibit a greater maximum degree of parallelism due to their greater degree of data independence. For example, in the APT benchmark running on the nominal data set. the Doppler processing step has a maximum degree of parallelism of 8,096 (236 x 32), but the second beamforming substep and the cell averaging CFAR step have a maximum degree of parallelism of 236. 95 The maximum degree of parallelism we can exploit is further constrained by the high message-passing overhead associated with the SP2: this high overhead limits us to a coarse-grained parallelization approach, and we cannot divide the computational portion of the program into finer-grained steps. In the APT benchmark, our parallelization strategy and the size of the data set limit the maximum number of nodes which can be used to 256. Larger data sets may allow more nodes to be used simultaneously. 6.5.2 Scalability with Respect to Problem Size The STAP benchmarks are scalable with respect to problem size. Most of the computation performed in the STAP benchmarks are done in parallel, and increasing the dimensions of the data cube lead directly to an increase in the maximum degree of parallelism. The APT benchmark is the only program with a sequential component: the first beamforming substep. The workload in this sequential step is proportional to the square of the number of channels in the data set, but independent of the number of pulse repetition intervals or range gates. Furthermore, the number of floating-point operations performed in the first beamforming substep is nearly two orders of magnitude smaller than the total number of floating-point operations performed by the APT benchmark on a nominal data set, and is therefore relatively insignificant. The time spent on message-passing operations also grows no faster than the problem size. In the total exchange operation performed by all three parallel benchmarks, the message size is proportional to the problem size, and the total exchange time is proportional to the message size. This property is true of the broadcast operation 96 (performed by the APT benchmark) and the circular shift operation (performed by the HO-PD benchmark). 6.53 Isoefflciency In [Kuma94], Kumar et al. define a scalable parallel system (consisting of the program and the computer) as one in which the efficiency can be kept fixed as the number of processors is increased, provided that the problem size is increased. The isoefficiency function of a parallel system defines how quickly the problem size must grow with respect to machine size in order to maintain constant efficiency. If the isoefficiency function is small (e.g. 6(nlog/i), where n is the number of processors), then the system is considered scalable, whereas a large isoefficiency function (e.g. @(n4)) indicates a poorly scalable system. The problem size is defined as the number of basic computation operations necessary to solve a problem. By normalizing the time to perform one basic computation at one unit, the problem size is equal to the fastest sequential execution time. The overhead function of a parallel system is defined as the of its cost, or processor-time product, that is not incurred by the fastest known serial algorithm. It is equal to the total time collectively spent by all the processors in addition to the that required by the fastest known sequential algorithm for solving the same problem on a single processor. Because the isoefficiency of a system is determined by considering only dominant terms as the problem size grows to infinity, it is not appropriate for analyzing STAP benchmarks. Preliminary isoefficiency studies of the STAP benchmarks have proven 97 useless because the dominant terms do not dominate the entire expression until the problem size grows well outside the range of real problem sizes. The problem size of one component of a benchmark may be masked by the problem size of another component because the latter component's problem size may have a higher-order term, but, for our range of parameters, is numerically smaller. 98 7 Conclusions The STAP benchmarks performed very well on the IBM SP2. On the parallel HO- PD program, the SP2 achieved 23 GFLOPS over nearly 13 billion floating-point operations. However, the sustained processing rates and the system efficiencies varied quite a bit. The overall performance of the SP2 showed a strong dependence on the computation-to-communication ratio of the program being run. Dr. Xu first discovered the correlation between the computation-to-communication ratio and the system efficiency of the program. Programs with a large amount of communication to be performed per unit computation tended to have worse system efficiencies, as shown in Table 7.1 below. Table 7.1: Computation-to-Communication Ratio Vs. System Efficiency Program Computation Workload (xlO6 floating-point operations) Aggregated Message Length in the Total Exchange Operation (MB) Computation to Communication Ratio (106 floating point operations per MB) Efficiency using 256 nodes (%) APT 1,443 16.78 86 13 HO-PD 12,852 50.33 255 34 General: FFT 1,610 100.66 16 7 General: VC6 604 100.66 6 4 These results confirmed our belief that this message-passing overhead will be the primary bottleneck on parallel programs running on the SP2. The startup latency for a total exchange operation on 256 nodes (640 its) alone is enough time for those 256 nodes to perform up to 43.6 million floating-point operations. 99 The SP2 has potential to be a real-time processor, as its processing rate is high and its performance is rather consistent However, the message-passing overheads must be improved. One possible way is to replace the IBM AIX operating system with a real time operating system. AIX is loaded onto every node of the SP2. Because AIX is a general-purpose time-sharing operating system with many features not necessary or detrimental to the performance of a real-time system, it is very large: approximately 28 MB. In a final implementation of the SP2 as a real-time computer, the real-time application will have dedicated use of the machine. For many supercomputer applications with relatively small communication requirements (e.g. Monte Carlo simulations), the SP2’s very high floating-point processing rate will provide excellent performance despite the high message-passing overhead. Our experiences and initial difficulties in parallelizing the STAP benchmarks led us to the parallel software development methodology based on early performance prediction outlined in [Xu95b]. This project will be extended to port these benchmarks to other massively parallel processors. At this time, the team is porting the parallel STAP benchmarks onto the message-passing Intel Paragon. The same benchmark suite will be modified later for porting onto the shared-memory Cray T3D/T3E. These follow-up research efforts are not within the scope of the research results reported in this thesis. 1 0 0 Bibliography [Hwan95a] [Hwan95b] [Hwan95c] [IBM94a] [IBM94b] [Kuma94] [LL94] [Stun94J tTiti94] [Xu95a] [Xu95b] K. Hwang, Z. Xu, M. Arakawa, “STAP Benchmark Performance on the IBM SP2 Massively Parallel Processor", Proceedings o f the Adaptive Sensor Array Processing Workshop 1995, MIT Lincoln Laboratory, March 15-17,1995, p. 75-91 K. Hwang, Z. Xu, M. Arakawa, “STAP Benchmark Performance of the IBM SP2 for Real-Time Signal Processing", submitted to ACM/IEEE Supercomputing Conference ‘95, San Diego, April 1,1995 K. Hwang, Z. Xu, M. Arakawa, “Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing", submitted to IEEE Transactions on Parallel and Distributed Systems, May 11,1995 IBM Corp., AIX Parallel Environment: Programing Primer, Release 2.0, Pub. No. SH26-7223, IBM Corp., June 1994 IBM Corp., IBM AIX Parallel Environment: Parallel Programming Subroutine Reference, Release 2.0, Pub. No. SH26-7228-01, IBM Corp., June 1994 V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA, 1994 MIT Lincoln Laboratory, Commercial Programmable Processor Benchmarks, MIT Lincoln Laboratory, February 28,1994 C. B. Stunkel, D. G. Shea, B. Abali, M. Atkins, C. A. Bender, D. G. Grice, P. H. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, P. R. Varker, ‘The SP2 Communication Subsystem”, Technical Report, IBM Thomas J. Watson Research Center & IBM Highly Parallel Supercomputing Systems Laboratory, August 22,1994 G. W. Titi, “An Overview of the ARP A/NAVY Mountaintop Program”, IEEE Adaptive Antenna Systems Symposium, November 7-8,1994 Z. Xu, K. Hwang, “Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2 Multicomputer”, submitted to IEEE Parallel A Distributed Technology, January 10, 1995 Z. Xu, K. Hwang, “Early Prediction of MPP Performance by Workload and Overhead Quantification: A Case Study of the IBM SP2 System", submitted to Parallel Computing, April 1995 1 0 1 Appendix A Parallel APT Code The parallel code for the APT benchmark, along with the script to compile and run the parallel APT program, are given in this appendix. A.1 bench_mark_APT.c bench^ark^APT. c / Parallel APT Benchmark Program for the IBM SP2 This parallel APT benchmark program was written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential APT benchmark program was originally written by Tony Adams on 7/12/93. This file contains the procedure main (), and represents the body of the APT benchmark. The program's overall structure is as follows: 1. Load data in parallel. 2. Perform FFTs in parallel. 3. Redistribute the data using the total exchange operation. 4. Form the Householder matrix. 5. Broadcast the Householder matrix to all nodes. 6. Perform beamforming 1 sequentially. 7. Perform beamforming 2 in parallel. 8. Perform CFAR/cell averaging in parallel. 9. Gather individual target reports from each node. 10. Sort the target reports, and display the closest targets (up to 25) . This program can be compiled by calling the shell script compile_apt. Because of the nature of the program, before the program is compiled, the header file defs.h must be adjusted so that NN is set to the number of nodes on which the program will be run. We have provided a defs.h file for each power of 2 node size from 1 to 256, called defs.001 to defs.256. This program can be run by calling the shell script run.##*, where ### is the number of nodes on which the program is being run. Unlike the original sequential version of this program, the parallel APT program does not support command-line arguments specifying the number of times the program body should be executed, nor whether or not timing information should be displayed. The program body will be executed once, and the timing information will be displayed at the end of execution. The input data file on disk is stored in a packed format: Each complex number is stored as a 32-bit binary integer; the 16 LSBs are the real 1 0 2 * half of the number, and the 16 MSBs are the imaginary half of the number. * These numbers are converted into floating point numbers usable by this * benchmark program as they are loaded from disk. * * The steering vectors file has the data stored in a different fashion. All ■ * data in this file is stored in ASCII format. The first two numbers in this * file are the number of PRIs in the data set and the threshold level, * respectively. Then, the remaining data are the steering vectors, with * alternating real and imaginary numbers. * / #include “stdio.h* #include “math.h" #include “defs.h" #include “math.h" #include <sys/types.h> #include <sys/times.h> #include <mpproto.h> #include <sys/time.h> #include / * * Global variables * * output_time: a flag; 1 = output timing information, 0 = don’t output timing * information * output_reportt a flag; 1 = write data to disk, 0 = don’t write data to disk * pricnt: the number of points in the FFT * threshold: minimum power return before a signal is considered a target * t^matrix: the T matrix V int output_time = FALSE; int output_report = FALSE; int pricnt = PRI; float threshold; COMPLEX t__matrix [BEAM] [EL] ; / * * ma i n () * inputs: int argc; char *argv[]; * outputs: none * * This procedure is the body of the program. */ main (argc, argv} int argc; char *argv[]; ( /* main */ / * * Original variables (variables which were in the sequential version of * the APT program) * * target_report: an array holding data for the closest targets * targets: a loop counter * r: a loop counter * el: the maximum number of elements in each PRI * beams: the maximum number of beams * det_bms: the number of beams in the detection data * rng: the maximum number of range gates in each PRI * chol_rng: the number of range gates in each Cholesky segment 103 * appl_rng: the number of range gates in each weight apply segment * £_temp: a buffer for the binary floating point output data * i, j, X: loop counters * row: a loop counter * col: a loop counter * cube_row: a loop counter for the row of in the input data cube * dopplers: the number of dopplers after the FFT * xrealptr: a pointer to the real part of the FFT vector * ximagptr: a pointer to the imaginary part of the FFT vector * str_vecs: steering vectors with EL COMPLEX elements each * weight_vec: storage for the weight solution vectors * make_t: a flag; 1 = generate the T matrix from the solution weight vectors, * 0 - don’t generate the T matrix but instead pass back the weight vector * num: a holder for the EL complex input data elements * tmp_ptr: a pointer to the next piece of data * real_ptr: a pointer to the next piece of real data * imag_ptr: a pointer to the next piece of imaginary data * sum: a variable to hold the sum of complex data * str_name: the name of the input steering vectors file * data_name: the name of the input data file * fopen: a pointer to the file open procedure * f_tmatrix, f_dop: pointers to the output T matrix and Doppler files */ static float target_report|25][4]; int targets; int r; int el = EL; int beams = BEAM; int det_bms = MBEAM; int rng = RNG; int chol_rng ^ RNGSEG; int appl_rng = RNGSEG; float f_temp[1]; int i, j, k; int row; int col; int cube_row; int dopplers; int logpoints; float ‘xrealptr; float ‘ximagptr; COMPLEX str_vecs(BEAM][DOF]; COMPLEX weight_vec[DOF]; int make_t; float num[2 • EL); float *tmp_ptr; float *real_ptr; float *imag_ptr; COMPLEX sum; char str_name[LINEJ4AX]; char data_name[LINE_MAX]; FILE *fopen(); FILE *f_tmatrix, *f_dop; /* * Variables we added or modified to parallelize the code * * fft_out: the local portion of the data cube * local_cube: the local portion of the data cube after the total exchange * operation * detect_cube: the local portion of the detect cube * partial_temp: the local portion of the tempjtiatrix * total_report: a target list containing all the targets found by all * nodes * blklen: the length of the message being sent * re: the return code from an MPL call * taskid: this task’s unique identifier 104 * numtask: the total number of tasks or nodes currently running this program * nbuf: an array used as part of the MPL call mpc_task_query; used to get * the value of allgrp * source, dest, type, nbytes: variables used for point-to-point messages */ static COMPLEX fft_OUt[PRI][RNG / NN](EL]; static COMPLEX local_cube[PRI_CHUNK][RNG][EL]; Static COMPLEX detect_cube[MBEAM][PRI_CHUNK][RNG]; static COMPLEX partial_temp[320][EL]; static float total_report[25 * NN][4]; int blklen; int rc; int taskid; int numtask; int nbuf[4]; int allgrp; int source, dest, type, nbytes; Timing variables *_start and *_end: the start and end CPU times for... *_user and *_sys: the user and system time breakdowns for the time spent for... *_user_max and *_sys_max: the largest user and system times for... *_clock_start and *_clock_end: the start and end wall clock times for... *_clock_time: the net wall clock time spent for... all: the entire program disk: reading the data from disk fft: the FFTs index: the MPL command mpc_index rewind: rewinding the 1-D output of the mpc_index operation into a 3-D data cube gather: the gather operation; the mpc_gather operation is no longer used to redistribute the data cube, as it is replaced by the mpc_index operation followed by the rewind operation reorder: the reorder operation, which is performed between the gather and scatter sequence to redistribute the data cube; this reorder operation is no longer used as this whole step is replaced by the mpc_index operation followed by the rewind operation scatter: the scatter operation; the mpc_scatter operation is no longer used to redistribute the data cube, as it is replaced by the mpc_index operation followed by the rewind operation partial: copying the appropriate portions of the data cube into the partial_temp matrix beast: the broadcast operation stepl: the first beamforming step step2: the second beamforming step cfar: the start and end time of the CFAR/cell averaging step report: the target reporting step st ruct tms struct tms struct tms struct tms st ruct tms struct tms struct tms struct tms st ruct tms struct tms struct tms struct tms struct tms struct tms all_start, all_end; disk_start, disk_end; fft_start, fft_end; index_start, index_end; rewind_start, rewind_end; gather_start, gather_end; reorder_start, reorder_end; scatter_start, scatter_end; partial_start, partial_end; bcast_start, bcast_end; stepl_start, stepl_end; step2_start, step2_end; cfar_start, cfar_end; report_start, report_end; 105 loat all_user, all_sys; loat disk_user, disk_sys; loat f£t_user, fft_sys; loat index_user, index_sys; loat rewind_user, rewind_sys; loat partial_user, partial_sys; loat bcast_user, bcast_sys; loat stepl_user, stepl_sys; loat step2_user, step2_sys; loat cfar_user, cfar_sys; loat report_user, report_sys; loat all_user_max, all_sys_max; loat disk_user_max, disk_sys_max; loat fft_userjnax, fft_sys_max; loat index_user_max, index_sys_max; loat rewind_user_max, rewind_sys_max; loat partial_user_max, partial_sys_max; loat bcast_user_^iax, bcast_sys_max; loat stepl_userjnax, stepl_sys_max; loat step2_userjnax, step2_sys_max; loat cf ar_user_^iax, cfar_sys_max; loat report_user_max, report_sys_max; struct timeval all_clock_start, all_clock_end; struct tlmeval disk_clock_start, disk_clock_end; struct tlmeval fft_clock_start, EEt_clock_end; struct tlmeval index_clock_start, index_clock_end; struct tlmeval rewind_clock_start, rewind_clock_end; struct timeval partial_clock_start, partial_clock_end; struct timeval bcast_clock_start, bcast_clock_end; struct timeval stepl_clock_start, stepl_clock_end; struct timeval step2_clock_start, step2_clock_end; struct timeval cfar_clock_start, cfar_clock_end; struct timeval report_clock_start, report_clock_end; El oat all_clock_time; Eloat disk_clock_tlme; Eloat EEt_clock_time; Eloat index_clock_time; Eloat rewind_clock_time; Eloat partial_clock_time; Eloat bcast_clock_time; float stepl_clock_time; Eloat step2_clock_time; Eloat cfar_clock_time; Eloat report_clock_time; /* * Variables added for gather/scatter (to replace index) * * The gather and scatter operation is no longer used to redistribute the * data cube. However, these variables are still used by the index and * rewind steps of this program, and therefore should not be removed. V static COMPLEX fft_vector[PRI * RNG * EL / NN]; int offset; int n; / * * Temporary variables for diagnostic purposes * * We used these variables in debugging this parallel version of the * APT program. */ 1 0 6 FILE *fpl, *fp2; int p, e, c, b; / * * Externally defined functions */ extern void cmd_line(>; extern void fft_APT(>; extern void stepl_beams(); extern void step2_beams(); extern void cell_avg_cfar(); / * * Begin function body: main () * * Initialize for parallel processing: Here, each task or node determines * its task number (taskid) and the total number of tasks or nodes running * this program (numtask) by using the MPL call mpc_environ. Then, each * task determines the identifier for the group which encompasses all tasks * or nodes running this program. This identifier (allgrp) is used in * collective communication or aggregated computation operations, such as * mpc_index. * 1 rc = mpc_environ (tnumtask, ttaskid); if (rc == -1) { printf ("Error - unable to call mpc_environ.\n"); exit (~1); } if (numtask != NN) ( if (taskid =- 0) { printf ("Error - task number mismatch... check defs.h.\n"); ex i t (-1),- ) ) rc = mpc_task_query (nbuf, 4, 3); if (rc = = -1) ( printf ("Error - unable to call mpc_task_query.\n*); exit (-1); ) allgrp = nbuf[3]; if (taskid == 0) ( printf ("Running...\n"); ) gettimeofday (tall_clock_start, (struct timeval*) 0); times (&all_start); / * * Get arguments from the command line. In the sequential version of the * program, the following procedure was used to extract the number of times * the main computational body (after the FFT) was to be repeated, and flags * regarding the amount of reporting to be done during and after the program * was run. In this parallel program, there are no command line arguments to * be extracted except for the name of the file containing the data cube. * / cmd_line (argc, argv, str_name, data_name); 107 /* * Read input files. In this section, each task loads its portion of the * data cube from the data file. */ i* if (taskid == 0) I printf (" loading data...\n*); ) */ mpc_sync (allgrp); gettimeofday (tdisk_clock_start, (struct timeval*) 0); times (&disk_start) ; read_input^APT (data_name, str„name, str_vecs, fft_out); times (idisk_end); gettimeofday (&disk_clock_end, (struct timeval*) 0); /* * FFT: In this section, each task performs FFTs along the PRI dimension * on its portion of the data cube. The FFT implementation used in this * program is a hybrid between the original implementation found in the * sequential version of this program and a suggestion given to us by MHPCC. * This change in implementation was done to improve the performance of the * FFT on the SP2. */ /* * if (taskid == 0) < * printf (* running parallel FFT...\n*); ) */ mpc_sync (allgrp); gettimeofday (&fft_clock_start, (struct timeval*) 0) , - times (ifft_start) ; fft_APT (f ft_out); times (&fft_end); gettimeofday (ifft_clock_end, (struct timeval*) 0) ; /* * Perform index operation to redistribute the data cube. Before the index * operation, the data cube was sliced along the RNG dimension, so each * task got all the PRIs. After the index operation, the data cube is sliced * along the PRI dimension, so each task gets all the RNGs. * * Because the MPL command mpc_index doesn't work to our specifications, * we need to rewind the 1-D matrix which is the output of the mpc_index * operation back into a 3-D data cube slice. */ /* * if (taskid = = 0) ( * printf (■ indexing data cube...\n"); } V mpc_sync (allgrp); gettimeofday <&index_clock_start, (struct timeval*) 0); times (&index_start); rc = mpc_index (fft_out, fft_vector, PRI * RNG * EL * sizeof (COMPLEX) / (NN * NN>, allgrp); times (&index_end); gettimeofday (tindex clock_end, (struct timeval*) 0); if (rc == -1) 108 { printf ("Error - unable to call mpc_index.\n*); exit (-1); } /• * if (taskid == 0) { * printf (" rewinding data cube..,\n*); > »/ mpc_sync (allgrp); gettimeofday (&rewind_clock_start, (struct timeval*) 0); times (trewind_start); offset = 0 for (n = 0; n < NN; n++) { for (p = 0; p < PRI / NN; pt+) ( for (r = n * RNG / NN; r < (n + 1) * RNG / NN; r+ + ) { for (e = 0; e < EL; e++) I local_cube(p][rIle].real = fft_vector[offset].real; local_cube[p][rj [ej.imag = fft_vector[offset].imag; offset++; ) ) ) ) times (&rewind_end), - gettimeofday (&rewind_clock_end, (struct timeval*) 0); /* * In this step, the partial_temp matrix is generated. If there are less than * 2 56 nodes, then the data which goes into the partial_temp matrix can be * found in tasks 0 and (NN - 1), where NN is the total number of tasks running * this program. If there are 256 nodes, then the data which goes into the * partial_temp matrix can be found in tasks 0, 1, 254, and 255. * * In the sequential version of the program, the partial_temp matrix was * formed in the first beamforming step (stepl_beams ()). * * The Householder matrix which is being formed here is generated by taking * range gates 201-280 from dopplers 1, 2, doppler-1, and doppler into an * EL by RNG_S temp matrix. Nominally, EL = 32 elements and RNG_S = 320 * samples. In the parallel version, because there are only 256 range gates, * the Householder matrix is generated by taking range gates 177-256. */ if (NN < 256) ( /* if (NN < 256) *! if (taskid == 0) ( printf (" generating partial_temp...\n"); ) / mpc_sync (allgrp); gettimeofday (ipartial_clock_start, (struct timeval*) 0); times (tpartial_start); if (taskid == 0) [ for (j = RNG - 80, i = 0; j < RNG; j++, i++) 109 ( for (k = 0; k < el; k++) { partial_temp[i](k].real = local_cube(0][j][k]-real; partial_temp[ij|k].imag - local_cube[0][j][k].imag; partial_temp[i + 80][k].real = local_cube[1][j)t k] .real; partial_temp[i + 80][k].imag = local_cube[1j [j][kj.imag; } ] ] if (taskid == (NN - 1)) I for (j = RNG - 80, i = 0; j < RNG; j ++•, i++) I for (k = 0; k < el; k++) { partial_temp|i + 160][k].real = 1ocal_cube[PRI_CHUNK - 2][j][k].real; partial_tempIi + 160][k].imag = local_cube[PRI_CHUNK - 2][j][k].imag; partial_tempIi + 240][k].real = local_cube[PRI_CHUNK - 1][j][k].real; partial_temp(i + 240][k].imag = local_cube[PRI_CHUNK 1 ] [ j] [k].imag; ) I ) times (&partial_end); gettimeofday (tpartial_clock_end, (struct timeval*) 0); } /* if (NN < 256) V ©1 s© < /* if (NN == 256) V if (taskid -- 0) ( printf (• generating partial_temp...\n"); } mpc_sync (allgrp); gettimeofday (tpartial_clock_start, (struct timeval*) 0); times (&partial_start); if (taskid == 0) { for (j = RNG - 80, i = 0; j < RNG; j++, i++) { for (k = 0; k < el; k++) ( partial_temp[i][k].real = local_cube[0][j ][k].real; partial_temp[i)[kj.imag = local_cube[0j]j](k).imag; ) ) ) if (taskid == 1) ( for (j = RNG - 80, i = 0; j < RNG; j++, i++) { for (k - 0; k < el; k++) ( partial_temp[i + 80][k].real = local_cube(l][j][k].real; partial_templi + 80][kj.imag = local_cube[l][j ][kj .imag; ) ) > 110 if (taskid == 254) { for (j = RNG -SO, i = 0; j < RNG; j + + , i + + ) ( for (k - 0; k < el; k++) ( partial_temp(i + 160][k]-real = local_cube[PRI_CHUNK - 2][jI[k].real; partial_temp [ i + ■ 160 ][ k] . imag = local_cube[PRI_CHUNK - 2][j][k].imag; > ) ) if (taskid == 255) I for (j = RNG - 80, i = 0; j < RNG; j++, i++) I for (k = 0; k < el; k + +) { partial_temp!i + 240][k].real = local_cube[PRI_CHUNK - 1 ] I j ] [k].real; partial_temp(i + 240][k].imag = local_cube1PRI_CHUNK - 1](j][k].imag; } ) ) gettimeofday <SLpartial_clock_end, (struct timeval*) 0); times (tpartial_end); ] /* if (NN == 256) */ /* * Once the appropriate tasks copy portions of their data cube slices into * sections of partial_temp, these tasks then broadcast these sections to * all the nodes. After these broadcasts are done, each task will have * identical copies of the complete partial_temp matrix. */ if (NN < 256) { /* if (NN < 256) */ blklen = 160 * EL * sizeof (COMPLEX); mpc_sync (allgrp); gettimeofday (&bcast_clock_start, (struct timeval*) 0); times (tbcast_start); rc = mpc_bcast (partial_temp, blklen, 0, allgrp); times (&bcast_end); if (rc == -l) ( printf (“Error - unable to call mpc_bcast.\n"); exi t(-1); } bcast_user = bcast_end.tms_utime - bcast_start.tms_utime; bcast_sys = bcast_end.tms_stime - bcast_start.tms_stime; times (tbcast_start); rc = mpc_bcast (partial_temp + 160, blklen, NN - 1, allgrp); times (&bcast_end) , • gettimeofday (fcbcast_clock_end, (struct timeval*) 0); if (rc == -1) ( printf (“Error - unable to call mpc_bcast.\n“); exi t(-1); ) bcast_user += bcast_end.tms_utime - bcast_start.tms_utime; 111 bcast_sys += bcast_end.tms_stime - bcast_start.tms_stime; ) /* if (NN < 256) «/ else ( /* if (NN == 256) */ blklen = 80 * EL * sizeof (COMPLEX); mpc_sync (allgrp); gettimeofday (&bcast_clock_start, (struct timeval*) 0); times (Sibcast_start) ; rc = mpc_bcast (partial_temp, blklen, 0, allgrp); times (&bcast_end>; if (rc == -l) ( printf ("Error - unable to call mpc_bcast.\n*); exit(-1) ; ) bcast_user = bcast_end.tms_utime - bcast^start.tms_utime; bcast_sys = bcast_end.tms_stime bcast_start.tms_stime; times (&bcast_start); rc = mpc_bcast (partial_temp + 80, blklen, 1, allgrp); times (&bcast_end); if (rc == -1) { printf ("Error - unable to call mpc_bcast.\n*); exit(-1)i ) bcast_user +- bcast_end.tms_utime - bcast_start.tms_utime; bcast_sys += bcast_end.tms_stime - bcast_start.tms_stime; times (&bcast_start); rc = mpc_bcast (partial_temp + 160, blklen, 254, allgrp); times (&bcast_end); if (rc == -1) ( printf ("Error - unable to call mpc_bcast.\n"); exit(- 1) ; ) bcast_user += bcast_end.tms_utime - bcast_start.tms_utime; bcast_sys += bcast_end.tms_stime - bcast_start.tms_stime; times (s.bcast_start) ; rc = mpc_bcast (partial_temp + 240, blklen, 255, allgrp); times (tbcast_end>; gettimeofday (&bcast_clock_end, (struct timeval*) 0); if (rc == -1) I printf ("Error - unable to call mpc_bcast.\n"); exit(-1) ; ) bcast_user += bcast_end.tms_utime - bcast_start.tms_utime; bcast_sys += bcast_end,tms_stime - bcast_start.tms_stime; ) /* if (NN == 256) */ /* * Step 1 Beamforming: This step cannot be parallelized efficiently, and was * left to run sequentially. In order to improve performance, we chose to have * each task execute this step. The alternative, having one task execute this * step, then broadcast the results, takes much longer. */ * if (taskid == 0) 112 1 * printf (• performing seep 1 beamforming,..\n*); } */ dopplers = PRI_CHUNK; mpc_sync (allgrp); gettimeofday (tstepl_clock_start, (struct timeval*) 0); times (&stepl_start); stepl_beams (str_vecs, partial_temp); times (&stepl_end); gettimeofday (fcstepl_clock_end, (struct timeval*) 0); /* * Step 2 Beamforming: Each task performs this beamforming step on its slice * of the the data cube. */ /* * if (taskid == 0) ( * printf (* performing step 2 beamforming...\n") ; } V mpc_sync (allgrp); gettimeofday (&step2_clock_start, (struct timeval*) 0); times (&step2_start); step2_beams (dopplers, beams, rng, str_vecs, local_cube, detect_cube); times (fcstep2_end); gettimeofday (&step2_clock_end, (struct timeval*) 0); /* * Cell averaging CFAR and target detection: Each task performs this step on * its slice of the data cube in parallel. */ /* * if (taskid == 0) { * printf (" performing cell_avg_cfar...\n"); ) */ mpc_sync (allgrp); gettimeofday (icfar_clock_start, (struct timeval*) 0); times (tcfar_start) ; cell_avg_cfar (threshold, dopplers, det_bms, rng, target_report, detect_cube); times (&cfar_end); gettimeofday (fccfar_clock_end, (struct timeval*) 0); /* * Gather target reports: In this step, the tasks collect the closest 25 * targets. This target sorting is performed in the following manner. The tasks * pair off. The two tasks in this pair combine their target lists (i.e. the * targets found in their own slice of the data cube). Then, one of these * two tasks sorts the list and keeps the closest targets (up to 25). Then, * the tasks with these new, combined target lists pair off, and this * process is repeated again. At the end, one task will have the final target * list, which contains the closest targets in the entire data cube (up to * 25) . * * The target list combining is done by matched blocking-sends and blocking- * receives. */ /* 113 * if (taskid == 0) ( * printf (* gathering target reports...\n"}; } V mpc_sync (allgrp); gettimeofday (treport_clock_start, (struct timeval*) 0) ; times (&report_start); for (i = 1; i < NN; i = 2 * i) ( /* for (i..-> */ blklen = 25 * 4 * sizeof (float); source = taskid + i; dest - taskid - i; type = i; if ((taskid % (2 * i)) == 0 && (NN != 1)) { rc = mpc_brecv (target_report + 25, blklen, isource, ttype, inbytes); if (rc == -1) ( printf ("Error - unable to call mpc_brecv.\n"); exit(-1); } ) if ((taskid % (2 * i|) == i it (NN ’ = 1)) { rc = mpc_bsend (target_report, blklen, dest, type); if (rc = = -1) { exit(-1) ; ) ) /* * Sort combined target list. */ for (targets = 0; targets < 50; targets++) ( /* for (targets...) */ for (r = targets + 1 ; r < 50 ; r ++) ( /* for (r...) */ if (((target_report[targetsJ[0] > target_report[r][01) it (target_report [r] [01 > 0.0)) II ((target_report[targets][0] < target_report[r][0]) it (target_report[targets][0] == 0.0))) { /* if (...) */ float tmp; tmp = target_report[r][0]; target_report[r][0] = target_report[targetsJ[0]; target_report[targets][0] = tmp; tmp = target_report[r][1]; target_report[r][1] = target_report[targets][1); target_report[targets][1] = tmp; tmp = target_report[r][2]; target_report[r][2] = target_report[targets)[2]; target_report[targets][2] = tmp; tmp = target_reportlr][3]; target_report [r] [3 ] = target_report [targets] [3 ] , - target_report[targets][3] = tmp; } /* if (...) */ ) /* for (r...) */ } /* for (targets...) */ ) /* for (i...) */ 114 times (&report_end); times (Stall_end) ; gettimeofday (treport_clock_end, (struct timeval*) 0); gettimeofday (tall_clock_end, (struct timeval*) 0); if (taskid == 0) { printf ("... done.in’); ) /* * Now, the program is done with the computation. All that remains to be done * is to report the targets that were found, and to report the amount of time * each step took. The target list should be identical from run to run, since * the program started with the same input data cube. This target list and * execution time data reporting is performed by task 0. */ if (taskid == 0) if (output_report == FALSE) ( printf (*ENTRY RANGE BEAM DOPPLER POWERNn*); for (i = 0; i < 25; i++) printf (■ %.02d %.03d %.02d t.03d (int)target_report[i]|0], (int)target_report[i] (int)target_report[ij[2], target_report[i][3] ) J > /* * Collect timing information. First, calculate the number of seconds of CPU * time spent in each step (user and system time separately). */ all_user = (all_end.tms_utime - all_start.tms_utime) / CLK_TICK; ali_sys = t (all_end.tms_stime all_start.tms_stime) / CLK_TICK; disk_user = (disk_end.tms_utime - disk_start.tms_utime) / CLK_TICK; disk_sys = (disk_end.tms_stime - disk_start.tms_stime) / CLK_TICK; fft_user = (fft_end.tms_utime - fft_start.tms_utime) / CLK_TICK; fft_sys = (fft_end.tms_stime - fft_start.tms_stime) / CLK_T1CK; index_user = (index_end.tms_utime - index_start.tms_utime) / CLK_TICK; index_sys = (index_end.tms_stime - index_start.tms_stime) / CLK_TICK; rewind_user = (rewind_end.tms_utime rewind_start.tms_utime) ! CLK_TICK; rewind_sys = (rewind_end.tms_stime - rewind„start.tms_stime) t CLK_TICK; partial_user = (partial_end.tms_utime - partial_start.tms_utime) / CLK_TICK; partial_sys = (partial_end.tms_stime - partial_start.tms_stime) / CLK_TICK; bcast_user = bcast_user / CLK_TICK; bcast_sys = bcast_sys / CLK_TICK; stepl_user = (stepl_end.tms_utime - stepl_start.tms_utime) / CLK_TICK; stepl_sys = (stepl_end.tms_stime * stepl_start.tms_stime) / CLK_TICK; step2_user = <step2_end.tms_utime - step2_start.tms_utime) / CLK_TICK; step2_sys = (step2_end.tms_stime - step2_start.tms_stime) / CLK_T1CK; cfar_user = (cfar_end.tms_utime - cfar_start.tms_utime) / CLK_TICK; cfar_sys = (cfar_end.tms_stime - cfar_start.tms_stime) / CLK_TICK; report_user = (report_end.tms_utime - report_start.tms_utime) / CLK_TICK; report_sys = (report_end.tms_stime - report_start.tms_stime) / CLK_TICK; /* * Calculate the number of wall clock seconds spent on each section. */ all_clock_time = (float) (all_clock_end.tv_sec - all_clock_start.tv_sec) + (float) ((all_clock_end.tv_usec - all_clock_start.tv_usec) I 1000000.0); disk_clock_time = ifloat) (disk_clock_end.tv_sec %f\n*, i, 11] , 115 - disk_clock_start.tv_sec) + (float) (<disk_clock_end.tv_usec - disk_clock_start.tv_usec) / 1000000.0); fft_clock_time = (float) (fft_clock_end.tv_sec - fft_clock_start.tv_sec) * ■ (float) ((fft_clock_end.tv_usec - fft_clock_start-tv_usec) I 1000000.0); index_clock_time = (float) (index_clock_end.tv_sec - index_clock_start.tv_sec) + (float) ((index_clock_end.tv_usec - index_clock_start.tv_usec) / 1000000.0); rewind_clock_tlme = (float) (rewind_clock_end.tv_sec - rewind_clock_start.tv_sec) + (float) ((rewind_clock_end.tv_usec - rewind_clock_start.tv_usec) / 1000000.0); partial_clock_time = (float) (partial_clock_end.tv_sec - partial_clock_start,tv_sec) + (float) ((partial_clock_end.tv_usec - partial_clock_start.tv_usec) / 1000000.0); bcast_clock_time = (float) (bcast_clock_end.tv_sec - bcast_clock_start.tv_sec) + (float) ((bcast_clock_end.tv_usec - bcast_clock_start.tv_usec) / 1000000.0); stepl_clock_tiine = (float) (stepl_clock_end.tv_sec - stepl_clock_start.tv_sec) + (float) ((stepl_clock_end.tv_usec - stepl_clock_start.tv_usec) I 1000000.0); step2_clock_time = (float) (step2_clock_end.tv_sec - step2_clock_start.tv_sec) + (float) ((step2_clock_end.tv_usec - step2_clock_start.tv_usec) / 1000000.0); ofar_clock_time = (float) (cfar_clock_end.tv_sec - cfar_clock_start.tv_sec) + (float) ((cfar_clock_end.tv_usec - cfar_clock_start.tv_usec) / 1000000.0); report_clock_time = (float) (report_clock_end.tv_sec - report_clock_start.tv_sec) + (float) ((report_clock_end.tv_usec - report_clock_start.tv_usec) / 1000000.0); /* * Use the mpc_reduce operation to find the largest CPU time in each section * (user and system time separately). V rc = mpc_reduce (&all_user, iall_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce.\n*); exit (-1) ; ) rc = mpc_reduce (tall_sys, fcall_sys_max, sizeof (float), 0, s_vmax, allgrp); if (rc -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); ) 116 rc = n\pc_reduce (tdisk_user, &dis)c„user_inax, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); } rc = mpc_reduce (fcdisk_sys, tdisk_sys_jnax, sizeof (float), 0, s_vmax, allgrp); if (rc = = -1) { printf ("Error - unable to call mpc_reduce.\n"); exit (-1); } rc = mpc_reduce (&fft_user, ifft_user_jnax, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) { printf ("Error - unable to call mpc_reduce.\n*); exit (-1); } rc = mpc_reduce (ifft_sys, ifft_sys_max, sizeof (float), 0, s_vmax, allgrp); if (rc = = -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); ) rc = mpc_reduce (&index_user, &index_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1) ; ) rc = mpc_reduce (&index_sys, tindex_sys_max, sizeof (float), 0, s_vtnax, allgrp); if (rc == -1) { printf ("Error - unable to call mpc_reduce.\n"); exit (-1); } rc = mpc_reduce (&rewind_user, irewind_user_fnax, sizeof (float), 0, s_vmax, allgrp) , - if (rc = = -1) { printf ("Error - unable to call mpc_reduce.Sn*); exit (-1); } rc = mpc_reduce (&rewind_sys, &rewind_sys_max, sizeof (fioat), 0, s_vmax, allgrp) , - if (rc == -1) ( printf ("Error * unable to call mpc_reduce.Sn"); exit (-1); ) rc = mpc_reduce (&partial_user, &part ial_user_jiiax, sizeof (float), 0, s vmax, allgrp) , - if (rc -= -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); ) rc = mpc_reduce (fcpartial_sys, tpartial_sys_|nax, sizeof (float), 0, s_vmax, allgrp); 117 if (rc = = -1) < printf (‘Error unable to call ropc_reduce. \n") ; exit (-1); ) rc = mpc_reduce (&bcast_user, &bcast_user_max, sizeof (float), allgrp); if (rc == -1) ( printf (‘Error - unable to call mpc_reduce.\n") , - exit (-1); ) rc = mpc_reduce (&bcast_sys, &bcast_sys_max, sizeof (float), 0, allgrp); if (rc = = -1) I printf ("Error - unable to call mpc„reduce.\n"); exit (-1)} ) rc = mpc_reduce (&stepl_user, &stepl_user_max, sizeof (float), allgrp); if (rc = = -1) ( printf ("Error - unable to call mpc_reduce.\n") ; exit (-1)i > rc = mpc_reduce (&stepl_sys, &stepl_sysjnax, sizeof (float), 0, allgrp); if (rc = = -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); > rc = mpc_reduce (&step2_user, &step2_user_max, sizeof (float), allgrp); if (rc = = -1) { printf ("Error - unable to call mpc_reduce.\n"); exit ( 1); ) rc = mpc_reduce (&step2_sys, &step2_sys_max, sizeof (float), 0, allgrp); if (rc = = -1) { printf ("Error - unable to call mpc_reduce.\n") ; exit (-1); > rc = mpc_reduce (&cfar_user, &cfar_user_max, sizeof (float), 0, allgrp); if (rc = = -1) ( printf ("Error - unable to call mpc_reduce.\n"),■ exit (-1); ) rc = mpc_reduce (&cfar_sys, &cfar_sys_max, sizeof (float), 0, s. allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); ) rc = mpc_reduce (&report_user, &report_user_jnax, sizeof (float) allgrp); 0, s vmax, s_vmax, , s_vmax, s_vmax, , s_vmax, s_vmax, s_vmax, .vmax, , 0, s_vmax, 118 if (rc == -1) i printf ("Error - unable to call mpc_reduce.\n") ; exit ( 1); ) rc = mpc_reduce (&report_sys, &report_sys_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce,\n*); exit (-1); Display timing information. if (taskid == 0; ( ~ numtask = %d\n\n" = % printf ("\n\n\n*** Timing information printf (" all_user_max = %.2f s, all_user_max, all_sys_max); printf (" disk_user_max = %.2f s, disk_userjnax, disk_sys jnax); printf (" fft_user_max = %.2f s, fft_sys_max = % f f t_user_jnax, f f t_sys_jnax) ; printf (" index_user_max = %.2f s, index_user_jnax, index_sys_jnax) ; printf (" rewind_user_max = %.2f s, rewind_eys_/nax = % rewind_userjnax, rewind_sys_max); printf (* partial_user_max = %.2f s, partial_sys_max = % partial_userjnax, partial_sys_max) printf (• bcast_user_max = %.2f s, bcast_user_max, bcast_sysjnax); printf (* stepl_user_max = %.2f s, stepl_user_max, stepl_sys_jnax) ; printf (" step2_user_max = %.2f s, step2_user_max, step2_sys_max); printf (• cfar_user_jnax - %.2f s, cfar_sysjnax = % cfar_user_max, cfar_sys_jnax) ; printf (* report_user_max = %,2f s. all_sys_max disk_sys_max f ft_sys_max index_sys_max bcast_sys_max stepl_sys_max step2_sys_max cfar_sys_jnax report_user_max, report_sys_max); report_sys_max = % , NN) ; . 2f sin", -2f sin", .2f sin", .2f s\n", •2f s\n", .2f s\n", ■2f s\n", .2f s\n", .2f s\n", .2f s\n", .2f s\n*, printf printf printf printf printf printf printf printf printf printf printf printf ) return; /* main */ ("\nWall clock timing -\n"); (* all_clock_time = %f\n", (■ disk_clock_time = %f\n", (" fft_clock_time = lf\n", (" index_clock_time = %f\n", (" rewind_clock_time = »f\n", (" partial_clock_time = %f\n", (■ bcast_clock_time = %f\n", (" stepl_clock_time = %f\n", (" step2_clock_time = %f\n", (* cfar_clock_time = %f\n", (" report_clock_time = %f\n". all_clock_time); disk_clock_time); fft_clock_tlme); index_clock_time|; rewind_clock_t ime); partial_clock_time); bcast_clock_time); stepl_clock_time); step2_clock_time); cfar_clock_time); report_clock_time)j 119 AJ2 cell_avg_cfar.c /* * cell_avg_cfar.c */ /* * This file contains the procedure cell_avg_cfar (), and is part of the * parallel APT benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the * AFtPA Mountaintop program. i t * The sequential APT benchmark program was originally written by Tony Adams * on 7/12/93. * * This procedure was largely untouched during our parallelization effort, * and therefore almost identical to the sequential version. Modifications * were made to perform the cell averaging and CFAR for only a slice of the * data cube, instead of the entire data cube (as was the case in the * sequential version of the program). As such, most, if not all, of the * comments in this procedure are taken directly from the sequential version * of this program. */ #include "stdio.h" iinclude ’math.h" #include "defs.h* #include <sys/types,h> #include <sys/time.h> * cell_avg_cfar () * inputs: threshold, dopplers, beams, rng, detect_cube * outputs: target_report * * threshold: when the power at a certain cell exceeds this threshold, * we consider it as having a target * dopplers: the number of dopplers in the input data * beams: the number of beams (rows) in the input data * rng: the number of range gates (columns) in the input matrix * target_report: a 25-element target report matrix (each element consisting * of four floating point numbers: RANG, BEAM, DOPPLER, and POWER), to * output the 25 closest targets found in this slice of detect_cube * detect_cube: input beam/doppler/rnggate matrix */ cell_avg_cfar (threshold, dopplers, beams, rng, target_report, detect_cube) float threshold: int dopplers; int beams; int rng; float target_report[][4); COMPLEX detect_cube[][DOP/NN][RNG]; /* * Variables: Most, if not all, of the variable comments were taken directly * from the sequential version of this program. * * sum: a holder for the cell sum * num_seg: number of range segments 120 * rng_seg: number of range gates per segment * targets: holder for the number of targets to include in the target report * num_targets: the number of targets to include in the target report * N: holder for the number of range gates per range segment * guard_range: number of cells away from cell of interest * rseg, r, i: loop variables * beam, dop: loop variables * rng_start: address of the first range gate in each range segment * rng_end: address of the last range gate in each range segment * ncells: holder for cells: guard_range per range segment * taskid: the identifier for this task (added for parallel execution) */ float sum; int nun\_seg = NUMSEG; int rng_seg = RNGSEG; int targets; int num_targets = 25; int N; int guard_range - 3; int rseg, r, i; int beam, dop; int rng_start; int rng_end; int ncells; extern int taskid; /* * Begin function body: cell_avg_cfar () * * Process cell averaging cfar by range segments: do cell averaging in ranges. */ N = rng_seg; /* Number of range cells per range segment */ targets =0; /* Start with 0 targets in target report number */ for (i = 0; i < num_targets; i++) ( target_report[i][0] = 0.0; target_report[ij [1] = 0.0; target_report[i][2] = 0.0; target_report[i][3] = 0.0; ) /* * Do the entire cell average CFAR algorithm starting from the first range * segment and continuing to the last range segment. This ensures getting * 25 least-range targets first, so the process can be stopped without * looking further into longer ranges. */ for (rseg = 0; rseg < num_seg; rseg++) ( /* for (rseg...) */ /* * Get start and end range gates for each of the range segments: starts at * low ranges and increments to higher ranges. */ rng_start = rseg * N; rng_end = rng_start + N - 1; /* * 1st get the range cell power and store it in detect_cube[][][].real. For * each beam and each doppler, get the power for each range cell in the * current range segment. */ 121 Eor (beam = 0; beam < beams; beam++) { /* for (beam...) */ for (dop = 0; dop < dopplers; dop++) ( /* for (dop...) */ for (r = rng_etart ; r <= rng_end; r++) { /* for (r. ..) V sum = detect_cube[beam][dop][r].real * detect_cube[beam][dop][r].real + detect_cube[beam] [dop] [r] . itnag * detect_cube[beam][dop][r|.imag; detect_cube[beam][dop][r].real = sum; } /* for (r...) */ } /* for (dop...) * I } /* for (beam...) */ /* * Now, get the range cell's average power, using cell averaging CFAR described * above, and store it in detect_cube[][][].imag. For each beam and each * doppler, get the average power for each range cell in the current range * segment. */ for (beam = 0; beam < beams; beam++) ( /* for (beam...) */ for (dop = 0; dop < dopplers; dop++) ( /* for (dop...) * I sum = 0.0; ncells = N - guard_range - 1; /* * Do a summation loop for the first range cell at the start of a range * segment. V for (r = rng start + guard_range +1; r <= rng_end; r++) ( sum +- detect_cube[beam][dop][r].real; } r * The average power is the sum divided by the number of cells in the * summation. */ detect_cube[beam][dop][rng_start].imag = sum / ncells; /* * Now, perform loops until the guard band is fully involved in the data. */ for (r = rng_start +1; r <= rng_start + guard_range; r++) ( /* for (r...) *f sum = sum - detect_cube[beam][dop![r + guard_range].real; --ncells; detect_cube[beam][dop][r].imag = sum / ncells; } /* for (r...) */ /* * Now, perform loops until the guard band reaches the range segment border. */ for (r = rng_start + guard_range + 1; r <= rng end - guard_range; r++) ( sum = sum + detect_cube[beam][dop][r-guard_range-l].real - detect_cube[beam][dop][r + guard_range].real; detect_cube[beam](dop)[r].imag = sum / ncells; 122 > /* * Now, perform loops to the end of the range segment. * / for (r = rng_end - guard_range + 1; r <= rng_end; r++) ( sum += detect_cube(beam][dop][r - guard_range - l].reai; + +ncelIs; detect_cube[beam][dop][r].imag = sum / ncells; } ) /* for (dop...) V } /* for (beam...) */ /* * Compare the range cell power to the range cell average power. Start at * the minimum range and increase the range until 25 targets are found. Put * the target's RANGE, BEAM, DOPPLER, and POWER level in a matrix called * "target_report■. Store all values as floating point numbers. Some can get * converted to integers on printout later as required. */ for (r = rng_start ; r <= rng_end; r + +) { /* for (r...) */ for (beam = 0; beam < beams; beam++) { /* for (beam...) */ for (dop = 0; dop < dopplers; dop++) ( /* for (dop,..) V if (((detect_cube[beam][dop)[r].real detect_cube[beam)[dop][r].imag) > threshold] && (targets <- num_targets)) I /* if (((...))) */ target_report[targets][0] = (float) r; target_report[targets][1] = (float) beam; target_report[targets][2] = (float) dop + taskid * PRI ! NN; target_report[targets][3] = detect_cube[beam)[dop)[r].real; targets +- 1; if (targets >= num targets) ( goto quit_report; 1 ) /* if (((...))) V ] /* for (dop...) */ ) /* for (beam...) */ ) /* for (r...) */ ) /* for (rseg..) */ quit_report: ; return; ) A 3 cmd_line.c /* * cmd_line.c */ /* * This file contains the procedure cmd_llne (), and is part of the parallel * APT benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai 123 * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. I t * The sequential APT benchmark program was originally written by Tony Adams * on 7/12/93. * * The procedure cmd_line () extracts the name of the files from which the * input data cube and the steering vector data should be loaded. The * function of this parallel version of cmd_line (} is different from that * of the sequential version, because the sequential version also extracted * the number of iterations the program should run and some reporting * options. */ ((include ■stdio.h’ ((include "math.h" ((include "defs.h" ((include <sys/types .h> ((include <sys/time.h> * cmd_li ne () * inputs: argc, argv * outputs: str_name, data_name * * argc, argv: these are used to get data from the command line arguments * str_name: a holder for the name of the input steering vectors file * data_name: a holder for the name of the input data file */ cmd_line (argc, argv, str_name, data_name) int argc; char *argv[]; char str_name[LINE_MAX]; char data_name(LINE_MAX]; /* * Begin function body: cmd_line () * I strcpy (str_name, argv[l](; strcat (str_name, ".str"); strcpy (data_name, argv[l]); strcat (data_name, ■,dat"|; return; A.4 fftc /* * f f t. c V /* * This file contains the procedures fft () and bit_reverse (), and is part * of the parallel APT benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawal, as part of the ARPA * Mountaintop program. * The sequential APT benchmark program was originally written by Tony Adams 124 * on 7/12/93. » * The procedure fft () implements an n-point in-place decimation-in-time * FFT of complex vector 'data* using the n/2 complex twiddle factors in * "w_common*. The implementation used in this procedure is a hybrid between * the implementation in the original sequential version of this program and * the implementation suggested by MHPCC. This modification was made to * improve the performance of the FFT on the SP2. * * The procedure bit_reverse () implements a simple {but somewhat inefficient) * bit reversal, */ ((include ’defs.h* /* * fft () * inputs: data, w_common, n, logn * outputs : data */ void fft (data, w_common, n, logn) COMPLEX *data, *w_common; int n, logn; < /* fft */ int incrvec, iO, il, i2, nx, tl, t2, 13; float fO, fl; void bit_reverse () ; /* * Begin function body: fft () * * Bit-reverse the input vector. */ (void) bit_reverse (data, n); /* * Do the first (log n) - 1 stages of the FFT, */ i2 = logn; for (incrvec = 2; incrvec < n; incrvec <<= 1) { /* for (incrvec...) */ i 2 - - ; for (iO = 0; iO < incrvec » 1; i0++) ( /* for (iO...) */ for (il = 0; il < nj il += incrvec) ( /* for (il. . .) */ tl = iO + il + incrvec / 2; 12 = iO « 12; t3 = iO + il; fO = data (tl] . real * w_convmon[t2) .real - data[tl].imag * w_common[t2].imag; fl = dataC11].real * w_common[t2).imag + data[tl].imag * w_common[t2].real; data[tl].real = data[13].real fO; data{11].imag = data[13j .imag - fl; data{13j.real = data|t3j .real + fO; data[13].imag = data|t3].imag + fl; ) /* for (il...» */ ) /* for (iO...) */ ) /* for (incrvec...) */ /* * Do the last stage of the FFT. */ 125 for (iO =0; iO < n / 2; i0++) ( /* for (iO. . .) */ tl = iO + ■ n / 2; fO = data[tl).real * w_common[iO).real - datattlJ.imag * w_common[10].imag; fl = data[tl).real * w_common[iO].imag + data[tl].imag * w_common[iO].real; data[tl].real = data[iO].real - fO; data[tlj.imag = data[iO].imag - fl; data[iO].real = data[iO].real + fO; data[iO].imag = data[iO].imag + fl; } /* for (iO. . .) */ ) /* fft V /* * bit_reverse () * inputs: a, n * outputs: a */ void bit_reverse (a, n) COMPLEX *a; int n; { int i, j, k; /* * Begin function body: bit_reverse (> */ j = 0; for (i = 0; i < n - 2; i++) < if (i < j) SWAP(a [j], a(i]); k = n » 1; while (k <= j) { j -= k; k >>= 1; ) j += k; } ) A S ffl_APT.c /* * fft^APT.C */ /* * This file contains the procedure fft_APT (), and is part of the parallel * APT benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential APT benchmark program was originally written by Tony Adams * on 7/12/93. * 126 * The procedure fft^APT () performs an FFT along the PRI dimension for * each column of data (there are a total of RNG*EL such columns) by calling * the procedure fft (). * / ♦include "stdio.h* ♦include "math.h* ♦include "defs.h" ♦include <sys/types.h> ♦include <sys/time.h> /* * fft_APT () * inputs: fft_out * outputs: fft_out * * fft_out: the FFT output data cube * / fft_APT (fft_out) COMPLEX fft_OUt[] [RNG / NN] [EL] ; /* * Variables * * pricnt: pricnt globally defined in main * output_report: not used in parallel version * fft (): a pointer to the FFT function (which performs a radix 2 * calculation) * taskid: an identifier for this task (included for parallel execution) * el: the maximum number of elements in each PRI * beams: the maximum number of beams * rng: the maximum number of range gates in each PRI, in this data cube slice * (modified for parallel execution) * i, j, k: loop counters * logpoints: log base 2 of the number of points in the FFT * pix2, pi: 2*pi and pi * x: storage for the temporary FFT vector * w: storage for the twiddle factors table */ extern int pricnt; extern int output_report; extern void fft (); extern int taskid; int el = EL; int beams = BEAM; int rng = RNG / NN; int i, j, k; int logpoints; float pix2, pi; static COMPLEX x[PRI]; static COMPLEX w[PRIj; /* * Begin function body: fft_APT () * * Generate twiddle factors table w. */ logpoints = log2 ((float) pricnt) + 0.1; pi = 3.14159265358979; pix2 ~ 2.0 * pi; 127 for (i - 0; i < PRI; i++) < w [i|.imag = -sin (pix2 * (float) i / (float) PRI); w[i).real = cos <pix2 * (float) i / (float) PRI); ) /" * For each range gate and each element, move all the PRIs into real and * imaginary vectors for the FFT. */ for (i = 0; i < rng; i++) ( for (k = 0; k < el; k++) ( for (j = 0; j < pricnt; j++) { x[j].real - fft_out[j][i][k].real; x [j).imag = fft_out[j)[i][k].imag; ) /* * Perform the FFTs. */ fft (x, w, pricnt, logpoints); /* * Move the FFT’ed data back into fft_out. */ for (j = 0; j < pricnt; j++) ( fft_outIj][i)[k].real = x[j).real; fft_out[j][i) tk].imag = x[j).imag; ) ) ) return; ) A.6 forback.c /* * forback.c */ / * * This file contains the procedure forback (>, and is part of the parallel * APT benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential APT benchmark program was originally written by Tony Adams * on 7/12/93. * * The procedure forback () performs a forward and back substitution on an * input temp_mat array, using the steering vectors, str_vecs, and * normalizes the solution returned in the vector "weight_vec". * * If the input variable ■make_t1 1 = 1, then the T matrix is generated in * this routine by doing BEAM forward and back solution vectors. Nominally, 128 BEAM = 3 2 . * The conjugate transpose of each weight vector, to get the Hemitian, i * put into the T matrix as row vectors. The input, temp_mat, is a square * lower triangular array, with the number of rows and columns = num_rows * * This routine requires 5 input parameters as follows; * num_rowe: the number of COMPLEX elements, M, in temp_mat * str_vecs: a pointer to the start of the steering vector array * weight_vec: a pointer to the start of the COMPLEX weight vector * make_t: 1 = make T matrix; 0 = don't make T matrix but pass back the * weight vector * temp_mat11 I]; a temporary matrix holding various data * * This procedure was untouched during our parallelization effort, and * therefore is virtually identical to the sequential version. Also, most * if not all, of the comments were taken verbatim from the sequential * version. V Hinclude "stdio.h" ((include "math.h" ((include "defs.h" * forback () * inputs: num_rows, str_vecs, weight_vec, make_t, temp_mat * outputs; weight_vec « * num_rows: the number of elements, M, in the input data * str_vecs: a pointer to the start of the steering vector array * weight_vec: a pointer to the start of the COMPLEX weight vector * make_t: 1 = make T matrix; 0 - don’t make T matrix but pass back the * weight vector * temp_mat: MAX temprary matrix holding various data * I forback (num_rows, str_vecs, weight_vec, make_t, temp_jnat) int num_rows; COMPLEX str_vecs[][DOF]; COMPLEX weight_vec[]; int make_t; COMPLEX temp_mat[][COLS]; /* * Variables * * i, j, k; loop counters * last; the last row or element in the matrices or vectors * beams; loop counter * num_beams: loop counter * sum_sq; a holder for the sum squared of the elements in a row * sum: a holder for the sum of the COMPLEX elements in a row * temp: a holder for temporary storage of a COMPLEX element * abs_jnag: the absolute magnitude of a complex element * wt_factor: a holder for the weight normalization factor, according to * Adaptive Matched Filter Normalization * steer_vec: one steering vector with DOF COMPLEX elements * vec: a holder for the complex solution intermediate vector * tmpjnat: a pointer to the start of the COMPLEX temp matrix */ int i, j, k; int last; int beams; int num_beams; float sum_sq; COMPLEX sum; COMPLEX temp; float abs_mag; float wt_factor; COMPLEX steer_vecIDOF); COMPLEX veC(DOF); COMPLEX tmp_mat[DOF][DOF]; /* * Begin function body; forback () * * The tempjnat matrix contains a lower triangular COMPLEX matrix as the * first MxM rows. Do a forward back solution BEAM times with a different * steering vector. If make_t = 1, then make the T matrix. */ if (make_t) { num_beams = num_rows; /* Do M forward back solution vectors. Put */ /* them else into T matrix. */ J else ( num_beams =1; /* Do only 1 solution weight vector */ } for (beams = 0; beams < num_beams; beams++) ( /* for (beams...) */ for (j = 0; j < num_rows; j++) { steer_vec[j].real = str_vecs[beams][j].real; steer_vec(jj.imag - str_vecs[beamsj[jj.imag; ) / * * step 1: Do forward elimination. Also, get the weight factor = square root * of the sum squared of the solution vector. Used to divide back substitution * solution to get the weight vector. Divide the first element of the COMPLEX * steering vector by the first COMPLEX diagonal to get the first element of * the COMPLEX solution vector. First, get the absolute magnitude of the first * lower triangular diagonal. */ abs_mag = temp_mat[0)[0).real * tempjnat 10] [0 ] . real + temp_mat[0][0].imag » temp_mat(0]10].imag; / * * Solve for the first element of the solution vector. */ vec[0].real = Itemp_mat[0][0].real * steer_vec[0].real + temp_mat[0][0].imag * steer_vec[0].imag) / abs_mag; vec[0].imag = (temp_mat[0][0].real * steer_vec[0].imag - temp_mat[0][0].imag * steer_vec[0].real) / abs_mag; / * * Start Bumming the square of the solution vector. */ sum_sq = vec[0].real * vec[0].real + vec[0].imag * vec[0].imag; / * * Now solve for the remaining elements of the solution vector. V for (i = 1; i < num_rows; i++) 130 { /* for (i . ..) */ sum. real - 0.0; sum.imag = 0.0; for (k = 0; k < i; k++) { /* for (k...j */ sum.real += (temp_jnat[i][k].real * vec[k).real - temp_mat[i][k].imag * vec[k].imag); sum.imag += (tempjnat[i|[kj.imag * vec[k).real + temp_mat[i][k].real * vec(k1.imag); ) /* for (k...) */ /* * Now subtract the sum from the next element of the steering vector. •/ temp.real = steer_vec[i].real - sum.real; temp.imag = steer_vec[i].imag - sum.imag; /' * Get the absolute magnitude of the next diagonal. * / abs_mag = temp_mat[i][i].rea1 * temp_jnat [ i ] [ i ] . real + ■ temp_mat [ i ] [ i J . imag * temp_mat[i][i|.imag; I* * Solve for the next element of the solution vector. */ vec[i],real = (temp_mat[i][iJ.real * temp.real + temp_jnatli][i).imag * temp.imag) / abs_mag; vec[i].imag = (temp_mat[i]Ii].real * temp.imag - temp_mat|i][i).imag * temp.real) / abs_mag; /* * Sum the square of the solution vector. * ! £um_sq += (vecli).real * vec[i].real + vec(i).imag * vec[i].imag); } /* for (i . . . ) */ wt_factor - sqrt ((double) sum_sq); /* * step 2; Take the conjugate transpose of the lower triangular matrix to * form an upper triangular matrix. */ for (i = 0; i < num_rows; i++) ( /* for {i...» */ for (j = 0; j < num_rows; j++> { /* for (j . . .) •/ tmpjnat [ i ] [ j ] . real = temp_jnat I j ] [ i J . real; tmpjnat [ i ] [ j j . imag = - temp_jnat [ j ] [ 1] . imag; ) /* for (j . . .) */ ) /* for (i . . .) * / /* * Step 3: Do a back substitution. */ last = nutn_rows - 1; / * * Get the absolute magnitude of the last upper triangular diagonal. */ absjnag = tmpjnat[last][last).real * tmp_mat[last][last].real + tmp_jnat[last][last].imag * tmpjnat[last][last].imag; 131 / * * Solve for the last element of the weight solution vector. */ weight_vec[last].real = (tmp_mattlast]1 last].real * vec[last].real + tmp_mat[last][last].imag * vec[last].imag) / absjnag; weight_vec[last].imag = (tmpjnat[last][last].real * vec[last].imag - tmp_mat[last][last].imag * vec[last].real) / abs_mag; /* * Now solve for the remaining elements of the weight solution vector from * the next to last element up to the first element. V for li = last - 1; i >= 0; i--) [ /* for { i...) ’/ sum.rea1 = 0.0 sum.imag = 0.0 for (k = i + 1; k <= last; k++) { /* for (k...) */ sum.real += (tmp_mat[i][k].real * weight_vec[k].real - tmp_jnat [ i ] [k] . imag * weight_vec[k].imag); sum.imag += (tmp_mat[i][k].imag * weight_vec[k].real + tmp_mat[i][k].real * weight_vec[k].imag); } /* for (k...) *i / * * Subtract the sum from the next element up of the forward solution vector. V temp.real = vec[i].real - sum.real; temp.imag = vec[ij.imag - sum.imag; /* * Get the absolute magnitude of the next diagonal up. V abs_mag = tmp_mat[i][i].rea1 * tmp_mat[i][i].real + tmp_mat[i][i].imag * tmp_mat[i][i].imag; /* * Solve for the next element up of the weight solution vector. V weight_vec(i].real = (tmp_jnat[i][i].real * temp.real * tmp_mat(i)[i).imag * temp.imag) / abs_mag; weight_vec[i].imag = (tmp_mat[i][i].real * temp.imag - tmp_mat[i1[i)•imag * temp.real) / abs_mag; ) /* for (i. . .) */ /* * Step 4: Divide the solution weight_vector by the weight factor. * / for (i = 0; i < nun\_rows; i + + ) ( weight_vec[i].real /= wt_factor; weight_vec[ij.imag /= wt_factor; ) tifdef APT /* * If make_t = 1, make the T matrix. * / 132 if (make_t) ( /* if (make_t) */ /* * Conjugate transpose the weight vector to get the Hermitian. Put each * weight vector into the T matrix as row vectors. * / for (j = 0 ; j < num_rows; j++) I tjnatrix[beams][j].real = weight_vec[j].real; t_matrix[beams][jj.imag = - weight_vec[j].imag; } ) /* if <make_t) */ •endif } / * for (beams...) * / return; ) A.7 house.c /* * house.c */ / * * This file contains the procedure house (), and is part of the parallel APT * benchmark written for the IBM SP2 by the STAP benchmark paralleltzation * team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei * Xu, and Masahiro Arakawa], as part of the ARPA Mountaintop Program. * * The sequential APT benchmark program was originally written by Tony Adams * on 7/12/93. * * The procedure house (] performs the Householder transform, in place, on * an N by M complex imput matrix, where M >= N. It returns the results in * the same location as the input data. * * This routine requries 5 input parameters as follows: * num_rows; number of elements in the temp_jnat * num_cols: number of range gates in the temp_mat * lower_triangular_rows: the number of rows in the output temp_mat that * have been lower triangularized * start_row: the number of the row on which to start the Householder * temp_mat[][]: a temporary matrix holding various data * * This procedure was untouched during our parallelization effort, and * therefore is virtually identical to the sequential version. Also, most, * if not all, of the comments were taken verbatim from the sequential * version. * / #include "stdio.h* •include "math.h* •include "defs.h* /* * house () * inputs: num_rows, num_cols, lower_triangular_rows, start_row, temp_jnat * outputs: temp_mat * 133 * num_rows: number of elements, N, in the input data * num_cols: number of range gates, M, in the input data * lower_triangular_rows: the number of rows to be lower triangular!zed * start_row: the row number of which to start the Householder * tempjnat: temporary matrix holding various data V house (num_rows, num_cols, lower_triangular_rows, start_row, temp_mat) int num_rows; int num_cols; int lower_triangular_rows,- int start_row; COMPLEX temp_mat[1[COLS]; I* * Variables * * i, j, k: loop counters * rtemp: a holder for temporary scalar data * x_square: a holder for the absolute square of complex variables * xmax_sq: a holder for the maximum of the complex absolute of variables * vec: a holder for the maximum complex vector 2 * num_cols max * sigma: a holder for a complex variable used in the Householder * gscal: a holder for a complex variable used in the Householder * alpha: a holder for a scalar variable used in the Householder * beta: a holder for a scalar variable used in the Householder V int i, j, k; float rtemp; float x_square; float xmax_sq; COMPLEX vec[2*C0LS]; COMPLEX sigma; COMPLEX gscal; float alpha; f 1 oa t bet a; /* * Begin function body: house () * * Loop through temp_mat for number of rows = lower_triangular_rows. Start * the row number indicated by the start_row input variable. * / for (i = start_row; i < lower_triangular_rows; i++) [ /* for (i. .. ) */ /• * Step 1: Find the maximum absolute element for each row of temp_mat, * starting at the diagonal element of each row. */ xmax_sq = 0 .0 ; for (j = i; j < num_cols; j++) { /* for (j . . . ) */ x_square = tempjnat [ i) [ j ] . real * temp_piat [ i ] [j ] . real + temp_mat[i][j].imag * temp_mat[i][j].imag; if (xmax_sq < x square) { xmax_sq = x_square; } } /* for I j. . .) V /* * Step 2: Normalize the row by the maximum value and generate the complex * transpose vector of the row in order to calculate alpha = square root of * the sum square of all the elements in the row. */ xmax_sq = (float) sqrt ((double) xmax_sq); alpha = 0 .0 ; for (j = i; j < nun\_cols; j++) ( /* for (j . . . ) */ vec[j].real = temp_jnat[i][j].real / xmax_sq; vec[jj.imag = - temp_mat[I)[j J.imag / xmax_sq; alpha += (vec[j].real * vecljj.real + vec[j],imag * vec[j].imag); } /* for (j . . . I V alpha = (float) sqrt ((double) alpha); /* * Step 3: Find beta = 2 / (b (transpose) * b). Find sigma of the relevant * element = x(i) / lx(i)l. V rtemp = vec[i).real * vec[i).real + vec[1J.imag * vec(i).imag; rtemp = (float) sqrt ((double) rtemp); beta = 1 . 0 / (alpha * (alpha + rtemp)); if (rtemp >= 1.0E-16) ( sigma.real = vec[i].real / rtemp; sigma.imag = vec[i].imag / rtemp; ) else { sigma.real = 1,0 ; sigma.imag = 0 .0 ; ) /* * step 4: Calculate the vector operator for the relevent element. V vec[i].real += sigma.real * alpha; vec[ij.imag += sigma.imag * alpha; /* * Step 5: Apply the Householder vector to all the rows of temp_jnat. */ for (k = i; k < num_rows; k++) ( /* for (k...) */ /* * Find the scalar for finding g. V gscal.real = 0 .0 ; gscal.imag = 0 .0 ; for (j = i; j < num_cols; j++) ( /* for (j . . . ) V gscal.real += (temp_mat[k][jJ.real * vecljj.real - temp_matIkJ [j].imag * vec[j].imag); gscal.imag += (temp_matIk]Ij).real • vec[j).imag + temp^mat[k](j].imag * vec[j].rea1); ) /* for (j . . .) */ gscal.real *= beta; gscal.imag *- beta; /* * Modify only the necessary elements of the temp_mat, subtracting gscal * * conjg (vec) from temp_jnat elements. 135 for (j = i; j < num_cols; j++) { /* for (j . . .) V temp_mat(k][j].rea1 temp_mat(k](j].imag ) /* for (j. . . ) * I ) /* for (k...> */ } /* for •include <sys/time.h> •include <mpproto,h> •ifdef DBLE char *fmt = *%lf*; •else char *fmt = *%f"; •endi f /* 136 * Externally defined variables (needed for parallel execution) * * taskid: an identifier for this task * numtask: the total number of tasks running this program * allgrp: an identifier for all the tasks running this program (used for * collective communications and aggregated computations) */ extern int taskid, numtask, allgrp; * read_input_APT () * inputs: data_name, str_name * outputs: str_vecs, fft_out * * data_name: the name of the file containing the data cube * str_name: the name of the file containing the steering vectors * str_vecs: a matrix holding the steering vectors * fft out: this task's slice of the data cube */ read_input_APT (data_name, str_name, str_vecs, fft_out) char data_name[]; char str_name[); COMPLEX str_vecs[][DOF]; COMPLEX fft_0Ut[][RNG / NN][EL]; * Variables * * el: the maximum number of elements in each PRI * beams: the maximum number of beams * rng: the maximum number of range gates * temp: a buffer for binary input integer data * tempi, temp2 : holders for integer data * i, j, k: loop counters * fopen (); a pointer to the file open function * f_str, f_dat: pointers to the input files * local_int_cube: the local portion of the data cube * pricnt: the pricnt variable globally defined in main () * threshold: the threshold variable globally defined in main () * blklen, rc: not used */ int el ^ EL; int beams = BEAM; int rng = RNG / NN; unsigned int temp[l]; long int tempi, temp2; int i, j, k; FILE *fopen(); FILE *f_str, * f_dat; unsigned int local_int_cube[RNG/NN][PRI][EL]; extern int pricnt; extern float threshold; long blklen, rc; /* * Begin function body: read_input_APT () * * Every task: load the steering vector file. */ if ((f_str = fopen (str_name, "r")) == NULL) [ printf ("Error - task %d unable to open steering vector file.Nn", 137 taskid); exit (-1); > /• * The first item in the steering vector file is the number of PRIs. The * second item in the steering vector file is the target detection threshold, V fscanf (f_str, *%d*, fcpricnt); fscanf (f_str, fmt, ^threshold); /* * Read in the rest of the steering vector file. */ for (i = 0 ; i < beams; i++) I for (j = 0 ; j < el; j++) ( fscanf (f_str, fmt, istr_vecs [ i ] [ j ] . real ) fscanf (f_str, fmt, tstr_vecs[ij(jj.imag); > ) fclose (f_str); /* * Read in the data file in parallel. */ if ((f_dat = fopen (data_name, “r“>) == NULL) I printf ("Error - task %d unable to open data file.Nn", taskid); exit (-1 ); ) fseek (f_dat, taskid * rng * pricnt * el * sizeof (unsigned int), 0 ); fread (local_int_cube, sizeof (unsigned int), pricnt * rng * el, f_dat); fclose (f_dat); /* * Convert data from unsigned int format to floating point format. */ for (i = 0 ; i < rng; i++) ( /* for (i...» */ for (j = 0 ; j < pricnt; j++) ( /* for (j. . .) */ for (k = 0 ; k < el; k+ + ) ( /* for (k...) V temp(0 ] = local_int_cube[i](j)(k); tempi = OxOOOOFFFF & temp[0]; tempi = (tempi & 0x00008000) ? tempi I OxffffOOOO : tempi; temp2 = (tempIO] » 16) & OxOOOOFFFF; temp2 = (temp2 & 0x00008000) ? temp2 I OxffffOOOO : temp2; fft_outIj][i)(k].real = (float) tempi; fft_out[j)[i][k].imag = (float) temp2 ; > /* for (k...) V ) /* for (j. . .) */ ) /* for (i . . .) V return; ) 138 A.9 stepl_beam s.c /* * stepl_beams.c */ /* * This file contains the procedure stepl_beams (), and is part of the * parallel APT benchmark program written for the IBM Sp2 by the STAP * benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential APT benchmark program was originally written by Tony Adams * on 7/12/93. * * The procedure stepl_beams () performs the first step adaptive beamforming * algorithm. In this algorithm, we apply the Householder transform to all * rows of a matrix to get a lower triangular Cholesky. Then, we foward/back * solve with the steering vectors to get a set of weights to form the T * matrix. * * In the original sequential version of this procedure, forming the * Householder matrix was done here. However, in the parallel version, the * formation and broadcasting of the Householder matrix was moved into * main U ■ V #include ■stdio.h" #include *math.h* ^include "defs.h* kinclude <sys/types.h> #include <sys/time.h> /* * stepl_beams (> * inputs: str_vecs, partial_temp * outputs: partial_temp * * str_vecs: steering vectors used to solve for the T matrix * partial_temp: the Householder matrix */ stepl beams (str_vecs, partial_temp| COMPLEX str_vecs[][DOF]; COMPLEX part ial_temp[RNG_S)[ELJ; /* * Variables * * tempjnat: MAX temporary matrix holding various data * el: the maximum number of elements in each PRI * beams: the maximum number of beams * rng: the maximum number of range gates in each PRI * num_cols: the number of samples in the first step beamforming matrix * i, k: loop counters * lower_triangular_rows: the number of rows to be lower triangularized * num_rows: the number of rows on which the Householder transform is applied * start_row: the row number on which to start the Householder transform * weight_vec: storage for the weight solution vectors * make_t: I = generate T matrix from the solution weight vectors; 0 = don't * generate the T matrix, but pass back the weight vector * numtask: the total number of tasks running this program 139 * taskid: an identifier for this task * fpl, fp2 : pointers to files; used for diagnostic purposes * e, c: loop counters; used for diagnostic purposes */ static COMPLEX temp_mat[EL][COLS]; int el = EL; int beams = BEAM; int rng = RNG; int num_cols - RNG_S; int i, k; int lower_triangular_rows; int nun(_rows; int start_row; COMPLEX weight_vec(DOF]; int make_t; int numtask, taskid; FILE "fpl, *fp2; int e, c; /* * Begin function body: stepl_beams () * * Rearrange partial_temp, and store into temp_mat. V for (i = 0 ; i < 320; i + + ) { for (k = 0 ; k < el; k+ + ) ( temp_mat[k][i].real - partial_temp[i][k]-real; tempjnat[kj[i].imag = partial_temp[i][k].imag; ) } Save tempjnat for diagnostic purposes. / mpc_environ (inumtask, ttaskid); if (taskid == 0) { print f (* saving temp_jnat. . .\n"); } if ((fpl = fopen ("/scratchl/masa/temp_mat_par.real", "w")) == NULL) { printf ("Error - unable to open /scratchl/masa/temp_mat_par.real.\n"); exit (-1); 1 if ((fp2 = fopen (*/scratchl/masa/temp_mat_par.imag", "w")) == NULL) ( printf ("Error - unable to open /scratchl/masa/temp_mat_par.imag,\n*); exit (-1); ) for (e = 0; e < EL; e + +) ( for (c = 0; c < 320; C++) ( fwrite (&(temp_jnatIelEc].real), sizeof (float), 1, fpl); fwrite (fc(tempjnat[e][cj .imag), sizeof (float), 1, fp2 ); ) ) fclose (fpl); fclose (fp2); 140 /’ * Perform Householder transform. '/ start_row =0; /* Start Householder on 1st row */ num_rows = el; /* Do Householder on all rows */ lower_triangular_rows = el; /* Lower triangularize all rows */ house (num_rows, num_cols, lower_triangular_rows, start_row, temp_mat); Save tempjnat for diagnostic purposes. / mpc_environ (tnumtask, itaskid); if (taskid == 0) printf (‘ saving temp_mat2 ...\n*); if ((f£l = fopen (*/scratchl/raasa/temp_mat2__par. real", "w*) ) == NULL) printf ('Error - unable to open /scratchl/masa/temp_mat2_par.real.\n*) exit (-1}; if ((fp2 = fopen <'/scratchl/masa/temp_mat2_par.imag*, "w")) == NULL) printf ("Error unable to open /scratchl/masa/temp_mat2_par.imag.\n*I exit (-1); for (e = 0; e < EL; e + +) for (c = 0; C < 320; C++) ( fwrite (&(temp_jnat[e][c].real), sizeof (float), 1, fpl); fwrite (&Itemp_mat[ejlc].imag), sizeof (float), 1, fp2); fclose (fpl); fclose (fp2); Now, temp_|nat has lower triangular matrix as the first MxM rows. Set make_t to 1 to generate the T matrix. / make t /* * call forback to solve M times with a different steering vector, putting * the Hermitian of the solution weight vectors into the rows of the T matrix. */ forback <lower_triangular_rows, str_vecs, weight_vec, make_t, temp_jnat); * The T matrix will have N main beams in 1st N rows, and A auxiliary beams in * the remaining A rows. return; 141 A.10 step2_beams.c step2_beams. c / This file contains the procedure step2_beams (), and is part of the parallel APT benchmark program written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential APT benchmark program was originally written by Tony Adams on 7/12/93. The procedure step2_beams () performs the second adaptive beamforming step. This parallel version has been modified for parallel execution in that each task performs the beamforming algorithm on only its slice of the data cube. Other than that, the algorithm has been left alone. Also, most, if not all, of the comments in this file are taken directly from the sequential version. For each of the dopplers: 1. Does T_matrix * doppler (BEAM number of beams by RNG range gates). Nominally, BEAM = 32. 2. For each of NUMSEG segments (of RNGSEG range gates each): Nominally, NUMSEG = 7, and RNGSEG = 40. a. Finds the maximum power in all beams, other than the main beams. (First MBEAM beams are main beams) b. Selects PWR_BM most power beams and then appends MBEAM main beams to the PWR_BM making (nominally 9+12=21) PWR_BM + MBEAM beams. c. Does a Householder on the PWR_BM + MBEAM beams beams but lower triangularizing only the first PWR_BM beams. d. Then, for each main beam, lower triangularize it and append to the PWR_BM genearting a lower triangular MxM matrix (nominally M = 10). e. Forward and back solve each MxM matrix applying steering vectors for forward solutions. (Steering vectors have been modified by Tb (the maximum power beams of the T matrix)). f. Apply weights from (e.) above to input dopplers to get output data cube to be used for target detection. The output data cube is MBEAM beams by PRI dopplers by RNG range gates divided, in NUMSEG segments of RNGSEG range gates each. This routine requires 6 input parameters as follows: dopplers: number of PRIs in the input data beams: number of beams (rows) in the input data rng: number of range gates in the input data str_vecs: pointer to the steering vectors local_cube[]I)(]: input data cube detect_cube[][)(]: detection data cube •include "stdio.h* •include "math.h’ •include *defs.h“ •include <sys/types.h> •include <sys/times.h> extern int taskid; /* * step2_beams () * inputs: dopplers, beams, rng, str_vecs, local_cube 142 o u t p u t s : d e t e c t _ c u b e dopplers: number of PRIs in the input data beams: number of beams in the input data rng: number of range gates in the input data str_vecs: a pointer to the array of COMPLEX steering vectors local_cube: input data cube detect_cube: detection data cube step2_beams (dopplers, beams, rng, str_vecs, local_cube, detect_cube) int dopplers; int beams; int rng; COMPLEX str_vecs(][DOF]; COMPLEX local_cube[]{RNG]{EL] ; COMPLEX detect_cube[]{PRI / NN][RNG]; Variables temp_mat: a temporary matrix holding various data rng_seg: number of range gates per segment main_beams: number of main beams in the input data aux_beams: number of auxiliary beams in the input data max_pwr_beamss number of max pwr beams used in Cholesky num_seg: number of range segments for Cholesky el: number of elements in the steering vectors num_rows: numbe rof rows on which to apply the Cholesky start_row: the row number on which to start applying the Cholesky lower_triangular_rows: number of rows to lower triangularize 1, 3, k, m, n: loop counters row: row number col: column number appl: column number for application data dop: loop counter for doppler number seg_col: column number for correct 1 of 7 segments next_seg: start column number for next 1 of 7 segments indx: index desired column number xreal: storage for real data for pointer use ximag: storage for imag data for pointer use weight_vec: storage for weight solution vector mod_str_vec: storage for one modified steering vector x_beams: hold each doppler for step 2 beamforming make_t: flag to make or not make T matrix. If make_t is set to 0, T matrix is not made and the weight vector is returned to the caller real_ptr: pointer to next real data imaa ptr: pointer to next imaginary data max_sum: max sum of the ABEAM auxiliary beams: nominally ABEAM = 20. pwr_sums: pwr sums of ABEAM auxiliary beams row_indx: row index of pwr sums of SBEAM aux beams max_indx: row index of PWR_BM max pwr beams sum: COMPLEX variable to hold sum of complex data / static COMPLEX temp_jnat[BEAM][COLS]; int rng_seg = RNGSEG; int main_beams = MBEAM; int aux_beams = ABEAM; int max_pwr_beams = PWR_BM; int num_seg = NUMSEG; int el = EL; int num_rows; int start_row; int lower_triangular_rows; int i, j, k, m, n; 143 int row; int col; int appl; int dop; int seg_col; int next_seg; int indx; float xreal[EL]; float ximag[EL]; COMPLEX weight_vec[DOF]; COMPLEX mod_str_vec[l][DOF]; COMPLEX x_beams[EL][RNG_S]; int make_t; float *real_ptr; float *imag_ptr; float max_sum; float pwr_sums[ABEAM]; int row_indx[ABEAM]; int max_indx[PWR_BM]; COMPLEX sum; extern void house]); extern void forback!); i* * Begin function body: step2_beams () V for (dop = 0 ; dop < dopplers; dop+O { I* for (dop...I */ /* * For each doppler, form beams by multiplying T_matrix by each doppler data * in the local_cube. MBEAM main beams are first then ABEAM auxiliary beams. * Nominally, MBEAM = 12 and ABEAM = 20. Perform T_matrix * doppler data and * store into the x_beams matrix. V for (j = 0; j < rng; j++) { /* for (j . . .) */ /* * For each doppler move EL elements of each range gate into temporary * holding vectors to multiply all rows of T matrix using pointers. */ real_ptr = xreal; imag_ptr = ximag; for (k = 0 ; k < beams; k++) { /• for (k...) * I *real_ptr++ = local_cube[dop][j](k].real; *imag_ptr+4- = local_cube [dop) | j ] [k] . imag; ) /* for (k...) */ /* * Multiply 1 range gate of each doppler by all rows of T_matrix summing the * products of each row and storing the sums in a corresponding row of x_beams * doppler output matrix. */ for (row = 0 ; row < beams; row++) ( /* for (row...) V sum.real = 0 .0 ; sum.imag = 0 .0 ; real_ptr = xreal; imag_ptr = ximag; for (col = 0 ; col < beams; col++) 144 ( /* for (col...) */ sum.real += (t_matrix[row][col].real * *real_ptr - t_matrix[row][colJ.imag * *imag_ptr); sum.Imag *= (t_matrix[row][col].imag * *real_ptr++ + t_matrix[row] [col] . real * *imag_ptrt + ) ; ) /* for (col...) */ x_beams[row][j].real = sum.real; x_beams[rowj [jj.imag = sum.imag; } /* for (row...) */ ) /* for (j . . .) V /* * For each doppler we now have BEAM beams formed in x_beams matrix. Main * beams are the first MBEAM beams then ABEAM auxiliary beams. Divide RNG * range gates into NUMSEG segments of RNGSEG range gates. Process RNGSEG * range gates per secgment for each of the NUMSEG segments for both the * cholesky data and application of weights data. The PWR_BM max pwr beams * are the first beams and following these beams are the MBEAM main beams. */ for (j = 0 ; j < num_seg,- j + +) [ /* for (j. . . ) */ next_seg = j * rng_seg; t* * Begin by moving MBEAM main beams to both Cholesky area and weights * application area of temp_mat as the MBEAM+1 through the MBEAM+PWR_BM * beams. */ for (k = 0 ; k < main_beams; k+ + ) { /* for (k...) V for (col = 0; col < rng_seg; col++) ( /* for (col...) */ row = max_pwr_beams + k; appl = col + rng_seg; /* application region RNGSEG */ /* range gates after cholesky */ /* data region */ seg_col = next_seg + col; /* gets the data from the */ I* correct range segments col */ /* which divides the range */ /* gates */ temp_jnat [row] [col ]. real = x_beams [k] [seg_col ] . real; temp_mat[row][col].imag = x_beams[k][seg_col].imag; temp_jnat[row][appl].real = x_beams[k][seg_col].real; tempjnat[row][appl].imag = x_beams[k][seg_col].imag; ] /* for (col.,.) */ ) /* for (k...) */ /* * For each segment, find the pwr sums of the ABEAM auxiliary beams. */ for (k = 0 ; k < aux_beams; k++) ( /* for (k...) V pwr_sums[k] = 0 .0 ; row_indx[k] = main_beams + k; row = k + main_beams; for (col = 0 ; col < rng_seg; col++) ( /* for (col...) */ seg_col = next_seg + col; /* gets the data from the */ /* correct range segments col */ /* which divides the range */ /* gates */ pwr_sums(k] += (x_beams[row][seg_col].real * x_beams[row][seg_col].real + 145 x_beams[row][seg_col].imag * x_beams[row j[seg_colj .imag); ] /* for (col...) */ ) /* for (k...) */ I* * For each segment, find PWR_BM max pwr beams of the ABEAM aux beams. Sort * them by max pwr in max_indx vector. */ for (k = 0; k < max_pwr_beams; k + +) { /* for (k...} */ /* * Init max_index[k] since pwr_sums may all be 0 or Not a Number. V max_indx[k] = main_beams + k; max_sum = 0 .0 ; for (m = 0 ; m < aux_beams; m++) { /* for (m...) */ /* * The > in the next if statement was changed to a >= */ if (pwr_sums[m) >= max_sum) { max_sum = pwr_sums[m]; max_indx[k] = row_indx[m]; n - m; } } /* for (m...) */ pwr_sums[nI = 0 .0 ; } /* for (k...> */ /* * For each segment, move PWR_BM max pwr beams to both Cholesky area and * weights application area of temp_mat as the first PWR_BM beams. */ for (k = 0 ; k < max_pwr_beams; k++( ( /* for (k...) */ indx = max_indx[k]; /* Indexes the correct max_pwr */ /* beam row */ for (col = 0; col < rng_seg; col+t) ( /* for (col...) */ appl = col + rng_seg; /* application region RNGSEG */ /* range gates after cholesky */ /* data region */ seg_col = next_seg + col; /* gets the data from the */ /* correct range segments col */ /* which divides the range */ /* gates */ tempjnat[k][col].real = x_beams[indx][seg_col].real; tempjnat[k][col].imag = x_beams[indx][seg_colj.imag; temp_mat[kj [appl].real = x_beams[indx][seg_col].real; tempjnat[kj [applj.imag = x_beams[indxj[seg_col].imag; } /* for (col...) */ ) /* for (k...) */ /* * In 2nd step adaptive beam forming we apply householder to all PWR_BM * + MBEAM rows but only lower triangularize the PWR_BM maxj>wr_beams rows. */ num_rows = ma x_pw r_beam s + main_beams; 146 1ower_t ri angu1 a r_rows = max_pwr_beams; start_row = 0 ; /* start housholder on 1st row */ /* * Call the second step adaptive beamforning Householder routine. */ house (num_rows, rng_seg, lower_triangular_rows, start_row, temp_jnat); /* * In addition call householder routine MBEAM more times with one of the main * beam rows moved to the PWR_BM+1 row in order to lower triangularize one * more row to make it a MxM lower triangular matrix. Then this will be * forward and back solved for weights. Nominally M = 10. */ for (k = 0; k < main_beams; k++) { 1* for (k...) V /* * Move a main beam row to PWR_BM+1 row except for the main beam 1 which is * already in the Mth row and doesn't have to be moved. */ if (k) { /* if (k) V row = k + max_pwr_beams; /* desired 1 of MBEAM main beam */ I* numbers */ for (col = ■ 0 ; col < rng_seg; col++) { /* for (col...> */ appl = col + rng_seg; /* application region RNGSEG */ /* range gates after cholesky */ /* data region */ temp_mat [max_j?wr_beams] [col] .real = temp_mat[row][col].real; temp_mat[max_pwr_beams][col].imag = temp_mat[row][col].imag; temp^mat[max_pwr_beams][appl].real = temp_mat[row][appl].real; temp_jnat[max_pwr_beams][appl].imag = temp_mat[row][appl].imag; } /* for (col...) */ } /* if (k) */ num_rows = max_pwr_beams + 1; /* Allows 1 row to be processed */ /* in housed */ lower_triangular_rows = max_pwr_beams + 1 ; start_row = max_pwr_beams; /* start housholder on PWR_BM + */ I* 1 (last) row. This does V /* Householder on only 1 row */ /* * Call second step adaptive beamforming Householder routine. V house (num_rows, rng_seg, lower_triangular_rows, start_row, tempjnat); /* * The temp_mat matrix now has a lower triangular matrix as the first MxM * rows. Modify the main beam steering vector by the max pwr beam rows of the * T_jnatrix plus 1 selected main beam row to get a lxM modified steering * vector. Move EL elements of each selected main beam steering vector into 147 * a temporary holding vector to multiply all max pwr rows of T matrix using * pointers. */ real_ptr = xreal; imag_ptr = ximag; for (n = 0; n < el; n++) { t* for (n...) */ *real_ptr++ = str_vecs[k]In].real; *imag_ptr++ = str_vecs[kj[n].imag; } /* for <n...) */ /* * Multiply the steering vector by all max pwr beams of the T_matrix. */ for (m = 0; m < max_pwr_beams; m+o I /* for (tn. . . ) »/ sum.real = 0 .0 ; sum.imag = 0 .0 ; real_ptr = xreal; imag_ptr = ximag; row = max_indx[m] ; for (col = 0 ; col < el; col++] ( /* for (col...) */ sum.real += (tjnatrixfrow][col].real * *real_ptr - t_matrix[row][col].imag * *imag_ptr); sum. imag += (tjnatrixfrow][col].imag * *real_ptr + + + t_matrix[row][col].real * *imag_ptr++); } /* for (col...) */ /* * Put the modified steering vector in slot 0 of the mod_str_vec matrix since * only one forward back solution is done in the forback routine. */ mod_str_vec[0 ][m].real = sum.real; mod_str_vec[0 ][ml.imag = sum.imag; } /* for (m...) */ /* * One more row to multiply. Multiply the steering vector by the selected * main beam row of the T_matrix. */ sum.real = 0 ,0 ; sum.imag = 0 .0 ; real_ptr = xreal; imag_ptr = ximag; row = k; /* T matrix row = selected main beam row *i for (col = 0; col < el; col++) { I * for (col.. . > */ sum.real += (t_matrix[row][col].real * *real_ptr - t_matrix[row][col].imag " *imag_ptr); sum.imag += (t_matrix[row][col].imag * *real_ptr++ + t_matrix[row][col].real * *imag_ptr++); } /* for (col...) */ mod_str_vec[0][max_pwr_beams].real = sum.real; mod_str_vec[0 ][max_pwr_beams].imag = sum.imag; /* * Set make_t = 0 to NOT generate a T matrix. Do only one forward back * solution and pass back the solution weight vector. V make_t = 0 ; 148 /* * Call forback to solve one weight vector with modified steering vector. */ forback ( num_rows, mod_str_vec, weight_vec, make_t, temp_mat); /* * Multiply the conjugate of the weight vector with the RNGSEG application * data range gates for each of the MBEAM main beams. Store the results in * the output detection cube. Use temp storage for the weight vector with * pointers to speed multiply. V real_ptr = xreal; imag_ptr = ximag; for (n = 0; n < max_pwr_beams + 1 ; n++) ( /* for (n...) •/ /* * Store the conjugate of the weight vector. */ *real_ptr++ = weight_vec[n].real; •imag_ptr++ = - weight_vec[nJ.imag; } /« for <n...) */ /* * Multiply the weight vector by all range gates of the application data set. * The application data set is from RNGSEG column to RNGSEG+RNGSEG columns * of "temp_mat’ matrix. */ for (m = rng_seg; m < rng_seg + rng_seg; m++) ( /* for (m...) */ sum.real = 0 .0 ; sum.imag = 0 .0 ; real_ptr = xreal; imag_ptr = ximag; for (row = 0; row < max_pwr_beams + 1 ; row++) ( /• for (row...) */ sum.real += (temp_mat[row]|m].real * *real_ptr - temp_mat[row][m].imag * *imag_ptr); sum.imag += (temp_mat[row][m].imag * *real_ptr++ + temp_mat [row] [m] .real * *imag_ptr+*•) ; } /* for (row..,) */ /* * Store the result into the detection data cube. */ col = j * rng_seg + m - rng_seg; detect_cube[k][dop][col].real = sum.real; detect_cube[kj [dop][col1.imag = sum.imag; ] /* for (m...) V } t* for (k...) * I } /* for (j . . .) V } /* for (dop...> */ return; ] 149 A. 11 d cftJi /* defs.h */ #define NN 64 #define SWAP.real;(b).real=swap_temp;\ swap_temp=(a).imag;\ (a).imag=(b).imag;(b).imag=swap_temp;} • i fdef IBM •define log2(x) ((log(x))/<M_LN2)> •define CLK_TICK 100.0 •endi f • i fdef DBLE •define float double •endi f typedef struct { float real; float imag; ) COMPLEX; •define AT_LEAST_ARG 2 •define AT_MOST_ARG 4 •define ITERATIONS_ARG 3 •define REPORTS_ARG 4 •define USERTIME(T1,T2) ((t2.tms_utime-tl.tms_utlme)/60.0) •define SYSTIMEITl,T2) ((t2.tms_stime-tl.tms_stime)/60.0) •define USERTIME1(T1,T2) ({time_end.tms_utime-time_start.tms_utime)/60.0) •define SYSTIME1(T1,T2) {(time_end.tms_stime-time_start-tms_stime)/60.0) •define USERTIME2(T1,T2) ({end_time.tms_utime-start_time.tms_utime)/60.0) • define SYSTIME2{Tl,T2) ((end_t ime.tms_st ime-start_time.tms_stime >/60.0) /* *define LINE_MAX 256 */ •define TRUE 1 •define FALSE 0 I* The following default dimensions are MAXIMUM values. The actual * /* dimension for PRIs will be the 1st entry in file *testcase.str". * /* The 2nd entry in the file is the Target Detection Threshold. * /* The rest of the file will contain steering vectors starting with * /* the main beams then all auxiliary beams. The file "testcase.hdr" * /* will have descriptions of entries in "testcase.str* file. Also * i* a description of the data file 'testcase.dat* will be given. This* to fill the input *data_cubet] [ ] [ ] " array. * Maximum number of columns in holding vector *vec* */ in house.c for max columns in Householder multiply */ Number of main beams */ Number of auxiliary beams */ Number of max power beams */ Number for dimensionl in input data cube */ Number for dimension2 in input data cube */ Number for dimensions in input data cube •/ /* file contains data • def ine COLS 1500 /* /* • def ine MBEAM 12 /* • def ine ABEAM 20 /* •define PWR BM 9 /* •define VI 64 •define V2 128 •define V3 1500 •def ine V4 64 •define DIM1 6 4 /* •define DIM2 128 /* •define DIM3 1500 /* •define DOP_GEN 2 04 8 /* MAX Number of dopplers after FFT */ 150 •define PRI_GEN 2048 /* MAX Number of points for 1500 data points FFT •define NUM_MAT 3 2 /* zero filled after 1500 up to 2048 points */ /* Number of matrices to do householde on */ /* NUM_MAT must be less than DIM2 above */ •ifdef APT •define PRI 256 /* Number of pris in input data cube */ •define DOP 256 /* Number of dopplers after FFT */ •define RNG 256 /* Number of range gates in input data cube */ •define EL 32 /* number of elements in input data cube */ •define BEAM 3 2 /* Number of beams */ •define NUMSEG 7 /* Number of range gate segments V •define RNGSEG RNG/NUMSEG /* Number of range gates per segment V •define RNG_S 320 /* Number sample range gates for 1st step beam forming */ •define DOF EL /* Number of degrees of freedom V extern COMPLEX tjnatrixtBEAM][EL]; /* T matrix */ •define PRI_CHUNK PRI/NN •endif • i fdef STAP •define •define •define •define •define •define •define •define •define •endif PRI 128 /* Number of pris in input data cube */ DOP 128 /* Number of dopplers after FFT */ RNG 12 50 /* Number of range gates in input data cube */ EL 48 /* number of elements in input data cube V BEAM 2 /* Number of beams */ NUMSEG 2 /* Number of range gate segments V RNGSEG RNG/NUMSEG /* Number of range gates per segment */ RNG_S 288 /* Number of sample range gates for beam forming */ DOF 3*EL /* Number of degrees of freedom */ • i fdef GEN •define EL V4 •define DOF EL •endif extern int output_time; /’ extern int output_report; /' extern int repetitions; /' extern int iterations; /' Flag if set TRUE, output execution times */ Flag if set TRUE, output data report files */ number of times program has executed */ number of times to execute program */ void xu_target ( ini, in2 , out, len ) float ini[](4], in2[)[4], out[1[4]; int *len ; 1 int i,il,i2, n ; il=i2=0 ; for (i = 0; i<25; i + + ) { out[i][0] = 0.0; ] for (i=0; i<25; i++) ( if (ini [il] [3] != 0.0 (,& (in2[i2j [3)==0.0 II ini til] E 0 ] < in2[i2][0])> ( out (i] [0 ] = inlfil][0 ] out(i) [1 ] outti] [2 ] out[i][3] il + +; inltil][1] inltil][2] inltil][3] ) 151 else if (in2[12] [3 ] ! = 0.0 && (Ini [il] [3!= = 0 .0 II inl[il)[0] > in2[12][0])) ( out [i] [0 ] = ln2 [1 2][0 ] out[1] [1] = in2 [i 2 1[1 ] out[i][2 ] = in2 [1 2][2 ] out[i][3] = 1n2[i 2][3] i2 + + ; ) else break ; A.12 compile_apt mpcc -qarch=pwr2 -03 -D A PT -D IBM -o apt bench_mark_APT.c cell_avg_cfar. c fft.c fft_APT.c forback.c cmd_line.c house.c read_input_APT.c stepl_beams. step2_beams. c lm A.13 run.256 poe apt /scratch2/zxu/new_data -procs 256 Appendix B Parallel HO-PD Code The parallel code for the HOPD benchmark, along with the script to compile and run the parallel HO-PD program, are given in this appendix. The version of the code below is for 1 to 128 nodes. B.1 bench_mark_STAP.c bench_mark_STAP. c / Parallel HO-PD Benchmark Program for the IBM SP2 This parallel HO-PD benchmark program was written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahlro Arakawa), as part of the ARPA Mountaintop program. The sequential HO-PD benchmark program was originally written by Tony Adams on 10/22/93. This file contains the procedure main (), and represents the body of the HO-PD benchmark. The program's overall structure is as follows: 1. Load data in parallel. 2. Perform FFTs in parallel. 3. Redistribute the data using the total exchange operation. 4. Complete data redistribution using circular shift operations. 5. Form a set of space-time steering vectors. 6 . Perform beamforming in parallel. 7. Perform CFAR/cell averaging in parallel. 8 . Reduce local target lists into one complete target list. This program can be compiled by calling the shell script compile_hopd. Because of the nature of the program, before the program is compiled, the header file defs.h must be adjusted so that NN is set to the number of nodes on which the program will be run. We have provided a defs.h file for each power of 2 node size from I to 256, called defs.001 to defs.256. This program can be run by calling the shell script run.###, where ### is the number of nodes on which the program is being run. Unlike the original sequential version of this program, the parallel HOPD program does not support command-line arguments specifying the number of times the program body should be executed, nor whether or not timing information should be displayed. The program body will be executed once, and the timing information will be displayed at the end of execution. The input data file on disk is stored in a packed format: Each complex number is stored as a 32-bit binary integer; the 16 LSBs are the real 153 * half of the number, and the 16 MSBs are the imaginary half of the number. * These numbers are converted into floating point numbers usable by this * benchmark program as they are loaded from disk. * * The steering vectors file has the data stored in a different fashion. All * data in this file is stored in ASCII format. The first two numbers in this * file are the number of PRIs in the data set and the threshold level, * respectively. Then, the remaining data are the steering vectors, with * alternating real and imaginary numbers. */ •include •include •include •include •include •include •include •include "stdio.h* "math.h* ■de f s.h * <sys/types.h> <sys/times.h> <mpproto.h> <sys/time.h> /* * Global variables * output_time: a flag; 1 = output timing information, 0 = don't output timing * information * output_report: a flag; 1 = write data to disk, 0 = don't write data to disk * pricnt; the number of points in the FFT * threshold: minimum power return before a signal is considered a target * t_matrix: the T matrix */ int output_time = FALSE; int output_report = FALSE; int pricnt = PRI; float threshold; COMPLEX t_matrix[BEAM][EL]; /* * main () * inputs; argc, argv * outputs: none it * This procedure is the body of the program. */ main (argc, argv) int argc; char *argv(]; ( /* main */ /* * Original variables (variables which were in the sequential version of the * HO-PD program) * * mod_str_vecs: holds the modified steering vectors * target_report: the 25 entries for targets which have been detected * el: the maximum number of elements in each PRI * det_bms: the number of beams in the detection data * beams: the total number of beams * rng: the maximum number of range gates in each PRI * f_temp: buffer for binary floating point output data * i, j, k: loop counters * dopplers: the number of dopplers after the FFT * xreal: storage for real FFT data 154 * ximag: storage for imaginary FFT data * sintable: storage for SINE table * costable: storage for COSINE table * str_vecs: spatial steering vectors with EL COMPLEX elements * weight_vec: storage for the weight solution vectors * str_name: holder for the name of the steering vectors file * data_name: holder for the name of the input data file */ static COMPLEX mod_str_vecs(BEAM)[DOP)[DOF]; static float target_report[25] [4] ; int el = EL; int det_bms = BEAM; int beams = BEAM; int rng = RNG; float f_temp[l]; int i, j, k; int dopplers; float xreal[PRI]; float ximag[PRI]; float sintable[PRI]; float costable(PRI); COMPLEX str_vecs[BEAM][EL]; COMPLEX weight_vec[DOF]; char str_name[LINE_MAX]; char data_name[LINE_MAX]; FILE *fopen(); FILE *f_dop; extern void cmd_line(); extern void fft_STAP(); extern void form_str_vecs(); extern void form_beams(}; extern void cell_avg_cfar() ; /* * Variables we added or modified to parallelize the code * * rc: return code from MPL commands * numtask: the total number of tasks running this program * taskid: the identifier number for this task * nbuf[4]: a buffer to hold data from the MPL call mpc_task_query * allgrp: an identifier representing all the tasks running this program * n: loop counter * offset: a variable used to count the number of elements rewound so far * msglen: message length; used in MPL commands * out_row: no longer used * p, r, e: loop counters * blklen: block length; used in MPL commands * source: source task number * dest: destination task number * type: message type; used in MPL commands * nbytes: number of bytes in message * targets: number of targets * fft_out: local portion of the data cube * fft_vector: vector used during the total exchange operation * local_cube: local portion of the data cube after the total exchange and * the circular shifts * shift_out: outgoing message of the circular shift operation * shift_in: incoming message of the circular shift operation * detect cube: detection data cube */ int rc; int numtask; int taskid; int nbuf[4]; int allgrp; 155 int n; int offset; int msglen; int out_row; int p, r, e; int blklen; int source; int dest; int type; int nbytes; int targets; Static COMPLEX fft_out[PRI][RNG i NN][EL]; Static COMPLEX fft_vector[PRI * RNG * EL / NN]; static COMPLEX iocal_cube[(PRI / NN) + 2][RNG][EL]; static COMPLEX shift_out[RNG][EL]; static COMPLEX shift_in[RNG][EL]; static COMPLEX detect_cube[BEAM][DOP / NN][RNG]; Timing variables *_start and *_end: the start and end CPU times for... *_user and *_sys: the user and system time breakdowns for the time spent for.. . *_user_max and *_sys_jnax: the largest user and system times for... *_clock_start and *_clock_end: the start and end wall clock times for... *_clock_time: the net wall clock time spent for... all: the entire program disk: reading the data from disk fft: the FFTs index: the MPL command mpc_index rewind: rewinding the 1-D output of the mpc_index operation into a 3-D data cube shift: the two circular shift operations str; forming the steering vectors beam: the beamforming step cfar: the start and end time of the CFAR/cell averaging step report: the target reporting step / struct tms all_start, all_end; struct tms disk_start, disk_end; struct tms fft_start, fft_end; struct tms index_start, index^end; struct tms rewind_start, rewind_end; struct tms shift_start, shift_end; struct tms str_start, str_end; struct tms beam_start, beam_end; struct tms cfar_start, cfar_end; struct tms report_start, report_end; float all_user, all_sys; float disk_user, disk_sys; float fft_user, fft_sys; float index_user, index_sys; float rewind_user, rewind_sys; float shift_user, shift_sys; float str_user, str_sys; f1oa t beam_u se r, beam_sy s; float cfar_user, cfar_sys; float report_user, report_sys; float all_user_max, all_sys_max; float disk_u8er_Aiax, diek_eys_max; float fft_user_jnax, fft_sys_max; float index_user_jnax, index^sysjnax; 156 float rewind_user_max, rewind_sys_max,- float shift_user_max, shift_sys_max; float str_userjnax, str_sys_^iax; float beam_user_max, beam_sys^max; float cfar_userjnax, cfar_sysjnax; float report_user_max, report_sy syntax; struct timeval all_clock_start, all_clock_end; struct timeval disk_clock_start, disk_clock_end; struct timeval fft_clock_start, fft_clock_end; struct timeval index_clock_start, index_clock_end; struct timeval rewind_clock_start, rewind_clock_end; struct timeval shift_clock_start, shift_clock_end; struct timeval str_clock_start, str_clock_end; struct timeval beam_clock_start, beam_clock_end; struct timeval cfar_clock_start, cfar_clock_end; struct timeval report_clock_start, report_clock_end; float all_clock_time; float disk_clock_time; float fft_clock_time; float index_clock_time,- float rewind_clock_time; float shift_clock_time; float str_clock_time; float beam_clock_time; float cfar_clock_time; float report_clock_time; /* * Begin function body: main () * * Initialize for parallel processing: Here, each task or node determins its * task number (taskid) and the total number of tasks or ndoes running this * program (numtask) by using the MPL call mpc_environ. Then, each task * determins the identifier for the group which encompasses all tasks or * nodes running this program. This identifier (allgrp) is used in collective * communication or aggregated computation operations, such as mpc_index, */ rc = mpc_environ (tnumtask, ttaskid); if (rc == -1) ( printf ("Error - unable to call mpc_environ.\n*) ; exit (-1); ) if (numtask ! = NN) ( printf ("Error - task number mismatch... check defs.h.in"); exit (-1); ) rc = mpc_task_query (nbuf, 4, 3); if (rc == -1) ( printf ("Error - unable to call mpc_task_query.\n"); exit (-1); ) allgrp = nbuf[3]; if (taskid == 0) ( printf ("Running...in"); ) gettimeofday (tall_clock_start, (struct timeval*) 0); times (&all_start); 157 / * * Get arguments from the command line. In the sequential version of the * program, the following procedure was used to extract the number of times * the main computational body (after the FFT) was to be repeated, and flags * regarding the amount of reporting to be done during and after the program * was run. In this parallel program, there are no command line arguments to * be extracted except for the name of the file containing the data cube. */ cmd_line (argc, argv, str_name, data_name); I* * Read input files. In this section, each task loads its portion of the * data cube from the data file. */ /* * if (taskid == 0) { * printf (■ loading data...\n*); ) */ mpc_sync (allgrp); gettimeofday (4disk_clock_start, (struct timeval*) 0 ); times (tdisk_start); read_input_STAP (data_name, str_name, str_vecs, fft_out); times (tdisk_end); gettimeofday (4disk_clock_end, (struct timeval*) 0 ); /* * FFT: In this section, each task performs FFTs along the PRI dimension * on its portion of the data cube. The FFT implementation used in this * program is a hybrid between the original implementation found in the * sequential version of this program and a suggestion given to us by MHPCC. * This change in implementation was done to improve the performance of the * FFT on the SP2. */ / * * if (taskid == 0 ) I * printf (■ running parallel FFT...\n“); ) */ mpc_sync (allgrp); gettimeofday (ifft_clock_Btart, (struct timeval*) 0 ); times (ifft_start); fft_STAP (fft_out); times (fcfft_end); gettimeofday (ifft_clock_end, (struct timeval*) 0 ); /* * Perform index operation to redistribute the data cube. Before the index * operation, the data cube was sliced along the RNG dimension, so each * task got all the PRIs. After the index operation, the data cube is sliced * along the PRI dimension, so each task gets all the RNGs. * * Because the MPL command mpc_index doesn’t work to our specifications, * we need to rewind the 1-D matrix which is the output of the mpc_index * operation back into a 3-D data cube slice. * / /* * if (taskid == 0) ( 158 * printf (■ indexing data cube...\n"); } */ mpc_sync (allgrp); gettimeofday <&index_clock_start, (struct timeval*) 0 ); times (&index_start); rc = mpc_index (fft_out, fft_vector, PRI * RNG * EL * sizeof (COMPLEX) / (NN * NN), allgrp); times (&index_end); gettimeofday <tindex_clock_end, (struct timeval*) 0 ); if (rc == -1) ( printf {’Error - unable to call mpc_index.\n*); exit (-1) ; ) / * * if (taskid == 0) * ( * printf (" rewinding data cube...\n"}; ) */ mpc_sync (allgrp); gettimeofday (&rewind_clock_start, (struct timeval*) 0 ); times (trewind_start); offset = 0; for (n = 0; n < NN; n++) ( for (p = 0; p < PRI / NN; p+>) ( for (r = n * RNG / NN; r < (n t 1) * RNG / NN; r++) ( for (e = 0; e < EL; e++) < local_cube[p + 1][r][e].real = fft_vector[offset].real; local_cube[p + 1][rj [ej.imag = fft_vector[offsetj.imag; offset++; ) ) ) ) times (trewind_end); gettimeofday (trewind_clock_end, (struct timeval*) 0 ); / * * Beamforming requires not only the dopplers made available after indexing, * but also the first doppler on both sides. Use two circular shift operations * to get them. */ / * * if (taskid == 0) ( * printf (’ performing circular shift...Vn*); * > */ mpc_sync (allgrp); gettimeofday (tshift_clock_start, (struct timeval*) Ou tlines (tshift_start); msglen = RNG * EL * sizeof (COMPLEX); for (r = 0; r < RNG; r++) ( for (e = 0; e < EL; e++) { shift_outlr][e].real = local_cubetPRI / NN)[r]fe].real; 159 s h i f t o u t ( r j [ e ] . imag = l o c a l _ c u b e [ P R I / N N ] [ r ] [ e ] . imag; } > rc = mpc_shift (shift_out, shift_in, msglen, 1, 0 , allgrp); if (rc == -1 ) { printf ('Error - unable to call mpc_shift.\n"); exit (-1 ); } for (r = 0; r < RNG; r++) { for (e = 0; e < EL; e++) { local_cube[0][r][e].real = shift_in[r][e].real; local_cube[0][r][ej.imag = shift_in[r][e).imag; ) } for (r = 0; r < RNG; r+O ( for (e = 0; e < EL; e++) ( shift_out[r][e].real = local_cube[1][r][e],real; shift_out(rj[ej .imag = local_cube[1][r][ej.imag; ) ) rc = mpc_shift (shift_out, shift_in, msglen, -1, 0 , allgrp); if (rc — “1) ( printf ("Error - unable to call mpc_shift.\n"}; exit (-1); > for (r = 0; r < RNG; r++) ( for (e = 0; e < EL; e+H ( local_cube[(PRI / NN) + 1][rJ[e].real = shift_in[r][e].real; local_cube[(PRI / NN) + 1j [rj[ej.imag = shift_in[r][e].imag; ) ) times (&shift_end>; gettimeofday (&shift_clock_end, (struct timeval*) 0 ); / * * Form a set of space_time steering vectors composed of the concatenation of * doppler_transformed spatial steering vectors for doppler of interest and * preceding and succeeding dopplers. Modified steering vectors will be put in * the mod_str_vecs matrix. */ /* * if (taskid == 0) < * printf (" forming steering vectors..An"); ) */ mpc_sync (a 1lgrp >; gettimeofday (tstr_clock_start, (struct timeval*) 0); times (tstr_start); form_str_vecs (str_vecs, mod_str_vecs); times (fcstr_end); gettimeofday (fcstr_clock_end, (struct timeval*) 0 ); 160 / * * Perform adaptive beamforming. Each task performs this step on its slice * of the data cube in parallel. */ / * * if (taskid == 0 ) ( * printf (* beamforming...\n") ; > * / dopplers = pricnt / NN; mpc_sync (allgrp); gettimeofday (ibeam_clock_start, (struct timeval*) 0 ); times (ibeam_start); form_beams (dopplers, local_cube, detect_cube, mod_str_vecs); times (ibeam_end); gettimeofday (&beam_clock_end, (struct timeval*) 0 ); / * * Perform cell_avg_cfar and target detection. Each task performs this step * on its slice of the data cube in parallel. * / /* * if (taskid == 0 ) { * printf (■ performing cell_avg_cfar...\n*) ; ) * / mpc_sync (allgrp); gettimeofday <&cfar_clock_start, (struct timeval*) 0 ); times (&cfar_start); cell_avg_cfar (threshold, dopplers, det_bms, rng, target_report, detect_cube); times (tcfar_end); gettimeofday (tcfar_clock_end, (struct timeval*) 0 ); / * * Gather target reports: In this step, the tasks collect the closest 25 * targets. This target sorting is performed in the following manner. The tasks * pair off. The two tasks in this pair combine their target lists (i.e. the * targets found in their own slice of the data cube). Then, one of these two * tasks sorts the list, and takes the closest targets (up to 25). Then, the * tasks with these new, combined target lists pair off, and this process is * repeated again. At the end, one task will have the final target list, which * contains the closest targets in the entire data cube (up to 25). * * The target list combining is done by matched blocking-sends and blocking- * receives. * / /* * if (taskid == 0) ( * printf (" gathering target reports...\n*); ) * / mpc_sync (allgrp); gettimeofday <&report_clock_start, (struct timeval*) 0 ); times (treport_start); for (i = 1; i < NN; i = 2 * i) { /* for (i. . . ) */ blklen = 25 * 4 * sizeof (float); source = taskid + i; 161 dest = taskid - i; type = i; if ({taskid » (2 * i)) == 0 tt (NN != 1)) { rc = mpc_brecv (target_report + 25, blklen, isource, itype, inbytes); if (rc == -1 ) { printf ("Error - unable to call mpc_brecv.\n’); exit(-1 ) ; } ) if ((taskid % (2 * i)) == i (NN != 1)) ( rc = mpc_bsend (target_report, blklen, dest, type); If (rc == -1 ) ( exit(-1 ); ) 1 /* * Sort combined target list. */ for (targets = 0; targets < 50; targets++) { /* for (targets...) */ for (r = targets + 1 ; r < 50 ; r ++} ( /* for (r...) */ if (((target_report[targets][0] > target_report[r]10]) && (target_report[r][0 ] > 0 .0 )) II ((target_report[targets][0 ) < target_report[r](0 |) && (target_report[targets][0 ] == 0 .0 ))) ( /* if (...) V float tmp; tmp = target_report[r][0 ]; target_report[r][0 ] = target_report[targets][0 ]; target_report[targets][0 ] = tmp; tmp - target_report[r][1 ]; target_report[r][1] = target_report(targetsJ[1]; target_report[targets][1] = tmp; tmp = target_report[r][2 ]; target_report[r][ 2] = target_report[targets][2 ]; target_report[targets][2 ] = tmp; tmp = target_report[r][3]; target_report[r][3] = target_report[targets][3]; target report[targets][3] = tmp; ) /* if (...) */ ] /* for (r...) */ ) /* for (targets...) */ ) /* for (i. . .) */ times (Lreport_end); times <&all_end); gettimeofday (ireport_clock_end, (struct timeval*) 0 ); gettimeofday (iall_clock_end, (struct timeval*) 0 ) ; if (taskid == 0 ) [ printf (*... done.\n‘); ] /* * Now, the program is done with the computation. All that remains to be done * is to report the targets that were found, and to report the amount of time * each step took. The target list should be identical from run to run, since * the program started with the same input data cube. This target list and * execution time data reporting is performed by task 0 . 1 6 2 / if (taskid == 0 ) ( printf CENTRV RANGE BEAM for (i = 0; i < 25; i + + ) printf (■ %.02d %.03d (int)target_report[i][0 ] (int)target_report[i][ 2 ] DOPPLER POWER\n' %.02d %.03d %f\n*, (int)target_report[i] [1] , target_report[i][3] ); /* * Collect timing information. Calculate elapsed user and system CPU time for * each step. V all_user = (all_end.tms_utime - all_start,tms_utime) / CLK_TICK; all_sys = (all_end.tms_stime - all_start.tms_stime) / CLK_TICK; disk_user = (disk_end.tms_utime disk_start.tms_utime) / CLK_TICK; disk_sys = (disk_end.tms_stime - disk_start.tms_stime) I CLK_TICK; fft_user = (fft_end.tms_utime - fft_start.tms_utime) / CLK_TICK; fft_sys = (fft_end.tms_stime - fft_start.tms_stime) / CLK_TICK; index_user = (index_end.tms_utime - index_start.tms_utime) / CLK_TICK; index_sys = (index_end.tms_stime - index_start.tms_stime) / CLK_TICK; rewind_user = (rewind_end.tms_utime - rewind_start.tms_utime) / CLK_TICK; rewind_sys = (rewind_end.tms_stime - rewind_start.tms_stime) / CLK_TICK; shift_user = (shift_end.tms_utime - shift_start.tms_utime) / CLK_TICK; shift_sys = (shift_end.tms_stime - shift_start.tms_stime) / CLK_TICK; str_user = (str_end.tms_utime - str_start.tms_utime> / CLK_TICK; str_sys = (str_end.tms_stime - str_start.tms_stime) I CLK_TICK; beam_user = (beam_end.tms_utime - beam_start.tms_utime) / CLK_TICK; beam_sys = (beam_end.tms_stime - beam_start.tms_stime) / CLK_TICK; cfar_user = {cfar_end,tms_utime - cfar_start.tms_utime) I CLK_TICK; cfar_sys = (cfar_end.tms_stime - cfar_start.tms_stime) / CLK_TICK; report_user = (report_end.tms_utime - report^start.tms_utime) / CLK_TICK; report_sys = (report_end.tms_stime - report_start,tms_stime) t CLK_TICK; t* * Calculate elapsed wall clock time for each step. */ all_clock_time = (float) (all_clock_end.tv_sec - all_clock_start.tv_sec) + (float) ((all_clock_end.tv_usec - all_clock_start.tv_usec) ( 1 0 0 0 0 0 0.0 ); disk_clock_time = (float) (disk_clock_end.tv_sec - disk_clock_start.tv_sec) + (float) ((disk_clock_end.tv_usec - disk_clock_start.tv_usec) I 1 0 0 0 0 0 0.0 ); fft_clock_time = (float) (fft_clock_end.tv_sec - fft_clock_start.tv_sec) + (float) ((fft_clock_end.tv_usec - fft_clock_start.tv_usec> / 1 0 0 0 0 0 0.0 ); index_clock_time = (float) (index_clock_end.tv_sec - index_clock_start.tv_sec) + (float) ((index_clock_end.tv_usec - index_clock_start.tv_usec) / 1 0 0 0 0 0 0.0); rewind_clock_time = (float) (rewind_clock_end.tv_sec - rewind_clock_start.tv_sec) + lfloat) ((rewind_clock_end.tv_usec - rewind_clock_start.tv_usec) / 1 0 0 0 0 0 0.0 ); shift_clock_time = (float) (shift_clock_end.tv_sec - shift_clock_start.tv_sec) 163 + (float) ((shift_clock_end,tv_usec - shift_clock_start.tv_usec) / 100000 0 .0); str_clock_time = (float) (str_clock_end.tv_sec - str_clock_start.tv_sec) + (float) ((str_clock_end.tv_usec - str_clock_start-tv_usec) / 1 0 0 0 0 0 0.0 ); beam_clock_time = (float) (beam_clock_end.tv_sec - beam_clock_start.tv_sec) + (float) ((beam_clock_end.tv_usec - beam_clock_start.tv_usec) / 1 0 0 0 0 0 0.0 ); cfar_clock_time - (float) (cfar_clock_end.tv_sec - cfar_clock_start.tv_sec) + (float) ((cfar_clock_end.tv_usec - cfar_clock_start.tv_usec) / 1 0 0 0 0 0 0.0 ); report_clock_time = (float) (report_clock_end.tv_sec - reported ock_start.tv_sec) * ■ (float) ( ( report_clock_end . tv_usec - report_clock_start.tv_usec) / 1 0 0 0 0 0 0.0 ); /* * Find the largest amount of user and system CPU time spent * mpc_reduce call. */ rc = mpc_reduce (&all_user, tall_user_max, sizeof (float), allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce <&all_sys, tall_sys_max, eizeof (float), 0 , allgrp); if (rc == -1 ) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (&disk_user, idisk_user_max, sizeof (float) allgrp); if (rc == -1 ) { printf ("Error - unable to call mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (&disk_sys, tdisk_sysjnax, sizeof (float), allgrp); if (rc == -1 ) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1 ); ) rc = tnpc_reduce (&fft_user, S>f f t_user_max, sizeof (float), allgrp); if (rc =- -1 ) { printf ("Error - unable to call mpc_reduce.\n"); exit (-1 ); > rc = mpc_reduce (tfft_sys, tfft_sys_max, sizeof (float), 0 , allgrp); if (rc == -1 ) using the 0 , s_vmax. s_vmax. , 0 , s_vmax. 0 , s_vmax, 0 , s_vmax, s_vmax, 164 ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1 ) ; } rc = mpc_reduce (4index_user, 4index_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc = = -1) { printf ("Error - unable to call mpc_reduce.\n"); exit (-1) ; ) rc = mpc_reduce (4index_sys, 4 index_sys_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error - unable to call mpc_reduce.\n"); exit ( -1); > rc = mpc_reduce (&rewind_user, 4rewind_user_inax, sizeof (float), 0, s_vmax, allgrp); If (rc == -1 ) { printf ("Error - unable to call mpc_reduce.\n") ; exit (-1); > rc - mpc_reduce (&rewind_sys, 4rewind_sys_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error - unable to call mpc_reduce,\n"); exit (-1 )! ) rc = mpc_reduce (&shift_user, tshift_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc = = -1) I printf ("Error - unable to call mpc_reduce.\n") ; exit (-1) ; ) rc = mpc_reduce (tshift_sys, ishift_sys_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1) ( printf ("Error unable to call mpc_reduce.\n"); exit (-1); ) rc = mpc_reduce (6str_user, &str_userjnax, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error - unable to call mpc_reduce.\n") ; exit l-l)i ) rc = mpc_reduce (tstr_sys, iLStr_sys_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1 ) I printf ("Error - unable to call mpc_reduce.\n"); exit ( -1) ; ) rc = mpc_reduce (4beam_user, tbeam_user^max, sizeof (float), 0 , s_vmax, allgrp); if (rc *= -1) 165 rc if printf ('Error - unable to call mpc_reduce.\n*); exit (-1 ) t = mpc_reduce (ibeam_sys, ibeam_sys_jiiax, sizeof (float), 0 , s_vmax, allgrp); (rc == -1) printf ("Error - unable to call mpc_reduce.in'); exit (-1) ; rc = mpc_reduce (&cfar_user, &cfar_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc = = -1) printf ('Error - unable to call tnpc_reduce.\n" ) ; exit (-1 ); rc = mpc_reduce (fccfar_sys, icfar_sys jnax, sizeof (float), 0 , s_vmax, allgrp); if (rc = = -1) printf ('Error - unable to call mpc_reduce.\n'); exit (-1 ); rc = mpc_reduce <&report_user, treport_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1) printf ("Error - unable to call mpc_reduce.\n'); exit (-1 ); rc = mpc_reduce (&report_sys, &report_sys_max, sizeof (float), 0 , s_vmax, ~ allgrp); if (rc == -1) printf ('Error - unable to call mpc_reduce.\n"); exit (-1 ); > 1 * * Display timing information, V if (taskid == 0) ( printf ('\n\n\n*** Timing information - numtask = %d\n\n", NN) ; printf (' all_user_max = %.2f s, all_user_max, all_sys_jnax) ; printf (' disk_user_jnax = %.2f s, disk_user_piax, disk_sys_max) , * printf (' fft_user_max - %.2f s, f ft_user_fnax, fft_sys_jnax) ; printf {' index_user_max = %.2f s, index_userjnax, index_sys_max); printf (' rewind_user_jnax - %.2f s, rewind_user_max, rewind_sys_max printf (’ shift_user_max = %.2f s, shift_user_max, shift_sys_jnax) ; printf (' str_user_max = %.2f s, str_user_fnax, str_sys_max) ; printf (' beam_user_max = %.2f s, beant_user_jnax, beam_sys_max) ; printf (' cfar_user_jnax = %.2f s, cfar_user_^ax, cfar_sysjnax); printf (" report_user_jnax = %.2f s, all_sys_max - % ,2 f s\n', disk_sys_jnax = % ,2 f s\n', f ft_sys_max % ,2 f s\n' , index_sys_max - 4 ,2 f s\n', rewind_sys_max \ , = % ,2 f s\n', i t shi ft_sys_max = % ,2 f s\n', str_sys_max 4 ,2f s\n', beam_sys_jnax = %. 2 f s\n", cfar_sys_max = t. 2 f s\n', report_sys_max = t. 2 f s\n', 166 report_user_max, report_sys_max) ; printf printf printf printf printf printf printf printf printf printf printf } return; } /* main */ \nWall clock timing all_clock_time = %f\n“, disk_clock_time = %f\n", fft_clock_time = %f\n*, index_clock_time = %f\n", rewind_clock_time = %f\n", shift_clock_time = %f\n‘, str_clock_time = %f\n", beam_clock_time = %f\n", cfar_clock_time - %f\n*, report_clock_time = %f\n’, all_clock_time); disk_clock_time); fft_clock_time); indexwclock_time) ; rewind_clock_time); shift_clock_time); str_clock_time); beam_clock_time); cfar_clock_time); report_clock_time); B2 cell_avg_cfar.c /* * cell_avg_cfar.c V /* * This file contains the procedure cell_avg_cfar (), and is part of the * parallel HO-PD benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the * ARPA Mountaintop program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93 . * * This procedure was largely untouched during our parallelization effort, * and therefore almost identical to the sequential version. Modifications * were made to perform the cell averaging and CFAR for only a slice of the * data cube, instead of the entire data cube (as was the case in the * sequential version of the program). As such, most, if not all, of the * comments in this procedure are taken directly from the sequential version * of this program. */ #include "stdio.h* #include *math.h“ ♦include *defs.h" ♦include <sys/types.h> ♦include <sys/time.h> ce1l_avg_c far () inputs: threshold, dopplers, beams, rng, detect_cube outputs: target_report threshold: when the power at a certain cell exceeds this threshold, we consider it as having a target dopplers: the number of dopplers in the input data beams: the number of beams (rows) in the input data rng: the number of range gates (columns) in the input matrix target_report: a 25-element target report matrix (each element consisting of four floating point numbers: RANG, BEAM, DOPPLER, and POWER), to output the 25 closest targets found in this slice of detect_cube 167 * detect_cube: input beam/doppler/rnggate matrix V cell_avg_cfar (threshold, dopplers, beams, rng, target_report, detect_cube) float threshold; int dopplers; int beams; int rng; float target_report[] [4); COMPLEX detect_cube[][DOP/NN](RNG]; /* * Variables: Most, if not all, of the variable comments were taken directly * from the sequential version of this program. * * sum: a holder for the cell sum * num_seg: number of range segments * rng_seg: number of range gates per segment * targets: holder for the number of targets to include in the target report ■ * num_targets: the number of targets to include in the target report * N: holder for the number of range gates per range segment * guard_range: number of cells away from cell of interest * rseg, r, i: loop variables * beam, dop: loop variables * rng_start: address of the first range gate in each range segment * rng_end: address of the last range gate in each range segment * ncells: holder for cells: guard_range per range segment * taskid: the identifier for this task (added for parallel execution) */ float sum; int num_seg - NUMSEG; int rng_seg - RNGSEG; int targets; int num_targets = 25; int N; int guard_range = 3; int rseg, r, i; int beam, dop; int rng_start; int rng_end; int ncells; extern int taskid; /* * Begin function body: cell_avg_cfar () * * Process ceil averaging cfar by range segments: do cell averaging in ranges. */ N = rng_seg; /* Number of range cells per range segment */ targets =0; /* Start with 0 targets in target report number */ for (i = 0 ; i < num_targets; i++) ( target_report[i](0 ] = 0 .0 ; target_report[ijflj = 0 .0 ; target_report[ij[2 ] = 0 .0 ; target report(ij [3) = 0.0; ) /* * Do the entire cell average CFAR algorithm starting from the first range * segment and continuing to the last range segment. This ensures getting * 25 least-range targets first, so the process can be stopped without * looking further into longer ranges. 168 V for (rseg = 0; rseg < num_seg; rseg++) ( I * for (rseg...) */ /* * Get start and end range gates for each of the range segments: starts at * low ranges and increments to higher ranges. V rng_start = rseg * N; rng_end = rng_start + N - 1; /* * 1st get the range cell power and store it in detect_cube[][][].real. For * each beam and each doppler, get the power for each range cell in the * current range segment. */ for (beam = 0; beam < beams: beam++) { /* for (beam...) */ for (dop = 0 ; dop < dopplers; dop++) ( /* for (dop...) */ for (r = rng_start ; r <= rng_end; r++) { /* for (r...) */ sum = detect_cube[beam](dop][r].real * detect_cube[beam][dop][r].real + detect_cube[beam][dop][r].imag * detect_cube[beam]Idop][r].imag; detect_cube[beam][dop][r].real = sum; } /* for (r...) * t } /* for (dop...) * i } /* for (beam...) */ /* * Now, get the range cell's average power, using cell averaging CFAR described * above, and store it in detect_cube[][][j.imag. For each beam and each * doppler, get the average power for each range cell in the current range * segment. */ for (beam = 0 ; beam < beams; beam++) { /* for (beam...) */ for (dop - 0 ; dop < dopplers; dop++) { /* for (dop...) */ sum = 0 .0; ncells = N - guard_range - 1; /* * Do a summation loop for the first range cell at the start of a range * segment. */ for (r = rng_start + guard_range + 1; r <= rng_end; r++) < sum += detect_cube[beam][dop][r].real; } /* * The average power is the sum divided by the number of cells in the * summation. */ detect_cube[beam][dop][rng_startj.imag = sum / ncells; /* * Now, perform loops until the guard band is fully involved in the data. */ 169 for (r = rng_start + 1; r <= rng_start * ■ guard_range; r+ + ) 1 /* for (r...) */ sum = sum - detect_cube[beam][dop][r + guard_range].real, - — ncells; detect_cube[beam][dop][r].imag = sum / ncells; } /* for (r...] */ /* * Now, perform loops until the guard band reaches the range segment border. * t for (r = rng_start + guard_range + 1 ; r <= rng_end - guard_range; r+ + ) { sum = sum + detect_cube[beam][dop][r-guard_range-1].real - detect_cube[beam][dop|[r + guard_range].real; detect_cube[beam][dop][r].imag = sum / ncells; 1 i* * Now, perform loops to the end of the range segment. * / for !r = rng_end - guard_range + 1; r <= rng_end; r*+) ( sum += detect_cube[beam][dop![r - guard_range - 1].real; ++ncells; detect_cube[beam][dop][r].imag = sum t ncells; ) } I* for (dop...) V } /* for (beam...) * I /* * Compare the range cell power to the range cell average power. Start at * the minimum range and increase the range until 25 targets are found. Put * the target's RANGE, BEAM, DOPPLER, and POWER level in a matrix called * "target_report■. Store all values as floating point numbers. Some can get * converted to integers on printout later as required. */ for (r = rng_start ; r <= rng_end; r++) { /* for (r...) V for (beam = 0 ; beam < beams; beam++) ( /* for (beam...] */ for (dop = 0 ; dop < dopplers,- dop + +) ( /* for (dop...) */ if (((detect_cube[beam][dop][r].real - detect_cube[beam][dop][r].imag) > threshold) && (targets <= num_targets)) ( /* if (((...)) ) V target_report[targets][0 ] = (fioat) r; target_report[targets] [ 1 1 = (float) beam; target_report[targets][2 ] = (float) dop + taskid * PRI / NN; target_report[targets] [3] = detect_cube[beam][dop][r].real; targets +- l; if (targets >= num_targets) ( goto quit_report; > > /* if (((... I )) */ ) /* for (dop...) */ } /* for (beam...) */ ) /* for (r...) */ ) /* for (rseg..) */ 170 quit_report: ; return; } B J cmd_line.c /* * cmd_line.c */ /* * This file contains the procedure cmd_line (), and is part of the parallel * HO-PD benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * * The procedure cmd_line () extracts the name of the files from which the * input data cube and the steering vector data should be loaded. The * function of this parallel version of cmd_line () is different from that * of the sequential version, because the sequential version also extracted * the number of iterations the program should run and some reporting * options. */ ((include ■stdio.h" ((include ■math.h" ((include "defs.h' #include <sys/types.h> ((include <sys/time.h> * cmd_line U * inputs: argc, argv * outputs: str_name, data_name * * argc, argv: these are used to get data from the command line arguments * str_name: a holder for the name of the input steering vectors file * data_name: a holder for the name of the input data file */ cmd_line (argc, argv, str_name, data_name) int argc; char *argv[]; char str_nameILINE_MAX]; char data_name[LINE_MAX]; /* * Begin function body: cmd_line () V strcpy (str_name, argv[l]); strcat (str_name, ".str*); strcpy (data_name, argv[l]); strcat (data_name, *.dat‘); return; 171 B.4 com pute_beams.c /* * compute_beams.c V /* * This file contains the procedure compute_beams (), and is part of the * parallel HO-PD benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern California * {Prof, Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * * The procedure compute_beams does beamforming per doppler. Its input is a * set of BEAM * NUMSEG number of weights for one doppler. It returns the ’ results in a new output cube consisting of BEAM beams by DOP dopplers by * RNG range gates. Nominally, BEAM, NUMSEG, DOP, and RNG =2, 2, 128, and * 1280, respectively. * * This procedure was largely untouched during our parallelization effort, * and is therefore almost identical to the sequential version. Modifications * were made to perform this step on only a slice of the data cube, instead of * the entire data cube (as was the case in the sequential version of the * program). As such, most, if not all, of the comments in this procedure are * taken directly from the sequential version of this program. V (•include "stdio.h" #include "math.h" #include "defs.h" #include <sys/types.h> #include <sys/times.h> compute_beams {) inputs: dop, local_cube, weights, temp_mat outputs: detect_cube dop: the doppler number for which the beams are computed local_cube: input data cube and dopplers after the FFTs weights: matrices holding the input set of weights detect_cube: detection output data cube temp_mat: temporary matrix holding three adjacent dopplers '/ compute_beams {dop, local_cube, weights, detect_cube, temp_mat) int dop; COMPLEX local_cube(][RNG]IEL]; COMPLEX weights[](NUMSEG](DOF); COMPLEX detect_cube(](DOP / NNj[RNG]; COMPLEX temp_mat[][COLS]; ( /* * Variables * * rng_seg: number of range gates per segment * rngsamp: sample range gates, used in each segment * num_seg: number of range segments for the Cholesky * dofs: number of DOFs (rows) in the input data 172 * rng: number of range gates in the input data * el: number of elements in the steering vectors * i, j, k, m, n: loop counters * dopplers: holder for the number of dopplers in data_cube * row: row number * col: column number * appl: column number for application data destination * apply: column number for application data source * beam: beam number to be formed in output data * seg: segment number * start_seg: column number for start range gate for a segment * doplrl, doplr2, doplr3: loop countgers for doppler number * xreal, ximag: storage for real and imaginary data for pointer use * real_ptr: pointer to next real data * imag_ptr: pointer to next imaginary data * sum: COMPLEX variable to hold the sum of complex data * pricnt: number of PRIs in the input data cube V int rng_seg = RNGSEG int rngsamp = RNG_S; int num_seg = NUMSEG int dofs = DOF; int rng = RNG; int el = EL; int i, j, k, m, n; int dopplers; int row; int col; int appl; int apply; int beam; int seg; int start_seg; int doplrl; int doplr2 ; int doplr3; float xreal [DOF],* float ximag[DOF]; float *real_ptr; float *imag_ptr; COMPLEX sum; extern int pricnt; /* * Begin function body: compute_beams (> * * Perform beam forming for all dopplers. Determine the doppler numbers for * the doppler of interest, as well as the preceding and following dopplers. V doplrl = dop - 1 ; doplr2 = dop; doplr3 = dop + 1; /* * For each doppler, apply BEAM * NUMSEG set of weights to the data. */ for (beam = 0; beam < BEAM; beam++) { /* for (beam.., ) */ for (seg = 0 ; seg < num_seg; seg++) ( /* for (seg...) */ /* * Apply the conjugate of the weight vector to all RNGSEG range gates for each * segment. Store the result in the output detection cube. If seg = 0 apply the * weights to the following segment, else for all other seg numbers apply the 173 * weights to preceding segment. */ if (seg == 0) ( start seg = rng seg; ) else < start_seg - (seg - 1 ) * rng_seg; } i* * Temporary storage for the weight vector with pointers to speed multiply. •/ real_ptr = xreal; imag_ptr = ximag; /■* * Store the conjugate of the weight vector. */ for (n = 0 ; n < dofs; n++) ( *real_ptr++ = weights[beam][seg][n].real; *imag_ptr++ = weights[beam][seg][n].imag; [ /* * Put application data elements in temp_mat. 1st EL rows. * i for (k = 0 ; k < el; k + +) ( /* for (k...[ */ row = k; for (col = 0; col < rng_seg; col++) { /* for (col...) */ appl = col + start_seg; I* Application region */ t* * Get all RNGSEG range gates from one of NUMSEG segments. */ temp_jnat [row] [col [ . real = local_cube [doplrl ][ appl ] [k ]. real; temp_mat[row][colj .imag = local_cube[doplrlj[appl][kj.imag; } /* for (col...) */ } /* for (k...) */ t* * 2nd EL rows. */ for (k = 0 ; k < el; k++) ( /* for (k...) */ row = el + k; for (col = 0 ; col < rng_seg; col++l ( /* for (col...) */ appl = col + start_seg; /* Application region */ /* * Get all RNGSEG range gates from one of NUMSEG segments. */ temp_mat[row][col].real = local_cube[doplr2 ][appl][k].real; tempjnat[row][col].imag = local_cube[doplr2 j[appl][kj.imag; 174 ) /* for (col...) */ ) /* for (k...) */ /* * 3rd EL rows. */ for (k = 0; k < el; k++) ( /* for (k...) V row = 2 * el + k; for (col = 0 ; col < rng_seg; col++) { /* for (col...) */ appl = col + start_seg; /* Application region */ /* * Get all RNGSEG range gates from one of NUMSEG segments. */ temp_mat[row][col].rea1 = local_cube[doplr3][appl][k].real; temp_mat[row][col].imag = local_cube[doplr3][applI[k].imag; J /’ for (col...} */ } for (k...) */ /* * Multiply the weight vector by all range gates of the application data. * Application data set at 1st RNGSEG cols of "temp_mat" matrix. V for (m = 0 ; m < rng_seg ; m++) { /* for (m...) */ sum.real = 0 .0; sum.imag = 0 .0; real_ptr = xreal; imag_ptr = ximag; for (row = 0 ; row < dofs; row++) I I * for (row...) */ sum.real += (temp_mat[row][m].real * *real_ptr - temp_mat[row][m].imag * *imag_ptr); sum.imag *- (temp_mat[row][m].imag * *real_ptr++ + tempjnat[row][m].real * *imag_ptr++); ) /* for (row...) */ /* * Store the result into the detection data cube by col = range gate. Store the * result in the same range gates used for application region. */ col = m + start_seg; detect_cube[beam][dop - 1][col].real = sum.real; detect_cube[beam][dop - 1][col].imag = sum.imag; } /* for (m...) */ } /* for (seg...) */ ) /* for (beam...) V return; ) 175 B J com pute_weights.c compute_weights,c / This file contains the procedure compute_weights (), and is part of the parallel HO-PD benchmark program written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mount a intop program. The sequential HO-PD benchmark program was originally written by Tony Adams on 10/22/93. The compute_,weights (} procedure does the weights computation prior to beamforming. Its input is a dop and dopplers formed in the fft () routine. Also, the modified space-time steering vectors are input to this routine. It returns a set of weights to be used in the beam computation routine. For each input doppler number: For each of the NUMSEG segments (of RNGSEG range gates each): a. For each doppler of interest, take two adjacent dopplers (doppler of interest and preceding and succeeding dopplers), making 3*EL = DOFS number of rows and RNG_S range gates (every other range gate, 0, 2, 4, 6, etc. until RNG_S number of range gates). Nominally, RNG_S = 2 8 8. b. Do a Householder on all DOFs (rows), lower triangularizing all rows. Since we have NUMSEG segments of range gates, we do a Cholesky and a forback solve on a segment for later application of weights to one of the other segments adjacent to it. Starting range gate = 0 for Cholesky for the first segment. Starting range gate = NUMSEG * RNGSEG for Cholesky for the second or higher segment and apply weights to range gates starting at (NUMSEG-1) * RNGSEG segment (previous segment). c. Forward and back solve each DOF x DOF lower triangular matrix using modified space-time steering vectors DOF x 1. There will be RNGSEG separate sets of weights, per beam per doppler. Space-time steering vectors have already been formed composed of the concatenation of doppler transformed spatial steering vectors for the doppler of interest and preceding and succeeding dopplers. d. Pass weights from (c.) above to caller to apply weights. Weights are calculated for each beam and each range segment and are passed in a matrix to be applied to data for each doppler. This procedure was largely untouched during our parallelization effort, and is therefore almost identical to the sequential version. Modifications were made to perform this step on only a slice of the data cube, instead of the entire data cube (as was the case in the sequential version of the program). As such, most, if not all, of the comments in this procedure are taken directly from the sequential version of this program. ! ((include “stdio.h” ((include "math.h" ((include “defs.h" •include <sys/types.h> •include <sys/times.h> /* * compute_weights () 176 * inputs: dop, local_cube, mod_str_vecs, temp_mat * outputs: weights * * dop: number of the doppler for weight computation * local_cube: input data cube and dopplers after FFTs * weights: weights matrix per doppler * mod_str_vecs: matrices holding the modified steering vectors * temp_mat: temporary matrix holding three adjacent dopplers '/ compute_weights (dop, local_cube, weights, mod_str_vecs, temp_mat) int dop; COMPLEX local_cube[][RNG][EL]; COMPLEX weights!I[NUMSEG][DOF]; COMPLEX mod_str_vecs[][DOP][DOF]; COMPLEX temp_mat[][COLS]; /* * Variables * * rng_seg: number of range gates per segment * rngsamp: sample range gates used in each segment * num_seg: number of range segments for the Cholesky * dofs: numebr of DOFS (rows) in the input data * rng: number of range gates in the input data * el: number of elements in the steering vectors * num_rows: number of rows on which to apply the Cholesky * start_row: the row number on which to start applying the Cholesky * lower_triangular_rows: number of rows to lower triangularize * i, j, k, m, n: loop counters * dopplers: holder for the number of dopplers * row: row number * col: column number * chol: column number for Cholesky data * beam: beam number to be formed in the output data * seg: segment number * doplrl, doplr2, doplr3: loop counters for the doppler number * xreal, ximag: storage for real and imaginary data for pointer use * weight_vec: storage for weight solution vector * mod_str_vec: storage for one modified steering vector * make_t: flag: 0 = T matrix is not made and weight vector is returned * caller * pricnt: number of PRIs in the input local_cube * taskid: identifier number for this task / int rng_seg = RNGSEG; int rngsamp = rng_S; int num_seg = NUMSEG; int dofs = DOF; int rng = RNG ; int el = EL; int num_rows; int start_row; int lower_triangular_rows; int i, j, k, m, n; int dopplers; int row; int col; int chol; int beam; int seg; int doplrl; int doplr2 ; int doplr3; float xreal[DOF]; float ximag[DOF]; COMPLEX weight_vec[DOF]; COMPLEX mod_str_vec[1][DOF]; int make_t; extern pricnt; extern void house(); extern void forback(); extern taskid; int dopxu; * Begin function body: compute_weights () * * Set make_t = 0 so we don’t form the T matrix in the forback () routine. */ make_t = 0; /* * Because of the way the data cube was redistributed (using the index and two * circular shifts), each node has all the dopplers it needs to do its * computation. For each doppler, form a DOF (3 * EL) rows by RNG_S columns * matrix using the elements for every other range gate from the doppler of * interest and previous and following dopplers. */ doplrl = dop - 1 ; doplr2 = dop; doplr3 = dop + 1; dopxu = (taskid * PRI / NN) + dop - 1; /* * Perform weight computations for all beams for all segments. Perform a * Householder tranform and a forward and back solution for the weights for * each segment. */ for (seg - 0 ; seg < num_seg; seg++) ( I * for (seg...) */ /* * 1st, Put Cholesky data elements in temp_mat to do householder. 1st EL rows V for (k = 0 ; k < el; k++) ( /* for (k...) */ row = k; for (col = 0 ; col < rngsamp; col++) { /* for (col...) */ chol = 2 * col + seg * rng_seg; /* Cholesky data */ !* * Get every other range gate from either 1st or 2nd segment. */ temp_mat[row][col].real = local_cube[doplrl][chol3[k].real; temp_jnat[row][colj.imag = local_cube[doplrlj[cholj[kj.imag; } /* for (col...) */ } /* for (k...I */ /* * 2nd EL rows. */ for (k = 0 ; k < el; k++) 178 { /* for (k. . . ) */ row = el + k; for {col = 0 ; col < rngsamp; col++) { /* for (col...) * I chol = 2 * col + seg * rng_seg; /* Cholesky data */ /* * Get every other range gate from either the 1st or 2nd segment. */ temp_mat[row][col|.real = local_cube[doplr2][chol][k].real; temp_mat[row][colj.imag = local_cube[doplr2 ][chol][kj.imag; ) /* for (col...) */ ) /* for (k...) */ /* * 3rd EL rows. */ for (k = 0 ; k < el; k++) ( /* for (k...) */ row = 2 * el + k; for (col = 0 ; col < rngsamp; col++) ( /* for (col...) */ chol = 2 * col + seg * rng_seg; /* Cholesky data */ /* * Get every other range gate from either the 1st or 2nd segment. */ temp_mat[row][col].real = local_cube[doplr3][chol][k].real; temp_mat[row][colj.imag = local_cube[doplr3j[chol]fkj.imag; ) /* for (col...) */ ) /* for (k...) */ I* * Perform a Householder transform to lower triangularize all rows. V num_rows = dofs; lower_triangular_rows = dofs; start_row = 0 ; /* start householder on 1st row V /* * Call the Householder routine. V house (num_rows, rngsamp, lower_triangular_rows, start_row, temp_mat); /* * temp_mat matrix now has a lower triangular matrix as the 1st MxM rows. M * nominally = 144 = DOF or 3 * EL. V for (beam = 0; beam < BEAM; beam++) ( /* for (beams...) */ /* * Move the appropriate steering vector into a temporary holding matrix. */ for (n = 0 ; n < dofs; n++) I mod_str_vec[0 ][n].real = mod_str_vecs[beam][dopxu][n].real; mod_str_vec[0 j[n].imag = mod_str_vecs[beam][dopxu]in].imag; ] 179 /* * Call forback to solve weight vector with modified steer vector. */ forback (num_rows, mod_str_vec, weight_vec, make_t, temp_mat); /* * Store weights into the weights matrix. V for (n = 0 ; n < dofs; n++) ( weights[beam)[seg]In].real = weight_vecIn].real; weights[beam][seg][n].imag = weight_vecIn].imag; ) ) /* for (beams...) */ ) /* for (seg...) */ return; ) B.6 fltc /* * f f t. c */ /* * This file contains the procedures fft () and bit_reverse (), and is part * of the parallel HO-PD benchmark program written for the IBM SP2 by the * STAP benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mount a intop program. A * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * * The procedure fft () implements an n-point in-place decimation-in-time * FFT of complex vector "data" using the n/2 complex twiddle factors in * *w_common". The implementation used in this procedure is a hybrid between * the implementation in the original sequential version of this program and * the implementation suggested by MHPCC. This modification was made to * improve the performance of the FFT on the SP2. * * The procedure bit_reverse () implements a simple (but somewhat inefficient) * bit reversal. */ •include "defs.h* /■* * fft () * inputs: data, w_common, n, logn * outputs: data */ void fft (data, w_common, n, logn) COMPLEX *data, *w_comnon; int n, logn; ( /* fft */ int incrvec, iO, il, 12, nx, tl, t2, t3; float fO, fl; 180 v o i d b i t _ r e v e r s e () ; /* * Begin function body: fft () * * Bit-reverse the input vector. */ (void) bit_reverse (data, n) ; I* * Do the first log n - 1 stages of the FFT. */ i2 = logn; for (incrvec = 2 ; incrvec < n; incrvec « = 1) ( /* for (incrvec...) */ i 2 - - ; for (iO = 0; iO < incrvec >> 1; i0++) { /* for (iO . . .) */ for (il = 0 ; il < n; il +- incrvec) { /* for (il. . .) */ tl = iO + il + incrvec / 2; t2 = iO << i2 ; t3 = iO + il; fO = data[tl1.real * w_common[t2].real data[tl].imag * w_common[t2].imag; fl = data[tl}.real * w_common[t2 ].imag data[tl].imag * w_common[t2].real; data[tl].real = data[13J .real - fO; data[tl].imag = data[13].imag - fl data[t3j.reai = data[13] .real + fO data[13j.imag = data[131 .imag + ■ fl ) /* for (il. . .) */ ) /* for (iO...) */ ) /* for (incrvec...) */ /* * Do the last stage of the FFT. */ for (i0 = 0 ; id < n / 2; i0 + + ) { /* for (iO . . .) */ tl = iO + n / 2; fO = data[tl].real * w_common[iO].real - data[tl].imag * w_common[iO].imag; fl = data[tl].real * w_common[iO].imag + data[tl].imag * w_common[iO].real; data[tl].real - data[iO].real - fO; data[tli.imag = datatiO].imag - fl data[iO].real = datafiO].real + fO data[iO].imag = datafiO].imag + fl ) /* for (10 . . .) */ > /* fft */ /* * bit_reverse () * inputs: a, n * outputs: a V void bit_reverse (a, n) COMPLEX *a; int n; ( int i, j, k; 181 /* * Begin function body: bit_reverse () */ 3 = 0; for (i = 0 ; i < n - 2; i++> { if < i < j) SWAP(a (j] , a [i]) ; k = n » 1 ; while (k <= j) { 3 = ki k » = 1; ) 3 *= k; } > B.7 m_STAP.c /* * f ft_STAP.c */ /* * This file contains the procedure fft_STAP (), and is part of the parallel * HO-PD benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. ■ * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * * The procedure fft_STAP () performs an FFT along the PR! dimension for each * column of data (there are a total of RNG*EL such columns) by calling the * procedure fft (>. */ #include “stdio.h" ((include "math.h" ((include "defs.h" ((include <sys/types.h> ((include <sys/time.h> fft_STAP () inputs: fft_out outputs: fft_out fft_out the FFT output data cube / f ft_STAP (fft_out( COMPLEX fft_out(][RNG / NN][ELJ; ( /* * Variables 182 * pricnt: pricnt globally defined in main * fft (): a pointer to the FFT function (which performs a radix 2 * calculation) * el: the maximum number of elements in each PRI * beams: the maximum number of beams * rng: the maximum number of range gates in each PRI, in this data cube slice * (modified for parallel execution) * i, j, k: loop counters * xrealptr, ximagptr: not used * logpoints: log base 2 of the number of points in the FFT * pix2 , pi: 2*pi and pi * x: storage for the temporary FFT vector * w: storage for the twiddle factors table V extern int pricnt; extern void fft(); int el = EL; int beams = BEAM; int rng = RNG / NN; int i, j, k; float ‘xrealptr; float ‘ximagptr; int logpoints; float pi, pix2 ; static COMPLEX x[PRI]; Static COMPLEX w [PRI]; /* * Begin function body: fft_STAP () * * Generate twiddle factors table w. V logpoints = log2 ((float) pricnt) + 0 .1; pi = 3.14159265350979; pix2 = 2 . 0 * pi; for (i = 0; i < PRI; i++) ( w[i].imag = -sin (pix2 * (float) i / (float) PRI); w[i].real = cos (pix2 * (float) i / (float) PRI); ) /* * Perform one FFT for each rng and el. */ for (i = 0 ; i < rng; i++) { /* for (!...) */ for (k = 0 ; k < el; k++) { /* for (k...) */ /* * Move all the pri's into a vector. */ for (j = 0; j < pricnt; j++) ( x[j].real = fft_out(j][i][k].real; x[j].imag = fft_out[j][i][k].imag; ) /* * call the FFT routine. */ 183 fft (x, w, pricnt, logpoints); /* * Move FFT’ed data back into fft_out. */ for (j = 0; j < pricnt; j++) { fft_out[j][i][k].real = x(j].real; fft_out[jj[ij [kj.imag = xijj.imag; ) } /* for (k...) V ) /* for (i . . . ) V return; 1 B.8 forback.c forback.c / This file contains the procedure forback (), and is part of the parallel HO-PD benchmark program written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential HO-PD benchmark program was originally written by Tony Adams on 10/22/93 . The procedure forback () performs a forward and back substitution on an input temp_mat array, using the steering vectors, str_vecs, and normalizes the solution returned in the vector “weight_vec*. If the input variable 'make_t' = 1, then the T matrix is generated in this routine by doing BEAM forward and back solution vectors. Nominally, BEAM = 32. * The conjugate transpose of each weight vector, to get the Hermitian, is * put into the T matrix as row vectors. The input, tempjnat, is a square * lower triangular array, with the number of rows and columns = num_rows. * * This routine requires 5 input parameters as follows; * num_rows: the number of COMPLEX elements, M, in temp_mat * str_vecs; a pointer to the start of the steering vector array * weight_vec: a pointer to the start of the COMPLEX weight vector * make_t: 1 = make T matrix; 0 = don't make T matrix but pass back the * weight vector * temp_mat[][]; a temporary matrix holding various data * * This procedure was untouched during our parallelization effort, and * therefore is virtually identical to the sequential version. Also, most, * if not all, of the comments were taken verbatim from the sequential * version. */ •include 'stdio.h* •include 'math.h* •include "defs.h' /* 184 forback O inputs: num^rows, str_vecs, weight_vec, make_t, temp_mat outputs: weight_vec num_rows: the number of elements, M, in the input data str_vecs: a pointer to the start of the steering vector array weight_vec: a pointer to the start of the COMPLEX weight vector make_t: 1 = make T matrix; 0 = don't make T matrix but pass back the weight vector temp_mat: MAX temprary matrix holding various data / forback (num_rows, str_vecs, weight_vec, make_t, temp_mat) int num_rows; COMPLEX s t r_vecs{] [DOF]; COMPLEX weight_vec[]; int make_t; COMPLEX temp_mat[][COLS]; { /* * Variables ■ t * i, j, k: loop counters * last: the last row or element in the matrices or vectors * beams: loop counter * num_beams; loop counter * sum_sq: a holder for the sum squared of the elements in a row * sum: a holder for the sum of the COMPLEX elements in a row * temp: a holder for temporary storage of a COMPLEX element * abs_mag: the absolute magnitude of a complex element * wt_factor: a holder for the weight normalization factor, according to * Adaptive Matched Filter Normalization * steer_vec: one steering vector with DOF COMPLEX elements * vec: a holder for the complex solution intermediate vector * tmp_mat: a pointer to the start of the COMPLEX temp matrix */ int i, j, k; int last; i nt beams; int num_beams; float sum_sq; COMPLEX sum; COMPLEX temp; float absjnag; float wt_factor; COMPLEX steer_vec[DOFJ; COMPLEX vec[DOF1; COMPLEX tmp_mat[DOF][DOF]; /* * Begin function body: forback O * * The temp_jnat matrix contains a lower triangular COMPLEX matrix as the * first MxM rows. Do a forward back solution BEAM times with a different * steering vector. If make_t = 1, then make the T matrix. */ if (make_t] ( num_beams = num_rows; /* Do M forward back solution vectors. Put */ /* them else into T matrix. */ ) else { 185 num beams = l; /* Do only 1 solution weight vector •/ ) for (beams = 0 ; beams < num_beams; beams+ 0 { /* for (beams. . .) */ for (j = 0 ; j < num_rows; 3++) { steer_vec[j].real = str_vecs[beams][j].real; steer_vec[jj.imag = str_vecs[beams][jj.imag; ) /* * Step 1: Do forward elimination. Also, get the weight factor = square root * of the sum squared of the solution vector. Used to divide back substitution * solution to get the weight vector. Divide the first element of the COMPLEX * steering vector by the first COMPLEX diagonal to get the first element of * the COMPLEX solution vector. First, get the absolute magnitude of the first * lower triangular diagonal. V abs_mag = temp_mat[0 ][0 ).real * temp_mat[0 ][0 ].real + temp_mat[0 ][0 ].imag * temp_mat[0 ][0 ].imag; /* * Solve for the first element of the solution vector. */ vec[0 ].real = (temp_mat[0 )[0 1.rea1 * steer_vec[0 ].real + temp_mat[0 ][0 ].imag * steer_vec[0 1 .imag) / absjnag; vec[0 ].imag = (temp_mat[0 ][0 ].real * steer_vec[0 j .imag - temp_mat[0 ][0 ].imag * steer_vec[0 ).real) / abs_mag; /* * Start summing the square of the solution vector. */ sum_sq = vec[0 ].real * vec[0 ].real + vec[0].imag * vec[0 ].imag; /* * Now solve for the remaining elements of the solution vector. */ for (i = 1 ; i < num_rows,- i + + ) { /* for (i . . . ) *! sum.real = 0 .0 ; sum.imag = 0 .0 ; for (k - 0 ; k < i; k++) ( /* for (k...) */ sum.real += (temp_mat[i][k].real * vec[k].real - temp_mat1i][k].imag * vec[k].imag); sum.imag += (tempjnat[i][k].imag * vec[k].real + temp_matfi][k].real * vec[k].imag); ) /* for (k...) */ /* * Now subtract the sum from the next element of the steering vector. */ temp.real = steer_vec[i].real - sum.real; temp.imag = steer_vec[i].imag - sum.imag; t* * Get the absolute magnitude of the next diagonal. V abs_mag = temp_mat[i][i].real * temp_mat[i1 [i].real + tempjnat[i][i].imag * temp_mat[i][i].imag; 186 /* * Solve for the next element of the solution vector. •/ vec{i].real = (temp_|nat (i ] [ i] . real * temp.real + temp_mat[i][i).imag * temp.imag) I abs_mag; vec(i].imag = <temp_mat(i][i].real * temp.imag - temp_mat[i)[i).imag * temp.real) / absjnag; /* * Sum the square of the solution vector. */ sum_sq += (vec[i|.real * vec[i].real + vecIi].imag * vec(i].imag); ) /* for (i . . .) */ wt_factor = sqrt ((double) sum_sq); /* * Step 2: Take the conjugate transpose of the lower triangular matrix to * form an upper triangular matrix. V for (i = 0; i < num_rows; i++) I /* for (i . . . ) */ for (j = 0 ; j < num_rows; j++) { /* for (j . . .) * I tmp_matIi)[j].real = temp_mat[j]Ii].rea1; tmp_mat[ij [jJ.imag = - temp_mat[j][i].imag; ) /* for (j . . . ) V } /* for (i . . .) */ /* * step 3: Do a back substitution. */ last = num_rows - 1; /* * Get the absolute magnitude of the last upper triangular diagonal. */ abs_mag = tmp_mat{last][last).real * tmp_mat[last)[last].real + tmp_mat[last][last].imag * tmp_mat[last][last].imag; /* * Solve for the last element of the weight solution vector. */ weight_vecilast].real = (tmp_mat(last][last].real * vec[last].real + tmp_mat[last][last].imag * vec[last].imag) / abs_mag; weight_vec[last].imag = (tmp_|nat[last][last].real * vec[last].imag - tmpjnat[last][last].imag * vec[last].real) / abs_mag; /* * Now solve for the remaining elements of the weight solution vector from * the next to last element up to the first element. */ for (i = last - 1; i >= 0 ; i--) [ /* for (i . ..) */ sum.rea1 = 0 .0 ; sum.imag = 0 .0 ; for (k = i t 1; k <= last; k++) { /* for (k...) */ sum.real += (tmpjnat[i][k].real * weight_vec[k].real - tmp_mat[i][k].imag * weight_vec[k].imag); 187 sum.imag *-= (tmp_mat [ i] [k] . imag * weight_vecIk].real + tmp_mat[i][k].real * weight_vec[k].imag); ) /* for <k. . .) */ /* * Subtract the sum from the next element up of the forward solution vector. */ temp.real = vectil-real - sum.real; temp.imag = vec[ij.imag - sum.imag; /* * Get the absolute magnitude of the next diagonal up. */ abs_mag = tmp_mat[i][i].real * tmp_mat[i][i].real + tmp_mat[i][i 1.imag * tmp_mat[i)[i].imag; /* * Solve for the next element up of the weight solution vector, * / weight_vec[i].real = (tmp_jnat [ i ] [ i ] . real * temp.real + tmp_mat|i)[i].imag * temp.imag) / abs_mag; weight_vec [ i J . imag - (tmp_jnat i i ] 1 i ) . real * temp, imag - tmp_mat|i][i].imag * temp.real) / abs_mag; ) /* for {i . . . ) */ /* * Step 4: Divide the solution weight_vector by the weight factor. */ for (i = 0; i < num„rows; i++) t weight_vec[i].real /= wt_factor; weight_vec[i].imag /= wt_factor; ) ttifdef APT /* * If make_t = 1, make the T matrix. */ if (make_t) I /* if (make_t) */ I* * Conjugate transpose the weight vector to get the Hermitian. Put each * weight vector into the T matrix as row vectors. V for (j = 0 ; j < num_rows; j++> ( t_matrix[beams][j].real = weight_vec[j].real; tjnatrixjbeams][j].imag = - weight_vec[j].imag; ] ) /* if (make_t) */ #endif ) /* for (beams...) */ return; ) 188 BS form_beams.c /* * form_beams. c */ /* * This file contains the procedure form_beams (), and is part of the parallel * HO-PD benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * * The procedure form_beams () performs the beamforming by calling the * compute_weights () and compute_beams () procedures. */ ((include "stdio.h" ((include “math.h* ((include "defs.h* ((include <sys/types .h> ((include <sys/times,h> extern int taskid, numtask, allgrp; * form_beams () * inputs; dopplers, local_cube, mod_str_vecs * outputs: detect_cube * * dopplers: number of PRIs in the input data * local_cube: input data cube and dopplers after FFTs * detect_cube: detection data cube * mod_str_vecs: matrices holding modified steering vectors */ form_beams (dopplers, local_cube, detect_cube, mod_str_vecs) int dopplers; COMPLEX local_cube(](RNG][EL); COMPLEX detect_cubel)(DOP / NN)[RNG]; COMPLEX mod_str_vecs[][DOP][DOF] ; /* * Variables * * temp_mat: a temporary matrix holding three adjacent dopplers * weights: weights matrix per doppler to apply to range segment per beam * dop: loop counter for doppler number */ static COMPLEX tempjnat[DOF][COLS 1; static COMPLEX weights [BEAM] [NUMSEG] [DOF] , - int dop; extern void compute_weights(); extern void compute_beams(); /* * Begin function body: form_beams (). 189 * Perform beamforming for all dopplers. * I for (dop = 1 ; dop <= dopplers; dop++) ( compute_weights (dop, local_cube, weights, mod_str_vecs, temp_mat); compute_beams (dop, local_cube, weights, detect_cube, tempjnatj; } return; } B.10 form_str_vccs.c /• * lorm_st r_vecs.c V /* * This file contains the procedure form_str_vecs (), and is part of the * parallel HO-PD benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. */ #include "stdio.h" #include "math.h* #include "defs.h" #include <sys/types.h> #include <sys/time.h> * form_str_vecs () * inputs: str_vecs * outputs: mod_str_vecs * * str_vecs: storage for complex spatial steering vectors * mod_str_vecs: modified steering vectors matrices */ form_str_vecs (str_vecs, mod_str_vecs> COMPLEX str_vecs(][EL]; COMPLEX mod_str_vecs[][DOP][DOF]; /* * Variables * * el: number of elements in each PRI * element: number of the element * beams: number of beams * dops: number of dopplers after the FFT * i, j, k: loop counters */ int el = EL; 190 int element; int beams = BEAM; int dops = DOP; int i, j, k; /* * Begin function body: form_str_vecs () I k * Special case steering vectors. */ for (i = 0 ; i < beams; i++) { /* for (i...) */ for (j = 0 ; j < dops; j++) { /* for (j. . . ) */ /* * 1st, put EL number 0,0 in special case steering vectors. * I for (k = 0 ; k < el; k++) ( ! * for (k...) */ mod_str_vecs[i][j][k].real = 0 .0 ; mod_s t r_vecs[ i j[j j [ k ].imag = 0 .0 ; ) /* for (k...) V f* * 2nd, put EL elements length spacial steering vector in special case * steering vectors. */ for (k = 0 ; k < el; k++) { /* for (k...) V element = k + el; mod_str_vecs[i][j][element].real = str_vecsli][k].real; mod_str_vecs[ij fjj [element].imag = str_vecs[i][kj.imag; } /* for (k...) */ /* * 3rd, put EL number 0,0 in special case steering vectors. *I for (k = 0 ; k < el; kt +) { /* for (k...) */ element = k + 2 * el; mod_str_vecs[i][j][element 1.real = 0 .0 ; mod_str_vecs[ij [j][elementi.imag = 0 .0 ; } /* for (k. ..) * t } /* for (j...) V ] /* for (i . . . ) */ return; } B.11 house.c /* * house.c */ /* * This file contains the procedure house (>, and is part of the parallel * HO-PD benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * * The procedure house () performs the Householder transform, in place, on * an N by M complex imput matrix, where M >= N. It returns the results in * the same location as the input data. * * This routine requries 5 input parameters as follows: * num_rows: number of elements in the temp_mat * num_cols: number of range gates in the tempjnat * lower_triangular_rows: the number of rows in the output tempjnat that * have been lower triangularized * start_row: the number of the row on which to start the Householder * temp_mat[][]: a temporary matrix holding various data * * This procedure was untouched during our paralleiization effort, and * therefore is virtually identical to the sequential version. Also, most, * if not all, of the comments were taken verbatim from the sequential * version. */ (♦include "stdio.h" (♦include Tnath.h* (♦include “defs.h" * house () * inputs: num_rows, num_cols, lower_triangular_rows, start_row, temp_mat * outputs: temp_mat * * num_rows: number of elements, N, in the input data * num_cols: number of range gates, M, in the input data * lower_triangular_rows: the number of rows to be lower triangularized * start_row: the row number of which to start the Householder * temp_jnat: temporary matrix holding various data */ house (num_rows, num_cols, lower_triangular_rows, start_row, temp_mat) int num_rows; int num_cols; int lower_triangular_rows; int start_row; COMPLEX temp_mat[J[COLS); /* * Variables * * i, j, k: loop counters * rtemp: a holder for temporary scalar data * x_square: a holder for the absolute square of complex variables * xmax_sq: a holder for the maximum of the complex absolute of variables * vec: a holder for the maximum complex vector 2 * num_cols max * sigma: a holder for a complex variable used in the Householder * gscal: a holder for a complex variable used in the Householder * alpha: a holder for a scalar variable used in the Householder * beta: a holder for a scalar variable used in the Householder */ int i, j, k; float rtemp; float x_square; float xmax_sq; COMPLEX vec[2*COLS]; 192 COMPLEX sigma; COMPLEX gscal; float alpha; float beta; /* * Begin function body; house () * * Loop through temp_mat for number of rows = lower_triangular_rows. Start at * the row number indicated by the start_row input variable. V for (i = start_row; i < lower_triangular_rows; i + + ) { /* for (i. . . > */ /* * Step 1: Find the maximum absolute element for each row of temp_mat, * starting at the diagonal element of each row. */ xmax_sq = 0 .0; for (j - i; j < num_cols; j++) { /* for (j . . . ) */ x_square = temp_mat[i][j].real * tempjnat[i](j].real + temp_mat[i][j].imag * temp„mat[i][j).imag; if (xraax_sq < x_square) ( xmax_sq = x_square; } } /* for (j . . . ) V /• * step 2; Normalize the row by the maximum value and generate the complex * transpose vector of the row in order to calculate alpha = square root of * the sum square of all the elements in the row. V xmax_sq = (float) sqrt ((double) xmax_sq); alpha = 0 ,0; for (j = i; j < num_cols; j++) ( /* for (j . . .) */ vec[j].real = temp_mat[iJ(j].real / xmax_sq; vec[j],imag = - temp_mat[i)[j].imag / xmax_sq; alpha += (vec[j],real * vec[j].real + vec[j].imag * vec[j].imag); ) /* for (j . . . ) •/ alpha = (float) sqrt ((double) alpha); /* * Step 3: Find beta = 2 / (b (transpose) * b). Find sigma of the relevant * element = x(i) / lx(i)l. V rtemp = vec[1].real * vec(i].real + veclil.imag * vecfij.imag; rtemp = (float) sqrt ((double) rtemp); beta = 1.0 / (alpha * (alpha + rtemp)); if (rtemp >= l.QE-16) ( sigma.real = vec[i].real / rtemp; sigma.imag = vec[ij.imag / rtemp; } else ( sigma.real = 1 .0 ; s igma.imag = 0 .0 ; ) 193 * Step 4: Calculate the vector operator for the relevent element. */ vec[i].real += sigma.real * alpha; vec[i].imag += sigma.imag * alpha; /* * Step 5: Apply the Householder vector to all the rows of tempjnat. */ for (k = i; k < num_rows; k + +) { /* for (k...> */ /* * Find the scalar for finding g. V gscal.real = 0 ,0 ; gscal.imag = 0 .0 ; for (j = i; j < num_cols; j++) { t* for (j . . . ) */ gscal.real += (temp_mat[k] Ej].real * vec]j3-real - temp_mat[k][j].imag * vec[j].imag); gscal.imag += (temp_mat[k]tjI•real * vec[j].imag + temp_mat[k][j].imag * vec[j].real); } /* for (j . . .) */ gscal.real *= beta; gscal.imag *= beta; /* * Modify only the necessary elements of the temp_mat, subtracting gscal * * conjg (vec) from temp_|nat elements. V for (j = i; j < num_cols; j++) f ! * for <j . . . ) */ temp_matIk][j].real -- (gscal.real * vec[j].real + gscal.imag * vec[j].imag); temp_mat[k][j].imag -= (gscal.imag * vectjj.real gscal.real * vec[jj.imag); ) 1* for (j...) */ ) /* for (k. . .> */ > /* for (i . . . ) */ return; I B.12 read_input_STAP.c /* * read_input_STAP.c V /* * This file contains the procedure read_input_STAP (), and is part of the * parallel HO-PD benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential HO-PD benchmark program was originally written by Tony Adams * on 10/22/93. * 194 * The procedure read_input_STAP () reads the input data files (containing * the data cube and the steering vectors}. * * In this parallel version of read_input_STAP (}, each task reads its portion * of the data cube in from the data file simultaneously. In order to improve * disk performance, the entire data cube slice is read from disk, then * converted from the packed binary integer format to the floating point * number format. * * Each complex number is stored on disk as a packed 32-bit binary integer. * The 16 LSBs are the real portion of the number, and the 16 MSBs are the * imaginary portion of the number. * The steering vector file contains the number of PRIs in the input data, * the target power threshold, and the steering vectors. The data is stored * in ASCII format, and the complex steering vector numbers are stored as * alternating real and imaginary numbers. */ #include "stdio.h" #include ■math.h* •include ■defs.h" •include <sys/types.h> •include <sys/time.h> •ifdef DBLE char * fmt = * % 1f *; •else char *fmt = "%f; •endif extern int taskid, numtask, allgrp; * read_input_STAP () * inputs: data_name, str_name * outputs: str_vecs, fft_out * * data_name: name of the file containing the input data cube * str_name: name of the file containing the steering vectors * str_vecs: the steering vectors * fft_out: this task's slice of the data cube */ read_input_STAP (data_name, str_name, str_vecs, fft_out| char data_name[1; char str_name[]; COMPLEX str_vecs[][EL]; COMPLEX fft_out[)[RNG t NN][EL]; /* * Variables * * el: the maximum number of elements in each PRI * beams: the maximum number of beams * rng: the maximum number of range gates * dopplers: number of dopplers in the input data * temp: a buffer for binary input integer data * tempi, temp2 : holders for integer data * i, j, k: loop counters * fopen (): a pointer to the file open function * f_str, f_dat: pointers to the input files * pricnt: the pricnt variable globally defined in main u * threshold: the threshold variable globally defined in main (} 195 * taskid: the identifier for this task * local_int_cube: the local portion of the data cube V int el = EL; int beams = BEAM; int rng = RNG / NN; int dopplers; unsigned int temp[l]; long int tempi, temp2; int i, j, k; FILE *fopen(); FILE *f_str, *f_dat; extern int pricnt extern float threshold; extern taskid; unsigned int local_int_cube[RNG / NN)iPR1] I ELI; t* * Begin function body: read_input_STAP () * * Every task: load the steering vectors file. V if <(f_str = fopen (str_name, *r*)| =- NULL) printf ("Error - task %d unable to open steering vector file.\n", taskid); exit (-1 ); } /* * The first item in the steering vector file is the number of PRIs. The * second item in the steering vector file is the target detection threshold. V fscanf <f_str, "%d", ipricnt); fscanf <f_str, fmt, ^threshold); i * * Read in the rest of the steering vector file. V for (i = 0; i < beams; i++) ( for (j = 0 ; j < el; j++) { fscanf (f_str, fmt, &str_vecs[i][j].real); fscanf (f_str, fmt, tstr_vecs[i][j].imag); ) ) fclose |f_str); !* * Read in the data file in parallel. */ if ((f_dat = fopen (data_name, *r“)| == NULL) ( printf ("Error - task %d unable to open data file.Xn", taskid); exit (-1 ) , - ) fseek (f_dat, taskid * rng * pricnt * el * sizeof (unsigned int), 0); fread (local_int_cube, sizeof (unsigned int), pricnt * rng * el, f_dat); fclose (f_dat); 196 /* * Convert data from unsigned int format to floating point format. */ for (i = 0 ; i < rng; i + +) { /* for (i . . .) */ for {j = 0 ; j < pricnt; j + + ) t /* for (j . . . ) V for (k = 0 ; k < el; k++) { /* for (k...) V tempIO] = local_int_cube[i][j][k]; tempi = OxOOOOFFFF & temp[0]; tempi = (tempi & 0x00008000) ? tempi I OxffffOOOO : tempi; temp2 = (temp[01 » 16) & OxOOOOFFFF; temp2 = (temp2 & 0x00008000) ? temp2 I OxffffOOOO : temp2; fft_out[j]Ii][k!.real = (float) tempi; fft_out[jj (i][kl.imag = (float) temp2 ; ) /* for (k...) */ } /* for (j . . .) */ I /* for ( i . . . ) */ return; ) B.13 defs.h /* defs.h */ •define NN 64 •define SWAP(a,b) (float swap_temp=(a).real;\ (a).real=(b).real;(b).real=swap_temp;\ swap_temp=(a).imag;\ (a).irnag^(b).imag;(b).imag=swap_temp;) •ifdef IBM •define log2(x) ((log(x))/(M_LN2)) •define CLK_TICK 100.0 •endif •ifdef DBLE •define float double •endif typedef struct { float real; float imag; ) COMPLEX; •define AT_LEAST_^RG 2 •define AT_MOST_ARG 4 •define ITERATIONS_ARG 3 •define REPORTS^ARG 4 •define USERTIME(T1.T2) •define SYSTIME(T1,T2) •define USERTIME1<T1,T2) •define SYSTIME1(T1,T2) •define USERTIME2<T1,T2) •define SYSTIME2(T1,T2) ((t2 .tms_utime-tl.tms_ut ime)/60.0) (t2.tms_stime-tl.tms_stime)/60.0) ((time_end.tms_utime-time_start.tms_utime)/60.0) ((time_end.tms_stime-time_start.tms_stime)/60.0) ((end_t ime.tms_utime-start_time.tms_utime)/60.0) ((end_time.tms_stime-start_time,tms_stime)/60.0) /* •define LINE_MAX 256 V •define TRUE 1 •define FALSE 0 /* The following default dimensions are MAXIMUM values. The actual /* dimension for PRIs will be the 1st entry in file "testcase.str■. /* The 2nd entry in the file is the Target Detection Threshold. /* The rest of the file will contain steering vectors starting with /* the main beams then all auxiliary beams. The file "testcase.hdr" /* will have descriptions of entries in "testcase.str" file. Also /* a description of the data file "testcase.dat" will be given. This /* file contains data to fill the input "data_cube[)[][]■ array. •define COLS 1500 •def ine •def ine • def ine •def ine •define • def ine • def ine • def ine • def ine •define • def ine • def ine MBEAM 12 ABEAM 20 PWR_BM 9 VI 64 V2 128 V3 1500 V4 64 DIM1 64 /' DIM2 128 /’ DIM3 1500 n DOP_GEN 204 8 PRIGEN 204 8 Maximum number of columns in holding vector "vec" */ in house.c for max columns in Householder multiply */ Number of main beams */ Number of auxiliary beams */ Number of max power beams */ Number for dimensionl in input data cube */ Number for dimension2 in input data cube */ Number for dimension3 in input data cube */ /* MAX Number of dopplers after FFT */ /* MAX Number of points for 1500 data points FFT */ /* zero filled after 1500 up to 2048 points */ •define NUM_MAT 32 /* Number of matrices to do householde on */ /* NUM.J4AT must be less than DIM2 above */ •ifdef APT •define PRI 1024 t* Number of pris in input data cube */ •define DOP 1024 I* Number of dopplers after FFT V •define RNG 280 /* Number of range gates in input data cube */ •define EL 32 /* number of elements in input data cube */ •define BEAM 3 2 /* Number of beams */ •define NUMSEG 7 /* Number of range gate segments */ •define RNGSEG RNG/NUMSEG /* Number of range gates per segment */ •define RNG_S 320 /* Number sample range gates for 1st step beam forming V •define DOF EL /* Number of degrees of freedom */ extern COMPLEX tjnatrix{BEAM][EL]; /* T matrix */ •endif •ifdef STAP •def ine •define •define •def ine •define •define •define •define •define •endi f PRI 128 DOP 128 RNG 10 24 EL 48 BEAM 2 NUMSEG 2 RNGSEG RNG/NUMSEG RNG_S 256 DOF 3*EL /* Number of pris in input data cube */ /* Number of dopplers after FFT *1 /* Number of range gates in input data cube */ /* number of elements in input data cube */ /* Number of beams */ Number of range gate segments */ Number of range gates per segment */ t* Number of sample range gates for beam forming V /» Number of degrees of freedom */ t* •ifdef GEN •define EL V4 •define DOF EL •endif extern int output_time; f* extern int output_report; !i extern int repetitions; /’ Flag if set TRUE, output execution times */ Flag if set TRUE, output data report files */ number of times program has executed */ 198 e x t e r n i n t i t e r a t i o n s ; /* num ber o f t i m e s t o e x e c u t e p r o g r a m V B.14 compUe_hopd mpcc -qarch=pwr2 -03 -DSTAP -DIBM -o stap bench_mark_STAP.c cell_avg_cfar.c fft.c fft_STAP.c forback.c cmd_line.c house.c read_input_STAP.c forrn_beams.c compute_weights.c compute_beams.c form_str_vecs.c -lm B.15 run.256 poe stap /scratchl/masa/new_stap -procs 256 -us 199 Appendix C Parallel General Code The parallel code for the General benchmark, along with the script to compile and run the parallel General program, are given in this appendix. C .l Sorting and FFT Subprogram C.1.1 bench_mark_SORT.c t* * bench_mark_SORT.c /* * Parallel General Benchmark Program for the IBM SP2 This parallel General benchmark program was written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential General benchmark program was originally written by Tony Adams on 8/30/93. This file contains the procedure main (), and represents the body of the General Sorting subprogram. This subprogram's overall structure is as follows: 1. Load data in parallel. 2. Each task searches its slice of the data cube for the largest element. 3. Perform a reduction operation to find the largest element in the entire data cube. 4. Each task sorts its slice of the data cube along the DIM1 dimension. 5. Collect the totals calculated in the bubble sort operation using a reduction operation. This program can be compiled by calling the shell script compile_sort. Because of the nature of the program, before the program is compiled, the header file defs.h must be adjusted so that NN is set to the number of nodes on which the program will be run. We have provided a defs.h file for each power of 2 node size from 1 to 256, called defs.001 to defs.256. This program can be run by calling the shell script run.###, where #»# is the number of nodes on which the program is being run. Unlike the original sequential version of this program, the parallel General program does not support command-line arguments specifying the number of times the program body should be executed, nor whether or not timing information should be displayed. The program body will be executed once, and the timing information will be displayed at the end of execution. 200 * The input data file on disk is stored in a packed format: Each complex * number is stored as a 32-bit binary integer; the 16 LSBs are the real * half of the number, and the 16 MSBs are the imaginary half of the number. * These numbers are converted into floating point numbers usable by this * benchmark program as they are loaded from disk. * * The steering vectors file has the data stored in a different fashion. All * data in this file is stored in ASCII format. The first two numbers in this * file are the number of PRIs in the data set and the threshold level, * respectively. Then, the remaining data are the steering vectors, with * alternating real and imaginary numbers. */ #include "stdio.h* #include "math.h" #include "defs.h" #include <sys/types.h> iinclude <sys/times.h> tinclude <mpproto.h> #include <sys/time.h> tinclude /* * main () * inputs: argc, argv * outputs: none V main (argc, argv) int argc; char *argv[],- ( /* * Variables * * data_cube: input data cube * tl, t2 : holder for the time function return in seconds * diml, dim2, dim3: dimensions 1, 2, and 3 in the input data cube * loci, loc2, loc3: holders for the location of the maximum entry * temp: buffer for binary input integerdata * f_templ, f_temp2 : holders for float data * f_pwr: holder for float power data * i, j, k, n, offset, r: loop counters * max: float variable to hold the maximum of complex data * index_jnax, g_index_jnax: data structures with the location and power of the * largest element (locally and globally, respectively) * data_name, str_name: the names of the files containing the input data cube * and the vectors, respectively * totals: holders for DIM1 totals * global_totals: holderB for global DIM1 totals * logpoints: log base 2 of the number of points in the FFT * sum: a COMPLEX variable to hold the sum of complex data * points: number of points in the FFT * pix2: pi X 2 (2 X 3.14159) * pi: pi (3.14159) * x: storage for temporary FFT vector * w: storage for twiddle factors table * blklen: block length * rc: return code from an MPL call * taskid: identifier for this task * numtask: the number of tasks running this program * nbuf: a buffer to hold data returned by the mpc_task_query call * allgrp: an identifier for the group encompassing all tasks running this * program * source, dest, type, nbytes: used by point-to-point communications 2 0 1 * t COMPLEX data_cube[DIM2] (DIM1][DIM3] ; struct tms tl,t2 ; int diml = DIM1; int dim2 = DIM2; int dim3 - DIM3; int loci,loc2,loc3 ; unsigned int temp[l]; float f_templ, f_temp2 ; float f_pwr; int i, j, k, n, offset, r; float max; INDEXEDJ4AX indexjnax[1], g_index_max[II; char data_name[LINE_MAX]; char str_name[LINE_MAX]; FILE *fopen(); FILE *f_dop; float totals[DIM1]; float global_totals[DIM1 ] ; int logpoints; COMPLEX sum; int points = DOP_GEN; float pix2 , pi; static COMPLEX x[DOP_GENJ; static COMPLEX w[DOP_GEN]; int blklen; int rc; int taskid; int numtask; int nbuf[4]; int allgrp; int source, dest, type, nbytes; /* * Timing variables * I struct timeval tvO, tvl, tv2; struct tms all_start, all_end; float all_user, all_sys; float all_user_max, all_sys_max; float all_wall, task_wall; . struct tms disk_start, disk_end; float disk_user, disk_sys; float disk_user.jnax, disk_sys_max; float disk_wall, disk_wall_jnax; struct tms maxl_start, maxl_end; float maxl_user, maxl_sys; float maxl_user.jnax, maxl_sys_max; float maxl_wall, maxl_wall_max; struct tms max2_start, max2_end; float max2_user, max2_sys; float max2_user_max, max2_sys_max; float max2_wall, max2_wall_max; struct tms before_start, before_end; float before_user, before_sys; float before_user_max, before_sys_max; float before_wall, before_wall_max; struct tms sort_start, sort_end; float sort_user, sort_sys; float sort_userjnax, sort_sys_max; float sort_wall, sort_wall_max; 202 struct tms sort_total_start, sort_total_end; float sort_total_user, sort_total_sys; float sort_total_user_max, sort_total_sys_max; float sort_total_wall, sort_total_wall_jnax; struct tms index_start, index_end; float index_user, index_sys; float index_user_max, index_sys_max; float index_wall, index_wall_max; struct tms fftl_start, fftl_end; float fftl_user, fftl_sys; float f ftl_user.jnax, fftl_sys_max; float fftl_wail, fftl_wall_max; struct tms fft2_start, fft2_end; float fft2_user, fft2_sys: float fft2_user_max, fft2_sys_max; float fft2_wall, fft2_wall_max; struct tms fft3_start, fft3_end; float fft3_user, fft3_sys; float fft3_userjnax, fft3_sys_max; float fft3_wall, fft3_wall_max; struct tms after_start, after_end; float after_user, after_sys; float af ter_user_fnax, after_sys_jnax; float after_wall, after_wall_max; /* * Externally defined functions. */ extern void cmd_line(>; extern void read_input_SORT_FFT( extern void bubble_sort(); extern void fft(); /* * Begin function body: main () * * Initialize for paralle processing: Here, each task or node determines its * task number (taskid) and the total number of tasks or nodes running this * program (numtask) by using the MPL call mpc_environ. Then, each task * determines the identifier for the group which encompasses all tasks or * nodes running this program. This identifier (allgrp) is used in collective * communication or aggregated computation operations, such as mpc_index. */ gett imeofday (&tvO, (struct t imeval* ) 0) /* before time */ rc = mpc_environ (&numtask, fctaskid); if (rc == -1) ( printf (“Error - unable to call mpc_environ.Vn“); exit (-1); } if (numtask 1= NN) < if (taskid == 0) ( printf (“Error - task number mismatch... check defs.h.\n“); exit (-1 ); ) 203 } rc = mpc_task_query (nbuf, 4, 3); if (rc = = -1 > < printf ("Error - unable to call mpc_task_query.\n*); exit (-1 ); } allgrp = nbuf[3]; if (taskid == 0 ) ( printf ("Running...\n*); ) gettimeofday(itv2, (struct timeval*)0 ); /* before time */ task_wall = (float) (tv2 .tv_sec - tv0 .tv_sec) * 1 0 0 0 00 0 + tv2 .tv_usec - tv0 .tv_usec; times (ta1l_start); t* * Get arguments from command line. In the sequential version of the program, * the following procedure was used to extract the number of times the main * computational body was to be repeated, and flags regarding the amount of * reporting to be done during and after the program was run. In this paralle * program, there are no command line arguments to be extracted except for the * name of the file containing the data cube. V cmd_line (argc, argv, str_name, data_name); /* * Read input files. In this section, each task loads its portion of the data * cube from the data file. */ if (taskid == 0 ) { printf (" loading data...\n"); } mpc_sync (allgrp); times (&disk_start); gettimeofday (SLtvl, (struct timeval*)0 ); /* before time */ read_input_SORT_FFT (data_name, data_cube); gettimeofday(itv2, (struct timeval*)0); /* after time V disk_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000 00 0 + tv2 .tv_usec - tvl.tv_usec; times (idisk_end) , - /* * start SORT Steps. Find the element with the largest magnitude in this * task's slice of the data cube. */ if (taskid == 0 ) printf("Finding maximum COMPLEX entry in data_cube before FFT \n"); times (imaxl_start) , * gettimeofday(&tvl, (struct timeval*)0 ); /* before time */ max = 0 .0; for (i = 0 ; i < diml; i++) for (j = 0 ; j < dim2 ; j++) for (k =0; k < dim3; k++> ( 204 f_templ = data_cube[j](i][k].real; f_temp2 = data_cube[j]1i][k].imag; f_pwr = f_templ*f_templ + f_temp2’f_temp2; if (f_pwr > max) { max = f_pwr; locl = i; loc2=j; loc3=k; ) /* * printf("task %d finds max %f at %d %d %d\n’, * taskid, max, loci, loc2, loc3>; */ index_max[0 ]-value = max; index_max[0 ].loci = loci; index_max(0 ].loc2 = loc2 ; index_max[0].loc3 = loc3 + taskid * dim3; gettimeofday(itv2, (struct timeval'JO); /* after time V maxl_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl.tv_usec; times(tmaxl_end); /* * Now aggregate to get the global maximum (the largest element in the entire * data cube). V mpc_sync (allgrp); times(&before_start); gettimeofday(ttvl, (struct timeval*)0 ); /* before time */ rc = mpc_reduce (index_max, g_index_max, sizeof (INDEXED_MAX), 0, xu_index_jnax, allgrp); if (rc == -1 ) ( printf ('Error - unable to call mpc_reduce.\n“) ; exit (-1 ); ) gettimeofday(ttv2, (struct timeval'IO); /* after time */ before_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 * tv2 .tv_usec - tvl.tv_usec; times(&before_end); if (taskid = = 0 ) ( max = g_indexjnax(0 ].value ; loci = g_index_max[0 ].loci ; loc2 - g_index_jnax 10 ] . loc2 ; loc3 = g_index_max[0].loc3 ; printf('TIME: finding max before FFT = %f user secs and %f sys secs\n', USERTIME(tl,t2), SYSTIME(t1,t2) ); printf('POWER of max COMPLEX entry before FFT = %f \n", max); printf('LOCATION of max entry before FFT = %d %d %d \n", loci,loc2,loc3); ) /* * SORT DIM1 of data_cube DIM2*DIM3/3 times and output DIM1 average values. * DIM3 must be divisible by 3 to give a whole number dimension. */ 205 if (taskid == 0 ) printf("Sort %d element vectors %d*%d times. Output %d averagesin", DIM1, DIM2, DIM3/3, DIM1 ); times (&sort_start); gettimeofday(&tvl, (struct timeval*)0 ); /* before time */ bubble_sort(data_cube, totals); gettimeofday(&tv2, (struct timeval*)0); /* after time */ sort_wall = (float) (tv2 .tv_sec - tvl,tv_sec) * 1000000 + tv2 .tv_usec - tvl. tv_usec times (fcsort_end); /* * Get global totals of bubble sort. */ mpc_sync (a 1lgrp); times (tsort_total_start) , - gettimeofday (&tvl, (struct timeval*)0); /* before time V rc = mpc_reduce (totals, global_totals, diml*sizeof(float), 0, s_vadd, allgrp); if (rc == -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit (-1); ) gettimeofday(&tv2 , (struct timeval*)0); /* after time */ sort_total_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 00 0 ♦ tv2 .tv_usec - tvl.tv_usec; gettimeofday(itv2 , (struct timeval*)0 ); /* after time */ all_wall = (float) (tv2.tv_sec - tv0 .tv_sec) * 1000000 + tv2 .tv_usec - tvO.tv_usec; times (&sort_total_end); times(&all_end); if (taskid==0) { for (i = 0; i < DIM1; i + + ) printf(" Average value # %d = »f\n", i, global_totals[i]/(dim2*dim3*NN/3)}; printf("TIME; Sorting %d elements vectors = *f user secs and %f sys secsin", DIM1, ITSERTIME (tl, t2 ), SYSTIME(tl, 12 ) ); ) /* * Compute all times. «/ all_user = (float) (all_end.tms_utime - all_start.tms_utime)/1 0 0.0 ; all_sys = (float)(all_end.tms_stime - all_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (tall_user, fcall_user_max, sizeof (float), 0 , s_vmax,allgrp); if (rc = = -1) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (&all_sys, &all_sys_max, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1) ( printf ("Error mpc_reduce.\n"); exit (-1); 206 } /* * Compute disk times. */ rc = mpc_reduce (&disk_wall, &disk_wall_max, sizeof{float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n*); exit (-1 ); } disk_user = (float) (disk_end.tms_utime - disk_start.tms_utime)/1 0 0.0 ; disk_sys = (float) (disk_end.tms_stime disk_start.tms_stime)11 0 0.0 ; rc = mpc_reduce <fcdisk_user, &disk_user.jnax, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit ( D ; ) rc = mpc_reduce (6tdisk_sys, &disk_sys_max, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n"); exit (-1 ) ; ) /* * Compute maxi times. */ rc = mpc_reduce (&maxl_wall, imaxl_walljnax, sizeof(float), 0 , s_vmax, allgrp); if (rc == -i) ( printf (“Error mpc_reduce.\n*); exit {-1 ); ) maxl_user = (float) (maxl_end.tms_utime - maxl_start.tms_utime)/1 0 0.0 ; maxl_sys = (float)(maxl_end.tms_stime - maxl_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce ((.maxl_user, &maxl_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); } rc - mpc_reduce (4maxl_sys, (<maxl_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc = = -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); } /* * Compute before times. */ rc = mpc_reduce (tbefore_wall, tbefore_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit {-1 ); ) before_user = (float) (before_end.tmB_utime - before_start.tms_utime)i 1 0 0.0 ; 207 before_sys = (float)(before_end.tms_stime - before_start.tms_stime)/1 0 0.0; rc = mpc_reduce (&before_user, tbefore_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc = = -1) ( printf ("Error mpc_reduce. \n") , * exit (-1 ) ; I rc = mpc_reduce (&before_sys, &before_sys_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n") ; exit (-1 ); 1 /* * Compute sort times. */ rc = mpc_reduce (&sort_wall, tsort_wall_max, sizeof(float), 0, s_vtnax, allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n") ; exit (-1 1; ) sort_user = (float) (sort_end.tms_utime - sort_start.tms_utime)/10 0.0 ; sort_sys = (float)(eort_end.tms_stime - sort_start.tms_etime)/1 0 0.0 ; rc = mpc_reduce (&sort_user, &sort_user_max, si2eof (float), 0 , s_vmax, allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n"); exit (-1 1; ) rc = mpc_reduce (&sort_sys, &sort_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc = = -1 ) I printf ("Error mpc_reduce.\n") ; exit (-1) ; } /* * Compute sort_total times. */ rc = mpc_reduce (&sort_total_wal1, &sort_total_wall_max, sizeof(float), 0, s_vmax, allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n■); exit (-1); ) sort_total_user = (float) (sort_total_end.tms_utime - sort_total_start.tms_utime)/1 0 0.0 ; sort_total_sys = (float)(sort_total_end.tms_stime - sort_total_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (isort_total_user, tsort_total_userjnax, sizeof (float), 0, a vmax,allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n"); exit (-1); ) rc = mpc_reduce (&sort_total_sys, tsort_total_sys_max, sizeof (float), 0 , s_vmax, allgrp) , • 208 if (rc == -1) < printf ("Error mpc_reduce.\n"); exit (-1); ) /* * Display timing information. */ if (taskid == 0 ) { printf (' printf ("\n\n*** CPU Timing information - numtask = %d\n\n", NN) ; printf (■ all_user.jnax = %.2f s, all_sys_max = %.2f s\n", all_user_max, all_sys_max); disk_user_max = %.2f s, disk_sys_max = %.2f s\n", disk_user_max, disk_sys_max); maxl_user_max = %.2f s, maxl_sys_max = %.2 f s\n‘, maxl_user_max, maxl_sys_max); * before_userjnax = %.2f s, before_sys_max - %.2f s\n", before_user_max, before_sys_max); " sort_user_max = %.2f s, sort_sysjnax = %.2f s\n", sort_user.jnax, sort_sys_max); sort_total_user_max = %.2f s, sort_total_sys_max = %.2f pr pr pr pr pr ntf ( 1 ntf ntf ntf s\n*, sort_total_user_max, sort_total_sys_max); ntf (*\n*** Wall Clock Timing information - numtask = %d\n\n", NN); ) printf printf printf printf printf printf printf (■ all_wall (* task_wall (* disk_wall (* maxl_wall (" before_wall (■ sort_wall (• sort_total_wall %.0f us\n", all_wall); = %.0f us\n", task_wall); = %.0f us\n", disk_wall_max) ; = %.0f us\n*, maxl_wall_max): = %.0f us\n", before_wall_max) , * = %.0f us\n*, sort_wall_max); = %.0f us\n', sort_total_wall_max); exit(0 ); C.1.2 bench m ark SORT FFT.c bench_mark_SORT_FFT.c / Parallel General Benchmark Program for the IBM SP2 This parallel General benchmark program was written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr, Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential General benchmark program was originally written by Tony Adams on 8/30/93. This file contains the procedure main (), and represents the body of the General Sorting and FFT subprograms. This subprogram's overall structure is as follows: 1. Load data in parallel. 2. Each task searches its slice of the data cube for the largest element. 209 3. Perform a reduction operation to find the largest element in the entire data cube. 4. Each task sorts its slice of the data cube along the DIM1 dimension. 5. collect the totals calculated in the bubble sort operation using a reduction operation. 6 . Perform FFTs along the D1M1 and DIM2 dimensions. 7. Redistribute the data cube using a total exchange operation. 8 . Perform FFTs along the DXM3 dimension. 9. Each task searches its slice of the data cube for the largest element after the FFTs. 10. Perform a reduction operation to find the largest element in the entire data cube. This program can be compiled by calling the shell script compile_sort_fft. Because of the nature of the program, before the program is compiled, the header file defs.h must be adjusted so that NN is set to the number of nodes on which the program will be run. We have provided a defs.h file for each power of 2 node size from 1 to 2S6, called defs.001 to defs.256. This program can be run by calling the shell script run.###, where ### is the number of nodes on which the program is being run. Unlike the original sequential version of this program, the parallel General program does not support command line arguments specifying the number of times the program body should be executed, nor whether or not timing information should be displayed. The program body will be executed once, and the timing information will be displayed at the end of execution. The input data file on disk is stored in a packed format: Each complex number is stored as a 32-bit binary integer: the 16 LSBs are the real half of the number, and the 16 MSBs are the imaginary half of the number. These numbers are converted into floating point numbers usable by this benchmark program as they are loaded from disk. The steering vectors file has the data stored in a different fashion. All data in this file is stored in ASCII format. The first two numbers in this file are the number of PRIs in the data set and the threshold level, respectively. Then, the remaining data are the steering vectors, with alternating real and imaginary numbers. / •include 'stdio.h' •include "math.h" •include "defs.h" •include <sys/types.h> •include <sys/times.h> •include <mpproto.h> •include <sys/time.h> •include /* * main () * inputs: argc, argv * outputs: none V main (argc, argv) int argc; char *argv[); ( /* * Variables * * data_cube: input data cube * doppler_cube: Doppler data cube * data_vector: used in the index step * tl, t2 : holders for the time function return in seconds 2 1 0 * diml, dim2, dim3: dimensions 1, 2, and 3 for the input data cube. * loci, loc2, loc3: holders for the location of the largest element * temp: buffer for binary input integer data * f_templ, f_temp2 : holders for float data * f_pwr: holder for float power data * i, j, k, n, offset, r: loop counters * max: float variable to hold the maximum of complex data * index_jnax, g_indexjnax: data structures holding the location and power of * the largest elements (local and global, repsectively) * data_name, str_name: names of the files containing the input data file and * the vectors, respectively * totals: holders for DIM1 totals * global_totals: holders for global D1M1 totals * logpoints: log base 2 of the number of points in the FFT * sum: a COMPLEX variable to hold the sum of complex data * points: the number of points in the FFT * pix2: pi times 2 (2 x 3.14159) * pi: pi (3.14159) * x: storage for temporary FFT vector * w: storage for the twiddle factors table * blklen: block length; used for MPL calls * rc: return code from an MPL call * taskid: the identifier for this task * numask: the number of tasks running this program * nbuf: buffer to hold data returned by the mpc_task_query call * allgrp: the identifier for the group encompassing all the tasks running * this program * source, dest, type, nbytes: used by MPL calls V COMPLEX data_cube[DIM2]IDIMl][DIM3]; COMPLEX doppler_cube[DIM1][DIM2/NN][DOP_GEN]; COMPLEX data_vector[DIM2 *DIM1*DIM3]; struct tms tl,t2; int diml = DIM1; int dim2 = DIM2; int dim3 = DIM3; int loci, loc2, loc3 , - unsigned int temp[l]; float f_templ, f_temp2 ; float f_pwr; int i, j, k, n, offset, r; float max; INDEXED_MAX index_max[1], g_index_max[1] ; char data_name(LINE_MAX); char str_nameiLINE_MAX]; FILE ‘fopen(); FILE *f_dop; float totals[DIM1]; float global_totals[DIMl]; int logpoints; COMPLEX sum; int points = DOP_GEN; float pix2 , pi; Static COMPLEX X[DOP_GEN|; static COMPLEX w[DOP_GEN); int blklen; int rc; int taskid; int numtask; int nbuf{4]; int allgrp; int source, dest, type, nbytes; /* * Timing variables. */ 2 1 1 struct timeval tvO, tvl, tv2,- struct tms all_start, all_end; float all_user, all_sys; float all_user_max, all_sys_max; float all_wall, task_wall; struct tms disk_start, disk_end; float disk_ueer, disk_sys; float dlsk_user_max, disk_sys_max; float disk_wall, disk_wall_max; struct tms maxl_start, maxl_end; float maxl_user, maxl_sys; float maxl_user_max, maxl_sys_max; float maxl_wall, maxl_wall_max; struct tms max2_start, max2_end; float max2_user, max2_sys; float max2_user_max, max2_sys_max; float max2_wall, max2_wall_max; struct tms before_start, before_end; float before_user, before_sys; float before_user_max, before_sys_max; float before_wall, before_wall_max; struct tms sort_start, sort_end; float sort_user, sort_sys; float sort_userjnax, sort_sys_max; float sort_wall, sort_wall_max; struct tms sort_total_start, sort_totai_end; float sort_total_user, sort_total_sys; float sort_total_user_max, sort_total_sys_jnax float sort_total_wal1, sort_total_wall_max; struct tms lndex_start, index_end; float index_user, index_sys; float index_user_max, index_sys_max; float index_wall, index_wall_max; struct tms fftl_start, fftl_end; float fftl_user, fftl_sys; float fftl_user_max, fftl_sys_max; float fftl_wall, fftl_walljnax; struct tms fft2_start, fft2_end; float fft2_user, fft2_sys; float fft2_user_max, fft2_sys_max; float fft2_wall, fft2_wall_max; struct tms fft3_start, fft3_end; float fft3_user, fft3_sys; float fft3_user_max, fft3_sys_max; float fft3_wall, fft3_wall_max; struct tms afterestart, after_end; float after_user, after_sys; float after_user_max, after_sysjnax; float after_wall, after_wall_jnax; Externally defined functions. / extern void cmd_line<); extern void read_input_SORT_FFT(); extern void bubble_sort() ; extern void fft(); /* * Begin function body: main () • * Initialize for parallel processing: Here, each task or node determines its * task number (taskid) and the total number of tasks or nodes running this * program (numtask) by using the MPL call mpc_environ. Then, each task * determines the identifier for the group which encompasses all tasks or * nodes running this program. This identifier (allgrp) is used in collective * communication or aggregated computation operations, such as mpc_index. */ gettimeofday(&tvO, (struct timeval*)0 ); /* before time */ rc - mpc_environ (tnumtask, staskid); if (rc = = -1 } ( printf (“Error - unable to call mpc_environ.\n" >; exit ( -1 ) ; } if (numtask != NN) { if (taskid == 0 ) ( printf ("Error - task number mismatch... check defs.h.in"); ex i t (-1 ); ) ) rc = mpc task_query (nbuf, 4, 3); if (rc == -1 ) ( printr ("Error - unable to call mpc_task_query.\n"); exit (-1 ); ) allgrp = nbuf[3]; if (taskid == 0 ) ( printf ("Running...\n"); } gettimeofday(&tv2 , (struct timeval*)0 ); /* before time V task_wall = (float) (tv2 .tv_sec - tv0 .tv_sec> * 1 0 0 0 0 0 0 + tv2 .tv_usec - tv0 .tv_usecj times (&all_start); /* * Get arguments from command line. In the sequential version of the program, * the following procedure was used to extract the number of times the main * computational body was to be repeated, and flags regarding the amount of * reporting to be done during and after the program was run. In this paralle * program, there are no command line arguments to be extracted except for the * name of the file containing the data cube. */ cmd_line (argc, argv, str_name, data_name); /* * Read input files. In this section, each task loads its portion of the data * cube from the data file. V i f ( t a s k i d == 0) 213 ( printf (■ loading data...\n"); } mpc_sync (allgrp); times (&disk_start); gettimeofdayl&tvl, (struct timeval*)0); /* before time */ read_input_SORT_FFT (data_name, data_cube); gettimeofday(&tv2 , (struct timeval*)0 ); /* after time */ disk_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000000 + tv2 .tv_usec - tvl.tv_usec; times (&disk_end); /* * Start SORT Steps. Find the element with the largest magnitude in this task's * slice of the data cube. V /* * if (taskid == 0 ) * printf("Finding maximum COMPLEX entry in data_cube before FFT \n"); *i times (&maxl_start); gettimeofday(itvl, (struct timeval*)0|; /* before time */ max = 0 .0; for (i = 0 ; i < diml; i + + ) for (j = 0 ; j < dim2 ; j++) for (k = 0; k < dim3; k++) ( f_templ = data_cube[j][i][k].real; f_temp2 - data_cube[j][ijIk).imag; f_pwr - f_templ*f_templ + f_temp2*f_temp2 ; if (f_pwr > max) ( max = f_pwr; locl= i; loc2=j; loc3=k; ) ) /* * printf("task %d finds max %f at %d %d %d\n", taskid, max, loci, loc2 , loc3); */ index_max[0].value = max ; index_max[0}.loci = loci ; index_max[0].loc2 = loc2 ; index_max[0].loc3 = loc3 + taskid * dim3 ; gettimeofday(&tv2, (struct timeval*)0); /* after time V maxl_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000000 + tv2 .tv_usec - tvl.tv_usec; times(wnaxl_end); /* * Now aggregate to get the global maximum (the largest element in the entire * data cube). */ mpc_sync (allgrp); times(&before_start); gettimeofday(fctvl, (struct timeval*)0); /* before time •/ rc = mpc_reduce (indexjnax, g_lndex_max, sizeof (INDEXED_JttX), 0, 214 xu_index_jnax, allgrp); if (rc == -1) ( printf ('Error - unable to call mpc_reduce.\n*); exit (-1 ); ) gettimeofday<&tv2 , (struct timeval*)0 ); /* after time */ before_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 ♦ tv2 .tv_usec - tvl.tv_usec; times(fcbefore_end) ; if (taskid == 0 ) ( max = g_index_max[0J.value ; loci = g_irdexH jnax[0 ) . loci ; loc2 = g_index_max[0 ].loc2 ; loc3 = g_index__max[0) ■ loc3 ; printf('TIME: finding max before FFT = %f user secs and %f sys secs\n*, USERTIME(tl,t2>, SYSTIME(t1,t2) ); printf('POWER of max COMPLEX entry before FFT = %f \n*, max); printf('LOCATION of max entry before FFT = %d %d %d \n', loci,loc2,loc3); > /■* * SORT DIM1 of data_cube DIM2*DIM3/3 tiroes and output DIM1 average values. * DIM3 must be divisible by 3 to give whole number dimension. V if (taskid == 0 ) printf(‘Sort %d element vectors »d*%d times. Output %d averagesNn", DIM1, DIM2, DIM3/3, DIM1 ); times (&sort_start); gettimeofday(ttvl, (struct timeval*)0); /* before time */ bubble_sort(data_cube, totals); gettimeofday<itv2 , (struct timeval*)0 ); /* after time */ sort_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl.tv_usec; times (itsort_end) ; /* * Get global totals of bubble sort. */ mpc_sync (allgrp); times (tsort_total_Btart); gettimeofday(&tvl, (struct timeval*)0 ); /* before time */ rc = mpc_reduce (totals, global_totals, diml*sizeof(float), 0 , s_vadd, allgrp); if (rc == -1) ( printf ('Error - unable to call mpc_reduce.\n'); exit (-1 ); ) gettimeofday(ttv2 , (struct timeval*)0 ); /* after time */ sort_total_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl,tv_usec; ~ times (&sort_total_end); 215 if (taskid==0 ) < for (i = 0; i < DIM1; i + + ) printf(■ Average value * »d = %f\n", i, global_totalsti) / (dim2*diro3*NN/3 ) ) ; printf(‘TIME: Sorting %d elements vectors = %f user secs and %f sys secsVn’, DIM1,USERTIME 111,t2), SYSTIME(tl,t2) ); ) /* * Start FFT steps. * * Generate twiddle factors table w. V pi = 3.14159265358979; pix2 = 2 . 0 * pi; for (i = 0 ; i < points; i + + ) { wii].imag = - sin(pix2 * (float)i / (float)points); wiij.real - cos(pix2 * (float)i / (float)points); } /* * 1st FFT: Perform dim2*dim3 number of diml points FFTs. */ logpoints = log2 ((float) diml ) + 0 .1 ; if (taskid~0 ) printf("Computing %d*%d %d point FFTs in', dim2, dim3*NN, diml); times(ifftl_start); gettimeofday(itvl, (struct timeval*)0 ); /* before time */ for (j = 0 ; j < dim2 ; j++) for (k = 0 ; k < dim3; k++) { for (i = 0 ; i < diml; i++) ( x[i].real = data_cube[j][i](k].real; xlij.imag = data_cube[jj [i](k].imag; ) fft (x, w, diml, logpoints); for (i = 0 ; i < diml; i++) < data_cubelj][i][k].real = x[i].real ; data_cubetj][i]tk].imag = x[ij.imag ; ) ) gettimeofday (itv2 , (struct t itneval * ) 0 > ; /* after time */ fftl_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl.tv_usec; times (if ftl_end) /* * 2nd FFT: Perform diml*dim3 number of dim2 points FFTs. V logpoints = log2 ((float) dim2 ) + 0 .1; if (taskid==0 ) printf('Computing %d*%d %d point FFTs in", diml, dim3*NN, dim2); times(i fft2_start); gettimeofday(itvl, (struct timeval*)0 ); /* before time */ 216 for (i = 0 ; i < diml; i++) for (k = 0 ; k < dim3; k++) ( for (j = 0 ; j < dim2 ; j + + ) < x[j].real = data_cube[j][i)[k].real; xijj.imag = data_cube(j][i][kj.imag; } fft (x, w, dim2 , logpoints); for (j = 0 ; j < dim2 ; j + +) < data_cubeljJ[i][k].real = x[j].real ; data_cube[j)[ij[k].imag = x[j].imag ; ) ) gettimeofday(ttv2 , (struct timeval*)0 ); /* after time */ fft2_wall - (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl.tv_usec; times(ifft2_end); /* * Index data_cube to doppler_cube. This total exchange operation is needed * so that each task has all the elements along the DIM3 dimension (the * dimension along which the last set of FFTs is performed). */ if (taskid == 0 ) { printf (* indexing data cube...\n"); ) mpc_sync (allgrp); times (iindexestart); gettimeofday(itvl, (struct timeval*)0); /* before time */ rc = mpc_index (data_cube, data_vector, DIM1 * DIM2 * DIM3 * sizeof (COMPLEX) / NN, allgrp); if (rc == -1 ) < printf (“Error - unable to call mpc_lndex.\n“); exit (-1 ); ) if (taskid == 0 ) ( printf (" rewinding data cube...\n*>; ) / offset = 0 ; for (n = 0; n < NN; n++) for (j= 0 ; j < DIM2/NN; j++) for (i = 0; i < DIM1 ; i++) for (k = n * DIM3; k < (n + 1) * DIM3 ; k++) ( doppler_cube[i][j 1 [k).real = data_vector[offset].real; doppler_cube[i][jj[kj.imag = data_vector[offset].imag; of fset + + ; ) gettimeofday(itv2 , (struct timeval*)0 ); /* after time */ index_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl.tv_usec; times (&index_end); 217 * 3rd FFT: Perform diml*dlm2 number of DOP„GEN points FFTs. */ dim2 = dim2/NN; dim3 = dim3*NN ; logpoints = log2((float) DOP_GEN ) + 0.1; if <taskid==0) printf("Computing %dMd %d point FFTs \n‘, diml, dim2*NN, DOP_GEN); times(ifft3_start); gettimeofday(itvl, (struct timeval*)0); /* before time */ for (i = 0; i < diml; i++) for (j = 0 ; j < dim2 ; j++) I for (k = 0; k < dim3; k + + ) ( x[k].real = doppler_cube[ii[j][k].real; xjkj.imag = doppier_cube[iJ[j)[kj.imag; ) for (k = dim3; k < DOP_GEN; k++) ( x(k].real = 0 .0 ; x[k].imag = 0 .0 ; } fft (x, w, DOP_GEN, logpoints); for (k = 0; k < D0P_GEN; k++) ( doppler_cube[iI[j 1[k].real = x(k].real ; doppler_cube[ij [jj[kj .imag = xjkj.imag ; > ) gettimeofday(itv2 , (struct timeval*)0 ); /* after time */ fft3_wall ^ (float) (tv2.tv_sec - tvl.tv_sec) * 1000000 + tv2.tv_usec - tvl.tv_usec; times(ifft3_end); if (taskid - = 0 ) printf(’Finding maximum COMPLEX entry in data_cube after FFT \n*); times (tmax2_start); gettimeofday(itvl, (struct timeval*)0); /* before time */ max = 0 .0; /* * Find the element with the largest magnitude in each task's slice of the * data cube after the FFTs. */ for (i = 0 ; i < diml; i++) for (j = 0 ; j < dim2 ; j++) for (k = 0; k < DOP_GEN; k++) ( f_templ = doppler_cube[i][j][k].real; f_temp2 = doppler_cube[ij (jj [kj .imag; f_pwr = f_templ*f_teropl + f_temp2*f_temp2; if (f_pwr > max) ( max = f_pwr; 1OC1= i; loc2=j; loc3=k; ) > gettimeofday(itv2, (struct timeval*)0 ); /* after time */ 218 max2_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000000 + tv2 .tv_usec - tvl.tv_usec; times(imax2_end); /* * printf("task Id finds max If at Id Id ld\n", * taskid, max, loci, loc2, loc3); V index_jnax[0] .value = max; index_max[0].loci = loci; index_max[0 ].loc2 = loc2 + taskid * dim2 ; index_max[0],loc3 = loc3; /* * Now aggregate to get the global maximum (the element with the largest * magnitude in the entire data cube after the FFTs). */ mpc_sync (allgrp); times (tafter_start) ; gettimeofday(ttvl, (struct timeval*)0); /* before time */ rc = mpc_reduce (index_jnax, g_index_max, sizeof (INDEXED_MAX), 0, xu_index_max, allgrp); if (rc -- -1) ( printf ("Error - unable to call mpc_reduce.\n"); exit ( -1); ) gettimeofday|&tv2, (struct timeval*)0 ); /* after time */ after_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000000 + tv2 .tv_usec - tvl.tv_usec; all_wall = (float) (tv2 .tv_sec - tv0 .tv_sec) * 1 000000 + tv2 .tv_usec - tvO.tv_usec; times(iafter_end); times (Stall_end) ; /* * Compute all times. */ all_user = (float) (all_end.tms_utime - all_start.tms_utime)/1 0 0.0 ; all_sys = (float)|all_end.tms_stime - all_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (tall_user, tall.user^max, sizeof (float), 0, s_vmax,allgrp); if (rc -= -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (&all_sys, &all_sys_max, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit {-1 ); ) i* * Compute disk times. */ rc = mpc_reduce (fcdisk_wall, &disk_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n") ; 219 exit (-1); } disk_user = (float) (disk_end.tms_utime - disk_start.tms_utime)/1 0 0.0 ; \ disk_ays = (float)(disk_end.tms_stime - disk_start.tms_stime>/1 0 0.0 ; \ rc = mpc_reduce (&disk_user, &disk_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf {"Error mpc_reduce.\n"); exit (-1 ); } rc = mpc_reduce (fcdisk_sys, &disk_sys_max, sizeof (float), 0, s_vmax, al lgrp) ; if (rc == -1) ( printf ("Error mpc_reduce.\n"); exit <-1>; ) /* * Compute maxi times. V rc = mpc_reduce (tmaxl_wall, &maxl_wall^max, sizeof(float), 0, s_vmax, allgrp); if (rc == -1) { printf ("Error mpc_reduce.\n"); exit (-1) ; ) maxl_user = (float) (maxl_end.tms_utime - maxl_start.tms_utime)/1 0 0.0 ; maxl_sys = (float)(maxl_end,tms_stime - maxl_start,tms_stime)/1 0 0.0; rc = mpc_reduce (tmaxl_user, &maxl_user_max, sizeof (float), 0 , s_vmax, allgrp) ; if (rc = = -1) < printf ("Error mpc_reduce.\n"); exit (-1); ) rc = mpc_reduce (&maxl_sys, imaxl_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc == -1) { printf ("Error mpc_reduce.\n*); exit (-1); ) /• * Compute before times. */ rc = mpc_reduce <&before_wall, &before_wal1 jnax, sizeof(float), 0, s_vmax, allgrp); if (rc == *1) I printf ("Error mpc_reduce.\n*) ; exit (-1 ); ) before_user = (float) (before_end.tms_utime - before_start.tms_utime)/1 0 0.0 ; before_sys = (float)(before_end.tms_stime - before_start.tms_stime)/1 0 0.0 j rc = mpc_reduce (&before_user, &before_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error mpc_reduce,Sn"); exit (-l); ) rc = mpc_reduce (fcbefore_sys, s.before_sys_max, sizeof (float), 0 , s_vmax, allgrp); i f ( r c == -1) 220 { p r i n t f ( “E r r o r m p c _ r e d u c e . \ n * ); e x i t ( - 1 ) ; } /* * Compute sort times. */ rc = mpc_reduce (isort_wall, &sort_wall_jnax, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1) ( printf (“Error mpc_reduce.\n“); exit (-1); } sort_user = (float) (sort_end.tms_utime - sort_start.tms_utime)/1 0 0.0 ; sort_sys = (float)(sort_end.tms_stime * sort_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (isort_user, &sort_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1) ( printf ("Error mpc_reduce.in'); exit (-1 ); ) rc = mpc_reduce (ScSort_sys, &sort_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc == -1) I printf (“Error mpc_reduce.\n*); exit (-1 ); ) /* * compute sort_total times. V rc = mpc_reduce (&eort_total_wall, tsort_total_wall_max, sizeof(float), 0, s_vmax, allgrp); if (rc == -1 ) < printf (“Error mpc_reduce.\n“); exit (-1 ); ) sort_total_user = (float) (sort_total_end.tms_utime - sort_total_start.tms_utime)/1 0 0.0; sort_total_sys = (float)(sort_total_end.tms_stime - sort_total_start.tms_st ime)/1 0 0.0 ; rc = mpc_reduce (&sort_total_user, tsort_total_user_)nax, sizeof (float), 0, s vmax, allgrp) ; if (rc == -1 ) ( printf (“Error mpc_reduce.\n“ ) ; exit (-1 ) ; ) rc = mpc_reduce (&sort_total_sys, &sort_total_sys_max, sizeof (float), 0 , s_vmax, allgrp) , - if (rc == -1 ) { printf (“Error mpc_reduce.\n* ) ; exit (-1 ); ) /* * Compute fftl times. */ rc = mpc_reduce (&fftl_wall, fcfftl_wall_piax, si2eof(float), 0 , s_vmax, allgrp) , - i f ( r c == -1) 221 { p r i n t f ( " E r r o r m p c _ r e d u c e . \ n " ); e x i t ( - 1 ) ; > fftl_user = (float) (fftl_end.tms_utime - fftl_start.tms_utime)/1 0 0.0 ; fftlZsys = (float)(fftl_end.tms_stime - fftl_start.tms_stime)/1 0 0.0; rc = mpc_reduce (ifftl_user, ifftl_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1) ( printf ("Error mpc_reduce.\n"); exit I -1 ); } rc = mpc_reduce (ifftl_sys, ifftl_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); } /* * Compute fft2 times. */ rc = mpc_reduce (ifft2_wall, ifft2_wall_max, sizeof{float), 0, s_vmax, allgrp); if (rc == -1 ) { printf ("Error mpc_reduce.\n"); exit (-1) ; ) fft2_user = (float) (fft2_end.tms_utime - fft2_start.tms_utime)/1 0 0.0 ; fft2_sys = (float)(fft2_end.tms_stime - fft2_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (ifft2_user, ifft2_user_max, sizeof (float), 0, s_vmax, allgrp) , - if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); ex i t ( -1) ; ) rc - mpc_reduce (ifft2_sys, ifft2_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc == -1) t printf ("Error mpc_reduce.\n"); ex i t (-1) ; ) /* * Compute fft3 times. V rc = mpc_reduce (ifft3_wall, ifft3_wall_max, sizeof(float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error mpc_reduce.\n")j exit (-1); ) fft3_user = (float) (fft3_end.tms_utime - fft3_start.tms_utime)/100.0; fft3_sys = (float)(fft3_end.tms_stime - fft3_start.tms_stime)/100.0; rc = mpc_reduce (ifft3_user, ifft3_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n*); exit ( -1) ; ) rc = mpc_reduce (ifft3_sys, ifft3_sys_max, sizeof (float), 0, s_vmax,allgrp); 222 if (rc = = -1 ) ( p r i n t f ( " E r r o r m p c _ r e d u c e . \ n " ); e x i t ( - 1 ); I /■ * Compute index times. */ rc = mpc_reduce (iindex_wal1, &index_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc = = -1) ( printf ("Error mpc_reduce.\n"); exit (-1 }; ) index_user = (float) (index_end.tms_utime - index^start.tms_utime)/1 0 0.0 ; index_sys = (float>(index_end.tms_stime - index_start.tms_stime)/1 0 0,0 ; rc = mpc_reduce (&index_user, &index_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc == -1) ( printf ("Error mpc_reduce.\n"); exit (-1); } rc = mpc_reduce (iindex_sys, &index_sys_max, sizeof (float), 0 , s_vmax, allgrp), ■ if (rc == -1 ) { printf ("Error mpc_reduce.\n"); exi t (-1); ) /* * Compute max2 times, */ rc = mpc_reduce (&max2_wal1, &max2_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1) { printf ("Error mpc_reduce.\n"); exit (-1); ) max2_user = (float) (max2_end.tms_utime - max2_start.tms_utime)/1 0 0.0 ; max2_sys = (float)(max2_end.tms_stime - max2_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (Simax2_user, tmax2_userjnax, sizeof (float), 0, s_vmax, allgrp); if (rc = = -1) ( printf ("Error mpc_reduce.\n*) ; exit (-1 ); ) rc = mpc_reduce <&max2_sys, tmax2_sysjrax, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) /* * Compute after times. */ rc = mpc_reduce (iafter_wall, &after_wall_max, sizeof(float), 0 , s_vmax, allgrp); i f (r c == -1) 223 { p r i n t f ( " E r r o r m p c _ r e d u c e . i n " ); e x i t ( - 1 ) ; > after_user = (float) (after_end.tms_utime - after_start.tms_utime)/1 0 0,0 ; after_sys = (float)(after_end.tms_stime - after_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (&after_user, s,af ter_user_max, sizeof (float), 0 , s_vmax, aligrp); if (rc == -1 ) ( printf ("Error mpc_reduce.in"); exit (-1 ); I rc = mpc_reduce (&after_sys, &after_sys_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1) ( printf ("Error mpc_reduce.in") ; exit (-1 ); /* * Report results. */ if (taskid == 0 ) ( max = g_index_max[0].value; loci = g_index,jnax [0] . loci; loc2 = g_index_piax[0] ,loc2 ; loc3 = g_index_jnax [0] . loc3 ; printf("TIME: finding max after FFT - %f user secs and %f sys secsin", USERTIME(t1,t2), SYST1ME(tl,t2) ); printf("POWER of max COMPLEX entry after FFT = \n", max); printf("LOCATION of max entry after FFT = %d %d %d in", loci,loc2,loc3); t* * Display timing information. V if (taskid = = 0 ) ( sin", printf ("inin*** CPU Timing information - numtask = tdinin", NN) ; printf (’ all_user_max = %.2f s, all_sys_max = t.2f s\n“, all_user_max, all_sys_max); printf (" disk_user_jnax = %.2f s, disk_sys_max = %.2f s\n", disk_user_max, disk_sys_max); printf (* maxl_user_inax = %.2f s, maxl_sys_max = \.2l s\n", maxl_user_max, maxl_sys_max); printf (" before_user_max = %.2f s, before_sys_max = %.2f sin", before_user_max, before_sys_max); printf (* sort_user_piax = %.2f s, sort_sys_max = *.2f s\n", sort_user_max, sort_sys_max); printf (" sort_total_user_tnax = %.2f s, sort_total_sysjnax = %.2f sort_total_user_max, sort_total_sys_max); printf (* fftl_user_ftiax = %,2f s, fftl_sys_max = %.2f s\n", fftl_user_max, fftl_sys_max); printf (" fft2_user_max = %.2f s, fft2_sys_piax = %.2f sin", fft2_user_max, fft2_sys_max); printf (* fft3_user_piax = %.2f s, fft3_sys_max = %.2f sin", fft3_user_max, fft3_sys_max); printf (" index_userjnax = %.2f s, index_sysjnax = %.2f sin", index_user_piax, index_sys_max) ; printf t" max2_user_jnax = %.2f s, max2_sys_max = %.2f sin", max2_userjnax, max2_sys_max); printf (" after_ueer_max = %.2f s, after_sys_max = %.2 f sin". 224 after_user_max, after_sys_max) ] print print print print print print print print print print print print print print Wall Clock T allwall iming informat = %.0f us\n", ion - numtask = %d\n\n*, NN); a 1l_wa1 1); task_wal1 = %.0f us\n", task_wal1); disk_wall %.0f us\n", disk_wall_max); maxl_wal1 _ %.0f us\n*, maxl_wall_max); before_wall %.0f us\n*, before_wall_max); sort_wall %.0f us\n‘, sort_wall_max); sort total wall %.0f us\n‘, sort_total_wall_max fftl_wall - %.0f us\n*, fftl_wall_max); f ft2_wal1 = %.0f us\n*, fft2_wall_max); f ft3_wal1 = %.0f us\n", fft3_wall_max); index_wall - %.0f us\n", index_wall_max); max2_wall %.0f us\n*, max2_wall_max); after_wall = %.0f us\n", after_wall_max); exit(0 ) C.13 bubble_9ort.c /* * bubble_sort.c */ /* * This file contains the procedure bubble_sort (), and is part of the parallel * General benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program, * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. * * The bubble_sort () procedure contains the sort routine to do DIM2*DIM3/3 * number of DIM1 element bubble sorts and to output DIMt average values of the * DIM2*DIM3/3 sorts. DIM3 must be divisible by 3 to give whole number * dimensions. The power of the COMPLEX entries will be used to sort the data. * The procedure sorts the COMPLEX input data_cube[DIM1][DIM2][DIM3]. * Nominally, DIM1, DIM2, and DIM3 = 64, 128, and 1500, respectively, and * thus the program sorts 128 * 500 = 64,000 64-element vectors. */ #include "stdio.h" #include "math.h* #include *defs.h* * bubble_sort {) * inputs: data_cube * outputs: totals * * data_cube: input data cube * totals: holders for DIM1 totals */ bubble_sort (data_cube, totals) COMPLEX data_cube[)[DIM1](DIM3]; float totals I); 225 /* * Variables * * diml, dim2, dim3: dimensions 1, 2, and 3 in the input data cube * sort_num: number of the sort number * f_pwr: holder for float power data * vec_diml: holder for DIM1 element SORT vector * this_ptr, next_ptr: pointers to float data * i, j, k, n: loop counters */ int diml = DIM1; int dim2 = DIM2; int dim3 = DIM3; int sort_num; float f_pwr; float vec_diml[DIM1); float *this_ptr, *next_ptr; int i, j, k, n; /* * Begin function body: bubble_sort (). * * Perform DIM2*DIM3/3 number of DIM1 element sorts. * * Zero out DIM1 totals. V for(n = 0; n < DIM1; n++) totals[n] = 0.0; /* * Sort by the magnitude of the complex number (sum of the squares of the real * and imaginary parts of the number). * * Select dimensions so we always do dim2*dim3/3 number of diml element sorts. */ for (j - 0 ; j < dim2; j++) for (k = 0; k < dim3/3; k++) ( /* * Initialize pointer to start of holding vector. */ this_ptr = vec_diml; for (i = 0; i < diml; i++) ( *this_ptr++ - data_cube[j][i][k]-real * data_cube[j][i)[k).real + data_cube[j}[i][k].imag * data_cube[j)[i)[k].imag; ) /* * Bubble sort the contents of the holding vector. 1st initialize pointers to * start of holding vector. */ this_ptr = vec_diml; next_ptr = vec_diml + 1; /* * Initialize the number of bubbles. V sort_num = diml - 1 ; /* * Get 1st number. 226 */ f_pwr = *this_ptr; while (Bort_num > 0 ) ( for (n = 0 ; n < sort_num; n+ + ) { if (*next_ptr < f_pwr) ( *this_ptr++ = *next_ptr++; ) else { *this_ptr++ = f_pwr; f_pwr = *next_ptr++; ) } *this_ptr = f_pwr; sort_num -= 1; /* * Reinitialize pointers to start of holding vector. V this_ptr = vec_diml; next_ptr = vec_diml + 1 ; /* * Get 1st number. */ f_pwr = *this_ptr; ) for(n = 0 ; n < diml; n++) totalslnl += vec_diml[n1; } return; } C.1.4 cmd_Iine.c /* * cmd_line.c V /* * This file contains the procedure cmd_line (J, and is part of the parallel * General benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. * * The procedure cmd_line () extracts the name of the files from which the * input data cube and the steering vector data should be loaded. The * function of this parallel version of cmd_line () is different from that * of the sequential version, because the sequential version also extracted * the number of iterations the program should run and some reporting * options. V #include "stdio.h* tinclude “math.h" 227 •include -defs.h" •include <sys/types.h> •include <sys/time.h> * cmd_li ne () * inputs: argc, argv * outputs: str_name, data_name * * argc, argv: these are used to get data from the command line arguments * str_name: a holder for the name of the input steering vectors file * data_name: a holder for the name of the input data file */ cmd_line {argc, argv, str_name, data_name) int argc; char *argv[]; char str_name(LINE_MAX); char data_name[LINE_MAX]; /* * Begin function body: cmd_line () */ strcpy (str_name, argv[l]); strcat (str_name, ".str"); strcpy (data_name, argv[l]); strcat (data_name, ’.dat"); return; C.1.5 fltc /• * l ft. c */ /• * This file contains the procedures fft () and bit_reverse (), and is part * of the parallel General benchmark program written for the IBM SP2 by the * STAP benchmark parallelization team at the University of Southern California * (Prof, Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. * * The procedure fft () implements an n-point in-place decimation-in-time * FFT of complex vector 'data* using the n/2 complex twiddle factors in * "w_common*. The implementation used in this procedure is a hybrid between * the implementation in the original sequential version of this program and * the implementation suggested by MHPCC. This modification was made to * improve the performance of the FFT on the SP2. * * The procedure bit_reverse () implements a simple (but somewhat inefficient) * bit reversal. */ •include "defs-h" /* 228 * fft () * inputs: data, w_common, n, logn * outputs: data */ void fft (data, w_conunon, n, logn) COMPLEX *data, *w_common; int n, logn; t /* fft */ int incrvec, iO, il, i2, nx, tl, t2, t3; float fO, fl; void bit_reverse(); /* * Begin function body: fft () + * Bit-reverse the input vector. */ (void) bit_reverse (data, n); I* * Do the first log n - 1 stages of the FFT. */ i 2 = 1ogn; for (incrvec = 2 ; incrvec < n; incrvec « - 1 ) I /* for (incrvec...) */ i 2 — ; for (iO = 0; iO < incrvec » 1; i0 + + ) ( /* for ( iO . . .) */ for (il = 0 ; il < n; il + = incrvec) ( /* for (il. . .) */ tl = 10 + il + incrvec / 2 ; t2 = iO << i2; t3 = io + il; fO = data[tl].real * w_common[t2].real - data [tl] . imag * w_common[t2 ).imag; fl = data [tl] . real * w_cominon [ t2 ] . imag + data[tl].imag * w_common[t2 ].real; data[tl].real = data[t3].real - fO; data[tl].imag = data[13].imag - fl data[t3].real = data[13].real + fO data[13].imag = data[13].imag + fl > /* for (il. . .) V } /* for (io...) */ ) /* for (incrvec...) */ /* ' Do the last stage of the FFT. W for (10 = 0 ; io < n / 2 ; 1 0++) ( /* for (10 . . . ) */ tl = iO + n / 2 ; fO = data[tl].real * w_common[10].real - data[tl].imag * w_common[i0 ].imag; fl = data[tl].real * w_common[io].imag + data[tl].imag * w_common[i0 ].real; data[tl].real = data[iO].real - fO; data[tl].imag = data[1 0].imag - fl; data[10].real = data[10j.real + fO; data[i0].imag = data[ioj.imag + fl; ] /* for (iO. . .) */ /* fft */ 229 * bit_reverse () * inputs: a, n * outputs: a */ void bit_reverse (a, n) COMPLEX *a; int n; { int i, j, k; /* * Begin function body: bit_reverse (} */ j = 0 ; for (i = 0 ; i < n - 2 ; i + O { if (i < j ) SWAP(a[j], a [ i]); k = n » 1 : while (k <= j) I j -= k; k >>= 1; ) j += k; ) ) C.1.6 read_input_SORT_FFT.c * read_input_SORT_FFT.c */ * This file contains the procedure read_input_SORT_FFT <), and is part of * the parallel General benchmark program written for the IBM SP2 by the * STAP benchmark parallelization team at the University of Southern California * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. * * The procedure read_input_SORT_FFT () reads the input data files (containing * the data cube and the steering vectors). * * In this parallel version of read_input_SORT_FFT (), each task reads its * portion of the data cube in from the data file simulatenously. In order to * improve disk performance, the entire data cube slice is read from disk, then * converted from the packed binary integer format to the floating point * number format. * * Each complex number is stored on disk as a packed 32-bit binary integer. * The 16 LSBs are the real portion of the number, and the 16 MSBs are the * imaginary portion of the number. * * The steering vector file contains the number of PRIs in the input data, * the target power threshold, and the steering vectors. The data is stored * in ASCII format, and the complex steering vector numbers are stored as 230 * alternating real and imaginary numbers. */ tinclude ‘stdio.h* #include "math.h* #include "defs-h" #include <sys/types.h> #include <sys/time.h> #include <mpproto.h> #i fdef DBLE char * fmt = *%lf; • else char *fmt = *%f“; #endif extern int taskid, numtask, allgrp; * read_input_SORT_FFT () * inputs: data_name, str_name * outputs: v4, data_cube * * data_name: the name of the file containing the data cube * str_name: the name of the file containing the input vectors * v4: storage for the v4 vector * data_cube: input data cube *! read_input_SORT_FFT (data_name, data_cube) char data_name[] ; COMPLEX data_cube[][DIM1]{DIM3); /* * Variables * * diml, dim2, dim3: dimension 1, 2, and 3 in the input cube, respectively * temp: buffer for binary input integer data * tempi, temp2 : holders for integer data * f_templ, f_temp2 : holders for float data * junk: a temporary variable * f_power: holder for float power data * i, j, k: loop counters * local_int_cube: the integer version of the local portion of the data cube * blklen, rc: variables used for MPL calls */ int diml = DIM1; int dim2 = DIM2; int dim3 = DIM3; unsigned int tempfl]; long int tempi, temp2 ; float f_templ, f_temp2 ; float f_pwr; int i, j, k; FILE *fopen() ; FILE *f_dat; unsigned int local_int_cube(DIM3](DIM2][DIM1I; long blklen, rc,- /* * Begin function body: read_input_SORT_FFT O. * * Read in the data_cube file in parallel. 231 / if ((f_dat = fopen (data_name, “r") J == NULL) I printf (“Error - task %d unable to open data file.\n“, taskid); exit (-1); ) fseek (f_dat, taskid * diml * dim2 * dim3 * sizeof (unsigned int), 0 ) ; fread (local_int_cube, sizeof (unsigned int), diml * dim2 * dim3, f_dat); fclose (f_dat); /* * Convert data from unsigned int format to floating point format. * / for (k = 0 ; k < dim3; k++) for (j = 0 ; j < dim2 ; j++) for (i = 0 ; i < diml; i++) ( temp[0 ) = local_int_cube[k][j)[i); tempi = OxOOOOFFFF & temp[0|; tempi = (tempi & 0x00008000) ? tempi I OxffffOOOO : tempi; temp2 = (temp|0] >>16) & OxOOOOFFFF; temp2 = (temp2 & 0x00008000) ? temp2 I OxffffOOOO : temp2; data_cube( j ] tiMkl .real = (float) tempi; data_cube(j]jil[kj.imag = (float) temp2; ) return; ) C.1.7 defs.h /* defs.h */ /* 94 Aug 25 - We added a #define macro to compensate for the lack of a */ /* log2 function in the IBM version of math.h */ /* 94 Sep 13 - We added a definition for number of clock ticks per */ /* second, because this varies between the Sun OS and the IBM V /* AIX OS. */ /* Sun OS version - 60 clock ticks per second */ Ufdef SUN ttdefine CLK_TICK 6 0.0 ttendif /* IBM AIX version - 100 clock ticks per second */ kifdef IBM •define log2(x) ((log(x))/(M_LN2)) •define CLK_TICK 100.0 •endif • i fdef DBLE •define float double •endi f 232 typedef struct { float real; float imag; } COMPLEX; /* used in new fft */ •define SWAP (a, b) {float swap_temp=(a).real;{a).real=(b}.real;(b).real=swap_temp;\ swap_temp=(a).imag;{a).imag=(b) .imag;(b) .imag=swap_temp;} •define NN 64 XXXUUUUUU V #define PRI_CHUNK (PRI/NN) /* XXXXXUU */ •define AT_LEAST_ARG 2 #def ine AT_MOST_ARG 4 •def ine ITERATIONS_ARG 3 #def ine REPORTS_ARG 4 • define USERTIME(Tl,T2) •define SYSTIME(Tl,T2) •define USERTIME1(Tl,T2) #def ine SYSTIME1(Tl,T2) #define USERTIME2(Tl,T2) #define SYSTIME2(Tl,T2) {(t2.tms_utime-tl.tms_utime)/CLK_TICK) ((t2.tms_st ime-tl.tms_st ime)/CLK_TICK) ((time_end.tms_utime-time_start.tms_utime)/CLK_TICK) {(time_end.tms_stime-time_start.tms_stime)/CLK_TICK) I(end_time.tms_utime-start_time.tms_utime)/CLK_TICK) ((end_time.tms_stime-start_time.tms_stime)/CLK_TICK) #i fndef LINE_MAX #define LINE_MAX 256 •endif •define TRUE 1 •define FALSE 0 •define COLS 153 6 •define •define •define •define •def ine •define •define •def ine •def ine •define •define •def ine MBEAM 12 ABEAM 20 PWR_BM 9 VI 64 V2 128 V3 COLS V4 VI DIM1 VI / DIM2 128 / DIM3 V3/NN DOP_GEN 2048 PRI_GEN 2048 * Maximum number of columns in holding vector "vec" */ * in house.c for max columns in Householder multiply */ * Number of main beams */ * Number of auxiliary beams */ * Number of max power beams */ * Number for dimensionl in input data cube */ * Number for dimension2 in input data cube */ /* Number for dimensions in input data cube */ I* MAX Number of dopplers after FFT */ /* MAX Number of points for 1500 data points FFT */ /* zero filled after 1500 up to 2048 points */ •define NUM_MAT 32 /* Number of matrices to do householde on V /* NUM_MAT must be less than DIM2 above */ • i fdef APT •define PRI 256 •define DOP 256 /* •define RNG 280 •define RNG 256 •define EL 32 •define BEAM 32 •define NUMSEG 7 /* Number of pris in input data cube */ /* Number of dopplers after FFT */ */ /* Number of range gates in input data cube */ /* Number of range gates in input data cube */ /* number of elements in input data cube */ /* Number of beams */ /* Number of range gate segments V •define RNGSEG RNG/NUMSEG /* Number of range gates per segment */ •define RNG_S 320 /* Number sample range gates for 1st step beam forming */ •define DOF EL I* Number of degrees of freedom */ extern COMPLEX t_matrix[BEAM](EL); /* T matrix V •endif 233 Hifdef STAP Hdefine PRI 128 /* Number of pris in input data cube */ •define DOP 128 /* Number of dopplers after FFT */ Hdefine RNG 1250 /* Number of range gates in input data cube V Hdefine EL 48 /* number of elements in input data cube */ Hdefine BEAM 2 /* Number of beams */ Hdefine NUMSEG 2 /* Number of range gate segments */ Hdefine RNGSEG RNG/NUMSEG /* Number of range gates per segment */ Hdefine RNG_S 288 /* Number of sample range gates for beam forming */ Hdefine DOF 3*EL /* Number of degrees of freedom "/ Hendi f Hifdef GEN Hdefine EL V4 Hdefine DOF EL Hendi f extern int output_time; /* Flag if set TRUE, output execution times */ extern int output_report; /* Flag if set TRUE, output data report files */ extern int repetitions; /* number of times program has executed */ extern int iterations; /* number of times to execute program *t typedef struct max_index ( float value ; int loci, loc2, loc3 ; } INDEXED_MAX ; extern void xu_index_max ( ini, in2, out, len ) INDEXED_MAX ini [], in2I], out[]; int *len ; < int i, n ; n = *len/sizeof(int) ; for (i-0; i<n; i + + ) { if (ini Ii].value > in2 Ji].value) { out[i].value = ini[i].value ; out[i].loci = ini(ij.loci; out[i].loc2 = inl[i].loc2 ; out[i].loc3 - inl[i].loc3; J else < out[i].value = in2 [i],value ; out[i].loci = in2 [i].locl; out[i].loc2 = in2 [ij.loc2 ; out[i].loc3 = in2[ij.loc3; } } > C.1.8 compUe_sort mpcc -03 -qarch=pwr2 -DGEN -DIBM -o gen bench_jnark_SORT_FFT,c bubble_sort.c cmd_line.c fft.c read_input_SORT_FFT.c -lm 234 C.1.9 run.128 poe gen /scratchl/masa/vec_data -procs 128 -us CJ 2 Vector Multiply Subprogram C.2.1 bench_m ark_VEC.c bench_mark_VEC.c / Parallel General Benchmark Program for the IBM SP2 This parallel General benchmark program was written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential General benchmark program was originally written by Tony Adams on 8/30/93. This file contains the procedure main (), and represents the body of the General Vector Multiply subprogram. This subprogram's overall structure is as follows: 1. Load data in parallel. 2. Perform DIM1*DIM3 DIM2-element vector multiplications using vector v2. 3. Each task searches its slice of the data cube for the largest element. 4. Perform a reduction operation to find the largest element in the entire data cube. 5. Perform DIM2*DIM3 DIM1 element vector multiplications using vector vl. 6 . Each task searches its slice of the data cube for the largest element. 7. Perform a reduction operation to find the largest element in the entire data cube. 8 . Redistribute the data cube using a total exchange operation. 9. Perform DIM1*DIM2 DIM3-element vector multiplications using vector v3, 10. Each task searches its slice of the data cube for the largest element. 11. Perform a reduction operation to find the largest element in the entire data cube. This program can be compiled by calling the shell script compile_vec. Because of the nature of the program, before the program is compiled, the header file defs.h must be adjusted so that NN is set to the number of nodes on which the program will be run. We have provided a defs.h file for each power of 2 node size from 1 to 256, called defs.001 to defs.256. This program can be run by calling the shell script run.##*, where ### is the number of nodes on which the program is being run. Unlike the original sequential version of this program, the parallel General program does not support command-line arguments specifying the number of times the program body should be executed, nor whether or not timing information should be displayed. The program body will be executed once. 235 an d t h e t i m i n g i n f o r m a t i o n w i l l be d i s p l a y e d a t t h e end o f e x e c u t i o n . The input data file on disk is stored in a packed format: Each complex number is stored as a 32-bit binary integer; the 16 LSBs are the real half of the number, and the 16 MSBs are the imaginary half of the number. These numbers are converted into floating point numbers usable by this benchmark program as they are loaded from disk. The steering vectors file has the data stored in a different fashion. All data in this file is stored in ASCII format. The first two numbers in this file are the number of PRIs in the data set and the threshold level, respectively. Then, the remaining data are the steering vectors, with alternating real and imaginary numbers. '/ •include "stdio.h" •include "math.h* •include "defs.h" •include <sys/types.h> •include <sys/times.h> •include <mpproto.h> •include <sys/time.h> •include * main () * inputs: argc, argv * outputs: none */ main (argc, argv) int argc; char *argv[]; /* * Variables * * data_cube: input data cube * new_cube: indexed data cube * data_vector: used in the inex step * vl, v2, v3: vectors used during the vector multiplications * tl, t2 : holders for the time function return in seconds * diml, dim2, dim3: dimensions 1, 2, and 3 of the input data cube * loci, loc2 , loc3; holders for the location of the maximum entry * temp: buffer for binary input integer data * f_templ, f_temp2 : holders for float data * f_pwr: holder for float power data * i, j, k, n, offset, r: loop counters * max: float variable to hold the maximum of complex data * max_pwr; holder for max float power data * real_prt, imag_prt: pointers to float data * xreal, ximag: holding vectors for vector multiply * vreal, vimag: holding vectors for float vectors vl, v2, and v3 * index_jnax, g_index_max: data structures with the location and power of the * largest element (locally and globally, repectively) * data_name, str_name: holders for the names of the files containing the * input data cube and the vectors, respectively * totals: holders for DIM1 totals * global_totals: holders for the global DIM1 totals * sum: COMPLEX variable to hold the sum of complex data * blklen: block length * rc: return code from an MPL call * taskid: identifier for this task * numtask: total number of tasks running this program 236 * nbuf: a buffer to hold data returned by mpc_task_query * allgrp: the identifier for the group encompassing all the tasks running * this program * source, dest, type, nbytes: used by point-to-point communications */ COMPLEX data_cube|DIM2][DIMlj[DIM3]; COMPLEX new_cube[DIM1] [DIM2/NN] [DIM3 *NN]; COMPLEX data_vector[DIM2*DIM1*DIM3]; COMPLEX Vl[VI]; COMPLEX v2[V2]; COMPLEX v3[V3]; struct tms tl,t2 ; int diml = DIM1; int dim2 = DIM2; int dim3 = DIM3; int loci,loc2 ,loc3; unsigned int temp[l]; float f_templ, f_temp2 ; float f_pwr; int i, j, k, n, offset, r; float max; float max_pwr; float *real_ptr, *imag_ptr; float xreal[COLS], ximag[COLS]; float vreal[COLS], vimag[COLS]; float *vreal_ptr, *vimag_ptr; INDEXED_MAX index_max(1], g_index_max[1] ; char data_name[LINE_MAX]; char str_name[LINE_MAXj; FILE *fopen(]; float totals[DIM1]; float global_totals[DIM1]; COMPLEX sum; int blklen; int rc; int taskid; int numtask; int nbuf[4]; int allgrp; int source, dest, type, nbytes; /* * Timing variables. V struct timeval tvO, tvl, tv2; struct tms all_start, all_end; float all__user, all_sys; float all_user_max, all_sysurtax; float all_wall, task_wall; struct tms disk_start, disk_end; float disk_user, disk_sys; float disk_user_max, disk_sys_max; float disk_wall, disk_wall_max; struct tms index_start, index_end; float index_user, index_sys; float index_user_max, index_sys_jnax; float index_wall, index_wall_max; struct tms vecl_start, vecl_end; float vecl_user, vecl_sys; float vecl_userjnax, vecl_sys_max; float vecl_wall, vecl_wall_max; struct tms vec2_start, vec2_end; 237 float vec2_user, vec2_sys; float vec2_user_rnax, vec2_sys_max; float vec2_wall, vec2_wall_max; struct tms vec3_start, vec3_end; float vec3_user, vec3_sys; float vec3_user_jnax, vec3_sys_max; float vec3_wall, vec3_wall_max: /» * Externally defined functions. */ extern void cmd_line(); extern void read_input_VEC(}; /• * Begin function body: main () * * Initialize for parallel processing. Here, each task or node determines its * task number (taskid) and the total number of tasks or nodes running this * program (numtask) by using the MPL call mpc_environ. Then, each task * determines the identifier for the group which encompasses all tasks or * nodes running this program. This identifier (allgrp) is used in collective * communication or aggregated computation operations, such as mpc_index. V gettimeofday(ttvO, (struct timeval*)0}; /* before time */ rc = mpc_environ (&numtask, itaskid); if (rc == -1 ) ( printf (’Error - unable to call mpc_environ.\n") ; exit (-1); ) if (numtask != NN) ( if (taskid == 0 ) ( printf (’Error - task number mismatch... check defs.h.\n"); exit (-1); ) I rc = mpc_task_guery (nbuf, 4, 3); if (rc == -1 ) { printf ("Error - unable to call mpc_task_query.\n"); exit (-1 ); } allgrp = nbuf[3]; if (taskid == 0 ) { printf (’Running...\n’); ) gettimeofday(&tv2 , (struct timeval")0); /* before time */ task_wall = (float) (tv2 .tv_sec - tv0 .tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tv0 .tv_usec; times (&all_start); /* * Get arguments from command line. In the sequential version of the program, * the following procedure was used to extract the number of times the main * computational body was to be repeated, and flags regarding the amount of 238 * reporting to be done during and after the program was run. In this paralle * program, there are no command line arguments to be extracted except for the * name of the file containing the data cube. */ cmd_line (argc, argv, str_name, data_name); /* * Read input files. In this section, each task loads its portion of the data * cube from the data file. */ if (taskid == 0) ( printf (" loading data. . .\n") , - } mpc_sync (allgrp); times (&disk_start) , * gettimeofday(itvl, (struct timeval*)0 ); /* before time */ read_input_VEC (data_name, str_name, vl, v2, v3, data_cube) ; gettimeofday(&tv2 , (struct timeval*)0); /* after time */ disk_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000000 + tv2 .tv_usec - tvl.tv_usec; times (&disk_end) /* * Start Vector Multiplication steps. V if (taskid-=0 ) printf("VECTOR multiply data_cube by vl,v2,v3 and output largest values\n"}; /* * Perform DIM1*DIM3 DIM2 element vector multiplies using vector v2. */ if (taskid == 0 ) printf ("Perform %d*%d number of *d element multiplies by vector v2\n", DIM1, DIM3*NN, DIM2 ) , * times (ivec2_start); gettimeofday(&tvl, (struct timeval*)0); I * before time */ /* * Initialize max_pwr = 0.0 and location = 0,0,0. */ max_pwr = 0 .0; loci = 0; loc2 = 0; loc3 = 0; /* * Move vector v2 into holding vectors for the use of pointers. */ vreal_ptr = vreal; vimag_ptr = vimag; for (n - 0; n < V2; n++> { *vreal_ptr++ = v2 [n].real; *vimag_ptr++ = v2 [n].imag; } for (i = 0 ; i < dimlj i++) for (k = 0; k < dim3; k++) 239 { real_ptr = xreal; imag_ptr = ximag; vreal_ptr = vreal; vimag_ptr = vimag; for (j = 0 ; j < dim2; j++) ( *real_ptr++ = data_cube[j][i][kJ.real * *vreal_ptr - data_cube[j][i 1[k].imag * *vimag_ptr; *imag_ptr++ = data_cube[j][i][k].real * *vimag_ptr++ + data_cube[j][i][k].imag * *vreal_ptr++; ) /* * Find the element with the largest magnitude in this task's slice of the * data cube. */ real_ptr = xreal; imag_ptr = ximag; for (j = 0 ; j < dim2 ; j++] { f_pwr - *real_ptr * *real_ptr + *imag_ptr * *imag_ptr; *real_ptr++; * imag ptr + +; if (max_pwr < f_pwr) { max_pwr = f_pwr; loci = i; loc2 = j; loc3 = k; ) 1 } /* * printf("task %d finds max %f at %d %d %d\n*, * taskid, max_pwr, loci, loc2, loc3) , - */ index_max[0 ].value = max_pwr; index jnax[0 ).loci = loci; indexjnax(O).loc2 = loc2; index_max[0].loc3 - loc3 + taskid * dim3; /* * Now aggregate to get the global maximum (the element with the largest * magnitude in the entire data cube). */ rc = mpc_reduce (indexjnax, g_index_max, sizeof <INDEXED_MAX), 0, xu_index_max, allgrp); if (rc = = -1) { printf (‘Error unable to call mpc_reduce,\n"); exit (-1); } gettimeofday<4tv2, (struct timeval*)0); /* after time */ vec2_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 000000 + tv2 .tv_usec - tvl.tv_usec; times (tvec2_end); if (taskid 0 ) { max_pwr = g_index_max[0].value ; loci = g_index_max[0].loci ; loc2 = g_index_max[0].loc2 ; 240 loc3 = g_index_max[0].loc3 ; printf("Max value for v2 vector multiply = %f\n", max_pwr); printf("Location = %d %d %d\n", loci, loc2, loc3); ) /* * Perform DIM2*DIM3 DIM1 element vector multiplies using vector vl. */ if (taskid==0 ) printf ("Perform %d*%d number of %d element multiplies by vector vl\n", DIM1, DIM3"NN, DIM2 ); times (tvecl_start); gettimeofday(ttvl, (struct timeval*)0 ); /* before time */ /* * Initialize max_pwr = 0.0 and location = 0,0,0. */ max_pwr = 0 .0; loci = 0 loc2 = 0 loc3 = 0 /* * Move vector vl into holding vectors for the use of pointers. */ vreal_ptr = vreal; vimag_ptr = vimag; for (n = 0; n < VI; n+t) ( *vreal_ptr++ = vlIn).real; *vimag_ptr++ - vllnj.imag; ) for (.i = 0 ; j < dim2 ; j++> for <k = 0; k < dim3; k+ + ) ( real_ptr = xreal; imag_ptr = ximag; vreal_ptr = vreal; vimag_ptr - vimag; for (i = 0; i < diml; i++) ( *real_ptr++ = data_cube(j] 1i][k].real * *vreal_ptr - data_cube[j][i][k].imag * *vimag_ptr; *imag_ptr++ = data_cube[j]Ii][k).real * *vimag_ptr++ + data_cube[ j] (iJ [k] .imag * »vreal_ptr+ + ) /* * Find the element with the largest magnitude in this tasks’s slice of the * data cube. */ real_ptr = xreal; imag_ptr = ximag; for (i - 0; i < diml; i++) f_pwr = *real_ptr * *real_ptr + *imag_ptr * *imag_ptr; *real_ptr++; *imag_ptr++; if (max_pwr < f_pwr) { max_pwr = £_pwr; loci = i; loc2 - j; 241 loc3 = k; } ) ) r * printf("task *d finds max %f at %d %d %d\n", * taskid, max_pwr, loci, loc2, loc3); */ irdex_max[0].value = max_pwr; index_jnax[0) . loci = loci; index_max[0].loc2 = loc2 ; index_jnax[0] . loc3 = loc3 + taskid * dim3; /* * Now aggregate to get the global maximum (the element with the largest * magnitude in the entire data cube), */ rc = mpc_reduce (index_max, g_index_max, sizeof (INDEXED_MAX), 0, xu_index_max, allgrp); if (rc -1 ) { printf ("Error - unable to call mpc_reduce.\n") exit (-1); ) gettimeofday (Sctv2 , (struct timeval*)0 ); /* after time *1 vecl_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1000000 + tv2 .tv_usec - tvl.tv_usec; times (&vecl_end); if (taskid == 0 ) { max_pwr = g_indexjnax(0 ].value ; loci = g_index_max[0).loci ; loc2 = g_index_max[0!.loc2 ; loc3 = g_index_max[0).loc3 ; printf("Max value for v2 vector multiply = %f\n", max_pwr); printf("Location = %d %d %d\n‘, loci, loc2, loc3); ) /* * Index data_cube[DIM2][DIM1][DIM3] to new_cube(DIM1)[DIM2/NN][DIM3*NN]. * This total exchange operation is needed so that each task has all the data * along the DIM3 dimension (the dimension along which the last set of vector * multiplications is performed). */ if (taskid == 0) < printf {* indexing data cube...\n*); ) mpc_sync (allgrp); times (tindex_start); gettimeofday ((itvl, (struct timeval*)0 ); /" before time */ rc = mpc_index (data cube, data_vector, DIM1 * DIM2 * DIM3 * sizeof (COMPLEX) / NN, allgrp); if (rc == -1) { printf ("Error - unable to call mpc_index.\n")j exit (-1 >; > 242 if (taskid == 0 ) ( printf (" rewinding data cube...\n">; } mpc_sync (allgrp); offset = 0 ; for (n = 0; n < NN; n+ + ) for (j= 0 ; j < DIM2/NN; j + + > for (i = 0; 1 < DIMX ; i++) for (k = n * DIM3; k < (n + 1) * DIM3 ; k + +) { new_cube(i][j] [k].real = data_vector[offset I.real; new_cube[ij[j1fk].imag = data_vector[offset I.imag; of fset + + ; ) gettimeofday(&tv2, (struct timeval*)0); /* after time */ index_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 ,tv_usec - tvl.tv_usec; t imes (& i ndex_end); /* * Perform DIM1*DIM2 DIM3 element vector multiplies using vector v3. */ if (taskid==0) printf ("Perform %d*%d number of %d element multiplies by vector v3\n", DIM1, DIM2, DIM3 *NN ); times (&vec3_start) ; gettimeofday(&tvl, (struct timeval*)0 ); /* before time */ /* * Initialize max_pwr = 0.0 and location - 0,0,0. V max_pwr = 0 .0 ; loci = 0 ; loc2 = 0 ; loc3 = 0; /* * Move vector v3 into holding vectors for the use of pointers. */ vreal_ptr = vreal; vimag_ptr = vimag; for (n = 0; n < V3; n++) { *vreal_ptr++ = v3[n].real; *vimag_ptr++ = v3[nj.imag; > for (i = 0 ; i < diml; i++) for (j = 0; j < dim2/NN; j++) { real_ptr = xreal; imag_ptr = ximag; vreal_ptr = vreal; vimag_ptr = vimag; for (k = 0; k < dim3*NN; k++) ( *real_ptr++ = new_cube[i][j][k].real * *vreal_ptr - new_cube[i][j][k].imag * *vimag_ptr; *imag_ptr++ = new_cube[i][j][k].real * *vimag_ptr++ + new_cube[i][j][k].imag * *vreal_ptr++; ) 243 * Finding the element with the largest magnitude in this task's slice of the * data cube. */ real_ptr = ■ xreal; imag_ptr = ximag; for (k = 0; k < dim3*NN; k + +) { f_pwr = *real_ptr * *real_ptr + *imag_ptr * *imag_ptr; *real_ptr+ +; *imag_ptr+ + ; if (max_pwr < f_pwr) { max_pwr = f_pwr; loci = i; loc2 - j; loc3 = k; } } } /* " printf("task %d finds max %f at %d %d %d\n", * taskid, max, loci, loc2, loc3); */ index_max[0 ].value = max_pwr; index_max(0 J.loci = loci; index_max[0).loc2 = loc2 + taskid * dim2/NN; index_maxtO].loc3 = loc3; /* * Now aggregate to get the global maximum (the element with the largest * magnitude in the entire data cube). "/ rc = mpc_reduce (index_jnax, g_index_max, sizeof (INDEXED_MAX), 0, xu_index_max, allgrp); if (rc == -1 ) { printf ("Error - unable to call mpc_reduce.\n*); exit (-1); 1 gettimeofday(&tv2, (struct timeval")0 ); t* after time */ vec3_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2.tv_usec - tvl.tv_usec; times (&vec3_end); gettimeofday(&tv2, (struct timeval*)0 ); /* after time */ all_wall = (float) (tv2 .tv_sec - tv0 .tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tv0 .tv_usec; times(&all_end); if (taskid == 0 ) ( max_pwr = g_indexjnax[0).value; loci = g_index_max[0 ].loci; loc2 = g_index_jnax(0 ] . loc2 ; loc3 = g_index_jnax[0] . loc3; printf("Max value for vl vector multiply = %f\n", max_pwr); printf("Location = %d %d %d\n", loci, loc2, loc3); ) /* * Compute all times. */ 244 all_user = (float) (all_end.tms_utime - all_start,tms_utime)/1 0 0.0 ; all_sys = (float)(all_end.tms_stime - all_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (&all_user, &all_user_jnax, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) f printf ('Error mpc_reduce.\n■); exit (-1 >; ) rc = mpc_reduce (&all_sys, &all_sys_max, sizeof (float), 0 , s_vmax,allgrp), - if (rc == *1 ) < printf ("Error mpc_reduce.\n“); exit (-1 ); ) /* * Compute disk times, * I rc = mpc_reduce (&disk_wal1, &disk_wall_max, sizeof(float), 0 , s_vmax, aiIgrp); if (rc == -1 ) { printf (“Error mpc_reduce.\n“); exit (-1 ) ; } disk_user = (float) (disk_end.tms_utime - disk_start.tms_utime)/1 0 0.0 ; disk_sys = (float)(disk_end.tms_stime - disk_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (&disk_user, &disk_user_jnax, sizeof (float), 0 , s_vmax, allgrp); if (rc == -l) ( printf (“Error mpc_reduce.\n“); exit (-1 ); ) rc = mpc_reduce (idisk_sys, &disk_sysjnax, sizeof (float), 0, s_vmax,allgrp); if (rc =- -i) ( printf (“Error mpc_reduce.\n“); exit (-1 ); ) /* * Compute vecl times. */ rc = mpc_reduce (&vecl_wall, s.vecl_wall_max, sizeof(float), 0, s_vmax, allgrp) ; if (rc = = -1) < printf (“Error mpc_reduce.\n“); exit (-1); ) vecl_user = (float) (vecl_end.tms_utime - vecl^start.tms_utime>/1 0 0.0 ; vecl_sys = (float)(vecl_end.tms_stime - vecl_start.tms_stime)/1 0 0.0 ; if <taskid==0 ) printf(“vecl sys = \ f user = %f\n", vecl_sys,vecl_user); rc = mpc_reduce (ivecl_user, &vecl_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -l) ( printf (“Error mpc_reduce.\n*)! exit (-1 ); ) rc = mpc_reduce (fcvecl_sys, &vecl_sys_jnax, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) f printf (“Error mpc_reduce.\n“); 245 exit (-1); } /* * Compute vec2 times. V rc = mpc_reduce (&vec2_wall, &vec2_wall_jnax, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1) < printf ("Error mpc_reduce.\n"); exit (-1); } vec2_user = (float) (vec2_end.tms_utime - vec2_start.tms_utime)/1 0 0.0 ; vec2_sys = (float)(vec2_end.tms_stime - vec2_start.tms_stime)/1 0 0.0; rc = mpc_reduce (&vec2_user, *<vec2_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"),- exit (-1); ) rc = mpc_reduce (&vec2_sys, &vec2_sys_max, sizeof (float), 0, s_vmax, allgrp) , - if (rc == -1) ( printf ("Error mpc_reduce.\n"); exit (-1); ) /* * Compute vec3 times. */ rc = mpc_reduce (&vec3_wall, J*vec3_wall_jnax, sizeof (float) , 0, s_vmax, allgrp); if (rc = = -1 ) I printf < "Error mpc_reduce.\n") , * exit (-1); ) vec3_user = (float) (vec3_end.tms_utime - vec3_start.tms_utime)/100.0; vec3_sys = (float)(vec3_end.tms_stime - vec3_start.tms_stime)/100.0; rc = mpc_reduce (&vec3_user, &vec3_user_max, sizeof (float), 0, s_vmax, allgrp); if (rc = = -1) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (&vec3_sys, &vec3_sys_rnax, sizeof (float), 0, s_vmax, allgrp) , • if (rc == -1) ( printf ("Error mpc_reduce.\n*); exit (-1 ); > /* * Compute index times. */ rc = mpc_reduce (£index_wa1 1, &index_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n■); exit (-1 ); ) index_user = (float) (index_end.tms_utime - index^start.tms_utime)/1 0 0.0 j 246 index_sys = (float)(index_end.tms_stime - index_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (&index_user, &index_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) 1 printf ("Error mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (&index_sys, &index_sys_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) /* * Display timing information. V if (taskid == 0 ) ( printf ("\n\n*** CPU Timing information - numtask = %d\n\n", NN) ; printf printf printf printf printf printf printf print f printf print f printf print f print f print f } exit(0 ); ) C.2.2 cmd_line.c /* * cmd_line.c */ /* * This file contains the procedure cmd_line (), and is part of the parallel * General benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. (" all_user_max = %,2 f s, all_user_max, all_sys_max); (* disk_user_jnax = %.2f s, disk_user_max, disk_sys_max); (* vecl_userjnax = %.2f s, vecl_user_max, vecl_sys_max) , - (" vec2_user_max = %.2f s, vec2_user_max, vec2_sys_max); (" vec3_user_max = %.2f s, vec3_user_max, vec3_sysjnax) , * (" index_user_max = %.2 f e index_user_jnax, index_sys_max) ; all_sys_max disk_sys_max vecl_sys_max vec2_sys_max vec3_sys_max i ndex_sy s_ma x %.2 f s\n", *.2 f s\n", ».2 f s\n", %.2 f s\n", %.2f s\n", = %.2f s\n*, (*\n*** Wall Clock Timing information - numtask = 4d\n\n", NN) ; all_wall task_wall d i s k_wa11 vecl_wall vec2_wall vec3_wall index_wall ».0f us\n", all_wall); = %.0f us\n", task_wall); = %.0f us\n", disk_wall_max); = %.Of us\n", vecl_wall_max); = %.Of us\n", vec2_wall_max); = %.Of us\n", vec3_wall_max); = %.0f us\n", index_wall_max); 247 * The procedure cmd_line () extracts the name of the files from which the * input data cube and the steering vector data should be loaded. The * function of this parallel version of cmd_line () is different from that * of the sequential version, because the sequential version also extracted * the number of iterations the program should run and some reporting * options. */ (•include *stdio.h* (•include "math.h’ ••include *defs.h* ••include <sys/types.h> ((include <sys/time.h> * cmd_line () * inputs: argc, argv * outputs: strename, data_name * * argc, argv: these are used to get data from the command line arguments * str_name: a holder for the name of the input steering vectors file * data_name: a holder for the name of the input data file V cmd_line (argc, argv, str_name, data_name) int argc; char *argv[]; char str_name[L1NE_MAX]; char data_name[LINE_MAX); /* * Begin function body: cmd_line () */ strcpy (str_name, argv[l]); strcat (str_name, *.str’); strcpy (data_name, argv[l]),- strcat (data_name, ".dat*); return; C J J read_input_VEC.c /* * read_input_VEC.c */ /* * This file contains the procedure read_input_VEC (), and is part of the * parallel General benchmark program written for the IBM SP2 by the STAP * benchmark parallelization team at the University of Southern Calfornia * (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA * Mountaintop program. * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. * * The procedure read_input_VEC () reads the input data files (containing the * data cube and the steering vectors). 248 * In this parallel version of read_input_VEC (), each task reads its portion * of the data cube in from the data file simulatenously. In order to improve * disk performance, the entire data cube slice is read from disk, then * converted from the packed binary integer format to the floating point * number format. * * Each complex number is stored on disk as a packed 32-bit binary integer. * The 16 LSBs are the real portion of the number, and the 16 MSBs are the * imaginary portion of the number. * * The steering vector file contains the number of PRIs in the input data, * the target power threshold, and the steering vectors. The data is stored * in ASCII format, and the complex steering vector numbers are stored as * alternating real and imaginary numbers. */ •include "stdio.h* •include *math.h* •include *defs.h" •include <sys/types.h> •include <sys/time.h> •include <mpproto.h> •ifdef DBLE char *fmt = ■•If*,- •else char *fmt = *%f"; •endif extern int taskid, numtask, allgrp; read_input_VEC () inputs: data_name, str_name outputs: vl, v2, v3, data_cube data_name: the name of the file containing the data cube str_name: the name of the file containing the input vectors vl, v2, v3: storage for the vl, v2, and v3 vectors, respectively data_cube: input data cube read_input_VEC (data_name, str_name, vl, v2, v3, data_cube) char data_name[3; char str_name[); COMPLEX vl[]; COMPLEX v2 [ ] , * COMPLEX v3 [ ] ; COMPLEX data_cube[][DIM1][DIM3]; /* * Variables * * diml, dim2, dim3: dimension 1, 2, and 3 in the input cube, respectively * temp: buffer for binary input integer data * tempi, temp2 : holders for integer data * f_templ, f_temp2 : holders for float data * junk: a temporary variable * f_power: holder for float power data * i, j, k: loop counters * local_int_cube: the integer version of the local portion of the data cube * blklen, rc: variables used for MPL calls */ 249 int diml = DIM1; int dim2 = DIM2; int dim3 - DIM3; unsigned int temp[l]; long int tempi, temp2; float f_templ, f_temp2 ; float junk; float f_pwr; int i, j, k; FILE *fopen<); FILE *f_dat, *f_vec; unsigned int local_int_cube[DIM3J[DIM2}[DIM1); long blklen, rc; /* * Begin function body: read_input_VEC (). * * Read in the data_cube file in parallel. */ if ((f_dat = fopen (data_name, "r")) = = N U L L ) { printf ("Error - task Id unable to open data file.Xn", taskid); exit <-1 ) ; ) fseek (f_dat, taskid * diml * dim2 * dim3 * sizeof (unsigned int), 0) ; fread (local_int_cube, sizeof (unsigned int), diml * dim2 * dim3, f_dat); fclose (f_dat); I* * Convert the data from the unsigned integer format to floating point * format. */ for (k = 0; k < dim3; kt + ) for (j = 0; j < dim2; j++) for (i = 0; i < diml; i + +> ( tempEO] = local_int_cube[k}[j][i]; tempi = OxOOOOFFFF & temp(0]; tempi = (tempi & 0x00008000) ? tempi I OxffffOOOO : tempi; temp2 = (temp[0] >>16) & OxOOOOFFFF; temp2 = (temp2 & 0x00008000) ? temp2 I OxffffOOOO : temp2; data_cube[j]Ii)[k].real = (float) tempi; data_cube[ j Hi] [k] . imag = (float) temp2; I /* * open the vector file using the file pointer f_vec. V if ((f vec = fopen (str_name, “r“) ) == NULL) ( printf (" Cant open input steering vector file ls\n ", str_name); exit (-1); } /* * Read the VECTORS from input file with vectors in vl,v2,v3 order. */ for (i = 0; i < VI; i + + ) ( fscanf (f_vec, fmt, tvl[i].real); fscanf (f_vec, fmt, tvl[ij.imag); ) 250 for (i = 0; i < V2; i + + ) I fscanf (f_vec, fmt, &v2 [i].real); fscanf (f_vec, fmt, tv2 [i].imag); ) for (i = 0 ; i < V3; i + +> ( fscanf (f_vec, fmt, &v3[i].real); fscanf (f_vec, fmt, &v3[i].Imag); > fclose)f_vec); return; ) C.2.4 defs.h /* defs.h */ /* 94 Aug 25 - We added a #define macro to compensate for the lack of a */ /* log2 function in the IBM version of math.h V /* 94 Sep 13 - We added a definition for number of clock ticks per V /* second, because this varies between the Sun OS and the IBM */ /* AIX OS. */ /* Sun OS version - 60 clock ticks per second */ •ifdef SUN •define CLKJTICK 60.0 Hendi f /* IBM AIX version 100 clock ticks per second V #i fdef IBM •define log2(x) ((log(x))I (M_LN2)) •define CLK_TICK 100.0 •endif •ifdef DBLE •define float double •endif typedef struct { float real; float imag; ) COMPLEX; /* used in new fft */ •define SWAP)a,b) {float swap_temp=(a).real;(a).real=(b).real;(b).real=swap_temp;\ swap_temp=(a).imag;(a).imag=(b).imag;(b).imag=swap_temp;) •define NN 2 t* XXXUUUUUU */ •define PRI_CHUNK (PRI/NN) /* XXXXXUU */ 231 •define AT_LEAST_JARG 2 •define AT_MOST_ARG 4 •define ITERATIONS^ARG 3 •define REPORTS_ARG 4 •define USERTIME(Tl,T2) ((t2.tms_utime-tl.tms_utime)/CLK_TICK) •define SYSTIME(Tl,T2) <(t2.tms_stime-tl.tms_stime)/CLK_TICK) •define USERTIME1(Tl,T2) ((time_end.tms_utime-time_start.tirs_utime)/CLK_TICK) •define SYSTIME1(T1,T2| ( (time_end. tms_stiire-t irre_start .tms_stime) /CLK_TICK) •define USERTIME2(Tl,T2) ((end_time.tms_utime-start_time.tms_utime}/CLK_TICK) •define SYSTIME2(Tl,T2) ((end_time.tms_stime-start_time.tms_stime)/CLK_TICK> •ifndef LINEJ4AX •define LINE_MAX 256 •endif •define TRUE 1 •define FALSE 0 •define COLS 153 6 •def ine •def ine •def ine •def ine •define •def ine •def ine •def ine •def ine •define •def ine •def ine MBEAM 12 ABEAM 20 PWR_BM 9 VI 64 V2 128 V3 COLS V4 VI DIM1 VI / DIM2 128 / DIM3 V3/NN DOP GEN 2048 PRIGEN 2048 * Maximum number of columns in holding vector *vec" */ * in house.c for max columns in Householder multiply */ * Number of main beams */ * Number of auxiliary beams */ * Number of max power beams */ * Number for dimensionl in input data cube */ * Number for dimension2 in input data cube */ /* Number for dimension3 in input data cube */ /* MAX Number of dopplers after FFT */ /* MAX Number of points for 1500 data points FFT */ /* zero filled after 1500 up to 2048 points */ •define NUM_MAT 32 /* Number of matrices to do householde on */ /* NUMJ1AT must be less than DIM2 above */ •ifdef APT •define PRI 256 i* •define DOP 256 t /* •def ine RNG 280 •define RNG 256 •define EL 32 •define BEAM 3 2 •define NUMSEG 7 Number of prls in input data cube */ ' Number of dopplers after FFT */ *t /* Number of range gates in input data cube */ /* Number of range gates in input data cube */ /* number of elements in input data cube */ /* Number of beams */ /* Number of range gate segments */ •define RNGSEG RNG/NUMSEG /* Number of range gates per segment */ •define RNG_S 320 /* Number sample range gates for 1st step beam forming */ •define DOF EL /* Number of degrees of freedom */ extern COMPLEX t_jnatrix (BEAM] [EL]; /* T matrix */ •endif •ifdef STAP •def ine •define •define •define •define •define •define •define •define PRI 128 DOP 128 RNG 1250 EL 48 BEAM 2 NUMSEG 2 /* Number of pris in input data cube */ /* Number of dopplers after FFT */ /* Number of range gates in input data cube */ /* number of elements in input data cube */ /* Number of beams */ /* Number of range gate segments */ Number of range gates per segment */ RNG_S 288 /* Number of sample range gates for beam forming */ DOF 3*EL /* Number of degrees of freedom */ RNGSEG RNG/NUMSEG /* •endif 252 •ifdef GEN tide fine EL V4 •define DOF EL •endif extern int output_time; /' extern int output_report; /' extern int repetitions; /' extern int iterations; /' Flag if set TRUE, output execution times */ Flag if set TRUE, output data report files number of times program has executed */ number of times to execute program */ typedef struct max_index ( float value ; int loci, loc2 , } INDEXED_MAX ; loc3 extern void xu_index_max ( ini, in2, out, len ) INDEXED_MAX ini[], in2[], out[]; int *len ; ( int i, n ; n = *len/sizeof(int) ; for (i = 0 ; i<n; i + + > { if (ini[i].value > in2 [i].value) < out(ij .value = inl[i).value ; out[i].loci = inl[i].locl; out[i].loc2 = inl[i].loc2 ; out[i].loc3 = inl[i].loc3; else { out[i].value = in2 [i).value ; out[i],loci = in2 [i].locl; out[i].loc2 = in2 [i].loc2 ; out[i].loc3 = in2[i].loc3; > ) ) C15 compile_vec mpcc -03 -qarch=pwr2 -DGEN -DIBM -o vec bench_mark_VEC.c cmd_line.c read_input_VEC.c -lm C.2.6 run.128 poe vec /scratchl/masa/vec_data -procs 126 -us C 3 Linear Algebra Subprogram C3.1 bench_mark_LIN.c bench_jnark_LIN.c / Parallel General Benchmark Program for the IBM SP2 This parallel General benchmark program was written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential General benchmark program was originally written by Tony Adams on 8/30/93. This file contains the procedure main (), and represents the body of the General Linear Algebra subprogram. This subprogram's overall structure is as follows: 1. Load data in parallel. 2. Perform the Householder transform on the data in parallel. 3. Perform forward and back substitution on the data in parallel. 4. Apply weight vectors to the data in parallel. This program can be compiled by calling the shell script compile_lin. Because of the nature of the program, before the program is compiled, the header file defs.h must be adjusted so that NN is set to the number of nodes on which the program will be run. We have provided a defs.h file for each power of 2 node size from 1 to 32, called defs.001 to defs.032. This program can be run by calling the shell script run.###, where ### is the number of nodes on which the program is being run. Unlike the original sequential version of this program, the parallel General program does not support command-1ine arguments specifying the number of times the program body should be executed, nor whether or not timing information should be displayed. The program body will be executed once, and the timing information will be displayed at the end of execution. The input data file on disk is stored in a packed format: Each complex number is stored as a 32-bit binary integer; the 16 LSBs are the real half of the number, and the 16 MSBs are the imaginary half of the number. These numbers are converted into floating point numbers usable by this benchmark program as they are loaded from disk. The steering vectors file has the data stored in a different fashion. All data in this file is stored in ASCII format. The first two numbers in this file are the number of PRIs in the data set and the threshold level, respectively. Then, the remaining data are the steering vectors, with alternating real and imaginary numbers. / •include "stdio.h* •include "math.h* • include •defs.h1 ' •include <sys/types.h> 254 •include <sys/times-h> •include <mpproto.h> •include <sys/time.h> •include i* * ma i n <) * inputs: argc, argv * outputs: none V main (argc, argv) int argc; char *argv[]; { Variables data_cube: input data cube v4: vector of complex numbers temp_mat: temporary matrix beams_piat: temporary matrix for holding output beam data diml, dim2, dim3: dimensions 1, 2, and 3 in the input data cube, respect ively loci, loc2, loc3: holders for the location of the maximum entry start_row: holder for the row on which to start the Householder lower_triangular_rows: holder for the number of rows to lower triangularize num_rows: holder for the total number of rows num_cols: holder for the total number of columns i, j, k: loop counters dopplers: number of dopplers after the FFT weight_vec: pointer to the start of the COMPLEX weight vector xreal, ximag: pointer to the start of the real and imaginary parts of the float data vector max: float variable to hold the maximum of the complex data sum: variable to hold the sum of the complex data f_pwr, f_templ, f_temp2 : float variables real_ptr, imag_ptr: pointer to float variables str_vecs; pointer to the start of the steering vector array make_t: a flag; 1 = make T matrix, 0 = don't ma eT matrix but pass back the weight vector index_max: a data structure holding the location and value of the maximum g_index_jnax: a data structure holding the location and value of the maximum data_name, str_name: the names of the files containing the input data cube and the steering vectors, respectively blklen: block length rc: return code from an MPL call taskid: the identifier for this task numtask: the total number of tasks running this program nbuf: a buffer to hold the data from the MPL call mpc_task_query allgrp: the identifier representing all the tasks running this program source, dest, type, nbytes: parameters to point-to-point MPL calls / COMPLEX data_cube[DIM1][DIM2][DIM3] ; COMPLEX v4[V4]; static COMPLEX temp jnat[V4][COLS]; COMPLEX *‘beams_mat; int diml = DIM1; int dim2 = DIM2; int dim3 = DIM3; int loci,loc2,loc3; int start_row; int lower_triangular_rows; int num_rows; 255 int num_cols = V3; int i, j, k; int dopplers; COMPLEX weight_vec[V4I; float xreal[V4]; float ximag[V4]; float max; COMPLEX sum; float f_pwr; float f_templ; float f_temp2; float *real_ptr; float *imag_ptr; COMPLEX str_vecs(l)[DOF]; int make_t; INDEXED_MAX index_max[NUM_J4AT]; INDEXED_MAX g_index_max[NUM_MAT*NN]; char data_name[LINE_MAX]; char str_name[LINE_MAX|; int blklen; int rc; int taskid; int numtask; int nbuf[4]; int allgrp; int source, dest, type, nbytes; Timing variables all_start, all_end; start and end CPU times for the entire program disk_start, disk_end: start and end CPU times for the disk access step lin_start, lin_ends start and end CPU times for the linear algebra step tl, t2, tvO, tvl, tv2: temporary variables all_user, all_sys: user and system CPU times for the entire program all_user_max, all_sys_max: maximum user and system CPU times for the entire program disk_user, disk_sys: user and system CPU times for the disk access setp disk_user_max, disk_sys_max: maximum user and system CPU times for the entire program all_wall, task_wali: wall clock time and maximum wall clock time for the entire program disk_wall, disk_wail_max: wall clock time and maximum wall clock time for the disk access step lin_user, lin_sys: user and system CPU times for the linear algebra step 1in_user_jnax, 1in_sys_max: maximum user and system CPU times for the linear algebra step lin_wall, lin_wall_max: wall clock time and maximum wall clock time for the linear algebra step / struct tms ali_start, all_end; struct tms disk_start, disk_end; struct tms lin_start, lin_end; struct tms tl,t2 ; struct timeval tvO, tvl, tv2; float all_user, all_sys; float all_user_jnax, all_sys_max; float disk_user, disk_sys; float disk_user_max, disk_sys_max; float ail_wall, task_wall; float disk_wall, disk_wall_max; float lin_user, lin_sys; float lin_userjtax, lin_sys_max; float lin_wall, lin_wall_max; 256 /* * Externally defined functions */ extern void cmd_line(); extern void read_input_LIN(); extern void house!); extern void forbackd; /* * Begin function body: main () * * Initialize for parallel processing. Here, each task or node determines its * task number (taskid) and the total number of tasks or nodes running this * program (numtask) by using the MPL call mpc_environ. Then, each task » determines the identifier for the group which encompasses all tasks or * nodes running this program. This identifier (allgrp) is used in collective * communication or aggregated computation operations, such as mpc_index. */ gettimeofday(ttvO, (struct timeval*I 0 ); /* before time */ rc = mpc_environ (tnumtask, &taskid); if (rc == -1) ( printf ("Error - unable to call mpc_environ.in*); exit (-1 ); } if (numtask !- NN) ( if (taskid == 0) < printf ("Error - task number mismatch... check defs.h.in*I; exit (-1); ) ) rc = mpc_task_query (nbuf, 4, 3); if (rc == -1) ( printf ("Error - unable to call mpc_task_query.in"}; exit (-1); } allgrp = nbuf(3]; gettimeofday(&tv2, (struct timevalMO); /* before time */ task_wall = (float) (tv2 .tv_sec - tv0 .tv_sec) * 1 0 00 00 0 + tv2 .tv_usec - tvO. tv_usec,- if (taskid == 0) ( printf ("Running... in"); ) times (s,all_start) , • /* * Get arguments from command line. In the sequential version of the program, * the following procedure was used to extract the number of times the main * computational body was to be repeated, and flags regarding the amount of * reporting to be done during and after the program was run. In this paralle * program, there are no command line arguments to be extracted except for the * name of the file containing the data cube. */ cmd_line (argc, argv, str_name, data_name); 257 /* * Read input files. In this section, each task loads its portion of the data * cube from the data file. V if (taskid == 0) ( printf (■ loading data...\n"}; ) mpc_sync (allgrp); times <&disk_start); gettimeofday(&tvl, (struct timeval*)0); /* before time */ read_input_LIN (data_name, str_name, v4, data_cube); gettimeofday(fctv2 , (struct timeval*)0); /* after time */ disk_wall = (fioat) (tv2 .tv_sec - tvl.tv_sec) * 1 000000 + tv2 .tv_usec - tvl.tv_usec; times (&disk_end); /* * Start Linear Algebra steps. */ make_t = 0; /* Dont make T matrix. Return a weight vector */ times(&lin_start); gettimeofday(&tvl, (struct timeval*)0); /* before time */ /* * Move steering vector V4 into the str_vecs matrix. */ for (i = 0; i < V4; i+t) { str_vecs[0][i].real = v4(i).real; str_vecs[0][i]•imag = v4(ij,imag; ) /* * Perform the Householder transform only on NUM_MAT of theV2 dopplers (DIM1 * by DIM3). Allocate space for the beams_mat matrix = NUM_MAT*V3 COMPLEX * matrix. */ if ((beams_mat = (COMPLEX **) malloc (NUM_MAT * sizeof (COMPLEX *))) == NULL) ( printf (" Cant allocate space for output beams_mat holding matrix\n *); free <temp_jnat); exit (-1); ) for (i = 0; i < NUM_MAT; i + +) ( if ( (beams _fnat [i] = (COMPLEX *) malloc (V3 * sizeof (COMPLEX))) == NULL) I printf (■ Cant allocate space for output beams_mat holding matrix ■); free (temp_mat); free (beams_mat); exit (-1); ) ) /* * Move 1 of NUM_MAT dopplers to tempjnat from data_cube, perform a Householder * transform, then forward and back solve for the weights. Apply the weights * to the data and put the beamformed dopplers into the beamsjnat matrix. * * Select 1st NURLMAT matrices with DIM1 by DIM3 data elements. 258 V for max) ( max = f_pwr; /* Always beam 0 */ locl=0 loc2-j loc3=k ) * printf ("task %d finds max %f at %d %d %d\.n*, * taskid, max, loci, loc2, loc3); V indexjnax[i],value = max; index_jnax[i] . loci = loci; index_max[ij.loc2 = loc2 + taskid * dim2 ; index_max[i].loc3 = loc3; } /* end i for_loop */ /* * Gather local max entries into node 0. V rc = mpc_gather (index_max, g_indexjnax, NUM.J4AT* sizeof(INDEXED_MAX), 0, allgrp); if (rc == -1 ) ( printf ("Error - unable to call moc gather.\n"); exit (-1 ); ) gettimeofday(itv2 , (struct timeval*)0 ); /* after time */ lin_wall = (float) (tv2 .tv_sec - tvl.tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tvl.tv_usec; all_wall = (float) (tv2 .tv_sec - tv0 .tv_sec) * 1 0 0 0 0 0 0 + tv2 .tv_usec - tv0 .tv_usec; times(&1in_end); times(sall_end); /* * Compute all times. V all_user = (float) (all_end.tms_utime - all_start.tms_utime)/I0 0 .0 ; all_sys = <float)(all_end.tms_stime - all_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (&all_user, fcall_user_jnax, sizeof (float), 0 , s_vmax,allgrp); 2 6 0 if (rc == -1 ) { p r i n t f ( " E r r o r m p c _ r e d u c e . \ n * ); e x i t ( - 1 ) ; } rc = mpc_reduce (&all_sys, &all_sys_max, sizeof (float), 0 , s_vmax,allgrp); If (rc == -1 ) ( printf ("Error mpc_reduce.\n*); exit {-1 >; ) /* * Compute disk times. V rc = mpc_reduce (&disk_wall, idisk_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) disk_user = (float) (disk_end.tms_utime - disk_start.tms_utime)/1 0 0,0 ; disk_sys = (float)(disk_end.tms_stime - disk_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (&disk_user, &disk_user_max, sizeof (float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc„reduce.\n“); exit (-1 ); ) rc = mpc_reduce (&disk_sys, &disk_sys_max, sizeof (float), 0, s_vmax,allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) /* * Compute lin times. V rc = mpc_reduce (&lin_wall, &lin_wall_max, sizeof(float), 0 , s_vmax, allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) lin_user = (float) (lin_end.tms_utime - lin_start.tms_utime)/1 0 0.0 ; lin_sys = (float)(lin_end.tms_stime - 1in_start.tms_stime)/1 0 0.0 ; rc = mpc_reduce (tlin_user, &lin_user_max, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n"); exit (-1 ); ) rc = mpc_reduce (tlin_sys, &lin_sys_max, sizeof (float), 0 , s_vmax,allgrp); if (rc == -1 ) ( printf ("Error mpc_reduce.\n*) ; exit (-1 ); ] /* * Report results. V 261 if (taskid == 0 ) for (i =0; i < NUM_J4AT*NN; i++) ( max = g_index_max[i).value; loci = g_index_jnax [ i ] . loci ; loc2 = g_index_max[ij.loc2 ; loc3 = g_index_max[ij.loc3; printf("#ld POWER of max COMPLEX output beams data = %f \n",i, max) printf(• LOCATION of max COMPLEX output beams data = %d %d %d \n*, loci,loc2,loc3); } /* * Display timing information. '/ if (taskid 0 ] ( printf ("\n\n*** CPU Timing information - numtask = %d\n\n*, NN); printf (* all_user_max = %.2f s, all_sys_max = %.2f s\n' all_user_max, all_sys_max); printf (* disk_user_max = %.2f s, disk_sys_max = %.2f s\n", disk_user_max, disk_sys_max); printf (■ lin_user_max = %.2f s, lin_sys_max = %.2f s\n‘, 1in_u se r_max, 1i n_sy s_ma x); > printf printf printf printf printf ("\n*** Wall Clock Timing information - numtask = »d\n\n", NN); (‘ all_wall task_wall disk_wall lin_wall = % . 0 f us\n", all_wall); = %.Of us\n", task_wall); = %.0f us\n", disk_wall); ! %.0f us\n", lin_wall); free (beamsjnat); C J i cmd.line.c /* * cmd_line.c */ /* * This file contains the procedure cmd_line (), and is part of the parallel * General benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the Univerisity of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential General benchmark program was originally written by Tony * Adams on 8/30/93. * * The procedure cmd_line (> extracts the name of the files from which the * input data cube and the steering vector data should be loaded. The * function of this parallel version of cmd_line I) is different from that * of the sequential version, because the sequential version also extracted * the number of iterations the program should run and some reporting * options. */ •include "stdio.h" *include "math.h* •include ■defs.h* •include <sys/types,h> 2 6 2 ^include <sys/time.h> cmd_line () inputs: argc, argv outputs: str_name, data_name argc, argv: these are used to get data from the command line arguments str_name: a holder for the name of the input steering vectors file data_name: a holder for the name of the input data file / cmd_line (argc, argv, str_name, data_name) int argc; char *argv(]; char str_name[LINE_MAX!; char data_name(LINE_MAX]; /* * Begin function body: cmd_line () */ strcpy (str_name, argv[l]|; strcat (str_name, *.str’|; strcpy (data_name, argv[l)}; strcat (data_name, ".dat*i; return; C3 3 forback.c forback.c / This file contains the procedure forback (), and is part of the parallel General benchmark program written for the IBM SP2 by the STAP benchmark parallelization team at the University of Southern California (Prof. Kai Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop program. The sequential General benchmark program was originally written by Tony Adams on 8/30/93. The procedure forback () performs a forward and back substitution on an input temp_mat array, using the steering vectors, str_vecs, and normalizes the solution returned in the vector "weight^ec*. If the input variable "make_t" = 1, then the T matrix is generated in this routine by doing BEAM forward and back solution vectors. Nominally, BEAM = 32. The conjugate transpose of each weight vector, to get the Hermitian, is put into the T matrix as row vectors. The input, temp_jnat, is a square lower triangular array, with the number of rows and columns = num_rows. This routine requires 5 input parameters as follows: num_rows: the number of COMPLEX elements, M, in tempjnat str_vecs: a pointer to the start of the steering vector array weight_vec: a pointer to the start of the COMPLEX weight vector 263 * make_t: 1 = make T matrix; 0 = don't make T matrix but pass back the * weight vector * temp_jnat[][]: a temporary matrix holding various data * * This procedure was untouched during our parallelization effort, and * therefore is virtually identical to the sequential version. Also, most, * if not all, of the comments were taken verbatim from the sequential * version. */ •include "stdio.h* •include "math.h" •include "defs.h' * forback (} * inputs: num_rows, str_vecs, weight_vec, make_t, temp_mat * outputs: weight_vec * * num_rows: the number of elements, M, in the input data * str_vecs: a pointer to the start of the steering vector array * weight_vec: a pointer to the start of the COMPLEX weight vector * make_t: 1 = make T matrix; 0 = don't make T matrix but pass back the * weight vector * temp_mat: MAX temprary matrix holding various data */ forback (num_rows, str_vecs, weight_vec, make_t, temp_mat) int num_rows; COMPLEX str_vecs[][DOF]; COMPLEX weight_vec[]; int make_t; COMPLEX temp__mat[][COLS]; /* * Variables * * i, j, k: loop counters * last: the last row or element in the matrices or vectors * beams: loop counter * num_beams: loop counter * sum_sq: a holder for the sum squared of the elements in a row * sum: a holder for the sum of the COMPLEX elements in a row * temp: a holder for temporary storage of a COMPLEX element * abs_jnag: the absolute magnitude of a complex element * wt_factor: a holder for the weight normalization factor, according to * Adaptive Matched Filter Normalization * steer_vec: one steering vector with DOF COMPLEX elements * vec: a holder for the complex solution intermediate vector * tmp_mat: a pointer to the start of the COMPLEX temp matrix */ int i, j, k; int last; int beams; int num_beams; float sunusq; COMPLEX sum; COMPLEX temp; float abs_mag; float wt_factor; COMPLEX steer_vec[DOF]; COMPLEX vec[DOF]; COMPLEX tmp_jnat [DOF] [DOF] ; * Begin function body: forback () * * The temp_mat matrix contains a lower triangular COMPLEX matrix as the * first MxM rows. Do a forward back solution BEAM times with a different * steering vector. If make_t = 1, then make the T matrix. V if (make_t) ( num_beams = num_rows; !* Do M forward back solution vectors. Put */ /* them else into T matrix. */ 1 else t num_beams =1; /* Do only 1 solution weight vector */ } for (beams = 0 ; beams < num_beams; beams+ 0 ( /* for (beams...) * / for (j = 0 ; j < num_rows; j++) ( steer_vec[j].real = str_vecsI beams] [jI .real; steer_vec[j].imag = str_vecsIbeams][j].imag; } /* * Step 1: Do forward elimination. Also, get the weight factor = square root * of the sum squared of the solution vector. Used to divide back substitution * solution to get the weight vector. Divide the first element of the COMPLEX * steering vector by the first COMPLEX diagonal to get the first element of * the COMPLEX solution vector. First, get the absolute magnitude of the first * lower triangular diagonal. */ abs_mag = temp_mat[0 1 [0 ].real * tempjnat[0][0].real + temp_mat[0 ][0 ].imag * temp_mat[0][0 ].imag; /* * Solve for the first element of the solution vector. */ vec(0 |.real = (temp_mat(0][0].real * steer_vec[0 ].real + temp_matjoj [0].imag * steer_vec[0 ].imag) / abs_mag; vec[0 J.imag = (temp_jnat|0][0).real * steer_vec[0].imag - temp_mat 10] (0] . imag * steer_vec[0 ].real> / abs_jtiag; /* * start summing the square of the solution vector. */ sum_sq = vec(0].real * vec[0 ].real + vec[0].imag * vec 10 ].imag; /* * Now solve for the remaining elements of the solution vector. V for (i = 1 ; i < num_rows; i++) I /* for (i . . . ) */ sum.real = ■ 0 .0 ; sum.imag = 0 .0 ; for (k = 0 ; k < i; k++) ( /* for (k...) */ sum.real += (temp_mat[i][k].real * vec[k].real - temp_mat [ 1) [k] . imag * vec [k] . imag) ; sum.imag += (temp_jnat [i] [k] . imag * vec[k].real + temp_mat[i][k].real * vecIk].imag); ) /* for (k...) */ 265 /* * Now subtract the sum from the next element of the steering vector. */ temp.real = steer_vecti].real - sum.real; temp.imag = steer_vec[ij .imag - sum.imag; /* * Get the absolute magnitude of the next diagonal. */ abs_mag = temp_matIiJ[i].real * temp_mat[i](i].real + temp_mat[i][iJ.imag * temp_matIi][i].imag; /* * Solve for the next element of the solution vector. */ vecIi].real = (temp_mat[i][i].rea1 * temp.real + temp_mat(i]Ii].imag * temp.imag) / abs_mag; vec[i).imag = (temp_mat[i][i].real * temp.imag - temp_mat[i](i).imag * temp.real) / abs_mag; /* * Sum the square of the solution vector. * / sum_sq + = (vec[i].real * vec[i].real + vec[i).imag * vec[i].imag); } /* for (i. . .) */ wt_factor = sqrt {(double) sum_sq); /* * Step 2: Take the conjugate transpose of the lower triangular matrix to * form an upper triangular matrix. */ for (i = 0 ; i < nun\_rows; i + + ) { i* for (i. . .) •/ for (j = 0 ; j < num_rows; j++) ( /* for (j . . . ) */ tmp_matli][j].real = temp_matEj][i].real; tmp_mat(i]tjj.imag = - temp_mat[jI[i].imag; 1 /* for (j ...) */ ) /* for (i. . .) */ /• * Step 3: Do a back substitution. •/ last = num_rows - 1 ; /• * Get the absolute magnitude of the last upper triangular diagonal. */ abs_mag = tmp_mattlast][last].real * tmp_mat[last][last].real + tmpjnat[last](last).imag * tmp_mat[laBt|[last].imag; /* * Solve for the last element of the weight solution vector. */ weight_vec(last].real = (tmpjnat(last][last].real * vec[last].real + tmp_mat[last][last].imag * vec[last].imag) / abs_mag; weight_vec[last].imag = (tmpjnatElast][last].real * vec[last].imag - tmp_mat[last][last].imag * vec[last].real) 2 6 6 / a b s _ m a g ; i* * Now solve for the remaining elements of the weight solution vector from * the next to last element up to the first element, */ for (i = last - 1 ; i >= 0 ; i--) i /* for { i . . . ) */ sum.rea1 = 0 .0 ; sum.imag = 0 .0 ; for (k = i + l; k <= last; k++) { /* for (k...) */ sum.real += (tmp_mat[i](k!.real * weight_vec[k].real - tmp_mat[i][k].imag * weight_vec(k).imag); sum.imag += (tmpjnat[i] !k).imag * weight_vec[k].real + tmp_mat [ i ] [k] . real * weight_vec Ek) . imag) } /* for (k...) V /* * Subtract the sum from the next element up of the forward solution vector. V temp.real = vec[i],real - sum.real; temp.imag = vec[ij.imag - sum.imag; /* * Get the absolute magnitude of the next diagonal up. */ abs_jnag = tmp_mat [ i ] [ i ] . real * tmpjnat [ i ] [ i] . real + tmp_mat[i]|i].imag * tmp_mat[i][i].imag; /* * Solve for the next element up of the weight solution vector. V weight_vec[i].real = (tmp_mat[i][i].real * temp.real + tmp_mat[i][i].imag * temp.imag) / abs_mag; weight_vec[i].imag = (tmpjnat[i][i).real * temp.imag - tmpjnat[i][i].imag * temp.real) / abs_mag; ) /* for {i...) */ !* * Step 4: Divide the solution weight_vector by the weight factor. V for (i = 0 ; i < num_rows; i++) { weight_vec[i).real /- wt_factor; weight vectij.imag /- wt_factor; ) #i fdef APT /* * If make t = 1, make the T matrix. */ if (make_t) { /* if (make_t) */ /* * Conjugate transpose the weight vector to get the Hermitian. Put each * weight vector into the T matrix as row vectors. V for tj = 0 ; j < num_rows; j++) 267 { t_matrix[beams][j].real = weight_vecij].real; tjnatrix[beams][j].Imag = - weight_vec[j].imag; } } /* if (make_t) */ •endi f } /* for {beams...) * / return; } C 3.4 house.c /* * house.c */ /* * This file contains the procedure house (), and is part of the parallel * General benchmark program written for the IBM SP2 by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential General benchmark program was originally written by Tony * Adams on 6/30/93. * * The procedure house () performs the Householder transform, in place, on * an N by M complex imput matrix, where M >= N. It returns the results in * the same location as the input data. * * This routine requries 5 input parameters as follows: * num_rows: number of elements in the temp_mat * num_cols: number of range gates in the temp_mat * lower_triangular_rows: the number of rows in the output temp_jrat that * have been lower triangularized * start_row: the number of the row on which to start the Householder * temp_mat[][]: a temporary matrix holding various data * * This procedure was untouched during our parallelization effort, and * therefore is virtually identical to the sequential version. Also, most, * if not all, of the comments were taken verbatim from the sequential * version. */ #include "stdio.h* •include ■math.h" •include 'defs.h1 * house () * inputs: num_rows, num_cols, lower_triangular_rows, start_row, temp_mat * outputs: temp_mat * * num_rows: number of elements, N, in the input data * num_cols: number of range gates, M, in the input data * lower_triangular_rows: the number of rows to be lower triangularized * start_row: the row number of which to start the Householder * tempjnat: temporary matrix holding various data */ house (num_rows, num_cols, lower_triangular_rows, start_row, temp_mat} int num_rows; 268 int nuin_cols; int lower_triangular_rows; int start_row; COMPLEX temp_mat[][COLS]; /* * Variables * * i, j, k: loop counters * rterop: a holder for temporary scalar data * x_square: a holder for the absolute square of complex variables * xmax_sq: a holder for the maximum of the complex absolute of variables * vec: a holder for the maximum complex vector 2 * num_cols max * sigma: a holder for a complex variable used in the Householder * gscal: a holder for a complex variable used in the Householder * alpha: a holder for a scalar variable used in the Householder * beta: a holder for a scalar variable used in the Householder */ int i, j, k; float rtemp; float x_square; float xmax_sq; COMPLEX vec[2*C0LS]; COMPLEX sigma; COMPLEX gscal; float alpha; float beta; /* * Begin function body: house {) it * Loop through temp_mat for number of rows = lower_triangular_rows. Start at * the row number indicated by the start_row input variable. */ for (i - start_row; i < lower_triangular_rows; i++) { t * for (i . .. ) V /* * Step 1: Find the maximum absolute element for each row of temp_mat, * starting at the diagonal element of each row. */ xmax_sq = 0 .0 ; for (j = i; j < num_cols; j++) [ /* for (j . . . ) */ x_square = temp_mat[i][j].real * temp_mat[i][j1.real + temp_mat [i] [ j ] . imag * temp_jnat [i] [ j ] , imag; if (xmax_sq < x_square) ( xmax_sq = x_square; ) I /* for (j . . . ) */ /* * Step 2: Normalize the row by the maximum value and generate the complex * transpose vector of the row in order to calculate alpha = square root of * the sum square of all the elements in the row. */ xmax_sq = (float) sqrt ((double) xmax_sq); alpha = 0 .0; for (j = i; j < num_cols; j++) ( /* for (j . . . ) */ vec[j].real = temp_mat[i][j].real / xmax_sq; 269 vec[j].imag = - temp_mat[i](j],imag / xmax_sq; alpha += (vec[j].real * vec[jj.real + vec(j].imag * vec[j].imag); J /* for (j . . . > V alpha = (float) sqrt ((double) alpha); /* * Step 3: Find beta = 2 / (b (transpose) * b). Find sigma of the relevant * element = x(i) / !x(i)l. V rtemp = vec(iJ.real * vec[i).real + vec(i).imag * vec[i].imag; rtemp = (float) sqrt ((double) rtemp); beta = 1 . 0 / (alpha * (alpha + rtemp)); if (rtemp >= 1.0E-16) < sigma.real = vec[i].real / rtemp; sigma.imag = vec[i].imag / rtemp; } else { sigma.real = 1 .0 ; sigma.imag = 0 .0 ; ) /* * Step 4: Calculate the vector operator for the relevent element. */ vec(i).real += sigma.real * alpha; vec(i].imag += sigma.imag * alpha; /* * Step 5; Apply the Householder vector to all the rows of temp_jnat. */ for (k = i; k < num_rows; k++) ( I* for (k...> •/ /* * Find the scalar for finding g. */ gscal.real = 0 .0 ; gscal.imag - 0 .0 ; for (j = if j < num_cols; j+t) ( /* for (j . . . ) */ gscal.real += (temp_mat[k][j].real * vecfj].real - temp_jnat [k] [ j ] . imag * vec (j ). imag) ; gscal.imag += (temp_jnat[k](j).real * vec[j].imag + temp_mat[k][j].imag • vecfj].real); ) /* for (j . . .) */ gscal.real *= beta; gscal.imag *= beta; /* * Modify only the necessary elements of the temp_mat, subtracting gscal * * conjg (vec) from tempjnat elements. */ for (j = i; j < num_cols; j++) ( /* for (j . . .) */ temp_mat[k][j).real -- (gscal.real * vectjj.real + gscal.imag * vec(jJ .imag); temp_mat[k]tjJ•imag -= (gscal.imag * vecijj.real - gscal.real * vec[j].imag); } /* for (j. . . ) V 270 } /* for (k...) V } /* for {i . . . ) */ return; } C 3 i read_input_LIN.c /* * read_input_LIN.c */ /* * This file contains the procedure read_input_LIN <), and is part of the * parallel General benchmark program written for the IBM by the STAP benchmark * parallelization team at the University of Southern California (Prof. Kai * Hwang, Dr. Zhiwei Xu, and Masahiro Arakawa), as part of the ARPA Mountaintop * program. * * The sequential General benchmark program was originally written by Tony * Adams on 6/30/93. i t * The procedure read_input_LIN () reads the input data files (containing the * data cube and the steering vectors). * * In this parallel version of read_input_LIN (), each task reads its portion * of the data cube in from the data file simulatenously. In order to improve * disk performance, the entire data cube slice is read from disk, then * converted from the packed binary integer format to the floating point * number format. * * Each complex number is stored on disk as a packed 32-bit binary integer. * The 16 LSBs are the real portion of the number, and the 16 MSBs are the * imaginary portion of the number. * * The steering vector file contains the number of PRIs in the input data, * the target power threshold, and the steering vectors. The data is stored * in ASCII format, and the complex steering vector numbers are stored as * alternating real and imaginary numbers. */ •include "stdio.h* •include "math.h* •include "defs.h* •include <sys/types.h> •include <sys/time.h> •include <mpproto.h> •ifdef DBLE char *fmt = “ %1f“; lalop char *fmt = *%f -; •endif extern int taskid, numtask, allgrp; read_input_LIN () inputs: data_name, str_name outputs: v4, data_cube data_name: the name of the file containing the data cube 271 * str_name: the name o£ the file containing the input vectors * v4: storage for the v4 vector * data_cube: input data cube */ read_input_LIN (data_name, str_name, v4, data_cube) char data_name[],- char str_name[]; COMPLEX v4[]; COMPLEX data_cube[] (DIM2] [DIM3],- * Variables * * diml, dim2, dim3: dimension 1, 2, and 3 in the input cube, respectively * temp: buffer for binary input integer data * tempi, temp2 : holders for integer data * f_templ, f_temp2 : holders for float data * junk: a temporary variable * f_power: holder for float power data * i, j, k: loop counters * local_int_cube: the integer version of the local portion of the data cube * blklen, rc: variables used for MPL calls V int diml = DIM1; int dim2 = DIM2; int dim3 = DIM3; unsigned int temp[l]; long int tempi, temp2; float f_templ, f_temp2 ; float junk; float f_pwr; int i, j, k; FILE *fopen< >; FILE *f dat, * f_vec; unsigned int local_int_cube[DIM2][DIM1][DIM3t; long blklen, rc; /* * Begin function body: read_input_LIN (), f t * Read in the data_cube file in parallel. */ if Mf_dat = fopen (data_name, "r")) == NULL) j printf ("Error - task %d unable to open data file.Vn", taskid); exit (-1 ); ) fseek (f_dat, taskid * diml * dim2 * dim3 * sizeof (unsigned int), 0); fread (local_int_cube, sizeof (unsigned int), diml * dim2 * dim3, f_dat); fclose (f_dat); /* * Convert the data from the unsigned integer format to floating point * format. */ for (i = 0 ; i < diml; i++) for (j = 0 ; j < dim2 ; j++) for (k = 0; k < dim3; k++) ( temp[0 ] = local_int_cube(j] [i) [k]; tempi = OxOOOOFFFF fc temp[0]; tempi = (tempi & 0x00008000) ? tempi I OxffffOOOO : tempi; 272 temp2 = (temp[0] >> 16) & OxOOOOFFFF; temp2 = (temp2 & 0x00008000) ? temp2 I OxffffOOOO : temp2; data_cube[i][j] ik].real = (float) tempi; data_cube[i][jj fk],imag = (float) temp2 ; } (* * Open the vector file using the file pointer f_vec. V if <{f vec = fopen (str_name, "r*>) == NULL) ( printf (* Cant open input steering vector file %s\n ", str_name); exit ( 1 ); > /* * Read the vectors from the input file with vectors vl, v2, v3, and v4 in * order. V for (i = 0; i < V4; i++) ( fscanf (f_vec, fmt, &v4[i].real); fscanf (f_vec, fmt, &v4[ij.imag); ) fclose(f_vec); return; } C3 .6 defsJ) /* defs.h *! /* 94 Aug 25 We added a «define macro to compensate for the lack of a */ /* log2 function in the IBM version of math.h */ /* 94 Sep 13 - We added a definition for number of clock ticks per */ /* second, because this varies between the Sun OS and the IBM */ /* AIX OS. V /* Sun OS version - 60 clock ticks per second */ #i fdef SUN •define CLK_TICK 60.0 #endi f /* IBM AIX version - 100 clock ticks per second */ tifdef IBM •define log2(x) ((log(x))/(M_LN2)) •define CLK_TICK 100.0 •endif •ifdef DBLE •define float double •endif 273 typedef struct { float real; float imag; } COMPLEX; /* used in new fft V •define SWAP(a,b) (float swap_temp=(a).real;(a),real=(b).real;(b).real=swap_temp;\ swap_temp= /* XXXXXUU */ •define AT_LEAST_ARG 2 •define AT_MOST ARG 4 •define ITERATIONS_ARG 3 •define REPORTS_ARG 4 •define USERTIME(T1,T2) •define SYSTIME(T1,T2) •define USERTIME1(T1,T2) •define SYSTIME1<T1,T2) •define USERTIME2(T1,T2) •define SYSTIME2(T1,T2) ((12.tms_ut ime-tl.tms_utime)/CLK_TICK) ((t2.tms_stime-tl.tms_st ime)/CLK_TICK) ((time_end.tms_utime-time_start.tms_utime)/CLK_TICK) ((t ime_end.tms_stime-t ime_start.tms_st ime)/CLK_TICK > <(end_time.tms_utime-start_time.tms_utime)/CLK_TICK) ((end_t ime.tms_stime-start_t ime.tms_st ime)/CLK_TICK) •ifndef LINEJ4AX •define LINE_MAX 256 •endif •define TRUE 1 •define FALSE 0 •define COLS 153 6 •define •define •define •define •define •define •def ine •define •define •define •define •define MBEAM 12 ABEAM 20 PWR_BM 9 VI 64 V2 32/NN V3 1536 V4 VI DIM1 VI DIM2 V2 DIM3 V3 /* DOP_GEN 204 PRI_GEN 204 /* Maximum number of columns in holding vector “vec" */ /* in house.c for max columns in Householder multiply */ /* Number of main beams */ /* Number of auxiliary beams V /* Number of max power beams */ r /* Number for dimensionl in input data cube */ Number for dimension2 in input data cube */ Number for dimensions in input data cube */ 8 /* MAX Number of dopplers after FFT */ 8 /* MAX Number of points for 1500 data points FFT */ I* zero filled after 1500 up to 2048 points */ •define NUM_MAT V2 /* Number of matrices to do householde on */ /* NUM_MAT must be less than DIM2 above */ •ifdef APT •define PRI 256 /* •define DOP 256 / /* •define RNG 280 •define RNG 256 •define EL 32 •define BEAM 32 •define NUMSEG 7 Number of pris in input data cube r Number of dopplers after FFT */ '/ /* Number of range gates in input data cube */ /* Number of range gates in input data cube */ /* number of elements in input data cube */ /* Number of beams */ /* Number of range gate segments */ •define RNGSEG RNG/NUMSEG /* Number of range gates per segment */ •define RNG_S 320 /* Number sample range gates for 1st step beam forming */ •define DOF EL /* Number of degrees of freedom */ extern COMPLEX t_matrix[BEAM][EL]; /* T matrix */ •endif 274 • i fdef STAP •define PRI 128 /* Number of pris in input data cube V •define DOP 128 /* Number of dopplers after FFT */ •define RNG 1250 /* Number of range gates in input data cube */ •define EL 48 /* number of elements in input data cube */ •define BEAM 2 /* Number of beams */ •define NUMSEG 2 /* Number of range gate segments */ •define RNGSEG RNG/NUMSEG /* Number of range gates per segment */ •define RNG_S 288 /* Number of sample range gates for beam forming */ •define DOF 3*EL /* Number of degrees of freedom */ • endi f • i fdef GEN •define EL V4 •define DOF EL • endi f extern int output_time; /* Flag if set TRUE, output execution times */ extern int output_report; /* Flag if set TRUE, output data report files */ extern int repetitions; /' number of times program has executed V extern int iterations; /* number of times to execute program */ typedef struct max_index { float value ; int loci, loc2, loc3 ; ) INDEXED_MAX ; extern void xu index_max ( ini, in2, out, len ) INDEXED_MAX inl(], in2I], out[]; int *len ; ( int i, n ; n - *len/sizeof(int) ; for (1=0; i<n; i++) ( if (ini Ii].value > in2 [i].value) ( out[i].value = ini[i].value ; out[ij.loci = ini[i].loci; out[ij.loc2 = inl[i].loc2 ; out[ij.loc3 = inl[i].loc3; ) else ( out[1],value = in2 [i].value ; out[ij.loci = in2 [i].locl; out[ij.loc2 = in2 [ij.loc2 ; out[ij.loc3 = in2[i).loc3; } ) ) C 3.7 compllc_lln mpcc -03 -qarch=pwr2 -DGEN -DIBM -o lin bench_mark_LIN.c cmd_line.c forback. house.c read_input_LIN.c -lm C3.8 run. 032 poe lin /scratchl/masa/lin_data -procs 3 -us 276 INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter free, while others may be from any type o f computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations an d photographs, print Meedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript an d there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize m aterials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer a n d continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back o f the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x T black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell A Howell Iafct— ttoo Cwpaey 300 North Zeeb Hoed. Ana Arbor MI U106-1346 USA 313/761-4700 S O O / 3 2 1-0600 UMI NuBb«n 1378396 UMI Microform 1378396 Copyright 1996, f a y UMI C oapaiy. All rights reserved. This ■Icrofom odMon Is protected » « « ■ * — uthorioed copying m der Title 17, United States Code. UMI 300 North Zeeb Rood Am Arbor, MI 4 1 1 0 3
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Disha: a true fully adaptive routing scheme
PDF
Effects of memory consistency models on multithreaded multiprocessor performance
PDF
A kinetic model of AMPA and NMDA receptors
PDF
Irradiation effects on the rheological behavior of composite polymer systems
PDF
Cross-correlation methods for quantification of nonlinear input-output transformations of enural systems using a Poisson random test input
PDF
A statical analysis and structural performance of commercial buildings in the Northridge earthquake
PDF
A computational model of NMDA receptor dependent and independent long-term potentiation in hippocampal pyramidal neurons
PDF
Auditory brainstem responses (ABR): variable effects of click polarity on auditory brainstem response, analyses of narrow-band ABR's, explanations
PDF
Thermally-driven angular rate sensors in standard CMOS
PDF
Auditory brainstem responses (ABR): quality estimation of auditory brainstem responsses by means of various techniques
PDF
Comparison of evacuation and compression for cough assist
PDF
Superinsulation applied to manufactured housing in hot, arid climates
PDF
Fault tolerant characteristics of artificial neural network electronic hardware
PDF
Flourine-19 NMR probe design for noninvasive tumoral pharmacokinetics
PDF
Statistical analysis of the damage to residential buildings in the Northridge earthquake
PDF
A physiologic model of granulopoiesis
PDF
The Hall Canyon pluton: implications for pluton emplacement and for the Mesozoic history of the west-central Panamint Mountains
PDF
Estimation of upper airway dynamics using neck inductive plethysmography
PDF
The relationship between fatty acid composition of subcutaneous adipose tissue and the risk of proliferateive benign breast disease and breast cancer
PDF
A study of the solution crystallization of poly(ether ether ketone) using dynamic light scattering
Asset Metadata
Creator
Arakawa, Masahiro
(author)
Core Title
Parallel STAP benchmarks and their performance on the IBM SP2
School
School of Engineering
Degree
Master of Science
Degree Program
Computer Engineering
Degree Conferral Date
1995-08
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Hwang, Kai (
committee chair
), [Arakawa, Masashiro] (
committee member
), Saavedra, Rafael H. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c18-8020
Unique identifier
UC11357520
Identifier
1378396.pdf (filename),usctheses-c18-8020 (legacy record id)
Legacy Identifier
1378396-0.pdf
Dmrecord
8020
Document Type
Thesis
Rights
Arakawa, Masahiro
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science