Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Production-level test issues in delay line based asynchronous designs
(USC Thesis Other)
Production-level test issues in delay line based asynchronous designs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PRODUCTION-LEVEL TEST ISSUES IN DELAY LINE BASED ASYNCHRONOUS DESIGNS by Yang Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2018 Copyright 2018 Yang Zhang Acknowledgements First and foremost, I would like to oer my sincere appreciation to my advisor, Prof. Peter A. Beerel, for guiding and supporting me in the researches of asynchronous designs. I appreciate all his contributions of time, ideas, and funding to enrich my Ph.D. experience. In particular, he is able to oer help and guidance even during weekends and while traveling. Regarding my Ph.D. admission, I would like to thank Prof. Melvin Breuer, who hired me in 2013. He guided me on my research during the rst two years of my Ph.D. program and published three papers with me. Because of his ill health at the time, he then handed me over to my current advisor. After a long battle with cancer, Prof. Breuer passed away on a Saturday afternoon, January 28, 2017. For this dissertation, I would like to thank my dissertation committee mem- bers|Prof. Sandeep Gupta, Prof. Aiichiro Nakano, and Prof. Peter Beerel|for their time, interest, and helpful comments. I would also like to thank the other two members of my qualifying committee|Prof. Jerey Draper and Prof. Alice Parker|for their time and insightful questions. In addition, I would like to mention the names of those who oered invaluable support over the past ve years: Xin Yue, Zhenzhen Liu, Wen Huang, and Xi- aozhuo Yang. They are a few best friends I can always rely on. And thanks go to Xiaopang, a guinea pig who was my companion from 2013 to 2016. Lastly, special thanks go to my family for all their love and encouragement. They raised me; funded my studies since kindergarten; and taught me to be a humble, self-motivated, and ambitious person. My father encouraged me to pursue the Ph.D. program. My mother spent plenty of time with me during my high school days and ew to the US twice when I was not able to go home. Thank you. ii Table of Contents Acknowledgements ii List of Figures vi List of Tables viii Abstract ix Chapter 1: Introduction 1 1.1 Basic Bundled-data Design Template . . . . . . . . . . . . . . . . . 6 1.2 Timing Resilient Bundled-Data Design Template . . . . . . . . . . . 9 1.3 Timing constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Setup Constraint . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Fault Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 Contributions of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 2: Optimizing Yield Given an SPQL Constraint 23 2.1 Key parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.1 Critical paths . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.2 Ratio of the delay line for test to delay line (X and X') . . . 27 2.1.3 Yield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.1.4 Shipped product quality loss . . . . . . . . . . . . . . . . . . 33 2.1.5 Aging eects . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Uniform Test Delay Ratio given Average-case Performance Con- straints for Bundled Data Design . . . . . . . . . . . . . . . . . . . 35 2.2.1 Monotonicity of SPQL over Test Delay Ratios . . . . . . . . 35 2.2.2 Yield Optimization . . . . . . . . . . . . . . . . . . . . . . . 39 2.2.3 Montecarlo Simulation and Results Discussion (Uniform Test Delay Ratio) . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 iii 2.2.4 Measuring Correlations . . . . . . . . . . . . . . . . . . . . . 40 2.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.6 Application of Theory . . . . . . . . . . . . . . . . . . . . . 47 2.3 Per-chip Test Delay Ratio X(L) given Average-case Performance Constraints for Bundled data design . . . . . . . . . . . . . . . . . . 48 2.3.1 Montecarlo Simulation and Results Discussion (Per-chip Test Delay Ratio) . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.3.2 Global and Local Variations . . . . . . . . . . . . . . . . . . 58 2.3.3 Per Delay Line . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.4 Uniform and Per-chip X(L) given Worst-case Performance Con- straints for Bundled data design . . . . . . . . . . . . . . . . . . . . 63 2.4.1 Uniform Test Delay Ratio . . . . . . . . . . . . . . . . . . . 64 2.4.2 Per-chip Test Delay Ratio . . . . . . . . . . . . . . . . . . . 65 2.4.3 Monte Carlo simulation Results . . . . . . . . . . . . . . . . 66 2.4.4 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.5 Aging analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.5.1 Introduction to NBTI aging model . . . . . . . . . . . . . . 70 2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.6 Extension to TR-BD Design . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 3: Edge: a Yield-Aware Synthesis and PnR Flow 75 3.1 Challenges in Asynchronous Designs . . . . . . . . . . . . . . . . . 76 3.2 Physical Design using the ACDC Flow . . . . . . . . . . . . . . . . 80 3.2.1 ACDC Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.2 Relative timing constraints . . . . . . . . . . . . . . . . . . . 84 3.2.3 Adding Test Margin using ACDC . . . . . . . . . . . . . . . 84 3.2.4 Edge: Integration of Synthesis and ACDC ow . . . . . . . . 86 3.2.5 Cosimulation: Verication using UVM structure . . . . . . . 89 3.2.6 Tested using an industrial design . . . . . . . . . . . . . . . 91 3.3 Scan Structure and Programmable Delay Line for Bundled Data Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.3.1 Introduction to Click Controllers . . . . . . . . . . . . . . . 92 3.3.2 Scan Cells for Edge . . . . . . . . . . . . . . . . . . . . . . . 95 3.3.3 Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.4 S2A and A2S interfaces . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4.1 Introduction to sync-async domain crossing . . . . . . . . . 100 3.4.2 S2A Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.4.3 S2A Petri-net and timing constraints . . . . . . . . . . . . . 101 iv 3.4.4 A2S Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.4.5 A2S Petri-net model and timing constraints . . . . . . . . . 104 Chapter 4: Summary 106 4.1 Summary of Accomplishments . . . . . . . . . . . . . . . . . . . . . 106 4.2 Conclusions and Possible Next Steps . . . . . . . . . . . . . . . . . 107 Bibliography 110 v List of Figures Figure 1.1 Four types of chips (modied from [1]) . . . . . . . . . . . 3 Figure 1.2 Bundled-data Design . . . . . . . . . . . . . . . . . . . . . 6 Figure 1.3 (a) Four phase handshaking (b) Two phase handshaking . . 9 Figure 1.4 Blade: Timing Violation Resilient Design . . . . . . . . . . 11 Figure 1.5 Timing Diagram of a BD template . . . . . . . . . . . . . . 14 Figure 1.6 Timing Diagram of TRBD template . . . . . . . . . . . . . 16 Figure 2.1 Normally distributed test parameters . . . . . . . . . . . . 25 Figure 2.2 Full-buer channel net model of the performance . . . . . . 29 Figure 2.3 Illustration of the hold time constraint . . . . . . . . . . . . 31 Figure 2.4 16-bit Carry Select Adder . . . . . . . . . . . . . . . . . . . 40 Figure 2.5 Programmable delay line with MUX . . . . . . . . . . . . 42 Figure 2.6 Delay line(XL) vs Delay line for test(L) . . . . . . . . . . . 43 Figure 2.7 Yield of BD/RO to SYNC vs T;XL . . . . . . . . . . . . . 44 Figure 2.8 log(SPQL) vs X . . . . . . . . . . . . . . . . . . . . . . . . 46 Figure 2.9 Yield vs required SPQL . . . . . . . . . . . . . . . . . . . . 46 Figure 2.10 1 Stage Sample Circuit . . . . . . . . . . . . . . . . . . . . 48 Figure 2.11 Comparison of Yields . . . . . . . . . . . . . . . . . . . . . 59 Figure 2.12 Per-chip Test Margin . . . . . . . . . . . . . . . . . . . . . 61 Figure 2.13 Yield of BD/RO designs under worst-case performance con- straints and PVT variations . . . . . . . . . . . . . . . . . 67 Figure 2.14 Yield of BD/RO designs under worst-case performance con- straints and process variation . . . . . . . . . . . . . . . . . 68 Figure 3.1 Undetectable Stuck-at fault . . . . . . . . . . . . . . . . . . 78 Figure 3.2 Scan Cell Proposed in [2] . . . . . . . . . . . . . . . . . . . 78 Figure 3.3 Functional Test for GALS system . . . . . . . . . . . . . . 79 Figure 3.4 ACDC ow using existing commercial EDA tools . . . . . . 83 Figure 3.5 Target bundled-data template for Edge. . . . . . . . . . . . 87 Figure 3.6 Edge ow using existing commercial EDA tools . . . . . . . 88 vi Figure 3.7 Target industrial design, showing the initial decomposition of asynchronous and synchronous islands. . . . . . . . . . . 90 Figure 3.8 Non-token Click Controller . . . . . . . . . . . . . . . . . . 93 Figure 3.9 Token Click Controller . . . . . . . . . . . . . . . . . . . . 94 Figure 3.10 Scan Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 3.11 Timing of launch-o-shift [3] . . . . . . . . . . . . . . . . . 98 Figure 3.12 Timing of launch-o-capture . . . . . . . . . . . . . . . . . 99 Figure 3.13 Synchronous to asynchronous domain crossing . . . . . . . 102 Figure 3.14 Petri-net for S2A . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 3.15 Asynchronous to synchronous domain crossing . . . . . . . 104 Figure 3.16 Petri-net for A2S . . . . . . . . . . . . . . . . . . . . . . . 105 vii List of Tables 2.1 Analysis dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Analysis of the correlation matrix due to PVT variations. . . . . . . 42 2.3 and versus required SPQL . . . . . . . . . . . . . . . . . . . . . 61 2.4 Yield Comparison under Large Hold Time and Average-case Perfor- mance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.5 Yield of BD/RO over SYNC given SPQL and Worst-case Perfor- mance Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.6 Yield Comparison under Large Hold Time and Worst-case perfor- mance constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.7 Analysis of the Critical Path under Test and Delay Line over 9 Years 72 viii Abstract The use of bundled-data and bundled-data resilient design with programmable delay lines has been proposed to combat process, voltage, and temperature (PVT) variations. The programmable delay lines oer the opportunity to add test margin into such designs in which the delay line in shipped products is set slower than that which is successfully tested. How to set the test margin to maximize yield given manufacturing constraints has not yet been explored. We create mathematical models to analyze this problem and develop practical schemes to optimally set test margins. The underlying hypothesis is that the correlation between the delay of the programmable delay lines and combinational logic in bundled data asynchronous designs will lead to higher yields than comparable synchronous designs. This thesis aims to test this hypothesis by completing the following steps. 1. Adopt a mathematical model used in synchronous design to quantify yield and manufacturing constraints of a variety of bundled-data design styles. This will involve using a canonical model with normal distribution for delays, slacks, as well as the correlation between such delays. 2. Mathematically prove that the optimal yield of bundled data designs is higher than their synchronous counterparts under certain conditions and quantify this advantage. ix 3. Explore schemes for optimally setting both uniform and per-chip test margin. Uniform test margin is conservative, implying a xed margin for all chips. In contrast, per-chip method changes test margins based on chip performance and can lead to higher yields. 4. Analyze two delay lines, forward and backward latency delay lines, and pro- pose the strategy to post-silicon tune them to balance performance and yield. 5. Implement a yield-aware placement and routing physical design ow for bun- dled data. We will show how commercial EDA tools [4] [5] can be used to place, route, and congure the delay lines within the bundled data design. The package will be open sourced and distributed online. 6. Test the ow on industry-scale designs that will include scan chains and controllers for test to support post-silicon characterization, tuning of the delay lines, and ecient manufacturing test. x Chapter 1 Introduction As mobile electronic and Internet of Things devices become more popular, the demand for lower power consumption has increased. However, as CMOS technol- ogy scales down and voltage supplies are lowered, power voltage and temperature (PVT) variations have become a limiting factor in traditional synchronous de- sign. Alternatives, such as the use of asynchronous circuits, have begun to gain popularity with the promise of low power consumption and resilience to PVT variations. However, high-volume manufacture testing of asynchronous circuits is largely unchartered territory considered dicult due to the lack of a global clock. In particular, production-level test issues for these circuits, such as techniques to increase yield and limit shipped product quality loss (SPQL) has not been fully explored yet and conguration and at-speed test issues remain largely open. Of particular interest to this proposal is bundled data (BD) asynchronous de- signs [6] and its timing resilient variants [7] both of which use programmable delay lines to match the delay of combinational logic. The basic BD design template is shown in Figure 1.2. These asynchronous design techniques are a promising means of combating increased PVT variations because the delay line delay often 1 tracks the delay of the critical path [8], [9]. BD designs have been the focus of many proposed test schemes (see e.g., [10]), but the analysis and optimization of associated manufacturing test metrics nor the conguration of the programmable bundled-data delay lines has not received much attention. Basic fault models for analysis include stuck-at and delay faults [11, 12]. A stuck-at-0 fault on a signal means that the signal is tied to 0. Alternatively, this situation can be described by saying that the signal will take an \innite" amount of time to rise from 0 to 1, where a stuck-at fault is a special delay fault. In traditional synchronous design, the clock period must be sucient long to ensure the setup time of the sequential elements is satised. In BD design, on the other hand, the delay line must be suciently long to ensure the setup time of the sequential elements is satised. In both cases, delay faults in which the delay of a path in the combinational logic may be dierent than expected can lead to circuit failures. We expect the programmable delay lines will be analyzed during chip charac- terization, tested at a particular delay setting, and shipped at possibly a dierent, longer-delay setting. The dierence between the test frequency and the chip's shipped frequency is called test margin. This margin may be necessary to account for: 1) incomplete test coverage in which the critical path under test (T ) may be dierent from the actual critical path (C); and 2) The temperature and voltage during actual operation may be dierent from the ones under test. 2 Some of the shipped chips may actually be decient, leading to what is typically called shipped product quality loss (SPQL) chips. In contrast, some of the rejected chips may actually be good, resulting in yield loss. As the selection of test paths is not ideal, it is apparent that the four types of chips shown in Figure 1.1 can be produced [1]. Figure 1.1: Four types of chips (modied from [1]) Good chips whose test paths pass test and chip performance meets the cus- tomer specication. Bad chips whose test paths fail to pass test and chip performance does not meet the customer specication. 3 Yield loss chips whose test paths fail to pass the test but whose chip perfor- mance meets the customer specication. SPQL chips whose test paths pass the test but whose chip performance does not meet the customer specication. Yield and SPQL can be expressed as Yield = Good chips +SPQL chips All chips (1.1) SPQL = SPQL chips SPQL chips +Good chips (1.2) Yield and SPQL analysis for synchronous design has been explored by many researchers (e.g., [13] [14]). But their analysis for BD designs and their timing resilient variants has been insucient. The biggest dierence between such BD designs and traditional synchronous designs is that the delay lines track the delay of the combinational logic. That is, the delays are correlated. This is unlike a traditional synchronous design in which the clock frequency is xed. Interestingly, this correlation is similar to what is seen in a synchronous circuit driven by a ring oscillator (RO). The dierence is that in BD designs there may be multiple delay lines associated with dierent pipeline stages of the design compared to a single delay line making up the RO. Due to proximity eects, we would expect the correlation of a BD design to be larger than that of RO-based synchronous 4 design. Nevertheless the mathematical analysis of yield and SPQL is identical. The correlation between combinational logic and ring oscillators has been analyzed in [15]. This analysis showed that the global PVT variation in such designs do not cause as much problems as in traditional synchronous designs. Moreover, it showed that local variation aect the performance of synchronous circuits with ring oscillators, but to a limited amount. However, this analysis did not include a formal analysis of manufacturing testing metrics, such as test margin, yield, and SPQL. The thesis adopts a mathematical model of delay proposed in synchronous yield analysis [16] to model both BD and RO-based designs and compare their yield and SPQL to equivalent traditional synchronous designs. The yield advantage of the correlated designs is quantied, given correlation coecient between combinational logic and delay line for test. Based on this model, we propose methods to deter- mine the uniform and per-chip test margin needed to maximize yield while meeting a required SPQL. The analysis applies directly to TR-BD when a worst-case per- formance constraint is given. For average-case conditions we propose modications to the denitions for yield and SPQL, as explained in Section 2.6 To complement and support our mathematical model, we propose an open-source design ow to automate synthesis and PnR, with improvement on yield for bundled-data design. 5 Figure 1.2: Bundled-data Design 1.1 Basic Bundled-data Design Template Asynchronous communication is transmission of data, generally without the use of external clock signal, where data can be transmitted intermittently rather than at regular intervals. Quasi-delay-insensitive (QDI) circuit and bundled data circuit are two popular asynchronous design styles. A QDI circuit is invariant to the delays of any of the circuit's wires or elements, except it assumes that certain fan-outs are isochronic. It reduces design time, but often requires more area and consumes more power. In contrast, bundled-data design oers smaller area, lower power consumption, but more timing constraints. Asynchronous blocks communicate using asynchronous channels, which are simply a bundle of wires and protocol to synchronize computation and transfer data between blocks. Various asynchronous 6 communication channels have been developed, with trade os between robustness and performance. This section introduces the basic bundled-data design template. The basic bundled-data channels consist of request and acknowledge lines and a data bus, as shown in Figure 1.2. The sequential elements in bundled data designs can be ip ops or latches. The controller initiates the transfer of valid data between pipeline stages by asserting the request signal. Once the data has been consumed by the receiving stage, the acknowledge wire is asserted by the receiving stage's controller. This sequence of event is presented in Figure 1.3. The request / acknowledge handshaking protocol can have two dierent avors. With four-phase handshaking, the rising edge of request/acknowledge indicates a valid transaction, as illustrated on Figure 1.3-a. The rising edge of the request indicates to the sender valid data, sometimes referred to as a token, is available on its inputs. The rising edge of the acknowledge line indicates that the data has be consumed by next stage. After the rising edge of acknowledgement, both request and acknowledge are reset to logic 0, in order to set up for a subsequent data transaction. In contrast, in the two-phase bundled data protocol, there is no dierence in meaning between the rising and falling transition of the request and acknowledge handshaking lines, as shown in Figure 1.3-b. Both transitions of the request line indicates the presence of a new token. The controller can initiate the next data transaction right after the acknowledge signal toggles. 7 In both cases, bundled data design uses matched delay lines to track the delay of combinational single rail logic. One traditional bundled-data template, named micro-pipelines [6] was proposed by Ivan Sutherland, uses a two-phase protocol and serves as a point of departure for several more advanced approaches. For example, Muller et. al presents a four-phase version [17], which uses a similar control structure. Bundled-data (BD) designs have similar switching activity as their synchronous counterparts because the combinational logic is unchanged and the total area is also similar because the area of the control circuits and delay lines is comparable to that of a clock tree. One benet is that when not needed, no token is sent to the pipeline and it remains idle, consuming only static power. This has lead many researchers to claim that asynchronous circuit oers a form of perfect clock gating [18] [19]. One challenge in BD designs is that the delay line must be conservatively designed to be longer than the worst case delay of its corresponding logic under all possible process, voltage, and temperature (PVT) corners [20]. This can take away much of its advantages. Researchers have proposed to mitigate this problem by duplicating the BD delay lines [21], constraining the design to regular structures such as PLAs [22], and using soft latches [23]. Others have proposed adding timing resilient features on top of the bundled data architecture [7] as described below. 8 Figure 1.3: (a) Four phase handshaking (b) Two phase handshaking 1.2 Timing Resilient Bundled-Data Design Tem- plate Timing resilient bundled-data design is an extension to bundled data design that adds logic for timing error detection and correction into the template. In addition to mitigating the eect of PVT variation, it increase the average performance by taking advantage of data dependencies. Many synchronous design techniques for timing resilient designs have been explored that address delay variations. Canary FFs have been proposed that predict when the design is close to a timing fail- ure [24]. Designs can then adjust their supply voltage or clock frequency either statically or dynamically to ensure the circuit is working at the edge of failure. In addition, Razor [25] circuits have been proposed that contain in situ timing vio- lation detection mechanisms that allows the circuits to recover from timing errors via architectural replay or automatic pipeline stalling, further removing margin. Blade is a recently proposed asynchronous template for timing resilient design and 9 is based on the pipeline block diagram shown in Figure 1.4. Like basic bundled- data design, Blade uses single-rail logic but in Blade they are followed by error detecting latches (EDLs) that are controlled using two recongurable delay lines and an asynchronous Blade controller. The rst delay line is of duration and con- trols when the EDL becomes transparent, allowing the data to propagate through the latch. The Blade controller speculatively assumes that the data at the input of the EDL is stable when it becomes transparent and thus sends an output request along the typical bundled data channel L/R. The second delay line, with duration , denes the time window during which the EDL is transparent. If data changes during this window, but stabilizes before the latch becomes opaque, it is recorded as a timing violation, which can subsequently be corrected. Consequently, de- nes a timing resiliency window (TRW) after during which the speculative timing assumption may be safely violated. In particular, if the combinational output transitions during the TRW, the error detection logic ags a timing violation by asserting its Err signal, which is sampled by the controller. The Blade controller then communicates with its right neighbor using a novel handshaking protocol implemented with an additional error channel (RE/LE) to recover from the timing violation by delaying the opening of the next stage's latch. 10 Figure 1.4: Blade: Timing Violation Resilient Design There are several types of error detecting storage elements [26]. For example, soft-error-tolerant-based storage elements can be adapted to produce an error sig- nal in response to a delay fault. Many of these storage elements, however, have the disadvantage of delaying a stable output, either by delaying inputs until validity is guaranteed; requiring pipeline ushes to recover from errors; or by correcting errors on subsequent clock cycles [27,28]. Other techniques use elaborate clock generation or time borrowing to allow error correction within the same clock period [29]. As in Bubble Razor [30], however, we propose using error-detecting latches that detect if signals are not valid upon the latches going transparent and if so generating an associated error signal to the controller. The latched value is valid as long as the data becomes valid before the latch becomes opaque. In other words, the pulse width of the latch determines how much time borrowing / timing resiliency is allowed. 11 The proposed Blade template implements a new form of asynchronous hand- shaking called speculative handshaking, illustrated in Figure 1.4. A request signal between blocks is speculatively asserted assuming the delay line of length is suf- ciently long and no timing violation occurs. A secondary 1-bit extend signal is asserted to indicate if this assumption was incorrect and a violation is detected. This extend signal tells the next pipeline stage if the correct data will be stable at the output of the latch later than the speculative request anticipated and that it must compensate for this. In particular, if asserted the next stage must add a delay of , the same delay as the pulse width of the previous latch. 1.3 Timing constraints In synchronous design, the setup time of a storage element denes the minimum amount of time its input data should be stable before the sampling clock event. It plays a role in constraining the maximum delay of the combinational logic t critical path ; as in the following equation. t c2q +t critical path +t setup +t margin t clk period ; (1.3) wheret c2q is the delay from clock event to output of the registers. t margin is needed to accommodate for uncertainty such as clock skew and PVT variation. 12 The hold time of a storage element in synchronous design, on the other hand, denes the minimum amount of time data should be stable after the clock event. It is important when considering the shortest delay through the combinational logic, as shown below. t c2q +t shortest path t hold +t margin (1.4) Bundled-data design and timing resilient bundled-data have somewhat dierent timing constraints for setup and hold time, as described below. 1.3.1 Setup Constraint The setup constraint for bundled data design, including its timing-resilient vari- ants, is somewhat dierent to the equation for synchronous design. Instead of the clock period, the delay of delay line constrains the maximum critical path delay, as shown in equation 1.5. t c2q +t critical path +t setup +t margin t delay line +t borrow : (1.5) As in equation 1.3 t margin for a bundled-date design consists of local clock tree dierences and PVT induced uncertainty. However, the PVT uncertainty that matters is the local rather than global variations, as is discussed in 2.3.2. The arrival time of the local clock, stage CLK in Figure 1.5 is aected by local clock trees, fans out of controllers. Due to dierent number of FFs each controller 13 Figure 1.5: Timing Diagram of a BD template drives, their clock trees are of dierent sizes, considered as part of t margin . In contrast, t margin in synchronous design need to accommodate for both global and local variations. Note thatt delay line represents the delay lineDL in Figure 1.5 and + in Figure 1.6. Note that thet borrow is added to the right side of the inequality if time borrowing is allowed. 1.3.2 Hold Time The hold time analysis is classied into three categories, ip- op (FF) based bun- dled data (BD), latch based bundled data (LB), and timing resilient bundled data 14 (TR-BD). Their hold time constraints are slightly dierent, due to the structural dierence between templates. t c2q +t shortest path +t margin +t backward latency t hold (1.6) t c2q +t shortest path +t margin +t backward latency t hold +t overlap (1.7) t backward latency is labeled asBL in Figure 1.5. It is the time interval from stage 2 receiving request from stage 1, to stage 1 receiving acknowledge from stage 2. The backward latency mitigates hold time issue for ip- op based bundled data design, since it postpone the rising clock edge generated by previous stage. The hold time constraint for FF based BD is dene in Equation 1.6, wheret backward latency is large enough to satisfy the constraint. Whereas the the latch based BD and TR-BD's hold time constraints are dened using Equation 1.7. The details regarding their hold time constraints are described as follows. FF based BD's Hold Time : One of benets of BD FF-based designs is that it mitigates hold time issues. In particular, the backward latency associated with the asynchronous handshaking lower bounds the time between a FF is clocked and new data arriving at its input. This handshaking is typically much larger than the hold times of the FFs which means that hold time concerns on such designs are nicely mitigated. 15 Figure 1.6: Timing Diagram of TRBD template Latch based BD's Hold Time : The BL in Figure 1.5 does not guarantee non-overlapping clocks. Unlike FF-based BD, latches in BD are transparent when two clocks overlaps with each other. For example, when the local clock pulse width is larger than BL, clocks of stages 1 and 2 are both high for a short period of time. Equation 1.7 is used to constrain the hold time of latch based BD design. The overlapping period is dened by a combination of backward latency and the local clock pulse width. If the backward latency is larger than the clock pulse width, the overlapping period can be negative. TR-BD's Hold Time : TR-BD designs has hold time issue because it is driven by two clocks, which can overlap with each other for a short period of time. The overlap period in equation 1.7 requires larger delay for shortest path of combinational logic. The TR-BD uses the same equation for hold 16 time constraint. However, the overlapping period is dependent on the delays of the two delay lines. The timing of TR-BD is shown on Figure 1.6. Ideally, the duty cycle of the clock is dened by two delay lines, and . If is larger than , the rst falling edge of stage 1 clock arrives after the rst rising edge of stage 2 clock. Thus, to avoid this problem designers ensure . 1.4 Fault Models To analyze the yield and SPQL of bundled-data design and its variants, we must rst adopt a model of faults expected during manufacturing. The two most com- mon fault models used are the stuck-at and delay fault models. The stuck-at fault model [31] assumes individual signals or pins can be stuck at Logical '1' or '0'. It mimics the eect of a variety of manufacturing de- fects within an integrated circuit [32] . For example, if an internal line and the power lines are shorted, then the internal line can be considered to be permanently stuck-at-1. Stuck-at fault (SAF) models include both single or multiple SAFs. The single SAF model assumes that only one input on one gate will be faulty at a time. If more are faulty, a test that can detect any single fault, should easily nd most multiple faults [11]. 17 The delay fault model [33] consider a circuit faulty if delay of any path from PI to PO or between pipeline stages exceeds the operational system clock period. An SA0 or SA1 can be modeled as a delay fault in which the signal takes an \innite" amount of time to change to 1 or 0, respectively. A delay test relies on vector pairs and all designated input transitions are supposed to occur at the same time. Unlike the stuck-at fault test, delay fault test set must be performed at speed, i.e., at the expected delay of the pipeline stage. Stuck-at fault test do not require stochastic analysis to evaluate. In contrast, delay fault test analysis is stochastic. In other words, the delay of all gates and paths vary from chip to chip and thus a mathematical delay model is required for such an analysis. A canonical delay model [16] for gate delays, slacks, and slews can be expressed as a 0 + n X i=1 a i Y i +a n+1 R a (1.8) where a 0 represents the mean value , Y i models global process variations, and R a models other variations. Y i and R a are assumed to be zero-mean, unit- variance Gaussians. Coecients a 1 to a n+1 are sensitivities to the corresponding variations. The critical path under test (T or T 0 ), the actual critical path (C or C 0 ), and the delay of the delay line (L or L 0 ) are modeled using form 1.8 with dierent parameters. T and C are paths with longest delay that determines the setup 18 constraint, whereas T 0 and C 0 are paths with shortest delay that determines the hold constraint. L andL 0 are forward and backward latency delay lines, as shown on Figure 1.2. We assume to some degree thatT ,C, andL are correlated. We thus introduce correlation coecients, T;C , T;L , and C;L to quantify there correlations. T;C = cov[T;C] T C (1.9) where cov[T;C] = n X i=1 a T;i a C;i (1.10) a T;i and a C;i are sensitivities to globally correlated variations of distributions T andC respectively. Additionally, T;L and C;L are similarly computed. Similarly, parameters can be obtained for T 0 , C 0 and L 0 . 1.5 Contributions of Thesis In the area of test for bundled data and bundled-data resilient design, the work provides a mathematical analysis of production-level test parameters and provide solutions to improve yield post-silicon using existing EDA tools. The expected contributions cover both theory and practice. Theory: 19 We prove that SPQL is reversely proportional to test margin for BD as well as TR-BD designs. The importance of this proof is that we know to maximize yield we should set the design to have the smallest test margin that achieves the maximum allowable SPQL set by the end customers. We explore schemes that generate a uniform and per-chip test margin. Uni- form test margin is conservative, implying a xed margin for all chip while the per-chip method allows the test margins to change based on individual chip performance. In both cases, the goal is to set the test margin to max- imize yield while satisfying a required SPQL and performance target. The test schemes and their analysis applies to BD as well as TR-BD designs. We address the setup and hold time timing constraints plus performance constraints. Bundled-Data design with two delay lines oers the opportunity of tuning backward delay line to x hold time violations. Application: We created a yield-aware placement and routing ow for BD and TR-BD designs based on a commercial physical design ow. The design implements the programmable delay lines with built-in programmability for the desired amount of test margin and include scan that will enable to congure and test the delay line settings. It will also consider the impact of the placement and routing of the delay line on the correlation between it and the logic it 20 matches. The principal goal of the ow will be to maximize yield subject to both SPQL and performance constraints. However, the scheme will also ensure the required manufacturing test time is comparable to an equivalent synchronous circuit. It is due to the fact that the scan chain structure we adopts is similar to the chains of synchronous design, and the BD or TR-BD design has a better performance after circuit re-timing. We build a clock domain crossing circuit to enable the use of asynchronous circuit in a synchronous environment. It adopts an open-source design for communication between synchronous circuit and 4-phase asynchronous de- sign. The circuit we built added more logic to it, in order to support 2-phase asynchronous designs, such as EDGE. We tested and evaluate these ows on a suite of benchmark circuits, including the open-core CPU Plasma. The netlist generated by the ow is veried by comparing its simulation result with the one of RTL design. It is also tested by an industrial partner using an industrial-scale circuit [34]. The selected industrial design for bundled-data implementation is a low power vision classier hardware accelerator. 21 1.6 Thesis Organization The remainder of this thesis is organized as follows. Chapter 2 presents our work that determines the optimal uniform test margin for bundled-data designs and preliminary work that sets per-chip test margin for bundled-data designs. Both test margin methods are mathematically formulated and validated using Monte Carlo simulation on a sample circuit, a 16-bit carry select adder. Chapter 3 describes our a yield-aware design ow, including the synthesis and PnR ow for bundled data designs that supports both our scan for delay testing and programmable delay lines. Chapter 4 summarize this thesis providing some conclusions and possibilities for future work. 22 Chapter 2 Optimizing Yield Given an SPQL Constraint Yield is a monotonically increasing functions of test delay ratio. Higher yield can be achieved by setting the test delay ratio to a larger value, closer to 1. However, the larger test delay ratio we set, the higher SPQL obtained. The goal of this chapter is to increase yield, given that SPQL should not exceed an arbitrarily xed value set by the end customer, without decreasing performance. Determining what test delay ratio maximizes the yield under these constraints is our principal question. The test delay ratio can be either set uniformly [35] for all chips, or tuned based on chip performance [1]. Uniform test margin is determined by the distributions of T, C and L and means that the test margin is applied to all chips uniformly. Per-chip test margin, on the other hand, is determined not only by these distribu- tions, but also using some measurement data from each chip, e.g. the delay of an internal ring oscillator or delay line. The margin setting of per-chip strategy re- quires additional overhead compared to uniform margins, including increased test time and more complex delay line conguration, but oers higher yields. 23 Table 2.1: Analysis dimensions Test Margin uniform per-chip Performance limit average case worst case Time right after fab years after fab Violations setup hold variables as measured with variations by 10% As indicated in Table 2.1, the research analyzed multiple dimensions, including test margin, performance, aging eects, timing violations and in uence of test related variables. In particular, the items of the rst column has been completed before my Qualify exam, whereas the ones of the second column were completed after the Qualify exam. Our study of bundled data design using per-chip test delay ratio veries that we can obtain higher yields than when using uniform test margins. The rest of the chapter describes uniform test delay ratio for bundled data design and per-chip test delay ratio for bundled data design, as well as timing resilient bundled data design. The reference model for yield comparison is synchronous design with speed binning. 2.1 Key parameters To mathematically analyze yield a model for the basic test parameters is needed. This section discusses these parameters conceptually and a more formal model that captures their variation and correlations is described in the next section. 24 Figure 2.1: Normally distributed test parameters 2.1.1 Critical paths For setup time analysis, the critical paths are the paths in the combinational logic with the longest delay. The actual longest path, determines the clock period for synchronous and RO circuits and the minimum length of the forward delay line for BD circuits. 25 During test, selected paths on chip are tested to achieve a balance between fault coverage and test time. The longest path under test (T ) is the longest path exercised among all test vectors. In some cases, the actual longest path (C) is triggered by the applied test vectors. In other cases, however, due to process variations and long test times [36], the actual longest path may not be exercised by the applied test vectors. In such cases, C diers from T . The forward delay line and clock period L should be suciently long to ensure that a large majority of shipped chip works with the actual longest path C. In this paper, T , C and L are modeled using Gaussian distributions, as shown in Figure 2.1. In contrast to setup time analysis, hold time analysis considers the shortest paths as critical. During test, the shortest path under test (T 0 ) is the fastest logic path exercised among all test vectors. The actual shortest path (C 0 ) may not be triggered by the applied test vectors. The backward delay line L 0 should be tuned to ensure the actual shortest pathC 0 is longer than the hold time requirement. T 0 , C and L 0 are modeled using Gaussian distributions as well. After test, the longest tested path of a passing chip is, by denition, shorter than the delay line or clock period. Similarly, the shortest tested path is suciently long to satisfy the hold time constraint. In contrast, the actual longest path and shortest path have a small chance of violating setup or hold time constraint. The chance that setup or hold time of a passing chip is not met, is also known asSPQL. 26 2.1.2 Ratio of the delay line for test to delay line (X and X') During test the forward delay lineL may be tuned to have a smaller delay than is set for actual shipped chips. The ratio is the known as the test delay ratio: X = Delay Line for Test Actual Delay Line (2.1) IdeallyX is a constant. However, because of process variations, X itself may vary from chip to chip and is dependent on the correlation coecient between the delay line for test (XL) and the forward delay line (L). If the correlation coecient equals 1, X is a constant, and thus has a variance of 0. If it is close to 0, X has may have large variance and the analysis based on a constant X can be incorrect. Fortunately, our experimental results show that when the delay line is designed carefully, XL and L are indeed highly correlated. The dierence between the actual delay line and delay line for test is the test margin for BD/RO designs. L =XL +Test margin (2.2) 27 In contrast, when we analyze the hold time test delay ratio, the delay line during test is longer than the delay line on working mode. We use X 0 instead to represent this ratio, that is larger than 1 naturally. L 0 +Test margin =X 0 L 0 (2.3) Note, however, that only BD designs have the ability to tune the hold time delay line L 0 and take advantage of X 0 . 2.1.3 Yield To compare the yields of SYNC, BD, and RO designs, the SYNC design is assumed to have a nominal clock period of T clk and nominal test clock period XT clk . To model performance binning, we allow synchronous chips to ship with a range of frequencies T clk (1) to T clk (1 +) [37]. In particular, T clk (1 +) is the slowest shipped clock period. The yield of the SYNC design is thus Yield SYNC = P (T +s<XT clk (1 +);T 0 >h): (2.4) where s and h represents setup and hold time of SYNC design. Similarly, BD/RO designs are assumed to have a delay line delay of L and L 0 where the nominal delay during test is XL and X 0 L 0 , where X < 1 and X 0 > 1. 28 However, the denition of the yield of a BD/RO design depends on the system requirements. Figure 2.2: Full-buer channel net model of the performance In this paper, the performance of BD designs is modeled using the Full Buer Channel Net (FBCN) model [9] of a typical master-slave latch bundled-data con- guration [38], illustrated in Figure 2.2. In this marked graph model, the forward latency represents the datapath delay from the master to slave latches as well as the datapath delay from the slave to master latches. It is captured by the delay line in the forward path L and labelled on the round places in the marked graph. The backward latency is the delay determined by the handshaking overhead in BD designs and is not present in RO designs. It is captured by the delay line in the backward path L 0 and labelled on the square places in the marked graph. The performance of the circuit is determined by the longest cycle in this graph [9] and thus equals max(2L;L +L 0 ; 2L 0 ). 29 In particular, if it is acceptable to ship chips whose performance varies with PVT variations but on average has the same delay as synchronous designs, then L and L 0 can be assumed to be normally distributed whose means equal T clk (1+) 2 and the BD/RO yield is the probability of having the longest path under test (T ) smaller than the delay line for test (XL) and the shortest path under test (T 0 ) bigger than the overlapping period (WX 0 L 0 ), as shown in Figure 2.3. Yield BAVE =P (T +s<XL;T 0 h>WX 0 L 0 ) 1 (2.5) This denition may be best suited for many-core or multi-chip designs for non real-time applications. Also, note that the condition T 0 h > WX 0 L 0 is only valid for BD designs which includes a tunable backward delay line. In contrast, T 0 h> 0 is the hold time condition for RO-based designs. 1 It is assumed that time borrowing is not allowed. 30 Figure 2.3: Illustration of the hold time constraint Note that, in this case, the larger the delay of the programmable delay line during test, the higher the probability that T will be smaller than XL. This means that the yield of a BD/RO design is a monotonically increasing function of X. Given a certain X, the yield is determined by the correlation between T and XL, T;XL . For example, if T;XL equals 1, the delay line (XL) tracks the critical path (T ) for every chip and the chance of having a chip that does not pass the test is 0. If T;XL equals 0, on the other hand, there is a good chance of having larger T and smaller XL, i.e., a test failure. If, in contrast, a worst-case performance constraint is given, the yield of a BD circuit can be expressed as Yield BWC =P (T +s<XL;T 0 h>WX 0 L 0 ; L +L 0 <T clk (1 +); 2L<T clk (1 +)): (2.6) 31 For RO-based designs, the hold condition T 0 h > WX 0 L 0 should replaced with T 0 h > 0, as above, and the performance constraint should be replaced with L < T clk (1 +) as they have no backward delay line L 0 . Notice also that for BD designs we omit the constraint that 2L 0 <T clk (1 +) because, in practice, the nominal delay of L 0 is much smaller than that of L and thus this constraint is typically redundant. To appreciate the dierence between worst-case and average- case constraints, consider the case where is set to zero. If the mean of L and L 0 are naively set toT clk =2, as is optimal when considering average-case performance, the worst-case yield would be close to 50% and we would lose approximately half of the manufactured chips due to setup violations. Thus a more sophisticated approach to optimizeL for this case is needed and our specic proposed approach is discussed in Section 2.4. Interestingly, the worst-case yield denition can be further classied into two sub-categories. Yield BWCP is the yield considering performance violations caused only by process variations and Yield BWCPVT considers performance violations also caused by (temporary) changes in operating voltage and temperature. In some applications, such as mobile and IoT, we may allow performance to change with changes in voltage and temperature and for such applications Yield BWCP may be suitable. In other applications with strict real-time constraints, however, Yield BWCP may be a better measure. 32 As discussed by Cortadella et al. [15], the delays of paths that are physically close to each other tend to be highly correlated. Given that the delay line (T ) and the associated combinational paths (XL) are often physically close, their delays are generally highly correlated, i.e., T;XL is close to 1. Consequently, given an average performance constraint, BD/RO designs will have a higher Yield BAVE than SYNC designs for the same test margin. More precisely, Cortadella et al. [15] suggest that the clock margin required need only compensate for the local process variation (i.e., mis-match) between the delay line under test and combinational critical path. Similarly, we will show that for BD/RO designs only local variations motivate a larger test margin and aect chip yield. Conversely, to achieve the same yield as SYNC design, we will show that BD/RO designs can have a smaller test delay ratio X. On the other hand, because the delay lines dictating the performance of BD/RO designs are aected by PVT variations similarly to that of synchronous combinational logic, we show the yield advantage of BD/RO designs is signicantly lower when strict worst-case performance constraints are given. 2.1.4 Shipped product quality loss Manufacturers generally set a strict limit on the shipped product quality loss (SPQL) to provide an upper bound on the failure rate of shipped products. The SPQL of SYNC design is dened as 33 SPQL SYNC = P (C +s>Tclk(1 +) or C 0 >h )j pass test (2.7) where the condition for passing the test is as used in Equation 2.4. Similarly, we dene the SPQL of BD/RO design as SPQL BAVE = P (C +s>L or C 0 h<WL 0 )j pass test (2.8) where the condition for passing the test is the same as used in Equation 2.5. Sim- ilarly, to dene SPQL BWC , we simply apply the stricter performance constraint for passing the test, as expressed in Equation 2.8. 2.1.5 Aging eects Aging eects lead to the increase of delays from their values when shipped, result- ing in a gradual performance degradation. Aging does not change the denition of yield, SPQL etc., but does change the distribution of the yield-determining parameters, including T , C and L. Our simulations show that aging aects T , C, and L similarly. Thus, for BD/RO designs in applications that allows chips to slow down as they age, we can determine yield using Monte Carlo simulation results that do not include aging, as 34 addressed in Section 2.5. For BD/RO designs in applications that require a chip to meet a xed performance constraint throughout its lifetime, we need to apply variations after aging on transistor width, length and threshold voltage, run Monte Carlo Simulation, and use the resulting aged distributions. For SYNC designs, the clock or power supply must be conservatively set based on the aged distribution or adjusted as the chip ages using both distributions. 2.2 Uniform Test Delay Ratio given Average- case Performance Constraints for Bundled Data Design In this section, yield andSPQL are mathematically analyzed for both synchronous and BD/RO design assuming a uniform test delay ratioX. Yield, SPQL, and their corresponding theorems are introduced in Section 2.1. Theorem I shows that SPQL is proportional to the test delay ratio X and X 0 . 2.2.1 Monotonicity of SPQL over Test Delay Ratios SPQL and Yield of bundled data design are both determined by test delay ratio. It is obvious that yield is proportional to test delay ratio. The sub section continues on the proof that SPQL is proportional to test delay ratio X and X 0 . 35 Theorem I: SPQL of a BD design is a monotonically increasing function of X if the correlation coecient between T and C satises T;C > 0. Proof By assuming X' is constant during proof, SPQL can be simplied as SPQL BD =P (C >LjT <XL) = Z +1 1 f L (l)P (C >ljT <Xl)dl The derivative of SPQL BD is as follows d(SPQL BD ) dX = Z +1 1 f L (l) d[P (C >ljT <Xl)] dX dl (2.9) where f L is the distribution of L. Because f L (l) > 0, the theorem statement is true i d[P (C>ljT<Xl)] dX > 0. This inequality is equivalent to d[ P (C>l;T<Xl) =P (T<Xl)] dX = d[ A =B] dX = dA =dX B 2 (B A dB dX dA =dX ) >0 36 where A =P (C >l;T <Xl) (2.10) and B =P (T <Xl) (2.11) After further simplication, the inequality can be written as Z Xl 1 f T;L (t;l) R +1 l f CjT;L (cjt;l)dc R +1 l f CjT;L (cjXl;l)dc dt < Z Xl 1 f T;L (t;l)dt where f T;L is the joint distribution of T and L and f CjT;L is the distribution of C given T and L. C can be re-written as C =T +L +U (2.12) whereU follows Gaussian distribution and is positive if T;C > 0 and negative if T;C < 0. U is uncorrelated with T and L. We can show that Z +1 l f CjT;L (cjt;l)dc = Z +1 ltl f U (u)du (2.13) 37 and Z +1 l f CjT;L (cjXl;l)dc = Z +1 lXll f U (u)du (2.14) where f U is the distribution of U. Because passing chips have t smaller than Xl, Equation 2.13 is smaller than Equation 2.14 if T;C > 0. Equivalently, R +1 l f CjT;L (cjt;l)dc is smaller than R +1 l f CjT;L (cjXl;l)dc, if T;C > 0. This concludes the proof. Similarly we can prove thatSPQL of a BD design is a monotonically increasing function of X 0 if the correlation coecient between T and C satises T 0 ;C 0 > 0. The importance of Theorem I is that it veries what we know intuitively: we can achieve the maximum yield when we set the SPQL to its maximum limit and can thus guide designers and CAD tools to set a proper test margin. In particular, given the same yield, placing the delay line and combinational logic closer leads to higher correlation, smaller X, and thus a smaller SPQL. In other words, the relative location of combinational logic and delay line aects test margin, yield, and SPQL of BD/RO circuits. However, this theorem highlights that this result is true only when the correlation between dierent combinational path delays (i.e., T and C, T 0 and C 0 ) is positive. 38 2.2.2 Yield Optimization The theorem prove two conditions, when hold time is not problematic, we can maximize X 0 to achieve the required SPQL, and thus optimal yield; when setup time is not problematic, we can maximize X in the same fashion. However, we need to balance those two parameters when both setup and hold time requirements have small margins. In other words, SPQL is aected by X and X 0 at the same time. We propose a brute-force iteration ofX andX 0 to achieve the optimal yield. The model has been previous built before this iteration, which includes only a few mathematical equations to achieve the optimal yield. Uniform X Due to the monotonicity of SPQL, the optimal test margin X for BD/RO designs is obtained when its SPQL is set to its maximum limit [39]. In particular, by setting Equation 2.8 to q, we are able to obtain a unique optimal value for X. Uniform X and X 0 To achieve the optimal joint values ofX andX 0 , we sweep them and identify pairs whose SPQL equals its limit q. By plugging the satisfying pairs of X and X 0 into Equation 2.5, we are able to obtain a set of yields, and record the pair that leads to the maximum yield. 39 2.2.3 Montecarlo Simulation and Results Discussion (Uni- form Test Delay Ratio) Theoretically, given the sameX, a high correlation coecient T;XL leads to yield advantage for BD circuits. As discussed in section 2.2, given the same SPQL, a high correlation coecient parameter would in return result in largeryield for BD circuit. 2.2.4 Measuring Correlations Figure 2.4: 16-bit Carry Select Adder To prove the potential yield advantage of BD over SYNC design under high correlation coecients, 9,000 Monte Carlo simulations were run on an example circuit and the correlation coecients between T , C and L were obtained. All circuits were designed in the IBM 65 nm CMOS technology and are sized to achieve a balance between rising and falling propagation delays. We varied the voltage between 0.9V to 1.1V and the temperature from -55 C to 120 C. 40 Figure 1.2 shows a BD template with a tunable delay line. We chose as our example combinational logic a 16-bit carry select adder because it is a simple circuit with multiple potentially critical paths and thus represents the case in which it may not be practical to test all possible paths. In particular, the structure of the carry select adder, shown in Figure 2.4, has 17 inputs and 17 outputs. By assuming that the delay of a MUX is comparable to the delay of a 1-bit full adder, the critical path is from the lowest signicant bit of one of the grouped ripple carry adders (RCAs) to the most signicant bit of the primary outputs. For each Monte Carlo run, delays of all paths are recorded. The largest delay among all possible critical paths is the actual critical path delay,C. The maximum of all delays is one sample point of C. The maximum of path delay from Cin to Cout and fromA[2] toCout, a subset of all potentially critical paths, is one sample point of T . Each sample point of T and C are a T;i and a C;i in Equation 1.10, respectively. By plugging sample values into the equation, cov[T;C] is obtained. Then we use Equation 1.9 to calculate T;C , where T and C can be easily obtained from all sample points. With all above parameters, the joint distribution ofT and C is available. Similarly, a programmable delay line with a MUX, shown in Figure 2.5, is analyzed to quantify the correlation between XL and L. The delay select is set to 0 during test and 1 for shipped chips. Thus delay line for test (XL) uses 38 inverters and delay line (L) uses 40 inverters. 40 is picked to obtain a slightly 41 Figure 2.5: Programmable delay line with MUX Table 2.2: Analysis of the correlation matrix due to PVT variations. 1.0V, 27 C varying P PVT variation T C XL L T C XL L T 1 0.979 0.871 0.868 1 0.997 0.985 0.985 C 0.979 1 0.875 0.872 0.997 1 0.986 0.985 XL 0.871 0.875 1 0.998 0.985 0.986 1 0.999 L 0.868 0.872 0.998 1 0.985 0.985 0.999 1 longer delay line than the critical path of the CSA. Based on dierent requirements of the SPQL, we can pick any even number smaller than 40. And 38 is one of possible value which results in a reasonable yield andSPQL. Using the same PVT variation setup, MC analysis is performed to measure the correlation between the delay of XL and L. Table 2.2 shows the nal correlation matrix of T , C, XL and L. Compared to process variation itself, PVT variation leads to higher correlation coecients. This is because the impact of local mismatch is reduced when global systematic variations are introduced. Depending on the actual PVT variation in real circuits, 42 the correlation coecients may change. However, the rest of paper shows results based on these obtained parameters. In particular, our simulations show that the tested and actual critical path of the combinational logic (T and C) are highly correlated, largely due to the possibility of T and C sharing logic paths. Moreover, XL and L are even more highly correlated, with XL;L = 0:999 with PVT variation, because in our example delay line the tested delay lineXL is actually part of the shipped delay lineL. The high correlation between XL and L is shown in Figure 2.6, whose slope indicates that the X = XL L has small variance. In this particular circuit setting, the mean ofX is 0.97 and its variance is 2:4 10 5 , suggesting the test delay ratioX in BD design is close to a constant. In contrast, in synchronous design, the test delay ratio X is constant. Figure 2.6: Delay line(XL) vs Delay line for test(L) 43 2.2.5 Results The mathematical analysis in Section 2.2 quanties the BD yield advantage and shows SPQL vsX conditioned on the correlation coecient and Table 2.2 reports the correlations obtained from Monte Carlo simulations on an example circuit using Monte Carlo spice simulations. Figure 2.7: Yield of BD/RO to SYNC vs T;XL Combining these results, Figure 2.7 plots the yield ratio of BD/RO over SYNC vs correlation coecient betweenT andXL, which graphically illustrates Theorem I. The curve labeled as = 0 shows that the ratio is larger than 1 when the correlation coecient is larger than 0.51. The red dotted line towards the right side of the plot indicates the actual T;XL measured from our sample circuit under PVT variations. Such a strong correlation means that the BD/RO design will 44 have a larger yield for the same given X. In other words, to achieve the same yield, BD/RO design can have its test margin set smaller than the comparable traditional synchronous circuit. With a 5% slowest speed bin, i.e., = 0:05, the yield of SYNC after binning increases, but the yield of BD/RO is still larger than SYNC if the correlation coecient is larger than 0.8. As we increase , the threshold correlation coecient for which point the yields are equal increases. For example, = 0:1 leads to a larger threshold value of 0.89. Figure 2.8 plots theSPQL vs test delay ratioX graphically illustrating Theo- rem II using our measured statistical results. In particular, the mean, covariance, and correlation matrix of T , C and L is computed from our Monte Carlo simula- tion data. The joint distribution of T , C and L is then mathematically derived. By integrating the joint distribution, we get log base 10 of SPQL of BD/RO versus X, which indicates that SPQL is a monotonically increasing function of X. Comparing yield as a function of SPQL is also useful. To control the quality of shipped product, the foundry may be asked to deliver product with a required SPQL. Given the required SPQL, a higher obtained yield means more shipped chips and higher prots. Figure 2.9 plots the yield versus required SPQL for SYNC with binning and BD/RO. To obtain the yield vs SPQL curve for BD we repeatedly perform the following two-step process. For each desiredSPQL, we determine the corresponding test marginX from Figure 2.8. Based on thisX, the BD/RO yield, 45 Figure 2.8: log(SPQL) vs X Figure 2.9: Yield vs required SPQL P (T < XL), is calculated using the joint distribution of the critical path under test and the delay line delay. The yield vs SPQL curve for SYNC is obtained similarly. When the required SPQL is larger than 0.001, the yield of a BD/RO 46 design is 50% higher than the comparable SYNC design without binning. A larger allows more slow chips to pass the test. As shown in Figure 2.9, = 1 boosts the SYNC yield higher, but it is still not as good as the equivalent BD/RO circuit. 2.2.6 Application of Theory The theorems developed in Section 2.2 support several important design decisions. Assuming the correlation between the delay line and combinational logic is known, the designer can set the test margin to achieve a given SPQL. This may be done at design time if the delay line has only xed test margin support or post-silicon if the test margin is congurable. Once set, the theory also identies your expected yield. However, the theory assumes that the correlation between the delay line and combinational logic is known. This correlation may be obtained pre-layout using a variety of models, including spiced-based Monte Carlo analysis used in Section 2.2.4 and advanced on-chip variation models built into commercial tools [40] that predict the correlation as a function of the bounding box of the relevant paths being explored. The latter model enables a practical design ow in which the test margin is dependent on the size of the bounding box that contains the combinational block and its matching delay line. The larger the bounding box, the lower the correlation which motivates using a higher test margin. The obtained correlation estimate can also be ne tuned after fabrication using data obtained during chip 47 characterization. If the delay line supports a programmable test margin, this may warrant altering the test margin setting post-silicon, potentially improving yield while still satisfying the SPQL requirements. Finally, as information about the actual chip environments become known, the test margins may also be able to be reduced in the eld. 2.3 Per-chip Test Delay Ratio X(L) given Average-case Performance Constraints for Bundled data design This section adopts the per-chip method from synchronous design to bundled data design. Test delay ratio X will be represented as a function of chip performance. Figure 2.10: 1 Stage Sample Circuit 48 A delay line, a chain of delay elements, is a naturally good candidate to estimate chip performance, as shown in Figure 2.10. By adding an extra inverter and setting Lselect = 0, we connect the delay line to form a loop, which works as a ring oscillator. In this sub section, we tune test margin parameterX based on dierent ring oscillator delays. SYNC design Per-chip method It is noted that the SYNC design does not oer the opportunity to tune the backward delay line during test. The optimization problem for per-chip SYNC design given an SPQL limit q is simplied as follows max X(L) P (T +s< (1 +)XT clk ) (2.15) subject to P (C +s> (1 +)T clk j T +s< (1 +)XT clk )q (2.16) where X is a function of L. By re-writing the function using integrals we get max X(L) Z +1 1 Z (1+)XT clk s 1 P T;L (t;l) dt dl (2.17) 49 subject to Z +1 1 Z +1 (1+)T clk s Z (1+)XT clk s 1 P T;C;L (t;c;l) dt dc dl q Z +1 1 Z (1+)XT clk s 1 P T;L (t;l) dt dl = 0 By solving the problem using Lagrangian equation we get L(a;b;) =(1 +q) Z +1 1 Z (1+)XT clk s 1 P T;L (t;l)dt dl Z +1 1 Z +1 (1+)T clk s Z (1+)XT clk s 1 P T;C;L (t;c;l)dt dc dl We dene H(X;l;) as (1 +q) Z (1+)XT clk s 1 P T;L (t;l)dt Z +1 (1+)T clk s Z (1+)XT clk s 1 P T;C;L (t;c;l)dt dc (2.18) To reach optimal yield, it requires X to satisfy 50 @H(X;l;) @X = 0 (2.19) Through the optimality condition we obtain the following for X (1 +q)P T;L ((1 +)XT clk s;l) Z +1 (1+)T clk s P T;C;L ((1 +)XT clk s;c;l)dc = 0 (2.20) The equations can be further simplied as Z +1 (1+)T clk s P CjT;L (cj(1 +)XT clk s;l)dc =q + 1 (2.21) The mean of the conditional Gaussian distribution can be expressed as ^ = 1 (q + 1 )^ + (1 +)T clk s (2.22) V = ( C T L ) = ( C V TL ); V = ( C T L ) = ( C TL ) (2.23) and 51 V = 0 B B B B B B B B B @ 2 C CT CL CT 2 T TL CL TL 2 L 1 C C C C C C C C C A = 0 B B B B @ 2 C C;TL T C;TL TL 1 C C C C A ^ = c + C;TL 1 TL (V TL TL ) (2.24) By combining two Equations 2.22 and 2.24, we can write X as X = L + (2.25) where = CT TL CL 2 T (1 +)T clk ( CT 2 L CL TL ) (2.26) 52 can can be obtained by substituting X in equation 2.16 by r +. = L + 1 CT 2 L CL TL + T (1 +)T clk + ( 2 T 2 L 2 TL )[ 1 (q + 1 )^ s C ] ( CT 2 L CL TL )(1 +)T clk (2.27) BD design Per-chip method By assumingX 0 is constant, the problem of maximizing yield for per-chip BD/RO designs given an SPQL limit q is simplied as follows max X(l) P (T +s<XL) (2.28) subject to P (C >Lj T +s<XL)q (2.29) where X is a function of l, a per-chip measure of the delay line. Recall that both yield and SPQL are monotonically increasing functions of X [39]. Consequently, the yield is maximized whenP (C >LjT <XL) is set toq, the required SPQL. The above maximization problem can be re-written as follows. max X(l) Z +1 1 Z Xls 1 f T;L (t;l) dt dl (2.30) 53 subject to Z +1 1 Z +1 ls Z Xls 1 f T;C;L (t;c;l) dt dc dl q Z +1 1 Z Xls 1 f T;L (t;l) dt dl = 0 The problem can be solved using the Lagrangian method as follows. L(a;b;) =(1 +q) Z +1 1 Z Xls 1 f T;L (t;l)dt dl Z +1 1 Z +1 ls Z Xls 1 f T;C;L (t;c;l)dt dc dl We dene H(X;l;) as (1 +q) Z Xls 1 f T;L (t;l)dt Z +1 ls Z Xls 1 f T;C;L (t;c;l)dt dc (2.31) To obtain the optimal yield, X should satisfy 54 @H(X;l;) @X = 0 (2.32) Through the optimality condition we obtain the following equation for X (1 +q)f T;L (Xls;l) Z +1 ls f T;C;L (Xls;c;l)dc = 0 (2.33) This equation can be further simplied as follows. Z +1 ls f CjT;L (cjXls;l)dc =q + 1 (2.34) To solve this equation, we introduce the following denitions. V = ( C T L ) = ( C V TL ); V = ( C T L ) = ( C TL ) (2.35) and 55 V = 0 B B B B B B B B B @ 2 C CT CL CT 2 T TL CL TL 2 L 1 C C C C C C C C C A = 0 B B B B @ 2 C C;TL T C;TL TL 1 C C C C A The mean of the conditional Gaussian distribution can be calculated from the above denitions. ^ = C + C;TL 1 TL (V TL TL ) (2.36) Based on the conditional Gaussian distribution in Equation 2.34, the mean of the distribution can also be expressed as ^ = 1 (q + 1 )^ +ls (2.37) By combining Equations 2.36 and 2.37, we can write X as X = l + (2.38) 56 where = 2 T 2 L 2 TL + CT TL CL 2 T CT 2 L CL TL (2.39) is known and is a function of . To obtain the value of , we can directly substitute X in Equation 2.29 by l +. = T + ( CT TL + CL 2 T ) L CT 2 L CT TL + ( 2 T 2 L 2 TL )[ 1 (q + 1 )^ s L ] CT 2 L CT TL (2.40) 2.3.1 Montecarlo Simulation and Results Discussion (Per- chip Test Delay Ratio) After Monte Carlo Simulation and analysis on the simulation data, we can see how global and local variation aect yield respectively. We observe advantages of bundled data over synchronous design regarding yield and SPQL. The next subsection will introduce how global and local variations aects yield, and presents some experimental results. The following subsection digs deeper into comparisons of yields of bundled data over synchronous design, given the same required SPQL. 57 2.3.2 Global and Local Variations Process variations are categorized into global and local variations. For global variations device parameters change equally for all transistors. In contrast, for local variations, also known as mismatch each transistor is aected dierently. Variations of physical parameters lead to variations of electrical parameters, like threshold voltage or gate capacitance. In turn, this aects the performance of digital circuits as it changes gate delays or leakage currents. However their eect on yield of bundled data has not been explored yet. It is naturally assumed that delay of delay line for test is larger than delay of critical under test. X > T L (2.41) In this data set, X is larger than 0.895, in order to maintain a longer delay line for test. The gure in this subsection is based on range of [0.9, 1]. Figure 2.11 plots the yield of SYNC and BD under dierent variations, and shows how they are aected dierently, assuming that the delay distributions of the combinational logic in the SYNC and BD designs are the same. In particular, the plot illustrates the eect of two types of variations. PVT indicates global process variation, local process mismatch, voltage varied between 0.9V to 1.1V and the temperature from -55 C to 120 C. Global indicates the same, but without local process mismatch. o control the quality of shipped product, the foundry may 58 Figure 2.11: Comparison of Yields be asked to deliver product with a required SPQL. Given the required SPQL, a higher obtained yield means more shipped chips and higher prots. We present the yield of SYNC and BD under dierent variations, and show how they are aected dierently. In this gure, we assume that the distribution of combinational logic in SYNC is same as the one of bundled data design. Yield of BD-PVT is higher than SYNC-PVT. It is due to high correlation between critical path under test and delay line. Delay line tracks combinational logic and leads to a higher yield. We can also observe that the dierence between SYNC-PVT and SYNC-Global is much smaller than the one between BD-PVT 59 and BD-Global. BD-Global is close to 1, which means global variation has no big eect on the yield of a bundled data circuit. The intuition behind this result is discussed in [15] in the context of margins for a ring-oscillator-based clock. Because global variation changes the delay of the ring oscillator/delay line and combinational logic in the same manner, it does not warrant increasing the clock margin. We show that for the same reason, global variations do not adversely aect the yield of BD designs. Thus, as long as the mean of the delay line under test is longer than the critical path of the combinational logic, the resulting yield is close to 1. In contrast, for SYNC designs the yield under global variations behaves similarly to under PVT variations and is signicantly less than 1 when the test margin is not suciently large. This is because the period of the global clock is xed and thus does not track the delay of the combinational logic. 2.3.3 Per Delay Line Per-chip, or per delay line test margin method has been discussed, this section focus on application of this method and show advantage by adopting this method to our existing CSA template circuit. Test margin parameter X, as a function of l depends on required SPQL. In other words, and varies given dierent required SPQL. In this particular 60 Table 2.3: and versus required SPQL q 0.001 0.002 0.003 0.004 0.005 0.006 0.007 -41.8 -28.3 -18.7 -10.6 -3.25 3.83 10.9 0.992 0.992 0.992 0.992 0.992 0.992 0.992 circuit, its maximum value is 0.0072 when X equals 1. Tables and gures in this subsection have X ranges from 0 to 0.007. Table 2.3 shows dierent and based on the required SPQL. The manu- facturer needs to pick the correct set of numbers based on the given customer requirements. Figure 2.12: Per-chip Test Margin 61 Figure 2.12 shows four curves, using uniform and per-chip method for yield calculation. Curve synchronous uniform is a baseline yield versus SPQL for com- parison. Synchronous per-chip method described in [1] would increase the yield by 10% for this circuit. The bundled data design uniform method has advantage regarding the yield, compared to both uniform and per-chip method. It has around 40% yield advan- tage over synchronous methods. The benet results from the following two factors. First the combinational logic and delay line are highly correlated. The higher cor- relation in bundled data design leads to smaller X, to get same yield. Second, smaller X indicates smaller SPQL, as proved in DATE paper. Thus given same SPQL, bundled data design has bigger X and thus larger yield. Hold Time We explored the cases where the hold time constraint is as large as 10% or 20% of Tclk. In these scenarios, we need to either set a minimum constraint on shortest path or tuneL 0 to resolve the hold time issue. To simplify the analysis, we assume that minimum constraint can improve the mean of T 0 by at most T 0. Table 2.4: Yield Comparison under Large Hold Time and Average-case Performance Constraints Tune L' Add min delay constraint hold time (h) 0 10%Tclk 20%Tclk 10%Tclk 20%Tclk Yield BDAVE 0.99 0.99 0.96 0.99 0.99 Yield SYNC 0.61 0.42 0.01 0.61 0.05 62 Table 2.4 shows that yield of bundled-data design is still close to 1 when hold time is 20% of Tclk whereas the yield of comparable SYNC designs is more chal- lenged. 2.4 Uniform and Per-chip X(L) given Worst-case Performance Constraints for Bundled data design Performance constraints vary from application to application. Ensuring an aver- age performance constraint may be acceptable in cases in multi-core systems in which individual cores can have varying performance or where voltage scaling can compensate for varying performance. However, in other applications, a manufac- turer may be required to meet certain worst-case performance constraints. With this motivation, this section focuses on the following problem: given a required SPQL and worst-case performance constraint, congure the setup and hold delay lines as well as their uniform/per-chip test margins to maximize yield. Due to the fact that average and worst-case performance for SYNC designs are the same, this section focuses on BD/RO design. Unfortunately, optimizing the BD/RO delay lines for the worst-case performance constraint is more complicated than for the average-case performance constraint because the yield may no longer be a 63 monotonic function of the delay lines L and/or L 0 . Instead, we need to consider variations and carefully balance setup and hold time violations with the worst-case performance constraint to nd the optimal setting ofL andL 0 and their associated test delay ratios X and X 0 . 2.4.1 Uniform Test Delay Ratio Uniform X The optimal yield given both SPQL and worst-case performance constraints de- pends on L, L 0 , X and X 0 . If the hold time requirement is easily met, e.g. the shortest paths are suciently long to satisfy the hold constraint, however, no hold time test margin is needed. As a rst step, this subsection makes this assumption and therefore focuses on setting X and L. Given this assumption, we set SPQL to its maximum limitq and optimize test delay ratio X and L. Based on SPQL BWC , by assuming X 0 = 1 and L 0 is constant, we can obtain the optimal test delay ratio X as a function of L and q. By sweeping L, we obtain its corresponding test delay ratio X. More specically, all combinations of test delay ratio X and L are plugged into Equation 2.6 to achieve multiple possible yields given a certain q and the X;L pair that leads to the maximum yield is recorded. Note that we can also run this procedure multiple times to determine how the optimal yield varies as a function of q. 64 Uniform X and X 0 In Section 2.4.1, we assumedX 0 = 1 and kept L 0 constant, sweepingL to achieve the optimal X and yield. In this subsection, we wish to optimally set X 0 and L 0 as well as L and X. To do this, we propose to simultaneously sweep all but one of X, X 0 , L and L 0 . For example, for each sample point of X, L and L 0 , X 0 can be calculated from SPQL BWC . Each four tuple can then be plugged into Equation 2.6 to obtain multiple possible yields given a specied q. The maximal yield can be picked from these results, concluding the optimization procedure. 2.4.2 Per-chip Test Delay Ratio Per-chip X In Section 2.2, we found that given an average-case performance constraint, we could express the optimal per-chipX as a function of two parameters and and L by manually solving the optimization problem expressed in Equation 2.38. An important observation is, expressed in Equation 2.39, is independent of the lower and upper limits on L. Thus, the worst-case performance limit on the forward delay line, which bounds the upper limit on L, does not eect the value of . Consequently, the optimal can be derived by substituting X in SPQL BWC by L +, where X 0 is assumed to be 1. 65 Per-chip X and X 0 Similar to the average-case situation described in Section 2.3, the per-chip op- timization problem is complicated because dening a nite grid search over all possible models of X and X 0 is dicult to construct. To simplify the optimiza- tion, we rst run the analysis described in Section 2.4.1 by assuming X and X 0 are uniformly set. We then assumeX 0 is set to the optimal uniform value and ob- tain the optimal per-chip X as a function of L using the analysis in Section 2.4.2. Lastly, we can nd the optimal per-chip X 0 by xing X to this optimal value. This heuristic approach does not guarantee an optimal solution, but does lead to a better yield compared to using uniform test margins. 2.4.3 Monte Carlo simulation Results Optimizing yield under worst-case performance constraints is more complex and, as described in Section 2.4, generally requires a brute-force search through a subset of parameters. As an example of an intermediate result of such a search, Figure 2.14 shows the normalized mean of the delay line versus test delay ratio, given two dierent SPQL requirements. They arise from the search sweep step, whereX 0 is 1 and L 0 is Tclk(1 +)=2. Notice that the mean of the delay line increases as X increases. The values above the plot are the corresponding yield of a BD/RO circuit under worst-case performance requirement set toT clk 1:05 with temporary performance 66 changes due to uctuations in temperature and voltage allowed. Notice as X increases the yield initially rises, reaches a maximum, and then begins fall. This makes nding the optimum yield straight forward. Figure 2.13: Yield of BD/RO designs under worst-case performance constraints and PVT variations Similar results can be obtained for strict worst-case performance constraints which do not allow temporary performance changes due to voltage and/or temper- ature uctuations, as illustrated in Figure 2.13. Here, the yield varies in a similar manner but with smaller values than those in Figure 2.14. The dierence between these two gures is summarized in Table 2.5. In particular, BD/RO designs under worst-case process variation leads to 8% higher yield when SPQL equals 0.0005 and 12% higher yield when SPQL equals 0.005. However, a BD/RO design under worst-case process, voltage and temperature variation leads to 14% less yield when 67 Figure 2.14: Yield of BD/RO designs under worst-case performance constraints and process variation SPQL equals 0.0005 and 12% less yield when 0.005. We can further improve the yield forBD=RO design by applying per-chip analysis in Section 2.4.2, which im- prove the yield by 5% to 8% and shown in the last row of Table 2.5. In both cases, if we assume temporary performance changes caused by voltage and/or temperature uctuations are allowed, we see signicant yield advantages for BD/RO designs over SYNC designs. However, if a stricter criteria for performance is required, BD/RO designs lose their advantage over SYNC designs. Table 2.5: Yield of BD/RO over SYNC given SPQL and Worst-case Performance Constraints SPQL = 0.0005 SPQL = 0.005 Yield SYNC PVT B WCP B WCPVT SYNC PVT B WCP B WCPVT Uniform 0.61 0.66 0.52 0.65 0.73 0.56 Per-chip 0.65 0.71 0.56 0.69 0.77 0.59 68 2.4.4 Hold Time Table 2.6: Yield Comparison under Large Hold Time and Worst-case performance constraints Tune L' Add min delay constraint hold time (h) 0 10%Tclk 20%Tclk 10%Tclk 20%Tclk Yield BWCP 0.66 0.54 0.47 0.54 0.50 Yield BWCPVT 0.52 0.42 0.25 0.43 0.32 Yield SYNC 0.61 0.42 0.01 0.61 0.05 Table 2.6 shows the yield comparison under both worst-case performance con- straints as well as hold constraints. When hold time is 0,Yield BWCPVT is smaller thanYield SYNC butYield BWCP is slightly bigger thanYield SYNC . However, be- cause of the ability to tuneL 0 , BD designs have a signicant yield advantage when the hold time constraints increase to 20% of Tclk. Adding hold buers improves the obtainable yields, but the yield advantage of BD/RO designs remains signi- cant. 2.5 Aging analysis With the aggressive downscaling of CMOS technology, Negative Biased Tempera- ture Instability (NBTI) becomes one of the most critical aging eects threatening the reliability of nanoscale CMOS circuits [41{44]. NBTI is caused by the stress on PMOS transistors (V gs = V dd ) and leads to an increase in both the threshold voltage (V th ) of the PMOS transistor and the delay of the associated gate. Due 69 to the NBTI eect, many circuit paths that are not critical in the design stage may turn critical over time, causing timing violations during the operation [42]. The NBTI-induced timing dierence will signicantly aect the accuracy of the proposed yield and shipped product quality loss analysis, and therefore, it is im- perative to consider the NBTI eect in the proposed evaluation framework. 2.5.1 Introduction to NBTI aging model We use an NBTI aging model for a 65nm process of a commercial foundry, where NBTI is identied as the most critical aging eect for this process. In this model, the NBTI-induced threshold voltage shift V th of a PMOS transistor is calculated as V th =f NBTI (V dd ;t on ;D load ;Lg;T ) (2.42) where V dd , t on , L g , D load , and T represent the supply voltage, total "on" state time, gate length, load, and temperature, respectively. The aging model is similar to other accessible NBTI aging models in the literature [41,43,44]. Next, we propose an aging-ware Monte Carlo simulation ow with the NBTI model. We assume the circuit operates under a constant supply voltage V dd throughout its lifetime. For each PMOS transistor, the load D load and gate length L g , which are determined in the design stage, are extracted from the netlist. The \on" state timet on is calculated by multiplying the total circuit operation timet op by the probability of \on" statep on (i.e.,V gs = 0) of the PMOS, i.e.,t on =t op p on . 70 According to [45], the probability of logic \on" state can be calculated using two ap- proaches: (i) the correlation coecient method (CCM) approach proposed in [46], or (ii) simulations over a large set of typical vectors (possibly obtained by running a set of benchmark programs). In this paper, the rst approach is adopted. One important observation is that the temperature parameter T appears in both Equation 2.42 and the PVT variation analysis. In the proposed aging-aware analysis, for each PVT corner, the NBTI-induced V th of all the PMOS transistors in the circuit of interest is re-calculated based on Equation 2.42 with theT in that corner. Furthermore, for each user-specied circuit operation time, the Monte Carlo simulation (mentioned in Section 2.2.4) is executed once with the updated V th drift applied to each PMOS transistor. Algorithm 2.1 provides the pseudo code of the ow. Algorithm 2.1 Pseudo code for the aging-aware Monte Carlo simulation ow Load netlist of interest and technology library; foreach PMOS in the netlist do Extract D load , L g ; Calculate p on using the correlation coecient method [46]; end foreach corner of the technology do foreach user-specied circuit operation time t op do foreach PMOS do Calculate t on =t op p on ; Update V th via Equation 2.42; Update width and length based on process variation and Equation 2.42; end end Run simulations according to Section 2.2.4; end 71 Table 2.7: Analysis of the Critical Path under Test and Delay Line over 9 Years Year 0 3 6 9 T = L 0.903 0.904 0.903 0.903 T = L 0.931 0.931 0.931 0.931 T;L 0.899 0.899 0.899 0.899 Delay at year N over year 0 1.000 1.011 1.013 1.015 2.5.2 Results Based on the NBTI model in Section 2.5.1, we run Monte Carlo simulation with global and local variations over a period of 9 years. Our goal is to determine how the mean and variance of the relative delays changes over the lifetime of the part. We explored whether these changes will impact the failure rate over time and how should we set the delay line in order to ensure functionality as the circuit ages. Table 2.7 shows the trend of delay of the critical path under test with a step size of 3 years. The delay of T and L increases 1% at the third year after being shipped. The delay then increases more slowly, becoming 1.5% larger at year 9 after being shipped. Both the mean and standard deviation of the delay ratio T overL remains the same. This means that aging can be viewed as a global variation that aectsT andL quite similarly. Consequently, the correlation coecient of T andL remains constant and aged asynchronous chips will likely remain functional as they age, although run a bit slower. In comparison, to ensure synchronous chips remain functional over their life- time, the clock period or voltage must be conservatively set when shipped or altered over time. Otherwise, there is a signicant chance that aged chips will fail. 72 In both SYNC and BD/RO design, however, if the performance constraint applies to the entire lifetime of the circuit, we should use the joint distribution of T , C and L from the Monte Carlo simulation that includes the aging variations. The analysis methods, however, are the same as in the non-aging case. 2.6 Extension to TR-BD Design The TR-BD design has performance advantage varying from application to ap- plication. It employs two delay lines, and . The expected delay line length is E = +p (2.43) where p is the probability that the combinational logic needs+ to complete the logic operation. In other words, is always needed, and is employed only when the combinational logic requires more time. With this being dened, the compa- rable asynchronous bundled data design with average-case performance constraint would use E to achieve the same performance, where its yield is restricted by the delay line length E, smaller than + . In comparison, the delay line length of 73 TR-BD design is restricted by + . The resulting yield and SPQL equation comparison would be as follows. Yield TBAVE =P (T +s<XL( + E );T 0 h>WX 0 L 0 ) (2.44) and SPQL TBAVE = P (C +s>L( + E ) or C 0 h<WL 0 )j pass test (2.45) Given that q in equation 2.43 is smaller than 1, it is noted that + E is larger than 1, which leads to a better yield for TR-BD design. We conclude that the smaller q the TR-BD has, the better its yield is. 74 Chapter 3 Edge: a Yield-Aware Synthesis and PnR Flow Chapter 2 theoretically introduced methods to set the test margin to maximize yield given a required SPQL, without sacricing the performance of circuit. The topic in this chapter is to resolve physical design level issues. In particular, we re- use existing synchronous design based EDA tools to set the test margin and support manufacturing test of asynchronous bundled-data and timing resilient bundled data designs. This chapter rst introduces the challenges in testing asynchronous circuit. It also introduces a hybrid scan solution that uses a traditional scan chain and global clock for propagating test vectors and responses but relies on the inserted delay lines to control the capture of the test responses. For bundled-data design, our solution involves a new click-based [47] controller described in this chapter. Using this controller, we then create specic test procedures that support both stuck-at and delay fault testing. The chapter then describes some of the challenges we will face as we try to extend this approach to timing resilient bundled data designs. 75 The Synthesis, Placement and Route from RTL to bundled data design is not directly supported by commercial EDA tools. EDA tools rely on constraints such max delay, clock period, but do not implicitly understand the importance of the relative timing constraints between a delay line and combinational logic. Some researchers have explored the constraints for bundled data design. For example, the relative timing constraints have been quantied in [20]. A ow that enables the explicit denition of relative timing constraints on top of a standard commercial Placement and Routing tool ow, called ACDC, is proposed in [48] and tested on BD designs. We propose to adapt and extend this ow to support programmable delay line insertion, that enables the test margin. 3.1 Challenges in Asynchronous Designs The tool development of asynchronous design relies on the use of commercial EDA tools that do not bundled-data designs directly. During the development of the asynchronous design ow, we observed challenges in synthesis, PnR and testing asynchronous circuits. During the development so far, several challenges in synthesis and PnR have been identied and resolved. 76 Commercial-tools version-specic behavior. We found that the behavior of specic \compile" commands dier. Some support our min-delay constraints while others ignored them. Advanced libraries. The co-simulation environment relies on post-synthesis delays to be stored in a SDF le format, but some version of the synthesis tool do not support SDF 3.x. Primetime had to be used, complicating the ow. Design integration. There was not a relatively small block which could be tested in isolation in our commercial setting. Instead the ow had to be tested on an island of synchronous logic with no surrounding clock domain crossing circuitry. While addressing setup issues is relatively straight forward, this introduced hold issues that required manual tweaking of delay lines to x. Beside the lack of support from EDA tool, the usage of bundled data design is inhibited by testing-related problems. Many researchers has explored testability issues of bundled data circuit and proposed various test methods specically for asynchronous design [49]. However, testing asynchronous circuit is still a dicult task due to problems with controllability. e.g. stuck at fault test and delay testing for synchronous circuit requires a single-step feature, which is easily achieved using global clock. Asynchronous controllers cannot trigger all local clock at the same time without additional changes to the structure of the controller. 77 The problems with testing asynchronous circuits can be outlined as follows. Figure 3.1: Undetectable Stuck-at fault Figure 3.2: Scan Cell Proposed in [2] 78 Figure 3.3: Functional Test for GALS system In asynchronous design styles, some redundant circuits are added to cope with hazard and races and instead produce glitch free outputs. However, the introduction of redundancy makes testing dicult. Some stuck-at-faults are undetectable. the SA0 fault at the output of red AND gate in gure 3.1 is undetectable, due to the fact that the test vector that can potentially detect the fault, would always force the output of the OR gate to logic 1. The output of the OR gate does not change even if output of the red AND gate is stuck at one. Latch based bundled data design has no D ip op which are widely used during synchronous test. To construct a scan cell for asynchronous design, the area of latches is doubled or tripled. The scan cell on Figure 3.2 includes 79 one additional latch and D ip op to make storage cell in asynchronous circuit scannable. Testers used for synchronous design generally expect output at the rising edge of global clock. To test an asynchronous circuit with scan chain, the test team has to change the structure of tester, or add a wrapper to the asynchronous circuit, that interacts with the asynchronous design and synchronous tester simultaneously. Figure 3.3 shows a wrapper that connects automatic test equipment to global asynchronous and local synchronous system. Due to all theses challenges in testing asynchronous circuit, there is a need for more research eort in this area. 3.2 Physical Design using the ACDC Flow PnR ows for synchronous designs are designed to constrain the worst case delay of the design to meet the specied clock period, whereas PnR ows for asynchronous design must instead constrain the combinational logic delay to be faster than the corresponding programmable delay line(s) of each pipeline stage such that a given overall cycle time target is met. Our PnR tool ow will address the constraints for asynchronous designs while also ensuring the optimal test margin can be added to the delay lines. The ow will automatically insert programmable delay lines into the design that can be congured shorter during test than in normal during 80 operation to enable the insertion of test margin. Physical design ows will be developed for both BD and TR-BD designs and both ows will leverage the ICC physical design tool from Synopsys. Then the yields of both BD and TR-BD designs are predicted using the ows, under PVT variation. Moreover, the PnR ows must also add sucient design for test circuits to support the manufacturing test for these circuits. For synchronous circuits the added circuitry is typically some sort of scan chain that is responsible for propa- gating test vectors and test responses in and out of the chip that is controlled by a special test clock. The lack of global clocks thus creates new challenges for testing asynchronous designs. 3.2.1 ACDC Flow ACDC enables the design to meet RTC requirements during synthesis and PnR. Small sample bundled-data circuits, such as FIFO has been implemented and ver- ied using this ow [48]. Our proposed work is to adopt the ow to work with programmable delay lines, and test it on larger designs, including Plasma [50], an open core CPU. The original ACDC ow is shown on Figure 3.4. ACDC is similar to regu- lar synthesis and PnR ows for synchronous design, but with intermediate steps for optimization and delay element insertion to meet relative timing constraints. 81 Thicker border boxes include steps that are exclusive to the ow. ACDC uses two tools, design compiler and IC compiler (ICC) from Synopsys. The ow comprises three tasks [48] as follows. Initial logic synthesis congures technology library, reads in and elaborates the design. Optimization and delay line insertion map the cells from library to the de- sign. The design is re-synthesized with more constraints set by designer. It includes delay elements insertion and relative timing constraints. At the end of synthesis, netlist would be stored. Physical Synthesis loads design in format of synthesis netlist to ICC. In this step, traditional physical synthesis steps take place. In addition, a repetitive delay elements insertion is performed, which add delay elements one by one to delay line, until it meets the relative timing requirements. The design after PnR is nally stored. The synthesis EDA tool enforces maximum delay constraints and perform setup and hold check on sequential elements. Maximum delay constraints are usually applied on the logic path between registers to ensure the circuit can operate under a performance constraint.In addition, the EDA tool also support minimum delay constraints, but only allow constant values to these. However, the delays of delay line and combinational logic are not constant. It depends on many factors, gate 82 mapping, physical synthesis etc. Such relative timing constraints (RTC) are not well supported by commercial EDA tool, which mainly target synchronous designs. To complement the commercial EDA tool, the RTC are added and fullled using our modied ACDC ow. As mentioned earlier, the ACDC tool is designed to support non-programmable delay lines in the form of strings of inverters. We show we can add programmable delay lines to the tool ow to enable test margin insertion. Figure 3.4: ACDC ow using existing commercial EDA tools 83 3.2.2 Relative timing constraints Bundled-data circuits rely on carefully tuned delay elements (DEs) in the request path to ensure that handshake events take place only when data is valid { that is, the request signal must arrive only after the data signal is stable. RTCs dene signal arrival order, and can be used to fulll such requirements. In the context of BD circuits, these constraints relate data path and control path: the minimum control path delay must be greater than the maximum data path delay; if it is not, a DE must be inserted in the control path to satisfy this requirement. In addition to DEs on the request path, it may be necessary to add DEs on acknowledge paths as well, to guarantee hold constraints of registers are met. 3.2.3 Adding Test Margin using ACDC The theorems developed in Section 2.2 support several important design decisions. Assuming the correlation between the delay line and combinational logic is known, the designer can set the test margin to achieve a given SPQL. This may be done at design time if the delay line has only xed test margin support or post-silicon if the test margin is congurable. Once set, the theory also identies your expected yield. However, the theory assumes that the correlation between the delay line and combinational logic is known. This correlation may be obtained pre-layout using a variety of models, including spiced-based Monte Carlo analysis used in Section 84 2.2.4 and advanced on-chip variation models built into commercial tools [40] that predict the correlation as a function of the bounding box of the relevant paths being explored. The latter model enables a practical design ow in which the test margin is dependent on the size of the bounding box that contains the combinational block and its matching delay line. The larger the bounding box, the lower the correlation which motivates using a higher test margin. The obtained correlation estimate can also be ne tuned after fabrication using data obtained during chip characterization. If the delay line supports a programmable test margin, this may warrant altering the test margin setting post-silicon, potentially improving yield while still satisfying the SPQL requirements. Finally, as information about the actual chip environments become known, the test margins may also be able to be reduced in the eld. It is also important to note that the design of the programmable delay line controls the achievable precision of the test margin settings. In particular, the test margin can be achieved via adding extra pairs of inverters into the delay line as illustrated in Figure 2.5 or controlling the back-body bias of a current-starved delay shift inverter as in [51], the latter enabling a delay resolution of as small as 2ps. The ner the resolution, the lower the quantization error and the closer to the desired SPQL and optimal yield that can be achieved. However, note that the delay line must also be implemented carefully to preserve the correlation between tested and shipped delays [51]. In particular, to ensure such a high correlation 85 the shipped and tested delay setting must share a relatively large common circuit path. ACDC tool implement the design by setting relative timing constraint and it- eratively add buers to meet such constraints. However, the placement constraints to improve the yield is not considered as part of the ow. Ideally, the tool can identify the critical path and its corresponding delay line, and reduce the chance of timing violation by placing them close to each other. 3.2.4 Edge: Integration of Synthesis and ACDC ow The Edge ow [52] is an automated ow that converts single clock RTL to click- controller based bundled data designs. The current version targets a simple two- cluster bundled-data design template, illustrated in Figure 3.5. The rst cluster can be viewed as driving a bank of master latches and the second cluster driving a bank of slave latches. The programmable delay line on the backward latency path between controllers provides control over the non-overlap period between clocks, somewhat akin to traditional synchronous two-phase latch-based design. This enables ecient post-silicon hold xing, freeing the designer from worst-case analysis. It also uses industrial EDA tools, including DesignCompiler for synthesis, and NC-sim for simulation. 86 Figure 3.5: Target bundled-data template for Edge. By adopting ACDC ow into Edge, the conversion ow has ve main steps, as illustrated in Figure 3.6 Synthesis: Synthesize RTL to synchronous netlist During this step, the ow synthesizes RTL code to a ip- op based design at a given clock frequency. In particular, the synthesis tool is constrained to use only one type of ip- op (an Edge FF) to simplify subsequent steps in the ow. Latch Conversion: Convert FF into latch based design The op-based design is converted to a latch-based design by replacing all Edge FFs with master-slave latches using a simple tcl script. Instead of one single global clock, two clocks are used to drive all latches, one for master latches and the other for slave latches. 87 Figure 3.6: Edge ow using existing commercial EDA tools Retiming: Balance combinational logic After the latch conversion step, there is no combinational logic between master and slave latches. We perform retiming to move logic through the slave latches to achieve a balanced latch-based design. This allows the backward latency of 88 the controllers to be hidden and thereby not impact performance and enables time borrowing. This step is performed using a tcl script that properly sets the two clock phases and time borrowing window using built-in functionality of the commercial synthesis tools. BD conversion: Add click controllers to the netlist We then replace the global clocks with local clocks driven by communicating click controllers [47] connected via the delay lines. The click controllers make timing analysis relatively easy because they contain traditional ip- ops that break inter- nal loops. Notice that the clock of master latches is driven by a token controller, which sends an initial Rreq immediately after reset is de-asserted. The clock of slave latches, on the other hand, is connected to a non-token controller. Delay lines are then inserted between controllers to match the worst case combinational delay, using the iterative ow proposed in [48]. 3.2.5 Cosimulation: Verication using UVM structure The Universal Verication Methodology (UVM) is a standardized methodology for verifying integrated circuit designs. The edge ow uses a UVM verication structure [53] to verify the nal netlist, UVM enables the original synchronous and nal asynchronous netlists to be concurrently simulated with the same set of inputs, comparing all corresponding outputs and checking for any inconsistent outputs. 89 Figure3.7: Target industrial design, showing the initial decomposition of asynchronous and synchronous islands. The verication starts from a special reset sequence that in the asynchronous case begins with a controller reset, followed by an input request, and a second controller reset. The rst two steps ensures all datapath latches are reset and the third step ensure the controllers re-start in a known state. After the reset sequence, random data input are continually sent to the UVM drivers of both the synchronous and BD netlists. UVM monitors collect the outputs of two netlists and send them to a scoreboard in which they are compared. Assertions are added to ensure the controller works as expected, reporting any observed abnormality during simulation. 90 3.2.6 Tested using an industrial design The ow was developed at USC and tested on an industrial design with partners from Qualcomm using a Qualcomm-proprietary low-power low-voltage cell library. The design for bundled-data implementation is a low power vision classier hardware accelerator. Its main building blocks consist of a CPU, dedicated hard- ware engine, and memory. The accelerator communicates with an external vision sensor via a sensor interface, and when embedded in a larger SoC, can communicate with other IP blocks via a standard bus interface. As illustrated in Figure 3.7, to constrain our initial experiment, we chose to convert only the CPU and the HW engine into an asynchronous circuit, leaving the interface blocks (i.e. bus, sensor and memory) in the synchronous domain. We introduced a simple shim that interfaces the two clock domains, handling reset of the asynchronous design and synchronizing the asynchronous and synchronous clocks. The design was performed and evaluated at Qualcomm (with guidance from us) using our open-source ow and demonstrated that our ow can work across libraries in an industrial setting. More details as to the challenges associated with this eort can be found in [34]. 91 3.3 Scan Structure and Programmable Delay Line for Bundled Data Design Despite the potential advantages of bundled data design over its synchronous coun- terparts, the test of BD designs is not well developed due to lack of global clock and support from related EDA tools. A scan chain is added to click controller based bundled data design and our open source ow that we created has been enhanced to support this test structure. Functional test is easier than scan based test, but leads to a relatively low fault coverage. To achieve higher fault coverage with manageable area overheads, we propose a scan-based structure that supports both stuck-at and delay fault detection. The controller we use is named click, discussed in section 3.3.1. The storage element is a regular D ip op, driven by a global clock during test, and local clock during normal working mode. The purpose of the test design is to adopt current EDA tools and re-use them to automatically generate test vectors. 3.3.1 Introduction to Click Controllers The Click controller is an asynchronous template for two-phase handshaking pro- tocols that uses only standard cell libraries (no special cells e.g. C-elements). This makes the these controllers easy to integrate into open-source tools that operate using commercial cell libraries. 92 Figure 3.8: Non-token Click Controller The click controller for the master latches, as shown in Figure 3.8, includes one FF and three complex gates. Initially all req, ack and clk are reset to logic 0. To initiate the data transaction, a.req transit from 0 to 1, which causes DFF toggles to 1, and b.req/a.ack to 1 as well. As a feedback, clk goes back to logic 0. Thus 93 one transaction completes and continues on requesting next stage to accept the new data. The minimum pulse width can be controlled by adding delay elements to the feedback path. Figure 3.9: Token Click Controller 94 The structure click controller for the slave latches, as shown in Figure 3.9 is similar to the non-token controller. Because it is a token controller, it resets the DFF to logic 1 at the beginning to initiate a request. Thus the very rst data transaction is initiated when both a.req and b.ack transit from 0 to 1. As in its non-token counterpart, the minimum pulse width can be controlled by adding delay elements to its feedback path, from output of the DFF to its D input. 3.3.2 Scan Cells for Edge The use of latches in Edge enables retiming and potential time borrowing. Thus the scan cell design should include the feature that latches are need during work mode, and scan FFs are only used during test. Figure 3.10: Scan Cell 95 The scan cell for Edge is shown on Figure 3.10. Based on IEEE standard 1149.1 [54], a scan interface includes the following ve standardized signals. TDI (Test Data In): The signal is named TI on Figure 3.10. It works as the data input of the scan cell, only when the test mode is enabled. During the test mode, TDI is the data input of a scan chain It works together with TCK to support data scan-in and scan-out, the basic operation during test. TCK (Test Clk): The signal is named Tclk on Figure 3.10, the clocking source of the scan cell. The Latch1 is transparent when TCK is logic 0 and opaque when it is logic 1. In contrast, the Latch2 is transparent when TCK is logic 1 and opaque when it is logic 0. TCK is driven by click controller. In comparison, the synchronous test clock is a global source. TMS (Test Mode Select): The signal is named TE on Figure 3.10. It enables the scan-in and scan-out by asserting it to 1. In contrast it is set as 0 during working mode. This signal basically switch between two inputs, D the input during normal mode and TI, the input during test mode. TDO (Test Data Out): The signal is named TQ on Figure 3.10. The test data out is connected to test data in to form one or more scan chains. The test data is scanned in and out bit by bit, going throughout the scan chains. TRST (Test Reset): The signal is named RN on Figure 3.10. It resets scan cell to logic 0 at the very beginning of the circuit simulation. 96 The naming of ports of the scan cell is consistent with a 28nm technology library on our group server. Thus the port names are slightly dierent from the standard ones. Also note that Latch1 is used during test mode only, and Latch2 is always used. Tclk is driven by a click controller, described on Figure 3.8. In the rest of this section, a pulse on Tclk is equivalent to a request to click controller, that triggers the pulse. 3.3.3 Test Procedure The proposed structure consists of one controller for each stage/cluster of FFs. In this approach, we sketch our test plan for bundled data designs that is similar to synchronous design. The purpose is to re-use synchronous design based commercial tools for asynchronous test, and thus save eort for redesigning the ATPG process for bundled data design. The proposed test scheme for SA fault test is as follows. Set TE signal to 1 and all other signals to 0. Use Tclk and TI as the scan in clock and data to ll in all stages with test vectors Set Test signal to 0 and send another pulse at Tclk. Set Test signal to 1 and scan out data using Tclk. 97 Except for the controller driven test clock, the test sequence has no dierence between synchronous and bundled data design. Similarly, we are able to describe the delay fault test sequences for bundled data design. Delay fault test includes two basic techniques, launch-o shift and launch-o capture. The timing of them is shown on Figure 3.113.12. The launch-o-shift procedure relies the shifting of scan- chain to generate the second test vector, during CC phase shown on Figure 3.11, whereas the launch-o-capture procedure relies the output of the combinational logic of each stage to generate the second test vector, during the CC phase shown on Figure 3.12. Figure 3.11: Timing of launch-o-shift [3] The DFT teams in some of our industry contacts state that launch-o shift is not used, due to the limit of shift signal. It is hard to make sure the transition happens on time during at-speed test. On Figure 3.11, the rst transition on SEN is required right after the pulse during LC phase and before the pulse during CC 98 Figure 3.12: Timing of launch-o-capture phase. Due to the fact that shift enable signal drives all scan cells, the transition time is long, and hard to meet the timing constraint. Thus we focus on the launch- o-capture methodology and propose the following test scheme for Edge. Set TE to 1 and all other signals to 0. Use Tclk and Sin as the scan in clock and data to ll in all stages with test vectors Set TE to 0 and send another pulse at Tclk. Reset the click controller to initial state. Set TE to 1 and scan out data using Tclk. There are many solutions that stops request between stages, however the test sequence generation and test procedure generated by synchronous EDA tool can not be re-used anymore. e.g. Mr. Go [55] is a proper stopper which has been 99 applied to a chip. However the test sequence in Mr. Go design need to be re- generated for each new circuit. The trade-o between test sequence generation and global test clock is another topic we can address. There may be more ecient ways of designing controller, and even with modication of scan cells for asynchronous design to get rid of global clock. 3.4 S2A and A2S interfaces In synchronous design, a clock domain crossing is the transfer of signals from one clock domain to another. In asynchronous design, there is no clock domain cross- ing issue between asynchronous designs, as long as they have the same handshake protocol. However it becomes an issue when the signals of synchronous design tra- verse into asynchronous design (and vice versa). The synchronous to asynchronous (S2A) interface enables the data transfer into asynchronous design, which requires glitch-free control signals. In contrast, the asynchronous to synchronous (A2S) in- terface enables the data transfer into synchronous design, that deals with potential metastability issues due to the domain crossing. 3.4.1 Introduction to sync-async domain crossing A modular synchronizing FIFO has been proposed to communicate between NOC and function blocks [56]. The input of the FIFO can be congured as a put interface 100 for both synchronous and 4-phase asynchronous design. Similarly, the output of the FIFO can be congured as a get interface. However, our Edge design is based on a 2-phase handshaking protocol, which requires additional circuitry to embed the open-source FIFO [57] into Edge. 3.4.2 S2A Structure The S2A structure re-uses the FIFO for 4-phase asynchronous bundled-data design. To interface with Edge of 2-phase handshaking, additional circuity is added, as shown in Figure 3.13. The green part is the original FIFO, where the signals on the left interacts with synchronous domain, and the signals on the right interacts with 4-phase asynchronous domain. The asynchronous part passively waits for a get signal to initiate the transaction. For this reason an XNOR gate is added to trigger such a transaction after reset. After the FIFO acknowledges, the data should be ready to be consumed. Thus we need a toggled ip op to notify the asynchronous design to accept the data. A detailed Petri-net description of this shim logic is shown in Section 3.4.3. 3.4.3 S2A Petri-net and timing constraints The reset signal sets FF to logic 0, which consequently triggers get4+. Ack4+ is not able to respond due to the reset signal being active. Right after reset signal is 101 Figure 3.13: Synchronous to asynchronous domain crossing deactivated, ack4+ responds, and causes req2+ and get4- simultaneously. In other words, the FIFO is supposed to deactivate its signals and complete the 4-phase transaction, and Edge is supposed to save the data. It is essential that it takes longer to latch in the data, otherwise, the controller is in danger of receiving an unexpected sequence and malfunctioning. From the point of view of the Petri-net description, the dashed lines are timing constraints that must be satised. After Edge responds by toggling acknowledge signal, the transaction ends. If we take a close look at Figure 3.13, we can see that this timing constraints implies that the path netQ > xnor > req4 > ack4 should be shorter than the path netQ > req2 > ack2 > xnor. Also, the ack4 from FIFO should meet the minimum pulse width requirement of FF. 102 Figure 3.14: Petri-net for S2A 3.4.4 A2S Structure Similar to the S2A structure, the A2S structure also re-uses the open-source FIFO [56,57] to interact with the synchronous design, where additional circuitry is added to enable the data transaction with a 2-phase handshaking protocol. The green rectangle on Figure 3.15 is the synchronizing FIFO that we propose to re-use. The signals on left of the block interact with 4 phase handshaking protocol. The ones on right waits for synchronous blcok to pull data from FIFO. The only part we worry about is the asynchronous part, a 2 to 4 phase circuitry to 103 enable the use of Edge. The request of Edge is supposed to trigger the request to FIFO, and we placed a XOR gate for the request. When the FIFO acknowledges and saves data, the Edge should be acknowledged as well. The ip op is added to enable this acknowledgement. A detailed Petri-net will be shown in Section 3.4.5. Figure 3.15: Asynchronous to synchronous domain crossing 3.4.5 A2S Petri-net model and timing constraints The reset signal sets FF to logic 1. Req2 signal from Edge is supposed to initiate the transaction and triggers req4+, the request for FIFO to save the data from Edge. FIFO then responds by asserting ack4+, which triggers acknowledge to Edge and req4- simultaneously. It indicates that FIFO is trying to deactivate the request and acknowledge, while Edge can send next request. Ideally, it takes longer time for Edge to send the next data, otherwise, the Petri-net is in danger of unexpected sequences. The dash lines are the ones that should be faster than 104 its parallel branches. In Figure 3.13, path netQ > xor > req4 > ack4 is supposed to be shorter than pathnetQ>ack2>req2>xor. Also, the ack4 from FIFO should meet the minimum pulse width requirement of FF. Figure 3.16: Petri-net for A2S 105 Chapter 4 Summary The thesis ends with the summary of the work and some conclusions. 4.1 Summary of Accomplishments The thesis adapts a mathematical model to a variety of bundled data (BD) designs and introduce an approach that sets the test delay ratio based using both uniform and per-chip test margins. We rst address the optimization of yield subject to a maximum SPQL and average case performance limit. The yield and SPQL denitions include both setup and hold requirements. For the per-chip test margin, we use the delay of the delay lines in bundled-data designs as a proxy for chip performance. This leads to an increase in yield for the same required SPQL. We also show that the advantage of these bundled data design over a comparable synchronous design is signicant, particularly when the correlation between delay line and combinational logic is high. As for the chip performance, we also address the problem of tuning delay lines and adding test margin to maximize chip yield given a worst case performance constraint. 106 Furthermore on the tool side, we built an enhanced physical design ow based on ACDC [48] that supports the insertion of programmable delay line to control the test delay ratio. In order to re-use the current test tool for synchronous design, and insert test vectors without writing a new ATPG etc, the new ow automati- cally inserts scan structures that support stuck-at and delay fault testing. Lastly, support for automatically adding clock domain crossing circuits to enable the use of asynchronous modules within a larger synchronous design has been added to the tool. In particular, we leveraged an open-source github circuit to create two-phase S2A and A2S interfaces that properly mitigate metastability. 4.2 Conclusions and Possible Next Steps Asynchronous design removes the need for a global clock and global clock margins that are increasingly challenging in synchronous designs. It exhibits many attrac- tive possibilities to achieve power and/or performance improvements. Because the delay constraints for many asynchronous designs are relative, instead of absolute, they are more robust to PVT variation. This thesis quanties this improvement in terms of an improvement in yield. However, the lack of EDA tool support hampers the wide-spread use. The proposal mathematically analyzes two bundled-data design styles popular in asyn- chronous domain and will show how to optimally congure the delay lines to max- imize yield subject to performance and manufacturing constraints. It will also 107 show how to existing EDA tools can be used to implement these schemes via scripts that force the tool into checking and guaranteeing key relative timing rela- tionships between the delay line(s) and combinational logic. Further a scan-based test structure that allows reuse of test EDA tool is designed. We believe this work help complete the tool ow and further enable successful commercialization of these types of design styles. Some start-up companies, e.g. Reduced Energy Micro-systems (REM) [58], are aiming to commercialize BD/TR- BD based asynchronous chips with high energy eciency. There is no doubt that a test structure and strategy is a must when managing a mass production line to commercialize chips and we hope that this thesis will be an important step in this direction. The future work includes several potential directions. First, we can extend the theory to apply to other delay models, including log normal which may be more accurate in sub-threshold regions of operation. In addition, further simulation-based studies are necessary to support the theorems based on log normal distribution. Second, the physical design for commercialization needs more detailed place- ment and route to optimize the performance of asynchronous design. Opti- mization of the power consumption after physical design has not been fully explored. We believe there is clearly a trade-o between the area/power consumption and chip performance for asynchronous design that deserves additional attention. 108 Third, Edge ow can be improved by supporting Blade, a timing resilient bundled-data design style. Due to the complexity of the controllers and error-detecting latches of Blade, more constraints are needed to make sure it works after physical design. Finally, it is also an important area of exploring formal verication meth- ods for the circuitry generated by Edge. In cases master and slave latches are both re-timed, it is hard to formally verify asynchronous designs. The tools to formally verify synchronous design should be enhanced to support asynchronous design as well. 109 Bibliography [1] J. Xiong, V. Zolotov, C. Visweswariah, and P. A. Habitz, \Optimal test mar- gin computation for at-speed structural test," IEEE Trans. on CAD, vol. 28, no. 9, pp. 1414{1423, 2009. [2] F. Te Beest, A. Peeters, K. van Berkel, and H. Kerkho, \Scan chain opti- mization for asynchronous circuits," 2002. [3] M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital systems testing and testable design. Computer science press New York, 1990, vol. 2. [4] P. Kurup and T. Abbasi, Logic synthesis using Synopsys R . Springer Science & Business Media, 2012. [5] I. Synopsys, \Compiler datasheet," 2016. [6] I. E. Sutherland, \Micropipelines," Communications of the ACM, vol. 32, no. 6, pp. 720{738, 1989. [7] D. Hand, M. T. Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li, M. Gibiluka, M. Breuer, N. L. V. Calazans, and P. A. Beerel, \Blade{a timing viola- tion resilient asynchronous template," in Asynchronous Circuits and Systems (ASYNC), 2015 21st IEEE International Symposium on. IEEE, 2015, pp. 21{28. [8] G. Russell, A. Yakovlev, A. Bystrov, D. Kinniment, and O. Maevsky, \On- chip structures for timing measurements and test," in Proceedings of the 8th International Symposium on Asynchronus Circuits and Systems. IEEE Com- puter Society, 2002, p. 190. [9] P. A. Beerel, R. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous VLSI. Cambridge University Press, 2010. [10] O. A. Petlin and S. B. Furber, \Scan testing of micropipelines," in Proceedings of the 13th IEEE VLSI Test Symposium, 1995, pp. 296{301. 110 [11] M. Abramovici, M. Breuer, and A. Friedman, Digital Systems Testing and Testable Design. John Wiley & Sons, 1990. [12] N. K. Jha and S. Gupta, Testing of Digital Systems. Cambridge University Press, 2003. [13] D. Appello, A. Fudoli, K. Giarda, V. Tancorre, E. Gizdarski, and B. Mathew, \Understanding yield losses in logic circuits," IEEE Design & Test, vol. 21, no. 3, pp. 208{215, 2004. [14] Y. Gao, Y. Zhang, D. Cheng, and M. A. Breuer, \Trading o area, yield and performance via hybrid redundancy in multi-core architectures," in 31st IEEE VLSI Test Symposium (VTS), 2013, pp. 1{6. [15] J. Cortadella, M. Lupon, A. Moreno, A. Roca, and S. S. Sapatnekar, \Ring oscillator clocks and margins," in ASYNC, 2016. [16] C. Visweswariah, K. Ravindran, K. Kalafala, S. G. Walker, S. Narayan, D. K. Beece, J. Piaget, N. Venkateswaran, and J. G. Hemmett, \First-order incre- mental block-based statistical timing analysis," IEEE Trans. on CAD, vol. 25, no. 10, pp. 2170{2180, 2006. [17] S. B. Furber and P. Day, \Four-phase micropipeline latch control circuits," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 4, no. 2, pp. 247{253, 1996. [18] T. Lin, K.-S. Chong, J. S. Chang, and B.-H. Gwee, \An ultra-low power asynchronous-logic in-situ self-adaptive vdd system for wireless sensor net- works," IEEE Journal of Solid-State Circuits, vol. 48, no. 2, pp. 573{586, 2013. [19] M. Imai, T. Van Chu, K. Kise, and T. Yoneda, \The synchronous vs. asyn- chronous noc routers: an apple-to-apple comparison between synchronous and transition signaling asynchronous designs," in Networks-on-Chip (NOCS), 2016 Tenth IEEE/ACM International Symposium on. IEEE, 2016, pp. 1{8. [20] J. V. Manoranjan and K. S. Stevens, \Qualifying relative timing constraints for asynchronous circuits," in Asynchronous Circuits and Systems (ASYNC), 2016 22nd IEEE International Symposium on. IEEE, 2016, pp. 91{98. [21] I. J. Chang, S. P. Park, and K. Roy, \Exploring asynchronous design tech- niques for process-tolerant and energy-ecient subthreshold operation," IEEE Journal of Solid-State Circuits, vol. 45, no. 2, pp. 401{410, 2010. 111 [22] N. Jayakumar, R. Garg, B. Gamache, and S. P. Khatri, \A pla based asyn- chronous micropipelining approach for subthreshold circuit design," in Pro- ceedings of the 43rd annual Design Automation Conference. ACM, 2006, pp. 419{424. [23] J. Liu, S. M. Nowick, and M. Seok, \Soft mousetrap: A bundled-data asyn- chronous pipeline scheme tolerant to random variations at ultra-low supply voltages," in 2013 IEEE 19th International Symposium on Asynchronous Cir- cuits and Systems. IEEE, 2013, pp. 1{7. [24] B. Munger, D. Akeson, S. Arekapudi, T. Burd, H. R. Fair, J. Farrell, D. John- son, G. Krishnan, H. McIntyre, E. McLellan et al., \Carrizo: A high perfor- mance, energy ecient 28 nm apu," IEEE Journal of Solid-State Circuits, vol. 51, no. 1, pp. 105{116, 2016. [25] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner et al., \Razor: A low-power pipeline based on circuit- level timing speculation," in Microarchitecture, 2003. MICRO-36. Proceed- ings. 36th Annual IEEE/ACM International Symposium on. IEEE, 2003, pp. 7{18. [26] A. J. Drake, A. KleinOsowski, and A. K. Martin, \A self-correcting soft error tolerant op- op," in 12th NASA Symposium on VLSI Design, Coeur dAlene, Idaho, USA, Oct. Citeseer, 2005, pp. 4{5. [27] A. Jagirdar, R. Oliveira, and T. J. Chakraborty, \Ecient ip- op designs for set/seu mitigation with tolerance to crosstalk induced signal delays," in Proc. IEEE Workshop Silicon Errors Logic Syst. Eects. Citeseer, 2007. [28] R. Naseer and J. Draper, \The df-dice storage element for immunity to soft errors," in Circuits and Systems, 2005. 48th Midwest Symposium on. IEEE, 2005, pp. 303{306. [29] M. Choudhury, V. Chandra, K. Mohanram, and R. Aitken, \Timber: Time borrowing and error relaying for online timing error resilience," in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010. IEEE, 2010, pp. 1554{1559. [30] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris, D. Blaauw, and D. Sylvester, \Bubble razor: An architecture-independent approach to timing- error detection and correction," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 488{ 490. 112 [31] G. W. Zobrist, VLSI fault modeling and testing techniques. Intellect Books, 1993. [32] S. Lee, B. Cobb, J. Dworak, M. Grimaila, and M. Mercer, \A new atpg algorithm to limit test set size and achieve multiple detections of all faults," in Proceedings of the conference on Design, automation and test in Europe. IEEE Computer Society, 2002, p. 94. [33] A. Krstic and K.-T. T. Cheng, Delay fault testing for VLSI circuits. Springer Science & Business Media, 1998, vol. 14. [34] Y. Zhang, H. Cheng, D. Chen, H. Fu, S. Agarwal, M. Lin, and P. Beerel, \Challenges in building an open-source ow from rtl to bundled-data design," in 24th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), May 2018. [35] J. Xiong, V. Zolotov, C. Visweswariah, and P. A. Habitz, \Optimal margin computation for at-speed test," in Proceedings of the conference on Design, automation and test in Europe. ACM, 2008, pp. 622{627. [36] L. Li, K. Chakrabarty, S. Kajihara, and S. Swaminathan, \Ecient space/- time compression to reduce test data volume and testing time for ip cores," in 18th IEEE International Conference on VLSI Design, 2005, pp. 53{58. [37] S. P. Mu, M. C. T. Chao, S. H. Chen, and Y. M. Wang, \Statistical frame- work and built-in self-speed-binning system for speed binning using on-chip ring oscillators," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 5, pp. 1675{1687, May 2016. [38] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P. Sotiriou, \Desynchro- nization: Synthesis of asynchronous circuits from synchronous specications," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys- tems, vol. 25, no. 10, pp. 1904{1921, 2006. [39] V. S. H. C. P. B. Yang Zhang, Haipeng Zha, \Test margin and yield in bundled data and ring-oscillator based designs," accepted to 23rd IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), 2017. [40] S. Walia, \Primetime R advanced OCV technology," 2009. [On- line]. Available: https://www.synopsys.com/Tools/Implementation/SignO/ CapsuleModule/PrimeTime AdvancedOCV WP.pdf [41] F. Kriebel, S. Rehman, M. Shaque, and J. Henkel, \ageOpt-RMT: compiler- driven variation-aware aging optimization for redundant multithreading," in Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016, p. 46. 113 [42] J. B. Velamala, V. Ravi, and Y. Cao, \Failure diagnosis of asymmetric ag- ing under NBTI," in Computer-Aided Design (ICCAD), 2011 IEEE/ACM International Conference on. IEEE, 2011, pp. 428{433. [43] D. Gnad, M. Shaque, F. Kriebel, S. Rehman, D. Sun, and J. Henkel, \Hayat: Harnessing dark silicon and variability for aging deceleration and balancing," in Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE. IEEE, 2015, pp. 1{6. [44] T.-B. Chan, W.-T. J. Chan, and A. B. Kahng, \Impact of adaptive voltage scaling on aging-aware signo," in Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 2013, pp. 1683{1688. [45] J. Li and J. Draper, \Accelerating soft-error-rate (SER) estimation in the presence of single event transients," in Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016, p. 55. [46] S. Ercolani, M. Favalli, M. Damiani, P. Olivo, and B. Ricco, \Estimate of signal probability in combinational logic networks," in European Test Confer- ence, 1989., Proceedings of the 1st. IEEE, 1989, pp. 132{138. [47] A. Peeters, F. te Beest, M. de Wit, and W. Mallon, \Click elements: An implementation style for data-driven compilation," in Asynchronous Circuits and Systems (ASYNC), 2010 IEEE Symposium on. IEEE, 2010, pp. 3{14. [48] M. Gibiluka, M. T. Moreira, and N. L. V. Calazans, \A bundled-data asyn- chronous circuit synthesis ow using a commercial eda framework," in Digital System Design (DSD), 2015 Euromicro Conference on. IEEE, 2015, pp. 79{86. [49] S. Zeidler and M. Krsti c, \A survey about testing asynchronous circuits," in Circuit Theory and Design (ECCTD), 2015 European Conference on. IEEE, 2015, pp. 1{4. [50] \Plasma - most MIPS I(TM) opcodes, howpublished = https://opencores. org/project/plasma, note = Updated: Nov 21, 2016." [51] W. Hua, R. Tadros, and P. Beerel, \A 2 ps resolution, ne-grained delay element in 28 nm fdsoi," Electronics Letters, vol. 51, no. 23, pp. 1848{1850, 2015. [52] Y. Zhang, H. Zha, and H. Cheng, \Edge 1.0.2," https://github.com/ nobodybutyou1/Edge, 2018. [53] V. R. Cooper, Getting Started with UVM: A Beginner's Guide. Verilab Publishing, 2013. 114 [54] R. Bennetts and A. Osseyran, \Ieee standard 1149.1-1990 on boundary scan: history, literature survey, and current status," Journal of Electronic Testing, vol. 2, no. 1, pp. 11{25, 1991. [55] M. Roncken, S. M. Gilla, H. Park, N. Jamadagni, C. Cowan, and I. Sutherland, \Naturalized communication and testing," in Asynchronous Circuits and Sys- tems (ASYNC), 2015 21st IEEE International Symposium on. IEEE, 2015, pp. 77{84. [56] T. Ono and M. Greenstreet, \A modular synchronizing fo for nocs," in Pro- ceedings of the 2009 3rd ACM/IEEE International Symposium on Networks- on-Chip. IEEE Computer Society, 2009, pp. 224{233. [57] A. Abdelhadi, \Edge 1.0.2," https://github.com/AmeerAbdelhadi/ cell-based mixed fo. ow, 2018. [58] REM: Reduced Energy Microsystems: http://www.remicro.com. 115
Abstract (if available)
Abstract
The use of bundled-data and bundled-data resilient design with programmable delay lines has been proposed to combat process, voltage, and temperature (PVT) variations. The programmable delay lines offer the opportunity to add test margin into such designs in which the delay line in shipped products is set slower than that which is successfully tested. How to set the test margin to maximize yield given manufacturing constraints has not yet been explored. We create mathematical models to analyze this problem and develop practical schemes to optimally set test margins. ❧ The underlying hypothesis is that the correlation between the delay of the programmable delay lines and combinational logic in bundled data asynchronous designs will lead to higher yields than comparable synchronous designs. This thesis aims to test this hypothesis by completing the following steps. ❧ 1. Adopt a mathematical model used in synchronous design to quantify yield and manufacturing constraints of a variety of bundled-data design styles. This will involve using a canonical model with normal distribution for delays, slacks, as well as the correlation between such delays. ❧ 2. Mathematically prove that the optimal yield of bundled data designs is higher than their synchronous counterparts under certain conditions and quantify this advantage. ❧ 3. Explore schemes for optimally setting both uniform and per-chip test margin. Uniform test margin is conservative, implying a fixed margin for all chips. In contrast, per-chip method changes test margins based on chip performance and can lead to higher yields. ❧ 4. Analyze two delay lines, forward and backward latency delay lines, and propose the strategy to post-silicon tune them to balance performance and yield. ❧ 5. Implement a yield-aware placement and routing physical design flow for bundled data. We will show how commercial EDA tools [4] [5] can be used to place, route, and configure the delay lines within the bundled data design. The package will be open sourced and distributed online. ❧ 6. Test the flow on industry-scale designs that will include scan chains and controllers for test to support post-silicon characterization, tuning of the delay lines, and efficient manufacturing test.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High level design for yield via redundancy in low yield environments
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Radiation hardened by design asynchronous framework
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Automatic test generation system for software
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Power-efficient biomimetic neural circuits
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Design of modular multiplication
Asset Metadata
Creator
Zhang, Yang
(author)
Core Title
Production-level test issues in delay line based asynchronous designs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/30/2018
Defense Date
10/10/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous,bundled-data,OAI-PMH Harvest,PVT,SPQL,TEST,yield
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
zhan808@usc.edu,zymynameqc@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-100198
Unique identifier
UC11675433
Identifier
etd-ZhangYang-6918.pdf (filename),usctheses-c89-100198 (legacy record id)
Legacy Identifier
etd-ZhangYang-6918.pdf
Dmrecord
100198
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Yang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asynchronous
bundled-data
PVT
SPQL
yield