Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A low-power high-speed single-ended parallel link using three-level differential encoding
(USC Thesis Other)
A low-power high-speed single-ended parallel link using three-level differential encoding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A LOW-POWER HIGH-SPEED SINGLE-ENDED PARALLEL LINK USING THREE-LEVEL DIFFERENTIAL ENCODING by Sotirios Zogopoulos A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2007 Copyright 2007 Sotirios Zogopoulos ii ACKNOWLEDGMENTS This dissertation is the end of a hard and adventurous course that I took during the last five years at USC. I am grateful for meeting many inspiring and knowledgeable professors, colleagues, and friends. I can not imagine myself succeeding without the support, the guidance and the encouragement of several people. For that, I would like to seize this opportunity to acknowledge all of them, especially my parents and my brother who believed in me and supported me from the first day. I would like to thank my advisor, Won Namgoong, who believed in me and encouraged me to pursue the Ph.D title. He infused me with his knowledge and academic ethics and helped me in every step of my research in overcoming the theoretical and practical problems that I faced. His guidance played a major role in the successful completion of my research and my personal growth during those years. My family deserves most of the recognition. I cannot thank them enough for their emotional and financial support, as well as the upbringing that I received from my earlier years. Even though they were a world away during those last five years, they were closer to me than ever. Not only did they carry the weight of stress and anxiety with me throughout the years, but they also rejoiced in the happiness of my achievements in every step that I took. I am extremely happy that they feel proud of me. There are no words that can describe my appreciation for them. I can only feel lucky that I have them. I also would like to thank Prof. Robert A. Scholtz, who made me member of his research group (ULTRALAB) and provided me with access to its valuable laboratorial equipment. Without the infrastructure of his laboratory, I would have never completed the testing phase of my chips. I especially thank Prof. Hossein Hashemi, for letting me use his laboratory equipment to improve the performance of the designed system. Having access to his bond wiring machine and the probe station played a key role in the development of the testing boards. I also give thanks to two of his students, Jonathan Roderick and Harish Krishnaswamy, for the time that they spent answering my questions regarding CAD tool difficulties and the operation of the bond wiring machine. iii Furthermore, I would like to thank my committee members Prof. Peter Beerel, Prof. Sandeep Gupta and Prof. Roger Zimmermann for their valuable comments during my qualification exam. Their guidance led me towards exploring the scaling potential of the present work, revealing new and very interesting findings. Last, but certainly not least, I feel lucky to be surrounded by few but exceptional friends. Even though some of them are not in my field of interest, they all supported me in their own unique way. George Dimou inspired me seven years ago to pursue a graduate degree in the US. During the last five years, he helped me countless times with various research difficulties that I faced. He supported me as an engineer with his technical expertise and as a person with his friendship. I cannot imagine myself going through that course without having him by my side. Costas Xiouros and Panagiotis Galiotos were always next to me, strengthening my faith and keeping me going. They charged me with courage and optimism. My co-researcher and friend Ali Medi helped me with numerous technical difficulties and he was always willing to spend hours upon hours with my projects, especially during the first few years. I feel obliged to acknowledge those people that chose to be there for me. Thank you for being my friends. Finally I am grateful that the MOSIS Educational Program (MEP) gave me a free run that made the fabrication of my design a reality. iv TABLE OF CONTENTS ACKNOWLEDGMENTS ......................................................................................................................ii LIST OF TABLES.................................................................................................................................vi LIST OF FIGURES ..............................................................................................................................vii ABSTRACT ..........................................................................................................................................ix Chapter One: Introduction ......................................................................................................................1 Chapter Two: Background......................................................................................................................4 2.1. Synchronous vs. Asynchronous links ..................................................................................4 2.2. Coding .................................................................................................................................6 2.3. Types of links ......................................................................................................................6 2.4. Design Metrics.....................................................................................................................9 2.5. Design Problems and Techniques........................................................................................9 2.5.1. Power Supply Noise......................................................................................................10 2.5.2. Reference Ambiguity and Voltage Offset.....................................................................11 2.5.3. Channel Termination ....................................................................................................11 2.5.4. Inter-Symbol Interference.............................................................................................12 2.6. Time Multiplexing.............................................................................................................13 2.7. Multiple Pulse Amplitude Modulation (M-PAM) .............................................................14 Chapter Three: Overall Architecture ....................................................................................................15 3.1. The algorithm ....................................................................................................................15 3.2. Speed enhancement ...........................................................................................................17 3.3. Area and power efficiency.................................................................................................18 Chapter Four: Transceiver Architecture ...............................................................................................20 4.1. Transmitter ........................................................................................................................20 4.2. Receiver.............................................................................................................................24 Chapter Five: Testing Method & Results .............................................................................................26 Chapter Six: Scaling of the Algorithm .................................................................................................32 6.1. The conventional multilevel differential coding................................................................32 6.2. Differential signals over a plurality of conductors ............................................................35 6.3. Scaling tradeoffs and specifications ..................................................................................39 6.4. Scaling of the system.........................................................................................................41 6.4.1. Scaling the transmitter ..................................................................................................47 6.4.2. Scaling the receiver.......................................................................................................49 6.4.3. Finding the optimum receiver / Computational load ....................................................51 6.4.4. Decreasing the computational load...............................................................................56 6.5. Scaling cases......................................................................................................................67 6.5.1. Using six conductors (W=6).........................................................................................67 6.5.2. Using twelve conductors (W=12) .................................................................................73 6.6. Scaling results....................................................................................................................75 6.7. Speed and Power efficiency ..............................................................................................77 v Chapter Seven: Conclusion...................................................................................................................79 7.1. Scaling ...............................................................................................................................80 7.2. Future work .......................................................................................................................81 References ............................................................................................................................................82 vi LIST OF TABLES TABLE 1: PROPOSED ENCODING SCHEME...............................................................................................16 TABLE 2: COMPARISON TABLE ..............................................................................................................30 TABLE 3: OVERALL CIRCUIT PERFORMANCE SUMMARY ........................................................................30 TABLE 4: SCALING EFFECTIVENESS OF THE CONVENTIONAL MULTILEVEL DIFFERENTIAL CODING ........34 TABLE 5: SCALING OF THE ALGORITHM PRESENTED IN [16] ..................................................................37 TABLE 6: THE 12 POSSIBLE SYMBOL COMBINATIONS OF THE PROPOSED CODING ALGORITHM...............41 TABLE 7: NUMBER OF COMBINATIONS AS WE SCALE THE NUMBER OF CONDUCTORS ............................43 TABLE 8: SUB-SYMBOLS DECODED BY CONDUCTORS 1, 2, 3AND 4 USING SUBSET A .............................62 TABLE 9: SUB-SYMBOLS DECODED BY CONDUCTORS 1, 2, 3 AND 4 USING SUBSET C.............................63 TABLE 10: SUB-SYMBOLS DECODED BY CONDUCTORS 1 AND 2 USING SUBSET COMP 2 ..........................64 TABLE 11: SUB-SYMBOLS DECODED BY CONDUCTORS 1, 2 AND 3 USING SUBSET D ..............................64 TABLE 12: DECODING OF THE THREE 4-INPUT COMPARATORS...............................................................70 TABLE 13: ANALYTICALLY PRESENTING ALL THE 48 SYMBOL CASES THAT CAN BE DECODED AND DECODED IN A 6 CONDUCTOR SYSTEM .........................................................................................71 TABLE 14: SCALING RESULTS FOR A SYSTEM 6*2 X CONDUCTORS..........................................................75 TABLE 15: SCALING RESULTS FOR A SYSTEM 5*2 X CONDUCTORS.........................................................75 TABLE 16: SCALING RESULTS FOR A SYSTEM 7*2 X CONDUCTORS..........................................................76 TABLE 17: SCALING RESULTS FOR A SYSTEM 8*2 X CONDUCTORS..........................................................76 vii LIST OF FIGURES FIGURE 1: THE CONVENTIONAL AND PROPOSED ENCODED SINGLE-ENDED LINK ARCHITECTURE.............2 FIGURE 2: LINK ARCHITECTURE. .............................................................................................................4 FIGURE 3: ASYNCHRONOUS COMMUNICATION LINK................................................................................5 FIGURE 4: TYPICAL PARALLEL LINK ARCHITECTURE...............................................................................7 FIGURE 5: SERIAL LINK ARCHITECTURE. .................................................................................................8 FIGURE 6: THE CURRENT DISSIPATION OF THE DRIVERS IS DATA DEPENDENT........................................10 FIGURE 7: INTER-SYMBOL INTERFERENCE.............................................................................................12 FIGURE 8: PULSE PRE-SHAPING AT THE TRANSMITTER...........................................................................12 FIGURE 9: ADAPTIVE EQUALIZATION AT THE RECEIVER ........................................................................13 FIGURE 10: MULTIPLE PULSE AMPLITUDE MODULATION (M-PAM) ......................................................13 FIGURE 11: 4 - PULSE AMPLITUDE MODULATION ARCHITECTURE ..........................................................14 FIGURE 12: DRIVER ARCHITECTURE AND CURRENT PATH......................................................................15 FIGURE 13: THE CURRENT PATH AT THE CONVENTIONAL AND THE PROPOSED DRIVER ARCHITECTURE...................................................................................................................18 FIGURE 14: FOUR-WAY TIME-INTERLEAVING IN THE TRANSMITTER. .....................................................21 FIGURE 15: THE PROGRAMMABLE TERMINATION RESISTOR...................................................................22 FIGURE 16: DETAIL ARCHITECTURE. .....................................................................................................23 FIGURE 17: FOUR-WAY TIME-INTERLEAVING IN THE RECEIVER.............................................................24 FIGURE 18: TWO INPUT COMPARATOR OF THE RECEIVER.......................................................................25 FIGURE 19: FOUR INPUT COMPARATOR OF THE RECEIVER. ....................................................................25 FIGURE 20: CHIP MICROGRAPH..............................................................................................................26 FIGURE 21: PLACING THE DIE ON BOARD...............................................................................................27 FIGURE 22: DIE FOOTPRINTS OF BOTH, TRANSMITTER AND RECEIVER ...................................................27 FIGURE 23: TESTING SETUP ...................................................................................................................28 FIGURE 24: EYE DIAGRAM. ...................................................................................................................29 viii FIGURE 25: POWER AND AREA BRAKE-DOWN .......................................................................................31 FIGURE 26: THREE BITS ARCHITECTURE OF A CONVENTIONAL MULTILEVEL DIFFERENTIAL CODING .....33 FIGURE 27: SEVEN BITS ARCHITECTURE OF A CONVENTIONAL MULTILEVEL DIFFERENTIAL CODING .....33 FIGURE 28: SCALING OF THE CONVENTIONAL MULTILEVEL DIFFERENTIAL CODING ..............................35 FIGURE 29: DIFFERENTIAL SIGNALING OVER THREE CONDUCTORS. REGENERATED FIGURE FROM [16]...........................................................................................................................36 FIGURE 30: SCALING OF THE ALGORITHM PRESENTED IN [16] ...............................................................38 FIGURE 31: COMPARING THE TWO MULTILEVEL DIFFERENTIAL CODING SCHEMES ................................39 FIGURE 32: SCALING EFFICIENCY AT THE TRANSMITTER. THE PLOT PRESENTS THE LOG 2 (SYMBOL-SET) WHICH REPRESENTS THE NUMBER OF BITS THAT CAN BE CODDED FROM THE TRANSMITTER....................................................................................................46 FIGURE 33: BITS/PIN CODING EFFICIENCY AT THE TRANSMITTER DURING SCALING...............................46 FIGURE 34: THE TWO CURRENT PATHS AT THE TRANSMITTER SIDE OF A SYSTEM WITH SIX CONDUCTORS .....................................................................................................................48 FIGURE 35: POWER EFFICIENCY OF X-INPUT COMPARATORS (X>2) ......................................................50 FIGURE 36: COMPUTATIONAL COMPLEXITY OF A RECEIVER WITH SIX CONDUCTORS.............................55 FIGURE 37: W/2 2-INPUT COMPARATORS WITHOUT SHARING INPUTS ....................................................61 FIGURE 38: (W/2-1) 2-INPUT COMPARATORS WITHOUT SHARING INPUTS ..............................................61 FIGURE 39: W/2 2-INPUT COMPARATORS WITH ONE COMMON CONDUCTOR ..........................................62 FIGURE 40: GRAPHICAL REPRESENTATION OF SUBSET C .......................................................................63 FIGURE 41: GRAPHICAL REPRESENTATION OF SUBSET D .......................................................................64 FIGURE 42: GRAPHICAL REPRESENTATION OF SUBSET E .......................................................................65 FIGURE 43: RECEIVER'S ARCHITECTURE USING SIX CONDUCTORS .........................................................68 FIGURE 44: RECEIVER'S ARCHITECTURE USING TWELVE CONDUCTORS .................................................74 FIGURE 45: SCALING EFFICIENCY OF THE PROPOSED SYSTEM VS THE OLD ONES....................................77 FIGURE 46: POWER EFFICIENCY OF THE SCALING ..................................................................................78 ix ABSTRACT As the lithography process scales the throughput requirement for chip interconnections increases. Furthermore, the band limited channel practically does not scale and the power budget has been more important than ever. This tradeoff has raised the need for innovative designs. As a result plethora of new design techniques has been discussed extensively during the last decade. This challenge prompts the development of the presented work that delivers a complete solution for systems that require low power, high speed and high level of integration. The proposed work combines the advantages of serial and parallel link architectures, to overcome known problems of both. Parallel links traditionally use binary coding. Their symbol rate is limited mainly due to: reference ambiguity and power supply noise. Serial links on the other hand spend two pins for each link. The proposed coding sends the data differentially among the pins without using a reference signal. In conjunction with a novel driver architecture, it keeps the power dissipation at the transceiver constant, decreasing the power supply noise. The transmitter recycles the current among the drivers to decrease power dissipation, while the receiver uses simple comparators to recover the data. To demonstrate the proposed coding scheme a parallel link that codes three bits over four conductors was designed in a 0.18-μm CMOS process. It reaches a data rate of 4.2-Gb/s/pin, dissipates 17.1-mW/Gb/s and achieves 100-Gbps/mm 2 . The efficiency of the coding increases further as it scales. By using only three signal levels the system guaranties that noise margin will remain the same. Using just six conductors the coding scheme sends 5.585 bits increasing the I/O pin efficiency to 93%. The effective bit rate would be 5.2-Gb/s/pin. To further increase performance twelve conductors can be used to code 0.97 bits per pin. 1 Chapter One: Introduction As the CMOS process technology scales and the computational speed increases, system integration follows to deliver a powerful set of resources. Therefore, extensive parallelism and improved I/O data rates had to develop to withstand the continuously increasing data throughput requirements ([3], [10], [17], [19], [21] and [23]). Unfortunately, the packaging technology as well as the transmission channel, which is formed by printed circuit board (PCB) traces, vias, connectors and bonding wires, does not scale the same way as lithography process. Since the on-chip I/O pad dimensions barely decrease while the chip perimeter and area remain unchanged, due to high level of integration, it is impossible to improve throughput using extensive parallelism. The number of pads has increased to just a few hundred ([9] and [18]), while the number of transistors have increased to tens of millions. This has prompted extensive research during the last decade ([1], [2], [11], [12] and [20]) attempting to close the gap between the processing clock frequency and the I/O throughput by improving the data rate per pin. Time multiplexing, differential signaling, Multiple-Pulse Amplitude Modulation (M-PAM) and channel equalization are some of the techniques used in serial link architectures, boosting the data rate to several Gb/s per link over an un-scalable channel. Although data rate in serial links is high, speed cannot be the only objective since other requirements like area utilization and power dissipation cannot be ignored in systems that require high degree of integration and power efficiency. Since high-speed chip-to-chip interconnections require integration of many links, several previous parallel link implementations were used to try to optimize for both speed and power efficiency. In [12] and [2], parallel links employing differential binary signaling that operate at 4-Gb/s per link while dissipating 22.5-mW/Gb/s and 32-mW/Gb/s, respectively, were reported. The use of differential signaling in parallel links, however, suffers from 50% efficiency in terms of I/O pin utilization, dropping the effective symbol rate to 2-Gb/s/pin. Furthermore, the signal bandwidth of the differential signal is twice as wide as in single-ended links that achieve the same data rate per pin, resulting in increased performance degradation due to ISI. To achieve high data rates and power 2 efficiency while minimizing the effects of ISI, a single-ended parallel link operating at 5.6-GB/s with 75% I/O pin utilization is presented. The corresponding data rate is 4.2-Gb/s/pin while dissipating 17.1-mW/Gb/s. This implementation demonstrates the efficiency of the coding algorithm and the power saving capability of the driver’s architecture coding thee bits over four conductors. The efficiency of the coding algorithm increases using more than four conductors. Scaling the coding scheme to six conductors for example increases the I/O bits-per-pin efficiency to 93%. Further scaling can approach 100% efficiency, which is often attained by parallel link architectures trading power for speed as will be later presented. R Vout Pin 2 Pin 1 Reference Pin 0 Comparators Conventional Encoder Comparators Encoded Transmitter Receiver Transmitter Receiver 0 1 Center Low High time Data Data Supply Current Reference time Supply Current Pin 0 Pin 1 Pin 3 Pin 2 R Vout 1 Vout 2 R AC Ground Conventional Driver Proposed Driver Figure 1: The conventional and proposed encoded single-ended link architecture. The transmitter’s architecture [Figure 1] and the coding scheme which is a significant improvement of [20], was designed keeping in mind the area and power criteria of a system that requires high level of integration. The main feature of the transmitter’s drivers is to share the same current path reducing the power dissipation by 33%, compared to a conventional driver. On the other end, three comparators are all that is needed for data recovery. The architectural simplicity of both 3 transmitter and receiver results in an area efficient system that reaches 100-Gbps /mm 2 in a CMOS 0.18-μm process. In chapter two, the background of links is discussed. In chapter three, the coding scheme is presented along with the advantages of speed, area and power. In chapter four, the architectural details of the transceiver are described and in chapter five, the results and the measurement process are presented. In the final chapter, the scaling efficiency of the coding algorithm is examined. 4 Chapter Two: Background The purpose of data links is to transfer data between two points. They are developing rapidly to account for the increasing throughput requirements that become more pronounced as the lithography process advances. As shown in Figure 2, they consist of three main parts: a transmitter, a channel, and a receiver. The transmitter transforms the digital data sequence into an analog waveform and sends it over the channel, while the receiver samples the received analog waveform and converts it back to its original digital form. The transmitter can usually be separated in an encoder and a digital-to-analog converter (DAC), while the receiver usually consists of an analog-to-digital converter (ADC) and a decoder. The channel is formed by the electrical characteristics of the entire transmission path including the printed circuit board (PCB) traces and vias, the bond wires, the package leads, the connectors as well as any possible wire used for connectivity. Channel TX 1011 RX Data Stream Recovered Data 1011 Clock Recovery Figure 2: Link architecture. 2.1. Synchronous vs. Asynchronous links In real time systems the data interconnection links can be synchronous or asynchronous. An asynchronous example is presented in Figure 3. The transmitter’s driver shapes its output according to the data value and informs the receiver using the ‘Valid’ 1 signal. The receiver samples the value on the channel as soon as the valid signal reaches its end and sends an acknowledgment to the transmitter. When the transmitter receives the acknowledgment, it sets the valid signal to low and 1 Also known as ‘Request’ signal 5 waits for the receiver to set its acknowledgement signal also to low. The whole process repeats for each new set of data 1 . The drawback with this communication protocol is that the data rate is directly related to the propagation time of the channel. In the above example, each new bit takes more than three times the propagation time of the channel to be transmitted. For systems that are several inches apart, the propagation time is comparable with the rising and falling edge of the system and therefore the data rate becomes insufficiently small. Data Valid Ack Channel 1011 Data Stream Recovered Data 1011 Receiver Channel Valid Ack Channel Signal propagation time across the channel Transmiter Figure 3: Asynchronous communication link. To overcome that problem a synchronous system does not use the valid neither the acknowledgment signals since the data that is transmitted over the channel changes on each clock cycle. If the internal clock frequency is fast enough, the transmitter does not have to wait until the data reaches the receiver. More than one symbol can exist on a physically lengthy channel. The smaller clock cycle period that can be used is determined by the slew rate and the coding scheme of the transmitter. As soon as the output driver of the transmitter shapes the appropriate pulse on the 1 This communication protocol is also known as handshaking. 6 channel the next data can be transmitted. At the receiver side, an identical clock is used to sample each one of the pulses adjusting its rising edge to the center of the received symbol 1 . 2.2. Coding Each symbol may represent more or less than one data bit, depending on the coding scheme. In differential coding for example, two symbols, one from each output, are necessary to fully describe the transmitted bit. The effective contribution of each symbol per bit is 50%. On the other hand, in binary transmissions a single symbol fully describes the data transmitted; while in multiple pulse amplitude modulation (M-PAM) systems, as later presented, each pulse represents k bits where 2 k = M, with M being the number of different signal levels on the channel. The choice of the appropriate link depends on the application’s requirements. Systems that use binary or M-PAM coding suffer from power supply noise, crosstalk, voltage offset, and Inter-Symbol Interference (ISI). Their advantages over the systems that use differential coding is that they typically have better I/O pin utilization and are easily integrated due to their size and low power consumption. Then again it is possible to apply M-PAM coding to boost their effective data rate. In differential signaling on the other hand, the system suffers from less voltage noise and manages to reach better data rates. The disadvantage is that the increased data rate stretches the signal bandwidth to the point that it is comparable with the channel bandwidth and equalization techniques are necessary for compensation. Consequently, the design complexity increases and differential signaling becomes a weaker candidate for systems that require high level of integration due to their increased power consumption and area. 2.3. Types of links Synchronous links can be separated in Serial and Parallel links. Their main differences lie in the synchronization scheme between the receiver and the transmitter as well as the coding scheme that is used to transmit the data over the channel. Parallel links, that have multiple outputs used for parallel data transmission, use an additional wire that carries a reference clock from the transmitter to the 1 Traditionally the pulses on the channel are called Symbols. 7 receiver for synchronization purposes. An additional wire carries a reference voltage level that is used by the receiver for data recovery. On the other hand, in serial links the receiver recovers the timing information from the collected data waveform and self-synchronizes to the transmitter. In both cases, and as the data rate increases to make the system skew tolerant programmable, delay modules are included into the system to set the rising edge of the clock at the receiver side to the appropriate time position within the symbol for data sampling. Channel 0 1 Reference Ref TX 0110 RX 0110 Channel TX … … … TX 1011 RX Data [0] 1011 Channel TX Data [n] Channel Clock Clk Figure 4: Typical Parallel link architecture. In Figure 4 a typical single-ended 1 parallel link architecture is presented ([1], [2], [11], [20] and [25]). There are multiple pairs of transmitters and receivers but all of them share the same clock and reference signal. In this particular example, the transmitters control the clock reference, which makes the whole system source-synchronous. The data is binary coded 2 , representing the digital ‘one’ and ‘zero’ with a high or low voltage level on the channel respectively. An additional wire is used to provide a reference voltage level which is used at the receiver side for data recovery. A comparator samples the voltage difference between the received data and the reference signal, which operates as a threshold voltage between logical zero and one, on the rising edge of the clock. Since the output of the 1 In single ended architectures each data sequence is coded over a single wire. 2 Binary signaling is also known as non return to zero (NRZ) coding. 8 comparator is the recovered data, no additional decoding is necessary. By removing the clock recovery system from the receiver and providing the reference signal from the transmitter, this architecture keeps the complexity of the receiver to a minimum. The I/O pin utilization efficiency reaches (N+2)/N where N is the number of data bits transmitted simultaneously through the link. TX 1011 RX Data Stream Recovered Data 1011 Clock Recovery Channel Channel Figure 5: Serial link architecture. In Figure 5 the top architecture of a conventional serial link is presented ([3], [5], [6], [8], [13] and [24]). There are two major differences compared with the parallel link architecture: the receiver uses a clock recovery system and the data is coded in time as the voltage difference across two pins instead of using a single wire to binary code the data. At the receiver side, a simple comparator is used to compare the voltage difference across the two wires. The hardware complexity of the transmitter as well as the area used is bigger compared to that of the conventional parallel link architecture. Using two wires to form one link their I/O pin efficiency drops to 50%. For that reason, serial links are usually used for systems that require low level of integration but performance that reaches several Gbps per link. Traditionally, serial links are faster than the parallel links due to the differential coding of the data. Any possible offset variation at the output lines affects both outputs but their voltage differences, which actually represent the data, remain unaffected. Due to the continuously increasing bit rate, the signal bandwidth becomes comparable to the channel bandwidth. At that point, the channel operates as a low pass filter, distorting the signal waveform at the receiver side and consequently increasing the bit error rate (BER) of the system. To overcome these problems extensive research occurred during the last decade, introducing several techniques that are later presented. 9 2.4. Design Metrics Through the years and as the links specifications changed with time according to the market needs, researchers developed several metrics to evaluate and compare different link architectures. Some of the most common metrics are: bit rate, transmission latency, bit error rate (BER), bit rate per pin, power per gigabit per second, area per gigabit per second and frequency range of operation. An optimum link would dissipate minimum power and use minimum area while having the highest bit rate per pin, wide frequency window of operation and minimum latency. For systems, like most of the serial links, which are optimized strictly to improve their data rate, several of the above metrics like the area per gigabit per second become less important. In this thesis, we develop an innovative hybrid system which combines the advantages of serial and parallel links and tries to optimize for speed, power, and area, delivering an attractive solution for high performance and low cost systems that require high level of integration. Although the bit rate is the most common metric that measures the speed of a system, in this work a fairer metric is the bit rate per pin, since the I/O pin utilization for different systems changes. In serial links the effective transmitted bit per pin is 0.5, since they use differential coding, whereas single ended links use multiple-amplitude pulse modulation where the effective transmitted bit per pin may be 2 or even 3 bits/pin. Another important metric in this work would be the power per gigabit per second. To derive this we actually measure the amount of power that is dissipated to transmit one Gigabit of data using the highest possible bit rate. Another important metric that dynamically decreases as bit rate increases is the bit error rate, which remains equally important for single ended and serial links. Several successful systems today operate with a BER of 10 -12 and lower. 2.5. Design Problems and Techniques Through extensive research that occurred during the last years, the link’s landscape was explored exposing several design difficulties and barriers that are present in both serial and parallel links. Some of them are: reference ambiguity, inter-symbol interference, power supply noise and 10 signal reflection. Several of them were mitigated by developing new design techniques. The designer challenge is to create new innovative architectures that combine most of the techniques described bellow or even to develop new ones to boost the overall performance of the proposed link. 2.5.1. Power Supply Noise The power supply noise of a system, also known as L*di/dt noise, is created by sudden current changes that occur within a very short period of time. The phenomenon becomes severe when its power supply source has low frequency response and when the switching frequency of the system increases. The power supply fluctuations created by the inductive behavior of the power distribution network. This behavior is reduced when the current dissipation of the system experiences minor or low frequency changes. Such an approach, especially for systems that have unpredictable input sequences, is highly related with the architectural details of the system and is rarely obvious if it is possible to design. In links, the current dissipation of the transmitter’s drivers is data-dependant. In Figure 6 an example of a conventional three-output single-ended parallel link is presented. The current dissipation of the drivers is different when the transmitted data is ‘101’ than ‘000’. Pin 2 Pin 1 Reference Pin 0 Comparators Conventional Transmitter Receiver 0 1 Data Reference 1 0 1 Pin 2 Pin 1 Reference Pin 0 Comparators Conventional Transmitter Receiver 0 1 Data Reference 0 0 0 Figure 6: The current dissipation of the drivers is data dependent. 11 Another widely used technique to minimize the L*di/dt noise, is to compensate for the inductive behavior of the power supply network using coupling capacitors close to the modules that dissipate current. Capacitors with small capacitance value and therefore fast frequency response are usually placed close to the circuitry that causes these rapid current changes. As the distance from the circuitry increases, so does the value of the coupling capacitors. This technique can be effectively used on the PCB level but does not completely resolve the problem. The power distribution network into the chip is connected to the source through bond wires that are highly inductive and consequently the phenomenon persists locally into the chip. The bonding wires isolate the circuitry from the power supply on board. Better and more power I/Os may decrease the phenomenon but the solution is not practical considering the number of I/Os that are available. 2.5.2. Reference Ambiguity and Voltage Offset Reference ambiguity becomes a hazard in single ended parallel links. Power line fluctuations in both the transmitter and receiver, crosstalk as well as process mismatches may add offset and noise to the voltage level of the reference signal. Careful layout design of the chip as well as the board may decrease crosstalk to some extent. For offset mismatches of the reference level, additional analog or digitally controlled modules can be used to calibrate both transmitter and receiver to operate with the correct reference level. 2.5.3. Channel Termination Improper channel termination causes signal reflections in both ends of the channel. This affects the system performance in two different ways. First, the reflection of the signal that reaches the receiver bounces back and it is added with the next symbol causing amplitude distortion. Second, the termination mismatches reduce the delivered power at the receiver, increasing the chance of an error. Reflections occur in both ends of the channel as well as in several areas across the channel like vias, package leads and connectors. As they travel, gradually loosing amplitude due to the channel resistance, the reflected signal becomes negligible, usually after unpredictable amount of time. In a 12 system with a poor channel termination, the transmitter is forced to wait for the channel to become a clean environment for proper transmission before retransmitting. 2.5.4. Inter-Symbol Interference Inter-symbol interference is a phenomenon that becomes present in high symbol rates where the signal and the channel bandwidth become comparable. The channel operates as a low pass filter distorting the shape of the transmitted symbol as is presented in Figure 7. At the receiver side, the symbol becomes smaller and wider. When transmitting a symbol sequence the symbols overlap and data recovery becomes impossible. Volts Time Volts Time 1 0 1 1 ? 1 Figure 7: Inter-symbol interference. Volts Time Volts Time 1 0 1 1 0 1 Figure 8: Pulse pre-shaping at the transmitter ISI can be treated using channel equalization techniques in both transmitter and the receiver ([4], [5], [6], [7], [15] and [22]). At the transmitter the pulses are pre-shaped, like in Figure 8, removing the high frequencies from the signal. The received pulse has less amplitude, but due to its shape, it does not interfere with the neighboring symbols. The transmission sequence ‘101’ and the equivalent received signal at the receiver are presented in Figure 8. 13 ADC D + A D + B Figure 9: Adaptive equalization at the receiver At the receiver, ISI can be treated using adaptive equalization. A high-resolution analog-to- digital converter (ADC) samples the distorted pulse at the right time. Several negative feedback loops [Figure 9] connected in series, each with an adjustable gain ‘A’, compensate for the ISI by removing the amplitude contribution of the neighborhood symbols. The drawback of this technique is the power dissipation and the area of the high resolution ADC, which is a vital component of this configuration. TX 11 TX 10 TX 01 ... Channel RX 10...111...0 RX RX ... 11 10 01 F 0 F 1 F 2 F 3 The phases for a 4-way time-interleaving example Figure 10: Multiple pulse amplitude modulation (M-PAM) 2.6. Time Multiplexing Using the previously presented techniques, the maximum symbol rate of high end links reached several 1 GSymbols/sec. Even though the lithography process scales fast, the internal clock frequency used cannot reach the same rate. To compensate for the frequency gap, time multiplexing architectures have been designed. In Figure 10 a 4-way time multiplexing architecture is presented. Four drivers that are connected in parallel, transmit over the same channel by time interleaving each 1 Designs higher that 10 GSymbols/sec have reported 14 other for 1/4th of the clock period. At the receiver, a similar architecture of 4 parallel connected receivers samples the channel four times within one clock period, using four different phases of the same clock. This concept can be applied using more than four phases of the same clock. The disadvantage of such an approach, despite its area and power inefficiency, is the increasing output capacitance created by the diffusion at the drain of each driver. A high output capacitance results in a small slew rate which directly affects the transmission symbol rate. 2.7. Multiple Pulse Amplitude Modulation (M-PAM) As the symbol rate in links increases, the bit rate increases linearly 1 using techniques like channel equalization and time multiplexing. To further improve the bit rate, which is actually the one that determines the performance of the system, a multiple pulse amplitude modulation coding signaling can be used [5] [Figure 11]. The output driver transmits k bits, with each transmitted symbol, using M output voltage levels 2 . At the receiver, an ADC samples the channel and compares it with M+1 reference voltage levels for data recovery. TX 1 0 1 1 RX Data [0:1] Channel TX Clock Recovery 1 1 0 0 References 5 2 1 0 1 1 1 1 0 0 4-PAM Figure 11: 4 - pulse amplitude modulation architecture 1 The symbol rate and bit rate are not the same but are related linearly. The coding scheme that is used in each case determines their exact relation. 2 2 k = M 15 Chapter Three: Overall Architecture 3.1. The algorithm Differential encoding traditionally has been used to transmit data by coding a bit of information as the voltage difference across two pins while using only two possible voltage levels. The concept of using multiple signal levels to differentially encode the data over multiple conductors is a relatively old idea ([14] and [16]). The present work is an improvement of this coding scheme, since it incorporates several innovations that decrease its power consumption and make it more efficient as it scales using more conductors while having the same noise margin Receiver AC Ground Comparator Comparator Comparator Data 0 Data 1 Data 2 P0 P1 P2 P3 Encoder Data 0 Data 1 Data 2 Transmitter Current path at time t 2 Level High Center Low Level High Center Low Level High Center Low Level High Center Low t 1 t 2 t 3 t 4 Figure 12: Driver architecture and current path. Through the proposed encoding scheme, the data is differentially transmitted across several pins using 3-level signaling by transmitting three bits of information via four pins. In Figure 12, nodes P0, P1, P2 and P3 represent the pins, while the set of signal levels in each pin is assumed to be {High, 16 Center, and Low}. In Table 1: the voltage level of the four output pins is presented for all the eight possible bit transmission cases. In each one of those cases there are exactly two pins that have been assigned to “Center”, while the remaining two are at the “High” and the “Low” voltage level. Compared to a group of three traditional differential links that would use two voltage levels and six pins, this coding scheme establishes three differential links using only four pins. Data D[0] and D[1], are coded differentially across the pins P0, P1 and P2, P3 respectively. Using 3-level coding, it is possible to transmit the first two bits in two possible ways. If for example D[0] is ‘0’, according to the coding algorithm, P0 has to be at a higher voltage than P1. This can be accomplished either by setting P0 to “High” and P1 to “Center” or P0 to “Center” and P1 to “Low”. Since one of the coding requirements is to have exactly one pin in each pair at the “Center” voltage level, P0 cannot be “High” while P1 is “Low”. By choosing among those two different ways to transmit the first two data bits, D[0] and D[1], the average voltage level of each pair of pins (P0, P1) and (P1, P2), can change. Using the average voltage difference across the two pairs of pins, a third differential link transmits the data bit D[2]. Table 1: Proposed encoding scheme. Data to be transmitted (D[0:2]) Pins 000 001 010 011 100 101 110 111 P0 H C H L C C C L P1 C L C C H L H C P2 C H L H C C L C P3 L C C C L H C H *L = Low, C = Center, H=High On the receiver side, in order to decode data D[0:2], three comparators are required. To recover D[0], the voltage of P0 is compared with that of P1. If VP0 > VP1, D[0] = 0; otherwise, D[0] = 1. Similarly, D[1] is recovered by comparing the voltages of P2 and P3. For D[2], the sum of the voltage values of P0 and P1 is compared with that of P2 and P3. Due to the nature of the coding scheme, no extra hardware is required to recover the data. 17 3.2. Speed enhancement The increased I/O pin utilization improves the link performance by decreasing the ISI. Since the proposed coding scheme transmits differentially three data bits over four pins, the data rate of each pin reaches 75% of the symbol rate, compared to 50% of a conventional differential link. A traditional differential link, with the same data rate per pin as the proposed scheme, would suffer from greater ISI. The reduced ISI of the proposed transceiver is not the only breakthrough. Through the integration of coding and innovative driver architecture, the proposed transceiver eliminates two of the most speed-limiting factors in parallel links, reference ambiguity and power supply fluctuations at the output drivers [11]. These phenomena bound the achievable single-ended parallel link data rate to approximately 3-Gb/s/pin [1]. In conventional parallel link architecture [Figure 1], a reference signal, driven by the transmitter, becomes the threshold voltage level at the receiver side for data recovery. Crosstalk and power noise in both ends of the link may introduce an error at the reference voltage level, increasing the bit error rate and consequently create reference haziness. By employing differential signaling, the present work does not require a reference signal, eliminating any potential speed degradation due to reference ambiguity. Another speed-limiting factor in parallel links, arises from the power line fluctuations [L*di/dt] at the drivers. As the transmitted data changes randomly with time so does the current dissipation of the drivers. The non-ideal power rails cannot withstand sudden current changes, introducing voltage fluctuations that cause jitter and distortion of the transmitted data. L, models the inductive effect of the bonding wires and the leads of the package, while ‘di’ stands for the variable current dissipation of the drivers. The natural way to eliminate the power line switching noise is to create an ideal power source by minimize L or maintain the current dissipation of the drivers constant while operating. The inductive effect of each pin [L/pin] depends on the electrical characteristics of the bonding wires and the package leads. Therefore, in order to reduce the total inductance of the power source more power supply pins have to be used, decreasing the I/O pin utilization of the whole system. Even in systems that use more power supply pins than usual, the L*di/dt noise exists from the local power rails tree of the drivers. 18 The proposed 3-level coding scheme and innovative driver architecture guarantee that the driver’s circuitry will always consume the same amount of current, forcing the ‘di’ factor to become zero and therefore becoming immune in L*di/dt switching noise. The switching activity at the drivers for all the eight possible data combinations remains the same since the “High” and “Low” signal levels appear exactly once and the summation of the signal levels of the four pins is constant (i.e. High + Center + Center + Low). As a result, the variation of the driving current becomes zero, eliminating the L*di/dt switching noise. 3.3. Area and power efficiency Compared to other parallel link solutions, the present work delivers an area efficient module that becomes ideal for extensive integration. Even though the coding algorithm may look complicated, in hardware it can be reduced to exactly four 1 AND gates per output pin, making the area as well as the power overhead due to coding negligible. R Vout R Vout 1 Vout 2 R AC Ground Conventional Proposed Figure 13: The current path at the conventional and the proposed driver architecture. The driver circuit in our design consists of both PMOS and NMOS transistors as shown in Figure 13. A resistor is added in each of the four outputs to provide 50 Ohm source impedance with the other terminal shorted with the resistors of the other drivers. Since the proposed encoding scheme ensures that one pin is ‘High’ and another ‘Low’ while the remaining two are ‘Center’, there are exactly one PMOS and one NMOS in two different drivers that are enabled among the four drivers at 1 One AND gate for each one of the four phases. 19 any given time. As a result, a single current path through the two enabled transistors is all that is needed to drive the four output pins. In Figure 12, the current path at time stamp t 2 is illustrated by the dotted line. Compared to a conventional 3-bit parallel link that uses pull-down drivers [Figure 1], the proposed transmitter dissipates 33% less current for the same voltage margin, assuming equiprobable zeros and ones. 20 Chapter Four: Transceiver Architecture To validate the proposed concepts, a chip has been fabricated in the 0.18-µm TSMC process. Each chip hosts a pair of transceivers and the drivers use a separate power supply to experimentally measure their current consumption. An external clock frequency of 1.4-GHz has been used as a reference, while a programmable clock generation structure creates four clock phases internally. To compensate for the gap between the on chip frequency and I/O data rate, a four-way time-interleaved architecture is employed at each pin. To simplify the testing process for each one of the four phases, a group of three pseudo- random binary sequence (PRBS) generators has been included in both transmitter and receiver. Each group generates a sequence of thirty one 3-bit numbers, which then are mapped to 4-bit numbers through the proposed coding scheme. An active-low Reset signal initializes the pattern sequence transmission. After initialization a known and unique bit pattern that cannot be misinterpreted by the receiver as a data sequence establishes synchronization between the transmitter and the receiver. When the initialization sequence is detected at the receiver, identical PRBS generators to those in the transmitter create exactly the same sequence pattern and compare it to the received data. Several output signals at the receiver report possible errors at specific pins and phases of the transmission. In addition, to account for pin-to-pin mismatches and timing skew, both of which become more pronounced at higher data rates; digital calibration circuits have been included throughout the design. Digital calibration occurs at the pre-drivers, which are controlled through digital tuning memories, by delaying or advancing slightly the control signals of the related driver and phase for each output pin separately. 4.1. Transmitter The transmitter architecture is presented in Figure 14. Since four-way time-interleaving has been employed, for each one of the four transmission phases a PRBS generator, an encoder and four output drivers are connected in series. Every PRBS generator uses a separate ‘Reset’ signal to 21 initialize its bit sequence. Each output pin has four drivers that are enabled in a time-sharing manner, for 25% of the clock period. The four outputs are connected through a programmable resistor, which is used for impedance termination of the channel, to a common node that functions as an AC ground. Due to the proposed coding algorithm, at any given time, there can be only one PMOS and one NMOS open among all the drivers of the same phase. The current path will form across those two transistors and the two resistors in between. Phase3 PRBS generator Phase2 PRBS generator Phase1 PRBS generator Phase3 Phase2 Phase1 Phase0 Phase3 Phase2 Phase1 Phase0 Phase3 Phase2 Phase1 Phase0 Phase3 Phase2 Phase1 Phase0 Phase0 Phase3 Encoders Phase2 Encoders Phase1 Encoders Phase0 Encoders PRBS Generator AC Ground pre-drivers pre-drivers pre-drivers pre-drivers F0 F 1 F2 F 3 Clock Generator P0 P1 P2 P3 Figure 14: Four-way time-interleaving in the transmitter. Figure 15 shows the programmable resistor that is used for channel termination. It is formed by connecting three resistors in series. To control the total resistance value, pass gates are connected in parallel with the 30 and 50 Ohm resistors. When one of the pass gates is open, the resistance value decreases due to the parallel combination of the resistance itself and the resistive channel of the pass gate. The size of the pass gates and the resistors value raise a tradeoff between the resistance range 22 and the frequency response of the system. A big pass gate with minor channel resistance would be ideal for large resistance range but poor frequency response due to its diffusion capacitance. On the other hand, a small pass gate has great frequency response but poor range. To get around this problem, a 25 Ohm resistor has been used to ‘hide’ the capacitive pass gates from the critical output node, improving the frequency response of the system. The final resistor has an effective resistance range from 35 to 105 Ohm and a -3dB bandwidth of 9.9GHz which is sufficient compared to the signal bandwidth 1 . 25 O 30 O 50 O C1 C0 C3 C2 AC Ground Output pin Figure 15: The programmable termination resistor. Despite the wide range of resistance values, it is also crucial to improve the resolution of the programmable effective resistance to better match the channel. To achieve greater resolution, each pass gate was replaced by the parallel combination of two smaller ones. Using four control signals and sizing one of the two parallel pass gates twice as big as the other one, there are 16 possible resistor values, well distributed across the available resistance range. Through laboratory experiments, the best impedance matching occurred by setting the control signals to C[0:3]=‘0110’ which correspond to an effective total resistance of 51 Ohm. In Figure 16 a detail diagram of the driver’s architecture is presented. The offset voltage level at the output lines is determined from the DC level of the internal node that connects all four 1 The symbol rate of the link is 5.6Gsps/pin. 23 outputs through the termination resistors. That node behaves as an AC ground while its DC voltage level can be controlled by the PMOS to NMOS width ratio at the drivers. Using a programmable memory, the pre-drivers set the appropriate transistor ratio and therefore change the output slew rate and the peak-to-peak voltage difference of the three possible signal levels. The final width of the driver’s transistors and the effective resistance value determines the only current path that is necessary to drive the four outputs, as mentioned above. Using a 96 μm wide PMOS and 40 μm wide NMOS while setting the programmable resistance to 51 Ohm, the drivers dissipate 12.2mA. Since the current dissipation of the pre-drivers is liner depended on the transmission data rate as opposed to the drivers that maintain constant current, the best power efficiency occur at the highest rate. Pre-drivers 1 2 4 Out AC Ground Control Signals + Memory Encoder F 0 F 1 F 2 F 3 50 O Figure 16: Detail architecture. 24 4.2. Receiver The top receiver architecture is presented in Figure 17. Three comparators and a group of three pseudo-random binary sequence (PRBS) generators are used for each one of the four phases. All the four inputs are connected through programmable resistors like the transmitter side for impedance termination. Since three comparators are dedicated for each one of the four phases, data de- multiplexing is not necessary. To simplify the testing process the received data and the output of the PRBS generators are compared and several output signals report possible errors. Data 0 Data 1 Data 2 Phase3 Data 0 Data 1 Data 2 Phase2 Data 0 Data 1 Data 2 Phase1 Comparator Comparator Comparator Data 0 Data 1 Data 2 Phase0 Phase3 PRBS generator Phase2 PRBS generator Phase1 PRBS generator Phase0 PRBS Generators Phase3 Error Detection Phase2 Error Detection Phase1 Error Detection Phase0 Error Detection System Bit Error Rate P0 P1 P2 P3 AC Ground Figure 17: Four-way time-interleaving in the receiver. Figure 18 depicts the two-input comparator. It is used to decode data D(0) and D(1) by comparing the voltage levels between the pins (P0, P1) and (P2, P3) respectively. The clock signal configures the sampling point by enabling the differential amplifier. For data D(3) a four-input comparator has been designed [Figure 19] to compare the average voltage level of each one of the above pairs. No further decoding is required since each of the comparators behaves also as the decoder of the receiver according to the coding algorithm. When the Clock signal is low, PMOS transistors M8, M9, M3 and M2 are used to pre-charge the output nodes of the amplifier to Vdd and 25 NMOS M1 is turning off the common mode current of the differential amplifier. At the rising edge of the clock, the pre-changing phase ends and the output voltage nodes follow the voltage difference of the inputs. Due to the low sampling frequency [1.4-GHz] and the amplification gain, the output swing is big enough to drive a differential latch directly that uses a latter phase of the same clock. M4 M5 M6 M7 M2 M3 M9 M8 Out M10 In A M11 In B M1 CLK Figure 18: Two input comparator of the receiver. M4 M5 M6 M7 M2 CLK M3 M9 M8 Out M1 M10 In A M12 In B M11 In C M13 In D Figure 19: Four input comparator of the receiver. 26 Chapter Five: Testing Method & Results The proposed transceiver was fabricated in a 0.18-μm CMOS and donated by MOSIS. The chip micrograph is shown in Figure 20. Due to the dense mesh of the power rails, it is not possible to visually identify the transistor population of individual modules. At the right edge of the micrograph, the four output pins of the transmitters are visually identified. They are purposely placed further apart to create extra area for the layout of the four output drivers of the corresponding phases. The compact layout of the four drivers reduces the capacitance at the output node by reducing the required interconnection. The chip size is 3.75x2.95mm 2 and around its perimeter there are 82 I/O pads of 100x100μm each. In each low frequency I/O pad an ESD was placed for protection. Comparators PRBS Receiver Clock Generator Clock Generator PRBS Transmitters Transmitter Figure 20: Chip micrograph. The available packaging according to the dimensions of the designed die raised a design problem due to their poor electrical characteristics. The lead’s inductance and capacitance form an extremely inductive channel that would limit the cutoff frequency of the low pass channel to less than 10 GHz. Furthermore, the cavity size of the packages that had more than 82 pins was a lot bigger than the dimensions of the die. To route the signals from the edge of the die to the start of the lead, a lengthy and consequently an extremely inductive bond wire had to be placed. Placing the die closer to the edge of the cavity to shorten the length of the critical I/O’s bond wires could not be the solution 27 either. For that case, each one of the pins should use either a lengthier wire or the shape (arc) would be different, resulting in a not uniform channel among the pins. To get around this problem, the chip was placed directly on the printed circuit board (PCB), as shown in Figure 21, decreasing the length of the critical I/O bonding wires to approximately 1mm. For that, the PCB had a golden finish to allow a direct connection of the bond wire to the copper traces. The PCB had four layers of copper and used a standard FR4 substrate. The two internal planes were used strictly for power distribution of the chip, minimizing the power supply noise on board. To further improve the power source of the chip, the connection between the power supply pads of the die and the board follow the shortest possible route. A series of wide vias were placed close and around the perimeter of the die and connected to the internal power supply planes. That topology was also ideal to route 82 bond wires in a very small area, considering that the pitch between the pads of the die is just 100μm, while the minimum clearance on the board was at least 7mils. A picture of the die’s footprint and their interconnection is presented in Figure 22. Copper trace Bond wire Die Silver epoxy FR4 Gnd Figure 21: Placing the die on board. Figure 22: Die footprints of both, transmitter and receiver 28 To allow the measurement of the current dissipation at the transmitter’s drivers, a set of islands of copper planes were placed just under the power supply pads of the drivers at the transmitter. The same tactic used in the die by separating the power supply rails of the drivers from those of the rest of the design. To further decrease the power supply noise of the board, several coupling capacitors of different size were placed. Both transmitter and receiver used an external clock frequency reference of 1.4GHz to create the four clock phases. Programmable delay chains were used to de-skew the four phases. For each output, four copies of the same driver were placed in parallel to each other. Each one of them was transmitting with the clock rate but only for one forth of the clock period boosting the effective data rate of each pin to 5.6-Gsymbol/s/pin. For the testing of the chip, two dies were placed on the same board as shown in Figure 22. In one of them, just the transmitter was activated and in the other one, only the receiver. The dies were placed as close as possible to keep the size of the board to the minimum. That was a requirement since the bond wiring machine used could not work with bigger boards. For connecting the clock and some critical signals for testing purposes, horizontal placed SMA connectors were used to decrease reflections. Logic Analyzer Interface Board Testing Board Laptop Digital Sampling Oscilloscope Signal Generator Power Supplies Figure 23: Testing setup 29 For the testing of the chip the topology presented in Figure 23 was used. Since the chip had approximately 1800 programmable parameters, a laptop using a script file was programming a logical analyzer that was responsible for programming serially both chips. A custom made interface board was used to decrease the voltage from 3.3 to 1.8 volts. The signal generator was providing the clock to both the transmitter and the receiver, while a high-frequency high-impedance probe used to project the eye diagram to the digital oscilloscope. Figure 24 shows the output of the digital oscilloscope. The minimum eye opening is 100ps and 120mV. The presented symbol rate is 5.6-Gsymbol/s/pin. Since the proposed encoding scheme maps three data bits onto four pins, the effective data rate is 4.2-Gb/s/pin. Figure 24: Eye Diagram. The drivers and pre-drivers dissipate 23.2-mW and 37.4-mW, respectively while the receiver uses just 11.4mW. When transmitting at the highest bit rate, the dissipated power is 17.1-mW/Gb/s 1 while extensive measurements at the receiver show that the BER is < 10 -14 . In Table 2 the proposed architecture is compared with three previous parallel links. 1 Including both transmitter and receiver 30 Table 2: Comparison table Present work Ref [12] Ref [2] Ref [1] Power (mW) 60.6 90 128 112 Link mode Single Ended Differ. Differ. Single Ended Process 0.18-μm 0.25-μm 0.13-μm 0.25-μm Gb/s/pin 4.2 2 2 3 mW/Gb/s 17.1 22.5 32 37 Table 3 summarizes all the results. The proposed system was fabricated using a 0.18-μm TSMC process and the measured BER was 10 -14 . When the clock was higher than 1.4-GHz, the parasitic capacitances of the random generators were failing and the BER was increasing rapidly. Probably without that problem, the performance of the system could have been even higher. Table 3: Overall circuit performance summary Technology TSMC1P6M 0.18-μm Clock rate 1.4 GHz Data Rate (per pin) 4.2 Gb/s Symbol rate (per pin) 5.6 GS/s BER < 10 -14 Area (Total) 3x1.3mm 2 Drivers 0.245 mm 2 Encoders 0.008 mm 2 Power Receiver (All Comparators) 11.4 mW Transmitter (Total of 4 pins) 60.6 mW Drivers 23.2 mW (38%) Pre-drivers 37.4 mW (62%) Total Power per Gb/s 17.1 mW/Gb/s Figure 25 presents the power and area brake-down for the critical parts of the system. The pre-drivers are responsible for approximately 50% of both the area and power. That is due to the fact that they designed with a fan out of four (FO4) to avoid failing at high frequencies due the parasitic capacitances. We could not run post-layout simulations for the pre-drivers and the random generators due to the computational load of the task and the absence of the necessary resources. With careful layout and post-layout simulations, it could be possible to increase the FO of the pre-drivers to four or even higher, decreasing their current dissipation to by at least 50%. 31 52% 32.2% 15.8% Power brake-down Receiver Transmitter Drivers Transmitter Pre- Drivers 51.5% 22% 26.5% Area break- down Transmitter Drivers Receiver Transmitter Pre- Drivers Figure 25: Power and Area brake-down 32 Chapter Six: Scaling of the Algorithm The effectiveness of the proposed system is based on its coding algorithm as well as the architecture of the driver that manages to recycle the current among the output pins. In this chapter, we will examine how the algorithm can be scaled to more than 3 bits, thus improving the performance of the system even further. According to the present work, four wires are used to send three bits of information resulting in 75% I/O pin efficiency. The system forms three serial links using differential coding over just four wires with the use of three possible voltage levels at the output and three comparators at the receiver side. Before proceeding to the scaling of the system, it is crucial to understand the system’s tradeoffs as we scale different parameters. The only requirements of this hybrid system are the differential coding, to avoid reference ambiguity; the constant current dissipation of the group of drivers, to eliminate the power supply noise; and, current recycling among the drivers for energy efficiency. For a deeper understanding of the design tradeoffs as it scales to more pins, two similar coding schemes are described and scaled below. 6.1. The conventional multilevel differential coding A similar and well-known architecture has been widely used 1 to increase the number of data links per conductor. The simplest version of this system establishes one differential link using a conventional differential driver at the transmitter side and a comparator at the receiver. The system sends 0.5 data bits per conductor, while the noise margin is the maximum possible due to the two possible signal levels. Scaling the system to three data bits, three drivers and three comparators are used to establish three differential links over just four conductors, increasing the I/O pin efficiency to 75%. In Figure 26, the system architecture is presented. Two out of the three differential drivers form two conventional differential links over the four conductors. The third link is established by driving the common mode voltage of each of the two differential pairs with the use of the third driver and four 1 The telephony industry uses this technique to support more clients using less wiring. 33 resistors at the transmitter side. At the receiver side, four resistors collect the common mode voltage of each pair and a common comparator decodes the third data bit. In that system, there can be as few as three possible voltage levels over the channel decreasing the noise margin to 50%, compared to a conventional differential link. Ignoring for now the driver architecture and the channel termination of the system, between the system in Figure 26 and the current work, the I/O pin efficiency and the noise margin are the same. D0 D1 D2 D0 D1 D2 Figure 26: Three bits architecture of a conventional multilevel differential coding D0 D1 D2 D0 D1 D2 D5 D4 D3 D5 D4 D3 D6 D6 Figure 27: Seven bits architecture of a conventional multilevel differential coding 34 The conventional way to scale this system even further is to increase the number of links per conductor. Figure 27 shows that by using an extra driver to control the common mode voltage of two of the above systems, seven data bits can be differentially coded over just eight conductors using seven drivers and seven comparators, at the transmitter and receiver side respectively. The system’s I/O pin efficiency increases to 87.5% while the noise margin decreases to 25%, when compared to that of a serial link. This is due to the fact that seven signal levels are used in this case. To further scale the system, two of the above systems are used in parallel while another driver establishes an extra differential link by controlling the common mode voltage of each one of them. The resulting I/O pin efficiency increases to 93.75%, while the noise margin decreases to 12.5% compared to that of a conventional differential link. By now, it becomes clear that as the system scales, it becomes more efficient to increase the number of bits that are transmitted over each conductor and less efficient due to its decreasing noise margin. As the noise margin decreases, the system becomes very sensitive to inter-symbol interference, as voltage noise and offset among the pins force the bit error rate to increase even further. It is a very challenging task to properly recover the data with an extremely low noise margin value, even by decreasing the symbol rate. Another disadvantage of the above coding is the continuously decreasing voltage headroom, as the process technology scales that would definitely decrease even further the system’s efficiency. Table 4: Scaling effectiveness of the conventional multilevel differential coding #Data Bits #Pins #Voltage Levels Bits/Symbol Noise Margin Bits/Symbol/(Noise Margin) -1 3 4 3 0.750 0.500 0.375 7 8 5 0.875 0.250 0.219 15 16 9 0.938 0.125 0.117 31 32 17 0.969 0.063 0.061 63 64 33 0.984 0.031 0.031 Since there is a clear tradeoff between those two metrics, a more useful one had to be used to estimate if this type of scaling is effective or not. Since the rate at which the noise margin decreases and the data rate per symbol increases differ, it makes more sense to use the data rate per symbol over the (noise margin) -1 as the metric to estimate if we can benefit increasing the number of possible 35 voltage levels at the output. Table 4 summarizes the characteristics of the conventional multilevel differential coding as we scale the number of voltage levels that can be used. Figure 28 presents the ‘Bits/Pin’, the ‘Noise Margin’ and the Bits/Data/(Noise-Margin) -1 and clearly states that the total improvement of the system will decrease because the noise margin decreases with a bigger rate than the improvement of the data bits per pin. 0.000 0.200 0.400 0.600 0.800 1.000 1.200 4 8 16 32 64 128 256 Number of Pins Efficiency Bits/Pin Noise Margin Bits/Pin/(Noise Margin)^-1 Noise Margin Bits/Pin/(Noise Margin) -1 Bits/Pin Figure 28: Scaling of the conventional multilevel differential coding 6.2. Differential signals over a plurality of conductors It becomes clear from the scaling of the multilevel differential coding scheme that its decreasing noise margin dominates over its increasing I/O pin efficiency. Here, an issued patent [16], which initiated the present coding scheme, is briefly presented using multiple conductors to differential encode multiple data pins increasing I/O pin efficiency. The algorithm relies on the number of conductors used. The number of transmitters and receivers used is equal to the number of all the possible combinations that can be generated by taking two conductors at a time. The mathematical expression to compute the number of possible combinations is: ! 2 )!* 2 ( ! 2 − = W W C W where “W” is the number of conductors and “2” is the 36 number of items taken at a time. At the receiver side, a resistor connects each conductor to any other one to transform the current flow to voltage difference. So the number of transmitters, receivers, and resistors used is the same. Another important condition for the successful data recovery is that the sum of the currents across all the conductors has to be zero. That characteristic apparently minimizes the number of symbol combinations that can be used across the link to less than 2 2 C W . For example, for the case of four conductors 6 2 4 = C transmitters and receivers are used. The symbol combinations that can be generated are 2 6 = 64 but only 24 of them satisfy the current condition. So by using four conductors, up to log 2 (24) = 4.585 can be transmitted. In Figure 29 1 , the system’s architecture is presented when the algorithm uses three signal levels and three conductors. In this case, only six out of the eight possible combinations satisfy the current condition. Tx0 Tx1 Tx2 Rx0 Rx1 Rx2 Ra Rc Rb Ia Ib Ic 3 traces Figure 29: Differential signaling over three conductors. Regenerated figure from [16] The main focus of this coding scheme is to increase the I/O pin efficiency even further by forming several differential links over the minimum possible number of conductors. The disadvantage of this coding algorithm is that the signal levels increase linearly with the number of conductors used, which also decreases the noise margin. The output capacitance at the transmitter side also increases linearly since W-1 drivers are connected in parallel to each conductor, where W is the total number of conductors used. The increasing capacitance at the output of the transmitter will decrease the slew rate effectiveness of the drivers and therefore the symbol rate of the system as it scales. Considering other 1 This is a regeneration of Fig 5 from US patent with patent number 6,556,628 [16] 37 aspects, like power consumption and area, the present algorithm will have to use several logic gates for coding and decoding prior to and after the transmitter and receiver respectively. The amount of gates used will increase exponentially with the number of the drivers. For the conventional multilevel differential coding, on the other hand, there is no coding and decoding used. In Table 5 we summarize the characteristics of [16] as it scales to multiple conductors and signal levels. Table 5: Scaling of the algorithm presented in [16] #Pins #Voltage Levels #Symbols Data Bits Bits/Pin Noise Margin Bits/Symbol/(Noise Margin) -1 3 3 6 2.585 0.862 0.500 0.431 4 4 24 4.585 1.146 0.333 0.382 5 5 120 6.907 1.381 0.250 0.345 6 6 720 9.492 1.582 0.200 0.316 7 7 5040 12.299 1.757 0.167 0.293 Compared to the conventional multilevel differential signaling, the present scheme is more efficient since, as it scales, transmits more than one data bit per conductor. In contrast, the conventional multilevel differential signaling has an upper bound of one data bit per conductor and its noise margin decreases faster. Another main advantage of the present algorithm is that it scales with the use of one extra conductor each time. On the other hand, the conventional multilevel differential coding has to use twice the conductors plus one each time that it scales, thus making it less effective for a system that has to use a limited number of conductors. 38 0.000 0.500 1.000 1.500 2.000 2.500 3 4 5 6 7 8 9 Number of Pins Efficiency Bits/Pin Noise Margin Bits/Pin/(Noise Margin)^-1 Bits/Pin/(Noise Margin) -1 Noise Margin Bits/Pin Figure 30: Scaling of the algorithm presented in [16] Since the signal levels increase linearly with scaling, the noise margin decreases. Using the same metric as with the previous multilevel differential algorithm, we can see from Figure 30 that the total performance of the system decreases even though the Bits/Pin increase with a greater rate than in the previous scheme. For comparison reasons, in Figure 31 we present the Bits/Pin/(Noise Margin) -1 of the two multilevel differential coding schemes. It is clear that the second scales better than the first primarily due to the fact that the data bits per pin can be more than one. The most important conclusion on the other hand is that none of them perform as well as they scale. The reason is that both of the coding algorithms rely on their continuously increasing number of signal levels. 39 0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 4 32 64 128 Pins Bits/Pin/(Noise Margin)^-1 Multilevel coding Patent Figure 31: Comparing the two multilevel differential coding schemes 6.3. Scaling tradeoffs and specifications The scaling analysis of the above two coding algorithms, which have several similar characteristics compared to the proposed work, reveal some important tradeoffs that will be used to justify the chosen scaling technique. It becomes clear by now that the number of signal levels should not increase with scaling since a small noise margin will result in a higher BER due to increased ISI and the continuously decreasing voltage headroom. Even if the voltage headroom does not decrease, as the lithography process scales, at high symbol rates the system cannot withstand very small noise margin. To compensate for the continuously increasing number of signal levels, the I/O pad voltage could also rise to provide more voltage headroom, keeping the noise margin constant as the system scales. The impact to the power efficiency in such a case would dominate the overall performance. One additional reason not to scale the number of signal levels is the increased ISI that becomes more pronounced as the number of signal trajectories increases. Another important factor that should be considered is the parasitic capacitance and the proper channel termination at each end of the channel. For example, the number of drivers that are connected on each conductor at the transmitter side should not increase unreasonably as the system scale. Both 40 proper termination and low output/input capacitance are essential to achieve high slew rates and therefore increase the symbol rate. For systems that require high level of integration, power dissipation and area efficiency are also very important. As the number of conductors increases, so does the complexity of the coding/decoding and therefore the power dissipation and the area of the system. The extra logic will increase the power and the only way for the scaling to be successful is if the rate at which the data rate increases is more than the power growth. The power growth, the number of signal levels, the output capacitance, and the symbol period of the system are some of the most important parameters that thread the efficiency of the scaling. The objective is to scale the existing algorithm improving the I/O pin effectiveness with a greater rate than the above parameters scale, delivering a system that performs better in terms of speed and power dissipation. Considering the design tradeoffs that arose through the scaling analysis of the above two differential coding schemes, the scaling specifications of the proposed work are presented below: 1. Differential coding needs to be used to avoid the reference signals and therefore eliminate the reference ambiguity problem. 2. The current dissipation across the drivers has to remain constant to minimize power noise ≅ 0 dt di L . 3. No more that three signal levels should be used. 4. Using current recycling among the drivers, at the transmitter side, will increase the power efficiency. 5. Minimize the number of drivers that connect on each conductor to keep the output capacitance at the transmitter as low as possible. 6. Considering the power and area overhead, due to coding, the complexity of the coder/decoder should be kept to the minimum possible. 41 6.4. Scaling of the system Before proceeding to the analysis of the scaling tactic, a thorough examination of the coding and decoding algorithm reveals some very useful information for the development of the scaled system. According to the proposed work, the coding scheme establishes three differential links over just four conductors with the use of three signal voltage levels. The algorithm uses a single AND gate per conductor to code the data at the transmitter while no further decoding is necessary at the receiver side. Table 6: The 12 possible symbol combinations of the proposed coding algorithm Case Pin 0 Pin 1 Pin 2 Pin 3 1 C C H L 2 C C L H 3 C H C L 4 C H L C 5 H C C L 6 H C L C 7 C L C H 8 C L H C 9 L C C H 10 L C H C 11 H L C C 12 L H C C Given the number of conductors, the number of signal levels and the coding algorithm there can be up to twelve possible symbol combinations as presented in Table 6. From those twelve cases, only eight of them, which are highlighted, can be detected at the receiver. The remaining four cannot be decoded since for each one of those cases, the two “Center” voltage levels are compared by a 2- input comparator. Besides that, the 4-input comparator, in each one of those four cases, compares the average voltage level of the two “Center” signal levels with the average voltage level of the “High” and the “Low” signal levels. As a result, the output of both comparators is just an amplified version of the noise of the channel and according to the present receiver architecture, those four cases are impossible to decode. So the performance of the system is not dependant only on the symbol combinations that can be generated at the transmitter, but also on the number of symbols that can be 42 detected by the receiver. It will become clear shortly that understanding this phenomenon in depth will be the backbone of the scaling process. During the rest of the analysis, we will use the word “Symbol” to describe the set of “High”, “Low” and “Center” voltage levels of all the conductors of the system that are present during one transmission period. Also, let the number of conductors available be equal to W and the number of “High” or “Low” signal levels, that are going to be used, be equal to K. Besides the fact that the receiver architecture directly affects the overall performance of the system considering the number of symbols that it can decode, an equally important characteristic is the number of symbols that can be generated from the transmitter. Although the number of each “High”, “Center” and “Low” signal levels is constant among the symbols, by reordering them among the conductors, a transmitter can generate many symbols. The exact number of symbol combinations is a function of the number of conductors (W) used, as well as the number of the “High” and “Low” signal levels (K) that are present among all the conductors. For the rest of the analysis we name the set of all the symbols that can be generated by a transmitter, “Symbol-Set”. The size of Symbol-Set is given by the mathematical formula: )! * 2 !*( !* ! K W K K W Set Symbol − = − It is basically the permutations of W elements divided by the permutations of each sub-group of elements that are equal with each other within W. For example, in a system of six conductors and two “High” levels the symbols “H 1 CLH 2 LC” and “H 2 CLH 1 LC” is the same symbol if you assume that “H 1 ” and “H 2 ” are equal. So the total combinations of six elements will have to be divided by 2 to eliminate any duplicates due to the fact that are two “High” signal in each symbol. The same has to be done for the “Low” as well as the “Center” signals. In Table 7, the size of the Symbol-Set of twelve different cases of W is presented. It is obvious that for each one of those cases, the selection of K affects dramatically the size of the Symbol-Set. 43 Table 7: Number of combinations as we scale the number of conductors W = 4 W = 5 W = 6 W = 7 W = 8 W = 9 K S K S K S K S K S K S 1 12 1 20 1 30 1 42 2 420 2 756 2 30 2 90 2 210 3 560 3 1680 3 140 4 70 4 630 W = 10 W = 12 W = 16 W = 24 W = 30 W = 32 K S K S K S K S K S K S 2 1260 3 1.8E+04 4 9.0E+05 7 6.7E+09 9 4.2E+12 10 4.1E+13 3 4200 4 3.4E+04 5 2.0E+06 8 9.4E+09 10 5.5E+12 11 4.5E+13 4 3150 5 1.6E+04 6 1.6E+06 9 6.5E+09 11 4.1E+12 12 2.8E+13 * S is the size of “Symbol-Set” ** W is the number of conductors *** K is the number of “High” and “Low” signal levels in each symbol Table 7 shows the size of the Symbol-Set as we scale the system to more than four conductors. For each one of the values of W, the Symbol-Set size is calculated for three values of K. The values of K that are highlighted are the ones that maximize the size of the Symbol-Set. Although we cannot prove yet for all possible scaling cases that the performance of the system increases as we maximize the size of each Symbol-Set, it is intuitively rational and has been experimentally proven for all the cases of Table 7 as will be presented later. The more symbols a transmitter can generate, the more symbols a receiver is probable to decode. Besides that, as will be proven later, the size of any Symbol-Set reaches its maximum when 3 W K = . That corresponds to an equal number of “High”, “Center” and “Low” signal levels in each Symbol. For example considering the phenomenon presented in Table 6, when the number of “High” and “Low” signals (K) dominate over the number of “Center” signals (W-2*K) and vice versa, it becomes clear that a 2-input comparator will reject most of the Symbols, since there will be more cases where the voltage potential in both its inputs will be the same. 44 Increasing the Symbol-Set: The Symbol-Set size is maximized when the denominator of the ratio )! * 2 !*( !* ! K W K K W − decreases. We claim that the denominator )! * 2 !*( !* ) ( K W K K K f − = reaches its global minimum when 3 W K = . At 3 W K = the denominator becomes: 3 ! 3 3 ! 3 * 2 !* 3 !* 3 3 = ⇒ − = W W f W W W W W f For − ∈ ∀ 1 3 ,..., 2 , 1 W x ⇒ − − − − = − ! 3 * 2 !* 3 !* 3 3 x W W x W x W x W f = > − − − + + + = − 3 ! 3 3 * ... * 1 3 * 1 3 * 3 * 3 * 2 3 * ... * 2 3 * 1 3 * 3 * ! 3 3 3 3 W f W x W W W W W x W W W W W x W f > − 3 3 W f x W f For − ∈ ∀ 2 1 ,..., 2 , 1 W x ⇒ + − + + = + ! 3 * 2 !* 3 !* 3 3 x W W x W x W x W f = > − − − + + + = + 3 ! 3 2 3 * ... * 2 3 * 1 3 * 3 3 * ... * 2 3 * 1 3 * ! 3 3 3 2 3 W f W x W W W W x W W W W x W f 45 So since > − 3 3 W f x W f and > + 3 3 W f x W f the denominator ( ) K f becomes minimum at 3 W K = and therefore the Symbol-Set becomes maximum. In Figure 32 and Figure 33 the maximum coding capability of the transmitter is presented. It becomes clear that as the number of conductors increases, the maximum number of symbols increases with an even higher rate. For example with 30 conductors (W=30) there are more than 40 bits that can be transmitted. Figure 33 presents the I/O pin efficiency of the transmitter. It clearly shows that by using six conductors (W=6, K=2), the number of symbols that are generated is grater than 2 6 and therefore the scaling brakes the barrier of one bit per pin and reaches an I/O pin efficiency of more than 100%. According to the scaling specifications, more than three signal levels cannot be used. Therefore, the graph of Figure 33 tracks the maximum possible coding efficiency that can be achieved without compromising the noise margin of the system. Although that is very promising, it is uncertain if a receiver exists that can decode that many symbols. The scaling objective will be to design a receiver that will be able to decode most of those symbols without compromising simplicity and power efficiency. For example, the encoder/decoder for a system of tens of wires will obviously consume an amount of power that cannot be neglected. Finding the optimum design for the receiver will be the most important, as well as the most challenging, task of the scaling process. 46 Efficiency of the three level coding at the transmitter 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 0 5 10 15 20 25 30 35 Conductors (W) Bits coded at the transmitter log2(Symbol-Set) Figure 32: Scaling efficiency at the transmitter. The plot presents the log 2 (Symbol-Set) which represents the number of bits that can be codded from the transmitter. Bits/pin efficiency of the three level coding at the transmitter 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 5 10 15 20 25 30 35 Conductors (W) Bits/Pin coded at the transmitter Figure 33: Bits/pin coding efficiency at the transmitter during scaling 47 6.4.1. Scaling the transmitter As the system scales, the transmitter does also. Luckily, the transmitter architecture can be easily transformed to apply the same fundamental architectural innovations, which are presented earlier, to more than four wires. Three of those characteristics are the absence of a reference signal, the constant current dissipation among all the drivers, as well as the current recycling of the transmitter. The absence of the reference signal can be achieved by coding the data differentially among the pins. The current dissipation of the group of the drivers is constant, which minimizes the power supply noise since the number of “High” and “Low” signal levels are the same across the Symbol-Set. That is guaranteed by the nature of the coding scheme since all the Symbols have the same number of “High” and “Low” signal levels (K). The number of “High” signals is the same as the number of “Low” signals, so the current recycling among the drivers is also achieved. The transmitter will scale linearly with the number of conductors (W) and the number of “High” and “Low” signal levels used (K). For each one of the conductors used during the scaling process, there will be a similar driver module with the ones already presented that is connected in parallel with the rest of the transmitter design through a resistor for channel termination. The transmitter is also dependant linearly on K. Even though the number of pre-drivers that needs to be active during each symbol is constant, it varies with scaling. According to the coding algorithm presented earlier, there are exactly K “High” and “Low” signal levels. So, there are 2*K pre-drivers enabled during the period of any Symbol. Half of them drive the PMOS sub-drivers during a symbol transmission and the other half the NMOS sub-drivers. By increasing K, the power dissipation of the drivers will linearly increase, and the number of symbols that can be decoded at the receiver side will hopefully increase, therefore improving the I/O pin efficiency. For the scaling to be successful, the power overhead due to the extra current path at the transmitter side should be compensated by the total throughout increase of the system. As we increase “K” to more than one, we no longer have a single current path across the group of drivers. Since K outputs are at the “High” voltage potential and the K at the “Low”, the system can be seen as the superposition of K similar current paths across the group of drivers. In 48 Figure 34, the transmitter architecture of a system with six conductors (W=6) and two current paths (K=2) is presented. At the left and the middle sub-figure, the two current paths are represented by the dotted line when the transmitted symbol is “HCHCLL”. For each one of those two cases there is an AC ground formed between the drivers through the voltage divider that is formed across each current path. The third sub-figure of Figure 34 presents the superposition of both current paths across the group of drivers. Thus, we guarantee that the common node among the drivers will be an AC ground, which is a requirement for the correct operation of the system while “K” is greater than one. H C H C L L The current path from pin 0 to pin 4 The current path from pin 2 to pin 5 H C H C L L H C H C L L Superposition of both current paths AC ground + = Figure 34: The two current paths at the transmitter side of a system with six conductors During the scaling process the encoder of the transmitter will also be affected. With four conductors, the encoder of the transmitter consists of sixteen AND gates. There is a group of four in each one of the outputs of the transmitter in order to implement a four way time interleaving. The 49 number of inputs of each gate is linearly dependent on the number of bits that are coded from each transmitter. Since the encoder size increases linearly during the scaling, it is not expected to affect the power efficiency of the system. On the other hand, at the receiver side, no decoder was necessary right after the comparators for the case of four conductors. Unfortunately, that is not expected to be the case for more than four conductors. The extra hardware, due to encoding at the receiver side, is expected to increase the power dissipation of the system. 6.4.2. Scaling the receiver The system that was designed, tested, and presented earlier establishes three differential links over four conductors. The design process started at the transmitter, by forming several differential links over the channel using more signal levels than usual. Unfortunately, it is impossible to boost the efficiency of the algorithm using more than three signal levels due to the problems triggered by the decreased noise margin. The scaling will have to rely on a more sophisticated receiver, capable of decoding as many symbols as possible. For example, the receiver of the system with four wires can decode only eight out of the twelve possible symbols generated by the transmitter. If a receiver of a system with W conductors can decode more than 2 0.75*W symbols, which is the maximum efficiency of the system already built, the scaling will be successful. By swapping the voltage levels from one conductor to another, rather than using the common mode voltage of the differential drivers to establish more differential links, we can take advantage of the continuously increasing number of symbol combinations and successfully decode them through the simplest possible decoding scheme at the receiver side. As the number of conductors increases the number of possible receivers grows with a permutational rate, as will be shown shortly. Each one of the possible receiver is a collection of comparators. The comparators with more than two or four inputs must be used to decode more data from the same symbol-set, by comparing the average voltage level of a group of conductors. Each one of those comparators introduces a restriction that is related to the noise margin of the system. For 50 comparison simple differential amplifier architecture is used. The two inputs are compared by controlling the current flowing through each NMOS. For a comparator that uses more than two inputs, a group of NMOS connect in parallel at both sides of the differential amplifier. The inputs are opening or closing the NMOS in terms for them to share the biasing current among the branches. They do that as a function of the voltage level on the gate of each NMOS. In a two input comparator, if the two inputs have the same voltage potential each branch will have half the biasing current. A graphical representation of the currents for a 2-input comparator is presented on the left diagram on Figure 35. The output of the comparator will be an amplified noise at the time of sampling. M1 M2 M3 M4 C H L C M1 M2 M3 M4 C H L C 2I b Double the output capacitance Double the current I b /4 I b /4 I b /4 I b /4 I b /2 I b /2 I b /2 I b /2 I b Figure 35: Power efficiency of X-input comparators (X>2) On the other hand, in a four-input comparator, if one of the input signals from each side of the comparator has the same voltage potential, the output of the comparator will relay only on the other two inputs. In that case, assuming the same common mode for all the inputs, there is only one fourth of the biasing current flowing through each NMOS. For the same biasing current with the 2- input comparator the gain of the 4-input comparator is decreased. Since only half of the biasing current is used for amplification, the slew rate of comparators with different number of inputs will be different. This phenomenon scales linearly with the number of inputs. To overcome this problem, the biasing current of an x-input comparator should increase to b I x * 2 . At the right diagram of Figure 35 51 the four input comparator is presented. The biasing current has been increased to 2*I b to compensate for half the current that has been wasted through NMOS M1 and M4 that have the same input. So as the number of inputs per comparator increases so does their biasing current. Unfortunately, the current inefficiency of the x-input comparators is not their only problem. As the number of inputs increases and the biasing current follows, so does the output capacitance of the amplifier. The comparator is self loading it self, decreasing its fan out and consensually decreasing its slew rate. Besides the scaling difficulties of each comparator at the receiver side, finding the optimum receiver as the system scaled is a challenging process. Even though it is possible to find the optimum receiver for small values of W using brute-force search, it is a computational exhaustive task to perform for larger values of W, as described below. For that reason, it is preferable to use the receiver’s architectural characteristics of a system that has few wires in a bigger scale. Before proceeding to the simulation results, the brute force simulation flow is presented as well as the complexity growth of the problem when it scales. 6.4.3. Finding the optimum receiver / Computational load In this section the simulation flow as well as the computational load to find the optimum receiver as the system scales is presented. • Let W be the number of conductors. 3 ≥ W • Let K be the number of High and Low signal levels. − − ≤ ≤ even odd is is W W if if W W K 2 2 2 1 1 • The number of all the possible symbols is: Symbol-Set = )! 2 ( ! ! ! K W K K W − Since the receiver consists of a group of comparators, before proceeding further, it is necessary to list all the possible comparators. The comparators will compare the average value of the 52 two groups of signals. Each group can have from at least one to as many as 2 W signals. The search objective is to find the smallest set of comparators that will form the most efficient receiver, as the system scales. According to the number of conductors as well as the number of inputs per comparator, the number of all possible comparators varies. Here we present the number of different 2-input, 4-input, 6-input, and in general, X-input comparators that could be part of the final solution. • The number of 2-input comparators is: ! 2 )!* 2 ( ! 2 2 − = = − W W C input W • For the 4-input comparators we need to find all the combinations of W elements taken four at a time. Using those four conductors, it is possible to design more than one 4-input comparator. For example using the set (A, B, C and D) of conductors there can be three 4- input comparators. [(AB-CD), (AC-BD) and (AD-BC)]. To mathematically model this, for the X-input comparators we need to multiply each one of the W elements taken X at a time, with the combinations of X elements taken 2 X at a time divided by two. The division by two is due to the fact that the comparator (AB-CD) is identical with the (CD-AB). When we calculate the X elements taken 2 X at a time, having for example X=4, we get the combinations [(AB-CD), (CD-AB), (AC-BD), (BD-AC), (AD-BC) and (BC-AD)]. Apparently, half of them are redundant. So the number of 4 input comparators is: 2 ! 2 )!* 2 4 ( ! 4 * ! 4 )!* 4 ( ! 2 * 4 2 4 4 − − = = − W W C C input W 3 * ! 4 )!* 4 ( ! 2 * 4 2 4 4 − = = − W W C C input W 53 • And the number of X input comparators is: 2 ! 2 )!* 2 ( ! * ! )!* ( ! 2 * 2 X X X X X X W W C C input X X X X W − − = = − ( ) 2 2 ! 2 ! * 2 * ! )!* ( ! 2 * X X X X W W C C input X X X X W − = = − ( ) 2 2 ! 2 )!* ( ! * 2 2 * X X W W C C input X X X X W − = = − By adding all the possible comparators together that could be part of a receiver, we get: • All the possible comparators are: 2 * ... 2 * 2 * 2 2 6 6 2 4 4 2 X X X W W W W C C C C C C C s Comparator + + + + = ( ) ∑ = = − + − = Z X X X X W W W W s Comparator 4 2 ! 2 )!* ( ! * 2 ! 2 )!* 2 ( ! Where { } − = = ∈ odd even is is W W if if W Z W Z Where Where Z X 1 ,..., 10 , 8 , 6 , 4 So there can be many possible receivers that need to be evaluated as to the number of symbols that they can decode. The number of the possible receiver architectures is not only a function of which comparators they include, but also of how many of them are included. For example, as it will be presented later on, in a system of 6 conductors, the receiver that decodes the larger number of transmitted symbols contains six comparators. On the other hand, the best receiver for a system with 5 conductors contains 4 comparators, although there is another architecture that consists of 5 comparators which decodes the same number of symbols. Of course between these two cases, the first is preferred due to its simplicity as well as its power and area efficiency. Although the number of comparators of the optimum receiver is unknown, there are maximum and minimum numbers of 54 comparators that should be tested. The minimum number should be at least 0.75*W since anything less than that will be impossible to deliver a better performance than the coding algorithm of four conductors that was analytically presented earlier. The upper limit is determined by the size of the Symbol-Set that is generated. Since the maximum number of symbols that can be decoded is determined by the number of symbols that can be generated at the transmitter, the maximum value of R is ( ) W Set Symbol R − = 2 max log . • ∑ − = = ) ( log * 75 . 0 2 W Set Symbol R W R R s Comparator C Where ! )!* ( ! R R s Comparator s Comparator C R s Comparator − = It may not be obvious from the mathematical formula above, but the computational complexity of the brute-force search technique is huge, even for systems with very few conductors. According to the formula above, the number of all possible receiver architectures increases also with a permutational rate, since the number of available comparators is much bigger than the number of comparators used in each receiver R s Comparator >> . For example, for a system with six conductors (W=6) and two “High” and “Low” signal levels (K=2), 90 possible symbols can be generated from the transmitter (Symbol-Set size = 90). The receiver can consist of a group of 5 to 7 comparators for sampling/decoding. Considering that the system uses six conductors, the number of all possible comparators that can be formed is much bigger that those of a four conductor system. There can be up to fifteen 2-input, forty five 4-input and twenty 6-input comparators, resulting to a set of 80 in total. So for a system with six conductors, all of the possible receiver topologies can be evaluated by finding all the combinations of 80 elements taken 5 to 7 at a time. As the system scales, the number of conductors increases linearly but the number of all possible comparators and consequently, the number of all possible receiver topologies increase with a permutational rate. 55 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 1.E+10 4 5 6 7 Number of comparators at the receiver Possible receiver architectures Figure 36: Computational complexity of a receiver with six conductors Using a brute-force search, each one of the possible receivers needs to be evaluated as to the number of symbols that it can decode. The curve in Figure 36 represents the number of all the possible receivers, using a set of 80 possible comparators to choose from. The number of receiver architectures changes as the number of comparators per receiver varies from 5 to 7. Notice that the vertical axis is in logarithmic scale, revealing the magnitude of the computational load of the brute- force approach. For example, using six comparators, at the receiver, (R=6) there are more than 300 million cases to be evaluated to find the ideal receiver architecture. Evaluating about 300 million cases for a six conductor system is not an efficient approach especially for a system that, as it scales, its computational load increases with a permutational rate. To sufficiently decrease the search time, the computational load has to be minimized by developing a good intuition of the optimum receiver’s architecture. Concentrating the search on a subset of comparators, which is more likely to contain the comparators of the ideal receiver, the search time is expected to decrease dramatically. 56 6.4.4. Decreasing the computational load Here we present the methodology that decreases the computational load of the brute-force search. This methodology concentrates the search among the subset of comparators that perform well. A subset of comparators performs well when the number of symbols that can decode maximizes. A symbol is successfully decoded when each comparator at the receiver has a voltage difference between its inputs or set of inputs 1 , that is at least equal or greater than the predefined noise margin 2 Vpp . While evaluating a receiver topology, each one of the comparators needs to satisfy that requirement for all the symbols that need to be decoded. Let A be a subset of symbols for which a collection of comparators can satisfy all its noise margin requirements. By adding another comparator into the group, some of the symbols of subset A may not satisfy its noise margin requirement. So the number of symbols that can be detected for the resulting architecture can only decrease, or for the best case scenario, stay the same. To decrease the number of potential receiver architectures that need to be evaluated, we need to develop a metric to decide if a subset of comparators should be ignored 2 or not. The metric will evaluate the efficiency of a subset of comparators as a group, according to the number of symbols that satisfy the noise margin requirements of the subset. The fact that a subset of comparators does not perform well, does not imply that each one of the comparators in the subset should be ignored. It is the combination of the specific group of comparators that cannot detect the most of the symbols. The efficiency of a subset of comparators relies not only on the kind of comparators used, but most importantly on the in-between relation. The comparators that share the same input can decode fewer symbols. Furthermore, the subset of comparators with poor performance will not be part of the optimum receiver. For example, in a system with W conductors, where W>4, the six 2-input comparators 1 A thorough analysis of the comparators that have more than 2 inputs, as well as the symbols that can be detected by them, is presented later on. 2 Note that the combination of comparators will be ignored rather than the comparators themselves. 57 ( ) ( ) ( ) ( ) ( ) ( ) [ ] 4 3 , 4 2 , 3 2 , 4 1 , 3 1 , 2 1 o o o o o o obviously cannot coexist in the same receiver. Due to the noise margin requirements, this system has the following restrictions: 1. ( ) ( ) ( ) ( ) 4 , 3 , 2 1 V V V V ≠ 2. ( ) ( ) ( ) 4 , 3 2 V V V ≠ 3. ( ) ( ) 4 3 V V ≠ According to these rules, no symbol that uses three signal levels can satisfy all of the above. Therefore, there is no point in evaluating any receiver architecture that includes this subset of comparators. Even though this is an extreme case that can be very easily ignored from evaluation, there are collections of comparators, as presented below, that can decode less symbols than other ones, mostly because the comparators share the same signals. In the computational analysis that was previously presented, all possible comparator configurations had to be evaluated. To decrease the search time, only the subsets of comparators that can decode the most symbols will be considered. The receiver architectures that contain any other subset will be ignored. As a result, the number of receiver architectures that need to be evaluated decreases significantly. In the next few pages, a heuristic approach is presented that intends to minimize the search time by ignoring subsets that we know perform poorly. For a deeper understanding of how the number and kind of comparators affect the number of symbols that can be decoded, we consider the following general case. When a receiver formed by a single comparator the system decodes one bit. The output of the comparator will either be zero or one. At this point, the Symbol-Set can be separated in two big subsets: Let them be A and B( ) B A Set Symbol U = − . The first subset, A, includes all the symbols that cannot satisfy the noise margin requirement of the comparator and therefore cannot be used by the system. The remaining symbols form the second subset A Set Symbol B \ − = . For some of the symbols in subset B, the output of the comparator will be one while for the rest it will be zero. Another comparator can be included in the receiver architecture to decode more bits. Of course the second comparator has to be able to retrieve the next bit using symbols only from subset B of the Symbol-Set. 58 Subset B will be separated in two smaller subsets, C and D. C will include all of the symbols that cannot be detected by the second comparator, while the rest of them are going to form subset D. Similarly, for different symbols in D, the output of the two comparators can be “00”, “01”, “10” or “11”. The addition of more comparators most probably decreases the number of symbols that can be detected by the group but on the other hand increase the number of unique symbols that can be detected and therefore the number of bits that can be decoded. The same process repeats until the point where the addition of another comparator into the receiver would decrease the number of the unique symbols that can be decoded at the receiver. That will occur, for example, when none of the symbols in D would be able to satisfy the noise margin requirements of a third comparator. Of course that behavior is not only a factor of how many comparators, but also a factor of which ones are used in each receiver. It becomes clear by now that as we add more comparators into a receiver the number of symbols, for which the receiver can satisfy the noise margin restrictions, decrease. On the other hand, more comparators are necessary to decode more bits and increase the system’s efficiency. These two extreme cases raise a design tradeoff for the receiver architecture. The optimum receiver will eventually maximize its efficiency by decoding the maximum number of bits. That will occur only if the intersection of the symbols that can be decoded from each one of the receiver’s comparators reaches its global maximum. To mathematically express that, let R be the number of all the comparators in a receiver. Also let X Symbol be the set of all the symbols that satisfy the noise margin requirements of comparator X. The objective is to find the subset of comparators for which the subset R Symbol Symbol Symbol I I I ... 2 1 maximizes. This task, even for systems with a small number of conductors, is very complicated. To solve this problem, we use the divide and conquer technique. We divide the set of comparators into several smaller subsets and deal with each one of them separately. The subsets are organized according to the number of inputs of the comparators. So in the case of a six conductor system, there are subgroups of 2-input, 4-input and 6-input comparators. For each one of those subgroups there is one or more subsets that can decode the most of the symbols. In particularly, as will be proved later, there is 59 exactly one subset of 2-input comparators that can decode the most of the symbols. For the subsets of four or more input comparators there can be found similarly, if not one, few ideal subsets that can decode the most of the symbols. An ideal subset is the one that could be part of the final solution. In this case, if there is an x-input subset A that can decode the symbol set S A while there is another subset B that can decode the symbol set S B then subset A can be neglected only if A B S S ⊆ . After finding the ideal 2-input, 4-input etc. comparator’s subsets the task will be to find the subsets whose symbol set intersection will be the biggest. The objective is to find the group of 2-input comparators that can decode most of the symbols, thus creating a new symbol set for the next stage. The next step is to find the set of 4-input comparators that can decode the most symbols of the new symbol set and so forth. For the 2-input comparator case, it is proved below that there is one ideal subset of 2-input comparators that is part of the optimum receiver. On the other hand, this is not the case for the subset of 4-input comparators. There can be many subsets of 4-input comparators that can potentially be part of the ideal solution. All of them have to be evaluated separately. The same holds for any subset of comparators that have more than two inputs. Finding the optimum receiver requires the evaluation of any single architecture that is the combination of the subsets of x-input comparators explained earlier, where [ ] W x ,..., 6 , 4 , 2 ∈ . Finding the optimum subset of 2-input comparators: The optimum subset of 2-input comparators can partially decode the greater number of symbols out of the Symbol-Set. Partially decode means that the two-input comparators capture part of the transmitted data, just like the first least significant bits (LSB) represented in a 8 bit number. Below, we prove that in a general case system, having an equal number of “High”, “Center” and “Low” signal levels = = − 3 * 2 W K K W ,there is one subset of 2 W 2-input comparators which 60 maximizes the number of symbols that can be partially decoded. The complete list of the 2-input comparators of a general case system with W conductors and 3 W K = are: (1 ◦ 2), (1 ◦ 3), (1 ◦ 4), (1 ◦ 5)… (1 ◦ (W-1)), (1 ◦ W) (2 ◦ 3), (2 ◦ 4), (2 ◦ 5) … (2 ◦ (W-1)), (2 ◦ W) (3 ◦ 4), (3 ◦ 5) … (3 ◦ (W-1)), (3 ◦ W) (4 ◦ 5) … (4 ◦ (W-1)), (4 ◦ W) ((W-2) ◦ (W-1)), ((W-2) ◦ W) ((W-1) ◦ W) Note that if we swap two of the conductors, the symbol-set does not change since it covers all the possible combinations of symbols. Also, the way that the 2-input comparators were picked does not affect the performance of the system since all the possible combinations are already included. It can be shown that the 2-input comparator’s subset ( ) ( ) ( ) [ ] W W Comp o o o 1 ,..., 4 3 , 2 1 2 − = is part of the ideal receiver. A very important characteristic of this set of comparators is that none of them share an input with one another. The advantage of this collection of 2-input comparators is that it exploits the signal levels from all the conductors, using the smallest number of comparators while having the fewest possible restrictions. Any other subset of 2-input comparators will either have more restrictions that need to be satisfied for each symbol, or will neglect some of the signal levels carried over some of the conductors. This will limit the decoding efficiency of the subset of comparators as is proven here: • Considering the subset ( ) ( ) ( ) [ ] W W Comp o o o 1 ,..., 4 3 , 2 1 2 − = , each of the conductors has been exclusively assigned to a single input of one comparator. To successfully decode a symbol there are 2 W requirements that need to be satisfied for each one of the decoded symbols [V(P1) ≠ V(P2), V(P3) ≠ V(P4),…,V(P(W-1)) ≠ V(PW)]. A graphical representation of that subset of comparators is presented in Figure 37. 61 1 2 3 5 W-2 W-1 4 W ... Figure 37: W/2 2-input comparators without sharing inputs • To prove that subset Comp 2 partially decodes the greater number of symbols, we evaluate and compare any other possible subset of 2-input comparators that can be generated by slowly modifying the contents of subset Comp 2 . There are five possible cases that need to be evaluated. o The first subset can be generated from subset Comp 2 , by removing one of its 2-input comparators ( ) ( ) ( ) [ ] W W A o o o 1 ,..., 6 5 , 2 1 − = . In subset A, comparator ( ) 4 3o is missing and its graphical representation is shown in Figure 38. As a result, there are + 1 2 W restrictions that need to be satisfied for each of the symbols: V(P1) ≠ V(P2), V(P5) ≠ V(P6),…,V(P(W-1)) ≠ V(PW). Even though there is one less restriction compared to subset Comp 2 , the number of symbols that can be decoded is less, since the signals carried from conductors 3 and 4 are neglected. 1 2 3 5 W-2 W-1 4 W ... Figure 38: (W/2-1) 2-input comparators without sharing inputs o To improve the efficiency of the receiver, one additional comparator can be added to subset A, comparing the signal levels between conductors 2 and 3. That will result in subset ( ) ( ) ( ) ( ) [ ] W W B o o o o 1 ,..., 6 5 , 3 2 , 2 1 − = , which is presented 62 graphically in Figure 39. In this case, there are 2 W restrictions: V(P1) ≠ V(P2), V(P2) ≠ V(P3),…,V(P(W-1)) ≠ V(PW). Again this subset decodes fewer symbols than subset Comp 2 since they both have the same number of restrictions. However, subset B neglects the signals carried from conductor 4. The more analytically presented Table 8 shows that there are only 12 sub-symbols than can be decoded using the conductors 1, 2, 3, and 4 and the comparators of subset B, when compared to the 6 2 =36 sub-symbols that can be decoded using subset Comp 2 . Table 8: Sub-symbols decoded by conductors 1, 2, 3and 4 using subset A Index 1 2 3 4 1 H C H - 2 H C L - 3 H L H - 4 H L C - 5 C H L - 6 C H C - 7 C L H - 8 C L C - 9 L H L - 10 L H C - 11 L C H - 12 L C L - 1 2 3 5 W-2 W-1 4 W ... Figure 39: W/2 2-input comparators with one common conductor o Considering the signal level carried from conductor 4 without using the subset Comp 2 , another 2-input comparator can be added to compare conductor 4 with another one. As a result, there are two possible subsets of 2-input comparators that can be formed: C and D. The graphical representation of those subsets is shown in Figure 40 and Figure 41 respectively. 63 ( ) ( ) ( ) ( ) [ ] W W C o o o o 1 ,..., 4 2 , 3 2 , 2 1 − = ( ) ( ) ( ) ( ) [ ] W W D o o o o 1 ,..., 5 4 , 3 2 , 2 1 − = 1 2 3 5 W-2 W-1 4 W ... Figure 40: Graphical representation of subset C Table 9: Sub-symbols decoded by conductors 1, 2, 3 and 4 using subset C Index 1 2 3 4 1 H C L L 2 H C L H 3 H C H L 4 H C H H 5 H L C C 6 H L C H 7 H L H C 8 H L H H 9 L C L L 10 L C L H 11 L C H L 12 L C H H 13 L H C C 14 L H C L 15 L H L C 16 L H L L 17 C L C C 18 C L C H 19 C L H C 20 C L H H 21 C H C C 22 C H C L 23 C H L C 24 C H L L Unfortunately, both of these subsets cannot decode more symbols than subset Comp 2 , since they introduce one more noise margin restriction + 1 2 W . More analytically, considering only the signals carried from conductors 1, 2, 3 and 4, subset C can decode just 24 sub-symbols as presented in Table 9. On the other hand, 64 the same conductors using subset Comp 2 would be able to decode as many as 6 2 =36 sub-symbols. This is because with the usage of a pair of conductors and subset E, six sub-symbols can be decoded as presented in Table 10. Table 10: Sub-symbols decoded by conductors 1 and 2 using subset Comp 2 Index 1 2 1 H L 2 H C 3 L H 4 L C 5 C H 6 C L Subset D is also less efficient than subset Comp 2 . Considering only the signals that can be carried from conductors 1, 2 and 3, subset D can decode just 12 sub-symbols as presented in Table 11. Considering the conductors 1, 2, 3, 4, 5 and 6, subset D can decode up to 12 2 =144 sub-symbols. Using the signals over the same six conductors, subset Comp 2 can decode up to 6 3 =216 sub-symbols. 1 2 3 5 W-2 W-1 4 W ... Figure 41: Graphical representation of subset D Table 11: Sub-symbols decoded by conductors 1, 2 and 3 using subset D Index 1 2 3 1 H C H 2 H C L 3 H L H 4 H L C 5 C H L 6 C H C 7 C L H 8 C L C 9 L H L 10 L H C 11 L C H 12 L C L 65 o In considering all the possible cases, one last case must be examined. We can increase the subset of Comp 2 to ( ) ( ) ( ) ( ) [ ] W W E o o o o 1 ,..., 3 2 , 4 3 , 2 1 − = by adding one extra comparator to subset Comp 2 between a new pair of conductors. In this example, conductor 2 and 3 are used. The graphical representation of the 2- input comparator’s subset is presented in Figure 42. Again in this case, using subset Comp 2 and the signals carried from conductors 1, 2, 3 and 4, there are 6 2 =36 possible sub-symbols that can be decoded, versus 24 (Table 9) using subset E. 1 2 3 5 W-2 W-1 4 W ... Figure 42: Graphical representation of subset E So subset ( ) ( ) ( ) [ ] W W o o o 1 ,..., 4 3 , 2 1 Comp 2 − = maximizes the number of symbols that can be partially decoded by using only 2-input comparators. Using 2 W comparators, the system can decode the same number of bits given that 3 W K = . Let Symbol 2 be the symbol set that satisfies the noise margin requirements of subset Comp 2 . The objective from this point on is to decode more bits from Symbol 2 using comparators with more than two inputs. The above analysis holds when 0 2 mod 3 mod = = W W . The number of conductors needs to be in multiples of three for the number of “High”, “Low” and “Center” signals within each symbol to be the same. If 3 W K ≠ or 0 3 mod ≠ W , the number of bits that can be decoded from subset Comp 2 will be less than 2 W . If the number of conductors is not multiples of two, then there will be 2 1 + W 2-input comparators. The first 2 1 − W will compare the consecutive pair of inputs of each 66 symbol, while the last signal (W th ) will be compared with one of the first W-1 signals. Of course it is not obvious which one will be the right one. So, all of the W-1 cases have to be considered. For example, for the case of a system with five conductors (W=5), the subset Comp 2 could consist of the two comparators ( ) ( ) [ ] 4 3 , 2 1 o o and one of the four ( ) ( ) ( ) ( ) [ ] 5 4 , 5 3 , 5 2 , 5 1 o o o o . So in that case, there would be 4 possible combinations for the Comp 2 subset. In general, for a system where 0 2 mod ≠ W , there are W-1 possible Comp 2 subsets to take into consideration. Finding the all potential subsets of X-input comparators, where X≥4: According to the above analysis, it becomes clear that any subset of X-input comparators (Comp X ) needs to maximize the number of symbols (Symbol X ) that can be partially decoded. By partially decoded we mean that for any symbol within the Symbol X , the noise margin requirements of all the comparators within the Comp X are satisfied. The Symbol X will reach its maximum size only when the comparators within the Comp X are not sharing the same inputs. For X≥4, there are many subsets of x-input comparators that satisfy those requirements. In order to find the optimum solution, all of them will have to be evaluated and compared. Of course if 0 mod ≠ X W , there going are to be exactly X W int x-input comparators used, while 1 to X-1 conductors are left at the end. Those can be connected to another x-input comparator among with some already assigned signals. In any case, the total number of systems that need to be evaluated and compared decreases drastically compared to the brute-force search approach. 67 6.5. Scaling cases In this section, using the heuristic approach presented above, several scaling cases are presented and analyzed for systems with a small number of conductors. The architectural characteristics from those cases reveal certain patterns, which can be used for the scaling of bigger systems that have ten or more conductors (W≥10). Considering the enormous computational load that is associated to finding the optimum solution for large systems, this heuristic technique can be used. Unfortunately, there is no proof that this technique of scaling finds the optimum receiver architecture. Below, the receiver architecture for several cases is presented as the coding algorithm is applied in bigger scale systems. 6.5.1. Using six conductors (W=6) As was presented earlier for the system that has six conductors, there are eighty different comparators available that can be used to decode the symbols at the receiver side. Apparently, the number of possible receivers that should be evaluated is more than 300 million. To decrease the workload of the brute-force search, the set of the available comparators has to decrease to the point that the simulation time would be reasonable. Following the method presented above, the subset of 2- input comparators decreased from fifteen to three ( ) ( ) ( ) [ ] 6 5 , 4 3 , 2 1 o o o , while the subset of 4-input comparators decreased from 40 to also three ( ) ( ) ( ) [ ] 56 34 , 56 12 , 34 12 o o o . One 6-input comparator was also considered( ) 456 123o . The optimum solution, using the above subset of seven comparators, consists of the three 2-input and the three 4-input comparators. The system with six conductors is a special case of the heuristic approach that was described earlier. It decreases the computational load of finding the optimum receiver architecture to the point that almost all the candidate comparators belong to the ideal solution. The symbol set of any subset of 2-nput comparators is a subset of the symbol set of ( )( )( ) [ ] 6 5 , 4 3 , 2 1 o o o . Mathematically, that can be expressed as ( )( )( ) [ ] X S S ⊆ 6 5 , 4 3 , 2 1 o o o where X is any 2-input subset of comparators. The same stands 68 for the 4-input comparators subset ( ) ( ) ( ) [ ] 56 34 , 56 12 , 34 12 o o o . In this special case the symbol set of the 4-input comparators subset ( ) ( ) ( ) [ ] 56 24 , 56 13 , 24 13 o o o and ( ) ( ) ( ) [ ] 56 34 , 56 12 , 34 12 o o o is not the same but they have the same number of symbols. So for equivalent subsets of comparators, which one has to be part of the ideal solution? That can be answered only by considering the symbol sets of the rest of the subsets of the heuristic approach. In this particular case, the subset ( ) ( ) ( ) [ ] 56 34 , 56 12 , 34 12 o o o was selected since it has a bigger intersection with the symbol set of the 2-input comparator subset ( ) ( ) ( ) [ ] 6 5 , 4 3 , 2 1 o o o . 1 2 3 4 5 6 12 34 34 56 56 12 1 2 3 4 5 6 Figure 43: Receiver's architecture using six conductors The graphical representation of the receiver’s architecture is presented in Figure 43. In total, the receiver can decode 48 symbols, resulting in 5.585 bits( ) 585 . 5 ) 48 ( log 2 = . The system reaches 69 an I/O pin efficiency of % 08 . 93 6 585 . 5 = , as opposed to the 75% of the system using four conductors. Improving the I/O pin efficiency by 18% is a huge achievement considering that it was a result of only the coding/decoding optimization of the algorithm. Given that the symbol rate remains 5.6-Gsymbols/s/pin, the effective data rate increases to 5.2Gbits/s/pin. An improvement of approximately 23% of the data rate can be achieved by applying the coding algorithm to six conductors. Understanding this topology in detail is crucial to further scale the algorithm for systems that have X 2 * 6 1 number of conductors, where [ ) ,... 4 , 3 , 2 , 1 ∈ X . The three 2-input comparators decode exactly three bits. To decode more data bits from each symbol, comparators with more inputs must be used. The three 4-input comparators used in this receiver’s architecture compare pairs of inputs with each other. Within each pair, the signals cannot have the same voltage potential since they are also compared with each other from the 2-input comparators. Therefore, the only three combinations of signals that can occur between the pair of conductors (12), (34) and (56) are either (Low, Center) or (Low, High) or (High, Center). Each one of those pairs has a unique signal level. So each of the 4- input comparators can be treated as a 2-input comparator that tracks a voltage difference between its two inputs. Each one of its inputs, in this case, is the average voltage level of a pair of signals. For simplicity reasons, let A represent the average voltage level of the pair (Low, Center). Let B and C represent the average voltage level of the pairs (Low, High) and (High, Center) respectively. Ranking those signals according to their voltage potentials, A has the lowest and C the highest average voltage level (A<B<C). • (L, C) Let the symbol for that average voltage level be A • (L, H) Let the symbol for that average voltage level be B • (C, H) Let the symbol for that average voltage level be C Using the three 4-input comparators and the three possible voltage potentials, up to six possible symbol cases can be decoded, as is analytically presented by Table 12. Fortunately in this 1 6, 12, 24… etc 70 case, for each one of the 2 3 =8 possible cases for the 2-input comparators, there are 6 possible cases according to the values decoded from the 4-input comparators. In total, 3+log 2 (6) = 5.585 bits can be decoded. In Table 13, all 48 symbols, as well as the outputs of each one of the comparators for the six conductor receiver, are presented. Table 12: Decoding of the three 4-input comparators All the input cases for the 4-input comparators The output of the 4-input comparators Index (1,2) (3,4) (5,6) (12,34) (12,56) (34,56) 1 A B C 0 0 0 2 A C B 0 0 1 3 B A C 1 0 0 4 B C A 0 1 1 5 C A B 1 1 0 6 C B A 1 1 1 A represents (L, C) or (C, L) B represents (L, H) or (H, L) C represents (H, C) or (C, H) A<B<C 71 Table 13: Analytically presenting all the 48 symbol cases that can be decoded and decoded in a 6 conductor system Conductors Comparators output Analytically presenting the 4 input comparator's inputs Two input comp. Four input comp. 12-34 12-56 34-56 1 2 3 4 5 6 1-2 3-4 5-6 12-34 12-56 34-56 1 2 3 4 1 2 5 6 3 4 5 6 L C L H C H 0 0 0 0 0 0 L C L H L C C H L H C H L C L H H C 0 0 1 0 0 0 L C L H L C H C L H H C L C C H L H 0 0 0 0 0 1 L C C H L C L H C H L H L C C H H L 0 0 1 0 0 1 L C C H L C H L C H H L L C H L C H 0 1 0 0 0 0 L C H L L C C H H L C H L C H L H C 0 1 1 0 0 0 L C H L L C H C H L H C L C H C L H 0 1 0 0 0 1 L C H C L C L H H C L H L C H C H L 0 1 1 0 0 1 L C H C L C H L H C H L L H L C C H 0 0 0 1 0 0 L H L C L H C H L C C H L H L C H C 0 0 1 1 0 0 L H L C L H H C L C H C L H C L C H 0 1 0 1 0 0 L H C L L H C H C L C H L H C L H C 0 1 1 1 0 0 L H C L L H H C C L H C L H C H L C 0 0 0 0 1 1 L H C H L H L C C H L C L H C H C L 0 0 1 0 1 1 L H C H L H C L C H C L L H H C L C 0 1 0 0 1 1 L H H C L H L C H C L C L H H C C L 0 1 1 0 1 1 L H H C L H C L H C C L C L L H C H 1 0 0 0 0 0 C L L H C L C H L H C H C L L H H C 1 0 1 0 0 0 C L L H C L H C L H H C C L C H L H 1 0 0 0 0 1 C L C H C L L H C H L H C L C H H L 1 0 1 0 0 1 C L C H C L H L C H H L C L H L C H 1 1 0 0 0 0 C L H L C L C H H L C H C L H L H C 1 1 1 0 0 0 C L H L C L H C H L H C 72 Table 13: Continued C L H C L H 1 1 0 0 0 1 C L H C C L L H H C L H C L H C H L 1 1 1 0 0 1 C L H C C L H L H C H L C H L C L H 0 0 0 1 1 0 C H L C C H L H L C L H C H L C H L 0 0 1 1 1 0 C H L C C H H L L C H L C H L H L C 0 0 0 1 1 1 C H L H C H L C L H L C C H L H C L 0 0 1 1 1 1 C H L H C H C L L H C L C H C L L H 0 1 0 1 1 0 C H C L C H L H C L L H C H C L H L 0 1 1 1 1 0 C H C L C H H L C L H L C H H L L C 0 1 0 1 1 1 C H H L C H L C H L L C C H H L C L 0 1 1 1 1 1 C H H L C H C L H L C L H L L C C H 1 0 0 1 0 0 H L L C H L C H L C C H H L L C H C 1 0 1 1 0 0 H L L C H L H C L C H C H L C L C H 1 1 0 1 0 0 H L C L H L C H C L C H H L C L H C 1 1 1 1 0 0 H L C L H L H C C L H C H L C H L C 1 0 0 0 1 1 H L C H H L L C C H L C H L C H C L 1 0 1 0 1 1 H L C H H L C L C H C L H L H C L C 1 1 0 0 1 1 H L H C H L L C H C L C H L H C C L 1 1 1 0 1 1 H L H C H L C L H C C L H C L C L H 1 0 0 1 1 0 H C L C H C L H L C L H H C L C H L 1 0 1 1 1 0 H C L C H C H L L C H L H C L H L C 1 0 0 1 1 1 H C L H H C L C L H L C H C L H C L 1 0 1 1 1 1 H C L H H C C L L H C L H C C L L H 1 1 0 1 1 0 H C C L H C L H C L L H H C C L H L 1 1 1 1 1 0 H C C L H C H L C L H L H C H L L C 1 1 0 1 1 1 H C H L H C L C H L L C H C H L C L 1 1 1 1 1 1 H C H L H C C L H L C L 73 6.5.2. Using twelve conductors (W=12) Having designed the six conductor receiver, it becomes trivial to apply the same topology to systems that the number of conductors is 6*2 X , where [ ) ,... 4 , 3 , 2 , 1 ∈ X . For example, if X=1 then the number of conductor is W=12. Dedicating for every two sequential conductors, one 2-input comparator, the subset Comp 2 becomes ( ) ( ) ( ) ( ) ( ) ( ) ( ) [ ] 12 11 , 10 9 , 8 7 , 6 5 , 6 5 , 4 3 , 2 1 o o o o o o o . Each one of those pairs of inputs can be seen as an average voltage level which can be used by a 4- input comparator to extract more data bits. Each one of those pairs can have three possible average voltage levels: A, B, and C which are analytically, presented in Table 12. By matching the above pair of signals with one signal level, each one of the 4-input comparators can be matched with one of the 2-input comparators of the six conductor receiver that was presented earlier. Similarly matching the three group of fours ( ) ( ) ( ) [ ] 12 , 11 10 , 9 , 78 56 , 34 12 o o o to one signal level, each one of the 8-input comparators of the subset ( ) ( ) ( ) [ ] 12 , 11 , 10 , 9 5678 , 12 , 11 , 10 , 9 1234 , 5678 1234 o o o can be considered as one of the 4-input comparators that were used in the six conductor system. Automatically the whole system can be simplified to a six conductor system, which was presented earlier, by assigning the 8-input and 4-input comparators of the 12 conductor system to the 4-input and 2-input comparators respectively of the 6 conductor system. In addition to that, the subset of two input comparators ( ) ( ) ( ) ( ) ( ) ( ) ( ) [ ] 12 11 , 10 9 , 8 7 , 6 5 , 6 5 , 4 3 , 2 1 o o o o o o o becomes part of the solution. So with the use of the six 2-input and three 4-input comparators there can be decoded in total nine data bits. Using the last three 8-input comparators, 2.585 bits [log 2 (6) = 2.585] can be decoded. In total a system with 12 conductors can send 11.585 data bits delivering an I/O pin efficiency of 96.54%. For any system that has 6*2 X conductors the same topology can be used by introducing 2 W new two input comparators and replacing any 2*X-input comparator with an X-input. The final receiver of the system with 12 conductors is graphically presented in Figure 44. 74 1 2 3 4 5 6 7 8 9 10 11 12 1,2 3,4 1 2 3 4 5 6 7 8 9 10 11 12 5,6 7,8 9,10 11,12 1,2,3,4 5,6,7,8 1,2,3,4 9,10,11,12 9,10,11,12 5,6,7,8 Two input comparators Four input comparators Eight input comparators Figure 44: Receiver's architecture using twelve conductors 75 6.6. Scaling results In a system with 24 conductors, using the topology of the 12 conductor receiver, up to 23.585 data bits resulting in an I/O pin efficiency of 98.27% can be decoded. Table 14 summarizes the results, as the system scales to 6*2 X number of conductors for [ ] 2 , 1 , 0 ∈ X . It is obvious that even though the transmitter can increase the size of the symbol-set with permutational rate the best receiver cannot decode the same amount of symbols. The efficiency of the system approaches the barrier of one data bit per pin as it scales, which is the I/O pin efficiency of the binary coding. Table 14: Scaling results for a system 6*2 X conductors Comparators W K Symbol Set Decoded Symbols Bits Efficiency 2-in 4-in 8-in 16-in Total 6 2 90 48 5.585 93.08% 3 3 0 0 6 12 4 34650 3072 11.585 96.54% 6 3 3 0 12 24 8 9.46E+9 1.2e+7 23.585 98.27% 12 6 3 3 24 Besides the six conductor system and their multiples of two in Table 15, Table 16 and Table 17, the scaling pattern of the five, seven, and eight conductor system is summarized. It can be shown that as the system scales its performance improves. In each one of the 5*2 X , 7*2 X and 8*2 X 1 system cases the performance increases with a different rate. For the five conductor system, the coding algorithm does not perform well. The coding efficiency is 71.7% which is lower compared to the one achieved by the four conductors system. As expected the system scales with a slower rate that the 6*2 X . Table 15: Scaling results for a system 5*2 X conductors Comparators W K Symbol Set Decoded Symbols Bits Efficiency 2-in 4-in 8-in 16-in Total 5 2 30 12 3.585 71.70% 3 1 0 0 4 10 3 4200 384 8.585 85.85% 5 3 1 0 9 20 7 5.55E+12 3.9e+5 18.585 92.93% 10 5 3 1 19 The seven conductor system, on the other hand, is more efficient from the four conductors but it has been outperformed by the six conductor system. They both have approximately the same 1 The 8*2 X is an extension of the 4 conductors system as presented for Table 17 76 power budget using six comparators at the receiver side and two current paths at the transmitter side (K=2). Even though, in absolute values, they spend the same power, the power efficiency of the six conductor system is better due to its increased data rate. As expected, the systems of 7*2 X number of conductors improve the systems efficiency but still remain worst that the 6*2 X systems. Table 16: Scaling results for a system 7*2 X conductors Comparators W K Symbol Set Decoded Symbols Bits Efficiency 2-in 4-in 8-in 16-in Total 7 2 210 48 5.585 79.79% 5 1 0 0 6 14 5 252252 6144 12.585 89.89% 7 5 1 0 13 28 9 6.38e+11 1.0e+8 26.585 94.94% 14 7 5 1 27 Finally, the 8*2 X systems can be seen as the extension of the four conductor link, using the pre-mentioned scaling approach. As presented in Table 17, the efficiency of the system increases from 75% to 87.05%, 93.73% etc… The eight conductor system is clearly worse than the one with six conductors since it uses three current paths at the transmitter and seven comparators delivering a system with 87% efficiency compared to the 93.08% of the six conductor system. Table 17: Scaling results for a system 8*2 X conductors Comparators W K Symbol Set Decoded Symbols Bits Efficiency 2-in 4-in 8-in 16-in Total 4 1 12 8 3 75.00% 2 1 0 0 3 8 3 560 128 7 87.05% 4 2 1 0 7 16 5 2018016 32768 15 93.73% 8 4 2 1 15 32 11 4.55E+13 2.15e+9 31 96.87% 16 8 4 2 31 77 6.7. Speed and Power efficiency Figure 45 compares the scaling efficiency of the proposed coding scheme, the conventional differential coding, as well as the differential coding over a plurality of conductors using the metric Bits/Symbol/(Noise Margin) -1 . It is obvious that without compromising noise margin the total efficiency of the system improves. The 6*2 X curve delivers the best performance compared to everything else. It is clearly the best case to choose not only due to its performance, but mostly due to the fact that it remains simple enough to implement. For the case of the six conductors system, just three 2-input and 4-input comparators are used delivering a surprising good I/O pin efficiency of 93%. 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 35 Number of conductors used Bits/Pin/(Noise Margin)^-1 6*2 x 8*2 x 7*2 x 5*2 x Differential signals over a plurality of conductors Conventional Differential 4 pins Figure 45: Scaling efficiency of the proposed system vs the old ones Figure 46 demonstrates the power efficiency of the system as it scales. Each line corresponds to the system cases 5*2 X , 6*2 X , 7*2 X and 8*2 X . They have been normalized for the power consumption of the 4 conductor system. It can be seen that 66% of the cases spend up to 10% more 78 power while the system speed improves by approximately the same amount. So scaling can be seen as a process of trading power efficiency for speed and vice versa. The dominant factor of the power of the system, as it scales, becomes the extra current paths that are generated at the transmitter side to increase the size of the symbol set (K). As the number of conductors increases, so does the number of “High” and “Low” signals. 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 0 5 10 15 20 25 30 35 Number of conductors used mW/Gbps (Normalised) 5*2 x 6*2 x 7*2 x 8*2 x Figure 46: Power efficiency of the scaling 79 Chapter Seven: Conclusion As the lithography process scales and the clock frequencies of integrated circuits increase, so does the level of parallelism, which integrates a lot more functions within the core of the same system. As a result the throughput bottleneck has been moved to the chip interconnections, whose data rate scale slower than the clock rate. Furthermore, as the power budget of the systems remains the same or even shrinks, while their functionality increases rapidly, the power efficiency of the next generation I/Os becomes a necessity, especially for systems that need high level of integration. The proposed work is a hybrid link for chip to chip interconnections that combines the design advantages of both parallel and serial links. By coding the data differentially among the pins, just like in serial links, we overcome two of the most important speed limiting factors that are present in parallel link architectures: reference ambiguity and power line fluctuations. On the other hand, systems that use differential coding occupy twice as many I/Os. This is a known drawback of differential coding since it decreases the effective performance by 50% (Gbits/s/pin). To overcome this problem, three signal levels are used to boost the I/O pin efficiency close to 100% as the system scales. Having just three signal levels, the noise margin of the system remains equal to the noise margin of binary coding which is typically used in parallel links. To keep the power dissipation of the drivers at low levels, novel but simple driver architecture has been used to recycle the current among the drivers. Using a simple comparator at the receiver side, the proposed coding scheme decodes the transmitted data by comparing the average voltage difference between two or more signals. Due to the nature of the coding there is no reference ambiguity. Besides that, the proposed architecture has theoretically zero switching noise at the power rails of the driver. The current dissipation of all the drivers remains constant with time for all the possible data combinations. Although that will naturally happen in a typical serial link, it is not the case in the proposed coding scheme that uses three signal levels instead of two. To get around this problem, the coding algorithm with the proposed architecture forces the number of “High” and “Low” signals levels to be the same within the same symbol and independent of the data transmitted. So the drivers dissipate the same amount of power and 80 theoretically eliminating any power supply noise (L*di/dt). To make the transmission power efficient the current used to pull “High”, one of the outputs is recycled and reused by one of the outputs that drive “Low”. Considering equiprobable zeros and ones, the present transmitter dissipates 33% less power than conventional driver architecture. To validate the proposed concepts, a transceiver chip has been implemented in a 0.18μm TSMC process. It encodes three data bits over four conductors using three signal levels. To account for the difference between the clock rate and the symbol rate, there are four clock phases that are internally generated for both the transmitter and the receiver. At each output pin, four-way time interleaved architecture is employed. To simplify the testing process, a pseudo-random binary sequence (PRBS) generator is included for each phase. To account for pin-to-pin mismatches and timing skew, both of which become more pronounced at higher data rates, digital calibration circuits have been included throughout the design. 7.1. Scaling When the algorithm is applied to a system with four conductors, the I/O pin efficiency becomes 75%. With a symbol rate of 5.6-Gsymbols/s/pin, the effective data rate becomes 4.2- Gsymbols/s/pin. Improving the I/O pin efficiency will have a direct impact to the improvement of the data rate. For that reason, the present work examines the scaling behavior of the proposed algorithm. The scaling occurs by increasing the I/O pin effectiveness 1 of the system when more conductors are used. Probably the most important requirement during the scaling process is to use only three signal levels for coding, keeping the noise margin constant. The scaling of the transmitter is a straight forward process. Multiple drivers are connected to the corresponding conductor and in parallel in between them through their termination resistor. If more than one signal has to be high, an equal number of low signals guarantee the same current dissipation among the drivers at any given time. Finding the optimum receiver on the other hand, is a computational complex task that increases with a permutational rate with scaling. The optimum 1 Basically sends more data per conductor. 81 solution for the receiver is the set or comparators that will be able to decode most of the symbols that can be generated at the transmitter. Due to the computational complexity of finding the optimum receiver, this dissertation presents the ideal receiver architecture for systems that have less than ten conductors. The best case that can demonstrate the effectiveness of the scaling process most probably is the six conductor case. The system codes differentially 5.585 bits over just six conductors using only three signal levels and six comparators at the receiver side. The effective data rate of the system is 93% of the symbol rate. Using the clock rate of 1.4-GHz and 4-way time interleaving, the system could send 5.2-Gbits/s/pin. That is the equivalent symbol rate of 10.4-Gsymbols/s for a serial link that uses the traditional differential coding. 7.2. Future work There are two possible ways to extend the work of this dissertation. Although the scaling of the proposed algorithm may not be either necessary or practical for more than ten pins, the optimum receiver architecture for any number of conductors should be found. The coding efficiency is already given. The objective from this point on is to decode as many symbols as possible, using the most efficient receiver. Possibly a receiver with a different architecture could perform better than just a collection of simple comparators. The other extension of the proposed work has to do with the comparator’s efficiency. As the number of inputs increases for large scale systems, their current dissipation has to increase as well to achieve the same slew rate at the output. On the other hand, the larger output capacitance due to the increased transistor’s width will decrease the slew rate. The efficiency of those comparators will definitely effect scaling. Last, but certainly not least, new coding techniques schemes should be considered to further improve the performance to parallel links. 82 References [1] S.J. Bae, H.J. Chi, H.R. Kim, H.J. Park “A 3Gb/s 8b single ended transceiver for 4-drop DRAM interface with digital calibration of equalization skew and offset coefficients,” IEEE I. Solid- State Circuits. vol. 1, pp. 520–614 Feb. 2005. [2] K. Chang, J. Wei, C. Huang, S. Li, K. Donnelly, Y Li, and S Sidiropoulos, “A 0.4–4-Gb/s CMOS Quad Transceiver Cell Using On-Chip Regulated Dual-Loop PLLs,” IEEE J. Solid-State Circuits, vol. 38, pp. 747–753, May 2003. [3] K.Y. Chang et al. “A 50 Gb/s 32x32 CMOS crossbar chip using asymmetric serial links,” Proceedings of 1999 IEEE Symposium on VLSI Circuits, pp. 19-22, June 1999. [4] W.J. Dally and J. Poulton, “Transmitter equalization for 4-Gbps signaling,” IEEE Micro, Jan.- Feb. 1997, vol.17, no.1, p. 48-56. [5] R. Farjad-Rad, et. al, “A 0.3- μm CMOS 8-Gbps 4-PAM serial link transceiver,” IEEE Journal of Solid-State Circuits, vol. 35, no. 5, pp. 757-64, May 2000. [6] R. Farjad-Rad, C-K Yang, M.A. Horowitz, and T H. Lee, “A 0.4- m CMOS 10-Gb/s 4-PAM Pre-Emphasis Serial Link Transmitter,” IEEE J. Solid-State Circuits, vol.34, no 5, pp 580-585, May 1999 [7] A. Fiedler, et al., “A 1.0625 Gbps transceiver with 2x-oversampling and transmit signal pre- emphasis,” 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers Feb. 1997, pp. 238-9. [8] Y.M. Greshishchev et al., “A fully integrated SiGe receiver IC for 10-Gb/s data rate,” IEEE Journal of Solid-State Circuits, vol. 35, no. 12, pp. 1949-57, Dec. 2000. [9] R. Heald et al., “A third-generation SPARC V9 64-b microprocessor,” IEEE Journal of Solid- State Circuits, vol. 35, no. 11, pp. 1526-1538, Nov. 2000. [10] A. Jain, et. al., “A 1.2GHz Alpha microprocessor with 44.8GB/s chip pin bandwidth,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2001, pp. 240-241. [11] S. Jou, S. Kuo, J. Chiu, and T. Lin, “Low switching noise and load-adaptive output buffer design techniques,” IEEE J. Solid- State Circuits, vol. 36, pp. 1239-1249, Aug. 2001. [12] M. Edward Lee, W. J. Dally and P. Chiang, “Low-power area efficient high-speed I/O circuit techniques,” IEEE J. Solid-State Circuits, vol. 35, pp. 1591–1599, Nov. 2000. [13] M. Meghelli et al., “SiGe BiCMOS 3.3-V clock and data recovery circuits for 10- Gb/s serial transmission systems,” IEEE Journal of Solid-State Circuits, vol. 35, no. 12, pp. 1992-5, Dec. 2000. [14] D. Perino and L. Dillon, “Apparatus and method for multilevel signaling,” U.S. Patent 6,359,931, March 19, 2002. [15] J. Poulton et. al., “A Tracking Clock Recovery Receiver for 4Gb/s Signalling,” Hot Interconnects V Symposium Record, pp. 157-169, Aug. 1997. 83 [16] J. Poulton, S. Tell, and R. Palmer, “Methods and systems for transmitting and receiving differential signals over a plurality of conductors,” U.S. Patent 6,556,628, Apr. 29, 2003. [17] E. Reese et al. “A phase-tolerant 3.8 GB/s data-communication router for a multiprocessor supercomputer backplane,” 1994 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 296-297, Feb. 1994. [18] S. Rusu and G. Singer, “The first IA-64 microprocessor,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1539-1544, Nov. 2000. [19] S. Sidiropoulos, et al., “An 800 mW 10 Gb ethernet transceiver in 0.13 _m CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2004, pp. 168–169. [20] J.Y. Sim and W. Namgoong, “Multi-Level Differential Encoding with Pre-Centering for High- Speed Parallel Link Transceiver”, IEEE J. Solid-State Circuit, vol 40, pp 1688 – 1694, Aug. 2005. [21] T. Takahashi et al., “A CMOS gate array with 600Mb/s simultaneous bidirectional I/O circuits,” IEEE Journal of Solid-State Circuits, vol. 30, no. 12, pp. 1544-1546, Dec. 1995. [22] A.X. Widmer et al., “Single-chip 4*500-MBd CMOS transceiver,” IEEE Journal of Solid-State Circuits, vol. 31, no. 12, pp. 2004-13, Dec. 1996. [23] K. Yang, “A scalable 32Gb/s parallel data transceiver with on-chip timing calibration circuits,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 258-259. [24] C.K. Yang, Vladimir Stojanovic, Siamak Modjtahedi, Mark A. Horowitz and William F. Ellersick “A Serial-Link Transceiver Based on 8-GSamples/s A/D and D/A Converters in 0.25- _m CMOS”, IEEE J. Solid-State Circuits, vol. 36, pp. 1684–1692, Nov. 2001. [25] E. Yeung, et. al., “A 2.4Gb/s/pin simultaneous bidirectional parallel link with per pin skew compensation,” IEEE Int’l Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2000, pp. 256-257.
Abstract (if available)
Abstract
As the lithography process scales the throughput requirement for chip interconnections increases. Furthermore, the band limited channel practically does not scale and the power budget has been more important than ever. This tradeoff has raised the need for innovative designs. As a result plethora of new design techniques has been discussed extensively during the last decade. This challenge prompts the development of the presented work that delivers a complete solution for systems that require low power, high speed and high level of integration.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Theory, implementations and applications of single-track designs
PDF
High performance packet forwarding on parallel architectures
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Low power and reliability assessment techniques for advanced processor design
PDF
High power, highly efficient millimeter-wave switching power amplifiers for watt-level high-speed silicon transmitters
PDF
Power-efficient biomimetic neural circuits
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
High level design for yield via redundancy in low yield environments
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
In-situ digital power measurement technique using circuit analysis
PDF
Energy control and material deposition methods for fast fabrication with high surface quality in additive manufacturing using photo-polymerization
PDF
A biomimetic approach to non-linear signal processing in ultra low power analog circuits
PDF
Synthesis of high-quality nanoparticles using microfluidic platforms
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
Asset Metadata
Creator
Zogopoulos, Sotirios
(author)
Core Title
A low-power high-speed single-ended parallel link using three-level differential encoding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Defense Date
12/05/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
differentially coding,link,low power,OAI-PMH Harvest,parallel,serial
Language
English
Advisor
Namgoong, Won (
committee chair
), Beerel, Peter A. (
committee member
), Gupta, Sandeep K. (
committee member
), Zimmermann, Roger (
committee member
)
Creator Email
zogopoul@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m305
Unique identifier
UC1160638
Identifier
etd-Zogopoulos-20070301 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-318862 (legacy record id),usctheses-m305 (legacy record id)
Legacy Identifier
etd-Zogopoulos-20070301.pdf
Dmrecord
318862
Document Type
Dissertation
Rights
Zogopoulos, Sotirios
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
differentially coding
low power
parallel
serial