Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
(USC Thesis Other)
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Encoding Techniques for Energy-efficient and Reliable Communication in VLSI Circuits by Yazdan Aghaghiri A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2005 Copyright 2005 Y azdan Aghaghiri Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3180384 Copyright 2005 by Aghaghiri, Yazdan All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3180384 Copyright 2005 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication To TaHereH ancC Tafya Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Epigraph J9J9 j r *& O f science nau^fjt remained 3 D iD not tnoro, O f secrets, scarceip ant), f)ipf) or loro; 2UI Dot) anD ni^fjt for tfiree score anD troetoe pears, 3 ponDereD just to (earn tpat naugfjt 3 tnoro. Omar ^f)at)t)am OJtatfjematician, Slsttonomet ant) ^3oet Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements First and foremost, I would like to thank my advisor, Professor Massoud Pedram, for his exceptional advisement during the past five years. It has been an invaluable privilege for me to work with him during these years. In addition to his scientific brightness, his kind personality, honesty and goodwilled support means a lot to me. I would like to profoundly thank my father, Yahya, and my mother, Tahereh. Their love and devotion has filled the sweetest days of my childhood and still is the most important source of inspiration in my life. Whatever I have accomplished in my life is because of them and cannot imagine what I would have been without them. I also want to thank my sister, Toranj and my brother, Ashkan for their kindness and support. Throughout the past years, being away studying, I have tremendously missed my family. I am grateful to Professor Michael A. Arbib and Professor Sandeep K. Gupta for being in my thesis committee and taking time to attend my final defense. I also thank Professor Robert A. Scholtz and Professor Peter A. Beerel for attending my qualifying exam. I would like to thank Dr Farzan Fallah for the collaboration that he has offered to me. His extensive knowledge in engineering has been very helpful to me. iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I am also grateful to all my friends for the sincere friendship that they have given to me, those that I have been able to see in recent years and those that I have not seen for a long time, back home in Iran or other places of the world. In addition, I am thankful to all my colleagues especially members of our research group at use. Yazdan Aghaghiri USC, December 2004 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Dedication ii Epigraph iii Acknowledgements iv List of Tables ix List of Figures xii Abstract xv Chapter 1 INTRODUCTION 1 1.1 Design Challenges........................................................................................ 1 1.2 Overview of The Dissertation..................................................................... 2 Chapter 2 LOW-POWER ENCODING TECHNIQUES 8 2.1 Introduction...................................................................................................8 2.2 General Notations and Definitions............................................................ 12 2.3 Overview of Previous W ork......................................................................19 2.4 A Set of Irredundant Encoding Techniques............................................ 27 2.4.1 TO-Concise........................................................................................... 27 2.4.2 Offset-XOR-SM................................................................................... 33 2.4.3 Offset-XOR-SMC................................................................................37 2.4.4 Performance Analysis..........................................................................40 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5 ALBORZ Encoding Techniques.............................................................. 45 2.5.1 Approach................................................................................................45 2.5.2 Redundant ALBORZ............................................................................ 46 2.5.2.1 Fixed Codebook......................................................................... 50 2.5.2.2 Adaptive Codebook....................................................................52 2.5.3 Irredundant ALBORZ Code.................................................................. 53 2.5.4 Quantitative Codebook Analysis..........................................................57 2.5.5 Performance Analysis............................................................................64 2.6 Conclusions............................................................................................... 65 Chapter 3 SECTOR-BASED ENCODING ALGORITHMS 67 3.1 Introduction............................................................................................... 67 3.2 Previous Related W ork..............................................................................69 3.3 The Approach to Sector-based Encoding..................................................73 3.4 Fixed Sector Encoding...............................................................................78 3.4.1 Fixed Two Sector Encoding................................................................. 78 3.4.2 Fixed Multiple Sector Encoding.......................................................... 80 3.5 Dynamic Sector Encoding..........................................................................86 3.5.1 Dynamic Two Sector Encoding............................................................ 90 3.5.2 Dynamic Multiple Sector Encoding................................................... 103 3.5.2.2 Generating Codewords for Covered Sourcewords................109 3.5.2.3 Generating Codeword for an Exposed Sourceword..............112 3.6 Experimental Results............................................................................... 120 3.6.1 Sector-based Encoding for Low Power............................................... 121 3.6.2 Sector-based Encoding for Data Compaction.....................................127 3.6.2.1 Compaction of TIFF Image Files........................................... 127 3.6.2.2 Compaction of Mixed Streams.............................................. 130 3.7...... Conclusions..............................................................................................133 Chapter 4 INSTRUCTION-SET-AWARE MEMORIES 134 4.1.......Introduction..............................................................................................134 4.2 Basic Approach......................................................................................... 135 4.3 Next Address Prediction in BEAM..........................................................138 v ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.1 Instruction Addresses.........................................................................138 4.3.2 Data Addresses................................................................................... 145 4.4 Performance Analysis..............................................................................147 4.5 Power Analysis........................................................................................ 154 4.6 Conclusion................................................................................................158 Chapter 5 PATTERN-SENSITIVE ENCODING FOR RELIABILITY 159 5.1 Introduction...............................................................................................159 5.2 Background...............................................................................................163 5.3 The Proposed Approach...........................................................................167 5.3.1 Sensitive Patterns............................................................................... 173 5.3.2 Pattern Sensitive Encoding................................................................177 5.3.3 Design Choices.................................................................................. 181 5.4 Experimental Results............................................................................... 186 5.5 Conclusions...............................................................................................191 Chapter 6 ENCODING TECHNIQUES FOR REDUCING HOT-CARRIER DEGRADATION 192 6.1 Introduction............................................................................................... 192 6.2 Problem setup............................................................................................198 6.3 Reducing the Bus Maximum Activity....................................................206 6.4 Coding with Combinational Functions...................................................209 6.5 Coding with Sequential Functions.......................................................... 222 6.5.1 Inter-Sequential Functions.................................................................223 6.5.2 General Sequential Functions............................................................226 6.6 Experimental Results............................................................................... 233 6.7 Conclusions............................................................................................... 238 CONCLUSIONS AND FUTURE DIRECTIONS 239 BIBLIOBRAPHY 242 v iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 2-1 Simple example proving TO would fail without an extra bit...................... 29 Table 2-2 Example of TO-C encoding........................................................................... 31 Table 2-3 Example showing TO-C for a self-jumping instruction...............................31 Table 2-4 Example of LSB-Inv function in the 8-bit space....................................... 35 Table 2-5 Example of a 3-bit codebook.........................................................................39 Table 2-6 Switching activity of SPEC 2000 traces in millions & average percentage saving for different encoding techniques........................................... 42 Table 2-7- Encoder hardware synthesis and power estimation....................................43 Table 2-8 Typical codebook for fixed redundant ALBORZ........................................50 Table 2-9 Encoder hardware synthesis and power estimation.....................................65 Table 3-1 An example of the DTSE encoding for a three-bit address space and sector heads equal to 001 and 011....................................................................... 101 Table 3-2 An example of the DTSE decoder...............................................................102 Table 3-3 DMSE encoding for a 5-bit bus using four sector heads equal to {1,3,7,21}. Shaded cells are exposed sourcewords............................................ 118 Table 3-4 Percentage savings for traces of data address (no cache)..........................123 Table 3-5 Percentage savings for traces of data address (no cache)..........................124 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3-6 Average contribution of Sector-ID bits and offset-bits in total remaining transitions.............................................................................................................. 126 Table 3-7 Average transition savings for different techniques.................................. 127 Table 3-8 Compaction ratio for different images from SIPI database.......................130 Table 3-9 Showing the effectiveness of DMSE for mixed streams........................... 132 Table 4-1 Percentage of Transition Cost for different kind of instructions.............. 150 Table 4-2 Transition saving for different stages of our proposed method for instruction address bus..........................................................................................151 Table 4-3 Transition saving for data addresses and the percentage of accesses precisely predicted for a full size shadow register file....................................... 153 Table 4-4 Transition saving for data addresses, cache hit, and the percentage of accesses precisely predicted for a 4-entry directly mapped cache.................... 153 Table 4-5 Results of hardware analysis and power estimation.................................. 158 Table 5-1 Number of errors in 5000 bus cycles (no encoding has been performed). 172 Table 5-2 Example of showing Miller effect count.................................................... 174 Table 5-3 Power consumption for different configurations of the bus......................187 Table 5-4- Error over a single group for various cycle time constraints...................190 Table 6-1 E(Min(L,S)) / E(S).......................................................................................202 Table 6-2 Representatives of three NP-equivalence sets in the 2-bit space plus some other members of class [F I]...................................................................... 210 Table 6-3- Encoding with redundancy to reduce maximum transitions...................214 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6-4- Modeling a trace based on number of inter-sourcewords........................217 Table 6-5 Comparison of different methods applied over instruction addresses..... 236 Table 6-6- Comparison of different methods applied over data addresses...............237 x i Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure 2-1 Basic block diagram for on-chip and off-chip encoding/decoding........... 13 Figure 2-2 Block diagram of low power bus encoding with transition signaling.......14 Figure 2-3 General framework for low power bus encoding.......................................26 Figure 2-4 TO-C encoder.................................................................................................32 Figure 2-5 Offset-XOR-SM encoder............................................................................. 36 Figure 2-6 Percentage of branch instructions based on the required bits to represent their displacement for SPEC95 benchmarks........................................ 37 Figure 2-7 Offset-XOR-SMC encoder.......................................................................... 40 Figure 2-8 Contribution of different kind of instructions in total number of bus transitions.................................................................................................................41 Figure 2-9 Comparison of total power saving for different encoding techniques......44 Figure 2-10 Redundant ALBORZ encoder................................................................... 47 Figure 2-11 Redundant ALBORZ decoder................................................................... 48 Figure 2-12 Irredundant ALBORZ encoder.................................................................. 55 Figure 2-13 Average of total activity of the encoded bus to the original bus for fixed redundant ALBORZ......................................................................................59 Figure 2-14 Ratio of codebook misses and hits............................................................ 60 Figure 2-15 Average of total activity of the encoded bus to the original bus for adaptive redundant ALBORZ................................................................................ 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 2-16 Ratio of codebook misses and hits............................................................ 61 Figure 2-17 Average of total activity of the encoded bus to the original bus for irredundant ALBORZ.............................................................................................63 Figure 2-18 Ratio of codebook misses and hits............................................................ 63 Figure 2-19 Comparison of total power savings of different encoding techniques... 65 Figure 3-1 Comparison of contiguous versus dispersed sectorization........................ 82 Figure 3-2 Concept of sectors in data streams with spatio-temporal correlation 89 Figure 3-3 Two sector heads and part of the memory space that each one covers. ...92 Figure 3-4 Graphical representation of distance functions...........................................94 Figure 3-5 Comparison of number of one’s in the offset of different sourcewords. 119 Figure 4-1 Block diagram of the calculation/prediction unit in memory..................138 Figure 4-2 Percentage of Different Kind of Control Flow Instructions.................... 148 Figure 4-3 Effect of stack size on JR transitions.........................................................152 Figure 4-4 Hardware implemented in memory for predicting instruction addresses (jump and links, jumps and branches)................................................................. 155 Figure 4-5 Hardware implemented in memory for predicting data addresses 156 Figure 5-1 The top metal layer with no ground level above it................................... 171 Figure 5-2 No of errors versus transition count in a group of size 11....................... 175 Figure 5-3 No of errors versus Miller effect count in a group of size 11..................175 Figure 5-4 Special parity check matrix for the extended Hamming code................178 Figure 5-5 Bus configuration of the proposed ERC architecture...............................185 xiii with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5-6- Block diagram of a generalized PSE........................................................185 Figure 6-1 A set of 3 global lines that are good candidates for encoding and decoding................................................................................................................. 199 Figure 6-2 Model for transition-routing sequential functions................................... 228 Figure 6-3- Interchange Block..................................................................................... 231 Figure 6-4- XOR-Rotate & INC-XOR-Rotate............................................................233 x iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ABSTRACT System-on-chip (SoC) is evolving as a result of increasing number of devices that can be integrated on a single chip. As an adaptation to enormous SoC design complexity, design flow is shifting toward interconnecting different pre-designed macro-cells or subsystems with application specific interconnects and buses. Interconnection of these modules will be a major challenge for realization of these complex and highly integrated systems. In this dissertation, we will address the power consumption and reliability issues that arise in the design process and propose encoding and decoding solutions that help overcome these bottlenecks. Several encoding techniques are proposed that can effectively reduce switching activity over instruction and data address buses. This is achieved by exploiting the spatial locality of the traces that flow over these buses. These techniques are all designed to accommodate for the specific limitations of encoding over on-chip and off-chip buses such as tight delay constraints and power and area considerations. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another set of encoding techniques named Sector-based encoding techniques, are proposed that are specifically very strong in exploiting spatio-temporal locality of various kind of traces. The fixed version is very suitable for hardware implementations (e.g. for low power applications) whereas the dynamic version is better suited for software implementations (e.g. for data compaction). An instruction-set-aware memory is proposed that is capable of predicting the addresses to some extent. This will lead to a reduction of traffic over the memory bus that can be targeted for power reduction or performance improvement. An Energy-efficient, reliable channel for on-chip communication is proposed. This includes a pattern-sensitive encoding scheme that selects the optimum encoding technique based on the sensitivity of patterns. Besides, various levels of reliability (encoding techniques) are selected in such a way that consistency between encoder and decoder is guaranteed with zero communication overhead. Finally, the problem of hot-carrier device degradation for bus drivers is tackled from an encoding point of view. The problem is systematically solved by, first, characterizing of the data that appears on the bus, and then finding combinational and/or sequential encoding functions that efficiently solve the problem. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 INTRODUCTION 1.1 DESIGN CHALLENGES Advances in silicon technologies have tremendously revolutionized human lives during the past three decades. A lot more is still to come. When and where it will go and stop, if ever, is more a matter of philosophy. However, to continue progressing at the same rate, many technological issues need to be addressed. Although the integration capability is expected to come to a stop in near future, there is still a lot room for design methodologies to advance and evolve, before we would be able to claim that we are close to the boundaries. A huge number of transistors can be fabricated on a single chip. To ease the design effort, complex designs try to reuse pre-designed blocks. Consequently, 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the overall performance of the system is not solely determined by performance of these blocks, yet it is also affected by how efficiently these blocks are connected to each other. Delay and power consumption and more recently, reliability of communication over interconnecting channels have become major issues challenging research engineers. The interconnects will become even more critical since their delay and power consumption do not scale down with technology unlike transistors. In my dissertation, I focus on different solutions that attempt to solve problems of communication over on-chip and off-chip channels. The proposed techniques are a set of high-level solutions such as encoding techniques that are effective for reducing power consumption, increasing reliability and compacting data. Some of these techniques deal with more than one of the above issues at a time. For instance, encoding techniques that can increase reliability, have already been intensively investigated by communication research community. Yet when it comes to on-chip communication, energy efficient and reliable on-chip communication is a new field of research and still craves for innovative solutions. 1.2 OVERVIEW OF THE DISSERTATION This section includes an overview of the proposed techniques and methodologies and obtained results. 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Chapter 2, we first propose three general encoding techniques that are suitable for decreasing the power consumption of global buses, off-hip or on-chip. The best target for these techniques is a wide and highly capacitive memory bus. First one of these techniques known as TO-C works based on the sequential behavior of traces such as instruction address traces. The other two techniques known as Offset-XOR-SM and Offset-XOR-SMC work based on a codebook that can help reduce transitions on both instruction and data address bus. These methods decrease switching activity up to 86% without the need for redundant bus lines. Having no redundancy means that implementing these techniques on any existing system does not lead to unpleasant redesign and remanufacturing cost of the whole system. The power dissipation of encoder and decoder blocks has also been calculated and shown to be insignificant in comparison with the power saved on the memory address bus as a result of applying the techniques. In the second part of this chapter, we introduce another set of techniques known as ALBORZ encoding techniques The ALBORZ code is constructed based on a codebook that maps the offsets to reduce their projected activity, has fixed and adaptive versions and is not irredundant in general. With enhancements to make it adaptive and irredundant, results in up to 89% reduction in the instruction bus. Although ALBORZ requires relatively more complex blocks compared to the first set of encoding techniques, however its higher flexibility and good performance for a wider range of traces, makes it a suitable choice in many different applications. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Chapter 3, we present an entire new set of techniques known as sector-based encoding techniques, which are irredundant encoding techniques that can effectively exploit locality in different traces of sourcewords. After the locality is exploited, a suitable codeword can be defined either with the goal of activity reduction or alternatively data compaction. These sector-based encoding techniques are quite successful in both of the above tasks. The key idea for these techniques is to partition the sourceword space into a number of sectors with a unique identifier in each sector called sector head. These sectors can, for example, correspond to address spaces for the code, heap, and stack segments of one or more application programs. Each sourceword is then dynamically mapped to the appropriate sector and is encoded with respect to the sector head that resides in the sector. In general, the sectors may be determined a priori or can dynamically be updated based on the sourceword that was last encountered in the trace. If the sectorization is done in advance and does not change after on, the technique is called Fixed Sector Encoding or FSE for short. Our experimental results show that for a computer system without an on-chip cache, FSE with up to 8 sectors can decrease the switching activity of data address and multiplexed address traces by an average of 55% to 67%, respectively. For a system with on- chip cache, up to 55% transition reduction is achieved on a multiplexed address bus between the internal cache and the external memory. Assuming a 10 pF per line bus capacitance, we show that, by using the proposed encoding techniques, a 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. power reduction of up to 52% can be achieved for an external data address bus and 4 2 % > for the multiplexed bus between cache and main memory. The other alternative in sector-based encoding is when the sectorization adaptively changes based on the sourcewords. We refer to this set of techniques as Dynamic Sector Encoding or DSE for short. For these techniques, sectorization is done in a manner such that the closest sector head for most of the sourcewords in the space would be the sector head of their own sector. As a result, each sourceword is encoded with respect to a nearby previously-referenced sourceword. This technique can be targeted to either data compaction or reduction of activity in traces. In the experimental results, we show the effectiveness of DSE when it is applied to three different streams i.e.: data address streams, image files and mixed data streams that represents sensor data communicated in a wireless sensor network. DSE achieves an average 68% transition reduction for data addresses, an average 23% compaction for image files, and up to 19% additional lossless compaction for sensor data. In Chapter 4, we talk about a novel technique for reducing the required amount of communication on the memory bus by using instruction-set-aware memories. These memories could be used to for reducing the power consumption of instruction or data addresses or for increasing the throughput of these buses for performance considerations. The proposed approach relies on the availability of Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. smart memories that have certain awareness of the instruction format of one or more architectures. Based on this knowledge, the memory calculates or predicts the instruction and data addresses. Hence, not all addresses are sent from the processor to the memory. This, in turn, significantly reduces the activity on the memory bus. The proposed method can eliminate up to 97% of the transitions on the instruction address bus and 75% of the transitions on the data address bus with a small hardware overhead. The actual power savings of 85% for the instruction bus and 64% for the data bus were achieved for a per-line bus capacitance of lOpF. In Chapter 5, we address the dilemma of power efficiency and reliability of high speed on-chip communication in future System-on-Chip (SoC) designs. We propose an innovative Energy-efficient, Reliable on-chip communication Channel Architecture. We also introduce the notion of Pattern Sensitive Encoding and employ it to minimize the effect of crosstalk coupling noise on the on-chip interconnect buses. More precisely, input patterns that are more vulnerable to the crosstalk noise, are recognized and protected with a more robust encoding technique. Finally, we introduce a new methodology by which, adaptive reselection of the encoding techniques over the channel is done without imposing any communication overhead i.e. there is no need for the sender to inform the receiver which encoding technique is being used. Instead, the receiver itself is Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. capable of figuring it out without external help. This is achieved as a result of the special contemplation we have given for selecting different encoding techniques. Experimental results show that the power consumption of bus lines can be reduced by up to 28% with the pattern-sensitive technique. This power saving is achieved without any impacts on reliability and only a 0.4% performance loss. In Chapter 6, we address the issue of hot-carriers and their impact on reducing the reliability of a VLSI circuit by accelerating the aging process of transistors employed in the bus line drivers. We tackle this phenomenon by formulating and solving a bus encoding problem, which we refer to as the Bit-level Transition Balancing (BTB). The BTB problem is to minimize the maximum value of the expected number of transitions over a group of lines, which we refer to as a bus. We approach this problem systematically by first answering the question of how much information about the characteristics of the data that appears on the bus is needed to find a combinational and/or sequential encoding functions that optimally solve the BTB problem. Next, we propose a number of different encoding techniques that efficiently solve the BTB problem. Experimental results demonstrate the effectiveness of such techniques. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 LOW-POWER ENCODING TECHNIQUES 2.1 INTRODUCTION With the increasing number of transistors on a chip and the rising operation frequencies, the total power dissipation of VLSI circuits is rapidly increasing. This is not only creating energy concerns but also causing high temperatures on the chip surface that lead to higher costs and a variety of reliability problems. Another significant issue is that, many systems are becoming portable and wireless, functioning based on a battery pack with limited energy supply. Excessive power consumption directly impacts the operation time of such systems before their battery needs to be refurbished or recharged. Modem digital design is a dichotomy of realizing circuits that have low power consumption and meet the 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tight performance constraints. Therefore, low power design methodologies at all levels are critically needed to overcome the above problems. The major building blocks of a digital system include various processing cores, the memory chips, I/O blocks and the communication channels dedicated to providing the means for data transfer between these blocks. These channels tend to support heavy traffic and often constitute the performance bottleneck in many systems. The overall performance of the system is set by how effectively various cores can communicate rather than how fast each individual core can perform computation. Therefore design and implementation of modem communication channels (both on-chip and off-chip) has received significant focus from research community during the recent years. In a conventional system, a key channel might be a local bus between the CPU and the memory controller, or a memory bus between the memory controller (which may be on-chip or off-chip) and the memory devices. The bus may be used for addresses or data or a combination of the two. Usually over these channels, the energy dissipation per access is quite high, which in turn limits the power efficiency of the overall system. The energy consumption depends on various factors. In general, the power consumption of a node depends on its total capacitance, its activity factor and the square of operating voltage. To be exact: Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ^ a vg a T (-'lo a d ^ d d •f d K In this equation, a j is the node activity factor, which is the effective number of transitions experienced per clock cycle, and is a number between 0 and 1. Also, Ci0 ad is the total capacitance of the node to ground and fc ix is the clock frequency. It is important to mention that in the above formula, we are only considering the total capacitance of the node with respect to ground. We do not consider the inter wire capacitance of lines of the bus in this work. This assumption is valid if wires are shielded by ground lines or properly separated such that the effect of inter wire capacitance is negligible. In [58], the authors look at the bus power problem when considering the inter-wire capacitances. We usually refer to the product of activity factor and total capacitance as switched capacitance of the node. The larger switched capacitance a node has, the more it will contribute in the total energy consumption of the system. Various approaches exist that reduce the energy consumption of power hungry nodes, for example lowering the supply voltage, reducing load capacitance, reducing activity factor, etc. Lowering the supply voltage would lead to a quadratic improvement in the energy consumption of the system [19] but would negatively affect the performance of the system at the same time. Reducing output capacitance is usually achieved through downsizing of driving transistors and is 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. only effective for a node whose total capacitance is dominated by device capacitance other than wiring capacitance. Thus, this method is not applicable to power hungry communication channels. The technique that we will be focusing on in this chapter would be the application of high level encoding and decoding techniques that can reduce the activity factor of the highly capacitive nodes. This is usually referred to as Low Power Bus Encoding problem in the literature. Next, we formally define the Low Power Bus Encoding problem. Definition 2-1 Low Power Bus Encoding problem: The problem of finding encoding and decoding techniques that can reduce total power consumption of a bus (a group of lines and their corresponding drivers) by manipulating the activity factor of different lines. The emphasis of this chapter is on encoding/decoding techniques that minimize the power consumption of memory instruction and data address buses. We will be assuming that different lines have equal capacitances. Therefore, the low power bus encoding problem is reduced to finding functions that can minimize total number of activities that happen on the bus. In the remaining of this chapter, we will first give some general notations and definitions that well be used throughout the whole work. After that, we will provide a thorough overview of the previous work that has been done for low power bus encoding. And finally we present our proposed techniques garnished with experimental results. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 GENERAL NOTATIONS AND DEFINITIONS Throughout this work, we assume that X is the input to the encoding block. X is an N-bit number and can represent code, address and data of different size. The bits of X are represented by X[N] to X[1] where X[N] is the MSB. Next, we have a set of definitions that we will use throughout this book. Definition 2-2 Sourceword; It is what flows on the original bus, on which we would do the encoding and decoding. In the context of memory buses, we sometimes use the word address or data interchangeably for sourceword. Definition 2-3 Codeword: It is the code generated by encoding sourceword X. We show the encoded sourceword or codeword by F(X) unless specified else way. Definition 2-4 Bus: It is used for referring to the word that is actually sent over the channel. Bus is shown by BUS(X) and can be equal to or different from codeword F(X). The reason for this naming convention is that sometimes, we send the codeword directly over the channel and sometimes we don’t (for example, we might use transition signaling for conveying encoded values to the receiver, refer to Definition 2-6 ). Thus, we use BUS(X) to easily differentiate between the above scenarios. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 2-5 Sender and Receiver: These are blocks that are communicating over the channel. In low power bus encoding, we want the encoding to be transparent to other blocks, i.e. we don’t want to modify the design of other blocks in any way. As long as the receiver is able to generate X from BUS(X), we can guarantee this. Figure 2-1 shows the low power bus encoding block diagram. As it can be seen, we differentiate between the codeword F(X) and the bus BUS(X). There is a S ' block on the sender that is responsible for putting the codeword on the bus. There is also a R block on the receiver which should be capable of extracting codeword F(X) from the bus. These blocks are optional and the actual codeword might be put on the bus. Bus(X) Encoder Decoder Sender F(X) F (X) Receiver Figure 2-1 Basic block diagram for on-chip and off-chip encoding/decoding. Definition 2-6 Reversible Function: Function F is called reversible if value X can be uniquely determined from F(X). In other words, there exists a function F 1 such that F 1 (F(X)) = X. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 2-7 Trace: We refer to a collection of sourcewords that flow consecutively on the bus as a trace and we show it by T=<Xj,X2,..., Xle>. LE represents the length of the trace and does not have to be a finite number. Definition 2-8 Transition Signaling: Sometimes instead of sending the codeword F(X,) on the bus directly, we send the bit-wise exclusive-or of codeword F(Xt ) and previous value of the bus BUS(Xi.j) as the new value on the bus. We call this Transition Signaling. We also refer to it by saying “ XORing codeword on the bus”. We do not consider the transition signaling as part of the encoder. The block diagram for an encoder with transition signaling has been shown in Figure 2-2. Codeword Sourceword, BUS(X.) XOR Register Encoder Figure 2-2 Block diagram of low power bus encoding with transition signaling. Definition 2-9 Inter-trace: For a trace T, we define inter-trace B as the trace obtained by XORing consecutive sourcewords in trace T, i.e., B=<U2,Us, — ,Ule> where U,=Xi0Xhi for i ranging from 2 to LE. We call each sourceword Ut of 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. trace B, an inter-sourceword of trace T. Furthermore, we use the notation B=X(T) ( .X stands for XOR) to show that B is the inter-trace of T. Definition 2-10 Offset: It is the arithmetic difference between two consecutive sourcewords, i.e. Xj-Xj.i. Definition 2-11 XOR-difference: It is another name for referring to U, =Xi@Xi-i. Definition 2-12 Number of One’s in X: Total number of one’s (as compared to zero’s) in the binary representation of the sourceword X is shown by NumberofOnes(X) (or NO(X) for short). Definition 2-13 Hamming Distance (a.k.a. XOR distance or Transition Count): It is the number of bit differences between two sourcewords or codewords or buses. Hamming distance of a sourceword is the number of one’s in the inter-sourceword, i.e. NO(U,) =NO(XlQXt./). The Hamming distance of a sourceword directly implies the amount of energy that is dissipated as a result of that sourceword put on the bus if there is no coding in effect. Definition 2-14 Total Activity of a Line: Total number of transitions that happen on line i (i varies from 1 to N) of a trace when it is sent on the bus is shown by TotalLineActivityfi] (or TLA[i] for short). We will have: 1 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TLA[i] - ^ N o ((X ')) where (.) is bit-wise AND function. J= 2 Definition 2-15 Total Activity of a trace: Total number of transitions that happen as a result of sending the trace on the bus is shown by TotalActivity(T) ( or TA(T) for short). LE L E - l TA(T) = ^ N O i X j © X,_i) = ] T NO(Ut) . i=2 i=1 Definition 2-16 Maximum Activity of a trace: Maximum among total activity of different lines is referred to as MaximumActivity(T) (or MA(T) for short). MA(T) = max {TLA [/]} _ i Usually when we talk about encoding for low power, only total activity of the trace is important. Therefore, we wont look at the maximum activity of traces until Chapter 6. Definition 2-17 Stride: In several cases, offset of a trace might be equal to a fixed value for a considerable portion of the trace. We refer to this recurring offset as stride of the trace. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As an example, for instruction address traces we have a stride. Most of the time instructions are sequential (unless a control flow instruction [31]) is executed and therefore we say that the stride is equal to 1. For data address, we might have strides as well. For example when long data arrays are accessed, data addresses will follow a stride equal to the size of each element in the array. This stride is not fixed and will change when a new array with different element size is accessed. Definition 2-18 Locality: It is meant to show the closeness of sourcewords. In general, we speak of locality in a trace, we mean that as a result of similarity between sourcewords in that trace, the amount of information included in the trace is actually less than the bit count of trace. Locality when exploited can be applied for compacting sourcewords or other applications such reducing total activity in the trace. Definition 2-19 Temporal Locality: Temporal locality means that if sourceword X appears in the trace once, this sourceword is highly probable to appear again very soon. But there is no exact information on how soon X is going to appear in the trace. Definition 2-20 Spatial Locality: Spatial locality means that if sourceword X appears in the trace once, its next sourceword will be quite close to X. Likewise, there is no information on how close the next sourceword would exactly be. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 2-21 Spatio-temporal Locality: It is the most general sort of locality. It basically means that if a sourceword is referenced in the trace once, its neighboring sourcewords will likely be accessed sometime in the near future. Exploiting Spatio-temporal locality is more difficult than spatial locality or temporal locality and would require more resources. Definition 2-22 Redundancy: It represents the application of redundant bits in the process of encoding. This means the encoded value would require more than N bits and the width of channel or bus of interest should be increased. The extra lines used in the process of encoding and decoding are usually referred to as redundant lines. Redundant lines are not desired in general for various reasons such as increasing the area requirement of channel and modifying the standard interface of the channels (specially for external buses). Definition 2-23 Codebook: It is a lookup table that is implemented in the decoder or the encoder block. The width and other characteristics are arbitrarily specified. Also it’s not necessarily a fixed lookup table and its contents might be changeable. Definition 2-24 Limited Weight Codes: A binary number that has only a single one in its binary representation is call a 1-Limited Weight Code or 1-LWC for short. There are a total of N 1-limited weight codes in the N-bit space. In general a 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i-LWC is a number having i one’s in its binary representation. A 1-LWC is sometimes also called a one-hot code. 2.3 OVERVIEW OF PREVIOUS WORK Having all the fundamental definitions and notations established, we look at the previous work in low power bus encoding and compare various encoding techniques in this section. In [59], Stan proposed the Bus-Invert method. The idea is that if the current sourceword Hamming distance is larger than N/2, then the sourceword can be inverted so as to push the Hamming distance to below N/2. One redundant bit is needed to make the decoder capable of distinguishing between the original or inverted sourceword at the receiver. The Bus-Invert method tends to perform well when sending random sourcewords, which is often the case on data buses. However, this method is largely ineffective on address buses, which tend to exhibit a high degree of sequential behavior. In [67] authors have investigated the effect of partitioning the bus and doing the bus-invert coding on these partitions separately. Obviously, the negative side of this partitioning will be the necessity of having more redundant lines on the channel. The simplest scheme would be to reserve one redundant line for each group for which inversion might be applied. However to make it more efficient, the set of redundant bits showing the inversion 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for each group separately can be all binary encoded and thus compacted. In [68], authors use Field Programmable Gate Array (FPGA) to dynamically change the bit grouping on the fly in order to maximize transition saving. They propose a heuristic to determine the bit grouping for each sourceword of the trace considering the energy overhead of reconfiguring the encoder and decoder as well as its timing effects. In [14], Benini et al. proposed the TO code, which exploits sequentiality of sourcewords to reduce the total activity on the address bus. The observation is that instruction addresses are sequential except when control flow instructions are encountered or exceptions occur. Therefore, address traces tend to stick to stride equal to 1. TO makes use of this behavior to reduce activity of trace. TO adds a redundant bus line, called INC. If the addresses are sequential, the sender freezes the value on the bus and sets the INC line. Otherwise, INC is de-asserted and the original address is sent. On average 60% reduction in address bus switching activity is achieved by TO coding [30]. In this chapter, we propose a JCMike encoding technique for an address bus, which does not require any redundant lines. We call this new encoding technique, TO-Concise or TO-C [3] for short. We will elaborate this technique in a subsequent section. In [12] a new coding technique called the Beach Solution was proposed. In this method, the address trace of a program is profiled, and then, possible correlation 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between different bits of the profiled trace is extracted. This information is subsequently used to define encoding functions that reduce the total switching activity. However, this method is only applicable to systems where the application programs are fixed and known a priori since the encoding technique needs exact knowledge of the address bus trace. In [48] Musoll et al. proposed an address bus encoding method that works based on the fact that, at any time during execution, a program uses a limited number of working zones in the address space. Thus, instead of sending the address, its offset with regards to the previous reference in the same zone along with the zone identifier is sent. One extra bit is required to notify the receiver whether this coding is in effect or the address itself is being sent. In Chapter 3, we will elaborate on this technique more. In fact, the sector-based encoding technique, which is the topic of that chapter, uses the same rationale as the work proposed by Musoll and his colleagues. In [32], Ikeda et al. proposed using codebooks in the sender and the receiver. By using a codebook, they intended to keep track of the previously accessed sourcewords. The codebook includes a set of sourcewords that are recently referenced. For every new sourceword, the code with the minimum Hamming distance to the sourceword is found in the codebook. Subsequently, an identifier of the selected code along with the XOR difference (refer to Definition 2-11 ) 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between the sourceword and the code are sent over the bus. The authors improved their method by using an adaptive codebook in [38]. Thus the codes in the codebook are replaced dynamically. As the program execution proceeds, only codes equal to previously referenced sourcewords or sourcewords close to them will remain in the codebook. Any low power bus encoding technique that is meant to reduce total activity on the bus should somehow incorporate the knowledge of previous value on the bus into encoding. To understand this, suppose that there is no transition signaling (refer to Definition 2-8 ) and the codeword is directly sent on the bus. In such a case, the encoder would not be able to predict number of transitions that will happen on the bus as a result of new codeword and therefore there wont be any guarantee on the performance of the code. Despite this fact, looking at the techniques introduced earlier in this section we realize that some of the above encoding techniques such as TO do no always use transition signaling. Lets consider this encoding technique again. When the current sourceword is equal to previous sourceword plus the known stride, TO freezes the bus and does not send a new value. Not sending a new value is equal to sending the previous value on the bus, which requires knowledge of the previous value and is the same as transition signaling of zero on the bus. On the other hand, when the current sourceword is not an increment of the previous sourceword, it is directly sent on 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the bus. In this case, no transition signaling is involved and number of transitions that will occur on the bus is not predictable. This proves that TO does not always use transition signaling. So how is the performance of this code still acceptable? The fact is that since sourcewords are mostly sequential, therefore, most of the time, knowledge of the previous value on the bus is used. Because of this, encoding is effective overall. In Figure 2-1 the R block is the block that actually uses the previous bus knowledge and determines the actual value that will flow on the bus. There are different choices for this block such as XORing previous bus and current codeword or adding them. Actual selection of this function will essentially set the goals in generation of codeword. Best candidate for this function is the XOR function because first this function simplifies the low power bus encoding problem as compared to other choices such as addition and second it can be done in parallel for different lines. When the XOR function is selected for the R block, the mission of the F function (refer to Figure 2-2) would be to minimize number of one’s in the codeword i.e. NO(F(X)) ( refer to Definition 2- 1 2 ). There is a complete set of encoding techniques that make use of transition signaling and a reversible function. Having these two characteristics, these techniques eliminate the requirement for having any redundant bits. As earlier mentioned, using transition signaling as R block, low power bus encoding 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. problem is converted to finding codewords with the smallest average number of one’s in them. The most efficient one of these encoding techniques is the INC- XOR code, which was proposed by Ramprasad et al. in [54], It can be easily seen that when the sourcewords are sequential, no switching activity occurs on the bus (similar to the case of TO code). Later on, authors in [30] called this method T0- XOR. The encoder works as follows: //IN C -X O R E ncoder F(X i)= X i© (X i_ 1 + l) , B U S(X i) = B U S(X i.1 ) © F ^ ) / / end One important point about INC-XOR is that sometimes, even if the difference between X t and Xj.j is small, their Hamming distance may be quite large. This usually occurs for sourcewords X, and Xj-j that are located at opposite sides of 2N , e.g., 61 and 69 are located at the two sides of 64. In these cases, although the offset (X,-Xi-i) is small and contains few one’s, XOi?-difference (X,<£Xi-i) contains many one’s and thus causes many transitions on the bus when it is XO R’ ed with the value on the bus. We refer to this problem as the “consecutive sourceword XOR problem”. In [30], the authors proposed another encoding technique, which is called Offset- XOR code. The encoder works as follows: 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. //O ffse t-X O R E ncoder B U S(X i) = B U S(X i.1 ) ® (X; -Xj.j) / / end However, this encoding will become much more effective if the coding algorithm is modified as follows (resulting in a code that we will call Offset-XOR with Stride or Offset-XOR-S for short): //O ffset-X O R -S B U S(X i) = B U S(X i.1 ) © ( X ^ X ^ + l ) ) / / end The reason for Offset-XOR-S improvement over Offset-XOR is that it avoids switching activity when sequential addresses are encoded. Later on we will study an effect similar to consecutive sourceword XOR problem that degrades the performance of Offset-XOR-S. Generally speaking, encoding techniques that exercise transition signaling perform poorly when the codeword includes many one’s. Consequently in this chapter, we present a new code, called Offset-XOR with Stride and Mapped-offset or Offset-XOR-SM for short, which addresses this shortcoming by applying a mapping function to the offsets in Offset-XOR-S code. In [53] the authors have proposed a coding framework that is used for modeling different bus encoding techniques. Their proposed framework is presented in Figure 2-3. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. XOR Register Register Figure 2-3 General framework for low power bus encoding. As it is seen in this figure, they also propose to always have transition signaling as a building block of a low power bus encoder in their framework. In the above framework, we have function F, fj and /?. Function F is a predictor that tries to predict current sourceword based on the previous sourceword. Function fj is in charge of exploiting locality and decorrelating consecutive sourcewords from each other. After that, functiontries to reduce number of one’s (in the binary representation) of its input as much as possible, which will translate to a lower number of transitions when the codeword is XORed on the bus. In the next section, we will look at the different algorithms proposed by us. First we will present TO-C. This method does not apply a general transition signaling (just like TO) but smartly suppresses the redundant bit of TO. In general, if TO is converted to INC-XOR (TO-XOR) the requirement of having a redundant bit is called off. However, although transition signaling has many benefits, it has the delay disadvantage of having to do two XOR operations at sender and receiver. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TO-C is an innovative extension of TO, which is irredundant, has a better performance compared to TO and does not need general transition signaling. After TO-C, we present, two other techniques known as Offset-XOR-SM and Offset- XOR-SMC. These two techniques can be derived using the general framework of [53], however application of innovative blocks (such as LSB-Inv and codebook) makes them superior with respect to the previously proposed algorithms. In a subsequent section of this chapter, we will look at another set of encoding techniques know as ALBORZ encoding techniques. All of these the ALBORZ techniques are codebook-based techniques. ALBORZ techniques are more flexible techniques compared with the first set. We will look at each of these techniques in detail and state their pros and cons with respect to each other. 2 .4 A SET OF IRREDUNDANT ENCODING TECHNIQUES 2.4.1 TO-CONCISE The first proposed encoding technique is an extension of TO code. It improves TO code in a number of important ways. First of all, it eliminates the redundant bit. As earlier said, the low power bus encoding techniques that we propose have applications for both on-chip and off-chip buses. In particular for off-chip buses, adding one extra line to the bus is not tolerable, needless to say that it would also cause the pin configurations of the chip to change. Besides that, elimination of 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. redundant line or lines would result in higher power saving on the bus. In TO-C, similar to TO code, the basic saving happens as a result of freezing the bus when sourcewords are sequential. To realize how TO-C works, suppose that we recklessly suppress the redundant bit in TO code. In other words, when X, and Xt.j are sequential addresses, we simply freeze the bus, and in all other cases, we send the original sourceword X t on the bus. As an example suppose that the first source is equal to 39. We send this sourceword on the bus. Now if the second sourceword is equal to 40, then sender just freezes the bus. The decoder detects that the bus is frozen and no data has been received, therefore it will presume that the current sourceword is an increment of the previous sourceword, i.e. it is equal to 39+1=40 and everything will be all right, no problem. Another scenario would be that instead of 40, the current address is equal to 42. In this case the sender just sends the 42 on the bus since its not sequential. On the other side, decoder will realize that bus has changed and 42 is received on the bus. So it knows that 42 is the actual sourceword, again no problem. However, the above simple scheme would fail. For example, suppose that we encounter a backward branch whose target address is the same as the current (frozen) bus. This is illustrated in the following table. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2-1 Simple example proving T O would fail without an extra bit. X BUS 39 39 40 39 41 39 39 39 ??? As it can be seen, when we reach the last row of the table, original TO without redundant bit will lead to ambiguity when the decoder is interpreting the values. If we use 39 as the codeword, the receiver (decoder) cannot determine whether the sourceword was 39 (backward jump) or 42 (next sequential address). So the problem occurs when the data on the bus is equal to target address of the control flow instruction. This is why redundancy was originally introduced into the TO code. However, TO-C applies a more efficient solution to this problem. To correctly handle backward branches with target addresses equal to the current bus value, an unusual pattern has to be sent to the receiver. By unusual we mean a pattern that can alert the receiver of the special case of target address being equal to the value of the bus. However, this cannot be a fixed pattern because we assume that jumps to any and all addresses are permissible (picking any fixed pattern to designate this case may create a potentially large activity on the bus, and at the same time, requires that particular fixed pattern not be used as a regular jump target). The solution we adopt in TO-C in the case when BuSj.]=Xi, is that we set the codeword to be previous address plus one i.e. F(Xi)=Xi.i+l. The reason is 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that this is the only pattern that the receiver does not expect from the sender. Notice that when the receiver notices a value of Xj.j+1, it recognizes that the trend of sequential addresses has been interrupted because the bus value has changed. On the other hand, when it examines the new jump address received on the bus, it identifies that this jump address is the same as the previous address plus stride. However receiver knows that if a special case were not encountered, there would have been no need for the sender to send a new value on the bus. This special case is, of course, when the target of a jump is the same as the current value on the bus. The decoder is aware of this, and the ambiguity is resolved! TO-C encoder works as follows: / / TO-C E ncoder if (X; = Xj.j + 1 ) B U S(X i) = B U S(X i.1 ) else if (X; != BUSCX^) ) B U S(X j)= X i else B U S(X i)= X i.1 + l / / end On the receiver side, when the Xj-i+1 value is received, the previous value on the bus is regarded as the branch target. For the previous example we will have: 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2-2 Example of TO-C encoding. X BUS 39 39 40 39 41 39 39 42 Now to further prove that this scheme works in all cases, let us consider the special case when {Xt = Xj.i}. This is a jump instruction where the branch target is the branching instruction itself, this means, the instruction is waiting for an external event. The first time this instruction iterates, BUS(Xj.i) is not equal to X t. Therefore, because we have a simple jump in this case, we simply send Xt. The next time this instruction executes, the encoder recognizes it as the special case and will thus send Xj.j+1 on the bus. Therefore at each point in time, TO-C is a one-to-one mapping, however this mapping is not fixed and depends on the previous value on the bus. This case is illustrated in the following table. Table 2-3 Example showing TO -C for a self-jumping instruction. X BUS 39 39 39 40 39 39 39 40 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Result show that TO-C outperforms TO . This is mainly due to the suppression of the redundant bit in TO. The TO-C encoding decreases switching activity on an address bus about 9% more than TO code. The encoder block diagram for TO-C has been shown in Figure 2-4. The left comparator block selects previous codeword if the current sourceword is sequential. But if the current sourceword is not sequential, output of the left MUX is put on the bus. This output is equal to the actual sourceword unless current sourceword is equal to the previous bus. In that case incremented current sourceword will be put on the bus. We haven’t shown the block diagram of the decoder but it is very similar to the encoder and has the same amount of complexity. Sourceword Codeword (BUm Register r + 1 Register v . Comp -<t> - Comp v J Figure 2-4 TO -C encoder. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.4.2 OFFSET-XOR-SM The objective is to improve Offset-XOR-S code by further reducing number of one’s in the offset right before the transition signaling block. Suppose that we are encoding instruction addresses. When we encounter a backward jump in an instruction trace, the resulting offset is negative. This negative number tends to have a small magnitude, and therefore, in two’s complement form, it will contain many one’s. In a typical application program, many small backward branches exist, and the offsets of all these branches are small negative numbers. Consider these offsets are to be transition-signaled over the bus, a large number of switching activities will occur on the bus because of these small negative numbers. For this reason, the performance (in terms of the average activity on the address bus) of the Offset-XOR and Offset-XOR-S techniques is relatively poor. We will refer to this problem as the “small negative offset problem.” We also talked about the consecutive sourceword XOR problem that degrades the performance of INC-XOR (refer to section 2.3). In practice, although INC-XOR and Offset-XOR are very much alike and they both suffer from a similar problem that degrades their performances; INC-XOR code outperforms Offset-XOR code noticeably [30]. This is because of the fact that the small negative offset problem, which is the Achille’s Heel of Offset-XOR code, shows up much more frequently 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. than the consecutive sourceword XOR problem. Indeed, as reported in [30], the switching activity reduction for 1NC-XOR is 7 4 % o versus 41% for Offset-XOR. In the following paragraphs, we present a new coding technique that solves the small negative offset problem. We call this method Offset-XOR with Stride and Mapping o f offsets or Offset-XOR-SM for short. Offset-XOR-SMencoder works as follows: / / O ffset-X O R -SM E ncoder B U S(X i) = B U S(X i.1 ) ® L SB-Inv(X i - ( X i.1 + l)) / / end Where the LSBInv(X) is defined as follows. Definition 2-25 Function LSB-InvfX): The LSB-Inv(X) function inverts all bits of X except the most significant one if X is a negative number, otherwise it will return X. To know if X is a negative number or not, we should examine its MSB. Here, we assume computation is done in the N-bit space, and interpret X as a two’s complement signed number). if(X > 0) LSB - Inv(X) = < X © ^ 2 ^ 1 - 1 j otherwise • 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Please take note that in general, X is an N-bit number and we do not make a general presumption of interpreting it as signed or unsigned. When we carry out addition or subtraction, we do not care about overflow. However, sometimes (like here in the definition of LSB-Inv), we interpret X as a signed number (just for the definition to make more sense). In two’s complement format, small negative numbers are represented with many one’s. By using LSB-Inv, the number of one’s in the representation of small negative numbers is reduced. In fact, only negative numbers are affected by this function. When a negative X is mapped under LSB-Inv, the MSB remains unchanged and represents the sign of the number. The remaining bits represent the distance from zero to the negative number. Therefore, if the negative number is small, the distance will also be small and thereby, can be easily compacted by suppressing its MSB bits. Table 2-4 represents an example of applying LSB-Inv in the 8-bit space. Table 2-4 Example of LSB-Inv function in the 8-bit space. X X in binary LSB-Inv(X) LSB-Inv(X) in binary 1 0000 0001 1 0000 0001 -1 11111111 -128 1000 0000 -128 1000 0000 -1 11111111 -3 1111 1101 -126 1000 0010 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Offset-XOR-SM code, when the Xj-Xj.i-1 is positive, it is not affected by LSB- Inv function and it is directly transition signaled over the bus. Obviously, sequential addresses do not cause any activity on the bus (XrXi-i-l= 0). However, if Xi-Xj-i-1 is negative, then it is optimized by LSB-Inv to cause fewer transitions on the bus. Unlike Offset-XOR-S, in Offset-XOR-SM small negative numbers cause only few transitions on the bus. The extra hardware that this method imposes on Offset- XOR is negligible. With this mapping we can achieve more than 40% improvement over Offset-XOR code and about 3% improvement over 1NC-XOR code as they are reported in [30].1 The encoder block diagram for Offset-XOR-SM has been shown in Figure 2-5. LSB-Inv XOR SUB Register Register Figure 2-5 Offset-XOR-SM encoder. The well-known sign magnitude representation is not used to for mapping the offsets. The reason is that converting numbers from two’s complement to the sign-magnitude representation requires more complex hardware compared to our proposed scheme. Furthermore, the greatest negative number in two’s complement form does not have any representation in the sign-magnitude form. 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.4.3 OFFSET-XOR-SMC The output of the LSB-Inv function can be optimized even further using a more complex encoder and decoder. In the following, we describe a new method called Offset-XOR with Stride, Mapping o f offsets and Codebook or for short Offset- XOR-SMC that uses a codebook to reduce the number of one’s in the codewords. The size of the codebook can vary and should be selected to fit to the specific application. The extra cost of this improvement in performance, is the complexity of a codebook that can optimize the offset. The more capacitive the buses are, the more complex encoding blocks would be practical for saving power on the bus. Therefore it is the relative magnitude of the bus capacitance that can determine a tolerable bound of complexity for the encoder and the decoder. Floating: point average 30% 20% Integer point average 10% 7 9 10 2 3 5 6 8 0 1 4 Bits of branch displacem ent Figure 2-6 Percentage of branch instructions based on the required bits to represent their displacement for SPEC95 benchmarks. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The codebook (refer to Definition 2-23 ) is a K-bit to K-bit mapping function implemented in both the sender and the receiver sides. The K least significant bits of the output of LSB-Inv function are mapped by this codebook before transition signaling on the bus. In practice a small K will be sufficient. This is because absolute displacements of control flow instructions are typically small numbers. As it can be seen in Figure 2-6 more than 95% of branch displacements in SPEC95 [75] benchmark programs are represented with less than 10 bits [31]. Therefore in our implementation we chose K to be equal to 10. In general, K should be selected based on the magnitude of the most frequent jumps in a program and constraints on the size of the codebook. In order to decrease the switching activity by this mapping, numbers are mapped in a manner such that small numbers map to numbers with few number of one’s in them. If X] and X 2 are two K-bit numbers and CB(X\) and CB(Xf) are the corresponding K-bit outputs from the codebook (i.e., the codewords of Xj and Xf), then CB must be defined in such a way that: If (X\ < X I) = i> NO{ CB(X 1)) < NO( CB(X2)) where NO(Y) denotes the number of one’s of Y (refer to Definition 2-12 ). Offset-XOR-SMC encoder works as follows: 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / / O ffset-X O R -SM Encoder B U S(X i) = B U S(X i.1 ) © C B(LSB-Inv(X i - ( X i.1 + l))) / / end As it was mentioned before, in our experiments, we selected K to be equal to id bits. The arrangement of codewords in the codebook is as follows. The first codeword of the codebook is 0, and the next 10 codewords are 10-bit 1-LWCs (binary numbers that only have a single one; refer to Definition 2-24 ). The next 45 entries are 10-bit 2-LWCs, and so on. An important point in the actual implementation of the codebook is that it can be organized in a fashion that if two numbers were bit-complements of each other, their codewords would also be bit- complements of one another. Table 2-5 presents an example of such an organization for a 3-bit codebook. This observation is used to reduce the number of entries in our codebook by a factor of two and thus significantly decreases the codebook hardware overhead. Offset-XOR-SMC code yields an extra 3% saving compared to Offset-XOR-SM code. Table 2-5 Example of a 3-bit codebook. X 000 001 010 Oil 100 101 110 111 CB(X) 000 001 010 100 Oil 101 110 111 Figure 2-7 shows the block diagram of Offset-XOR SMC encoder. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LSB-Inv Codebook SUB XOR Register Register Figure 2-7 Offset-XOR-SMC encoder. 2.4.4 PERFORMANCE ANALYSIS To evaluate the proposed encoding techniques, we generated detailed instruction address bus traces for a number of SPEC2000 [75] benchmark programs using a simulator called Simplescalar [72]. The reported results are based on averaging over six programs from the benchmark set, namely vpr (FPGA circuit placement and routing software), parser (Word processing software), gcc (C language compiler), vortex (Object-oriented database), equake (Seismic wave propagation simulation), and art (Image recognition/neural networks). The first four of these programs have been selected from the integer benchmark set whereas the last two are from the floating-point benchmark set. Simplescaler is an academic architecture widely used in computer architecture related research projects. We repeatedly will use Simplescalar for evaluating different encoding techniques throughout this work. Control flow instructions in this architecture are branches and different kind of jumps such as Jump, Jump and Link, etc. All the other 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. instructions are sequential instructions. A branch that is not taken also acts as a sequential instruction. For each of the benchmark programs, more than 15 million instruction addresses were generated by simulation. We first analyzed these traces to determine the contribution of each category of instructions in total number of transitions that happen on the bus. The results are presented in Figure 2-8. It can be seen that most of the activity is due to sequential instructions, although number of transitions caused by a single sequential instruction is usually much smaller than a control flow instruction, the total contribution of sequential instructions is dominant because of their numerous count in the execution flow. bus transitions vpr parser equake vortex gcc □ sequential ■ not taken branch II backward branch □ forward branch Hjump and link register ■ jump register ■jump and link □jump Figure 2-8 Contribution of different kind of instructions in total number of bus transitions. 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Next, different encoding techniques were applied to these traces to measure their effect in reducing switching activity. The simulation results are shown in Figure 2-8. The original entry in this table refers to the total transition count without any encoding. Other columns show the transition count when different encoding techniques are applied to the addresses and also the percentage of transitions remaining. Finally, the last row of the table shows the percentage transition saving for each of the encoding techniques. Table 2-6 Switching activity of SPEC 2000 traces in millions & average percentage saving for different encoding techniques. .. Original T O TO -C Offset- XOR-S INC- XOR Offset- XOR- SM Offset- XOR- SMC vpr 22.94 9.17 6.45 17.24 5.56 4.75 4.05 100% 40.0% 28.1% 75.2% 24.2% 20.7% 17.7% parser 21.27 6.02 4.17 13.61 3.79 3.13 2.39 100% 28.3% 19.6% 64.0% 17.8% 14.7% 11.2% equake 22.65 7.94 5.59 10.88 5.21 4.24 3.60 100% 35.1% 24.7% 48.0% 23.0% 18.7% 15.9% vortex 22.17 5.61 4.34 9.90 4.00 3.47 3.03 100% 25.3% 19.6% 44.6% 18.1% 15.7% 13.7% gcc 22.46 8.14 5.74 11.65 5.20 4.57 3.63 100% 36.2% 25.5% 51.9% 23.1% 20.3% 16.2% art 20.06 4.65 3.33 9.80 2.68 2.02 1.40 100% 23.2% 16.6% 48.9% FT3% 10.1% 7.0% Average Transition saving 0% 68.7% 77.6% 44.6% 80.1% 83.3% 86.4% 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To estimate the actual overhead of the above encoder blocks, first, we generated the net list of each encoder/decoder circuit in Berkeley Logic Interchange Format (BLIF) [76]. The netlists were optimized using the SIS script.rugged [77] and mapped to a 1.5-volt, 0.18p CMOS library using the SIS technology mapper. I/O voltage was assumed to be 3.3v. Instruction addresses of the benchmark programs were then fed into a gate-level logic simulation program named sim-power [51]. Based on the input vectors and mapped design, estimation of the encoder power was done by sim-power. The results for a 100 MHz system clock are reported in Table 2-7. The actual power saved in any system depends to the actual capacitance of bus lines and the encoding technique applied in the system. Therefore for any fixed value of line capacitance, one of the techniques would be the most effective. In Figure 2-9, percentage of total bus power saved versus I/O capacitance per line is compared for different encoding techniques. Obviously as the line capacitance grows, Offset-XOR-SMC outperforms the other techniques. Table 2-7- Encoder hardware synthesis and power estimation. INC-XOR T0-C Offset-XOR- SM Offset-XOR- SMC Number of literals 440 767 661 2693 Area of Encoder (* 1000 A ,2 ) 334 410 399 1043 Number of gates 306 386 379 1136 Power dissipated by encoder & decoder (uW) 266 642 740 1822 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Percentage of Total Powr Saved Offset-X OR-SM - X -IN C -X O R TO-C Offset-X OR-SM C Figure 2-9 Comparison of total power saving for different encoding techniques. 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5 ALBORZ ENCODING TECHNIQUES In this section, we look at another set of encoders, which we refer to as ALBORZ encoding techniques. ALBORZ is an acronym for Address Level Bus Power Optimization. ALBORZ encoding techniques have similarities with Offset-XOR- SMC as they employ codebooks for mapping offsets to low cost codewords i.e. limited weight codes (refer to Definition 2-24 ). However there are major differences in the structure of the codebook, which distinguishes ALBORZ from Offset-XOR-SMC. Specifically in adaptive version of ALBORZ, a small size codebook can have the same performance of a large codebook used in Offset- XOR-SMC since the codebook entries can be dynamically updated. This makes ALBORZ suitable for a broader variety of applications and traces. 2.5.1 APPROACH ALBORZ code exploits the optimality of limited weight codes when they are used with transition signaling, as earlier mentioned. The ALBORZ encoder first calculates the offset of the new sourceword. After that, ALBORZ encoder uses the codebook for mapping the offset. Therefore, each offset is first looked up in this codebook. If the offset is present in the codebook (i.e., there is a hit), then the codeword is generated based on the value looked up from the codebook; otherwise, there is a miss and some other action should take place depending on 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the kind of ALBORZ we are using. ALBORZ can be either used as a redundant code using one extra bit per bus or as an irredundant code. The redundant version makes use of a simpler codebook and does not always apply transition signaling. In this version of ALBORZ, when there is a hit in the codebook, a code is extracted from the codebook and is transition-signaled on the bus. On the other hand when there is a miss, the original sourceword is put on the bus. The extra bit in this scheme is used for distinguishing between hit and miss. However, the burden of having one redundant bit can be taken off ALBORZ. This is done in the irredundant version of ALBORZ, which employs a relatively more complex codebook. It always has transition signaling as the final block and is a reversible N-bit mapping without any need to redundant bits. Next we will look in detail how each of these techniques work. 2.5.2 REDUNDANT ALBORZ In redundant ALBORZ, for each sourceword, first, its offset is calculated. Then it is looked up in a codebook that tries to map this offset to a limited weight code (refer to Definition 2-24 ). If the offset is found in the codebook, a LWC associated with that offset is extracted from the codebook. In other words, the offset is mapped to a LWC, which will be transition-signaled over the bus. At the same time, the redundant bit, which we refer to it by CODEON bit, is set to one. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. On the contrary, if the offset is not found in the codebook, the sourceword itself is put on the bus. On the decoder side, if the CODEON is zero, the value on the bus is used as the sourceword. Otherwise, XORing the current bus with the previous bus will extract the codeword. This will be equal to a LWC and is looked up in the decoder’s reverse codebook and the corresponding offset is retrieved. This offset is used to calculate the new sourceword. Figure 2-10 and Figure 2-11 show the block diagrams of the redundant ALBORZ encoder and the decoder respectively. CODEON Sourceword SUB Codeword XOR Register Register off o > Figure 2-10 Redundant ALBORZ encoder. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CODEON Codeword XOR Add Sourceword Register Register Figure 2-11 Redundant ALBORZ decoder. Every entry of the encoder codebook consists of two fields: off lwc The following terminology and notation are used in the remainder of this section: LW C(off): L im ite d W e ig h t C ode of the entry w ith o ffse t= o ff C O D E S : set of lim ite d w e ig h t codes in the c o d e b o o k O F F S E T (lw c ): O ffset of the entry w ith LW C = lwc ALBORZ encoder is formally described as follows: 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / / R edundant ALBORZ Encoder off = X.-X,., if ( off e O FFSETS ) C O D E O N = 1 BUS(Xj) = C O D EO N | | LW C(off) else C O D E O N = 0 BUS(Xj) = C O D EO N | | Xf / / end Where || is the concatenation operator. The ALBORZ decoder is described as: / / R edundant ALBORZ D ecoder if ( C O D E O N ) lwc = {BUS(Xj) ® BUSCX^OIn X( = O FFSET (lw c) + X;.! else X; = {B U S}N / / end where {X}n means casting A to A bits. The mapping between offsets and LWCs can be either fixed or vary at runtime. If the offsets are fixed, then the codebook is called a fixed codebook. Otherwise, the codebook entries can be updated during execution of programs; in this case, the codebook is called an adaptive codebook. Next, we explain these the two alternatives in more detail. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 .5 .2 .1 F ixed C odebook In fixed codebook, the mapping between offsets and limited weight codes does not change, and the codebook is implemented using a ROM or combinational logic. Consider a 32-bit address bus. To avoid any transitions when two consecutive sourcewords are sequential (i.e., assuming that the difference between two sequential addresses is one), the first entry of the codebook has to be as follows: +1 O O O O O O O O h For a 32-bit bus, there exist 32 1-LWCs (refer to Definition 2-24 ). Thus, LWCs of the next 32 entries of the codebook have exactly one 1. The offsets of these entries have to be the most frequently encountered offsets in any program. Therefore, these entries have to be used for small negative and positive offsets. Table 2-8 Typical codebook for fixed redundant ALBORZ. off lwc +2 O O O O O O O lh +3 00000002h +17 00008000h -1 O O O lO O O O h -15 40000000h -16 80000000h 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2-8 shows the mapping for offsets +2 to +17 and -1 to -16 for a 32-bit wide bus. The numbers in the LWC column are hexadecimal format. The number of 2-LWCs is 496 and the same method as that described above is used to map offsets +18 to +265 and -17 to -264 in our fixed codebook. The maximum size of the codebook that we can use in a system depends on the number of required transistors by the codebook, the ratio of switched capacitance inside the codebook to the switched capacitance on the bus, and the ratio of internal (codebook logic) to external (bus driver) power supply voltages. With the trend to scale down the transistor feature sizes and power supply voltage levels, the maximum size of codebook can be increased. Here, based on the experiments that we carried out, we decide not to go any further than including 2-LWCs due to area and power consumption considerations. Higher reduction in the switched capacitance of the bus might be achieved if application of more complex encoder and decoders is justified in the system. Another observation is that it is possible to design a fixed codebook in a way that the offsets are used to index the codebook entries. Consider the previous table, which included positive offsets from +1 to +17. If +16 and +17 are eliminated from the table, the four LSB bits of the positive offsets can be used to index the codebook while the MSB bits are zero. This will significantly reduce the 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. codebook’s hardware complexity. We will consider this optimization when we want to estimate the actual overhead of the ALBORZ encoder. 2 .5 .2 .2 A daptive C odebook In the adaptive codebook, the offset column of the codebook is implemented using a read/write memory. When an offset lookup takes place, if it is present in the codebook, the corresponding limited weight code is extracted and transition- signaled. If a miss occurs, the actual sourceword is sent over the bus and at the same time, its offset replaces one of the offsets in the codebook. This guarantees that next time the same offset is looked up it will be present in the codebook (assuming that it is not replaced by subsequently-generated offsets). By adaptive replacement of offsets, those that are most commonly encountered in a program are gradually loaded into the codebook; this will increase the hit rate and will result in a higher reduction in the bus switching activity. Notice that the same policy has to be used to update codebooks of the encoder and the decoder to ensure the coherence between sender and receiver. Every time the CODEON bit is reset, the decoder realizes that a replacement has occurred in the encoder’s codebook. Thus the decoder updates its codebook entry correspondingly. The decoder and the encoder follow the same eviction policy, therefore they select the same entry every time a new offset is replaced in the codebook. This guarantees that the same set of offsets exist in both codebooks. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. By using an adaptive codebook instead of a fixed codebook, the number of entries in the codebook can be significantly reduced while maintaining the same level of switching activity savings on the bus. However to determine if an offset is present in the codebook, it has to be compared to all offsets in the codebook. This kind of fully associative [31] comparison is usually costly from hardware and power consumption viewpoints. To simplify the codebook hardware, the replacement (or eviction) policy can be changed to use direct or set associative mappings [31]. With these policies each offset can be placed in certain entries of the codebook; therefore, to identify a hit, only a small number of the entries have to be searched. We will use the direct mapped policy in our implementation. 2.5.3 IRREDUNDANT ALBORZ CODE The Fixed and adaptive technique that we earlier described are both redundant codes. In other words, they need an extra bit to signal whether a miss or a hit has happened in the encoder. We already talked about the various disadvantageous of having redundant bits. This gives the motivation for an irredundant version of ALBORZ coding. It is possible to alter the ALBORZ code to suppress the redundant bit. Recall that the CODEON bit is required because the encoder has to inform the decoder if the actual sourceword is sent or the looked up LWC is transition-signaled. We modify 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. our coding scheme so that the actual sourceword is never sent over the bus; instead either the offset or a looked up value from the codebook is XOR ’ ed with the previous value of the bus. Therefore, we do it as follows: when there is a hit in the codebook, we XOR the output of the codebook. When there is no hit, we XOR the offset itself. However the codebook structure needs to be modified from what we had in redundant ALBORZ, as using such scheme, ambiguity can occur when the offset itself is a LWC. In such a case the receiver cannot distinguish between a miss and a hit. Changing our replacement policy such that LWCs are prevented from entering the codebook as offsets can partially solve this problem. After this modification in the replacement policy, if the decoder receives a LWC, it knows that there should be a hit in the table. This guarantees that the offsets that are equal to one of the LWCs present in the codebook cannot hit in the offset column. However we still need to find a way to transmit these LWC offsets without any ambiguity. Our solution is to reverse-map the LWC to the corresponding offset based on the codebook. This will be more understandable with the help of an example. Example. Consider a single-entry codebook with the following values for a 32-bit space, 00000040h +18 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. If the offset is +18, then 00000040k or +64 is XOR ’ ed with the previous value of the bus. If the offset is +64 then +18 is XO R’ ed with the bus. If the offset is neither +64 nor +18, say it is +11, the offset, i.e., +11, is XO R’ ed with the previous value on the bus. Figure 2-12 illustrates the block diagram of the irredundant ALBORZ encoder. Sourceword Codeword XOR SUB Register Register off O ) Figure 2-12 Irredundant ALBORZ encoder. The elimination of the CODEON bit comes at the expense of a marginally more complex encoding logic. The codebook for irredundant ALBORZ is different from that of the redundant code. As before each entry of the codebook consists of an offset associated with a LWC. The main difference is that there is no offset equal to any of the LWCs in the codebook. In other words, there is no number in common between the OFFSETS column and CODES column. When an offset is 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. looked up in the codebook, it is compared with both the OFFSETS and the CODES', if there is a match in the OFFSETS column, the corresponding LWC will be output. Otherwise, if there is a match in the CODES column, the offset corresponding to the LWC is used as the codebook output. The output is then transition signaled on the bus. If the offset is not equal to any of the OFFSETS or CODES, then a miss occurs and the offset itself is XOR ’ ed with the previous value on the bus. The codebook effectively implements a one-to-one mapping between the offsets and the LWC s. This can result in a reduction in switching activity as long as the probability of having a match in the offset column is higher than that, of having a match in the LWC column. The Irredundant ALBORZ encoder is formally described as follows: / / Irredundant ALBORZ encoder off = Xj-X,.! if ( off e OFFSETS ) F (X ;) = LW C(off) else if ( o ff e CODES ) F(X;) = O FFSE T (off) else F (X;) = off BUS(Xj) = F(Xj) © B U S(X i_ 1 ) / / end 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Note that (OFFSETS n CODES = 0 ), otherwise it may not be possible to correctly decode the address in the decoder. The Irredundant ALBORZ decoder is described as follows: / / Irredundant ALBORZ decoder lwc = BUS(Xj) © BUSCXj.O if ( lwc e CODES ) X ; = Xj.! + O FFSE T (lw c) else if ( lwc e OFFSETS ) X; = X,.! + LW C(lwc) else X ; = X ;.! + lWC / / end Next, we will investigate different codebook characteristics and evaluate the performance of the above techniques with respect to the size of the codebook. 2.5.4 QUANTITATIVE CODEBOOK ANALYSIS First we will look at the fixed redundant ALBORZ codebook. The Fixed codebook is implemented using ROMs or combinational logic. Based on the codebook size, a certain number of LSB bits of the offset are used to index into the codebook. Number of entries that can be used can vary based on the amount of extra hardware that can be tolerated which itself depends on to the technology, system 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. specifications, etc. In Figure 2-13, we have compared the performance of the redundant ALBORZ encoding scheme based on the number of codebook entries. Vertical axis shows the average ratio of total activity of the encoded bus to the original activity for the considered traces and horizontal axis shows the number of entries of the table. The traces on which these experiments were conducted are the same as the one’s used in section 2.4.4. Switching activity has been partitioned into three constituents, extra denotes the activity of the redundant or extra bus line, lwc denotes the activity (of all lines except extra line) when there is a hit in the codebook and a LWC is transition-signaled over the bus. miss denotes the activity (of all lines except extra line) when there is a miss and the new address itself is put on the bus. Codebooks in this example are implemented so that half of the entries are filled with positive offsets and the other half by negative one’s. It is seen that the improvement of the technique for over 256 entries is marginal. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Average of TA(F(T)) I TA(T) 128 256 512 1024 2 0 4 8 4 0 9 6 Num ber of codebook entries ■ ex tra ■ Iw c ■ m iss Figure 2-13 Average of total activity of the encoded bus to the original bus for fixed redundant ALBORZ. To further reduce the hardware complexity and reduce the size of the codebook significantly, the following technique is used in hardware design. First LSB-Inv of the offset (refer to Definition 2-25 ) is calculated. After this mapping, a group of LSB bits (based on the size of the codebook) are used to lookup into the codebook. Thus, for example, 2 and -3 will hit in the same row of the codebook cause they only differ in MSB. After the code is looked up, then if the MSB is equal to one, the code is flipped. Suppose 00000100b is extracted for both 2 and - 3. This code is used when the original number is 2 but for -3 the code will be flipped to generate a new code or 00100000b. By this easy technique half of the codebook entries can be eliminated which will reduce the hardware overhead significantly. Figure 2-14 shows percentage of hits and miss based on number of 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. codebook entries. As it can be seen in this figure, number of hits will only increase marginally after we go beyond 256 entries. 1.2 0.4 0.2 0.8 0.6 0 4 8 16 3 2 6 4 12 8 256 512 1024 2 04 8 40 9 6 Num ber of codebook entries a m is s ■ h it Figure 2-14 Ratio of codebook misses and hits. For the redundant adaptive ALBORZ similar figures (Figure 2-15 and Figure 2-16) have been drawn. The numbers of entries are much fewer, since the entries are read/write memory and thus they tend to consume more energy. In our implementation, the offsets are replaced using direct mapping scheme [31]. An offset equal to one, i.e. sequential instructions should be mapped to all zero’s as their output LWC, causing no transition. Some entries can be designed in a fashion so that they cannot be replaced. This can be very helpful for example when dealing with instruction addresses. Also, a saturating counter can be added to each entry to prevent entries with substantial hits get evicted. In our implementation we used a single-bit saturating counter. If one entry has more than 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. one hit, this bit is set. On the other hand when a new offset tries to replace an entry, this bit is checked and if equal to one, it is reset to zero. Otherwise, the entry is replaced. Average of TA(F(T)) / TA(T) Number of codebook entries II extra B lw c ■ m is s Figure 2-15 Average of total activity of the encoded bus to the original bus for adaptive redundant ALBORZ. 4 8 16 2 4 32 48 6 4 9 6 128 Number of codebook entries Bmiss ■ hit Figure 2-16 Ratio of codebook misses and hits. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Irredundant adaptive ALBORZ was also evaluated for different number of entries (Figure 2-17 and Figure 2-18). Again the number of entries should be much fewer than the fixed version. Switching activity is caused by three different sources, the first one is when there is a hit in the offset column (ioffset-hit) and an LWC is transition-signaled over the bus, we show the associated activity by Iwc. Second one is when there is a hit in the LWC Column (LWC-hit) and the output is a stored jump offset. We have used the word offset to show this kind of activity. And the last one is when there is no hit in either of the columns (shown by miss) and the offset itself is transition signaled over the bus, we call the corresponding transitions miss. As it was mentioned earlier, in this scheme, the LWCs are fixed and only the offset entries are replaced. Since each generated offset is compared with both columns the way the LWCs are put in the second column has a great effect on the overall performance. Because we map LWCs to offsets, every time there is a hit in LWC column, many bits may switch. In a program, the number of small branches is typically large. This suggests eliminating LWCs that are small in magnitude from the second column will increase the overall performance. By a simple analysis of our traces we decided to arrange the second column as follows. Among the 1-LWCs the first four are omitted, i.e 0x1, 0x2, 0x4 and 0x8 and the remaining 28 1-LWCs fill the first 28 entries. The rest of the entries are filled with 2-LWCs sorted in decreasing order. There are many large numbers among the total 496 2-LWCs, which are very unlikely to be offsets. With this 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. arrangement of the codebook, offsets rarely cause any hit in the LWC column one can see in Figure 2-18. Average of MA(F(T)) / MA(T) 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 4 8 16 24 32 48 64 96 128 Number of codebook entries ■ LWC ■ m is s □ o ffs e t Figure 2-17 Average of total activity of the encoded bus to the original bus for irredundant ALBORZ. 1.2 1 0.8 0.6 0.4 0.2 0 ■ o ffset-h * ■ LWC-hit ■ miss Figure 2-18 Ratio of codebook misses and hits. 4 8 « 24 32 48 64 80 96 1 2 8 Number of codebook entries Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5.5 PERFORMANCE ANALYSIS In this section, we have done similar experiments as in section 2.4.4 to determine the overhead of ALBORZ encoding techniques. In general ALBORZ encoding techniques are relatively more flexible compared to techniques presented in 2.4 and their performance can go higher. However this comes at the cost of having higher implementation overhead and more complex hardware. We analyzed the power dissipation overhead of the encoder/decoder logic. Each of the above encoders/decoders with reasonable number of entries was described in BLIF format and the netlists were optimized using the SIS script.rugged and mapped to a 1.5-volt, 0.18-micron CMOS library using the SIS technology mapper. Power estimations were done using the sim-power for encoders only. Clock frequency was 100 MHz. Results are summarized in Table 2-9. Comparing this table with Table 2-7, we observe the extra complexity of ALBORZ techniques with those earlier presented in this chapter. Figure 2-19 shows the actual power savings (considering encoder and decoder power dissipation) that can be achieved by using ALBORZ in comparison to INC- XOR method when the I/O supply voltage is 3.3v for different values of line capacitance. As one can see for large values of line capacitance all of our methods outperforms INC-XOR method. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2-9 Encoder hardware synthesis and power estimation. INC-XOR Fixed ALBORZ 512-entry redundant Adaptive ALBORZ 32- entry redundant Adaptive ALBORZ 32- entry irredundant Number of literals 440 1750 1615 2146 Area of Encoder (* 1000 X 2) 334.82 766.67 817.64 797.53 Number of gates 306 818 870 827 Power dissipated by encoder (mW) 0.13 2.32 1.65 2.18 Percentage of Total Power Saved 8 3 % 81% 79% 77% 75% 73% 71% 67 % 65% 45 25 30 35 40 20 Line Cap (pF) -•— ING-XOR - h i — Fixed — ^— A daptive-redundant —K™ ~Adaptive-irredundant Figure 2-19 Comparison of total power savings of different encoding techniques. 2.6 CONCLUSIONS We introduced two different sets of low power encoding techniques in this chapter. The first set includes TO-C, Offset-XOR-SM and Offset-XOR-SMC. The second group of techniques is known as ALBORZ encoding techniques. We 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. introduced various architectures for codebooks that are useful in low power bus encoding techniques. We evaluated the complexity and performance of each of the proposed techniques and compared them with each other and also INC-XOR. The experimental results show the effectiveness of our proposed algorithms. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 SECTOR-BASED ENCODING ALGORITHMS 3.1 INTRODUCTION In this chapter, we present a complete new set of techniques known as Sector- based Encoding techniques (SE). Sector-based encoding techniques are a set of irredundant encoding techniques that can exploit locality in different kind of traces very effectively. Having the locality exploited, sector-based encoding can be targeted for either low power bus encoding or data compaction, i.e., a different codeword is defined for achieving either of the above goals. The key idea in sector-based encoding is to partition the sourceword space into a number of disjointed sectors with a unique register in each of the sectors called sector head. Sector head of each sector is used for tracking the sourcewords that would fall in that sector; therefore, it is used for exploiting the locality of sourcewords in that 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sector. The way we partition the space into the sectors, i.e., the methodology of sectorization will be a crucial aspect of sector-based encoding and will make a difference in the quality and complexity of the code. If the sectorization is done once in advance and pertains after on, the technique is called Fixed Sector-based Encoding or FSE for short. The first part of this chapter deals with FSE. Then we will move to the other choice, which is having sectorization change adaptively with respect to the sourcewords. We refer to this set of techniques as Dynamic Sector-based Encoding or DSE for short. The primary goal of dynamic scheme is to do the sectorization in a manner such that the closest sector head for most of the sourcewords in the space would be the sector head of their own sector. We will start this chapter by looking at related previous works. Then, we will take a closer look at the concept of sector-based encoding and how it tries to capture the locality in streams. Then, we will describe the fixed sector-based encoding techniques, its extensions, advantages as well as limitations. After that we will move to the dynamic sector-based encoding techniques. For both kinds of encoding techniques, we will first look at the simplest version, which employs only two sector heads. This will give us insight when we describe the general case consequently. And finally, we will present examples, experimental results and conclusions. 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2 PREVIOUS RELATED WORK Spatio-temporal locality is the most complicated when it comes to capturing it. Consider data addresses for example. In general, data address traces are good representatives for traces with spatio-temporal locality. Suppose that data is accessed in multiple huge data structures. Elements of a single structure are close to each other and can result in a lot of spatial locality, but when several of them are accessed alternatively, the spatial locality fades in the stream, although it still exists and can be exploited. Not many low overhead techniques exist that are able to successfully exploit this kind of locality in an efficient manner. Musoll, et al. proposed the working zone method in [48]. Their method takes advantage of the fact that data and instruction addresses tend to remain in a small set of working zones. These zones can, for example, correspond to the spaces for the code, heap, and stack segments of different programs. For the addresses that lie in each of these zones, a relatively high degree of locality is observed. For example code addresses tend to be sequential addresses. Data is accessed repeatedly and/or close to previously accessed data such as accessing arrays, structures, etc. Stack is accessed in a LIFO style, which has a lot of locality in it. For each working zone, a dedicated register called zone register is implemented that is used to keep track of the accesses in that zone. When a new address arrives, the offset of the address is calculated with respect to all zone registers. The 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. address is, thus, mapped to the working zone with the smallest offset. If the offset is sufficiently small, one-hot encoding is performed and the result is sent on the bus using transition signaling (refer to Definition 2-8 ). Otherwise, the address itself is sent over the bus. The working zone method employs multiple redundant lines. First it needs one extra line to show whether encoding has been done or the original value has been sent. And then, it also uses additional lines to identify the working zone that was used to compute the offset. For this matter, redundant bits in the order of Log2 of the total used zone registers will be needed. Based on this information, the decoder on the other side of the bus can uniquely decode the address. The working zone method also has the ability to detect a stride in any of the working zones. A stride is a constant offset that occurs repeatedly and if detected, can be used to completely eliminate the switching activity for corresponding addresses (refer to Definition 2-17 ). For instruction addresses, stride corresponds to the offset of sequential instructions. Stride is very important when instruction address encoding is tackled. In fact, the large number of sequential instructions (stride equal to one) is the foundation of considerable transition savings that is usually seen in instruction address encoding techniques. For data addresses, stride can occur when, for example, a program is accessing elements of an array in the memory. In general, utilizing a fixed stride has a very small impact on decreasing 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the switching activity of data addresses. Stride of data address traces is not fixed. This means that different strides might happen when different arrays of data are accessed plus the fact that these arrays might be accessed alternatively. Another issue is that these strides, unlike the instruction address strides, are unknown (unless some profiled information about the specific application is available). The working zone technique uses the right methodology for exploiting spatio- temporal locality. In fact, this is the idea that any technique should use for exploiting this kind of locality. However, working zone method does this in a not very efficient way because of its large area overhead and excessive complexity of encoding and decoding. In addition, it can completely be ineffective with some address traces because of non-uniformities in the stride of sourceowords in different zones. Also, mapping the offset to a one-hot code is highly vulnerable to fail. Consider a data address bus where address offsets are not small enough to be mapped to one-hot code; in such a case the original address is sent over the bus, which usually causes many transitions on the bus. Another encoding method that has shown relative effectiveness for data addresses is the bus-invert method [59] elaborated in section 2.3. The bus-invert selects between the original and the inverted sourceword in a way that minimizes the switching activity on the bus. The resulting codeword (which should include an extra bit to notify whether the sourceword or its complement has been sent) is 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transition-signaled over the bus. This technique is quite effective for reducing the number of one’s in addresses with random behavior, but it is ineffective when addresses exhibit some degree of locality. Since locality in data address traces is less compared to pure instruction addresses, bus-invert coding tends to be more effective for these traces. To make the bus-invert method more effective, the bus can be partitioned into a handful of bit-level groups and a bus-invert coding can be separately applied to each of these groups [67]. However, the major problem of bus-invert method is only relying on the previous sourceword for exploiting locality and this is not going to be sufficient. Finally, the framework proposed by Ramprasad et al. in [53] is not very effective when it comes to dealing with spatio-temporal locality. Although this framework is a general coding framework and can model a number of noteworthy low power coding methods for encoding the instruction addresses, their proposed framework has also the limitation of only looking at the last sourceword. The experiments present in [53], shows that the most effective method is Gray coding for the data addresses and INC-XOR for the multiplexed addresses [61]. Therefore, for pure data addresses without predictable sequential behavior, none of the techniques is capable of performing well. We will show in our experimental results that sector- based encoding techniques outperform all these methods by a wide margin. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 THE APPROACH TO SECTOR-BASED ENCODING We start this section with a couple of definitions. Definition 3-1 Sectors: In sector-based encoding technique, the sourceword space is partitioned into disjointed constituents, which we call sectors. Definition 3-2 Sector Head: There is exactly one register in each of the sectors. This register is called the sector head and is used to keep track of sourcewords in that sector. Definition 3-3 Sectorization: The method of partitioning the sourceword space into these disjointed sectors is called sectorization. We propose two different categories of sector-based encoding techniques, i.e. fixed and dynamic. In both techniques, each sourceword would fall in one of the sectors and is encoded based on the sector in which it is located. We want the sectors to partition the sourceword space in a fashion such that the sourcewords that are in the same sector be relatively close to each other. This implies that if we encode each sourceword with respect to the previous sourceword accessed in the same sector, spatial locality within each sector will enable us to define a codeword that could either reduce number of bit transitions on the bus or minimize number of required bits. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To elaborate on the above statement, let’s look at the following example. We consider two different cases as follows: In the first case, a trace of sourcewords, which are scattered all over the sourceword space, is sent over a bus without any encoding. Because these sourcewords are dispersed, it is likely that they will have large Hamming distances (refer to Definition 2-13 ). In the second case, we partition the sourceword space into two sectors so that the original trace is divided into two sub-traces based on this sectorization. In each sector, the sourcewords are closer to each other and if we calculate the total activity (refer to Definition 2-15 ) of these two sub-traces, it will be less than the total transition count of the original trace. In practice, sourcewords are not partitioned into two sub-traces; it is the function of the encoder to do this “virtual separation” of sourcewords in the trace. This statement is the key insight behind the proposed sector-based encoding techniques. To show the benefits of sectorization quantitatively, let’s consider the data addresses for a memory system without a cache. Each memory access will be routed to the corresponding physical address. Each data access generated by the CPU may be used for either accessing a data value in a stack, which is used for storing function return addresses and local variables, or in a heap, which is used to hold global data and dynamically-allocated variables. The stack may reside in some memory segment, e.g., in the upper half of the memory, whereas the heap 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. may reside in another memory segment, e.g., in the lower half of the memory. Let H and S denote Heap and Stack accesses, respectively. By H-+S access, we mean address bus transitions that occur when the stack is accessed after a heap access. S ^ H , S— >S and H— *H are defined similarly. The number of bit transitions caused by H— *S and S— >H accesses is often higher than those for the S— > S and H— > H accesses. As explained earlier, this is because the heap and stack sectors are usually placed far from one another in the memory address space. From detailed simulations performed on a large number of benchmark programs in SimpleScalar architecture [72], we have observed that if we apply the Offset-XOR encoding technique [30] to a data address bus, S— *H and H— *S accesses will be responsible for about 73% of the entire bit transitions. Now suppose we break the trace into two parts; one includes accesses to the stack, whereas the other includes accesses to the heap. If we separately apply the Offset-XOR encoding to each of these two traces and sum up transitions for each trace, then up to 61% reduction in the switching activity will be achieved, with respect to the undivided trace. This shows the importance of proper sectorization. There are various actual instances that data on the bus has strong spatio-temporal locality. Consider the bus between cache and main memory. Blocks that need to be fetched from or written to the main memory belong to different processes. If two blocks belong to the same physical page, their addresses will be very close to 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. one another. However, if they are not in the same page, then they can be significantly far from one another. The total transitions caused by two consecutive accesses, which happen to fall in the same page, tend to be much smaller than the transition counts of successive accesses in different pages. A similar behavior is observed when multiple masters are communicating with different devices on a shared bus. Consider AMBA bus [74] as an example. The addresses generated by each one of the masters have some sort of locality. However, the addresses that the different masters are generating can be uncorrelated and may cause a large number of transitions on the bus when the control of the bus is handed over from master to master. When sector-based encoding is applied to such a bus, each sender and receiver should have its own copy of the sector heads. Although, in the long run, a sourceword may be encoded with respect to any of the sector heads, we expect that a sender will use the same sector head for some time before resorting to another sector head. This is true if the current sector head retains the last sourceword that was sent over the bus by that specific sender in a previous session. So it is as if the sector heads are automatically distributed among different sources (If a sector-based encoding with enough sector heads is applied). Plus, it’s even possible that one sender may use more than one sector head. Thus, locality within the sourcewords of each sender will be effectively exploited by sector-based encoding. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. One major advantage of the sector-based encoding techniques is that they do not require any redundant bits. Obviously, in the codeword, some bits are dedicated for conveying information about the sector that has been used as a reference for encoding. The remaining bits are used for encoding the difference i.e., the difference between the new sourceword and the previous one accessed in that sector (which is kept in the sector head). As mentioned, we propose two different set of encoding techniques. First we will look at fixed sector-based encoding (FSE) in which partitioning of sourceword space is done statically. This kind of sector-based encoding has the benefit of having very low overhead. However doing the fixed sectorization such that sourcewords evenly fall into the disjointed sectors is not easy. For FSE, we propose a special method of sectorization called dispersed sectorization that can achieve this goal to a certain extent. After that, we will look at the dynamic sector-based encoding (DSE) techniques in which sectors are dynamically changed with respect to most recent sourcewords. This set of encoding techniques can be efficient for a much larger set of traces with diverse behaviors. Sectorization is very flexible in DSE. The distance between different sector heads can arbitrarily vary and sourcewords of different sectors can get very close to each other and they would still properly fall into their corresponding sectors. Next we will elaborate on the above two classes in detail. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4 FIXED SECTOR ENCODING In this section, we take a look at the first set of sector-based encoding techniques that utilize static partitioning of the sourceword space. These techniques will be collectively referred to as FSE techniques. We will present these techniques by looking at the simplest version, which takes hold of only two sector heads. Later we will explore the multiple sector versions. 3.4.1 FIXED TWO SECTOR ENCODING A Fixed Two Sector Encoder (FTSE) is the simplest FSE scheme. The sectors are defined to be the lower half and the upper half of the sourceword space. There is one sector head for each sector. Each sector head is bound to one of half-space sectors and can freely roam all over it. Therefore, the MSB of each sector head is known and fixed and we actually need only (N-l) bits to represent the sector head. The MSB will be known implicitly. The MSB o f the sourceword also determines its sector and thus the sector heads that shall be used for encoding. In addition, the MSB of the codeword is defined to be equal to MSB of sourceword, i.e. the MSB is the same for sourceword and codeword. The remaining bits of the codeword are obtained by XORing the sector head (which is N -l bits) with the sourceword. To do this, we have to cast the sector head to an N-bit number; more precisely, we concatenate a zero with it at the left. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / / FTSE encoder if ( X[N] == 0) then FTSE(X) = (0 | | SH0) ® X / / Concatenating SH0 with a zero and then XORing it with X SH0= {X}N.j else FTSE(X) = (0 | | SHj) ® X / / Concatenating SHt with a zero and then XORing it with X SH t= {X}N.j / / end The codeword is transition signaled over the bus. SHo and SHi are (N -l)-bit numbers and they are in the lower half and upper half of the memory map, respectively. That is the actual location of SHo is 0\\SHo and the actual location of SHi is l\\SH i. As it is seen, FTSE is a very simple encoding technique. This simplicity comes from the fact that no complex operation such as subtraction or comparison is required to determine the sector of each sourceword. Plus, codeword is computed very easily by an XOR operation. These characteristics will also be perceived in the multiple sector version, which will be elaborated next. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4.2 FIXED MULTIPLE SECTOR ENCODING In Fixed Multiple Sector Encoding (FMSE), the sourceword space is partitioned into multiple sectors, where the number of sectors is a power of 2. As a result of having more sectors, sourcewords are more evenly distributed in different sectors. For example, consider FTSE when all sourcewords lie in the lower half of the memory. Clearly the FTSE encoding degenerates to simply XORing sourcewords over the bus, which performs poorly. FMSE overcomes this potential weakness by applying two methods: 1 Increasing the number of sectors. This helps to reduce the probability of having distant sourcewords fall in the same sector. 2 Using segments to further divide sectors. This kind of sectorization is referred to as dispersed sectorization as compared to contiguous sectorization. In dispersed sectorization, a sector is not a single contiguous chunk of memory yet it is comprised of much smaller segments dispersed all over the sourceword space. A detailed description of FMSE follows. Suppose the sourceword space is divided into 2m sectors. Therefore, total of M bits of the sourceword should be used for determining the sector. 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 3-4 Sector-ID bits: The bit positions in sector-based codewords that are used to identify the sector head used in encoding of sourcewords, are referred to as Sector-ID bits. If an encoding technique employs 2M sectors, then it would require exactly MSector-ID bits. The simplest extension of FTSE to obtain a multiple sector scheme would be to choose the M most significant bits of the sourceword as the Sector-ID bits and copy them to the codeword. The remaining bits in the sourceword are then XORed with the corresponding bits of the sector head to compose the codeword (sector- heads are (N-M) bits each and should be concatenated with M zero’s before the XOR operation.). The result will comprise the FMSE codeword. With the above extenstion, FMSE can support an arbitrary large number of sectors. However, with some more investigation into actual cases, we realize that even the increased number of sectors may not be sufficient to achieve a uniform and efficient distribution of the sourcewords over different sectors. For example, consider the main memory of a typical computer system with an internal cache. In comparison with the whole virtual address space, the size of the main memory is usually much smaller and even allocating a handful of bits as Sector-ID bits, the whole physical memory may reside in very few of the sectors. This problem cannot be solved unless unreasonable number of sectors is used. Even in that case, a significant portion of the sectors would be left unused. For this reason, we 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. propose a new technique for partitioning the sourceword space. We call this new method of sectorization, dispersed sectorization. The trick is that instead of using the MSB bits as Sector-ID bits, some of the intermediate bits of the sourcewords are used. This changes the sectors from large contiguous sections to smaller disjointed (dispersed) sections. We call each of these subsections a segment of the sector. Figure 3-1 depicts the two different methods of sectorization and how the position of Sector-ID bits affects the sectorization. As the Sector-ID bits are shifted from left to right, number of segments within each sector is increased. At the same time, size of each segment becomes smaller. Contiguous Dispersed Dispersed Sector-ID Bits: N.N-1 N-1.N-2 N-2.N-3 Figure 3-1 Comparison of contiguous versus dispersed sectorization. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In contiguous sectorization, given two arbitrary numbers in the space, total number of sectors that fall between those numbers depends on the two numbers and total number of sectors. However, in dispersed sectorization, the size of the segments will also be a factor in determining the number of sectors between two arbitrary selected numbers. Even if an interval is small, it is possible that it includes at least one segment of each sector. As long as the interval includes a segment from each of the sectors, sourcewords that lie in that interval may fall in any of the 2M sectors and this is exactly what we want. We don’t want some of the sectors to be overused and the others not used at all. We will later on talk about the granularity of segments and how it would affect the performance of encoding and decoding. Suppose we have 2M sectors. Each of the sector heads is a register bounded to one of the sectors. Consequently, M bits of each sector head are fixed and known. Therefore, we only need (N-M) for each sector head. However, to make the following pseudo-code easier to understand, we assume that sector heads are also N-bit numbers and those bits that are in the position of the Sector-ID bits are zero’s. The sector to which each sector head belongs is known based on the sector head index. The Sector-ID bits in the sourceword are used to select the correct sector head for codeword calculation and they are copied to the codeword like the FTSE encoder. We use these bits to arrive at the corresponding sector head and 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. XOR the sourceword with that sector head to generate the codeword. The Sector- ID bits in the sourceword are XORed with corresponding zero’s in the sector head. Hence, they do not change. / / FMSE encoder / / 2m sectors, 2M Sector Heads, SH0...S H 2M, / / Sector-ID bits: X[i+M] ...X[i + 1] (An M-bit number) FMSE(X) = X © SHX[i+M ]...X [i+1] Update SHX [j+M ] X [i+1] with X and make the Sector-ID bits zero / / end A fundamental question in FMSE is what are the bit positions we should use for the Sector-ID'? The number of bits defines the number and the size of the sectors. The location of the bits defines the number and the size of the segments. Notice that these bits can be chosen from non-adjacent bit positions. The answer to the above-mentioned question depends on the characteristics of the sourcewords and the kind of trace we are dealing with. As an example, we consider a bus between an internal cache and an external memory. For such a system, based on our simulations, we state that there is an optimum range for the Sector-ID bits. As long as the Sector-ID bits are within that range, their exact location does not really make a big difference. Assume that Sector-ID bits are M contiguous bits in the sourceword. As depicted in Figure 3-1, shifting the position of the Sector-ID bits to the right will make the segments smaller. A main memory usually occupies a small portion of the address space. The segments should be small enough so that 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. at least one segment of each sector is included in the memory space. This guarantees that all sectors will be used for the encoding. On the other hand, the Sector-ID bits should be shifted to the left to make each segment at least as large as one physical page of the memory system. Close virtual addresses might get far from each other, after they are translated to physical addresses. As long as two virtual addresses are within the same virtual page, they will be mapped to the same physical page. Therefore, they will remain just as close as they were before translation. However, this is not the case when they are not in the same virtual page. Suppose multiple programs are executed in the system. All cache misses generate requests to the external memory. Every time execution of a program is stopped and another program is executed, many cache misses happen. These misses are due to reading the code and data of the new program from consecutive blocks that are most probably in the same physical page. To be efficient, dispersed sectorization scheme should place all of these sourcewords in the same sector. This is why it is beneficial for the segments to be larger than the physical memory pages. As long as the Sector-ID bits satisfy the two aforementioned constraints and remain within the corresponding bounds, we have a proper sectorization and expect to have a good performance. This is confirmed by experimental results. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Next, we will look at the dynamic sector-based encoding techniques. As earlier mentioned dynamic and fixed encoding have advantages and disadvantages with respect to each other. In the next section, we will first go through those differences more carefully and explain how major limitations of fixed sector- based encoding are resolved in the dynamic scheme. 3.5 DYNAMIC SECTOR ENCODING Before going to details of the dynamic sector-based encoding, it is important to remind a few points about FSE techniques. In this set of techniques, it’s solely the initial sectorization that determines if two sourcewords would fall into the same sector or not and the behavior of sourcewords does not have any effects on that. However, DSE is designed such that sectorization adaptively changes with respect to the most current sourcewords for maximizing the performance of encoding. Of course, the sectorization limitation in FSE comes at the delight of having incredibly simple encoding and decoding blocks. In FSE, encoders and decoders are very lightweight and associated delay for encoding and decoding is also negligible. Of course, this will be a vital concern when these encodings are to be implemented in hardware. Another advantage is that FSE is scaled with a lower rate of increasing complexity compared to DSE. As the number of sectors in DSE techniques goes up, the complexities tend to rise at a faster pace compared to FSE. As an example, in FSE, it is always very easy to determine the sector to 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. which a specific sourceword belongs. However as the number of sectors becomes larger, doing the same task would become much more difficult in DSE as will be explained in section 3.5.2. Indeed, FSE can be easily scaled to support an arbitrary number of sectors in hardware and is a perfect choice for actual hardware application when a large number of sectors are required, such as the case when target bus is the bus between the internal cache and the outside memory chip or a bus shared by multiple masters and slaves as earlier discussed. In such a case, DSE would probably be less practical because of the associated hardware complexity and delay overhead, which is not tolerable on a high- performance bus. Yet, DSE is very interesting because of the adaptive sectorization it applies which is generally much more effective compared to FSE. Let’s start this section with the following fundamental question: “What is the optimum sectorization for exploiting locality of a stream?” Let’s assume that a trace of sourcewords is given and we want to come up with an optimal time- varying sectorization such that: first, last access in each sector is recorded in a register called the sector head of that sector. Second, each new sourceword falls in one of the sectors and is encoded with respect to the sector head in that sector and third, for the new source, sector head of its sector be the actual closest sector head among all of them such that the encoding can be done most efficiently. Of course, realizing such sectorization would be rather difficult and we will not be looking to 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. solve the problem optimally, because even if done, the solution will be impractical because of its high complexity. Instead we will try to develop an effective way of sectorizing the sourceword space that can exploit the locality that exists in the stream in accordance to the rules that were mentioned, as much as possible. Given the footprint of a sourceword stream, one may come up with different sectorization schemes using different number of sector heads. For example, the sourceword stream may be viewed to exist in two major sectors, while within each major sector; the sourcewords may be partitioned into multiple minor sectors as depicted in Figure 3-2. This figure shows the footprint of the references in the sourceword space. It can be seen that the sourcewords lie in two major sectors while the lower one contains three smaller sectors within itself. Therefore, by looking at the footprint of a stream, there is no definite way of determining whether or not two sourcewords fall into the same sector. In fact, the answer depends on how many sectors are there, how fast these sourcewords are accessed in time, what the initial values of sector heads are, and how they are updated afterwards. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S e c o n d S e c t o r M a g n i f i e d / / \ __ —■ J M ultiple n a re im sm a lier s e c a a i n a b l e in m a j o r s e c t o r s s e c o n d s e c t o r Figure 3-2 Concept of sectors in data streams with spatio-temporal correlation. The DSE encoding is formulated in a way such that the sector heads can freely roam the whole space. This is the biggest advantage of the DSE as compared to FSE. In FSE, sectorization is fixed and does not change adaptively. Therefore, each sector head is bound to a specific segment of the space. In fixed multiple sector Encoding, even with dispersed sectorization (refer to 3.5.2); it is possible that the sourcewords, with special characteristics, are unevenly distributed among different sectors, which causes a huge degradation in encoding efficiency. This occurs when the intrinsic memory footprint of the stream is different from what the fixed sectorization dictates. For example, in Figure 3-2, consider the case when the size of sectors is large to the extent that all sourcewords in the magnified part lie in one sector. Obviously, the encoding will not be able to make the most out of the correlation that exists between the sourcewords in the magnified section. Besides, in dispersed sectorization, the possibility always exists that two distant sourcewords lie in the same sector because their Sector-ID 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bits are the same, which leads to poor performance. This phenomenon is very similar to conflict cache misses which occur when the cache is not fully associative [31]. Dynamic sector-based encoding minimizes the effect of the two aforementioned problems. First of all, it can track sourcewords in different sectors even if these sectors are minuscule or the accessed sourcewords are very close to each other. On the other hand, the conflict phenomenon basically does not arise in the DSE. We first present the dynamic two sectors encoding to introduce the concepts and definitions to the reader in a relatively easier fashion. After that we will look at the multiple sectors encoding and present general theorems that apply to dynamic sector-based encoding. 3.5.1 DYNAMIC TWO SECTOR ENCODING The Dynamic Two Sector Encoding (DTSE) makes use of only two sector heads. To encode a sourceword, its arithmetic offset is computed with respect to both sector heads and one of the sector heads is chosen as the reference for encoding that sourceword. The codeword consists of two different kinds of bits, a single bit that specifies the sector head used for encoding (we call this bit Sector-ID bit just like FSE, refer to Definition 3-4 ) and the remaining bits (N -l bits) that encode the offset of the sourceword, with respect to the selected sector head. After the 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. codeword is computed based on the selected sector head and sent to the receiver, the sector head value is updated with the last sourceword. This means that the sectors are dynamically changed in order to track the sourcewords. In general, when we work in the N-bit space, the offset between two arbitrary sourcewords requires N bits. But in DTSE, we only have N -l bits available for offset between sourceword and the sector head. To understand how N -l bits can be used to encode an N-bit offset, consider a circular sourceword space where 2m -1 is adjacent to - 2 N'! (see Figure 3-3.) Sector heads S H q and SHj are shown on the circle. We only have a budget of N -l bits for representing the difference between the sourceword and the sector head. Consider sector heads S H q and SHi shown on Figure 3-3 (a). Having a (N-l)-bit to specify offset with respect to SHo, this sector head will cover the upper portion of the circle (In a later section, we will provide a formal definition for covering). Similarly, the lower portion of the circle is covered by SHi. Therefore, the entire space can be covered using a (N-l)- bit offset with respect to one of the sector heads. Now consider the more general case depicted in Figure 3-3 (b). In this case, the arcs covered by SHo and SHj intersect. There is a portion of the sourceword space, which is covered twice. We call this set of sourcewords Doubly Covered or DC. There is also a portion, which cannot be covered using an (N-l)-bit offset, which we will call Not Covered or NC. We will also use the term exposed for these not covered sourcewords 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interchangeably. Note that, adding 2/ V ’y to a point in NC maps it to a point in DC. This fact will be used when encoding the points in NC. If a point X is in DC, it can be encoded with respect to any of two sector heads. We encode it with respect to its closest sector head. This leaves one of the potential codewords unused. On the other hand, if a point Y is in NC, it cannot be encoded with respect to any of the sector heads because none of the sector heads cover Y. However, adding 2N '1 to Y, maps it to X in DC. If X is encoded with respect to the closest sector head, the obtained codeword cannot be used for Y, as the encoding will not be one-to- one in that case, i.e., the decoder will not be able to find out whether X or Y has been sent. Therefore, we choose to encode X with respect to the sector head, which is farthest from it and use the obtained codeword as a codeword for Y. Covered by ■ H O SHO SHI Covered by SHO Not Covered (NC) Doubly1 Covered (DC) SHI (a) (b)‘ Figure 3-3 Two sector heads and part of the memory space that each one covers. Next, we will have a set of definitions that will be used for throughout the DSE sections. It is important to mention again that, sourcewords X, Y, Z and sector 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. head SH are assumed to be N-bit values in 2 ’ s complement format. The bits of X are represented by X [l] to X[NJ, where XfNJ is the most significant bit (MSB). When we carry addition and subtraction operations between these variables, we do not care about overflow. The result is always interpreted as a signed number in the N-bit space. For instance, in the 8-bit space [-128,127], the addition -125+(- 100) is evaluated to 31 because -125-100= -256+31. In other words, arithmetic is done modulo 2N. Definition 3-5 Function dist(SH,X) : Given SH and X, the distance from SH to X is defined as follows: Note that distance is a non-negative number and that the order o f arguments of the dist function is important. Consider the 8-bit circular sourceword space where 127 and -128 are adjacent to one another. Sourcewords decrease in the clockwise direction as long as they do not pass the sign boundary (-128 to 127). Consider two sector heads SHo to SHj and a sourceword X as depicted in Figure 3-4. The distance from SHo to X is the length of smallest of the two following arcs: clockwise arc from SHo to X+ l and anti-clockwise arc from SHo to X. The shorter arc is clock-wise from SHo to X+ l. Thus, We say distance direction is clockwise. if 0 < X - SH < 2n -1 otherwise ^0 < SH - X -1 < 2 dist(SH,X) 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Similarly, the distance from SHj to X is determined to be the length of the anti clockwise arc from SH/ to X, i.e. distance direction is anti-clockwise. Definition 3-6 Distance direction: Given distinct SH and X, we write X XSH , if dist(SH,X) requires a clockwise travel from SH to X+ l. Please take not that unlike < (less than operator) ./is not transitive, i.e. X Z Y and Y Z Z does not imply X Z Z . We write X Z= Y if and only \fXZY or X=Y. dist(SH,,X) s h ; dist(SH0,X) 127 -128 SH0 Figure 3-4 Graphical representation of distance functions. Definition 3-7 Closeness: We say X is closer to SHo than SHj when dist(SHo,X) is smaller than dist(SHi,X). When two sector heads SHo and SHj are both odd or both even, distance of a sourceword to them cannot be equal i.e. dist(SHo,X) + dist(SHj,X). Since in many cases, we are making decisions by comparing distances, from now on we assume 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that all sector heads are odd. This guarantees that we will never have equal distances. Also, in the updating process, we will make sure they will remain odd. Lemma 3-1. For any two odd sector heads SHo and SH j, if X sweeps the N-bit space, half o f the time it will be closer to SHo and half o f the time it will be closer to SH j. I f X is closer to SHo, then X+2N ~ l will be closer to SH ] and vice versa. Proof. Proving that both X and X+2N '! cannot be closer to SHo will be sufficient since it will partition the source-word space into two subgroups; one of them closer to SHo and the other closer to SH j. For any two N-bit numbers X and SHo as depicted in Figure 3-4, it is easy to verify that: dist(X,SH0) + dist(X + 2 N~\SH 0) = 2 ,v_1 -1 = constant. Therefore, for any arbitrary numbers SHo and SH ], we will have: dist(X,SH0) + dist(X + 2N~\SH(}) = dist(X,SH]) + dist(X + - . dist(X,SH0) > dist{X,SHx) O dist(X + 2N~l,SH0) < dist(X + 2A '_1,S //| ) and the proof is complete. Definition 3-8 Function CfX.SHntSHi): It is defined as follows: [LSB - Inv ( {X - SHq}]], ) if dist(X,SH 0) < dist{X,SH f C (X SH ■ SH ) = < ° ’ 1 1 LSB - Inv { { X - SHx}n _\ ) otherwise [dist(X,SH0) > dist(X,SHf) 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Theorem 3-1. For any two odd sector heads SHo and S H j, when X sweeps the N- bit space, C (X ,SH 0;SH j) will sweep the (N-l)-bit space. Each sourceword in the space is covered exactly twice: once when X is closer to SHo and a second time when X is closer to SH j. Proof. From Lemma 3-1, it is clear that C(X,SH o;SH j) is calculated with respect to SHo for half of the sourcewords and with respect to SH j for the other half. Since LSB -Inv (refer to Definition 2-25 ) is a one-to-one mapping of source-words in the (N-l)-bit space, it does not have any impact on the proof and the emphasis is on {X-SHo } n-i and {X-SHi}n-i. For every (N-l)-bit number X , either X / =LSB-Inv 1 (X )+S H q or X 2-L S B -In v 1 (X) + S H o + 2 n'1 is closer to SH q. We call it X sh o and we have {X sh o -S H o }n -i = X . Similarly, there is another number X sm , which will generate X (i.e. {X s h i- S H i} h - i = X ) and is closer to SH j. X sh o and X s h i are distinct numbers because one of them is closer to SHo and the other is closer to SH j. Therefore, for every (N-l)-bit number X , we have found two distinct numbers (X sh o and X s h i) such that applying function C ( ) to them will generate X and the proof is complete. ■ Now based on Theorem 3-1, we explain how the DTSE works. We call the two sector heads SHo and SH j. First, C (X ,SH 0;SH j) is calculated. This is an (N-l)-bit number. We use the MSB bit to send the Sector-ID (refer to Definition 3-4 ), 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. which represents the sector head that has been used for encoding. For example, 0 can be used for SHo and 1 for SH], The DTSE encoder is defined as follows: / / DTSE Encoder DTSE(X) = (Sector-ID) | | C(X,SH0;SH1 ) if (dist(X ,SH 0) < dist(X,SH,)) SH 0 = Odd(X) Else SH, = Odd(X) / / Set the sector head to current sourceword Function Odd(X) is to make sure sector heads will remain odd for the next round of encoding and is defined as follows. Definition 3-9 Function Odd(X): Odd(X) is either X or X+ l whichever is an odd number. The resulting code is transition signaled over the bus (refer to Definition 2-8 ). The above theorem guarantees that for any arbitrary values of sector heads, the N- / / end Odd{X) 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bit sourceword is mapped to an N-bit codeword in a one-to-one manner. As a result, it is possible to uniquely decode the numbers on the receiver side. Let’s take a closer look at how this one-to-one mapping is put together. Consider an arbitrary sourceword X in the DC set. Obviously, X+2N '1 is in the NC set. Assume that X is closer to SHo so that it will be encoded with respect to SHo. We wish to assign a codeword for X+2N ~ I. The reasonable candidate is the codeword that would have been generated for X, if it had been encoded with respect to SH}. On the other hand, consider that we forget for the moment that X+2m is not covered by any of the two sector heads. From the above theorem, X+2N '1 is closer to SHi. So, if we simply use the rule that every sourceword should be encoded with respect to the sector head that is closest to it, then X+2N '1 will be encoded with respect to SHj. Now we know that the encoding of X and X+2N ~ ' with respect to SHj will be exactly the same, because both of them will have the same offset with respect to SH], To be precise: LSB-Inv ( {X - SH J,,., ) = LSB-Inv ( {X+2N1 - S H J ^ ) Consequently, by adopting the simple and straightforward rule of “ encoding a sourceword with respect to its closest sector head, ” we will be using all of the unused codewords for sourcewords in the DC set to encode the sourcewords in the NC set. This process is equivalent to mapping the NC set to the DC set (by inverting the MSB bit) and encoding the sourcewords with respect to the sector 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. head that is farthest from them, which will clearly result in a one-to-one mapping. Therefore, our general rule is to measure the distance from a sourceword to both sector heads and encode the sourceword with respect to the closest sector head. As a final step, LSB-Inv is applied to the lower (N-l) bits of the word to reduce the number of one’s in small negative offsets (refer to 2.4.2). When LSB-Inv is applied to large negative numbers, then the number of one’s in the word will increase. In practice, on average the LSB-Inv function is quite effective in reducing the number of one’s in the word, since offsets in each sector tend to be small numbers. On the receiver side, an XOR operation is first required to reverse the effect of transition signaling and extract the codeword. The sector is easily determined based on the MSB i.e. the Sector-ID bit. Next, by using the value of the corresponding sector head in the receiver side (note that the sector heads on the sender and receiver sides are synchronized) and the remaining (N-l) bits of the codeword, the sourceword X is computed (assuming X is not in NC). After this step, it is determined whether the computed X is actually closer to the sector head that has been used for decoding. If this is true, the sourceword has been correctly calculated; otherwise, constant 2^'1 should be added to X to produce the correct sourceword (note that adding 2m is equal to inverting the MSB.). 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / / DTSE Decoder Z = BUS(Xi)® BUS(Xi.1 ) U = Z[N-1] II ( LSB-Inv ( { Z } ^ )) / / First applying LSB-Inv in (N -l)-b it space and then doing sign-extension if (Z[N] == 0) then X = SH 0 + U If (distCXjSHj) < dist(X,SH 0 )) then X += 2N1 else X = SHj+ U If (dist(X ,SH 0) < dist(X,SH 1 » then X += 2N1 if (dist(X ,SH 0) < distCXjSHj)) SH0=Odd(X) else SHi=Odd(X) / / end Table 3-1 shows an example of using D TSE to encode a three-bit sourceword space. The first column denotes the original sourcewords. The two bold face numbers in this column show the sector heads (SHo is 001 and S H i is OIL) The second and third columns provide the distance with respect to the two sector heads. The fourth column shows the S H that should be used in the calculation of C(X,SHq ;SHi), which is determined by comparing the distances to the two sector heads. The fifth column shows the offset of X to the closest sector head. The next column is C (X ,SH o;SH i), which is calculated by casting the previous column to two bits and then applying the LSB-Inv function. The last column shows the 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. codewords before transition signaling. The MSB of the codewords shows the Sector-ID, i.e., it is zero for the addresses that are encoded with respect to SHo and one for those encoded with respect to SHj. Table 3-1 An example of the DTSE encoding for a three-bit address space and sector heads equal to 001 and Oil. X dist(X,001) dist(X,011) Sector-ID,SH X-SHi C(X,001;011) Codeword 000 00 10 0.001 111 10 0 10 001 00 01 0.001 000 00 0 00 010 01 00 1.011 111 10 1 10 Oil 10 00 1.011 000 00 1 00 100 11 01 1.011 001 01 1 01 101 11 10 1.011 010 11 111 110 10 11 0.001 101 01 0 01 111 01 11 0.001 110 11 0 11 Table 3-2 shows how the values received on the bus, which are decoded in the receiver. The first column shows the bus and the next column shows the codeword that is extracted form the bus by XORing consecutive values on the bus. The third column shows the sector head that should be used for decoding based on the Sector-ID bit of the codeword. The next column shows U+SH (refer to the decoding algorithm). If X (sourceword) is covered by at least one of the codewords, U+SH will be equal to X. Otherwise, its MSB is inverted to compute X. Finally the last two columns show the updated value of sector heads after each decoding step. Initial values of the bus and sector heads have been shown in parenthesis at top of their corresponding columns. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3-2 An example of the DTSE decoder. Bus (000) Codeword Sector-ID, SH U+SHi X SHo (001) SHj (011) 000 000 0,001 001 001 001 Oil 110 110 1,011 010 010 001 Oil 111 001 0,001 010 110 111 Oil 101 010 0,111 110 110 111 Oil The DTSE encoding method can be extended to support more sectors. The resulting encoding/decoding function gets more complex as the number of sectors is increased. However, the encoding is a very interesting function that is useful in different applications such as data compaction. In this extension, there will be no restriction on the number of sectors except that it should a power of two. Therefore, more than one bit may be used as the Sector-ID bits. The rest of the bits will contain the offset information with respect to one of the sector heads. Sectors are not contiguous in general and can comprise of multiple segments. These segments might be of different sizes. Therefore, if a stream of sourcewords is properly sectorized, this multi-sector dynamic sectorization method can adapt its sectors to the intrinsic sectors of the trace and thus effectively exploit locality within the trace. In next section, we will look at the multiple sector extension of DSE. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5.2 DYNAMIC MULTIPLE SECTOR ENCODING Dynamic Multiple Sector Encoding (DMSE) is the most effective sector-based encoding technique for exploiting spatio-temporal correlation. It is capable of tracking sourcewords scattered in multiple sections of the space. It accomplishes this task by partitioning these scattered sourcewords into disjoint groups such that within each group a higher level of locality is observed. DMSE supports as many sectors as desired with the constraint that the number of sectors should be a power of two. Before a formal description of this encoding technique, we explain the principles on top of which DMSE is built. After that we will describe how the codewords are generated. First, we will define the encoding for those sourcewords that are close to the sector heads. After that, we explain how distant sourcewords are handled in DMSE. Then we will show that DMSE is indeed a one-to-one mapping based on lemmas and theorems. The purpose of having several sector heads is to make it possible to encode nearby sourcewords with respect to a relatively close sector head, even though these sourcewords are not accessed sequentially in time. To be general, let’s assume that there are 2M sector heads (SHo to ). These sector heads can be anywhere within the sourceword space. A total of M bits of the codeword are automatically allocated for the Sector-ID bits (to be used for sector identification). Rest of the codeword bits should be used for offset bits. In the sequel, we will see 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. how to formulate and solve the problem of encoding the sourcewords in a manner such that an irredundant one-to-one mapping between sourcewords and codewords is obtained whilst the sector-based offset encoding concept is also realized. Definition 3-10 Class of X: Given X, let { X + i2N ’ M, 0 < i < 2M - 1 } denote a set of sourcewords. We call this set the class of X and show it by [XJ. Example. Consider a space with N=8 . [0]=[64]=[-128]=[-64]={0,64-128,-64}. Definition 3-11 Covering; Sourceword X is covered by sector head SHj exactly if SH ,-2N ~ MX = X X S H ,+ 2 n'm (refer to Definition 3-6 ). Notice that when a sourceword is covered by a sector head, it’s offset to that sector head can be represented by (N-M ) bits i.e., if X is subtracted from SHj, then the result will not require more than (N-M ) bits. Since we have 2M sectors, we need to reserve M bits of the codeword for the Sector-ID bits. Therefore, we only have (N -M ) bits left, which can be used to encode the offset of X when X is covered by SHi. In other words, for a covered sourceword, a codeword comprising of M Sector-ID bits and (N -M ) bits of offset can be generated. This codeword is a suitable candidate that can meet the desired objective of exploiting locality. However, a potential problem is that, with this definition, some sourcewords may 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be covered by several of sector heads whereas some other sourcewords may be covered by none of the sector heads. Since exactly one codeword is needed for each sourceword, and in general, some of the sourcewords are multiply covered, thus, some potential codewords will be wasted i.e., they will remain unused. At the same time, some of the sourcewords will not be covered by any of the sector heads i.e., they will remain exposed (Not covered). So far, we have not specified how to generate a codeword for the exposed sourcewords. The obvious approach is to employ the unused codewords to encode the exposed sourcewords as explained below. Example: Consider an 8-bit sourceword space [-128,127] with four sector heads SHo= 1, SH]= 81, SH2= -75, andSH3= -13 (N=8 , M=2). The coverage regions of the four sector heads are [-31,32], [49,112], [-107,-44], and [-45, 18], respectively. Examples of multiply-covered sourceword is -1, -2, etc. whereas examples of exposed sourcewords are -108, -109, etc. Plus, we have dist(SH],97)= 97-81=16, dist(SHl r 128)= -128-81-1= -210=46, dist(SH3 ,127)= - 13-(-127)-l=l 13, etc. Definition 3-12 Function secid(X): The “closest sector head” to a sourceword X is the sector head SHse C id (X ) such that the distance from this sector head to X is minimum. To guarantee that SHsecid(X) is unique, we assume that sector heads are 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. initialized to distinct odd numbers in the N-bit space and will remain odd every time they are updated. This will guarantee that secid(X) will be unique. Lemma 3-2. For two sourcewords Yand Z: {Y -S H } n .m = { Z - S H } n .m iff [Y] = [Z]. Proof. {Y - SH}N.M - {Z - SH}N,M o Y -S H = Z - S H + i2N~M » Y = Z + i2N~M o [Y] = [Z].m Lemma 3-3. Given sourcewords Y and Z such that [Y]=[Z], a given sector head SH can cover both of them exactly if Y=Z. Proof. First we note that if Y and Z are equal, then they are either both covered by a sector head SH or not. This proves the sufficiency. For the necessary condition, assume that a sector head SH covers two different members Y and Z of the class [Y]=[Z] at the same time (Without loss of generality X Z Z). This would not happen unless the distance between Y and Z is the minimum, i.e. Y+2N ~ M =Z. Now we have: SH covers Y thus: SH - 2N~M~1Z = YZSH + 2N~M~l = > SH - 2N~M~lZ = Y 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. SH covers Z thus: SH - 2 N~M~l z = Z Z SH + 2 n ~ m ~ 1 = > Y + 2n ~m ZSH + 2 n ~m-i _ 1=> YZSH - 2 n ~m ~x which produces a conflict. This completes the necessity proof. ■ Lemma 3-4. Any sector head SH covers at least one member of every class. Proof. Given an arbitrary class of sourcewords, say [X], if SH is equal to a member Y of class [X], then SH will clearly cover Y. Next, consider the case that SH is different from Y. First assume that YZSH . We should have YXSH XY+ 2N'M. Now suppose S H covers neither Y nor 7+2'v'i W , i.e. Y is not covered by SH => Y <SH - 2 N~M~l Y + 2n'm is not covered by SH => SH + 2N~M~X -<=Y + 2N'M => SH - 2N~M~l <= Y which gives rise to a conflict. So S H must cover either Y or Y+2N~ M . The case S H Z Y is similar. This completes the proof. ■ These lemmas are very useful. Based on Lemma 3-2, two sourcewords generate the same offset with respect to a sector head SH, exactly when they are in the same class. Lemma 3-3 states that the same sector head cannot cover two distinct 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. members of the same class at the same time. As a result, the closest sector head of two covered members of a class cannot be identical. Lemma 3-4 states that at least one member of every class is covered by a given sector head. Therefore based on these lemmas, exactly one member of a class is covered by a given sector head. Suppose SH, covers X. As earlier mentioned, a good candidate for a codeword for X would be. Code(X\SH,) = i || LSB - Inv({X - S H f N_M ) . where || is the concatenation operator. If sourceword X is covered by SHs e C id (x), their difference requires (N-M) bits to which LSB-Inv is applied to provide additional compaction. Note that LSB-Inv (Definition 2-25 ) is computed in the (N-M)-bit space in the above formula and it is a one-to-one mapping. In this formula, i is the Sector-ID part and the rest is the offset part. Theorem 3-2. Consider two sourcewords Y and Z covered by SH, and SHj respectively: Code(Z\SH,) = Code{Y\sHj) iff Y = Z , 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Proof. Sufficiency is evident. For necessity, suppose that Y and Z are distinct and their codewords are equal. Therefore the Sector-ID part of their codewords should be equal which directly implies that i=j or SH,=SHj. Now the lower parts should also be equal, and considering that LSB-Inv is a one-to-one mapping this gives that offset of Y and Z with respect to SHj=SHj should be the same. Thus, Y and Z should be in the same class based on Lemma 3-2. However based on Lemma 3-3, they cannot be covered by the same sector head unless they are equal. This is a contradiction and the proof is complete. ■ 3 .5 .2 .2 G enerating C odew ords fo r C overed Sourcew ords We have found an effective way to generate distinct codewords for covered sourcewords, however, to make the offset part as small as possible, we use the sector head that is closest to the source for generating the codeword. Therefore, DMSE codeword is defined as follows. DMSE(X) = secid(X ) || LSB- In v « X - SHsecid(x)}N_M) . Example. Getting back to the 8 -bit space example with same four sector heads as in the previous example. Suppose we want to encode X=3. The closest sector head is SH0 . Therefore, we have: Secid(3) =0, SHo=l. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. F(3)=DMSE (3) = 00j || LSB-Inv({0000001 l b - 00000001b )}6 ) = 00b || 00001 O h = 00000010b = 2. However, X=3 is also covered by SH3. Code(3\SHi)= 11010000b = -48. This is a wasted code that is not going to be used by any other covered sourceword. Now consider X=67. The closest sector head is SHb. For this sourceword LSB-Inv makes a difference. We have: Secid(67)=l. F(67)=DMSE (67) = 01 b || LSB-Inv({0100001l b - 01010001b )}6 ) = 01 b || LSB- Inv(110010)b = 01b || lOUOh = O llO llO h = 109. This is the first step to guarantee a one-to-one function. We have found a code that maps a sub-space of the N-bit space (sub-space comprised of the covered sourcewords) to another sub-space of N-bit space in a one-to-one manner. As a result of this, number of sourcewords not mapped yet (exposed sourcewords), is equal to the number of codewords that are not exploited yet (unused codewords). We have to find an algorithm for coding of exposed sourcewords. Before doing that we prove that the above statement is also through in a class, i.e. number of exposed sourcewords of a class is equal to the number of unused codewords of that class. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Theorem 3-3. For any given sourceword X, let Sb V i : 0 < i < 2 M - 1 , denote the number of sector heads that cover the ith member of the class ( X + n N ~ M ). We have 1M -1 E = 2" • 1 = 0 Proof. Each sector head covers exactly one member of the class. This means that if we calculate the number of sector heads that cover a member of a class and sum it over all members of a class, we will be counting the total number of sector heads, which is equal to 2 M . ■ Theorem 3-4. For any given sourceword X, number of exposed sourcewords in the class [X], is equal to the number of unused codewords in that class. An unused codeword is a Code(X\SH) where SH covers X, but it is not the closest sector head toX. Proof. Since St’ s are non-negative numbers, if one sourceword is covered more than once (Si>l), Si -1 other sourcewords will exist that are exposed. At the same time, Sj-1 codewords are left unused for a sourceword covered S, times. Therefore, there is a one-to-one correspondence between the exposed sourcewords and the unused codewords with respect to any S, that is greater than one. Therefore, we 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. conclude that in each class [X], the number of unused codewords is exactly the same as the number of exposed sourcewords. ■ Example. In our 8 -bit space example, consider X=3, class [3]={3,67,-125,-61}. 3 is covered by both SHq and SH3, 67 and -61 are covered by SHj and SH2, respectively, and - 125 is exposed. Thus, So=2 ,Si=S3 =1, S2=0 . We see that -125 is an exposed sourceword. On the other hand, we saw in the previous example that -48 is an unused codeword in this class. 3 .5 .2 .3 G enerating C odew ord fo r an E xposed Sourcew ord All that is left is to present an algorithm which is capable of doing a one-to-one mapping of the exposed sourcewords to the unused codewords. Although it is possible to devise different ways of doing this mapping, it is very important to do it in such a way that keeps the algorithm as simple as possible. Since our algorithm uses the unused codewords in each class to make up for the exposed sourcewords in that same class, in the worst case, to find the coding for a sourceword. It will be enough to evaluate only the sourcewords in that class. The algorithm presented next maps the exposed sourcewords in a class to the unused codewords in the same class by examining them one by one. In the worst case, it has to visit all of sourcewords in a class. Thus, its complexity is of order 2M . Because M is usually a very small number compared to N, the algorithm is in fact feasible. By “evaluating a sourceword”, we refer to a variety of tasks, for 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. example, checking if one sector head is the closest sector head of a specific sourceword. Evaluation is much easier than generating the codeword for a sourceword. In practice, the number of evaluations that are needed to map a given sourceword is often much smaller than 2 M . Definition 3-13 Function CM(X.SH): (which stands for Covered Member in class [X] by sector head SH) is defined as : C M (X,SH) = Y e[X ] such that SH covers Y. This member exists and is unique based on Lemma 3-2 and Lemma 3-3. Algorithm SHE(X) (which stands for Sector Head for the Exposed sourceword X) is presented next. This algorithm finds the appropriate sector head for encoding an exposed sourceword. ALGORITHM SHE(X): Sector Head for Exposed Sourceword X 1. Y=CM(X,SH0 ), SEC=secid(Y); 2. Y + = 2n m (Mod 2n) until Y becomes a exposed sourceword; 3. SEC++ (Mod 2M ) until secid(CM(X,SHS E C )) is not equal to SEC; 4. If Y is not equal to X, go to 2; 5. return SEC; Now that we have found SHE(X), the codeword can easily be generated as if this sector head covers X , i.e., DM SE(X) = S H E {X ) || LSB - Inv({X - SHshe(X)}n _m ) 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The closest sector head to X i.e., secid(X) is defined for all (both covered and exposed) sourcewords. It can be equal to or different from SHE(X) (remember SHE(X) is only defined for an exposed sourceword). Theorem 3-5. Algorithm SHE(X) leads to a one-to-one mapping of exposed sourcewords to the unused codewords in class [X]. Proof. First, it finds the member in [X] that is covered by SHo and calls it Y. In step 2, the first exposed member in [Y] is found by adding 2N 'M to Y (module 2N ) repeatedly. In step 3 of the algorithm, by incrementing secid(CM(X,SHo)), the first sector head that is unused in class [Y] is found i.e., this sector head is not the closest sector head to the member of [Y] that it covers, i.e. it covers the member but is not it’s closest sector head. The exposed sourceword found in step 2 and unused sector head found in step 3 are matched together and this process is repeated. In particular, suppose X is the j th member of [Y], greater than Y that is not covered by any sector head. Likewise, we choose SHE(X) to be the j th sector head greater than secid(CM(X,SHo)) that has not been used in encoding of any of the covered sourcewords. In this way, we create a one-to-one mapping between the unused codewords and the exposed sourcewords. Therefore, a one-to-one mapping has been done between exposed members of the class and unused codewords generated by other members (take note that these codewords all have the same offset). ■ 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Example. In our 8-bit space example, let’s calculate SHE(-125). In step 1, Y is initialized to 3=CM(-125,SHq ) . SEC is initialized to secid(3)=0. In step 2, 64 is added to Y twice to generate an exposed sourceword which is -125. In step 3, SEC is incremented. SEC=1 is rejected because l=secid(CM(- 125,SHi))=secid(67). Similarly SEC=2 is rejected. Finally SEC=3 is accepted. The algorithm is complete because Y=X, thus, SHE(-125)=3. Codeword is calculated to be DMSE(-125)= -48. This is the exactly the unused codeword in class[3] that was also shown in previous examples. Theorem 3-6. DMSE code is a one-to-one mapping in the N-bit space. The proof of theorem 4 is valid in all 2N 'M classes, and therefore, DMSE as a whole is a one-to-one code. ■ In practice, if the trace has a good spatio-temporal locality and enough number of sector heads are employed, sourcewords tend to be covered most of the time. There is rarely a need to carry out the SHE algorithm. In other words, the average time required to do the coding is much shorter than the worst case run time of SHE, because usually there is no need to make use of this algorithm. DMSE is effective only when the sourcewords are covered. For exposed sourcewords, the DMSE performance cannot be predicted. As long as number of exposed 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sourcewords in a trace is not too many, overall performance of the DMSE technique will be quite remarkable. The decoder can easily perform the reverse of the above operations. First, it retrieves the sourceword with the assumption that it was covered and has been encoded with respect to the closest sector head. Next, it checks to see if the decoded sourceword happens to be closer to another sector head. If this is not the case, then the decoding has been properly performed and the sourceword is approved. Otherwise, the decoder knows that the sourceword is actually an exposed sourceword that belongs to the class of the current retrieved sourceword. Therefore, it has to perform the reverse steps of the SHE(X) algorithm (We call this inverse algorithm AS(X,Z) which stands for Adjust Sector of X) to find the actual sourceword. While this process is under execution, the receiver can use the lower (N-M) bits of the sourceword, since it knows that they will not change. The pseudo code for the complete encoder and decoder of DMSE is provided next. / /D M S E E ncoder if X = C M (X ,SH secid(X )) then C ode(X ) = secid (X ) | | L SB -Inv({X -SH secid(X )}N_ M ); else C ode(X ) = SH E(X ) | | L SB -Inv({X -SH SH E (X )}N.M ); U pdate sector head; / / end 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. And the pseudocode for the decoder: / / DM SE D ecod er 2 = B U S(X i)© B U S (X j.1 ) / / Z is the received codew ord secZ = Z [N] ...Z [N-M + l] , offZ = Z [N -M ] ...Z [1] x = s H S ecz + L SB -Inv(offZ ); If secid (X ) = secZ then Source (Z) =X; else / / X is an exp osed sourceword Sou rce(Z )= AS(Z,X); U pdate sector heads; Algorithm AS(Z,X) basically performs the inverse operation of SHE(X) and is described below. ALGORITHM AS(Z,X) :Adjust Sector of received codeword Z in class [X] 1. Y=CM(X,SH0 ),SEC=secid(Y); 2. SEC++(Mod 2m ) until secid(CM(X,SHS E C )) is not equal to SEC; 3. Y + = 2n m (Mod 2n) until Y becomes a exposed sourceword; 4. if secZ not equal to SEC go to 2; 5. return Y; In Table 3-3, we have shown the encoding of the 5-bit address space with 4 sector heads. We have intentionally chosen the first three sector heads to be close to each other. By doing this, we have increased the number of exposed and multiply covered sourcewords. Sector heads are in bold face. As it can be observed, for the 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sourcewords that are equal to the sector heads, only the Sector-ID bits are non zero and the offset is equal to zero. In addition, we have marked the exposed sourcewords by shading them. Table 3-3 DMSE encoding for a 5-bit bus using four sector heads equal to {1,3,7,21}. Shaded cells are exposed sourcewords. X F(X) X F(X) X F(X) X F(X) 00000 00100 01000 10001 10000 01110 11000 11011 00001 00000 01001 10010 10001 11111 11001 01101 00010 01100 01010 10011 10010 11110 11010 ' 00001 00011 01000 01011 10111 10011 11101 11011 00010 00100 01001 01100 10110 10100 11100 00011 00101 10101 01101 01010 10101 11000 11101 00111 00110 10100 01110 01011 10110 11001 11110 00110 00111 10000 01111 01111 10111 11010 11111 00101 In Figure 3-5, we have made a comparison of the number of one’s in the offset part of the codewords. We have considered the offset part only because we want to show the fact that the number of one’s is reduced for sourcewords close to the sector heads. As mentioned above, for sourcewords that are equal to the sector heads, the offset will be zero. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Number of 1's in offset 3.5 2.5 1.5 0.5 20 source word Figure 3-5 Comparison of number of one’s in the offset of different sourcewords. Initialization of sector heads plays a major role in determining the effectiveness of the encoding. Suppose that in each step, we only update the closest sector head to the sourceword, it is possible that some sector heads are not used for a long period of time. In such a case, performance of the encoding may significantly be degraded since only a few of the sector heads will be used. Different scenarios can be constructed where a sector head gets stuck at one point in the sourceword space. We need to make sure that such circumstances will not persist. To avoid these kinds of problems, some sort of scrambling that continuously shuffles the sector heads is needed. Although this shuffling may cause a marginal degradation in the performance of the encoding for some data streams, it will cause the whole process to become more robust. Performing the scrambling de-sensitizes the algorithm to the sector head initialization. To implement the scrambling, instead of always updating one sector head, sometimes two of them are updated (as will 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be seen in the experimental results section.) This will enforce a constant movement in sector heads, and therefore, it will not be possible for a sector head to remain frozen for a long time. In the next section of this chapter, we will see how FSE and DSE techniques can be put into application. After the locality is exploited, these techniques can be used for various purposes such as reducing number of transitions in a trace or reducing the size of the trace. Experimental results confirm the effectiveness of the sector-based encoding techniques. 3.6 EXPERIMENTAL RESULTS To show the effectiveness of the sector-based encoding, we have selected a variety of different input streams. First, we look at the results of applying these techniques over different traces for low power encoding applications. We compare sector-based techniques with other low power encoding techniques and show the effectiveness of our approach. Because of the increased complexity of DSE when the number of sectors goes beyond two, we wont utilize DMSE as a low power bus encoding technique and comprehensive low power encoding results will only be reported for the DTSE and FSE techniques. However for symbolically showing the effectiveness of DMSE in exploiting locality, we compare the results of applying DMSE over data addresses using techniques with 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. up to 32 sectors. The next part of experimental results would be the application DSE for the purpose of lossless data compaction. This part will include two different sub-sections. In the first sub-section, we will examine TIFF image files and compact them using DMSE (lossless). In the next part, we will show how DMSE can be used to reduce size of communicated data in a typical sensor network scenario where multiple sensors try to transmit different measured quantities over a shared channel to a base station. Size of the communicated bits is quite important for the lifetime of the sensor network. We will see how the sector-based encoding techniques can reduce the size of such a trace by exploiting its locality. 3.6.1 SECTOR-BASED ENCODING FOR LOW POWER To evaluate our encoding techniques, we again used SPEC2000 [75] benchmark programs (refer to 2.4.4) using the simplescalar simulator [72]. We used different sets of traces for reporting the performance of sector-based encoding techniques compared to previous chapter. We generated three different sets of address traces, each representing a different memory configuration. The first two traces were generated for a memory system without an on-chip cache and are traces of data and multiplexed addresses, respectively. A data address trace includes only the data accesses and assumes that data and instruction buses are separate. A multiplexed address trace includes both instruction and data addresses. The 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. amount of correlation in multiplexed address traces is much higher than in data address traces because of included instruction addresses. The third set of traces was generated for a system with two levels of internal caches and a memory management unit that translates second-level cache misses into physical addresses. The second level cache is a unified cache; therefore, addresses that miss this cache are either instruction or data addresses requesting for data and instruction blocks. We compared our proposed techniques with the INC-XOR [53], Gray encoding [61], and the Working Zone techniques [48]. INC-XOR and Gray coding are reported in [53] to have resulted in the best results for data and multiplexed addresses. For the Working Zone method, we used two registers and thus call this method WZE-2 for short. First, we present a detailed comparison of our methods and the above techniques when applied to the data address traces in Table 3-4 and Table 3-5. These tables show the percentage of saving or percentage of cancelled transitions for the above methods and various alternatives of FSE and DSE. In Table 3-4, we have also included the results of applying FTSE and FMSE(3). The number in parenthesis shows the number of bits allocated for Sector-ID. Therefore FMSE(3) is a fixed multiple sectors encoding using 8 sector heads. Dispersed sectorization has also been done to improve the results. In this case FMSE(3) is not performing a better job than FTSE. As earlier mentioned data 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. addresses are the most difficult kind of addresses in terms of exploiting their locality. Table 3-4 Percentage savings for traces of data address (no cache). WZE-2 INC-XOR Gray FTSE FMSE(3) vpr 36.1% -7.9% 39.0% 56.8% 52.1% parser 35.3% -1.5% 40.3% 66.8% 64.4% equake 16.8% 3.3% 28.9% 47.8% 37.4% vortex 19.5% 3.0% 24.5% 45.4% 47.2% gcc 16.8% -2.6% 30.0% 56.3% 47.9% art 15.5% -12.5% 24.6% 59.1% 54.7% Average Savings 23% -3% 31% 55% 51% Next, we have the results of applying DSE techniques over the same set of traces, presented in Table 3-5. The following updating policy has been used for these schemes. / /U p d a tin g the sector heads SH secid(X ) = O dd(X) / / refer to D efin ition 3-9 count = count + 1 ( Mod 4) If ( count = = 0 )(M od 4) SH secid(X)_ 1 (M o d 2 > — O dd(X )-2 / / e n d In the above updating policy, count is an integer, which is initialized to zero at the beginning of the algorithm and is subsequently incremented in each step. Every time it is equal to zero modulo 4, two of the sector heads are updated. This 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. process accomplishes the goal of scrambling the sector heads. Every four steps, the used sector head and the one preceding it, are put close to one another. Based on the detailed experiments, we have reached the conclusion that four is the right number of steps for doing this double update. Another technique that we use to improve the results is prediction of the Sector-ID bits. If two consecutive sourcewords are in the same Sector, their Sector-ID bits will be exactly similar; therefore, instead of putting the Sector-ID bits, we can put the XOR-difference (refer to Definition 2-11 ) of the Sector-ID of the current codeword and the previous codeword. In this way, if the two Sector-IDs are equal, their XOR difference will be equal to zero and activity will be further reduced. We refer to this method as “Sector-ID XOR signaling”. Results have been reported in Table 3-5. Table 3-5 Percentage savings for traces of data address (no cache). DTSE DMSE(2) DMSE(3) DMSE(4) DMSE(5) vpr 60.1% 70.1% 69.9% 67.7% 65.6% parser 57.7% 68.0% 73.8% 76.4% 76.3% equake 45.0% 47.2% 58.1% 59.2% 58.3% vortex 49.5% 56.2% 62.2% 63.0% 64.0% gcc 49.6% 60.4% 65.7% 66.6% 67.0% art 55.4% 62.5% 77.7% 83.5% 79.6% Average Savings 52.9% 60.7% 67.9% 69.4% 68.5% 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As it can be seen results have been reported for having 2 up to 32 sector heads. When the number of sector heads increase from 16 to 32, result marginally get worse. The reason for this is that the overhead of transitions caused by the Sector- ID bits themselves will become too much and it will offset the overall result. In fact, in Table 3-6, we have shown the average contribution of Sector-ID bits and offset bits in the DSE codewords for various number of sector heads. When we go from 16 sector heads to 32 sector heads the contribution of Sector-ID bits increase about 10% (simply because one more bit is added to Sector-ID bits), at the same time contribution of offset bits is decreased about 1 0 % and these two cancel each other out. This shows a very nice attribute of DSE and that is increasing number of sector heads is very effective in terms of reducing transition caused by offset bits but for this specific example it is not helpful as it is cancelled by Sector-ID transitions. Certainly in wider buses, increasing number of sector heads would be more beneficial. Using 8 sector heads, only 32.1% of transitions will remain, i.e. 6 8 % of the transitions are eliminated in the data streams, which is much higher than the 55% reduction reported in Table 3-4 for FTSE. For example for benchmark vpr, after applying DMSE(3), remaining codeword transitions will be 30.1% of the original transitions. In the codeword stream, 17.2% of the transitions are due to transitions over Sector-ID bits and 12.8% are caused by offset bits. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3-6 Average contribution of Sector-ID bits and offset-bits in total remaining transitions DTSE DMSE(2) DMSE(3) DMSE(4) DMSE(5) Sector-ID bits 9.9% 22.6% 39.5% 52.6% 62.0% Offset bits 90.0% 77.3% 60.4% 47.3% 37.9% Next, we will compare the encoding techniques of Table 3-4 and DTSE versus various earlier proposed techniques that we earlier described. We decided to go not further than two sector heads for DSE and the reason is that since the goal is reducing on-chip bus power consumption, we want to control the level of complexity of encoder and decoders. We don’t go into the detail results and only report the final averages that we obtained for the same sets of benchmarks earlier described. The results have been shown in Table 3-7. As we can see, our techniques outperform other techniques for all kinds of traces. INC-XOR is ineffective for these traces because it counts on sequential behavior of sourcewords, which is not the case for data and multiplexed address traces. Gray and WZE-2 techniques perform much better than INC-XOR, but still have lower performance than our techniques. For the multiplexed addresses with cache, FMSE(3) performs significantly better than other techniques. This is due to the fact that in this case, the number of intrinsic sectors in which the sourcewords are scattered, is higher than that for previous sets of traces. As a result, the benefits of having more than two sectors become more pronounced in this case. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3-7 Average transition savings for different techniques. INC-XOR Gray WZE-2 DTSE FTSE FMSE(3) Data Address (No Cache) -3% 31% 23% 51% 55% 51 % (1) Multiplexed Address (No Cache) 6% 25% 47% 41% 52% 67%(3) Multiplexed Address (w. Cache) 1% 10% 16% 19% 6% 55%(3) 3.6.2 SECTOR-BASED ENCODING FOR DATA COMPACTION In this section, we will investigate the application of DSE for data compaction. 3. 6 .2 .1 C om paction o f TIFF Im age F iles We also employ DMSE techniques for compaction of colored image files in TIFF (Tag-based image file format). The pictures were chosen from volume 4 of the USC SIPI Image Database (USID) [73]. Color depth for each pixel is 24 bits i.e., 8 bits per color channel (red, blue, and green.) It is likely that for two neighboring pixels, the value of the same color channels are very close to one another, which we refer to as chromatic locality [25]. Our goal is to exploit this locality and compress these TIFF files using the DMSE. Therefore, we assume that we apply the DMSE to words with size of 8 bits. Similar to what we did in the Sector-ID XOR signaling, we also need to minimize the transitions of the Sector-ID bits. It would not be helpful to use the Sector-ID XOR signaling because there is no 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reason for the Sector-ID bits of two consecutive words corresponding to two different colors to be equal. Therefore, what we do here is to XOR the Sector-ID bits of the current word with the Sector-ID bits of the previous word of the same color, i.e., the Sector-ID bits of the first sourceword is XORed with the Sector-ID bits of the fourth sourceword, the fourth one with the seventh one and so on. We call this method “chromatic Sector-ID XOR signaling”. It can be seen that by doing this, a considerable number of the sourcewords can be encoded with correctly predicted Sector-ID bits (or Sector-ID bits equal to zero.) Now we do the compaction. For this purpose, we use two different kind of packets. First, if the Sector-ID bits are not zero in the codeword, we just send the original sourceword. However, we do update the sector heads. The second one is when the Sector-ID bits get zero as a result of chromatic Sector-ID XOR signaling. In this case, clearly, M bits get eliminated in the first place. We also optimize the offset. Offset is composed of a sign bit and a distance. This distance is likely to need less than the N-M -l bits that have been allocated for it. Let’s say that, most of the time, we observe that it requires N -M -l-a bits. So, an additional group of a bits of data is saved. We find a such that most packets with zero Sector-ID bits fit into it. To identify these two packets from one another, one additional bit is required for each packet. More precisely, each packet would have either of the following formats: 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 bit, 1 for Long packet 8 bits, Original Data 1 bit, 0 for Short packet (N-M-a) bits, reduced DMSE Encoded data Notice that for the long packets, we send the original data instead of the codeword, because sending the codeword will not help us do a better job of compacting the sourceword and we just make the decoder’s job more difficult. However, we do update the sector heads, because we want them to keep on tracking the sourcewords. The updating policy is very similar to that for the data addresses. The only difference is that when we update two sector heads, we do not place them as close to one another as in the previous case. Details are given below. //U p d a tin g the sector heads Experimental results demonstrate that an average of 23% reduction in the size of the TIFF files is achieved by using DMSE with M equal to 4. Note that this is a lossless compaction of the TIFF files. SH secid(X ) = Odd(X) count = count + 1 ( M od 4) If ( count = = 0 )(M od 4) O dd(X )-2 / / end 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3-8 Compaction ratio for different images from SIPI database. TIFF image M=2, a=l M=3, a=l M=4,a=l M=5, a=l 4.1.05 90.6% 77.5% 75.2% 78.2% 4.1.06 88.8% 86.4% 86.0% 88.4% 4.1.07 87.0% 80.5% 66.7% 55.8% 4.1.08 86.9% 73.8% 74.4% 61.2% 4.2.01 84.6% 72.9% 69.6% 72.1% 4.2.02 89.5% 77.8% 76.8% 79.6% 4.2.04 86.7% 80.5% 82.4% 89.1% 4.2.07 86.7% 82.5% 86.9% 92.6% Average 87.6% 79.0% 77.2% 77.1% 3 .6 .2 .2 C om paction o f M ixed Stream s In sensor networks, each node is responsible for sensing different physical/environmental phenomena and sending the sensor data over a wireless network to the base station. If considered alone, these streams tend to have considerable locality that can easily be exploited. However, when these values are mixed with each other, spatial locality changes to spatio-temporal locality, which is obviously more difficult to deal with. DMSE is very effective for compacting these mixed streams. To demonstrate this, we show how DMSE can be applied for lossless compaction of mixed streams. In this experiment, we constructed two different streams i.e., streaml and stream2. In each of the streams, the difference between two consecutive sourcewords follows a normal distribution. Therefore, considerable spatial locality exists in these streams. Next, we generated six other streams based on streaml and stream2 as follows: In stream called mix-i, index i 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. shows the number of sourcewords that are alternately taken from each stream in one step. In mix-2, for example, two sourcewords are taken from streaml, the next two from stream2, the next two from streaml, and so on. We refer to the number of sourcewords taken in each step, as mix-chunk. We calculated the correlation between consecutive sourcewords in each of the streams and reported the results in the second column of Table 3-9. This correlation is highest when the streams are not mixed. For mixed streams, as the mix-chunk is increased form 1 to 6 , correlation goes up, We use a simple algorithm for compacting each of the streams. We will show how this compaction algorithm can be improved for the mixed streams when it is applied over DMSE codewords instead of the original sourcewords. For the original sourcewords, the compaction works as follows. We first calculate the difference between consecutive sourcewords. Next, we encode them as two different sizes of packets, i.e., long ones and short ones. We find the optimum size for the short packets. Now if any difference is larger than that, the original packet is sent as is; otherwise it is reduced to the short packet size and, of course, we need one bit for each packet so that we distinguish between different packets. Since the correlation goes down in mixed streams, it can be seen that the achievable compaction goes down, as the mix-chunk gets small. Now we show that by using DMSE, the compaction factor can significantly be increased. We 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have used two different versions of the DMSE: first one without Sector-ID XOR signaling and the second one with it. For the one without Sector-ID XOR signaling, we use the exact same compaction algorithm as explained above. For the one with Sector-ID XOR signaling, we used the compaction algorithm that we used for TIFF files. Basically these compaction algorithms are very similar. The updating algorithm is the same as the one used for data addresses in both cases. As expected, better results are obtained with the second method when the mix- chunk is larger. For mix-1, around 19% (75.0% versus 93.8%) additional compaction is achievable when we apply the DMSE. The best improvement achieved by the DMSE has been highlighted in bold face for each stream. No improvement was, however, achieved for the original streams because their locality is mostly spatial not spatio-temporal. The higher the spatio-temporal locality of the stream, the bigger difference DMSE makes. Table 3-9 Showing the effectiveness of DMSE for mixed streams. Corr. Compacted size, No D M S E Compacted size, D M S E No S e c t o r - I D XOR signaling S e c t o r - I D XOR signaling k=l k=2 k=3 k=l k=2 k=3 s t r e a m l 0.99 64.3% 68.1% 72.8% 78.0% 66.0% 70.4% 75.0% s t r e a m 2 0.99 50.6% 53.8% 59.3% 64.8% 52.1% 56.6% 61.7% mix-1 -0.36 93.8% 75.0% 75.5% 78.7% 76.2% 77.6% 81.2% mix-2 0.31 83.4% 75.0% 75.6% 78.8% 72.5% 71.6% 73.9% mix-3 0.54 76.8% 75.8% 73.0% 76.7% 74.0% 69.7% 72.4% mix-4 0.65 73.5% 75.8% 72.1% 75.8% 73.7% 67.5% 71.0% mix-5 0.72 71.4% 73.3% 72.1% 76.5% 72.2% 68.0% 71.2% mix-6 0.77 70.0% 72.4% 72.4% 72.4% 70.7% 67.5% 70.0% 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.7 CONCLUSIONS In this chapter, we have proposed a very efficient encoding technique, called Sector-based Encoding (SE), for exploiting locality in different kinds of streams. This technique is suitable for exploiting spatio-temporal locality in particular. The basic idea is to create disjointed sectors with one sector head in each of them, which is used to encode the sourcewords that are in that sector. Sectorization scheme is either fixed or dynamic. Sector heads are continuously updated in accordance to the sourcewords. In the dynamic scheme, the sectorization is also dynamically changing. SE results in irredundant codes; thus it is a very suitable technique for low power bus encoding and lossless data compaction. We proved the effectiveness of SE by applying it over different kinds of streams extracted from different application scenarios. We first used FSE (Fixed SE) for reducing total activity of different traces. Next, we showed that DSE (Dynamic SE) could be used for compaction of TIFF image files and streams with high level of spatio- temporal locality. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 INSTRUCTION-SET-AWARE MEMORIES 4.1 INTRODUCTION In this chapter, we propose a new technique for encoding of addresses on the memory address bus. This technique, which we will call BEAM for “ Bus Encoding based on instruction-set-Aware Memories ” is based on programmable smart memories that can be configured to attain certain awareness about the instruction set architecture of the processor. Consequently, to some extent, these memories can calculate or predict the instruction or data addresses, thereby, eliminating the burden on the processor to send these addresses on the memory bus. In the BEAM technique, memory carries out the calculation or prediction of the next instruction or data address based on the information that it has collected from the instructions executed up to now. The processor supervises the memory’s 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. address generation. If the generated address by the memory is correct, the processor does not send anything on the bus; otherwise it will send enough information to help memory correct its prediction. This will result in less traffic and lower activity on the address bus. This will in turn reduce the switched capacitance of the bus. The BEAM technique uses a simple decoder and encoder when compared to other bus encoding techniques, therefore BEAM cm effectively reduce the power consumption of the memory bus subsystem as will be illustrated in this chapter. 4 .2 BASIC APPROACH Embedded processors dominate the total number of shipped processors. Many of these processors are designed for applications that do not require a very high performance when compared to the state-of-the-art general-purpose processors. The clock frequency of these processors is also much lower than their general- purpose counterparts. Low power consumption is often a vital design criterion for the embedded processors, especially when these processors are designed for use in battery-powered systems. Tight constraints on power consumption prevent the embedded processors from having complex micro-architectural characteristics 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. such as memory management, branch prediction, and out-of-order execution. Furthermore, many of the embedded processors do not have on-chip caches.1 In this chapter, we assume that there is a processor without internal cache and out- of-order execution, which is a good example of a low power embedded processor. In such a system, the energy consumed on the external memory bus of the processor is typically a large portion of the total power dissipation. Therefore, low power bus encoding techniques can greatly reduce the overall power consumption of such a system. In a typical embedded system, to access an instruction or data, addresses are generated in the processor and sent over an address bus to the memory. In the BEAM technique, in most cases, the address is generated inside the memory; consequently, there is no need for the processor to send everything over the bus. The address is either calculated or predicted in the memory. When we say address is calculated, we mean that memory is able to generate the address correctly, whereas in the case of prediction the address might not be correct. If the memory is able to calculate the address, processor knows it and does not do anything. Otherwise, the processor follows the same technique to predict the next address. Therefore, processor will be able to verify the correctness of the address 'Some fast growing category o f applications, in which on-chip caches are not used, are the stream processing and network processing. Using a cache does not help to increase the performance for these applications. 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. prediction in the memory. If the predicted address is correct, the processor does not do anything. Otherwise, it intervenes by sending either a signal or the correct address to the memory. In the remainder of this chapter we use calculating and predicting interchangeably. Predicting the address values leads to a reduction in the switched capacitance of the bus; therefore, it reduces the power consumption. Both instruction and data addresses can be predicted. In this chapter, we examine instruction and data addresses separately and propose methods that are appropriate for each of them. The two sets of methods can be easily integrated so as to predict both instruction and data addresses on a multiplexed bus. For the memory to predict the addresses, it should know format of the instructions of the processor. Note that this does not mean it is necessary to design a specific memory for every Instruction Set Architecture (ISA). The Instruction formats of many RISC architectures are very similar. Hence, the basic functions of these smart memories would be the same and with the help of simple programmable hardware, which can be configured by the processor during an initialization phase, these memories are made compatible for a target ISA. Figure 4-1 shows a block diagram of the BEAM address calculator/predictor unit in the memory. The current instruction is analyzed in an TiSH-aware unit. This module identifies the instruction type. There is also a prediction/calculation 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. module that generates the next address. This module requires the following inputs: current address, current instruction, the value on the bus, and some information from the /X4-aware module. The ISA-aware unit determines how to calculate the next address. The current instruction is needed because it may have an immediate value, which is the offset of the next address. In the next two sections, we give details of the proposed prediction schemes for instruction and data addresses. Current Current ISA AWARE Bus Calculati Prediction Next Figure 4-1 Block diagram of the calculation/prediction unit in memory. 4 .3 NEXT ADDRESS PREDICTION IN BEAM First we describe our encoding method for the instruction addresses. Next we present our data address prediction technique. 4.3.1 INSTRUCTION ADDRESSES In a typical program, one out of every seven instructions is a control flow instruction [31]. This implies that, most of the time, instructions are fetched from 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. consecutive memory locations. This considerable spatial correlation in the instruction addresses is the key factor in many low power bus-encoding techniques such as TO method [14]. TO was elaborated in Chapter 2. In this method, if the new address is one (stride value) larger than the previous address (i.e., the new address is sequential), then the new address is not sent over the bus and the bus is frozen. The new address will only be sent over the bus if it is not sequential. TO consists of one extra bit for informing the memory whether or not the bus is frozen. We adopt a similar approach to TO to predict all sequential addresses. Furthermore, we add other prediction schemes to decrease the switching activity on the bus. To eliminate the transitions of non-control flow instructions in the BEAM technique, memory should distinguish them from control flow instructions and calculate the next address. The interesting difference between TO and BEAM is that the latter does not require any redundant bit. Suppressing the redundant bit in TO results in ambiguity on the memory side every time the target of a control flow instruction is equal to the value on the bus (refer to 2.4.1). However, this problem does not exist in BEAM because the memory looks at the current instruction to find out whether the next address is sequential or not. Therefore it doesn’t need to be notified by the processor. 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, non-control flow instructions are handled easily. For control flow instructions also, the memory may be able to correctly predict the next target address, thereby, eliminating the need for the processor to send the target address on the bus. If the prediction is not accurate, the processor has to send the address or a hint to help the memory correct its prediction. In the sequel, we will build up our method based on SimpleScalar processor architecture [72]. We use SimpleScalar because it is a MIPS-based architecture and a good representative of current commercial RISC machines. SimpleScalar has two main types of control flow instructions, jumps and branches. A branch is a conditional control flow instruction that, based on the evaluation of a specific condition, can either cause a forward/backward movement in the execution flow or have no effect on the flow. The amount of movement is determined by an immediate offset specified in the branch instruction. Taking the branch can depend on a variety of conditions such as the value of a register being equal to zero. The jump instructions are similar to branches except that they do not check any conditions. Jumps are deterministic and are classified into four different types, namely J Gump), JAL Gump and link), JR Gump register), and JALR Gump and link register). Although jumps are deterministic, there are jump instructions whose targets are not known at compile time. This category of jumps is usually called indirect jumps. This means that the jump instruction itself does not include 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the offset of the jump. In SimpleScalar, for J and JAL instructions, the offset is specified as an immediate value in the instruction, whereas for JR and JALR instructions, it is the content of a register that determines the target of the jump; thus, it is not known beforehand. The first set of instructions that we target in BEAM technique is direct jumps i.e. J and JAL. Note that J is used to implement unconditional jumps in the program, whereas JAL is used to implement function calls. When JAL is executed, it links the return address to a special register, which will be used later when returning from the function. If the memory recognizes J and JAL, it can easily compute the address of the next instruction by adding the offset embedded in the jump instruction to the current address. This is exactly how the processor computes the next address. For these instructions, memory is able to calculate the target address. At the same time, processor does not need to send any address for the next instruction if the current instruction is J or JAL. This completely eliminates the switching activity on the memory address bus for J and JAL instructions. The next step is to tackle the branch instructions. A not-taken branch behaves the same way as a non-control flow instruction does. The target of a taken branch can be easily calculated by extracting the offset from the instruction and adding it to the current address exactly like J and JAL. The problem is that the outcome of a branch is unknown in the memory and adding extra hardware for computing it is 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. impractical. Therefore, the only possibility is to predict the outcome of branches in the memory. The prediction scheme should be as simple as possible. Suppose that memory predicts all branches as ‘ taken’ and then calculates the targets of those branches. When executing the branch in the processor, if it is taken, the memory’s prediction is correct. Hence, it is not necessary to send anything from the processor to the memory. If the branch is not taken, memory has failed in its prediction and the processor sends a signal to the memory to indicate that the branch is not taken and memory fetches the next sequential address. Therefore, memory locally predicts the branches and processor sends a signal to the memory whenever memory’s prediction is not correct. To decrease the power consumption, a single bit transition on a specific line of the bus is used to signal the memory. As a result each branch will cause at most one transition on the bus. Since in a typical program about 70% of all branches are taken [31], this scheme leads to elimination of a significant number of transitions. To further improve the result, better prediction schemes may be used for predicting branches. Modem branch prediction schemes have up to 99% accuracy [31]. However, their hardware overhead is not tolerable in our system. The last category of instructions to be tackled is JR and JALR instructions. JR is mostly used to implement function returns. To do this, the program usually reads the return address from the stack and puts it into a register (assuming that the 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. return address is on top of the stack). After that a jump to that address is performed by executing a JR instruction. Another usage of JR is in implementing conditional branch to several targets such as what case does in C programs. However, it is important to note that the majority of JR instructions are used to return from functions. For the SPEC92 benchmark programs, according to [31], function returns account for 85% of the indirect jumps on average. Therefore, we propose a technique for predicting the target of JR instructions only when they are used to implement function returns. Final jump instruction in SimpleScalar is JALR which is mostly used for implementing pointer to functions, i.e., a pointer that can point to different functions and call them from a specific place in the program. However, since JALR is rarely used, it has a minor impact on the total number of transitions. Therefore, in our scheme processor always sends the next address of a JALR instruction and memory does not attempt for any kind of prediction. Our technique to reduce the transition cost of JR is as follows. The return address is saved whenever there is a function call. Later, when there is a JR instruction, the saved addresses are used to predict the target address. Therefore, a stack is used to save the return addresses. Since this scheme will not work for all JR instructions (because not all JR instructions are used to implement function returns), the processor has to have a stack as well to find out if the memory is able 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to correctly predict the address. Upon encountering a JAL, the return address (the same address that is linked during execution of these instructions) is pushed into the stacks. Later, when a JR is executed, if it is a function return instruction, its return address ought to be present in the stack. If the JR is not a function return, then its target will not be in the stack and cannot be predicted. This is confirmed in the processor by comparing the value of the register to which JR is jumping with the value stored on the top of the stack. If they match, then the processor knows that memory correctly predicts the target and the processor does not send anything on the bus. If the target of the jump is not equal to the value stored on top of the stack, then the memory prediction will be incorrect and the processor will simply send the new address over the bus. Memory detects the activity on the bus and concludes that its predicted value was incorrect. Consequently, it uses the address received on the bus instead of the address on top of the stack. An essential question is how large the size of the stack should be. For any stack with finite number of entries, there is always the possibility of an overflow. Once the stack overflows, no JR will be predicted correctly until enough function returns are executed so that the number of nested function calls becomes less than the size of the stack. Therefore, a regular stack is not a suitable choice because many nested functions may not return until the very end of the program. However, if we make the stack circular, this problem will be solved. Although 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. there is still the possibility of overflow for circular stacks, the most recent return addresses will not be lost in the case of overflow and the prediction scheme will perform better. In Section 4.4, we will address the issue of the stack size quantitatively. 4.3.2 DATA ADDRESSES In this section, we describe how data addresses can be predicted. In SimpleScalar, reading or writing to memory for data purposes is done only with load and store instructions. There are two different addressing modes in this architecture. The first one is displaced addressing in which the address to be accessed is calculated by adding a register Rs to an offset embedded in the instruction. The data accessed in memory is written to register Rd R d < = M E M ( Rs + O ffset ) The second type is indexed addressing in which the address is calculated by adding the values of two registers, R d < = M E M ( Rs + Rt ) If the memory wants to locally calculate the address accessed by these instructions, it has to know the content of the registers. However, the register file is in the processor and the memory does not have access to it. The solution we 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. adopt is to implement a shadow register file in the memory. If the shadow register file is kept completely coherent with the processor register file, then the accessed addresses can be easily calculated in the memory. However, making the two register files coherent is very expensive in terms of the number of required bus transactions. Thus, the values of the registers in the shadow register file are updated only when there is a memory access instruction with displaced addressing. Every time the processor sends the data address to the memory, the memory subtracts the offset embedded in the instruction (which is known to the memory) to calculate the value of the register used in the instruction. Therefore, the register value can be updated in the memory shadow register file without additional overhead on the bus. Now this “ semi-coherent” register file in the memory can be used to predict data addresses. On the other side, the processor can easily determine whether the value that memory has for a register is valid. This is done by keeping track of all registers that have been modified (i.e., have been the destination of a mov instruction [72]) since the last time they were used in a memory access instruction as pointers. If a register has an updated value when it is used in the memory access instruction, the data address can be correctly calculated in the memory. When the processor detects that the memory does not have the updated value of the register, it sends the new address value on the bus. At the same time, memory knows when the register value is not valid and will read the address from the bus instead of calculating it. Additionally, memory uses 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this new value to update its register file. On both sides, a valid flag corresponding to the register is set indicating that the register value has become valid for future references. SimpleScalar [72] has 32 general-purpose registers. Hence, the shadow register file should have 32 registers, but this can be expensive for a low power scheme. To reduce the number of registers, we consider the fact that compilers usually use a small set of registers as pointers in the memory access instructions. So it is possible to use a small cache to hold the value of some of the registers. Whenever a new register is used in a memory access instruction, that register occupies one entry of the cache. Therefore, the address of the next instruction using that register to access memory can be correctly predicted. In fact we will show in the next section that using a 4-entry cache instead of 32 registers will have a marginal effect on the switching activity reduction while it reduces the hardware overhead significantly. To avoid evicting registers that are more frequently used, we use a saturating counter [31]. 4 .4 PERFORMANCE ANALYSIS In this section we examine the actual transition reduction obtained by applying the BEAM technique. First we focus on instruction addresses. Figure 4-2 gives the percentage of different types of instructions used in SPEC2000 [75] benchmark 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. programs. Forward and backward branches as well as ‘taken’ and ‘not-taken’ branches have been reported separately. As we mentioned previously, JALR is rarely used in the programs. Furthermore, the number of JAL and JR instructions are almost the same in all programs, which shows that most of JR instructions are function returns for the corresponding function calls implemented by JAL instructions. ■ backw ard not taken ■ forward not taken H backw ard branch ■ forward branch □ ju m p and link register ■ ju m p and link ■ ju m p register □ ju m p vpr p arser eq u ak e vortex g cc art Figure 4-2 Percentage of Different Kind of Control Flow Instructions. Next we compare the transition cost of different types of instructions in the original instruction address trace. The results have been reported in Table 4-1. This table shows the percentage contribution of each different category of instruction on the total number of transitions. The upper part of this table shows the contribution of non-control flow instructions and branch instructions and in I u u /o 80% 60% 40% 20% 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the lower part the contribution of different jump instruction has been presented. By looking at this table, it is possible to determine the transition saving that is achieved when the transitions of a certain category of instructions are eliminated using BEAM technique. For example, according to the table, if jump transitions are suppressed by predicting jump targets, around 5% of the total transitions will be saved. However, this is not an exact number, because when a new encoding is applied to cancel the transition cost of jumps, other transition costs will be affected. This is because the transition cost of an instruction depends on the current value on the bus and the next address. Notice that our method is similar to TO for predicting the sequential addresses. However, because our method does not require an extra bit, it performs better than TO . According to our results, the BEAM technique using sequential prediction only outperforms TO by 5%. 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4-1 Percentage of Transition Cost for different kind of instructions. Forward taken branch Backward taken branch Not taken branch Non-control flow Vpr 2935 9% 879 3% 2021 6% 22789 68% parser 1696 5% 127 0% 2008 6% 23425 73% equake 2310 7% 532 2% 2029 6% 24274 71% vortex 2206 7% 589 2% 1279 4% 25252 75% Gee 3081 9% 102 0% 1513 4% 24136 72% Art 1001 3% 357 1% 569 2% 27151 87% Average 6.7% 1.3% 4.6% 74.3% Jump Jump and link Jump register Jump and link register Vpr 1742 5% 1716 5% 1584 5% 1 0% parser 2405 7% 1412 4% 1104 3% 0 0% equake 1400 4% 1775 5% 1955 6% 1 0% vortex 979 3% 1709 5% 1630 5% 35 0% Gee 2319 7% 1123 3% 1383 4% 12 0% Art 1768 6% 253 1% 216 1% 1 0% Average 5.3% 3.8% 4.0% 0.0% Table 4-2 shows the actual savings achieved by the BEAM technique when different levels of prediction are used. As one can see, adding the branch prediction method to simple sequential prediction increases the transition saving to 86.2% up from 65.0%. If the prediction of J and JAL instructions is also included, the saving increases to 92.1%. Until this point, the required extra hardware is essentially negligible; we only need to use an adder (in order to add the extracted offset to the current address), several multiplexers and some logic for detecting the instructions’ types. To predict the target of JR instructions, we 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have to use two stacks to store return addresses: one in the processor and the other in the memory. We generated these results, assuming that a 10-entry circular stack is used. Table 4-2 Transition saving for different stages of our proposed method for instruction address bus. Transition Saving predict seq. ins + predict branches + predict J and JAL + predict JR vpr 58.2% 84.9% 90.6% 95.5% parser 64.3% 85.2% 91.5% 97.1% equake 59.3% 83.7% 89.4% 96.7% vortex 65.1% 87.2% 92.7% 97.8% gcc 60.6% 84.8% 91.8% 97.2% art 82.1% 91.4% 96.6% 99.2% Average 65.0% 86.2% 92.1% 97.3% Figure 4-3 shows the effect of the size of the circular stack on transition saving. Vertical axis shows the number of transitions caused by JR instructions. According to the figure, a 10-entry stack is sufficient and increasing the number of entries in the stack beyond 10, would only marginally improve performance from that point forward. 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2500000 2000000 1500000 1000000 ■Q— 500000 15 20 25 5 10 ♦ vpr a " parser A equake — ■ — vortex X gee —O— art Figure 4-3 Effect of stack size on JR transitions. By using our method, up to 97.3% of all transitions of instruction addresses can be suppressed. The remaining 3% of the transitions are dominated by transitions caused by the misprediction of the branches and jump registers. More saving can be achieved by using better branch prediction schemes, but this would require more complex hardware. Next, we quantitatively investigate the prediction of data addresses. We have used the same set of benchmark programs and generated the data address traces for them. A precise prediction represents a case when the value of a register is valid in the memory shadow register file. In this case, memory can exactly predict the data address. Table 4-3 shows the percentage of transition saving and the precise predictions when we use a 32-entry shadow register file in the memory. In 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4-4, we have reported the results when using a 4-entry cache (with direct mapping [31] of registers into the cache entries) instead of the full size shadow register file. The average saving in transition count declines by only 6%. Table 4-3 Transition saving for data addresses and the percentage of accesses precisely predicted for a full size shadow register file. Transition saving Precise predictions vpr 86.1% 74.7% parser 80.5% 65.1% equake 76.3% 82.1% vortex 74.2% 74.3% gcc 81.9% 71.1% art 95.4% 85.6% Average 82.4 % 75.4% Table 4-4 Transition saving for data addresses, cache hit, and the percentage of accesses precisely predicted for a 4-entry directly mapped cache. Transition saving Cache hit Precise predictions vpr 80.3% 87.3% 68.3% parser 74.6% 81.3% 60.6% equake 70.7% 88.4% 74.8% vortex 66.9% 87.0% 69.1% gcc 77.3% 81.5% 61.0% art 88.1% 88.1% 84.7% Average 76.3% 83.6% 69.8% Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 .5 POWER ANALYSIS We report the actual power saving (i.e., one that accounts for the power dissipation overhead of the codecs) achieved by the BEAM technique. We do this evaluation separately for instruction and data address traces. We do the power estimation for the blocks used in the memory only, namely, memory codecs. This is because, on the processor side, most of the required hardware is already in place or there are similar logic blocks, into which the required codec hardware can be integrated with a minor overhead. For example, processor needs to identify the type of the instructions such as branches and jumps. This task is already done in the instruction decode unit. In fact in some cases, our technique even decreases the burden on the processor. As an example, the processor no longer has to calculate the target address for a jump or a branch since this is always done in the memory. The outcome of the branches still has to be determined, but the actual target calculation is no longer needed. This is similar to moving the adder for calculating the target of jumps and branches from the processor to the memory chip. Notice that we consider a version of the BEAM technique that does not employ the JR prediction. The reason is that the 5% reduction in the transition activity that can be achieved by implementing the JR prediction does not justify using two different stacks in many cases. 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 4-4 shows the BEAM codec, which is used in memory for predicting branches and direct jumps. The instruction-set aware unit only determines whether the instruction is a JR/JALR, a branch or a sequential instruction. If it is a branch, the leftmost multiplexer selects the branch offset. Otherwise, the jump offset is selected. Either this value or the instruction stride is added to the current address. If the current instruction is not a JR or a JALR, this value will be the next address. Otherwise, the value that is received from the bus will determine the next address. The JRJJALR signal generated by the Instruction-Set Aware Unit is used for controlling the multiplexer that chooses between these two addresses. C urrent Instruction C urrent Address 3um p Offset Branch Off: Adder Next A ddress + 1 Instruction Aware Branch n o t taken Address received over th e bus (From Processor) Figure 4-4 Hardware implemented in memory for predicting instruction addresses (jump and links, jumps and branches). Figure 4-5 shows the blocks required in the memory for calculating data addresses. There is a cache with four entries that can hold the values of four 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. registers. The Rs field in the current instruction is used to index into the register cache. The Rt field shows that a register has been used as a target register and is used to invalidate a cached register. The Instruction Set Aware Unit sends two signals to the cache, namely, invalidate and Mem-Access. Any access to the cache may lead to a hit or a miss. Even if it hits, the register value may be invalid meaning that the value of the register has been modified by an instruction after the last memory access using that register. Thus, only if there is a valid hit, the value is used for calculating the address by adding it to the offset embedded in the instruction. If there is no valid hit, the value will be received on the bus from the processor and the rightmost multiplexer will select this value as the data address. At the same time the value is used to update one of the entries in the cache using direct mapping. Instruction-Set S ig n E xtension Mem Invalii Adder Data A d dress Sub/Add Register C ach e valid hit A ddress received over the b u s Figure 4-5 Hardware implemented in memory for predicting data addresses. 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To estimate the actual overhead of the above memory codecs, we followed the same methodology as we did in previous chapters. First, we generated the netlist of each circuit in Berkeley Logic Interchange Format (BLIF). The netlists were optimized using the SIS script.rugged and mapped to a 1.5 Volt, 0.18 p. CMOS library using the SIS technology mapper. The I/O voltage was assumed to be 3.3 V. The number of literals, the area and the number of gates have been reported for both the instruction and data codecs in Table 4-5. Next we calculated the power consumption of these circuits. These values are needed to determine the actual power reduction of the bus. Therefore, instruction and data address traces of the benchmark programs were fed into the codecs and the power consumption was estimated using sim-power [33], a gate-level power estimation tool. The results for a 100 MHz system clock are reported in as “power of BEAM memory codec.” Assuming bus capacitance of lOpF/line, we have calculated the original bus power (i.e., when no encoding is used) using the same address traces that we used for the estimation of the power by sim-power. The total power saving considering extra on-chip codecs and the percentage of saving are also reported in Table 4-5. 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4-5 Results of hardware analysis and power estimation. Instruction codec Data codec Num of Literals 686 720 Area ( * 1000 X 2) 343 588 Num of Gates 311 528 Original Bus Power (uW) 5205 20050 Bus Power with BEAM (uW) 416 6055 Power of BEAM memory Codec (uW) 364 1144 Codec power + Bus Power with BEAM 780 7199 Power Saved with BEAM (uW) 4421 12851 Percentage Saving over bus 85% 64% 4 .6 CONCLUSION In this chapter, we described a new method for encoding instruction and data address buses. Our method can achieve up to 97% reduction in switching activity for an instruction address bus. For a data address bus, the saving is approximately 64%. The small hardware overhead makes our method practical. Our experiment shows that the power consumption of memory instruction and data codecs are 0.36mW and 1.15mW, respectively. In practice, when using our method, several modules are moved from the processor to the memory and some new blocks are added to the processor. Therefore, the processor power consumption remains almost the same or even decreases. Our techniques can be combined to predict the addresses in a multiplexed address bus. 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 PATTERN-SENSITIVE ENCODING FOR RELIABILITY 5.1 INTRODUCTION Conventional methodologies for SoC design are unable to cope with the billions of transistors that can be potentially integrated on the chip. Large and complex SoC's can become a reality only if one is able to cost-efficiently incorporate pre designed modules (IP cores) into the design and to enable high-speed and reliable communication among these modules. The communication channels are usually long and highly capacitive, which in turn increases the latency and power consumption of these channels. Furthermore, as a result of shrinking device and interconnect geometries, increasing clock speed requirements, and down-scaling of the supply voltage (as well as the threshold voltage of the switching devices), 159 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DC noise margins and timing slacks are significantly reduced. Therefore, we begin to encounter a situation in which various sources of noise in VLSI circuits are causing a significant increase in the number of errors over on-chip communication channels. A SoC system may be subjected to very different operating condition based on when and where it is used. The change in “environmental conditions” can in turn result in large variations in the peak and RMS values of the external and internal noise sources. On the other hand, the reliability constraints may also vary for different tasks. A modem SoC brings together several different modules performing various tasks and operations. Communication channels in such a system carry a large amount of data with a broad range of characteristics and criticality constraints. Therefore, the level of protection required in a communication session depends on the criticality of data that is interchanged. Sensitive data should be protected more at the expense of a more complex encoding, higher power consumption or system performance degradation. The bit error rate over the communication channels in SoC designs is expected to rise, and at the same time, becomes subject to a higher degree of unpredictability. Notice that it is ill-advised to design a SoC for the absolute worst-case conditions (e.g., using the peak values of internal and external noise sources under the highest temperatures and the lowest supply voltages) because of the tight 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performance specs for many of these designs. It is thus crucial to design SoC s for typical operating conditions such that final design meets the throughput (data bandwidth) and/or latency constraints, but then to ensure their reliability and integrity by employing architectural means that can detect/correct errors that occur under various operating conditions and to do this energy efficiently. As for reliability constraints, it is either set by the designer or defined by a runtime environment factor and based on the attributes of the specific task, which is being executed on the SoC. The noise parameters should be somehow estimated within the SoC. Some of the noise sources have stationary behavior, which makes the job of sensing/estimating them easier. Other sources of noise such as the capacitive crosstalk are very much dependent on the values that are being communicated over the bus. This means that the effect of the crosstalk noise can be very different for each transfer cycle. This dynamic behavior makes the task of controlling the bit error rate in the communication channel a challenging mission. As an example consider the effect of temperature. If the temperature increases for some reason, it will increase the variance of noise, which will uniformly increase BER over the different lines of a bus. However, the same bus working at a fixed temperature might be affected differently by crosstalk based on the values that are transmitted over the bus. 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another important factor is that the crosstalk noise is the dominant source of faults in on-chip wires [27]. This means that a system will not be able to adaptively adjust the reliability level for communication links unless it considers the level of sensitivity to crosstalk faults. In this chapter, we propose an Energy-efficient, Reliable on-chip communication Channel (ERC) architecture. Unlike previous techniques, it directly considers vulnerability to capacitive crosstalk. This architecture includes a Pattern Sensitive Encoding (PSE) technique, which is designed for adaptive reinforcement of the transaction reliability against the crosstalk. A new methodology for determining the sensitivity of patterns is introduced. PSE adaptively selects the appropriate encoding technique based on the pattern sensitivity. Note that although the proposed ERC adaptively selects the right type and/or degree of encoding based on the noise sensitivity of the pattern to be transmitted to the coupling noise, it does not impose any additional overhead to the system. This is also a key accomplishment of our work. The remainder of this chapter is organized as follows: In the background section, we look at previously proposed techniques for dealing with the problem of interconnects’ power and reliability. In the following section, we look at details of our approach including sensitive patterns, PSE and design choices. The next 162 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. section includes our experimental results and at the end of the chapter, there is a conclusion section. 5.2 BACKGROUND Many methods have been proposed that tackle the power consumption in the on- and off-chip buses (although the signal reliability concerns are absent from most of these works) [44][53][69][58][36]. In these works, line capacitances, including area and fringing capacitances to ground plus the inter-wire capacitance, determine the amount of power, which is dissipated in the bus line drivers. Therefore, power consumption of a bus is proportional to the switching activity of lines (with respect to the ground) plus the activity of lines with respect to each other. Reducing the switching activity by means of encoding values that go over the bus is an effective method for bus power reduction and has been addressed by many researchers (e.g., [44].) Low voltage signaling is another effective method for reducing the power consumption of the bus drivers [62]. An aggressive low- voltage signaling method is the differential voltage signaling, which requires a doubling of the number of lines in the bus [36]. More recently, researchers have tackled both power consumption and reliability of the bus lines [17] [65] [70] [42]. One solution is to employ error correction/detection schemes over interconnection channels. Specifically 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. researchers have opted for linear encoding techniques [43] due to their low overhead and high-speed encoding/decoding. In [17] authors, have compared linear error correcting codes (ECC) versus error detecting codes (EDC) from an energy viewpoint. They have concluded that error detecting techniques would be more power efficient when compared to correcting schemes. This is due to both a lower residual error rate (if a code is only used for error detection, it can detect a broader range of errors compared to when it is used for both error detection and correction) and simplified encoder/decoder blocks. In [42], authors have proposed an adaptive encoding scheme for dynamically selecting an encoding technique based on reliability and energy considerations. In other words, not all the available redundant lines that have been implemented on the bus for encoding are always used, but based on the noise conditions at the moment an optimum encoding scheme is selected. Different encoding schemes employ different numbers of redundant lines. Encoding techniques with higher error detection capabilities require more redundant bits and have more power hungry encoders/decoders; therefore by this dynamic selection of techniques, power is saved when reliability constraints or noise levels leave any room. To implement such an adaptive scheme, it is required to monitor the error level. In their proposed scheme, noise is estimated by keeping track of detected errors. Thus, they count the number of detected errors repeatedly in a specific window of 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. time and based on that they pick up the encoding scheme for the next interval. The stronger is the noise interference, the higher would be the number of detected errors. To rapidly adapt to varying noise conditions, this window of time should be relatively small. However, in a small period of time, very few errors will be revealed considering the typical on-chip error rates and this complicates the task of selecting a new encoding technique. To increase the number of detected errors in a not so large window, they have proposed the notion of a victim line. Victim line is basically a line that is operated at a lower voltage and therefore, it is more vulnerable to noise. Noise is estimated based on detected errors on the victim line. They have made use of three different encoding techniques, namely parity, Hamming (double error detection) and the extended Hamming code (triple error detection) [43]. Extended Hamming code is an extension of Hamming code and is obtained by selectively puncturing some of the message bits of the Hamming codeword. Minimum distance of this code is 4 [43]. Dynamic Voltage Scaling (DVS) is another low-power technique traditionally used to adapt the system to the varying performance requirements. It can also be very useful for tuning the power consumption and reliability of interconnects when applied to bus drivers. In [65], the authors have proposed a dynamic method for controlling the supply voltage and frequency of on-chip buses to guarantee safe and power-efficient communication over the bus for a given quality of 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. service. They have proposed different policies for controlling voltage swing and operation clock of interconnects. Their most comprehensive policy, finds the optimum voltage and frequency that can meet the delay and reliability constraints of the system. Therefore, with this policy, in situ estimation of technological and environmental effects will minimize the need for a priori knowledge, hence, simplifying the design process. In [70], the authors present a better model for characterizing the faults that cause errors on a line. Unlike the previous techniques, they do not assume a uniform bit error rate over all lines of the bus. In fact they propose a new model for representation of different sources of noise that can affect on-chip wires. Each fault is characterized based on its probability of occurrence, the effect that it causes on the line (such as inversion faults, stuck at faults, delay faults, etc.) and the number of lines that are affected by the fault. Later, they select a specific encoding for different Quality o f Service levels such as maximum bandwidth, guaranteed integrity, minimum latency, and high reliability. Standard techniques for alleviating the effect of crosstalk noise are shielding and using of repeaters [40]. Use of repeaters is a technology-dependent technique and requires optimal sizing of repeaters. Besides, it is inherently a power hungry approach. On the other hand, shielding is an area-hungry technique. Cross-chip 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. buses are usually routed in high metal layers that are limited resources and should be used as efficiently as possible. In another work, the authors have proposed a rather complex technique to make the sensitive patterns less susceptible to crosstalk [64]. However, complexity is probably the most important factor that take away the credibility of their approach for future SoC s. 5.3 THE PROPOSED APPROACH In this chapter, we propose a new methodology for applying linear error correcting techniques over on-chip buses. We specifically intend to present a method, which tackles crosstalk-induced faults. Our work is different from the previous work because most of the existing schemes that employ adaptive encoding assume that the noise interference over the bus lines can be modeled as an independent identically distributed random process. Therefore, they model the noise as a uniform bit error rate over all different lines (which can in turn be estimated if it is monitored within a window of transmissions.) Only in [70], the authors have presented an elaborate fault model that can represent different kinds of noise. Their results are considerably different from the case where noise is modeled as a uniform bit error rate. Capacitive crosstalk usually affects several adjacent or closely placed interconnect lines and can thus cause multiple faults 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. whenever it hits. Accurate modeling of crosstalk is not easy; therefore, we do not adopt an analytical crosstalk model. Instead, we rely on HSPICE [78] simulations to validate our proposed encoding technique. In this chapter, we consider a 32-bit bus. We would like to have the ability of detecting/correcting many errors. We use the word “many” to make a distinction between our work and the existing techniques. More precisely, in almost all of the previous techniques, the utilized encoding technique is only capable of at most detecting up to three errors or correcting two errors on the whole bus. In practice, correcting more errors can be quite beneficial if we are to prevent designing the system for the worst-case scenarios. However, implementing a single linear encoding technique that is capable of correcting/detecting more than 3 errors on a relatively wide bus can be fairly expensive in term of the associated delay and power dissipation overheads. A reasonable tradeoff is to partition the bus into different groups of bits and separately encode each group. That is exactly what we have done. Let’s say that instead of a 32-bit bus, we have 3 groups of size 11 bits (2 of size 1 1 and one of size 1 0 ) and they are encoded such that in each of them either one error can be corrected or two errors can be detected. This means that in each group, we apply an extended Hamming code [43] as the highest level of protection. As a result, up to three errors can be corrected in the bus or up to 9 errors can be detected or some other combination of correction and detection can 168 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be performed (such as correcting two errors and detecting three other errors.) Clearly, because of the bit grouping, the position of these errors cannot be arbitrary and there will be limitations on the spatial distribution of the errors that can be detected and/or corrected. For example, no more than one error can be corrected in each group. In a subsequent section, we will explain exactly the rationale of our choice of grouping and encoding technique. A group of size 11 would require at least 6 lines for the extended Hamming code. Therefore, a total of 18 redundant lines are needed for the three groups of bus, which will make the total number of lines equal to 50. This is a little more than 55% redundancy with respect to the original bus (compare this to line shielding which requires a doubling of the bus line count.) Of course, there exists an encoding which can detect an arbitrary combination of 9 errors on a 32-bit bus. Any code with the above capability requires at list 21 redundant lines [51]. The resulting encoder/decoder functions will also be much more complex compared to the encoder/decoder functions used when partitioning the 32-bit bus into three smaller buses and encoding each group separately. Therefore, grouping benefits us by also simplifying the encoding and decoding functions. We will see that another important advantage of grouping is that it facilitates our goal, which is capturing and reacting to pattern-dependent vulnerability with respect to the crosstalk noise. The reason for this is that, capacitive crosstalk is inherently a 169 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. local effect. (Although inductive crosstalk is increasingly becoming more important, capacitive crosstalk for on-chip interconnects remains the dominant source of coupling noise. See our simulation results.) This fact implies that for lines to induce coupling noise into one another, they must be close to each other. Now we can design the bus as small groups of lines such that each group has a minimal interference with the other groups. This is done by putting a larger spacing or adding shielding lines between the groups. Subsequently, we can look at the crosstalk vulnerability within the groups instead of the whole bus. As stated above, our solution is to increase the number of lines from 32 to 50. Besides, we need some shielding between different groups of lines to minimize their interference on each other. As a result, we will assume that our solution takes the space of 55 lines instead of 50 lines to account for the area overhead of group isolation. The way lines are organized on the bus is shown in Figure 5-5. This figure is elaborated in the following sections. To obtain accurate crosstalk modeling throughout this chapter, we performed detailed HSPICE simulations using 70nm technology parameters for different scenarios. The power supply voltage was set to 1.2V. We assumed that the target bus is a global on-chip bus implemented in the highest metal layer as depicted in Figure 5-1 (a). This layer is covered by packaging material from above. We used the Berkeley Predictive Technology Model [71 ] (also known as BPTM) to extract 170 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. specification of interconnects. All the interconnect simulations were done by using the HSPICE model for transmission lines (i.e., the w model). The target bus has 55 lines with the following specifications: l=2500um, w~0.45um, s=0.2um, h=0.2um, and t=1.2um. These parameters are defined in Figure 5-1 (b). P ackaging m aterials (kj,) w m T op layer (a) (b) Figure 5-1 The top metal layer with no ground level above it. We call the 55-bit bus using the above specification a CodedBus from now on. To make a comparison between our solution and a 32-bit bus with same total area but increased line spacing, we simulated a second bus with 32 lines with increased spacing, which we call InflatedBus. We calculated that in the InflatedBus the line spacing should be increased to 0.68um compared to the 0.2um spacing in the CodedBus. To show the difference of crosstalk interference for CodedBus and InflatedBus, we simulated these buses for 5000 pairs of randomly generated vectors. We chose random vectors to exemplify the wide variety of data that can be interchanged on the bus of a multi-functionality SoC. The input signals over various bus lines are 171 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. random pulses with random arrival times and transition times. The input arrival times are randomly distributed over a window of lOOps. The input rise time is a uniform random variable from 20 to 8 Ops. The results are reported in Table 5-1. This table reports the number of errors for the CodedBus and the InflatedBus under three different cycle time constraints. Notice that no encoding or decoding is done on the CodedBus, and that is why in the table we refer to it as Raw CodedBus. . We are just quantifying the effect of extra spacing between lines on the number of generated errors. One thing is that as the cycle time increases, signals have more time to settle down from the coupling effects, therefore, the number of errors decrease for each bus as the cycle time increases. Now the major issue is that although InflatedBus is expected to reduce the number of errors due to its much larger inter-wire spacing, we see that a considerable portion of the crosstalk errors still affect the system. Having these many errors is more than what can be tolerated in a digital system. Thus spacing between the bus lines is not a sufficient solution that can guarantee a desired level of data reliability. Table 5-1 Number of errors in 5000 bus cycles (no encoding has been performed). Bys cycle time (ps) 750 1000 1250 Raw CodedBus error count 82 15 13 InflatedBus error count 39 10 8 Reduction in number of errors 52% 33% 38% 172 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.1 SENSITIVE PATTERNS A major contribution of our work is the development of the pattern sensitive encoding (PSE) technique for encoding an on-chip bus. In pattern-sensitive encoding, the code for each input pattern is selected based on the probability of having an error in that specific pattern. Sensitivity depends not only on the vector that is being sent but also on the vector that was sent prior to it. We have used the notion of “pattern” (as opposed to “vector”) to emphasize this fact. When we say that a pattern is sensitive we mean that the pair of vectors is sensitive to noise. Next, we will look at the problem of determining the level of noise sensitivity (or vulnerability) of a pattern. A number of researchers have tackled this problem. In [27], the authors present the Maximum Aggressor model and propose that a victim is most likely to be affected when it makes a transition in the opposite direction with respect to all adjacent lines/aggressors. However, a more recent work shows that worst-case effects on the victim can happen when nearby aggressors make a transition in the opposite direction whereas distant ones in the same direction [49]. The truth is that patterns that result in higher transition count are more likely to be affected by the crosstalk noise. Therefore, we first analyze the sensitivity of patterns based on their transition count (refer to Definition 2-13 ). Next, we look at our proposed model for evaluating crosstalk. We will refer to the number of Miller effect capacitances in a pattern and use that as a measure of 173 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the level of noise sensitivity of that pattern. We define the Miller effect count as sum of 01 ->10 or 10 ->01 sub-patterns. Table 5-2 shows the Miller effect counts for three pairs of transmitted data (for two different bus widths.) Table 5-2 Example of showing Miller effect count. First vector Second vector # Miller effects 01010 10001 2 010 101 2 00101 00011 1 To achieve a logical assessment of the noise sensitivity of a pattern, we compared the two metrics suggested above i.e., the transition count and the Miller effect count. We simulated a bus of width 11 (which we defined as the basic group size earlier) for 2 0 , 0 0 0 randomly generated patterns (using same statistical random arrival and rise times as were used in comparison of CodedBus and InfaltedBus) and organized errors based on the number of transitions and the Miller effect count. Results are shown in Figure 5-2and Figure 5-3. The cycle time is assumed to be 750ps. 174 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Num ber of P atterns 500 400 3 00 - 200 1 0 0 J l L in I # e rro r-fre e p attern s/1 0 I # e rr o n e o u s p attern s 4 5 6 7 8 9 10 11 Number of T ransitions in a B us T ransaction Figure 5-2 Number of errors versus transition count in a group of size 11. Num ber of P atterns 8 00 - 7 00 - 6 0 0 - — — 500 J - — 1 4 0 0 - 300 - Mil 200 -|- 100 - — 0 T- j # e rro r-fre e p attern s/1 0 I # e rr o n e o u s p attern s 4 5 6 7 8 9 10 Number of Miller E ffects in a Bus T ransaction Figure 5-3 Number of errors versus Miller effect count in a group of size 11. In Figure 5-2, the x-axis depicts the number of transitions in a group. From 0 up to 1 1 transitions may occur when a pair of vectors is transmitted on an 1 1 -bit bus. Since vectors have been selected randomly, the histogram shows a binomial arrangement with respect to the number of transitions. For each transition count, 175 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. two bars are depicted. The right bar shows the number of erroneous patterns that occurred on the bus for the corresponding transition count whereas the left bar shows the number of error-free patterns (divided by 10). Figure 5-3 uses the same convention except that the x-axis now corresponds to the Miller effect count (hence, it ranges from 0 to 10.) Comparing the above charts, we conclude that the number of Miller effects is an incredibly gripping gauge for recognizing a pattern’s noise sensitivity. Since our encoder adaptively selects the encoding technique, it should be able to recognize sensitive patterns efficiently. If we examine Figure 5-2, we realize that patterns that have five or more transitions are relatively sensitive to noise (this conclusion is drawn from the ratio of the number of erroneous patterns to the number of error-free patterns.) For these patterns, we should use our most reliable encoding scheme. The unwelcome part is that these patterns constitute a large percentage of the total patterns that have occurred. Therefore, to protect the few erroneous patterns, we have to use a boosted encoding for a huge number of patterns that go on the bus although many of these patterns do not lead to any errors. On the other hand, when we examine the patterns as organized based on the number of Miller effects, we can easily see that patterns with 0 or 1 Miller effects are not at all sensitive to noise, although they constitute a huge percentage of the total number of bus patterns, which is exactly what we have been looking for. Therefore, it is not necessary to apply our enhanced encoding (e.g. the extended Hamming code) for these patterns. In fact, 176 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we use boosted encoding only for the remaining patterns that have two or more Miller effects. Using this gauge, we can partition patterns such that a significant portion of the total number of noise-induced errors in the bus is caused by a relatively small number of “sensitive patterns”. 5.3.2 PATTERN SENSITIVE ENCODING We propose a pattern-sensitive encoding (PSE) technique that selects the encoding technique based on the vulnerability of the input pattern to on-chip noise sources. From the previous results, a gauge of crosstalk noise sensitivity of a pattern is the number of miller effects that occur on the bus when the new pattern arrives. We assumed that the original bus has 32 lines, which are divided into three groups of roughly the same size. To make sure the bus always has a minimum capability for error detection, we generate a parity bit for each of the groups. These three parity bits are always sent on the bus together with data lines. This would be the lowest level of protection on the bus. For each group, if it becomes known that the group is about to encounter a noise sensitive pattern, i.e., if the current pattern results in two or more Miller effects in that group, we encode the group by using the extended Hamming code, which would provide the highest level of protection for that group. The utilized extended Hamming code has the following parity 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. check matrix [43] (For the group that has 10 lines, we simply omit the last column of the parity check matrix): 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 / , 6 0 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 1 Figure 5-4 Special parity check matrix for the extended Hamming code. In this matrix, f is the identity matrix of size 6 . The first row of the parity check matrix (excluding the f columns) is all one’s. This means that this redundant line is actually computing the parity bit for all the lines in that group. Hence, the parity code is a sub-function of this extended Hamming code. In other words, the lower level o f protection is subsumed in the higher level o f protection. We have made sure that the last statement is always true for the various codes we have used. The algorithm for the pattern-sensitive encoding and decoding is as follows. Let’s assume the original data lines, denoted by n\ to n^, are divided into three groups: G[ to Gj. The three lines used for parity encoding are shown by p i to p$ and the remainder of the redundant lines are shown by ht j to h,j for i=l,2,3. Note that the encoding technique for each group is selected independently in each bus cycle. The psuedocode for the sender is as following: 178 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / / P S E E n c o d e r for e a c h i fro m 1 to 3: if ( m ille r_ e f fe c t_ c o u n t in G J > = 2 c o m p u te a n d s e n d lin e hj t to hj 5 c o m p u te a n d s e n d p; / / e n d And the psuedocode for the receiver is: / / P S E d e c o d e r for e a c h i fro m 1 to 3: if th e r e is a c tiv ity on an y o f th e lin e s : h f x to h ( 5 d e c o d e Gi a s s u m in g e x t. H a m m , c o d e by u s in g h / s a n d Pi e lse d e c o d e Gj a s s u m in g p a rity c o d e by u s in g p ; on ly / / e n d The reader may correctly argue that it is possible that the extended Hamming code is in effect, but no activity is observed on the h lines. This case happens when the value on the h lines (coded bits plus noise) is equal to the previous value on these lines (previously coded bits plus noise). Since both the previous value and the current one are subject to noise, we can assume that they are independent of one another. With such an assumption, the probability of these two values being equal is 1/32, which is around 3%. In addition, this will become important only if more than one error per group of bits occur, which makes the probability of such an event very low. However, when this situation occurs, it is not possible 179 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for the receiver to determine that extended Hamming code has been used due to the non-activity of the h lines. Therefore, the decoder makes its default choice and assumes that only the parity encoding has been applied. Of course since parity is a sub-function of the extended Hamming code, no logical error will occur. This is exactly why the extended Hamming code has been designed to have parity code as a sub-function. A systematic code is a code that does not modify the original data bits but only adds redundant bits to it in order to construct the codeword. Suppose a sender utilizes one of a number of systematic codes C„ i = 1...N. Codes are selected such that each code is a sub-function of its subsequent code i.e., the functionality of C, is included in Cm. We show this containment relation by C,-cC,-+;. The sender chooses an arbitrary C, for each transaction. The receiver observes the activity on the redundant lines in order to discover the encoding technique. Now, if a sufficient number of these redundant lines do not exhibit any activity, the receiver will mistakenly assume that data was encoded with a weaker code, say Cj instead of Ci (j<i). However, this mistake will not lead to any logical decoding error because we have set C7 < ;cC, and the redundant lines of Cj can still be useful. Another assumption is that the redundant lines are properly shielded such that they are not affected by coupling effects when they are not conveying data. So the receiver will not mistakenly recognize data on these lines. This assumption is 180 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. satisfied by properly shielding each additional set of code lines, e.g., lines used by Ci+ i but not by Q. The above algorithm is effective in that it enables adaptive selection of the encoding technique without any communication overhead. By communication overhead, we mean the extra messages for notifying the utilized technique to the receiver. However, it will have some area overhead to shield or space out groups of redundant lines. 5.3.3 DESIGN CHOICES In this section, we explain our design choices for each part of the proposed (ERC) architecture. First, we summarize the overall flow. The key idea is to partition the channel into a small number of properly shielded sub-channels, and to add redundancy for each group of bits in a sub-channel. Next, the encoder dynamically and efficiently recognizes the noise-sensitive patterns that are supposed to be communicated over each sub-channel and uses the redundant bits to employ the most energy-efficient encoding technique that will guarantee a minimum level of reliability for the transmitted pattern on the sub-channel. Finally, the decoder, having examined the activity on the set of redundant lines for each sub-channel, makes a guess as to which encoding technique was used to encode the sub-pattern in each sub-channel, and decodes the codeword 181 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. accordingly. Note that since the decoder can only mistakenly use a weaker code that is subsumed by the actual code used by the encoder, there is no possibility of a logical error during the decoding phase, although in that case some amount of information is not exploited. Partitioning the 32-bit bus into three groups of 11, 11 and 10 bits is done to reduce the logic complexity of the codec and to enable efficient recognition of the noise-sensitive patterns. We opted to have these group sizes because 1) a larger group size would increase the logic complexity; 2) a larger group size would compromise the ability to recognize the sensitive input patterns and 3) with a larger group, the portion of the error free patterns with more than two Miller effects will increase. This directly degrades the quality of results, as then the extended encoding technique will be applied to a larger number of patterns that, in fact, do not need it. On the other hand, a smaller group size means even more redundant lines in the system, which is also undesirable. Next, we justify the number of redundant lines per group. We need to have a minimum error detection capability within each group, which we have done by employing the parity encoding. Parity encoding is generated by XORing all lines in the group. We also set the most reliable coding technique for each group to be the extended Hamming code. This extended Hamming code is selected such that one of its redundant lines is the parity bit for all data lines of the group. Therefore, 182 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we need 6 redundant lines per group in order to form the parity check matrix and the extended Hamming code. Let’s say we decided to have 5 redundant lines for each group. An extended Hamming code that includes a parity check code and has 5 redundant lines per group, can only accommodate 6 data lines. Therefore, we should have 6 groups (for a 32-bit bus) and a total of 30 redundant lines, which is unacceptable. Figure 5-5 shows the bus configuration for the proposed ERC architecture. Data lines, redundant lines and shield lines have been shown. Note that the single parity bit is sent along with the data bits of each group. The Hamming lines or the lines used exclusively for the extended Hamming encoding are stacked together at the end of the bus. There are two reasons for this. First, the encoding technique that is used to encode each pattern depends on the noise-sensitivity of that pattern. If the h lines are interleaved with the data lines, the actual pattern that is being sent over the bus will be dependent on these redundant lines, which may change the sensitivity of the pattern itself. This will make the job of recognizing noise sensitivity of a pattern a much more complicated task. Second, by keeping the extra lines together, we avoid the problem of having to charge/discharge inter wire (coupling) capacitances between the original data lines and the h lines. In our encoding scheme, the h lines are used in only some of the cycles. Therefore during cycles when these lines are not used for encoding, we can make them 183 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. retain their previous values. By doing this, their ground capacitances will not have to be charged and/or discharged. However to avoid charging/discharging their inter-wire capacitances, we have to make sure that their adjacent lines, will not make a transition either. This is accomplished by putting all the h lines next to each other. Of course, the benefit that will be lost as a result of the proposed bundling and placement of the h lines is that if these lines are dispersed among the data lines, they might help serve as shielding lines during the times that they are not used, thereby, reducing the noise susceptibility of the transmitted pattern. However, based on our experimental results and detailed simulations, we have concluded that the benefit of this shielding effect does not justify the cost of additional power consumption on the bus and larger encoding complexity. Figure 5-6 shows the block diagram of a generalized PSE. A block is in charge of evaluating the noise sensitivity, which we will call Sensitive Pattern Recognizer (SPR). The output of this block is used to make a decision about the appropriate level of encoding to be used on the pattern. This is done in a block named Encoding Selector (ES). Additional inputs might be added to the ES block such as reliability constraints and also information from a general noise-monitoring unit such as a victim line as described in [42]. The ES block may be programmable and the criteria for identifying the sensitive patterns can be set after chip is fabricated and tested or even based on runtime conditions. In many cases, a whole 184 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. chip fails because of a crosstalk fault. Using the programmability of SPR, we can incrementally tighten the criteria to include critical patterns as noise-sensitive ones. As soon as the error-making pattern is covered, the flaw will be eliminated and chip will function without any problem. This is a post-production (in-field) reliability enhancement of a VLSI chip with a minimal impact on its power/performance specifications. Group I E ncoder D a ta lin e s p lu s p a rity b its Group II E ncode r Group II E ncoder E x tra lin e s Figure 5-5 Bus configuration of the proposed ERC architecture. to the e n c o d e r Reliability C onstraints ( optional) E ncoding Sele ctor(E S ) N oise Monitoring Unit (optional) Sensitive Pattern R e c o g n iz e r (S P R ) Figure 5-6- Block diagram of a generalized PSE. 185 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5 .4 EXPERIMENTAL RESULTS In this section, we present the experimental results to show delay/power/reliability characteristics of the proposed ERC architecture. The key accomplishment of the PSE is to introduce flexibility for eliminating crosstalk faults under given timing constraints and to do this as energy-efficiently as possible. We do not include any figure of merit such as the total system power saving, because such a figure is highly dependent on the circuit timing constraints, the performance/power tradeoffs, and the number and physical length of the global wires in the VLSI circuit. Instead, we present low-level results illustrating power and reliability capability of pattern sensitive encoding. We compared the total power consumption of 1) lnflatedBus; 2) a CodedBus that uses the extended Hamming code all the time; 3) a CodedBus which employs PSE but the redundant lines are interspersed among the data lines instead of stacked together and 4) a CodedBus that employs the PSE technique as described. The bus cycle time was set to 750ps for all cases. Results are reported in Table 5-3. Solution 1 (first entry in the table i.e. lnflatedBus) results in the least power but lacks the level of reliability that is needed. Solution 2, which is reliable, is not power-efficient. Solution 3 corresponds to PSE with interspersed redundant lines, 186 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. whereas Solution 4 corresponds to the scheme in Figure 5-6 and is most power efficient and reliable solution. Table 5-3 Power consumption for different configurations of the bus. 1 lnflatedBus 211.1 uW 2 CodedBus (always Ext. Ham. Code, No PSE) 369.1 uW 3 CodedBus (PSE & interspersed redundant lines) 288.7 uW 4 CodedBus ( PSE as in Fig 5) 266.3 uW In the following we make a few observations about the power consumption of the encoder and decoder logic. First note that because in our proposed framework, different encoding techniques are selected such that each encoding is completely subsumed by its successor encodings, the total complexity of the encoder/decoder logic is determined by the most complex encoding scheme. Second, based on the results of Figure 5-1, we expect that any high-performance SoC bus will need protection by employing a rather sophisticated error correcting code such as the extended Hamming code. Third, compared to a reliable high performance bus, our proposed PSE technique does not consume higher power for its encoding/decoding logic. In fact, without sacrificing reliability, it reduces the power consumption over the bus because of its pattern sensitivity. Finally, note that the codec power dissipation will scale down with technology faster than the power consumption of the global buses drivers. The latter is in fact expected to rise due to longer buses, higher data transfer rates, and faster clock frequencies 187 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. whereas the codec power is expected to fall as a result of the Moore's scaling laws and the diminishing supply voltage levels. In terms of reliability, the objective is to ensure that the residual error rate (RER) does not exceed a bound. Residual errors are errors that are left undetected or are mistakenly corrected. In our proposed scheme, a single error in each group is corrected. If the pattern containing this error is recognized as a sensitive pattern and sent over the channel by using the extended Hamming code, then the error will be corrected. However, if the pattern is not recognized as a sensitive pattern, the error can only be detected. In addition, if there are two errors in a group, they are detected but cannot be corrected and a retransmission is required. Since crosstalk faults depend on the pattern, it is likely that the re-transmission of the same data will also be erroneous. To avoid this problem, we assume that all re transmissions are done by using the strongest encoding technique and that the re transmission is done in two bus cycles. Note that even our strongest encoding technique cannot correct more than one bit error per group, so if we encounter two errors, we must re-transmit the data in two cycles so that the delay fault will be potentially eliminated. Simulation shows that, with this policy, all detected errors become either correctable or are washed away. We performed detailed simulations with 20,000 pairs of vectors to determine the number of residual errors in a group under various cycle time constraints. In this 188 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. simulation, the first vector in each pair is applied and kept on for a long period of time so that it settles to its final steady-state value on the channel. After that, we apply the second vector and record the number of bit errors on the receiver side after a fixed clock cycle time. Two different cases were investigated. First we assumed that the Sensitive Pattern Recognizer (SPR) marks a pattern as sensitive when it has two or more Miller effects (this is the same as what was mentioned in the sender algorithm.). In the second experiment, we assume that patterns with three or more Miller effects are classified as noise-sensitive. By doing this, we are trading off power dissipation and reliability. This change decreases the number of sensitive patterns from 7,139 to 2,830. We report the number of errors and the percentage of required retransmission cycles for four different cycle times in Table 5-4. 189 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5-4- Error over a single group for various cycle time constraints. Cycle time (psec) 750 700 650 600 # Sensitive Patterns: 7139 Two Miller effects and above) # Crosstalk faults 53 557 1778 3987 # Corrected Errors 38 514 1578 2805 # Detected Errors 15 43 191 1105 #Residual Errors 0 0 6 67 RER 2.7e-5 3.0e-4 Re-trans Cycles 0.4% 1.2% 5.5% 31.6% # Sensitive Patterns: 2830 ( Three Miller effects and above) # Crosstalk faults 53 557 1778 3987 # Corrected Errors 35 415 1095 1562 # Detected Errors 18 142 674 2308 #Residual Errors 0 0 9 117 RER 4.9e-5 5.3e-4 Re-trans Cycles 0.5% 4.2% 19.5% 61.5% The above table shows that by using PSE, we are able to operate reliably at a bus frequency that would otherwise result in many transmission errors. In the second case when the criterion for recognizing sensitive patterns is changed, the number of sensitive patters is significantly reduced. For longer cycle times, this would definitely be to our benefit since it will cause higher power savings without jeopardizing the reliability. However, as the cycle time constraints get tighter, the overhead of re-transmission cycles will significantly increase and this extra power saving will be offset by the increased cycles. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5 CONCLUSIONS A major challenge in SoC design is to connect numerous modules and IP blocks that need to communicate with each other at high speeds over long distances. This SoC design trend has been accompanied by a significant increase in the noise sources that adversely affect the VLSI interconnect performance. Consequently, reliable on-chip communication in future SoC designs will not be possible unless these designs employ on-chip error detection and correction mechanisms. In this chapter, we presented an Energy-efficient, Reliable on-chip communication Channel (ERC) architecture, which is specifically designed to compensate for crosstalk errors in a highly power-efficient method. We also proposed the notion of Pattern Sensitive Encoding (PSE). We experimented over various bus configurations and showed their corresponding power consumption. We also compared different instances of PSE in terms of reliability and performance. We also propose a new methodology for selecting encoding techniques such that encoding selection between sender and receiver is done without imposing any overhead on the system. 191 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 ENCODING TECHNIQUES FOR REDUCING HOT-CARRIER DEGRADATION 6.1 INTRODUCTION With the current trend of shrinking minimum feature sizes and rising clock frequencies in VLSI circuits, reliability has become a major design issue. In spite of one-time benefits of reducing the supply voltage level, the substrate temperature and power density are rapidly increasing. At the same time, a decrease in critical device dimensions to sub micron ranges, results in more intense horizontal and vertical electric fields in the channel region. Under the gate of a transistor, these enormous fields give rise to electrons and holes with kinetic energies significantly higher than silicon band gap (l.leV). These electrons and holes may be injected into the gate oxide and can cause permanent changes in the 192 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. oxide interface charge distribution. This phenomenon is called hot carrier effect, or sometimes hot electron effect because this injection happens more often for electrons due to smaller barrier height of electrons as compared to holes (i.e. 3.1eV for electrons compared to 4.8eV for holes [11].) A sizeable increase in the threshold voltage of the affected transistors and a corresponding decrease in their drain current driving capability are undesirable results of such hot carrier injections in the gate oxide. The hot carrier effect is exacerbated as the technology moves toward smaller devices dimensions and higher clock frequencies [11]. Another phenomenon caused by carriers having energies higher than 1.1 eV is the creation of electron-hole pairs through impact ionization. In an NMOS device, generated electrons are collected by the drain whereas generated holes drift in the substrate toward the ground terminals, and thereby, contribute to a substrate current. Carrier injection and impact ionization become most severe when the device is in the saturation region because in this case the intensity of electrical fields in the channel is at maximum. Therefore, device degradation strongly depends on the duration of time that the transistor stays in the saturation region and substrate leakage current is a good indicator of the degree of device degradation [41]. Every time, the output of a gate makes a transition, some transistors should pass through the saturation region either to turn on or turn off. This means that the amount of time spent in saturation directly depends in the output activity of the device. Besides it depends on the output load capacitance 193 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and slew rate of the input signal since these two parameters directly determine the output slew rate for a given gate. Now, let DfreS h and D a ge(i denote the fresh and stressed (aged) propagation delay of a CMOS inverter, we can write: ^ a g e d C O V O O /cv . ? load 5 ^ s w 'O ' ^ j n 's h where Tsie w denotes the input slew rate, C w is the output capacitance, represents the output switching activity, and t the time parameter (total “on-time” of the circuit in seconds.) Function ^ is a non-linear function which is determined from transistor level simulations and it is usually represented as a three- dimensional table [19]. For gates other than the simple inverters, the ratio-based degradation model proposed in [6 6 ] is applied. In this model we hdi\QDa g e d =aDfiesh , where a> l is defined as the overall degradation of all transistors in the gate and is calculated as: o c = ( ^ j ai) - n + \ In the above equation n is the number of transistors in series and a,i>l is the aged to fresh delay ratio when only input pin i is under stress and is defined by ^ ( X s l e w > C load , N m J ) • Hot carrier degradation can cause digital systems to fail. A line driver may fail due to an increase in its threshold voltage, which slows down the driver to a point 194 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that it violates the bus time constraints, i.e., a gate delay fault is being created as a result of hot carrier degradation. It is also possible to encounter a case whereby the increased delay of consecutive gates along a combinational path in the circuit creates a setup time violation for the flip-flop at the end of the path, which is known as a path delay fault. In this work, we do not consider faults cause by accumulated path delay. Our focus is on on-chip buses with a single driver. We define the lifetime o f a wire or line as the Mean Time To Failure (MTTF) of its driving gate. This wire might be a segment of a bus between the driver and a repeater or a segment between two repeaters. We also assume that the whole system fails as soon as a single wire fails. There are different approaches for modeling the lifetime of a wire. For example, we can say that a line fails when ad = Da g e d - D/re sh > PD/rc sh where /?is a user- defined parameter greater than zero (e.g., 0.25). However lifetime of a wire is actually a random variable with a certain probability distribution function. Therefore, the above deterministic definition is not suitable. Unfortunately it is extremely difficult to derive the lifetime distribution from the physics of the hot- carrier effect [11]. Complete statistical characterization of a transistor’s lifetime highly depends on technology and various physical phenomena. Instead, what one typically does is to empirically determine the distribution that best fits the lifetime of a wire in different designs and different chip realizations of the same design. 195 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Both lognormal and Weibull probability distribution functions have been used for characterizing hot-carrier lifetime [34]. These two distribution functions are however similar near their mean values. In this paper, we choose to model the lifetime of an interconnect driver as a lognormal random variable. In addition, based on the previous discussion, we make the following assumption: “mean value of the driver lifetime is inversely proportional to its output switching activity.” Hot carrier effect has been extensively studied in the past few years. Different design techniques such as transistor resizing and reordering [29,11], logic factorization [55], technology mapping [23], logic restructuring [20] and binding and scheduling [28] for minimizing hot-carrier degradation have been introduced. In this paper, we propose a completely new approach to minimizing the hot carrier induced failures of on-chip buses in VLSI circuits. More precisely, by increasing lifetime of the gates that are subject to most severe hot carrier degradation (e.g., bus drivers), we increase the lifetime of the whole circuit. Buses tend to have high capacitance compared to other wires in the system. In addition they usually have a high activity rate compared to other nets in the circuit. This makes the bus drivers some of the most hot-electron degradation-prone components. 196 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We assume that we are given a set of interconnect lines in a VLSI circuit, which are flagged as hot-carrier sensitive (HC-sensitive.) Furthermore, we assume that if one of these lines fails, then the whole circuit will fail. Our approach is to add some encoding/decoding hardware for protecting the circuit against hot carrier failure. We want the encoding and decoding functions to be as lightweight as possible so as to minimize their impact on overall area, delay and power consumption of the circuit. Getting back to the bus problem, the fact is that, even with equal sized drivers and identical loads for each and every bus line, different lines of the same bus may age at different rates due to their different bit-level activities over time. For example, the LSB of a bus carrying small positive numbers over the circuit lifetime is expected to have a much higher activity than the MSB of the same bus. The bus as a whole, however, fails as soon as any of its lines fail. The encoding solutions we propose increase the lifetime of a bus by balancing switching activity among all bus lines, i.e., by minimizing the maximum line activity of the bus. If we do not balance the transition counts over different lines of the bus, then the expected lifetime of the bus will be will be dominated by the most active line of the bus, and as a result, it will be much shorter. The remainder of this paper is organized as follows. In the next section, we will setup the problem precisely. In Section 6.3 we will explain the approach type that 197 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we have adopted to solve this problem and look at some definitions. In Section 6.4 different combinational functions are examined, where in Section 6.5 sequential solutions are studied. Results are presented in 6 .6 , whereas conclusions are provided in the final section. 6.2 PROBLEM SETUP The goal of this work is to apply encoding functions that reduce the maximum bit- level activity over a set of bus line drivers. The bus encoding and decoding function are performed in such a way that the logic functionality of the circuit is not impacted. If we consider a single-bit bus, there will obviously exist no single bit encoding and decoding functions that can reduce the bit-level activity without modifying the surrounding logic. Instead consider a multi-bit bus where the bus lines are bundled together, implying that the set of drivers are placed in close proximity of each other as are the set of bus receivers. We assume that the bus drivers have nearly identical electrical characteristics (i.e., in terms of their size and driver strength.) Our goal is to maximize the bus lifetime, which is set by the minimum lifetime among all its bit lines, through bus encoding. The lifetime of a bus line is mainly determined from the output switching activity of its driver. This is because the drivers for all bus lines are the same, so the only factor impacting the line lifetime is its activity level, which in turn determines the extent of the hot- carrier effect on the line driver. 198 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6-1 shows an example of a bus comprising of three global lines. One of these lines has much higher activity than the other two, and is thus flagged as an HC-sensitive line. The two other lines can be used to reduce the hot-carrier vulnerability of this sensitive line by reducing the output switching activity of its driver. This can likely be achieved by increasing the activity of the non-sensitive lines. Note however that the overall lifetime of the bus will improve because the maximum activity of the bus lines will have been reduced. Adding the decoder and encoder logic can be done with a minimal impact on the chip routing because of the proximity of the bus drivers and receivers. S en sitiv e Line S e n s itiv e Line H > — D e c o d e r T h r e e G lobal Lines E n c o d e r Figure 6-1 A set of 3 global lines that are good candidates for encoding and decoding. For a given bus, we denote the lifetime of the driver of line i by a r.v. (random variable), i%. We assume that each l% is a r.v. with lognormal probability distribution function and that these r.v.’s are independent from each other. We 199 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. further assume that the expectation value of l% is proportional to the inverse of the switching activity of line i. Therefore, the Mean Time To Failure (i.e., the lifetime) of a bus line is inversely proportional to the switching activity of the line driver. When a single bit line driver fails, the whole bus fails, the whole circuit fails. If we denote the lifetime of the whole bus with LTbus, then we have LTb u s = M in ^ L T ^ . Define GIr (T)= PROB(LTi < T) as the cumulative distribution function (cdf) of l 7; ■ Similarly, the c d f of LTb u s can be defined as: Gi7 i> (T) ^ PROB(LTb u s < T) = 1 - n (1 - GL T i (T)) ■ i Let Px (T) denote the (marginal) p robability distribution function (pdf) of r.v. X. The p d f of LTb u s is obtained by taking derivative of its cumulative distribution function, yielding: pL T b „ , ct ) = I (pL T l ( n n (i - g L T j ( m i j * i Obviously, E(LTbus) = E(Mini{LTi}) , where E denotes the expectation value operator. It is, however, difficult to calculate this expectation value. Instead, we approximate it with Min^EiLT,)], which is accurate if the variances of all l t , lognormal variables are in the same range. This is because if any of the ///', 200 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variables has an expectation value, which is much larger thanM«/{£(!7’)}, then that variable will have little impact in determining E (L T bus) . To illustrate this point, let l and s denote two lognormal rv .’s such that the mean of variable L is greater than the mean of variable S. Table 6-1 reports a comparison of Min(E(L),E(S)) = E(S) with E(Min(L,SJ) for different combinations of E(L)/E(S) and Var(L)/Var(S) assuming E(S)=1, Var(S)=0.25. This table shows three facts: 1) For a fixed ratio of E(L)/E(S), the error between E(Min(L,S)) and E(S) grows as the ratio of Var(L)/Var(S) increases. 2) For a fixed ratio of Var(L)/Var(S), the error between E(Min(L,S)) and E(S) diminishes as the ratio of E(L)/E(S) increases. 3) For E(L)/E(S)> 2, the error between E(Min(L,S)) and E(S) vanishes and remains so independent of Var(L)/Var(S) as if variable L has no impact in determining E{Min{L, S )). 201 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6-1 E{Min(L,S)) / E(S) E(L)/E(S) C O 1 2 3 5 7 10 1 1 0.73 0.96 0.99 0.99 0.99 0.99 4 0.63 0.92 0.99 0.99 0.99 0.99 9 0.55 0.87 0.97 0.99 0.99 0.99 > 16 0.50 0.82 0.95 0.99 0.99 0.99 25 0.45 0.77 0.92 0.99 0.99 0.99 Speaking more generally, we know: oo oo E(Min(L,S)) = E(L) + E(S)~ f e ^ P ^ d r - f e ^ P ^ d r t = 0 r = 0 where Q x(r )= ] t P x {t)dt i - T For two lognormal random variables L and S where E(L)>E(S), if the p d f function of the overlap between the p d f functions of these two random variables is small, then Q l (t)P s ( t ) may be approximated with E (L )P s(t) whereas Qs (t)Pl (t) is nearly zero. Therefore, in this case, E(Min(L,S))« E(S) = Min(E(L),E(S)). 202 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Although the above conclusions are made for the case of having only two r.v.’s, our experiments show that they will still hold as long as the number of r.v.’s remains small, i.e., < 8 . For example, suppose that instead of two r.v.’s, S and L, we have five r.v.’s S, Li . . . L 4 such that E(Lj)=3E(S) and Var(Li)/Var(S) < 9 for i=l ...4. In this case, using our approximation will only result in an additional 4% error on top of the results of Table 6-1 (the total error will be 7%, which is still rather small). The hot-carrier-aware bus encoding problem is different than the low-power bus encoding problem [2,13, 30, 53, 58]. This is because the objective of the former is to minimize the maximum of bit-level activity (a minmax cost function) whereas the latter attempts to minimize the total bit-level activity (a minsum cost function.) Note that H C-aware bus encoding can be applied to a bus that has been optimized for low power dissipation to in order to balance the bit-level transition counts of bus lines. This may result in some increase in the total power dissipation, but increases the bus reliability over time. The abovementioned problem setup is based on a number of approximations. It is possible to come up with rare examples where by reducing the maximum bit-level activity of every bus line; the lifetime of the bus becomes actually shorter. To be more precise, consider a four-bit bus (S,Lt,L2,L3), where the expected values of 203 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. log-normally distributed lifetime variables, LTh of the bus lines are: E(S) = \,E(LX ) = E(L2) = E(L% ) = 5• All variances are all equal to 0.25. The exact lifetime of this bus (assuming lognormal distribution) is written as: L f l T " = E(Min{LTk}) = E{Min{\ognorm{E(LTk),Var{LTk)}) • With our proposed approximation, we have: LT 2 ‘ ; p p ™ = Min(E{lTk}) = 1 - Suppose that after HC-aware bus encoding, the activities change such that the new lifetime expectation values are: e(S) = E(L,) = 1 .1 for all i. Let’s assume that the variance remains the same. From our approximation, the final bus lifetime is: = Min^E{LTkyi = \ i, which is 1 0 % longer than before, that is, > Ej m "'a p p r o x , and thus, the HC-aware encoding appears to have been effective. However, for this case, = 0.67 which shows that the encoding has been unsuccessful. In general, when such a case happens and activities of a bunch of non-critical lines increase significantly, the actual lifetime of the bus, ^ A n a l,e m « become smaller than £ 7^""'“" • So the question is under what conditions we can trust our approximations and actually expect lifetime improvement after HC-aware encoding of a bus? 204 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the above example, assume that the activity of no line becomes larger than 50% of its initial activity. Making that assumption, the lifetime will be extended by at least 4% in the worst case. This means that although the absolute accuracy of our approximation may not be good, its relative accuracy (i.e., fidelity of the approximation) is high. 1 As a general rule, given a set of lines, as long as the maximum line activity is reduced by a high percentage (e.g., 50% or more with respect to the initial maximum) or activities of non-sensitive are not close to the reduced maximum activity after encoding (e.g., remain less than 50% of the new maximum), then we will have a lifetime improvement. This is in fact the scenario that we encounter in practice, that is, in the methods that we will present, we almost always achieve around 20-50% reduction in the maximum activity (see Results section) and we never a see a large increase in switching activities of non- critical lines. This means that although the actual activity values after encoding might affect the accuracy of our approximation, there will be a high fidelity between the exact lifetime and approximated one and lifetime will be extended almost as much as we expect it to. Finally we recognize that in the past RT level design techniques for distributing the bus activity have been proposed [28], These methods attempt to bind and schedule the data transfers of a control and data flow graph representing the 1 L e t/b e an approximation o f function g, we say /h as a high fidelity with respect to g if f(x) >f(y) implies that g(x) > g(y). 205 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. application so as to evenly distribute the switching activities. In our proposed approach, encoding is performed after completing the logical design with some level of information from placement and routing. This is the stage in flow when the arrival time and slew of logic signals and capacitance of interconnect lines can be well estimated. These physical attributes of the design are needed to first identify degradation-prone drivers for application of encoding function. 6.3 REDUCING THE BUS MAXIMUM ACTIVITY Based on what we said in the previous section, we will next examine encoding techniques that can reduce the maximum activity of a trace. Maximum activity of a trace (refer to Definition 2-16 ) is the maximum among total activity of different lines (refer to Definition 2-14 ). Total activity of line i of a trace T (1 <i <N) is shown by TLA[i] whereas maximum activity of trace T is noted as MA(T). Other notions that will be used a lot throughout this chapter, are the concept of inter trace and inter-sourceword, i.e. B = < U 2, U 3, U l e > (refer to Definition 2-9 ). We are interested in finding a function that reduces the maximum activity of a trace T. Different traces emerge on a certain bus in different epochs. Of course, we would like to design an encoder and a decoder that works fine for all these traces. These traces may be similar to each other in terms of a number of different characteristics. Therefore, we will look at our problem based on the following 206 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. methodology: Given a certain amount o f information about a trace, we will investigate whether this information would be sufficient to design the appropriate encoding/decoding functions for the purpose of reducing the maximum bit transition count. We would next devise the kind of logic functions, combinational or sequential, that are capable of accomplishing this goal. Definition 6-1 Characteristic c: An equivalent class of traces CL(c) is the set of traces T that are characterized based on some characteristic of interest, c. In other words, they cannot be distinguished from each other as far as characteristic c is concerned. Definition 6-2 Characteristic number of sourcewords: Each sourceword in trace T is an N-bit binary number. Let IV , denote the number o f sourcewords that are equal to 0 < i < 2 N, i.e. A,.= £ 3(X . - /) where 5 is the Kronecker Delta function. 1 <j<LE We ought to characterize and classify traces based on their characteristics. Indeed, existence and construction of a reversible function F that minimizes MA(T) is strongly dependent on the set of common characteristics that are extracted from these traces. 207 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Example. Some classes, for instance, are: CL(No=100 i.e. a trace with exactly a hundred sourcewords equal to zero), CL(N=1 ,LE=200,No=Ni i.e. A one bit trace of 200 sourcewords(bits) with equal number of one’s and zero’s), CL(TLA[i] =Kt given for every i , i.e. all traces that have Kt transitions on line i, {Ki} is given), CL{a 2n-state lag-one Markov source generator), etc. Notice that different classes may have common members. Definition 6-3 Bit-level Transition Balancing Problem: Given a class of traces CL(c), the bit-level transition balancing (BTB) problem refers to the problem of finding a function F that 1 . is reversible. 2. For every member T of CL(c), MA(F (T)) < MA(T). Lemma 6-1. No universal function, combinational or sequential, can always reduce MA(T) for all traces. This is an intuitive, yet imperative, result. Notice that existence of such a function would be in conflict with information theory principles, because repeated application of that function would eliminate all transitions in a trace without loss of information. This lemma is remarkable in the sense that it prompts us to characterize the trace first and then find solutions based on that characteristic. 208 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. There is no magical logic that can always reduce maximum activity. We must have a certain level of knowledge about the bus in order to come up with a function that always works. We investigate combinational and sequential decoders and encoders separately. Combinational functions, if successful to do what we want them to, would be much better choices, as they do not need clock and additional flops. For sequential functions, we recognize two different categories. The first one is a special category of sequential circuits that we call inter-sequential functions and the second one makes use of general sequential functions. Inter-sequential, as we will see, are combinational functions that are applied to the inter-trace B instead of trace T and are usually much simpler than general finite state machines. We first look at combinational logic in detail. 6 .4 CODING WITH COMBINATIONAL FUNCTIONS In this section, we investigate when combinational functions are capable of reducing MA(T). A combinational function F: {0,1 }N {0,1}N can be used for the encoding task only if it is reversible (refer to Definition 2-6 ). The total number of such functions are (2 N )!, corresponding to permutations of all sourcewords in an N-bit space. However, since transitions on lines are of primary significance, these functions can be partitioned into NP-equivalence sets (NP stands for Negation and 209 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Permutation). Two functions are NP-equivalent exactly if they can be transformed to one another by inverting one or more output columns and/or swapping some columns. Therefore, the number of NP-equivalence sets is (2N )l / (N! * 2N ). For example, for N=2, there are three distinct classes. One function from each class is shown in Table 6-2 (F j, F 2 and F fi. The function shown from each class has the property of mapping sourceword zero to zero, i.e. F j(0 0 ...0 )= 0 0 ...0 . Some other members of class [ F i] have also been shown. Table 6-2 Representatives of three NP-equivalence sets in the 2-bit space plus some other members of class [FI] X Fi(X) f2 (X) F3 (X) Some Members of [FI] 00 00 00 00 00 10 01 01 01 1 1 10 1 1 10 10 1 1 10 01 00 1 1 1 1 10 01 1 1 01 Definition 6-4 Replace-Invert: Consider a function F, we say F replace-inverts the ith bit, if there is a corresponding bit in the output that is the same as the /,/, bit or its inversion. Clearly, when a function replace-inverts a bit, it does not change the bit transition count of that line. Lemma 6-2. A characteristic class defined by a given set o f total activity o f lines {TLAfi]} is unbalanceable under all reversible combinational functions. More precisely, for every reversible combinational function F, a trace may be found that actually increases MA(T): 210 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. V c = {TL[i]}, VF, 3T e CL(c) : MA{F(T)) > MA(T) . Proof. For a function F and a given set of bit transitions if all bit transitions are nonzero, we construct a counter example by building a trace that increases the maximum number of transitions after applying F, i.e. we build a trace that overwhelms the function. If the number of transitions of a line is zero, that line can be omitted; therefore, without loss of generality, we assume that all TLA[i] are greater than zero. Assume TLA[i] represents the maximum number of transitions. Any function that reduces MA(T) cannot replace-invert this bit. The trace T can be built as follows: The first TLA[i]+l sourcewords of the trace are composed of two sourcewords that only differ in ith position, say Xj and X 2. Obviously, F(Xi) and F(Xi) should differ at least in one bit (say j th position), which will cause the same TLA[i] transitions on that bit in F(T). Now we add one more sourceword to this trace, which causes the maximum number of transitions to exceed TLA[i], For this new sourceword ith bit is the same as previous sourceword. The rest of the bits are allowed to change without violating the bit transition constraints because TLA[i]'s, are all greater than zero. Now we pick a sourceword from the 2N' 1 possible sourcewords whose mapping will cause a transition on the j th position under function F. Such a sourceword exists because F doesn’t replace-invert the 211 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ith bit. The rest of the trace can be easily generated to fulfill all the remaining bit transition constraints. F is overwhelmed and proof is complete. ■ Therefore knowing bit transitions is not enough to solve the B TB problem. Let’s think in a new direction and assume the information that we have about a trace is the exact count of each sourceword (Number of times each sourceword appears in the trace) in the trace. More precisely, the given information looks like this: For trace T, /T V ,, 0 < i < 2 N } is given. Lemma 6-3. A characteristic class defined by a given set o f number o f sourcewords { N f given T V , > 0, 0< i <2N , is unbalanceable under all reversible combinational functions, that is: V e = [ N j : N - >0}, VF, 3 T e C L ( c ) : M 4 ( F ( T ) ) > M A ( T ) . Proof. i f 3F V r e C L ( { N , : N t > 0}) M A ( F ( T ) ) < M A ( T ) => \ / T e C L ( { N , ,;V, =1}) M A { F ( T ) ) < M A ( T ) . If a function reduces M A (T) for any trace that is a member of CL({Nj, Ni>0}), the function should work with an arbitrary trace containing exactly one of each sourceword. This is because if equal sourcewords are ordered consecutively in a trace of C L({N ,: 1 V j>0}), their effect on M A (T) would be as if only one of them exists in the trace. Therefore, the problem might be reduced to finding a function 212 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that reduces MA(T) for a trace that have all sourcewords exactly once. Now suppose that the input trace with N,=l is arranged to construct a balanced gray code [35], By definition, this leads to the minimum achievable MA(T) in input, so no function can reduce MA(T) for this trace and the proof is complete. ■ This means that as long as all sourcewords are present in the trace, no function can guarantee to reduce maximum transition of the trace. This result is intuitively expected. Characterizing by set {Ni} is not suitable for grouping traces as far as BTB problem is concerned. This is because the position of these sourcewords with respect to each other in the trace is a determining factor, yet it’s completely ignored in characterization. Interestingly, if not all sourcewords are present in a trace, the class may become balanceable. This can be easily shown by the following example. T is a trace in the 7-bit space and its sourcewords are shown in the first column, i.e. T e CL({Ni=l, 0<i <8 , Nj=0 for the rest of i}). It is easy to verify that no matter how the sourcewords are ordered, MA(T) will be greater than 2. The second column which shows F(T). It is not difficult to verify that MA(F(T)) will always be 2. In fact, it is the redundancy in the encoding (refer to Definition 2 - 2 2 ) that enables us to find a combinational function with the desired property. An interesting problem will be to find those sub-classes (characterized by the number of sourcewords) for which the problem is balanceable. This is an open 213 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. problem and we haven’t been able to solve it yet. Next, we will look at other methods to characterize a trace. Table 6-3- Encoding with redundancy to reduce maximum transitions T, MA(T)>2 F(T),MA(F(T))=2 11111 000 11111110 11111 001 11111101 11111 Oil 11111011 11111 111 11110111 11111 101 11101111 11111 100 11011111 11111 110 10111111 11111 010 01111111 Definition 6-5 Characteristic number of inter-sourcewords: Let L, denote the number of inter-sourcewords in the inter-trace B of a trace T that are equal to i, 0< i< 2 Nl c . , L i= £ \<,j<,LE-l Another way to increase the amount of information compared to the case that the bit transitions of a trace is known, is to provide the set of L, values, i.e., {Lh i=0...2N -l}. Obviously, total activity of line j i.e. TLA[j] is simply calculated by adding up those Z,’s that correspond to an inter-sourceword with a one in its j th position, i.e. TLA[j]=IXi*(jth bit o f Li) for 0 <i <2 N. Please take note that all traces 214 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TeCL({Lt}), have the same maximum activity. Same thing is true when traces are characterized by {TLA[i],l<i <N}. But when the traces are modeled by {N„ 0<i <2n }, MA(T) would be different for different instances of the class. This motivates for the following definition. Definition 6-6 Uniformity: Given a characteristic class of traces CL(c) we say that the class CL(c) is uniform if all of its traces have equal MA(T) values. Definition 6-7 Regularity: A uniform characteristic class is regular under a set of reversible functions {FJ if, for each Fe{FJ, characteristic class F(CL(c)) is uniform. Definition 6-8 Inter-combinational function: A combinational function F that has the property F(Xi 0 X^)= F(Xi) 0 F(X2), is called an inter-combinational function. For inter-combinational functions mapping of inter-sourceword of two sourcewords will be equal to the inter-sourceword of their mapping. Of course, not all the combinational functions in the N-bit space have this characteristic. Only a small portion of the combinational functions will be inter-combinational for a large N. If a combinational function is inter-combinational, it is easy to prove that we should always have F(V=00...0) equal to 00...0. To completely 215 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. specify the rest of the function, it is enough to specify the output for N linearly independent inputs (linear independence means none of these values can be generated by an XOR relationship of the other ones). An instance of a linearly independent set would be one-hot sourcewords (refer to Definition 2-24 ). Any other input can be decomposed to an XOR of these one-hot values. Mapping of any sourceword under F is uniquely determined if mapping of an independent set is known. If a function is inter-combinational, it is possible to calculate MA(F(T)) where T belongs to CL({LJ) and Z,’s are the number of inter-sourcewords. Lemma 6-4. A characteristic class defined by a given set o f number o f inter- sourcewords C L ({L j}) is uniform. However, it is only regular under inter- combinational subset o f combinational functions. Consider a 2-bit space. In this case, interestingly, each AP-equivalence class has an inter-combinational representative. These representatives are exactly the functions that were reported in Table 6-2. Fj, F2 and F$ are all inter- combinational functions. Now suppose that a set of traces (N=2) is characterized by number of inter-sourcewords in the trace as shown in the Table 6-4. A 01 inter- sourceword happens during one of the following events on the bus (0 0 - > 0 1 or 0 1 - > 0 0 or 1 1 - > 1 0 or 1 0 -> 1 1 ) and total number of such events is equal to Li based on the following table. 216 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6-4- Modeling a trace based on number of inter-sourcewords. Inter-sourceword U # (Number) 00 L O 01 LI 10 L2 1 1 L3 For such a trace MA(T) = Max{Li+Ls, L2+L3}. It is easy to prove that there exists a function in the 2-bit space that always reduces MA(T) if and only if L3 is the absolute maximum of {Li, L2, L3} and L2 not equal to L\. In such a case, either F2 or F3 will reduce MA(T). For example, if L2 is the minimum of the three i.e. L2 <Lj<L3, then applying F2 over T (refer to Table 6-2) will result in MA(F(T)) equal to Max{L2+L], L2+L3}=L3+L2 which is less than MA(T)= Max{Li+L3, L2+L3}=Li+L3. A similar approach may be used for N>2. Example. Consider the two least significant bits (LSB) of an instruction address bus. Suppose that the instructions are sequential 8 0 % of the time. L j, L 2, and L 3 can be determined as follows (Here Li’s are specified as percentage of the corresponding inter-sourceword to the total number of inter-sourcewords). Eighty percent sequential instructions contribute to 4 0 % L i and 4 0 % L 3, whereas 20% non-sequential means 5% Lo, 5 % L /, 5% L 2 and 5 % L 3. Thus, L o= 5% L i= 4 5 % L 2= 5 % L 3 = 4 5 % , M A (T)=9()% . After applying F 2 function to this bus, the new Z,’s will be: L 0= 5 % L I= 4 5 % L 2= 4 5 % L 3= 5 % and thus, M A(F(T))=50% >. 217 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For T characterized by { L } we have a methodology to determine whether M A (F (T)) is less than M A (T) by using an inter-combinational function F. This means that if the trace were tested against all inter-combinational functions, it would be possible to answer whether it will be balanceable under inter- combinational functions. However, this test is not possible for large N's since the required time for it is exponential with respect to N. In practice, this test is usually feasible because we do not want to consider N larger that 6 or 7 due to the increasing complexity of encoding/decoding functions. Another point is that inter- combinational functions are only a small subset of the set of combinational functions. It is very difficult to analyze the effect of a non-inter-combinational function over a trace characterized by a set of L { s. (By non-inter-combinational, we mean a combinational function that is not inter-combinational.) We cannot find a single characteristic C L({Li}) that is balanceable under a non-inter- combinational function. Therefore, we surmise that no characteristic C L({L,)} is balanceable under non-inter-combinational functions, although we have not been able to prove this statement. This means that if C L ({L j}) is unbalanceable under inter-combinational functions, then it will be unbalanceable under all combinational functions. The basis for our conjecture is the fact that C L ({L j}) is uniform under the set of inter-combinational functions only. Therefore, it is not possible to control the variations of M A (F (T)) of traces under non-inter- combinational functions. 218 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 6-9 Characteristic Markov source: A trace class CL(c) may be defined by a lag-one Markov Source R(I,S), that is, each distinct sourceword X in any trace T in this class denotes a state s e S of the Markov source, and each pair of consecutive vectors <XitX i+ i> defines a transition edge in the Markov source between ,v , and Sj+i. / denotes the set of external inputs of the Markov source. A Markov source is completely specified if the probability of being in each state and the conditional probability of transitioning from one state to another are known. The transition probability matrix of R completely defines the characteristic class CL(c=R(I,S)). We assume that external input values are uniformly distributed in the input space. Definition 6-10 Reversible Markov mapping: We define a reversible function mapping F on R(I,S) as R(I,F(S)) with F 1 (F(R(I,S)))=R(I,S). Definition 6-11 A characteristic class defined by a Markov source R(I,S) is regular under all reversible combinational functions. Proof. CL(c=R(I,S)) is uniform and it will be mapped to CL(R(I,F(S)) under a combinational function F. CL(R(I,F(S)) is itself uniform, therefore, R(I,S) is regular under all reversible combinational functions. 219 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For small N, it is thus possible to construct an output Markov source for every function F and find new MA(F(T)). Therefore, balanceability can be checked by applying a function of each iVP-equivalence class to the Markov source. The problem of finding best of these functions is similar to the minimum hamming- weight state assignment problem which is known to be AP-complete [63]. For that reason, we developed a heuristic algorithm for our problem too. Of course if N is small enough, then brute-force checking will lead to optimum solution. Definition 6-12 Minimum Max-Transition State Assignment: Find a reversible function F such that for traces of class R(I,F(S)), MA(T) is minimized. We next present a heuristic algorithm, named PermuteStates, for solving this problem. Complexity of step 3 in this heuristic algorithm is 2M a x S e tS lz e . The larger MaxSetSize is, the closer the heuristic solution will be to the optimum one. ALGORITHM (PermuteStates): 1. Generate an initial state assignment by setting F(si) = i, si eS; 2. SetSize = 2; 3. for every subset H o f S with cardinality of SetSize 4. for every possible permutation of states in H 5. if the permutation reduces MA(T), then accept it and break; 6. if a permutation has been accepted, then SetSize=2 and goto step 3; 220 with permission of the copyright owner. Further reproduction prohibited without permission. 7. SetSize = SetSize + 1; 8. if SetSize > MaxSetSize then exit; 9. goto step 3; A practical example of a Markov source modeling is the characteristic class representing a sequential trace. We define a sequential Markov source as a source that generates a trace in which each sourceword is either equal to the previous sourceword or the previous sourceword incremented by one. Now for a sequential trace best transition-balancing is done using Balanced gray codes [18]. Balanced gray codes are gray codes that result in almost equal number of transitions on all lines of the trace if the input trace is sequential. By “almost equal,” we mean that the difference between the transition counts of any two lines is not greater than two. If A is a power of two, it is possible to construct a fully balanced gray code, i.e., one that would result in exactly the same number of transitions in all bit lines. Because of this, fully balanced gray codes are more advantageous. Balanced gray codes, which in general are not inter-combinational functions, are the best solutions for a sequential Markov source [18]. Instruction address traces fit very well into class of sequential traces. Here, we model instruction and data addresses as a lag one Markov source. As it will be seen in the experimental results section, a 6-bit balanced gray code is a good practical solution for instruction address buses. 221 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Before starting the next section, let’s take note of the relationship among different characteristic classes that we have considered so far. We say characteristic c is reduced to characteristic c’ if every T e c is also a T e c ’ and we write c < c If characteristic c ’ is enough to determine balanceability under a set of functions, then c will also be enough. We can simply map any c to the corresponding c ’ and do the check over characteristic c ’. Now consider the four different characteristic classes that we have examined so far. We have: {R(I,S) : Markov source} < {Nt: Number of sourcewords} and {R(I,S) } < {Li : Number of inter-sourcewords } < { TLA[i] : Bit transitions}. We will use these relationships when analyzing sequential functions in the next section. 6 .5 CODING WITH SEQUENTIAL FUNCTIONS In this section, first, we examine a special category of sequential circuits that we refer to as inter-sequential and later we study the general class of sequential circuits. We will see that sequential circuits are the most effective functions for solving BTB because they can balance classes with the least given information, which are the characteristic classes characterized by bit transitions. 222 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.5.1 INTER-SEQUENTIAL FUNCTIONS In inter-sequential encoders, registers and some XOR gates are used to generate the inter-sourcewords, and then a combinational function is applied to the inter trace to generate the output inter-trace. Finally, the output function is recovered from its inter-trace by using again XOR gates and registers. Definition 6-13 Inter-sequential function: We define an inter-Sequential function mapping G on T as G(T), where X(G(T)) = F(X(T)) (F is a combinational function) and G(Xj)=Xi (X) is the first member of trace 7). Note that X(T) is the inter-trace of T. In addition, consider that the equation is only determining the inter-trace of G. The actual output trace depends on how we define mapping of the first sourceword. The function is reversible if G' 1 (G(T))=T and this happens if and only if F is a reversible combinational function. For inter- sequential functions, we have F(Xi 0 X 2)= G(Xi) (BG(Xi), a similar equation to what we had for inter-combinational functions (refer to Definition 6 - 8 ). Inter-Sequential functions are important in the sense that, first, their output only depend on the current sourceword and the previous sourceword, i.e. X, and Xui. Therefore, they only need to save the previous sourceword, which means less overhead compared to a general sequential encoder. Second, it would be 223 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interesting to analyze these encoders because of the similarities that they have to inter-combinational circuits. Lemma 6-5. A characteristic class defined by a given set o f total activity o f lines {TLA[i]} is unbalanceable under any reversible inter-sequential functions. More precisely, for every reversible inter-sequential function G, a trace may be found MA(G(T)) is higher than MA(T) : Vc = {TLA[i]}, VG, 3T e CL(c) : MA(G(T)) > M{T) . Proof. Suppose, TLA[j] is the maximum number of bit transitions. Consider an input trace T composed of TLA[j]+l sourcewords that have transitions only in the jth position. This means the inter-sourceword consists of a sourceword with all zero’ s except on the j th position (a one-hot code). For any function that reduces the maximum transition of this trace, this one-hot code should be mapped to zero. Otherwise, the same number of transitions will happen on some other line in the output trace. Therefore, inter-sourceword zero should be mapped to another inter- sourceword that has at least a one in its binary representation. Now, the last sourceword in T can be appended to it without causing any change to its bit transitions. This means that inter-sourceword zero can be added in any trace without altering the bit transition constraints. However the added zero inter- sourcewords should be mapped to a non-zero inter-sourceword and this will cause 224 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a transition on some bit of the output trace and by doing this the function will be eventually overwhelmed and proof is complete. ■ Lemma 6 -6 . A characteristic class defined by a given set o f number o f inter- source words CL({L,}) is regular under all reversible inter-sequential functions. Proof. A class CL({Li}) will be converted to a class CL({Li’ }) under an inter- sequential function and they are both uniform. Therefore, any characteristic class CL({Li}) is regular under all reversible inter-sequential functions. ■ We examine the case for N=2 just as we did for inter-combinational functions. Suppose that the given information is the same as that provided in Table 6-4. Each function corresponds to a permutation of inter-sourcewords. For trace T, MA(T) is calculated to be equal to Max/Li+Lj, L2+L3}. Suppose that the characteristic class is mapped to CL({Lj’ }) under G. For output traces, MA(G(T)) is similarly calculated to be Max{Li ’+L3 L2 ’+L3 ’ }. To get a lower MA(G(T))), the function should map L3 ’ to the minimum and Lo ’ to the maximum of L i’ s. Apparently inter-sequential functions are more effective in decreasing MA(T) compared to inter-combinational subset of combinational functions. 225 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For inter-sequential circuits there is no need to go further than this and model the trace with a Markov source. This is because CL({Li})<CL(R(I,S)) and balanceability is determined at this level of information. 6.5.2 GENERAL SEQUENTIAL FUNCTIONS General sequential functions are the most effective functions for balancing transitions of a trace. They are also associated with the maximum complexity compared with the functions of previous sections. Definition 6-14 Sequential function: We define a sequential function mapping H on T as H(T), where H(Xj) = F(Xj, The function is reversible if H ~ J(H(T))=T. Index i does not have to be a finite number. Lemma 6-7. A characteristic class defined by a given set o f total activity o f lines {TLA[i]} is balanceable exactly if the difference between the maximum and the minimum total activities is two or more. Vc = [ T L A [ i ] } , g i v e n M a x { T L A [ i ] } - M i n { T L A [ i ]} > 2, 3 F , V r e C L ( c ) : M A ( F ( T ) ) < M A ( T ) Proof. First we prove an interesting property of any sequential function that reduces MA(T). Consider the bit transitions are given and for any given sequential function, we want to build a trace having those bit transitions in a fashion that it 226 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. overwhelms that function. By overwhelming a function we mean demonstrating that the function has MA(H(T)) equal or higher than MA(T). First we claim that if for a sequential function H, at any point in time a non-zero inter-sourceword is mapped into a zero inter-sourceword under F, that function can be overwhelmed very easily. Since a transition in input has been translated to zero transitions in output, a zero inter-sourceword in the input should lead to at least one transition in the output. Therefore, if at any point during the construction of the trace, the sequential machine generates no-transition for the non-zero inter-sourceword that we are going to insert, we will insert a zero inter-sourceword in the input trace, instead. This will cause at least one transition in the output. If this continues infinitely, the function will be overwhelmed sooner or later, because we will keep on inserting transitions in H(T) without inserting transitions in T. Now, suppose we build our input trace in a way that in each step only one transition happens. Based on what we said we know that the sequential function generates at least one transition in output for each transition in input. Now, we argue that it is enough to look at only those functions that always have equal number of transitions in input and in output i.e. they distribute transitions over different lines. In other words we have proved that no other function can do better than these functions. We have shown the general block diagram of these transition-distributing functions in Figure 6-2. The permutation function is not a fixed permutation and might change 227 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with time. This is actually one of the main differences with inter-sequential circuits. Permute bits Generate Inter- sourcewords Generate output trace from inter-sourcewords Figure 6-2 Model for transition-routing sequential functions A sequential function used to evenly distribute transitions over the bit lines resembles a complex routing network that can route transitions of each line to any other line. Evidently, the routing configuration should be changed as time progresses in order to achieve a uniform distribution profile of bit level switching activities. However, it is important to note that this reconfiguring is only based on the sourcewords that have been already conveyed to the receiver and the knowledge of the bit-level activities of different lines. Now, based on what mentioned about the properties of sequential functions, we state that a suitable function may not exist for all given set of bit-level transition counts. For instance given bit transitions {TLA[1]=4,TLA[2J=3,TLA[3]=3} it is easy to verify that no function can reduce MA(T). Since transitions are going to be only distributed (not suppressed) between the lines, no distribution can decrease MA(T) in this case. Therefore, we assume that average of TLA’s are at least one unit less than MA(T) in the original trace. If only one line has the maximum activity, this will translate into a difference of at least two between the maximum and the minimum of the bit 228 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transitions. Without loss of generality we assume that only one line in T has the maximum activity. The algorithm for controlling the routing network is as follows. In each step, we show that based on the transitions that happen, either the suitable function is found or it will reduce to another balanceable problem. Assume the line that has the maximum activity is Lm ax and the line with the minimum activity is Lmj„. 1. If L m ax makes a transition and L m in does not, swap L m in and L m ax and set the encoding to be equal to H (X )= X for the remainder of the trace. For the output trace, M A (H (T )) will be less than the original trace. 2. If L m m makes a transition, difference of at least two will remain between the maximum and the minimum transitions. Repeat this algorithm. 3. If none of them make any transitions, do not change the routing. Repeat the algorithm. Its not easy to verify that M A(T) will be reduced by at least one using this algorithm and the proof is complete. ■ 229 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sequential functions are much more effective in implementing bit balancing encoders when compared to combinational functions. They can actually solve the BTB problem for cases where combinational and inter-sequential functions utterly fail. The algorithm presented in that proof actually reduces MA(T) by only one. By cascading such blocks maximum transition can be reduced as much as it is possible for the given characteristic. However, the above solution is just to show the potential of sequential circuits. Each of these blocks is pretty costly because keeping track of transitions on each line and identifying the maximum and the minimum is definitely very expensive in terms of the hardware resources. The superiority of sequential circuits over combinational functions comes at the price of their increased overhead. Such solutions are complex compared to combinational solutions. Therefore, simpler sequential functions may be used instead that can balance transitions in a heuristic approach. A straightforward example is to use sequential functions that only swap two different lines. In fact instead of using a complete N-bit to N-bit routing network, two multiplexers may be used to swap two lines. The bit swapping should be done in a way to make the transition count of these two lines almost equal. Therefore, if there is a big difference between the transition counts of two lines, then the maximum transition can be prudently decreased. 230 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The first proposed block is called the interchange block shown in Figure 6-3. It simply swaps two lines every time the value of the two lines are equal, i.e. both of them are zero or one. This is because we do not want to add any extra transitions to the original trace. For this scheme to work properly, the two lines should be uncorrelated. Otherwise, it can be easily shown that this encoder may become ineffective in some cases (as an example, consider the trace <00,01,00,00,01,00,00,...>). The interchange block is a fast solution when a vulnerable line is neighbored by a sturdy line from which it can get some help to convey the information, while reducing its exposure to degradation. Line 1 O utput 1 Line 2 O utput 2 Control Logic Transition Counter Figure 6-3- Interchange Block. It is possible to use several interchange blocks or progressive levels of it to achieve better results. In such cases, to make the scheme even simpler and decrease the overhead of the controlling logic, we can modify the interchange 231 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. block and employ a global decision maker for all blocks whose job is to swap the lines in all interchange blocks after K clocks. This approach may marginally increase the total number of transitions but it will be almost as effective as before in terms of reducing maximum activity. We call this technique the global- interchange solution. In practice, it is sometimes very efficient to use functions that map a non-zero inter-sourceword to inter-sourceword zero in the output. This may lead to cancellation of a large number of transitions. For example, this will be advantageous if the coding is devised in a way to suppress transitions for sequential addresses in a trace of instruction addresses. This is a common trick used in encodings that aim to reduce maximum transitions of a trace [30]. Other configurations are also possible. One effective solution is to extract the inter- sourceword and rotate it by / bits in each clock cycle. / is incremented by one (I is calculated mod N) for the next cycle. Therefore the one’s in the inter-sourceword are distributed over different lines. Figure 6-4 illustrates this scheme. Another solution would be to send (X,+1)(DX1 + I to the rotating network for sequential traces. This has the additional advantage that it leads to no transitions when values are sequential, consequently MA(T) can be reduced even more. We call the first approach XOR-Rotate and the second one INC-XOR-Rotate. Recall that INC-XOR 232 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [53] is an irredundant bus encoding technique that sends (Xt+ l)0 X i+ i on the bus by transition signaling (XORing the value with the previous value on the bus). F(T) Rotate I bits and XOR, increase I Figure 6-4- XOR-Rotate & INC-XOR-Rotate 6.6 EXPERIMENTAL RESULTS In the previous sections, we studied the BTB problem under different constraints. In this section, we present experimental results of applying different encoding techniques on two kinds of traces, i.e. instruction address traces and data address traces. We have applied both combinational and sequential functions to these traces and compared the results with each other. Our methodology is similar to previous chapters and we will report the results based on averaging over six different SPEC2000 benchmarks: vpr, parser, equake, gcc, vortex, and art. Each trace was generated by simulating 10 million instructions using Simplescalar [72]. We report two different quantities for each method: 1) Max Transition Ratio, which is the ratio of the maximum transition count of the bus lines after encoding to that before encoding i.e. MA(F(T))/MA(T) (refer to Definition 2-16 ). 2) Total Transition Ratio, which is the ratio of the total transition count of the bus after encoding to that before encoding i.e. TA(F(T))/TA(T) (refer to Definition 2-15 ). 233 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Of course, Max Transition Ratio is of primary interest in this paper, however, Total Transition Ratio shows what percent of the total transitions has been eliminated. The greater this percentage, the less probability of odd scenarios (such as the one investigated at the end of section 6 .2 ) and the more energy saving. Table 6-5 and Table 6 - 6 present the comparison between different techniques. Instruction address traces are mostly sequential traces; therefore, we expect the balanced gray code to be the most effective method for reducing the maximum transition counts of these traces. Using the balanced gray code, for each new sequential address, only one transition occurs and this transition is distributed over different bus lines. We tested balanced gray codes for buses of width 3, 4, 5 and 6 . The number after dash sign in front of each balanced gray code entry refers to the width of the bus. For ideal sequential symbols, the result should get better as the bus becomes wider. However, as can be seen in the table, this is not the case for instruction addresses and the marginal improvement in the performance of balanced gray code diminishes as a result of non-sequential instructions. Next, entries in the table correspond to results obtained by applying PermuteStates technique on a Markov model extracted from the instruction addresses. Again numbers after dash show the width of the bus, e.g., PermuteState-4 is the result for a 4-bit wide bus. Based on results, PermuteStates performs better than balanced gray code when the size of the bus is 5 or 6 . We used MaxSetSize equal 234 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to 4. We also experimented over Interchange, the sequential encoder presented in the final section. We have reported the results for three configurations of the interchange block. For two lines, if one line is more active than the other one, interchange distributes the transitions of more active line to the other line. We used configurations with multiple levels of interchange. The number after dash in front of each interchange entry represents the number of levels used. As expected, Interchange-1 is capable of reducing MA(T) by half. This is due to the fact that the highest active line of the bus is grouped with a line with almost zero activity. Therefore, the transitions after encoding will be the total transition count divided by two. The reported results are for the best possible configuration of grouping two bits, meaning that at each level, the line with the highest activity is grouped with the line with the lowest activity and so on. Finally, We have reported the results for XOR-Rotate and INC-XOR-Rotate methods. As we mentioned earlier, these methods require rotation networks that can perform arbitrary amount of rotation in each clock. The superb results of these methods come at the expense of having extremely complex logic networks to perform the required arithmetic and rotation operations. As it can be seen, the performance of INC-XOR-Rotate is superior to than that of XOR-Rotate. This is due to the fact that INC-XOR-Rotate eliminates many transitions by exploiting the sequentiality of instruction addresses. 235 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6-5 Comparison of different methods applied over instruction addresses. Method Max Trans. Ratio Total Trans. Ratio Bal. Gray-3 0.538 0.704 Bal. Gray-4 0.311 0.654 Bal. Gray-5 0.314 0.633 Bal. Gray- 6 0.235 0.603 PermuteStates-4 0.351 0.708 PermuteStates-5 0.265 0.621 PermuteStates- 6 0.231 0.567 Interchange-1 0.508 1 Interchange-2 0.302 1 Interchange-3 0.203 1 XOR-Rotate 0.084 1 INC-XOR-Rotate 0.017 0.199 In practice, target bus may not be sequential. In such a case, methods like balanced gray code would not be applicable anymore. A very simple example would be a data address bus. In Table 6 -6 , we have shown the results for such a bus. Notation and conventions in this table are similar to Table 6-5. It can be observed that the balanced gray code has a poor performance. Besides that, due to the fact that data address bus transitions are originally much more balanced compared to instruction address buses, even interchange or XOR-Rotate will not be a successful solution and are not reported in the table. The only effective solution that we could find in this case, is the PermuteStates encoder. We have reported results for three different bus widths. 236 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6-6- Comparison of different methods applied over data addresses. Method Max Trans. Ratio Total Trans. Ratio Bal. Gray-4 .823 0.946 Bal. Gray-5 .981 1 . 0 2 1 Bal. Gray- 6 .960 0.982 PermuteStates-4 0.586 0.822 PermuteStates-5 0.523 0.823 PermuteStates- 6 0.467 0.775 Interchange-1 1.043 1 Lets summarize the results of the two tables. For instruction address buses, when we experiment over buses of size up to 6, best result is generated by using three levels of interchange and leads to 79.3% reduction in max transition whereas PermuteStates heuristic achieves 76.9% reduction. PermuteStates is a pure combinational logic and needs much less overhead compared to three levels of Interchange block. We do not take into account the result of XOR-Rotate and INC-XOR-Rotate because of their infeasibility. For data addresses, the only real effective method is PermuteStates technique. It achieves a 53.3% reduction in max transitions and none of the other techniques has a close performance. All techniques other than Interchange, bring a good amount of reduction in total transitions as well. As explained early in the paper, this will certify the validity of estimations to a certain level. 237 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.7 CONCLUSIONS In this paper, we thoroughly investigated the problem of reducing maximum transition count for a group of lines. We looked at the problem when various levels of information are available for the trace and by applying different kinds of functions such as combinational, inter-sequential and sequential logic. We were able to exactly solve the problem in many cases. We presented polynomial time solutions when the exact solution leads to a non-feasible algorithm. We presented experimental results using instruction and data addresses buses, which are good examples of typical buses that might be vulnerable to hot-carrier degradation. Our experimental results also show the effectiveness of Markov-source heuristic for instruction address traces and data address traces. The actual selection of a technique highly depends on the characteristic of the trace and other constraints in the system. 238 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CONCLUSIONS AND FUTURE DIRECTIONS This thesis provides various solutions to the growing concerns for realization of highly complex digital systems, i.e., power consumption and reliability of communication between different blocks of the system. The proposed techniques are a set of high-level solutions such as encoding techniques that are effective for reducing switching activity, increasing reliability and compacting data. Various lightweight encoding techniques such as TO-C and Offset-XOR-SM were proposed that are suitable for reducing activity of traces with spatial locality. It has been shown that a significant portion of activity over typical instruction and data address buses can be eliminated using these techniques. Another set of low power bus encoding techniques, the ALBORZ code, was introduced that use a codebook for encoding and decoding. The codebook structure highly depends on 239 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the context of the application and should be carefully selected by the designer to achieve maximum performance. After that, we proposed the Sector-Based encoding techniques. These techniques are very effective in exploiting spatio- temporal locality of various kinds of traces. Sector-based encoding has a fixed and an adaptive version. The fixed version is more suitable for hardware implementation and the dynamic version is better suited for software implementation. Notion of instruction-set-aware memory was proposed. These are memories that are capable of predicting the addresses of instruction and data. Using these memories will lead to a reduction of traffic over the memory bus that can be targeted for power reduction or performance improvement. Pattern-sensitive encoding was proposed. A specific method for assessing the sensitivity of patterns with respect to crosstalk errors was introduced. Pattern- sensitive encoding selects the optimum encoding technique based on the sensitivity of patterns. Different encoding techniques are selected in such a way that consistency between encoder and decoder is guaranteed with zero communication overhead. Finally, bit transition balancing is carefully formulated for enhancing reliability of transistors with respect to hot carrier degradation. Lifetime of the bus is increased 240 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by increasing the lifetime of most vulnerable driver of the bus, i.e. the one with highest activity. The problem was systematically solved by, first, characterizing of the data that appears on the bus, and then finding combinational and/or sequential encoding functions that efficiently solve the problem. As the technology moves on, the criticality of communication versus computation becomes more prevalent. This will naturally push the design methodology toward application of more complex hardware blocks for making the communication low power and reliable. Many of our proposed techniques can be readily implemented in current digital systems. Some of them might be become more feasible in future technologies. The overall performance of these techniques will highly depend on the details of hardware implementation. 241 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BIBLIOBRAPHY 1. A. Abdollahi, F. Fallah, M. Pedram, “ Runtime Mechanisms for Leakage Current Reduction in CMOS VLSI Circuits,” Proceedings o f In t’ l. Symposium on Low Power Electronics and Design, pp. 213-218, 2002. 2. Y. Aghaghiri, F. Fallah, M. Pedram, “Reducing Transitions on Memory Buses Using Sector-Based Encoding Technique,” Proc. o f Int’ l Symposium on Low Power Electronics and Design, pp. 190-195, 2002. 3. Y. Aghaghiri, F. Fallah, M. Pedram, “Irredundant Address Bus Encoding for Low Power”, Proeedings o f Int 7 Symposium on Low Electronics and Power Design, pp. 182-187, 2001. 4. Y. Aghaghiri, F. Fallah, M. Pedram, “ A Class of Irredundant Encoding Techniques for Reducing Bus Power”, Journal o f Circuits, Systems and Computers, World Scientific Publishing Company, Vol. 11, No. 5, 2002, pp. 445- 457. 5. Y. Aghaghiri, F. Fallah, M. Pedram, “Transition Reduction in Memory Buses Using Sector-based Encoding Techniques,” IEEE Transactions on Computer Aided Design, Vol. 23, No. 8, Aug. 2004, pp. 1164-1174. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6. Y. Aghaghiri, F. Fallah, M. Pedram, “ BEAM: Bus Encoding Based on Instruction-Set-Aware Memories,” Proceedings o f Asia and South Pacific Design Automation Conference, pp. 3-8, 2003. 7. Y. Aghaghiri, M. Pedram, “Data Compaction with Dynamic Multiple-sector Encoding,” submitted to ACM Transactions on Design Automation o f Electronic Systems, 2004. 8. Y. Aghaghiri, M. Pedram, “Combating Hot Carrier Effects via Bit-level Transition Balancing,” submitted to IEEE Transactions on Very Large Scale Integration Systems, 2004. 9. Y. Aghaghiri, M. Pedram, “Reliable and Power-efficient Encoding Techniques for SoC Communication,” submitted to Design Automation Conference, 2005. 10. Y. Aghaghiri, F. Fallah, M. Pedram, “ALBORZ: Address Level Bus Power Optimization,” Proceedings o f Int’ l symposium on Quality Electronic Design, pp. 470-475, 2002. 11. E. A. Amerasekera and F. N. Najm, Failure Mechanisms in Semiconductor Devices. Wiley& Sons, 1998. 12. L. Benini, G. De Michelli, E. Macii, M. Poncino, and S. Quer, “System-Level Power Optimization of Special Purpose Applications: The Beach Solution,” Proceedings o f In t’ l Symposium on Low Power Electronics and Design, pp. 24- 29, Aug. 1997. 13. L. Benini, G. De Micheli, E. Macii, D. Sciuto, and C. Silvano, “Address Bus Encoding Techniques for System-Level Power Optimization,” Proceedings o f Design Automation and Test in Europe, pp. 861-866, 1998. 14. L. Benini, G. De Micheli, E. Macii, D. Sciuto, C. Silvano, “Asymptotic Zero- Transition Activity Encoding for Address Buses in Low-Power Microprocessor- Based Systems,” Proceedings o f 7th Great Lakes Symposium on VLSI, pp. 77-82, Mar. 1997. 15. L. Benini, A. Macii, E. Macii, M. Poncino, R. Scarsi, “ Synthesis of Low- overhead interface for power-efficient communication over wide busses,” Proceedings o f Design Automation Conference, pp. 128-133, 1999. 16. L. Benini, G. De Michelli, “Networks on chips: A new SoC Paradigm,” IEEE journal o f computer, Vol 35, No. 1, pp. 70-78, Jan. 2002. 243 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 17. D. Bertozzi, L. Benini, G. De Michelli, “Low Power Error Resilient Encoding for On-chip Data Buses,” Proceedings o f Design Automation and Test in Europe, pp., Mar. 2002. 18. G. S. Bhat, C. D. Savage, “ Balanced Gray Codes,” Electronic Journal o f Combinatorics 3, No. 1, R25, 1996. 19. A. Chandrakasan, S. Sheng, R. Brodersen, “ Low Power CMOS Digital Design,” IEEE journal o f Solid State Circuits, vol. SC-27, No. 4, pp. 1082-1087, April 1992. 20. C. Chang, K. Wang, M. Marek-Sadowska, “ Layout-Driven Hot-Carrier Degradation Minimization Using Logic Restructuring Techniques,” Proceedings o f Design Automation Conference, 2001. 21. N. Chang, K. Kim, J. Cho, “Bus Encoding for Low-Power High-Performance Memory Systems”, Proceedings o f Design Automation Conference, Jun. 2000. 22. P. Chang, E. Hao, Y. N. Patt, “ Target prediction for indirect jumps”, Proceedings o f In t’ l Symposium on Computer Architecture, pp. 274-283, Jun. 1997. 23. Z. Chen, I. Koren, “ Technology Mapping for Hot-Carrier Reliability Enhancement”, Proceedings o f International Society fo r Optical Engineering, Vol. 3216,pp. 42-50, 1997. 24. W. C. Cheng, M. Pedram, “Low Power Techniques for Address Encoding and Memory Allocation,” Proceedings o f Asia and South Pacific Design Automation Conference, pp. 245-250, 2001. 25. W. C. Cheng, M. Pedram, “Chromatic Encoding: a Low Power Encoding Technique for Digital Visual Interface,” Proceedings o f Design, Automation and Test in Europe, pp. 694-699, 2003. 26. J. Cong, Y. Fan, G. Han, X. Yang, Z. Zhang, “Architecture and Synthesis for Multi-Cycle On-Chip Communication,” Proceedings o fln t’ l Symposium on System Synthesis, pp. 77-78, 2003. 27. M. Cuviello, S. Dey, X. Bai, Y. Zhao, “Fault Modeling and Simulation for Crosstalk in System-on-Chip Interconnects,” Proceedings o f Int 7 Conference on Computer Aided Design, pp. 297-303, 1999. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 28. A. Dasgupta, R. Karri, “Electromigration Reliability Enhancement Via Bus Activity Distribution,” Proceedings o f Design Automation Conference, pp. 353- 356, 1996. 29. A. Dasgupta, R. Karri, “ Hot-Carrier Reliability Enhancement via Input Reordering and Transistor Sizing”, Proceedings o f Design Automation Conference, pp. 819-824, 1996. 30. W. Fomaciari, M. Polentarutti, D.Sciuto, and C. Silvano, “Power Optimization of System-Level Address Buses Based on Software Profiling,” Proceedings o f In t’ l conf on Hardware/software Codesign, pp. 29-33, 2000. 31. J. Hennessy, D. Patterson, “Computer Architecture, A Quantitative Approach, 2n d edition,” Morgan Kaufmann Publishers, 1996. 32. M. Ikeda, K. Asada, “Bus Data Coding with Zero Suppression for Low Power Chip Interfaces”, In t’ l Workshop on Logic and Architecture Synthesis, pp.267- 274, Dec. 1996. 33. S. Iman, M. Pedram, “ POSE: Power Optimization and Synthesis Enviroment,” Proceedings o f Design Automation Conference, pp. 21-26, 1996. 34. JEITA, Standard of Japan Electronics and Information Technology Industries Association, “Failure Mechanism Driven Reliability Test Methods for LSIs (Amendment 1)”, Oct 2001. 35. K. C. Kapur, L.R. Lamberson, “Reliability in Engineering Design,” John Wiley & Sons, 1997. 36. K. Kim, K. Baek, N. Shanbhag. C.L. Liu, S. Kang, “Coupling-Driven Signal Encoding Scheme for Low-Power Interface Design”, Proc. o f Int 7 conference on Computer Aided Design, pp. 318-321, 2000. 37. T. Kogel, M. Doerper, A. Wieferink, S. Goossens, “A Modular Simulation Framework for Architectural Exploration of On-Chip Interconnection Networks,” Proceedings o f In t’ l Symposium on System Synthesis, pp. 7-12, 2003 38. S. Komatsu, M. Ikeda, K. Asada, “ Low Power Chip Interface based on Bus Data Encoding with Adaptive Code-book Method”, Proceedings o f Ninth Great Lakes Symposium, pp368-371, 1999. 39. Y. Leblebici, S. M. Kang, Hot Carrier Reliability o f MOS VLSI Circuits, Kulwer Academic Publishers, 1993. 245 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 40. D. Li, A. Pua, P.Srivastava, U. Ko, “A Repeater Optimization Mehtodology for Deep Sub-Micron, High-Performance Processors,” Proceedings o f Int’ l Conference on Computer Design, VLSI in Computers and Processors, pp. 726-31, 1997. 41. P.C. Li and I. Hajj, “Computer Aided Redesign of VLSI Circuits for Hot- Carrier Reliability,” Proc. o f International Conference on Computer Design, 1993. 42. L. Lin, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, “Adaptive Error Protection for Energy Efficiency,” Proceedings o f Int 7 Conference on Computer Aided Design, pp. 2-7, 2003. 43. S. Lin, D. J. Costello, “Error Control Coding: Fundamentals and Applications,” Prentice Hall, 1993. 44. L. Macchiarulo, E. Macii, M. Poncino, “Low-energy for Deep-submicron Address Buses”, Proceedings o f International Symposium on Low Power Electronics and Design, pp.176-181, Aug. 2001. 45. M. Mamidipaka, D. Hirschberg, N. Dutt, “ Low Power Address Encoding using Self-Organizing Lists”, Proceedings o f Int’ l Symposium on Low Power Electronics and Design, Aug 2001. 46. Mahesh N. Mamidipaka , Daniel S. Hirschberg , Nikil D. Dutt, “Adaptive low-power address encoding techniques using self-organizing lists,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.l 1 No.5, pp. 827-834, October 2003. 47. R. Murgai, M. Fujita, A. Oliveria, “Using Complemnetation and resequencing to minimize transitions,” Proceedings o f Design Automation Conf, pp. 694-697, 1998. 48. E. M usoll, T. Lang, J. Cortadella, ’’ Exploiting the locality of memory references to reduce the address bus energy,” Proceedings o f Int 7 Symposium on Low Power Electronics and Design, pp.202-207, 1997. 49. M. Nourani, A. Attarha, “Signal Integrity: Fault Modeling and Testing in High-Speed SoCs”, Journal o f Electronic Testing, pp. 175-190, 2002. 50. P. Panda, N. Dutt, “ Reducing Address Bus Transitions for Low Power Memory Mapping”, proceedings o f Design Automation and Test in Europe, pp. 63-67, March 1996. 246 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 51.8. V. Pless, W.C. Huffman, R.Brualdi, W. Huffman, “Handbook of Coding Theory,” North-Holland, 1998. 52. S. Iman, M. Pedram, “POSE: Power optimization and synthesis environment,” Proceedings of Design Automation Conference, pp. 21-26, 1996. 53. S. Ramprasad, N. Shanbhag, I. Hajj, “ A Coding Framework for Low Power Address and Data Busses”, IEEE Transactions on Very Large Scale Integration Systems, Vol. 7, No. 2, pp. 212-221, 1999. 54. S. Ramprasad, N. R. Shanbhag, I. N. Hajj, “ Sigal Coding for Low Power: Fundamental Limits and Practical Realizations”, IEEE Transactions on Circuits and Systems, II, Vol. 46, No. 7, pp. 923-929, Jul. 1999. 55. K. Roy and S. Prasad, ” Logic Synthesis for Reliability - An Early Start to Controlling Electromigration and Hot Carrier Effects,” Proceedings o f Design Automation and Test in Europe, pp. 136-141, 1994. 56. Y. Shin, S. I. Chae, K. Choi, “Partial Bus-Invert Coding for Power Optimization of System Level Bus,” Proceedings o f Int’ l Symposium on Low Power Electronics and Design, pp. 127-129, 1998. 57. J.E. Smith, “ A Study of Branch Prediction Strategies”, Proceedings of Int’ l Symposium on Computer Architecture, pp. 135-148, May 1981. 58. P. P. Sotiriadis, A. Chandrakasan, “ Bus Energy Minimization by Transition Pattern Coding (TPC) in Deep Submicron Technologies,” Proc. International Conference on Computer Aided Design, pp. 317-321, Nov. 2000. 59. M. R. Stan, W. P. Burleson, “Bus-Invert Coding for low-Power I/O,” IEEE Transactions on Very Large Scale Integration Systems, Vol.3, No. 1, pp. 49-58, Mar. 1995. 60. M. R. Stan, W. P. Burleson, “Two-Dimensional Codes for Low Power,” Proceedings of Int’l Symposium on Low Power Electronics and Design, pp. 335- 340, 1996. 61. C. L. Su, C. Y. Tsui, A. M. Despain, “Saving Power in the Control Path of Embedded Processors,” IEEE Transactions on Design and Test o f Computers, Vol. 11, No.4, pp. 24-30, 1994 247 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 62. C. Svensson, “Optimum Voltage Swing on On-Chip and Off-chip Interconnect,” IEEE journal o f Solid-State Circuits, Vol. 36, No. 7, pp. 1108- 1112, Jul 2001. 63. V. Veeramachaneni, A. Tyagi, S. Rajgopal, “ Re-encoding for Low Power State Assignment of FSMs,” Intl. Symposium on Low Power Electronics and Design, pp. 173-178, 1995. 64. B.Victor, K. Keutzer, “Bus Encoding to Prevent Crosstalk Delay,” Int’ l Conference on Computer Aided Design, pp. 9-14, 2001. 65. F. Worm, P. Ienne, P. Thiran, G. De Michelli, “An Adaptive Low-power Transmission Schem for On-chip Networks,” Proceedings o f Int j Symposium On System Synthesis, pp. 92-100, Oct. 2002. 66. H. Yonezawa, J. Fang, Y. Kawakami, N. Iwanishi, L. Wu, A. Chen, N. Koike, P. Chen, C. Yeh and Z. Liu, “Ratio Based Hot-Carrier Degradation Modeling for Aged Timing Simulation of Millions of Transistors Digital Circuits, IEEE Int’ l Electron Devices Meeting Technical Digest, pp. 93-96, 1998. 67. Y. Shin, S. I. Chae, K. Choi,” Partial Bus-Invert Coding for Power Optimization of System Level Bus,” Proceedings o f In t’ l symposium on Low Power Electronics and Design, pp. 127-129, 1998. 68. S. Yoo, K. Choi, “Interleaving Partial Bus-Invert Coding for Low Power Reconfiguration of FPGAs,” Proceedings o f Sixth Int’ l Conference on VLSI and CAD, ppl 549-552, 1999. 69. H. Zhang, V. George, J. M. Rabaey, “Low-swing on-chip signaling techniques: Effectiveness and Robustness,” IEEE Transactions on Very Large Scale Integration, Vol. 8, No. 3, Jun. 2000. 70. H. Zimmer, A. Jantsch, “A Fault Model Notation and Error-Control Scheme for Switch-to-Switch Buses in a Network-on-Chip,” Proceedings o f Int 7 conference on Hardware/Software Codesign + Int 7 symposium on System Synthesis, pp. 188-193, 2003. 71. http://www.device.eecs.berkeley.edu/~ptm/introduction.html 72. http://www.simplescalar.com 73. http://sipi.usc.edu 248 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 74. http://www.arm.com/products/solutions/AMBAHomePage.html 75. http://www.spec.org 76. http://www.bdd-portal.org/docu/blif/ 77. http://www-cad.eecs.berkeley.edu/Software/software.html 78. http://www.synopsys.com/products/mixedsignaI/hspice/hspice.html Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Contributions to image and video coding for reliable and secure communications
PDF
Effects of non-uniform substrate temperature in high-performance integrated circuits: Modeling, analysis, and implications for signal integrity and interconnect performance optimization
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
Design and performance analysis of low complexity encoding algorithm for H.264 /AVC
PDF
Advanced test generation techniques: Improving yield and protecting intellectual property
PDF
Functional testing of constrained and unconstrained memory using march tests
PDF
Fading channel equalization and video traffic classification using nonlinear signal processing techniques
PDF
Experimental demonstration of techniques to improve system performance in fiber -optic communication systems using subcarrier -multiplexed and digital baseband signals
PDF
Advanced video coding techniques for Internet streaming and DVB applications
PDF
Gyrator-based synthesis of active inductances and their applications in radio -frequency integrated circuits
PDF
A passive RLC notch filter design using spiral inductors and a broadband amplifier design for RF integrated circuits
PDF
Complexity -distortion tradeoffs in image and video compression
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
Asset Metadata
Creator
Aghaghiri, Yazdan
(author)
Core Title
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Arbib, Michael A. (
committee member
), Gupta, Sandeep (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-346747
Unique identifier
UC11335635
Identifier
3180384.pdf (filename),usctheses-c16-346747 (legacy record id)
Legacy Identifier
3180384.pdf
Dmrecord
346747
Document Type
Dissertation
Rights
Aghaghiri, Yazdan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical