Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Area comparisons of FIFO queues using SRAM and DRAM memory cores
(USC Thesis Other)
Area comparisons of FIFO queues using SRAM and DRAM memory cores
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AREA COMPARISONS OF FIFO QUEUES USING SRAM
AND DRAM MEMORY CORES
by
Praveen Krishnanunni
A Thesis Presented to the
FACULTY OF THE SCHOOL OF ENGINEERING
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment o f the
Requirements for the Degree
MASTER OF SCIENCE
(ELECTRICAL ENGINEERING)
Copyright 2004 Praveen Krishnanunni
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
UMI N um ber: 1 4 2 4 2 4 7
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1424247
Copyright 2005 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
ii
ACKNOWLEDGEMENTS
This work has been made possible by to a whole lot of wonderful people.
Firstly, I would like to thank my advisor Dr. Alice Parker for all her help and
most importantly for her patience. I have done quite a few directed research
assignments under her and each one of them have been a wonderful learning
experience. Without all of those minor projects, this work would not have
materialized. She gave me ample time to read, research, understand and
implement. I would also like to thank Dr. Peter Beerel and Dr. Timothy
Pinkston for agreeing to be on my committee. You feedback is valuable to me.
The courses I did under Professor Gandhi Puwada and Professor Won
Namgoong here at USC have helped me understand important concepts of
Digital Electronics. I would like to specifically thank Professor Puwada for his
wonderful lectures on FIFO systems.
My project partner and good friend Aniket Kadkol has played a major role in
this thesis. His suggestions and ideas have helped me better the results obtained
from this study. He worked on the performance issues of FIFOs and the
numbers depicting speed measurements presented in this document are adapted
from his findings.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
I thank God for His kind blessings and constant help in getting me through
troubled times. Without Him, my life would have been a whole lot different.
My parents have been my best and biggest motivators. They have given me this
opportunity to pursue higher studies at USC. I dedicate this thesis and my
Masters degree to them.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
iv
TABLE OF CONTENTS
Acknowledgements ii
List of Tables V
List of Figures vi
Abstract vii
Chapter 1: Introduction 1
Chapter 2: Related work 3
Chapter 3: FIFO Overview 8
Chapter 4: The flag generation block 14
Chapter 5: Memory cores 24
Chapter 6: Architecture of a 16 location 1Kbit packet size FIFO 29
Chapter 7: Results 56
Chapter 8: Future work 74
Bibliography 81
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
V
LIST OF TABLES
Table 1: Areas of lk packet size FIFOs employing SRAM and
DRAM memory cores for varying number of locations 68
Table 2: Areas of 16 location FIFOs employing SRAM and
DRAM memory cores for varying packet sizes 69
Table 3: Areas of lk packet size 16 location FIFOs employing
SRAM and DRAM memory cores for various process
technologies 70
Table 4: Write-cycle delay of lk packet size 16 location FIFOs
employing SRAM and DRAM memory cores and
ring-pointer addressing scheme for varying number of
locations 71
Table 5: Read-cycle delay of lk packet size 16 location FIFOs
employing SRAM and DRAM memory cores and
ring-pointer addressing scheme for varying number of
locations 72
Table 6: Areas of the decoder and ring-pointer addressing
schemes. 73
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
LIST OF FIGURES
Figure 1: Block diagram of a basic FIFO setup 9
Figure 2: Internal structure of a typical FIFO 10
Figure 3: Block diagram of the flag generator 17
Figure 4: FIFO timing diagram 23
Figure 5: Dual-port SRAM cell 25
Figure 6: Three-port DRAM cell 27
Figure 7: Binary counter 30
Figure 8: 4 bit address decoder 31
Figure 9: Delay vs fanout plot 34
Figure 10: Sizing of gates in an inverter 35
Figure 11: Sizing of gates in a NOR gate 36
Figure 12: Sizing of gates in aNAND gate 36
Figure 13: Single stage driver 37
Figure 14: Two stage driver 38
Figure 15: Three stage driver 39
Figure 16: Four stage driver 39
Figure 17: Structure of the decoder for al 6-location
lk packet-size SRAM memory 43
Figure 18: Ring-pointer 44
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
vii
Figure 19: Structure of the ring-pointer for a 16-location
lk packet-size SRAM memory 46
Figure 20: Binary to gray code conversion circuit 48
Figure 21: Gray to binary code conversion circuit 49
Figure 22: XOR gate architecture comparison 50
Figure 23: One-bit subtractor 52
Figure 24: Flag generation circuit 53
Figure 25: Areas of 3-port DRAM and 2-port DRAM 65
Figure 26: Areas of 3-port DRAM and 2-port SRAM 66
Figure 27: Area-delay comparisons of 3-port DRAM
and 2-port DRAM 67
Figure 28: Area-delay comparisons of 3-port DRAM
and 2-port SRAM 67
Figure 29: Number of locations vs. Area 68
Figure 30: Packet size vs. Area 69
Figure 31: Process technology vs. Area for various
processing technologies 70
Figure 32: Number of locations vs. write-cycle delay
using ring-pointer addressing scheme 71
Figure 33: Number of locations vs. read-cycle delay using
ring-pointer addressing scheme 72
Figure 30: Number of locations vs. Area 73
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
viii
ABSTRACT
On-chip networks employ dedicated processing logic which manipulates
incoming high-speed data streams. Data buffers are required to prevent loss of
information due to variations between the input rate and the processing rate.
FIFOs are the most common on-chip data buffers. This study presents an
asynchronous FIFO operating across multiple clock domains for use in an on-
chip network. Determining the condition of the FIFO requires the read pointer
value to be sent to the write clock domain and the write pointer value to be sent
to the read clock domain. Debugging such multi-clock designs is extremely
difficult in synthesized hardware. A simple scheme to determine the ‘Empty’
and ‘Full’ condition of the buffer is presented.
Conventional FIFO buffers employ a dual-port SRAM memory core. High
speed buffering does not allow DRAMs to be a viable option due to the
overhead involved in refreshing the stored value. This work presents a three-
port DRAM as a possible alternative. It requires three transistors per bit as
against eight transistors per bit for the dual-port SRAM. A dedicated refresh is
used which greatly reduces stalling a read or write operation by the refresh
controller. A simple refresh scheme using minimum logic is also presented.
The advantages and possible disadvantages of this core have been researched
upon.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Values for the area and performance of similar sized FIFO blocks employing
SRAM and DRAM memory cores are provided as the result of this study.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
1
Chapter 1
Introduction.
Speed is the biggest buzzword in technology today. High speed network on
chip (NOC) architectures are necessary to speed up data processing. Custom
ASICs are not very viable from the commercial point of view since they are
not easily scalable. Hence, dedicated processors employing NOCs are fast
gaining popularity. One of the biggest challenges in designing hardware for a
NOC is to determine the best possible memory architecture. The performance
of the memory architecture in turn depends primarily on the physical and
circuit structure and secondarily on the address and command protocols.
Data integrity requires the chip to handle traffic ideally at a rate equal to the
input rate. This requires the memory systems to keep up with the processing
speeds. Although the memory core is not directly involved in operations
performed after memory access, it contributes a major part to the operation’s
ability to meet time constraints.
Another challenge facing the memory designer is the random nature of data.
Incoming data packets are not of the same size.. Random arrival times and
varying packet sizes reduce the expected performance of the memory core.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Network processors for example could be an end user of the NOC that we are
designing. Current technology incorporates up to 64 line cards in a large
multi-rack network-router chassis. Over time, each line card will have a
growing number of lines and chips to handle complex processing
requirements. Line rate quadruples over every generation which means that
space and power will be as important as speed. However the fastest memories
are not the cheapest and smallest. Thus, network processor memories must
be fast, small and consume very little power.
We aimed at minimizing the area of a network processor FIFO buffer while
maintaining a particular read/write speed. The conventional FIFO buffer used
in high-speed applications that employs a dual-port SRAM core is studied
first. The overall area of the FIFO buffer is minimized as much as possible
while maintaining a cycle time of approximately 5 ns. Asynchronous FIFO
buffers operating across multiple clock domains may cause errors in logic.
We have presented a simple flag-generation scheme that ensures correct
operation of any asynchronous FIFO buffer. We then study the possibility of
reducing the area even further by replacing the SRAM memory core with a
DRAM core. In this regard, we present a three-port DRAM as a possible
replacement to the dual-port SRAM.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3
Chapter 2
Related work.
This chapter is a listing of all previous work that has been used as references
in writing this thesis. Considerable research has been done in the area of
buffer design and optimization. High speed communication networks and
imaging systems employ dedicated packet buffer architectures. Agere has
published an interesting article about the evolution of high-speed packet
processors that require efficient buffering schemes [23]. This article lists the
advantages and disadvantages of various technologies that have been
employed in the design of high-speed data processing logic. Whenever there
is a requirement to process an input stream that arrives in a random manner, it
indicates that the buffering stage will most probably be asynchronous.
Clifford has written a very good reference on asynchronous FIFO design [3].
His work provides an elaborate explanation on how die status signals of the
FIFO are exchanged across clock domains. Yantchev, Huang, Josephs and
Nedelchev discuss how these buffers work to support high throughput [22].
Chapter 4 of this thesis is based on the course notes and lectures given by
Professor Gandhi Puwada of USC [14]. This chapter describes the design of
the flag generator. This block is the only part of the design that works across
two clock domains. Whenever multiple clock designs are synthesized,
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
debugging the hardware becomes a problem. We have presented a simple
flag generation scheme that will work even if the read and write domain
clocks are totally unrelated. However, for easy debugging, it can be assumed
that the clocks are multiples of each other and are in phase.
The heart of the FIFO buffer is the memory core. Conventional high-speed
FIFO buffers employ an SRAM core owing to the relatively slow
performance of the DRAM. Sundar Iyer, Ramana Rao Kompella and Nick
McKeown provide a very good introduction to the requirements of a fast
packet buffer [9]. Their paper deals with a buffering scheme that uses a large
DRAM core assisted by a small SRAM cache. A memory management
algorithm for replenishing the cache is also presented. A dual-port SRAM
core is necessary to facilitate simultaneous reads and writes that will be
required in our application. Smith gives the reader a basic idea of how dual
port memory modules are employed to achieve the bandwidth demanded by
high-speed communication systems [17]. The ideas obtained from these
papers have been used to study dual-port SRAM FIFO buffer cores in this
thesis.
As discussed earlier, network processor buffers must be not only fast but also
small. Low cost and high density make DRAMs a very good choice for on-
chip buffers, especially ones used in high-speed applications like network
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
processors. Moreover, SRAM memories are susceptible to ‘soft errors’, a
phenomenon caused by external radiation leading to changes in the value
stored in a cell. Jeanne Graham gives an introduction on what soft errors are
[6]. Graham talks about how and why soft errors are a problem in SRAM
memories. Premkishore Shivakumar, Michael Kistler, Stephen W.Keckler,
Doug Burger and Lorenzo Alvisi propose an end-to-end model that can be
used to compute the soft-error rates for existing and future designs [16].
Anthony Catalado’s article describes how soft errors affect network processor
buffers [2]. He also lists possible malfunctions that may occur if a soft error
goes undetected. A packet may be routed to a totally different location due to
a single bit flip in the SRAM memory core. Therefore we attempted to study
the tradeoffs involved in using a DRAM instead of SRAM as the core of the
FIFO buffer. Katayama has written an extremely interesting paper on
semiconductor memories [12]. It explains the advantages and limitations of
DRAMs in detail. This work gave us an idea of what the current technology
trend is. It was important to look at how asynchronous DRAMs are designed.
Ekanayake and Manohar talk about the design of a high-performance on-chip
pipelined asynchronous DRAM suitable for use in a microprocessor cache
[5]. The paper concludes that asynchronous DRAMs can be tweaked to
approach the performance levels of SRAMs. Refreshing a DRAM cell is a
cause of concern in on-chip memories. Dedicated controllers may have to be
designed to achieve this operation. Moreover, refresh cycles usually stalls
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
any read or write operation occurring at that point in time. This is not very
desirable in high-speed applications. Micron Technology’s white paper on
refresh techniques [25] describes the different refreshing schemes used in the
industry today. It also provides an idea of how frequently a DRAM storage
capacitor needs to be refreshed in order to prevent loss of data.
We looked at possible options to try and ‘hide’ the refresh operation. Our
idea was to avoid a read or a write cycle from being stalled by the refresh
cycle. We also aimed to simplify the design of the refresh controller.
Takayasu Sakurai, Kazutaka Nogami, Kazuhiro Sawada and Tetsuya Iizuka
introduce a dual-port DRAM having a dedicated refresh port [15]. We extend
this work and propose a three-port three transistor DRAM as a low-area
alternative to the dual-port eight transistor SRAM. Our DRAM core has one
dedicated read port, one dedicated write port and one dedicated refresh port.
Chapter 5 of this thesis gives an introduction to the memory cores used. The
detailed design and working of the same can be found in the Master’s Thesis
of Aniket Kadkol [11]. He has discussed performance issues in his work
while I focused on area.
FIFO buffers are sequential-access memory elements. The first data packet to
be written into the buffer is also the first to be read out of the buffer. This
means that the address generation block, that is usually a decoder in random
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
access memories, can be replaced by a simple ring pointer. Haibo Wang and
Sarnia B.K. Vrudhula present one such pointer and this has been used as a
reference for our work [20]. This thesis also presents a comparative study on
both the decoder and the ring pointer addressing schemes in Chapter 6. The
concept of logical effort [18] has been employed to obtain an approximate
design with which to start actual simulations.
Chapter 7 is a description of results obtained from this study. We hope that
these numbers will help make on-chip memory design choices a little
simpler.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
8
Chapter 3
FIFO overview.
This chapter provides an introduction to the conventional asynchronous FIFO
buffer. My role in Dr.Parker’s ‘Netchip’ project was to compare various
FIFO memory cores in a FIFO and obtain the benefits and limitations of each
setup. A FIFO (First-in First-out) as the name suggests, is a memory that
incorporates sequential access. The first item to be written is also the first to
be read out. On the outside of the FIFO buffer, we have systems writing into
(referred to as the writer from this point on) and systems reading out of
(referred to as the reader from this point on) the FIFO buffer. Each of these
systems may or may not operate at the same clock frequency. This study
considers the case where the reader and writer operate at two different clock
frequencies. The internal design of the FIFO is not of any importance to
either the writer or the reader. The writer must not write into the FIFO when
it is full and the reader must not read out of the FIFO when it is empty.
The status of the FIFO is made available to the writer and reader through the
‘Full’ and ‘Empty’ flags respectively. The writer can input information into
the FIFO during any clock provided the ‘Full’ flag is not set. The reader can
read out information from the FIFO during any clock provided the ‘Empty’
flag is not set.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
9
Writer
Data
Data
Reader
Full
FIFO
Empty
Figure 1: Block diagram of a basic FIFO setup
As the figure indicates, the writer deposits data into the FIFO at a clock rate
of W clk while the reader reads data out of the FIFO at a clock rate of R clk.
The clocks can be assumed to be multiples of each other and perfectly in
phase so as to enable easy debugging upon synthesis. However, the design
presented here will work even if the clock domains are completely unrelated.
The block diagram of the internal structure of a typical FIFO is shown in
Figure 2.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
10
Read data Write data
Wr addr
Write
pointer
Rd addr
Read
pointer
Full
Empty
Flae generator
Memory core
WcLK RcL K
Rd_addr= Read address
Wr_addr= Write address
Figure 2: Internal structure of a typical FIFO
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
11
Figure 2 helps us identify two distinct data paths:
• The memory core along with the read/write memory pointers
• The flag generator circuit
The operations of these data paths are in parallel to one another, occurring
immediately after the pointer values are generated.
The memory core is essentially a RAM array with separate read and write
ports. Single port memories would require an arbiter to grant access to either
the reader or the writer. This is not very desirable for a high-speed buffer.
The read and write ports have separate read and write addresses, generated by
two counters of width log2 (number of locations). They are commonly
referred to as the read and write pointers respectively. The write pointer
points to the location that will be written next, and the read pointer points to
the location that will be read next. A write operation increments the write
pointer, and a read operation increments the read pointer.
The flag generator circuit generates the ‘Empty’ and ‘Full’ signals. As
mentioned earlier, these signals indicate that the FIFO has reached a terminal
condition: If ‘Full’ flag is set, then the FIFO has reached a terminal condition
for write and if ‘Empty’ flag is set, the FIFO has reached a terminal condition
for read. A terminal condition for write implies that the FIFO has no space to
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
accommodate more data and a terminal condition for read implies that the
FIFO buffer has no more data available for readout.
A brief description of the operation of the FIFO buffer is as follows. At reset,
the pointers are both 0. This is the empty condition of the FIFO buffer. The
‘Empty’ flag must be set and the ‘Full’ flag must be reset. When the ‘Empty’
flag is set, read operations cannot be performed and so the first operation is a
write. A write loads location 0 of the FIFO and increments the write pointer
to 1. This brings the ‘Empty’ flag low. Assuming that the writer continues to
write without any read cycles, the write pointer will ultimately equal (number
of locations-1). This means that the last location in the memory core is the
next location that will be written to. At this condition, a write will cause the
th
write pointer to point to the 0 location of the memory core.
At this instant, the write and read domain pointers are equal, but the FIFO is
full not empty. This means that equal pointer values alone do not indicate the
exact state of a FIFO memory. Designing a way to remember what operation
was performed last so as to differentiate between hill and empty conditions is
essential. A solution to this is to design a memory element to store values
depending on whether the FIFO is almost full or almost empty. If the cause of
read and write pointer equality is a system reset or a read, the FIFO is empty
and if the cause is a write, the FIFO is full.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
13
The flag generator block operates with both the read and write clocks. The
basic idea is to transfer data between domains operating at different clock
frequencies. Although the memory core also has two clocks connected, it is
not a cause of concern because it is generally a dual-port RAM with one port
dedicated for writes and the other dedicated for reads. Hence the operations
that rely on the clocks are independent of each other. This is not the case in
the flag generator circuit. A detailed description of the flag generator block is
presented in the next chapter.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Chapter 4
The flag generation block.
14
The internal structure of the flag generation block is shown in Figure 3. The
circuit has its own write and read domain pointers, which are incremented
using the write and read clocks respectively. As mentioned earlier,
determining the condition of the ‘full’ and ‘empty’ flags requires pointer
values to be exchanged across clock domains. For example, to calculate the
value of the read pointer in the write domain, we must sample the output of
the read domain pointer using the write domain clock. In this case,
metastability may arise. Metastability will hamper the correct operation of the
FIFO buffer and must be accounted for and dealt with suitably. The
phenomenon of metastability can be best explained using the example of a D
flip-flop.
In a D flip-flop, the output Q attains the value of the input D after a finite
time called the ‘clock to Q’ time T c q and its value varies depending on the
architecture of the flip-flop. If the setup and hold times are maintained, the
output Q is guaranteed to follow input D after T c q . However if the clock edge
arrives while D is still changing, the output cannot be determined. It may end
up being a one or a zero. There is a finite non-zero probability that the output
will remain meta-stable forever. Metastability without resolution is rarely the
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
case in practical systems and industry standards adopt 2 0 T c q as a reasonable
time for resolving metastability. The technique of double sampling has been
employed to handle metastability. This involves connecting a second flip-flop
to the output of the first flip-flop. The value of 2 0 T c q of the second flip-flop
must be as much smaller than the clock period as possible. In our design,
double sampling is adopted at the outputs of both write and read domain
pointers.
There is one other hidden problem, which should not be overlooked. We are
sampling the counter with a clock that is different from the counter clock.
Double sampling will get rid of the metastability problem but the values that
are obtained at the output of the second sampling flip-flop may end up being
totally different from the actual values. Assume the case wherein the read
pointer is changing from 1111 to 0000 where all 4 bits change values or for
that matter, any other transition where more than one bit flips. In our
example, all bits flip from 1 to O.Consider the worst case wherein all 4 bits go
into metastability. Now, after double sampling, the output obtained maybe
0111. The most significant bit may resolve to 0 instead of 1 after coming out
of metastability. 0111 corresponds to location 8 in a FIFO memory
(0000=location 1). The pointer has in fact moved from location 16 to location
1 whereas the information passed on to the writer’s side is that the read
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
pointer is at location 8. The information obtained by the full-flag generator is
therefore completely different and will lead to erroneous flag generation.
This problem can be fixed by making use of a counter in which only one bit
will flip. The obvious answer therefore, is to employ a Gray counter. The
solution to the above problem is explained by considering the same example.
Assume that the read counter is changing from location 16 to location 1. The
Gray code equivalent of the binary number 1111 corresponding to location 16
is 1000. The Gray code equivalent of the binary number 0000 corresponding
to location 1 is 0000. Here we see that only one bit flips, namely the most
significant bit. Hence the worst possible case here is when the sampled value
of this bit goes into metastability. Ultimately, it will have to attain a value of
0 or 1. If it falls to 0, the value passed on to the writer side is that the read
pointer is at location 1. If unfortunately the bit stays at 1, the writer side will
think that the read pointer is still at location 16. Thus a totally different value
is never obtained. In the next clock when the read pointer is sampled again by
the write clock, the most recent value is obtained at the writer side.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
WSG RSq
17
Write
---*
---V
R
—i
- 1
R
=> <=
W
1 —
w
A---
Read
domain S S s s domain
pointer 1 2 2 1 pointer
W clk R clk
+...... J
i i
L .... t
WP
R clk
R S b R sg
Subtractor CC
1
Wclk
W sg WSB R P
CC Subtractor
2
D wd RSb WSb
FULL EMPTY
flag logic flag logic
I T
FULL EMPTY
RS: Read domain sampling flip-flops AF: Almost Full logic
WS: Write domain sampling flip-flops AE: Almost Empty logic
Dwd: Depth in write domain Drd> Depth in read domain
CC: Gray to binary code converter
Figure 3: Block diagram of the flag generator
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
It is to be noted that erroneous operation is not at all a possibility even if we
do not obtain the most recent value after sampling. An elaborate discussion of
the same is provided below.
Let us assume that the write clock is much faster than the read clock. Say at
clock i on the reader’s side, the sample value of the write pointer is M when
actually the write pointer is at location M+l. The reader will not attempt to
read out of location M even though in reality, the writer has already written
into it and moved on to location M+l. This is not harmful. However we do
end up wasting a read clock but this is unavoidable to ensure correctness.
This may continue to happen and there may even be a stage when the reader
will decide that the FIFO is empty when it is actually full. This can happen if
the writer fills up the FIFO while the reader reads out of what he thinks is the
last location with valid data. Only in the next read clock will the reader
sample the latest value of the write pointer and realize that the FIFO is not
empty any more.
The same argument extends to the case where the read clock is faster than the
write clock. Now, the writer will assume that a fewer number of reads have
occurred and may end up not writing into the FIFO thinking it is full at some
point of time when actually the fast reader has emptied it. This again is not
harmful because we hold back the write operation. Wastage of one write
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
clock is encountered. The following pseudo code, which uses Figure 3 as
reference, provides a simple description of the working of the flag generator
block.
1) Upon reset, set the ‘Empty’ flag and reset the ‘Full’ flag.
2) Whenever a write occurs, increment the write domain pointer WP, which
is a gray code counter. The read domain pointer remains unchanged.
3) Whenever a read occurs, increment the read domain pointer RP, which is a
gray code counter. The write domain pointer remains unchanged.
4) Sample the write domain pointer with the read clock R c l k - (If the clock
edge of R c l k occurs while the write domain pointer is still changing,
metastability may occur. The two sampling flip-flops RSI and RS2 perform
double-sampling so as to prevent metastability).
5) Sample the read domain pointer with the write clock W clk- (If the clock
edge of W clk occurs while the write domain pointer is still changing,
metastability may occur. The two sampling flip-flops WS1 and WS2 perform
double-sampling so as to prevent metastability).
6) Transfer the value of the sampled write pointer WSg to the read domain.
7) Transfer the value of the sampled read pointer RSq to the write domain.
8) Compute the difference between W sg and the value of the read pointer RP
using subtractor 2 in the read domain.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
9) Compute the difference between Rsg and the value of the write pointer WP
using subtractor 1 in the write domain. Both subtractions are performed using
two’s complement arithmetic.
10) If Wsg and RP are nearly equal in the read domain (two most significant
bits of the output of the subtractor equals zero which means that the
difference between the two pointers is less than or equal to 3), generate an
‘almost empty’ signal (using the AE block). This signal must be stored in a
memory element for use in a later clock when W sg and RP become equal.
11) If Rsg and WP are nearly equal in the write domain (two most significant
bits of the output of the subtractor equals one which means that the difference
between the two pointers is greater than or equal to 12), generate an ‘almost
full’ signal (using the AF block). This signal must be stored in a memory
element for use in a later clock when Rsg and WP actually become equal.
12) If Wsg and RP are exactly equal, check the status of the almost empty
signal. If it is set, the FIFO is actually empty and not full. Set ‘Empty’ flag.
13) If Rsg and WP are exactly equal, check the status of the almost full
signal. If it is set, the FIFO is actually full and not empty. Set the ‘Full’ flag.
14) The status of the ‘Empty’ flag is queried by the reader on each read
clock to determine if it is safe to read from the FIFO.
15) The status of the ‘Full’ flag is queried by the writer on each write clock to
determine if it is safe to write into the FIFO.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
We also examined an exhaustive set of cases when an erroneous read or write
may occur. These are as follows:
1) The faster reader reads from an empty location: This will never happen in
our design. Even if the reader is faster than the writer, the ‘Empty’ flag is
generated as soon as the read pointer catches up with the write pointer. This
is because the write pointer when sampled by the reader will either give the
current location or the previous location (if metastability occurs) of the write
pointer and never the next location. The read pointer cannot be misinterpreted
at the read domain because it is operated by the read clock.
2) The faster writer writes into a filled location: This again will never happen
in our design. Even if the writer is faster than the reader, the ‘Full’ flag is
generated as soon as the write pointer catches up with the read pointer. This
is because the read pointer when sampled by the writer will either give the
current location or the previous location (if metastability occurs) of the read
pointer and never the next location. The write pointer cannot be
misinterpreted at the write domain because it is operated by the write clock.
3) The reader samples a random value of write pointer or vice versa. This
again is avoided by making use of gray code counters. The sampled value
will be either the latest value, say n, of the pointer or one less than the latest
value-(n-l), which is the previous location of the FIFO.
4) Sampled value of the read pointer is not the most recent value. The faster
reader has read from nearly all locations but the slow writer samples the read
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
pointer and thinks that the FIFO is full when it is actually empty. This will
not cause an erroneous read or write. What will happen is the writer will hold
off a write operation during that clock even though it can actually go ahead
and write safely. There is wastage of one write clock but no harm is done.
Therefore, the slower clock domain may end up wasting a clock. In the next
clock, the writer will sample the read pointer again and will realize that the
FIFO has become empty leading to a normal write operation.
Therefore it can be safely concluded that the only drawback of this scheme is
that the slower clock domain may end up wasting one clock under the
circumstance indicated in point 4 above. Under no circumstance will an
empty location be read from or a full location be written into. The exact
architectures of the blocks shown in Figure 3 are shown in Chapter 6 where
we have presented a simple flag generation circuit in an attempt to reduce
overall area.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
23
Assuming out of phase clocks -worst case situation
Case 1
Slow reader
i + 1
Fast writer
In case 1, the faster writer will never write into a filled up location because it will always have the most
recent or previous value of the read pointer. For example, during write dock i, the writer will never have a
read pointer value corresponding to read clock i+1 (indicating that the reader has read the last location
and it is okay to go ahead and write).
In case 2, the faster reader will never read from a filled up location because it will always have the most
recent or previous value of the write pointer. For example, during read clock i, the reader will never have a
write pointer value corresponding to write dock i+1 (indicating that the writer has written into the last
location and it is okay to go ahead and read).
Case 2
Slow writer
i i + 1
Fast reader
Figure 4: FIFO timing diagram.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
24
Chapter 5
Memory cores.
This chapter gives a brief introduction to the types of memory cores used in
the FIFO. Each memory location stores a single packet,
i) Dual-port Static Random Access Memory
High-speed applications employ SRAM memory. Most of the details about
the working of the memory cores can be found in the Masters thesis report of
my project partner Aniket Kadkol, a graduate student at USC [11]. He has
discussed performance issues of this FIFO while I focused on area
minimization. We have used a dual-port core so as to enable simultaneous
read and write operations from two different locations. Figure 5 is a circuit
diagram showing the internal structure of a single dual-port SRAM cell. The
dual-port cell has four pass transistors per cell as opposed to two in a
conventional SRAM cell. The SRAM (static RAM) cell stores one bit (either
a 1 or a 0) of information without the need for refreshing the contents.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
2 5
write bit read bit write bit bar read_bit bar
write wordline
invl
M2
read wordline
node a n o d e b
: u
M3 M4
inv2
Figure 5: Dual-port SRAM cell
b) Three-port Dynamic Random Access Memory
High-energy neutrons present in cosmic radiations and alpha particles present
in packaging material impurities can strike a sensitive region in a chip,
generating a track of electron-hole pairs. When these pairs accumulate at a p-
n junction, a current pulse is produced. If the charge generated by this pulse
is greater than a particular critical value, the data stored in the memory cell
gets corrupted. Such errors are called ‘Soft Errors’ and they are measured in
units of FIT (Failures In Time, 1 failure in 109 hours of operation) [4].
Industry experts specify 1000 FITs per megabit to be a typical soft error rate.
However, devices implemented using 0.13 p . technologies have shown soft-
error rates up to 10,000 FITs per megabit [6]. Technology scaling decreases
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
device geometries, but this also reduces the value of critical charge required
to flip the value stored in the cell. Therefore, scaling worsens the impact of
soft errors on modem memory systems. Dense packing of devices increases
the probability of a high-energy particle striking a vulnerable node on the
chip. In a PC, soft errors are eclipsed by the more common software bugs and
will most probably go un-noticed. Rebooting the computer will solve the
problem and no harm is done. The same cannot be said about memory
systems used in networking components. A soft-error can route a packet to a
totally different location. SRAMs are more susceptible to these errors
because the data is stored at nodes between cross-coupled inverters. A
particle hit on any of these nodes can flip the value that is stored. Cross
coupled inverter action will magnify the corrupted nodal charge and invert
the stored value. DRAMs are fairly resistant to soft errors due to superior
packaging techniques and also due to the fact that data is stored on a
capacitor. The amount of energy required to flip the charge stored on this
capacitor is far greater than the amount of energy required to flip the data
stored at a node in an SRAM cell. In a high-speed application, a DRAM
memory core may seem impractical because of the constant refreshing and
destructive readouts associated with them. A refresh operation in normal
DRAMs stalls read and write requests. Moreover, readout is destructive and
hence must be accompanied by a write-back stage. Generally, the output of
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
27
the sense amplifier is written back into the cell. These drawbacks decrease
speed greatly.
We have proposed a technique, which improvises on earlier research [19], to
achieve high speeds using a DRAM core. Apart from separate read and write
ports, the memory cell is provided with a refresh port [11]. Therefore, a
refresh operation can occur without hampering a read or write operation. The
setup of the DRAM cell is as shown in Figure 6.
write bit
refresh bit
read bit
refresh wordline
write wordline read wordline
M2 M3
Figure 6: Three-port DRAM cell
An elaborate discussion on the working of this cell and a simplified refresh
scheme that we proposed together can be found in Aniket’s Masters thesis
report [11]. We compared the three-port DRAM memory with the dual-port
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
28
SRAM and also with a previously proposed asynchronous DRAM [5]. The
results obtained through this study have been presented in Chapter 7.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
2 9
Chapter 6:
Architecture of a 16 location 1Kbit packet size FIFO.
The gate-level structure of a 16 location FIFO is explained in this chapter. As
explained earlier, the architecture comprises two distinct data paths:
1) The memory with its addressing schemes and pointers, and
2) The flag generators.
We have employed two memory addressing methods, the ring-pointer
architecture and the decoder architecture and compared their areas and
performance [11].
Decoder Architecture:
The 16-location FIFO memory requires a four-bit counter to generate the
address. The binary counter employing T flip-flops to perform this operation
is shown in Figure 7. The counter is enabled only when there is a request to
read or write.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
30
0 4
0 3
0 2
write/read_enable
0 1
Vdd
CLK
T3 B3
TO BO
T2 B2
Figure 7: Binary counter
The output bits 01, 02 03 and 04 provide the address of the location, which
is to be accessed in the current read/write clock. These bits are fed to a
decoder. The decoder activates a single word line depending on the value
supplied by the address bits. The architecture of the decoder is shown in
Figure 8.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
31
03 02 01 00
(-03.02) (-03.-02)
(03.02) (03.-02)
(-01.00) (- 01.- 00)
(01.00) (01.-00)
(-03.-02)
(- 01.- 00)
(-03.-02)
(-01.00)
(-03.-02)
(01.-00)
(-03.-02)
(01.00)
(-03.02)
(- 01.- 00)
(-03.02)
(-01.00)
(-03.02)
(01. - 00)
(-03.02)
(01.00)
W L 1
W L2
W L3
W L4
W L5
W L6
W L7
W L8
(03.-02)
(- 01.- 00)
(03.-02)
(-01.00)
(03.-02)
(01.- 00)
(03.-02)
(01.00)
(03.02)
(- 01.- 00)
(03.02)
(-01.00)
(03.02)
(01.- 00)
(03.02)
(01.00)
W L9
W L10
W L11
W L12
W L13
W L14
W L15
W L16
Figure 8: 4 bit address decoder
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
The inputs from the counter are inverted to obtain their complements (~01,
~02, ~03 and -04). These are then pre-decoded and then suitably fed into
AND gates. Each AND gate selects a unique word line. Hence a 16-location
memory core requires 16 AND (NAND implementation in CMOS) gates.
Each word line extends over 1024 locations (internal packet size). A single
AND gate is not sufficient to drive the entire word line. We need to buffer the
input suitably. The concept of logical effort [18] is employed to determine the
ideal number of stages required to drive the word line capacitance while
maintaining a specific delay value. Actual results were computed using
HSPICE. Logical effort has been used merely as a guide to arrive close to the
exact result. Determining a value for delay required a small mathematical
analysis, which is shown below.
The sequences of operations performed on the packet are as follows:
1. Write received packet to memory.
2. Read received packet from memory for processing.
3. Write back the packet after processing.
4. Read out the processed packet.
Let us assume an input rate of 10 Gbps. This means that the throughput with
respect to a single-port memory is four times the input rate which is 4*10=40
Gbps. We have used separate ports for reading from and writing into the
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
memory. Therefore the desired throughput is reduced by half and now
becomes 20 Gbps. Internal packet size is set at 1Kb. Hence the rate at which
packets access the memory is given by packet size/throughput = 1 Kb/20
Gbps = 47 ns. This time includes packet processing also. We have allotted
90% of the time to processing and 10% to buffering. This sets a target time of
5 ns per cycle. A read or a write operation must complete within
approximately 5 ns. We have tried to minimize the area as much as possible
by selecting suitable architectures while trying to obtain this performance
metric.
Here is a small example of how logical effort [18] can be used to determine
the optimum number of stages in logic. Logical effort is a simple way to
estimate the delay in logic. According to this concept, Delay = Effective
fanout + Parasitic delay (xp ). A graph can be plotted between the delay
through a gate and its fan out. The slope of this plot gives the value of logical
effort of the gate. This is shown in Figure 9.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
34
delay
slope = logical effort = g
X ,
p
fanout=h
Figure 9: Delay vs. fanout plot
An inverter comprises a PMOS transistor in the pull-up network and an
NMOS transistor in the pull-down network. The PMOS transistor gate width
is made approximately twice that of NMOS [18] so that the rise and fall times
during a pull-up and pull-down operation are equal because a PMOS has
holes as the majority charge carriers and their mobility is approximately half
of that of electrons which are the majority charge carriers in an NMOS.
Current technology recommends PMOS transistor gate width to be 2.6-2.7
times that of the NMOS transistor gate width. This thesis has assumed a
PMOS transistor width of 2 times the NMOS transistor gate width so as to
simplify explanations on gate sizing. By increasing gate width, the channel
resistance of the PMOS transistor is reduced by half so as to compensate for
the decreased mobility of charge carriers. This does not guarantee exactly
equal rise and fall times. Simulation results provided a rise time that was
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
35
approximately 1.17 times the fall time. An inverter is sized as shown in
Figure 10.
PMOS
NMOS
Figure 10: Sizing of gates in an inverter
Logical effort of any gate under study is given by the equation
(R gate-C gate)/ (K -in v -C in v )?
where, Rg a te is the channel resistance of the gate under study, Cg a te is the
capacitance seen at the gate by the input signal driving the logic gate under
study, R m v is the channel resistance of the inverter shown in Figure 10, Ci„v is
the gate capacitance of the inverter shown in Figure 10.
The logical effort of the gate under study is calculated by sizing the
transistors such that the effective pull-up and pull-down resistances are equal
to that of the inverter. Then the ratio of gate widths would give us the logical
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
36
effort of the gate. The sizing for a NOR and NAND gate are shown in Figure
11 and Figure 12 respectively.
in1
out
in2
Figure 11: Sizing of gates in a NOR gate
out
in1
in2
Figure 12: Sizing of gates in a NAND gate
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
37
Therefore, logical effort of a minimum sized NAND gate is 4/3 and logical
effort of a minimum sized NOR gate is 5/3. Consider the following setup in
Figure 13 where a load capacitance of 64 units is to be driven with a gate
whose capacitance is 1 unit.
Figurel3: Single stage driver
Delay d = N. (G.B.H)1 /N + p,
where G = Path logical effort = 1 for a single inverter.
H = Path fanout = 64.
B = Branching factor = 1 in this case,
p = parasitic delay = 1 for an inverter [18].
N = number of stages = 1 in this case.
Therefore, delay = 64+ 1= 65. (Equation 1)
The delay of an inverter having an effective fanout of 4 = 4+1 =5. (Equation
2)
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3 8
Equation 2 gives the value of one F04 (fanout of 4) delay, which is
approximately 150ps [18].
The delay obtained in Equation 1 can be normalized with respect to equation
2 as follows.
Normalized delay of the gate = 65/5 = 13 F04 delays = 1.9ns.
If another inverter is added to the path, it may seem that an extra inverter
delay will be added to the above value. However this is not true and is
explained using Figure 14 below. The second inverter is given a gate width of
8 times the first so as to maintain an equal fanout in each stage.
Figurel4: Two-stage driver
Now, g.b.h = 64 but N = 2.
Hence, delay = 2.(64)1 /2 + 2=18/5 = 3.6 F04 delays = 0.54ns.
We now add another inverter to the path as shown in Figure 15 and maintain
an equal fanout per stage.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3 9
IQ -
64
Figurel5: Three-stage driver
Now, g.b.h = 64 but N = 3.
Hence, delay = 3.(64)1 /3 + 3 = 15/5 = 3 F04 delays = 0.45ns.
This decrease in delay will not continue as more and more stages are added
with equal fanouts per stage. Consider the setup in Figure 16.
Figurel6: Four-stage driver
Now, g.b.h = 64 but N = 6.
Hence, delay = 6.(64)1 /6 + 6 = 18/5 = 3.6 F04 delays = 0.54ns.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Therefore, the delay decreases and then increases, even though extra gates are
added to the path. For a specific path fanout, there are an optimum number of
stages for which delay is minimum. This is approximately equal to the
number of gates in the path, which will maintain an effective fanout of 4 at
each gate [27].
Note: The effective fanout per stage for minimum delay is very much
technology specific. We used the value 4 to obtain the initial number of gates
in the critical path so as to start simulations. The final number was obtained
upon simulation employing trial and error.
The number of stages required for minimizing decoder delay is calculated as
follows:
Let CW o rd iin e be the capacitance of a single word line in the memory array,
C p ass-g ate be the total gate capacitance of all the pass transistors located in the
memory cells of the locations through which the word line runs and C W ire be
the capacitance of the wire constituting the word line. The total capacitance
of the word line can then be given by the equation
Cwordline ~ Cpass-gate C wjre
C gate is approximately equal to 2fF/p [10].
Assuming metal 2 is used, C w ire will be approximately 0.2fF/p [10].
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
41
The area of an SRAM cell is approximately (40X*40X) where X is the channel
length that depends on process technology. Assuming 0.15 p . technology, the
above values now become
Cgate = 0.3fF/ X
Cwire = 0.03fF/ X
The word line has to span 1024 SRAM cells (as internal packet size is 1024
bits and each location stores a packet).
Hence the capacitance values become
C gate = (4*2)A*1024*0.3fF/ X = 2457.6fF
C wire = 40A,*1024*0.03fF/ X = 1228.8fF
Cwordiine = 2457.6fF + 1228.8fF = 3686.4fF
Since C gate = 0.3fF/ X , we can express the word line capacitance in terms of
gate-width to make calculations easier.
C w ordiine= 3686.4/0.3 = 12288 X of gate width.
Logical effort is then used to determine the approximate number of stages,
which will give us a minimum delay.
G = (4/3)*(4/3)*(l) for the two NAND gates and inverters in the path.
H = 12288/3.
B = 2.
Therefore, GBH = 116523.00.
We aim at maintaining an effective fanout per stage of 4 [27].
(GBH)1 / n = 4.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4 2
(116523.00) 1 /N = 4.
The value of N that satisfies the above equation is found to be 9. With N = 9,
effective fanout per stage = 3.65.
The structure of the decoder with 9 stages is as shown in Figure 17. We add
inverters to the beginning of the chain so as to reduce the number of gates in
an attempt to reduce the area. If inverters were added to the branches located
at the end of the chain, the number of additional gates added to make N = 9
would have been doubled. The gates are then sized using the value of
effective fanout per stage. Therefore, the individual gate widths are given by
the product of the gate width of the preceding stage with the effective fanout
per stage.
A = 3.65*3 = 10.95 A,.(where 3.65 is the effective fanout per stage and 3 is
the gate width of the first inverter in the path).
Similarly,
B = 39.96 X .
C = 145.881 .
D = 266.23 X .
E = I = 485.87 X .
F = J= 1773.44 X .
G = K = 1618.27 X .
H = L = 5906.69 X .
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4 3
—X4
122881
— X4 — X2
12288X
Figure 17: Structure of the decoder for a 16-location lk packet-size SRAM
memory
The total area of the pre-decoder and decoder was found to be 1750295.02 X .
The decoder was simulated for delay values as well and these are tabulated at
a later stage.
Ring-pointer architecture:
Serial access memories like a FIFO can be addressed using a ring-pointer.
We have studied the advantages and drawbacks of this design and compared
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
them with the decoder scheme of addressing. A ring pointer activates the
word lines in increasing order of locations. One word line is activated per
clock. The basic architecture [20] is given in Figure 18 .
dock
input
Figure 18: Ring-pointer
When the FIFO is enabled by a read or write request, a value of ‘1’ is
provided at the input. During the negative half of the clock, PMOS transistor
PI propagates this value to the input of the inverter II which produces an
output of ‘O ’ at the input of the pass-gate setup by NMOS transistor N2. Until
the clock goes into its positive half cycle, the NMOS transistor N2 remains
off. During the positive half cycle, transistor N2 turns on and transmits ‘0’ to
the input of inverter 12 that in turn produces an output of 1 to drive the word
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
line. However, this setup will take a lot of time to drive the large word line
capacitance. Therefore, we use the concept of logical effort once again to
determine the number of stages necessary to minimize the delay.
Here again, like in the decoder setup, the word line capacitance is 12288 X .
We use only inverters to drive the word line and so the path logical effort (G)
is 1. Absence of branching makes branching factor B = 1. Hence, GBH = H =
12288/3 = 4096.
With N = 6, effective fanout per stage will exactly equal 4. The complete
setup of a ring-pointer word line driver for lk-packet size is as shown in
Figure 19.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
46
clock
to second set of
pointers
input to first set of
pointers
N2
wordline 1
N = 6
wordline capacitance= 12288 X
Figure 19: Structure of the ring-pointer for a 16-location lk packet-size
SRAM memory
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
The gates are then sized using the value of effective fanout per stage, which
in this case is exactly 4.
Therefore, A = 3*4 = 12 X , B = 48 X , C = 192 X , D = 768 X , E = 3072 X . The
area of this setup was calculated in a similar fashion as the decoder and was
found to be 967680 X 2. We realize a reduction in area as compared to the
decoder scheme. Simulations were performed to compare the speeds of the
decoder setup with the ring pointer and it was found that the ring pointer is
much faster than the decoder. Although both addressing schemes were
simulated and compared for the SRAM memory core, we used only the ring
pointer addressing scheme with the DRAM core because using a decoder will
further slow down the memory access. The outputs from the decoder or ring
pointer are connected to appropriate word lines in the memory as explained
in the previous chapter.
Gray-code counter architecture:
As discussed earlier, a gray-code counter is employed in the flag-generator
block so as to avoid erroneous outputs while sampling across clock domains.
The binary counter shown in Figure 11 can be converted into a gray-code
counter using a simple code-conversion circuit shown in Figure 20.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4 8
Binary counter
93
92
Gray-code output
g i
go
Figure 20: Binary to gray code conversion circuit
From Figure 3, it is understood that the inputs to the subtractors, which
calculate the depth of the FIFO, are in binary form. We used the circuit
shown in Figure 21 to convert the sampled values from gray code into their
corresponding binary form.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
g3-
4 9
b3
S= £ > - b 2
Gray code Binary code
b1
bO
Figure 21: Gray to binary code conversion circuit
I would like to mention here that all XOR gates were implemented using
pass-transistor logic in an attempt to reduce area. A comparative study of the
same is presented below. A 2-input XOR gate implemented using CMOS
technology will have 8 transistors excluding the inverters used to generate
input complements. The same gate implemented using pass transistor logic
will only have 4 transistors. The structures are shown in Figure 22. A binary
to gray and gray to binary conversion requires four 2-input XOR gates, one
3-input XOR gate and one 4-input XOR gate. CMOS logic results in a total
of 144 transistors for this setup. On the other hand, pass-transistor logic gives
us a total of 72 transistors. The areas of code converters were as follows:
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
5 0
a) Using CMOS logic - 57024 X 2
b)Using pass-transistor logic - 26784 X 2
Pull-up network
~B
XOR gate using CMOS logic
Pull-down network
XOR gate using pass transistor logic
~ B
~A
output
Figure 22: XOR gate architecture comparison
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
We realized a 53% reduction in the area of code converters by using pass-
transistor logic.
Subtractor architecture:
FIFO depth is calculated on both read and write domains using a simple
ripple-carry adder/subtractor circuit. A faster architecture was not employed
because anyways this block works in parallel with memory core access,
which is slower. A 4-bit subtractor is used in the 16 location FIFO. Each
block in the subtractor is a one-bit full adder with one of the inputs
complemented and with an initial carry Co of 1. Equations employed in the
setup are
Difference = A xor ~B xor Qn.
Borrow = (A*B) + Cin *(A+B).
Subtraction is realized through two’s complement addition. The gate level
implementation of one single block is shown in Figure 23.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
5 2
Difference
Carry = 1
Borrow
Figure 23: One-bit subtractor
The borrow of the first stage is fed in as the carry of the next stage. The final
4-bit result gives us the depth of the FIFO. Depth on the write side is given
by DWd and depth on the read side is given by DRd as shown in Figure 3.
Empty and Full flag generators:
The outputs of the subtractors in the read and write domains are used to
generate the empty and full flags respectively. An output of zero from the
subtractor indicating equal read and write pointer values could occur for both
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
an empty as well as a full condition. Hence, intermediate signals called the
‘almost full’ and ‘almost empty’ are generated to further qualify the outputs
from the subtractors. We designed the setup shown in Figure 24 to generate
the ‘Full’ and ‘Empty’ flags.
d w 3
► DW.
DW,
DW,
From write domain
subtractor
FULL
RESET
DR,
NE
NFF
D R ^,
:lk
RESET
NF
DW,
D w i
NEF
CLK
EMPTY
DR,
From read domain
subtractor
2:1 MUX
2:1 MUX
DF/F
DF/F
Figure 24: Flag generation circuit
The above setup works as follows. The first two most significant bits of the
outputs from the subtractors on the read and write side are checked. If the
write side subtractor has both it’s upper MSBs as one, that means that the
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
FIFO is ‘nearly full’ and this makes NF = 1. If the read side subtractor has
both it’s upper MSBs as zero, that means that the FIFO is ‘nearly empty’ and
this makes NE = 1. This information has to be stored for use in a later clock
when the output from either the read or write domain subtractor is a zero.
Hence, a D flip-flop is employed to store the value presented on the NF and
NE signal lines. The ‘Full’ flag indicator side flip-flop is clocked by the write
clock and the ‘Empty’ flag indicator side flip-flop is clocked by the read
clock. This is because the reader only needs to know if the FIFO is empty or
not whereas the writer only needs to know if the FIFO is full or not at a
particular read or write clock respectively.
The following requirements have also been taken care of in our circuit:
1) Upon reset, the ‘Nearly Empty Final’ (NEF) signal must become 1,
which sets the ‘empty’ flag and the ‘Nearly Full Final’ (NFF) signal must
become 0, which resets the ‘Full’ flag. 2:1 multiplexers are employed for
achieving this.
2) A bistable element must ensure that both flags are not set at the same
time. NOR gates are used to generate the bistable element.
As the FIFO gets filled up, NF becomes 1. This causes NEF to become 0 that
in turn produces a 1 on NFF. Now, in case the subtractor produces an output
of zero, we know that the FIFO was most recently ‘Nearly Full’ and hence it
is actually ‘Full’ now. As the reader starts emptying the contents, NE
becomes 1. In a network processor, the writer is usually much faster than the
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
reader and hence, NF would have gone to 0 before NE became 1 since the
write clock would have sampled the most current value of the read pointer
and sent it to the write side subtractor. Hence, NF would have become 0
while NE is still a 0. This would continue to keep NFF at 1 that is not a real
problem because the output of the subtractor being fed with the most recent
read pointer values would not be 1. Now when NE=T and NF=0, NEF
becomes 1. This indicates that the FIFO is most recently empty and an output
of 0 from the subtractor would set the ‘Empty’ flag.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
56
Chapter 7:
Results.
The circuits designed and explained in the previous chapters were laid out
and simulated using HSPICE to obtain the performance metrics. Although
this work focused on minimizing the areas of the components, delay values
obtained on simulation [11] have also been presented in this chapter. We
compared the three-port DRAM FIFO with the dual-port SRAM FIFO and
also with a dual-port DRAM FIFO [5]. The areas and delays [11] for both
architectures have been tabulated. A graphical comparison is also included
wherever necessary.
The breakdown of individual component areas for a lk-packet size FIFO with
a SRAM memory core for varying number of locations in terms of X 2 is as
follows:
a) 16-location FIFO:
Area of memory core = 26214400 X 2 .
Total areas of decoders (one on write and one on read side) = 3500590.04 X 2 .
Total areas of ring-pointer (one on write and one on read side) = 1935360 X 2 .
Area of counters = 128064 X 2 .
Area of subtractors = 184320 X 2 .
Area of sampling flip-flops = 222336 X 2.
Area of full and empty flag generators = 27000 X 2 .
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
57
Total area of the setup = 3.30 * 107 X 2 (with the decoder addressing scheme).
Total area of the setup = 3.14 * 107 X 2 (with the ring-pointer addressing
scheme).
b) 32-location FIFO:
Area of memory core = 53660672 X 2 .
• y
Total areas of decoders (one on write and one on read side) = 5842298.9 X .
Total areas of ring-pointer (one on write and one on read side) = 3870720 X .
Area of counters = 155856 X 2 .
Area of subtractors = 229638 X 2 .
Area of sampling flip-flops = 277920 X 2 .
Area of full and empty flag generators = 54216 X 2.
Area of sense amplifiers = 2727936 X 2 .
Total area of the setup = 6.29 * 107 X 2 (with the decoder addressing scheme).
Total area of the setup = 6.09 * 107 X 2 (with the ring-pointer addressing
scheme).
c) 64-location FIFO:
Area of memory core = 104857600 X 2 .
Total areas of decoders (one on write and one on read side) = 16139661.6 X 2 .
Total areas of ring-pointer (one on write and one on read side) = 7741440 X 2 .
Area of counters = 183648 X 2.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
5 8
Area of subtractors = 275718 X 2 .
Area of sampling flip-flops = 333504 X 2 .
Area of full and empty flag generators = 54216 X 2 .
Area of sense amplifiers = 2727936 X 2 .
Total area of the setup = 1.165 * 108 X 2 (with the decoder addressing
scheme).
Total area of the setup = 1.161 * 108 X 2 (with the ring-pointer addressing
scheme).
d) 128-location FIFO:
Area of memory core = 209715200 X 2 .
Total areas of decoders (one on write and one on read side) = 18454229.4 X .
Total areas of ring-pointer (one each on write and read side) = 15482880 X .
Area of counters = 211440 X 2 .
Area of subtractors = 321798 X 2 .
Area of sampling flip-flops = 389088 X 2 .
Area of full and empty flag generators = 54216 X 2 .
Area of sense amplifiers = 2727936 X 2 .
Total area of the setup = 2.31 * 108 X 2 (with the decoder addressing scheme).
Total area of the setup = 2.28 * 108 X 2 (with the ring-pointer addressing
scheme).
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
5 9
The breakdown of individual component areas for a 16-location FIFO with a
SRAM memory core for varying packet sizes is as follows:
a) 128-bit packets:
Area of memory core = 3276800 X 2 .
Total areas of decoders (one on write and one on read side) = 500454.72k2.
Total areas of ring-pointer (one each on write and read side) = 345003.84 X .
Area of counters = 128064 X 2 .
Area of subtractors = 184320 X 2 .
Area of sampling flip-flops = 222336 X 2 .
Area of full and empty flag generators = 27000 X 2 .
Area of sense amplifiers = 340992 X 2 .
Total area of the setup = 4.67 * 106 X 2 (with the decoder addressing scheme).
Total area of the setup = 4.52 * 10 X (with the ring-pointer addressing
scheme).
b) 256-bit packets:
Area of memory core = 6553600 X 2 .
Total areas of decoders (one on write and one on read side) = 780240.96 X 2 .
Total areas of ring-pointer (one each on write and read side) = 441135.36 X 2 .
Area of counters = 128064 X 2 .
Area of subtractors = 184320 X 2 .
Area of sampling flip-flops = 222336 X 2 .
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
60
Area of full and empty flag generators = 27000 A .2.
Area of sense amplifiers = 681984 A ,2.
Total area of the setup = 8.57 * 106 A ,2 (with the decoder addressing scheme).
Total area of the setup = 8.23 * 106 A ,2 (with the ring-pointer addressing
scheme).
c) 512-bit packets:
Area of memory core = 13107200 A ,2.
- y
Total areas of decoders (one on write and one on read side) = 1643407.68 A,.
Total areas of ring-pointer (one each on write and read side) = 1150590.72 A,.
Area of counters = 128064 A ,2.
Area of subtractors = 184320 A ,2.
Area of sampling flip-flops = 222336 A ,2.
Area of full and empty flag generators = 27000 A,.
Area of sense amplifiers = 1363968 A , 2.
Total area of the setup = 1.66 * 107 A ,2 (with the decoder addressing scheme).
Total area of the setup = 1.61 * 107 A ,2 (with the ring-pointer addressing
scheme).
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
61
The breakdown of individual component areas for a lk-packet size FIFO with
a DRAM memory core for varying number of locations is as follows:
a) 16-location FIFO:
Area of memory core = 6553600 X 2 .
Total area of decoder (one on refresh side) = 1047321.2.04 X 2 .
Total areas of ring-pointer (one each on write and read side) = 1262465.28 X .
Area of counters = 128064 X 2 .
Area of subtractors = 184320 X 2 .
Area of sampling flip-flops = 222336 X 2 .
Area of full and empty flag generators = 27000 X 2 .
Area of sense amplifiers = 6414336 X 2 .
Area of refresh circuitry = 254562 X .
Total area of the setup = 1.60 * 107 X 2 .
b) 32-location FIFO:
Area of memory core = 13107200 X 2 .
Total area of decoder (one on refresh side) = 1041533.85 X 2 .
Total areas of ring-pointer (one each on write and read side) = 2524930.56 X 2 .
Area of counters = 155856 X 2 .
Area of subtractors = 229638 X 2 .
Area of sampling flip-flops = 277920 X 2 .
Area of full and empty flag generators = 54216 X 2 .
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
• J
Area of sense amplifiers = 6414336 X .
Area of refresh circuitry = 254562 X 2 .
Total area of the setup = 2.40 * 107 X 2 .
c) 64-location FIFO:
Area of memory core = 26214400 X 2 .
Total area of decoder (one on refresh side) = 1729577.43 X 2 .
Total areas of ring-pointer (one each on write and read side) = 5049861.12 X 2
Area of counters = 183648 X 2 .
Area of subtractors = 275718 X 2 .
Area of sampling flip-flops = 333504 X 2 .
Area of full and empty flag generators = 54216 X 2.
Area of sense amplifiers = 6414336 X 2 .
Area of refresh circuitry = 254562 X 2 .
Total area of the setup = 4.05 * 107 X 2 .
d) 128-location FIFO:
Area of memory core = 52428800 X 2 .
Total area of decoder (one on refresh side) = 3711647.92 X 2 .
Total areas of ring-pointer (one each on write and read side)=10099722.24 X 2
Area of counters = 2114400 X 2 .
Area of subtractors = 321798 X 2 .
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
63
Area of sampling flip-flops = 389088 A . 2.
Area of full and empty flag generators = 54216 A ,2.
Area of sense amplifiers = 6414336 A ,2.
Area of refresh circuitry = 254562 A ,2.
Total area of the setup = 7.38 * 107 A , 2
The breakdown of individual component areas for a 16-location FIFO with a
DRAM memory core for varying packet sizes is as follows:
a) 128-bit packets:
Area of memory core = 819200 A , 2.
Total area of decoder (one on refresh side) = 244423.8 7 A ,2.
Total areas of ring-pointer (one each on write and read side) = 296271.36 A ,2.
Area of counters = 128064 A ,2.
Area of subtractors = 184320 A ,2.
Area of sampling flip-flops = 222336 A ,2.
Area of full and empty flag generators = 27000 A ,2.
Area of sense amplifiers = 801792 A ,2.
Area of refresh circuitry = 254562 A .2.
Total area of the setup = 2.97 * 106 A .2.
b) 256-bit packets:
Area of memory core = 1638400 A 2.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Total area of decoder (one on refresh side) = 300940.71 A 2.
Total areas of ring-pointer (one each on write and read side) = 362760.96 A 2
Area of counters = 128064 A 2.
Area of subtractors = 184320 A 2.
Area of sampling flip-flops = 222336 A 2.
Area of full and empty flag generators = 27000 A 2.
Area of sense amplifiers = 1603584 A 2.
Area of refresh circuitry = 254562 A 2.
Total area of the setup = 4.51 * 106 A 2.
c) 512-bit packets:
Area of memory core = 3276800 A 2.
Total areas of decoders (one on refresh side) = 754866.83 A 2.
Total areas of ring-pointer (one each on write and read side) = 909934.08 A 2
Area of counters = 128064 A 2.
Area of subtractors = 184320 A 2.
Area of sampling flip-flops = 222336 A 2.
Area of full and empty flag generators = 27000 A 2.
Area of sense amplifiers = 3207168 A 2.
Area of refresh circuitry = 254562 A 2.
Total area of the setup = 8.96 * 106 A 2.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
65
We also compared the three-port DRAM FIFO buffer with a dual-port
DRAM FIFO buffer and a dual-port SRAM FIFO buffer. Adding an extra
transistor to the dual-port DRAM cell may seem impractical. Our study
however indicated that the simplified refresh scheme used in this work
greatly reduced the area of the overall memory core when compared with a
dual-port DRAM of comparable performance [5](Note: The processing
technology considered in this work may be different). The following charts
were obtained as a result of this study. It has been presented here to give the
reader an idea of how much of a variation to expect when incorporating our
memory core in the FIFO.
Area comparisons of a 64 location 256- bit
packet 3-port DRAM with 2-port DRAM
30000000
25000000
20000000
^ 15000000
< 10000000
5000000
0
Type of memory core
Figure 25: Areas of 3-port DRAM and 2-port DRAM
M 3-port DRAM
■ 2-port DRAM
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
66
Area comparisons of a 64-location 1024- bit
packet 3-port DRAM with 2-port SRAM
140000000 -
~ 120000000
*100000000
~ 80000000
I 60000000
40000000 -
20000000 -
0
H 3-port DRAM
■ 2-port SRAM
Type of memory core
Figure 26: Areas of 3-port DRAM and 2-port SRAM
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
67
The delays indicated below are the read-cycle delays obtained for memory
cores of the same size [11].
Area and delay comparisons between a
64- location 256-bit packet size 3-port DRAM
and 2-port SRAM
30000000 -r----------------------------- -------------------------------- ,
o- 25000000 - * 8.63,,25265818
« A 6.11,
onnnnnnn ♦ ___ _______ _
2Ub4yUDO.Ot)
2 15000000 I -------------------------------------------------------------- 1
< 10000000 4— ...................................................— — —-t
5000000 \ --------------------------------- — — — ...... ..............i
0 “ I------------ r ------------ 1 -------- - ■ T'---- -----“ i— -— -----H
0 2 4 6 8 10
Delay (ns)
♦ 3-port DRAM
2-port DRAM
Figure 27: Area-delay comparisons of 3-port DRAM and 2-port DRAM.
Area and delay comparisons of a 64 location
1024-bit packet 3-port DRAM with 2-port SRAM
» * + u u u u u u u -
^ 120000000 ---------------------------------------------------*-2.^7116174062-----------
1 oooooooo ................... ........ - .................- ..........*........----------------------
; S 80000000 ------------ — -------------------------- ----------------------------------- —
I 60000000 I ....................— ......................................... ? - - T ...... * ; ........ . - r - - - ..........
< 40000000 + --------- ------------- — ....................................................................................
20000000 | * *
0 ■ “ Y " “ — ........r ‘------r ...............' r - • -
0 1 2 3 4 5 1
Delay (ns)
— 2-Port SRAM
- « .....3-Port DRAM
.11,
9058.55
Figure 28: Area-delay comparisons of 3-port DRAM and 2-port SRAM.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
68
A tabular representation of results obtained through this study is presented
below. A graphical representation of results obtained is also presented.
S.No Number
of
locations
Area when using 0.15 p .
technology (mm )
SRAM DRAM
1. 16 0.707 0.3621
2. 32 1.371 0.5413
3. 64 2.613 0.9114
4 128 5.150 1.67
Table 1: Areas of lk packet size FIFOs employing SRAM and DRAM
memory cores for varying number of locations
Area comparison between 1k packet size SRAM and DRAM memory
core FIFOs for varying number of locations
-SR A M
D R A M
128
128
0 20 40 60 80 100 120 140
Number of locations
Figure 29: Number of locations vs. Area
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
6 9
S.No Packet size (bits) Area when using 0.15 p
technology (mm )
SRAM DRAM
1. 128 0.1018 0.067
2. 256 0.1853 0.1015
3. 512 0.3641 0.2017
4 1024 0.707 0.3621
Table 2: Areas of 16 location FIFOs employing SRAM and DRAM memory
cores for varying packet sizes
Area comparison betw een 16 location SRAM and ORAM memory
core FIFOs for varying packet sizes
SRAM
D R A M
0.8
0.7
0.6
? 0.5
1 . 04
£ 0.3
<
0.2
1024
1024
0 200 400 600 800 1000 1200
Packet size
Figure 30: Packet size vs. Area
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
70
S.No Process Technology
GO
Area (mm )
SRAM DRAM
1. 0.09 0.2545 0.1303
2. 0.13 0.531 0.2719
3. 0.15 0.707 0.3621
4 0.18 1.018 0.5214
Table 3: Areas of lk packet size 16 location FIFOs employing SRAM and DRAM
memory cores for various process technologies
Area comparison between SRAM and DRAM memory core FIFOs with
process technology
1.2
■ 5 0.8
C O
I 0.6
S I
< 0.4 A
0.2
■ S R A M
D R A M
.-•Has
»n.i3
0.05 0.1 0.15
Process technology (p)
0.2
Figure 31: Process technology vs. Area for various processing technologies
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
71
The delay values obtained by using each of these memory cores are presented
below.
S.No Number of
locations
Write-cycle delay (ns)
SRAM DRAM
1. 16 0.54 2.53
2. 32 0.88 2.78
3. 64 1.53 3.28
4 128 2.84 4.58
Table 4: Write-cycle delay of lk packet size 16 location FIFOs employing
SRAM and DRAM memory cores and ring-pointer addressing scheme for
varying number of locations
Write-time comparison between SRAM and D R A M memory core FIFOs for
varying number of locations
•SRA M
D R A M
128
4.5
3.5
3Z 128
2.5
o
0.5
i
50 100
Number of locations
150
Figure 32: Number of locations vs. write-cycle delay using ring-pointer
addressing scheme
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
72
Number of
locations
Read-cycle delay (ns)
SRAM DRAM
S.No 16 2.83 5.06
2. 32 2.87 5.09
3. 64 2.9 5.11
4 128 3.08 5.13
Table 5: Read-cycle delay of lk packet size 16 location FIFOs
employing SRAM and DRAM memory cores and ring-pointer
addressing scheme for varying number of locations
Read-time comparison between SRAM and D R A M memory core FIFOs for
varying number of locations
6
5
— 4
C O
| 3
o
° 2
1
0
- « 10R
■ ' |g * 0£. 9 W
— — ♦ 128
♦-“ 10 ♦ oer—---------♦“ 64—
-SRAM
D R A M
50 100
Number of locations
150
Figure 33: Number of locations vs. read-cycle delay using ring-pointer
addressing scheme
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
73
The results shown in Table 3 compare the decoder scheme of addressing with
the ring-pointer scheme. Although they do not provide an overall picture of
the entire FIFO, these numbers have been provided to qualify our choice of
the ring-pointer addressing scheme for the three-port DRAM memory core.
S.No Number of locations Area when using 0.15 p .
technology (mm2 )
Decoder Ring-
pointer
1. 16 0.0787 0.0435
2. 32 0.1314 0.0870
3. 64 0.3631 0.1741
4 128 0.4152 0.3483
Table 6: Areas of the decoder and ring-pointer addressing schemes
Area comparison between memory addressing schem es in a 16
location 1k packet size FIFO for varying number of locations
0.45
0.4
0.35
0.3
CO
E 0.25
I 0.2
P
< 0.15
0.1
0.05
0
— Decoder
Ring pointer
Figure 34: Number of locations vs. Area
0 50 100 150
Number of locations
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
74
Chapter 8:
Future work.
This thesis is part of a larger project namely NetChip, a NOC approach
intended for high-speed data processing, headed by Dr. Alice Parker at the
University of Southern California. The on-chip FIFO buffers discussed in this
work are aimed for use in this chip. On-chip memory devices enable
throughput capabilities into the gigabit-per-second range. They are extremely
compact and dissipate minimum power. New leading-edge applications
ranging from games to networking infrastructure equipment are driving the
need to include large-capacity and high-speed memory on-chip rather than
separately as a discrete device [8]. Until recently, embedded SRAMs were
widely used as on-chip buffers. This is because of the fact that manufacturing
embedded DRAMs has not been easy. A DRAM process, for example, needs
temperatures above 1,000C to form the capacitor — about two times above
the temperature threshold of a typical logic process. Exposing the transistors
to these temperatures can wind up degrading the performance of the logic
considerably. Chip makers have tried to mitigate this problem by depositing
thick layers of oxide over the transistors, but this process is costly and tends
to reduce wafer yields. However, new breakthroughs in this area by NEC and
Samsung have made the future look a lot brighter for DRAMs. NEC has
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
developed a special DRAM capacitor [1] using two layers of metal separated
by a low-k dielectric material. The tungsten used to build the capacitor is the
same metal that forms vias in a logic process. This obviates the need for
polysilicon, which is normally used for DRAM cells, allowing the DRAM
cell formation to mesh with a standard CMOS logic process.
Samsung Electronics on the other hand announced that it has developed the
industry’s first “CVD aluminum” process technology, the very latest 70-
nanometer node DRAM process technology employing the Chemical Vapor
Deposition or CVD method [28]. The CVD aluminum process technology
involves forming conducting films by turning metal organic source, including
aluminum, into particles through chemical reactions. These particles are then
deposited on a wafer surface by using suitable processing techniques.
Existing DRAM circuit-wiring processes have employed the Physical Vapor
Deposition or PVD method in which thin films are formed by turning solid-
state materials into particles. Due to the problem of “void” in which the
deposition is not evenly made on the wafer surface, this PVD method has
been difficult to apply in the 90-nano-or-less scale processes.
However, if the CVD aluminum process technology is employed, not only is
the problem of cavitations addressed, but also the electrical properties of
circuit-wiring are dramatically improved, making it an essential process
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
technology. This technology is believed to start the trend towards 70
nanometer DRAMs. Furthermore, if this CVD aluminum process technology
is employed, analysis shows that it would reduce costs related to circuit-
wiring process by up to approximately 20 percent, as it does not require
planarization (etch-back) and cleaning process, which have been required
until now in the conventional DRAM circuit-wiring process. This technology
will help extend the applications of the DRAM module from the PC to the
mobile and consumer electronics products in the very near future. These
advancements in process technologies motivated us to consider the possibility
of replacing a dual-port SRAM with a DRAM of comparable if not equal
performance.
In this SoC era, 50 percent or more of the die will consist of embedded
memory and embedded DRAMs are slowly becoming the standard.
Significant research is being done in this area. NetChip could be used in next-
generation network processors. Internet speed is doubling every six to nine
months, faster than Moore's Law. Dedicated processors are now required to
handle the data transmitted by high-speed links. Network processors were
invented to bridge the gap between link rates and processing speeds.
Embedded DRAM is a highly desirable choice for many types of memory in
network processors because it allows designers to extend the memory on-chip
to provide new levels of performance and it reduces the total number of chips
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
on board, resulting in lower power dissipation and higher overall
performance. Moreover, embedded DRAMs are capable of meeting very high
bandwidth requirements by making use of wide buses of 512 bits or more.
Next generation network processors require four basic types of memory:
packet memory, header memory, routing-table lookup memory and program
memory [7]. Packet memory is a data buffer that contains the payload
information to be transmitted. This data streams in and out of the processor,
and so it is very suitable for DRAM with a fast page mode for sequential
accesses. It is usually implemented with off-chip SDRAM or RDRAM, since
packet memory tends to be large, on the order of 32 to 256 Mbytes. We may
be able to bring this block onto the surface of the chip by using advanced
architectures and process-specific design. Packet memory connects directly to
the data stream and hence it must have high bandwidth. Packet memory also
requires at least two ports so that it can be connected to the data stream and
yet still accessed by the network processor when required.
There are trade-offs to be made with the packet buffer size to optimize the
system efficiency. The minimum packet size should be about equal to the
buffer or segment size [26]. If the packet buffer size is too large, the memory
efficiency is poor for minimum-size packets; if it is too small, then multiple
buffers have to be used to hold most packets. That issue affects not only the
memory efficiency but also the bandwidth requirement, since cycles used to
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
write to unused memory locations are also lost. Another trade-off involves
the size of the packet buffer and the speed of decision making. If forwarding
decisions can be made quickly, then a smaller packet buffer can be used and
it can be more easily placed on-chip to save power and area.
Header memory is used to store copies of the header information associated
with the packets. It is written in bursts and read randomly in small packets.
The processors read the header memory to obtain information about the
packet and determine the appropriate method of processing the data. It is
usually 256 Kbytes (2 Mbits) and has low permanency, since it is associated
with the packets.
Lookup memory contains the routing table and is sometimes implemented as
content-addressable memory (CAM). It is typically quite large and is often
implemented off chip. The size of this table can have a direct relationship to
performance. Hence this is another area where there is scope for
optimization. The data in this memory is relatively stable and it is written
through a low-speed update port, but it must be read very rapidly and
typically requires random access.
Program memory is used by the network processors to store their program
code. It has high permanency and is often specified by users as having five-
year data retention while powered. The program memory is typically 4k to 8k
words but the processor word may be 32 to 64 bits, resulting in a range of
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
128 to 512 kbits. The next generation of network processors is associated
with much larger program store, perhaps eight to 16 times that size.
The program and lookup memory have high permanency, and many
manufacturers want the contents to be guaranteed for a worst-case life of five
years. Recent data suggests that SRAM demonstrates soft-error failure rates
on the order of 1,000 FITs/megabit. This suggests that embedded DRAMs
will take the place of SRAMs in most of tomorrow’s packet processors.
The header and packet memory have very low permanency. In many
applications it may be possible to eliminate the refresh requirement. The
residence time of the data is so short that the probability of soft errors on any
particular packet is extremely low. Therefore SRAMs may be retained
provided they satisfy the area and power constraints.Embedded DRAMs can
be implemented with very wide data buses so that memory accesses are
performed on hundreds or thousands of bits at a time.
Header memory requirements are racing ahead of embedded technology, and
they may remain off chip. Lookup memory can be built using DRAM CAM
for significant gains in area and power. This technology has been pioneered
by Mosaid for standalone 2-Mbit CAMs, but next-generation devices will
contain at least 18 Mbits and will be difficult to integrate economically.
Program memory requires dual-port memory with one high-speed random-
access port connected to the processor and a relatively low-speed port for
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
program updates. The memory would include refresh, which would also
perform ECC invisibly and would far exceed the five-year data retention
specification that is required by most manufacturers.
There is scope for improvement in each of the above memory blocks that will
be in use in tomorrow’s high-speed data processing chips. Embedded
DRAMs are in the introduction phase. The complexity of embedded DRAM
prevents manufacturers from developing and marketing advancements
quickly [29]. Embedded DRAM capacitors still require complex processing
steps that are not required for embedded SRAMs. Thus, high-speed on-chip
memories will be an area that is intensely researched for at least the next
decade.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
81
Bibliography:
[1] NEC, Mosys push bounds of embedded DRAM”, Anthony Catalado, E E
T i m e s , December 16,2002.
[2] “SRAM soft errors cause hard network problems”, Anthony Catalado, E E
T i m e s , August 17,2001.
[3] “Simulation and Synthesis Techniques for Asynchronous FIFO Design”,
Clifford E.Cummings, Sunburst Design Inc., Rev 1.1, April 2002.
[4] “The effect of threshold voltages on the soft error rate”, V.Degalahal,
R.Ramanarayanan, N.Vijaykrishnan, Y.Xie and M.J Irwin, 5 t h I n t e r n a t i o n a l
S y m p o s i u m o n Q u a l i t y E l e c t r o n i c D e s i g n , 2004, March 22-24, 2004,
Pages:503 - 508.
[5] “Asynchronous DRAM design and synthesis”, Ekanayake, V.N.,
Manohar, R, N i n t h I n t e r n a t i o n a l S y m p o s i u m o n A s y n c h r o n o u s C i r c u i t s a n d
S y s t e m s , 12-15 May 2003, Pages:174 - 183.
[6] “Soft errors, a problem as SRAM geometries shrink”, Jeanne Graham,
E l e c t r o n i c s S u p p l y a n d M a n u f a c t u r i n g , Jan28,2002.
[7] Embedded DRAM Has a Home in the Network Processing World
by Gord Harling on the www at
http://www.eedesign.com/isd/OEG20010803S0026.
[8] “New Embedded DRAM Solutions for High-Performance SoCs”by
Hideya Horikawa and Hamid Aslam on the www at http://www.us.design-
reuse.com/articles/article3500.html.
[9] “Analysis of a memory architecture for fast packet buffers ”, Sundar Iyer,
Ramana Rao Kompella, Nick McKeown, I E E E W o r k s h o p o n H i g h
P e r f o r m a n c e S w i t c h i n g a n d R o u t i n g , 29-31 May 2001, Pages:368 - 373.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
8 2
[10] Course notes for VLSI Design-by Professor Won Namgoong,
Department of Electrical Engineering-Systems, University of Southern
California.
[11] Masters thesis on Performance issues in Network on Chip FIFO queues,
Aniket Kadkol, Department of Electrical Engineering-Systems, University of
Southern California, May 2004.
[12] “Trends in semiconductor memories”, Katayama, Y., I E E E
M i c r o , Volume: 17, Issue: 6 , Nov.-Dee. 1997, Pages: 10 - 17.
[13] “(Ba,Sr) Ti03 dielectrics for future stacked-capacitor DRAMs”,
D.E.Kotecki, J.D.Baniecki, H.Shen, R. B. Laibowitz, K. L. Saenger, J. J.
Lian, T. M. Shaw, S. D. Athavale, C. Cabral, Jr., P. R. Duncombe, M.
Gutsche, G. Kunkel, Y.-J. Park, Y.-Y. Wang and R. Wise, I B M J o u r n a l o f
R e s e a r c h a n d D e v e l o p m e n t , Volume 43, Number 3, pg.367,1999.
[14] “FIFO design techniques”, EE560 course notes, Prof. Gandhi Puwada,
Department of Electrical Engineering-Systems, University of Southern
California.
[15] “Transparent-Reffesh DRAM (TReD) using Dual-Port DRAM cell”,
Takayasu Sakurai, Kazutaka Nogami, Kazuhiro Sawada and Tetsuya Iizuka,
I E E E C u s t o m I n t e g r a t e d C i r c u i t s C o n f e r e n c e , 16-19 May 1988, Pages:4.3/1 -
4.3/4.
[16] “Modelling the effects of technology trends on soft-error rate of
combinational logic”, Technical report by Premkishore Shivakumar, Michael
Kistler, Stephen W.Keckler, Doug Burger, Lorenzo Alvisi, Department of
Computer Sciences, University of Texas at Austin.
[17] ““The SARAM (sequential access and random access memory), a new
kind of dual-port memory for communications now and beyond”, Smith, J.C.,
W E S C O N / ' 9 3 . C o n f e r e n c e R e c o r d , 28-30 Sept. 1993, Pages:571 - 579.
[18] “Logical Effort: Designing for speed on the back of an envelope”, Ivan
E. Sutherland, Robert F.Sproull, C o n f e r e n c e o n A d v a n c e d R e s e a r c h i n V L S I ,
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
83
C a l i f o r n i a , S a n t a C r u z , , C.H. Sequin, ed., MIT Press, 1991. pp. 1-16.
[19] “An analytical access time model for on-chip cache memories”, Wada,
T., Rajan, S. ,Przybylski, S. A., I E E E J o u r n a l o f S o l i d - S t a t e
C i r c u i t s , V o l u m e : ! , Issue:8, Aug. 1992 , Pages: 1147 - 1156.
[20] “A low-voltage low-power ring pointer for FIFO Memory Design”,
Technical report by Haibo Wang and Sarma B.K. Vrudhula, Center for low-
power electronics, University of Arizona, Tucson.
[21] “New Features in Synchronous FIFOs”, Wyland.D, W E S C O N / ’9 3 ,
C o n f e r e n c e R e c o r d , 28-30 Sept 1993, Pages 580-585.
[22] “Low-latency asynchronous FIFO buffers”, Yantchev, J.T., Huang,
C.G.Josephs, M.B. Nedelchev, I.M., S e c o n d W o r k i n g C o n f e r e n c e o n
A s y n c h r o n o u s D e s i g n M e t h o d o l o g i e s , 30-31 May 1995, Pages:24 - 31.
[23] White paper on “The challenge for next generation network processors”,
Agere Inc., September 10,1999.
[24] The www at
http://bwrc.eecs.berkeley.edu/Classes/ICDesign/EE 14 l_f03.
[25] Technical note on “Various methods of DRAM refresh” (TN-04-30),
Micron Technology.
[26] Current states and trends in Memory IP located on the www at
http://www.mosysinc.com/files/pdf/ipjapan.pdf.
[27] The www at http://research.sun.com, paper on logical effort located at
async/Publications/KPDisclosed/LogEfftSlides/letalk.pdf.
[28] CVD processing technologies for DRAMs located on the www at
http://www.samsung.com/Products/Semiconductor/News/DRAM.
[29] Resources on embedded DRAM technology on the www at
http://smithsonianchips.si.edu/ice/cd/MEM96/SEC 12.pdf.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Design tradeoffs in a packet-switched network on chip architecture
PDF
Extending the design space for networks on chip
PDF
High performance crossbar switch design
PDF
Cost -sensitive cache replacement algorithms
PDF
A comprehensive framework for the specification of hardware /software systems
PDF
A template-based standard-cell asynchronous design methodology
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
An efficient design space exploration for balance between computation and memory
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Boundary estimation and tracking of spatially diffuse phenomena in sensor networks
PDF
Performance issues in network on chip FIFO queues
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Characteristic acoustics of transmyocardial laser revascularization
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Error-tolerance in digital speech recording systems
PDF
A technical survey of embedded processors
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Data compression and detection
PDF
Dynamic logic synthesis for reconfigurable hardware
Asset Metadata
Creator
Krishnanunni, Praveen
(author)
Core Title
Area comparisons of FIFO queues using SRAM and DRAM memory cores
School
School of Engineering
Degree
Master of Science
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Parker, Alice C. (
committee chair
), Beerel, Peter (
committee member
), Pinkston, Timothy M. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-321889
Unique identifier
UC11336723
Identifier
1424247.pdf (filename),usctheses-c16-321889 (legacy record id)
Legacy Identifier
1424247.pdf
Dmrecord
321889
Document Type
Thesis
Rights
Krishnanunni, Praveen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical