Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Synchronization and timing techniques based on statistical random sampling
(USC Thesis Other)
Synchronization and timing techniques based on statistical random sampling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SYNCHRONIZATION AND TIMING TECHNIQUES
BASED ON STATISTICAL RANDOM SAMPLING
Title
by
Rashed Zafar Bhatti
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
December 2007
Copyright 2007 Rashed Zafar Bhatti
Dedication
This thesis is dedicated to my parents and wife Sadaf who provided unconditional love
and support throughout the course of my PhD studies and research.
ii
Acknowledgements
I would like to present my profound gratitude to my PhD advisor Dr. Jeffrey
Draper who has been kind to undertake the supervision of my research. Despite his
heavy commitments and extremely busy research schedules, he provided a consistent
technical guidance and support through out the course of my research.
I am also grateful to the Cray Award recipient Dr. Monty Denneau, form IBM T.J
Watson Research Center, who provided me an opportunity to work under his able
guidance in one of the most prestigious national level research projects like
BlueGene/Cyclops-64. This research is byproduct of the Cyclops-64 project.
I would like to express my sincere thanks to Dr. Keith Chugg, Dr. Alice Parker, Dr.
Victor Prasanna and Dr. Aiichiro Nakano who kindly consented to be on my PhD
guidance committee and helped me to shape this research work.
iii
Table of Contents
Dedication.........................................................................................................................ii
Acknowledgements..........................................................................................................iii
List of Figures..................................................................................................................vi
Abstract............................................................................................................................ix
Chapter 1: Introduction...............................................................................................1
1.1 Motivational Background .................................................................................1
1.2 The Genesis of the Proposed Technique...........................................................2
1.2.1 Scope of the Theory..................................................................................3
1.3 Introduction of Proposed Technique.................................................................4
1.4 Limitations of Conventional Approaches .........................................................5
1.5 Organization......................................................................................................6
Chapter 2: Theoretical Framework.............................................................................9
2.1 Random Sampling.............................................................................................9
2.2 Random Sampling Theory Applied to Digital VLSI Signals .........................10
2.2.1 Quantification Sample Size.....................................................................11
2.2.2 Relation of Sample Size with Accuracy and Confidence .......................14
2.3 Characterization of Random Clock.................................................................15
2.3.1 Ideal Random Clock and Periodically Repeating Interval......................16
2.3.2 Random Clock Relative to Signal under Measurement..........................17
2.3.3 Bounds of the Average Frequency of Random Clock ............................23
Chapter 3: Circuit Level Implementations ...............................................................25
3.1 Duty Cycle Measurement and Correction.......................................................25
3.1.2 Survey of Duty Cycle Corrector (DCC) Circuits....................................27
3.1.3 Implementation of Proposed DCC..........................................................30
3.1.4 Random Sampling Unit (RSU) for DCC ................................................32
3.1.5 Experimental Setup and Results .............................................................33
3.2 Relative Phase Measurement ..........................................................................35
3.2.2 Conventional Ways to Tackle the Problem ............................................37
3.2.3 Extension of the Technique to Simultaneous Random Sampling...........38
3.2.4 Implementation of Circuit Design ..........................................................39
3.2.5 Random Sampling Unit (RSU) ...............................................................40
3.2.6 Experimental Setup and Results .............................................................42
3.3 Random Event Generation..............................................................................44
3.3.1 Pseudo Random Number Generators......................................................46
iv
3.3.2 Proposed Pseudo Random Clock Generator (PRCG).............................49
3.3.3 Simulation Methodology and Results.....................................................54
3.3.4 Deductions ..............................................................................................57
Chapter 4: Case Studies VLSI Applications.............................................................58
4.1 Proposed Techniques Applied to SerDes design ............................................58
4.1.1 Significance of SerDes Systems .............................................................59
4.1.2 Conventional SerDes Designs.................................................................60
4.1.3 Design Conception..................................................................................62
4.1.4 Implementation of the Proposed SerDes Design ....................................63
4.1.5 Transmitter Side (Serializer)...................................................................64
4.1.6 Receiver side (Deserializer)....................................................................68
4.1.7 Analysis of the Design............................................................................70
4.1.8 Experimental Results ..............................................................................72
4.2 Proposed Technique Applied to DDR/DDR2 Bus Interface Timing..............74
4.2.1 Background of the DDR/DDR2 SDRAMS ............................................75
4.2.2 Implementation .......................................................................................79
4.2.3 Analysis of the Overall Design...............................................................83
4.2.4 Verification Methodology and Results ...................................................84
4.2.5 Deductions ..............................................................................................85
Chapter 5: Proposed Technique with Scaling Technologies ....................................86
5.1 Technology dependency .................................................................................87
5.1.1 Measurability ..........................................................................................87
5.1.2 Adjustability............................................................................................91
5.1.3 Deductions ..............................................................................................92
Chapter 6: The Scope, Limitations and Extensions..................................................93
6.1 Scope of the Proposed Synchronization and Timing Technique....................93
6.2 Limitations of the Proposed Technique ..........................................................95
6.2.1 Measurement Time .................................................................................95
6.2.2 Static Compensation ...............................................................................96
6.2.3 Continuous Monitoring...........................................................................96
6.2.4 Controller Logic or Software..................................................................97
6.2.5 Convergence Speed.................................................................................97
6.3 More Applications of the proposed technique................................................98
6.4 Possible Extension of the Proposed Research ................................................99
Conclusion ....................................................................................................................100
Bibliography .................................................................................................................101
v
List of Figures
Figure 1. Conceptual Block Diagram of the Proposed Theory.........................................3
Figure 2. A periodic signal under observation................................................................10
Figure 3. Area under the Gaussian distribution curves within the confidence interval..13
Figure 4. Sample size versus Confidence Level (at p=0.5)............................................15
Figure 5. Random edges relative to the signal under measurement ...............................17
Figure 6. Fixed frequency sampling clock beating with the signal ................................19
Figure 7. Sample Points distributed within the τ
sig
interval ............................................19
Figure 8. SMD based all digital DCC.............................................................................29
Figure 9. The conceptual diagram of proposed DCC .....................................................30
Figure 10. Timing diagram of Stretching and Chopping................................................31
Figure 11. Implementation for random sampling unit....................................................32
Figure 12. Maximum Observed Error Normalized by Expected Error ..........................34
Figure 13. Two signals under relative phase measurement ............................................39
Figure 14. Digital System with Closed Loop Signaling Convention..............................40
Figure 15. Random Sampling Unit (RSU) for Relative Phase Measurement.................41
Figure 16. Maximum Observed Error Normalized by Expected Error. .........................44
Figure 17. Three Bit Fibonacci LFSR configurations ....................................................47
Figure 18. Three Bit Galois LFSR configurations..........................................................48
Figure 19. A basic building block of CA50745 cellular automata .................................48
Figure 20. Conceptual Blocks of proposed PRCG ........................................................50
vi
Figure 21. Ring Slice of proposed PRCG.......................................................................51
Figure 22. Delay Blender of PRCG ................................................................................51
Figure 23. Sample point distribution of a frequency mix ...............................................53
Figure 24. Modified Random Sampling Unit (RSU)......................................................54
Figure 25. Relative Δ
r
Distribution of the PRCG under test...........................................55
Figure 26. Standard Deviation of Sample Point Probabilities ........................................56
Figure 27. Effective Sample Points Resolution Δr-effective (ps)...................................56
Figure 28. A Conventional SerDes System. ...................................................................61
Figure 29. Tx-bit unit: A 4-to-1 serializer circuit module. .............................................65
Figure 30. Timing diagram of Tx-bit unit.......................................................................65
Figure 31. Duty Cycle Corrector (DCC) and Phase Generator. .....................................66
Figure 32. Combined Random Sampling Unit (RSU)....................................................67
Figure 33. Tx-Byte unit...................................................................................................68
Figure 34. Rx-bit unit (1-to-4 De-serializer) with Ring Buffer. .....................................69
Figure 35. Rx byte unit ...................................................................................................70
Figure 36. Timing diagram of 1-to-4 de-serializer (Rx-Bit unit)....................................71
Figure 37. Eye diagram of simulated results of a SerDes link........................................73
Figure 38. A typical read and write cycle timing of DDR/DDR2. .................................75
Figure 39. Source synchronous I/O channel...................................................................77
Figure 40. Source synchronous write transaction timing................................................78
Figure 41. Block diagram of DCC circuit.......................................................................81
Figure 42. Duty Cycle Correction...................................................................................81
vii
Figure 43. Block diagram of receiver strobe timing circuit............................................82
Figure 44. Delayed strobe timing....................................................................................82
Figure 45. Setup and Hold Time Violation Regions.......................................................89
Figure 46. Minimum Detectable Pulse width .................................................................91
viii
Abstract
The rapid scaling of silicon technologies over the past decade has introduced
some arduous constraints for design engineers. The technology progression has
exacerbated the power problem whereas the rapidity of scaling has enormously reduced
the time-to-market. Standard cell and FPGA based technologies have emerged as the
best approaches to achieve reduced time-to-market, but these technologies almost
eradicate the possibility of using custom designed components. In the given scenario
many timing and synchronization problems are reborn, requiring fresh solutions to fit in
this new circuit design paradigm. In this research, a hypothesis derived from statistical
estimation is proposed that forms the basis of a new circuit design methodology. The
proposed technique addresses some synchronization problems by applying statistical
random sampling to high-speed digital signals. Through this technique, timing
parameters like pulse width, duty cycle, and clock and data skew can very accurately be
measured and adjusted. This proposed technique provides a new way to tackle some
classical VLSI problems with considerably reduced circuit complexities which in turn
make the overall design area, power and design-time efficient. The proposed circuits do
not require custom designed components which make them reusable and portable to
most standard cell or FPGA technologies. A Serializer/Deserializer (SerDes) system
using the proposed circuit design approach, exhibits 2.5 times improvement in power
dissipation compared to a typical conventional design with a 60% less area requirement.
ix
Chapter 1: Introduction
1.1 Motivational Background
Timing uncertainty issues with control signal pulse width, clock duty cycles,
relative phase of multiple clocks, clock skew with respect to data and relative skew in
parallel data lines are classical synchronization problems of electronic systems. In every
era of digital design, engineers had some solutions for these problems that performed
within the technological constraints of the time. As technology kept scaling in
accordance with Moore’s law [46] circuit integration reached a limit where
interconnection delays started to dominate the gate delays making the timing issues
even worse. Today’s highly scaled deep submicron technologies are allowing ultra large
scale integration (ULSI) and enable systems to operate at mutli-gigahertz speeds, but
the leakage currents are considerably increased, making power management a principle
design parameter. Moreover post-fabrication timing uncertainties due to process
variation are also considerably increased. These variations demand some adaptability in
circuit design techniques to mitigate the effect of the resulting timing uncertainties and
let the system operate at its optimal speed. A commercial effect of rapidly scaling
technologies has reduced time to market for any target product. The economic demands
of reduced time to market leave little margin for engineers to perform tedious and time
consuming cycles of design and characterization of custom designed components. Many
efforts have been done to close the gap between the ASIC and custom designed
components [16] but the use of custom design components circuits does negatively
1
affect the time to market of the final product. Hence, standard cell based ASIC
technologies have become the mainstream approach for digital circuits and systems
design. The given technological scenario indicates a need for a family of reduced
complexity circuits which are inherently area, power and design time efficient to face
the technological challenges and are easily portable across technologies to meet the
reduced time to market demands.
1.2 The Genesis of the Proposed Technique
In general the measurement of time duration of a naturally occurring phenomenon
is considered a trivial problem, but as the absolute duration of a phenomenon
approaches or becomes comparable to the smallest resolvable unit of time of the
employed measurement technology, the problem becomes nontrivial. This thesis is
based on the following general hypothesis to accurately observe and measure certain
timing parameters of on-chip high-speed signals:
“If any periodic or periodically excitable behavior of a signal, a group of signals or
a system can be binomially captured at a given instant of time, through some realizable
mechanism, then the average duration of that behavior is measurable through statistical
random sampling”.
The theoretical statement above opens up a new dimension of timing circuit and
system design in which a target parameter is converted to a binomial periodic behavior
which is then repeatedly captured at random instants of time to characterize the signal,
group of signals or system under observation. A collection of pre-calculated sample size
is analyzed to accurately measure the desired parameter with associated confidence.
2
Figure 1 shows the conceptual blocks of the design that implements the proposed
theory.
Random Instant
Generator
R(t)
Behavior
Capturing
Mechanism
Signal or
System
Under
Observation Behavior
Test Framework to excite
the behavior periodically
Event Counter 2
Event Counter 1
Desired Accuracy and
Confidence In terms of
sample size (n)
Number of Times the Target
Behavior Captured (X)
{Logic hi or lo}
Figure 1. Conceptual Block Diagram of the Proposed Theory
The test framework excites a system to evince a behavior periodically which
corresponds to a certain parameter of the system to be quantified. The “Event Counter
1” is loaded with a value n that corresponds to a desired accuracy and confidence in the
measurement. The counter is decremented at every event generated by “Random Event
Generator”. The “Event Counter 2” is initially reset to zero, and it is incremented at the
event generated by “Random Event Generator” only when the “Behavior Capturing
Mechanism” block senses the occurrence of the target behavior. Given the periodicity of
the target behavior, the proportion of number of times the target behavior is captured to
the sample size can be used to compute the average duration of the target behavior.
1.2.1 Scope of the Theory
Although the general form of the theory is applicable to any problem that can be
mapped to the conceptual blocks shown in Figure 1 above, the real scope, in which the
significance of this theory becomes more eminent, is when the duration of behavior,
3
under observation, is too small to be measured without involving some complex and
resource heavy techniques. The following common synchronization problems are some
examples where the proposed technique can be applied to observe the various signal
and system behaviors which are either difficult to observe with conventional
methodologies or typically involve resource heavy techniques:
• Pulse width of a high speed on-chip signal
• Measurement of duty cycle of a high speed on-chip signal
• Relative Phase measurement of high speed on-chip signal
• Rise time measurement of a digital signal
• Detecting overlapping of multiple signals
• Measuring the duration of the excited state of a system after a certain
stimulation
1.3 Introduction of Proposed Technique
This research introduces an idea to apply statistical random sampling techniques to
observe timing parameters of digital signals embedded deep inside VLSI chips. The
proposed technique evolves new strategies to handle synchronization and timing
problems which are area, power and design time efficient compared to their
conventional counterparts. The circuits designed using the proposed approach, to fix the
timing distortion problems, can easily be mapped to standard cell technologies without
requiring any custom designed components. The hypothesis, based on the Law of large
numbers[62], proposed in this research is applied to observe and manipulate the duty
4
cycle of a clock-like periodic signal and relative phase of two or more periodic signals
as a proof of concept. The proclamation of the proposed methodology states
“If the state of some periodic signal(s) under observation is captured repeatedly at
random instants of time and a large sample data of premeditated size is gathered then
accurate conclusions can be drawn about certain parameters of the signal(s) under
observation through statistical inferences”.
This research also propounds a circuit design approach for controllability of the
directly or indirectly observable parameters of digital signals through statistical random
sampling.
1.4 Limitations of Conventional Approaches
The proposed hypothesis is applied to two classical problems related to timing
uncertainties of digital signals: (1) Duty Cycle Correction and (2) Relative Phase
Adjustment. In this thesis the conventional methods to solve these problems are studied
and segregated into three categories for the purpose of survey (1) Pure analogue circuit
design approach [53], (2) Mixed signal design approach and (3) All digital design
Approach. In general the former two design approaches involve resource heavy
components like operational amplifiers (op amps)[54], charge pumps [75][34], phase
and frequency detectors (PFD’s) [60][66], Voltage controlled oscillators (VCO),
Voltage controlled delay lines[75], PLL’s [13][17][48], DLL’s [36] etc. Although these
components if designed carefully provide higher precision and speed in certain cases,
they make the overall design more area and power hungry, especially when
considerably many instances of the target adjustment circuits are replicated over a
5
single chip. Moreover circuit complexities involved in analogue and mixed signal
circuits require lengthy and tedious design flow, which negatively affects the time to
market. An all-digital design approach like [69] and [48], generally replaces the
analogue components with their digital counterparts that increase the design
complexities. Moreover due to the nature of these problems some custom designed
digital components are always used to compensate the quantization issues, which
restrict the portability of these designs. Chapter 3: provides more elaborate and
problem specific surveys of some of these known techniques to tackle the timing and
synchronization problems.
1.5 Organization
This thesis is organized as follows. Chapter 2: lays down the theoretical
framework of statistical random sampling applied to digital VLSI signals, which forms
the basis of the proposed synchronization and timing techniques. The successful
application of the proposed technique requires that an unbiased sample of data is
gathered at uniformly distributed random instants of times within the periodic cycles
under observation [62] [42] [30]. For this purpose a “Random Clock” is used, Chapter
2: also provides the mathematical groundwork to characterize a required random clock
for a given timing application. Chapter 3: looks into the issues involved in a circuit-
level implementation of the proposed theoretical concept especially on pure standard
cell or FPGA technologies which allow the proposed circuits to be portable across
rapidly scaling technologies. Section 3.1 provides a closer look of the typical VLSI
timing problem of duty cycle measurement and correction; it describes how the
6
proposed idea can be applied at circuit level to locally observe and manipulate the duty
cycle of a clock with considerably reduced circuit complexities. Section 3.2 extends the
idea to observe signals to handle yet another synchronization issue, in which the relative
phase of two digital signals is measured and adjusted using the proposed scheme. A
good random clock to facilitate an unbiased random sampling is indispensable for the
correct operation of the circuits proposed in Chapter 3: . To address this issue section
3.3 describes a practical implementation technique of a pseudo-random clock generator
(PRCG). Although the proposed circuits do not require an on-chip source for a random
clock, the proposed standard cell based implementation of a PRCG allows integration of
the PRCG with the random sampling units (RSU) on a single die for standalone
applications. Chapter 4: describes the two VLSI applications in which the proposed
timing and synchronization technique is successfully used to avoid resource heavy and
custom designed components and to reduce the circuit level complexities. Section 4.1
describes a complete SerDes system design as a case study in which the entire
synchronization is adapted to the proposed theme. The SerDes design using the
proposed technique provides an example of how circuit level complexities can be
reduced and overall design can be made area, power and design time efficient using the
proposed technique. A 2.5 times improvement in power dissipation compared to a
typical conventional design with a 60% less area requirement is successfully
demonstrated through the SerDes design. Section 4.1 describes yet another practical
application of the proposed synchronization and timing techniques to a high speed
source synchronous [1][2] DDR/DDR2 SDRAM I/O bus interface.
7
The statistical random sampling technique is applied to measure and correct the
duty cycle of the clock to produce source synchronous signals for write bus cycle and to
measure and correct the phase of the incoming strobe to correctly capture the data.
Chapter 5: presents the issues of the proposed synchronization and timing techniques
with scaling technologies for a typical design flow of an RSU for a given technology.
Chapter 6: briefly reviews the general scope and limitations of the proposed timing
techniques along with some possible future extensions of this research work.
8
Chapter 2: Theoretical Framework
Statistical Random Sampling Technique of inferential statistics suggests that
instead of examining the entire group, called population, which may be difficult or
impossible to do, a small part of the population, called sample, can be examined. This
chapter composed of two major sections first section explains how the survey problem
of inferential statistics, that can draws some conclusion about a certain characteristic of
a given population, is mapped to a problem of observing parameters of on-chip digital
signals. Some useful derivations and relations, that come handy in statistical random
sampling based circuit design, are also explained in the first section. The second section
deals with characterization random clock used for statistical random sampling of VLSI
signals; it provides edges at random instants of time to capture the state of digital VLSI
signals under observation.
2.1 Random Sampling
Sampling procedure that assures that each element in the population has an equal
chance of being selected is referred to as simple random sampling. A sample is a subset
of a population. Since it is usually impractical to test every member of a population, a
sample from the population is typically the best approach available. Inferential statistics
generally require that sampling be random to make the sample as representative of the
population as possible by choosing the sample to resemble the population on the most
important characteristics [30].
9
2.2 Random Sampling Theory Applied to Digital VLSI Signals
The proposed technique exploits the well-known Law of Large Numbers [62] for
statistical estimation and repeatedly captures the state of the signal to be measured at
random instants of time. A large sample data of premeditated size is gathered that
provides the measurement results with an accuracy and confidence level as high as
desired. To explain the theory of statistical random sampling applied to digital signal
observation this section considers a periodic signal of a know frequency (f) who’s pulse
width is to be observed. In a single trial the state (high or low) of the signal is captured
as x
i
at a random instant of time, Figure 2 shows an example situation.
Figure 2. A periodic signal under observation
It is clear that if the time known period of the periodic signal under measurement is
T = 1/f and the unknown pulse width is t
w
, then the probability of capturing a high in a
single trial is given by the fraction of the time period for which the signal remains high
p= t
w
/T. If a sample {x
o
, x
1,
x
3,
… x
n
} of size n of such trials is gathered, then x
o
, x
1,
x
3,
…
x
n
are in independent and identically distributed (iid) Bernoulli random variables with
probability distribution given by Pr(x
i
=1) = p and Pr(x
i
=0) = (1-p) =q. From the weak
law of large numbers [62]
10
0 lim =
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
≥ −
∞ →
ε p
n
X
P
n
Equation 1
Where X= x
o
+ x
1
+
x
3
+…+ x
n
which is the number of times logic high is captured
in n trials and ε be a very small positive number. The law says that for a large number
of trials the value of X/n will be very close to the probability p of getting logic high in a
single trial, which provides a direct measure of pulse width given the time period. A
strong result is proved in the light of the strong law of large numbers as following
p
n
X
n
=
∞ →
lim
Equation 2
This says that the X/p converges to p except in a negligible number of cases. Given
that it is to be established that how big the sample size n should be so that the observed
result through random sampling can be considers accurate enough to be relied upon in a
digital system paradigm. This is quantified in terms of the measurement accuracy which
is represented in terms of error level and confidence level which given by the
confidence interval estimation.
2.2.1 Quantification Sample Size
To look into how big should a sample be to provide a certain accuracy and
confidence level we first characterize sampling distribution of the gathered sample.
Since x
i
is a binomially distributed random variable and we are looking at the fraction
p= t
w
/T of the time period for which the signal remains high, the scenario exactly maps
to the problem of statistically observing the proportion of a binomially distributed large
11
population. For a sample size n the sampling distribution of the proportion can be
represented with mean ( μ
P
) and variance ( σ
P
P 2
) as following [62]
p
P
= μ
Equation 3
n
p p
n
pq
P
) 1 ( −
= = σ
Equation 4
According to Central Limit Theorem this statistics can closely be approximated by
Gaussian distribution with mean μ=μ
P
and variance σ
2
= σ
P
P 2
. If P= X/n = (x + x +
x +…+ x )/n is an observed proportion that represents the number of times logic high is
captured to the number of trials n in the observed sample, the confidence level or
confidence limits for p is given by the following equation
o 1
3 n
n
p p
z P
n
pq
z P z P p
c c P c
) 1 ( −
± = ± = ± = σ σ Equation 5
Where ±z
c
is the critical value that represents the limits within which the area under
the bell shaped normal distribution curve, shown in the Figure 3, is equal to the
confidence interval, also known as confidence level. ±z
c
σ
P
is the standard error in the
observed value P. In simple words the above equation says that the actual value of this
proportion, which is p, could be off by ±z
c
σ
P
from the observed value P. The value ±z
c
for a desired confidence interval (CI) can be found with following formula; Table 1
enlists some frequently used values of z
c
.
) ( 2
1
CI erf z
c
−
⋅ =
Equation 6
12
Table 1. Typical values of z
c
Confidence Level 99.73% 99% 98% 96% 95.45% 95% 90% 80% 68.27% 50%
Zc 3.00 2.58 2.33 2.05 2.00 1.96 1.645 1.28 1.00 0.6745
Figure 3. Area under the Gaussian distribution curves within the confidence interval
In this process the proportion p is the parameter under measurement with a certain
standard error and confidence level. The equation 5 can be solved for p in terms of P, z
c
and n as following
n
z
n
z
n
P P
z
n
z
P
p
c
c
c
c
2
2 2
1
4
) 1 (
2
+
+
−
± +
=
Equation 7
13
For a very large value of n the equation 7 can be simplified to the following form
by ignoring insignificant terms
n
P P
z P p
c
) 1 ( −
± =
Equation 8
From the above relation the observable error ξ can be represented as difference of
actual and observed values of proportion
n
P P
z P p
c
) 1 ( −
± = − = ξ
Equation 9
The observable error ξ can be represented in terms of given desirable percentage
accuracy α as following
) 1 ( ξ α − =
Equation 10
Form this we can now arrive to a relation in which a simple minded desirable
percentage accuracy α and confidence level can be plugged in to determine the required
sample size.
p
p z
n
c
) 1 (
1
2
−
⎟
⎠
⎞
⎜
⎝
⎛
−
=
α
Equation 11
2.2.2 Relation of Sample Size with Accuracy and Confidence
It is obvious from the equation 11 that n has a quadratic relation with accuracy.
Figure 4 shows the relation of sample size n with the desired confidence level and
tolerable error level for p=0.5 (that corresponds to maximum value of n). A log
2
scale
representation is used to directly determine an optimum size of binary counters needed
for circuit implementation as described in the subsequent sections.
14
0
5
10
15
20
25
30
90% 99% 99.9% 99.99% 99.999% 99.9999%
Confidence
Sample Size
error= 10 % error= 1 % error=0.1 % error=0.01%
() n
2
log
Figure 4. Sample size versus Confidence Level (at p=0.5)
2.3 Characterization of Random Clock
The ability to provide a “random clock” is a crucial factor to apply the statistical
random sampling technique to accurately measure the timing of on-chip high-speed
digital signals. This gives rise to some obvious questions about the proposed scheme
like “What is the random clock?”, “How it is generated?” and “What parameters need to
be considered?” etc. Theoretically the random clock required for the measurement
through the proposed technique is a signal whose edges have a uniformly distributed
probability of occurrence independent of the signal to be measured, so that all parts of
the measured signal are observable with equal probability, thus making a single
observation a Bernoulli trial.
15
A practical realization of such a truly random signal for statistical random sampling
based measurement using a purely standard cell based digital technology platform is not
straightforward. Most practical digital systems employ pseudo-random sequences where
theoretically a true random sequence is required, but this transition always requires a
careful analysis of the problem and the practical system at hand. This section analyzes
the problem of statistical random sampling applied to periodic digital VLSI signals and
establishes practical bounds and essential parameters for a practical random clock
required to apply the proposed technique for accurate on-chip signal measurements.
2.3.1 Ideal Random Clock and Periodically Repeating Interval
Since the proposed hypothesis considers periodic or periodically excitable signals,
the distribution of the random edges to collect the signal statistics is required to be
uniform within the periodic interval of the signal (of time periods τ
sig
) under
observation. For a particular signal under observation, its periodicity determines how an
arbitrary distribution of the random edges would fold within its periodic interval due to
modulo τ
sig
. To study this phenomenon a hypothetical ideal random clock is considered
with inter-arrival time of its positive edges forming a truly random sequence
{ τ
x1
, τ
x2
, τ
x3
,…}. The average value of this random number sequence is E( τ
xi
) = τ
avg
, which
corresponds to the average time period and average frequency f
avg
of the random clock.
16
Figure 5 shows how the edges of this ideal random clock occur at instants of time
{tx
0
,tx
1
,tx
2,
tx
3
…} within the periodic interval τ
sig
defined by the signal under
measurement. The following equations capture the same property in a mathematical
form:
tx
0
= φ
Where φ is the initial phase between the measuring and measured signal
tx
1
= φ + τ
x1
(mod τ
sig
)
tx
2
= tx
1
+ τ
x2
(mod τ
sig
)
…
tx
i
= tx
i-1
+ τ
xi
(mod τ
sig
) Æ 0 ≤ tx
i
≤ τ
sig
Equation 12
It is obvious from the equations 12 that the distribution of instants of time
{tx
0
,tx
1
,tx
2,
tx
3
…} does not only depend upon the random clock itself but also depends
upon the time period τ
sig
of the periodic signal under measurement. Thus it is crucial to
characterize the random clock relative to the signal(s) under observation.
Figure 5. Random edges relative to the signal under measurement
2.3.2 Random Clock Relative to Signal under Measurement
Due to the technological and TTM constraints, pure digital platforms do not allow
the luxury of using sophisticated chaotic oscillators [40][31][14], PLLs [53] or
17
programmable analogue delay components in design. The most practical approach for a
pure digital platform is to use a component capable of producing multiple frequencies
of fixed values that can be switched in a pseudo-random fashion to generate a frequency
mix. To characterize the relation of the random clock generated in this way with respect
to the signal under measurement, it is important to first consider a fixed frequency
sampling clock and observe the occurrence of its edges along the periodic cycle of the
signal under measurement. In the following analysis all time measurements are
represented in integral values of high-precision units, e.g., picoseconds (ps).
2.3.2.1 Fixed-Frequency Sampling Clock
Figure 6 shows a signal of time period τ
sig
being observed with a sampling clock of
a fixed frequency f
sclk
. If the values of the time period of the sampling clock τ
sclk
and the
time period of the sampled signal τ
sig
are represented by pure integers with equalized
exponents, then the sequence of occurrence of edges of the sampling clock within the
periodic cycle of the sampled signal can be represented with the following linear
congruential sequence.
t
i
= (i × τ
sclk +
φ) mod τ
sig
Equation 13
Where φ is an arbitrary initial phase between the signal under measurement and
sampling clock. The sequence in which the time instants {t
0
, t
1
, t
2
… t
n
} appear within
the τ
sig
interval highly depends upon the intricate relation between the numbers τ
sig
and
τ
sclk
[42]. Before looking into the properties of this sequence another important
18
consideration is the length of the sequence which also determines the achievable sample
point resolution.
Figure 6. Fixed frequency sampling clock beating with the signal
(a)
(b)
τ
sig
Δ
r
Δ
r
= gcd( τ
rclk
, τ
sig
)
t (mod τ
sig
)
t (mod τ
sig
)
Figure 7. Sample Points distributed within the τ
sig
interval
2.3.2.2 Sample Point Resolution
For some combination of τ
sclk
and τ
sig
the linear congruence given in equation 13
produces a finite set of a numbers {t
0
, t
1
, t
2
… t
n
} called a group, where each member of
the group is a discrete sample point within the τ
sig
interval i.e. 0 < t
i
< τ
sig
. If the sample
points are sorted by their value, they form a set of discrete instants of time uniformly
spaced within a τ
sig
interval for a fixed frequency sampling clock as shown in the Figure
7(b). The spacing Δ
r
between any two consecutive sample points on the time axis
determines the maximum possible resolution with which a given fixed frequency
sampling clock can observe the signal under measurement. If the values of τ
sclk
and τ
sig
19
are represented by pure integers with equalized exponents, then the value of Δ
r
is
calculated by the greatest common devisor (gcd) [19] of τ
sig
and τ
sclk
.
Δ
r
= gcd ( τ
sclk
, τ
sig
) Equation 14
In the best case the sample point resolution Δ
r
= 1, for a relatively prime
combination of two integers representing τ
sig
and τ
sclk
[42][19]. A catastrophic case is
when τ
sig
exactly divides τ
sclk
, and there is a single sample point. To better understand
the concept of sample points consider the following example
Example
Let the frequency of the signal under measurement be f
sig
= 500MHz
Æ τ
sig
= 2ns
If this signal is sampled using a sampling clock of frequency f
sclk
= 108.625 MHz
(A crystal oscillator commonly used in verity of test equipments)
Æ τ
sclk
= 9206 ps
Equalizing the exponents τ
sig
= 2000 ps
Sample point resolution Δ
r
= gcd (9206, 2000) = 2ps
The length of sample points sequence = τ
sig
/ Δ
r
= 2000/ 2 = 1000
Sample points represent discrete time instants within the τ
sig
interval of a particular
signal of frequency f
sig
=1/ τ
sig
that can be observed by a given fixed frequency sampling
signal.
20
Theoretically the time between these discrete instants of time are un-observable,
but when we look into the practical implementation of a random clock generator we see
how secondary effects like signal noise and timing jitter fill up these un-observable
spaces fully or partially. The sequence in which these discrete instants of time appear
highly depends upon the values of τ
sclk
and τ
sig
, and they may not be necessarily random
[42].
2.3.2.3 Producing Frequency Mix with Pseudo Random Numbers
In the preceding subsection it is shown that the values of time periods of a sampling
clock τ
sclk
relatively prime to τ
sig
produces a longest sequence of time instants within the
periodic interval of the signal under measurement, but the sequence in which these
sample points appear for a particular pair of τ
sclk
and τ
sig
is not necessarily random. In
most practical cases the frequency of the signal to be measured with statistical random
sampling cannot be altered at will to have a relatively prime pair of τ
sclk
and τ
sig
, whereas
the frequency of the sampling clock, thus τ
sclk
, can be controlled and generated as
desired. Fortunately for a given value of τ
sig
there exist infinitely many relatively prime
numbers. If a collection of n frequencies with time periods τ
1
, τ
2
, τ
3
… τ
n
is gathered
with all τ
i
relatively prime to τ
sig
of the signal to be measured then each sampling
frequency would correspond to a sequence of sample points. Although the individual
sequences corresponding to each sampling frequency may not be random, if these
sequences are mixed and switched using some pseudo-random number sequence then a
fairly random sequence of instants of time can be produced. It is obvious that due to the
discrete digital nature of the platform the resultant sequence of instants of time would
21
not be a true random sequence as it would repeat after a fixed length. A pseudo-random
signal produced in this way is called a frequency mix. The observable sample points
using this signal would remain the same as that of any individual continuant frequency
of the frequency mix, thus the sample point resolution would not deteriorate. To analyze
the effectiveness of a frequency mix and the randomness of the sequence of instants of
time produce by it, consider the following formulation
Let the pseudo-random sequence be represented by ψ
i
and the time period of the
frequency selected using this pseudo random sequence be represented by τ( ψ
i
), then the
time instants within the periodic cycle τ
sig
of the signal under measurement is given by
tx
0
= φ
Where φ is the initial phase between the measuring and measured signal
tx
1
= φ + τ( ψ
1
) (mod τ
sig
)
tx
2
= tx
1
+ τ( ψ
2
) (mod τ
sig
)
…
tx
i
= tx
i-1
+ τ( ψ
i
) (mod τ
sig
) Æ 0 ≤ tx
i
≤ τ
sig
Equation 15
Equation 15 is identical to the equation 12 of an ideal random clock, except that the
true random intervals of inter-arrival time of the random edges are replaced by a
pseudo-random sequence of intervals. A circuit that follows the above model simply
transforms an applied random number sequence to random instants of time that can be
used for statistical random sampling. Section 3.3 shows a practical circuit level
implementation of the pseudo-random clock generation model show in this section.
22
2.3.3 Bounds of the Average Frequency of Random Clock
To optimize the design of a practical random clock generator for this particular
application, it is very important to carefully look into the upper and lower bounds of
average frequency f
avg
of a random clock which can guarantee the desired measurement
accuracy while being within realizable limits. The lower bound of f
avg
is not as critical
as the upper bound, as it merely effects the measurement time, i.e. lower value of f
avg
means slower measurement speed. Before establishing the upper bound for f
avg
, it is
important to understand that the random sampling technique finds its real worth for
applications where the signal under measurement has so high a frequency that sampling
it at higher frequencies may not be practical or possible. For the sake of completeness
consider if the f
avg
is set too high relative to the signal under measurement the
information sample gathered may capture only a part of the entire cycle of the signal,
hence providing a highly biased measurement result. To set a quantitative upper bound
for maximum average frequency f
avg-max
for the random clock, consider f
sig-min
(=1/τ
sig-
max
) is the minimum frequency of the signal expected to be observed with the minimum
desirable accuracy α
min
and confidence CI
min
. According to Equation 11 the size of the
required sample is given by the following equation
p
p
z
n
c
) 1 (
1
2
min
min
min
−
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
−
=
α
Equation 16
Where zc
min
is the critical value corresponding to the confidence interval CI
min
, and
p is the target value of the signal parameter under measurement. It is intuitive that if the
minimum required accuracy is α
min
= 99% which corresponds to max error level
23
ξ
max
=(1- α
min
)=1% then a maximum of 1% of the total trials should occur in a single
cycle of the signal under measurement. Given this constraint the upper bound of the
average frequency of the random clock f
avg-max
is set as following
τ
avg-min
= 1/ f
avg-max
ξ
max
n
min
τ
avg-min =
τ
sig-max
Î ξ
max
% of the total sampling time = One cycle of signal
f
avg-max
/
( ξ
max
n
min
) = f
sig-min
f
avg-max
= f
sig-min
( ξ
max
n
min
) Equation 17
For example if we need a sample size of 1000 to observe a certain parameter of a
signal of frequency 500 MHz with 99% accuracy then the fastest random clock that may
be used should not have average frequency more that 5 GHz. This means that state of
the signal under measurement would be observed 10 times per cycle on average.
24
Chapter 3: Circuit Level Implementations
The preceding chapter showed the theoretical basic of the proposed concept of
applying statistical random sampling to observe and accurately measure on-ship digital
signals. This chapter looks into the practical circuit level implementation of the
proposed concept over purely digital standard cell based technologies. Two primary
timing problems i.e. (1) Duty Cycle Correction and (2) Relative Phase Adjustment are
considered. The former demonstrates how the proposed technique can practically be
applied to measure and correct the duty cycle of an on-chip digital signal buried deep in
the die of a chip. The later shows how the proposed concept can be extended to multiple
signals to measure and adjust their relative timing. Last section describes a standard cell
based circuit implementation of a pseudo random clock generator (PRCG) that can be
integrated on the chip along with the Random Sampling Units (RSU).
3.1 Duty Cycle Measurement and Correction
A specific value of duty cycle of an on-chip clock or signal often becomes of
extreme significance in VLSI circuits like DRAM’s, dynamic/domino pipelined
circuits, pipelined analog-to-digital converters (ADC) and Serializer / Deserializer
(SERDES) circuits, which are sensitive to the duty cycle or where operations are
synchronized with both transitions of the clock. This section explains the application of
the proposed random sampling technique to the duty cycle correction of high-speed on-
chip signals. The proposed design shows how to locally adjust the duty cycle of a
25
periodic synchronization signal or clock using a standard cell based circuit. The circuit
repeatedly captures the state of the signal to be measured at random instants of time. A
large sample data of premeditated size is gathered that provides the measurement results
with an accuracy and confidence level as high as desired. The proposed circuit delays
the input signal with a programmable delay line of a modest size; the delayed and
original input signals are used to stretch or chop the signal to produce a desired output
duty cycle using a small and simple logic circuit. The high measurement accuracy
achievable through the proposed random sampling technique provides a way to correct
the duty cycle with a maximum error of less than half the smallest delay resolution unit
available for correction. Experimental results gathered though extensive simulations of
the proposed circuits manifest a very close correlation to the expected theoretical
results. This section first looks into the significance of the Duty Cycle Correction
problem for the high speed digital circuits, then it provides after a quick survey of
existing solutions for the subject problem before going into the proposed circuit and
experimental results.
3.1.1.1 Significance of Duty Cycle Correction Problem
The clock signal is the heart beat for all synchronous digital computing and
communication circuits, some of which are sensitive to both edges of the clock.
Dynamic and domino logic circuits require one phase of the clock cycle for pre-charge
and the other to evaluate, thus imposing a tight constraint on the duty cycle of the clock
to operate at maximum possible speed. Memory systems like SRAM’s and DRAM’s
also require a part of the clock cycle to pre-charge the bit/bit bar lines and a part of it for
26
read or write operations [55]. In data communication circuits and systems the
importance of clock-to-data correlation is magnified, and large variations in the duty
cycle of the clock cannot be tolerated. Similarly, in SERDES technology, when both
edges of the serialization signal are used to serialize the data, a balanced duty cycle
becomes very important to provide equal transmission time for each symbol. In
advanced deep submicron VLSI technologies, the clock is distributed to individual
components through a large clock distribution tree made up of clock buffers and
interconnects of appropriate sizes to minimize skew and end-to-end delay. A noticeable
degradation in duty cycle can be observed at the terminal ends of the signal distribution
network, even for signals generated with a perfectly stable and accurate signal source.
This is due to the slight mismatch in the drive strengths of pull-up and pull-down
networks of the CMOS gates/buffers and non-uniformity in the distribution of wiring
capacitance. A local duty cycle correction circuit is usually required to fix this problem.
3.1.2 Survey of Duty Cycle Corrector (DCC) Circuits
Measurement and correction of duty cycle of on-chip signals is a classic VLSI and
ASIC design problem. The design approaches for duty cycle corrector (DCC) circuits
already proposed in the literature can be categorized broadly into analog, digital and
mixed-signal.
3.1.2.1 Analogue DCC
Purely analog duty cycle corrector (DCC) circuits like the one proposed by [28]
consist of a voltage controlled oscillator (VCO), operational amplifiers (OPAMP),
27
phase detectors and frequency filters that makes the design extremely resource heavy.
These circuits are obviously not a good choice when die area is the most important
constraint.
3.1.2.2 Mixed-Signal DCC
The mixed-signal design approaches [15],[75] are very fast but require very careful
design that is independent of process, voltage and temperature variation for analog
components like charge pumps, integrators etc.
3.1.2.3 Digital Synchronous Mirror Delay based DCC
The Synchronous Mirror Delay (SMD) was first proposed by [55], forms the basis
of digital design approaches like [64] and [69]. An all digital DCC by Wang [69] is
shown in the Figure 8. Although this SMD based DCC [69] is a pure digital solution of
the duty cycle correction problem, it is not an efficient approach, in area and design
time, for implementation with standard cell libraries in the ASIC design flow, especially
when considerably many DCC’s are required all over the chip, and frequent adjustment
or relocking is not required. The major hurdles that limit the approach in [69] while
working with standard cell libraries are (1) building a custom designed component, like
a perfectly symmetric SR flip-flop, (2) building a precisely matched SMD and (3)
generating a pulse of a specific width which is a recursive call for the solution.
Moreover, the SMD-based designs employ comparatively long delay lines that make the
corrected signal more prone to jitter due to power supply variations.
28
(a) The block diagram of SMD based DCC
(b) The timing diagram of the SMD based DCC.
(c) The conceptual diagram SMD
(d) The timing diagram of the HCDL
Figure 8. SMD based all digital DCC
29
The maximum measurement and correction error of [69] is equal to the primary
delay element (usually the fastest component in the library) used in the construction of
the SMD, whereas the proposed technique provides a way to very accurately measure
the duty cycle and reduces the maximum correction error to less than half the smallest
delay resolution unit.
3.1.3 Implementation of Proposed DCC
The design concept can be understood through the simple block diagram given in
Figure 9. The circuit uses a small programmable delay line that provides a delayed
version of the input signal. The delayed signal is ORed with the original input to stretch
and ANDed with the input to chop the input signal as illustrated in Figure 10. This
configuration accepts an input duty cycle range of 30% to 70%, and it can adjust a
signal with 30% input duty cycle to any desired value within a 30% to 60% range.
Similarly a 70% input duty cycle can be adjusted to any desired value within a 40% to
70% range. For example, if the input duty cycle is 30%, a signal that is delayed 20% of
the cycle time is ORed with the input to produce a corrected 50% duty cycle.
Figure 9. The conceptual diagram of proposed DCC
30
These examples also show that to correct an input signal to 50% duty cycle, the
signal undergoes a maximum delay of 20% of the cycle time. This scheme is therefore
less prone to power supply jitter as compared to SMD-based designs in which a signal
essentially passes through a delay of 1.5 to 2 times its cycle time. The output
multiplexer selects between the stretched and chopped signal as required. It also
provides a path for the original signal to the output for initial measurement. The output
of this multiplexer is fed to a random sampling unit for the measurement procedure. The
control unit initiates and reads the measurement to correct the output duty cycle by
varying the delay setting of the programmable delay line.
Figure 10. Timing diagram of Stretching and Chopping
3.1.3.1 Correction Error
The control unit uses the delay line tap that minimizes the duty cycle correction
error due to the quantization effect of the minimum delay unit used in the delay line. As
an example, a signal with 2ns cycle time and 30% duty cycle is required to be corrected
31
to 50% duty cycle, and the unit delay of the delay line is 33ps. Ideally the signal should
be stretched by 400ps, but the closest approximations can be achieved by selecting the
12th or 13th tap that yield 396ps and 429ps, respectively. Since the random sampling
technique provides an accurate output duty cycle measurement, the control unit selects
the 12th tap to keep the correction error to 4ps instead of 29ps.
3.1.4 Random Sampling Unit (RSU) for DCC
Theoretically the random sampling unit is simply a gated flip-flop clocked with a
random clock. The practical implementation of such a sampling circuit requires careful
handling of metastability issues since the clock of the latching register and the input
signal may switch simultaneously. The register output could settle into an un-defined
region—neither a logical high nor a logical low. Several solutions have been proposed
to alleviate this problem [52],[58]. The simplest approach uses two or three cascaded
flip-flops to demetastabilize the sampled value of the input signal by providing it
enough time to settle down to a stable value before it is forwarded to other logic.
Figure 11. Implementation for random sampling unit
32
A simple implementation of the random sampling unit is shown in Figure 11,
which includes counters. At any transition of “Sample”, “Counter 1” is loaded with
“Desired Sample size (n)” and “Counter 2” is reset. At every edge of the random clock,
“Counter 1” is decremented, and “Counter 2” is incremented when the signal is sampled
as high. When “Counter 1” decrements to zero, further sampling is stopped and
“Counter 2” is read to calculate the duty cycle of the input signal. The size of the
counters used depends upon the required accuracy and confidence level.
Our design space exploration shows that a design with 16-bit counters could be
implemented in a modest area (1700 cells of size 0.4 μm x 4.8 μm in 130nm technology)
and provides 99% accuracy with a 99.9999% confidence. To make the correction
process faster, courser measurements can be done in the beginning with smaller sized
samples and more accurate measurements can be done with large sized samples towards
the end of the correction process.
3.1.5 Experimental Setup and Results
Functional verification is done through post synthesis simulations of the proposed
design targeted to IBM Cu-11 (130nm technology). The test bench used generates a
random clock using uniformly distributed random numbers, to feed the HDL description
(synthesized) of the design. For various combinations of accuracy and confidence level,
the necessary sample size is found using Equation (5). An extensive series of simulation
experiments were performed for different signal duty cycles and combinations of
confidence level and accuracy settings, but results are shown for only 50% due to space
constraints. The 50% duty cycle results were chosen because the expected error level is
33
maximum in the measurement of 50% duty cycle signal for any given sample size and
confidence level.
Figure 12 shows the maximum observed error in 100 simulations run over each set
of parameters for 50% duty cycle signal. The results are normalized to expected error
for each case so that fine-level detail could be observed for all cases on the shown
graph. The decreasing trend of observed error with increased confidence level is a
consequence of increased sample size. It is noticeable from the results that the observed
error always remained within the limits of expected error and the proposed technique is
equally valid at different accuracy settings, which is favorable to design a fast
converging duty cycle adjustment algorithm.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy=90% Accuracy=99% Accuracy=99.9%
Confidence Level
Normalized Error
90% 99% 99.9% 99.99% 99.999% 99.9999%
Figure 12. Maximum Observed Error Normalized by Expected Error
3.1.5.1 Deductions
Unlike pulse width measurement problem duty cycle problem does not require the
knowledge of frequency or time period of the signal. The maximum correction error of
34
the proposed design is half the maximum delay resolution of a delay line without
involving any resource-heavy analog components or custom digital components, thus
making it very attractive for standard cell based ASIC designs. The design is purely
standard cell based that can practically portable to any ASIC design technology.
3.2 Relative Phase Measurement
This section explains how statistical random sampling technique discussed in
previous sections can be extended to Simultaneous Random Sampling and applied to
measure and adjusts the relative phase of on-chip high speed digital signals. The
proposed technique as applied to timing uncertainty mitigation in the signaling of a
digital system is presented as an example; the relative phase information is used to
minimize the timing skew. The proposed circuit in this section captures the state of the
signals under measurement simultaneously at random instants of time and gathers a
large sample data to estimate the relative phase between the signals. By carefully
premeditating the sample size, the accuracy and confidence of the result can be set to a
level as high as desired. Accurately sensed value of relative phase enables the correction
circuit to reduce the maximum correction error, less than half the maximum delay
resolution unit available for adjustment. The proposed circuit design is based on
standard cell technology that makes the design practically portable to any process or
technology. The pure standard cell based circuit design approach reduces the overall
design time and circuit complexity as well. The test results of the proposed circuit
manifest a very close correlation to the simulated and theoretically expected results. The
random sampling unit (RSU) circuit proposed for phase measurement in this section
35
occupies 3350 ( μm)
2
area in 130nm technology, which is an order of magnitude smaller
than what is required for its analog equivalent in the same technology. In the following
subsections the significance of the relative phase measurement problem and a quick
survey of conventional approaches are provided before going into details of circuit level
implementation, its experimental setup and test results.
3.2.1.1 Significance of the Problem
Relative phase measurement and adjustment of digital signals embedded deep
inside a chip becomes extremely significant for correct functionality or optimal
performance of certain systems. In data communication circuits and systems the
importance of the clock-to-data correlation is magnified, and maximum timing margin
can be achieved only by aligning the capturing edge of the clock at a certain point in the
data eye. This can be achieved through the adjustment of the relative phase of the
capturing clock with respect to data. Similarly in serializer-deserializer (SERDES)
technology, multiple phases of the clock are used to launch and capture data at the
SERDES. The timing uncertainties in data signaling systems are mainly categorized as
skew and jitter [20] . Uncertainties due to mismatched line lengths, process variations
and pin parasitics, etc, are generally time invariants for a system at given operating
conditions and are grouped together to be called “skew”. Synchronous open-loop
systems tolerate the skew at the cost of performance, i.e., by low-frequency operation,
whereas active closed loop systems trade area for performance gain by employing phase
locked loops (PLLs) or delay locked loops (DLLs). The basic idea of an active closed-
loop skew compensation is to reduce exactly as much skew as needed. It is important to
36
note that if the operating condition of a system is not time varying, it would not require
frequent adjustments and fast locking mechanisms to compensate the skew. This forms
the bases of the claims that statistical random sampling technique can be applied to
observe and adjust the duty cycle of an on-chip digital signal. The idea of observing the
duty cycle of a signal described in Section 2.3 can be extended to observe relative phase
of multiple signals with respect to each other by capturing their state simultaneously at
random instants of time. The observed information then can be used to minimize the
timing skew in the systems where operating conditions are not changing frequently.
3.2.2 Conventional Ways to Tackle the Problem
Phase measurement and detection is yet another classic VLSI and ASIC design
problem. Phase detectors (PD) and Phase/Frequency detectors (PFD) are commonly
used in phase locked loops and delay locked loops [13][17][41]. In a typical delay
locked loop (DLL), a phase detector signals the loop control circuit to increase,
decrease or stop the loop delay adjustment. Similarly, in a typical PLL circuit, the
relative phase of the output of a voltage controlled oscillator (VCO) with respect to the
reference signal is measured by a PFD and used as feedback to adjust the VCO’s output.
Soliman, et al [60] explored a design space of PDs and PFDs and categorized them with
respect to their functionality and implementation perspective. The design spectrum of
phase measurement and detection circuits can also be divided with respect to circuit
families (1) Analog, (2) Mixed-signal and (3) Pure Digital.
37
3.2.3 Extension of the Technique to Simultaneous Random Sampling
Previous sections showed that how random sampling can provide observability a
signal deep inside a VLSI chip to measure its pulse width or duty cycle. The concept of
random sampling can be extended to simultaneous random sampling in which the states
of the multiple signals are captured simultaneously at random instants of time. For
relative phase measurement problem two periodic are considered. The signals have the
same frequency and one is leading the other with some unknown phase difference, there
are four distinct regions as shown in Figure 13. To measure the phase difference we
estimate the length of the “region A” that corresponds to a simultaneously captured
value “10” of the two signals. Defining p as the proportion of the “region A” to cycle
time T
cycle
i.e. p=t
A
/T
cycle
, and a single trial as a simultaneously captured state of the two
signals that can take four distinct values 10, 11, 01 and 00 corresponding to the four
regions shown in the Figure 13, the probability of capturing a logic state “10” in a
single trial would be equal to p. This reduces the problem of relative phase
measurement to problem of observing a proportion discussed in the section Chapter 2:
and allows us to apply the same theoretical bases that we have established in that
section.
The observed value of P, which represents the proportion of number of times logic
state “10” is captured to the n trials of the gathered sample, can now be mapped to
relative phase using the following equation with a certain confidence and error level.
cycle A
T t P / 2 2 ⋅ = = π π φ
Equation 18
38
Figure 13. Two signals under relative phase measurement
3.2.4 Implementation of Circuit Design
The block diagram of a signaling convention shown in Figure 14 explains the
design concept of the proposed technique, where phase measurement and adjustment
through random sampling is employed for a typical problem of placing the capturing
edge of the clock in the middle of the eye of received data symbols. The circuit employs
a programmable delay line at the receiver side in the path of the clock to adjust its
sampling edge at a desired phase with respect to the data to be sampled. During the
clock to data alignment step, a pattern of known frequency (clock itself in this case) is
sent at both data and clock lines. The random sampling unit captures the state of the
data and clock lines simultaneously at the edges of a random clock. The random
sampling unit records a required number of observations to measure the phase
difference between the signals received through the two paths. The control unit uses the
phase difference information and sets the taps of the delay line to adjust the phase of the
clock with respect to the data line to provide maximum tolerance against timing
uncertainties by minimizing the timing skew in the two paths.
39
Figure 14. Digital System with Closed Loop Signaling Convention.
3.2.5 Random Sampling Unit (RSU)
Theoretically the random sampling unit (RSU) consists simply of flip-flops clocked
with a random clock. The practical implementation of such a sampling circuit requires
careful handling of metastability issues since the clock of the latching register and the
input signal may switch simultaneously. The register output could settle into an un-
defined region—neither a logical high nor a logical low. To alleviate this problem
several solutions exist in the text [52]. Maggioni et al used sample and holds with
comparators in [43]. To keep the circuits portable and purely digital we employed a
simple approach that uses cascaded flip-flops to demetastabilize the sampled value of
the input signal by providing it enough time to settle down to a stable value before it is
consumed by other logic.
The implementation block diagram of the random sampling unit is shown in Figure
15, it includes two event counters. At any transition of control signal “Sample”,
“Counter 1” is loaded with “Desired Sample Size (n)” and “Counter 2” is reset. At
40
every active edge of the random clock, “Counter 1” is decremented, whereas “Counter
2” is incremented only when the captured state matches with the “Region Code”, e.g.,
for region A, Region Code =“10”. When “Counter 1” decrements to zero, further
sampling is stopped and “Counter 2” is read to calculate the phase difference of the two
input signals. The size of the counters used depends upon the required accuracy and
confidence level.
Figure 15. Random Sampling Unit (RSU) for Relative Phase Measurement.
Design space exploration shows that a design with 16-bit counters could be
implemented within a modest area (1745 cells of size 0.4 μm x 4.8 μm in IBM Cu-11,
130nm technology) and provides 99% accuracy with a 99.9999% confidence level. To
make the correction process faster, courser measurements can be done in the beginning
with smaller sized samples, and more accurate measurements can be performed with
large sized samples towards the end of the correction process.
41
3.2.5.1 Phase Correction Error
Timing jitter in the signals due to power supply noise is defined to be a zero mean
random variable [20]. The error induced in the phase measurement due to jitter in the
signals under measurement is averaged out to zero for large sized sample, by virtue of
its zero mean characteristic. The high measurement accuracy achievable through the
proposed random sampling technique enables the control unit to select the delay line tap
that minimizes the phase correction error due to quantization effect of the maximum
delay adjustment resolution possible through the employed delay line. As an example,
consider a 500 MHz system that requires a phase adjustment of 36° to eliminate skew
and achieve maximum timing margin. The maximum delay resolution of the delay line
in the target technology is 33ps. Ideally the clock should be delayed by 400ps, but the
closest approximations can be achieved by selecting the 12th or 13th tap that yield
396ps and 429ps delays, respectively. The accurate phase measurement enables the
control unit to select the 12th tap to keep the correction error to 4ps instead of 29ps.
3.2.6 Experimental Setup and Results
Functional verification of the proposed technique is done through post synthesis
simulations of the design targeted to IBM Cu-11 (130nm) technology. Simulation test
benches used uniformly distributed random numbers to generate a random clock as
stimulus for the synthesized netlist. To validate the idea for physical design, the RSU
along with the integrated random clock generator was synthesized and ported to a
Xilinx FPGA. Two periodic signals with frequency of the order of 100MHz, one of
42
which is delayed using digitally controlled delay lines, were used to test phase
measurement accuracy and consistency of RSU. Extensive series of experiments were
performed for various combinations of accuracy and confidence level with input signals
at different relative phase settings. The relative phase values measured with the RSU in
these experiments are compared against the same observed through a digital
oscilloscope. The 90° phase difference results are chosen because in a typical double
data rate digital signaling system, clock to data adjustment is kept at 90° for maximum
timing margin. Moreover, the expected error level is nearly at its maximum value while
observing two signals at 90° phase difference for any given sample size and confidence
level. Figure 16 shows the maximum observed error in 100 experiments run for each set
of parameters over signals at 90° phase difference. The results are normalized to
expected error for each case so that fine-level detail could be observed for all cases on
the graph shown. The decreasing trend of observed error with increased confidence
level is a consequence of increased sample size. The results show that the observed
error is always within the limits of expected error and the proposed technique is equally
valid at different accuracy settings.
3.2.6.1 Deductions
This section explained how the statistical random sampling technique can be
extended to observe and manipulate the relative phase of multiple signals. Using
premeditated parameters, a much enhanced measurement accuracy and wide range of
measurements can be obtained. Measurement error due to jitter is cancelled out by the
zero mean random characteristic of jitter itself. The maximum phase correction error of
43
the adjustment circuit is reduced to half the maximum delay resolution of a delay line
without involving any resource-heavy analog components or custom designed digital
components. The circuit is implemented purely with standard cells, making it extremely
suitable for System On-Chip (SoC) applications since it is design time efficient and
portable to any process or technology. The theory of high-speed signal observation and
measurement using the random sampling technique shown in this section is not only
good for digital VLSI circuits and systems, but it can be extended to other domains of
science and engineering like instrumentation, power electronics and industrial controls,
etc.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy=90% Accuracy=99% Accuracy=99.9%
Confidence Level
Normalized Error
90% 99% 99.9% 99.99% 99.999% 99.9999%
Figure 16. Maximum Observed Error Normalized by Expected Error.
3.3 Random Event Generation
Like any other circuit level problem, random clock generation can also be
addressed using pure analogue, mixed signal or pure digital paradigms. More precise
and continuous random clock generation circuits like chaos based circuits [14] and
44
chaotic oscillators [40],[31] can be employed to generate random numbers that can be
used to generate random edges. The motivation for using chaotic systems to generate
random numbers is that the state of a chaotic system is unpredictable, but this is true
only in the long term [5]. Unpredictability means that consecutive random numbers are
independent of each other in the probabilistic sense, thus providing a way of signal
sampling based on Bernoulli trials. The chaotic oscillator based random clock
generation can be categorized under the analogue design approach, which is not
consistent with the primary theme of the technique proposed in this thesis. As discussed
in earlier sections, analogue electronic components make the overall circuit resource
heavy and design time inefficient. Moreover these circuits are not easily portable across
rapidly changing technologies. This section discusses some possible ways to realize a
random clock at a purely digital circuit level that could be employed for the proposed
random sampling units (RSU’s) described in the preceding two sections. It is
established in Chapter 2: that the proposed technique requires random instants of time,
at which the state of the signal under measurement is repeatedly captured, so it is
important to understand that the problem at hand is generation of random instants of
time and not the generation of random numbers itself. Of course it can be argued that
there must exist some way to convert a given sequence of random or pseudo-random
numbers to random instants of time. This section describes a pure standard cell based
pseudo-random clock generator [10] using any well know pseudo-random number
sequence.
45
3.3.1 Pseudo Random Number Generators
The problem of random number generation caught the attention of scientists and
engineers shortly after the invention of digital computers, and the literature carries a
great deal of research information about generating and testing pseudo-random
numbers[42][44][2]. The quality of the random numbers plays a vital roll in the success
of the applications like simulations, approximation algorithms for NP-hard problems,
spread-spectrum communications, security, encryption, etc., which all require good
random number sequences. An old and most common way to implement a random
number generator is a Linear Feedback Shift Register (LFSR)[44][12]. Cellular
Automata based pseudo-random number generators have recently emerged and proved
to be producing better random number sequences [59]. Let us have a quick look at both
the circuits to compare the complexity and robustness before looking into how to use
these pseudo-random sequences to produce pseudo-random instants of time.
3.3.1.1 Linear Feedback Shift Register
A careful selection of the taps for feedback causes the internal state of the shift
register to cycle through a set of unique values. The choice of LFSR length, gate type,
LFSR type, maximum length logic, and tap positions allows multiple design dimensions
to control the implementation and feedback of the LFSR, which in turn, controls the
sequence of repeating values the LFSR iterates through. Following are two well know
configurations for LFSRs
46
3.3.1.1.1 Fibonacci LFSR
In the Fibonacci implementation, the outputs from some of the registers are
exclusive-ORed with each other and fed back to the input of the shift register. Figure 17
shows a 3-bit Fibonacci LFSR.
Figure 17. Three Bit Fibonacci LFSR configurations
The length of the pseudo-random sequence is dependent on the length of the shift
register and the number and the position of the feedback taps. The number and the
position of the taps are commonly represented by a delay polynomial given by the
following
P(D) = D
3
+ D
1
+ 1 Equation 19
3.3.1.1.2 Galois LFSR
In the Galois implementation, the gates are placed between the registers. Figure 18
shows the Galois implementation of the LFSR from the previous example. Again, the
length of the pseudo-random sequence is dependent on the length of the shift register
and the number and the position of the feedback taps. The number and the position of
the taps can also be represented as a delay polynomial.
47
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Clk
Outputs
Figure 18. Three Bit Galois LFSR configurations
Figure 19. A basic building block of CA50745 cellular automata
3.3.1.2 Cellular Automata (CA) Based Random Number Generator
Hortensius, et al, [33] showed that non-homogeneous Cellular Automata (CA)
composed of two linear functions could generate superior random numbers to the linear
feedback shift register and to the homogenous nonlinear CA proposed by Wolfram [33].
48
Shackleford, et al, [59] recently proposed some optimized constructions of CA based
random number generator. Figure 19 shows a basic building block of CA50745.
Figure 19(a) shows that the functionality of a cell derives directly from a LUT truth
table. Figure 19 (b) describes the cell symbology. Each cell contains a 4-input lookup
table which defines the cell functionality and a 1-bit register which holds the cell’s
state. The notation derives directly from the decimal value of the function’s 16-bit truth
table as shown in Figure 19 (c). Leading 0s are included to prevent confusion with 3-
neighborhood rules. Figure 19 (d) shows how connections to cell i are expressed as a set
of displacements from i’s ordinal value.
3.3.2 Proposed Pseudo Random Clock Generator (PRCG)
3.3.2.1 Basic Circuit of PRCG
Figure 20 shows the conceptual block diagram of a basic random clock generator
employed for an FPGA-based experiment performed for relative phase measurement,
elaborated in Section 3.2.6. This is a pure digital approach that can be targeted to a
standard cell based ASIC library, which makes this design portable across many
technologies. A pseudo-random sequence of frequencies is generated by switching the
control input of a ring oscillator. The control sequence of pseudo-random numbers is
produced with linear feedback shift registers (LFSR)[44], which controls the length of a
ring oscillator.
49
Figure 20. Conceptual Blocks of proposed PRCG
This PRCG relies on the capricious behavioral characteristics of a ring oscillator
due to jitter and the pseudo-random number sequence. In the absence of the jitter the
theoretical output of the circuit would be pseudo random which would repeat its outputs
after a fixed interval. The inherent jitter of the ring oscillator makes the output
completely unpredictable and unrepeatable. The random clock generated through this
configuration provides results well within the theoretical bounds, but the maximum
achievable accuracy is limited by the fastest element used in the ring. The fastest
element of the ring determines the shortest possible time difference between two
theoretically possible sample pointes within the periodic interval of the signal under
measurement.
3.3.2.2 Improved PRCG
The basic PRCG shown in Figure 20 is improved to follow the theoretical model
proposed in section 2.3.2. The circular ring oscillator is built with a series of non-
uniform ring slices and a delay blender. The fixed part of the ring determines the
maximum average frequency f
avg-max
of the PRCG, it value can be determined from the
expression derived in section 2.3.3. Figure 21 shows a ring slice which is connected to
the n
th
bit of the control port and it contains N
inv
= 2(n+1) inverters. To produce a fine
50
frequency mix, the proposed PRCG design adds a delay blender in the ring oscillator.
Figure 22 shows a delay blender built with inverters of various performance levels. The
delay blender enables this circuit to produce frequencies of very close values that
increase the probability of getting frequencies of better sample point resolution Δ
r
characteristics for a given signal periodicity. Most common commercial standard cell
libraries provide a reasonably wide spectrum of buffers and inverters of different
performance levels. The cells with smaller delay differential can be used in an
increasing order to build a delay blender.
Figure 21. Ring Slice of proposed PRCG
Figure 22. Delay Blender of PRCG
51
3.3.2.3 Characterization of the proposed PRCG
A variable-frequency random clock is generated when the length of the ring
oscillator is continuously switched as the random edges are generated at the output. The
length control can be switched with any given sequence of numbers {x
0
, x
1
, x
2
, …x
m
}
applied on the control port. If the time periods of the frequencies generated
corresponding to this sequence are given by { τ
x0
, τ
x1
, τ
x2
, … τ
xm
} then edges produced by
the variable-frequency random clock within the time period τ
sig
interval of the signal
under observation can theoretically be represented by the following recurrence
t
x0
= φ , tx1 = ( τ
x0
+ t
x0
) mod τ
sig
, … ,
t
xi
= ( τ
xi-1
+ t
xi
-1 ) mod τ
sig
Equation 20
The underlying assumption of Equation 20 is that the delay lines switching in the
ring would be empty, whereas in a practical case all delay lines carry some phase of the
oscillations. In spite of this anomaly, the above expression provides a fairly good
reference for designing a variable frequency random clock to measure a given range of
signal frequencies. The final distribution of the edges within the τ
sig
interval highly
depends upon the relative characteristics Δ
r
of the individual constituent frequencies of
the entire frequency mix produced by the oscillator and sequence in which they are
switched. We compare three types of switching sequences (1) linear sequence, (2)
LFSR [44] based random number and (3) cellular automata based random number
sequences [33].
For the sake of simplicity, the above analysis ignored the expected jitter in the
signal generated by the ring oscillator. The jitter is a zero mean random variable with a
52
Gaussian distribution [20] and it can be exploited as a source of true randomness [26].
Figure 23 shows the probability distribution of the edges of a frequency mix produced
by the proposed PRCG with Gaussian jitter. The resultant distribution becomes a
continuous distribution rather than a discrete distribution on fixed sample points within
the τ
sig
interval. For smaller values of Δ
r
and larger values of jitter variance the
distribution would be more uniform in the τ
sig
interval.
Figure 23. Sample point distribution of a frequency mix
3.3.2.4 Modified Random Sampling Unit (RSU)
Keeping in view the importance of Δ
r
characteristics, the proposed circuit shown in
Figure 24 incorporates an additional counter, “Counter 0”. This counter is used to
observe the relative characteristic of a random clock and the measured signal. This
allows the controller to select a set of good frequencies and isolate bad ones relative to
the signal under observation.
53
Q
D
Sample
Q
D
Signal in
Rand clk
Counter 2
EN
Desired Sample Size (n)
Counter 1
EN
LOAD
RESET
Number of Times Signal
Captured as High (H)
Sample Ready
Reset Pulse
ZERO
Signal Cycles (S)
Counter 0
EN LOAD
ZERO
Reg (Q)
Figure 24. Modified Random Sampling Unit (RSU)
3.3.2.5 Quantification of measurement resolution Δr
At the start of a sample observation, counter 0 is loaded with a value “S”. At run
time counter 0 is decremented at every edge of the signal under observation. When it
reaches zero, the value of “Counter 1” is captured in register “Q”. Δ
r
is calculated from
the following formula, a smaller value of Δ
r
is indicative of a higher measurement
resolution.
Δ
r
= gcd ( S τ
sig
/ (n –Q) , τ
sig
) Equation 21
3.3.3 Simulation Methodology and Results
To validate the analytical models established in this work the proposed circuits
were synthesized and targeted to IBM Cu-08 (90 nm) standard cell technology. A
PRCG with 8-bit control width was implemented with the upper 5 bits connected to ring
slices for coarse variation and the 3 lower bits connected to a delay-blender for finer
54
variation of frequencies. The f
avg-max
of the frequency mix was set approximately to 100
MHz with a fixed chain of inverters. The sampling edge distribution characteristics of a
fixed frequency sampling clock and variable frequency random clocks were observed
relative to 1 GHz, 622.08MHz, 500 MHz, 300MHz and 250 MHz signals. For Variable-
Frequency Pseudo-Random Clocks the control was switched using three types of
number sequences (1) linear sequence, (2) LFSR based random number [44] and (3)
cellular automata based random number sequences [59]. All experiments were run for a
sample size of 1,000,000 trials. The experimental results proved that the proposed
analytical models very accurately track the actual behavior of a PRCG built with a
controlled ring oscillator. The performance of four types of sampling clocks is
compared on the bases of two performance metrics (1) smoothness of probability
distribution of sample point in the τ
sig
interval and (2) value of effective sample point
resolution Δ
r-effective
.
0
50
100
150
200
250
300
12 3458 10 11 16 20 25
Δr (ps)
Number of frequencies
1GHz 622MHz 500MHz 300MHz 250Mhz
Figure 25. Relative Δ
r
Distribution of the PRCG under test
55
0
0.003
0.0313
0
0.0003
0.0065
0.0406
0.0003
0
0.0051
0.044
0
0.0007
0.0143
0.0584
0.0007
0
0.0069
0.0624
0
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
FIX LIN LFSR CA
1G Hz 622M Hz 500M Hz 300M Hz 250M Hz
Figure 26. Standard Deviation of Sample Point Probabilities
0
0.5
1
1.5
2
2.5
FIX LIN LFSR CA
(ps)
1GHz 622MHz 500MHz 300MHz 250MHz
Figure 27. Effective Sample Points Resolution Δr-effective (ps)
Figure 25 shows the Δ
r
distribution of the frequency mix produced by the PRCG
under test. Figure 26 shows a comparison of the first metric in terms of the standard
deviation of probabilities of the sample points in a τ
sig
interval. Figure 27 compares the
Δ
r-effective
characteristics relative to different values of τ
sig
. Notice that as predicted by the
proposed analytical model, the fixed-frequency sampling clock, if selected correctly,
56
produces a perfectly uniform distribution and smallest Δ
r-effective
= Δ
r
. linearly varying
and cellular automata based control sequences produced the smoothest probability
distribution in the τ
sig
interval. The clock, generated with a LFSR based random number
sequence, manifests the noisiest distribution behaviors in the τ
sig
interval but achieves a
higher sample point resolution, i.e., Δ
r
= 1ps. This high sample point resolution is
because of the random phases that are produced in the ring oscillator because of the
random switching of the control. Cellular Automata based random number sequences
produced comparable results as that of LFSR based random number sequences, but their
circuit complexity is much larger than that of simple LFSR.
3.3.4 Deductions
This section critically analyzes the desirable characteristics of a pseudo-random
clock which can be employed for measurement of on-die digital VLSI signals using a
statistical random sampling technique. It also proposes a method to generate pseudo-
random clocks that would measure a wide range of signals without losing measurement
resolution. The design rules provided in this section can be used to build an efficient
PRCG. The proposed random sampling unit (RSU) provides a way to evaluate and
adjust the reliability of the measurement results.
57
Chapter 4: Case Studies VLSI Applications
4.1 Proposed Techniques Applied to SerDes design
This section introduces a standard cell based design technique for a Serializer and
Deserializer (SerDes) communication link [8]. The SerDes is undertaken design as case
study in which the design employs the statistical random sampling to observe and adjust
the synchronization and serialization signals rather than using a resource-heavy PLL or
DLL based frequency multiplier/synthesizer and clock data recovery (CDR) circuits.
The proposed design is area, power and design time efficient as compared to
conventional SerDes Designs, making it very attractive for modest budget multi-core
and multi-processor ASICs with wide communication buses that are difficult to
accommodate within the pin count of commonly available packaging. The proposed
methodology speeds up the design time and enhances the portability of the design
across technologies while keeping the overall cost as low as possible. The serialization
and deserialization logic is based on standard cell technology that makes the design
highly portable. Multiple serial lines are bundled with a strobe that is used as a
reference signal for deserialization. Data-to-strobe timing skew is compensated by
adjusting the launch times of strobe and data symbols at the sender side. The edges of
the strobe are set within the eye of data symbols to have maximum timing margin,
which makes the design inherently tolerant of jitter. Power consumption of the proposed
SerDes design is 30 mW per serial link targeted to IBM Cu-11(130 nm) Technology,
58
nearly a 2.5x improvement over the conventional design with a 60% less area
requirement.
4.1.1 Significance of SerDes Systems
Over the past few decades processor designers have been exploiting mainly
technology frequency scaling for performance gain, with architectural improvements
providing only modest performance adders. However, frequency scaling is reaching its
physical limits, and architectures have matured so greatly that any further revolutionary
improvement is unlikely. To push forward the conventional performance scaling trend
computer engineers are now moving towards a parallel processing paradigm by putting
more and more processing cores on a single chip along with different kinds of on-chip
memory architectures. DIVA’s PIM chips [22], Intel’s Pentium D, AMD’s Dual
Opteron, Sun’s Niagara and IBM’s Cell are just a few of the many examples of such
architectures. Multiple cores on a single chip essentially demand wider interfacing
buses that produce a considerable mass of interconnection wires on the chip and require
a high number of package pins to connect the silicon die with the rest of the board level
circuit. This problem is often tackled by serializing multiple signals to a high-speed data
channel at the source side and deserializing them at the target side. A dedicated pair of a
Serializer and Deserializer (SerDes) is used for this purpose. SerDes technology has
existed for some time, but, until recently, has had little visibility as a true interface in
commonly fabricated ASIC’s. The reasons that SerDes has not enjoyed widespread
adoption are (1) design time inefficiency, (2) high power requirements, (3) channel bit
error rate (BER) and (4) silicon area cost. The recent surge in low voltage differential
59
signaling (LVDS) technology and its common-mode versatility has partially solved the
problem of channel bandwidth and BER problem [63], but still all known state-of-the-
art SerDes designs like IBM High Speed SerDes (HSS) [61] are complex, area-intensive
and power hungry which make them not suitable for medium and low-budget projects.
4.1.2 Conventional SerDes Designs
Figure 28 illustrates the main components of conventional SerDes systems [24].
The serializer on the sender side includes a clock multiplier or frequency synthesizer
and a parallel-to-serial multiplexer circuit with an I/O driver at the output, whereas a
clock recovery and serial-to-parallel conversion circuits with receiver are the main
components of the deserializer at the receiving end. A clock multiplier is one of the very
important components of SerDes system. In conventional high-speed SerDes systems
like [29],[45] and [61], to achieve maximum bandwidth and pack large number of
parallel data lines to one serial link, low-jitter fast-locking PLL based clock
multipliers/frequency synthesizers are used to drive the parallel to serial converters.
Similarly a clock recovery circuit, on the receiver side, employs sophisticated PLLs or
DLLs to recover the clock on the receiver end to capture and deserialize data back to a
parallel form. I/O drivers and receivers are used to provide a high bandwidth
communication channel between the sender and receiver. Different types of driver and
receiver pairs are used depending upon the desired operating parameters. The two most
commonly used kinds are low voltage differential signaling (LVDS) and current mode
logic (CML) drivers. LVDS drivers generally operate at 100-450 mV swing with DC-
60
coupled signal and provide a speed between 155 Mb/s and 2 Gb/s. CML drivers operate
at a higher voltage swing with DC or AC coupled signals and typically provide
bandwidth above 2.5 Gb/s [63]. To further enhance the communication channel
performance advanced techniques like feed forward equalization (FFE) and pre-
emphasis are used on the transmitter side where as on the receiver side equalization is
achieved through either passive equalizers, receiver peaking pre-amplifiers or decision
feedback equalizers (DFE) [8]. IBM’s state-of-the-art 6.4GB/s CMOS SerDes core [61]
employs CML drivers and receivers with a 4-tap FFE and 5-tap DFE for channel
equalization. Given that frequent usage of LVDS and CML drivers, component libraries
for VLSI technologies (e.g. IBM Cu-11, Cu-08 etc) typically provide a reasonable
choice of LVDS and CML drivers and receivers.
Figure 28. A Conventional SerDes System.
61
4.1.3 Design Conception
Many well characterized high-performance IP cores of SerDes systems that are
commercially available employ sophisticated components like low-jitter, fast-locking
PLLs or DLLs for clock multiplication or frequency synthesis and clock recovery
circuits to achieve high-quality communication. However, these characteristics make
the overall design extremely resource heavy in terms of silicon area and power
consumption. Moreover these IP’s are still way too expensive for medium budget
projects. On the other hand custom designing a SerDes system with a conventional
approach is not design time efficient, especially for systems where SerDes itself is not a
primary component of the system. The primary motivation of the proposed work is to
come up with a SerDes design approach that can avoid the resource heavy analog
components like PLLs and clock data recovery (CDR) circuits.
In the proposed design a system clock and its phases for multiplexing the data to
the serial link are used that avoids the requirement of PLL-based high-frequency clock
generation to serialize the parallel data. On the transmitter side the two main challenges
involved in this approach are (1) generating a clean 50% duty cycle clock from the
system clock and (2) producing its exact phase that can be used to multiplex the parallel
data lines to serialized symbols of fairly equal duration. The statistical random sampling
technique is applied for both correcting the duty cycle [6] and generating multiple
phases of the clock [7]. To avoid complex CDR circuits, multiple serial lines are
grouped with a strobe that acts as a reference signal and is used for clocking on the
receiver side. On the receiver side the two main challenges are (1) compensation of the
62
relative skew of the multiple serial links that belong to the same strobe group and (2)
de-skewing and aligning the strobe with respect to serial data so that the data capturing
edges occur in the eye of that data symbols to have maximum timing margin.
Programmable delay lines at the transmitter side are used, through which the launch
time of the serial data and strobe can be controlled.
4.1.4 Implementation of the Proposed SerDes Design
The SerDes Design presented in this section has been targeted to IBM Cu-11
standard cell library and is designed for a massive parallel computing architecture [9]
[1] with a system clock of 500MHz. Both transmitter and receiver sides are accessible
to a common host through which initial configurations, measurements and adjustments
can be done at start up before the high speed serial channels are activated. The high
speed communication is done through LVDS driver (OLVDS) and receiver (ILVDS)
cells, provided in the IO library of IBM Cu-11 technology. The SerDes design is based
on 4-to-1 multiplexing serializers (Tx-bit units) and 1-to-4 de-multiplexing deserializers
(Rx-bit units) thus have a 2 Gbps link at each serial channel. Seven high-speed bits are
grouped with a strobe signal to form a Tx-byte unit. A Tx-byte unit takes a 28 bit packet
and transmits it over 8 serial channels including a channel for strobe. The strobe is
generated by applying a bit pattern of alternating ones and zeros like 0101 to a Tx-bit
unit which is the same as used for serialization of data bits. On the receiver side the 8
serial channels are received at a Rx-byte unit which includes 7 Rx-bit units. Rx-bit unit
is a 1-to-4 deserializer that uses the edges of the strobe to capture and deserialize the
63
data back to parallel bits. An implementation detail of various components of the
proposed SerDes system is provided in the following subsections.
4.1.5 Transmitter Side (Serializer)
4.1.5.1 Tx-Bit Unit
Figure 29 shows the standard cell based Tx-bit unit circuit. All four input data bits
D0~D3 are captured at the edge of the clock; bits D2 and D3 are recaptured at negative
edge of the clock. Three balanced multiplexers are used to multiplex the data to a single
serial output. The timing diagram of the Tx-bit unit is shown in Figure 30. It takes a
symmetric (50% duty cycle) clock (CLK50) and its 90° lagging phase (CLK90) to
multiplex and serialize four parallel data bits. In the proposed design these two signals
are generated from the system clock. On large chips, the system clock is distributed
through large clock distribution trees made up of clock buffers and interconnects of
appropriate sizes to minimize skew and end-to-end delay. A noticeable degradation in
duty cycle can occur at the terminal ends of the signal distribution network. This is due
to the slight mismatch in the drive strengths of pull-up and pull-down networks of the
CMOS gates/buffers and non-uniformity in the distribution of wiring capacitance. To
conserve the portability of the design across different technologies a standard cell based
duty cycle corrector (DCC) circuit is designed to fix the problem of duty cycle
degradation.
64
D0
D1
D2
D3
r-D0
r-D1
r-D2
r-D3
rt-D2
rt-D3
MB0
MB1
Ser-Out
CLK50 (50% Duty
Cycle Clock)
CLK90 (90º Phase Clock)
Figure 29. Tx-bit unit: A 4-to-1 serializer circuit module.
Figure 30. Timing diagram of Tx-bit unit.
4.1.5.2 Duty Cycle Corrector (DCC)
The Transmitter side of the SerDes employs a DCC circuit similar to the one
introduced in section 2.3 with a little modification that this component also generates
65
CLK90. The duty cycle corrected symmetric clock is passed through a digitally
controlled delay line to produce its 90° lagging phase that can be fed to Tx-bit unit as
CLK50 and CLK90. The relative phase between CLK50 and CLK90 is adjusted with
the help of a random sampling unit (RSU). The block diagram of this piece of circuit is
shown in the Figure 31.
Figure 31. Duty Cycle Corrector (DCC) and Phase Generator.
4.1.5.3 Random Sampling Unit (RSU)
The RSU shown in Figure 32 combines the functionalities of RSU’s introduced in
sections 3.1.4 and Section 3.2.5. At any transition of control signal “Sample”, “Counter
1” is loaded with “Desired Sample Size (n)” and the other two counters are reset.
Cascaded flip flops are used to address the metastability issues due to the asynchronous
nature of a random clock. At an active edge of the random clock, “Counter 1” is
decremented, whereas “Counter 2” is incremented only when the captured state of
“Signal 1” is high. Similarly “Counter 3” is incremented only when “Signal 1” is high
and “Signal 2” is captured Low. When “Counter 1” decrements to zero, further
66
sampling is stopped, at this point “Counter 2” provides the measurement of the duty
cycle of Signal 1 and “Counter 3” shows the phase difference of the two input signals.
The size of the counters used depends upon the required accuracy and confidence level.
Our design space exploration shows that a design with 16-bit counters could be
implemented within a modest area (1745 cells of size 0.4 μm x 4.8 μm in IBM Cu-11,
130nm technology) and provides 99% accuracy with a 99.9999% confidence level. To
make the correction process faster, courser measurements can be done in the beginning
with smaller sized samples, and more accurate measurements can be performed with
large sized samples towards the end of the measurement and adjustment process.
Figure 32. Combined Random Sampling Unit (RSU)
4.1.5.4 Tx Byte Unit
Figure 33 shows the block diagram of a Tx-byte unit that integrates 8 Tx-bit units
to generate 7 high speed data streams and a reference strobe. A signal DCC and Phase
generator in conjunction with RSU is used to supply CLK50 and CLK90 to all 8 Tx-bit
67
units. A reference strobe sent from the transmit side negates any need for CDR on the
receiver side. This considerably reduces circuit complexities of the de-serializer at the
cost of a slight pin count increase (for strobe signal) as compared to conventional
SerDes systems. To compensate the skew of the data with respect to the strobe, due to
the routing and connecting wires to the receiver, the Tx-bit units are fed with CLK50
and CLK90 through controlled delay lines. By adjusting these delay lines, the launch
time of serialized data or the strobe can be varied. Multiplexers at the input of each Tx-
bit unit allow a selection between the regular data and alignment pattern.
Figure 33. Tx-Byte unit
4.1.6 Receiver side (Deserializer)
The receiver side contains the Rx-bit units and ring buffers to deserialize and align
the received bits of a data packet. The de-serialized data is deposited into the ring
68
buffers. After reset, the buffer reading process is kept on hold until all buffers have
received at least one data set.
4.1.6.1 Rx-bit unit and Rx-byte unit
The Figure 34 shows the Deserializer circuit with a ring buffer. The strobe is
passed through an equal delay fork to generate a true (stb) and inverted (stb_bar)
version of the received strobe signal. The “stb” and “stb_bar” signals are then used to
deserialize the received data back to four parallel bits. The two input flip flops capture
the state of the serial signal in every cycle of the strobe. A “write” signal is generated by
dividing the “stb” signal to capture the de-serialized data into the ring buffer when the
output of the Rx-bit unit is valid. Figure 36 shows the timing diagram of the
deserialization process of Rx-bit unit and the write signal. Rx-byte unit is a simple
integration of seven Rx-bit units and RSU’s to observe the serial signals and strobe.
Figure 35 shows the Rx-byte unit.
Figure 34. Rx-bit unit (1-to-4 De-serializer) with Ring Buffer.
69
Figure 35. Rx byte unit
4.1.7 Analysis of the Design
The proposed SerDes is a unique design as compared to conventional SerDes
designs in many ways. Timing skew is compensated during the configuration stage as a
closed-loop system at boot time, whereas an open-loop timing convention is employed
at run-time for high-speed data communication. Effects of jitter are tolerated by having
a large timing margin. Looking at the overall picture, the proposed SerDes design is
based on some unusual tradeoffs as compared to conventional SerDes designs. In this
section we analyze different aspects of the design.
4.1.7.1 Design Time efficiency
The fast pace of technology scaling and reduced time to market require design time
efficient circuits and systems. To make the system design time efficient, the proposed
design methodology suggests the use of well-characterized library components instead
of custom designing components that may involve long and tedious characterization
phase before they can be used in the actual system. This design rule places a very hard
70
constraint over design dimensions and takes off a lot of design flexibility. In the
proposed design this constraint restricts the possibilities of using any custom designed
high-speed line driver and receiver pair. This practically imposes an upper limit on the
achievable communication speed by the SerDes designed under this rule, but advanced
component libraries greatly mitigate this limit. Typical standard cell libraries and IO
libraries are becoming fast enough to meet the speed requirements, e.g. the LVDS
drivers provided by in the IBM Cu-11 IO libraries are sophisticated enough to handle a
data rate of about 2~3 Gbps.
Figure 36. Timing diagram of 1-to-4 de-serializer (Rx-Bit unit).
4.1.7.2 Portability of Design
The proposed SerDes is strictly a standard cell based design that does not require
any custom design component. This characteristic makes this design practically portable
71
to any standard cell ASIC or FPGA technology. The requirement of resource heavy
components like a PLL or DLL for frequency synthesis and CDR circuits is negated by
using a standard cell based RSU that can provide a comparable performance when used
with an appropriate sample size. The trade-off here is the startup time of SerDes that it
requires to get ready before it can start its high-speed data communication. In high end
SerDes systems this time is reduced by using fast-locking PLL’s and CDR’s. The setup
time of the proposed SerDes design may range from 500ms to a few seconds depending
upon average frequency of the random clock driving the RSU’s and convergence
characteristics of search algorithms employed to adjust the delay line taps. If the system
has to run for a considerable time before it is restarted then the setup time of a few
seconds becomes insignificant.
4.1.7.3 PIN count reduction ratio
The serializer employs the Tx-bit unit which is a 4-to-1 multiplexer unit, but
quantitative analyses can easily indicate that the package pin count reduction ratio is not
exactly 1/4. This is due to a strobe signal which is grouped with every 7 high-speed bits,
i.e., 28 parallel bits are communicated over 8 high-speed channels, thus the actual pin
count reduction ratio of this design is 2/7 instead of 1/4.
4.1.8 Experimental Results
The proposed SerDes design targeted to IBM Cu-11 is laid out for a 1600 ball grid
array (BGA) pin package using area pad drivers (OLVDS) and receivers (ILVDS) of
IBM IO libraries. The design is then tested through HSPICE simulations for lossy
72
transmission lines connecting wire models of 50 Ω co-axial cable, wires on FR4 and
NELCO13 boards of un-equal lengths for data and strobe lines. Lengths of the wires are
set such that the delay lines to compensate the strobe-data skew should emulate the
worst case scenario. Figure 37 shows an eye diagram of the data signal received
including line loss, crosstalk, reflections and jitter at the pads of an LVDS receiver with
respect to edges of the strobe signal for an FR-4 board wires. The simulations include
full loss, crosstalk, power supply reflections, and jitter. In spite of the jitter in the strobe
the wide opening of the eye shows a timing margin that corresponds to a jitter tolerance
of around ±200ps to other sources of noise like crosstalk etc.
Figure 37. Eye diagram of simulated results of a SerDes link.
4.1.8.1 Deductions
The increasing trend of multi core processors and embedded systems demands
large signal count. SerDes are a direct solution to reduce package pin count by
practically time multiplexing multiple signals on to the same pin. However the
73
conventional designs of SerDes systems come with large die area, operating power and
financial cost. The proposed SerDes design in this section uses a unique design
technique to avoid resource heavy components used in conventional SerDes designs.
The design is based on Standard cell technology and is portable to any other
technology. The power consumption of the proposed SerDes design is 30 mW per serial
link targeted to IBM Cu-11(130 nm) Technology, nearly a 2.5x improvement over the
conventional design within 60% less area.
4.2 Proposed Technique Applied to DDR/DDR2 Bus Interface Timing
This section describes a new way to tackle the critical bus cycle timing issue related
to DDR/DDR2 DRAM bus transactions using the statistical random sampling
technique[11]. This allows a pure standard cell based design which is inherently area,
power and design time efficient compared to the existing solution proposed in the
literature. The proposed scheme is successfully applied over high-speed on-chip signals
to measure and correct the duty cycle of clocks to produce source synchronous signals
and to adjust the phase of the incoming strobe to correctly capture the data. The
proposed technique is employed to interface to Samsung K4T51163QB_D5 DDR2
chips to an IBM TJ Watson massively parallel processing ASIC [1], targeted to IBM
Cu-08 (90 nm technology). The measurement and correction results obtained very
closely track the expected theoretically estimated results. The proposed design is a fully
digital solution based on standard cell components and does not require any custom
designed component. This makes it extremely design time efficient and portable across
most ASIC and FPGA technologies.
74
4.2.1 Background of the DDR/DDR2 SDRAMS
RAMBUS [36] and Double Data Rate (DDR)[38] memory technologies were
introduced to achieve high data throughput[73] and reduce the ever increasing
performance gap between digital logic and memory subsystems. These memory
technologies utilize source-synchronous double data rate techniques to achieve higher
data bandwidth. In DDR/DDR2 a strobe is sent along with a group of data bits, but
timing of the strobe with respect to the data is very critical and differs for read and write
bus operations. Figure 38 shows typical read and write timing of DDR/DDR2 for a read
latency (RL) of 5 cycles and burst length of four. Notice that during the write operation
the edges of the strobe have to reach the memory centered within the burst of data bits
to maximize jitter tolerance and timing margin. Whereas during the read cycles, edges
of both data and strobe are launched from the memory chip simultaneously, thus leaving
the responsibility of delaying the strobe to capture the data correctly to the receiving
logic.
Figure 38. A typical read and write cycle timing of DDR/DDR2.
75
It is also very important to have a clock with balanced duty cycle (50%) on the
transmitting side, to keep the timing of individual data bits and strobe equalized to let
the DRAM correctly capture the data [38]. To correct the phase of the strobe relative to
data and duty cycle of the clock, the design approaches already proposed in the
literature can be categorized broadly into pure analog, mixed-signal and all digital
solutions. Purely analog circuits based solutions to this problem usually employ analog
phase-locked loop (PLL) and phase detectors that makes the design extremely resource
heavy. These circuits are obviously not a good choice when die area is the most
important constraint. The mixed-signal design approaches are better than a pure
analogue approach [53] but require very careful design, independent of process, voltage
and temperature variation for analog components like charges pumps, integrators etc.
Most of the previously proposed all digital solutions [18] [70] either produce less
accurate timing compensation or use some custom designed digital components.
Compared to these, the proposed approach reduces circuit level complexities by
exploiting the well-known Law of Large Numbers [62].for statistical estimation and
repeatedly captures the state of the signal(s) to be measured at random instants of time.
A large sample data of premeditated size is gathered that provides the measurement
results with an accuracy and confidence level as high as desired. Statistical random
sampling has long been used to quantitatively estimate a particular attribute of a given
data by randomly sampling some correlated observable phenomenon [42]. This work
shows how the proposed technique can successfully be applied over high-speed on-chip
signals to measure and correct the duty cycle of clocks to produce source synchronous
76
signals and to measure and correct the phase of the incoming strobe to correctly capture
the data. To better understand the problem the following subsection provides a quick
review of source synchronous I/O systems.
4.2.1.1 Source Synchronous I/O Systems
The relentless scaling of integrated circuit technologies in recent years has pushed
on-chip processing capability into the multi-GHz regime. To support overall system
performance in this regime, reliable, high-speed inter-chip communication networks are
extremely indispensable. Source-synchronous signaling used in RAMBUS and
DDR/DDR2 memory systems is a widely accepted technique for high-speed parallel
bus interfaces in digital systems [4]. Figure 39 shows a typical source synchronous
channel with a PLL on the transmission side to provide a balanced and stable
synchronization clock to launch the data and strobe. A separate channel is dedicated to
transmit the clock or strobe, which is then de-skewed at the receiver by a delay locked
loop (DLL) to sample at the middle of the data eye. The cost of the additional clock
channel is amortized by sharing it among several parallel data channels.
Figure 39. Source synchronous I/O channel
77
In contrast to conventional common clock signaling [20] discussed in section 3.2, a
source synchronous bus provides a sampling strobe synchronized with[20] the data.
Figure 40 shows the setup and hold timing of a typical source synchronous bus. In this
technique the absolute signal propagating delays (flight times) are omitted from the
timing equations because both data and strobe are sourced from the same transmitter
and a carefully designed printed circuit board (PCB) equalizes the propagation delays.
All delay terms are converted to differential delays which are represented relative to the
sampling edge of the strobe.
Figure 40. Source synchronous write transaction timing.
For common clock signaling[20] the design optimization problem is actually the
minimization of the signal propagation delay spread across the manufacturing process
and the environmental conditions. Whereas for the source synchronous scheme, the
design optimization problem reduces to the minimization of the differential delays (or
skews) between the signals and the associated strobe[1]. The basic source synchronous
bus timing optimization equations when data is transmitted at both edges of the strobe
are given as follows
max min
) (
strobe f data f reciever
t t Tsu Tvb
− −
− + > Equation 22
min min
) (
strobe f data f receiver
t t Th Tva
− −
− + > Equation 23
78
Tcycle Tva Tvb 5 . 0 ) ( < + Equation 24
Where T
vb
and T
va
is the minimum time the signal is required to be valid at the
receiving components before and after the sampling edge of the strobe respectively.
Times of flight or propagation delays of data and strobe are represented by t
f
. The
difference term in the above equations arises from timing uncertainties, and it has a
dynamic and a static component. The static component arises from mismatched
parameters like impedance and length of the two channels; it is usually called skew.
Conventionally it is compensated with a PLL or a DLL on the receiving end as shown
in the figure. The main sources of the dynamic component are signal jitter, crosstalk,
ambient noise and intersymbol interference (ISI). The dynamic part of this timing
uncertainty is usually tolerated by having timing margins for both setup and hold
timing, given all other techniques to minimize these effects are already employed. It is
obvious from the above equations that positioning the edges of the strobe within the eye
of the data and duty cycle of the strobe is very critical to obtain the optimized data rate
and timing margin.
4.2.2 Implementation
The DDR/DDR2 SDRAM’s use a differential clock (CK) input to latch the address
and command signals. The address and command setup and hold times from CK are
represented by t
IS
and t
IH
respectively [38]. For a write cycle, the controller sends strobe
and data to the SDRAM one cycle following the write command within the t
DQSS
timing
specification [38]. The data is latched on both edges of the strobe DQS. For read cycle,
SDRAM takes a few cycles (CAS latency [38]) to assert data DQ and strobe DQS. The
79
DQ and DQS skew is given by the parameter t
DQSQ
[38], and the data is held valid for
the t
QH
time duration. During read cycle, the DDR implements a delay locked loop
circuit (DLL) which tracks both the edges of CK and input signals and it aligns DQS
output edges with CK input edges. For the controller design, the read cycle timing is a
complete loop from the read command launch time until the DQS signal appears at its
receiver. During a write cycle, the controller will launch DQS after CK with a delay of
t
DQSS
. In this case, the timing specification t
DQSS
/t
DQSH
with respect to CK falling edges
is used for timing the data bus.
4.2.2.1 Conceptual Block Diagrams
The DDR/DDR2 SDRAM’s use a differential clock (CK) input to latch the address
and command signals. The data (DQ) and strobe (DQS) are bidirectional busses which
are required to be “turned around” to switch between read and write cycles. During
write cycles the controller is expected to send the data with edges of the strobe centered
within the data. This becomes a trivial problem if the duty cycle of the clock launching
the data and strobe is fully balanced to 50%. In practice a noticeable degradation in duty
cycle can be observed at the terminal ends of the clock tree, even if the clock is
generated with a perfectly stable and accurate oscillator or PLL, due to the slight
mismatch in the drive strengths of pull-up and pull-down networks of the CMOS
gates/buffers and non-uniformity in the distribution of wiring capacitances. Hence, a
local duty cycle correction circuit is required to fix this problem. Figure 41 shows the
proposed circuit of a duty cycle corrector (DCC). To locally correct duty cycle of the
clock signal, its state is repeatedly observed and recorded at random instants of time.
80
The probability of capturing a logic high (one) in a particular random observation
directly corresponds to the duty cycle under measurement. The accuracy and confidence
level of the measurement can be controlled with the size of the sampled observation.
The proposed circuit then delays the input signal with a digitally controlled delay line,
the delayed and original input are ORed or ANDed to stretch or chop the asymmetric
clock signal respectively, to produce a balanced output duty cycle as shown in the
Figure 42.
Figure 41. Block diagram of DCC circuit.
Figure 42. Duty Cycle Correction.
81
The second critical timing issue of delaying the edges of the strobe to capture the
data correctly is also handled using the statistical random sampling technique. Figure 43
shows both data and strobe are received through well matched channels. The strobe is
then delayed using a digitally controlled delay line to center its edges to the data to
attain maximum timing margin. To accurately set up the delay line, a balanced clock is
fed to the delay line. The original and delayed signals are fed to a random sampling unit
to measure the relative timing by simultaneously capturing the state of the two signals at
random instants of time. As shown in Figure 44 the required phase delay corresponds to
the region “A” the joint probability of capturing the two signals with region code 10
directly corresponds to the phase delay to be measured and corrected.
Figure 43. Block diagram of receiver strobe timing circuit
Strobe
Delayed Strobe
t
A
10
t
B t
D
t
C
11 01 00
T
cycle
AB C D Regions
Captured Value
Figure 44. Delayed strobe timing
82
4.2.3 Analysis of the Overall Design
The DDR/DDR2 bus interface design described in this section is a unique design as
compared to conventional designs in many ways. Timing skew is compensated during
the configuration stage at boot time; timing margins are maximized to tolerate dynamic
timing uncertainties like jitter, crosstalk, etc. In this section we analyze different aspects
of the design.
4.2.3.1 Design Time efficiency
The design employs well-characterized library components instead of custom
designed components that may involve long and tedious characterization phases before
they can be used in production systems. The pure standard cell based ASIC design flow
makes the overall design and verification time much shorter. The design time efficiency
of the design reduces the time to market of the final product.
4.2.3.2 Portability of Design
The proposed DDR/DDR2 interface design is strictly a standard cell based design
that does not require any custom design component. This characteristic makes this
design practically portable to any standard cell ASIC or FPGA technology. The
requirement of resource heavy components like a PLL or DLL for frequency synthesis
and data/strobe alignment is negated by using a standard cell based RSU that can
provide a comparable performance when used with an appropriate sample size. The
trade-off here is the configuration time at startup of the interface. If the system has to
run for a considerable time before it is restarted then the setup time of a few seconds
83
becomes insignificant. This startup configuration time can significantly be reduced by
storing the configuration corresponding to the native PCB and socket in some non-
volatile memory and load into the configuration registers as a part of a boot-up
sequence.
4.2.3.3 Flexibility of Bus Timing Specification
The strobe can easily be placed to any specific time within the data, which means if
the setup and hold time of the bus protocol are changed because of some technical
reason the proposed interface can adapt to the new timing specifications.
4.2.4 Verification Methodology and Results
The proposed technique has been applied to interface Samsung K4T51163QB_D5
DDR2 chips to a massively parallel processing ASIC [1]. The design was initially
targeting IBM Cu-11 (130 nm technology) but because of rapidly scaling technologies,
program management transitioned the entire project to IBM Cu-08 before the final
implementation phase. The portability of the proposed technique paid off and the
DDR/DDR2 timing interface design was conveniently transported to IBM Cu-08 (90
nm technology). To handle signal integrity issues, 1.8V SSTL (stub series-terminated
logic) drivers of the IBM Cu-08 IO library are used. For functional and timing
verification a very accurate simulation model provided by Denali [21] is employed.
Extensive simulations were conducted to prove the effectiveness of the proposed
technique. The measurement and correction results obtained manifested a very close
consonance with the expected theoretically estimated results. The proposed design is a
84
fully digital solution based on standard cell components and does not require any
custom designed component. This makes it extremely design time efficient and portable
across most ASIC and FPGA technologies.
4.2.5 Deductions
This section described yet another very important application of the proposed
synchronization and timing technique that allows pure standard cell design. Similar to
the SerDes design the DDR/DDR2 bus timing interface built using the proposed
technique eliminates the DLL or PLL from the design, which considerably reduces the
overall mass of the design. This not only reduces the circuit complexities but also
reduces the perpetually consumed power by the power hungry components of PLLs and
DLLs. The inherent portability of the employed design technique allowed the design
initially built on IBM Cu-11 to be easily transported to IBM Cu-08 Technology.
85
Chapter 5: Proposed Technique with Scaling
Technologies
For more than 30 years, MOS device technologies have been improving at a
dramatic rate. A large part of the success of the MOS transistor is due to the fact that it
can be scaled to increasingly smaller dimensions, which results in higher performance.
The ability to improve performance consistently while decreasing power consumption
has made CMOS architecture the dominant technology for integrated circuits. The
scaling of the CMOS transistor has been the primary factor driving improvements in
microprocessor performance. Transistor delay times have decreased by more than 30%
per technology generation resulting in a doubling of microprocessor performance every
two years. Given this, it is very important to analyze any new scheme to determine its
validity with rapidly scaling technologies. Portability is a vital element of the proposed
scheme. In this chapter the effects of scaling technologies on the proposed
synchronization and timing technique are analyzed. The primary factors and parameters
of the technological platform are considered, which may effect the measurement and
adjustment accuracy through the proposed technique. Later in the chapter a
comprehensive design flow of a Random Sampling Unit for typical given technological
parameters is described.
86
5.1 Technology dependency
The proposed technique strongly advocates the use of standard cell technologies to
maintain the portability of the design across changing technologies. This is possible
because the proposed technique employs some rudimentary primitives of digital logic
circuits. For precise analysis let us consider separately the two parts of accuracy of the
proposed technique, namely measurability and adjustability.
5.1.1 Measurability
As shown in Chapter 2: theoretically the measurement accuracy of statistical
random sampling can be increased with an increased sample size of the gathered
statistics. However, in practice there are some platform technology dependent
parameters that limit the achievable accuracy of this technique. When the statistical
random sampling technique is applied to digital VLSI signals the measurement
accuracy depends upon the fidelity of the sampling element and quality of the random
event generator. The dependency of the measurement accuracy of a random sampling
unit (RSU) on the quality of the random clock employed has already been discussed in
detail in Sections 2.3 and 3.3. Ideally the sampling element employed should capture
the state of the signal under observation right at the given random instant of time.
Typical Digital Signal Processing (DSP) systems employ very sensitive Sample and
Hold (S/H) functions for this purpose, which obviously are analogue components with
several sampling fidelity issues like, linearity, sampling speed, quantization noise, etc.,
related to their use. In pure standard cell based digital platforms the luxury of S/H is not
87
available and the state of the signal under observation is captured with regular flip-flops
or latches available in standard cell libraries. These components come along with
technology dependent parameters like setup time, hold time and capture delays.
Although in digital systems we don’t have to deal with complex issues like quantization
noise, the technology dependent parameters of flip-flops or latches have to be carefully
considered to correctly capture the state of the signal or signals under observation at any
given random instant of time. Let us look into the effects of these parameters on the
measurement accuracy when using statistical random sampling.
5.1.1.1 Effect of Setup and Hold time
“Set-up time” and “hold time” describe the timing requirements on the data input of
a flip-flop or register with respect to the clock input. The two parameters define the time
during which data must be stable at the input of the flip-flop or register in order to
guarantee predictable performance over the full range of operating conditions and
manufacturing tolerances. A positive set-up time describes the length of time that the
data must be available and stable before the active clock edge. A positive hold time, on
the other hand, describes the length of time that the data to be clocked into the flip-flop
must remain available and stable after the active clock edge. The scan-able flip-flops in
new technologies like IBM Cu-11 and Cu-08 manifest a negative hold time, which
suggest that the data may make a transition even before the occurrence of an active edge
of the clock. If an active edge of the sampling clock occurs too close to a transition of
the data input, violating setup or hold time, the flip-flop may not capture the data state
as intended or at best the flip-flop could enter a meta-stable state [25] [67]. A meta-
88
stable flip-flop takes a longer time to settle to a stable ‘0’ or ‘1’ state. An obvious
question arises of how does setup and hold time effect the achievable accuracy through
statistical random sampling? Figure 45 shows a signal under observation with shaded
regions representing where the occurrence of a sampling edge would violate setup or
hold time. Figure 45(a) shows the case when both setup t
su
and hold t
h
times are positive
and Figure 45(b) shows a negative hold time case. The suffix ‘0’ or ‘1’ represents the
state for which the violation has occurred or the state that would have been captured if
there has been no violation.
Figure 45. Setup and Hold Time Violation Regions
A flip-flop in a meta-stable state settles down to ‘0’ or ‘1’ with equal probability
[72]. Notice that theoretically setup and hold time violations are the same whether the
data to be captured has a value ‘0’ or ‘1’. Thus for a large sample size the effect of
89
metastability due to setup and hold time violations evens out. Although separate values
of t
su0
, t
h0
and t
su1
, t
h1
parameters are not usually documented in datasheets of component
libraries, for any practical flip-flop, the value of t
su0
, t
h0
and t
su1
, t
h1
may not be equal
because of the mismatched mobility of the current carriers of the pull-up and pull-down
transistors[53][71][72]. For precise estimation of these values, simple spice simulations
can be performed. The physical differential of these parameters Δt
v
from the following
equation provides an estimated inaccuracy introduced in the measurements through
statistical random sampling when the particular flip-flop is employed.
Δ
tv
=| ( t
su0
+ t
h0
) - (t
su1
+ t
h1
) | Equation 25
5.1.1.2 Minimum Detectable Pulse width
Setup and hold times of the flip-flop used to capture the state of the signal under
measurement also limit the smallest positive or negative pulse width T
PW-min
that can be
measured through the proposed statistical random sampling technique. Figure 46 shows
how setup and hold time violation regions join each other when the pulse width of the
signal approaches T
PW-min
. The T
PW-min
is given by the following equation
T
PW-min
=max ( ( t
su0
+ t
h0
), (t
su1
+ t
h1
)) ≈ t
su
+ t
h
Equation 26
5.1.1.3 Effect of Clock-to-Q Propagation Delay
Clock-to-Q propagation delays t
clk-Q-lh
and
t
clk-Q-hl,
i.e. when output switches from
low to high and
when output transitions from high to low respectively, are other
technology dependent parameters associated with flip-flops. Distinct values of t
clk-Q-lh
and
t
clk-Q-hl
are usually documented in the datasheets of the standard cell library flip-
90
flops. Notice that these parameters merely indicate how long a captured state would
take to appear at the output Q of the flip-flop after it is captured at a give random instant
of time triggered by an active edge at the Clk input of the flip-flop. It is obvious that
this parameter does not affect the measurability or accuracy of the proposed technique.
Figure 46. Minimum Detectable Pulse width
5.1.2 Adjustability
Since the proposed technique advocates the use of standard cell technology to keep
the target design portable and design time efficient, it brings a limitation to the
adjustability of the signal under observation. As Section 3.1 and 3.2 show with standard
cell technology the timing of the signals is adjusted by digital delay-lines. The digital
delay-lines limit the resolution of the adjustability of the timing of the signal due to the
fact that the basic delay element determines the finest achievable adjustment resolution.
To achieve higher delay resolution, the fastest possible component is used to build the
delay line. As it is proved by examples in Section 3.1 and 3.2 that since the proposed
technique provides a very accurate way to measure the timing, the adjustment
91
inaccuracy at a given technology can be reduced to less than or equal to half the finest
achievable delay resolution. Scaling of technologies typically makes individual
components faster; this trend apparently improves the adjustability achievable through
the proposed timing technique. However, if operating frequencies also scale with
scaling technologies this improvement does not make any significant relative
difference.
5.1.3 Deductions
In this chapter it is shown how the proposed technique remains valid with scaling
technologies. The setup and hold time differential may slightly effect the measurement
results. This effect can be minimized by selecting a flip flop from the given library that
is characterized by a minimum value of setup and hold time differential. The
adjustability of timing through the use of delay lines scales well with scaling
technologies.
92
Chapter 6: The Scope, Limitations and Extensions
This research introduces a technique of applying statistical random sampling to
digital VLSI signals related to synchronization to observe certain characteristic
parameters. It provides observability for signals buried deep inside VLSI chips that
gives birth to new ways to look at circuit and board level synchronization problems.
This also opens up new avenues to synchronization strategies to let VLSI designs adapt
to circuit and system level timing uncertainties. In addition to summarizing the
important findings of this research, this chapter also elucidates the scope of the
proposed synchronization and timing technique along with some of its limitations
compared to conventional timing techniques. This chapter also presents a few possible
directions to extend this research to solve other timing and measurement problems.
6.1 Scope of the Proposed Synchronization and Timing Technique
This research mostly focused on two synchronization and timing problems, i.e.,
duty cycle and relative phase measurement and correction. The timing uncertainties in
digital circuits and systems are mainly categorized as skew and jitter [20] . The part of
timing uncertainties due to mismatched lengths and impedances of interconnects,
process variations, mismatched rise/fall delays and pin parasitic parameters, etc., are
generally time invariants for a system at given operating conditions and are grouped
together to be called “skew”.
93
Jitter is a dynamic part of timing uncertainties and comes from effects like ambient
electrical noise, crosstalk, inter-symbol interference (ISI) [20] [51] and delay variations
due to power supply variations and IR drops. The proposed technique provides an
accurate way of measuring the static part of the timing uncertainty by averaging out the
measurement error due to jitter due to its zero mean random characteristic [20]. The
underlying assumption of the technique is that operating conditions are maintained
during the measurement, correction and operation of the circuit. Should the operating
conditions change, the measurement and correction must be reiterated for correct
operation. Timing circuits of conventional closed-loop systems trade off area to allow
frequently changing operating conditions by employing phase locked loops (PLLs) or
delay locked loops (DLLs). The basic idea of an active closed-loop skew compensation
is to reduce exactly as much skew as needed. However, if the operating condition of a
system is not time varying, it does not require frequent adjustments and fast locking
mechanisms to compensate the skew. In massively parallel digital VLSI designs [9][1]
the area is extremely precious to accommodate as much logic as possible. Since these
kinds of systems are usually operated in well-controlled operating conditions the
proposed timing techniques become extremely attractive because of the achieved area
efficiency.
When high-speed digital signals are buried deep inside a die, an accurate local
measurement and correction of such signals becomes extremely expensive in terms of
circuit complexities and die-area. Section 3.1 and 3.2 show how a few flip-flops
94
sampling the signal or signals under observation can provide extremely accurate
measurements of their timing parameters like pulse width, duty cycle and relative phase.
In a purely standard cell technology platform where custom designed components
are extremely prohibitive because of the design effort and characterization time
involved, the proposed technique proves to be extremely useful. It allows design
engineers to choose from standard available digital components for a given technology
and still achieve high measurement accuracy. Circuit designs based on the proposed
technique remain equally valid for technologies like FPGAs where the luxury of
complex custom designed components is often non-existent.
6.2 Limitations of the Proposed Technique
This research mainly focused on the many-fold advantages of the proposed
technique in terms of area, power and design time efficiency. It is noted that the
proposed technique becomes the only choice when custom designed components are
unavailable or prohibitively expensive. However, it is important to keep in view some
drawbacks, limitations and tradeoffs of the proposed techniques.
6.2.1 Measurement Time
The statistical random sampling technique applied to digital signals proposed in
this research exploits the law of large numbers [62][27] and collects a premeditated
sized[6] large sample of measurements of the signal or signals under observation. This
makes it an inherently slow measurement process. The measurement time also depends
upon the average frequency of the random clock employed to capture the state of the
95
signal or signals under observation. The proposed random sampling technique finds its
real worth for applications where the signal under measurement has so high a frequency
that sampling it at higher frequencies may be impractical or even prohibitive. But if
technological limits allow, then the upper bound from the average sampling frequency
established in Section 2.3.3 would allow a faster measurement. For a signal of 500
MHz sampled at an average frequency 10 times slower, i.e., 50 MHz, measuring with
99% accuracy and 95% confidence would take approximately 0.8ms. A lock time of
typical on-chip PLLs ranges from 200ms to 400ms. In this amount of time, the proposed
technique would be able to undergo 250 to 500 measurements, which should be more
than enough to converge to a locked point.
6.2.2 Static Compensation
The proposed technique suggests a closed-loop operation measurement /
adjustment during the configuration phase and open-loop operation during the working
phase. That is, the static timing uncertainties are compensated only at configuration
time. Any change in the timing due to a change in operating conditions or physical
change with the logic devices may affect the timing during regular operation.
6.2.3 Continuous Monitoring
As mentioned above if the operating conditions are subject to change, it is
indispensable to regularly take measurements and perform appropriate adjustments as
needed. This is only possible if the subject signal or signals are inherently periodic, like
clock signals. If the signals during regular operation are not periodic in nature, then to
96
sense the change in timing parameters due to some change in operating condition
requires an interruption in regular operation of the circuit to excite the timing path with
a periodic signal. A more efficient way can be achieved by having a dummy path well
matched with the active paths always kept excited with a periodic signal for continuous
monitoring of any change in the timing parameters.
6.2.4 Controller Logic or Software
The duty cycle correction or skew compensation circuit shown in Sections 3.1 and
3.2 were controlled by an intelligent software controller that initiates and reads the
measurements and intelligently decides which tap of the compensation delay line to be
selected. If a particular application does not have an on-chip controller, then hooks for
an external controller should be provided, or if die area allows, an on-chip state machine
can be employed as a controller.
6.2.5 Convergence Speed
The proposed technique allows high measurement accuracy but as discussed in
Section 5.1.2 the adjustability depends upon the delay line elements, which determine
the adjustment resolution. Theoretically if the delay line allows equally spaced timing
increments corresponding to the delay line input control parameter then a nice binary
search algorithm [19] can quickly converge to the desired point where a given
uncertainty is best compensated. In practice due to non-uniform interconnect delays and
slight variations in the delay elements comprising the delay line, equally spaced timing
increments of delays are not fully achievable. For this case, the controller may use a
97
binary search algorithm for course adjustment but for fine adjustment it can resort to a
linear search to determine the delay line tap that best compensates the target timing
uncertainty.
6.3 More Applications of the proposed technique
The proposed synchronization and timing techniques can be applied to delay testing
of a digital circuit that can be used to more accurately characterize a fabrication process.
Individual gates can be tested for their delay behaviors under different operating
conditions [43]. Despite the use of well-characterized library components, process
variations affect the yield. If the foundry inserts some timing compensation delay-lines
and state capturing flip-flops that may allow post fabrication timing adjustments, then
the statistical random sampling technique can be applied to accurately measure and
compensate any timing variation due to fabrication process and thus provide a possible
increase in the yield.
Another application of the proposed techniques could be implantation of digital
pipelines with non-uniform stage delays. It is very common in processor pipelines that
the execution stage is one of the most functionally intensive stages. To equalize the
stage delays, register retiming techniques are used which at time enormously increase
the number of pipeline registers or makes the pipeline control overly complex. The
proposed technique can be used to delay the clock edge deriving the next stage thus
stealing some time from the next pipeline stage.
98
6.4 Possible Extension of the Proposed Research
Statistical random sampling has long been used to quantitatively estimate a
particular attribute of a given data by randomly sampling some correlated observable
phenomenon [53] [54]. In 1786 Pierre Simon Laplace estimated the population of
France by using a sample, along with a ratio estimator. In this research the concept of
simple random sampling is used, which happens to very well map to the measurement
of periodic digital signals. In this technique, a premeditated sample size is gathered by
repeatedly capturing the state of the signal at random instants of time. The sample is
then analyzed to determine the signal parameter to be estimated. Recently Progressive
Random Sampling (PRS) techniques have been proposed by Santos [56][57] that can be
applied to retrospective sampling of multi-period data. It is shown that PRS serves to
either improve sampling estimates or reduce sample sizes, as demonstrated by two
example applications. The random sampling of digital signals also represents a multi-
period data, and there is a probability of a particular portion or instant of the signal to
appear in collected data in multiple periods. These two conditions very closely match
the example shown by Santos [57], in which claims made under federal social welfare
programs require retrospective sampling over multiple time periods. PRS suggests
sample gathering as a continuous process rather than each sample as separate. This
mechanism may provide continuously flowing feedback information that would allow a
closed loop operation like a PLL or DLL for timing compensation.
99
Conclusion
The proposed technique in this thesis provides a reduced complexity circuit design
approach to many resource heavy VLSI circuits like DCC, Delay Locked Loops
(DLL’s), Frequency Multipliers, Phase detectors etc. The proposed design approach
reduces the circuit complexities making the overall design area, power and design time
efficient, hence reducing the time to market of the targeted product. The proposed
technique does not require any specialized or custom designed components, which
makes it attractive for ASIC design, and designs employing the proposed techniques
can be targeted to any standard cell technologies making them practically portable
across technologies. The technique eliminates circuit complexities, which greatly help
to reduce the time to market of the final product. The work included case studies and
has successfully shown that the proposed techniques enabled a standard cell based
design of a SerDes system that is 2.5 times more power efficient and requires 60% less
silicon real state area than an analog counterpart. The theory of high-speed signal
observation and measurement using the random sampling technique proposed in this
research is not limited only to digital VLSI circuits and systems, but it can be extended
to other domains of science and engineering like quantum physics, instrumentation,
power electronics and industrial controls, etc.
100
Bibliography
[1] George Almási, C ǎlin Ca şcaval, José G. Castaños, Monty Denneau, Derek Lieber, José E.
Moreira, Henry S. Warren Jr, “Dissecting Cyclops A detailed analysis of a multithreaded
architecture”, ACM SIGARCH Computer Architecture News, March 2003, Volume 31.
[2] Tawfili Arabi, .Jeff Jones, Greg Taylor, and Dave Rientleau , “Modeling, simulation, and
design methodology of the interconnect and packaging of an ultra-high speed source
synchronous bus”, Electrical Performance of Electronic Packaging, 1998. IEEE 7th topical
Meeting on 26-28 Oct. 1998.
[3] Vittorio Bagini, Marco Bucci, “A Design of Reliable True Random Number Generator for
Cryptographic Applications Source” Proceedings of the First International Workshop on
Cryptographic Hardware and Embedded Systems 1999.
[4] Ganesh Balamurugan, Naresh Shanbhag, “Modeling and mitigation of jitter in multiGbps
source-synchronous I/O links”, Computer Design, 2003. Proceedings. 21st International
Conference on 13-15 Oct. 2003 Page(s)254 – 260.
[5] G.M. Bernstein, M.A. Lieberman, “Secure random number generation using chaotic
circuits”, IEEE Transactions on Circuits and Systems, Volume 37 , Issue 9 , Sept. 1990
Pages1157 – 1164.
[6] R. Bhatti, M. Denneau and J. Draper, “Duty cycle measurement and correction using a
random sampling technique”, 48th IEEE International Midwest Symposium on Circuits and
Systems 2005.
[7] R. Bhatti, et al, “Phase Measurement and Adjustment of Digital Signals Using Random
Sampling Technique”, IEEE International Symposium on Circuits and Systems 2006.
[8] R. Bhatti, M. Denneau and J. Draper, “2 Gbps SerDes design based on IBM Cu-11 (130nm)
standard cell technology”, Proceedings of the 16th ACM Great Lakes symposium on VLSI,
GLSVLSI '06.
[9] R. Bhatti, C. Steele and J. Draper, “PBuf: An On-Chip Packet Transfer Engine for
MONARCH”, 49th IEEE International Midwest Symposium on Circuits and Systems 2006.
[10] R. Bhatti, K. Chugg and J. Draper, "Standard Cell based Pseudo-Random Clock Generator
for Statistical Random Sampling of Digital VLSI Signals", 50th IEEE International
Midwest Symposium on Circuits and Systems 2007.
[11] R. Bhatti, M. Denneau and J. Draper, "Data Strobe Timing of DDR2 using a Statistical
Random Sampling Technique", 50th IEEE International Midwest Symposium on Circuits
and Systems 2007.
101
[12] Richard P. Brent, “On the periods of generalized Fibonacci recurrences”, Mathematics of
Computation 63 (1994), Also Technical Report TR-CS-92-03 (March 1992, revised March
1993).
[13] P. Chen, C. Chung and C. Lee, “An All-Digital PLL with cascaded dynamic phase average
loop for wide multiplication range applications”, IEEE International Symposium on Circuits
and Systems 2005.
[14] G. Chen and T. Ueta, “Chaos in Circuits and Systems”, ed 2002 (Singapore World
Scientific).
[15] Kuo-Hsing Cheng et al, “A phase-locked pulse width control loop with programmable duty
cycle”, Advanced System Integrated Circuits 2004. IEEE Asia-Pacific Conference.
[16] D.G. Chinnery and K. Keutzer, “Closing the Gap Between ASIC and Custom An ASIC
Perspective”, In Proceedings of the 2000 Design Autornation Conference, 2000, pp. 631-
642.
[17] C. Chung and C. Lee, “An all-digital phase-locked loop for high-speed clock generation”,
IEEE Journal of Solid-State Circuits, Volume 38, Issue 2, Feb. 2003 Page(s)347 – 351.
[18] Ching-Che Chung, Pao-Lung Chen, Chen-Yi Lee, “An All-Digital Delay-Locked Loop for
DDR SDRAM Controller Applications” VLSI Design, Automation and Test, 2006
International Symposium on April 2006.
[19] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliff Stein, “Introduction
to Algorithms” (Second Edition) by published by MIT Press and McGraw-Hill.
[20] W.J. Dally and J.W. Poulton, “Digital Systems Engineering”, Cambridge University Press,
1998.
[21] Denali Software, Inc. (http//www.denali.com/products/interfaces_dram.html).
[22] Jeff Draper, et al, The Architecture of the DIVA Processing-In-Memory Chip, Proceedings
of the International Conference on Supercomputing, June 2002.
[23] Milos Drutarovsky and Pavol Galajda, “Chaos–based true random number Generator
embedded in a mixed–signal Reconfigurable hardware”, Journal of Electrical Engineering,
Vol. 57, NO. 4, 2006, 218–225.
[24] C. Dryden, “Survey of design and process failure modes for high-speed SerDes in
nanometer CMOS”, 23rd Proceedings of IEEE VLSI Test Symposium 2005.
[25] Michael A. Epstein, “Method and apreatus for genrating random numbers using flip flop
meta-stability”, US Patent Number 6631390, Mar 6, 2000.
102
[26] Viktor Fischer, Milos Drutarovsky, “True Random Number Generator Embedded in
Reconfigurable Hardware”, 4th International Workshop on Cryptographic Hardware and
Embedded Systems, CHES 2002.
[27] Albert Leon-Garcia, “Probability and Random Processes for Electrical Engineering (2nd
Edition)”, Book by Addison-Wesley Publishing Company.
[28] T. Gawa et al, “A 50% duty-cycle correction circuit for PLL output”, IEEE International
Symposium on Circuits and Systems 2002, Volume 4 , 26-29 May 2002.
[29] T. Geurts et al, “A 2.5 Gbps - 3.125 Gbps multi-core serial-link transceiver in 0.13 /spl
mu/m CMOS”, Proceeding of the 30th European Solid-State Circuits Conference 2004.
Page(s)487 – 490.
[30] A. Gubner, “Probability and Random Processes for Electrical and Computer Engineers”
Book by Cambridge University Press.
[31] F. Han, X. Yu, Y. Wang, Y. Feng and G. Chen, “n-scroll chaotic oscillators by second-
order systems and double-hysteresis blocks”, Elect Ltrs, Volume 39 , Issue 23 , 13 Nov.
2003 Pages1636-8.
[32] David Harris , Ivan Sutherland, Robert F. Sproull , “Logical Effort Designing Fast CMOS
Circuits”, Book by The Morgan Kaufmann Series in Computer Architecture and Design
[33] P. D. Hortensius, R. D. McLeod, and H. C. Card, “Parallel number generation for VLSI
systems using cellular automata,” IEEE Trans. on Computers, vol. 38, no. 10, pp. 1466–
1473, Oct. 1989.
[34] P.A. Howard and A.E. Jones, “Improved charge pump phase detector for digital phase-
locked loop”, IEEE Colloquium on Analogue Signal Processing 1994.
[35] Terng-Yin Hsu, Chung-Cheng Wang and Chen-Yi Lee, “Design and analysis of a portable
high-speed clock generator”, Circuits and Systems II IEEE Transactions on Analog and
Digital Signal Processing also Circuits and Systems II IEEE Transactions on Express
Briefs, Volume 48, Issue 4, April 2001 Page(s)367 – 375.
[36] C. C. Huang, K. S. Oh, S. Rajan, “The Interconnect Design and Analysis of RAMBUS
Memory Channel,” Proceedings ofASME IPACK 2001, IPACK2001-15531, July 2001.
[37] K. Iniewski et al, “SerDes technology for gigabit I/O communications in storage area
networking”, 4th IEEE International Workshop on System-on-Chip for Real-Time
Applications Proceedings 2004, Page(s)247 – 252.
[38] JEDEC Standard DDR2 SDRAM Specification, JESD79-2C (Revision of ESD79-2B)
MAY 2006.
103
[39] M. Jessa, M. Walentynowicz, “Discrete-time phase-locked loop as a source of random
sequences with different distributions”, IEEE International Symposium on Circuits and
Systems 2002.
[40] A. J. Johansson, H. Floberg, “Random number generation by chaotic double scroll oscillator
on chip, IEEE International Symposium on Circuits and Systems, 1999.
[41] T. Johnson, A. Fard and D. Aberg, “An improved low voltage phase-frequency detector
with extended frequency capability”, 47th Midwest Symposium on Circuits and Systems
2004.
[42] D. E. Knuth, The Art of Computer Programming Volume 2, Seminumerical Algorithms, 3rd
ed., Addison-Wesley, ch. 3, 1998.
[43] S. Maggioni, A. Veggetti, A. Bogliolo and L. Croce “Random sampling for on-chip
characterization of standard-cell propagation delay”, Fourth International Symposium on
Quality Electronic Design 2003.
[44] G. Marsaglia, “A current view of random numbers,” Computer Science and Statistics The
Interface, L. Billard, ed., Elsevier Science Publishers B. V., pp. 3–10, 1985.
[45] E. Matoglu et al, “Design and verification of multi-gigabit transmission channels using
equalization techniques”, Proceedings of Electronic Components and Technology 2005.
[46] Gordon E. Moore , “Cramming More Components Onto Integrated Circuits”, Electronics,
Volume 38, Number 8, April 19, 1965.
[47] K. Nakamura, et al, ‘A CMOS 50% duty cycle repeater using complementary phase
blending”, Symposium on VLSI Circuits, 2000. Digest of Technical Papers. June 2000.
[48] T. Olsson and P.Nilsson, “An all-digital PLL clock multiplier”, IEEE Asia-Pacific
Conference on ASIC, 2002.
[49] Uday Padmanabhan, Janet M. Wang, Jiang Hu “Statistical Clock Tree Routing for
Robustness to Process Variations”, International Symposium on Physical Design (ISPD
`06), April 9 – 12, 2006, San Jose, California, USA.
[50] H. Partovi et al, “A 62.5 Gb/s multi-standard SerDes IC”, Proceedings of Custom Integrated
Circuits Conference 2003.
[51] N. Pham, M. Cases, J. Bandyopadhyay, “Design, modeling and simulation methodology for
source synchronous DDR memory subsystems”, Electronic Components and Technology
Conference, 2000. 2000 Proceedings. 50th 21-24 May 2000.
[52] J. Rabaey, A. Chandrakasan, and B. Nikolic “Digital Integrated Circuits”. Englewood
Cliffs, NJ Prentice- Hall, 1996.
104
[53] Behzad Razavi, “Design of Analog CMOS Integrated Circuits”, Book by McGraw-Hill
January 2001.
[54] W. Rhee, B. Parker and D. Friedman, “A semi-digital delay-locked loop using an analog-
based finite state machine”, Circuits and Systems II IEEE Transactions on Express Briefs,
[see also Circuits and Systems II IEEE Transactions on Analog and Digital Signal
Processing,] Volume 51, Issue 11, Nov. 2004 Page(s)635 – 639.
[55] T.Saeki, Y. Nakaoka, M. Fujita, “A 2.5-ns clock access, 250-MHz, 256-Mb SDRAM with
synchronous mirror delay”, IEEE Journal of Solid-State Circuits, Nov. 1996.
[56] De los Santos, P., Jr., Burke, R.J., Tien, J.M., “A retrospective multi-period sampling
approach”, IEEE International Conference on Computational Cybernetics and Simulation
1997.
[57] De los Santos, P.A., Jr., Burke, R.J., Tien, J.M., “Progressive random sampling: a
multiperiod estimation technique with applications” , IEEE Transactions on Man and
Cybernetics Systems, Volume 30, Issue 4, Nov. 2000 Page(s):418 - 426.
[58] J.-N. Seizovic, “Pipeline synchronization,” Procedings of IEEE ASYNC, 1994.
[59] Barry Shackleford, Motoo Tanaka, Richard J. Carter, and Greg Snider Shackleford, “High-
performance cellular automata random number generators for embedded probabilistic
computing systems”,. Proceedings. NASA/DoD Conference on Evolvable Hardware, 2002.
[60] S. Soliman, F. Yuan and K. Raahemifar, “An overview of design techniques for CMOS
phase detectors”, IEEE International Symposium on Circuits and Systems 2002.
[61] M. Sorna et al, “A 6.4Gb/s CMOS SerDes core with feedforward and decision-feedback
equalization”, IEEE International Solid-State Circuits Conference, 2005.
[62] M. Spiegel, J. Schiller and R. Srinivasan, “Theory and Problems of Probability and
Statistics”, 2nd Edition, McGraw Hill.
[63] E. Suckow, “Basics of High-Performance SerDes Design”, http//www.analogzone.com
[64] Kihyuk Sung, Lee-Sup Kim, “A high-resolution synchronous mirror delay using successive
approximation register”, JSSC, IEEE, Volume 39 , Issue 11 , Nov. 2004 Pages1997 – 2004.
[65] S. Sunter, A. Roy and J. F. Cote, “An automated, complete, structural test solution for
SerDes”, Proceedings of International Test Conference 2004.
[66] Yonghui Tang, R.L. Geiger, “Phase detector for PLL-based high-speed data recovery”, IEE
JNL, Electronics Letters Volume 38, Issue 23.
[67] Thomas C. Tang, “Experimental studies of metastability behaviors of sub-micron
CMOSASIC flip flops” , Proceedings of Fourth Annual IEEE International ASIC
Conference and Exhibit, 1991.
105
[68] B. Vizvari, G. Kolumban, “Quality evaluation of random numbers generated by chaotic
sampling phase-locked loops” IEEE Transactions on Circuits and Systems, Volume 45 ,
Issue 3 , March 1998.
[69] Yi-Ming Wang et al, “An all-digital 50% duty-cycle corrector”, International Symposium
on Circuits and Systems 2004, Volume 2 , 23-26 May 2004
[70] You-Jen Wang, Shao-Ku Kao, Shen-Iuan Liu, “All-digital delay-locked loop/pulsewidth-
control loop with adjustable duty cycles”, Solid-State Circuits, IEEE Journal of Volume 41,
Issue 6, June 2006.
[71] Neil H. E. Weste and Kamran Eshraghian, “Principles of CMOS VLSI Design, A System
Perspective”, Book by Addison-Wesley Publishing Company.
[72] Neil H..E. Weste and David Harris, “CMOS VLSI Design A Circuits and Systems
Perspective (3rd Edition)”, Book by Addison-Wesley Publishing Company.
[73] A. Wirick, S. Ulrich, N. Pham, M. Cases, D. N. de Araujo, “Design and modeling
challenges for DDR II memory subsystems”, Electrical Performance of Electronic
Packaging, 2003, Page(s)229 – 232.
[74] S. Wolfram, “Random sequence generation by cellular automata,” Advances in Applied
Mathematics, vol. 7, pp. 123–169, June 1986. (Also available in S. Wolfram, Cellular
Automata and Complexity, Addison-Wesley, 1994).
[75] Po-Hui Yang, Jinn-Shyan Wang, “Low-voltage pulsewidth control loops for SOC
applications”, JSSC, IEEE, Volume 37 , Issue 10 , Oct. 2002 Pages1348 – 1351.
106
Abstract (if available)
Abstract
The rapid scaling of silicon technologies over the past decade has introduced some arduous constraints for design engineers. The technology progression has exacerbated the power problem whereas the rapidity of scaling has enormously reduced the time-to-market. Standard cell and FPGA based technologies have emerged as the best approaches to achieve reduced time-to-market, but these technologies almost eradicate the possibility of using custom designed components. In the given scenario many timing and synchronization problems are reborn, requiring fresh solutions to fit in this new circuit design paradigm. In this research, a hypothesis derived from statistical estimation is proposed that forms the basis of a new circuit design methodology. The proposed technique addresses some synchronization problems by applying statistical random sampling to high-speed digital signals. Through this technique, timing parameters like pulse width, duty cycle, and clock and data skew can very accurately be measured and adjusted. This proposed technique provides a new way to tackle some classical VLSI problems with considerably reduced circuit complexities which in turn make the overall design area, power and design-time efficient. The proposed circuits do not require custom designed components which make them reusable and portable to most standard cell or FPGA technologies. A Serializer/Deserializer (SerDes) system using the proposed circuit design approach, exhibits 2.5 times improvement in power dissipation compared to a typical conventional design with a 60% less area requirement.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Low power and reliability assessment techniques for advanced processor design
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Clocking solutions for SFQ circuits
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Modeling and mitigation of radiation-induced charge sharing effects in advanced electronics
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Towards high-performance low-cost AMS designs: time-domain conversion and ML-based design automation
PDF
Clustering and fanout optimizations of asynchronous circuits
Asset Metadata
Creator
Bhatti, Rashed Zafar
(author)
Core Title
Synchronization and timing techniques based on statistical random sampling
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
10/09/2007
Defense Date
04/30/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clocking,delay alignment,duty cycle,high speed signaling,OAI-PMH Harvest,random sampling,relative phase measurement,SerDes,synchronization,timing
Language
English
Advisor
Draper, Jeffrey T. (
committee chair
), Chugg, Keith M. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
bhatti@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m855
Unique identifier
UC1166595
Identifier
etd-Bhatti-20071009 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-558314 (legacy record id),usctheses-m855 (legacy record id)
Legacy Identifier
etd-Bhatti-20071009.pdf
Dmrecord
558314
Document Type
Dissertation
Rights
Bhatti, Rashed Zafar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
clocking
delay alignment
duty cycle
high speed signaling
random sampling
relative phase measurement
SerDes
synchronization
timing