Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
In-situ digital power measurement technique using circuit analysis
(USC Thesis Other)
In-situ digital power measurement technique using circuit analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
3
IN-SITU DIGITAL POWER MEASUREMENT
TECHNIQUE USING CIRCUIT ANALYSIS
By
Siddharth S. Bhargav
December 2015
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING) Copyright 2015 Siddharth S. Bhargav
4
“The worthwhile problems are the ones you can really solve or help solve, the ones you can
really contribute something to. … No problem is too small or too trivial if we can really do
something about it” – Richard Feynman
Dedicated to
My Parents, Brother, Grandparents,
Prof. Alice C. Parker, Prof. Alain J. Martin and Prof. Young H. Cho
5
Acknowledgements
“The true contribution and beauty of research lies in its degree of pervasiveness and
simplicity” – An invaluable statement by my advisor, Prof. Young H. Cho, that made my research
relevant and satisfying. With myriad angles of envisioning research and its applicability, my
advisor provided me unique lessons in solving problems that truly matter. I believe such a skill
can only be acquired by working with experienced experts such as Prof. Cho, to see a problem
through many different eyes. I am deeply indebted to my advisor for his support throughout my
PhD as a mentor, guide and friend.
“Figure out the axes of the problem and it should fall apart to an elegant solution” – One of
the gems by Prof. Alice C. Parker that guided me with a thorough outlook during my research. I
am grateful to Prof. Parker in showing me to believe in one’s true potential. I thank Prof. Parker
for providing me innumerable chances to meet and direct me towards the right step. I thank Prof.
Sandeep K. Gupta for his extensive support in mentoring me throughout my PhD. I am also grateful
to Prof. Gupta for providing me his invaluable time and resources for my work.
“The best way to test an idea is by trying to kill it several times and ways(figuratively). If it
survives, you can be confident the idea is good” – Prof. Alain. J Martin. I would like to thank Prof.
Alain J Martin for his momentous help and support in pursuing research from the days of my
master’s degree. I am indebted to his help in nurturing my zeal towards research, the importance
of first-hand experience in ground up implementation and reasoning.
Special thanks to my dissertation and qualification committees, including Professor Ramesh
Govindan, Professor Antonio Ortega, for their time, guidance, and feedback.
6
I would like to thank Sean J. Keller, Sameer G. Kulkarni, Rajeev S. Rao, Joseph W. Kemp III,
Prasanjeet Das, Aditya Deshpande and Stefanie Pimentel for providing me invaluable suggestions
and discussions during my academic journey. I would also like to thank all my friends, current and
former graduate students at USC and Caltech : Andrew Goodney, Rishvanth K.Prabakar, Chandan
S. Raj, Shawn M. Kailath , Apoorv Bhargava, Anup Ganesh, Anuj Gupta, Varun Krishnan,
Sandesh Pai, Lohith Bellad, Dhanya Davies, Ajith Guthi, Devardhi Milind Mandya
Shubhachandra, Taniya Mondal, Vaibhav Pathak, Rohit Phatak, Akash Prithvi, Ashutosh
Shanker, Ketul Sheth, Keshav Sonbhadra, Harishkumar Umayi, Han Wang, Saketkumar
Srivastav , Rahul Hari , Deepak Srinivas , Kusha Kallesh , Ganesh Balachander , Jayanth
Padmanabhan , Prashanth Harshangi , Shanmugam Manjunathan, Vishwas Vedantam , Mohan
Krishnareddy , Praveen Manjunatha , Apratim Choudhury , Ayush Jaiswal and Amanda Singleton.
I would like to thank the graduate student advisors at Electrical engineering department: Diane
Demetras, Christina Fontenot, and Tim Boston for their timely help and guidance.
I would also like to thank the ISI – Networks Division for their help and support: Terry Benzel,
Matthew Binkley, Jeanine Yamazaki, Ted Faber, Alba Regalado-Palacios, and John Wroclawski.
Finally, I thank and dedicate my work to my parents and grandparents for their loving
encouragement, support and for being there to nourish me with the strength and courage to pursue
my goals.
7
Table of Contents
Chapter 1 : Introduction ................................................................................................... 15
1.1 Motivation............................................................................................................... 15
1.2 Benefits of accurate, low latency and in-situ power measurements ....................... 16
Chapter 2 : Related Works............................................................................................... 18
2.1 Dedicated ADCs ..................................................................................................... 18
2.2 Power Modeling...................................................................................................... 20
2.3 Distinction from Top-down Approach.................................................................... 24
Chapter 3 : Our Approach................................................................................................ 27
3.1 Identifying Instrumentation points.......................................................................... 27
3.2 Power Measurement using sensor data ................................................................... 32
3.1.1 Solving for Power Using Numerical Methods ............................................... 34
3.1.2 Dealing with unaccountable parameters......................................................... 36
Chapter 4 : From Methodology to Practice...................................................................... 38
4.1 Examining residual power ...................................................................................... 38
4.2 Analysis of Error vs. Overhead............................................................................... 39
4.3 From Ideas to Practice ............................................................................................ 39
4.4 Implementation ....................................................................................................... 40
4.4.1 Evaluation Designs......................................................................................... 40
4.4.2 Instrumentation............................................................................................... 41
4.4.3 Circuit Cuts..................................................................................................... 41
4.5 Experiments ............................................................................................................ 43
Chapter 5 : Results on Initial Experimentation................................................................ 44
5.1 Total power comparison ......................................................................................... 44
5.2 Partitioned power.................................................................................................... 45
5.3 Analysis of Residual Power .................................................................................... 46
5.4 Analysis of Error vs. Overhead............................................................................... 48
Chapter 6 : Processor Power Monitoring......................................................................... 52
6.1 Sensor Architecture................................................................................................. 52
6.2 Hardware Instrumentation ...................................................................................... 54
Chapter 7 : Accurate Fine grained Energy Per instruction (EPI) and
Energy per component per instruction (EPC).................................................................. 57
7.1 Introduction............................................................................................................. 58
7.2 Contributions........................................................................................................... 59
7.2.1 Minimally Invasive In-situ EPI Monitoring Methodology ............................ 61
7.2.1.1 Background ........................................................................................... 62
7.2.1.2 Our Approach ......................................................................................... 63
7.2.1.3 Online method to estimate Energy ......................................................... 67
7.2.1.4 Training and test instruction set generation ........................................... 68
7.2.1.5 Experimental Setup ................................................................................ 70
8
7.2.2 Associating Peripheral Activities with Instruction Profiler ......................... 72
7.2.3 Experiments and results to evaluate our method to monitor
energy per Instruction................................................................................... 73
7.2.3.1 Evaluating the effect of time synchronization
on Rate of Convergence ........................................................................ 73
7.2.3.2 Evaluating the effect of noise on convergence...................................... 75
7.2.4 Evaluating the Energy per Instruction of OpenRISC processor..................... 75
7.2.5 Method to Measure Accurate In-situ Runtime
Energy per Operation of a SoC ................................................................... 76
7.2.5.1 Experiments and results to evaluate energy per operation
of ORPSoC ............................................................................................ 77
7.2.5.2 Analysis of hardware instrumentation for EPI monitoring ................... 79
7.2.6 Extracting Energy Per Component (EPC) for GPP based SoC...................... 80
7.2.6.1 Mapping ISA into Architectural Components....................................... 81
7.2.6.2 Experiments and Results ....................................................................... 82
7.2.6.3 EPC Result Analysis ............................................................................. 83
7.3 Discussion of the Related Work ............................................................................. 84
7.3.1 FPGA energy measurement techniques........................................................... 84
7.3.2 Real time estimation techniques...................................................................... 84
7.3.3 Offline estimation techniques.......................................................................... 85
Chapter 8 : Automated Extraction of Energy per Component using Online Regression 86
8.1 Automated Extraction of Energy per component using online regression ............. 87
8.2 Predict and update number of components............................................................. 90
8.3 Results..................................................................................................................... 92
Chapter 9 : Accelerating Physical Level Sub-Component Power Simulation by
Online Power Partitioning................................................................................................ 95
9.1 Related Work .......................................................................................................... 96
9.2 Enabling Faster and Leaner Simulation.................................................................. 97
9.3 Fundamental Power Estimation Methodology........................................................ 98
9.4 Training vector set generation .............................................................................. 101
9.5 Instrumenting Logic to Observe Activity ............................................................. 101
9.6 Experiments for analysis of faster and leaner simulation ..................................... 102
9.6.1 Results on faster and leaner simulation ........................................................ 103
9.7 Enabling Accuracy and Resource Trade-off......................................................... 103
9.7.1 Summarizing Sub-circuit Activity................................................................ 103
9.7.2 Analysis of accuracy vs resource tradeoffs .................................................. 105
9.7.3 Experiments for analysis of accuracy vs resource tradeoffs ........................ 106
9.7.4 Results on accuracy vs resource tradeoffs.................................................... 106
9.8 Calibration Speedup Enhancements ..................................................................... 107
9.8.1 Experiments on calibration enhancements ................................................... 107
9.8.2 Results on calibration enhancements............................................................ 108
9.9 Conclusion ............................................................................................................ 108
Chapter 10 : Future Work - Accurate Fine grained power monitoring technique
Using instruction profile ................................................................................................ 110
9
10.1 Initial Implementations and Results....................................................................... 113
10.1.1 Implementation and results on ORPSOC based system............................. 113
10.1.1.1 Fine grained instruction profiling using basic block map ................. 114
10.1.1.2 Instruction Profiling of processes using kernel interrupts................. 120
10.1.1.3 Low overhead accurate instruction profiling per thread.................... 122
10.2 Implementation and results on ARM based system............................................ 123
10.3 Initial Results ...................................................................................................... 128
References...................................................................................................................... 129
10
List of Figures
Figure 1 : Logic density, performance/power, power and processor cores trends ....... 13
Figure 2 : Coarse-grain management vs fine grain management .................................. 17
Figure 3 : Yield vs. process technology......................................................................... 17
Figure 4 : Taxonomy of Related Work .......................................................................... 19
Figure 5 : Circuit C with various cuts............................................................................ 28
Figure 6 : Constraining the problem .............................................................................. 30
Figure 7 : Finding the closest point of required accuracy and instrumentation 31
Figure 8 : Sensor- hardware design .............................................................................. 34
Figure 9 : NetFPGA Ref. Router .................................................................................. 34
Figure 10 : On-chip power measurement design ............................................................ 41
Figure 11 : Comparison of the sum of per-component results with the total power
measured with an external ADC NetFPGA Board ................................... 44
Figure 12 : Values of power weight during run-time calibration. .................................. 47
Figure 13 : Error Vs. Overhead Analysis for benchmark circuits used in emulation ..... 50
Figure 14 : Analysis of Error vs. Overhead of Simulated results of CKT4.................... 51
Figure 15 : Instruction Profiler (Spatial Pattern Matcher with Accumulator) ................ 53
Figure 16 : Logic Gate Level Sensor (Temporal Pattern Matcher with Accumulator) .. 54
Figure 17 : Instruction Profiler (Spatial Pattern Matcher with Accumulator) ................ 55
Figure 18 : Hybrid Instrumentation ................................................................................ 56
Figure 19 : ORPSoC Setup ............................................................................................. 66
Figure 20 : ORPSoC Architecture .................................................................................. 67
Figure 21 : Analysis of EPI weight convergence under low noise and noisy
Measurements ............................................................................................. 70
Figure 22 : Analysis of EPI weight convergence due to variation in sync. ................... 70
Figure 23 : Instruction counters and MAC unit to calculate total energy...................... 70
Figure 24 : Processor with Instruction profiler and peripheral sensors ......................... 71
Figure 25 : Measured total vs Calculated Energy using EPI for
Dhrystone benchmark ................................................................................. 74
Figure 26 : Error over time for EPI of ORPSOC........................................................... 74
Figure 27 : Results on EPI measurements for various workloads ....................... 77
Figure 28 : Component decomposition by EPC ............................................................ 81
Figure 29 : Results of EPC measurements..................................................................... 82
Figure 30 : Comparison of EPI and EPC results............................................................. 83
Figure 31 : Flow of automated EPC ............................................................................... 90
Figure 32 : AEPC results. ............................................................................................... 92
Figure 33 : Comparison of EPI, EPC and AEPC results ................................................ 93
Figure 34 : flow of accelerated simulator ..................................................................... 103
Figure 35 : Error vs overhead for microcontroller for 65nm and 180nm sim. ............. 103
Figure 36 : Flow - User program compiled and executed ............................................ 111
Figure 37 : Modified flow for power monitoring ......................................................... 111
Figure 38 : Updated Flow with lower overhead ........................................................... 111
11
Figure 39 : Sample flow to reduce overhead ................................................................ 112
Figure 40 : Example code flows ................................................................................... 113
Figure 41 : Modification to facilitate library function profiling ................................... 115
Figure 42 : GPR to store base address for instruction profiling ................................... 115
Figure 43 : Sample assembly code with block counter code ........................................ 118
Figure 44 : Assembly code to increment basic blocks.................................................. 118
Figure 45 : Snapshot of the result with block and instruction counters........................ 119
Figure 46 : Work flow to use hardware assisted thread level instruction profiling...... 122
Figure 47 : Histogram of instructions in a program...................................................... 127
12
Abstract
Over the last decade, it has become increasingly important to monitor several parameters
during the operational lives of chips. For example, power measurements are used to prevent chips
from overheating and, in many emerging complex systems, enable power optimizations. There are
a number of tools such as SPICE and dedicated instrumentation methods available for obtaining
power data today. However, simulation and model based solutions are often limited in precision
while direct measurements incur impractical overhead. The need for accurate power monitoring
during the lifetime of a chip can be seen three fold:
Firstly, it is no longer possible to use pessimistic estimates as the power and performance gains
provided by each new technology generation are diminishing [40]. The pessimistic power
estimation introduces thick guard bands during the application of dynamic voltage and frequency
scaling (DVFS) and hence causes non-optimal power management. The DVFS is not applied when
there is an opportunity to scale due to these thick guard bands. The thick guard bands in turn reduce
performance per watt which diminishes the advantages of technology scaling. As a consequence
of this issue along with unchanged threshold and nominal supply voltages, figure 1 shows
performance has plateaued for several generations of technology. With a fine grained, low latency,
in-situ and accurate power measurement system, we try to gain back the loss in performance per
watt by reducing the thickness of the guard band.
Secondly, experts in [39] suggest that application level power management and adaptive
control of power are crucial for future CMOS systems. They hypothesize that most of the circuit
design techniques have already been used and the room for power minimization lies with the
applications running on the systems. With the knowledge of amount of power being used by the
13
hardware for the software jobs, the software can adjust itself by issuing different types of
instructions to limit the power. We enable this by providing an all-digital, fine grained and high
resolution power measurement system that can provide applications with details of power usage
on each component. Per component power can be used actively by the applications to limit the
components specific power usage while maintaining program correctness.
Lastly, many of these analyses must capture the effects of process variations. While some of
the available tools such as SPICE provide feedback on parametric effect of process variations, they
incur long run-time simulations. These simulations incur impractically high run-time for complex
circuits such as microprocessor. Also, they do not provide a means to include temperature
variations across the chip during simulation. The current methods assume uniform temperature
characteristics across the entire chip while on a real chip; these temperatures vary by more than
50% [42].
In this work, we present a method for on-line power measurements at high resolution that is
more scalable, lower power, and less invasive than many traditional methods. Our initial results
on simulations have shown to have 99.999% accuracy with the HSPICE simulator and a 3000x
speedup. Also, we have designed a run-time system and demonstrated it on an FPGA
Figure 1: Logic density, performance/power, power and processor cores trends [41]
14
platform. Our prototype simultaneously monitors 350 different instrumentation points inside the
FPGA as well as the components on-board at 128 kS/sec. Our measurements differ from those
acquired using an ADC by at most 2.87%. Furthermore, an off-line data analysis suggests that our
method is more resilient to instrumentation errors than an ADC.
15
Chapter 1 : Introduction
Experts predict power to play a crucial role in IC performance for future CMOS technologies
[28]. Excessive power dissipation contributes to overheating and faster aging of circuits, which
may lead to failures of chips. The problem of power minimization has been addressed by different
solutions. Researchers have scaled the supply voltage to operate transistors in sub threshold
regions to reduce total power [30]. Also, efforts have been made to minimize the number of
switching transistors by applying several Boolean minimization techniques to reduce dynamic
power [29]. Experts also suggest that these generic circuit design techniques have been extensively
explored and subsequent gains in smaller technologies tend to diminish [31]. They suggest
application-specific techniques, such as application level power optimization and adaptive power
control methods, to save more power in future technologies.
This proposal introduces a new approach for on-chip power measurement that is more scalable,
accurate, and has a lower overhead as compared to traditional power measurement techniques.
1.1 Motivation
Over the past decade, it has become increasingly important to monitor/estimate several
parameters during the operational lives of chips. For example, power estimates are needed to
control circuits to optimize power/energy during operation. In some critical systems, it will become
increasingly useful to monitor aging parameters to identify impeding early-life failures to
proactively take remedial actions.
Unlike in the past, it is no longer possible to use overly pessimistic estimates since the
overheads of corresponding overdesign are growing. Also, such overheads are becoming much
16
less acceptable as the gains provided by each new technology generation are diminishing for many
key parameters [40].
1.2 Benefits of accurate, low latency and in-situ power measurements
We have identified three areas in power monitoring that benefit from our work. First, power
management schemes require low latency and accurate feedback about power usage to operate
optimally. Our work addresses both latency and accuracy. Second, to develop new power
management methods, hardware must be designed in concert with the software. Our work provides
a new level of monitoring granularity that is not possible today. Finally, Our work can be applied
to the design loop to give feedback to the designer about estimated power usage and it can be used
during testing. For example, to improve yield by understanding how process variations affect
power usage.
To expound on first and the second point, some existing methods use direct ADC based power
measurements. However, these ADCs are too slow. As discussed in related works, another class
of power estimation techniques uses modeling. These models are often inaccurate. This implies
that the yellow regions in figure 2 represent additional power savings that can be achieved because
of slow ADCs or inaccurate power models.
17
To elaborate on the third point, estimating or measuring power usage is useful during the design
cycle and testing cycle. For example, a BIST system for power can help in post silicon fix to
recover some functionally correct chips. It can provide information on power dissipated by sub-
circuits and identify areas where power dissipation is in excess. Further, these areas can be tuned
to match the desired power budget. Otherwise, these chips might be marked as failed even though
they are functionally correct. This is known as parametric loss and as shown in figure 3, as feature
reduces the problem accounts for a larger percentage of yield loss.
Figure 2 : Coarse-grain management vs fine grain management [43]
Figure 3: Yield vs. process technology [44]
18
Chapter 2 : Related Works
Fine grained, low overhead, and accurate dynamic power management techniques are becoming
more important in a wide range of emerging computation systems. System developers are
interested in such techniques to reduce power at board and chip levels. While many emerging
power monitoring methods rely on direct measurements from dedicated current sensors or
statistically derived power estimates, some methods combine these two techniques. The taxonomy
of related works is described by Figure 4. The prior effort to monitor power can be classified as
dedicated ADCs and power modeling methods.
2.1 Dedicated ADCs
In this method of power measurement, input current to each component is measured using
dedicated ADCs and current sense resistors. The advantage of measuring power using this
technique is in its simplicity and accuracy. Researchers in Low Power Energy Aware Processing
(LEAP) [24] and on-chip power measurement demonstrated in [25, 26] have shown promising
results in power optimization in embedded systems via accurate fine grained power monitoring
and management.
They have dedicated ADCs to monitor power of each on-board [24] or on-chip [25, 26] component
and have shown up to 60% power savings over devices without such power management.
However, there are challenges that arise due to the nature of instrumentation and ADC designs.
19
One of the major challenges using this method is high resource requirement. The technology used
to manufacture transistors for on-chip ADCs for on-die power management applications is the
same as that for digital logic. The amplification gain provided by circuits using these devices is
dependent on the channel length. With technology scaling, smaller channel length transistors have
lower amplification gain [12]. Hence, several cascaded ADCs or larger sized transistors are needed
to compensate for this loss resulting in greater area and power. Furthermore, mixed signal designs
are more expensive to integrate due to the requirements of supply and ground plane isolation.
Another challenge is in ADC calibration. When ADCs are used for a long time, degradation in
performance of the measurement system can be expected due to the aging. Such systems require
an offline recalibration or replacement in order to minimize inaccuracy due to instrumentation
errors. However, recalibration may not be a practical fix for many of these systems.
Figure 4: Taxonomy of Related Work
20
2.2 Power Modeling
Other on-board and on-chip measurement techniques are used for fine grained power
measurement due to the challenges associated with dedicated ADCs. These techniques rely on the
accurate on-line extraction of a power model for each architecture component from workload
information and external ADC based power measurement [15, 16]. Run time power estimations
are done using the power model built over long windows of time for different workload data
collected in synchrony with ADC values. These power models eliminate the need for On-chip
ADCs for monitoring power and can be classified as static and dynamic analysis methods.
In static analysis methods, a power model is built by using an offline analysis performed on
simulation data for various operating regions of the systems. Power is modeled as a direct function
of signal activity as observed in simulations at output pins and internal wires of a VLSI chip for a
given workload. This power model is used to estimate power for the rest of the lifetime of the chip
[16, 19].
Najm et al provide a survey of these static offline techniques that have been used to estimate
power in VLSI circuits [13]. These methods are broken down into two categories: probabilistic
and statistical.
Probabilistic methods model power through the use of signal transition probabilities, a
known signal behavior or using Boolean differences. Using this information, an offline power
model is built for the most common regions of activity of the chip. Power is estimated by
monitoring the input activities of the circuit and the power model.
21
Alternatively, Najm et al discuss the following methods that rely on statistics: Statistical
data of power dissipated for a given design is built using Monte Carlo simulations [19]. Subsequent
real time power is estimated at the full system level or per gate level using the statistical data and
primary inputs.
On the other hand, dynamic analysis techniques build various power models considering
different workloads and operating conditions of the chips. Counters are used to locate the
appropriate position of operation of the chip on the power model to estimate the power. The power
model is chosen dynamically based on the workload of the chip.
Several efforts have been made to measure power accurately at the architecture level.
Researchers built models of various capacitances of each block of a processor. They used another
module to record power values along with the details of each unit accessed every cycle. By
comparing with these power values, they improved their capacitive model. A lookup table was
built using the capacitive model and activity rates were used to index to extract power numbers in
real time [17].
Another line of research by Kim et al endeavored to achieve un-instrumented per-
component power measurement of domestic electrical appliances [32]. They consider various
measurements such as electric and magnetic measurements to model a linear programming
problem and solve it using numerical methods.
Power usage has also been modeled as a linear regression model using a set of relevant
hardware counters [7] for various workloads. This method calculates power for monitored
components by considering the best linear fit of values.
22
One of the latest power modeling techniques leverage the information in the built-in event
counters on processors and add a new dimension to power monitoring [15, 17, 18]. Capacitances
of various logic blocks are modeled using these event counter values along with a known set of
workloads and ADC measurements. Given a sufficient number of such models, counter values can
be used to estimate the power dissipated by various components at any given time.
Several efforts have been made by the industry in measuring power using digital sensors.
Some of these efforts which we focus on are AMD’s Richland Accelerated Processing Unit (APU) and Kabini Processor, IBM’s Power 7, Power 7+ and Power 8, Intel’s Clover field SoC and Itanium
processor [35].
AMD’s Kabini processor revealed the use of digital activity monitors. The processor
calculates power by multiplying the activity counts with a weight and accumulates all the results
to find the total dynamic power consumed. The weights are calculated offline for various workload
data that includes various temperatures and voltages. Static power is then added to dynamic power
to find the total power consumed by the system. The static power is derived by modeling leakage
at various voltages and temperatures. The appropriate value to be added is chosen based on the
counter value, operating voltage and temperature.
IBM’s power proxy uses event counters located at various logic blocks in the architecture.
These event counters keep track of events such as cache hit/miss, ALU access, etc. that occur
relevant to the component monitored [16]. Initially, a power model is built and weights are
extracted for each of these counters for various workloads offline. They then use these weights to
calculate power in real time.
23
The Itanium architecture published in [36] described the use of activity counters to model
capacitance and subsequently measure dynamic power by multiplying with associated weights.
Static power component which was derived offline for idle state is added with this dynamic power
to calculate the total power.
The improvements in on-chip power estimation come at a relatively low overhead via power
modeling since there is no need to change the existing hardware. However, these techniques suffer
with inaccuracies when presented with an increasing number of micro-architecture components.
One source of the inaccuracy is due to modeling errors. The techniques described earlier
require an offline power modeling task through curve fitting. When the fit model is applied to the
real time operation of a chip, small variations in workload and environment introduce errors into
the estimations. Real time measurement errors are also introduced as the result of one-time
analysis across all the copies of the chip. The curve fit represents the averaged response across
various workload and per-chip process variations.
Our initial implementation focused on extracting accurate dynamic and static power of on-
board components using digital counter instrumentation on the NetFPGA 1G platform [37].We
instrument counters at inputs and outputs of all the components instantiated in the reference router
design on NetFPGA since the FPGA is connected to all other on-board components. We calculated
power using the weights extracted for the counter values and obtained a result with a maximum
difference of less than 3% between the calculated and measured powers. We expound on our earlier
effort to accurately measure power within the chip of each logical component. In this work, we
present a technique to measure power accurately on chip and evaluate the bounds on accuracy of
the technique with respect to the instrumentation overhead. We also perform analysis on the effect
24
of measurement noise and perform different experiments to validate our claims on accuracy at
various granularities of measurement.
2.3 Distinction from Top-down Approach
In general, the power modeling methods described above do not consider the design specific
characteristics of a chip such as structure of gates and structure of transistor networks. These
methods follow top-down approaches where power is measured directly or associates measured
power with high level characteristics. In doing so, there is no means of validating the power based
on the underlying logic. Therefore, even if errors can be measured, there is no way to determine
their sources.
Our distinction from the event counter approaches can be illustrated with the following
measurement parameters:
1. Monitoring scheme
The monitoring of activity at various pins of the specific component is done via a digital sensor
that counts only one of the transitions .i.e. 0 to 1 or 1 to 0. The event counter techniques count
every event at a higher level such as cache hit/miss, branch rate that may involve several
components such as memory, comparators, decoders and, arithmetic units within each level of
measurement.
2. Granularity of measurement
We find the relationship between the activity at the inputs and within the component in turn is
proportional to power dissipated. We could potentially estimate power of each gate or a group of
25
gates known as logic cones within the component at any instant of time. On the other hand, the
event counter technique provides power dissipated by the component for component specific
events for large windows of time.
For example: Let us consider a case where the event counters technique is used to measure
power drawn by a branch prediction unit on a taken branch. Our technique provides power of the
entire branch prediction unit and power of the internal logic such as comparators. It also provides
power values when there might be some stray input activity due to other circuit switching that
might not necessarily produce output but causes power dissipation. The event counter technique
provides a broader measurement of total component power caused by a specific event but does not
provide accurate power values for stray switching.
3. Static Power and Noise
In the event counter technique, the static power of the component is measured as idle power of
the system for each component and the rest of the weights are calculated using this idle power in
the equation. On the other hand, we consider the fundamental concepts of static power to derive
weights and also isolate the noise into a residual component from the measured power. The
inclusion of these components in our algorithm gives us a means to automatically calibrate and
remove instrumentation noise from our measurement technique.
4. Accuracy
Our method is formulated based on fundamental circuit switching theory, and hence achieve
an error rate less than 3% as we can consider dynamic and static power along with the noise. This
allows us to efficiently partition power, provide bounds on accuracy versus overhead, and identify
26
the power dissipated by measurement unit itself for each component. The event counter technique
estimates power values of various components but does not isolate sources of error or evaluate its
accuracy for different numbers of high performance counters.
27
Chapter 3 : Our Approach
Our method follows a bottom-up approach and leverages the fundamental switching theory
in CMOS to measure power.
The problem formulation leverages the fundamental switching theory and circuit analysis
tools in computer aided design techniques to instrument sensors and obtain accurate
measurements. The problem is formulated in two phases. In the first phase, we formulate a method
to perform digital sensor instrumentation and in the second phase we use the values from the
instrumented digital sensors to perform accurate measurements.
3.1 Identifying Instrumentation points
The problem statement for virtual sensor instrumentation is formulated as: For a given subset
of input vectors for a digital logic circuit, how do we instrument sensors to achieve the given
accuracy?
Initially, Let “C” be a module under consideration (MUC) that is combinational with “n” wires.
Let tri represent transition count on each of the “i
th
” wire of the circuit.
Let us define the ground truth to be an external power measurement scheme that measures
power drawn from VDD of the MUC. Given that each of these “n” wires drive a capacitive load
of ci, the ground truth power can be expressed as,
=( . .)(1)
28
We can represent the ground truth or the actual power as weighted multiplication of transitions
on each wire as,
=( .) ,(2) And = . , where is a weighted constant.
− = R(3) Where, R represents residue and is the external power measured at the VDD.
= 0 Under ideal conditions and,
≠ 0 Under PVT variations and noise.
We derive empirically for a given subset of vectors such that it accounts for the PVT variations.
For large circuits, it may not be possible to monitor ,∀ , hence we need a method to find
a subset of “i” and still be able to obtain the desired accuracy. We find the subset of “i” by finding
cuts of the entire circuits which divide the larger circuits into a set of smaller bunch of gates. We
then instrument the inputs of these cuts. We approximate the total power measured as a function
Figure 5 : Circuit C with various cuts
29
of transitions at the inputs of these cuts and extract weights. We use these weights to perform
subsequent power measurements for various input vectors. To illustrate this process
mathematically, let us consider that “m” wires were instrumented when the circuit was cut as red
boxes as shown in figure 5.
Then, the estimated power can be represented as:
,
= ∑ . . Now, we find
different sized and shaped cuts as:
,
= ∑ . , where the total input set of all the cuts
are “p” such that, > > .
−
,
= , −
,
= ,(5) Where, ε andε are deviations from ground truth. Given that m > p, we hypothesize that,
≥ .
For the MUC represented in figure 5, Starting from the smallest cut of the design, i.e. a single
box; it is hard to enumerate every combination of cut aggregation to find the points of virtual
sensor instrumentation for a given accuracy. Also, if the cut results in functionally unequal blocks,
it is hard to attribute the source of this error. To solve this problem, we need to find a scheme of
cut aggregation such that we can evaluate the accuracy of one cut and apply that across the other
cuts of the design. This scheme provides us a means to compare accuracy of our method across
the set of sub-circuits. We consider the most commonly used arithmetic logic such as adders and
multipliers and perform analysis of these units. These units are found to contain a basic block of
computation which is repeated to form the entire unit. By sub-dividing the entire logic in terms of
these basic blocks, we obtain functionally equivalent cuts which can be instrumented with virtual
sensors. The same constraint can be represented as shown in figure 6. Initially, we identify
30
functionally equal units in the same row and subsequently we identify the repeating units within
the same column.
At the lowest level of instrumentation, we place virtual sensors at the inputs of these basic
blocks. We then group adjacent two blocks to find the next point of accuracy and so on. We first
find the accuracy when virtual sensors are placed at the inputs of MUC. We repeat this process by
finding the next largest non-overlapping cut that is functionally equivalent within the set of other
cuts. We continue until we find the point of accuracy that is desired.
To illustrate this process, Let us consider finding the instrumentation point for error of 1.6%
for the ISCAS benchmark circuit C6288. As shown in figure 7, we first start with the “m” inputs
of the MUC. We then find the accuracy if we cut the circuit with basic blocks having 8 rows
together (Cut at the point :( +8))of the repeating rows together. The accuracy was found to
be 2.1%. We then extrapolate a line between these two points as shown in the figure 7 and estimate
the point of instrumentation for the desired accuracy. We then instrument the closest estimated
block size and evaluate for accuracy. It was found that with the block containing 7 rows was
Figure 6 : Constraining the problem
31
aggregated for virtual sensor instrumentation the accuracy much greater than desired was obtained
and we stopped the process. We repeat the above process in case the desired accuracy was not
achieved.
To illustrate this mathematically, we consider the power measurements made with (2) as
ground truth. We approximate the power dissipated by the entire circuits as a weighted
multiplication of input activity as:
,
= ∑( .), Where m = inputs of C.(6) −
,
= , provides us with the deviation in measurements.
If the maximum deviation in measurements is greater than the desired value, we cut the circuit
in smaller functionally equal units and evaluate the accuracy.
i.e. max() > E, We find cuts “V” for() = ∪ ,
Such that f () = f ()∀ +1 ≤ . Where f(x) denotes the function represented by the
MUC and represents the cut of the MUC. We cut the circuit such that each block is
Figure 7 : Finding the closest point of required accuracy and instrumentation (Data presented
here is for ISCAS benchmark C6288)
32
approximately represented as (( +2)). Where, “l” represents the number of rows in
the constrained representation of MUC.
We then instrument the inputs of these cuts and evaluate the accuracy again.
,
= ∑( .), Where a = inputs of cuts.(7) −
,
=(8) If the deviation is greater than the desired accuracy, we cut the circuit as:
(( +(2)−)) , Where s = 0, 1, 2… such that ≤(2) and repeat the
procedure “s” times until we find the point of desired accuracy.
3.2 Power Measurement using sensor data
We build the method of calculating weights using concepts of linear regression and fundamentals
of circuit switching. Let us consider the fundamental power equation:
TotalPower,P = P + P
Due to the instrumentation and other effects of power measurement such as quantization
noise, we add an additional residue component that accounts for these effects. Rewriting the ideal
equation,
TotalPower,P = P + P +Residue(9)
33
Where, Pdyn is the dynamic and Pstatic is the static power of the digital system. For simplicity,
we consider process, voltage and temperature (PVT) variations and noise are minimal. We know
that the Dynamic power of “n” gates is:
P = ∑ a cv f(10) Where, aj denotes the activity, i.e., the number of transitions, at the output of gate j, and cj,
vj, and fj, denote the total load capacitance driven by the output of the j
th
gate, the power supply
voltage at the j
th
gate, and the frequency at the j
th
gate respectively.
From (2), equation (10) can be re-written as:
P = ∑ Wa(11) We know that the Static power of “n” gates is:
P = ∑ I
,
V(12) Where Vsupply is the Supply voltage and I
,
is the leakage Current of jth gate given by [1]:
I
,
=
W
L
μ(n−1)C ϑ e
() 1− e
Where, W and L are the width and length of the transistor, μ is the mobility of electrons,
n is the body effect co-efficient,ϑ is the thermal voltage,C is the gate oxide capacitance,V is
the threshold voltage, V is the source to drain voltage and V is the gate to source voltage. For
a given constant temperature and voltages, we know that all the terms on the right hand side can
be considered constant. Hence, assuming temperature does not change for the given sample
window,P can be alternatively represented as:
34
P = 1∗W
(13) Where,W = ∑ I
,
V
Now, the total power can be written as:
P =(∑ Wa)+(1∗W)+Residue
(14) We could measure total power and
activity rates at each gate and use linear
regression on this data to obtain weights.
Device characteristics and switching vary due
to process variations and hence the power
weights are expected to change within a chip for each gate. These variations are lumped into
residue. The algorithm adapts to the new values of power such that weights is the best fit for the
specific conditions. We recalculate weights by applying our algorithm when the difference
between total measured and calculated power are above an expected error threshold.
3.1.1 Solving for Power Using Numerical Methods
We use concepts of applied mathematics to simultaneously solve for all the variables to
partition the external measured power using internally observed variables. To illustrate this
concept, let us consider a set of simultaneous equations defined by y(t) as shown below:
y(t)= ∑ w x(t)∀i = 1…n, j = 1…m(15) Figure 8: Sensor- hardware design
Figure 9 : NetFPGA Ref. Router
Data Path
Counter
enable
clock
d q
DFF
Signal
Activity
Digital
Logic
Signal Probe
Counter
enable
clock
Digital
Logic ON/OFF
Electrical
Module
Signal
Activity
Signal Probe
35
Where, w is the co-efficient of the variablex(t). Now, by applying the principles of applied
mathematics, equation (4) can be represented as a matrix of y(t) , w and x(t) :
y(t) ⋮
y(t) =w
x(t) ⋮
x(t)(16) Wherew =
w ï w
⋮ ⋮ ⋮
w ï w
The matricesy(t), w and x(t) consist of observed values of the same variables at
different instances of time. They are sorted in ascending order of time to form matrices, which
essentially represent a system of linear equations as denoted by equation (16). The matrix x(t) can be computed by multiplying the y(t) by W
-1
.
We can apply the same solution by building matrices for P and a by using equation (16) in (14). The signal activity rates from each component are collected from a counter module as
shown in figure 1 and sorted temporally to build matrixa . The sensor hardware shown in figure
1 consists of a 16 bit binary up counter, a D-flip-flop along with an AND gate to capture the edge
of transition. The outputs of these modules are used to create matrixa . An additional column with
a constant co-efficient is included in the matrix to consider static power in the measurements. We
take the transpose of the matrix aj to solve for the weights. The total power P is observed using an
external ADC and sorted temporally to obtain a total power matrix. The per-component power
matrix is obtained by multiplying the inverse of the mixing matrix (also known as the un-mixing
matrix) with the observed power matrix. This per-component matrix consists of weights of each
36
component to translate counter values into actual power numbers. The final converged weights are
used for all subsequent power measurements.
Then, one can select a training data such that it covers all regions of our design operation.
We generate input patterns to collect data using exhaustive test pattern generation techniques by
leveraging the concepts in [34].
Subsequently, we use a sliding window over the above set of linear equations to obtain
component specific rates. Once the weight converges across several windows, we use these values
to calculate power for the remainder of the given input vector sequence. Furthermore, we make
this method practical for runtime applications by leveraging concepts of successive regression and
estimation as explained in [38]. We eliminate the need for using complex matrix computation and
large memories to store previous data. Instead, we update the weights using fewer predecessor
values and compute results using arithmetic multiplication and addition.
To measure total power using our method, we multiply W with its corresponding activity
rate of the components and obtain the power values. Our method can be extended to include
various modes in DVFS by performing our analysis over such modes to extract respective power
weights.
3.1.2 Dealing with unaccountable parameters
3.1.2.1 Computing for residue
In a real scenario, there are various sources of error that are introduced into measurements.
The error might be due to ADC noise, PVT variations, environmental noise and noise due to the
switching of the components themselves. Under these conditions, we refine our method to reject
37
this noise by adding a residual component to the measured power as defined by the term R in
equation(6). We then measure this residue as,
Residue, R = {|P – P | }(17) Our algorithm yields power weights at every iteration; per-component power (Pcalc) is
calculated by multiplying the weights with measured signal activity rates. This result is further
compared with the measured power, Pmeasured to obtain the residual value described in equation
(17). We perform analysis on the residue to evaluate the resilience of our method.
3.1.2.2 Run-time self-calibration
Unaccountable parameters such as circuit aging and temperature effect introduce error in
our measurement systems, and in turn manifest as increased residue. However, frequent online
calibration and re-calibration of our monitoring system can be done to reduce errors with an
external ADC while the system is running.
3.1.2.3 Internal circuit power
The FPGA platforms used do not have a means to measure per-component or fine grained
power accurately at high speeds. We validate our claims of accurate internal circuit power
measurement per component by comparing the total platform power measured by an ADC against
the sum of all of the per-component power that we computed using our method.
38
Chapter 4 : From Methodology to Practice
We implemented our power measurement technique on multiple FPGA platforms with a
number of known designs and digital sensor instrumentation to validate our claims on accurate
power partitioning and noise suppression in power measurements. The platforms used have no
dedicated ADCs to measure power consumed by the platform or its components. Also, the FPGA
does not have any way to separate the power consumed by different parts of the internal design.
For all experimentation, we consider the externally measured power values using ADCs to be
ground truth.
Then, we continued to validate the correctness of our method by performing another set of
experiments by switching on some of the known components and comparing derived total power
with ground truth. In order to emulate components where input and output signals are inaccessible,
we attached resistors of different values at general purpose I/Os (GPIO) of the FPGA with an
assumption that the on/off signal of the components can be monitored. The measured data at the
resistors are then compared against our computed values.
4.1 Examining residual power
We also performed an experiment with multiple data sets to evaluate the resilience of our
algorithm. The weight convergence and the noise projection in residue provide us with a method
to evaluate our technique for noise resilience.
39
4.2 Analysis of Error vs. Overhead
We examine the bounds on accuracy of our measurement technique by recalculating weights
while systematically eliminating counters to extract and compare the measured power with ground
truth. We perform this analysis using simulation and real time on FPGA to identify various sources
of errors in measurement.
4.3 From Ideas to Practice
Although the idea described in previous section is intuitive, a direct mapping of activity
sensors to every gate is impractical since the number of gates required for sensors would be far
greater than the target circuit. Therefore, we present a method for accounting for the activity rate
of a sub-circuit with fewer counters to make this method feasible.
Our formulation begins with a recent FPGA power modeling research that showed the signal
activity within the logic is proportional to the activity at the inputs of the circuit for various
commonly used benchmark designs [19]. We build on this result to assume that the signal activity
of any functionally independent subcomponent within a chip is proportional to the signal activity
at its input wires. We apply our interpretation of this result to our problem formulation to
approximate the total activity of a logically dependent set of wires with signal transition counts
from a much smaller set of wires.
First, we partition the summation of equation 3 into k summations representing the sub-
circuits, each consisting of values from a set of wires that are tightly inter-dependent as shown in
Equation 18.
40
P = ∑ a w +∑ a w +ï+∑ a w(18) Then, our approximation approach will allow us to rewrite each summation using the
product of new weight (w’) associated with all the wires within each tightly dependent set and the
transition counts from the input wires (a’) for each such set that may proportionally influence the
activity rates of the wires within the set.
P = a w +a w +ï+a w → ∑ a w(19) This approximation produces an equation of the same form as Equation 11. Therefore, the
same applied mathematics based algorithm can be applied to estimate the power at each sub-
circuit. We use this instrumentation method to implement and test our power monitoring algorithm
on various designs on simulation and FPGA platforms.
4.4 Implementation
We use the NetFPGA 1G board to perform experiments on the validation of total power and
sub-circuit power (shown in figure 2) and Xilinx SP605 board for validating our claims of per
component power.
4.4.1 Evaluation Designs
For our initial experiments, we integrated our sensors to hardware IP router design as
shown in figure 2. This working design was configured on to a NetFPGA platform and deployed
into an actual network to evaluate its effectiveness in a realistic computer networking environment.
We also built a generic multiple arithmetic computation engine shown in figure 3 on SP605 –
Spartan 6 board to compare our result against the ground truth measurements.
41
4.4.2 Instrumentation
We monitor every pin of the components connected to the main FPGA on the boards to
obtain an activity factor. To do so, we sample the digital output pin of a module of interest and
increment a counter whenever we see a transition - rising or falling. The digital sensor is as shown
in figure 1.The interfaces to the on-board hardware components that are connected to the main
FPGA are instantiated in Verilog descriptor files as modules, and these modules are open for
modification within a set of well-defined constraints. Thus we may add our counting circuitry
easily without changing the functionality of the design.
4.4.3 Circuit Cuts
We choose a subset of wires against every wire in the circuit for digital sensor
instrumentation feasible for practical implementation. We identify a set of gates that are bounded
by multiple inputs to a logic that generates a single output within a level of logic, these structures
are known as cones [5].We leverage this idea to identify various cones within a logic level and
monitor inputs of this bunch of gates to evaluate the effectiveness of our technique.
Figure 10: On-chip power measurement design
42
These circuit cuts ensure data independency across various circuits within a logic level of
a component and provide isolation in terms of power and activity. Hence, we reliably partition the
power without any issue of correlation effects across various circuits to achieve highest accuracy.
To evaluate the error versus the overhead of counter addition, we view these modules as a
bunch of output cones of various levels in a circuit and added sensors at inputs of each of these
cones. We traverse the gate level circuit that is described as a Verilog net list and identify each of
these cones by traversing from output to input. Initially, we identify the level of gates in the logic,
the primary input being level 0 and primary output as the highest. First, we find out various logic
levels in the circuit. Now, starting from primary output, we traverse back by one level and find all
the cones between these two levels. We continue this process till all the levels are covered. The
gate feeding the primary output is labeled as cone head and we find its fan-ins. We include a field
in the gate description such that it can be used as a marker to indicate cone heads. We label each
of the fan-in relevant to the cone with the name of the cone head. As we traverse to smaller levels,
we check if the number of levels considered for aggregation and assign intermediate cone heads.
Once we identify all the cones, we pick their inputs and assign the digital sensors. We also ensure
that there is no duplication of these sensors while still considering those wires during calculations.
In each step, we compare the number of levels for each gate traversed and stop if it crosses a
predetermined level threshold. This provides us with means to collect gates for a cone between
various logic levels. The identification of these gates is termed as level aggregation and we keep
track of these various levels aggregated. The counter data is collected over software registers in
NetFPGA [21]. We use wireshark to capture Ethernet packets containing counter data for SP605
board. With this data, the activity matrix is built.
43
4.5 Experiments
The IP router was connected to 4 networking nodes that transmitted and received packets at
the maximum 1 Gbps data rate as described in section 4. We tried to implement realistic scenarios
under heavy operating loads. The values of the counters were periodically captured at the same
time the total power of the board was measured with an external ADC.
This experiment validates our claim of accurate total power partitioning using internally
observed variables. However, it does not fully support our claim of per component power
measurement since we do not have the per component ground truth values.
We validate our claims on per component power measurement by performing a well-
controlled experiment where we measure and compare power dissipated by a single logic block
against total measured power. The SP605 is programmed with a four ISCAS benchmark circuits:
traffic light controller (CKT2 is ISCAS benchmark S298), a single event correction (CKT4 is
C1908) and two other combinational logic benchmark circuits (CKT1 is C880 and CKT 3 is C432).
We chose the above functional blocks as the most common components in the data path consisting
of such arithmetic units. Using these, we potentially run all possible test cases to enable and disable
specific components and measure power to verify and validate our proposed method for per
component power. The same test vectors are applied over various experiments by programming
the linear feedback shift register with the same seed. This provides us with a means to compare
power with ground truth under the same testing conditions.
44
Chapter 5 : Results on Initial Experimentation
We perform experiments to evaluate our method of translating the ideal power measurement
technique described in equation 6 to a feasible technique described in equation 11. We perform
analysis of our results in the following three phases:
5.1 Total power comparison
We compare measured and calculated total power over NetFPGA to validate our claim of
accurate power partitioning using internally observed variables. A subset of the experimental data
(training set) is used with the
solver to obtain the weights for
individual components. The
solver was trained with 50000
data sets from NetFPGA and the
power measured is for the
subsequent 1000 instances of
time, the figure 4 is a snapshot
of 150ms for 20 components. It shows a time series of calculated power measurements whose sum
is overlaid on top of the power measured by an external ADC. Over 20 components the total of
errors would have caused a greater difference between measured and calculated powers had the
partitioned power values been erroneous. This provides us with the first step in validating the claim
of accurate power partitioning using internally observed values. This also empirically validates
Figure 11: Comparison of the sum of per-component results with
the total power measured with an external ADC NetFPGA
Board
45
our assumption of decomposing the total power over k components as given by equation 11.
Results suggest that the total error introduced in such a decomposition to be a maximum of 2.87%.
5.2 Partitioned power
The results from NetFPGA platform indicate an efficient power partitioning between the
observed components. However, it does not fully support the claim of per component power
measurement. We use SP605 platform to perform experiments to validate our claim on per
component power measurement. Unlike NetFPGA, the main FPGA power, i.e. Spartan 6, can be
directly measured on the SP605 board. It has a current sense resistor connected to the Spartan 6
power supply that is accessible on-board for measurement.
We enable and disable specific components on a design implemented on SP605 board to
compare measured and calculated power to verify and validate our method for per component
power. We validate our claims on accuracy of component power measurement by programming
this FPGA with a known component and comparing ground truth with calculated power. A
difference between various scenarios of measured power values is compared with calculated power
to remove the effects of other on-chip components during our analysis.
Additionally, the total power values are recorded when the same selected component was
enabled with other components switched off. This data is compared with calculated power values
for the same testing scenarios. Table 1 presents the summary of results. For each case, a “0” and a
“1” in a column represents enable or disable of the corresponding circuit respectively. We collected
46
and compared data for 16 cases with the 4 circuits. It can be seen that the difference between
calculated and measured power values is less than 2%.
5.3 Analysis of Residual Power
We perform analysis of residual power under noisy total power measurements to evaluate
the resilience of our power partitioning method. We use noisy ADC with our method and extract
the residue to perform further analysis. To quantify the amount of noise in ADCs, we first
Case Ckt1 Ckt2 Ckt3 Ckt4 Error (%) 1 0 0 0 0 1.16
2 0 0 0 1 1.33
3 0 0 1 0 1.28
4 0 0 1 1 1.42
5 0 1 0 0 1.22
6 0 1 0 1 1.56
7 0 1 1 0 1.38
8 0 1 1 1 1.59
9 1 0 0 0 1.27
10 1 0 0 1 1.62
11 1 0 1 0 1.51
12 1 0 1 1 1.68
13 1 1 0 0 1.29
14 1 1 0 1 1.71
15 1 1 1 0 1.78
16 1 1 1 1 1.85
Table 1: Evaluation of per component power on SP605
47
measured square wave of frequencies 1 MHz, 10 MHz and 100 MHz from signal generator using
both ADC1 and ADC2 and compared the measurements for 50000 data points for each frequency.
Peak error for both the ADCs was found to occur at 100MHz. ADC1 was off by maximum
of 4.8% and ADC2 by 11.25%. Along with these ADCs, we added a signal whose characteristic
was known to ADC1 measurements. Using these three measurements (ADC1, ADC2,
ADC1+Noise) in our method power weights were obtained. It can be observed from Figure 5 that
the power weights for the FPGA converged to the same value for all three cases. The
convergence of power weights with ADC1 was faster with 36 iterations while that with ADC2 and
ADC1 + Noise converged with 106 and 183 iterations respectively. Furthermore, the result showed
that the algorithm deposited uncorrelated factors of the signal into the residue component of the
equation; ultimately yielding the same per component power weights even with noisy or less
accurate ADCs.
Unlike the event counter techniques, our method does not require modeling of power. In fact,
our counters are instrumented to simply measure the signal activity rate of a wire or a group of
wires in a system. Since the counter values quantify activities that are closely correlated to the
Figure 12 : Values of power weight during run-time calibration.
48
actual dynamic power dissipation of the gates within the logic blocks, we are effectively sensing
the power using the counters.
5.4 Analysis of Error vs. Overhead
We systematically remove digital sensors within the logic to perform analysis on bounds on
accuracy of our system. First, we identify various levels of logic cones and then we aggregate
various levels of cones. We add digital sensors at the inputs of larger cones obtained post
aggregation. This way, we remove counters in the smaller cones and account for counters only in
the aggregated cones. We take the net list that has been instrumented with sensors at finest level
of logic cones. We traverse from the primary outputs to primary inputs.
We rename the cone head’s labels for aggregation and propagate the change for the rest of the
cone. Once the cone is aggregated, we save the list of input wires of the aggregated cones for
further instrumentation.
Table 2: Summary - Error vs. overhead
Circuit
Least Error
(Error(%)/Overhead(%)) Maximum Error
(Error(%)/Overhead(%)) Optimal Error and overhead
(Error (%)/Overhead (%)) CKT1 1.17/110 5.6/3 2.14/19
CKT2 1.01/93 2.78/1 1.56/13
CKT3 0.98/130 4.3/2 1.43/21
CKT4 1.17/95 4.22/1 2.78/11
49
For our current implementation, we aggregate the first immediate level of logic cones and
repeat it iteratively until only the primary input to output cones exist. At every level of aggregation
of cones, we extract power using our method and compare with direct measurements.
Figure 13 suggests that the circuit structure determines the variation of error for various
numbers of counters instrumented. All the results on overhead presented is for FGPA platform.
The maximum overhead point is the region where we instrument counters at the input of circuit
cones at the smallest level feasible. The minimum overhead occurs when only inputs of the
components are instrumented. For the benchmark circuits used, the minimum error point occurred
when maximum numbers of counters were instrumented. The maximum error occurred when only
the inputs of the modules were instrumented.
Initially for CKT 1, we instrumented counters at cones obtained between subsequent two
levels. The maximum error with such instrumentation was 1.17%. Further, we aggregated cones
between subsequent three levels and so on. At every step of aggregation, we measured and
compared the power values. We observed an increase in error when the number of cones
aggregated increased. This was primarily due to the activity within the logic which was not fully
represented by switching at the inputs of the cones. At 19%, 20% and 40% overheads, the errors
abruptly changed from 2.14%, 2% to 1.5% respectively. These errors changed as some of the wires
which contributed to greater activity, and hence power, was fully covered by the sensor
instrumentation.
50
Further, we perform the same analysis using simulation with 65nm and 180nm device models
on SPICE to perform analysis on sources of errors and applicability of our method at various
technology nodes. Figure 14 shows the maximum error at maximum number of counters
instrumented to be 0.01% and 0.03% for 65nm and 180nm respectively. Also, it can be observed
that the shape of the curve from emulation and simulation is similar for circuit 4. Also, the points
of 2.75% of error in emulation and corresponding 1.75 % in simulation (65nm implementation) can be considered as a bias stemming from the emulation platform.
Figure 13: Error Vs. Overhead Analysis for benchmark circuits used in emulation
51
The source of this error can be explained as a bias introduced by external ADC instrumentation
itself. It also suggests that our method is closer to the actual power dissipated in the logic and is
comparable to the ideal power measurements.
Figure 14: Analysis of Error vs. Overhead of Simulated results of CKT4
52
Chapter 6 : Processor Power Monitoring
The focus of our work is to deploy an accurate in-situ power monitoring system in a real time
environment in many of the commonly used mobile devices. We found that the processor is a
single common factor amongst all these mobile devices that can potentially benefit from our power
monitoring system. The current day mobile devices use multi-core processors and the battery life
is of critical importance. By having our technology, each mobile device can manage power based
on its manufacturing characteristics to save battery life.
Our technology uses digital hardware sensors and/or software monitoring mechanism as
underlying instrumentation for correctly partitioning the precise dynamic and static power for sub-
circuits within the digital logic designs. The total power measurements need to be collected in
conjunction to digital sensor information to enable calibration of the system. In the current state-
of-the-art SoC suggests all these can be implemented with minimal extra hardware overhead. To
implement such a system on a processor, in this chapter we revisit our sensor architecture and
describe sensor instrumentation within a processor and compare the results.
6.1 Sensor Architecture
The digital sensor can be described as various combinations of spatial and temporal pattern
matchers followed by a function unit as shown in Figure 15. Then the total chip power and the
accumulated digital values from all sensors are sampled at a regular interval to extract parameters
for the algorithms to derive dynamic and static power of each sub-circuit represented by the
sensors.
53
Spatial Pattern Matcher examines one of more bits or wires and looks to match a bit pattern in
any one instance of time. This unit can be set to look for a particular pattern in the digital lines
and data and produces a corresponding output. For example, control signals or instruction opcodes
can be matched by this unit to determine a unit activity of a functional unit. In some cases, a
spatial pattern matcher is simply a wire to directly connect input to the output.
The output of the spatial pattern matcher is then passed on to the temporal pattern matcher to
determine patterns over some duration of time. For instance, a temporal pattern matcher may look
for a single bit pattern of 0 followed by 1 over any two clock transitions; this temporal pattern
matcher would detect all of the rising signal transition of the bit or the wire.
Then the output of the pattern matcher is passed to a functional unit to re-condition the output
to represent the total activity of the sub-circuits over a sample of time. This unit can be made using
a combination of a memory or a look-up-table and an accumulator.
The look-up-table can be used to remap the pattern matcher output to match a typical activity
of a sub-circuit. This map can be created using simulators with typical input vectors or logic
activity estimators through models. In certain cases this map may be a direct 1-to-1 map to
eliminate the need for any memory.
The unit activity represented by a digital value can be combined by arithmetic units such as an
accumulator, adder, or other functions. For example, accumulators can be used to simply
Figure 15: Instruction Profiler (Spatial Pattern Matcher with Accumulator)
54
accumulate the unit activity values to incrementally represent the activity of a sub-circuit over
sample of time.
6.2 Hardware Instrumentation
6.1.1.1 Logic Gate-level Sensors
Logic gate-level sensor instrumentation that is shown in figure 16 captures digital activities of
each component of interest completely. This type of sensor can be described using a spatial sensor
consisting of direct wire connection followed by a 0 to 1 pattern matcher over two time states (can
easily be build using an AND gate and a D-flip flop). Then, 0-to-1 transition can be accumulated
using a 1-bit accumulator. Multiple 0-to-1 transitions can be also accumulated into a single
accumulator to reduce the hardware resource at the price of reduction in accuracy. The values from
these sensors are then used by the power management unit to extract dynamic and static power in
real time. It captures activity of each component spatially and measures power at any instant of
time.
Figure 16: Logic Gate Level Sensor (Temporal Pattern Matcher with Accumulator) Processor Slave 1
Slave 2 Slave n
POWER MEASUREMENT
S
S S
S S
S
S S
S S
S
S S
S S
S
S S
S S
55
We have implemented a proof-of-concept system on FPGA platform to test the effectiveness
of this type of sensors in various digital logics. This implementation gave sub-component power
measurements that are 99% of the ground truth on SP605 platform on common benchmark circuits
[8].
6.1.1.2 Instruction Profiler
Alternatively, we could use instructions or control signals that drive the logic on the chip and
match patterns temporally to measure power. In figure 17, we use an instruction profiler that
matches all the instructions and increments a counter. These instruction counts are used for all
subsequent power calculations. Power measurements at finer granularity can be done to capture
per-component energy by mapping each of these instructions with the components it activates over
time. A counter associated with each component is incremented based on the instruction executed.
These counters can be then used as parameters to our algorithm in measuring dynamic power
dissipated as each instruction is executed.
Figure 17: Instruction Profiler (Spatial Pattern Matcher with Accumulator) Processor
Slave 1
Slave 2
Slave n
POWER MEASUREMENT
IF STAGE
ID STAGE
EX STAGE
WB STAGE
Instruction Profiler with
Instruction Counters (S)
56
6.1.1.3 Hybrid Instrumentation
There are more complex interactions between several peripheral components which might not
include processor intervention for DMA access or Ethernet packet reception. Such interactions
might dissipate power which is non-negligible and cause error in measurements. In the spatial
measurement technique such interactions are captured efficiently, but it incurs greater
instrumentation overhead. The temporal power measurement has low overhead but fails to capture
such interactions.
We implemented a system which is a hybrid of the two techniques to capture off processor
interactions separately as a component and all other measurements using the temporal
measurement technique. Figure 18 shows an example of the implementation. This method gave us
99% average accuracy implemented over OpenRISC processor [44] running Linux and measuring
power real time using external ADC.
Figure 18: Hybrid Instrumentation
Processor
Slave 1
Slave 2
Slave n
POWER MEASUREMENT
IF STAGE
ID STAGE
EX STAGE
WB STAGE
Instruction Profiler with
Instruction Counters (S) S
S S
S
S S
S
S S
57
Chapter 7 : Accurate Fine grained Energy Per
instruction (EPI) and Energy per component per
instruction (EPC) Despite seemingly unrelenting technology trend of shrinking transistors, the performance per
watt of digital systems stopped keeping pace for most of the last decade. Many major integrated
circuits (IC) manufacturers suggest that the main reason for this behavior is due to stunted scaling
of nominal supply voltage and magnifying impacts of process variations. Researchers in both
academia and industry are also conceding to the notion that most of the effective solutions for
fixing the energy problems have already been used; and that the remaining venue for new
exploration is in application-specific design tuning and smart energy management. While this
venue is a widely accepted path forward, IC designers need a sufficiently accurate and feasible
energy measurement technology to even begin to consider any energy management algorithms.
In this section, first we present a minimally invasive in-situ and accurate energy per instruction
(EPI) monitoring methodology. Then, we show that the same parameters used to compute EPI can
be used along with architectural information to extract accurate energy per component (EPC) in
general purpose processor (GPP) based System-on-Chip (SoC). Finally, we present a generalized
sub-component level partitioning methodology for SoC that can applied to our technology to
automatically obtain EPC (AEPC) even when there is limited architectural information of the
system.
Our technology has been verified on a number FPGA based SoC platforms with full featured
Linux operating system to evaluate the resilience of our method for various scenarios. Results
show that our novel in-situ power extraction methodology provides EPI and EPC results within
maximum deviation of 3% from high accuracy total system energy measurements from a data
58
acquisition (DAQ) tool; this low level of deviation is found to be less than the manufacturing
variations present in most current sense resistors used by DAQs.
7.1 Introduction
Energy monitoring of modern day digital integrated circuits (ICs) is complicated by
instrumentation noise, process variations and system dynamics. It becomes increasingly difficult
to monitor energy in FPGA based systems due to variations during the bitfile generation such as
place and route variations between multiple runs of the same design. The research described here
is a novel integrated method to estimate energy of software applications running on FPGA based
systems in face of such complications. The novelty is in obtaining energy estimates online that
adapts to process variations and system dynamics in presence of noise. The accurate estimates
from our technique can assist energy management systems to be more effective and hence increase
the battery life in mobile systems.
Earlier efforts indicate application level fine grained energy management is the path forward
as most of the circuit design techniques to reduce energy have already been extensively explored
[43,46]. This implies that, an accurate and fine grained feedback on energy consumption of
applications at runtime to the energy management systems is a path for fruitful gains in energy
savings. Several efforts from industry and the academia to achieve greater energy efficiency are
either on dedicated direct measurement technique or by offline energy modelling based methods.
FPGA manufacturers provide pre-characterized library based simulation tools [48,49] to estimate
energy consumed by the underlying hardware. However, energy estimates from these tools do not
consider manufacturing variations to provide run time results for FPGA chip at hand. Some of the
FPGA platforms provide means to measure energy on the FPGA energy supply rails by inserting
current sense resistors that are monitored by on-board data acquisition tools [49]. However, the
59
dedicated measurement technique tends to be impractical for larger designs on FPGA due to the
number of components to be monitored for useful granularity of measurements. On the other hand,
the other energy estimation techniques using offline and online modelling methods [17,66] either
incur high latency or tend to be erroneous during longer usage cycles. The temperature, packaging
variations and ageing patterns contribute to variations in the energy dissipation within a chip and
across its copies. Such variations introduce error in these measurement techniques since frequent
recalibration of energy models is not employed. This suggests, in consideration of process, voltage
and temperature (PVT) variations, each chip should be individually characterized to extract chip
specific power models under various operating conditions to enable accurate energy estimates.
7.2 Contributions
Considering these issues, we present an online, in-situ, accurate and runtime energy monitoring
method for FPGA based SoC designs. The technique accounts for process variations and other
hardware implementation parameters as it is in-situ and online. Our method also features a re-
calibration mechanism that enables the measurement system to adapt to the system changes during
runtime to consistently provide accurate results. Such a system can assist FPGA application
developers to iterate their applications based on the accurate runtime energy feedback of their
software application. FPGA designers can use our method to build better energy management
schemes that is accurate for each FPGA chip without the need of pre-characterization of the every
system for energy. The accurate energy measurements can feed-forward to the synthesis tool such
that it can optimize the place and route patterns of a design on FPGA for maximum performance
per joule that is chip-specific. The two contributions of our work to the state-of-the-art are:
a) Minimally invasive in-situ EPI monitoring methodology for FPGA based GPP
60
In our method, we estimate the energy dissipated by the instructions of the embedded processor
at runtime that is chip specific and also features frequent re-calibration scheme. To build such a
system, we use an instruction profiler built into the hardware that provides accurate instruction
count during program execution in real-time. We use an external total energy measurement or data
acquisition (DAQ) scheme using an Analog to digital converter. A hardware unit is built-in to the
system to acquire the total energy values from the external DAQ. Energy weights are extracted
using the instruction count values and the total measured energy. For subsequent energy per
instruction (EPI) estimation, energy weights are multiplied with their corresponding instruction
count value.
b) Technique to measure accurate energy per component (EPC) in a GPP based System-
on-Chip on FPGA
Next, we find the physical level sub-component power by extending the EPI measurement
technique to accurately measure EPC. We examine the processor and SoC architecture to extract
a mapping from instruction execution to component usage. Using component usage counts, we
partition the total energy into sub-component energy usage without increasing the hardware
overhead. Our experiments show that runtime EPC result is better than the EPI based result. Also,
this method of mapping instruction level power to physical component did not require any
additional sensor than the instruction profiler used in EPI method.
Results of experimentation presented in subsequent sections suggests our current measurement
system can measure SoC energy online at less than 3% deviation from the reference external
energy measurement values. We use the term “minimal” as our energy measurement system incurs
less than 2% total system area and energy overhead. The overhead we incur is of a built-in
hardware profiler and a short calibration phase using total measured energy. We use the term
“invasive” to indicate the post-design modification to the platform. The platform is instrumented
61
with our monitoring methodology after its successful completion of the test and verification
phases. The term “in-situ'' refers to our on-chip methodology that uses hardware to monitor energy
and software to calibrate the measurement scheme.
7.2.1 Minimally Invasive In-situ EPI Monitoring Methodology
The main focus of this paper is to accurately measure the energy consumed by the processor
and its peripherals accurately at a fine granularity. However, in this section, we begin by building
an in-situ online energy per instruction measuring methodology to account for runtime variations
(e.g. supply voltage variations) on the chip. These variations tend to vary the way a component
dissipates energy and hence contribute to error in energy estimation. In the section III, we extend
the EPI methodology to include peripheral energy consumption. The key contribution of our work
is partitioning the total energy online using sparsity in instruction execution using minimal
memory, energy, and hardware area without the need to create any explicit offline model. The
current work builds on our previous effort in measuring energy accurately online by summarizing
activity within a circuit by dividing it into groups of gates [50]. However, a direct implementation
of the technique is infeasible due to the numbers of wires of each architectural component that are
monitored to obtain high accuracy. We therefore summarize the activity of the SoC based on
several observations of the system and extend mathematical analysis from our prior work to
implement an accurate measurement system. Also, it is observed that, our prior work incurs large
number of iterations to suppress noise to extract accurate weights. Hence, we use the technique of
noise suppression by residue minimization during estimation by Handschin and Schweppe et al
[51,52] and modify it based on empirical results to reduce the number of iterations required to
suppress noise. We use part of the method of Joseph et al [53] to test the best fit weights for energy
estimation of the processor based system.
62
7.2.1.1 Background
We begin our analysis on the total energy consumption of the SoC to arrive at the energy
monitoring method presented in this paper. From the fundamentals of CMOS circuit switching,
the total energy consists of dynamic and static energy. The dynamic energy depends on circuit
switching activity and can be further decomposed into energy due to capacitive switching and the
crowbar current. The static energy dissipation is considered to be based primarily on leakage
current. Intuitively, observing switching activity at every gate of the system results in an accurate
measurement of dynamic energy of these components at finest granularity. Also, static energy can
be measured at the voltage supply rail of these circuits based on the input patterns applied.
However, it is infeasible to implement such a system due to impractical implementation overheads.
We therefore find alternatives to monitor sub-systems by summarizing the activities to estimate
dynamic and static energy with minimal loss in accuracy. We build our measurement method by
measuring the total energy and correlating it with the summary of the circuit activity within the
SoC at hand. This paper assumes that the hardware implementation (Verilog description of the
hardware) of the SoC is available for analysis to build the energy estimation platform. We plan to
extend this method in our future work to cases where such details of hardware implementation is
not available. Further, we monitor the dedicated voltage rail to FPGA using a DAQ to measure
total energy consumed by the SoC (e.g. VCCINT rail on FPGA). Our current work considers the
values from the DAQ as the reference or ground truth to compare and update our system. We
observe that, a feature to measure total energy is available for wide range of processor and SoC
platforms. Therefore, such system level requirements in our work is applicable to wide number of
platforms currently in use.
63
7.2.1.2 Our Approach
We approach our problem to summarize circuit activity within SoC by finding alternatives to
represent activity within the embedded processor. The method to measure embedded processor is
extended to monitor peripherals along with the processor activity to extract SoC energy in section
7.2.6. We summarize the activity within the processor by observing that the execution of
instructions within a processor is unique in the way the components are used. Also, the instruction
execution within a processor gives us a finite set of component usage patterns that contributes
towards total processor energy consumption. We use this observation to instrument instructions
that drive the logic on the chip with a hardware instruction profiler to extract the accurate
instruction count of the application. The hardware instruction profiler is built into the processor at
the instruction decode stage. This enables our system to accurately capture the summarized activity
and correlate it to the total energy consumed. Initially, we formulate our energy model of the
processor as active and idle energy components during an instruction execution as:
= +(20) The active energy comprises of dynamic and static energy of the processor that is proportional
to instruction execution of the processor. This component consists of energy of the architectural
components activated during the execution of instruction at hand. The idle energy consists of the
dynamic and static energy consumed by components such as hazard detection units in the
processor that are not directly controlled by the instructions themselves. We iteratively find the
idle energy by projecting it into a constant component. The idle energy for our system is
approximately constant and our results from section 7.2.6 validates this assumption. On the other
hand, the idle energy may change due to changes in the system at runtime such as supply voltage
changes e.g. voltage drops. Such effects that persist during applications executions are shown to
64
cause significant variations in energy dissipation and in turn introduce error in energy estimation
[55]. However, our technique features an online update to quickly adapt to these changes.
Further, we examine the effect of idle components on active energy during application
execution. Prior efforts shows that variations in the applications workloads cause erroneous
estimations due to variations in instruction execution patterns [56]. The variations are caused by
insertion of NOPs or flushing of instructions by the idle components due to branching operations,
instruction dependencies and exceptions.
This in turn may cause inaccurate correlation between instruction executed and the active
energy of the system. In our current system, we observe that the runtime changes in instruction
execution are implemented at the processor’s decode stage. Therefore, we account for these
variations in instruction executions by observing the instructions issued at the decode stage. E.g. a
hardware stalling unit inserts low power NOPs during stalling in the instruction decode stage [57].
Our hardware profiler that is added at the output of these units in the decode stage such that it
counts the dynamically inserted NOPs and accurately accounts for instruction execution time
variations.
Apart from variations in instruction execution, earlier efforts on energy estimation indicates
glitches or spurious switching activity results in erroneous energy estimates [58]. We observe in
our gate-level simulations of architectural components that the spurious switching within a
component is directly proportional to the activity at their inputs. Hence, we assume that the
summary of the activity within the processor accounts for the glitching activity of the components.
This in turn implies that the weights derived for active energy accounts for the glitching energy of
the components activated by the instructions. Our results in section II and III suggests that this
approximation by our assumption does not increase error in estimation for various use cases.
65
Let us consider that the given processor “P” supports “n” instructions in its instruction set
architecture (ISA). Let us consider the processor executes “k” instructions. We estimate the active
energy as, = ∑( .) , where represents the number of times each of “k”
instructions executed within a period of “t”. represents the energy weight associated to each
of these instructions on the processor that is extracted for the previous “t-1” samples. We then use
the to estimate energy of the sample at “t”. Let us consider the processor activated “V”
components to complete execution of “kth” instruction. Therefore, from the fundamentals of
CMOS switching, = ∑() , where is the capacitances of the Vth component with
nominal supply voltage of . We can re-write equation (20) as,
, ,
= ∑()+(21) Where, idle energy is represented as = 1∗ .We project the inactive energy
consumed by the system to using a constant component.
We use the auto regression methods to solve for weights for the set of linear equations defined
by equation (21) as it results in accurate weight per component to estimate energy [53]. From the
fundamentals of auto regression, Akaike’s Information Criteria (AIC) identifies the best set of
variables and then derives the best fit weights for the variables derived. Joseph et al [53] derives
alternatives to find the parameter set to extract best fit weights as AIC is impractical. This is due
to exhaustive search by AIC in the parameters set space (parameters such as gates, wires, different
voltages and temperatures) to extract best parameter set for energy estimation of digital system.
We use instruction counts as parameters for weight extraction and find best fit weights for these
parameters using the work of Joseph et al. We find best fit weights by first extracting the weights
at the startup of the system using our training instruction set and then use the method of joseph et
al to derive test instruction set to test the effectiveness of weight fit. We add an additional
66
component, residue, to equation (21) to account for the instrumentation and other effects of energy
measurement such as quantization noise. Therefore, equation (21) can be re-written as:
, ,
= ∑(()+)+(22) We measure this residue for the sample at time “t” as,
= {|
, ,
–
, ,
| }(23) Our algorithm yields energy weights every iteration of energy and instruction profile sampling;
,
is the total calculated energy extracted by multiplying the energy weights with the
measured instruction counts. This result is further compared with the total measured energy,
,
to obtain the residual value described in equation (23). We uses the technique of
Handschin and Schweppe et al to use a portion of the total residue for subsequent iterations of
weight extraction. We implement the estimator used in this paper to solve equation (22) as,
=
, ,
−∑() , such that, =
.
(24) And =
Where, is the variation of the residue distribution and is the conditioning factor. Equation
(24) is solved for for every additional measurements of
,
and . The initial
value of weights is zero and for every subsequent iteration, the residue value of
.
is used. This
process is continued until weights converge, i.e. − < , → 0 , where is the
deviation. It has been proven by Schweppe et al for estimators described in equation (24), that,
67
Property 1. Convergence of weights is also the point of().
Property 2. The weights are not smeared by noise and are stable.
We empirically found that gives us the fastest convergence rate for the OpenRISC System-
on-Chip (ORPSoC) running on Atlys platform. We perform various experiments to validate these
properties on our system and compare results with ground truth to confirm our claims on accuracy
of the solver.
7.2.1.3 Online method to estimate Energy
From the fundamentals of regression and estimation [7], we can extract the weights, as,
=() , ,
. Therefore, equation (24) can be written as:
,
= ∑( .() ,
)+(25) Where is the activity matrix built by collecting all the instruction counts and arranged
temporally for “m” samples. That is, =
,
…
,
⋮ … ⋮
,
…
, ,
. Where
, ,
indicates
the m
th
count value of the n
th
instruction. Solving for requires() , which is
computationally expensive (diagonalization and inversion). We use the matrix inversion lemma
Figure 19: ORPSoC Setup
68
presented by Mendel [38] in equation (25) to reduce the matrix inversion into arithmetic division.
Equation (25) assumes the
, ,
, and matrices are of the order x , x
and x respectively. However, we know that these matrices have m = 1. As derived in the
matrix inversion lemma, the inverse operation is reduced to arithmetic division. Substituting these
results in (25) and simplifying we get:
, ,
= ∑ +(26) To measure the total energy using our method, we multiply the energy weights with its
corresponding instruction count. The arithmetic operations themselves can either be implemented
as dedicated hardware for calibration and estimation or can be implemented as a thread in the
operating system. The system frequently updates weights online to account for changes in process
and voltage variations that affects the energy of these digital components. If the error exceeds the
user-defined tolerance, the calibration tool updates the weights again until convergence is
achieved.
7.2.1.4 Training and test instruction set generation
We derive the preliminary weights for the platform as a part of system start up and then update
these weights online to account for runtime changes. The preliminary weights for our system is
Figure 20: ORPSOC Architecture
69
derived based on the observations by Austin et al and Gao et al on modern day processors during
system start-up [60,61]. They show that maximum activity within a processor occurs due to
maximum activities within the architectural components during system start-up. To extract
maximum activity of architectural components, we employ a commonly used technique from
Chang et al [62]. They identify the critical path set that contributes to maximum activity within
the gate-level implementation of each component and then derive maximum activity input vector
set on the critical paths identified. We use their method to extract the maximum activity input
vectors for the gate-level description of all the components of our processor and imply instructions
from the vectors based on the architecture of the processor. The implied instructions from the
maximum activity vectors form the training set and the test set to evaluate the effectiveness of fit
of the weights is generated using the technique of Joseph et al. Based on the results of Chang et al,
we observe that the off-critical path activation energy of components can be averaged and is
assumed to be a part of critical path set activation energy without significant loss in accuracy in
energy estimation in our FPGA based system. The validity of this assumption for our current
FPGA based system is supported by the results in Section 7.2.6 that show the converged EPO
weights provide energy estimates with a maximum deviation of less than 3% from ground truth
for various applications. These assumptions and observations might vary for different classes of
circuits and systems; we explore more on this aspect in our future work.
70
7.2.1.5 Experimental Setup
We implement our method on Digilent Atlys board [18] that consists of a Spartan 6 FPGA.
We program the FPGA with ORPSoC design [19]. ORPSoC consists of an OpenRISC1200
embedded processor along with several peripheral components such as Ethernet, Direct Memory
Access (DMA) controller, Video Graphics Array (VGA) and a Double Data rate (DDR) memory
controller. It runs a Linux operating system and presents us with a system that emulates realistic
behavior of processors and SoC. The ORPSoC design is open source, thus all hardware and
software component design details are accessible to users and open to modification. The processor
implemented in this system supports 80 instructions as a part of its ISA. As shown in figure 19,
we collect the total measured energy and feed it back to the processor. For all our experimentation,
Figure 21: Analysis of Rate of convergence Figure 22: Comparison of EPI weight
EPI weights under noise and noisy measurements. Due to Variations in synchronization
Figure 23: Instruction counters and MAC unit to calculate total energy
71
the total energy is measured using PMODADC5 [59] as a part of our DAQ and is considered as
the ground truth. The instruction profile is observed at runtime at the decode stage of the processor
using a hardware profiler. Our hardware profiler consists of 80 pattern matcher that matches each
instruction executed and increments a counter (16 bit wide) to extract the accurate instruction
profile at runtime.
As shown in figure 23, a hardware multiply and accumulate (MAC) unit multiplies an
instruction count value with its weight and accumulates it to produce total energy of the embedded
processor. The total calculated energy from the MAC unit is sampled at the same time as the
external measured energy at 1 KHz using a sampling unit to match the internal timer of the
operating system (OS). This is done to estimate energy consumed by application between every
context switching operation of the OS. A C-code is written that runs on the processor as a
background process to compare the measured and calculated results. It also initiates a procedure
to update the weights using equation (26) when the error exceeds the desired value. All the counters
and measured energy values are accessible to the operating system and the C-code solves for
weights by dereferencing the instruction counts and total measured energy values. Once the
weights converge, they are used to extract calculate energy values.
Figure 24: Processor with Instruction profiler and peripheral sensors
Processor
Slave 1
Slave 2
Slave n
POWER MEASUREMENT
IF STAGE
ID STAGE
EX STAGE
WB STAGE
Instruction Profiler with
Instruction Counters (S) S
S S
S
S S
S
S S
72
7.2.2 Associating Peripheral Activities with Instruction Profiler
With increasing numbers of components integrated on-chip, instrumenting each peripheral
component with activity sensors might not be possible. It might be infeasible to emulate each
peripheral on SoC and instrument a set of wires that completely capture the activity due to
impractical overhead. Also, the location of the component might make it difficult to collect the
activity data in synchrony with the others to extract power. To solve this problem, we derive
component activity from the instructions executed on the embedded processor. The peripherals of
the ORPSoC are activated based on the execution of a load or store word of the data xFFH to the
base addresses found in the memory mapping definition. We add additional hardware in the
instruction profiler to match these instructions along with the base addresses to account for
peripheral components. Whenever a load or store word is executed along with a known set of base
addresses, a counter is incremented to indicate associated peripheral component activity. The
comparator unit matches these instructions along with the counter accounts towards energy
consumed by the measurement unit. In order to account for these components, we add a column
that consists of summation of these peripheral counter values that represents power used by the
measurement unit itself. Figure 24 shows the processor instrumentation after the peripheral
counters are integrated to the instruction profiler. These counters are sampled along with the other
instruction counters and are used in our online regression algorithm to extract weights associated
with each instruction and peripherals.
The same method can be extended to modern day processors by instrumenting signals that are
uniquely activated by the system such as interrupt handlers. These signals are found to be integral
part of a system that controls and co-ordinates various activities within SoC. This provides us
73
means to extract the power consumed by various components on the SoC without having to
instrument the peripheral component itself.
7.2.3 Experiments and results to evaluate our method to monitor energy per
Instruction
We begin by evaluating our solver for its accuracy in weight extraction to validate the
properties on stability of weights and noise suppression as described in section 2.2. We extract
results for various use cases on the embedded processor and compare it with ground truth. For the
current section, we record the data and use it to perform offline analysis using Matlab by adding
noise with known characteristics and time-shifted counter values. These analysis are performed
offline as they cannot be done during program execution with our platform. However, we use the
online nature of our system in section 3 and show the effectiveness of our online method for
various use cases. The analysis of the solver described in equation (5) is performed in the following
steps:
7.2.3.1 Evaluating the effect of time synchronization on Rate of Convergence
We evaluate the effectiveness of our solver by time shifting the total measured values and the
observed instruction counts. This is done to evaluate our sliding window solver for its ability to
identify the sparsity in the total energy measurements and correlate them with the observed
instruction counts. For the system at hand, we time shift the counter values by two cases: one step
ahead and a step behind the DAQ timestamp values. In the case of time shifting the instruction
profile by one step ahead represents delay in capture of DAQ values. The case of time shifting the
instruction profile by one step behind represents late capture of instruction profile values as we
expect the maximum lag in capturing instruction and measured energy values to be within 50,000
cycles. This observation holds true for all sampling cases of our current platform as DAQ and
74
sampling is much slower than the processor clock. The convergence of weights to the same values
for the cases of a step ahead, behind and accurate sampling indicates our solver correlates the
sparsity across various windows to produce stable weighs.
As it can be seen from figure 22, the time shifting of instruction profile values affects the
number of samples required for convergence. However, the weights converged to the same value.
This indicates our solver indeed identifies the sparsity to produce stable results and validates the
property 1 mentioned in section 7.2.1.
Figure 25: Measured total vs Calculated Energy using EPI for Dhrystone benchmark
Figure 26: Error over time for EPI of ORPSOC
75
7.2.3.2 Evaluating the effect of noise on convergence
Next, we evaluate the effect of noise on the weight extraction by our method. First, we perform
analysis of residual energy under low noise measurements (DAQ1). The PMODADC5 was
characterized and was found to have less than 1% noise. We use values from DAQ1 as low noise
measurements for our experimentation. We then add a signal with known characteristic to the total
energy measurements of DAQ1to evaluate the resilience of our energy partitioning method. The
convergence of weights to the same values for the cases of noisy and low noise cases shows that
our method is robust against noise effects. Also, we perform analysis on residue to evaluate the
resilience of our method.
As it can be seen from the figure 23, the weights for noisy measurements converged to the
same value as low noise measurements. The weights converged for 106 iterations for low noise
scenario whereas it took up to 137 iterations to converge for the noisy measurements scenario.
Furthermore, the residue obtained from the analysis contained the added noise. This validates our
claims of robustness and noise suppression of our method as mentioned in property 2 in section
7.2.1.
7.2.4 Evaluating the Energy per Instruction of OpenRISC processor
We perform EPI measurements on OpenRISC processor using the Dhrystone benchmark and
evaluate our measurements by comparing it with the fundamental technique [32].The fundamental
technique instruments the supply rails and measures the energy while the instruction is executed
in a loop. This value is recorded and averaged over various runs of the instruction at hand offline.
Similarly, average energy consumed by all the instructions are recorded and are used for
subsequent energy estimations. We extract the EPI using the fundamental technique and our
technique using a synthetic code that was written to run all the instructions. We evaluate the
76
effectiveness of our method by comparing the accuracy of energy estimates for Dhrystone
benchmark with those of the fundamental technique as it is the most intuitive way to measure EPI.
The Dhrystone benchmark has been widely used to evaluate performance of embedded central
processing units (CPUs) for various processor architectures. It consists of several memory and
integer operations that are intended to evaluate the performance of the CPU. We run the Dhrystone
benchmark several times in a loop to extract the number of Dhrystones per second. We observed
50,000 Dhrystones per second for our current implementation of ORPSoC. Figure 5 shows the plot
for a subset of measured versus the calculated energy values. Figure 6 shows the plot of maximum
deviation of the estimated values from the ground truth. The peak deviation from the ground truth
was 4.35% and the average error was 3.76%. Whereas, the results from the fundamental technique
indicated a maximum deviation of up to 10.8% with reference to ground truth. The results suggests
that the idle component and noise affects the active energy component to cause the 10.8% deviation
in the fundamental technique. On the other hand, figure 26 implies our system produced accurate
results by suppressing noise. The instruction counts during the benchmark execution from our
method and the traditional technique were found to be the same as the expected benchmark
instruction profile. These results confirm that our work extracts accurate runtime instruction counts
and improves upon the fundamental work to estimate accurate runtime energy values.
7.2.5 Method to Measure Accurate In-situ Runtime Energy per Operation of
a SoC
We hypothesize that the error in our EPI measurements is in part due to peripheral component
activity. Furthermore, the energy dissipation patterns are expected to change on each FPGA chip
for different synthesis runs due variations in placement and routing patterns and manufacturing
variations. We therefore provide an online method to partition the total measured energy based on
77
instruction execution along with their peripherals at fine resolutions to account such variations.
Let us consider we collect activities, from the activity counters of “p” peripherals in our SoC.
Equation (22) can be written for total energy for “k” instructions as,
= ∑(()+ +)+(27) Where, is the energy weight associated to each activity of each peripheral component.
We use the same mathematical technique used to solve equation (22) to solve equation (27).
7.2.5.1 Experiments and results to evaluate energy per operation of
ORPSoC
We instrument counters on a set of inputs of peripherals to perform experiments on the current
ORPSoC implementation to extract the EPO of SoC. We build on a prior work to extract minimal
set of wires to extract maximum accuracy [50]. We use the results of the work to independently
characterize each peripheral component exhaustively to arrive at the wires to be monitored for
maximum accuracy. We instrument 70 dedicated counters at various inputs of the peripherals and
Figure 27: Results on energy measurements for various workloads
78
use it along with the instruction counts to extract energy weights. We run the Coremark benchmark
to evaluate processor and peripheral energy measurements. We use file transfer via UART, toggle
the general purpose input and outputs (GPIO) to exercise peripherals that are not used by
Coremark. Lastly, we run the grep and gzip along with their benchmark files to evaluate our system
for commonly used applications.We extract the results from our method and compare with ground
truth for all these applications.
The coremark benchmark exercises the processor, memory, DMA, wishbone and parts of VGA
during its execution. We run this benchmark in a loop 100,000 times and maximum deviation from
the ground truth was 2.8%. The primary 10 instructions that contributed towards total energy is
shown in the Fig. 27.
We evaluate UART energy consumption by transferring a file with the largest possible size
(1MB file) that our current system setup supports. A 1MB file was transferred repeatedly 100 times
in different sessions and the maximum deviation from the ground truth was 1.85% and the
distribution of energy across various circuits are as shown in Fig. 27. GPIOs may have additional
circuits to drive larger load and hence might consume greater energy [63]. In order to ensure our
method works with such modifications, we connect a GPIO available on our platform to a light
emitting diode (LED) and toggle it continuously for 2 seconds using a C-code running on our
processor. The results for 100 runs suggests the maximum total deviation from the ground truth
was 2.4%. Fig. 27 for GPIO provides breakdown on energy usage during the GPIO access. Further,
we perform analysis on energy consumption on grep and gzip that are known to be IO and
arithmetic instruction intensive and also exercise several parts of ORPSoC. We create a 1MB file
based on the commonly used benchmark standard for grep and Gzip. We run the grep and gzip
commands on the file as suggested by the benchmark suite. Fig. 27 for grep and gzip shows the
79
top 10 primary contributors to total energy consumption and the maximum error was 2.78% across
various runs. Also, we found that previous 10 snapshots of iterations that comprises of , and
, , are sufficient for convergence to perform online update of 80 instruction and
70 peripheral energy weights. The maximum deviation in energy measurements for the commonly
used applications was less than 3%. Also, it can be observed in figure 8 that the idle energy is
constant which supports our assumption made in equation (20). The benchmarks used various
peripherals in different patterns that causes variations in the system dynamics such as variations
in voltage drops. Our system adapts to these changes quickly to compensate for the changes and
hence keeps the maximum deviation from the ground truth within 3%.
7.2.5.2 Analysis of hardware instrumentation for EPI monitoring
Further, we evaluate the energy and area overhead incurred by the hardware instrumented to
monitor EPO of the SoC. The base OpenRISC processor used 5390 slices (79% FPGA utilization) and entire ORPSoC platform used 6413 slices (94% FPGA utilization). The hardware profiler
embedded in the processor used a total of 93 slices and the complete hardware instrumentation
used 122 slices which gives a total area overhead is 1.9% for ORPSoC. Further, the overhead
values were consistently less than 2% with other implementations of the platform [45].
Furthermore, we add a column with sum total of all the counter values during the weight extraction
to partition the energy consumed by our measurement system. For the user applications tested
above, our hardware instrumentation used 0.87% of the total platform energy. The C-code used
for weight extraction and self-calibration was characterized independently and was found to
consume less than 1% of total system energy and CPU time to derive converged weights. Also, it
can be observed in Fig. 27 that the idle energy is approximately constant for the various use cases
which supports our assumption made in equation(20)
80
7.2.6 Extracting Energy Per Component (EPC) for GPP based SoC
Prior art suggests a scope of improvement in power efficiency of a microprocessor power
efficiency given accurate estimates of energy consumed per component per instruction executed
at low latency. The EPI value does not give information on physical level of power consumption
of the processor. This can be used by thermal management unit to optimize temperature across the
chip. Therefore, we map instruction counts into micro architectural component activation counts
and use it with (20) to extract weights per component. Let us consider, an instruction I activates
Ci of “m” components in the processor P. Where, “m” is the total number of components in P to
support its ISA. Therefore, the energy consumed byI can be represented as,
,
= ∑()+ ∑ +(28) Where “D” represents the number of components not used byI , and “C ” represents the
components that are not exercised by the processor. We assume that components C contribute to
static component of the total embedded processor power. For a set of k-instructions executed in t,
we can rewrite equation (28) with corresponding static and dynamic power components as,
,
= ∑
,
()+ ∑
,
() +(29) As (29) is of the same form as(22), the same applied mathematics based algorithm can be
applied to estimate the EPC per instruction. Using the method to map the peripherals in a SoC
discussed in the previous section, equation (29) can be re-written for SoC power as
,
=()+
,
()+(()+( +
,
()) +(30)
81
Now, (10) is of the same form as (5). Therefore, the same regression algorithm can be to
estimate the power of the SoC.
7.2.6.1 Mapping ISA into Architectural Components
To profile every instruction in the OpenRISC processor,
we begin by manually identifying various architectural
components activated by the processor such as ALU,
Control Unit. We decompose instructions into various
components hierarchically and continue the decomposition
until the desired accuracy is reached. Initially, we divide the
set of supported instructions based on its operation; for example, we classify instructions using
Load Store unit as one group. This type of classification gives us seven groups for the initial
decomposition of ISA of the current OpenRISC architecture: ALU, Multiply-and-Accumulate,
Shift-Rotate, Branch, Load-Store, Compare and Floating Point Operations. We hypothesize that,
these components attribute to a larger contribution of power usage when an associated instruction
activates it during execution. We label this level of granularity of decomposition as the Layer 1 of
the instruction profile.
Based on the desired accuracy, these set of components are further decomposed by identifying
differences between the instructions within a specific group. Using the changes covered during the
previous decomposition, we identify additional sub-components during current decomposition to
compensate for unaccounted difference. The sub-components found in current decomposition are
then associated with the differences between the instructions in considerations. The error due to
differences between instructions having similar component usage profile can be compensated by
subsequent decompositions. The error could stem from differences in instruction execution paths
Figure 28: Component
decomposition by EPC
82
(e.g. jump instructions) and usage of each component (e.g. changes in input patterns to ALU and
memory in add, sub instructions).
We create a mapping template that can be used by the processor to convert the hardware
instruction counts into components count to estimate energy at a desired accuracy. We extract the
component utilization for any program running on a CPU to estimate energy consumption based
on the component counts from the mapping.
7.2.6.2 Experiments and Results
We explore the physical level power by measuring EPC of the ORPSOC. We mapped
hierarchically extracted mapping up to 3 levels of components and used this mapping for
partitioning power. We repeated all the experiments to validate EPI and compare EPC results. The
EPC was able to map 80 instructions into 55 components and provide results with a maximum
deviation of 1.8% with the ADC as shown in the figure 8. This validates that our method indeed
measures physical level sub-component energy. Table 1 provides comparison of some of the
Figure 29: Result on EPC measurements
83
salient features of our technique with the prior art. Results suggests that our online technique
improves over the prior art with accurate, scalable power measurements. Also, our technique
accounts for process variations since weights for each chip should be calibrated independently
unlike the other methods.
7.2.6.3 EPC Result Analysis
The results suggest, the total power measured using the EPC technique is more accurate than
the EPI technique. Comparing the EPC mapping and EPI technique, the latter averages various
component utilization whereas EPC accounts for each activation of components.
To illustrate this effect, let us consider execution of 10 of instruction I1, 5 of instruction I2 and
15 of instruction I3 by the OpenRISC processor as shown in figure 9. The EPI considers equal
contribution of all the architectural components during the execution of I1, I2 and I3. This process
can be seen as assigning equal activity of 10 for all components in the processor for I 1. Similarly,
an activity of 5 is assigned to all components for I2
On the other hand, EPC mapping indicates activity contribution of 5 by component B during
I1 and a contribution of 10 during I2. This difference in the activation counts untangles the
Figure 30: Comparison of EPI and EPC results
84
averaging effect of EPI to produce more accurate results. For example, it was observed that weights
conditional branch and Add as extracted by EPI have a difference of 1%. Whereas in case of EPC,
each of these activations are considered as an independent component and weights for each of
these components are extracted. The Energy per component calculated for these components differ
by 4%. The overall accuracy of EPC is less than 2% whereas it is close to 2.4% for EPI. This
validates our hypothesis of averaging effect of EPI.
7.3 Discussion of the Related Work
Prior works lay the foundation for our current work. They predict accurate application level
energy feedback of software is essential to efficiently manage limited stored energy in the batteries
[46]. Several efforts have been made in this area of research, which can be broadly classified as:
7.3.1 FPGA energy measurement techniques
The FPGA manufacturers provide simulation based energy estimation techniques that use pre-
evaluated library of energy utilization for various devices. This information is used along with the
lookup table and component usage on the target FPGA by the user designs to estimate energy
[47,48]. Some of the FPGAs [65] also feature dedicated current sense resistors on the voltage rails
of the chip that is accessible by the on-board DAQ to measure run-time energy consumption of the
Chip.
7.3.2 Real time estimation techniques
In this estimation technique, an initial offline capacitive energy model is built based on total
energy measurements and event sets [67,68,69,70]. An energy model is built associating total
current drawn by the platform and performance counter values that indicates various events on the
processor. Further, Iyer et al [66] build a system that updates this energy model on-chip before its
first use by the energy management unit.
85
7.3.3 Offline estimation techniques.
In this technique, direct current measurements or values from simulations are used to derive
average energy dissipated per instruction [71,72,73,74]. The energy per instruction is modelled
using offline regression analysis based on the collected energy data, access rates, execution times
and instruction counts. These values are then used to estimate the energy consumed by a program
at runtime. The techniques other than the DAQ based measurement technique described above do
not provide energy estimates that consider manufacturing variations. On the other hand, the DAQ
based technique does not provide finer granularity of energy usage on large SoC designs. The
related works for EPI measurements mentioned in this section share similarities with our initial
work with total energy measurements along with the instruction data and its intended use [15, 17].
However, our online regression algorithm and the use of digital counters fundamentally differs
from any prior work. The usage and update of active and idle energy component differs from the
other techniques and we recalibrate these values frequently using an integrated external
measurement scheme. Furthermore, the measurement energy can be isolated from the system
energy consumption is fundamentally different from the other techniques.
86
Chapter 8 : Automated Extraction of Energy per
Component using Online Regression
To enable an application to effectively manage power, it can change the way it executes based
on the energy consumed per instruction by each architectural component. This ensures that the
application never overshoots the power budget and hence the thermal budget of the chip. For
example, the software might issue instructions to use CPU for floating point multiplication rather
than dedicated FPUs. This provides a platform for software to manage power consumed by the
hardware without the need of elaborate power models or detailed information on microarchitecture
of the chip.
A software or hardware implementation of such fine grained power measurement system
requires detailed analysis of micro-architecture to extract a thermal profile of each component. In
modern day processors, performing such an analysis might not be feasible due to the numbers and
complexities of components. Also, the micro-architecture for the processor might not be available
for analysis given a black box implementation of the desired instruction set. The traditional
dedicated sensors methods are impractical due to the numbers of sensors required and also, the
internal circuitry of such black boxes might not be available for modifications. Furthermore, some
of the current techniques that extract power model offline might be erroneous as they do not
capture the correlation between several instructions and components during execution. Therefore,
we present an online technique that initially discovers a minimal set of components to achieve a
desired accuracy and then uses these estimated components along with their power weights to
perform subsequent power estimation. The components discovery can be done one-time or
87
frequently performed at a system level so as to account for power consumed by plug-and-play
peripherals.
Discovering several components within a digital system along with their inter-operability with
several other components by explicit enumeration can be a hard process. Hence, we provide a
technique that implicitly enumerates various combinations to obtain a minimal set of components
for power monitoring. In order to implement such a system, we need accurate instruction profile
of the software running on the system. We then extract component usage per instruction and
associate this with the total power consumed by the system. The online linear regression technique
then extracts the contribution of each of these components towards the total power. We then
perform analysis over these results to extract the number of components for the given accuracy.
The overhead in implementing this system can be seen as measurement and calibration overhead.
The measurement overhead consists of a system required to extract instruction profile. This
overhead can be seen as a trade-off between the numbers of instances of software instrumentation
code set versus an actual hardware instruction profiler with its data collecting system. The
calibration overhead is an algorithm that uses arithmetic multiplication and addition over these
values of instruction counts that can be either implemented as a stand-alone hardware or software
kernel thread in the operating system.
8.1 Automated Extraction of Energy per component using online regression
An accurate, fine grained and low latency power measurement system can enable efficient
operation of the on-chip power management algorithm. The on-chip power management schemes
use the measured values/estimates from the measurement system to make decisions on dynamic
voltage and frequency scaling (DVFS) for various components on-chip. The management
88
algorithm also have a decision guard band that help reduce the chances of accidentally turning
on/off of the systems and thereby reducing the impact on system’s performance. In the modern
day chips, DVFS incurs several hundreds of clock cycles. This results in processor stalls so that
each component is ready before its being used. But, the gains obtained with smaller process
technologies diminish with such pessimistic decision guard bands. Also, due to manufacturing
variations, the current techniques tend to be erroneous across various chips and over long usage
cycles. This leads to sub-optimal operation of the power management algorithms.
To improve accuracy of state-of-art measurement techniques, each chip must be independently
characterized and the estimation models must be updated accordingly. Since it is impractical to
perform such an operation during large scale manufacturing, we built an accurate, in-situ, low
latency fine grained measurement system (static and dynamic power) that features a fast self-
calibration scheme.
In this section, we build on our online to derive an accurate energy per-component
measurement system that requires an external measurement scheme and an instruction set
architecture or distinguishable signals activities within the system of interest. We proceed by
extracting weights per component for a list of components that we assume initially and then prune
it based on the weights to find the underlying components. Figure 1 shows the flow of our method.
Initially, we assume:
1. ∀ ∃1( −1) −
Definition: Instruction is defined as a unique command to a processor or any distinguishable
signal or set of signals within a given logic that performs an activity within the logic and dissipates
power/energy.
89
For example, if an ISA has 80 instructions, then for each instruction we initially assume 1
unique component and 79 shared components.
2. ∀ .
I.e. Instruction (j)↔ instruction (k)à() ↑ ∀ , ∈
We consider that the shared component C is activated when either instruction j or k is executed
irrespective of the order of execution.
These two assumptions give us an initial matrix with n columns that indicate unique
components and
() columns that indicate dependencies. We then consider(n+
()) rows per sample and extract weights for each of the components using multiple iterations
with several samples. Then, we compare weights across all the dependent components and merge
or delete column based on weights. If the weights between two or more columns are the same, we
merge those columns to a single column. If the weights of a column is zero, we remove those
columns as they do not contribute to power for the given program. We then observe weights that
are non-convergent and separate those into two columns; one with the earlier scheme of counting
and the other with a static value to project the variations. We repeat these steps until we obtain
converged weights and cannot merge or bifurcate any more for a given accuracy. The action of
merging of dependent components can be seen as implicitly enumerating the system for probably
interdependencies of components between instructions. The process of removing columns with
zero weights can be thought off as pruning the set from redundant components. Lastly, the process
of segregating a given non-convergent weight into multiple columns can be explained as a way of
discovering components or segregating a component and its inter-relation with multiple
instructions during the program execution.
90
8.2 Predict and update number of components
Our initial iteration for component discovery for an ISA with n instructions assumes
() dependent components. This implies, for each instruction in the current 80 instruction ISA, there
is 1 independent component and 79 dependent components on the processor. Each instruction
shares at least one component with the other instructions thereby giving us 79 shared components
for each instruction.
We assume that each of these components contribute towards dynamic and static power
consumption of the processor. We apply our interpretation in estimating energy from [] to our
Figure 31: Flow of automated EPC
91
current work to estimate power from an estimated set of unique and dependent components. More
precisely, first we partition the summation of Equation (20) in to(n+m+p) summations for a
processor P with “n” instructions, “m” peripherals along with their static power
components,P . We represent residue R as the deviation between the external measured power
(ground truth) and the values from our technique.
=∑ +∑
() + ∑ + +(31) This approximation produces an equation of the same form as our work in [static power patent].
Therefore, the same online regression algorithm can be applied to estimate the power at each sub-
circuit.
Next, we extract weights represented by “w” in (31) using the techniques described in section
7.2. Comparing these weights we preserve, combine or eliminate the terms in (31).
i.e.∀w,w ∈ {w ,w} ,
ifw ≠ w , we preserve the element corresponding to w andw . Ifw = w orw ≈
w , we combine the elements corresponding tow andw . Where, the operator “≈” corresponds
to a value of wl or wq that is within a threshold value of deviation between each other.
92
ifw = 0,orw = 0 or w ≈ 0,orw ≈ 0 , we eliminate the element corresponding to w
and/orw . Let us consider, we have “n” unique components, and “t” components remain that is a
subset of
() dependent components either preserved or combined. Therefore (31) can be re-
written as,
P = ∑ a w +∑ a w′ + ∑ a w +P′ + R(32) Wherew′ andP′ , denote the latest set of values for the reduced component set.
We have conducted a limited number of experiments to evaluate the feasibility and accuracy
of this method. The results of these experiments show that this approach provides highly accurate
estimates.
8.3 Results
We implement our method on Digilent Atlys board, an FPGA platform that is programmed
with OpenRISC SoC (ORPSoC) design. ORPSOC consists of OpenRISC 1200 embedded
Figure 32: AEPC results
93
processor along with several peripheral components such as Ethernet, Direct Memory Access
controller, Universal Serial Bus, Video Graphics Array and a Double Data rate memory. It runs a
Linux operating system and presents us with a system that emulates realistic behavior of processors
and SoC. The ORPSoC design is open source, thus all hardware and software components design
details are accessible to users and open to modification. The processor implemented in this system
supports 80 instructions as a part of its ISA.
We begin by assuming the processor has 80 unique and 3160 dependent components. The
peripheral counters are accounted based on the instruction based activation described chapter 7.
We run the Dhrystone benchmark with varying temperature and extract weights. The temperature
variations were induced by using a heat gun over out ORPSOC platform on digilent atlys board.
Several weights converged to a zero value and some weights converged to the same value. We
eliminate the elements with zero weights and combine the elements with same weights. This
process reduced the number of elements from 3240 to 1552 columns. We repeat the process and
Figure 33: Comparison of EPI, EPC and AEPC results
94
extract the numbers of components for a desired accuracy of 99.2 %. In the pre-final step, we
obtained 55 components, of which, we compared the weights obtained using EPC. We found that
the component weight of ALU using the EPC method was approximately equal to sum of two
functional weights and a logical component in our AEPC method. We identified these components
based on their occurrence in the overlap of instructions over these components. We tried
combining dissimilar components but observed the weights did not converge to a fixed value.
95
Chapter 9 : Accelerating Physical Level Sub-
Component Power Simulation by Online Power
Partitioning
Accurate power simulation of physical-level subcomponents in modern day digital circuits is
complicated due to memory requirements and simulation time. The research described here is a
novel method to accelerate power simulations at fine granularities with lower memory and
simulation times for a given platform. The circuit designs can be iterated with respect to power
usage with highly accurate power values at design time to be built under strict power constraints.
This in turn provides greater opportunities to accurately budget power across the chip while
designing circuits with billions of transistors.
Currently, designers evaluate circuits and sub-circuits for functional correctness and timing
during the design phase. However, accurate design-time or simulation-time power evaluation are
often not performed due to the extremely long simulation time incurred with cycle accurate
simulators. As an alternative, the FPGA platform can be used to quickly estimate power usage
during design and simulation. However, they are constrained by area, granularity and the number
of gates that can be simulated. To overcome such problems, researchers have suggested using
power model based techniques that abstracts various underlying circuit components to extract
power estimates. The results obtained by using these modeling techniques can be inaccurate up to
100% [76,77].
This paper presents a method to accelerate power simulation of a given cycle-accurate
simulator to provide design time power estimates at logic simulation speed. The overhead incurred
is a short calibration phase using accurate technology-specific power simulation. Fast and accurate
96
power estimates for sub-circuits from our method will also feed-forward into the place-and-route
and floor-planning stages of circuit design. With accurate power estimates sub-circuits with higher
power dissipation can be placed away from each other that in turn provide a uniform spatial
temperature profile of the circuit thereby providing additional thermal operating margin. The three
main contributions of this paper are as follows:
1. Accelerating Commercial SPICE for power simulation with faster and leaner methodology
2. A System with tunable tradeoff of Accuracy, Speed and Memory
3. Improving Accuracy and Performance of power simulators using less accurate results.
The three contributions of our work are described in sections 9.2, 9.7, 9.8. Discussions on the
related work is presented in section 9.1.
9.1 Related Work
Several efforts have been made at developing fast simulation techniques to provide quick
feedback to designers. The prior art can be classified based on the granularity of simulation, as
follows:
1. Architectural component-level power simulation.
One of the initial contributions to the area of accelerated simulations, Landman et al [78] build
a capacitance model for the architectural component and use it for all subsequent power
estimations. Several works to build on this work to derive an accelerated simulation using model
based techniques for circuits and systems [79, 80].
2. Gate Level power simulation
Some of the other efforts focus on achieving gate level simulation speedup by exploiting the
concurrency in computation by distributing simulation amongst several cores [81]. Kapre [83] and
97
Chiou [84] expound on this work to achieve speedup of SPICE simulation using FPGA. On the
other hand, efforts have been made to use analog simulation to build a database of power usage of
the library cells for all subsequent simulations [85,86,87].
The architectural simulation and gate-level power simulation approaches described above
require to build models for every technology node for every device that may be used. Our method
extracts weights at the beginning of the experiment for any given technology-based device model
online. We then perform all subsequent simulations using these weights and observed transitions.
This provides us means to extend the given cycle accurate simulator with our logic simulator based
simulator to simulate power accurately with low resource overhead. We obtain accurate simulation
results at logic simulation speeds after the short calibration phase.
9.2 Enabling Faster and Leaner Simulation
The aim of our current work is to accelerate power simulation of physical-level subcomponents
at fine granularities at low resource overheads without losing accuracy. To achieve such a system,
we learn values from the base simulator and use the weights to perform subsequent power
simulation at logic simulation speeds. Our approach builds on a power modeling research that
showed the signal activity within the logic is proportional to the activity at the inputs of the circuit
for various commonly used benchmark designs [93]. We apply our interpretation of this result to
our problem formulation to approximate the total activity of a logically dependent set of wires with
signal transition counts from a much smaller set of wires. Further, we extend our online regression
framework to learn power weights online using a smaller set of wires in the system. We then use
a part of the work of Wu et al [89] to design our learning method along with its training procedure
to derive an accurate simulation system. Their work shows that peak dynamic power estimation is
98
the critical factor for accurate fine-grained power analysis in digital logic. They also observe for
peak power dissipation, the static power is a smaller component compared to dynamic power and
can be averaged during power estimation using a moving window solver without losing accuracy.
We therefore observe activities and co-relate them to dynamic power of the digital logic and
project the average static power to a constant component. Furthermore, prior work show that
glitching or hazardous switching cause a measureable power dissipation [96]. Analysis on
activities and gate-level simulation results show that the power weights derived by our method
accounts for such switching as they are proportional to signal transitions at the inputs of the logic
cuts. These approximations are found to be valid as it does not cause significant error and is
confirmed by 65nm and 180nm simulation results for various different types of logic at 99.999%
accuracy as presented in sections 9.2, 9.5 and 9.7. Our current system does not consider
temperature variations and assumes that details on layout and gate-level implementation of the
design are available for analysis.
9.3 Fundamental Power Estimation Methodology
We begin by considering “C”, a module under consideration (MUC) that is combinational with
“n” wires and “m” gates. Let represent voltage transition count on the “i
th
” wire of the circuit.
Let us define the ground truth to be an external power measurement scheme that measures total
power drawn from of the MUC. Given that each of these “n” wires drive a capacitive load
of and leakage current of jth gate ofI
,
, the ground truth total power can be expressed as,
= ∑( . .)+ ∑
,
(33) From the work of Chang et al , we approximate the static power in (1) as multiple of a weighted
constant i.e. .
99
Where = ∑
,
, “k” is a constant co-efficient to project . Hence, the total
power is estimated as,
= ∑( .)+ . ,(34) Where = . . We derive and empirically for a given subset of vectors such
that,| − | < ,
→ 0 . Furthermore, the total power can be written as:
P =(∑ Wa)+(1∗)+R(35) Where the term R represents residue that represents modelling error and based on empirical
results, we set k = 1. We measure residue as,
R = {|P – P | }(36) Equation (35) can be solved for dynamic power and static power using the linear regression by
the method of auto-regression. However, solving equation (35) involves complex matrix
computations to extract power weight that is proportional to number of wires monitored. Such a
system becomes impractical for larger circuits in consideration of resources such as simulation
time and memory. We therefore derive our current online method from the properties of estimation
used by our earlier effort to make the system online. We use the matrix inversion lemma from
Mendel and the widely used technique of Handshin and Schwepe et al to derive an online method
that converges by suppressing bad data. From the fundamentals of linear estimation with least
squares, equation (35) can be written as:
,
=( .() ,
)+(37)
100
Where = {a } is the activity matrix built by collecting all the instruction counts and arranged
temporally for “ ” samples, and are the instruction counts extracted at time instant . This
requires() to extract the weight vector set . Equation (6) assumes the
,
, and
matrices are of the order x , x and x respectively. However, we know that these
matrices have m = 1. As derived in the matrix inversion lemma, the matrix inverse operation in (7) is reduced to arithmetic division. Furthermore, we use a part of residue for every successive
iteration during estimation using the technique of Handshin and Schwepe et al. They show that
using partial residual values produces faster weight convergence by suppressing noise
i.e.
,
−
,
< , → 0 for minimum number of samples in “t” by suppressing noise.
This property is required by our system to use minimum calibration cycles and thereby reduce the
number of simulations required from the base cycle accurate simulator. Substituting these results
in (3), we get:
Power Simulator Accuracy (%) Number of
Vectors
Total Run-time
Memory (MB) Total Run-
time (min) Speedup
Factor
Comm. SPICE 100.000 104,200 14000 5172.00 1.0x
Accel. SPICE 98.000 104,200 4000 731.45 7.1x
Our method (A) 98.000 104,200 27 48.12 107.5x
Our method (B) 99.710 104,200 33 60.46 85.5x
Our method (C) 99.999 104,200 49 151.54 34.1x
Accel. SPICE 98.000 1,042,000 12,000 7440.15 1.0x
Our method 99.999 1,042,000 520 342.15 21.7x
Table 3: Comparison of results of Microcontroller simulations for varying accuracy for Comm.
SPICE, Accl. SPICE and our method
101
,
= ∑
, ,
+(38) Where
,
is the additional total power measurement made at time “t + 1”, is the
updated counter values and
,
are the weights. For every additional measurement made, we
iteratively update the weights and multiply the transition weights with its corresponding count
values to obtain the power values. The arithmetic operations are implemented as a dedicated
background software thread for calibration. Furthermore, device characteristics and switching vary
due to process variations and hence, the power weights are expected to change for each simulation
run. The algorithm adapts to the new values of power and circuits such that the weights are the
best fit for the specific conditions.
9.4 Training vector set generation
We generate training patterns to collect data using technique of Chang et al [90]. They provide
a technique to generate test vector set for a given digital logic that causes maximum capacitance
switching by considering each logic cut as a fan-out free logic. We generate test vectors using their
technique and observe that an initial value that represents the switching power can be used in our
method. We use these initial or seed values during training to reduce the number of vectors to be
run since the seed covers many of these use cases. The experimentation in the current section (i.e.
Section 9.4) and section 9.7 use zero initial seed to perform worst case analysis on convergence
time. Section 9.7 provides details on the use of seed value for training. We test the effectiveness
of training during actual simulation and adapt to anomalies in execution online.
9.5 Instrumenting Logic to Observe Activity
In practice, we annotate logic simulation for a given circuit to collect signal transition counts
from all the wires. Once the logic simulator is run, the transition counts are collected and multiplied
102
with weights to extract power associated with each transition of the component. We then take sum
total of these values to arrive at the total component, sub-circuit and full circuit power.
9.6 Experiments for analysis of faster and leaner simulation
For our initial experiments, the flow of
our work is as shown in figure 34. The first
step uses ABC synthesis tool with our
modifications to identify various wires to
be instrumented and the second step uses a
model based simulation over the training
vector set. Next, our tool extracts weights for power simulation and in the last step, logic
simulations from verilator [92] are used along with the power weights to produce power simulation
results.
The commercialized version of SPICE based simulator (Comm. SPICE) is considered as a
reference platform for all our experimentation. We consider the leading commercial Accelerated
SPICE simulator platform (Accl. SPICE) to perform analysis and comparative study of our work
with 65nm and 180nm technology libraries. Initially, the layout and RC network details are
extracted from the synthesized gate level netlist for the benchmark circuit with the standard gate
library. The complete training vector sets are run for every iteration of monitoring points provided
by our analysis tool. The resulting power and signal transition data along with the simulation
runtime and memory usage are collected. The ISCAS ‘85 and ‘89 benchmarks for our
experimentation as they provide a platform to evaluate our system for the most common logic used
in modern day data path. We chose SEC/DED Viterbi decoder (1988 gates) and a microcontroller
(50000 gates) to evaluate various tradeoffs in a simulator for smaller and larger circuits within
Figure 34: flow of our method
103
permissible limits of the available platform. We use
the test vector suite available with these circuits to
test our system. We also add synthetic test vectors
into these basic vector set to evaluate our simulation
time and speedup claims.
9.6.1 Results on faster and leaner
simulation
For this case, we perform experiments to obtain
maximum accuracy in power values and evaluate the tradeoff between the size of the circuit and
its simulation time given a test input vector set. As it can be seen from table 1, the reference
simulator used 14GB of memory and 5172.00 mins to complete the simulation for microcontroller
with 104,200 vectors. The Commercial accelerated SPICE simulator on the other hand used 4GB
and 731.45 minutes with 2% deviation from reference results. On the other hand, our method used
49 MB and 151.54 minutes with 0.001% deviation from the reference. The calibration phase
utilized 9.8 mins with reference data.
9.7 Enabling Accuracy and Resource Trade-off
Based on the empirical data, we hypothesize that, area vs simulation time and accuracy vs
overhead are the two most important components to achieve maximum accuracy during
simulation. We test our hypothesis with reference simulator results and compare it with our work.
9.7.1 Summarizing Sub-circuit Activity
Intuitively, monitoring each CMOS gate in the logic would provide us the maximum accuracy
for estimating power. However, such a system is impractical due to memory and computation time
requirements. It was observed that, larger circuits incur greater memory usage and simulation time.
Figure 35: Error vs overhead for
microcontroller for 65nm and 180nm
simulations
104
We choose a subset of wires against using every wire in the circuit for transition counts to minimize
the memory and time requirements to improve simulation speed for a target accuracy. We extend
the results from previous efforts that show activity at the inputs of the logic is proportional to
power dissipated by the logic [93,93]. We identify various cuts within a logic and monitor inputs
of these logic cuts to derive simulation results at target accuracy. To identify the cuts, we use the
technique of Liu et al [95] to partition the directed acyclic network flow graph derived for the logic
to be simulated. We use activity as a weight of the edges of the network and partition using their
algorithm. They show that their method derives non-overlapping cuts for any directed acyclic
network flow graph. We use this property to iteratively find different sized cuts that are non-
overlapping to summarize circuit activity without any correlation of activities between the cuts.
Let us consider that the given digital logic is a function “f” of literals, = { ,. } and =
{ ,. }. We view “ ” as a combination of “z” number of “F” sub-functions or logic cuts in a
circuit. i.e.
( ,) = ∪(39) such that, − < and Area () < Area () < Area ().
To evaluate the error versus the overhead of monitoring point addition. The smallest “F” is
considered as a CMOS gate. Initially, we identify flip flops as pseudo primary inputs (for the flip
flops driving the fan in) and pseudo primary outputs (for the flip flops at the fan-out of the logic).
The signal activity for a network flow graph is extracted using its binary decision diagram and the
input vector set. We then use the technique of Liu et al to build a network flow graph and partition
it based on the area and signal activity constraints. Starting from the smallest cut, we iteratively
accumulate various cuts until the desired accuracy has been achieved. We identify various cuts in
105
the logic and assign their inputs to be monitored. We also ensure that there is no duplication of
these signals while still considering those wires during calculations. In each step, we aggregate the
cuts with the immediate fan-in cuts starting from the logic’s output to input and compare the
accuracy achieved. During the aggregation, we eliminate the monitoring of input wires of the fan
out cut and monitor the inputs of its fanin cuts. This way, the fanin cuts along with the cut of
interest can be considered as a larger cut. We stop the aggregation if − > . We
perform this operation to identify fewer set of wires to summarize the input activity. We assume
the area of the logic is proportional to the amount of activity within the logic. Therefore, the
constraint in (39) is used to find the minimal set of wires for summarizing signal activity within
the logic to achieve desired simulation accuracy.
9.7.2 Analysis of accuracy vs resource tradeoffs
Accurate circuit simulations using highly parametrized gate library set have higher memory
requirements as they involve extensive matrix operations [83]. In our work, we collapse the entire
model into parameters by learning their effect on the peak power. For each circuit, we calibrate
our models before performing simulations so that the model collapsing does not compromise the
accuracy of simulation results. Further, the model collapsing and calibration are tunable to achieve
user definable system based on the accuracy and resource requirements. To validate this claim, we
perform analysis with circuits of different sizes at different technology nodes and report the results.
106
9.7.3 Experiments for analysis of accuracy vs resource tradeoffs
We consider the leading commercial simulator to perform analysis. We use the RC extracted
netlist of the benchmarks to simulate power for all the test vector sets for each set of monitoring
points. The details of layout including RC netlist of all the circuits are extracted using the standard
layout library cells available for the given technology. Using the RC netlist, we calculate the load
capacitance of the switching unit at the points of observation. Next, we extract seed values based
on equation (38) to use it in the simulator. We then record the simulation time, memory and the
simulation data for further analysis and comparison with our technique.
9.7.4 Results on accuracy vs resource tradeoffs
We consider the worst case accuracy and memory utilization to evaluate the tradeoffs of our
method. We therefore consider the microcontroller benchmark that is the largest circuit currently
at hand to perform this analysis. Figure 35 shows the maximum error at minimum number of
counters instrumented at 0.001% for both 65nm and 180nm respectively for the microcontroller
Accuracy
(%) Raw Convergence Convergence with seed value
Speedup in conv.
using Accl. Sim for
training
99.999
Comm. SPICE Accl. Sim Comm. SPICE Accl. Sim
Raw Seeded
Samp.
Time
(mins) Samp.
Time
(mins) Samp.
Time
(mins) Samp.
Time
(mins) 19055 953.50 19070 121.19 66 3.12 102 0.59 7.88x 5.29x
99.7 4120 210.00 4157 30.35 63 3.10 89 0.53 6.92x 5.85x
98 2230 122.00 2283 18.01 55 3.02 71 0.47 6.77x 6.43x
97 1775 69.00 1810 14.51 43 1.56 57 0.38 4.76x 4.10x
Table 4: Results on rate of convergence with Comm. SPICE, Accl. SPICE and our method
107
benchmark. Also, it can be observed that the shape of the curve from simulation for both the
technologies are similar. The 0.1% point indicates the instrumentation of control signals only. Our
method produces an error of about 60% for both 65nm and 180nm technologies and can be
attributed to inaccurate summarization of activity. As it can be seen in table 2, for 98% accuracy
our system achieves 148x less memory utilization over the fast simulator. Also, there is an increase
in speed when the desired accuracy is reduced. This can be attributed to reduction in the amount
of data processed due to fewer observation points from our system.
9.8 Calibration Speedup Enhancements
Accurate circuit simulations incur higher memory overhead and run slower since each
parameter of the device must be evaluated to extract the results. We use a subset of these simulation
runs to learn the weights that represent these parameters and use a logic simulator to perform all
subsequent simulations. For larger simulation cycles, the calibration phase is much lesser than the
complete simulation and therefore we achieve speedup in total runtime. However, the calibration
overhead might exceed the actual simulation cycles for short vector set. It is therefore desirable to
pursue alternate approaches to reduce the calibration overhead to increase the speed and utility of
such simulation accelerators.
9.8.1 Experiments on calibration enhancements
In order to reduce the number of samples, we precompute the probable weights using the RC-
network of the layout of the logic, nominal voltage and target frequency values. These
precomputed weights are used as seed values for the solver during the training phase. We initially
perform analysis of convergence by setting the initial weights of the estimation to zero and run the
solver over various sample set until the weight converges (raw convergence). Next, we use a seed
value in and run the solver from this value until the weight converges. We compare and evaluate
108
the overhead of cases using a zero initial weights and non-zero seed value to perform analysis on
the tradeoff between the training set versus the total run results.
9.8.2 Results on calibration enhancements
Table 2 summaries our results on calibration enhancements and it indicates fewer samples are
required as we decrease the desired accuracy. It also suggests that, higher quality power estimation
leads to faster convergence as compared to faster but less accurate estimates. Results suggest that
the weights converged to the exact same values for both the cases of reference and the fast
simulator for raw and seeded methods. Also, the seeded extraction of weights with the accelerated
simulator is faster than the accurate simulator even though the number of iterations is higher.
Furthermore, the modified vectors run the entire test suite for the microcontroller for 10 times in a
randomized order to avoid any caching of the results and in turn provide results closer to running
full length random vectors. As shown in table 1, our method runs 21.7x faster than the accelerated
SPICE platform. The simulation of the commercial SPICE simulator did not converge and hence,
we could not compare with the reference data.
9.9 Conclusion
This paper presents a method to simulate physical level component and sub-component power
accurately and efficiently at logic simulator speeds. By comparing values from the reference
commercial SPICE platform and values obtained from our method, it can be concluded that the
method is consistent and accurate. The results on memory usage and simulation runtime suggest
that our method indeed produces simulation results that are as accurate as reference yet leaner and
faster. This method was implemented on several ISCAS ‘85 and ‘89 benchmark designs. The latest
implementation of the technology shows that the maximum measurement difference of 0.001%
against reference design with 244x speedup was achieved. By comparing our measurement at
109
various granularities against the commercial SPICE simulator and the commercial accelerated
SPICE platforms, it has been verified that our technique provides a flexible accuracy versus
resource overhead platform.
The convergence analysis suggests that the method of using seed values along with a fast
simulator can further reduce the training overhead. This makes our system practical for simulating
smaller circuits. Hence, by observing signal activity rates and the training data set along with
online solver, simulation of power consumed by ICs and sub-components can be accelerated for
any given platform.
Apart from the fundamental contributions of our work, the weights obtained from accurate
commercial SPICE platform and accelerated SPICE platforms are compared. Results suggests that
our method rejects inaccuracies and produces the same weights as the accurate simulation platform
even with inaccuracies in the training data set.
110
Chapter 10 : Future Work
Accurate Fine grained power monitoring technique
using instruction profile
The current compilers transform the user applications into target Instruction Set Architecture.
These are then converted into binaries that are executed by the processors. A higher level flow of
modern day compilers are as shown in figure 1. The applications coded in higher level languages
such as C or C++ is compiled to a set of assembly level instructions based on the ISA supported
by the underlying processor. A set of these instructions can be identified as a block of code that is
atomic in execution such that as soon as the processor starts to execute the first instruction in that
basic block, it will execute the entire block of instructions atomically. These sets of basic blocks
are then translated to strings of 1’s and 0’s that are executed on the processor.
We leverage the current art and inject our analysis to enable power monitoring before the
translation of the instructions. In our work, we add a post-compile stage before converting the
assembly code to binary to enable power monitoring. The post-compile stage performs an analysis
on the compiled code and inserts instructions to monitor basic block execution. This stage
essentially provides the number of times the basic blocks ran. This information is then translated
into an instruction count and a component count in subsequent stages of power monitoring. The
instruction counts act as valuable resources for power estimation. The state-of-the-art software
assisted power monitoring uses these predicted values of instruction counts to estimate power and
assist the power management systems. These values tend to be erroneous due to interruptions as
the instruction count estimated is not within the range of the actual values. The error in the
111
estimation stems from interruptions from handling sub-routines that add several instructions in the
execution path.
To solve this problem, we modify the usual flow as shown in figures 37, 38. We perform an
analysis on the compiled code that consists of a set of assembly level instructions. We find all the
Figure 36: Flow - User program compiled and executed
Figure 37: Modified flow for power monitoring Figure 38: Updated Flow with lower
overhead
112
basic blocks and add instructions to count the number of times they are executed. Using this we
translate the basic block count into the instruction count. The resultant instruction counts are used
to extract power of various components on the system accurately. This provides a fully-software
platform that can be used to measure power consumed by the underlying hardware accurately on
any hardware-software system.
Figure 40:(a, b, c) Example code flows
Figure 39: Sample flow to reduce overhead
113
Examples of the code flow are shown in figure 5. We use these examples to illustrate our
measurement technique. Each block in these examples represents a basic code block that is
generated by the compiler. These basic code blocks are a set of assembly level instructions are
atomic. The high level code is decomposed to several such blocks.
For the full instrumentation, we add instructions to increment the counter associated with each
type of basic block. The instructions are added either at the beginning or at the end of the execution
of the basic block. With the full instrumentation, we estimate the overhead to be 30%.
To reduce this overhead, we leverage the graph covering techniques to identify unique blocks.
An example flow is shown in figure 40. As shown in figure 39, we then instrument only unique
blocks instead of full instrumentation. Also, in cases such as figure 40, we collapse subsequent
blocks to create composite blocks. In this example, we collapse blocks 1 and 2, 1 and 3 in example
(c) in figure 40 to create a composite block and count the number of times the composite blocks
are executed.
10.1 Initial Implementations and Results
The initial experimentation was performed on digilent Atlys board programmed with ORPSOC
design. We also implemented our scheme on the freescale ARM platform and verified our claims.
10.1.1 Implementation and results on ORPSOC based system
We implement our method on Digilent Atlys board , an FPGA platform that is programmed
with OpenRISC SoC (ORPSoC) design. ORPSOC consists of OpenRISC 1200 embedded
processor along with several peripheral components such as Ethernet, Direct Memory Access
controller, Universal Serial Bus, Video Graphics Array and a Double Data rate memory. It runs a
Linux operating system and presents us with a system that emulates realistic behavior of processors
114
and SoC. The ORPSoC design is open source, thus all hardware and software components design
details are accessible to users and open to modification.
10.1.1.1 Fine grained instruction profiling using basic block map
The knowledge of instructions executed per unit time on a processor can be used to extract
accurate energy consumed per component per instruction. These values can be used at various
levels: At the hardware or gate level, it can be used to minimize hot spots on the chip without
altering the software components. At compiler level, it can be used to issue instructions to limit
power used by the system without elaborate hardware or software overhead. At a software level,
it can be used to build a generic platform to generate chip and environment specific execution
pattern to extract maximum performance per watt from the device. Accurate and low overhead
instruction profiling is therefore a key parameter to optimize power, heat and energy in OS based
systems.
The instruction or block count values are used by our power monitoring tool to extract accurate
estimates of power consumed by various hardware components when the user application
programs are run. These power values can then be used by process scheduler to distribute power
evenly across all parts of the chip. The uniform distribution of power implies uniform temperature
distribution across the chip that can in turn be used by cooling systems to work more efficiently.
Hence, we can enable software based temperature aware digital system that is custom to every
manufactured copy of the chip and also reduces the impact of process and packaging variations on
power and temperature.
115
Most programs in software implement their functionality by using several built-in functions in
operating systems along with the logic of the code. The codes of these built-in functions or
procedures are either executed from a known location or they are copied into the code in relevant
locations. The application binary interface (ABI) is a module in operating systems that monitors
and controls the execution of these function calls present in the user program. It also provides an
additional feature known as position independent code (PIC) for function calls, which enables sub-
routine code execution independent of their actual location in the main memory. The ABI for
OpenRISC 1000 does not include a definition for PIC in their linux 2.6 operating system. The PIC
Figure 41: Modification to facilitate library function profiling
Figure 42: GPR to store base address for instruction profiling
116
provides means to use shared libraries to execute programs such that the library code is executed
from a location that is not used by any other shared files. In absence of such a feature, the compiler
replaces each procedure or sub-routine call by its code. Such programs or libraries are known as
static libraries. We add instructions in each program of these static libraries to count the basic
blocks at run-time.
Many static libraries used in OpenRISC are modified from their actual full version to suit the
embedded systems platform. OpenRISC uses a lighter version of glibc, uClibc which is a C library
for embedded and real-time computing. Also, it uses Busybox, a light weight software bundle to
support utilities like nslookup, cat, vim, mkdir etc. The C libraries provided by uClibc and Busybox
are compiled separately into object files for subsequent linker operations. Initially, we modify
these libraries to create archives (such as libc.a, libpthread.a) that consist of instruction profiling
code. We then update the object library with these modified archives. The linker now uses these
archives in relevant locations while linking the user code. The execution of these binaries provides
an accurate instruction profile at run-time.
In OpenRISC, we require additional 3 instructions to count each instruction of the user program
executed that in turn implies a 3x overhead to count each instruction. Due to the impractical
overhead, we choose to count a set of instructions in the assembled program over adding code to
count every instruction separately. In order to choose the set of instructions or blocks to be counted,
the program flow should be known to avoid mispredictions. Therefore, we leverage the concepts
of compiler design, basic blocks and CFG to enable accurate determination of program flow to
enable a system that can provide an accurate instruction profile. Each basic block has a known set
of instructions and there are finite set of basic blocks that is used by the compiler.
117
We account for various basic blocks per program by adding block counters in the data section
of the program. This is achieved by using a python script that allocates memory in the data section
region of program and initializes a CPU general-purpose registers (GPR) to point to the starting
address of the block counters. We then set a flag in our compiler to exclude our GPR during
compilation. As shown in figure 42, the GPR that holds the base address of counter memory is
later used in the program to access the required block counters. We leverage these special GPRs
to extend our work to consider programs that use code which is present in multiple files. In the
commonly used approach, each file is compiled separately and the linker creates a single
executable using all the compiled codes. Instead, we manually concatenate all the files used by the
program along with user program and add our basic block counting code since it is impractical to
instrument each of these files separately.
In our current approach, we first convert the C program files of the user into assembly files
using architecture specific compiler (For the current work, we used GCC for OpenRISC). The
compiler orders the assembly code into tree of blocks indicating different program execution paths.
We use a script to parse assembly code to detect the basic blocks and insert instructions to
increment block counters before completion of each block. We create block counters by initializing
an integer array in the main program stack to hold various block counts in the code. We add 3
instructions in the assembly code to increment the block counter in the code logic and allocate its
memory in the data section such that the main program stack region is unmodified. Also, these
instructions are added such that the functionality of the software remains unaltered. The block
count data stored in the array is later used to determine the numbers and types of basic blocks
executed. We create a mapping template to convert the block counts to instruction profile. Using
the block counts, we look up the mapping to extract the instruction count for any given program.
118
The overhead due to profiling instructions is relatively smaller than the average number of
instructions in the basic block. It was noticed that basic blocks on an average have 50 instructions
in the commonly run user programs which implies the profiling code is about 6% overhead.
Furthermore, we pass the base address of block counters to the function calls to account for its
block counts. We initialize the contiguous memory locations for instruction counters such that it
is accessible by adding an offset to the base address. This provides means to update the relevant
counters in an array by using a single base address depending on basic blocks used by library
function before it returns to the calling program. The data width chosen for the block counters is
2 bytes that counts the number of basic blocks from 0 to 65535. We observed that a user application
on an average can have up to 200,000 blocks and each block can execute up to 50,000 times. The
data width of 2 bytes for block counters is sufficient to account for the average numbers of basic
blocks for these applications. The base address of basic block counter (bb_count) is stored in the
GPR R28 and is used to access the other block counters to update the program’s profile. Figure 45
Figure 43: Sample assembly code with block counter code instrumented
Figure 44: Assembly code to increment basic blocks
119
shows a snapshot of an implemented code that increments basic block counters added at the
starting point of each basic block. The basic block counter 16 is accessed by adding an offset to
the base address as shown in figure 9. The base address of the counter is stored in the GPR R28
and the offset is calculated as 16 ×2 bytes=32. The offset of 32 is added to the base address to
access Counter 16.
We implement our system to frequently sample the basic block profile to extract instruction
counts of the software running on the processor at runtime. We use interrupts to sample the
counters in the user code. We add a signal handler code to the user application that executes a
subroutine when particular signal (SIGINT) is received. The subroutine goes through block counts
and then proceeds to store it in a temporary buffer allocated in the memory. This signal handler is
frequently run to collect fine grained instruction profiles of the software in the buffer space and is
recorded into a file at a known rate. The basic block counts of the user application stored in the
file include the numbers and types of basic blocks within the code and library calls.
We test our system over a string library that has up to 12 library functions such as strlen, strcpy,
strcmp. We extracted the library calls it uses and added the block counts to basic blocks of both
the user program and library calls code. To instrument the libraries, we begin by compiling its
Figure 45: Snapshot of the result with block and instruction counters
120
source C code and then add instructions to count basic blocks in the assembly file. We observed
an average of 5 basic blocks within some of the string library functions. We use this data to allocate
5 block counters for the string library calls. Now, we consider a program to compute the length of
a string using the strlen function available in the uClibc library. We modify the strlen function to
strlen_ip to accept base address of counters to update with its basic block counts. The strlen_ip
library function call computes the length of the string whose address is passed to the function by
the register R3(as shown in the second highlighted box in figure 9). The base address of the basic
block counter is passed by register R4 as shown in the first highlighted box in the figure 9. When
a user program calls the library function strlen_ip, it calculates the length of the string passed and
updates the basic block counts with its profile before returning to the calling program. Similarly,
when a user application calls one of these built-in functions, our modified library function will be
linked to the main program that counts the basic blocks executed. The sample codes in appendix
illustrate the usage of such modified library calls. Figure 6 shows the flow for library modification
of the string functions. Figure 10 shows the result of instrumenting profile code in the software.
As shown, it provides block counts on the left and instruction profile on the right.
10.1.1.2 Instruction Profiling of processes using kernel interrupts
A common user application with 200,000 blocks executing up to 50,000 times produces large
amounts of data that might be hard to use in an embedded system with limited space and energy.
Hence, we try an alternative method to profile instruction with low overhead, accurate profile with
lesser profile data. We extract instruction profile at a process level using CFG and instruction
counters and sample the profile data at a regular interval using kernel level interrupts.
121
Linux provides the multi-tasking functionality by allocating a fixed time period to each user
and kernel level processes. To facilitate this, kernel uses jiffies to keep track of execution time of
each thread. The jiffies are generated by a timer which in turn uses a counter to generate tick. The
tick or interrupt timer frequency is programmed during the booting by using a predetermined value
in the configuration file and is based on the system architecture. The Linux kernel on OpenRISC
uses these interrupts to update the clocks used by the subsystems such as CPU real time clock,
Network timers etc. During these interrupts, the program execution switches to kernel mode from
the user mode. During this kernel mode, the linux kernel monitors execution time of each process.
Each process is given a fixed time frame and kernel checks for the length of code executed during
the given time frame. The scheduler executes the next process in queue if the previous task is found
to be completed or it uses fair-share scheduling scheme to accommodate multiple large processes
that are incomplete. Between the interrupts, the kernel stores the Program Counter (PC) value to
keep track of the execution path of the processes. We leverage this to identify the location of code
execution in the control flow graph (CFG). We extract the CFG by parsing the object dump of the
software along with a set of basic blocks used by compilers to generate the object files. The CFG
indicates all the paths that a program possibly takes during execution. Using PC values, we trace
the exact path of program execution and extract the instruction profile of code executed between
interrupts.
Our current ORPSOC platform executes about 500,000 instructions between successive timer
interrupts. Due to larger instruction executions, the path tracing tends to be harder thereby
introducing errors in CFG path traversal. To reduce such errors, we reduce the number of
instructions executed between successive tick executions by increasing the number of ticks per
second. To achieve this, we modify the preprocessor constant “HZ” during kernel compilation.
122
We observed between each tick, the minimum period that can be achieved is 80-100 micro
seconds, that is, we cannot reduce the tick rate less than 100 micro seconds. The typical interrupt
interval used in our system is 10 milliseconds. It is observed that modification of the tick interval
to a value below 80 microseconds cannot be achieved without compromising the correctness of
other modules in the system. The results suggest that modules synchronized to kernel interrupts
cannot keep up with the performance requirement with the higher speed ticks. About 5000
instructions are executed to run each tick and its interrupt routine. Therefore at higher rates, the
modules might not be able to complete execution of all the instructions and results in
malfunctioning of the hardware.
10.1.1.3 Low overhead accurate instruction profiling per thread
It might not be possible to achieve higher tick rates for finer tracing of CFG in technologies
with higher clock rates. The range of tick rates with higher number of modules integrated on-chip
tends to be much lesser due to synchronization issues. Also, the additional instructions in the
program to extract basic block profiles may cause non-negligible overhead in execution times for
a large basic block set. This in turn adds an overhead of power as the measurement code has a
considerable power footprint. Therefore, we build a hardware assisted technique to account for
Figure 46: Work flow to use hardware assisted thread level instruction profiling
123
instructions or block counts of each thread executed on the processor with low overhead in power
and negligible overhead in execution times.
We build a hardware unit to account for each instruction executed in the thread and use a
dedicated bus to store its instruction count in the memory. The hardware unit consists of a pattern
matcher for instructions and counters to store the instances of instructions matched. The pattern
matcher also assists in storing the thread relevant information. For each context switch operation,
we ascribe thread specific information using one of the following schemes:
i. We identify an instruction or set of instructions that are executed during each context
switch. Then, we store this information along with the power profile in the memory.
We hypothesize these set of instructions contain thread specific information such as
thread ID. We use this data to distinguish power consumed by different threads.
ii. We use the reserved instruction space to create additional instruction to assist the
pattern matcher. This instruction is inserted in the signal handler code during context
switch operation to enable the power measurement unit to store the profile along with
the thread ID in the memory.
In scheme 1 or 2, we match relevant instructions and enable the instruction profiler to store the
profile in a memory location accessible by software. The software computes power using the
weights per instructions and instructions count per thread. Figure 11 shows a generic flow of
operations for scheme 1 and 2.
10.2 Implementation and results on ARM based system
We test our method on a processor that is commonly deployed in the modern day devices. We
use the Freescale imx6q sabre lite board that has an embedded ARM cortex processor and Ubuntu
124
12.04 version operating system optimized for ARM processor. In the first step, we compile the
given C program into ARMv7 assembly instructions using gcc. We set a flag in GCC to prevent
the use of two hardware registers that stores the profile data and subsequently used by the
instruction profiling tool. Following are the functions of two reserved registers:
iii. Register 1: Reserved to store the address of the global instruction profile pointer. The
entire instruction profile is stored starting at this particular address.
iv. Register 2: Temporary register used to store the address of the block counter which is
to be incremented. Each basic block has an associated counter. We use a set of three
instructions to increment instruction counters (Details given in the code described in
appendix). This requires the use of a temporary register to store the address of the
counter to be incremented.
We create an array of memory locations to store the basic block counts. Basic block count
array is an array of software counters that are associated with each basic block. The basic block
array holds all the count values i.e. the size of the basic block array is equivalent to the number of
basic blocks. The type of a counter is integer (4 bytes). Thus, the maximum counter value is2 .
The size of the basic block array is calculated by parsing the ARM assembly code and extracting
the number of basic blocks in the assembly file. An array of integer counter values equivalent to
the determined size is then initialized to zeroes in the data region of the program. Three instructions
are also added at the end of each basic block to increment a counter associated with it when that
basic block is executed atomically. The incremented value is stored back in the basic block array
at its particular location in the data region. These additional instructions provide an overhead of
20-25% in the code size. In the main function, the start address of the data region is stored in one
125
of the reserved register. Each counter can then be accessed by using this address as an offset from
the base address and the second reserved register is used as a temporary register to store the address
of the counter.
We create an array of memory locations to store the instruction counts in a program. Block
instruction array is an array which is defined in the data region of the assembly program. It is array
that stores the assembly instructions in each basic block. The total number of block arrays is
equivalent to the number of basic blocks in the program. Each block instructions array will have
the instructions that are in that particular basic block.
For example, const char *a1[6] = {“add”, “sub”, “mov”, “ldr”, “add”, “str”}
const char *a2[7] = {“add”, “sub”, “fldd”,“mov”, “ldr”, “add”, “str”}
This example shows that the program has two basic blocks with the first basic block having 6
instructions and the second basic block having 7 instructions. The array of strings is defines as
constant as these instructions associated with a basic block are not bound to change.
Global instruction profile pointer is a pointer which is defined in the header file included in
each files associated with a program and each library files. This instruction profile pointer acts as
a shared piece of memory between all the c files associated with the execution of the program
along with all library files. This instruction profile pointer stores the instructions type and the count
of that instruction which are executed by the processor in real time. It is a shared memory whose
size is initialized to a random but sufficient amount for testing purpose to 10kB. The shared
memory is initialized using the shmget() function call. Following is an example of functions used
to initialize the shared memory and attach it to the existing file:
int shmid1;
126
key_t key1;
int *shm1; //Global instruction profile pointer
key1 = 1000; //Name of the segment
//Create the segment
if ((shmid1 = shmget (key1, SHMSZ1, IPC_CREAT | 0666)) < 0) {
perror("shmget1");
exit(1); }
//Attach the segment to the data space
if ((shm1 = (int *)shmat(shmid1, NULL, 0)) == (int *) -1) {
perror("shmat1");
exit(1); }
For testing our system, the whetstone benchmark program was used along with the IEEE 754
math library. Each c program in the library file is modified and block counter array, block
instruction array and global instruction profile pointer are initialized as mentioned above. The
makefile of the math library is modified to compile the assembly programs instead of the c program
files. At the end of the assembly library file before returning to the caller program, the instruction
profile, that is, the instructions executed in the library file are calculated and stored at the
instruction profile address and the pointer to the instruction profile is incremented. This instruction
profile is done by scanning the basic block count array and multiplying each counter values with
the instruction of that basic block stored in the block instruction array to get the total number of
instructions executed. The library files are then archived into a shared library file which can be
dynamically linked with the main program. Thus, the main program thinks that it is calling the
original math library but in reality it is calling the modified library functions and file.
127
The main c program file includes a signal handler code (written in assembly) which works
every one second. The job of the signal handler is the following:
1. Runs through the basic block count array and checks to see if the count value has changed from
the previous iteration. If it has changed, the difference of the changed value and the previous
value is taken and that number is multiplied by the block instruction array of that basic block
whose count value has changed. This will give the instruction type and count executed during
that particular 1 second and that can be added onto the instruction profile at the address stored
in the instruction profile pointer.
2. Print out the instruction profile in a file.
To plot the real-time histogram of instructions executed by the processor, a tool called
Processing is used. A java code is compiled to take input from the instruction profile file generated
by the tool in the back end and it is used to plot a histogram that displays the type and the total
number of instructions executed in real time.
Figure 47: Histogram of instructions in a program
128
10.3 Initial Results
1. Real time instruction profile:
Our system extracts the accurate instruction profile of any software program at runtime as a
background process. The real time profile can then be used to calculate the Energy consumed per
Instruction and can be mapped into Energy consumed per component.
2. Execution time up by 30-40%:
The execution time of a program increases by on an average 30-40%. This is due to profiling
instructions included in the basic blocks. On an average a basic block is about 20-30 instructions
and because of including 3 instructions in between this execution time increases by a factor of 30-
40%. This was observed on the whetstone benchmark program using the Linux ‘time’ command.
3. Code size increased by 20-25% (Can be optimized using graph techniques):
The size of the code increases by on an average 20-25% because of including three instructions
per basic block. This can be further optimized by merging two or successive basic blocks that do
not have an incoming branch from any other basic block using graph theory techniques.
4. Benchmark program with math library (dynamically linked) working:
For the purpose of testing, the whetstone benchmark program was used. This program included
function calls that invoked dynamic math library functions. The IEEE 754 math library was
modified to get the instruction profile. The above results were derived based on the tests carried
out on the whetstone benchmark program.
129
References
[1] J.Rabaey ,A. Chandrakasan , B.Nikolic - Digital Integrated circuits : A Design Perspective,
Pearson Education(2003), 130909963
[2] G.Varghese, H.Zhang, J M. Rabaey: The design of a low energy FPGA. ISLPED 1999:
188-193
[3] L. Shang, N K. Jha: High-Level Power Modeling of CPLDs and FPGAs. ICCD 2001: 46-
53.
[4] P. Heydari and M. Pedram. "Ground bounce in digital VLSI circuits," IEEE Trans. on
VLSI Systems, Vol. 11, No. 2, Apr. 2003, pp. 180-193.
[5] G.D. Hachtel and F. Somenzi, Logic Synthesis and Verification Algorithms. Kluwer
Academic, 1996
[6] http://www.altera.com/support/devices/estimator/pow-powerplay.jsp
[7] J.C.McCullough, Y Agarwal, J Chandrashekar, S Kuppuswamy, A C. Snoeren, and R K.
Gupta Evaluating the Effectiveness of Model-Based Power Characterization, In Procs of
the USENIX Annual Technical Conference, 2011
[8] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L.Ji, S. R. Nassif, E. J. Nowak,
D. J. Pearson, and N. J. Rohrer,‘‘High-Performance CMOS Variability in the 65-nm
Regime and Beyond,’’ IBM J. Res. & Dev. 50, No. 4/5,433–449 (2006).
[9] H.-Y.Wong, L. Cheng, Y. Lin, and L. He, .FPGA device and architecture evaluation
considering process Variations, in Proc. Int. Conf. on Computer Aided Design, Nov 2005.
[10] www.xilinx.com/products/intellectual-property/axi_sysmon_adc.htm
[11] www.xilinx.com/support/documentation/ip_documentation/ds790_axi_sysmon_adc.pdf
[12] B. Murmann, "Limits on ADC Power Dissipation," in Analog Circuit Design, M. Steyaert,
A. H. M. Roermund, and J. H. v. Huijsing, (eds.), Springer, 2006.
[13] F N. Najm, A survey of power estimation techniques in VLSI circuits. IEEE Trans. VLSI
Syst. 2(4): 446-455(1994) [14] www.xilinx.com/support/documentation/user_guides/ug440.pdf
[15] C. Isci and M. Martonosi, “Runtime Power Monitoring in High-End Processors:
Methodology and Empirical Data”, Proc. MICRO, 2003, pp. 93-104.
[16] Floyd, M., Allen-Ware, M.,Rajamani, K., Brock, B.,Lefurgy, C., Drake, A.J., Pesantez, L.,
Gloekler, T.,Tierno,J.A., Bose, P., Buyuktosunoglu, A. 2011. Introducing the Adaptive
Energy Management Features of the Power7 Chip.IEEE Micro, 31, 2, (March-April 2011),
60-75.
[17] D.Brooks, V.Tiwari, and M. Martonosi.Wattch: A Framework for Architectural-Level
Power Analysis and Optimizations. ISCA-27, June 2000.
[18] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, M.
Schulz: Prediction Models for Multi-dimensional Power-Performance Optimization on
Many Cores. Proceedings of PACT08, Toronto, Canada, Oct. 25–29, 2008.
[19] S Gupta, F N. Najm, Power Macromodeling for High Level Power Estimation. DAC 1997:
365-370
[20] A. Hyvärinen, J. Karhunen, E. Oja , Independent Component Analysis ,John Wiley & Sons,
978-0-471-40540-5
[21] www.netfpga.org
[22] http://ultraviewcorp.com
130
[23] http://download.micron.com/pdf/datasheets/dram/ddr/
[24] D.McIntire , K.Ho, B.Yip, A Singh, W H. Wu, W J. Kaiser, The low power energy aware
processing(LEAP) embedded networked sensor system, IPSN 2006:449-457
[25] M. Qazi, K. Stawiasz, L. Chang, and A. Chandrakasan, “A 512 kb 8T SRAM macro
operating down to 0.57 V with an AC-coupled sense amplifier and embedded data-
retention-voltage sensor in 45 nm SOI CMOS,” in Proc. IEEE ISSCC, Feb. 2010, pp. 350–
351.
[26] M E. Sinangil, M Yip, M Qazi, R Rithe, J Kwong, A Chandrakasan: Design of Low-
Voltage Digital Building Blocks and ADCs for Energy-Efficient Systems. IEEE Trans. on
Circuits and Systems 59-II(9): 533-537,(2012) [27] http://amploc.fatcow.com/store/index.html
[28] S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol. 19, no.4, Jul.–
Aug.1999, pp.23–29.
[29] M. Pedram and J. Rabaey, Power Aware Design Methodologies, Kluwer Academic
Publishers, 2002.
[30] B.H. Calhoun, A. Wang, and A. Chandrakasan, “Modeling and Sizing for Minimum
Energy Operation in Subthreshold Circuits,”IEEE JSSC, Sept.2005.
[31] Horowitz M, Alon E, Patil D, Naffziger S, Kumar R, Bernstein K (2005) Scaling, power,
and the future of CMOS. In: Proceedings of the 2005 IEEE international electron devices
meeting (IEDM), pp 9–15,2005
[32] Y.Kim, T. Schmid, Z M. Charbiwala, and M B. Srivastava , ViridiScope: Design and
Implementation of a Fine Grained Power Monitoring System for Homes, UbiComp 2009:
245-254.
[33] http://web.eecs.umich.edu/~jhayes/iscas.restore/
[34] M. Abramovici , M.Breuer, A.D. Friedman , Digital System Testing and Testable Design.
IEEE Press, 1995.
[35] Proceedings of Hotchips 2013.
[36] B. Stackhouse, B. Cherkauer, M. Gowan, P. Gronowski, and C. Lyles, “A 65 nm 2-billion-
transistor quad-core itanium processor,” in 2008 , IEEE ISSCC Dig., Feb. 2008, pp. 92–
93.
[37] S.S. Bhargav, Y H. Cho Measuring Power Digitally with Numerical Analysis, student
poster ACM Sigmetrics, London, 2012.
[38] J.Mendel, “Lessons in Estimation Theory for Signal Processing, Communication and
Control”, Prentice-Hall, Englewood-Cliffs, NJ, 1995.
[39] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, ”Scaling, power,
and the future of CMOS,Proc. IEEE Int. Electron Devices Meeting, Dec. 2005.
[40] International Technology Roadmap of Semiconductor, http://www.itrs.net/
[41] Source of figure : National Academy of Sciences.
[42] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L.Ji, S. R. Nassif, E. J. Nowak,
D. J. Pearson, and N. J. Rohrer,‘‘High-Performance CMOS Variability in the 65-nm
Regime and Beyond,’’ IBM J. Res. & Dev. 50, No. 4/5, 433–449 (2006).
[43] V.De , “Fine grain Power management” , Forum talk, ISSCC 2013.
[44] Ozdemir, D. Sinha, G. Memik, J. Adams, and H. Zhou, “Yield-aware cache architectures
“. In International Symposium on Microarchitecture, December 2006.
[45] http://opencores.org/or1k/OR1200_OpenRISC_Processor
131
[46] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, “Scaling, energy,
and the future of cmos,” in IEDM 05, 2005.
[47] http://www.xilinx.com/products/design tools/logic design/verification/xenergy.html
[48] https://www.altera.com/support/supportresources/operation-and-testing/energy/pow-
energyplay.html
[49] http://www.xilinx.com/products/intellectualproperty/xadc-wizard.html
[50] S. S. Bhargav and Y. H. Cho, “Accurate power measurement technique for digital systems
using independent component analysis,” in Proc. Of the Conference on DCIS, 2015.
[51] E. Handschin, F. C. Schweppe, J. Kohlas, and A. A. F. A. Fiechter, “Bad data analysis for
energy system state estimation,” in Proc. IEEE Energy Apparatus and Systems, 1975, pp.
94(2), 329–337.
[52] F. C. Schweppe, “Energy system static-state estimation, part iii: Implementation.” in Proc.
IEEE Energy Apparatus and Systems, 1970, pp.130–135.
[53] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil, “Construction and use of linear
regression models for processor performance analysis,” in HPCA, 2006, Feb. 2006, pp.
99–108.
[54] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0314h/Babdficb.html
[55] S. V. Gubbi and B. Amrutur, “All digital energy sensing for minimum energy tracking,” in
IEEE Transactions on VLSI Systems, 2015, pp. 796–800.
[56] A. Das, A. Kumar, B. Veeravalli, R. Shafik, G. Merrett, and B. Al- Hashimi, “Workload
uncertainty characterization and adaptive frequency scaling for energy minimization of
embedded systems,” in Proc. of the conference on DATE, Mar. 2015, pp. 43–48.
[57] http://opencores.org/or1k/ORPSoC
[58] H. Mehta, M. Borah, Owens, R. M., and M. J. Irwin, “Accurate estimation of combinational
circuit activity,” in Proc. of 32nd annual ACM/IEEE DAC, pp. 618–622.
[59] J.Mendel, Lessons in Estimation Theory for Signal Processing, Communication and
Control. Englewood-Cliffs, NJ: Prentice-Hall, 1995.
[60] T. Austin, V. Bertacco, S. Mahlke, and Y. Cao, “Reliable systems on unreliable fabrics,”
in Proc. IEEE Design & Test of Computers, 2008, pp. 322–332.
[61] Y. Gao, S. K. Gupta, and M. A. Breuer, “Using explicit output comparisons for fault
tolerant scheduling (fts) on modern high-performance processors,” in Proc. of the
Conference on DATE, Mar. 2013, pp. 927–932.
[62] Y. S. Chang, S. K. Gupta, and M. A. Breuer, “Test generation for ground bounce in internal
logic circuitry,” in Proc. IEEE VTS, Apr. 1999, p. 95.
[63] http://www.digilentinc.com/atlys/
[64] http://www.digilentinc.com/Products/Detail.cfm?Prod=PMOD-AD5
[65] http://rpasearch.com/web/en/benchmark.php
[66] http://www.xilinx.com/products/silicondevices/soc/zynq-7000.html
[67] A. Iyer and D. Marculescu, “Energy aware microarchitecture resource scaling,” in Proc. of
the conference on DATE, 2001.
[68] S. Lee and et al., “An accurate instruction-level energy consumption model for embedded
risc processors,” in ACM SIGPLAN Notices. Vol. 36. No. 8., 2011.
[69] Y. S. Shao and D. Brooks, “Energy characterization and instruction-level energy model of
intel’s xeon phi processor,” in Proc. of the ISLPED, IEEE Press, Sep. 2013, pp. 389–394.
132
[70] A. Annamalai, Rodrigues, R., Koren, I., and S. Kundu, “Reducing energy per instruction
via dynamic resource allocation and voltage and frequency adaptation in asymmetric
multicores,” in IEEE ISVLSI, Jul. 2014, pp. 436–441.
[71] W. J. Song, S. Yalamanchili, A. F. Rodrigues, and S. Mukhopadhyay, “Instruction-based
energy estimation methodology for asymmetric manycore processor simulations,” in Proc.
of the 5th ICSTT, Mar. 2012, pp. 166–171.
[72] P. Landman and J. Rabaey, “Architectural energy analysis: The dual bit type method,” in
IEEE Transactions on VLSI Systems, Jun. 1995, pp. 173–187.
[73] P. Kalla, J. H., and X. S. H., “Sea: Fast energy estimation for microarchitectures,” in Proc.
of the 2003 ASP-DAC Conferenc, 2003.
[74] A. Carroll and G. Heiser, “An analysis of energy consumption in a smartphone,” in
USENIX annual technical conference, 2010.
[75] S. Hong and K. H, “An integrated gpu energy and performance model,” in ACM
SIGARCH Computer Architecture News. Vol. 38. No. 3, ACM, 2010.
[76] https://www.apache-da.com/products/powerartist
[77] R. Bertran, M. Gonzlez, X. Martorell, N. Navarro, and E. Ayguad, “Potra: a framework for
building power models for next generation multicore architectures,” in ACM
SIGMETRICS Performance Evaluation Review, Jun. 2012, pp. 427–428.
[78] J. McCullough, Y. Agarwal, J. Chandrashekar, S. Kuppuswamy, A. C. Snoeren, and R. K.
Gupta, “Evaluating the effectiveness of model-based power characterization,” in Proc. of
the USENIX, Jun. 2011.
[79] P. Landman and J. Rabaey, “Architectural energy analysis: The dual bit type method,” in
IEEE Trans. VLSI, Jun. 1995, pp. 173–187.
[80] C. Hsieh and M. Pedram, “Microprocessor power estimation using profile-driven program
synthesis,” in IEEE Trans. on CAD, Nov. 1998, pp. 1080–1089.
[81] A. Sama, J. Theeuwen, and M. Balakrishnan, “Speeding up power estimation of embedded
software,” in Proc. ACM/IEEE ISLPED, Jul. 2000, pp. 191–196.
[82] R. Damaeviius, “Estimation of design characteristics at rtl modeling level using systemc,”
in Proc. Info. Tech. And Cont. 35, 2015.
[83] W. Dally, The MOSSIM Simulation Engine. Pasadena,CA: Caltech, 1984.
[84] N. Kapre, SPICE2A spatial parallel architecture for accelerating the spice circuit simulator.
Pasadena,CA: Caltech, 2010.
[85] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson, J. Keefe, and H.
Angepat, “Fpga-accelerated simulation technologies(fast): Fast, full-system, cycle-
accurate simulators,” in IEEE MICRO,Jul. 2007, p. 249261.
[86] T. Krodel, “Powerplay - fast dynamic power estimation based on logic simulation,” in Proc.
IEEE ICCD, Oct. 1991, pp. 96–100.
[87] A. Salz and M. Horowitz, “Irsim: An incremental mos switch-level simulator,” in Proc.
ACM/IEEE 26th DAC, Jun. 1989, pp. 173–178.
[88] Z. C. Huang and R. S. Tsay, “Aroma: A highly accurate microcomponent-based approach
for embedded processor power analysis,” in Proc. IEEE ASP-DAC, Jan. 2015.
[89] S. S. Bhargav and Y. H. Cho, “Accurate power measurement technique for digital systems
using independent component analysis,” in Proc. Of the Conference on DCIS, 2015.
[90] Q. Wu, Q. Qiu, M. Pedram, and C. S. Ding, “Cycle-accurate macromodels for rt-level
power analysis,” in IEEE Trans VLSI, 1998, pp.6(4),520–528.
133
[91] Y. S. Chang, S. K. Gupta, and M. A. Breuer, “Test generation for ground bounce in internal
logic circuitry,” in Proc. IEEE VTS, Apr. 1999, p. 95.
[92] http://www.eecs.berkeley.edu/ alanmi/abc/
[93] Available: http://www.veripool.org/wiki/verilator
[94] S. Gupta and F. N. Najm, “Power macromodeling for high level power estimation,” in
Proc. ACM/IEEE 34th DAC, 1997, pp. 365–370.
[95] D. Marculescu, R. Marculescu, and M. Pedram, “Information theoretic measures of energy
consumption at register transfer level,” in Proc. ACM/IEEE ISLPED, Apr. 1995, pp. 81–
86.
[96] H. Liu and D. F. Wong, “Network-flow-based multiway partitioning with area and pin
constraints,” in IEEE Trans. CAD, pp. 17(1),50–59.
[97] H. Mehta, M. Borah, Owens, R. M., and M. J. Irwin, “Accurate estimation of combinational
circuit activity,” in Proc. of 32nd annual ACM/IEEE DAC, pp. 618–622.
Abstract (if available)
Abstract
Over the last decade, it has become increasingly important to monitor several parameters during the operational lives of chips. For example, power measurements are used to prevent chips from overheating and, in many emerging complex systems, enable power optimizations. There are a number of tools such as SPICE and dedicated instrumentation methods available for obtaining power data today. However, simulation and model based solutions are often limited in precision while direct measurements incur impractical overhead. The need for accurate power monitoring during the lifetime of a chip can be seen three fold: ❧ Firstly, it is no longer possible to use pessimistic estimates as the power and performance gains provided by each new technology generation are diminishing. The pessimistic power estimation introduces thick guard bands during the application of dynamic voltage and frequency scaling (DVFS) and hence causes non-optimal power management. The DVFS is not applied when there is an opportunity to scale due to these thick guard bands. The thick guard bands in turn reduce performance per watt which diminishes the advantages of technology scaling. As a consequence of this issue along with unchanged threshold and nominal supply voltages, performance has plateaued for several generations of technology. With a fine grained, low latency, in-situ and accurate power measurement system, we try to gain back the loss in performance per watt by reducing the thickness of the guard band. ❧ Secondly, experts suggest that application level power management and adaptive control of power are crucial for future CMOS systems. They hypothesize that most of the circuit design techniques have already been used and the room for power minimization lies with the applications running on the systems. With the knowledge of amount of power being used by the hardware for the software jobs, the software can adjust itself by issuing different types of instructions to limit the power. We enable this by providing an all-digital, fine grained and high resolution power measurement system that can provide applications with details of power usage on each component. Per component power can be used actively by the applications to limit the components specific power usage while maintaining program correctness. ❧ Lastly, many of these analyses must capture the effects of process variations. While some of the available tools such as SPICE provide feedback on parametric effect of process variations, they incur long run-time simulations. These simulations incur impractically high run-time for complex circuits such as microprocessor. Also, they do not provide a means to include temperature variations across the chip during simulation. The current methods assume uniform temperature characteristics across the entire chip while on a real chip
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Subthreshold circuit design for ultra-low-power sensors
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Radio localization techniques using ranked sequences
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Power-efficient biomimetic neural circuits
PDF
mmWave dynamic channel measurements for localization and communications
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Power efficient multimedia applications on embedded systems
PDF
Dendritic computation and plasticity in neuromorphic circuits
PDF
Development of novel techniques for evaluating physical, chemical and toxicological properties of particulate matter in ambient air
PDF
Cooperation in wireless networks with selfish users
PDF
Improve cellular performance with minimal infrastructure changes
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Resource allocation in OFDM/OFDMA cellular networks: protocol design and performance analysis
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Improving the speed-power-accuracy trade-off in low-power analog circuits by reverse back-body biasing
Asset Metadata
Creator
Bhargav, Siddharth S.
(author)
Core Title
In-situ digital power measurement technique using circuit analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/02/2015
Defense Date
10/04/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,online analysis,power management,power measurement,regression analysis
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Cho, Young H. (
committee chair
), Govindan, Ramesh (
committee member
), Parker, Alice C. (
committee member
)
Creator Email
siddharthsbhargav@gmail.com,ssbharga@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-202818
Unique identifier
UC11277234
Identifier
etd-BhargavSid-4062.pdf (filename),usctheses-c40-202818 (legacy record id)
Legacy Identifier
etd-BhargavSid-4062.pdf
Dmrecord
202818
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Bhargav, Siddharth S.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
online analysis
power management
power measurement
regression analysis