Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Design tradeoffs in a packet-switched network on chip architecture
(USC Thesis Other)
Design tradeoffs in a packet-switched network on chip architecture
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DESIGN TRADEOFFS IN A PACKET-SWITCHED
NETWORK ON CHIP ARCHITECTURE
by
Vikram Muttineni
A Thesis Presented to the
FACULTY OF THE VITERBI SCHOOL OF ENGINEERING
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment o f the
Requirements for the Degree
MASTER OF SCIENCE
(ELECTRICAL ENGINEERING)
December 2005
Copyright 2005 Vikram Muttineni
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1435086
Copyright 2005 by
Muttineni, Vikram
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI
UMI Microform 1435086
Copyright 2006 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGEMENTS
I thank my advisor and Committee Chair, Dr. Alice
C. Parker, who read numerous revisions in a short
interval of time, for giving me the opportunity to work
in a very interesting area, and for her support and
guidance throughout my graduate studies at University
of Southern California. I thank the members of my
committee Dr. Timothy Pinkston and Dr. Michael Neely
for being a part of my committee.
I thank my parents, my sister Jyotsna and brother
M.V. Rao, for their constant encouragement and support
without which I wouldn’t have completed this work.
Finally, I thank all the friends I have met over my
years at University o f Southern California especially
Wei Chen.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table o f Contents
Acknowledgements ii
List of Figures iv
Abstract vi
Chapter One: Introduction 1
Chapter Two: Related Work 8
Chapter Three: Current Work 21
Chapter Four: Conclusion 40
Bibliography 41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List o f Figures
Figure 1: Throughput and maximum latency requirements
for various traffic types [16] 3
Figure 2: Reachable distance per clock [14] 4
Figure 3: Billion transistor era from 2007 [12] 5
Figure 4: Minimum half perimeter delay for various core sizes [12] 5
Figure 5: Gate length trends in high performance
microprocessor units (MPU) [12] 6
Figure 6: 2D mesh topology [ 11 ] 10
Figure 7: 2D torus topology [11] 10
Figure 8: Bypass torus layout topology [11] 11
Figure 9: Clustered torus topology [11] 11
Figure 10: Typical NOC topology. A packet switched
architecture for network on chip 15
Figure 11: Proposed NOC topology. Enhanced
packet switched architecture 17
Figure 12: Single FIFO built from four identical FIFOs to
synchronize packet size 22
Figure 13: Shows a single IP core with a large input and
output buffers and a processing element 26
Figure 14: Shows how the router input buffer and output
buffers are connected through a crossbar switch 28
Figure 15: Input and output ports of a router 29
Figure 16: Enhanced packet switched architecture 30
Figure 17: The model under study 32
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 18: Modeling the input output buffers of a router 32
Figure 19: Open queuing network 34
Figure 20: Closed queuing network 34
Figure 21: Latency vs. No. of nodes/stages for both the architectures 38
Figure 22: No of stages vs. Throughput of the network for both the
architectures 38
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
As technology scales, billion transistor chips will become a reality in the near
future. Reliable and efficient use of the available communication medium (on chip
interconnects, buffers and routers) is of critical importance for such a chip. The
focus of this work is on modeling a “Network on Chip” to analyze the performance.
We model the on-chip network as a queuing network with blocking and analyze the
network. Two possible architectures are considered, the enhanced packet-switched
architecture and the packet-switched architecture. The architectures differ in the
routing network; the routing networks are analyzed and modeled to estimate the
performance. We also discuss the need for “Network on Chip” in the near future.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter One: Introduction
Technology scaling decreases gate delays and increase the length of the chip
edge. Wires can be classified into two categories, wires that shrink as technology
scales and wires that do not shrink as technology scales (global wires). Wires that
shrink with technology pose no challenge to designers as the relative change in the
speed of wires to that of the gates in modest. Global wires will not scale as
technology scales and will result in multiple clock cycles to communicate across the
chip. Optimistic predictions estimate that the propagation delays for highly
optimized global wires is between 6 to 10 clocks. Network on Chip is an innovative
architectural style that overcomes the global wire delay problem. The focus of this
work is on estimating the performance of Network on Chip.
Systems on Chip (SOC) have for long been the order of the day where intra
chip communication was required. But with shrinking transistor sizes there is a need
for a more efficient way of interconnecting on-chip components (cores). Systems on
chip provide highly reliable and dedicated buses for communication between IP
cores. But with IP cores of the order of 10’s being integrated on a single chip,
dedicated buses are not an attractive solution as they consume much of the available
silicon area and buses are inherently non-scalable [16],[26], If the IP cores need to
communicate through buses, then we will need bidirectional buses and each IP core
will act as a transmitter and receiver. In such buses only one transmitter can control
the bus at any particular instant of time. We need a bus arbiter when several
transmitters/processors attempt to communicate simultaneously. A processor which
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
seeks to communicate first usually gets the bus mastership. Arbitration should be fast
to avoid performance losses. Arbitration and response time of the slow bus slaves
can result in performance losses as the bus remains idle while the master waits for
the slave to respond [19],[26], Such buses are widely used in microprocessors for
transmitting data to various devices. In such buses when an input instruction is
performed, the bus holds the opcode for the instruction during the first machine
cycle. During the next machine cycle, it holds the port address. During the last
machine cycle, the input device is enabled and places actual data on the bus. Data is
placed on the bus for a very brief time. This completes one transaction on a bus.
Each transaction is at least three machine cycles and each machine cycle is four or
five clock periods long [19]. New bus proposals are still being made to standardize
IP interfaces. Such architectures propose to use multiple on-chip busses, requiring
case-specific grouping of IP cores. However such architectures are not scalable. Bus-
based architecture remains convenient for SOCs that integrate less than five
processors [26].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A jhroughput (fo r
linemm
C o m p r e ise d
Hajndlmg
Latency (|is)
Figure 1: Throughput and maximum latency requirements for various traffic
types [16].
Bus based systems are not only limited by their non-scalable architecture as
explained, but also by their energy inefficiency. Longer buses with increasing chip
edge length means increased clock period which leads to performance losses. So
there is a need for an alternative method of communication on chip, such as Network
on Chip. Networks are preferable over bus-based architectures because they can
support concurrent communication [26]. Networks have advantages in terms of
modularity, latency, bandwidth and fault tolerance compared to bus-based
architectures [1], [13]. Modular designs enable faults to be tolerated through
isolation, redundancy and reconfiguration [13]. Networks can support concurrent
communication which cannot be supported by buses [1]. Also with multiple cores on
a single chip there will be a lot of task-level parallelism between the cores that
should be exploited for optimal performance [16]. To exploit this parallelism we
need interconnection networks with a very high throughput depending on the number
3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of cores [16]. A bus-based architecture would certainly fail to provide such a high
bandwidth and throughput.
This work focuses on estimating the performance of “Network on Chip”
when a general-purpose on-chip communication network is used for communication.
Figure 2 shows the reachable distance on a chip per clock for different process
technologies. It can be seen that as technology scales the chip edge length increases
slightly but the reachable distance per clock decreases rapidly. So to communicate
across the chip we will need multiple clock cycles. Figure 3 shows the number of
transistors that can be integrated on a single chip. It can be seen that we can integrate
more than a billion transistors on a single chip by 2007. Also, from Figure 4, we can
see that at 50 nm technology and for 20000K gates per core, we need 1000 ps to
communicate across a half-perimeter length; i.e., we need 12.4 clock cycles (at 12.4
GHz) [12], [13].
ft
s
g
I
L O O
M
0.13 0.18
TeA fietogyLfe^i)
Figure 2: Reachable distance per clock. [14].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
D.
o
fe w
m
o.
3 *
to
< 0
C O
K m »
.2
‘cj
O
Potential Design Complexity
10,000-
Equivalent Added Complexity
Logic Tr./Chip
1,000
100
58%)Yr. compounded
Complexity growth rate
0.1
0.01
0.001
n in n. o >
C O C O co c o
P I U ) S C l
< » m o > m
C O 1 0 I N . o >
o o o o
IT”
o > o C O
o i o i mt D O U O f f l i D O)
m * * ' ’ T * *
o o o o
N N N N
o
(Si
Figure 3. Billion transistor era from 2007 [12].
20G 150 1 0 0 50 0 300
TeeiBK>iogy{ntn|
Figure 4: Minimum half perimeter delay for various core sizes [12].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5 shows the scaling of gate length with time in high performance
microprocessors. The physical gate length shown in the Figure 5 is the gate length of
a folly processed transistor. The prediction is that the scaling will continue to be 70%
per two year cycle through the 32nm physical microprocessor unit (MPU) gate
length in 2005, and return to a three year cycle trend thereafter.
To reap the benefits technology scaling, innovations at the circuit level
design and architectural level design are required [12]. Networks on Chip is an
architectural level innovative design technique used for reaping the benefits of
technology scaling.
& MPU H (-Performance
Gate Length - Printed
X MPU Hi-PerformsEtee
Gate Length - Physical
100
I
.e
J u.
0
m
Q
Nano^sshnslagy Q Q rc m ) £ra Ssgsres -1999 |
2020
Year
Figure 5: Gate length trends in high performance microprocessor units (MPU)
[12].
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Many NOC designs have been proposed and there is much commercial activity.
Much of the research concerns packet-switched architectures of the type described
by L. Benini and G. De Micheli [26]. However, there are a few tradeoff studies that
validate architectural choices involving this basic architecture. One of the most
fundamental choices is the topology of the architecture. In this work we examine two
topologies, the traditional packet switched architecture and the enhanced packet
switched architecture. Both the architectures differ in the method in which the IP
cores communicate with the communication network.
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter Two: Related Work
2.1 Topologies
Dally and Towels propose to use a folded torus topology in which the nodes
in each row are connected cyclically. In such a network the IP cores communicate by
sending packets to one another over the network. Packet switched architectures
effectively deal with communication errors by error containment [26]. Time slot
reservations have been proposed for certain classes of traffic such as data streams to
provide guaranteed, predictable performance for latency-critical or jitter-critical
applications. In addition to that for efficiently handling pre-scheduled and dynamic
traffic virtual channels and a cyclic reservation registers have been suggested.
While most of the research community is focused on packet switched
architectures, Raghavan [23] presents a token ring implementation for NOC with a
hierarchical networking strategy. He predicts that future NOCs will have a ring of
rings or a ring of packet switched network or one of their extensions. Packet
switched networks with virtual flow control as proposed by Dally and Towels [1]
suffer from large area overhead and leakage currents, and thus he proposes a
heterogeneous approach to solve the problem. Raghavan concludes that the high cost
of packet-switched network can only be absorbed if the traffic rates demand such an
implementation, else a token ring architecture where simpler interconnection
schemes are required can be implemented. The advantages of token ring architecture
are scalability without significant degradation on performance, good bandwidth
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
utilization, absence of data collisions, smaller buffer requirements and decreased
latency by introducing multiple tokens [28].
Bhandary [28] proposes to use a hexagonal lattice routing scheme for NOCs.
In such architecture there are four different types of nodes. They are the D-routing
nodes, monitor nodes, flow control nodes and ND-routing nodes. The D-routing
nodes make simple routing decisions, monitor nodes implement the functionality of a
monitor, flow control nodes implement flow control functions and ND-routing nodes
forward the data on the ring. This scheme makes the following assumptions
1.) Stations which need to communicate very often (which have a high
bandwidth requirement) are clustered on the same ring or adjacent rings to
minimize latency
2.) The control information ring is not susceptible to metal migration since there
is no heavy data flow [28], Metal migration is the diffusion of metal atoms
along the conductor in the direction of electron flow. Diffusion occurs
because the momentum transfer between the electrons and the metal atoms
increases the probability that a metal atom will move in the direction of
electron flow. Such a diffusion process will preferentially fill metal ion
vacancies found in crystal defects, leaving a vacancy in the location from
which the metal atom came from.
Link et al., [11] analyzed four different topologies; they are the standard 2D
mesh, 2D torus layout, bypass torus layout and clustered torus architecture. In 2D
mesh network topology each non-peripheral router is bidirectionally connected to its
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
four nearest neighbors and to its related IP core (not shown) as shown in the Figure
6.
Figure 6: 2D mesh topology [11].
In the 2D torus topology each router is connected to four other routers including the
peripheral routers, so routers on one side can send packets directly to routers on the
other side. In such a topology logically adjacent routers are no longer physically
adjacent. These long wires result in high interconnect energy consumption. Figure 7
shows a 2D torus topology of interconnected routers.
Figure 7: 2D torus topology [11].
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In a bypass torus topology each router is not only connected to its closest neighbor
but also to the next neighbor as shown in Figure 8. This reduces the maximum
distance in a network by a factor of 2. However both the torus topologies suffer from
longer wires than the standard 2-D mesh, since in the torus topologies the minimum
wire length is more than the width of two IP cores/PE’s.
Figure 8: Bypass torus layout topology [11].
In the clustered torus topology four neighboring IP cores form a group which is
connected to a single router as shown in Figure 9. This results in fewer routers being
required, and allows each router to consume more area. It has been shown that the
Fjjim rss Rama 2 : ►
Figure 9: Clustered torus topology [11].
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
clustered torus topology has the least average latency (when the effect of the longer
wires was ignored) while the 2D mesh and bypass topologies have the maximum
average latency when wormhole routing is employed with area constraints. When all
the topologies are given equal flit and output buffer sizes regardless of the area
requirement, then the bypass network topology performs significantly better than all
the other topologies due to the large amount of buffer space while the clustered torus
performs poorly as the smaller number of routers on a chip results in lower
bandwidth due to the fixed flit size.
2.2 Performance Evaluation
To evaluate the performance of a NOC Sun, Kumar and Jantsch [29] present
a methodology for simulating a NOC. The NS simulator [31] was used for
constructing the network. The topology of the network, network protocols and
routing algorithms and were described in NS. NS provides TCP and UDP as the
network transmission protocols and static, session, dynamic and manual routing
schemes for modeling traffic. Topology parameters such as buffer size, bandwidth,
number of nodes and link delay can be described as a script file in Tel (Tool
Command Language). A 5 X 5 two- dimensional mesh topology was modeled. It was
assumed that the buffer size in each core is infinite but buffer size is finite in the
switches. Packet dropping was also considered. It has been shown that the
probability of dropping a packet decreases as the buffer size increases when the
buffer size is small and the probability of dropping a packet is not sensitive to buffer
size when the buffer size is large.
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Wiklund, Sathe and Dake [30] present benchmarking as a solution for
evaluating the relative performance of different interconnection network
architectures for NOCs and to find the bottlenecks in the interconnection
architectures. A bottleneck is defined as a performance limiting factor in the
interconnection architecture. They define a benchmark as the combination of the
specification(s) that have been used and the results that have been achieved in the
process of benchmarking. They conclude that the specification of the benchmark is
basically the traffic pattern specification, packet size and number of ports. Wiklund
et al., [30] resort to benchmarking for performance analysis because an analytical
model to describe the performance of a network style interconnect is difficult to
achieve due to the inherent complexity of the network design space and how traffic
patterns affect the network performance.
2.3 Buffer Allocation and Optimization
Buffers are an integral part of Network on Chip design. Buffers are required
at the input port of each router and at the output port of each router. Jingcao Hu and
Radu Marculescu [4] present a queuing model based on an M/M/l/K model to
analyze and optimize the buffer allocation. The algorithm presented identifies the
performance bottleneck among the different router channels and tries to solve it by
adding extra buffers. The algorithm presented by Jingcao Hu and Radu Marculescu
[4] identifies the bottleneck among the router channels and proposes to solve it by
increasing the buffer size as explained. The system analyzer generates the system of
equations for all the routers once the architecture parameters (such as routing
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
algorithm and delay parameters) and the application parameters (probability that a
packet generated by a particular core has as its destination another particular core)
are given. The system analyzer also assigns the buffer in all the used channels to be
one packet large. These equations are solved and the probability of a buffer being
full is calculated. Then the buffer which has the highest probability of being full is
selected as the bottle neck and the size of the buffer is incremented by one packet.
This process is repeated until the total buffering space used in all the channels across
the chip reaches the buffer limit. In this model the inter-arrival times and service
times are considered to be independent, identical and exponentially distributed and
the queues are considered to be finite [4]. The time between consecutive arrivals is
referred to as the interarrival time. The time elapsed from the commencement of
service to its completion for a customer at a service facility is referred to as the
service time. Since simulating a network for NOC demands long simulation times for
all possible configurations, they conclude that simulation is not a possible solution.
To optimize the performance, shared buffers can be implemented at the
inputs of the routers as suggested by G.M Link et al., [11]. A shared buffer operates
like the traditional stack and heap in memory systems, where each port (read pointer
and write pointer) starts at one end of the array and works towards the center. This
has the advantage of allowing either port to make use of the entire buffer storage
space. Shared buffers bring along with them a 15% area penalty, as compared to two
single separate buffers [11]. This area penalty is due to the additional complexity of
the bi-directional shift array and the complexity of the control logic [11].
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.4 Proposed Architecture
a
*■
Fig 10: Typical NOC topology. A packet switched architecture for network on
chip (The smaller circles represent the routers and the larger circles represent
the IP cores).
In the scheme in Figure 10 above, each router has five inputs, four from the
adjacent routers and one from the nearest core. The routers communicate with each
other and transfer the data from one IP core to the other. Also in this type of network
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
we need to have as many routers as the number of cores; i.e., the ratio of the number
of cores to the number of routers in such a type of network is always equal to one.
To handle non- uniform traffic (in time and space) we plan to have the
scheme shown in Figure 11. When the traffic is uniform the typical NOC topology
will perform well, when buffers are properly allocated. But when the traffic is non-
uniform, the typical NOC topology will experience long delays and lead to
performance loss. The scheme shown below will be able to handle non-uniform
traffic by providing alternate data paths for transmitting/receiving the data.
Increasing the buffer sizes alone is not the only solution to decrease the
latency of a network. An alternative solution would be to increase the number of
routers. In the scheme shown in Figure 11, the core can transmit data to more than
one router. Figure 11 shows that the core can transmit data to four other routers. In
this type of topology, if the router to which the core wants to transmit data is busy
then the core can poll for the other three routers and transmit data to the router that is
the least busy. If all the routers are busy then the core has to wait until at least one
router is not busy and transmit data through that router. The proposed architecture
solves the bottleneck of various inputs to the router trying to transmit data to the
same output port of the router. When the core wants to transmit data and the router is
busy it is not always a good idea to immediately poll for a router which is not busy
and transmit data through that router which is not busy, because the data would then
go through a longer path (in most of the cases) and this means more traffic. So we
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
should be able to predict how long the router (which was currently busy) would be
busy, if the core wanted to transmit data to another router.
4,4
4,3
Figure 11: Proposed NOC topology. Enhanced packet switched architecture.
If the router would not be busy ( i.e. the router would be free) in a few clocks from
the instant the core wants to transmit data then it would not be a good idea to poll for
another router which was not busy, rather to wait for that short interval of time and
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
transmit data through the orginal router. This requires a good prediction of the
traffic, but it is not always possible to predict the traffic in all networks. If the traffic
cannot be predicted then we can have a fixed or variable time period for which the
core should poll the router (even though the router is busy), and acquire another
router only if it were not busy before the end of the fixed time period. The time
period can be either fixed or variable. A variable time period is needed if the traffic
is highly erratic.
2.5 On-Chip Router
A router is a device that determines the next network point to which a data
packet should be forwarded enroute towards its destination. The IP cores
communicate through the network. Routers form a part of the network. The IP core
interfaces with the network through a router input buffer which transmits packets
into the network and a router output buffer through which it can receive packets from
the network. A router consists of input/output buffers (or queuing buffers), cross bar
switch, a routing function and a central arbiter and controller. The input/output
buffers of the router are usually implemented using a FIFO (first in first out) and
these buffers are used to store the incoming packets temporarily if they cannot be
routed immediately and sometimes outgoing packets for required retransmission.
This increases the throughput of the router. The cross bar switch makes and breaks
the circuits for routing the data from incoming to outgoing paths. The routing
function translates the packet header to the appropriate output buffer to which the
data should be routed. Deterministic or adaptive routing algorithms could be used.
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
However, it has been shown that adaptive routing algorithms are not suitable due to
the computational complexity of updating and maintaining the information [11]. The
central arbiter prioritizes the packets when arbitration is required. It also monitors the
status of all input and output queues. It grants permission to input queues to transmit
data to the output port. It also monitors the input and output queues (whether they are
empty or full). When a router input buffer has data which is to be routed, the central
arbiter sends the data to the router output buffer indicated by the destination address
of the incoming packet. If more than one input queue has data to be sent out to the
same output queue, a prioritizer is used to arbitrate. The proiritizer has four requests
as inputs from the four different input queues and generates a grant depending on the
priority principle used. To summarize a router reads the destination address of an
incoming packet and routes it to the appropriate buffer. It prioritizes if multiple
packets are to be routed to the same destination address. The router does not process
the packets.
In the proposed new topology, the router to core ratio is always greater than
one. A high ratio indicates the ability of the network to handle more traffic with less
congestion as it will be able to handle more transactional bandwidth [11]. The price
we pay for the increased ratio is in terms of silicon area and increased complexity of
the router [11], But it seems to be a reasonable trade off as the routing network in a
conventional packet switched NOC occupies minimal area (6.6 %) [1], The area of
the router is dominated by the buffers. It is estimated that the logic receiver circuits,
buffer storage, and routing will occupy an area less than 50pm wide by 3mm long
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
along each edge of the IP core for a total overhead of 0.59mm2 (6.6%)[1], when the
dimension of the IP cores was 3mm x 3mm. It has also been shown that when the
routers are connected by a 32-bit wide bus and 4 control lines and contain six 32-bit
wide FIFO buffers with a depth of 4, the router occupied 7% of the device area on an
FPGA Virtex II 6000 [25]. This area is expected to drastically decrease with the
rapid growth in size of IP cores [25], As technology scales there is more room for
performance optimization because of increased wiring availability and buffer space.
This available space can be used for the additional routers that are being proposed.
NOC’s with a router-to-core ratio greater than one demand more silicon area
compared to the typical NOC topology. Also increasing the routers may not be the
solution in all cases. It depends on the nature of the traffic and the placement of the
routers and the position of input output pins on each core. The router to core ratio
can give a quick estimate about how good the circuit would function in terms of
congestion when the traffic is uniform.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter Three: Current Work
3.1 Synchronizing Packet Width.
Packets of data which are to be communicated across the NOC need not be of
the same size. This can lead to performance loss and analyzing such networks in
complex as each packet size becomes a customer of a specific class (in queuing
networks) and we can expect to have many classes of customers in a NOC. The
performance loss can mainly be due to the fact that if the network is capable of
transmitting 64 bit data and if we transmit 16 bit data then we are not utilizing the
resources completely and the resources are wasted. Suppose we have two different
cores that produce data whose packet lengths are 16 bits wide and 64 bits wide.
These packets can be transmitted in the following ways.
1.) Break the 64 bit wide packet into 16 bit wide packets and transmit the data.
This introduces the overhead of having the source and destination address in
each of the 16 bit data.
2.) Transmit the 16 bit data as 64 bit wide data by appending 48 bits of
redundant data ( i.e., by appending all zeros)
3.) Append four 16 bit wide data packets to form a single packet of width 64
bits and transmit the packet as a 64 bit packet. This scheme is attractive,
because we can try to have a uniform packet size throughout the network. But
the overhead lies in the fact that the delay per packet increases [6] (and the
number of packets transmitted decreases as we are appending four packets to
form a single packet in this example) and all the four packets reach its
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
destination at one instant. If the packets are transmitted individually without
appending data the first packet reaches its destination first and the other
packets follow in the order they are injected into the network. It has been
shown that [6] the size of the packet is directly proportional to the service
time per packet. So we are increasing the overhead of transmitting a single
packet but decreasing the number of packets transmitted by appending four
packets together. We present a methodology how this (appending multiple
packets) can be accomplished.
A single FIFO formed from four identical 16 wide
FIFOs
64 bit wide output data
16 bit wide input
Control Unit
FI -16 bit wide, n location FIFO
F2 -16 bit wide, n location FIFO
F3 - 16 bit wide , n location FIFO
F4 -16 bit wide, n location FIFO
Figure 12: Single FIFO built from four identical FIFOs to synchronize packet
size.
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
As shown in Figure 12, a single FIFO can be built from four individual FIFOs each
16 bits wide and n locations deep. The value of n depends on the nature of the traffic.
The 16 bit wide data input is from the IP core which is producing it and the output 64
bit wide data is input to the network. Consider the case when all the four FIFOs (FI,
F2, F3, and F4) are empty. When the IP core wants to transmit data, then it writes its
first 16 bit data into FI. So this FIFO (FI) has data which can be read. But we do not
read this information, and wait until we write data into other FIFOs. So we read the
FIFOs (FI, F2, F3 and F4) only when all the FIFOs have at least one 16 bit data
written into them.
The second 16 bit of data is no longer written into FI, but now written into
F2 and the third into F3 and the fourth into F4. After data has been written into FI,
F2, F3 and F4, it can be read from all the four FIFOs simultaneously. So we will
have 64 bit data as input to the network. So data is written in this order into the
FIFOs, FI, F2, F3, F4, FI, F2, F3, F4.... in a circular fashion and data is always read
simultaneously from all the FIFOs FI, F2 , F3 and F4 at one stroke.
The IP core may always not produce four packets (or multiples of 4) of data
at quick successions. If the IP core produces 2 packets of data and two more packets
of data after a long interval of time, then the first two packets of data will remain in
the FIFO and will be waiting for a long interval of time until it has two more packets
to be transmitted. To avoid this we can have a control unit that waits for a specified
interval of time. If the IP core does not produce data during this interval of time and
if either of FI or FI and F2 or FI, F2 and F3 have data which is to be communicated
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to other IP core, then the control unit will wait for a fixed interval of time (timeout)
and let the data be read by appending all zeros and making it a 64 bit wide data. The
input to the control unit is the write pointer location of the FIFOs. Based on the
location of the write pointer we can calculate the FIFO we are currently writing into,
and from the difference of the write pointers on two different FIFOs we can start
counting the required number of clock cycles and make a decision.
3.2 Modeling of Buffers for on Chip Routers
In the M/M/l/K or M/M/1 queuing model, interarrival and service times are
identical, independent and exponentially distributed. An exponential distribution has
the property that it is a strictly decreasing function. This property rules out situations
where potential customers approaching the queuing system tend to postpone their
entry if they see another customer entering ahead of them. But it is consistent with
the common phenomenon of arrivals occurring randomly. In our case the arrivals
occur randomly only at the processing element which randomly generates packets of
data and deposits them in the infinite buffer. But at the input and output buffers of
the router the arrivals are not random but the arrivals tend to postpone if they see that
the buffer is full or if they see another customer entering ahead of them.
Queues are considered to be finite or infinite, according to whether their size
is relatively large or small. It should be noted in this context that finite and infinite
are both relative terms. Queues are considered to be infinite if a relatively large
upper bound on the permissible number of customers is possible. However for
queuing systems where the upper bound is small enough that it can be reached, it
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
becomes necessary to assume a finite queue [3], The finite case is more difficult
analytically because the number of customers in the queuing system affects the
number of potential customers outside the system at any instant of time. In our case,
the buffers present at the input node of the router are small enough and the upper
bound can be reached easily and frequently. The buffer present on the IP core of the
chip is large and can be approximated to be an infinite buffer, since the upper bound
is usually not reached. Queuing systems with finite buffers do not obey the
equivalence property. The equivalence property states that “for a server facility with
s servers and an infinite queue with a Poisson input with parameter X and with the
same exponential service time distribution with parameter p for each server (where
s.p > X), then the steady state output of such a service facility is also a Poisson
process with parameter A ,”[3], The equivalence property makes no assumption about
the type of queue discipline. Whether the queue discipline is first-come-first-served,
random or even priority discipline, the served customers will leave the service
facility according to a Poisson process. For such systems, i.e. for systems with finite
queues in series, no simple product form solution is available and only limited results
have been obtained [3].
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.3 Approximating the input and output buffers of the processing element and
the router
Consider a Network on chip processor with N x N (N rows and N columns).
Each IP core of the NOC can be modeled as a processing element and a core input-
core output buffer as shown in Figure 13.
o
Figure 13: Shows a single IF core with a large input and output buffers and a
processing element.
Note: The core output buffer is the buffer that accepts the data the processing
element outputs to the network so that the data can be transmitted to other IP cores.
Similarly the core input buffer contains the data the processing element will receive,
which has been transmitted by other IP cores.
These large core input buffers and core output buffers can be approximated to
be infinite buffers. So these buffers obey the equivalence property of queuing
networks. Let us assume that the input to the input buffer has interarrival time that
26
Core Input
buffer
Core
ouptput
buffer
Processing
Element
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
follows a Poisson process with parameter X. The assumption that the interarrival time
to the output buffer is a Poisson process is equivalent with the common phenomenon
of the input being random [3]. While considering the core input and core output
buffers the buffer at the router node can be merged with the buffer on the processing
element and can be considered as a single large/infinite buffer.
Now let us consider a single router with its input and output buffers as shown
in Figure 14 and Figure 15. These router buffers are small compared to the core input
and core output buffers on the IP core. As these router buffers can be expected to be
full (i.e. their upper bound can be reached) these buffers should be approximated to
be finite buffers. Note that the router input buffer in the X+ direction can
communicate with any buffer except the router output buffer in the X+ direction.
This is because if a buffer in the X+(input) direction transmits data to the buffer in
the X+(output) direction (through the cross bar switch), it is equivalent to
transmitting data to itself, which is redundant transmission of data. Figure 15
illustrates this.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Cross Bar
Switch
Arbiter
Buffer(Output of the
processing
element)
Input buffer (Y-)
Input Buffer (Y+)
Output Buffer (Y-)
Output Buffer (Y+)
Buffer (input to the
processing
element)
Output Buffer (X-)
Output Buffer (X+)
Input Buffer (X-)
Input buffer (X+)
Figure 14: Shows how the input buffer and output buffers are connected
through a cross bar switch.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To summarize the core input and core output buffers of the processing
element can be approximated to be infinite buffers and the router input and router
output buffers can be approximated to be finite buffers.
X+ (Output) X+ (Input)
Y- (Output) Y+ (Output)
Router
Y+ (Input) Y- (Input)
Output to Core
X- (Input)
Input to Core
X- (Output)
Figure 15: Input and output ports of a router.
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Consider a packet which is injected by an IP core node 1 and which has as its
destination IP core 9. Let us follow the path the packet takes to reach its destination.
Its first destination will be the router node (1, 1). We can say that the buffer which is
at router node 1 and connected to IP core 1 is an infinite buffer and is a part of the
large buffer on the core.
4,2
Figure 16: Enhanced packet switched architecture.
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The packet can reach its destination either through the path provided by routers (1,1),
(1,2),(1,3), (2,3), (3,3) or through the path provided by the routers (1,1),
(2.1),(3,1),(3,2), (3,3) depending on the routing principle (X+ first or Y+ first) or
through any other path depending on the routing principle employed. Irrespective of
the path the packet takes, as the packet reaches its destination it traverses through a
series of finite buffers located at the nodes of each router. At the destination it again
reaches an infinite buffer of the core. So a packet which is injected into the network
will pass through an infinite buffer (at the sender end) followed by a series of finite
buffers. At the destination the packet reaches an infinite buffer. As this packet is
being routed to its destination, there are other packets that are also trying to reach
their destinations. The route followed by the other packets depends on their source
nodes and their destination nodes. So these packets may have part of their route the
same as the other packets or follow a completely different route. So these packets
which have part of their route the same as the one under consideration (route (1,1),
(1.2),(1,3), (2,3), (3,3) or route (1,1), (2,1),(3,1),(3,2), (3,3)) can be said to be
intermediate arrivals and departures along the route (1,1), (1,2),(1,3), (2,3), (3,3) or
(1,1), (2,1),(3,1),(3,2), (3,3). So the model under consideration is a sequence of
buffers (with the first and last buffer being infinite and the remaining buffers being
finite) with intermediate arrivals and intermediate departures at each node as shown
in Figure 17. The intermediate arrivals at each node cannot be dropped but must be
held at the source if the buffer is full as it may be extremely difficult to implement
end to end protocols. The service time here is the time the router takes to “service the
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
packet” and route it in its destination direction depending on the routing principle
employed.
Intermediate Deapitnres
Infinite Server
Buffer
Intermediate
Arrivals
• 0
Intermediate Departures
t
I
Interm ediate A rrivals
Finite Buffers
Figure 17: The model under study.
Consider router R1 and router R2 as in Figure 18. B1 is the router output
buffer of router R1 and B2 is the router input buffer for R2. In modeling we combine
both B1 and B2 into a single buffer and consider it to be a part of router R2. Router
R2 by itself is considered to be a node. Figure 18 shows only the buffers associated
with other routers, it does not show the routers associated with the core.
Router (contains the
arbiter, cross bar
switch and die
routing function)
R1
B2
Router (contains die
arbiter, cross bar switch
and the routing function)
R2
ID-
Figure 18: Modeling the input output buffers of a router.
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
As finite buffers do not obey the equivalence property, to analyze this situation we
resort to queuing networks. A queuing network is described as follows “networks of
service facilities where customers must receive service at some or all of the
facilities” or simply a set of interconnected nodes [1], [6]. Each node consists of a
queue, where customers wait for service, and one or more servers. Customers receive
service at one or more or at all of the nodes depending on the network topology.
Information such as expected total waiting time, expected number of customers in
the entire system and throughput of such a network can only be analyzed by studying
the entire network [1]. Since the intermediate queues in the network we are studying
are finite, the flow of customers through one node may be momentarily stopped if a
destination node has reached its full capacity. This phenomenon is called blocking.
Such networks (queuing networks with finite capacity nodes) are referred to as
“queuing networks with blocking” [6], Queuing networks can be classified as open,
closed and mixed. In an open queuing network model, customers enter the network
from outside, receive service at one or more nodes according to the network
topology, and eventually leave the network, as shown in Figure 19. In closed
queuing networks as shown in Figure 20, no arrivals to or departures from the
network are allowed and there is a constant population of customers circulating the
network. Mixed queuing networks are models with multiple types of customers that
are closed networks with respect to some customers and open with respect to others.
Briefly a network is said to be open if all customers can leave the network and closed
if no customer can enter/leave [7],
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
NodeM
• • • n f O —
Figure 19: Open queuing network.
N odel Node 2 Node 3 NodeM
• • o fO
Figure 20: Closed queuing network.
To be more precise the network model under study is an open queuing network with
blocking with fixed (constant) service time. Blocking can further be classified into
four types. If to is the time when station n finishes serving a customer so that station
(n+1) becomes full, then for t > to and until a transition occurs in station (n+1), no
34
Intermedaite arrivals
N odel Node 2 U/ N o d e l
\
Intermediate departures •
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
customer can be served by station n because there is no room available in station
(n+1). The four possible types of blocking are
1.) At t = to station n is not blocked and starts serving its next customer. When
the customer finishes its service, it is blocked in station n.
2.) At t = to station n is not blocked and starts serving its next customer. When
the customer finishes its service, it starts being served again. This
corresponds to a retransmitted packet. [When station (n+1) is busy some
networks drop the packet depending on the blocking principle employed.
Such a packet is retransmitted again. This retransmitted packet receives the
service again as the previous packet. To analyze the effect of such a
retransmitted packet in a queuing network, the customer is served until
station (n+1) is not busy].
3.) At t = to station n is blocked, but the next customer enters the service,
without being served and station n is blocked in “full” state.
4.) At t = to station n is blocked, the next customer is not allowed to enter the
service, and station n is blocked in an empty state.
It can be shown that rules 2 and 3 are equivalent and rules 1 and 4 are equivalent.
The particular type of blocking in which we are interested is when a station n is
blocked, the next customer is not allowed to enter the service, and station n is
blocked in an empty state (rule 4). A simple M/M/l/K or M/M/1 (the first M
represents the distribution of service times and the second M represents the
distribution of interarrival times, both of which are exponential, the 1 represents the
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
number of servers and the K represents that the queue is of finite length K, the
absence of K implies that the queue is infinite ) that queuing model cannot be used to
represent the queuing network under study, because the output of a finite buffer
which forms the input to the finite buffer in sequence can no longer be said to have
an exponential distribution. Queuing networks with blocking are in general difficult
to analyze and closed form solutions for the steady state queue length distributions
are generally not attainable [3]. An alternative solution to avoid blocking would be to
drop the packets when the buffer is full. Though it seems to be an attractive solution
to avoid blocking, it may be extremely difficult (or may not be feasible) to
implement end-to-end protocols on a NOC because of power limitations and the
number of wires available for communication. So packet dropping is not a solution
to avoid blocking [6]. So the techniques that must be used for the analysis of such
networks are analytical approximations, numerical techniques and simulation
techniques [3]. A closed exponential queuing network with blocking has a product
form solution in the following three cases 1) when the routing matrix is reversible 2)
the probability of blocking does not depend on the number of units in the destination
node, but is simply a constant and 3) when the service rate at each node is constant,
but there is zero probability that a queue is empty. Unfortunately our queuing
network is an open queuing network with blocking. So the above solutions cannot be
used for the analysis of our queuing network. Analyzing arbitrary configurations of
open exponential queuing networks with blocking is still unresolved [3]. Here we
employed simulation techniques to solve the problem. A queuing network was
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
modeled as described above and the mean waiting times and throughputs were
calculated through simulations with varying buffer size. The throughput was also
calculated for different number of nodes in series. The results are as shown in Figure
21, and Figure 22. QNAT was used for building and simulating the network. All
results shown below have been obtained after simulating for 10,000 arrival events.
For the sake of simplicity we have assumed a uniform packet size and the buffer size
is a multiple of the packet size. The worst case throughput was calculated and the
results are shown below in the graph. Worst case throughput is the throughput of the
network when we are trying to communicate from one diagonal end of the chip to the
other diagonal end of the chip. Networks with various numbers of cores on each edge
(number of stages or number of nodes) also have been simulated and their
throughput and latency are reported below. In the enhanced packet-switched
architecture when a single router is busy, the IP core transmits data through another
available router. The simulations did not take into account the logic circuitry, latency
associated with retransmitting data through another router when the router through
which the IP core wants to transmit is busy.
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Latency vs. No. of Stages
45
40
35
30
25
20
15
10
5
0
0 2 4 6 8 10
No. of Stages
—*— Packet Switched
Architecture
— »— Enhanced Packet Switched
Architecture
Figure 21: Latency vs. Number of nodes/stages for both the architectures.
Throughput v s No. of Stages
10 ■
p
i —
0 2 4 8 10 6
-Packet switched
Architecture
- Enhanced Packet Switched
Architecture
No. of Stages
Figure 22: No of stages vs. Throughput of the network for both the
architectures.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
It can be seen from Figure 21 that the latency increases as the number of stages
increases. The enhanced packet-switched architecture has lower latency compared to
the packet-switched architecture because the enhanced packet-switched architecture
distributes its traffic more evenly compared to the traditional architecture. The
latency of both the architectures tend to become similar as the number of
stages/nodes increases. Figure 22 shows that the enhanced packet-switched
architecture has a higher throughput compared to the traditional architecture. The
throughput of the enhanced packet switched architecture falls steeply beyond 6
nodes. The enhanced packet- switched architecture can be expected to have a steeper
rate of fall in network performance because as the number of nodes increases both
the networks will behave similarly, so the enhanced packet-switched architecture can
be expected to have a steeper curve compared to the traditional architecture. This is
because as the number of nodes increase, the router to core ratio in the enhanced
packet-switched architecture approaches unity. The router to core ratio in a typical
NOC topology is always unity. Assuming that everything else remains same in both
the architectures (i.e., the network architecture does not have significant effect on the
network performance when the number of nodes is large) we can expect both the
networks to behave similarly in terms of network performance as the ratio of router
to core approaches unity.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter Four: Conclusion
A queuing network with finite buffers and non-memoryless assumptions on
arrivals and packet lengths cannot be exactly analyzed according to the M/M/1
models associated with Jackson networks. Therefore we evaluated the performance
of the queuing network based on simulation experiments, rather than associating it
with any analytical model.
Chapters 1 and 2 focuses on the need for looking at architecture other than a bus-
based architecture in the billion transistor era to exploit the advantages of technology
scaling. It explains why a bus-based architecture would certainly fail to meet the
bandwidth requirements in the billion transistor era and exploit the advantages of
technology scaling.
Chapter 3 focuses on the need for looking at a model other than the M/M/l/K model
or the M/M/1 model and it further models an on chip network into a series of finite
buffers (with blocking) with intermediate arrivals and departures with infinite buffers
being present at the first and last nodes of the network. It compares both the
topologies and the results are presented.
To exploit all the advantages of technology scaling in the multibillion transistor era
we believe that a single network architecture would not be a solution. A combination
of architectures at each level would be an optimal solution. The choice of
architecture(s) will be based heavily on the particular application of Network on
Chip.
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bibliography
[21] Anant Agarwal, Limits on Interconnection Network Performance, IEEE
Transactions on Parallel and Distributed Sytems, Volume 2, Issue 4, Pages: 398-
412, 1981
[18] Tayfur Altiok, Approximate Analysis of Queues in Series with Phase-Type
Service Times and Blocking, Operations Research, Vol 37, No 4, July -August 1989
[10] Simonetta Balsamo, Vittoria de Nitto Persone, Raif Onvural, Analysis of
Queueing Networks with Blocking, Kluwer Academic Publishers, 2001
[26] L. Benini and G. De Micheli, Networks on Chips: A new SoC paradigm, IEEE
Computer Society, Volume 35, Issue 1, Jan 2002 Pages: 70 - 78
[28]Vinayak Prabhakar Bhandary, A Hexagonal Lattice Routing Scheme for On-
Chip Networks, Directed Research, University of Southern California, Los Angeles
[25] Christopher Bobda, Mateusz Majer, Dirk Koch, Ali Ahmadinia, and Jurgen
Tiech, A Dynamic Noc Approach for Communication in Reconfugurable Devices,
Department o f Computer Science, University of Erlangen-Nuremberg Am
Weichselgarten 3, D-91058 Erlangen, Germany
[4] Paul Caseau and Guy Pujolle, Throughput Capacity of a Sequence of Queues
with Blocking Due to Finite Waiting Room , IEEE Transactions on Software
Engineering, Vol. SE-5,NO. 6, November 1979
[1] William J. Dally and Brian Towels, Route packets, Not wires: On-Chip
Interconnection Networks, Annual ACM IEEE Design Automation Conference,
Pages 684-689, 2001, ISBN: 1 -58113-297-2
[16] Pierre Guerrier and Alian Greiner,A Generic Architecture for On-Chip Packet-
Switched Interconnections, Design, Automation, and Test in Europe, Pages: 250 -
256, 2002, ISBN 1-58113-244-1
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[3] F.S. Hiller and G. J. Lieberman, Introduction to Operations Research, F.S. Hiller
and G. J. Lieberman, McGraw-Hill Publishing Company, 1990.
[14] Ron Ho, Kenneth W. Mai and Mark A. Horowitz, The Future of Wires,
Proceedings of the IEEE, April 2001, Pages: 490-504.
[6] Jingcao Hu and Radu Marculescu, Application - Specific Buffer Space
Allocation for Networks-on-Chip Router Design, Proceedings of 1EE/ACM
International Conference on Computer Aided Design, San Jose, CA, November 2004
[22] Jingcao Hu and Radu Marculescu, Smart Routing for Networks-on-Chip,
Technical Report, http://w\vw. ece. emu. edu/~sldfpubs/papers/dvad dac04. pdf
[12] International Technology Roadmap for Semiconductors, ITRS 1999, ITRS 2003
[7] T.Lin and L. Kleinrock, Performance Analysis of Finite-Buffers Multistage
Interconnection Networks with a General Traffic Pattern, IEEE Transactions on
Computers, Vol 43, Issue 2, February 1994, Pages:153-162, ISBN:0018-9340
[11] G.M. Link, J.Kim, N.Vijakrishana, Chita R. Das, Network-on-chip (NoC)
Architectures : A Resource -Constrained Perspective
[20] Youngsong Mun and Hee Yong Youn, Performance Analysis of Finite Buffered
Multistage Interconnection Networks, ACM Proceedings o f the ninth symposium on
Data Communications, Pages: 124-133, 1985, ISSN: 0146-4833
[8] Harry G Perros, Queueing Networks with Blocking, Oxford University Press,
1994
[5] H. G. Perros and Tay fur Altiok, Approximate Analysis of Networks of queues
with blocking: Tandem Configurations, IEEE Transactions in Software Engineering,
Vol. SE-12, No. 3, March 1986
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[13] Timothy Mark Pinkston, Jeonghee Shin, Trends towards on chip Networked
Microsystems, SMART Interconnects Group, Technical Report,
http://cen2. use. edu/smart/publications/archives/CENG-2004-17 . pdf
[15] QNAT, A Graphical Queuing Network Analysis Tool.
http://poisson.ecse.rpi.edu/~hema/qnat/
[23] Dhananjay Raghavan. Extending the Design Space for Networks on Chip,
Masters Thesis, University of Southern California, Los Angeles
[2] Gordon L. Stuber, Principles of Mobile Communication, Kluwer Academic
Publishers, 2001
[29] Yi-Ran Sun, Shashi Kumar, Axel Jantsch, Simulation and Evaluation for a
Network on Chip Architecture Using Ns-2, Proceedings o f 20th NORCHIP
Conference, Copenhagen, November 2002
[31] The Network Simulator - ns -2, http://www.isi.edu/nsnam/ns/
[19] John Uffenbeck. Microcomputers and Microprocessors, The 8080, and 8085,
and Z-80 Programming, Interfacing, and Troubleshooting, Prentice Hall, 2000
[27] G.V. Varatkar and R. Marculescu, On-chip Traffic Modeling and Synthesis for
MPEG-2 Video Applications, IEEE Transactions on Very Large Scale
Integration(VLSI) Systmes, Volume 12, Issue 1 ( January 2004), Pages: 108-119,
2004, ISSN:1063-8210
[9] Jean Walrand, An Introduction to Queueing Networks, Prentice Hall, 1988
[30] Daniel Wiklund, Sumant Sathe and Dake Liu, Benchmarking of On-Chip
Interconnection Networks, Proceedings of the 4th IEEE International Workshop on
System-on-chip for Real-time Applications (IWSOC’ 04)
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Extending the design space for networks on chip
PDF
Area comparisons of FIFO queues using SRAM and DRAM memory cores
PDF
High performance crossbar switch design
PDF
Characterization of deadlocks in interconnection networks
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Error-tolerance in digital speech recording systems
PDF
Performance issues in network on chip FIFO queues
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Design and analysis of MAC protocols for broadband wired/wireless networks
PDF
Characteristic acoustics of transmyocardial laser revascularization
PDF
A template-based standard-cell asynchronous design methodology
PDF
Boundary estimation and tracking of spatially diffuse phenomena in sensor networks
PDF
Disha: a true fully adaptive routing scheme
PDF
A document-driven approach to conceptual design
PDF
A technical survey of embedded processors
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Fault -tolerant control in complex systems with real-time applications
PDF
Compiler optimizations for architectures supporting superword-level parallelism
Asset Metadata
Creator
Muttineni, Vikram
(author)
Core Title
Design tradeoffs in a packet-switched network on chip architecture
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Parker, Alice C. (
committee chair
), Neely, Michael (
committee member
), Pinkston, Timothy M. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-48728
Unique identifier
UC11338023
Identifier
1435086.pdf (filename),usctheses-c16-48728 (legacy record id)
Legacy Identifier
1435086.pdf
Dmrecord
48728
Document Type
Thesis
Rights
Muttineni, Vikram
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical