Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 937 (2013)
(USC DC Other)
USC Computer Science Technical Reports, no. 937 (2013)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Can Networked Systems Benefit from Tomorrow’s Fast, but Unreliable,
Memories?
Bin Liu
, Da Cheng
‡
, Hsunwei Hsiung
‡
, Ramesh Govindan
, Sandeep Gupta
‡
University of Southern California, Los Angeles, CA, USA
Abstract
As we move closer to the end of the Moore’s law
era, future networked systems are likely to be more con-
strained by hardware capabilities than they have been
in the past. To understand how networked systems will
be affected by future developments in hardware, we use
roadmaps developed by the semiconductor industry to
explore the impact of future hardware constraints on net-
worked systems. Motivated by the projected unreliabil-
ity of SRAMs and Flash memory, we study the impact of
this trend on two qualitatively different networking sub-
systems whose functional properties depend upon this
hardware: end-to-end reliable communication and Flash-
based key-value storage. While these subsystems have
built-in protections against data loss, we find that these
protections are insufficient to guard against the projected
increases in memory unreliability in the future. Using
a systematic evaluation of mitigation strategies, we find
that it is possible to make these subsystems robust to
memory unreliability, but different, somewhat counter-
intuitive, strategies are necessary for each of the systems.
1 Introduction
Networked systems have benefited from unprecedented
growth in hardware capabilities. High speed switch-
ing fabrics, data centers, networked sensing, and wire-
less and mobile computing (to name a few) would not
have been possible without improvements in the design
of underlying circuits and devices, resulting in improved
speed, reliability, power, etc. Because the evolution of
networked systems is strongly determined by hardware
advances, it is instructive to examine how future im-
provements in hardware will affect networked systems.
The hardware community invests heavily in the devel-
opment of roadmaps, of which the ITRS roadmap for
chips [9] is the best known. These roadmaps project fu-
ture directions for chips and systems in terms of a wide
range of important metrics, particularly computational
and storage capacities, performance, power, cost, reli-
ability (lifetime), and resilience (to internal and exter-
nal noise). Recent versions of semiconductor roadmaps
show that technology improvements are generally slow-
ing down, and these slowdowns are likely to persist be-
yond the end of the Moore’s law era. Improvements are
continuing in some dimensions (e.g., cost-per-transistor),
45 32 22 16 12 (~2020)
1
2
3
4
5
5.5
x 10
5
Technology (nm)
Throughput (packets/sec)
Do nothing
Idealized
Hardware
Figure 1: SRAM Throughput
but slowing down in others (e.g., speed), and recessing in
others (e.g., power, reliability, and resilience).
In particular, recent ITRS roadmaps and other evi-
dence suggest that the bit error rates in static RAMs
(SRAMs) and Flash memory, two technologies often
used in networked systems, will increase by several or-
ders of magnitude within the next 10 years (Section 2).
With increasing technology density (from 45/32 nm to-
day to 12-13 nm in around 2020), variability in manufac-
turing processes and other factors make it more difficult
to reliably read from or write to these memories.
Because networked systems have built-in mechanisms
for handling failures, such memory failures manifest
themselves as (sometimes graceful) performance degra-
dation in networked systems. For example, in a reli-
able transport protocol, a memory failure can inflate flow
completion time because memory failures cause dropped
packets which trigger loss recovery. In a distributed
Flash-based key-value store [11], a memory failure can
manifest itself as a service-level agreement (SLA) viola-
tion (because it can increase the time to complete a put or
get operation), or as data unavailability (similar to, say,
unavailability due to replica failure).
There can be three possible responses to the increased
unreliability of memory technologies, and Figure 1 de-
picts the consequences of choosing each one of these re-
sponses, for one of the technologies (SRAM). The first
is to use today’s technology, the consequence of which
is a stagnant SRAM throughput (the Do Nothing curve);
by contrast, future generations of SRAMs will be signifi-
cantly faster (the Idealized SRAM throughputs were ob-
tained from SRAM speed improvement projections in the
ITRS roadmap while ignoring any projected increase in
1
unreliability), and this choice doesn’t benefit from those
speed improvements. The second is to develop complete
error masking in hardware. This choice, labeled Hard-
ware, was generated by designing, in Verilog, a pipelined
low-delay encoder/decoder for the binary BCH class of
codes, selecting the level of error correction needed to
completely mask SRAM errors for each technology den-
sity, then calculating the overhead of encoding and de-
coding for each technology year and the resulting SRAM
throughput. Hardware is pessimal, and is unable to lever-
age the projected increases in SRAM throughput, result-
ing in a large gap between Idealized and Hardware.
The third response, motivated by this gap, is to ex-
plore a regime where hardware manufacturers sell less
than perfect memories, and networking designers devise
efficient reliability mechanisms using a combination of
techniques in order to achieve performance closer to Ide-
alized than Hardware does. As counterintuitive as that
may seem, selling unreliable hardware is not without
precedent — wireless chipsets do not guarantee 100%
reliable transmission, and rely on additional software to
mask packet losses.
The key research question we address in this paper is:
What combinations of memory error masking techniques
are sufficient to ensure acceptable performance for net-
worked systems? Given the early stage of this research,
we choose to explore this research question in the con-
text of two networked sub-systems: a reliable transport
protocol (specifically TCP) that uses SRAM for packet
memory, and a distributed Flash-based key-value store
that relies both on SRAMs for storing packets and inter-
mediate hops, and Flash memory for storing key-value
pairs on storage servers. Our methodology is as fol-
lows. For each subsystem, we first study (using a simula-
tor carefully validated through analytical modeling) the
projected impact of increasing memory errors on these
two subsystems, then search the space of combinations
of memory error masking techniques (applied at differ-
ent granularities: in memory hardware, at a single node,
within a group of nodes, and end-to-end) to characterize
the combinations that give acceptable performance.
We find that TCP’s flow completion times increase by
more than an order of magnitude beyond an inflection
point in the SRAM roadmap in 2015 (Section 3). This in-
flection point is caused by significantly high packet loss
rates that result in frequent timeout-based loss recovery.
We find that TCP cannot be made robust to the projected
rate of memory failures using purely end-to-end mitiga-
tion, but modest increases in hardware redundancy, and
adding per-hop redundancy can be successful in counter-
ing the effects of memory unreliability. Furthermore, the
performance cost of these modest redundancy additions
is small (Section 5): one of our mitigation approaches is
able to closely track Idealized while having enough re-
dundancy to permit acceptable TCP performance.
For the Flash-based key-value store, its availability
(when retrieval latency is taken into account) also drops
off significantly after 2015, and, in some cases, these
degradations can be attributed in part to SRAM failures,
and in part to Flash failures (Section 4). For this sys-
tem, end-to-end mitigation approaches give acceptable
performance but the amount of redundancy needed can
vary with workload and the age of the Flash.
While preliminary, these results suggest a qualified
“yes” to the question we pose in the title of this paper.
They suggest that (a) should hardware designers be un-
able to find cost-effective memory error masking tech-
niques, there exist a combination of modest improve-
ments to hardware combined with software error mask-
ing techniques that will give acceptable performance
while allowing them to leverage the increased idealized
performance of memories, (b) but that the specific com-
bination may depend upon the networked system and the
memory technologies under consideration.
2 Understanding Memory Unreliability
In the semiconductor industry, developing projections for
future trends in hardware has become institutionalized
because these projections have proved to have predic-
tive power. The International Technology Roadmap for
Semiconductors (ITRS) [9], sponsored by the largest in-
dustry organizations in five leading chip manufacturing
regions in the world, puts out biennial roadmaps [31].
Until recently, these roadmaps had projected exponen-
tial growth for many performance metrics as a function
of increasing technology density. Recent versions of
these roadmaps show that technology improvements are
generally slowing down. In particular, CMOS semicon-
ductor memories (like SRAM, DRAM and Flash [12])
which constitute about 20% of the semiconductor mar-
ket, are projected to become increasingly unreliable.
2.1 Static RAM (SRAM) Roadmap
Static RAMs use a memory cell that is able to read and
write, and retains its value using internal feedback, as
long as power is applied. A typical SRAM cell contains
a pair of weak cross-coupled inverters holding the state,
and a pair of access transistors to read or write the state.
SRAMs are used in applications that require fast access
times, from caches to register files to tables to scratch-
pad buffers [47]. For the same reason, they are central to
networking, since they are used to buffer packets or parts
thereof, or to store control information such as forward-
ing tables and rules.
The ITRS roadmap for SRAMs is extensive, so we
focus on one part: consumer-relevant parameters for S-
RAMs. For SRAMs for both cost-performance (CP)
microprocessors (for desktops), and high-performance
(HP) microprocessors (for servers), the roadmap projects
2
that the amount of SRAM available on a constant die
area (140mm
2
for CP, 160mm
2
for HP) will double with
each successive technology generation. One section of
the ITRS roadmap [7] continues to project a doubling of
SRAM capacity as a function of technology density (Ta-
ble 2) and improved access times, but this comes at a cost
of significantly higher power
1
requirements.
Most relevant for this paper are the projected increases
in bit error rate (BER) of SRAM with increasing tech-
nology density (Table 1) from another section of the
ITRS Roadmap [5]). In general, failures in memories
can be classified as soft failures or hard failures [36].
Soft failures refer to bit-flips during a memory’s oper-
ation, caused by alpha particles or cosmic rays [39] and
are negligible
2
. Hard failures occur in SRAM due to the
mismatch in the strength between neighboring transistors
[36]. Increases in technology density can cause signifi-
cant variations in fabrication processes, which can also
increase the mismatch between fabricated transistors and
ultimately result in hard failures. Until recently, the ITRS
roadmap did not incorporate any projections for SRAM
failure rates, since SRAMs were assumed to be gener-
ally reliable; in fact, ITRS 2009 for the first ITRS edition
include SRAM reliability projections
3
. Table 1 shows
two roadmap projections (ITRS roadmaps are updated
roughly once every two years). The 2009 projections in-
dicate 7 orders of magnitude worsening of the read/write
BER; the 2011 projections are even more pessimistic
4
,
suggesting 20% BER by 2020.
There are generally two ways to counter these trends
in hardware. Table 2 assumes existing circuit technology
and standard testing approaches [7], but circuit and archi-
tecture innovations can improve reliability, or new test-
ing approaches can eliminate a large proportion of error
prone SRAMs. Of course, the latter will happen at the
expense of lower yields and higher testing costs.
A second approach is to include increased correction
capabilities in memories. In fact, every SRAM cur-
rently on the market includes an error correction code,
the Hamming code (also known as SEC-DED code for
1
The total power is estimated using SRAM size, access time
(read/write time), and the per-cell static and dynamic power.
2
The probability of a soft error for a single bit stored in
SRAM/DRAM for 20ms is about 10
11
(derived from [6]).
3
Compared with SRAMs, DRAM reliability is much higher and
does not appreciably degrade with technology density, according to
both the ITRS roadmap and prior research [38, 40]. For example, re-
sults from a real experiment [38] show that on average only 0.22%
of the DRAM DIMMs (dual in-line memory modules) see an uncor-
rectable error per year, a number much smaller than the projected
SRAM unreliabilities we discuss in this section. Thus, in this paper
we ignore DRAM errors.
4
In fact, these numbers are so pessimistic that we use the 2009
SRAM roadmap in our evaluations, and are currently checking the va-
lidity of the 2011 projections with some memory manufacturers. The
2013 projections are not due to be released until the ITRS 2013 Winter
Conference on Dec 6th, 2013.
Table 1: SRAM bit error rates (BER) [5]
ITRS Year 2010 2012 2015 2018 2020
edition Technology (nm) 45 32 22 16 12
2009 Read/Write BER 4E 11 1E 07 5E 06 1E 04 2E 04
2011
Read BER 3E 07 2E 04 1E 02 6E 02 2E 01
Write BER 3E 03 1E 02 4E 02 1E 01 2E 01
its single error correction, double error detection capabil-
ity). This code protects every double word (32 bits) with
6 parity bits; the memory hardware stores both the data
and the parity bits. The parity bits are generated upon a
memory write, and checked during a read. While ECC
improves the reliability, it can slow down access times
by factors of two or more [44]. Recent research [49] has
proposed using stronger codes (such as Reed-Solomon),
but to our knowledge, no commercial SRAMs use this,
likely because of chip area and access latency overheads.
Table 2: SRAM Roadmap for future technologies [7]
Year 2010 2013 2016 2019 2022
Technology (nm) 45 35 25 18 13
Size (MB) 8 16 32 64 128
V
dd
(V) 1 0.9/1 0.7 0.7 0.7
Static power (mW) 5E 04 1E 03 2E 03 3E 03 5E 03
Dynamic power (mW/mHz) 5E 07 4E 07 4E 07 3E 07 2E 07
Write/Read time (ns) 1.2 0.8 0.5 0.3 0.3
Total power (W) 58.67 200 716.8 2048 5803
Table 2 shows projections for reductions in memory
delays as well as improvements in several other param-
eters. However, to take advantages of these, it is nec-
essary to tackle the corresponding projected increase in
error rates shown in Table 1. In the next section, we dis-
cuss whether and to what extent ECC and other software
reliability mechanisms are able to counter the projected
increases in memory unreliability.
2.2 Flash Memory Roadmap
The second memory technology whose roadmap we dis-
cuss is Flash. Two Flash memory architectures are preva-
lent today: the common ground NOR Flash, used for
code and data storage, and the NAND Flash, optimized
for data storage. A Flash memory cell is basically a
a transistor with a floating gate that is completely sur-
rounded by dielectrics, where the floating gate acts as the
storing electrode for the cell device [12]. Recent Flash
products use multi-level cells (MLC), where two or more
bits are stored in the same cell, which increases the Flash
storage density.
Flash is now used in a variety of applications ranging
from portable consumer electronics, smartphones and
tablets, to enterprise storage as well as storage for note-
books and desktop PCs. Recent research has proposed
using Flash in a distributed key-value store [11].
Recent editions of the ITRS roadmap also cover Flash
memory extensively. Table 3 shows the Flash memory
roadmap for some of the parameters of interest to this
3
Table 3: NAND (2D) Flash Roadmap from [6][8]
Year 2011 2013 2015 2017 2019
Technology (nm) 22 18 15 13 11
Capacity (GB) 8 16 32 64 128
Bits per cell 3 3 3 3 3
Data retention (years) 10 10 10 10 10
Endurance (1,000 P/E cycles) 10 10 10 5 5
paper. Memory capacity is projected to double with ev-
ery technology generation [6], and the number of bits
per cell was project to stay constant at 3. Most impor-
tant, from our perspective is that the endurance of Flash
memory cells will halve in the 2017 timeframe.
Flash unreliability occurs for at least two reasons.
First, a Flash read operation requires comparison with
reference voltages to determine the stored value in the
cell; as technology density increases, this comparison re-
quires accurate sensing, adversely impacting chip yield.
More relevant from our perspective is that each write op-
eration to a Flash, called a Program/Erase (P/E) cycle, is
known to cause a fairly uniform wear-out of the cell per-
formance, which eventually limits its endurance [14, 23].
This aging of a Flash cell can be countered by wear-
leveling techniques [14]: many Flash file systems and
firmware spread out writes to different locations in a flash
to increase the time before any location becones unreli-
able. However, such wear-leveling just affects the time to
first failure, but does not fundamentally alter the raw bit
error rate since it does not alter the physics of cell aging.
A recent study has quantified the impact of age (in
terms of number of P/E cycles on a Flash memory sector)
on reliability [23]. This study measured the raw bit error
rate (RBER) for 11 commercial products from 5 man-
ufacturers. We used the upper bound of the measured
RBER at each age to plot the RBER for technology be-
fore 2017. Also, we have used the ITRS endurance pro-
jections (halved endurance after 2017) to plot the RBER
for technologies after 2017 (Figure 2(b)). In a later sec-
tion, we use these estimates to study the impact of Flash
unreliability on networked systems.
2010 2012 2014 2016 2018 2020
0
0.5
1
1.5
2
x 10
-4
Year
Read/Write bit error rate
(a) SRAM Roadmap
0 2000 4000 6000 8000 10000
10
-8
10
-6
10
-4
10
-2
Number of P/E cycles
Raw bit error rate
before 2017
after 2017 (include)
(b) Flash Roadmap
Figure 2: Memory Roadmaps
To reduce aging errors, ECC is widely used in Flash
memories and, in particular, in MLC architectures. In
early NAND Flash devices, manufacturers recommended
the use of Hamming code. These devices had 512-byte
sectors, so the correction was typically applied to the en-
tire Flash sector. Hamming codes are relatively easy to
construct and in Flash are often implemented in software,
since the overhead to generate the parity and corrections
is fairly low. As error rates increased, the likelihood
of multiple-bit errors within a single sector increased,
so Hamming codes were used over smaller blocks (i.e.,
at a finer granularity), or designers turned to stronger
error correction codes, e.g., Reed-Solomon (RS) codes
or binary Bose-Chaudhuri-Hocquenghem Codes (binary
BCH) codes, to ensure data integrity. As NAND Flash
error rates continued to grow, the ECC capability (the
number of errors that can be corrected) has increased and
the correction block size has decreased in order to main-
tain, or further improve, memory reliability [3].
3 Reliable Transport
Memory is used in routers and switches in a data network
to store and forward packets. When memories can induce
errors in packets, this can result in dropped packets or
undetected end-to-end errors, affecting the performance
of reliable transport protocols.
While not much is publicly known about the detailed
design of commercial routers and switches, many aca-
demic studies have posited a hybrid SRAM/DRAM ar-
chitecture (e.g., [28, 21, 22]). In discussions with col-
leagues, we understand that SRAM is likely used to
buffer packets (or parts thereof) in high-speed shallow-
buffered switching devices (e.g., core or aggregation
switches in data centers), or to buffer the parts of a packet
that need to be processed in some way (e.g., for deep
packet inspection), and DRAM is used in all other set-
tings. The ITRS roadmap predicts that SRAM read/write
error rates will increase dramatically with technology
density, as shown in Figure 2(a) (this figure plots the
2009 SRAM roadmap from Table 1).
Why study reliable transport? Most transport proto-
cols have, over the years, developed sophisticated mech-
anisms to be robust to data loss: header checksums,
end-to-end CRC, and retransmission-based loss recov-
ery. Our interest in this section is to understand how
much a reliable transport protocol, like TCP, which was
not explicitly designed to be robust to switch and router
memory errors, is affected by these errors.
3.1 Impact of Memory Errors on TCP
Methodology and Assumptions. To understand the im-
pact of memory errors on TCP, we use as our baseline
a version of TCP with SACK and ECN; this version,
implemented in ns-2, mirrors the congestion control al-
gorithms found in the Linux 2.6.22 kernel [46]. We
also consider an intentionally simplified setting: a chain
topology with n hops, a sender at one end and a receiver
on the other. We assume the links themselves to be per-
fect, and that the network is uncongested; thus, any per-
4
formance degradation is entirely due to memory errors.
Finally, we also assume that a packet is written to and
read from SRAM at each hop; with increasing demand
for faster switches, it is likely that this assumption will
become true in the future, but at the very least, is pes-
simistic, which is appropriate when we consider mitiga-
tion strategies.
Model. As the first step in understanding the impact of
memory errors on TCP, we develop an analytical model
to derive end-to-end packet loss rate as a function of
memory bit failure rate. Translating memory errors into
end-to-end packet loss rates is nontrivial because memo-
ries have in-built error-correction codes (ECCs), and in a
multi-hop topology, packets can be dropped at any switch
because of memory errors, or an erroneous packet can
reach the end undetected (because memory ECCs are un-
able to detect errors beyond certain levels levels of error).
Our model incorporates memory ECC and performs two
functions: it helps to understand the frequency of patho-
logical situations like end-to-end undetected errors, and
it can validate our simulator (described below). Our sim-
ulator also incorporates other forms of checksums
5
.
In developing the model, we use the following no-
tation. e(Y) represents the BER of an SRAM bit-cell
read/write in the technology prevalent in year Y (Fig-
ure 2). P
L
is the length of the packet in bits.
For each bit of P
L
, the probability that this bit is read
and written correctly at a given hop (we assume a packet
is written to SRAM and read from SRAM exactly once)
is: p
bit
=[1e(Y)]
2
. Modern SRAMs store data in units
of b-bit flits, where each flit is protected using an er-
ror correcting code (ECC) with additional a parity bits.
SRAM hardware checks the ECC bits, and signals a read
failure if the memory failures exceed the error correction
capability of the ECC. Now, the probability of i incorrect
bits after a flit has been written and read is:
p
FLIT;i
= C
i
b+a
(1 p
bit
)
i
p
b+ai
bit
SRAMs use a Hamming code for ECC, with the ability
to correct a 1 bit error, and detect 2 bit errors. When more
than 2 bits are in error, the Hamming code’s behavior is
a little more complicated: 3 to 5 bit errors are assumed to
effectively be 1 bit errors and are corrected, resulting in
the flit having 4 (undetected) erroneous bits, but 6 erro-
neous bits cause the read to fail. In order to simplify our
model, we consider flits with no more than 3 bit errors;
the hardware roadmaps make higher bit errors for current
flit sizes (where each flit is 32 bits) exceedingly unlikely.
Under this assumption, since the entire packet is stored
indL=be flits, the probability for the whole packet leav-
5
Our evaluation ignores MAC layer frame checks since these are
applied after a packet has been read from memory at one hop, and
checked before the packet is written to memory at the next hop and we
have assumed perfect links.
ing a hop without any error (detected or otherwise) is that
of each of the storage flit only having none or 1 error bit
(since the latter can be corrected), and is thus given by:
p
SUCC
=
å
dL=be
i=0
C
i
dL=be
p
i
FLIT;0
p
dL=bei
FLIT;1
The likelihood that a packet is dropped at a hop is the
probability that at least one flit in a packet contains 2 bits
in error:
p
DROP
= 1(1 p
FLIT;2
)
dL=be
The final quantity we are interested in is the probability
of a packet being transmitted to the next hop correctly, or
with an undetected error.
Accounting for undetected errors is a little subtle.
These occur when a flit incurs 3 erroneous bits, which the
ECC incorrectly assumes as being a 1 bit correctable er-
ror, and thereby introduces 4 bits of error. Furthermore,
a single packet may contain multiple flits in error, and
since we are interested in determining whether these er-
rors can be propagated end-to-end, we calculate p
ERROR;i
as the probability that a packet is successfully transmitted
to the next hop but has exactly i flits with an undetected
error:
p
ERROR;i
=
d
L
b
ei
å
j=0
(i+ j)!
i! j!
C
i+ j
d
L
b
e
p
i
FLIT;3
p
j
FLIT;1
p
d
L
b
e(i+ j)
FLIT;0
Our evaluation of the hardware roadmap indicates that
scenarios where i> 1 occur rarely, so in the subsequent
steps of the modeling, we restrict i= 1. In this setting,
a single flit in the packet contains 3 error bits, so the re-
stored flit will contain 4 error bits after ECC correction.
However, some of these 4 erroneous bits may be in the
a-bit parity field. If all 4 of these erroneous bits are in the
parity field, the packet does not contain an undetected er-
ror. Let p
ERROR;1;i
with i= 0 4 denote the probability
that exactly i error bits are in the data field (i.e., 4 i are
in the parity bits):
p
ERROR;1;i
= p
ERROR;1
C
i
4
b
a+ b
i
a
a+ b
4i
Then p
GOOD
, the probability that a packet is trans-
ferred from one node to the next without any error, is:
p
GOOD
= p
SUCC
+ p
ERROR;1;0
Finally, p
n
GOOD
is the probability that a packet reaches
the receiving application without error, where n is the
number of hops, assuming bit error independence at each
hop.
Simulation. In order to simulate memory failure, we
modified the queue buffer component of ns-2.35 to inject
bit errors with a probability e(Y), and to perform ECC
decoding/encoding. We also integrated these modifica-
tions into the simulator’s TCP-Linux component. We im-
plemented the IP header checksum and the TCP packet
checksum; the checksums can reliably detect up to 15
5
error bits in the IP header or the TCP packet [42].
In our simulations, we simulate a 15 MB file transfer
across chain topologies with varying lengths. For each
packet, the IP header, TCP header and data field (if any)
are set to be of size 20 bytes, 30 bytes, and 1500 bytes,
respectively. We also use identical network links with
100 Mb bandwidth; there are no competing flows, so
most packet losses are due to memory errors. We also
ignore SRAM read/write delay, since this is negligible
relative to network bandwidth. All simulation results are
averaged over 20 runs. Our simulation trace contains
enough information to calculate the probability of end-
to-end error-free packet delivery.
Validation and Results. Figure 3(a) compares simula-
tion results with the model for 5, 10, 15 and 20 hops.
These results match closely; the maximum difference be-
tween the model and simulation is 0.0072, while the min-
imum and the mean are 0 and 0.00067. This provides an
internal consistency check for our simulator.
The results also show pessimistic packet delivery per-
formance; by 2020, only 80% of the packets can be deliv-
ered across a five-hop network, even with existing mem-
ory error correction capabilities. For larger diameter net-
works, the delivery ratio decreases to 40%. From our
model, we can also evaluate the likelihood of undetected
end-to-end bit errors. We find that, in 2020, the like-
lihood of packets with undetected bit errors in a 5 hop
network is 1 in 300,000, and for a 20-hop network it is
1 in 150,000, a much higher value than the 1 in 16 mil-
lion to 10 billion packets observed in earlier work [43].
Given the prevalence of large flows in the Internet, this
suggests that the TCP checksum should be enhanced.
But is TCP with SACK and ECN robust to SRAM er-
rors? Figure 3(c), which plots the completion time of
a TCP flow as a function of technology density, shows
that is it most certainly not. As the figure shows, TCP
incurs an inflection in completion time at about 2015:
even in the absence of congestion, and in the presence of
a full suite of TCP mechanisms like Fast Recovery, Fast
Retransmit, SACK, and ECN, TCP completion times in-
crease by several orders of magnitude (note the log-scale
on the y-axis) after 2015.
Why does this happen? As Figure 3(b) shows, the
number of TCP timeouts increase dramatically after
2015, faster than the number of fast retransmission
events. When timeouts happen, TCP has no option but to
reduce the congestion window to re-establish ack clock-
ing. This also suggests that methods that attempt to dis-
tinguish wireless losses from congestion losses, as pro-
posed for wireless access networks, may not be effective
for memory errors; these help prevent congestion win-
dow reductions when triple dup-ACKs occur. We quan-
tify this intuition below.
Finally, while the severity of the completion time in-
flation is a function of topology diameter, the inflection
happens at the same time for all diameters.
3.2 Making TCP Robust to Memory Errors
In TCP, memory errors manifest themselves as dropped
packets or as delivered packets with undetectable errors.
Strategies that increase the power of TCP to detect and
correct errors can mitigate these manifestations. Be-
fore we describe these strategies, we must define what
it means to mitigate memory unreliability: we say that a
strategy (or a combination of strategies) mitigates mem-
ory unreliability for reliable transport if it masks the ef-
fects of unreliability. For TCP, we have seen that the ef-
fect of memory unreliability is to inflate flow completion
time. So, we say a mitigation strategy is successful if its
completion time inflation (defined as the ratio of comple-
tion time with mitigation over completion time without
memory errors) is less than 1.05.
What can we learn from wireless? There is a large
literature on enhancements to TCP to deal with wireless
losses. One line of work (e.g. [20, 34]) attempts to distin-
guish congestion losses from non-congestion losses, and
to modify TCP congestion control behavior in response
to non-congestion losses. To determine if these propos-
als can help TCP to become robust to memory errors, we
evaluated two such proposals, TCP-Veno [20] and TCP-
Westwood [34]. As Figure 4(a) shows, these techniques
are generally better than our baseline, but still show an
inflection in the completion time curve after 2015. This
is because packet loss rates due to memory errors in later
technology years result in a significant number of time-
outs, after which TCP is forced to drop its congestion
window due to loss of ack clocking.
A second line of work attempts to recover losses lo-
cally (either at the link layer or using Snoop-like [1] tech-
niques), either by completely retransmitting packets, or
using partial packet recovery techniques [29, 25, 48, 50].
Unfortunately, none of these techniques apply in our set-
ting, since they all assume that one end of the wireless
channel has an uncorrupted version of the packet. In our
setting, this is not true: a memory error at the sending-
side switch may cause that switch’s copy of the packet to
have an (undetected) error.
Mitigation Strategies. Our mitigation strategies are mo-
tivated by the third line of work in wireless loss recov-
ery, which has explored forward error correction (FEC).
While there are many potential FEC-based mitigation
strategies, we consider qualitatively different ones that
increase the bit error detection and correction capabili-
ties at different granularities: flit, packet, and flow.
Flit-level: We study the impact of using a stronger code,
like binary BCH, instead of the Hamming code currently
employed in SRAMs. This code has the following prop-
erty: given the code block of length [2
h1
;2
h
1] bits
6
2010 2012 2014 2016 2018 2020
0
0.2
0.4
0.6
0.8
1.0
Year
Success rate
model (5 hops)
simulation (5 hops)
model (10 hops)
simulation (10 hops)
model (15 hops)
simulation (15 hops)
model (20 hops)
simulation (20 hops)
(a) Error-free packet delivery
2010 2012 2014 2016 2018 2020
0
2000
4000
6000
8000
10000
Year
Number of retransmission
timeout (5 hops)
fast (5 hops)
timeout (10 hops)
fast (10 hops)
timeout (15 hops)
fast (15 hops)
timeout (20 hops)
fast (20 hops)
(b) Packet retransmission
2010 2012 2014 2016 2018 2020
10
0
10
1
10
2
10
3
10
4
10
5
Year
Flow completion time (s)
5 hops
10 hops
15 hops
20 hops
(c) Flow-completion time
Figure 3: TCP performance under different technologies
with ht parity bits, the code can correct up to t bit errors
and detect up to t+ 1 bit errors. We call this strategy
Memory-ECC.
Packet-level: In this strategy, binary BCH is applied to
the entire packet. We consider two variants: one where
the code is checked and re-applied at each hop (Hop-
FEC), and one where the code is applied once at the
sender and checked at the receiver (End-FEC).
Flow-level: Finally, we consider adding FEC to TCP
(we call this Flow-NC), and implement a recently pro-
posed scheme designed to improve TCP recovery la-
tency [45]. Generally, in this approach, TCP sends one
redundant XOR-coded packet for every K packets, so if
any K of K+ 1 packets are received, TCP can recover
the lost packet. If more than one packet is lost, the re-
dundant packet’s ACK can help trigger fast recovery and
avoid timeouts.
Methodology. Each of these strategies has at least one
parameter that can take many discrete values (for the first
3 strategies it is t, and for the last it is K). Our goal
is to find all combinations of strategies and associated
parameter values that result in successful mitigation.
Now, since the search space of mitigations is fairly
large (e.g., with 4 strategies, each with 5 parameters, re-
sults in 1024 possible combinations), we impose a car-
dinal ordering on the strategies in order of increasing
complexity of deployment: Hop-FEC, which may be im-
plemented by changing per router software; End-FEC,
which requires per host software changes; Flow-NC,
which needs more extensive per host software changes
than End-FEC; and Memory-ECC, which incurs hard-
ware changes at every router. We use this ordering to
prune the search space as follows. We start with the eas-
iest strategy and find the lowest parameter setting which
results in a successful mitigation. We then use the high-
est parameter for this strategy that does not result in a
successful mitigation, and vary the next strategy in the
search space. We repeat this until we have explored all
four strategies, a greedy approach that gives us an enve-
lope of successful mitigations which are most desirable
according to our ordering.
We have implemented all of these mitigation methods
in NS-2.35 and in this section we use our simulator to
evaluate the above mitigation methods, both individually
and in combination. The parameter space explored for
each method is listed in Table 4.
Table 4: Parameters chosen for mitigation
Memory-ECC Hamming code or binary BCH code that correcs 2 error bits
Hop-FEC None or binary BCH code that corrects 1-15 error bits
End-FEC None or binary BCH code that corrects 1-15 error bits
Flow-NC None or sending in the interval of 2,3,4,6,8 normal packets
As before, we simulate the transfer of a 15 MB file
over 5 or 10 hops under different technologies. Results
are averaged over 20 runs.
Mitigation results. The successful mitigation strategies
across different technologies are summarized in Table 5
and Table 6. Before discussing these, we explore which
mitigation strategies, surprisingly, do not work. An in-
teresting observation is that none of the end-to-end solu-
tions alone work for technologies after 2018. As shown
in Figure 4(b), with increasing error correction capabil-
ity, pure End-FEC’s completion time inflation improves
marginally but is still significantly above our target infla-
tion. Since memory errors manifest themselves at every
hop, so many packets are dropped before they reach the
destination that End-FEC performs poorly.
Flow-NC performs better than end-FEC post-2018
(Figure 4(c)), since it can repair lost packets, but it still
does not qualify as a successful mitigation strategy by
our definition. In the best case, with K= 2, on a 10-hop
network, Flow-NC has an inflation of 1.85. Interestingly,
in a 10-hop network, Flow-NC incurs slightly lower in-
flation (1.78) with K= 3 (lower redundancy) than K= 2.
This is because the completion time of increased traffic
at K = 2 is higher than the benefits from fast recovery
obtained with higher redundancy.
Table 5: Minimum mitigation requirement for 5 hops
Year Hop-FEC End-FEC Flow-NC Memory-ECC Inflation
2010,2012 0 0 0 Hamming 1.0
2015 0 0 0 Hamming 1.0002
2018 2 0 0 Hamming 1.0035
0 0 0 BCH 2 1.00026
2020 2 0 0 Hamming 1.0078
0 0 0 BCH 2 1.0021
7
Table 6: Minimum mitigation requirement for 10 hops
Year Hop-FEC End-FEC Flow-NC Memory-ECC Inflation
2010,2012 0 0 0 Hamming 1.0
2015 0 0 0 Hamming 1.0005
2018 2 0 0 Hamming 1.0044
0 0 0 BCH 2 1.0006
2020 4 0 0 Hamming 1.0069
3 1 0 Hamming 1.0251
0 0 0 BCH 2 1.0166
What mitigation strategies are successful? First,
changing the memory ECC to use 2-bit binary BCH
(t = 2) successfully mitigates over 5 and 10 hops across
all the technology years. To understand the hop depen-
dence of this result, we studied the topologies of 15 and
20 hops. 2-bit binary BCH continues to work for these
topologies in 2018, but fails in 2020, where its inflation
is 15% and 49% respectively for 15 and 20 hops.
Other successful mitigation strategies exist that do not
require flit-level error correction. At 5 hops, adding a 2-
bit hop-FEC results in successful mitigation. At 10 hops,
adding either 4-bit hop-FEC or a combination of 3-bit
hop-FEC and 1-bit End-FEC is sufficient.
Summary and Implications. The implication that end-
to-end mitigations alone are not successful is telling:
since memory unreliability is more pervasive than con-
gestion (in the sense that every hop can encounter mem-
ory unreliability), we are forced to look for solutions in-
side the network if the ultimate performance goal is to
effectively mask unreliability. There appear to be rea-
sonable engineering solutions to dealing with the pro-
jected memory unreliability, at least until current ITRS
projections go (2020): adding small amounts of addi-
tional error correcting capability to memories (e.g., by
using 2-bit or 3-bit binary BCH), or to packets at each
hop (e.g., 2-4 bits of hop-FEC) perhaps in concert with
a small amount of end-FEC. Of course, these solutions
are not quick fixes and may require extensive hardware
or software changes.
Some of these successful mitigation strategies are
topology dependent: a vendor can either choose to en-
gineer large enough redundancy to cover known network
diameters, or can choose to specialize redundancy based
on market segment (e.g., use lower redundancy in data
centers with smaller hop counts, and higher redundancy
for products targeted for the broader Internet).
4 Flash-Based Key-Value Storage
In this section, we explore the robustness to memory
errors of a qualitatively different network subsystem, a
Flash-based key-value store (F-KVS). Recent work has
shown that an F-KVS can provide higher throughput and
use lower energy than a key-value store that uses mag-
netic disks. In an F-KVS, several storage servers that use
Flash/SSD are used to store replicas of key-value pairs;
Codeword 1 Codeword 2 Codeword 3 Data copy 1
Codeword 1 Codeword 2 Codeword 3 Data copy 2
Codeword 1 Codeword 2 Codeword 3 Data copy 3
Final data
Vote
Codeword 1 Codeword 2 Codeword 3
Corrupted codeword Correct codeword
A
C
B
D
E
F
DRAM index Append-only
Flash data log
Back-end
Front-end
Front-end
Requests
Responses
Figure 5: Flash-Based Key-Value Storage [11]
each storage or retrieval operation traverses a network of
switches that, in our evaluation, use SRAM for packet
buffering. In other words, F-KVS is interesting because
it uses both the memory technologies that are projected
to become highly unreliable.
4.1 Impact of Memory Errors on an F-KVS
Methodology and Assumptions. Our exploration is
based on a simplified version of a recent design for a F-
KVS [11]. (Figure ??) In this design, the storage servers
are arranged in a logical ring, and consistent hashing is
used to map keys to storage nodes. Front-end servers
initiate put and get requests, which are then forwarded to
a back-end storage server. When a put reaches a node,
the value associated with the key is stored in an append-
only Flash data log, and an entry for the key-value pair
is added to a DRAM index. For a get request, the key is
looked up in the DRAM index to find the Flash location
for the corresponding value.
Our F-KVS, following [11], uses quorum replication
to improve put-get reliability. In our implementation, the
front-end sends a put request to a coordinator, which in
turn replicates the value in R 1 successor nodes on the
ring. Both writes and reads are subject to quorum con-
sensus for success; we use the same quorum consensus
algorithm as in [11].
Our evaluations model a multi-level cell (MLC)
NAND Flash. To deal with unreliability, MLC NAND
Flash chips have limited built-in error-correction. Typi-
cally, before writing each 512-byte sector, the Flash con-
troller computes parity bits that depend on the specific
code (Reed-Solomon, binary BCH, etc.) [2].
Model. As before, we develop an analytical model that
predicts F-KVS put and get success rates as a function of
SRAM and Flash unreliability. In deriving these prob-
abilities, we model Flash error correction capabilities,
errors introduced by network transmissions between the
front-end and the storage node(s), and the reliability af-
forded by quorum consensus. We cross-check our model
with the simulator, and then derive our conclusions on
F-KVS robustness to memory errors using simulation.
Model. If an ECC scheme can correct up to E error bits
per codeword (the data and parity bits together constitute
8
2010 2012 2014 2016 2018 2020
10
0
10
1
10
2
10
3
Year
Flow completion time (s)
Baseline TCP
TCP Veno
TCP Westwood
(a) Separating congestion and non-congestion
losses
2 4 6 10 15
0
10
20
30
40
50
60
70
End-FEC error correcting capability (bit)
Completion time inflation
5 hops 10 hops
(b) Use End-FEC only (2018 technology)
8 6 4 3 2
0
5
10
15
20
25
30
Interval of sending one redundancy packet (packet)
Completion time inflation
5 hops 10 hops
(c) Use Flow-NC only (2018 technology)
Figure 4: Discussions about End-FEC and Flow-NC
a codeword), the codeword will have an uncorrectable
error if E+ 1 or more bits fail. The probability that a
codeword will fail is,
p
CW;FAIL
=
å
N
D
+N
P
i=E+1
p
CW;FAIL;i
where
p
CW;FAIL;i
= C
i
N
D
+N
P
RBER
i
(1 RBER)
N
D
+N
P
i
where N
D
is the number of data bits, and N
P
is the num-
ber of parity bits, and RBER is the raw bit error rate de-
rived from Figure 2(b) (as discussed in Section 2).
However, with a small probability, the data sector of
a codeword will be error-free if all the error bits are lo-
cated in the parity bits. Thus, the probability that the data
sector in a codeword will fail is:
p
D;FAIL
=
N
D
+N
P
å
i=E+1
p
CW;FAIL;i
n
1
N
P
(N
D
+ N
P
)
i
o
For data of size L
D
bits, the probability that the read/write
operations are successful without errors is:
p
D;SUCC
=(1 p
D;FAIL
)
dL
D
=N
D
e
If we further consider that the error appears during the
process of writing then reading, we can get p
DWR;FAIL
and p
DWR;SUCC
by using 1(1 RBER)
2
to replace
RBER in the above equations.
Now consider the F-KVS, and assume that a put or
get fits into m packets. A put request first traverses n
0
hops from a front-end node to the coordinator storage
server. At the coordinator, the data is written into Flash,
then read from the Flash and sent out to R1 successors;
each such communication also incurs network transmis-
sions followed by a Flash write and a read (to check the
correctness of the write) at each replica. At the coordi-
nator, the probability of a successful put request is:
p
PUT;WR
(n
0
)=[p
n
0
GOOD
p
DWR;SUCC
]
m
If each successor is n
1
hops away in the physical topol-
ogy from the coordinator (for simplicity, we assume all
the successors have a equal hop-distance to the coordi-
nator), then for the put operation to be successful, we
require that R
P
1 replications are successful (so that a
write quorum exists). Given this, the probability of suc-
cess of the put request is:
p
PUT;SUCC
= p
PUT;WR
(n
0
)
å
R1
i=R
P
1
C
i
R1
p
i
PUT;W
(n
1
)[1 p
PUT;WR
(n
1
)]
R1i
The derivation of p
GET;SUCC
, the rate of success of a get,
is similar, and we have omitted it for brevity.
Simulation. We have implemented F-KVS in the NS-
2.35 simulator. Specifically, we have implemented the
Flash storage and replication strategies discussed above.
The Flash read and write speeds are set to 2000 MB/s
and 1000 MB/s [4] respectively. Both the simulator and
the model use hop distances drawn from the hop distri-
bution in a fat-tree topology. The model assumes that
put and get operations incur UDP-based packet transmis-
sions while the simulation, more realistically, uses TCP
for these operations. Finally, our simulator and our eval-
uations do not take network congestion into account and
do not account for node failures. In this sense, our results
represent an optimistic assessment of the actual perfor-
mance of an F-KVS.
Validation and Results. To validate the analysis and the
simulator, we enable UDP-based request transmissions
in the simulator and use comparable hop distances for
the transmission drawn from a fat-tree topology. We set
R and R
P
to the typical values of 3 and 2 as used in [18].
We evaluate F-KVS performance at three distinct P/E
cycles, which have qualitatively different RBER values:
2500 (low P/E), 8000 (medium P/E) and 10000 (high P/E
as this is the most common lifetime that manufactures
guarantee for current commercial products [23]). This
means, for instance, that our medium P/E evaluation as-
sumes that the block of Flash being accessed has already
been exposed to 8000 P/E cycles.
Results for m= 1 are shown in Figure ??. All the re-
sults are averaged over 100,000 put-get operations. For
each year, we use the most advanced SRAM and Flash
technologies predicted to be available in that year. Over-
all, the results of the model and simulation are consis-
tent: the maximum errors are 0.0046 and 0.0025 for put
and get respectively, while the mean errors are 0.0013
and 0.00068. We validated for larger values of m and
obtained similar results.
9
2010 2012 2014 2016 2018 2020
0
0.2
0.4
0.6
0.8
1.0
Year
Success rate
model (2500 P/E cycles)
simulation (2500 P/E cycles)
model (8000 P/E cycles)
simulation (8000 P/E cycles)
model (10000 P/E cycles)
simulation (10000 P/E cycles)
(a) put operation
2010 2012 2014 2016 2018 2020
0
0.2
0.4
0.6
0.8
1.0
Year
Success rate
model (2500 P/E cycles)
simulation (2500 P/E cycles)
model (8000 P/E cycles)
simulation (8000 P/E cycles)
model (10000 P/E cycles)
simulation (10000 P/E cycles)
(b) get operation
Figure 6: Validations of put/get performance under dif-
ferent technologies (UDP , m= 1)
As an aside, the performance of a put degrades faster
than that of a get. During a put, the Flash write at the
coordinator must be successful before the key-value pair
is replicated at successors. Thus, if the coordinator fails,
the put is also deemed to have failed. However, in a get,
a quorum is sufficient to ensure success.
Having ensured that our simulation and model are mu-
tually consistent, we can now understand the impact of
memory errors for F-KVS. The first step is to devise a
metric for F-KVS. Given that F-KVS is a storage sys-
tem, we choose the success rate of a put followed by a
get as the metric. However, latency is an important con-
sideration in F-KVS designs [18]. To accommodate this,
we also evaluate F-KVS under a latency constraint. Un-
der this constraint, if the latency of a put or get is inflated
by more than a factor of 2, a conservative choice, we
consider the operation to have failed. Finally, our eval-
uation also considers the impact of the size of the key-
value pair, simulating cases where 1 or 16 packets may
be needed for these.
Our results are shown in Figure 5. For small key-value
objects (Figure 5(a)), when we place no latency con-
straint, the medium P/E roadmap shows a small degra-
dation in success rate, while the low P/E roadmap shows
almost no degradation at all. In this regime, the in-built
Flash error-correcting capability, together with the re-
dundancy offered by quorum consensus suffices to mask
memory failures. Since we impose no timeout con-
straints, TCP eventually ensures reliable put and get re-
quest/response delivery. However, after 2017, the perfor-
mance at 10,000 P/E cycles degrades dramatically, since
endurance is halved and the resilience mechanisms are
not sufficient. Thus, the roadmap in this case tracks the
inflection in the Flash roadmap.
Under a latency constraint, however, at all P/E lev-
els there is an inflection in the success rate after 2015,
tracking the inflection in the SRAM roadmap. Clearly,
the TCP flow completion times are significantly high
post-2015, resulting in increasing failures of put-get op-
erations. In this regime, the performance of F-KVS is
entirely dominated by SRAM failures for the low and
medium P/E cases. For the high P/E case, however, the
Flash roadmap still determines performance.
For larger objects (Figure 5(b)), however, the medium
2010 2012 2014 2016 2018 2020
0
0.2
0.4
0.6
0.8
1.0
Year
Success rate
2500 P/E cycles
2500 P/E cycles (LC)
8000 P/E cycle
8000 P/E cycle (LC)
10000 P/E cycles
10000 P/E cycles (LC)
(a) Small key-value objects (1
packet)
2010 2012 2014 2016 2018 2020
0
0.2
0.4
0.6
0.8
1.0
Year
Success rate
2500 P/E cycles
2500 P/E cycles (LC)
8000 P/E cycle
8000 P/E cycle (LC)
10000 P/E cycles
10000 P/E cycles (LC)
(b) Large key-value objects (16
packets)
Figure 7: put-get performance under different technolo-
gies (TCP). LC is the latency constraint.
P/E case shows a qualitatively different result than for
small objects. Even without a latency constraint, perfor-
mance is noticeably worse than for smaller objects, and
there is a further degradation in performance when la-
tency is taken into account. In this regime, the roadmap is
determined in part by the Flash technology roadmap (the
unconstrained case) and in part by the SRAM roadmap
(the constrained case).
In summary, the roadmap for F-KVS shows some
nuances: F-KVS generally tracks the SRAM roadmap
since the latency of TCP recovery dominates, but in
some regimes, both the SRAM roadmap and the Flash
roadmap determine performance.
4.2 Making F-KVS Robust to Memory Errors
In F-KVS, SRAM errors can inflate TCP completion
times, while Flash errors can result in get failures. A mit-
igation strategy for F-KVS should, ideally, mask both of
these failures. To define a metric for successful mitiga-
tions, consider that an F-KVS is fundamentally a storage
system, so the ultimate test of correctness is that every
get of a successful put must be 100% successful. How-
ever, an F-KVS is a distributed storage system with the
possibility of component failure, and such systems are
often engineered for a certain availability (e.g., [18] con-
siders a design for 3-9s availability). Given this failure
tolerance, we define a successful mitigation strategy as
one that provides 4-9s availability, i.e., the probability of
a correct put-get pair should be at least 99.99%.
We also make one additional simplification to reduce
the search space of successful mitigations. Distributed
key-value stores are also latency constrained and in Sec-
tion 4.1 we had placed a latency constraint on put and
get operations. In this section, we assume that we have
a successful TCP mitigation strategy in place, i.e., where
the completion time latency penalty is less than 5%. Our
results below use binary BCH memory-ECC to ensure
successful TCP mitigation. For F-KVS, Flash unrelia-
bility does not add much to the put-get request comple-
tion time, as we discuss later; the latency inflation comes
from SRAM failures. This means that all our success-
ful mitigation strategies described below trivially satisfy
the latency constraint. In future work, we plan to ex-
10
plore joint mitigation strategies that may consider TCP
mitigation methods that inflate completion time below
the latency constraint. This exploration may reveal other
combinations of successful F-KVS mitigations that we
are not able to uncover in this paper.
Strategies. As with TCP, we study qualitatively different
mitigation strategies that apply reliability mechanisms at
different granularities: at the Flash sector level, at the
node level, across nodes, and end-to-end.
Flash-sector: Most current Flash memories [2] use bi-
nary BCH or RS codes with 4 bit correction and 5 bit
detection capability [35]. In our evaluations, we explore
Flash-ECC mitigation with t bit correction and t+ 1 bit
detection capabilities, for t 4.
Node: Within a single node, we can attempt two mitiga-
tion strategies. Local-Temporal applies to put operations.
When a key-value pair is received at a node, it is writ-
ten to Flash, then read back again, and if the two values
don’t match, the writing is re-tried. This process is re-
peated at most N
T
times. In Local-Spatial, when a node
receives a key-value pair, it writes N
D
different copies to
the local Flash, then reads them back. It then votes on
the 3 copies sector-by-sector: i.e., if, for example, the
first sector of 2 of the 3 copies match, that first sector
is chosen for the result. The put is deemed successful
only if the read value can be reconstructed (i.e., quorum
consensus succeeds for each sector) and matches the re-
ceived value. Our voting is sector-by-sector because that
is the granularity at which error correction occurs in the
Flash hardware. A similar definition holds for get suc-
cess, with one caveat. For a get operation, there is a very
small probability (which we ignore) of the winners of the
vote having bit errors at the same bit location, resulting
in a quorum that doesn’t yield the correct answer.
Across Nodes: While quorum replication is used in F-
KVS to improve availability in the case of node failure, it
can also be used to recover correct key-value pairs in the
presence of Flash errors. We explore the Node-Replica
mitigation scheme which uses N replicas, where N is
odd: if at leastbN=2c+ 1 copies are identical to each
other, a get is deemed to be successful (or if that many
put operations succeed, the put is deemed successful).
End-to-end: Our end-to-end strategy, App-FEC adds re-
dundancy to the key-value pair at the front-end before
a put operation to provide E
A
bit correction capability.
When a get returns the value, the front-end applies the
correction, and returns the value if there were no errors.
In our implementation, the FEC is applied to every 1500
byte chunk of the key-value pair data, so that the correc-
tion capability is independent of the total length of the
key-value pair, but the overhead of coding is proportional
to the size of the key-value pair.
As with TCP, we define an ordering, as follows, among
PUT GET PUT-GET
0.975
0.98
0.985
0.99
0.995
1
Success rate
N(3,2)
N(5,3)
N(7,4)
(a) 8000 P/E cycles
PUT GET PUT-GET
0
0.2
0.4
0.6
0.8
1
Success rate
N(3,2)
N(5,3)
N(7,4)
(b) 10000 P/E cycles
Figure 8: Node-Replica schemes for Flash technology
after 2017 (1 packet key-value pairs, note that
the starting points of Y-axis are different in two
figures)
these methods in approximate order of increasing com-
plexity in order to limit the search space. App-FEC re-
quires software changes at the front-end only. Node-
replica modifies a parameter to the replication algorithm,
while Local-Temporal requires fewer per node software
changes than Local-Spatial. Finally, Flash-ECC requires
modifying the hardware on all storage servers.
Mitigation strategies. We have implemented these
strategies in ns-2.35 with the kernel mirror plugin and
we now report the results of simulations to identify suc-
cessful mitigation strategies (those that provide 99.99%
availability for a put followed immediately by a get)
or combinations thereof. Our experimental settings are
identical to the Section 4.1. We are also able to mathe-
matically model all of these strategies and cross-validate
them with our simulator (discussed below).
Our first result is somewhat counter-intuitive. Node
replication is not a successful strategy for up to 7 repli-
cas (Figure 6) and increasing the number of replicas does
not increase availability significantly and may actually
degrade availability in some cases (Figure 6(b)). This
turns out to be an artifact of the design of FAWN’s [11]
replication algorithm, which we used in F-KVS. In this
algorithm, on a put, if the coordinator fails due to a Flash
failure, the put is deemed to have failed, and replication
is not attempted. While this design certainly makes sense
in the absence of Flash errors, an alternative, more ro-
bust design would have attempted to replicate the data
and then checked if there was a write quorum. Given
this result, we fix the number of node replicas at 3 in the
evaluations that follow.
We now turn to successful strategies. First, pre-2017,
before the expected halving of Flash endurance, none of
these mitigation strategies are required: the baseline 4-
bit Flash-ECC in current devices is sufficient to ensure
a reliable F-KVS. Post-2017, however, the strategies for
achieving successful mitigation at medium P/E and high
P/E, for two different key-value pair sizes (1 packet and
16 packet) are discussed in Table 7. These successful
mitigation strategies reveal a fairly nuanced picture.
The first column of Table 7 describes successful miti-
gation strategies for the small (1 packet) key-value pairs
11
with Flash being at medium P/E. Each cell lists a collec-
tion of related successful mitigation strategies. A cell is
generally a sequence of pairs (c;d), where c represents
the parameter associated with the strategy in the column,
and d represents the parameter for App-FEC. Thus, the
top-left cell contains two successful strategies, one which
uses a parameter of 2 for Local-Temporal (this strategy
does not require App-FEC), and the other which uses
a parameter one, but requires 5 bits of App-FEC. Our
search basically varies one strategy’s parameter at a time,
find the largest value for the parameter that is successful,
then decreases that strategy’s parameter, compensating
for the loss of robustness with increased App-FEC. Thus,
we are able to explore only a part of the search space; we
have left a more systematic exploration to future work.
In this setting, 10 bits of App-FEC, and double
Local-Temporal redundancy
6
are successful. However,
7 Local-Spatial copies are needed for successful mitiga-
tion. Finally, about 9 bits of Flash-ECC suffice to en-
sure successful mitigation in this setting, but the Flash-
ECC correction capability can be reduced almost lin-
early by linearly increasing App-FEC correction capabil-
ity. These results suggest that temporal redundancy and
error correction capabilities may be more effective than
reconstruction using spatial voting.
With larger key-value pairs, 10 bits of App-FEC still
continue to suffice, but 3 Local-Temporal attempts and 9
Local-Spatial copies are needed, as are 11 bits of Flash-
FEC. This suggests that some of these mitigation strate-
gies are F-KVS workload-dependent.
At high P/E cycles (i.e., an “older” F-KVS), with a
smaller key-value pair, 25 bits of App-FEC, 9 Local-
Temporal attempts and 16 bits of Flash-ECC are all
successful, but 73 Local-Spatial copies are needed.
All of these values are significantly higher than their
medium P/E counterparts, with the Local-Spatial copies
being a more dramatic demonstration of the weakness
of quorum-based reconstruction. Thus, all mitigation
strategies are also uniformly P/E-level specific. But, in
Figure 2(b), pre-2017 the existing mitigation strategy
with 4 bit Flash-ECC is P/E-level independent: the same
mitigation strategy applies to all P/E levels. Clearly, the
order of magnitude higher RBER post-2017 qualitatively
changes the landscape of successful strategies.
The workload-dependence at high P/E values is qual-
itatively similar to the workload-dependence at medium
P/E values, so we have omitted the results. This is also
surprising because this workload dependence does not
exist pre-2017. To confirm this, we have checked that the
baseline mitigation strategy continues to work pre-2017
for key-value pairs of size up to 64 packets.
Summary and Implications. Pre-2017 the existing
6
Since Local-Temporal applies only to put, we always augment
increased Local-Temporal with 3 Local-Spatial redundancy.
Table 7: Minimum requirements for mitigation
(P/E cycle, key-value size) (8000,1) (8000,16) (10000,1)
App-FEC 10 10 25
Local-Temporal (2,0),(1,5) (3,0),(2,1),(1,6) (9,0),(7,1),(5,11),
(3,21),(1,23)
Local-Spatial (7,0) (9,0),(7,1),(5,2), (73,0),(9,21),(7,22),
(3,5) (5,23),(3,24)
Flash-ECC (9,0),(8,2),(7,6), (11,0),(8,3),(7,7), (16,0),(8,13),(7,19),
(6,8),(5,9) (6,8),(5,9) (6,22),(5,23)
Flash error correction capabilities are successful at
all P/E levels and workload sizes we have examined.
Post-2017, when Flash endurance halves, the behavior
changes and all successful mitigation strategies require
non-trivial amounts of additional redundancy. Node-
Replica is unhelpful for mitigation because of an arti-
fact in the protocol design, while Local-Spatial provides
weak reconstruction in that it requires a very large num-
ber of copies to be successful. The levels of redundancy
required for all successful mitigation strategies are both
workload-dependent (vary with the size of the key-value
pair) and P/E-level-dependent (vary with the “age” of
the F-KVS). Since, in general, redundancy incurs cost,
this argues for either carefully provisioning F-KVS re-
dundancy based on the workload, or designing mitiga-
tion strategies such that their levels of redundancy can
be progressively increased with the age of the F-KVS.
Finally, it is interesting to note that, unlike for TCP, an
end-to-end strategy (App-FEC) is successful, with suffi-
cient redundancy.
Validation. We have also been able to analytically
model our mitigation strategies and derive the probabil-
ity of successful put and get operations using Flash-ECC,
Node-Replica, Local-Temporal, and Local-Spatial. Our
analytical derivation methodology is similar to that dis-
cussed above.
Figure ?? shows the validation of the simulator us-
ing our analytical model for a post-2017 Flash at 10,000
P/E cycles. In general, the simulation results match the
model quite closely, likely because, in this validation, we
have assumed a successful TCP mitigation strategy. This
validation gives us greater confidence in our result; in
particular, we have also validated that the large values of
redundancy required for Local-Spatial in the simulations
above is also predicted by the model. The model also
predicts the workload-dependence of successful mitiga-
tion strategies post-2017.
5 The Cost of Robustness
Our explorations suggest that cautious optimism may be
warranted: they indicate that many successful mitigation
strategies do exist. However, we have not taken mitiga-
tion costs into account. One manifestation of this cost is
reduced performance. In Figure 4(c), we show how ad-
ditional redundancy degrades performance, an instance
12
F4T3S5N(3,2) F4T3S5N(5,3) F4T3S5N(7,4) F4T5S3N(3,2) F4T5S3N(5,3) F4T5S3N(7,4) F5T2S1N(3,2) F5T2S3N(3,2) F6T3S1N(3,2)
0
0.2
0.4
0.6
0.8
1
Success rate
Model
Simulation
(a) Successful probability for put operation
F4T3S5N(3,2) F4T3S5N(5,3) F4T3S5N(7,4) F4T5S3N(3,2) F4T5S3N(5,3) F4T5S3N(7,4) F5T2S1N(3,2) F5T2S3N(3,2) F6T3S1N(3,2)
0
0.2
0.4
0.6
0.8
1
Successs rate
Model
SImulation
(b) Successful probability for get operation
Figure 9: Model verification for 10000 P/E cycles under Flash technology after 2017 using simulations, the key-
value pair size is 16 packets. The X-axis is the mitigation strategies and the format is [Flash-ECC, Local-
Temporal, Local-Spatial, (Node-Replica R, Node-Replica R P)]. For example, F43S5N(3,2) means using
4-error-bit correction binary BCH, maximum 3 attempts for Local-Temporal, 5 pieces voting for Local-
Spatial and 3 node replications with at least 2 successful operations.
45 32 22 16 12 (~2020)
1
2
3
4
5
5.5
x 10
5
Tec Knology (nm)
Throughput (packets/sec)
Do nothing
Idealized
Hardware
Mitigation
Figure 10: SRAM Throughput
where the cost of mitigation outweighs its benefits. In
this instance, the cost manifests itself in additional net-
work traffic, which impacts the performance metric used
to characterize the benefit (completion time).
Another way to quantify performance cost is to go
back to Figure 1. How much do our mitigation ap-
proaches affect effective SRAM throughput? To ad-
dress this question, we consider a successful mitigation
strategy for TCP: using a weaker flit-level error correc-
tion, together with hop-by-hop packet-level error correc-
tion. We designed encoder/decoder (codec) circuitry for
packet-level binary BCH using a recent proposed archi-
tecture [19] (this BCH codec was used to compute the
Hardware curve in Figure 1, since the codec design can
be adapted to different levels of correction). Our codec
design uses parallelism to reduce decoding delays, and
also pipelines the flit-level and packet-level decoding
stages to reduce decoding delays (in BCH codecs, decod-
ing delays dominate). Finally, we computed the delays
for encoding and decoding for each technology density
using delay projections published by ITRS. This gives us
our Mitigation curve.
We find that this curve tracks the Idealized SRAM
throughput closely, and is well above Hardware (Fig-
ure 7), suggesting that our cross-layer mitigation strate-
gies might help networked systems benefit from faster,
but unreliable, memories. In particular, since Idealized
represents an upper bound for future SRAMs, our miti-
gation approach is close to optimal. To understand why
Mitigation is significantly better than Hardware even
though both of them provide identical error correction
capabilities, it helps to understand that error correction
during decoding is non-linear in the size of the correc-
tion capability. Thus, it helps significantly to perform
weaker error correction at the flit level, and move some
of the correction responsibility to the packet level. Less
important, although still a factor, is the pipelining be-
tween the two decoding stages.
However, this analysis is preliminary, in that it ignores
many of the other costs: fabrication costs of new ECC
circuitry, testing costs, and the associated yields. Much
of this is left to future work, but we are starting to take
some steps in this direction. We have recently begun
to explore new testing techniques for SRAM [15] (and
TCAM [26]
7
) chips for the kinds of errors caused by
process variations (Section 2). We have shown that exist-
ing techniques are insufficient for catching these errors,
and have designed new tests. However, our work also
demonstrates that test costs increase significantly, result-
ing in possibly increased overall costs. More of this kind
of exploration is needed before we understand some of
the fundamental choices facing future designers of robust
networked systems.
7
TCAMs use similar technology to SRAMs
13
6 Related Work
The networking community has had a good history of
recognizing hardware trends and pursuing research agen-
das to address limitations imposed by hardware. Ex-
amples include developing fast packet forwarding algo-
rithms to deal with memory latency [41], designing novel
data center interconnects to achieve high bisection band-
width using commodity switches [10], devising software
radios to leverage processing power improvements [13],
and exploring in-situ sensing using advances in device
miniaturization [27]. In a departure from this body of
work, our paper systematically explores the effect of
hardware technology slowdowns on networked systems,
explores different mitigation approaches, and discusses
potential ways forward. In [32], we make the case for
developing systematic networking roadmaps; this paper
considers more networked systems than that, and also
discusses mitigation strategies, unlike that work.
Mechanisms for reliability have been extensively ex-
plored in the networking literature. These methods have
included using temporal redundancy (ARQ-based ap-
proaches), spatial redundancy (coding, replication) etc.
The literature on the subject is too vast to recount here,
but has clearly influenced our choice of mitigation strate-
gies, all of which have been explored in some context or
another. In particular, TCP has seen much work on loss
recovery; while methods for correct loss recovery in TCP
have been known for a while, fast loss recovery contin-
ues to be the subject of active research. On a related note,
beyond very early papers (e.g., [37]), we have not seen
a discussion of memory errors affecting networked com-
munication. Stone et al. [43, 42] uncover TCP checksum
failures in measured network traffic, but do not attribute
memory failures as the root cause.
The dependence of Flash unreliability on pro-
gram/erase cycles, and its causes due to tunnel oxide ag-
ing, have been known since the inception of the tech-
nology [12]. Some recent empirical measurements of
Flash products have discovered increasing unreliability
in commercial Flash products [23, 24, 35]; our Flash
roadmap (Section 2.2) combines some of these measure-
ments with the Flash ITRS roadmap to develop Flash re-
liability estimates. Flash-based key-value storage sys-
tems have seen significant recent activity, and that has
motivated our use of these subsystems. In addition to
[11], distributed Flash-based key-value stores are also
discussed in several recent papers that have focused on
different spaces: reducing per-key memory consumption
with system performance and lifetime prediction [30],
achieving extremely low RAM footprint [17], and ex-
ploiting fast sequential write performance using a log-
structure on Flash [16].
7 Conclusion
In this paper, we have explored how memory roadmaps
impact the reliability and performance of networked sub-
systems, and have studied successful mitigation strate-
gies. Our explorations suggest that cautious optimism
may be warranted: they indicate that many success-
ful cross-layer mitigation strategies do exist, although
some of these strategies may be counter-intuitive or be
workload-dependent. Much future work remains: ex-
ploring the cost of mitigation strategies, tracking up-
dates to hardware roadmaps, exploring other networked
subsystems, and, more generally, developing a system-
atic, continuously-applied, methodology for evaluating
the impact of hardware trends on networked systems.
14
References
[1] The berkeley snoop protocol. http://nms.lcs.mit.edu/
hari/papers/snoop.html.
[2] Ecc options for improving nand flash memory reliability.
http://www.micron.com/
/media/Documents/Products/
Software%20Article/SWNL implementing ecc.pdf.
[3] Hamming, rs, bch, ldpc - the alphabet
soup of nand ecc. http://www.cyclicdesign.
com/index.php/parity-bytes/3-nandflash/
24-hamming-rs-bch-ldpc-the-alphabet-soup-of-nand-ecc.
[4] Intel solid-state drive 910 series: No spin. all
grin. http://www.intel.com/content/www/us/en/
solid-state-drives/solid-state-drives-910-series.html.
[5] International technology roadmap for semiconductors
2009 design. http://www.itrs.net/Links/2009ITRS/
2009Chapters 2009Tables/2009 Design.pdf.
[6] International technology roadmap for semiconductors
2011 process integration, devices, and structures (pids)
plus. http://www.itrs.net/Links/2011ITRS/2011Tables/
PIDS 2011Tables.xlsx.
[7] International technology roadmap for semiconduc-
tors 2011 system drivers. http://www.itrs.net/Links/
2011ITRS/2011Chapters/2011SysDrivers.pdf.
[8] International technology roadmap for semiconductors
2011 test and test equipment. http://www.itrs.net/Links/
2011ITRS/2011Tables/Test 2011Tables.xlsx.
[9] International technology roadmap for semiconductors.
2011.
[10] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable,
commodity data center network architecture. In Proc.
ACM SIGCOMM, 2008.
[11] D. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee,
L. Tan, and V . Vasudevan. Fawn: A fast array of wimpy
nodes. In SOSP, 2009.
[12] R. Bez, E. Camerlenghi, A. Modelli, and A. Visconti. In-
troduction to flash memory. Proceedings of the IEEE,
91:489–502, 2003.
[13] V . Bose. Design and implementation of software radios
using a general purpose processor. PhD thesis, MIT,
1999.
[14] J. Brewer and M. Gill. Nonvolatile memory technologies
with emphasis on flash. 2008.
[15] D. Cheng, H. Hsiung, B. Liu, J. Chen, J. Zeng, R. Govin-
dan, and S. Gupta. A new march test for process-variation
induced delay faults in srams. In IEEE Asian Test Sympo-
sium, 2013.
[16] B. Debnath, S. Sengupta, and J. Li. Flashstore: high
throughput persistent key-value store. In VLDB, 2010.
[17] B. Debnath, S. Sengupta, and J. Li. Skimpystash: Ram
space skimpy key-value store on flash-based storage. In
SIGMOD, 2011.
[18] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula-
pati, A. Lakshman, A. Pilchin, S. Sivasubramanian,
P. V osshall, and W. V ogels. Dynamo: amazon’s highly
available key-value store. In SOSP, 2007.
[19] M. Fabiano, M. Indaco, S. Carlo, and P. Prinetto. Design
and optimization of adaptable bch codecs for nand flash
memories. Microprocessors and Microsystems, 37(4-
5):407–419, 2013.
[20] C. Fu and S. Liew. Tcp veno: Tcp enhancement for trans-
mission over wireless access networks. IEEE Journal on
Selected Areas in Communications (JSAC), 21(2):216–
228, 2003.
[21] J. Garcia, J. Corbal, L. Cerda, and M. Valero. Design
and implementation of high-performance memory sys-
tems for future packet buffers. In Proc. IEEE MICRO-36,
2003.
[22] J. Garcia, M. March, L. Cerda, J. Corbal, and M. Valero.
On the design of hybrid dram/sram memory schemes for
fast packet buffers. In Proc. HPSR, 2004.
[23] L. Grupp, A. Caulfield, J. Coburn, S. Swanson,
E. Yaakobi, P. Siegel, and J. Wolf. Characterizing flash
memory: anomalies, observations, and applications. In
Proc. IEEE MICRO-42, 2009.
[24] L. Grupp, J. D. Davis, and S. Swanson. The bleak fu-
ture of nand flash memory. In Proceedings of the 10th
USENIX conference on File and Storage Technologies,
2012.
[25] B. Han, A. Schulman, F. Gringoli, N. Spring, B. Bhat-
tacharjee, L. Nava, L. Ji, S. Lee, and R. Miller. Maranello:
Practical partial packet recovery for 802.11. In USENIX
NSDI, 2010.
[26] H. Hsiung, D. Cheng, B. Liu R. Govindan, and S. Gupta.
Interplay of failure rate, perfromance, and test cost in
tcam under process variations. In IEEE Asian Test Sym-
posium, 2013.
[27] C. Intanagonwiwat, R. Govindan, D. Estrin, J. Heide-
mann, and F. Silva. Directed diffusion for wireless sensor
networking. IEEE/ACM Trans. Networking, 11(1):2–16,
2003.
[28] S. Iyer, R. Kompella, and N. McKeown. Designing packet
buffers for router linecards. IEEE/ACM Trans. Network-
ing, 16(3), 2008.
[29] K. Jamieson and H. Balakrishnan. Ppr: Partial packet
recovery for wireless networks. In SIGCOMM, 2007.
[30] H. Lim, B. Fan, D. Andersen, and M. Kaminsky. Silt: A
memory-efficient, high-performance key-value store. In
SOSP, 2011.
[31] B. Liu, H. Hsiung, D. Cheng, R. Govindan, and S. Gupta.
Towards Systematic Roadmaps for Networked Systems.
Technical Report 12-931, University of Southern Califor-
nia, October 2012.
[32] B. Liu, H. Hsiung, D. Cheng, R. Govindan, and S. Gupta.
Towards systematic roadmaps for networked systems. In
ACM HotNets, 2012.
15
[33] B. Liu, H. Hsiung, D. Cheng, R. Govindan, and S. Gupta.
DECAF: Detecting and characterizing ad fraud in mobile
apps. Technical Report 13-937, University of Southern
California, September 2013.
[34] S. Mascolo, C. Casetti, M. Gerla, M. Sanadidi, and
R. Wang. Tcp westwood: Bandwidth estimation for en-
hanced transport over wireless links. In ACM MobiCom,
2001.
[35] N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal,
E. Schares, F. Trivedi, E. Goodness, and L. Nevill. Bit
error rate in nand flash memories. In IEEE IRPS, 2008.
[36] S. Mukhopadhyay, H. Mahmoodi, and K. Roy. Modeling
of failure probability and statistical design of sram array
for yield enhancement in nanoscaled cmos. IEEE Tran.
Computer-Aided Design of Integrated Circuits and Sys-
tems, 24(12):1859–1880, 2005.
[37] J. Saltzer, D. Reed, and D. Clark. End-to-end arguments
in system design. ACM Transactions on Computer Sys-
tems (TOCS), 2(4):277–288, 1984.
[38] B. Schroeder, E. Pinheiro, and W. Weber. Dram errors
in the wild: a large-scale field study. In SIGMETRICS,
2009.
[39] T. Semiconductor. Soft errors in electronic memory-a
white paper. 2004.
[40] V . Sridharan and D. Liberty. A study of dram failures in
the field. In IEEE SC, 2012.
[41] V . Srinivasan, S. Suri, and G. Varghese. Packet classifica-
tion using tuple space search. In Proc. ACM SIGCOMM,
1999.
[42] J. Stone, M. Greenwald, C. Partridge, and J. Hughes.
Performance of checksums and crc’s over real data.
IEEE/ACM Transactions on Networking, 6:529–543,
1998.
[43] J. Stone and C. Partridge. When the crc and tcp checksum
disagree. In SIGCOMM, 2000.
[44] T. Suzuki, I. Hatanaka, A. Shibayama, and H. Akamatsu.
A sub-0.5-v operating embedded sram featuring a multi-
bit-error-immune hidden-ecc scheme. 2006.
[45] F. Tobias, D. Nandita, T. Andreas, R. Barath, C. Neal,
C. Yuchung, J. Ankur, H. Shuai, K. Ethan, and
G. Ramesh. Reducing web latency: the virtue of gentle
aggression. In ACM SIGCOMM, 2013.
[46] D. Wei and P. Cao. Ns-2 tcp-linux: an ns-2 tcp implemen-
tation with congestion control algorithms from linux. In
ValueTool’06 – Workshop of NS-2, 2006.
[47] N. Weste and D. Harris. Cmos vlsi design: A circuits and
systems perspective. 2011.
[48] G. Woo, P. Kheradpour, D. Shen, and D. Katabi. Beyond
the bits: cooperative packet recovery using physical layer
information. In ACM MobiCom, 2007.
[49] D. Yoon and M. Erez. Memory mapped ecc: low-cost
error protection for last level caches. In ISCA, 2009.
[50] J. Zhang, H. Shen, K. Tan, R. Chandra, Y . Zhang, and
Q. Zhang. Frame retransmissions considered harmful:
improving spectrum efficiency using micro-acks. In ACM
MobiCom, 2012.
16
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 931 (2012)
PDF
USC Computer Science Technical Reports, no. 941 (2014)
PDF
USC Computer Science Technical Reports, no. 872 (2005)
PDF
USC Computer Science Technical Reports, no. 692 (1999)
PDF
USC Computer Science Technical Reports, no. 746 (2001)
PDF
USC Computer Science Technical Reports, no. 957 (2015)
PDF
USC Computer Science Technical Reports, no. 771 (2002)
PDF
USC Computer Science Technical Reports, no. 786 (2003)
PDF
USC Computer Science Technical Reports, no. 745 (2001)
PDF
USC Computer Science Technical Reports, no. 750 (2001)
PDF
USC Computer Science Technical Reports, no. 873 (2005)
PDF
USC Computer Science Technical Reports, no. 971 (2017)
PDF
USC Computer Science Technical Reports, no. 938 (2013)
PDF
USC Computer Science Technical Reports, no. 817 (2004)
PDF
USC Computer Science Technical Reports, no. 910 (2009)
PDF
USC Computer Science Technical Reports, no. 939 (2013)
PDF
USC Computer Science Technical Reports, no. 741 (2001)
PDF
USC Computer Science Technical Reports, no. 731 (2000)
PDF
USC Computer Science Technical Reports, no. 669 (1998)
PDF
USC Computer Science Technical Reports, no. 935 (2013)
Description
Bin Liu, Da Cheng, Hsunwei Hsiung, Ramesh Govindan, and Sandeep Gupta. "Can networked systems benefit from tomorrow's fast but unreliable memories?." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 937 (2013).
Asset Metadata
Creator
Cheng, Da
(author),
Govindan, Ramesh
(author),
Gupta, Sandeep
(author),
Hsiung, Hsunwei
(author),
Liu, Bin
(author)
Core Title
USC Computer Science Technical Reports, no. 937 (2013)
Alternative Title
Can networked systems benefit from tomorrow's fast but unreliable memories? (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
16 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270396
Identifier
13-937 Can Networked Systems Benefit from Tomorrow_s Fast but Unreliable Memories (filename)
Legacy Identifier
usc-cstr-13-937
Format
16 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/