Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Memristive device and architecture for analog computing with high precision and programmability
(USC Thesis Other)
Memristive device and architecture for analog computing with high precision and programmability
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MEMRISTIVE DEVICE AND ARCHITECTURE FOR ANALOG
COMPUTING WITH HIGH PRECISION AND
PROGRAMMABILITY
by
Wenhao Song
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2023 Wenhao Song
ii
Dedication
To my family
iii
Acknowledgments
I would like to express my deepest gratitude to my advisor and committee chair, Jianhua Joshua
Yang, for his unwavering support and mentorship throughout my Ph.D. journey. Prof. Yang’s
visionary approach to identifying significant challenges has been instrumental in shaping my
research focus and academic pursuits. His commitment to fostering a professional mindset has
significantly contributed to my growth. None of the work in this dissertation would be possible
without him. I would also like to extend my gratitude to Prof. Shuo-Wei (Mike) Chen, Prof. Wei
Wu, and Prof. Aiichiro Nakano for their support and encouragement as part of my dissertation
committee.
I am deeply indebted to Prof. Qiangfei Xia, who introduced me to advanced fabrication and
immensely helped me in our close collaborations. His unremitting dedication to academic identity
and taste sets a guiding model for my own pursuit. I am also thankful to the professors who have
not only instructed me in multidisciplinary courses but also imparted valuable life qualities: Prof.
Mario Parente, Prof. Hossein Pishro-Nik, Prof. Aura Ganz, Prof. Neal G. Anderson, Prof. Lixin
Gao, Prof. Eric Polizzi, Prof. Ramakrishna Janaswamy, Prof. Subhransu Maji, Prof. Wayne
Burleson at UMass, as well as Prof. Tony Levi and Prof. Aluizio Prata at USC.
I gratefully acknowledge the generous support from various funding agencies that made this
research possible, including Air Force Office of Scientific Research (AFOSR) through the
Multidisciplinary University Research Initiative (MURI) program (contract no. FA9550-19-1-
iv
0213), USA Air Force Research Laboratory (AFRL) (Prime Contract Nos. FA8650-21-C-5405 and
FA8750-22-1-0501), Army Research Office (grant no. W91 INF2120128 and W91 INF1810268),
National Science Foundation (contract no. 2023752 and 2036359), and TetraMem (015542-00001).
Special thanks to Dr. Ye Zhuo. As the only students who transferred to USC together, we
share many colorful moments from coast to coast. Without your encouragement, many things
would not be possible. I am also grateful to my colleague and friend, Yunning Li, for bringing me
into this great research team and area. Many thanks to Dr. Zhongrui Wang and Dr. Can Li, for their
indispensable assistance on many projects. I had the pleasure of working with all my current and
past colleagues at UMass and USC, including Dr. JungHo Yoon, Dr. Saumil Joshi, Dr. Navnidhi
Kumar Upadhyay, Dr. Mingyi Rao, Dr. Rivu Midya, Shiva Asapu, Dr. Peng Yan, Dr. Xumeng
Zhang, Leibin Ni, as well as Dr. Hao Jiang, Dr. Shuang Pi, Dr. Peng Lin, Daniel Belkin, Puming
Fang, Rui Wang, Dr. Fan Ye, Yi Huang, Fatemeh Kiani, Vignesh Ravichandran at UMass; and Dr.
Mengjiao Li, Dr. Taehwan Moon, Seung Ju Kim, Minjae Kim, Byeongsoo Kang, Piyush Sud,
Ruoyu Zhao, Tong Wang, Yichun Xu, Hanting Liao, Zixu Wang, Zihan Wang and Jian Zhao at
USC. I’d also like to recognize the unselfish and continuous support from many former Hewlett
Packard Laboratory members and former and current colleagues at Tetramem.
I would be remiss in not mentioning my family. They always encourage me to pursue my
dream.
Finally, thanks to you, the reader, for picking up this dissertation. I hope you enjoy reading it
as much as I did writing it.
v
TABLE OF CONTENTS
Dedication....................................................................................................................................... ii
Acknowledgments..........................................................................................................................iii
List of Figures................................................................................................................................ vi
Abstract........................................................................................................................................ viii
Related Publications with Links ..................................................................................................... x
Chapter 1: Introduction................................................................................................................ 1
1.1 Memristor Backgrounds................................................................................................ 4
1.2 Field-Programmable Analog Arrays (FPAA) Backgrounds.......................................... 8
1.3 Dissertation Organization............................................................................................ 10
Chapter 2: Field-Programmable analog arrays (FPAA) .............................................................11
2.1 Introduction and Motivation.........................................................................................11
2.2 memFPAA Based Low/high-pass Filter...................................................................... 14
2.3 memFPAA based Audio Equalizer .............................................................................. 17
2.4 memFPAA based Mixed-frequency Classifier ............................................................ 19
2.5 Methods....................................................................................................................... 24
2.6 Summary ..................................................................................................................... 28
Chapter 3: Denoising memristor for multi-level accurate reading ............................................ 29
3.1 Introduction and Motivation........................................................................................ 29
3.2 Conductance Levels and Arrays on Integrated Chips ................................................. 33
3.3 High Precision Programming Algorithm..................................................................... 36
3.4 Conduction channel evolution in denoising processes................................................ 38
3.5 Switching and denoising mechanisms......................................................................... 41
3.6 Methods....................................................................................................................... 47
3.7 Summary ..................................................................................................................... 51
Chapter 4: Programming with arbitrarily high precision........................................................... 52
4.1 Introduction ................................................................................................................. 52
4.2 High precision obtained with low-precision devices .................................................. 54
4.3 Experimental demonstration of high precision solvers............................................... 61
4.4 Methods....................................................................................................................... 72
4.5 Summary ..................................................................................................................... 86
Chapter 5: Conclusion and Future Work ................................................................................... 87
5.1 Contributions............................................................................................................... 87
5.2 Future Work................................................................................................................. 88
References..................................................................................................................................... 90
vi
List of Figures
Figure 1.1 Typical I-V characteristics of 1T1R devices. ................................................................ 6
Figure 1.2 The retention of memristors of different conductance states. ....................................... 7
Figure 1.3 1T1R Memristor Programming. .................................................................................... 8
Figure 2.1 The memFPAA based on a 1T1R crossbar. ..................................................................11
Figure 2.2 The memFPAA based first-order low/high-pass filter................................................. 15
Figure 2.3 A memFPAA audio equalizer....................................................................................... 17
Figure 2.4 Performance of the audio equalizer with different weight vectors.............................. 19
Figure 2.5 A memFPAA mixed-frequency signal classifier.......................................................... 22
Figure 2.6 Measured Neuron responses of the MemFPAA mixed-frequency classifier............... 23
Figure 2.7 Simulated neuron responses of the MemFPAA mixed-frequency classifier. .............. 24
Figure 2.8 Photograph of the memFPAA measurement system. .................................................. 26
Figure 2.9 Diagram of the memFPAA measurement system........................................................ 27
Figure 3.1 High precision memristor for neuromorphic computing............................................. 32
Figure 3.2 The algorithm of high precision programming. .......................................................... 38
Figure 3.3 Direct observation of the evolution of conduction channels in the denoising process
through conductive atomic force microscope (C-AFM)....................................................... 40
Figure 3.4 Trapped-charge-induced conductance change in incomplete conduction channels. ... 42
Figure 3.5 Mechanism of denoising using subthreshold voltage, identified using C-AFM
measurements and phase-field theory simulations. .............................................................. 45
Figure 3.6 The pattern programming and denoising in a 256 256 1-transistor-1-memristor
(1T1M) array......................................................................................................................... 49
Figure 3.7 The testing environment and the schematic of the 1T1R array with its driving
circuits................................................................................................................................... 50
Figure 4.1 Comparison of Arbitrary precision programming and traditional crossbar arrays...... 55
Figure 4.2 Photos and Programmability of the memristor crossbar array on the SoC. ................ 62
Figure 4.3 Experimental results on Poisson solver with arbitrary precision programming with
three arrays............................................................................................................................ 64
Figure 4.4 Hardware recursive least square filter with arbitrary precision programming............ 67
Figure 4.5 Scalability, efficiency and more applications.............................................................. 70
Figure 4.6 Photo of the fully integrated SoC testing platform...................................................... 74
Figure 4.7 Diagram of the analog in-memory computing accelerator SoC (system on chip). ..... 75
Figure 4.8 Experimental results on Poisson solver with arbitrary precision programming with
three arrays on the non-fully integrated memristor platform................................................ 82
Figure 4.9 Experimental results of accumulated summation of 3 subarrays in SoC using
Green’s function preconditioner............................................................................................ 83
Figure 4.10 VMM result comparison between multiple subarrays and one subarray with extra
vii
tuning cycles. ........................................................................................................................ 84
Figure 4.11 Accumulated summation of 5 subarrays in simulation using Green’s function
preconditioner in simulation. ................................................................................................ 85
viii
Abstract
While digital computing dominates the technological landscape, analog computing has superior
energy efficiency and high throughput. However, its historical limitation in precision and
programmability has confined its application to specific and low-precision domains, notably in
neural networks. The escalating challenge posed by the analog data deluge calls for versatile
analog platforms. These platforms must exhibit exceptional efficiency and boast reconfigurability
and precision.
Recent breakthroughs in analog devices, such as memristors, have laid the groundwork for
unparalleled analog computing capabilities. Leveraging the multifaceted role of memristors, we
introduce memristive field-programmable analog arrays (FPAAs), mirroring the functionality of
their digital counterparts, field-programmable digital arrays (FPGAs). To elevate precision, we
delve into the origins of reading noise, successfully mitigating it and achieving an unparalleled
2048 conductance levels in individual memristors—equivalent to 11 bits per cell, setting a record
precision among diverse memory types. Acknowledging the persistent demand for single or double
precision in various applications, we propose and develop a circuit architecture and programming
protocol. This innovation enables analog memories to attain arbitrarily high precision with
minimal circuit overhead. Our experimental validation involves a memristor System-on-Chip
fabricated in a standard foundry, demonstrating significantly improved precision and power
efficiency compared to traditional digital systems.
ix
The co-design approach presented empowers low-precision analog devices to perform highprecision computing within a programmable platform. This demonstration underscores the
transformative potential of analog computing, transcending historical limitations and ushering in
a new era of precision and efficiency.
x
Related Publications with Links
(*: These authors contributed equally.)
W. Song, M. Rao, Y. Li, C. Li, Y. Zhuo, F. Cai, W. Yin, C. Wei, S. Lee, H. Zhu, L. Gong, M.
Barnell, Q. Wu, P. A. Beerel, M. S.-W. Chen, N. Ge, M. Hu, Q. Xia, J. J. Yang, Programming in
memristor arrays with arbitrarily high precision for analog computing. Science, in press (2023).
M. Rao*, H. Tang*, J. Wu*, W. Song*, M. Zhang, W. Yin, Y. Zhuo, F. Kiani, B. Chen, X. Jiang,
H. Liu, H.-Y. Chen, R. Midya, F. Ye, H. Jiang, Z. Wang, M. Wu, M. Hu, H. Wang, Q. Xia, N. Ge,
J. Li, J. J. Yang, Thousands of conductance levels in memristors integrated on CMOS. Nature 615,
823–829 (2023).
https://www.nature.com/articles/s41586-023-05759-5
Y. Li*, W. Song*, Z. Wang, H. Jiang, P. Yan, P. Lin, C. Li, M. Rao, M. Barnell, Q. Wu, S. Ganguli,
A. K. Roy, Q. Xia, J. J. Yang, Memristive Field‐Programmable Analog Arrays for Analog
Computing. Advanced Materials, 2206648 (2022).
https://onlinelibrary.wiley.com/doi/10.1002/adma.202206648
1
Chapter 1:Introduction
In the big data and IoT (internet of things) era, while ubiquitous sensors are still rapidly growing
in both their number and the rate of generating analog data, it has become too time and energy
consuming to digitize all the analog data for processing. This will require some fundamental
changes to rethink and redesign the digital systems for various applications, including but not
limited to infrastructure, mobile devices, autonomous systems, robotic systems, medical and health
systems, national security and defense, and energy management. In many of these applications,
energy efficiency and processing throughput are increasingly important for complicated computing
tasks such as classification and video/audio processing while processing accuracy is less critical,
which favors analog data processing over digital.1–5
In addition, it has become increasingly
important to pre-process sensed data and reduce the analog data size by orders of magnitude before
digitizing them, which demands efficient analog circuits. Robots equipped with human-like audio
and video sensing systems are a typical example of such applications.
Nevertheless, the development of analog circuits is far behind their digital counterparts and
needs a major boost now. One of the main reasons for this situation is the lack of reconfigurable
and scalable platforms for fast analog circuit prototyping and verification, like the fieldprogrammable gate arrays (FPGAs) for digital circuits. That is Field-programmable analog arrays
(FPAAs). The recent progress on memristive device technology6–11 may have for the first time
2
provided potential solutions for large-scale, versatile, high-speed, low-energy FPAAs. Memristors
are non-volatile highly scalable devices with fast programming speed and multiple conductance
states, being able to playing various roles in the FPAA design. Here, we experimentally
demonstrated a platform of a memristive field-programmable analog array (memFPAA) with
memristive devices serving as a variety of core analog elements and CMOS components as
peripheral circuits. We reconfigured the memFPAA and implement a first-order band pass filter,
an audio equalizer, and an acoustic mixed frequency classifier, as application examples. The
memFPAA, featured with programmable analog memristors, memristive routing networks, and
memristive vector-matrix multipliers, opens opportunities for fast prototyping analog designs as
well as efficient analog applications in signal processing and neuromorphic computing.
Furthermore, we are not satisfied with only solving traditional analog problems, which is the
synonym of “low precision problems”. We also want to solve precise problems that are even hard
or less efficient for current digital computers, that is to say, high performance computing problems.
Many complex physical systems can be described by coupled nonlinear equations that must be
analyzed simultaneously at multiple spatiotemporal scales. However, these systems are often too
complex for analytical techniques, and direct numerical computation is hindered by the "curse of
dimensionality," which requires exponentially increasing resources as the size of the problem
increases. These systems can range from nanoscale problems in material modeling to large-scale
problems in climate science. While the need for accurate and high-performance computing
3
solutions is growing, traditional von Neumann computing architectures are reaching their limits in
terms of speed, power consumption, and infrastructure.
Amazingly, we found it possible by exploiting the in-memory computing architecture, which
circumvents the memory-processor bottleneck inherent to von Neumann architectures. To achieve
efficient in-memory computing , various emerging devices, such as floating gate transistors12–14
,
phase-change15–17, ferroelectric18–22, magnetic23, and metal oxide24,6,25–27 materials, have been
studied intensively to enabling parallel computation of matrix operations (the vector-matrix
multiplication, or VMM) in nonvolatile memory crossbars. However, technical challenges such as
reading noises, writing variabilities caused by device-to-device inhomogeneities, have limited the
scalability, precision, and accuracy28–31 required for high-performance scientific computing.
The first difficulty we solved is to minimize the reading noises and obtain accurate and stable
conductance levels for accurate number representation, which is not only necessary for traditional
high performance computing, but also beneficial for other applications like neural networks32,33
.
Here we report over 2048 conductance levels, the largest number among all types of memories
ever reported, achieved with memristors in fully integrated chips with 256 256 memristor arrays
monolithically integrated on CMOS circuits in a standard foundry. We have unearthed the
underlying physics that previously limited the number of achievable conductance levels in
memristors and developed electrical operation protocols to circumvent such limitations. These
results reveal insights into the fundamental understanding of the microscopic picture of memristive
4
switching and provide approaches to enable high-precision memristors for various applications.
The next difficulty we solved is the writing accuracy and device-to-device variations, which
forms the major gap between numerical simulation (which usually uses the same model to
represent all memristive devices in the crossbar) and the experimental implementation. This
requires some software-hardware codesign. We proposed a circuit architecture and programming
protocol that enables arbitrarily high precision with minimal circuit overhead and is not limited to
memristor platforms. With a fully integrated System on Chip (SoC) based on memristors, we
experimentally demonstrate substantially enhanced precision and power efficiency over traditional
digital partial differential equation solvers and Recursive Least Square filters for scientific
computing. Our co-design approach allows low-precision analog devices to perform highprecision computing.
1.1 Memristor Backgrounds
Memristive devices are electrical resistance switches that can retain a state of internal resistance
based on the history of applied voltage and current. These devices can store and process
information and offer several key performance characteristics that exceed conventional integrated
circuit technology. Memristors usually has a Metal-Insulator-Metal (MIM) structure and are nonvolatile devices with good retentions (see Figure 1.2), which means that they do not consume
power to maintain their states or require extra memory to store the configuration information. A
5
memristive FPAA, i.e., memFPAA, can result in a reduced footprint as a highly scalable memristor
can be used to replace multiple CMOS elements while maintaining the same function in many
cases, such as programmable resistors. Memristors have a faster programming speed (e.g.
<0.1ns)34 and multiple stable resistance levels (e.g. 256 levels in Figure 2.1e or even thousands of
levels in Figure 3.1g, promising faster and more accurate FPAAs.
Memristors are particularly useful in the form of crossbar arrays. In practice, the so-called
1T1R (a memristor on top of a metaloxide–semiconductor (MOS) transistor as an access device in
each cell) architecture is widely adopted for many reasons. It allows independent access to
memristors with a linear current–voltage (I–V) relation in an array with the transistor gate control,
so each memristor’s conductance can be precisely tuned. Moreover, unlike passive arrays, a 1T1R
crossbar enables accurate analogue VMM with linear I–V memristors that yield a good
approximation to the scalar product of a vector component and matrix element. A column of cells
shares a common top electrode and a common gate terminal, while a row of cells shares a common
bottom electrode. The memristors have sufficient ON/OFF ratio to meet the needs as switches of
the routing networks in the FPAA fabric. The typical 1-V characteristics of 1T1T devices are shown
in Figure 1.1. The transistors can suppress the sneak path currents during memristor programming
and enable rapid tuning of conductance of memristors by adjusting the gate voltage to impose
current compliance for memristor tuning (see Figure 1.3).
6
Figure 1.1 Typical I-V characteristics of 1T1R devices. a. The voltage sweep I-V characteristics of a Ta/HfOx/Pd
memristor. The memristor is switched ON by a positive voltage around 2.5V and switched OFF by a negative voltage
around -4V. The transistor was switched ON during the measurement. b. The corresponding resistance under voltage
sweep, showing about 5 orders of magnitude ON/OFF ratio in the non-volatile memristor. The peak HRS resistance
of the memristor is around 100MΩ. c. Analog switching behaviors of a Ta/HfOx/Pd memristor. The memristor is
switched to various states with various current compliances imposed by different transistor gate voltages. Increased
gate voltage allows a larger current compliance and thus allows switching the memristor to a lower resistance state.
The memristor can be tuned in an analog fashion between 500Ω and 20kΩ.
7
Figure 1.2 The retention of memristors of different conductance states.
read up to 10K seconds at room temperature, 85°C and 150°C respectively.
8
1.2 Field-Programmable Analog Arrays (FPAA) Backgrounds
Field-programmable analog arrays (FPAAs) were conceptualized in late 1980s35 and firstly
commercialized in 1996.36 The expectation was that such an analog platform might be eventually
used as a general-purpose circuit that could be reconfigured directly for different applications or
as a tool for prototyping analog designs. The general form of an FPAA is a monolithic collection
Figure 1.3 1T1R Memristor Programming. a. 3 examples of the conductance tuning process. The first row of panels
shows the measured conductance evolution of the memristors over programming cycles. The lower 3 rows show the
SET voltage (applied to the top electrodes of memristors by column boards), reset voltage (applied to the bottom
electrodes of memristors by row boards), and the transistor gate voltage, respectively. The subsequent voltages are
determined by the conductance feedback as illustrated by the 3 cases of the 3 columns of panels. b. The comparison
between target conductance and the experimentally written values for memristors in a 32×32 array. The blue dots show
the written conductance of memristors, and the red dashed lines show the tolerance range ( ± 10) during
the conductance tuning process. The precision could be potentially increased by setting a narrower tolerance range. c.
The histogram and d. cumulative distribution function (CDF) of the writing error, which is defined as the difference
between final written conductance and the target conductance.
9
of configurable analog blocks (CABs), a user-controllable routing network (switching matrix) used
for reconfiguring the connections of the building blocks, and a collection of memory elements
used to define both the function and structure (configuration memory).37 The CABs may have
different elements, such as transistors, programmable resistors/capacitors, op-amps, vector-matrix
multipliers (VMMs) etc. However, early CABs lacked basic compact reconfigurable elements,
such as programmable analog resistors, and complicated solutions were used accordingly38–40
which were very costly in terms of chip area, design complexity, noise level and power
consumption. The early switching matrices in those FPAAs are CMOS-centric, typically
employing pass transistors or CMOS transmission gates, which are volatile and thus necessitate
the designated configuration memory elements in those early FPAAs to store the configuration
information. With only a handful function units, those FPAAs resemble the PLD (programmable
logic device) stage of FPGAs in the 1980s. The adoption of floating-gate (FG) transistors into the
FPAAs was a great step forward towards a large scale FPAA.41–48 In these FPAAs, FG transistors
were used as both the switching matrices and programmable resistors, leading to more compact
circuits that have more function units yet use less energy. Nonvolatility of the FG transistor-based
switching matrices obviates the need of the memory elements for configuration storage. However,
the common issues of FG transistors, such as high operation voltages, limit the performance of
such FPAAs.
10
1.3 Dissertation Organization
Chapter 1 introduced the basics of memristor and PFAA. The rest of the dissertation is organized
as follows. Chapter 2 shows a proof of principle demonstration of the reconfigurable analog
computing platform memFPAA that can be reconfigured like FPGA but also with preliminary
analog computing capability. Chapter 3 shows an effective technique to substantially improve the
reading stability of multi-level memristors and achieved thousands of stable conductance states.
The possible mechanism of the reading noise is studied via C-AFM measurements and first
principle calculation. Chapter 4 presented a new algorithm-hardware codesign that enables
arbitrarily high precision with minimal circuit overhead and experimentally solved several highperformance computing problems. Finally, Chapter 5 concludes this dissertation with some
outlooks.
11
Chapter 2:Field-Programmable analog arrays
(FPAA)
2.1 Introduction and Motivation
Figure 2.1 The memFPAA based on a 1T1R crossbar. a. A memFPAA architecture using memristor crossbar array
as routing network. A common CAB may consist of transistors, capacitors, transimpedance amplifiers (TIA),
memristor, memcapacitors, op-amps as well as analog vector-matrix multipliers (VMM) based on memristor
subarrays. b. Schematic of the 1T1R crossbar. TE is short for Top Electrode and BE for Bottom Electrode. The
transistors not only serve as the accessing devices to avoid sneak path currents during memristor programming but
also facilitate analog memristor programming with gate-voltage-induced current compliance. The memristors form a
crossbar array if all transistors are switched ON. c. Optical image of an operational 32×32 1T1R array, which can be
connected to external circuits by a probe card. d. The scanning electron micrograph of 1T1R cells in the 32×32
array. Zoomed-in image is a memristor in a 1T1R cell. e. DC voltage sweeps of 256 different memristor conductance
states from 50 to 2000 μS. The states are evenly distributed and show good I-V linearity.
12
The recent progress on memristive device technology6–11 may have for the first time provided
potential solutions for large-scale, versatile, high-speed, low-energy FPAAs49. Memristors are nonvolatile devices, which means that they do not consume power to maintain their states or require
extra memory to store the configuration information. A memristive FPAA, i.e., memFPAA, can
result in a reduced footprint as a highly scalable memristor can be used to replace multiple CMOS
elements while maintaining the same function in many cases, such as programmable resistors.
Memristors have a faster programming speed (e.g. <0.1ns)34 and multiple stable resistance levels
(e.g. 256 levels in Figure 2.1e or even thousands of levels,50 promising faster and more accurate
FPAAs. The functions of key components in the FPAAs, including efficient switching matrices,
programmable resistors, and vector-matrix multipliers can all be naturally achieved with
memristive devices as demonstrated in this study, which differentiates this study from the previous
memristor FPAA studies, where only a couple of volatile memristors were used on a feedback path
to make the gain of an amplifier adaptive46
. Various analog computing tasks, including band pass
filters, an audio equalizer and an acoustic mixed frequency classifier, have been experimentally
implemented in our memFPAA as detailed below. As shown in Figure 2.1a, a memFPAA may
consist of multiple CABs connected by reconfigurable routing networks, which can also be
realized by using crossbars of memristors with a very large ON/OFF ratio. Each CAB may consist
of different components of different granularity levels, for example, transistors, memristors,
memcapacitors, capacitors, op-amps, as well as VMMs etc. Memristors can be used as switches
13
on the fabric of the memFPAA. A memristor in its low resistance state (LRS) works as an ONstate switch, which electrically bridges components. On the opposite, the high resistance state
(HRS) or the OFF switching memristor shuts the connection so negligible signals can pass through
it. For demonstration purpose, a memristive crossbar with One-Transistor-One-Memristor (1T1R)
structure is used in our memFPAA design, comprising integrated hafnium oxide (Ta/HfO2/Pd)
memristors built on the drain terminals of n-type enhancement mode metal-oxide-semiconductor
(MOS) transistors.
51 In principle, the transistor in the 1T1R cell can be replaced by a selector to
form a so-called 1-Selector-1-Memristor(1S1R) cell, which maintains the nonvolatility and
improves the scalability and stackability. Figure 2.1b shows the schematic of an example 1T1R
array, with photographed details of cells in a 32×32 array illustrated in Figure 2.1c and d. A column
of cells shares a common top electrode and a common gate terminal, while a row of cells shares a
common bottom electrode. The memristors have sufficient ON/OFF ratio to meet the needs as
switches of the routing networks in the FPAA fabric (See Figure 1.1.) The transistors can suppress
the sneak path currents during memristor programming and enable rapid tuning of conductance of
memristors by adjusting the gate voltage to impose current compliance for memristor tuning (See
Figure 1.3.). More importantly, to fulfill the role of a continuously programmable resistor, each
individual memristor exhibits stable multilevel analog programmability and good I-V linearity in
the conductance range from 50 to 2000 μS as shown in the Figure 2.1e (See Figure 1.2 for retention
of different conductances.) Such crossbar array can naturally implement the analog vector-matrix-
14
multiplication by using Ohm’s Law for multiplications and Kirchhoff’s Current Law for current
summations simultaneously across the entire crossbar array, resulting in a fast, low-power and
high-density computing compared to the digital counterpart with von Neumann architecture.52–54
It is worth noting that while memristive devices are adopted to more efficiently perform some
analog functions in the memPFAA, CMOS peripheral circuits are also critical components for such
memristor/CMOS hybrid systems.
2.2 memFPAA Based Low/high-pass Filter
Analog filters are commonly used in sensing signal processing systems. However, the traditional
filters built based on switched capacitors are very bulky. For instance, a large number or area of
capacitors are needed for each filter, which is too big and costly for integrated edge sensing
modules. Alternatively, such filter can be achieved by combining one memristor shown in Figure
2.1e with one capacitor, essentially two devices instead of hundreds or thousands. A band pass
filter could be realized by connecting a high pass filter with a low pass filter in series. To illustrate
the reconfigurability of the memristor based FPAA, we show the continuous tuning of the cutoff
frequency of the first-order low-pass and high-pass filters, mathematically defined by || =
1
√1+(⁄0
)2
and || =
1
√1+(0⁄)2
respectively, where is the amplitude gain of the signal, is
the input signal frequency, and 0 = 1/2πRC is the -3dB cutoff frequency point of the filter. A
low-pass filter that effectively transmits lower frequency signals and attenuates higher frequencies,
15
and oppositely a high-pass filter transmits high frequency signals. Both filters employ a resistor
and a capacitor as depicted in Figure 2.2a. In the low-pass filter, the capacitor exhibits reactance,
and blocks low-frequency signals, forcing them through the output load instead. At higher
frequencies the reactance drops, and the capacitor effectively functions as a short circuit. Different
Figure 2.2 The memFPAA based first-order low/high-pass filter. a. The schematic of a first order low-pass filter
with a resistor in series with the load, and a capacitor parallel to the load, and the first order high-pass filter, with
swapped resistance and capacitance. b. A memristor-based first-order low-pass filter. The color of memristors
represents the conductance of devices; yellow indicates the device is used as an ON-state switch and deep blue as an
OFF-state switch. Other color shows the memristor is of intermediate conductance, as indicated by the color bar. c.
Like b, the implementation of a memristor-based first-order high-pass filter. d. The frequency responses of the low
pass filter configured with different memristor resistance, covering the full audio frequency range from 20 to 20kHz.
The main curve shows the median, while the shadow area shows the 25% and 75% quartiles of over 10 repeated
measurements. e. Like d, the frequency responses of the high pass filter configured with different memristor resistance.
16
from a conventional low-pass filter, a memristor based low-pass filter employs a memristor as a
reconfigurable analog resistor in series with the load (Figure 2.2b), and the analog programmability
of the memristor yields a continuous tuning ability of the cutoff frequency of the filter, as shown
in Figure 2.2d.55–57 For demonstration purpose, the memristor was programmed in the range from
1kΩ to 20kΩ with three different capacitances (10, 100 and 470 nF) to cover the whole audio
frequency range (20 ~ 20k Hz). Similarly, the high-pass filter could be constructed by swapping
the resistance and capacitance elements (Figure 2.2c) featuring the same continuously tunable
cutoff frequency, as shown in Figure 2.2e.
17
2.3 memFPAA based Audio Equalizer
Based on the filters implemented above, we further constructed a memFPAA audio equalizer that
adjusts the balance of different frequency components of an audio signal.
58 Such equalizer is of
wide applications in consumer electronics and telecommunications, either as an independent
Figure 2.3 A memFPAA audio equalizer. a. The schematic of an audio equalizer consisting of a bank of cascaded
high and low pass filters followed by a VMM. b. The 24×15 memFPAA subarray configured as an audio equalizer.
Memristors and switches in bandpass filters and the weight network in a are implemented in the crossbar subarray,
while off-chip TIAs, buffers, capacitors, as well as input and outputs are also shown in the figure. The color behinds
the cross point of the grids represents the programmed conductance of the memristor in the array. The top right 4
memristors are the equalizing weights which can be reconfigured. c-f. 4 example frequency responses of the equalizer
with different weight vectors reconfigured. Red solid lines show the measured frequency response of the output of the
equalizer. Blue dashed lines are the responses of each band pass filters. The 4 color blocks show the conductance of
the memristors of the 4 bands. As it indicated, high conductance weight memristor retains the signal amplitude of the
corresponding frequency band while low conductance memristor suppresses the corresponding band.
18
device in acoustic applications or as the intermediate stage of a complicated audio system (e.g.
pre-processing for speech recognition).59–61 A commonly used equalizer architecture in audio
processing consists of parallel band-pass filters, buffers, and a weight network by VMM, which
can be realized using memristors as shown in Figure 2.3a. The band-pass filters pick specific
frequency bands of the input signal and send it to buffers. The weight network modulates the
amplitude of each frequency band component and sums them up to produce a signal with adjusted
balance. The memFPAA audio equalizer was experimentally built with a 32×32 1T1R crossbar
array, external capacitors, and op-amps as shown in Figure 2.3b. As shown in the conductance map,
a routing network based on a memristor subarray with 24 rows and 15-columns was used to
connect separate components. The first-order high and low pass filters employ programmable
memristors in series with capacitors, which could be easily reconfigured to produce different
frequency responses. After passing through the buffers, the 4 frequency band components are
modulated by the weighted VMM network according to the equation () = ∑
() ·
=1
, where () is the output voltage, is the feedback resistance of the transimpedance
amplifier,
() is the voltage signal from the j-th band-pass filter, and
is the equalizing weight
of the j-th frequency band represented by the conductance of its corresponding memristor. If a
weight memristor has a high conductance, the corresponding frequency band signal is retained,
while a memristor with a low conductance weight significantly suppresses the associated signal.
To show the reconfigurability of the weight memristors, we programmed 4 different weight vectors
19
by tuning the weight memristors to obtain different frequency responses as shown in Figure 2.3cf. A weight vector with all individual weights large (Figure 2.3c) or small (Figure 2.3f) retains or
depresses the signals in all bands. While weight vectors with non-uniform weight elements
selectively emphasize signals of interested frequency range as shown in Figure 2.3d-e. (See Figure
2.4 for more different frequency responses.)
2.4 memFPAA based Mixed-frequency Classifier
Such audio equalizer shares similarities with some parts of our hearing systems. Human hearing
system samples and differentiates sounds at different frequencies and the action potentials are
Figure 2.4 Performance of the audio equalizer with different weight vectors.The red solid lines are the measured
frequency response of the output of the equalizer. Blue dashed lines are the responses of the band-pass filters. High
conductance weight memristor retains the signal amplitude of the corresponding frequency band while low
conductance memristor suppresses the signal.
20
transmitted to the auditory cortex in our brain for further processing (e.g., classification). Our
memFPAA can perform a similar function given that the memristor VMM CAB can serve as a
single-layer perceptron neural network to classify vectors in a hyperplane.29-32 Here we
demonstrate a simple memFPAA based mixed-frequency classifier. As shown in Figure 2.5a, the
band pass filters play the role of the basilar membrane of the human hearing system where different
parts are selectively resonant to sinusoidal inputs of different frequencies. The signal is then fed to
the buffers and output to the single layer perceptron network, with memristor synapses and peak
detector neurons. Each post-synaptic neuron, consisting of a transimpedance amplifier and a peak
detector, receives its own weighted sum of the signals of band pass filters. Since the input signal
is a combination of signals of different frequencies, the neuron associated with synapses that have
larger weights on the matched frequencies (with the input signals) will yield a larger output voltage.
This process closely mimics the process in which hair cells sense the vibration of the basilar
membrane and transmit signals to the cerebral cortex. For demonstration purpose, we
experimentally constructed 6 different temporal input patterns where each pattern is a combination
of sinusoidal waves of two different frequencies, as illustrated in Figure 2.5b. The input passed the
band-pass filters is then send into a pre-trained 4×6 memristive synaptic array with measured
weights depicted in Figure 2.5c. The output are the measured peak voltages of the 6 post-synaptic
neurons. They produce unique responses to each input, correctly classifying the input signals of
different combinations of frequencies, as shown in Figure 2.5d. (See also Figure 2.4 and Figure
21
2.6) Figure 2.5e shows the outputs for input pattern P4 as an example. Input P4 is the sum of two
sinusoidal waves of frequencies 200Hz and 2kHz. The neurons N1 and N5 receive signals from
the 200Hz filter with large-weight synapses and the 2kHz filter with small-weight synapses, so
they output waveforms with dominant frequency of 200Hz at 1V peak. Similarly, the neurons N2
and N6 output waveforms with dominant frequency of 2kHz at 1V peak. The synapses associated
with neuron N3 to both 200Hz and 2kHz filters are of low weight, so N3 outputs almost zero signal.
On the contrary, neuron N4 observes a waveform close to the input signal since its synapses
connecting both 200Hz and 2kHz filters are of large weights, producing the highest output peak
~2V. This proof of principle demonstration of the memFPAA acoustic classifier could be expanded
to handle tasks with higher complexity, such as speech recognition.
22
Figure 2.5 A memFPAA mixed-frequency signal classifier. a. Schematic of the classifier circuit consisting of a bank
of cascaded high and low pass filters, serving the role of the basilar membrane of the human hearing system, followed
by a 4×6 VMM as synaptic array with 6 neurons implemented by TIAs and peak detectors. b. 6 different temporal
input patterns (P1~P6), each of which is the composition of two sinusoidal waves of different frequencies. (From the
left, 20Hz+200Hz, 20Hz+2kHz, 20Hz+20kHz, 200Hz+2kHz, 200Hz+20kHz, and 2kHz+20kHz.) c. Measured
conductance map of the 4×6 VMM synaptic array. d. Measured output peak voltages of the post-synaptic neurons to
the input patterns. Each individual input pattern is associated with a unique response of the output artificial neurons.
e. The temporal output responses of the 6 output neurons N1~N6 to input pattern P4. The red signals are voltages
measured after the TIAs and the blue signals are voltages measured after the peak detectors.
23
Figure 2.6 Measured Neuron responses of the MemFPAA mixed-frequency classifier.
The upper schematic shows the synaptic weight matrix. The temporal responses of the 6 output neurons (as indicated
by the black arrows) to the 6 input patterns (P1, P2, …, P6) are illustrated in the lower 6×6 panels.
24
2.5 Methods
Device fabrication: Transistors with a feature size of 2 μm were used in this work. The transistors
were fabricated in a commercial fab, which gives small wire resistance (about 0.3Ω per block).
The wire resistance per block is calculated by dividing the measured resistance of the entire TE or
BE line by 32, in the case of a 32x32 array. Photolithography, thin film deposition, and liftoff were
used to integrate the memristors with the wired transistors. In order to remove native metal oxide
layers, argon plasma treatment was applied on the transistor chip which gives a better electrical
Figure 2.7 Simulated neuron responses of the MemFPAA mixed-frequency classifier. The temporal responses of
the 6 output neurons (N1, N2, …, N6) to the 6 input patterns (P1, P2, …, P6) are illustrated in the 6×6 panels.
25
connection. Metal vias were created by sputtering of 5-nm silver (Ag) and 200-nm palladium (Pd),
followed by lifting off in warm acetone. Samples were annealed at 300 °C in nitrogen ambience
(flow 20 sccm) for half an hour. A 60-nm thick Pd bottom electrode was sputtered on a 5-nm
tantalum (Ta) adhesive layer. To ensure the high film quality and steps coverage, we used water
and tetrakis (dimethylamido) hafnium as precursors for depositing a 5-nm HfO2 switching layer
by ALD at 250 °C. Photolithography and reactive ion etch (RIE) using CHF3/O2 patterned the
switching layer. Last, the top electrode was sputtered and lifted-off with a 50-nm thick Ta and
covered by a 10-nm thick Pd as a passivation layer.
Electrical measurements: Details of the memristor based FPAA measurement setup is shown in
Figure 2.8. The FPAA circuits consists of a chip with a 32×32 1T1R crossbar array accessed by a
probe card and an extended PCB containing op-amps, TIAs and capacitors. The 1T1R array was
programmed by an external PCB system. After the 1T1R array was programmed, it was connected
to the extended PCB for measurements. The input signal was produced by a Keysight 33220a
waveform generator, while the output was collected by a Keysight 3104T oscilloscope. MATLAB
scripts were used for automatic data collection and memristor programming. The final results were
obtained by applying inputs from the waveform generator and measuring the outputs from PCB
via oscilloscope. The system was not “fully” integrated, as the chip and the extended measurement
board were connected via high-speed cables, as shown in the Figure 2.8 photo. The chip-to-board
connection for audio equalizer and mixed-frequency signal classifier experiments are shown as
26
diagrams in Figure 2.9.
Statistical Analysis: Each curve in Figure 2.1e was raw measurement data. Each curve and shadow
in Figure 2.2d-e was quartiles calculated from 10 repeated measurements after robust linear
regression with a heuristic window size. The image and curve in Figure 2.3 and Figure 2.5 were
raw measurement data. All data are processed in MATLAB.
Figure 2.8 Photograph of the memFPAA measurement system. a. The 1T1R array chip is accessed by a probe card.
It can be connected to the extended circuit which contain off-shelf CAB elements (e.g., op-amps, capacitors) for
measurements, or the row and column boards for programming. This switch is controlled by a multi-channel
multiplexer. The input signal is generated by a Keysight 33220a waveform generator, while the output signal is
collected by a Keysight 3104T oscilloscope. b. Top view of the 32 x 32 probe card. c. Optical microscopic image
showing a 32x32 1T1R array in contact with the probe card’s probes. d. Front view of the 8 column boards (left
boards) and 8 row boards (right boards). Each column board can drive 16 channels of analog voltages and sense 8
channels of currents simultaneously. Every row board can drive 16 channels of analog voltage simultaneously. All
boards are mounted on a motherboard communicating with a computer via an MCU for programming control.
27
Figure 2.9 Diagram of the memFPAA measurement system. a. Diagram for memFPAA audio equalizer in Figure
3. A 24x15 1T1R subarray is used and connected to the extended circuit as shown in Figure S7. b. Diagram for
memFPAA mixed-frequency signal classifier in Figure 4. A 24x20 1T1R subarray is used and connected to the
extended circuit as shown in Figure S7.
28
2.6 Summary
A novel implementation of FPAA blocks with memristor crossbar arrays has been developed. In
this new reconfigurable analog computing structure, different types of memristors play various
critical roles, not only as switches in the fabric network, but also as reconfigurable resistors for
VMMs and other analog units, such as tunable filters. A variety of computing functions have been
experimentally implemented with the memFPAA. Like FPGAs, a large scale FPAAs may be used
either as a general purpose reconfigurable analog circuit or as a platform to quickly prototype
analog circuit designs, which can reduce the design time from months to minutes. FPAAs based
on memristor crossbar arrays as what we demonstrated in this study are expected to significantly
advance the development and applications of analog circuits to meet the increasing needs of analog
computing.
29
Chapter 3:Denoising memristor for multi-level
accurate reading
3.1 Introduction and Motivation
Neural networks based on memristive devices9,62,63 have shown potential in substantially
improving throughput and energy efficiency for machine learning64,65 and artificial intelligence66
,
especially in edge applications67–81. Because training a neural network model from scratch is very
costly in terms of hardware resources, time, and energy, it is impractical to do it individually on
billions of memristive neural networks distributed at the edge. A practical approach would be to
download the synaptic weights obtained from the cloud training and program them directly into
memristors for the commercialization of edge applications (Figure 3.1a). Some post-tuning in
memristor conductance to adapt local situations may follow afterward or during applications.
Therefore, a critical requirement on memristors for neural network applications is a high-precision
programming ability to guarantee uniform and accurate performance across a massive number of
memristive networks32,54,82–86. That translates into the requirement of many distinguishable
conductance levels on each memristive device, not just lab-made devices but more importantly,
devices fabricated in foundries. High precision memristors also benefit other neural network
applications, such as training and scientific computing32,33. Here we report over 2048 conductance
levels, the largest number among all types of memories ever reported, achieved with memristors
30
in fully integrated chips with 256 256 memristor arrays monolithically integrated on CMOS
circuits in a standard foundry. We have unearthed the underlying physics that previously limited
the number of achievable conductance levels in memristors and developed electrical operation
protocols to circumvent such limitations. These results reveal insights into the fundamental
understanding of the microscopic picture of memristive switching and provide approaches to
enable high-precision memristors for various applications.
Memristive switching devices are known for their relatively large dynamical range of
conductance, which can potentially lead to a large number of discrete conductance levels. Studies
have tried to improve the programming speed and accuracy for certain applications like neural
network87
. However, the highest number reported to date has been no more than two hundred82
.
There are no forbidden conductance states within the dynamical range of the device since a
memristor is typically analog and can, in principle, achieve an infinite number of conductance
levels. However, the fluctuation commonly observed at each conductance level (Figure 3.1e) limits
the number of distinguishable levels achievable within a specific conductance range. Interestingly,
we found that such fluctuation can be substantially suppressed, as shown in Figure 3.1e and f, by
applying appropriate electrical stimuli (termed as ‘denoising’ processes). Importantly, such
denoising process does not require any extra circuitry beyond the normal read and program circuits.
We incorporated the denoising process into device tuning algorithms and successfully programmed
a commercial-semiconductor-manufacturer-made memristor (Figure 3.1b-d) into 2048
31
conductance levels (Figure 3.1g), corresponding to 11-bit resolution. Conductive atomic force
microscopy (C-AFM) was employed to visualize the evolution of conduction channels during
programming and denoising processes. We discovered that a normal switching operation (SET or
RSET) always ends up with some incomplete conduction channels, which appear as islands or
blurry edges along the main conduction channel and are more resistive and less stable than the
main conduction channel. First principle calculations suggest that these incomplete channels are
unstable phase boundaries, and their conductance is sensitive to trapped charges, contributing to
the large fluctuations of each conductance level. We revealed, experimentally and theoretically,
that an appropriate voltage in the denoising process either annihilates (weakens) or completes
(enhances) these incomplete channels, resulting in a great reduction in fluctuation and a significant
increase in memristor precision. The observed phenomena generally exist in memristive switching
process with localized conduction channels, and the insights can be applied to most memristive
material systems for scientific understanding and technological applications.
32
Figure 3.1 High precision memristor for neuromorphic computing. a, Proposed scheme of the large-scale
application of memristive neural networks for edge computing. Neural network training is performed in the cloud.
The obtained weights are downloaded and accurately programmed into a massive number of memristor arrays
distributed at the edge, which imposes high-precision requirements on memristive devices. b, An eight-inch wafer
33
with memristors fabricated by a commercial semiconductor manufacturer. c, High-resolution transmission electron
microscopy image of the cross-section view of a memristor. Pt and Ta serve as the bottom electrode (BE) and top
electrode (TE), respectively. Scale bars, 1 μm and 100 nm (inset). d, Magnification of the memristor material stack.
Scale bar, 5 nm. e, As-programmed (blue) and after-denoising (red) currents of a memristor are read by a constant
voltage (0.2 V). The denoising process eliminated the large-amplitude RTN observed in the as-programmed state
(see Methods). f, Magnification of three nearest-neighbour states after denoising. The current of each state was read
by a constant voltage (0.2 V). No large-amplitude RTN was observed, and all of the states can be clearly
distinguished. g, An individual memristor on the chip was tuned into 2,048 resistance levels by high-resolution offchip driving circuitry, and each resistance level was read by a d.c. voltage sweeping from 0 to 0.2 V. The target
resistance was set from 50 µS to 4,144 µS with a 2-µS interval between neighbouring levels. All readings at 0.2 V are
less than 1 µS from the target conductance. Bottom inset, magnification of the resistance levels. Top inset,
experimental results of an entire 256 × 256 array programmed by its 6-bit on-chip circuitry into 64 32 × 32 blocks, and
each block is programmed into one of the 64 conductance levels. Each of the 256 × 256 memristors has been previously
switched over one million cycles, demonstrating the high endurance and robustness of the devices.
3.2 Conductance Levels and Arrays on Integrated Chips
Memristors used in this study were fabricated on an eight-inch wafer by a commercial
semiconductor manufacturer (Figure 3.1b). Details about the fabrication process are provided in
the Methods. Cross-section views of a memristor are shown in Figure 3.1c, and the crucial resistive
switching layers are magnified in Figure 3.1d. The device, which consists of a Pt bottom electrode,
a Ti/Ta top electrode and a HfO2/Al2O3 bilayer, was fabricated in a 240-nm via above the CMOS
peripheral circuitry. The Al2O3 and Ti layers are designed to be thin (<1 nm) so that they seem as
a mixed layer rather than two separate continuous layers. When the bottom electrode is grounded,
the device can be switched by applying either a sufficiently positive voltage (for set) or a negative
voltage (for reset) to the top electrode. The fluctuation level (characterized by the standard
deviation of a measured current under a constant voltage) after a set or a reset operation is
34
distributed in a wide range. The result indicates that an as-programmed state typically has large
fluctuations. This considerably limits the applications of memristors, but is a characteristic of
memristive materials more generally88–91
. The data also show that a set operation tends to induce
a larger fluctuation in an as-programmed state than does a reset operation. Such reading
fluctuations mainly consist of random telegraph noise (RTN), which typically has step-like
transitions between two or more current levels at random time points under a constant reading
voltage. Such RTN generally exists in memristors. Even fluctuations that do not seem step-like
may in fact be made of a RTN 92
, which can be shown only when the measurement sampling rate
is higher than the RTN frequency. It has been demonstrated previously by simulations that
memristor RTN may be caused by charges occasionally trapping into certain defects and blocking
conduction channels because of Coulomb screening 89,93
. However, experiments that directly link
trapped charges, conduction channel(s) and RTN, and how to remove it, are missing. Although this
is a critical issue for memristors in general, it has been unclear how to reduce the RTN in
memristors. These experiments are important not only for understanding the physical origin of
memristor RTN but also for revealing the entire microscopy picture of memristive switching and
providing possible solutions to high-precision memristors.
We discovered that the fluctuation level could be greatly reduced by applying small voltage
pulses with optimized amplitude and width. An example is given in Figure 3.1e, in which an asprogrammed state with a considerable fluctuation (blue) was stabilized into a low-fluctuation state
35
(red) by denoising pulses. Using a three-level feedback algorithm devised to denoise, as shown in
Figure 3.2, a single memristor was tuned into 2,048 conductance states between 50 and 4,144 µS,
with a 2-µS interval between every two neighbouring states. All states were read by a voltage
sweeping from 0 to 0.2 V, as shown in Figure 3.1g. The bottom inset to Figure 3.1g shows
magnification of the current–voltage curves, which show the well-distinguishable states and the
marked linearity of each state. Three nearest-neighbour states after denoising are shown in Figure
3.1f, in which a constant voltage of 0.2 V reads each state for 1,000 s. The current fluctuation of
every state is within 0.4 µA, corresponding to 2 µS in conductance. No significant overlap was
observed in the neighbouring states. Memristors from multiple chips of an 8-inch wafer were
measured, demonstrating considerable programming uniformity across the entire wafer. We further
used the denoising process in the array-level programming of an entire 256 × 256 array using the
on-chip circuitry. The experimentally programmed patterns are shown in Figure 3.1g (top inset)
and Figure 3.6. For demonstrations using the on-chip circuitry, the programming precision was
limited by the precision of the on-chip analog-to-digital conversion peripheral circuitry, which was
6-bit (64 levels) in this design. The testing set-up and the schematic of the driving circuits are
shown in Figure 3.7. Because a relatively smaller voltage is needed for denoising than is required
for typical set or reset programming, the extra energy consumption is only a small fraction of the
energy needed for programming. Further studies show that the denoising operation can also reduce
RTN in other material stacks. Because reading noise has been observed in various resistive
36
switching materials, the results indicate that the denoising step is an important, potentially essential,
process for the training of memristive neural networks because unstable readings lead to incorrect
outputs from the neural networks, and these cannot be compensated by adaptive in situ training.
3.3 High Precision Programming Algorithm
We first constructed a model for the device and used it to coarsely tune the device. The device was
gradually set by a 2V DC signal using different compliance currents with an increment of 50 µA
from 50 µA to 2 mA. After each set operation, the device was read by a 0.2V voltage. After the 2
mA set current compliance was reached, the device was reset by -1.5V to the initial state. That
set/read data collection process was repeated for 5 times to minimize the effect of cycle-to-cycle
variation. After all the compliance current-conductance pair data (Ic, g) were collected, they were
fitted into a linear model g = f (Ic). When using a transistor as the current limiter, the gate voltage
of the transistor could be calculated according to its transfer curve. The device was then set to each
target conductance using the compliance current predicted by the model. This was called coarse
tuning. If the resulted conductance was within a tolerance ∆1 of the target, the coarse tuning was
successful and the fine tuning was applied next. If the deviation was larger than ∆1, the device
was reset and coarsely tunned again. If the coarse tuning failed after N1 attempts, the estimated
model g = f (Ic) was no longer correctly representing the current status of the device, so we recalibrated it and started over again. The typical value of ∆1 was 50 µS and N1 was 5.
37
38
Figure 3.2 The algorithm of high precision programming. The algorithm was implemented using C# language
which could control a Keysight B1500 Semiconductor Device Analyzer by controlling the programming voltage with
customized parameters.
During the fine tuning, we set/reset the device if the current conductance was smaller/larger
than the target. For the set operation, the voltage for the next set operation was calculated using
the Newton’s method where the gradient was extracted from coarse tuning result, and for the reset
operation, the reset voltage amplitude was increased by a fixed step each time. After each set/reset
operation, the device conductance was read by a 0.2V DC voltage and compared to the target
conductance. If the difference was larger than a predefined tolerance ∆2 (typically 1 µS),
another round of set/reset was performed. Once the difference was smaller than ∆2, the device
current was read at a constant 0.2V signal for a constant of time to measure its fluctuation over
time. That fluctuation measurement was done multiple times and each fluctuation measurement
was conducted within 10s to save time. If the conductance fluctuation was larger than a predefined
value ∆3 (typically 2µS), a stabilization process took place, which typically used one or more
one-positive-one-negative pulse pair (until the noise level is within the tolerance range) at 0.35V.
The stabilization voltage may sometime change the average conductance of the device. If the
actual-target conductance mismatch was larger than ∆1 or ∆2, a corresponding coarse tuning
or fine tuning process was applied again.
3.4 Conduction channel evolution in denoising processes
Deciphering the underlying reason for the above results is essential for finding a reliable solution
39
to the problem of unstable conductance states and understanding the dynamic process of
memristive switching. Visualizing the evolution of conduction channels during electrical
operations is informative for this purpose 94–97
. We used C-AFM to precisely locate the active
conduction channel(s) and scan all of the surrounding regions. Details of the measurement are
provided in the Methods. A customized device was fabricated for the C-AFM measurements. A
schematic of its structure is shown in Figure 3.3a. To use the Pt-coated C-AFM tip as the top
electrode, the device was designed to have a reversed structure compared with that of the standard
device shown in Figure 3.1d. By grounding the bottom electrode and applying a voltage to the top
electrode, the device can be operated as our standard device with opposite voltage polarities—that
is, a positive voltage tends to reset the device, and a negative voltage tends to set the device.
Denoising operations were also successfully performed by C-AFM, as shown in Figure 3.3b,c. The
conductance scanning results corresponding to the reading results of Figure 3.3b are shown before
(Figure 3.3d) and after (Figure 3.3e) denoising, and those for the reading results of Figure 3.3c are
shown in Figure 3.3f,g. A comparison of the conductance maps in Figure 3.3d,e, reveals that the
main part of the conduction channel (the ‘complete’ channel) remains nearly the same whereas the
positive denoising voltage annihilates an island-like channel (the ‘incomplete’ channel). By
contrast, the negative denoising voltage (Figure 3.3f,g) reduces the noise by removing the current
dips in Figure 3.3c. These results indicate that the conductance of an RTN-rich state can be divided
into two parts: the base conductance provided by complete channels and the RTN provided by
40
incomplete channels. These incomplete channels had formed together with complete channels but
were smaller in size. Such incomplete channels were also observed in SrTiO3-based resistive
switching devices98
. A memristor can be denoised by eliminating incomplete channels (by either
removing or completing them). Incomplete channels are more sensitive to voltage stimuli
compared with complete channels, which makes it possible to tune the former without affecting
the latter by using appropriate electrical stimuli. Further studies suggest that this is a general
mechanism and can also be performed in other material stacks. It should be noted that the
seemingly isolated island(s) may or may not be electrically connected with the main conduction
channel beneath the surface. However, this does not change the denoising mechanisms or operation
protocols.
Figure 3.3 Direct observation of the evolution of conduction channels in the denoising process through
conductive atomic force microscope (C-AFM). a, Schematic of the customized memristor structure and C-AFM
testing set-up. A C-AFM probe was used as the top electrode in the customized device. Because Ta easily oxidizes in
41
air and is not a practical probe material, a Pt probe was used. This Pt probe had the same purpose as that of the bottom
Pt electrode of the standard memristor that we used. To maintain the material stack of a standard memristor, the
customized memristor has a reversed structure. b, Current readings at 0.1 V before (red) and after (blue) a denoising
process using a subthreshold reset voltage. c, Current readings at 0.1 V before (red) and after (blue) a denoising process
using a subthreshold set voltage. d, Conductance map measured by C-AFM scanning corresponding to the beforedenoising state (red) in b. e, Conductance map corresponding to the after-denoising state (blue) in b. f, Conductance
map measured by C-AFM scanning corresponding to the before-denoising state (red) in c. g, Conductance map
corresponding to the after-denoising state (blue) in c. The dashed yellow circles in d–g highlight the changes observed
before and after the denoising process. Scale bars, 10 nm.
3.5 Switching and denoising mechanisms
To understand the mechanism of denoising, we studied the microscopic origin of RTNs in
memristors. A critical question is whether RTN is induced by an ‘atomic effect’ or ‘electronic
effect’. Incomplete channels are consistently observed in a C-AFM scanning whenever RTN is
observed. Once incomplete channels are eliminated, RTN disappears. Such result indicates that
RTN is a phenomenon in company with incomplete channels rather than being induced by the
transition process between incomplete and complete channels. Previously, a theoretical framework
was established for the electronic RTN mechanism88,89,99–101
, in which the electrical conduction of
the incomplete conduction channels is frequently blocked by Coulomb repulsion when nearby
defects trap electrons and become negatively charged. RTN caused by the atomic motion induced
by external voltage stimuli is random, and irregular in amplitude even when the device is driven
by regular voltage pulses102
.
To identify the type of defect that traps or detraps charges, we measured memristor RTN at
different voltages and performed further theoretical analyses. First-principles calculations indicate
42
that the defects might be oxygen interstitials that have large relaxation energies and thus long
trapping or detrapping times, consistent with the measurement. It was also previously reported99
that charge trapping or detrapping at oxygen interstitials may be responsible for RTN in oxide
memristors. The strongly non-equilibrium condition during device programming probably drives
oxygen ions from conduction channels into their surrounding regions103, leading to oxygen
interstitial defects and potentially providing a type of trapping or detrapping source. By further
analysing the relationship between characteristic duration of RTN and the reading voltage
amplitude, we propose that RTN is predominantly induced by an electronic effect rather than an
atomic effect in our device.
Figure 3.4 Trapped-charge-induced conductance change in incomplete conduction channels. a, The RTNresponsible defect (orange) is 1 nm away from an island-like conduction channel (blue). The channel is formed by a
conductive phase region (phase II) and the phase boundary (PB) region. b, The transport electron wavefunction
corresponding to a, where z denotes the position of the channel along the electron transport direction (from −3 nm to
3 nm), and n(z) shows the normalized integration of the transport electron wavefunction on the plane perpendicular to
the z direction, which indicates the electrical conduction at each z position. The black and red curves are n(z) when
the carrier density in the channel is 5 × 1018 cm−3 or 1 × 1019 cm−3 with one electron trapped at the defect, respectively,
43
and the blue line is n(z) with no electron trapped. c, Two defects (orange) are positioned away from a channel that is
attached to the main conduction channel. The PB region is 3 nm in width. d, The transport electron wavefunction
corresponding to c. The red and blue lines correspond to n(z) when one electron is trapped in the defect 0.8 nm and
1 nm away from the channel, respectively, and the green and black lines correspond to n(z) when both or none of the
defects have trapped electrons. The carrier density in the channel for the simulation is 5 × 1018 cm−3
. The value of τc
and τe are computed in a similar method as reported in ref. [42], where the material-specific parameters were chosen
according to our device information and first principle calculation results, including the atomic relaxation energy
during electron trapping and the thermal activation energy of the trapped electron. The distribution functions of 0
and 1 are derived for different reading voltages (the delay effect of the current response to the electron trapping is
included by an empirical coefficient, see Supplementary Information SI-15 for details). We can see that the simulated
distributions are consistent with those extracted from the experimental RTN measurements. The large EREL is the
primary reason for the long 0 and 1
. In comparison, the characteristic time of oxygen vacancy and Ta substitution
defects are both on the order of nanoseconds because of their small EREL. The trend of 0 and 1 can be intuitively
understood as follows. Because the electron capture rate of the defect is positively related to the electric current density
through the defect, 1 () is inversely related to the absolute value of the reading voltage. In comparison, the electron
emission process is not substantially influenced by the current level, so 0
(
) is approximately independent of the
applied voltage.
The incomplete channel blocking process was modelled as shown in Figure 3.4. On the basis
of C-AFM experiments, we classified the device region as three phases: the non-conductive phase
(phase I), the conductive phase (phase II) and the region between them, which has an intermediate
conductance (phase boundary). During the programming or denoising operations, these phaseboundary regions form or disappear, accompanying the observation of RTN and its removal,
indicating that some RTN-inducing incomplete channels are located in these phase boundary
regions. Figure 3.4a shows a defect trapping or detrapping an electron 1 nm away from an islandlike incomplete channel that has a width of 1 nm. The transport electron wavefunctions ψ(x, y, z)
with or without a trapped charge are visualized in Figure 3.4b by the probability density at each
cross-section of the channel () = ∫ |(, , )|
2 (where z is the axis along the channel).
The wave functions show what proportion of the injected electron propagates through the channel.
44
To mimic the different percentages of phase II, two charge carrier densities (averaged over phase
I and phase II) were used for the simulations. The results indicate that the incomplete channel is
fully blocked at a lower charge carrier density (lightly doped with oxygen vacancies,
corresponding to less phase II) and partially blocked at a higher charge carrier density (heavily
doped, corresponding to more phase II). Figure 3.4c corresponds to another commonly observed
C-AFM result, in which the incomplete channel is attached to the main channel with multiple
charge traps around it. Figure 3.4d shows that a trapped charge close to the incomplete channel
tends to have a larger impact on conductance than one far away. Furthermore, the effect of multiple
charge traps can enhance each other and lead to a multiplied change of conductance because the
thick phase boundary region is completely blocked. Compared with previous models using classic
carrier drift-diffusion equations, we use quantum transport formalism to simulate the influence of
charged defects on channel conductivity, confirming that the Coulomb blockade mechanism
applies to nanoscale channels. Furthermore, we inferred that two or more (N) charge-trapping
defects can lead to complex RTN patterns with a maximum of 2N
levels, which is consistent with
previous reports100,101
.
45
Figure 3.5 Mechanism of denoising using subthreshold voltage, identified using C-AFM measurements and
phase-field theory simulations. a–d, After switching (a), the conduction channel is first denoised by a 0.2 V voltage
(b) and then reset twice with a 0.5 V voltage (c,d), as measured by C-AFM. e–h, Phase-field simulations of the
conduction channels when the device is freshly switched (e), then denoised (f), and reset twice (g,h). The dynamics
of the conductive and insulating phase fields are simulated on the basis of the phase transition energy pathway from
the first-principles calculation. We propose that the conductive and insulating phases are the orthorhombic phase with
a high number of oxygen vacancies and the monoclinic phase without oxygen vacancies, respectively. The denoising
process is captured by the phase-field relaxation, in which the island of the incomplete channel disappears and the
phase boundary sharpens.
Because the RTN originates from the incomplete conduction channels, the denoising process
is associated with the disappearance of both the island and the blurry boundary of the main channel.
A subthreshold voltage that is much smaller than the set or reset voltages can decrease the RTN
because of the phase-field relaxation, as shown in Figure 3.5. For this specific material system, the
relatively conductive and insulating phases (phase II and phase I, respectively; Figure 3.4) are the
orthorhombic and monoclinic phases of HfO2, because the orthorhombic phase is stabilized by a
46
high number of oxygen vacancies104. The denoising voltage provides a driving force for the phase
relaxation through both temperature effects and the current-induced forces, enabling the system to
relax towards an equilibrium state. The free energy F and equation of motion of the system are as
follows:
Δ = ∫ [Δ0
() +
1
2
(∇)
2
]
1
()
= −
Δ[]
()
= −
Δ0
+ ∇
2
Where is the order parameter (here use the monoclinic angle) describing the transition
from m to o phase, Δ0 is the free energy density for a system with a certain order parameter, and
K is the gradient energy parameter. The energy density Δ0 is derived from the first principle
calculations. Using the phase-field simulation, we derive a similar behavior as observed by the CAFM: after denoising, the island disappears and the boundary of the main channel sharpens. The
disappearance and sharpening of the boundary are driven by the energy barrier between the two
phases, in which the high-energy boundary region is reduced. During the reset process, the
conduction channel shrinks in size and its conductivity also decreases because the strong voltage
drives the oxygen vacancy away from the switching-active region. The incomplete conduction
channels—that is, the islands and boundary regions in a freshly switched state—are frozen in a
highly non-equilibrium state because they are always formed at the end of the set or reset voltage
pulse and do not have a chance (sufficient time) to reach the same stable state as the more mature
complete channel region formed earlier. Therefore, these incomplete conduction channels are
47
prone to change; the completion or removal can be induced by a subthreshold voltage. In contrast
to the electron transport in the complete main conduction channel, that of incomplete channels can
be readily blocked by trapped charges (Figure 3.4), making them the main source of RTN. The
situation is more severe for a conductance state obtained by a set switching process because the
creation and growth of a conduction channel comprise a positive feedback process, which happens
faster and faster and leaves no time for the maturation of the newly formed conduction channels
before the end of each switching pulse. In the denoising process, there is no need for the migration,
annihilation or creation of trap sites (for example, interstitial oxygen defects). Although the
specific phases involved may be different for different oxide systems, our approach and
conclusions are generally applicable.
3.6 Methods
Memristor fabrication:
Standard memristor integrated with CMOS driving circuits:
The CMOS part was fabricated in a standard 180-nm process line in a commercial semiconductor
manufacturer with an exposed tungsten via at the top. Memristors were processed in the same
process line with customized materials and protocols. After surface oxide cleaning of the tungsten
via, the Pt bottom electrodes were sputtered and patterned on the vias. Holes for memristors were
created by etching through a patterned SiO2 isolation layer (~100 nm) and terminating at the
surface of Pt. The resistive switching layer (HfO2/Al2O3) and top electrode (Ti/Ta) were filled into
48
the etched holes sequentially, in which the resistive switching layers were fabricated by atomic
layer deposition and the top electrode was fabricated by sputter. Finally, a standard aluminum
interconnect was used to connect the top electrode to bond pads for electrical testing.
A customized memristor for C-AFM measurement
The customized device was fabricated in a university cleanroom on an Si wafer covered with
thermally oxidized SiO2 (~100 nm). The bottom electrode (Ta/Ti) and resistive switching layers
(Al2O3/HfO2) were deposited by an AJA sputtering system. The four layers were fabricated
continuously in a high-vacuum chamber to avoid oxidation of Ta and Ti. The chip was then
patterned and etched to expose part of the bottom electrode. After surface oxide cleaning, Pt was
deposited onto the exposed bottom electrode to prevent oxidation and serve as the ground contact
during C-AFM measurement.
Electrical measurements:
Single device measurement
Electrical measurements of the standard memristor (factory-made complete memristor with top
electrode) were performed on a Keysight B1500A semiconductor device analyzer equipped with
a B1530A waveform generator and fast measurement unit. To realize the algorithm as shown in
Figure 3.2, we built a program using C# to control the electrical operations of B1500A.
Array measurement
The schematic of the one-transistor–one-memristor array with on-chip driving circuits and the
testing set-up is shown in Figure 3.6 and Figure 3.7.
49
Figure 3.6 The pattern programming and denoising in a 256 256 1-transistor-1-memristor (1T1M) array. The
algorithm is similar to Fig.S4 but performed in a 256 256 array. Each array-programming operation is followed by
an array denoising operation. In all subplots 1 Least Significant Bit (LSB) is 2.34 µS. a) The target weight map of the
programming. b) The programmed weight map. c) The reading noise of devices in the array after the initial
programming before the first denoising operation. d) The final reading noise of devices in the array after 7 iterations
of programming-denoising, the number of noisy devices significantly decreased compared to those in c). e) The
histogram of all device noise level. The inset is a zoomed-in view of the histogram in the range of 0-2%. f) The trend
of average reading noise in all 65536 devices in the 7 iterations of programming / denoising.
50
Figure 3.7 The testing environment and the schematic of the 1T1R array with its driving circuits. a) The testing
set up. The 1T1R array and its driving circuits are controlled by a field programmable gate array (FPGA) through a
voltage level shifter. b) The schematic of the 1T1R array and its on-chip driving circuits. Device reading and writing
voltage are controlled by on-chip digital-analog-converters (DACs). Current reading is realized by a transimpedance
amplifier (TIA) and an analog-digital-converter (ADC). The current flow direction in the three basic operations (SET,
RESET and READ) are marked by arrows.
51
3.7 Summary
We have achieved 2,048 conductance levels in a memristor which is more than an order of
magnitude higher than previous demonstrations82,105. Notably, these were obtained in memristors
of a fully integrated chip fabricated in a commercial factory. We have shown the root cause of
conductance fluctuations in memristors through experimental and theoretical studies and devised
an electrical operation protocol to denoise the memristors for high-precision operations. The
denoising process has been successfully applied to the entire 256 × 256 crossbars using the onchip driving circuitry designed for regular reading and programming without any extra hardware.
These results not only provide crucial insights into the microscopy picture of the memristive
switching process but also represent a step forward in commercializing memristor technology as
hardware accelerators of machine learning and artificial intelligence for edge applications.
Moreover, such analog memristors may also enable electronic circuits capable of growing for the
recently proposed mortal computations106
.
52
Chapter 4:Programming with arbitrarily high
precision
4.1 Introduction
Many complex physical systems can be described by coupled nonlinear equations that must be
analyzed simultaneously at multiple spatiotemporal scales. However, these systems are often too
complicated for analytical techniques, and direct numerical computation is hindered by the "curse
of dimensionality," which requires exponentially increasing resources as the size of the problem
increases. These systems range from nanoscale problems in material modeling to large-scale
problems in climate science. While the need for accurate and high-performance computing
solutions is growing, traditional von Neumann computing architectures are reaching their limit in
terms of speed, energy consumption, and infrastructure.
A promising alternative is in-memory computing that circumvents the memory-processor
bottleneck inherent to von Neumann architectures. In-memory computing in crossbars can execute
a large vector-matrix multiplication (VMM) in the analog domain within one computing cycle
( O(1) time complexity) by exploiting Ohm’s law and Kirchhoff’s current summation laws =
, where
is the input voltage vector, G is the conductance matrix and is the output
current vector from the crossbar. In the digital domain, such computation requires N2
multiplication and additions, in which N is the vector size107. To achieve efficient in-memory
53
computing, various emerging devices108, such as floating gate transistors12,14,78, phase-change15–17
,
ferroelectric19–22, magnetic23,74, organic109, and metal oxide24,6,25,27 switching materials, have been
studied intensively to enable the parallel computation of matrix operations in nonvolatile memory
crossbars. However, technical challenges26,28–31,110, such as reading noises and writing variabilities
(caused by device-to-device inhomogeneities), have limited the scalability and precision required
by many applications, such as high-performance scientific computing and in-situ training for
neural networks.
We have achieved thousands of conductance levels in Chapter 3, by eliminating the reading
noise issue in individual memristors111. Still, practical numerical problems often require solutions
with single (2
23 ≈ 107
levels, or ~10-7
error) or double precision (2
52 ≈ 1015 levels, or ~10-15
error). Accordingly, analog devices have been primarily used for applications without highprecision requirements, such as machine learning112,113,86,17,66,114, randomness-based processing
like stochastic computing115–117, and hardware security118–120. To achieve high-precision solutions,
innovations in architecture and algorithms, co-designed with analog devices, must be made.
Some theoretical121 and experimental studies122 have used analog arrays to generate a lowprecision estimate and then resort to an integrated high-precision digital solver for refinement to
produce the required high-precision solutions. Recent efforts in solving high-precision numerical
problems used memristors as binary or low-precision cells. They relied on complicated peripheral
circuit design123 or intensive software processing32 techniques with frequent quantization
54
operations to obtain error-free results, which substantially reduced energy and area efficiency due
to increased costs for analog-to-digital converters (ADCs) and other post-processing.
In this work, we propose and demonstrate a new circuit architecture and programming
protocol that can efficiently represent high-precision numbers using multiple relatively lowprecision analog devices, such as memristors, with a greatly reduced overhead in circuitry, energy,
and latency than existing quantization approaches. As proof of principle demos, we have
experimentally solved both static and time-evolving partial differential equations, including
Laplace and Poisson equations, Navier-Stokes (N-S) equations, Magnetohydrodynamics (MHD)
problems and Recursive least square (RLS) filters, with memristor crossbars playing various
critical roles in the solver. We have achieved high-precision solutions up to 10-15 precision on a
fully integrated memristor SoC chip while maintaining a substantial power efficiency advantage
over conventional digital PDE solvers.
4.2 High precision obtained with low-precision devices
Figure 4.1 shows the simplified schematics of a traditional crossbar architecture with a bit-slicing
approach and our proposed true analog architecture and its programming algorithm that can
achieve arbitrarily high precisions by overcoming the issues of device writing accuracy and
variability. Peripheral input/output registers and digital-to-analog converters (DACs) are omitted
in the schematic for simplicity.
55
Figure 4.1 Comparison of Arbitrary precision programming and traditional crossbar arrays.
(A) Traditional crossbar arrays with ADCs and additional post-processing circuits. (B) Proposed arbitrary precision
programming circuit with shared ADCs. (C) Example of programming a numerical value A = 1 into multiple memristor
devices step by step. Red, green, and blue represent memristive devices in the 1st, 2nd, and 3rd subarray. (D) Flowchart
of the arbitrarily high precision programming algorithm.
Traditional In-memory-computing architecture123,65,114,124 follows the same paradigm as
digital circuits, which is not error tolerant. In digital circuits design, each number is represented
by multiple bits. During multiplication, each bit in the multiplier is multiplied with each bit in the
multiplicand to get many partial products. So, a multiplier is a group of bit shifters and adders that
add the partial products. Such an approach is inherited and expanded to arrays in traditional inmemory-computing architecture, forming the so-called bit-slicing approach. In that approach, the
56
memory devices are bound into very limited (usually binary) predetermined states, with the weight
matrix and input vectors sliced into multiple bit-planes (only weight planes are shown in Figure
4.1 and the input planes are omitted for simplicity), and partial VMMs are performed in those bit
planes to obtain many partial products. Those partial products are then quantized and combined
using additional digital circuitry such as ADCs, shifters, and adders to get the full VMM product.
That is almost as complex as the digital multiplier, if not more. On the algorithm side, the same
computing algorithm optimized for digital computing is typically used. Therefore, the accuracy of
the VMM result relies heavily on the programming accuracy of each cell in the array. This
approach forces analog devices to behave like digital devices, which negates the advantages of
analog devices and analog computing.
Instead, the proposed true analog approach tries to complete the computation in the analog
domain as much as possible. It only converts the computing result to digital at the last step. In
practice, due to yield issues and device-to-device variations, or to keep programming iterations
and time reasonable, some devices are less accurately programmed than others and may not meet
the predetermined criteria of the traditional bit-slicing approach, resulting in quantization error. In
our analog approach, we also use the weighted sum of multiple devices to represent one number
but utilize the subsequently programmed devices to compensate for the conductance error of the
previously programmed devices. Such compensation is done by dynamic mapping between the
residual value (error) and the conductance in the proposed algorithm (Figure 4.1D, detailed in
57
Supplementary Text). As shown in Figure 4.1B, multiple crossbar subarrays were used for the
multistage compensation. The subarrays can be physically placed horizontally, vertically, or 3Dstacked without substantially changing the algorithm.
A simple example of writing a number a = 1 into three combined devices is drawn in Figure
4.1C. In this example, the first device ended up with a 10% programming error, either by one-shot
programming or by a read-verify feedback programming method111 with a few programming
cycles, as it was programmed to be 0.9 instead of 1. After reading the programming result of this
first device, we can program a second device to compensate for this error. The second device likely
also had a 10% programming error and was programmed to 0.9, for example. A weight of less than
1 (e.g., 0.1) was used for the second device to ensure the error scales down, with which the second
device represented 0.9*0.1 = 0.09. Therefore, the combined value of those two devices became
0.9 + 0.1*0.9 = 0.99, successfully reducing the total error to only 1% by only two sequential
programming operations. Similarly, adding a third device further reduced the total error to 0.1%.
More rigorously, to write any target numerical matrix A, 0 = is considered as the initial
residual and mapped to a conductance matrix 1, within the programmable conductance
range of the memristive devices. Since A can have both positive and negative elements while the
conductance is a physical quantity related to the device, which must be positive and within a certain
dynamic range, a mapping method that can map both positive and negative values to a given
positive range is needed. One approach is to perform a scaling followed by a shifting so that all
58
elements are shifted into the positive range. This linear mapping method 1,
= 10 + 1
was chosen in our work, where K is a diagonal matrix for scaling, and B serves as a global offset
consisting of identical column vectors (each unique value in B is a different shift for the entire
column of memristors). K and B are chosen in such a way that each column of , is mapped
to the full conductance range of the memristor. Since there is a transpose operation in the formula
=
, the columns of G correspond to the rows of A. An alternative to this linear mapping
method would be using the differential pairs54,65, which utilizes two devices to represent one
number, and the number is proportional to the difference of the conductances of the two devices.
But this approach would double the number of memristors used. Both approaches are compatible
with the proposed programming algorithm, but to avoid introducing unnecessary complexity in
our description, we assume all numbers are positive in the schematic in Figure 4.1.
The first memristor subarray was then programmed targeting 1, using fast one-shot
programming or the write-verify approach. Because of the analog nature of devices, there was
always a programming error between the target 1, and the programmed array 1. 1 was
inversely mapped to the numerical matrix 1 and the residual matrix 1 = − 1 was set as
the programming target for the next subarray. Similarly, the second subarray was programmed to
G2, which was converted to A2, and the residual 2 = − (1 + 2
) = 1 − 2 became the
target of the third subarray.
One of the key advantages of this approach over the traditional bit-slicing technique is that
59
the scaling factor K in the linear mapping is dynamically calculated and adaptive to the
programming performance of each column of the subarray instead of a predetermined value. This
allows faster convergence when the programming error is small and guarantees the largest residual
of each column to be monotonically decreasing (converging) when the programming error is large.
That is because even if a device in the next subarray is stuck and extremely far away from the
target, it cannot be further than the difference between Gon and Goff, thus its residual cannot be
larger than the previous largest residual. As the scaling factors, K, or the weights for subsequent
subarrays decrease, the remaining error would also decrease and converge toward zero. Since the
actual programming result is read and considered when calculating the next residual, the accuracy
can be guaranteed. And with the help of the shrinking scaling factors, the effective precision of the
whole array can exceed the device programming precision. In principle, an arbitrarily high
precision can be achieved by using more and more subarrays. Such scaling factor granularity was
chosen as a balance between using one scaler for the whole subarray and individual scaling factors
for every device in the subarray. Using a global scaling factor for the whole subarray would make
the entire subarray less effective and cumbered by even a single inaccurate device, and maintaining
individual scaling factors for every device would require too much extra computation and could
not be implemented in hardware efficiently.
The proposed mapping mechanism could be conveniently implemented in hardware by
programming the feedback resistor in the existing trans-impedance amplifier (TIA) circuits
60
and adding a last row at the subarray for the offset . The overhead for dynamic scaling factor
calculation was negligible because it was only needed once for each subarray and required no
additional array reading operation. This calculation had the same complexity as calculating the
programming voltage amplitudes for one cycle in the write-verify programming, which usually
takes multiple cycles. During programming of each subarray, the corresponding switch was turned
on, and switches of other subarrays were turned off. Since both the traditional scheme and the
proposed scheme require high resolution ADCs for accurate VMM operations, the proposed
scheme does not require a higher ADC precision, but more efficiently utilizes the existing ADC
precisions.
Such a high-precision programming method enabled high-precision full vector-matrix
multiplications. When the input voltages were applied to the rows, switches of all subarrays were
turned on simultaneously, and the output currents of all subarrays were naturally weighted and
summed together to obtain the total VMM result, which was then sent to the ADCs for a final
digitization. The entire VMM process was analog, and the result was only digitized at the last step.
Multiple subarrays can share one ADC, which saves a substantial portion of this most area and
power-consuming components123,125,126, as well as other post-processing digital circuits like bit
shifters and adders for partial products in the traditional approach. Also, in the pre-processing
circuit, input bit-planes, time and energy overheads incurred in the bit-slicing approach can be
eliminated in our approach as well.
61
4.3 Experimental demonstration of high precision solvers
We use the proposed architecture and algorithm to solve partial differential equations (PDE) as an
experimental verification. The experiments were conducted on two memristor platforms, i.e., a
non-fully integrated system to represent lab made memristors with relatively larger device
variations and a fully integrated SoC chip to represent fab made memristors with improved
homogeneities. The former consisted of a 128 x 64 one transistor-one resistor (1T1R) memristor
crossbar array and PCB driving circuits. the latter was an analog in-memory computing accelerator
SoC with ten neural processing units (NPUs). Each NPU had a 256 x 256 memristor array
fabricated in a commercial foundry with much better yield and uniformity than the lab memristors
(see Methods, Figure 4.6 and Figure 4.7). Our experiments on these two platforms verified that
the proposed approach worked well for both cases with large or small device variances and the
SoC chip exhibited an especially encouraging performance. Photographs of the unpacked SoC are
shown in Figure 4.2A. Each memristive cell could be programmed in an analog fashion within the
range from 30 to 700 µS by controlling the gate voltage of the transistor in the 1T1R cell (Figure
4.2B). A 64 x 64 region was programmed in a write-and-verify manner to a multilevel pattern
within 30 programming cycles (Figure 4.2, C to E). A few conspicuous devices were not written
to the target due to the device-to-device variability and limited programming cycles (Figure 4.2C).
Those devices, if not compensated later, would greatly affect the vector-matrix multiplication
accuracy, and prevent the PDE solver from convergence.
62
Figure 4.2 Photos and Programmability of the memristor crossbar array on the SoC.
(A) Optical images of the wafer and a system-on-chip (SoC) under test. Each chip has ten 256 x 256 1T1R crossbar
arrays. Scale bar from left to right, 2 mm, 500um, 10 um. (B) The final conductance map of a 64x64 region after 30
programming cycles. (C) Tthe absolute error map after 30 programming cycles.
The effectiveness of our arbitrary precision programming method was first verified by using
the on-chip crossbar as a VMM core in the high-precision partial differential equation solver. The
solver used preconditioned conjugate gradient (PCG) algorithm. Compared with the vanilla
conjugate gradient (CG) method, PCG employed a preconditioning matrix to improve the
condition number of the system to be solved and thus improve the convergence speed127 (See
Methods). In hardware, a high-precision VMM core enabled new efficiency optimizations that
were previously not feasible. As an example of our software-hardware co-design advantage, this
63
VMM core allowed us to use the more efficient Green’s function of the problem as the
preconditioner128,129 for numerous problems, instead of the classical Jacobi (diagonal)
preconditioner that has been widely used32,130 in digital solver because of its computational
simplicity for digital computers. Utilizing this more efficient but complex preconditioner allowed
us to converge faster than digital solvers with simpler preconditioners. To accommodate multiple
subarrays in our chip, the problem with a size of × was downsized into a rougher mesh
with a size of × when performing the hardware preconditioning.
64
Figure 4.3 Experimental results on Poisson solver with arbitrary precision programming with three arrays.
(A) The classical diagonal preconditioner matrix. (B) The target Green’s function preconditioner matrix. (C) The
summation of all three numerical matrices A1+A2+A3. (D) The 1st subarray, programmed and mapped to A1. (E) The
2
nd subarray, programmed and mapped to A2. (F) The 3rd subarray, programmed and mapped to A3. (G) Correct
solution of a 128 x 128 Poisson equation example by the hardware solver using all three subarrays with the SoC
system. (H) The residual of the solution over iterations of different settings. Ideal and Diagonal results are obtained
with software solver, n=1,2,3 results are experimentally obtained with the SoC system.
65
As an example, we solved a Poisson equation with = = 128 grids by down-sampling
it into a = = 6 mesh in the preconditioning process (see Movie S1). In this hardware
preconditioning step, the input matrix was flattened as a 1 x 36 vector to multiply with the flattened
2D physical preconditioner matrix. Hence, the size of the preconditioner matrix needed was a 36
x 36 per subarray. A traditional diagonal preconditioner for the Poisson problem is drawn in Figure
4.3A. The Green’s function for the Poisson equation was calculated explicitly and reshaped to 2D
for hardware VMM (Figure 4.3B). Up to three subarrays were experimentally programmed to the
target Green’s function preconditioner matrix for hardware VMM in the PCG algorithm, so the
total physical devices used are 108 x 36. The results obtained with the non-fully integrated platform
are shown in Figure 4.8 while the SoC platform generated much improved results as shown in
Figure 4.3. The programmed subarrays A1 to A3 are shown as Figure 4.3, D to F, and the effective
matrix (Figure 4.3C) was the summation of A1 to A3. Compared with the traditional single matrix
approach (Figure 4.3D), Figure 4.3C was much closer to the target in Figure 4.3B. Additional
details on this mapping process can be found in Figure 4.9.
We solved the same initial condition of one source and two sinks with different numbers of
subarrays enabled in the VMM operation to see the differences of the solution obtained. With only
the first subarray of lab made memristors (non-fully integrated platform), the preconditioner could
not be effectively reconstructed in hardware. Thus, the solution did not converge correctly (Figure
4.8G), which revealed the subpar performance of a normal memristor crossbar in scientific
66
computing without using the approach proposed in this study. When using two or more subarrays,
the obtained solution converged to the correct value (Figure 4.3G). Compared with similar
previous work32 that achieved 2.7% mean absolute error (MAE) in hardware VMM, the precision
of the solution improved enormously as more subarrays were used for VMM, and up to 10-15
precision was obtained with three subarrays within 600 iterations (Figure 4.3H). The use of more
subarrays would bring the residual curve closer to the ideal curve of using Green’s function
preconditioner, ultimately converging to the ideal curve. Using diagonal preconditioner was slower
than using Green’s function preconditioner because it contained less information and was less
effective. Because of the experimental reading variation, there were slight variations on the
residual curve each run, which did not change the above general observations. We did not need to
use techniques like time multiplexing, as all subarrays could compute simultaneously, greatly
improving the throughput.
67
Figure 4.4 Hardware recursive least square filter with arbitrary precision programming.
(A) the randomly generated original signal u(t) and the received nosiy signal y(t) passing an echoey channel, and the
live estimation of the noise-free signal ŷ(t) . (B) the estimated coefficients of the channel with the ground truth.
Experiments using n = 1~3 subarrays are performed. With 2 or more subarrays used, the hardware result is almost
identical with the software estimation. (C) the coefficients estimation history within 40 timesteps. Each line
representing one coefficient of the channel. (D) the covariance matrix and the numerical matrix of the three subarrays
programmed at t = 1. (E) the covariance matrix and the numerical matrix of the three subarrays programmed at t = 10.
(F) the covariance matrix and the numerical matrix of of the three subarrays programmed at t = 40.
The PCG algorithm employs a fixed matrix, and we have gone a step further by employing a
changing matrix in the hardware recursive least squares (RLS) filter application (see Methods).
Suppose a signal u(n) is transmitted over an echoey, noisy channel, it will be received as () =
∑
()( − ) + ()
=0
, where v(n) represents additive noise (Figure 4.4A). The RLS filter
68
can be used to estimate the channel coefficients and recover the noise-free version of the
received signal, ŷ(n). At each timestep, it uses the last step’s estimation to update the next step.
The crossbar array served as the covariance matrix which was critical in updating the Kalman gain
and the estimation of the coefficients. In this example problem, the noisy window was assumed to
be t = 10, thus there were ten coefficients to be estimated and the covariance matrix was 10 x 10.
The experiments were done with one subarray, 2 subarrays and 3 subarrays. Their corresponding
estimated coefficients are shown in Figure 4.4B, along with the software estimation and the ground
truth that was randomly generated when setting up the problem. One subarray was not sufficient
to accurately estimate the channel coefficients, but two or three subarrays significantly improved
the result that was overlapped with the software estimation. The coefficients’ updating history is
drawn in Figure 4.4C, with covariance matrices on three typical timesteps shown in Figure 4.4, D
to F. The covariance matrix was updated in hardware in each timestep. It was observed that it
changed from the initial diagonal matrix to checkerboard-like shape in the middle and finally
changed to a banded matrix when the estimation became stable. This verified the capability and
stability of our programming scheme for dynamic matrices that change during computation.
Given our current chip's limited physical crossbar size, extended configurations for solving
larger equations and other types of PDEs were verified in simulation. Up to five subarrays were
programmed in simulations using our noisy memristor model calibrated with lab made devices and
used in the PDE solver to solve the Poison equation. A large writing tolerance of 60 µS was
69
assumed in the writing process. The five matrices and the summed effective matrix were visualized
in Figure 4.11. Even with such a considerable programming error, the effective summed matrix
closely approximated the target matrix as more arrays contributed to the summation.
As for the scalability, increasing the mesh size reduced the iterations needed to achieve a
specific solution precision, and the numbers of iterations required to obtain 10-10 and 10-14
precision on a 512 x 512 problem were listed in Figure 4.5A. We also compared the energy
performance with a highly optimized digital system with an application-specific integrated circuit
(ASIC), which exhibited an energy efficiency of 7.02 TOPS/W 2
and a latency of 10.4 ns, almost
the same speed as our system. We obtained nearly two orders of magnitude energy advantage over
the digital system. (Figure 4.5B).
70
Figure 4.5 Scalability, efficiency and more applications.
(A) The number of iterations for convergence reduces while the mesh size increases. (B) Energy consumption
compared with an ASIC design running at approximately the same speed. (C) real part of the DFT matrix (n=16) used
in the Navier-Stokes equation solver. (D) Solved velocity field of a Navier-Stokes equation at t=4.8 sec solved by
MATLAB. (E-I) Solved velocity field of a Navier-Stokes equation at t=4.8 sec solved by memristor simulation using
1-5 subarrays. (J) The solved velocity and magnetic flux density field of an MHD problem at t = 2 by MATLAB. (K)
71
The solved velocity and magnetic flux density field of an MHD problem at t = 2 by memristor simulation using five
subarrays.
The proposed programming methods proved more valuable when solving complicated timeevolving problems like Navier-Stokes equations and magnetohydrodynamics (MHD) problems.
When solving those equations, a subsequent timestep needed to be calculated based on the result
of the previous timesteps, so even tiny errors in the previous step could accumulate and propagate,
making a high precision critical for each timestep. Our approach was a general programming
approach that can serve not only as the preconditioners but also any other matrices, for instance,
the discrete Fourier transform (DFT) matrices (Figure 4.5C). As an example, we solved NavierStokes equations in a simulation of the motion of fluids using spectral method and n = 1~5
subarrays (, Movie S2). The simulation result (Figure 4.5, E to I) became closer to the MATLAB
solver (Figure 4.5D) as more subarrays are used. In each timestep, the solution was transformed
to the spectral space by multiplying with the DFT matrix and transformed back to the physical
space using the inverse DFT after the pressure and diffusion effect are applied in the frequency
domain. The input vectors and the DFT matrix were divided into real and imaginary parts to
process complex number multiplication.
One advantage of such spectral methods is that they can achieve high accuracy with relatively
few grid points. The Fourier transform and inverse Fourier transform were highly parallelizable
but could be computationally expensive, which is a perfect example to be solved by the hardware
VMM using memristive crossbars. As a last example, we solved complicated MHD problems
72
where the fluid flow and magnetic fields were coupled together, by exploiting both our hardware
FFT technique in solving the N-S subproblem and hardware PCG technique in solving the pressure
and magnetic field pressure. The simulation with five subarrays perfectly matched our MATLAB
solver within 100 timesteps (Figure 4.5, J and K).
4.4 Methods
Electrical measurements
Electrical characterization data in Figure 4.8 were measured using our customized PCB multiboard measurement system(1). MATLAB scripts were used for automatic data collection and
memristor programming.
Electrical measurement data in Figure 4.2-Figure 4.5 were measured using an analog inmemory computing accelerator SoC (System on chip) designed with the 65 nm technology node.
VMM and matrix writing were conducted in the memristor based computing cores. The other
functions (e.g., operation sequence control and matrix transpose/rotation) were realized in an opensource RISC-V CPU, also on chip. To store large data such as an input image or neural network
results, on-chip memory was used. The computing core, CPU and memory were connected to the
on-chip AXI (advanced extensible interface). Direct memory access (DMA) was used to efficiently
transfer data across the SoC (usually between memory and the NPU, or from one NPU to another)
from any AXI-connected components.
73
Each computing core has a 256 256 1T1R array and its driving circuits (DAC, ADC, etc.),
control circuits, input/output buffers, etc. The reading and writing voltages are generated by an
array of digital-analog-converters (DACs) and the analog-digital-converters (ADCs) are
responsible for reading the output. The computing core also has a neural network function block,
which makes it capable of neuromorphic computation, but not active in use for numerical
calculation in this work.
There are two functional modes for operation: matrix writing mode and VMM mode. In the
matrix writing mode, memristor conductance values are programmed or updated based on the pretrained neural network weight values. A set/reset operation can be applied multiple times on the
same device using a multi-pulse feedback scheme to program memristors to the desired
conductance values within the error tolerance. In the VMM mode, all crossbar word lines are
turned on and the accumulated current on the bit line is converted to voltage through the usage of
transimpedance amplifiers (TIAs). After that, the ADC samples and converts the voltage values to
digital bits.
To determine the functional mode, operation sequence, voltage amplitude and so on, control
circuits take the configuration and parameters from registers that communicate with AXI bus.
74
Figure 4.6 Photo of the fully integrated SoC testing platform.
The USB 2.0 to QuadSPI bridge (UMFT4222EV module) connects the SPI port on the evaluation board to the USB
port on your control host
75
Figure 4.7 Diagram of the analog in-memory computing accelerator SoC (system on chip).
(A)diagram of the whole SoC. There are 10 computing cores in the SoC. (B) Diagram of each computing core. Each
core has a 256 x 256 crossbar array.
The Programming Algorithm
The target mathematical matrix is represented by a conductance matrix , and = (). We
choose () = + where K is a diagonal matrix and B consists of identical column vectors.
For each subarray, we employ the commonly used write-verify method(2) with a limited number
of iterations or one-shot method for the best efficiency.
▪ Initialization: 0 = , n is the number of subarrays to use.
▪ For = 1:
• Map the numerical residual matrix to conductance by , = (−1), with K
and B chosen such that each column of , is mapped to the full conductance
range of the memristor.
76
• Write , to the subarray i and get a different ≈ ,
• Inverse map the conductance to the numerical matrix by =
−1
()
• Compute the residual matrix = −1 −
▪ Endfor
Preconditioned conjugate gradient method (PCG)
Conjugate gradient methods are a class of iterative algorithms used to solve large linear systems
of equations that are hard to solve with direct methods. They are particularly useful for symmetric
positive-definite systems, and they can converge to the exact solution in a relatively small number
of iterations. However, the convergence rate of these methods can be slowed down by the presence
of eigenvalues clustered near zero or widely separated, which can make the condition number of
the system large.
One way to improve the convergence of conjugate gradient methods is to use a preconditioner,
a matrix that transforms the original linear system into a new one with a better condition number.
PCG is the quickest and most reliable method at solving symmetric positive definite matrices(3).
Green’s function preconditioner is a type of preconditioner that uses Green's function of the
differential operator in the original system to construct a matrix approximating the inverse of the
original system(4). It has been shown to be effective in accelerating the convergence of conjugate
gradient methods for a wide range of problems, including those arising in fluid dynamics,
77
electromagnetics, and quantum mechanics.
Partial differential equations
All equations mentioned in this work are solved using the finite-difference method in 2D space.
Poisson’s equation
The equation is ∆ = ℎ where h is the source.
Poisson’s equation is widely solved in electrostatics to find the electrical potential for a given
charge distribution h.
Example problem setup: Problem size is = = 128 points in a bounded = = 1
78
space, with zero boundary condition. Sources are three point charges: +3 at (0.4,0.8), -5 at (0.5,0.5),
and +2 at (0.8,0.8). The unit is arbitrary.
Laplace’s equation
The equation is ∆ = 0 where ∆= ∇ ∙ ∇= ∇
2
is the Laplace operator.
Laplace’s equation is the source-free version of Poisson’s equation, which can describe
equilibrium states like steady-state heat transfer problems or electrostatics problems.
Example problem setup: Problem size is = = 128 in a bounded = = 1 space
with Dirichlet boundary conditions. The bottom side of the space is fixed to t=3 and the other three
sides are fixed to t=0. The unit is arbitrary.
Navier-Stokes equation
The Navier-Stokes equations are a set of coupled nonlinear partial differential equations that are
difficult to solve analytically, especially for complex geometries or turbulent flows. Here we are
solving a simple case with the incompressible flow:
+ ( ∙ ∇) − ∇
2 = −∇ (
0
) +
where is the velocity, is the kinetic viscosity, p is the pressure and is the body
accelerations.
Example problem setup: Problem size is = = 16 points in a bounded = = 1
space with = 0.001 and g=0. The initial velocity field = (, ) =
(∑ ,sin (,
2
+ ,) cos (,
2
+ ,)
3
=1
, − ∑ , cos (,
2
+
3
=1
79
,) sin (,
2
+ ,)) is the composition of 3 sinusoidal waves, where the coefficients are
= [1.0,0.6,0.3], = [3.0,5.0,7.0], = [1.2,0.0,0.5], = [4.0,3.0,7.0], = [5.0,0.0,0.5].
Simulation timestep dt=0.05, number of timestep = 150. The unit is arbitrary.
MHD equations
MHD can be described by a set of equations consisting of a continuity equation, an equation of
motion, an equation of state, Ampère's Law, Faraday's law, and Ohm's law. Here we are solving a
simple case where the pressure p is isotropic and adiabatic index , electrical resistivity and
kinetic viscosity are all constant scalers. A fluid with velocity and magnetic field B can be
described by:
the continuity equation
= − ∙ ∇ρ − ρ∇ ∙ ,
the equation of state
= − ∙ ∇p − ∇ ∙
the equation of motion
= − ∙ ∇ − ∇p −
1
0
∇
2
2
+
1
0
∇ ∙ + ∇
2,
and the induction equation
= ∇ × ( × ) − ∇ × [(∇ × )], which comes from Ohm’s
law, Ampere’s law and Faraday’s law.
Example problem setup: the problem space is periodic with = = 2 and is divided
into = = 16 grids. All quantities are normalized before calculation. Normalized 0 =1,
=0.1, =0.1, and =1. The initial particle velocity field is = (, ) = (−sin (( + 0.5) ∗
2 ∗ ), sin (( + 0.5) ∗ 2 ∗ )) . The initial Magnetic flux density is = (, ) =
(−0.2sin (( + 0.5) ∗ 2 ∗ , 0.2sin (2( + 0.5) ∗ 2 ∗ )) . Timestep dt = 0.02 and the
80
number of timesteps = 100. Mesh size for the Green’s function preconditioner is = = 6.
Recursive least squares (RLS)
Recursive least squares (RLS) is an adaptive filter algorithm that recursively finds the coefficients
that minimize a weighted linear least squares cost function relating to the input signals. Suppose a
signal x(n) is transmitted over an echoey, noisy channel, it will be received as () =
∑
()( − ) + ()
=0
, where v(n) represents additive noise. The RLS filter can be used to
estimate the channel coefficients w with and recover the noise-free version of the received
signal, ŷ(n) = , so that ŷ(n) is close to y(n) in the least squares sense.
For p-th order RLS filter, the algorithm runs as following, where is the prior estimation
error, is the forgetting factor, is the Kalman gain, and is the covariance matrix.
For n=1, 2,…
= [(), ( − 1), … , ( − )]
() = () −
()( − 1)
() = ( − 1)(){ +
()( − 1)()}
−1
() =
−1 ( − 1) − ()
()( − 1)
() = ( − 1) + ()()
Example problem setup: the window size is t = 10, which is also the filter order, so there are
10 coefficients to be estimated. The input signal x(n) are normally distributed random numbers
from the standard normal distribution N(0,1), and the coefficients are normally distributed random
81
numbers [0.1344, 0.4585, -0.5647, 0.2155, 0.0797, -0.3269, -0.1084, 0.0857, 0.8946, 0.6924].
= 0.97.
Calculation on energy and time consumption of solving a Poisson equation, and discussion on the
comparison with the ASIC target
The speed and power is compared with the digital approach using a highly optimized digital system
with an application-specific integrated circuit (ASIC)(51).
For Memristor solver, since the preconditioner matrix is fixed during the iterations, no reprogramming is needed and VMM preconditioning operations take the dominate portion of
hardware time and energy. Each VMM operation takes 10 ns, the average VMM voltage is 0.05 V,
the average resistance of memristor cells is 10 KΩ. 5 subarrays are used. Each VMM consists of
M N multiplication and addition, which results in 160 TOPS/W for memristor crossbar array. As
an approximate comparison, the highly optimized digital system with an application-specific
integrated circuit (ASIC) fabricated at the 40 nm technology node for 4-bit 100-dimensional vector
and 4-bit 100 × 200 matrix multiplication reported 7.02 TOPS/W(51). Although not a direct
comparison, our system is 22 times more energy efficient.
The energy and time cost for solving the same Poisson problem are further calculated based
on the number of iterations data in Supplementary Table 1. the ASIC solver uses the diagonal
preconditioner while the memristor solver uses the meshed Green’s function preconditioner as part
82
of the co-optimization. This co-optimization not only reduces the time for the preconditioning step,
but also for other vector-matrix multiplications in the iterative solver, as it reduces the total
iterations needed. The time cost ratio is the same as energy ratio when the two systems are scaled
to have the same power.
Figure 4.8 Experimental results on Poisson solver with arbitrary precision programming with three arrays on the nonfully integrated memristor platform.
(A) The classical diagonal preconditioner matrix. (B) The target Green’s function preconditioner matrix. (C) The
summation of all three matrices A1-A3. (D to F) The 1st, 2nd and 3rd matrices programmed and reverse mapped. (G)
83
The wrong solution of a 128 x 128 Poisson equation example by the hardware solver using only the 1st subarray. (H)
The solution of a 128 x 128 Poisson equation example by the hardware solver using all three subarrays. (I) The residual
of the solution over iterations of different settings.
Figure 4.9 Experimental results of accumulated summation of 3 subarrays in SoC using Green’s function
preconditioner.
(A) the target matrix. (B) the written numerical matrix A1. (C) the residual matrix of A1. (D) the written numerical
matrix A2. (E) the accumulated sum of A1 and A2. (F) the residual matrix of A1+A2. (G) the written numerical matrix
A3. (H) the accumulated sum of A1, A2 and A3. (I) the residual matrix of A1+A2+A3. As more subarrays are used,
the summed effective matrix is closer to the target matrix.
84
Figure 4.10 VMM result comparison between multiple subarrays and one subarray with extra tuning cycles.
(A) The 1st subarray programmed and mapped to A1. (B) The 2nd subarray, programmed and mapped to A2. (C) The
3
rd subarray, programmed and mapped to A3. All 3 subarrays are programmed with 30 cycles. (D) The effective matrix
equals to A1+A2+A3. (E) One array programmed with 90 cycles. (F) Actual VMM output over ideal VMM output for
both cases.
85
Figure 4.11 Accumulated summation of 5 subarrays in simulation using Green’s function preconditioner in simulation.
(A), the 5 numerical matrices A1 to A5. (B), the accumulated sum from A1+A2 to A1+A2+A3+A4+A5. As more
subarrays are used, the summed effective matrix is closer to the target matrix.
86
4.5 Summary
We have demonstrated an innovative circuit architecture and programming protocol that can
efficiently program inaccurate analog devices with arbitrarily high precision within the limit of
digital peripheral circuits. This method enables us to execute partial differential equation solvers
with remarkable precision, energy efficiency, and throughput. Beyond in-memory computing
architecture, it also opens doors for computational applications previously considered infeasible
for emerging analog memories, such as scientific computing, neural network training, and complex
physical system modeling. The accuracy of analog computing was not solely limited by writing
accuracy, but also by reading accuracy. Although our proposed programming method overcame
the limitation of writing accuracy caused by the device's writing pulse response variations and a
small number of stuck or poorly conditioned devices, hardware computing accuracy was still
bound by device reading variation, and peripheral circuit elements like ADC precision. With recent
advancements in memristor programming111, the reading variations can be substantially mitigated,
making the proposed solver more effective. This co-design approach enables the use of lowprecision analog devices for high-precision computing, considerably broadening the application
scope of analog computing.
87
Chapter 5:Conclusion and Future Work
5.1 Contributions
Memristive devices, although have been studied for years, has been limited to applications without
the requirement of high precision so far, such as neural networks, due to the analog nature of the
device and computing approach. This dissertation primarily focused on changing such situation
and expanding the capability of such analog devices, by providing high-precision and
programmability of such devices. Certain understanding has been advanced in this process and
new avenues have been opened for high-precision analog computing applications. The efforts span
from device level, circuit and algorithm level to system level design. Specifically, the contributions
of this dissertations include:
• Developed a novel implementation of FPAA blocks with memristor crossbar arrays,
having different types of memristors play various critical roles. Experimentally
demonstrated a variety of computing functions like tunable filters and mixed-frequency
classifier.
• Studied the root cause of conductance fluctuations in memristors through experimental
and theoretical studies.
• Devised an electrical operation protocol to denoise the memristors and demonstrated 2048
conductance levels in a memristor from a fully integrated chip, the highest precision
among all type of known memory devices.
88
• Invented an innovative circuit architecture and programming protocol that can efficiently
program inaccurate analog devices with arbitrarily high precision within the limit of
digital peripheral circuits.
• Experimentally solved partial differential equations and recursive least squares problem
using above method with remarkable solution precision (<10-15 error) and energy
efficiency (~100 x over digital ASIC).
5.2 Future Work
There are still issues left unsolved in developing the memristive devices for efficient computing
applications. Those issues can also be categorized into different levels, but note that an issue may
be solved or mitigated by a solution at a different level.
FPAA is a general purpose reconfigurable analog circuit. A very preliminary demonstration
was shown in the dissertation using memristor crossbars. With proper circuit design, it has much
larger potential in more complicated and serious analog computing applications, like solving linear
systems in the analog domain, rather than doing the math in the digital domain.
As mentioned in previous chapter, when doing VMM, the hardware computing accuracy was
still bound by peripheral circuit elements like ADC precision. New ADC designs or new
computing architecture with new reading modes may be able to solve this problem. The linearity
of the device and VMM circuit could become the new threshold when certain level of precision is
achieved, which requires dedicated and feasible algorithms to compensate for.
The arbitrary precision algorithm opens doors for computational applications previously
89
considered infeasible for emerging analog memories, such as scientific computing and complex
physical system modeling, as well as online training for complicated models.
90
References
1. Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on
metal-oxide memristors. Nature 521, 61–64 (2015).
2. Sheridan, P. M. et al. Sparse coding with memristor networks. Nature Nanotech 12, 784–789
(2017).
3. Mead, C. Neuromorphic electronic systems. Proc. IEEE 78, 1629–1636 (1990).
4. Xia, Q. & Yang, J. J. Memristive crossbar arrays for brain-inspired computing. Nature
Materials 18, 309–323 (2019).
5. Wang, Z. et al. Resistive switching materials for information processing. Nat Rev Mater 5,
173–195 (2020).
6. Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Nature
Nanotech 8, 13–24 (2013).
7. Yang, J. J. et al. Memristive switching mechanism for metal/oxide/metal nanodevices. Nature
Nanotech 3, 429–433 (2008).
8. Liu, M., Yu, H. & Wang, W. FPAA Based on Integration of CMOS and Nanojunction Devices
for Neuromorphic Applications. in Nano-Net (ed. Cheng, M.) vol. 3 44–48 (Springer Berlin
Heidelberg, 2009).
9. Yang, Y. & Huang, R. Probing memristive switching in nanoionic devices. Nat Electron 1,
274–287 (2018).
10. Ascoli, A., Tetzlaff, R., Kang, S.-M. S. & Chua, L. System-Theoretic Methods for Designing
Bio-Inspired Mem-Computing Memristor Cellular Nonlinear Networks. Front. Nanotechnol.
3, 633026 (2021).
11. Gao, B. et al. Memristor-based analogue computing for brain-inspired sound localization with
in situ training. Nat Commun 13, 2026 (2022).
12. Ramakrishnan, S. & Hasler, J. Vector-Matrix Multiply and Winner-Take-All as an Analog
Classifier. IEEE Trans. VLSI Syst. 22, 353–361 (2014).
13. Merrikh-Bayat, F. et al. High-Performance Mixed-Signal Neurocomputing With Nanoscale
Floating-Gate Memory Cell Arrays. IEEE Trans. Neural Netw. Learning Syst. 29, 4782–4790
(2018).
14. Danial, L. et al. Two-terminal floating-gate transistors with a low-power memristive operation
mode for analogue neuromorphic computing. Nat Electron 2, 596–605 (2019).
15. Burr, G. W. et al. Experimental Demonstration and Tolerancing of a Large-Scale Neural
Network (165 000 Synapses) Using Phase-Change Memory as the Synaptic Weight Element.
IEEE Transactions on Electron Devices 62, 3498–3507 (2015).
16. Boybat, I. et al. Neuromorphic computing with multi-memristive synapses. Nature
Communications 9, 2514 (2018).
91
17. Joshi, V. et al. Accurate deep neural network inference using computational phase-change
memory. Nat Commun 11, 2473 (2020).
18. Kaneko, Y., Nishitani, Y. & Ueda, M. Ferroelectric Artificial Synapses for Recognition of a
Multishaded Image. IEEE Trans. Electron Devices 61, 2827–2833 (2014).
19. Aziz, A. et al. Computing with ferroelectric FETs: Devices, models, systems, and applications.
in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE) 1289–1298
(IEEE, 2018). doi:10.23919/DATE.2018.8342213.
20. Khan, A. I., Keshavarzi, A. & Datta, S. The future of ferroelectric field-effect transistor
technology. Nat Electron 3, 588–597 (2020).
21. Berdan, R. et al. Low-power linear computation using nonlinear ferroelectric tunnel junction
memristors. Nature Electronics 3, 259–266 (2020).
22. Niu, X., Tian, B., Zhu, Q., Dkhil, B. & Duan, C. Ferroelectric polymers for neuromorphic
computing. Applied Physics Reviews 9, 021309 (2022).
23. Romera, M. et al. Vowel recognition with four coupled spin-torque nano-oscillators. Nature
563, 230–234 (2018).
24. Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. The missing memristor found.
Nature 453, 80–83 (2008).
25. Woods, W. & Teuscher, C. Approximate vector matrix multiplication implementations for
neuromorphic applications using memristive crossbars. in 2017 IEEE/ACM International
Symposium on Nanoscale Architectures (NANOARCH) 103–108 (IEEE, 2017).
doi:10.1109/NANOARCH.2017.8053729.
26. Fahimi, Z., Mahmoodi, M. R., Klachko, M., Nili, H. & Strukov, D. B. The Impact of Device
Uniformity on Functionality of Analog Passively-Integrated Memristive Circuits. IEEE Trans.
Circuits Syst. I 68, 4090–4101 (2021).
27. Zhou, G. et al. Volatile and Nonvolatile Memristive Devices for Neuromorphic Computing.
Adv Elect Materials 8, 2101127 (2022).
28. Kim, Y. et al. Neural network learning using non-ideal resistive memory devices. Front.
Nanotechnol. 4, 1008266 (2022).
29. Amirsoleimani, A. et al. In‐Memory Vector‐Matrix Multiplication in Monolithic
Complementary Metal–Oxide–Semiconductor‐Memristor Integrated Circuits: Design Choices,
Challenges, and Perspectives. Advanced Intelligent Systems 2, 2000115 (2020).
30. Mahmoodi, M. R., Vincent, A. F., Nili, H. & Strukov, D. B. Intrinsic Bounds for Computing
Precision in Memristor-Based Vector-by-Matrix Multipliers. IEEE Trans. Nanotechnology 19,
429–435 (2020).
31. Bengel, C. et al. Reliability aspects of binary vector-matrix-multiplications using ReRAM
devices. Neuromorph. Comput. Eng. 2, 034001 (2022).
32. Zidan, M. A. et al. A general memristor-based partial differential equation solver. Nat Electron
1, 411–420 (2018).
92
33. Yao, P. et al. Face classification using electronic synapses. Nature Communications 8, 15199
(2017).
34. Choi, B. J. et al. High‐Speed and Low‐Energy Nitride Memristors. Adv. Funct. Mater. 26,
5290–5296 (2016).
35. Gulak, P. G. Field programmable analog arrays: past, present and future perspectives. in 1995
IEEE TENCON. IEEE Region 10 International Conference on Microelectronics and VLSI.
‘Asia-Pacific Microelectronics 2000’. Proceedings 123–126 (IEEE, 1995).
doi:10.1109/TENCON.1995.496352.
36. Marsh, D. FROM EDN EUROPE: Programmable analogue ICs challenge Spice-andbreadboard designs. 9 (2001).
37. D’Mello, D. R. & Gulak, P. G. Design Approaches to Field-Programmable Analog Integrated
Circuits. in Field-Programmable Analog Arrays (eds. Pierzchala, E., Gulak, G., Chua, L. O.
& Rodríguez-Vázquez, A.) 7–34 (Springer US, 1998). doi:10.1007/978-1-4757-5224-3_1.
38. Fisher, W. A., Fujimoto, R. J. & Smithson, R. C. A programmable analog neural network
processor. IEEE Trans. Neural Netw. 2, 222–229 (1991).
39. Austin, K. Integrated circuit for analogue system. (1993).
40. Lee, E. K. F. & Gulak, P. G. Field programmable analogue array based on MOSFET
transconductors. Electronics Letters 28, 28–29 (1992).
41. Brink, S., Hasler, J. & Wunderlich, R. Adaptive Floating-Gate Circuit Enabled Large-Scale
FPAA. IEEE Trans. VLSI Syst. 22, 2307–2315 (2014).
42. Basu, A. et al. A Floating-Gate-Based Field-Programmable Analog Array. IEEE J. Solid-State
Circuits 45, 1781–1794 (2010).
43. Hall, T. S., Hasler, P. & Anderson, D. V. Field-Programmable Analog Arrays: A Floating—
Gate Approach. in Field-Programmable Logic and Applications: Reconfigurable Computing
Is Going Mainstream (eds. Glesner, M., Zipf, P. & Renovell, M.) vol. 2438 424–433 (Springer
Berlin Heidelberg, 2002).
44. Harrison, R. R., Bragg, J. A., Hasler, P., Minch, B. A. & Deweerth, S. P. A CMOS
programmable analog memory-cell array using floating-gate circuits. IEEE Trans. Circuits
Syst. II 48, 4–11 (2001).
45. Hall, T. S., Twigg, C. M., Hasler, P. & Anderson, D. V. Developing large-scale fieldprogrammable analog arrays. in 18th International Parallel and Distributed Processing
Symposium, 2004. Proceedings. 142–147 (IEEE, 2004). doi:10.1109/IPDPS.2004.1303121.
46.Laiho, M. et al. FPAA/Memristor Hybrid Computing Infrastructure. IEEE Trans. Circuits Syst.
I 62, 906–915 (2015).
47. Laiho, M. et al. Analog signal processing on a FPAA/memristor hybrid circuit. in 2014 IEEE
International Symposium on Circuits and Systems (ISCAS) 2265–2268 (IEEE, 2014).
doi:10.1109/ISCAS.2014.6865622.
48. Kim, S., Hasler, J. & George, S. Integrated Floating-Gate Programming Environment for
System-Level ICs. IEEE Trans. VLSI Syst. 24, 1–9 (2015).
93
49. Yang, J., Qureshi, M. S., Ribeiro, G. M. & Williams, R. S. Field-programmable analog array
with memristors. 8 (2014).
50. Yang, J. et al. Thousands of conductance levels in memristors monolithically integrated on
CMOS. https://www.researchsquare.com/article/rs-1939455/v1 (2022) doi:10.21203/rs.3.rs1939455/v1.
51. Jiang, H. et al. Sub-10 nm Ta Channel Responsible for Superior Performance of a HfO2
Memristor. Sci Rep 6, 28525 (2016).
52. Hu, M. et al. Dot-Product Engine for Neuromorphic Computing: Programming 1T1M
Crossbar to Accelerate Matrix-Vector Multiplication. IEEE Design Automation Conference
(2016) doi:10.1145/2897937.2898010.
53. Fujishima, H., Takemoto, Y., Onoye, T. & Shirakawa, I. An architecture of a matrix-vector
multiplier dedicated to video decoding and three-dimensional computer graphics. IEEE Trans.
Circuits Syst. Video Technol. 9, 306–314 (1999).
54. Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nat
Electron 1, 52–59 (2018).
55. Bayat, F. M., Alibart, F., Gao, L. & Strukov, D. B. A Reconfigurable FIR Filter with
Memristor-Based Weights. (2016) doi:10.48550/ARXIV.1608.05445.
56. Ali, S., Hassan, A., Hassan, G., Bae, J. & Lee, C. H. Memristor-capacitor passive filters to
tune both cut-off frequency and bandwidth. in (eds. Chung, Y. et al.) 1032379 (2017).
doi:10.1117/12.2264963.
57. SOzen, H. & Cam, U. On the realization of memristor based RC high pass filter. in 2013 8th
International Conference on Electrical and Electronics Engineering (ELECO) 45–48 (IEEE,
2013). doi:10.1109/ELECO.2013.6713933.
58. Välimäki, V. & Reiss, J. All About Audio Equalization: Solutions and Frontiers. Applied
Sciences 6, 129 (2016).
59. Kumar, N., Himmelbauer, W., Cauwenberghs, G. & Andreou, A. G. An analog VLSI chip
with asynchronous interface for auditory feature extraction. IEEE Trans. Circuits Syst. II 45,
600–606 (1998).
60. Kim, B. & Pardo, B. Speeding Learning of Personalized Audio Equalization. in 2014 13th
International Conference on Machine Learning and Applications 495–499 (IEEE, 2014).
doi:10.1109/ICMLA.2014.86.
61. George, S. et al. A Programmable and Configurable Mixed-Mode FPAA SoC. IEEE Trans.
VLSI Syst. 1–9 (2016) doi:10.1109/TVLSI.2015.2504119.
62. Chua, L. Memristor-The missing circuit element. IEEE Trans. Circuit Theory 18, 507–519
(1971).
63. Valov, I., Waser, R., Jameson, J. R. & Kozicki, M. N. Electrochemical metallization
memories—fundamentals, applications, prospects. Nanotechnology 22, 254003 (2011).
64. Wen, W., Wu, C., Wang, Y., Chen, Y. & Li, H. Learning structured sparsity in deep neural
networks. Advances in neural information processing systems 29, (2016).
94
65. Wan, W. et al. A compute-in-memory chip based on resistive random-access memory. Nature
608, 504–512 (2022).
66. Kumar, S., Wang, X., Strachan, J. P., Yang, Y. & Lu, W. D. Dynamical memristors for highercomplexity neuromorphic computing. Nat Rev Mater 7, 575–591 (2022).
67. Yu, S. Neuro-Inspired Computing With Emerging Nonvolatile Memorys. Proc. IEEE 106,
260–285 (2018).
68. Xue, C.-X. et al. A CMOS-integrated compute-in-memory macro based on resistive randomaccess memory for AI edge devices. Nat Electron 4, 81–90 (2020).
69. Lanza, M. et al. Memristive technologies for data storage, computation, encryption, and radiofrequency communication. Science 376, eabj9979 (2022).
70. Yao, P. et al. Fully hardware-implemented memristor convolutional neural network. Nature
577, 641–646 (2020).
71. Zhang, W. et al. Neuro-inspired computing chips. Nat Electron 3, 371–382 (2020).
72. Ielmini, D. & Wong, H.-S. P. In-memory computing with resistive switching devices. Nat
Electron 1, 333–343 (2018).
73. Zidan, M. A., Strachan, J. P. & Lu, W. D. The future of electronics based on memristive
systems. Nat Electron 1, 22–29 (2018).
74. Jung, S. et al. A crossbar array of magnetoresistive memory devices for in-memory computing.
Nature 601, 211–216 (2022).
75. Sangwan, V. K. & Hersam, M. C. Neuromorphic nanoelectronic materials. Nat. Nanotechnol.
15, 517–528 (2020).
76. Burr, G. W. A role for analogue memory in AI hardware. Nat Mach Intell 1, 10–11 (2019).
77. Chen, S. et al. Wafer-scale integration of two-dimensional materials in high-density
memristive crossbar arrays for artificial neural networks. Nat Electron 3, 638–645 (2020).
78. Fuller, E. J. et al. Parallel programming of an ionic floating-gate memory array for scalable
neuromorphic computing. Science 364, 570–574 (2019).
79. Choi, C. et al. Reconfigurable heterogeneous integration using stackable chips with embedded
artificial intelligence. Nat Electron 5, 386–393 (2022).
80. Lim, D.-H. et al. Spontaneous sparse learning for PCM-based memristor neural networks. Nat
Commun 12, 319 (2021).
81. Xu, X. et al. Scaling for edge inference of deep neural networks. Nat Electron 1, 216–222
(2018).
82. Sun, Y. et al. A Ti/AlO x /TaO x /Pt Analog Synapse for Memristive Neural Network. IEEE
Electron Device Lett. 39, 1298–1301 (2018).
83. Stathopoulos, S. et al. Multibit memory operation of metal-oxide bi-layer memristors. Sci Rep
7, 17532 (2017).
84. Kim, H., Mahmoodi, M. R., Nili, H. & Strukov, D. B. 4K-memristor analog-grade passive
crossbar circuit. Nat Commun 12, 5198 (2021).
95
85. Mackin, C. et al. Optimised weight programming for analogue memory-based deep neural
networks. Nat Commun 13, 3765 (2022).
86. Choi, S. et al. SiGe epitaxial memory for neuromorphic computing with reproducible high
performance based on engineered dislocations. Nature Mater 17, 335–340 (2018).
87. Yan, Z., Hu, X. S. & Shi, Y. SWIM: selective write-verify for computing-in-memory neural
accelerators. in Proceedings of the 59th ACM/IEEE Design Automation Conference 277–282
(ACM, 2022). doi:10.1145/3489517.3530459.
88. Choi, S., Yang, Y. & Lu, W. Random telegraph noise and resistance switching analysis of
oxide based resistive memory. Nanoscale 6, 400–404 (2014).
89. Ielmini, D., Nardi, F. & Cagli, C. Resistance-dependent amplitude of random telegraph-signal
noise in resistive switching memories. Appl. Phys. Lett. 96, 053503 (2010).
90. Puglisi, F. M., Pavan, P., Padovani, A., Larcher, L. & Bersuker, G. Random Telegraph Signal
noise properties of HfOx RRAM in high resistive state. in 2012 Proceedings of the European
Solid-State Device Research Conference (ESSDERC) 274–277 (IEEE, 2012).
doi:10.1109/ESSDERC.2012.6343386.
91. Lee, J.-K. et al. Extraction of trap location and energy from random telegraph noise in
amorphous TiOx resistance random access memories. Appl. Phys. Lett. 98, 143502 (2011).
92. Puglisi, F. M., Padovani, A., Larcher, L. & Pavan, P. Random telegraph noise: Measurement,
data analysis, and interpretation. in 2017 IEEE 24th International Symposium on the Physical
and Failure Analysis of Integrated Circuits (IPFA) 1–9 (IEEE, 2017).
doi:10.1109/IPFA.2017.8060057.
93. Puglisi, F. M., Zagni, N., Larcher, L. & Pavan, P. Random Telegraph Noise in Resistive
Random Access Memories: Compact Modeling and Advanced Circuit Design. IEEE Trans.
Electron Devices 65, 2964–2972 (2018).
94. Yang, Y. et al. Probing nanoscale oxygen ion motion in memristive systems. Nat Commun 8,
15173 (2017).
95. Ranjan, A., Raghavan, N., Shubhakar, K., O’Shea, S. J. & Pey, K. L. Random Telegraph Noise
Nano-spectroscopy in High-κ Dielectrics Using Scanning Probe Microscopy Techniques. in
Noise in Nanoscale Semiconductor Devices (ed. Grasser, T.) 417–440 (Springer International
Publishing, 2020). doi:10.1007/978-3-030-37500-3_12.
96. Hui, F. & Lanza, M. Scanning probe microscopy for advanced nanoelectronics. Nat Electron
2, 221–229 (2019).
97. Celano, U. et al. Three-Dimensional Observation of the Conductive Filament in Nanoscaled
Resistive Memory Devices. Nano Lett. 14, 2401–2406 (2014).
98. Du, H. et al. Nanosized Conducting Filaments Formed by Atomic-Scale Defects in RedoxBased Resistive Switching Memories. Chem. Mater. 29, 3164–3173 (2017).
99. Puglisi, F. M., Larcher, L., Padovani, A. & Pavan, P. A Complete Statistical Investigation of
RTN in HfO2-Based RRAM in High Resistive State. IEEE Trans. Electron Devices 62, 2606–
2613 (2015).
96
100. Ambrogio, S. et al. Statistical Fluctuations in HfO x Resistive-Switching Memory: Part
II—Random Telegraph Noise. IEEE Trans. Electron Devices 61, 2920–2927 (2014).
101. Becker, T. et al. An Electrical Model for Trap Coupling Effects on Random Telegraph
Noise. IEEE Electron Device Lett. 41, 1596–1599 (2020).
102. Brivio, S., Frascaroli, J., Covi, E. & Spiga, S. Stimulated Ionic Telegraph Noise in
Filamentary Memristive Devices. Sci Rep 9, 6310 (2019).
103. Miao, F. et al. Anatomy of a Nanoscale Conduction Channel Reveals the Mechanism of
a High-Performance Memristor. Adv. Mater. 23, 5633–5640 (2011).
104. Zhou, Y. et al. The effects of oxygen vacancies on ferroelectric phase transition of HfO2-
based thin film from first-principle. Computational Materials Science 167, 143–150 (2019).
105. Chen, B. et al. A memristor-based hybrid analog-digital computing platform for mobile
robotics. Sci. Robot. 5, eabb6938 (2020).
106. Hinton, G. The Forward-Forward Algorithm: Some Preliminary Investigations. Preprint
at http://arxiv.org/abs/2212.13345 (2022).
107. Hu, M. et al. Dot-product engine for neuromorphic computing: programming 1T1M
crossbar to accelerate matrix-vector multiplication. in Proceedings of the 53rd Annual Design
Automation Conference 1–6 (ACM, 2016). doi:10.1145/2897937.2898010.
108. Zangeneh-Nejad, F., Sounas, D. L., Alù, A. & Fleury, R. Analogue computing with
metamaterials. Nat Rev Mater 6, 207–225 (2020).
109. Van De Burgt, Y. et al. A non-volatile organic electrochemical device as a low-voltage
artificial synapse for neuromorphic computing. Nature Mater 16, 414–418 (2017).
110. Choe, G., Lu, A. & Yu, S. 3D AND-Type Ferroelectric Transistors for Compute-inMemory and the Variability Analysis. IEEE Electron Device Lett. 43, 304–307 (2022).
111. Rao, M. et al. Thousands of conductance levels in memristors integrated on CMOS.
Nature 615, 823–829 (2023).
112. Gokmen, T. & Vlasov, Y. Acceleration of Deep Neural Network Training with Resistive
Cross-Point Devices: Design Considerations. Front. Neurosci. 10, (2016).
113. Wang, Z. et al. Fully memristive neural networks for pattern classification with
unsupervised learning. Nat Electron 1, 137–145 (2018).
114. Jiang, H., Li, W., Huang, S. & Yu, S. A 40nm Analog-Input ADC-Free Compute-inMemory RRAM Macro with Pulse-Width Modulation between Sub-arrays. in 2022 IEEE
Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits) 266–267 (IEEE,
2022). doi:10.1109/VLSITechnologyandCir46769.2022.9830211.
115. Cai, F. et al. Harnessing Intrinsic Noise in Memristor Hopfield Neural Networks for
Combinatorial Optimization. (2019).
116. Misra, S. et al. Probabilistic Neural Computing with Stochastic Devices. Advanced
Materials 2204569 (2022) doi:10.1002/adma.202204569.
97
117. Riahi Alam, M., Najafi, M. H., Taherinejad, N., Imani, M. & Gottumukkala, R. Stochastic
Computing in Beyond Von-Neumann Era: Processing Bit-Streams in Memristive Memory.
IEEE Trans. Circuits Syst. II 69, 2423–2427 (2022).
118. Jiang, H. et al. A novel true random number generator based on a stochastic diffusive
memristor. Nat Commun 8, 882 (2017).
119. Ibrahim, H. M., Abunahla, H., Mohammad, B. & AlKhzaimi, H. Memristor-based PUF
for lightweight cryptographic randomness. Sci Rep 12, 8633 (2022).
120. Larimian, S., Mahmoodi, M. R. & Strukov, D. B. Improving Machine Learning Attack
Resiliency via Conductance Balancing in Memristive Strong PUFs. IEEE Trans. Electron
Devices 69, 1816–1822 (2022).
121. Richter, I. et al. Memristive Accelerator for Extreme Scale Linear Solvers. in Government
Microcircuit Applications & Critical Technology Conf. (GOMACTech) (2015).
122. Le Gallo, M. et al. Mixed-precision in-memory computing. Nat Electron 1, 246–253
(2018).
123. Huo, Q. et al. A computing-in-memory macro based on three-dimensional resistive
random-access memory. Nat Electron 5, 469–477 (2022).
124. Zheng, Y.-L., Yang, W.-Y., Chen, Y.-S. & Han, D.-H. An Energy-Efficient Inference
Engine for a Configurable ReRAM-Based Neural Network Accelerator. IEEE Trans. Comput.-
Aided Des. Integr. Circuits Syst. 42, 740–753 (2023).
125. Langenegger, J. et al. In-memory factorization of holographic perceptual representations.
Nat. Nanotechnol. (2023) doi:10.1038/s41565-023-01357-8.
126. Luo, Y. et al. A Compute-in-Memory Hardware Accelerator Design With Back-End-ofLine (BEOL) Transistor Based Reconfigurable Interconnect. IEEE J. Emerg. Sel. Topics
Circuits Syst. 12, 445–457 (2022).
127. Barrett, R. et al. Templates for the Solution of Linear Systems: Building Blocks for
Iterative Methods. (Society for Industrial and Applied Mathematics, 1994).
doi:10.1137/1.9781611971538.
128. Loghin, D. Green’s functions for preconditioning. (University of Oxford, 1999).
129. Ichimura, T. et al. A Fast Scalable Iterative Implicit Solver with Green’s function-based
Neural Networks. in 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable
Algorithms for Large-Scale Systems (ScalA) 61–68 (IEEE, 2020).
doi:10.1109/ScalA51936.2020.00013.
130. Vandenplas, J., Calus, M. P. L., Eding, H. & Vuik, C. A second-level diagonal
preconditioner for single-step SNPBLUP. Genet Sel Evol 51, 30 (2019).
Abstract (if available)
Abstract
While digital computing dominates the technological landscape, analog computing has superior energy efficiency and high throughput. However, its historical limitation in precision and programmability has confined its application to specific and low-precision domains, notably in neural networks. The escalating challenge posed by the analog data deluge calls for versatile analog platforms. These platforms must exhibit exceptional efficiency and boast reconfigurability and precision.
Recent breakthroughs in analog devices, such as memristors, have laid the groundwork for unparalleled analog computing capabilities. Leveraging the multifaceted role of memristors, we introduce memristive field-programmable analog arrays (FPAAs), mirroring the functionality of their digital counterparts, field-programmable digital arrays (FPGAs). To elevate precision, we delve into the origins of reading noise, successfully mitigating it and achieving an unparalleled 2048 conductance levels in individual memristors—equivalent to 11 bits per cell, setting a record precision among diverse memory types. Acknowledging the persistent demand for single or double precision in various applications, we propose and develop a circuit architecture and programming protocol. This innovation enables analog memories to attain arbitrarily high precision with minimal circuit overhead. Our experimental validation involves a memristor System-on-Chip fabricated in a standard foundry, demonstrating significantly improved precision and power efficiency compared to traditional digital systems.
The co-design approach presented empowers low-precision analog devices to perform high-precision computing within a programmable platform. This demonstration underscores the transformative potential of analog computing, transcending historical limitations and ushering in a new era of precision and efficiency.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Memristor device engineering and memristor-based analog computers for mobile robotics
PDF
Nano-engineered devices for display and analog computing
PDF
Semiconductor devices for vacuum electronics, electrochemical reactions, and ultra-low power in-sensor computing
PDF
III-V semiconductor heterogeneous integration platform and devices for neuromorphic computing
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Memristor for parallel and analog data processing in the era of big data
PDF
Applications enabled by plasmonic nano-finger and analog computing based on memristive devices
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Towards high-performance low-cost AMS designs: time-domain conversion and ML-based design automation
PDF
Low-dimensional material based devices for neuromorphic computing and other applications
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Circuit design with nano electronic devices for biomimetic neuromorphic systems
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Grid-based Vlasov method for kinetic plasma simulations
PDF
Integrating material growth and device physics: building blocks for cost effective emerging electronics and photonics devices
PDF
Synthesis, characterization, and device application of two-dimensional materials beyond graphene
PDF
Introspective resilience for exascale high-performance computing systems
Asset Metadata
Creator
Song, Wenhao
(author)
Core Title
Memristive device and architecture for analog computing with high precision and programmability
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
01/05/2024
Defense Date
12/08/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
analog computing,Architecture,high precision,memristive device,neuromorphic computing,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Yang, Jianhua 'Joshua' (
committee chair
), Chen, Shuo-Wei (Mike) (
committee member
), Nakano, Aiichiro (
committee member
), Wu, Wei (
committee member
)
Creator Email
wenhaoso@usc.edu,whsong.ee@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113803849
Unique identifier
UC113803849
Identifier
etd-SongWenhao-12592.pdf (filename)
Legacy Identifier
etd-SongWenhao-12592
Document Type
Dissertation
Format
theses (aat)
Rights
Song, Wenhao
Internet Media Type
application/pdf
Type
texts
Source
20240116-usctheses-batch-1119
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
analog computing
high precision
memristive device
neuromorphic computing