Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Custom hardware accelerators for boolean satisfiability
(USC Thesis Other)
Custom hardware accelerators for boolean satisfiability
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Custom Hardware Accelerators for Boolean Satisfiability
by
Soowang Park
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2022
Copyright 2022 Soowang Park
Acknowledgements
It has been 25 years ever since I began to learn computer programming and mathematics. I would
like to express my sincere gratitude to the following people, without whom I would not have been
able to complete this thesis and my degree.
Foremost, I sincerely thank to my thesis advisor Professor Sandeep Gupta, whose insight and
knowledge into the subject matter steered me through this research. Under his guidance, it has been
a long but pleasant journey for me to completely reshape my thinking and solving, and construct
my implementing and writing in a whole new way.
Besides my advisor, I would like to thank the rest of my doctoral committee: Prof. Pierluigi
Nuzzo and Prof. Chao Wang, for their insightful comments on design automation, computer aided
verification, and Boolean satisfiability, but also for the intuitive questions which inspired me to
investigate my research from various perspectives.
My sincere thanks also go to Prof. Jae-Won Nam who guided me on full-custom design as well
as encouraged me to continue my research. Without his invaluable support it would not be possible
to conduct this research.
Last but not the least, I send my biggest thanks to my family for all the support. I am a lucky
man to meet my dearest wife Daeun, and sincerely thank her for all trust and support. I am so
thankful to my lovely kids, Jenna and Leo, that they always have been the source of my joy. I send
all my love to my mother and father who are my biggest supporters. The sacrifices they have made
for me are beyond any description. In addition, I truly thank to my parents-in-law for supporting
me all the time.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Modern Boolean Satisfiability solvers . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related research on hardware accelerators for SAT . . . . . . . . . . . . . . . . . 5
Chapter 2: Custom Hardware Accelerators for Boolean Satisfiability . . . . . . . . . . . . 8
2.1 Introduction and background: Profiling of MiniSAT2 . . . . . . . . . . . . . . . . 8
2.1.1 Runtime analysis and focus on BCP and other dominant operations . . . . . 8
2.1.2 Characteristics of BCP and opportunities for acceleration . . . . . . . . . . 11
2.1.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Design innovation for Boolean Constraint Propagation . . . . . . . . . . . . . . . 17
2.2.1 Low-cost content-addressable memory architecture . . . . . . . . . . . . . 17
2.2.2 Novel memory architecture using CAMs and SRAMs . . . . . . . . . . . . 18
2.2.3 Near-memory computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Elimination of von Neumann overheads . . . . . . . . . . . . . . . . . . . 25
2.2.5 Low-power architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: HW-BCP Design and Optimization . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Design overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 SAT Submodule design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Memory cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Memory array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Processing logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Combining logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Design optimization for HW-BCP . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Floorplan and H-tree design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 HW-BCP design metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
Chapter 4: Custom Hardware Accelerators for Conflict-Driven Clause Learning . . . . . . 44
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Review of SAT Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 CDCL runtime analysis and functions . . . . . . . . . . . . . . . . . . . . 45
4.2 HW-CDCL data structure and operations . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Design choices for Clause Information Table (CIT) . . . . . . . . . . . . . 48
4.2.2 New module, Assignment Information Table (AIT) . . . . . . . . . . . . . 49
4.2.3 HW-CDCL operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Delay of HW-CDCL operations . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 HW-CDCL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Algorithm for analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Algorithm for memLitRedundant . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Integrated architecture, HW-SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 5: Evaluation of HW-SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.1 Definition of speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.2 BCP speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.3 analyze speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.4 memLitRedundant (mLR) speedup . . . . . . . . . . . . . . . . . . . . . . 62
5.1.5 Overall SAT-level speedup . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Area and scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Delay and scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.3 Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Chapter 6: Contributions and Future Research . . . . . . . . . . . . . . . . . . . . . . . . 73
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
iv
List of Tables
2.1 Profiling for MiniSAT2 [1]: Total runtime devoted to BCP, analyze, and
litRedundant for SAT instances with various sizes. . . . . . . . . . . . . . . . . . . 10
2.2 Cache miss rates for some SAT benchmark instances of different sizes; Level1-
Data (L1-D), Level2-Data (L2-D), Level2 (L2) miss rates are shown. Cache miss
rates are not fairly independent of the size of SAT instances. . . . . . . . . . . . . 12
2.3 HW-BCP operations and CAM/SRAM modes . . . . . . . . . . . . . . . . . . . . 21
2.4 Power reduction and area/delay overheads using the proposed lower power
architecture on HW-BCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Optimized H-tree input wire design. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 BCP operation delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 HW-BCP operations and CAM/SRAM modes for HW-BCP and AIT. . . . . . . . 43
4.1 Overall HW-SAT operations and CAM/SRAM modes for HW-BCP and AIT. . . . 51
4.2 Summarized CDCL operations and delays: CIT, AIT-Search, AIT-Push, and AIT-Pop 53
5.1 Profile of BCP; total number of BCP invocations during runtime and BCP runtime
(%), average BCP delay in SW (d
avg,SW,BCP
(ns)), BCP speedup (θ
BCP
), and
SAT-level speedup (S
BCP
) on MiniSAT2 [1]. . . . . . . . . . . . . . . . . . . . . . 60
5.2 CPU Benchmark Data – PassMark single thread score . . . . . . . . . . . . . . . . 61
5.3 Profile analyze in terms of CIT/AIT operations and analyze speedup. . . . . . . . . 63
5.4 Profile litRedundant in terms of AIT operations and memLitRedundant (mLR)
speedup (θ
mLR
). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Runtimes for BCP, analyze, and mLR (P
BCP
, P
analyze
, and P
mLR
, respectively)
and individual function speedups (θ
BCP
, θ
analyze
, and θ
mLR
, respectively), and
the overall SAT-level speedup (S
BCP,analyze,mLR
) in MiniSAT2+HW-SAT on
MiniSAT2 [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
v
5.6 Extrapolation for 7nm technology; maximum numbers of clauses and correspond-
ing estimated chip area (mm
2
), individual function speedups (θ
BCP
, θ
analyze
, and
θ
mLR
), and SAT-level speedup (S
BCP,analyze,mLR
) on MiniSAT2+HW-SAT. . . . . . . 68
5.7 Function speedups (θ
BCP
,θ
analyze
, andθ
mLR
) and SAT-level speedup (S
BCP,analyze,mLR
)
on MiniSAT2+HW-SAT (designed for one million clauses) compared to software
MiniSAT2 [1] for different technologies; 65nm with our memory cells, 65nm with
library cells, and 7nm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 SRAM cell area (um
2
) and maximum number of clauses in a 5cm
2
chip for
different technologies; 65nm with our memory cells, 65nm with library cells, and
7nm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.9 BCP performance comparison between MiniSAT2 [1], FPGA-BCP [2], and the
proposed HW-BCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
vi
List of Figures
2.1 A. Execution profile for MiniSAT2 [1]: % CPU time required by key functions,
average for 63 instances, B. Total runtime devoted to BCP for SAT instances with
various sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 BCP algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 BCP data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Floorplan with H-Tree data bus and SAT Submodules. . . . . . . . . . . . . . . . 18
2.5 HW-BCP CAM/SRAM array with formula information. (This figure assumes that
n
v,max
and n
c,max
are 8 and 16, respectively.) . . . . . . . . . . . . . . . . . . . . . 20
2.6 BCP mode of SAT Submodule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 BCP operation with an example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Group example: 4 groups of SAT Submodules and Group Status Table. . . . . . . . 27
3.1 Our schematics and layouts of CAM and SRAM cells. . . . . . . . . . . . . . . . 32
3.2 CAM/SRAM array and processing logic. . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Sub-combining logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Combining logic inside each SAT Submodule. . . . . . . . . . . . . . . . . . . . . 35
3.5 Optimized SAT Submodule and a part of the layout. . . . . . . . . . . . . . . . . . 37
3.6 SAT Group Submodule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 SAT Array floorplan with combining logic blocks. . . . . . . . . . . . . . . . . . . 40
3.8 SAT Array geometry and H-tree stages. . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Buffer design for H-tree input wire. . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Algorithm of analyze in MiniSAT2 [1]. . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Procedure of conflict-clause minimization in litRedundant. . . . . . . . . . . . . . 47
vii
4.3 CDCL mode of SAT Submodule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Hardware modules for HW-CDCL; Assignment Information Table (AIT) and
Clause Information Table (CIT). . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Algorithm of HW-CDCL-analyze; analyze function of HW-CDCL. . . . . . . . . . 55
4.6 Architecture of HW-SAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Area and wire delay with different instance sizes on different technologies: (1)
65nm with our memory cells and (2) 65nm with library memory cells. . . . . . . . 67
viii
Abstract
Boolean Satisfiability (SAT) has broad usage in Electronic Design Automation (EDA), artificial
intelligence (AI), and theoretical studies. Further, as an NP-complete problem, acceleration of
SAT will also enable acceleration of a wide range of combinatorial problems.
We propose a completely new custom hardware design to accelerate SAT. Starting with the
well-known fact that Boolean Constraint Propagation (BCP) takes most of the SAT solving time
(80-90%), we first focus on accelerating BCP. Beyond BCP, to achieve > 10× SAT-level speedup,
we also accelerate functions of conflict-driven clause learning (CDCL), namely, analyze and mem-
LitRedundant. By profiling a widely-used software SAT solver, MiniSAT v2.2.0 (MiniSAT2) [1],
we identify opportunities to accelerate BCP via parallelization and elimination of all von Neu-
mann overheads, including data movement. We also create hardware implementations of other
data structures for CDCL and design accelerators for CDCL operations.
We propose hardware accelerators for Boolean Satisfiability (HW-SAT). The core part of HW-
SAT is an accelerator for BCP (HW-BCP) which achieves these goals via a customized combi-
nation of content-addressable memory (CAM) cells, SRAM cells, logic circuitry, and optimized
interconnects. Specifically, our CAM design is area-delay efficient since we modify the CAM by
removing its high-area and high-delay part, namely the priority encoder. The secondary part of
HW-SAT are accelerators for CDCL (HW-CDCL) which carry out key lookup operations via the
reuse of HW-BCP memory array with minimal design update, and the implementation of assign-
ment information table (AIT).
In 65nm technology, for the largest SAT instances in the SAT Competition 2017 benchmark
suite [3], our HW-BCP dramatically accelerates BCP (10.1ns per BCP in simulations), and hence
ix
provides a 5.0× speedup over FPGA-BCP [2] and a 137-275× speedup over MiniSAT2 running on
general purpose processors, where MiniSAT2 is an optimized software implementation which in-
cludes successful heuristics like two-watched literal [4] and blocking literal [5]. More importantly,
due to the area-efficiency of ASIC design, ours HW-BCP can solve 16 × larger SAT instances than
FPGA-BCP [2]. Our HW-CDCL is completely new architecture which accelerates analyze by
15-33× and memLitRedundant by 6-26× .
Even though HW-BCP accelerates BCP by two orders of magnitude (137-275× ), the overall
acceleration at SAT-level is limited to 4.0-11.6× . However, the proposed HW-SAT, which inte-
grates HW-BCP and HW-CDCL, achieves SAT-level speedup, 8.6-18.7× .
Then, we extrapolate our HW-SAT design to 7nm technology and estimate area and delay.
The analysis shows that in 7nm, in a realistic chip size, our HW-SAT would be able to solve the
largest SAT instances in the benchmark suite [3]. Again, for 7nm, our HW-SAT will be an order of
magnitude faster and will solve problems that are an order of magnitude larger than FPGA-BCP.
Finally, we have developed SW approaches (namely, appropriate place ways of clauses across
subarrays in HW-SAT) and HW enhancements (use additional logic to enable only a fraction of
HW-SAT subarrays), to reduce average power for HW-SAT to acceptable levels.
x
Chapter 1: Introduction
We present a novel custom hardware accelerator for classical combinatorial problems. To achieve
area-delay-power efficient acceleration, the hardware design is identified by exploring compre-
hensive design space exploration and optimization at different levels: circuit, architecture, and
software.
1.1 Motivation
Semiconductor industry has benefited dramatically by Moore’s law for the past 50 years. However,
the industry faces challenges due to the slowing down of Moore’s law. Researchers are hence
developing alternative ways for improving performance.
Besides traditional central processing units (CPUs), custom architectures have been developed
for accelerating specific demanding applications, such as graphics processing units (GPUs) and
special accelerators for neural networks (DNNs) and graph processing. GPUs are widely used in
image processing and scientific computing because of their powerful parallel processing for vector
operations. Field programmable gate arrays (FPGAs) are also widely used to accelerate diverse
applications from embedded systems to cloud computing because of their reconfigurability [6].
We are interested in designing hardware for accelerating combinatorial problems. Specifically,
if we develop a custom hardware to solve one of the NP-complete problems efficiently in terms
of area, delay, and power, it will significantly impact mathematical sciences and a wide range of
industry.
1
Among the combinatorial problems, Boolean Satisfiability (SAT) problem has been a core and
important problem (details in Section 1.2). Due to its importance, many software SAT solvers
have been developed and considerable performance improvement has been achieved. Modern
software SAT solvers like MiniSAT2 [1] achieved substantial performance improvement by adopt-
ing successful heuristics. Despite their improvements, SAT solvers face performance bottlenecks.
Specifically, MiniSAT2 [1] spends 80-90% of its total runtime for Boolean Constraint Propagation
(BCP) operations where each BCP takes around 1000 clock cycles on general-purpose processors
due to sequential table lookups (pointer chasing) as well as data movement and other von Neumann
overheads.
Several ASIC and FPGA implementations of SAT solvers have been proposed to improve per-
formance of software SAT solvers by parallelizing SAT operations at coarse-grain [2, 7, 8]. While
some ASIC SAT solvers achieved considerable speedup at the SAT level, they can only solve
small-sized SAT instances (under 1000 clauses). A successful BCP-level SAT solver [2] was de-
veloped by implementing BCP accelerators on FPGA (FPGA-BCP). FPGA-BCP [2] achieved 10-
20× speedup on BCP operations and worked for SAT instances up to 64K clauses. However,
FPGA architecture causes long latency due to a sequence of memory lookups; further, the limited
capacity of FPGA Block RAMs (BRAMs) limits the size of SAT instances. Also, after imple-
menting BCP, limited resource remained on FPGA to implement other operations of SAT beyond
BCP.
We are motivated to design a completely new hardware accelerator as an ASIC where we
fully customize memory architecture and logic circuitry for SAT. Especially, we propose a cus-
tom hardware design for BCP operations (HW-BCP) which enables fully parallel operation using
content-addressable memory (CAM).
Compared to FPGA-BCP [2], in 65nm technology, the proposed HW-BCP achieves 5.0× speedup on BCP operations and also can solve 16× larger SAT instances (with up to 1M clauses).
We demonstrate this via a detailed comparison of performances of FPGA-BCP and our HW-BCP
in Section 5.3. Besides area and delay advantages over FPGA-BCP [2], the HW-BCP adopts an
2
efficient low-power architecture to minimize power consumption, which will be discussed in Sec-
tion 2.2.5.
We further improve our HW-BCP by expanding its hardware to process more operations of
SAT (which is not feasible in FPGAs due to limited resources) and eventually achieve much higher
speedups at the SAT level.
1.2 Modern Boolean Satisfiability solvers
A combinatorial problem is to find one solution, among a finite set of candidates, which satisfies a
set of constraints. It has a wide range of applications, such as constraint programming (CP), linear
programming (LP), Satisfiability, etc.
Among these, Boolean Satisfiability (SAT), an NP-complete problem, is key to many areas
and mathematical science. A large number of other combinatorial problems are NP-complete and
hence SAT plays an important role in many software frameworks such as solving constraint in-
teger programs (SCIP) [9], symbolic execution [10], constraint integer programming, constraint
programming, etc. SCIP is widely used in graph theory to find optimal paths for transport prob-
lems, chip design verification, network optimization, etc. Symbolic execution is a fundamental
technique used by software testing tools to generate test cases dynamically and analyze software
quality, which are widely used in academic resesarch [11, 12] and industry applications [13, 14].
All these problems can be mapped to SAT and hence will benefit from a SAT accelerator.
A Boolean Satisfiability (SAT) problem consists of finding an assignment of values to the vari-
ables of a Boolean formulaφ such thatφ evaluates to true, or determining that no such assignment
exists, i.e., φ is unsatisfiable. Typically, φ is expressed in conjunctive normal form (CNF), as
illustrated below,
φ :=(x
0
∨ x
1
∨ x
2
)(¬x
2
∨ x
3
∨ x
4
)(¬x
2
∨ x
5
)(¬x
4
∨¬x
5
∨ x
6
), (1.1)
3
where the x
i
, i∈{0,...,6}, are the Boolean variables, and ¬ denotes the logical negation. x
i
or its complement¬x
i
is a literal. A clause is the disjunction (denoted by∨) of one or more
literals, delimited by parentheses in Eq. (1.1). The overall CNF is the conjunction (denoted by∧
and dropped, for simplicity, in Eq. (1.1)) of one or more clauses. We will refer to the above four
clauses as C
0
,C
1
,C
2
,C
3
, respectively.
Most modern software SAT solvers (SW-SAT) are based on the Davis-Putnam-Logemann-
Loveland (DPLL) algorithm [15]. A simple version of DPLL uses a heuristic approach to select a
variable and a value to be assigned in the next decision step, e.g., by assigning value v to variable
x
i
. It then adds the assignment (x
i
, v) to a decision tree, together with (x
i
,¬v), which is added as
a yet-unexplored alternative, and invokes Boolean Constraint Propagation (BCP). BCP makes
the given assignment and checks the status of the clauses in the formula to produce one of the
following three outcomes.
Satisfied: This case occurs if every clause inφ is satisfied. If so, DPLL reports that the formula
is satisfiable, reports the values assigned to all the variables, i.e., the satisfying assignment, and
terminates.
Conflict: This case occurs if at least one clause is unsatisfied. DPLL invokes a backtrack, i.e.,
it revisits a recent variable assignment on the decision tree, assigns a yet-unexplored alternative
value to the variable, updates the decision tree, and continues. If no variable with an unexplored
alternative exists, then DPLL reports thatφ is unsatisfiable and terminates.
Unit propagation: If the given assignment makes all but one literal in a clause false and one
literal has unspecified value ( − ), then that literal must be assigned true to avoid making the clause,
and the entire formula, unsatisfiable.
In the example CNF in Eq. (1.1), if the first assignment is ( x
5
, 0), then the clause C
2
=(¬x
2
∨x
5
)
leads to a unit propagation, since the second literal, x
5
, evaluates to false and the first literal, ¬x
2
,
is the only one with unspecified value. It is now necessary to assign 0 to x
2
to prevent this clause
(hence the entire CNF) from becoming false. BCP reports this to DPLL as a unit propagation,
either by reporting that the clause C
2
has a unit propagation or by identifying (x
2
,0) as a necessary
4
assignment. DPLL continues by processing the assignments required by every unit propagation
before it proceeds to the next decision step. We refer to these assignments as necessary steps.
The assignments made in necessary steps may also be added to the decision tree, each with a flag
denoting that it was a necessary assignment and hence does not have an unexplored alternative.
DPLL then uses its heuristic to select another variable and value to be assigned.
Modern software SAT solvers are very efficient at solving large and difficult problem in-
stances in practical runtimes. Most solvers are based on DPLL algorithm [15] and heuristic local
search [16]. Recent software SAT solvers use additional methods to efficiently prune the decision
tree, such as non-chronological backtracking [17] which has a function that determines the level of
decision to backtrack to, and two watched-literals (2WL) [4] which significantly reduces the num-
ber of lookups. They also use conflict-driven clause learning (CDCL) [18], i.e., a clause learning
step that identifies a new clause based on the conflict information. MiniSAT v2.2.0 (MiniSAT2) [1]
was implemented with these advanced techniques and reflects the basic structure of many leading
software SAT solvers.
Performance of MiniSAT2 [1] is limited due to overheads of von Neumann machines. First,
even for running a simple task, the fetch-decode-execute cycle is required. Second, due to the fact
that BCP is memory bounded [19], it requires lots of table lookups and sustained memory accesses
and hence high data movement overheads. In Section 2.1.2, we investigate in more detail on the
overheads of von Neumann machines and look for opportunities for hardware accelerations.
1.3 Related research on hardware accelerators for SAT
Gulati et al. demonstrated a full implementation of a SAT solver in ASIC [8]. Clauses are par-
titioned into multiple clause banks. In each bank, clause ID is used as row address and variable
ID is used as column address. Each clause cell contains a value of a literal (0, 1, or x) and logic
circuits for implication. It achieved considerable speedup on some SAT instances, but is not scal-
able to large SAT instances, due to the limitation of the above addressing mechanism. They also
5
demonstrated a similar design in FPGA [7]. However, to fit into small-sized memory on FPGA,
the original instance is grouped into multiple bins and loaded/solved sequentially. Considerable
bin-swapping overhead limits performance improvement.
Davis et al. proposed BCP accelerators on FPGA (FPGA-BCP) [2] by implementing an effi-
cient mechanism based on tree walk to maximize the utility of limited capacity of FPGA Block
RAMs (BRAMs). The original clauses are divided into 2
p
groups so that in each group a specific
variable appears at most once. To implement clause index tree walk, variable ID (k-bit) is divided
into k/m chunks, and a multi-step index computation is used to find the clause. When there is a
variable assignment, first by concatenating a base address with the first chunk of VID, address of
next base index is accessed. Then the next chunk of VID is taken and the same operations are
repeated until the leaf node where the associated clause information exists. Each clause group can
perform the tree walk to accelerate BCP. This design is limited by FPGA BRAM capacity. They
achieve 18× speedup over MiniSAT2 [1] for SAT benchmark instances with up to 64K clauses.
(For fair comparison, we scale performance of general purpose processors running MiniSAT2 for
65nm technology.)
Thong et al. [19] proposed a memory architecture that uses variables as addresses of hardware
memory specially designed for multithreading. This design requires large memory and complex
Network-on-Chip (NoC) between processing elements. Hence, their actual implementation is for
instances with hundreds of clauses. Other approaches that implement fast SAT accelerators are
also able to handle only small-sized instances [20–22].
Among the approaches mentioned above, we focus on FPGA-BCP [2] due to its scalability.
Multiple clock cycles are required to complete a BCP operation, due to multiple table lookups
for tree walk and multiplexing. We verified that memory used and latency are both optimized
when the chunk size (m) is 4. We also profiled benchmark instances to derive the distributions
of the numbers of clauses in which variables appear. This profile shows that there is no speedup
for numbers of inference engines > 64 (i.e., for p> 6). Using above, for their study, which uses
6
SAT instances with≤ 64K clauses and a 65nm FPGA with a 5ns clock, the latency for each BCP
operation is 10 clocks cycles and hence 50ns.
Specifically, in Chapter 2 we demonstrate that our design is scalable to the largest SAT bench-
mark instances, for a realistic chip area, and achieves considerable performance boost via complete
parallelization. Also, we show a detailed comparison of performances of FPGA-BCP and our HW-
BCP.
7
Chapter 2: Custom Hardware Accelerators for Boolean Satis-
fiability
2.1 Introduction and background: Profiling of MiniSAT2
In this chapter, we analyze state-of-the-art software SAT solvers to guide the design of our hard-
ware accelerator. The most frequently used operations become the candidates for design of hard-
ware processing units. The specific designs we pursue for our processing units as well as for the
overall architecture of our accelerator are designed to maximize the parallelism for these operations
and to avoid the bottlenecks and overheads faced by software solvers.
Due to its high performance, MiniSAT2 [1] is used in this study as the reference software SAT
solver. Generally, Boolean Constraint Propagation (BCP) operations are believed to dominate the
total CPU time for modern software SAT solvers like MiniSAT2 [1].
2.1.1 Runtime analysis and focus on BCP and other dominant operations
To understand and characterize the instance-to-instance commonalities and variations in the per-
centage of run-time required for BCP, we profiled SAT Competition 2017 benchmark suite [3]
using MiniSAT2 [1]. This profiling was performed on an Intel Xeon Silver 4114 operating at 2.2
GHz with 96 GB DDR4 memory.
Whereas easy SAT instances are solved in a few seconds, hard instances take several hours
or more. We selected 63 instances that have medium-difficulty (total runtime in the 10-5000 sec
range) for our profiling and show the summary in Fig. 2.1A and more details in Table 2.1. Our
8
Figure 2.1: A. Execution profile for MiniSAT2 [1]: % CPU time required by key functions, average
for 63 instances, B. Total runtime devoted to BCP for SAT instances with various sizes.
benchmarking confirmed that the average of total BCP runtime is 82% of the total runtime for
these SAT instances.
Further, on the CPU we used for benchmarking MiniSAT2, the average runtime per BCP call is
500ns, which corresponds to 1000 clock cycles approximately. Fig. 2.1B reports the range of total
BCP time across benchmarks and shows that, regardless of the size of the SAT problem instance
(number of variables and number of clauses), the average of the total BCP time falls in the 56-93%
range. Further, while small-sized instances have a tendency to have more than 90% of runtime
devoted to BCP, generally speaking, BCP time is a dominant part independent of the instance size.
Also, as shown in Fig. 2.1A, the functions that constitute the second and the third highest
fractions of the total CPU runtime are litRedundant (7%) and analyze (5%). analyze function
is called when there is a conflict. It inspects the assignment stack to identify the reason for the
conflict. It returns a learned clause and a backtracking level. litRedundant function simplifies each
learned clause identified by analyze. It iterates over each literal in the learned clause to check
whether the literal can be removed.
Hence, we first design hardware to accelerate BCP operations. Subsequently, we design hard-
ware to accelerate analyze and litRedundant.
9
Table 2.1: Profiling for MiniSAT2 [1]: Total runtime devoted to BCP, analyze, and litRedundant
for SAT instances with various sizes.
SAT instance BCP (%) analyze (%) litRedundant (%) Variables Clauses
ak128astepbg2msisc 82.0 6.2 8.3 293,123 943,236
ak128astepmodbtisc 82.9 4.8 8.2 263,025 869,765
hwmcc15deep-beemhanoi4b1-k32 69.5 6.3 19.5 392,428 921,944
ak128modbtbg1asisc 59.9 0.7 0.7 263,025 869,765
ak128paralparalisc 80.7 5.6 10.6 265,303 859,947
hwmcc15deep-intel032-k84 88.2 3.2 4.4 426,091 882,799
ak128astepbg2asisc 78.0 7.6 10.5 260,359 845,071
ak128diagodiagoisc 75.6 7.8 13.2 260,097 844,289
hwmcc15deep-6s341r-k16 91.7 2.3 2.4 296,616 878,409
hwmcc15deep-6s340rb63-k22 91.0 2.4 1.6 276,394 817,662
hwmcc15deep-6s44-k40 87.7 3.0 4.9 259,705 757,473
hwmcc15deep-6s44-k38 88.6 2.9 4.3 242,811 707,931
hwmcc15deep-bob12s02-k17 85.8 4.3 4.3 216,247 639,882
hwmcc15deep-6s188-k46 88.2 3.0 4.4 213,938 631,035
hwmcc15deep-bob12s02-k16 86.6 3.9 4.2 203,492 602,144
hwmcc15deep-6s188-k44 88.2 3.1 4.6 202,374 596,873
hwmcc15deep-bobpcihm-k33 84.7 2.3 2.2 213,845 593,197
hwmcc15deep-6s366r-k72 88.9 4.2 3.6 198,696 583,514
hwmcc15deep-bobpcihm-k32 84.8 2.5 2.2 204,319 566,666
hwmcc15deep-bobpcihm-k31 85.3 2.6 2.3 194,793 540,135
hwmcc15deep-beembkry8b1-k45 76.9 4.5 14.9 249,516 528,366
hwmcc15deep-bobpcihm-k30 85.2 2.5 2.2 185,267 513,604
hwmcc15deep-intel065-k11 81.7 5.3 7.9 156,426 376,987
hwmcc15deep-6s340rb63-k16 92.1 3.1 1.0 123,061 363,618
hwmcc15deep-beemloyd3b1-k31 85.0 4.2 8.5 122,450 327,830
slp-synthesis-aes-top30 89.8 2.8 4.2 101,451 304,209
mizh-md5-48-5 77.4 8.3 7.9 66,892 240,181
mizh-md5-48-2 79.7 7.0 7.7 66,892 239,781
mizh-md5-47-3 78.2 8.0 9.3 65,604 234,719
mizh-md5-47-5 78.3 8.7 9.9 65,604 235,061
slp-synthesis-aes-top29 89.9 2.8 4.0 94,998 284,752
slp-synthesis-aes-top28 90.8 2.7 3.6 88,763 265,956
hwmcc15deep-beemcmbrdg7f2-k32 80.9 5.3 10.9 102,936 283,796
slp-synthesis-aes-top26 91.3 2.6 3.3 76,943 230,335
slp-synthesis-aes-top25 91.4 2.6 3.3 71,356 213,504
hwmcc15deep-6s105-k35 90.3 2.7 3.4 75,903 224,704
hwmcc15deep-6s161-k18 88.0 4.2 4.5 48,224 140,164
slp-synthesis-aes-top24 91.4 2.6 3.4 65,983 197,322
hwmcc15deep-intel066-k10 81.2 6.5 7.5 78,582 185,591
hwmcc15deep-6s516r-k18 72.1 2.9 2.5 51,996 147,555
hwmcc15deep-6s516r-k17 69.5 2.8 2.6 47,761 135,421
hwmcc15deep-6s161-k17 88.1 4.3 4.5 44,195 128,348
gss-40-s100 75.3 16.4 5.7 32,814 98,623
gss-38-s100 74.5 17.1 5.7 32,759 98,454
gss-36-s100 74.9 17.1 5.2 32,642 98,095
gss-34-s100 75.0 17.1 5.1 32,465 97,556
gss-32-s100 74.5 17.9 4.8 32,312 97,093
gss-30-s100 74.2 18.1 4.8 32,309 97,083
gss-28-s100 75.6 17.0 4.6 32,151 96,586
hwmcc15deep-6s33-k34 84.4 5.6 7.2 24,952 72,571
hwmcc15deep-6s33-k33 84.9 5.6 6.9 24,125 70,153
Sz512
1
5128
1
.smt2− cvc4 87.2 5.6 3.7 10,478 33,986
hwmcc15deep-6s179-k17 78.6 7.8 10.6 11,456 32,806
hwmcc15deep-6s399b03-k02 83.9 7.8 5.0 7,191 21,427
hwmcc15deep-6s399b02-k02 83.9 7.7 5.0 6,950 20,704
modgen-n200-m90860q08c40-16823 88.2 6.0 2.9 2,200 9,086
modgen-n200-m90860q08c40-6967 88.7 5.9 2.6 2,200 9,086
modgen-n200-m90860q08c40-29667 88.7 5.9 2.6 2,200 9,086
modgen-n200-m90860q08c40-13698 89.5 5.5 2.4 2,200 9,086
modgen-n200-m90860q08c40-3230 87.3 6.7 3.0 2,200 9,086
10
2.1.2 Characteristics of BCP and opportunities for acceleration
Based on the simple version of BCP described in Section 1.2, given an assignment (x
i
,v), we
identified the following operations: (a) update the value of variable x
i
to v; (b) visit every clause
that includes x
i
and check whether the clause is satisfied , has a conflict , or has a unit propagation.
In turn, for each clause C
k
visited in step (b), the check requires that we (c) access the identifier
(ID) of every variable in C
k
, then access the value of each variable, and perform logic operations on
these values to determine the status of C
k
(satisfied, conflict, unit propagation, or neither or above).
Then, the required operations for the BCP steps listed above is
O(αγ) (2.1)
whereα is the average number of clauses in which a variable (e.g., x
i
) appears andγ is the average
number of variables in a clause. These operations are lookups in large tables stored in memory
to find the clauses (clause-list table), the variables in each clause (global clause memory), and the
values of the variables (global variable value memory), plus logic operations to determine the final
outcome (satisfied , conflict , unit propagation).
As is clear from above, software BCP has long serial execution [19] on von Neumann ma-
chines. While several researchers [2] have parallelized BCP at coarse-grain, however, in all these
approaches each BCP instance is essentially executed serially in the manner summarized above.
The key challenge to parallelizing BCP beyond above in multi-core von Neumann systems is
that each time BCP is called, it must visit the relevant parts of all the large data structures in SAT.
Specifically, each BCP call necessitates visits to all the memories shown in Fig. 2.3. To identify all
clauses where the assigned variable is used, it accesses global clause memory to retrieve one clause
at a time. To evaluate the value of each clause, it needs the value of each variable in the clause. For
this, it accesses global literal value memory, where a literal is a variable with its polarity (e.g., x
i
or
¯ x
i
). This makes BCP a critically memory bounded process [19]. When the size of the SAT instance
is large, entire clause information cannot be held in small-sized on-chip memory (SRAM caches),
11
Table 2.2: Cache miss rates for some SAT benchmark instances of different sizes; Level1-Data
(L1-D), Level2-Data (L2-D), Level2 (L2) miss rates are shown. Cache miss rates are not fairly
independent of the size of SAT instances.
Benchmark instance
No. of
variables
No. of
clauses
L1-D
miss rate
(%)
L2-D
miss rate
(%)
L2
miss rate
(%)
mp1-9 29 1,458 19,741 4.4 0.0 0.0
mp1-blockpuzzle 9x9 s1 frcc7 7,209 1,219,027 1.8 0.2 0.0
g2-hwmcc15deep-gs340rb63-k16 123,061 363,618 2.5 0.6 0.2
g2-UTI-20-5p1 225,926 1,195,016 3.4 0.9 0.2
g2-ak128paralparalisc 265,303 859,947 2.0 0.9 0.2
g2-hwmcc15deep-beemlifts3b1-k29 1,238,026 3,431,266 2.8 0.9 0.2
and hence requires the use of full memory hierarchy from cache to main memory (DRAM). (High-
end Intel Comet Lake S i9 processors have up to 20MB L3 cache [23].) Thus, extremely high
memory latencies caused by cache misses become performance bottlenecks for the software SAT
solvers.
Impact of cache misses: Cache miss rates for different sized SAT benchmark instances are
obtained by running Valgrind cache simulations [24] for the first 10 billion instructions for each
instance and shown in Table 2.2. Due to the large number of the table loopups, cache misses
occur across all sizes of SAT instances regardless of the instance size and result in significant
performance overheads on von Neumann machines. The cache miss rates show that there is no
strong relationship between cache miss rate and the SAT instance size.
Even though the miss rates seem low, it impacts performance significantly. By simulating
BCP operations using Valgrind [24], L1 and L2 cache miss rate is identified as 8% and 0.3%,
respectively. In the real world, Intel Skylake i7 processors [25] have L1 data, L2, and L3 cache is
5, 12, and 42 cycles. Additionally, RAM latency is 42 cycles plus 51ns. With cache misses and
their latencies, the overall latency for a BCP operation is doubled.
Analyzing BCP to identify parallelisms: To accelerate BCP in new ways, we started by con-
ducting detailed analysis of the BCP algorithm and data structure used in Mini-SAT2 [1]. We
first carried out qualitative analysis to derive a symbolic expression for BCP runtime complexity.
We then combined this with profiling by incorporating the actual runtime information into our
12
symbolic expressions (by compiling the values of coefficients) and used this to identify the key
performance bottlenecks and completely new hardware designs to accelerate BCP.
The BCP implementation in MiniSAT2 [1] uses advanced approaches beyond the simple sum-
mary above. Specifically, to resolve the memory bottleneck, two-watched-literal (2WL) [4] and
blocking literal [5] were developed to reduce the number of clauses to be observed by monitor-
ing the activities of only two literals in each clause, which substantially improved performance of
software SAT solvers. To implement the 2WL scheme, Watched List Table (WLT) and Watched
List (WL) are required, as shown in Fig. 2.3. WLT is a table containing pointers to watched lists
for each literal and the list sizes. The average size of watched list is denoted by α
′
, which is sig-
nificantly reduced compared to α (the average number of clauses in which a variable appears) in
Eq. (2.1). Each WL, which is retrieved by directly indexing WLT with a literal, is a two-column ta-
ble; clause references (cref s) and blockers. cref is one of indices of the associated clauses to global
clause memory. blocker is a copied literal from the clause, which is used to reduce the probability
of accessing the clause memory. This takes advantage of the observation that we don’t need to visit
the clause if one of its literals is known to be assigned true and thus the clause is already satisfied.
This is why another level of indirection, WL, is inserted to minimize the frequency of access to
the global clause memory. Even with the above two very effective schemes, modern software SAT
solvers like MiniSAT2 [1] still spend 80-90% of total CPU time on BCP.
Our goal: As our goal is to fully parallelize BCP at very fine-grain, we design our custom
hardware as an ASIC and thus have complete freedom to maximize parallelism. To identify the
opportunities for hardware acceleration, we analyze BCP algorithm’s complexity and memory
accesses qualitatively.
As shown in Fig. 2.2, BCP algorithm starts when a literal, x
i
, is assigned true (x
i
= 1). Min-
iSAT2 [1] uses the 2WL scheme where the clauses containing the watched literal just assigned
false needs to be updated. Let WL(¬x
i
) be the list of pointers to clauses where¬x
i
is watched.
Thus, a WL(¬x
i
) is retrieved from WLT as shown in Fig. 2.3. Since the size of WL is α
′
, α
′
-size
iterations are required to check whether each blocker of the associated clauses is true or not. For
13
Figure 2.2: BCP algorithm.
each time, to get the value of the literal, global literal value memory (shown in Fig. 2.3) must be
accessed. Since checking blocker (line 6 in Fig. 2.2) is performed for all the related clauses being
watched, this becomes a dominant part of performance bottleneck.
If blocker is not true, then BCP algorithm must visit global clause memory (lines 9-10 in
Alg. 1 in Fig. 2.2). We denote a probability that this branch is reached, by ε
1
. After getting the
header of the clause by directly indexing with cref on global clause memory, a preliminary job
for maintaining the 2WL structure (the second literal is ¯ x
i
) is executed (lines 11-13 in Alg. 1 in
Fig. 2.2). Then, BCP algorithm checks whether the first watched literal is already true ( β
1
cycles;
lines 9-14 in Alg. 1 in Fig. 2.2). If it is true, BCP algorithm does not need to examine this clause
anymore. Thus, blocker is set to this first literal and the algorithm skips to the next clause. If it
14
Figure 2.3: BCP data structure.
is not true (probability,ε
2
, this branch is reached), non-watched literals must be examined serially
to find a new watched literal (the clause has γ literals). If there is a literal not assigned false, the
2WL structure is updated with this literal and the algorithm skips to the next clause (β
2
cycles;
lines 18-23 in Alg. 1 in Fig. 2.2). Since there is a early termination of this loop (line 18 in Alg.
1 in Fig. 2.2), the actual number of visits in this loop, γ
′
, is much less than γ (γ
′
/γ≃ 0.1). If a
non-false literal is not found (probability, ε
3
, this branch is reached), this clause has either an UP
or a conflict, which is determined by checking whether the first literal is false ( β
3
cycles; line 24
in Alg. 1 in Fig. 2.2). If it is false, this clause cannot be satisfied and BCP algorithm returns cref
to the SAT algorithm to deal with the conflict. If it is non-false (unassigned), it causes the UP and
thus this literal must be assigned true.
Overall, the number of cycles taken on von Neumann machine for BCP algorithm, h, is ap-
proximately represented as follows:
h=α
′
(β
0
+β
1
ε
1
+β
2
γ
′
ε
2
+β
3
ε
3
). (2.2)
15
Eq. (2.2) gives a qualitative view how BCP is organized in terms of algebraic coefficients;
nested two loops (α
′
and γ
′
), cycles for each task (β
i
;i = 0,1,2,3), and their probabilities (ε
i
;i =
0,1,2,3).
To understand BCP algorithm quantitatively, in other words, how large these coefficients are,
we profiled benchmark instances and revealed that α
′
,β
0
,β
1
,β
2
,β
3
, andγ
′
are 10, 25, 30, 30, 21,
and 1.8, respectively. Also probabilities ε
1
, ε
2
, and ε
3
are 39%, 35%, and 11%, respectively. We
noticed that even a simple task, e.g., checking the value of a blocker, requires 25 cycles due to data
movement overheads across all the memories (shown in Fig. 2.3) and the fetch-decode-execution
cycle of von Neumann machines. Above that, BCP algorithm executes this task for α
′
(= 10)
times. As a result, h is around 500+ cycles with an assumption that we don’t have any cache
misses. With cache misses and subsequent memory accesses, overall h grows over 1000 cycles.
Thus far, we have identified that we have opportunities to parallelize BCP algorithm. In the
next section, we show how we design our custom hardware to parallelize BCP and eliminate data
movement and all von Neumann overheads shown in Eq. (2.2).
2.1.3 Observations
For software SAT solvers, two-watched literals (2WL) [4] is extremely successful as it provides
performance improvement of one or two orders of magnitude by significantly reducing clause
memory accesses. Also, blocking literal [5] provides considerable additional performance gains
by adding another layer of data structure that further reduces clause memory accesses. Even with
these successful methods, modern software SAT solvers like MiniSAT2 [1] spend 80-90% of total
runtime on BCP operations and each BCP takes of the order of 1000 clock cycles on general
purpose processors due to table lookups and data movement.
We propose a custom hardware design for BCP (HW-BCP), a co-processor that replaces software-
BCP and accelerates MiniSAT2 [1]. Since BCP only performs Boolean operations once the al-
gorithm selects a variable and the value to assign, we design a fully parallel architecture using
16
content-addressable memory (CAM) and dedicated near-memory logic blocks to eliminate table
lookups and data movement.
2.2 Design innovation for Boolean Constraint Propagation
2.2.1 Low-cost content-addressable memory architecture
Content-addressable memory (CAM) is a powerful memory architecture as it provides extremely
low delay for word searching by using hardware to parallelize search. The required time for a word
search is less than 1 ps [26]. However, it is widely believed that CAM is an expensive structure in
terms of area and delay. This is mainly due to its priority encoder. Even for a relatively small CAM
size, conventional priority encoder takes 35% of total area of CAM array [27] and more than 60%
of overall CAM delay [28]. These costs of priority encoder increase more than linearly with CAM
size. Therefore, the use of CAM is limited to only a few applications where large CAM capacity
is not required, such as text database retrieval [29], address searching in local area networks [30],
and set-associative caches with very small set sizes.
We propose a novel memory architecture for BCP that uses CAM after removing its high-
area high-delay part, namely the priority encoder. Instead of using the priority encoder, we use
a significantly simplified logic circuitry for combining signals (described in Section 3.2.4). Also,
for each row of CAM array, CAM matchline is horizontally connected to SRAM wordline, thus
we remove the decoder for SRAM as well. The above part of our design is similar to how CAM
and RAM cells are used in set-associate caches. However, we go further by adding a dedicated
copy of logic block for clause evaluation right next to the SRAM cell for each clause and directly
use the data stored in SRAM cells to evaluate each clause. Thus, we remove a part of the circuitry
for the SRAM read operation as well. In this manner, without the priority encoder, we can fully
take advantages of CAM’s capability for parallel search, eliminate most of the subsequent data
movement, and parallelize the subsequent logic processing.
17
Figure 2.4: Floorplan with H-Tree data bus and SAT Submodules.
2.2.2 Novel memory architecture using CAMs and SRAMs
CAM is a key memory structure which can help to maximize parallelism by broadcasting a search
key to an entire memory array when searching for associated clauses for BCP. Potential fanout
delay increase is minimized by employing our h-tree structure for the wires. Importantly, such
design with optimal buffer insertion and sizing allows us to limit delay to O(logN) as the number
of nodes to which we broadcast the search key increases by O(N).
As shown in Fig. 2.3, a software BCP implementation involves lookups in tables WLT and WL,
in the global clause memory, and the global literal value memory, to identify all clauses watching
the desired literal, the variables used in these clauses, and the values of these variables, respectively.
Let n
v,max
and n
c,max
be the maximum numbers of variables and clauses supported by HW-BCP,
respectively. Because any CNF formula can be converted to a 3-CNF formula, where every clause
18
has exactly three literals, we can focus on 3-CNF formulas in our design. The structure of HW-
BCP is shown in Fig. 2.4. The CAM and SRAM arrays hold the clauses as well as the values of the
variables as follows. Each row of the CAM is designed to store a variable-ID (VID), i.e., the value
i for variable x
i
, and hence is log
2
(n
v,max
)-bit wide. Each row of the SRAM uses 1 bit to store the
polarity of the literal in a clause, i.e., 0 or 1, respectively, if the clause contains the literal x
i
or¬x
i
.
It also uses 2 bits to store the current value of the variable x
i
, namely, 0, 1, or U. If variable x
i
appears in multiple rows, the corresponding value is stored (replicated) in each of these rows.
When MiniSAT2 reports conflict or unit-propagation, it also returns the clause ID (CID). Ex-
plicitly storing CID on the chip significantly increases the size of the memory required. One key
innovation of our method is that we store CID implicitly as follows. First, once we determine the
order in which the clauses are to be stored in SAT Submodules, we assign a new implicit CID to
each clause based on its location. Hence, if the original clause C
i
is placed in the j-th location
within the SAT Submodule-k, then C
i
is assigned an implicit CID obtained by concatenating the
binary values for j and k.
This enables us to design combining logic cells that generate the implicit CID. Specifically,
within each SAT Submodule the combining logic cells generate the local CID (i.e., the binary
string that corresponds to j in above example) by using their locations. Simply, the first-level
combining logic cells that are in the top half of a SAT Submodule output a 1-bit value 0 while
those in the bottom half output the value 1. The combining logic cells at each subsequent level,
based on their locations, append either a 0 or a 1 to this value. In this manner, the combining
logic within each SAT Submodule produces the local CID without storing CID. By continuing
this at every level of the combining logic outside the SAT Submodules, we similarly create the
remaining bits of the implicit CID. Thus, we significantly reduce the storage area at negligible
increase in the complexity of the combining logic cells. Further, this also reduces the widths of
outgoing interconnects, since one bit is added to the implicit CID at every level and, importantly,
very low bit widths are needed at the early levels where we have exponentially large numbers of
interconnects.
19
Figure 2.5: HW-BCP CAM/SRAM array with formula information. (This figure assumes that
n
v,max
and n
c,max
are 8 and 16, respectively.)
Since HW-BCP is designed for 3-CNF formulas, each clause is in placed in three consecutive
rows. For example, for the formula φ
1
:= (x
0
∨ x
1
∨¬x
6
)(x
1
∨¬x
2
∨ x
6
)··· (¬x
5
∨¬x
6
∨ x
7
), the
IDs of the three variables in C
0
are in rows 0, 1, and 2; the IDs of the variables in C
1
are in rows 3,
4, and 5, and so on. Row 0 to row 5 of VID-CAM will respectively store variable IDs 0,1,6,1,2,6.
Also, the polarity bits in row 0 to row 5 will respectively store 0,0,1,0,1,0. Fig. 2.5 shows how
the formula information is represented in the CAM/SRAM array after this problem is loaded. (In
this example figure we assume that n
v,max
and n
c,max
are 8 and 16, respectively.)
CAM operating modes: The CAM array operates in three modes: write, read, and search. The
write and read modes use the CAM’s row decoder and bit lines. In the search mode, the decoder is
disabled, the values to be searched are applied to the select lines (SL and SL), and the match line
(ML) goes high for every row that contains the value to be searched.
The read mode of CAM is not required for the BCP operation. However, for each BCP module
to store the implication graph information that is used for the CDCL operation, which we also
accelerate (see Chapter 4), the read mode of CAM is used right after the clause evaluation in
the SAT Submodules. Thus, in our integrated design, which accelerates BCP as well as CDCL
operations, the CAM array operates first in the search mode and then in the read mode as shown in
Table 2.3.
20
Table 2.3: HW-BCP operations and CAM/SRAM modes
Operations Sub-operations CAM mode SRAM mode
LOAD N/A write write
UNDO N/A search CAM-activated write
BCP
Phase 1 search CAM-activated write
Phase 2 read read
SRAM operating modes: The SRAM array operates in three modes as shown in Table 2.3. In
the two standard modes, read and write, the SRAM uses its row decoder as well as bit lines to
write or read values from the SRAM cells. In the new CAM-activated write mode, the decoder is
disabled and the match line of the corresponding CAM row is used to control the SRAM’s word
line, as shown in the box at the bottom of Fig. 2.4. The bit lines are used to write the values input by
the user into the CAM-activated rows of SRAM cells. The CAM-activated write mode is used not
only in BCP but also when we undo variable assignments after backtracking to one of the previous
decision levels as a result of conflict analysis. In this case, U is written into the two-bits of SRAM
that hold the value. In BCP, the SRAM write mode is used only when a new SAT problem is loaded
into BCP, to load the polarity bits and to initialize the value bits to U. Subsequently, when we solve
this SAT problem, we mostly use the SRAM in CAM-activated write mode to update (or UNDO)
the values of the variables, as shown in Table 2.3 (see Phase 1). During this process, to facilitate
operation, such as unit propagation, in some steps the CAM and SRAM are used in read mode to
output the variable ID and polarity, as shown in Table 2.3 (see Phase 2).
Loading a new SAT problem (LOAD): Using the write mode of the CAM, variable IDs for
clauses C
0
,C
1
,... are written into corresponding CAM rows, one variable ID in each row. Using
the write mode of the SRAM, the values of the corresponding literal polarities are loaded into the
corresponding cells in every row. At the same time, the SRAM cells that hold the variable values
are initialized to the unspecified value U. The variable ID values in the CAM and the polarity
values in the SRAM do not change as long as HW-BCP continues to work on the same formula.
Operation during each BCP step: As shown in Fig. 2.6, during each BCP operation, the CAM
is configured in the search mode, the columns of the SRAM that hold variable values are configured
21
Figure 2.6: BCP mode of SAT Submodule.
in CAM-activated write mode, i.e., the match-line of the CAM in each row serves as the word line
of these SRAM cells. All other columns of the SRAM are in the hold mode, i.e., they continue to
hold their previous values. To perform BCP with assignment (x
i
,v), the variable ID, i, is applied to
the search lines of the CAM cells in every row (see V ID and V ID at the bottom of Fig. 2.4) and the
value v is applied to VAL and VAL. As a result, the match line becomes high for every row where
the CAM stores i, i.e., the ID of variable x
i
. Consequently, the value v is written into the SRAM
cells that store variable values, in every row where the CAM contains i, i.e., the index of variable
x
i
. (This corresponds to Phase 1 in Table 2.3; more in Chapter 4.)
Next, the processing logic for each clause directly accesses the values in the corresponding
SRAM cells and carries out the operations required to generate conflict (CONF), unit propagation
(UP), or satisfied (DONE) output signals. Subsequently, the combining logic integrates these sig-
nals into one set. In cases where either UP or CONF occurs, it also generates the local CID (details
in Section 3.2). Finally, for the case where UP occurs, it also outputs LID.
22
Figure 2.7: BCP operation with an example.
For BCP, the CID of the clause that caused the CONF needs to be sent to software to carry out
conflict analysis. In contrast, for UP we prefer to report literal ID (LID) as this directly provides
the information required to make the subsequent necessary assignments. Hence, our BCP does not
require us to report CID for UP. However, since we accelerate CDCL along with BCP, we output
CID information for UP as well as CONF. (This corresponds to Phase 2 in Table 2.3; more in
Chapter 4.)
BCP operation with an example: Fig. 2.7 explains the BCP operation again with the example.
In Fig. 2.7, BCP operations start when Boolean constraint is provided by CPU, x
6
= 0. V ID, 6
(= 3
′
b110), is driven to CAM search-lines to find all x
6
in CAM array. Then CAM match-lines
are activated (red-dotted lines in Fig. 2.7) where 6 are stored. Next, to assign value to the activated
rows, value 0 is provided to SRAM array via SRAM bit lines. Since corresponding SRAM word-
lines are activated by the active CAM match-lines, values in these rows are updated. After that,
23
each clause is evaluated by the copy of processing logic provided for the clause. The processing
logic generates three signals whether the clause has CONF, UP, or DONE.
Undoing variable assignments (UNDO): During each UNDO operation, similar to the BCP
operation, the CAM is configured in the search mode and the SRAM is configured in the CAM-
activated write mode. Undoing the variable assignment is completed by writing U into the value
bits.
In summary, for BCP, our unique design with CAM and SRAM cells completely parallelizes
the excitation of the word line of every row where variable x
i
appears, the storage of the value v in
the variable-value SRAM cells in that row, and evaluation of the status for the clause (conflict, unit
propagation, or satisfied). Further, it avoids the overheads due to pointer chasing across multiple
tables and pointer arithmetic, the bottlenecks associated with moving data from memories to ALU,
and all von Neumann overheads.
Compared to BCP operations and other major operations, such as analyze and litRedundant,
the delays caused by LOAD or UNDO operations are negligible. In our experimental evaluations
via benchmarking, we ignore the LOAD/UNDO operation delays.
2.2.3 Near-memory computing
For each clause in the memory block, we add a dedicated logic block to compute “satisfied”,
“conflict”, or “unit propagation”. A small logic circuit can compute these values for each clause if
the values of the variables used in the clause as well as the polarities of the variables are provided.
For example, when performing BCP for (x
3
,0), we need to perform the above computation only
for every clause where x
3
appears. In the example formula φ
1
, this must be performed for clause
C
1
. Given the current values of x
2
, x
3
, and x
5
and the polarities of the corresponding variables,
1,0,0, satisfied and conflict can be obtained by computing the value of the clause: if the value is
1, satisfied = 1; otherwise, conflict = 1. Further, if the value is− , then we can determine unit
propagation using a circuit that checks whether exactly one out of these three variables has the
24
value− . In our design, all the values required for this computation are available in the SRAM
cells in the three consecutive rows which store the clause C
1
.
As sets of three consecutive rows of our design store one clause (in our 3-CNF), we need one
logic block for every set of three rows with the following three key features. First, the logic block
for clause C
j
directly receives its inputs from the SRAM cells in the three consecutive rows which
store C
j
. This direct access dramatically reduces data-movement latency as it avoids three memory
read cycles, where each cycle would be long due to the need to precharge and discharge long bit
lines.
Second, the match lines (ML) of the VID-CAM cells for clause C
j
are combined via an OR
gate to enable the part of the logic block that computes unit propagation for the clause. This
helps ensure that the same unit propagation is not duplicated and hence saves bandwidth for the
interconnect that gathers unit propagation outcomes from SAT Submodules and communicates
these to the outputs of HW-BCP.
Third, the block layout height is matched with the height of the set of three rows to minimize
area.
2.2.4 Elimination of von Neumann overheads
Our goal is to design a unique chip architecture which can increase parallelism and significantly
improve the performance of BCP by optimizing memory architecture, logic processing, and wire
design, and also eliminate von Neumann overheads (shown in Eq. (2.2)).
The required operations for BCP are quite simple: checking values, chasing pointers, and logic
operations. This fact allows us to design simple custom logic for processing and eliminate most
of the high-area module in a CPU, such as an arithmetic logic unit (ALU), a floating point unit
(FPU), a branch predictor (BP), etc.
Further, the simplicity allows us to incorporate a large number of processing elements and
place them next to the data. This helps with parallelization as well as elimination of von Neumann
overheads, especially data movement.
25
Our above study showed that BCP needs to access all the large data structures used by Min-
iSAT2 [1] (see Fig. 2.3). To avoid the high delays associated with off-chip DRAM accesses, our
goal is to fit all the memories required for large SAT instances on-chip. The above elimination of
high-area modules already frees up area.
To parallelize evaluation of the clause to identify whether the clause is satisfied, has a conflict,
or a unit propagation, in each row, we incorporate logic elements to evaluate the SAT clause with
the updated literal values. This also eliminates much of the data movement overheads. In this
manner, we also eliminate the need for blocking literal [5], indirection memory, and the 2WL [4]
structure. More importantly, we eliminate the complex sequence of operations required to update
two-watched literals, which are expensive to implement in custom hardware.
In this manner, via parallelization, we eliminate all the algebraic coefficients shown in Eq. (2.2).
Consequently, the proposed HW-BCP takes only one cycle but with an increased clock period.
In comparison, 1000+ cycles are required on general purpose processors. In our HW-BCP, the
fanout of variable ID (VID) increases by orderα, since the VID is broadcast to all the CAM rows.
However, careful wire and buffer design in the H-tree structure (shown in Fig. 2.4) to optimize the
delay of VID broadcast, reduces this delay to order of log(α). log(α) primarily determines the
clock period for HW-BCP.
Hence, overall our HW-BCP design reduces BCP delay by a factor that is greater than:
O(
α
log(α)
). (2.3)
2.2.5 Low-power architecture
Key idea for dramatically reducing power consumption for the entire chip is to enable only essential
modules. First, SAT Submodules are partitioned into a small number of groups, say 2
P
, where P
is 2-6. When a VID is provided to start a BCP operation, group information where the variable
appears is stored in a lookup table, namely Group Status Table. For each group, an indicating bit is
26
Figure 2.8: Group example: 4 groups of SAT Submodules and Group Status Table.
Table 2.4: Power reduction and area/delay overheads using the proposed lower power architecture
on HW-BCP.
Groups
Power reduction
(Ideal power reduction)
Area overhead (%) Delay overhead (%)
16 1/2 (1/9) 1.2 1.8
32 1/4 (1/13) 2.5 2.6
64 1/8 (1/16) 5.0 3.7
128 1/12 (1/18) 9.9 5.2
required to control whether this group is disabled or not. VID is used as an address to this Group
Status Table to fetch the group information.
For example, for 1M clauses, we need 2
13
SAT Submodules as each Submodule holds 128
clauses. As shown in Fig. 2.8, 2
13
SAT Submodules are partitioned into 2
2
groups (P= 2), which
means that each group has 2
11
SAT Submodules. For each group, a group status bit, one-bit
indicating its status, is needed to enable or disable all the submodules in this group. In this example,
when VID equals to 3 is given, group information is fetched from the Group Status Table. 0011
denotes that the first and second groups are disabled and the third and fourth groups are enabled.
Thus, for each variable, group information is needed. To implement this scheme, a 1-bit signal is
connected to each group to turn on or off the group. In this example, a 4-bit bus is required for this
technique and added to the H-tree data bus.
CAM cells operate in two states in our design – dynamic and static. When a CAM cell is
dynamic during the search mode, transistors in the CAM cell operate to evaluate the precharged
match-line, which could result in a match or a mismatch. From our transistor-level simulation, the
27
CAM cell consumes 695nW for a match operation and 607nW for a mismatch operation. When
the CAM cell just holds the stored value without charging or discharging activity, it is in the static
state. The CAM cell consumes 31nW from our simulation.
By turning off the group status bit, every controller in each SAT Submodule in this group
does not generate control signals to drive charging activities in that BCP memory array. Thus all
memory cells in this group can be in the static mode. Also, flip-flops which are a part of search-line
drivers for CAM and bit-line drivers for SRAM can be inactive using a clock gating technique with
the group status bit.
Next, an efficient algorithm for distributing clauses to the groups is required to minimize the
number of enabled groups to save power consumption. By assuming that there is an ideal algorithm
which distributes clauses across the groups such that most variables enable only one group, ideal
power reduction is calculated and shown in Table. 2.4. While leaving design of a good clause dis-
tribution algorithm as a proposed research task, we have shown that a simple round-robin heuristic
can provide the needed power reductions. Basically, when loading the clauses of a new SAT prob-
lem during preprocessing phase, each clause goes to each group sequentially; the first clause is
assigned to the first group, the second clause is assigned to the second group, and so on. When it
reaches the last group, then assignment starts again from the first group.
After the assignment, it is possible that in a group, some SAT Submodules have same variables
assigned multiple times. In our preliminary evaluations, the small number of repetition can be
allowed by supporting multiple (less than 5) simultaneous SRAM write operations in different rows
of any individual SAT Submodule. A systematic study of the trade-offs between reduction in power
and allowing different numbers of simultaneous write operations within individual submodules is
left for future research.
With this low-power architecture, which enables only indispensable SAT Submodules and logic
circuitry, power reduction is calculated by analyzing SAT benchmark instances. After assigning
clauses to groups, each variable has a list of groups to which it belongs. Average number of groups
that need to be enabled is estimated by averaging the number of groups for each variable. Then,
28
the estimated average number of enabled groups is used to calculate dynamic power consump-
tion. We assume that the disabled groups consume static power. Considerable power reduction is
achieved as shown in Table. 2.4. We expect that power reduction can be moved much closer to the
ideal power reduction by developing more powerful heuristics for distributing clauses across SAT
Submodules.
Area overhead is shown for different group numbers in Table. 2.4. Major area overhead comes
from the size of the lookup table. Since, each variable needs its group information, the number
of table entries is the number of variables. For each variable, the table entry stores 2
P
bits for
the group information where each bit indicates whether the corresponding group is enabled or not.
These additional data bits also increase bus area and hence the chip area. We ignored the negligible
area overheads from control logic circuit and the bus wires for the Group Status Table.
Since the Group Status Table is composed of SRAM cells, it requires SRAM read operation
to fetch the group status bits. Then this information is delivered along with VID via the H-tree.
Thus, major delay overheads from the low-power architecture is the wire delay through H-tree,
which is calculated as 0.3ns. A delay from the SRAM read operation is less than 0.1ns. Thus
delay overhead is 0.4ns.
In summary, our unique design with CAM and SRAM cells and logic completely parallelizes
BCP, enabling faster computing and lower data movement latencies. Also, our low-power archi-
tecture ensures that the average chip power remains at acceptable levels.
29
Chapter 3: HW-BCP Design and Optimization
3.1 Design overview
We now investigate the integration of the above innovations into a design that can indeed offer
significant acceleration at low cost and at acceptable power.
We first consider minimizing the chip area. For each clause, a significant portion of the area
is covered by the 3log
2
(n
v,max
) 1-bit CAM cells required by our architecture, where n
v,max
is the
number of variables in the largest SAT benchmark we wish to solve.
While CAM is widely believed to have high area and delay, CAM cells are only 2× larger than
SRAM cells [31]. We propose a novel memory architecture with regard to the use of CAM by
removing the high-area high-delay part, namely the priority encoder. Instead of using the priority
encoder, we use a much simplified logic circuitry for combining signals (details in Section 3.2).
Without the priority encoder, we can fully take advantages of CAM’s capability for parallel search.
Finally, for each clause, we also require nine SRAM cells and a logic block with around 50
gates. Overall, our design is practical in terms of area. As discussed in Chapter 5, in TSMC 65nm,
a single-chip version of HW-BCP can solve SAT benchmark problems with one million clauses.
Importantly, this design completely avoids any need for off-chip access to large data structures
while solving a SAT problem. The SAT problem needs to be loaded into the HW-BCP from an
external source, but the associated delay is incurred only once per problem.
We now consider delay minimization, which is key to acceleration. Clearly, a single BCP
module for one million clauses will incur extremely high delay, calling for optimization strategies
that partition the module into submodules connected via an H-tree, as shown on the top-left of
30
Fig. 2.4. In 65nm TSMC, an optimized design consists of SAT Submodules that store each 128
clauses and an optimally buffered H-tree (more next). Also, the output signals produced by the SAT
Submodules drive a combining-logic tree which combines and communicates the outputs from
the submodules, via the H-tree, to the output of HW-BCP. This processing, which is distributed
along the levels of the H-tree, takes three specific forms: for satisfied outputs, the combining-
logic computes AND; for conflict outputs, it computes OR; and for unit propagation output, it
selects one of the output values. In all three cases, the final result is provided at the output of the
combining-logic tree.
We now recall that an optimally buffered design starting with a minimum-size inverter can
drive a load that is N× the input capacitance of a minimum-size inverter with a delay that scales
logarithmically with N [32]. We can then show that the delay of our optimally buffered H-tree
will show a similar trend, i.e., O(log(n
c,max
)), where n
c,max
is the number of clauses in the largest
SAT problem targeted by HW-BCP. The delay of the combining-logic tree is also O(log(n
c,max
)).
Finally, the delay of each SAT Submodule is low, because the size of the sub-array is small. In fact,
the length of the bit- and search-lines is only 128× 3 SRAM cell heights, the length of the word
and match lines amounts to the width of only log(n
c,max
)-CAM cells + 3 SRAM cells, and the
size of the logic blocks is also small. In summary, it is possible to create a design that completely
parallelizes BCP at a fine grain and eliminates all the von Neumann overheads, including data
movement overheads.
3.2 SAT Submodule design
SAT Submodule (shown in Fig. 2.4) is the key module which performs BCP operation locally and
CDCL operations (explained in Chapter 4). SAT Submodule is equipped with circuitry to support
both operations and is able to perform one operation each time. Since we have our own custom
memory architecture for HW-BCP, we use custom design flow for memory cells/array and general
digital design flow for processing/combining logic.
31
Figure 3.1: Our schematics and layouts of CAM and SRAM cells.
Our HW-BCP has a fixed structure which can load and solve 3-SAT instances. Thus, SAT
Submodule has 3N memory rows, where N is the number of clauses in the Submodule and each
clause occupies three rows of CAM/SRAM array. Every problem instance which is not defined
as a 3-SAT instance is converted to the 3-SAT form prior to use of our SAT hardware. The study
of the possible performance differences between the original problem and the converted 3-SAT
problem is left for future research.
3.2.1 Memory cells
For precise area and delay estimation, we design our own memory cells using the minimum
metal/wire pitches by following foundry design rule checks (DRCs). We use TSMC 65nm GP
PDK for the entire design. Custom schematics and layouts of memory cells are designed using
Cadence Virtuoso IC 6.17 and shown in Fig. 3.1. Both the CAM and SRAM cells are designed to
have the same height. Thus, the cells can be placed side by side, horizontally and vertically, and
are abutable horizontally and vertically.
3.2.2 Memory array
Our HW-BCP Memory array operates in two modes; BCP mode for HW-BCP and CDCL mode
for HW-CDCL (details in Chapter 4).
32
For BCP mode, we have two phases for the BCP operation. In the first phase, the CAM array
is in the search mode and the SRAM array is in the write mode. CAM match lines are aligned
with the word lines of the SRAM cells, which hold values of variables and polarity. Specifically,
the CAM match lines in every row drive tri-state drivers which are enabled and whose outputs
drive the SRAM word lines to control a write operation of the SRAM cells in the respective rows.
To optimize the operation, the SRAM read circuitry is not used for the first phase of the BCP
operation. Thus, the literal values stored on the SRAM cells in a set of three consecutive rows
corresponding to each 3-SAT clause directly drive the corresponding logic circuitry next to the
SRAM cells (Processing logic in Fig. 2.4) to evaluate the clause. Then in the second phase, the
CAM array is in the read mode and the SRAM array is in the read mode. Given the CID and
location ID to locate the VID within the CID, the address decoder selects the word line of the
corresponding CAM cells and the SRAM cells. The CAM read circuitry reads the VID stored
in the CAM cells and the SRAM read circuitry reads the polarity stored in the SRAM cells. We
output these two values, since LID is the combination of the VID and the polarity.
For the CDCL mode, as in the second phase of the BCP operation, the CAM array is in the read
mode and the SRAM array is in the read mode. Given the CID, the CAM/SRAM array performs
three read operations consecutively to output the three LIDs in the clause (details in Chapter 4).
The entire memory part of the SAT Submodule is designed at netlist level with complete floor-
plan and the precise wire information is computed using cell dimensions, n
v,max
, and number of
clauses in each submodule. RC parasitics are estimated using Mentor Graphics Calibre 2019. Us-
ing the extracted RC parasitics, transistor-level simulations are performed using Cadence Spectre
18.1.
3.2.3 Processing logic
The logic part of the SAT Submodule is designed using a standard digital design flow. Specifically,
for each clause (3 rows of CAM and SRAM cells), logic circuitry to determine the three signals
33
Figure 3.2: CAM/SRAM array and processing logic.
Figure 3.3: Sub-combining logic.
mentioned above (processing logic shown in Fig. 2.4) is synthesized using Synopsys Design Com-
piler 2019, placed and routed using Cadence Innovus 18.1, and connected to the SRAM as shown
in Fig. 3.2. Processing logic for each of clauses in SAT Submodule generates CONF, UP, and
DONE signals.
3.2.4 Combining logic
HW-BCP reports a single unit propagation (UP) or a conflict (CONF) at a time to CPU. Thus
combining logic is required to integrate the generate signals for each clause into a single set of
signals (the output of the submodule). After evaluated by processing logic, each clause sends
34
Figure 3.4: Combining logic inside each SAT Submodule.
CONF, UP, and DONE along with local clause ID (CID) and literal ID (LID). Sub-combining
logic (shown in Fig. 3.3) integrates signals from two clauses based on the priority. Our clause
distribution algorithm orders and places the clauses in such a manner that, for every LID, we place
a maximum of one clause with each LID in one SAT Submodule. This means that within a SAT
Submodule, there is no possibility of multiple conflicts or UPs occurring in any one BCP step.
The highest priority is assigned to CONF. Hence, if one of the clauses has a conflict, the
conflict overrides all other signals. When there is no conflict (both CONFs are zeros), then a UP
takes priority over a satisfied (DONE) signal. It also sends out a literal ID (LID) that is causing
the UP. If there is neither a conflict nor an UP, then DONE is considered. When both clauses are
satisfied then it sends out the DONE signal.
Consider the example where the left clause has CONF
L
= 1 and UP
L
= 0, and the right clause
has CONF
R
= 0 and UP
L
= 1. Since CONF takes priority over UP, both UP
L
and UP
R
are ignored.
Then local CID is generated based on the location of the clause causing the conflict, which is 0
(0/1 for left/right, respectively). This bit is used to extend the generated local CID for each level
of sub-combining logic in combining logic inside SAT Submodule as shown in Fig. 3.4.
35
In the optimized SAT Submodule configuration (see Section 3.3), we have 128 clauses in each
SAT Submodule. As shown in Fig. 3.4, the first level has 64 sub-combining logic blocks. Then the
second level has 32 sub-combining logic blocks to integrate signals from the first level. Overall,
there are 7 levels and a total of 127 sub-combining logic blocks.
Along with the UP signal, combining logic also generates local CID for the original clause that
is causing the UP. At the first level, 1-bit local CID is generated per sub-combining logic block. In
the second level, using two of 1-bit local CID and a UP signal, local CID is extended to two bits.
Eventually, the entire combining logic inside each SAT Submodule generates 7-bit local CID for
the UP.
Thus we create a detailed netlist of the SAT Submodule, which includes all parasitics from
our memory cell layouts (shown in Fig. 3.1) and perform accurate simulations of the entire SAT
Submodule.
Then, logic synthesis is performed by using Synopsys Design Compiler 2019. Finally, place
and route of the designed logic is performed by using Cadence Innovus 18.1. The post-layout
extracted netlist for logic part is simulated (at logic level), and the simulation results are combined
with those for the memory to obtain accurate delays for the entire SAT Submodule.
3.3 Design optimization for HW-BCP
Once we know the precise dimensions and parasitics of each SAT Submodule, as shown in Fig.
2.4, we design and optimize layout of the H-tree data path.
We repeat the above process by exploring the design space across different sizes of SAT Sub-
modules, i.e., different numbers of clauses in each SAT Submodule, as well as different buffer sizes
for memory write drivers, and different place and route options for the logic circuitry. To capture
the delay precisely, we iterate netlist-level simulations with the extracted RC parasitics from the
layout and logic-level simulations for the logic part. The area of the memory part is calculated
based on our CAM and SRAM cells. The area of the logic part is calculated based on the post-pnr
36
Figure 3.5: Optimized SAT Submodule and a part of the layout.
geometry of the blocks. Finally we select the H-tree and SAT Submodule design that optimizes a
desired combination of area and delay.
If we have large number of rows in a SAT Submodule, it will reduce the total number of SAT
Submodules and thus save the overall bus area. However, in the SAT Submodule, longer CAM
search lines require stronger inverters to drive. If the size of this inverter exceeds the width of the
memory column, it makes it necessary for us to place memory cells farther apart and hence memory
density decreases. In turn, this significantly increases area overhead. Practically, in TSMC 65 nm
PDK we use, an inverter with width≤ 16× the width of the standard inverter is able to fit within
the width of the single memory column that we custom-design.
Given this constraint, in the selected SAT Submodule (as shown in Fig. 3.5), the number of
rows is 384 for 128 clauses. Area is 0.05mm
2
(= 73um× 727um). Delay of the BCP operation
is 1.1ns. Power consumption per BCP operation is 5.2mW in non-operating mode, and the static
power consumption is 0.2mW when the submodule is not selected.
We use this SAT Submodule as a building block to design HW-BCP to load and solve large
SAT problems.
37
3.4 Floorplan and H-tree design
Our goal is to design an accelerator that can solve large SAT instances with up to one million
clauses, which requires 2
13
SAT Submodules. We employ an H-tree bus design to equalize the
delay for sending and receiving data to/from SAT Submodules as shown in Fig. 3.7. Since BCP
operation requires broadcasting VID to all of SAT Submodules, we face two considerable wire
delay challenges. First, we have high fanout to broadcast VID to 2
13
SAT Submodules. Second,
due to large number of SAT Submodules, the wires are very long, especially at the higher levels
of the H-tree. We explored an exhaustive set of design choices to identify the one that optimally
tackles these challenges.
Design SAT Group Submodule: H-tree wire length can be minimized when we have a square-
like floorplan. It is helpful to have a square-like module to build a top-level floorplan. As our SAT
Submodule has much larger height than width (height to width ratio is 10:1), we first group 8 SAT
Submodules into 1 group, namely, SAT Group Submodule (shown in Fig. 3.6).
Inside SAT Group Submodule, to minimize driver load when broadcasting, input wire starts
with the minimum-sized (1× ) standard inverter and is locally buffered to drive bit- and search-
lines.
We also need seven sub-combining logic blocks to integrate output signals from 8 SAT Sub-
modules. Local CID is also extended from 7 bits to 10 bits via three additional levels of sub-
combining logic blocks. We assume output wires from SAT Submodules can be routed on top of
SAT Submodule. SAT Group Submodule also has extra space for output bus, input bus, and com-
bining logic blocks. Output bus width is calculated based on the minimum wire width and wire
pitch. It is assumed that input bus and combining logic blocks can be placed below the output bus
area. The area of the designed SAT Group Submodule is 0.43mm
2
(= 734um× 581um).
Top-level floorplan with SAT Array: We then have the top-level floorplan for SAT Array as
shown in Fig. 3.7. The number of SAT Group Submodules is 1024 (= 32× 32) for one million
38
Figure 3.6: SAT Group Submodule.
clauses. Outside of SAT Group Submodules, sub-combining logic blocks are located every merge
point in H-tree to integrate the signals. Area of SAT Array is 450mm
2
(= 18.9mm× 23.8mm).
H-tree design optimization: Given all geometric information for SAT Submodule, SAT Group
Submodule, and SAT Array, we perform H-tree design optimization. Total H-tree wire is divided
into 9 stages for wire design and optimization as shown in Fig. 3.8. For simplicity, we use the
same size of buffers/inverters in the stage. For the H-tree output wire, we divide the entire wire
into N segments, and add a fixed-size inverter for each segment. However, the H-tree input wire has
branches at every branch point as shown in Fig. 3.8 and eventually drives SAT Group Submodule
where its load capacitance, C
L
, is calculated as 390× inverter considering input wires and local
drivers inside. Thus we use the method used for super-buffer design to calculate the optimal
number of the stages to drive the large load.
As shown in Fig. 3.9, buffering across the whole H-tree output wire is designed as follows:
C
init
(α/2)
K− 1
α = C
L
, (3.1)
39
Figure 3.7: SAT Array floorplan with combining logic blocks.
where C
init
is initial driving capacitance at the starting point at the H-tree input,α is a sizing factor,
and K is the number of H-tree stages, 9. C
L
is the load capacitance to SAT Group Submodule. α is
calculated as 2.64.
We optimize total H-tree wire delay by partitioning long wire into equal length segments and
adding an inverter between consecutive segments. Based on the geometry information, total H-tree
Figure 3.8: SAT Array geometry and H-tree stages.
40
Figure 3.9: Buffer design for H-tree input wire.
wire length, L, is calculated as 38mm (= 37,867um). Then we need N+1 inverters for N segments.
Each segment delay, d
seg
, is composed of wire delay (d
seg,wire
) and inverter delay (d
seg,inv
):
d
seg
= d
seg,wire
+ d
seg,inv
=
1
2
rc(
L
N
)
2
+ d
0
(1+
L
N
C+C
g,next
C
g
)
(3.2)
where c is unit wire capacitance, r is unit wire resistance, d
0
is minimum-sized inverter delay
without load capacitance, and C
g
is minimum-sized inverter gate capacitance. c and r are computed
via parasitic extraction. d
0
and C
g
are estimated from netlist simulations. The exact variable values
are not disclosed due to a non-disclosure agreement with TSMC.
Total H-tree wire delay, d
tot
, is the product of N and d
seg
where d
tot
is optimized at N = 150.
Table 3.1: Optimized H-tree input wire design.
Stage L (mm) N
Inverter size for
H-tree input wire
1 9.4 48 8
2 5.9 30 11
3 4.7 24 14
4 3.0 15 18
5 2.4 12 24
6 1.5 8 32
7 1.2 6 42
8 0.7 4 56
9 0.9 3 74
41
Table. 3.1 shows the optimized H-tree input wire design including stage length, the number
of segments (or inverters), and the optimal inverter size for each stage. Eventually, H-tree has an
input delay 3.4ns and an output delay 3.2ns.
We designed the layout of our entire H-tree, for the above geometry (wire lengths at every
level), buffer sizing and placement, and terminated it with the loads of all the SAT Group Sub-
modules. We performed SPICE simulations on the designed H-tree to measure wire delays. The
delay reported by this simulation was slightly lower than that by our above calculations (< 5%).
This confirmed that our calculations and hence our buffer design is quite accurate. For simplicity,
we use the calculated delay values (which are slightly pessimistic) to compute other metrics like
speedup (details in Section 5.1).
All delay and area evaluations ahead for 65nm design using our memory cells are for our
detailed design.
3.5 HW-BCP design metrics
All HW-BCP design metrics are summarized in this section.
Area of SAT Array is 658mm
2
(= 23.8mm× 27.6mm). SAT Array has 2
10
SAT Group Sub-
modules which corresponeds to 2
13
SAT Submodules. Area of SAT Group Submodule is 0.63mm
2
(= 0.86mm× 0.73mm). Area of SAT Submodule is 0.078mm
2
(= 0.107mm× 0.727mm).
Area of Group Status Table is 33mm
2
.
As shown in Table. 3.2, BCP operation is composed of Group Access Table delay, H-tree input
delay to broadcast VID, memory operation delay for a CAM search and a SRAM read, processing
& combining logic delay, and eventually H-tree output delay, which are 0.4, 3.4, 0.5, 1.7, and
3.2ns, respectively. Additionally, as described in Chapter 4, BCP operation also includes a sub-
operation to build the implication graph required for conflict-driven clause learning (CDCL) by
updating Assignment Information Table (AIT) (details in Section 4.2.2), which takes 1.0ns. Hence,
the BCP operation delay is 10.1ns.
42
Table 3.2: BCP operation delay.
Sequence of operations Delay (ns) Method to measure
Group Access Table delay 0.4
Size estimation and calculation
with buffer design
Broadcast VID via H-tree input bus 3.4 Calculation with buffer design
Memory operation
(CAM Search and SRAM Read)
0.5 Netlist-level simulation
Processing and Combining logic 1.7 Logic Simulation
Send out via H-tree output bus 3.2 Calculation with buffer design
Update AIT 1.0 Scaled down from HW-BCP
Total BCP operation delay 10.1
Table 3.3: HW-BCP operations and CAM/SRAM modes for HW-BCP and AIT.
Operations Sub-operations
HW-BCP
CAM mode
HW-BCP
SRAM mode
AIT
CAM mode
AIT
SRAM mode
LOAD N/A write write hold hold
UNDO N/A search
CAM-activated
write
hold hold
BCP
Phase 1 search
CAM-activated
write
hold hold
Phase 2 read read hold hold
Update AIT hold hold write write
CDCL Read three LIDs read read hold hold
Table. 3.3 summarizes how HW-BCP supports LOAD, UNDO, BCP, and CDCL operations,
and specifies the memory operating mode(s) for each operation.
The average power consumption for a single CAM cell in BCP operation (i.e., dynamic mode)
is 651nW; CAM cell can be either in mismatch or match mode, where power consumption is
695nW and 607nW, respectively. The CAM cell can be in static mode by using the low-power
architecture where memory cells in the disabled SAT Submodules are in hold mode where each
cell consumes 31nW. With the low-power architecture, BCP operation consumes 5.4W whereas
without it (which means entire SAT Array is operating) BCP operation consumes 43W. Thus, with
the low-power architecture, we have 8× power consumption reduction.
43
Chapter 4: Custom Hardware Accelerators for Conflict-Driven
Clause Learning
We achieved considerable speedup (well over 100× ) for BCP operation by using our HW-BCP [33].
However, even that does not provide more than 6-7× speedup at SAT-level, because the SAT-level
speedup becomes bounded by the non-BCP operations.
Our overall goal for designing custom hardware accelerators for Boolean Satisfiability (HW-
SAT) is to achieve much higher speed-up for SAT solving, namely, 10× -to-100× . Clearly, this
requires us to extend our transistor-level architecture for HW-BCP [33] to develop hardware that
accelerates additional (i.e., beyond BCP) operations of a SAT solver.
In this chapter, we describe our approach for accelerating most parts of conflict-driven clause
learning (CDCL), which is the second highest run time operation (after BCP) of Boolean Satisfia-
bility.
4.1 Background
4.1.1 Review of SAT Operations
To implement hardware accelerators for CDCL (HW-CDCL), we started with a review of the major
SAT operations. Modern SAT solvers like MiniSAT2 [1] carry out the following sequence of
operations. First, in Operation-1 the solver uses decision heuristics, such as VSIDS [4], to select a
variable as well as the value to assign to the variable. This is often called a decision. Operation-2,
BCP, is performed for the above decision. In many cases, BCP causes multiple consecutive unit
44
propagations. While performing BCP, the decision is stored; and while processing each subsequent
unit propagation, information about the assignment is stored. The information stored captures the
implication graph (an example shown in Figure 4.2) which describes all possible ways of forcing
the unit propagations and is used for CDCL. Operation-3: if there is any conflict while performing
BCP for a decision or any of the subsequent unit propagations, CDCL algorithm is invoked. It
finds a cut in the implication graph that caused the conflict. It then uses this information to derive a
learned clause which is composed of the negation of the assignments that caused the conflict. This
learned clause is added to the clause database to effectively prune the search tree after restart [34].
After that, among the variables involved in the conflict, the one that was assigned first is identified
and a non-chronological backtracking is performed to the corresponding decision level [35].
Entire Operation-2 can be accelerated by slightly expanding the BCP operation on our HW-
BCP. In this chapter we present our custom architecture, namely HW-CDCL, to accelerate most of
Operation-3.
4.1.2 CDCL runtime analysis and functions
As shown in Table 2.1, in software, CDCL takes 12% of the total runtime for SAT. CDCL has
three components analyze, litRedundant, and findBacktrackLevel . First, analyze is used to gener-
ate an initial learned clause from a conflict and the average runtime for analyze is 6.7% (ranges
from 2 to 18% of the total runtime). Second, litRedundant is performed to optimize the learned
clause derived by analyze from the conflict. (This step is also known as conflict-clause minimiza-
tion [36].) The average runtime for litRedundant is 5.6% (ranges from 1 to 20%) of the total run-
time. Third, findBacktrackLevel , is used to find the literal at the highest decision level to perform
non-chronological backtracking [35]. Since the average runtime of this operation is less than 0.5%,
we do not consider this step for acceleration. Consequently, we target analyze and litRedundant
for hardware acceleration.
We profiled MiniSAT2 [1] for CDCL to investigate and identify operations we accelerate and
will describe the details in Section 5.1.
45
Figure 4.1: Algorithm of analyze in MiniSAT2 [1].
analyze, the primary CDCL function: analyze is a function that analyzes the conflict to identify
the learned clause, using the information stored during each decision and each unit propagation
step. If we view this stored information as a graph as shown in Figure 4.2, then analyze starts with
the conflict node and, in a breadth-first manner, expands literals from the current decision level by
observing assignment edges backwards to find a dominator in the implication graph.
In software implementation, this is performed using the algorithm shown in Fig. 4.1. Again,
starting from the clause that caused the conflict, it checks each literal in the clause to determine
whether it is assigned at the current decision level. If so, it uses the reason clause associated with
this literal to performs further investigations. The reason clause either stores the CID of the clause
that forced the value for this literal via unit propagation; or stores no CID, if this value is assigned
by a decision. If this literal is not assigned at the current decision level, the literal is added to the
learned clause. The algorithm searches iteratively by popping assignments stored in assignment
stack until there is no literal left which is assigned at the current decision level. To implement this
iterative process, the algorithm uses the cnt variable by increasing or decreasing the number of
46
Figure 4.2: Procedure of conflict-clause minimization in litRedundant.
remaining jobs to perform. When analyze completes execution, it provides the initial version of
learned clause.
litRedundant, the secondary CDCL function: The generated learned clause from analyze is not
optimal in terms of its number of literals. Hence, it can be optimized to improve SAT solving per-
formance. litRedundant is a function which tries to minimize the number of literals in the learned
clause. It tries to determine whether the reasons for some literals can be completely explained by
subsets of the other literals in the learned clause [36]. In that case, the literals which meet these
requirements can be removed. As shown in Fig. 4.2, in the graph-view, starting from the 1-UIP [1]
cut in the implication graph, the optimization algorithm tries to find a more efficient cut which can
minimize the number of literals in the learned clause.
We designed hardware accelerators for analyze and a part of litRedundant and describe the
details next.
4.2 HW-CDCL data structure and operations
MiniSAT2 [1] uses data structures, including the assignment stack and the reason clause to main-
tain assignment history and build and analyze implication graphs. Likewise, for HW-CDCL, we
have explored design space and identified new data-structures and operations. In particular, we
47
have developed efficient hardware implementations of the Clause Information Table (CIT) and As-
signment Information Table (AIT) as well as key operations, namely CIT, AIT-Search, AIT-Push,
and AIT-Pop.
4.2.1 Design choices for Clause Information Table (CIT)
CIT stores clause information where each clause has three literal IDs (LIDs) as shown in Fig. 4.4.
We identified and studied multiple design choices for implementing CIT; CIT-Design-0, CIT-
Design-1A, CIT-Design-1B, and CIT-Design-2.
CIT-Design-0 – Implicitly implemented in software: CIT is already implicitly implemented in
software. Entire clause information is already loaded in DRAM, assuming that we have sufficient
DRAM capacity. Then, a subset of clauses which are frequently/recently accessed are stored in
cache memory. Thus, when HW-CDCL needs the LIDs included in any clause, it can send a request
to the software. Considering that most clause information is stored in DRAM, it can takes more
than 50 CPU cycles to retrieve this information.
CIT-Design-1A/1B – Separate hardware, Full CIT or CIT Cache: We can implement a separate
hardware module for CIT. In CIT-Design-1A, we can add a Full CIT where the complete clause
information is loaded in a full-sized SRAM during initial problem loading. The area overhead for
this option is considerable, estimated to be approximately 30% using the key design parameters
from our original HW-BCP implementation. Since Full CIT is a large memory, it must also be
partitioned into sub-arrays and requires an H-tree bus. The delay for accessing the Full CIT is
estimated as being greater than 50% of HW-BCP operation delay. In CIT-Design-1B, we can
design a hardware module that stores only the frequently/recently used clause information, i.e., we
can implement a CIT Cache. However, a cache is useful only when the stored data can be used
multiple times. Since the information about many clauses is used only once after each conflict,
CIT Cache is not a meaningful option.
CIT-Design-2 – Reuse HW-BCP: The information required within CIT is already available
within HW-BCP. With minor modifications to the combining logic for clause reordering, the clause
48
Figure 4.3: CDCL mode of SAT Submodule.
location can be calculated in the H-tree. In this case, the delay for CIT operation is the sum of the
H-tree input and output delays. This design choice has negligible area overhead as it largely re-uses
HW-BCP.
After our analysis of the above choices, we selected CIT-Design-2 for CIT implementation
where HW-BCP is slightly modified to support CIT operations for retrieving clause information.
4.2.2 New module, Assignment Information Table (AIT)
Other than the Assignment Information Table (AIT), all modules and features for HW-CDCL are
already implemented in HW-BCP (and explained in Chapter 3). Specifically, CIT is implicitly
implemented in HW-BCP and CIT operation is supported using the CAM read mode. Also, BCP
operation delay is already included in the delay for updating AIT for CDCL in Section 3.5. In this
section, we describe our new hardware design for AIT.
AIT is a new data structure which we implement for CDCL which is required to accelerate the
conflict analysis. As shown in Fig. 4.4, AIT stores every decision and BCP information. During
each BCP step, BCP related information, including the reason clause, i.e., the CID of the clause
49
which caused the unit propagation, as well as the decision level are recorded in the AIT. The AIT
also includes the storage for the variable seen, which is used to run clause learning algorithm.
AIT is an integrated memory structure that contains (like SAT Submodule) content-addressable
memory (CAM) and SRAM. We store VID in CAM and the other related information in SRAM.
Thus, first, AIT supports an AIT-Search operation where, by using the CAM’s search operation,
we can search across all VIDs in the CAM array, find the row with the match, and fetch the
corresponding VID-related information in the SRAM in the same row. Then, to also support the
stack structure, AIT is designed to support two stack operations, AIT-Push and AIT-Pop.
Since the structure of AIT is similar to SAT Submodule, we use a similar design methodology
as we used to design SAT Submodule and SAT Array for HW-BCP. We decided the maximum
number of rows for AIT by running benchmark simulations. Basically, AIT replaces the assign-
ment stack used in software implementation of SAT. In our simulations, we measured the number
of rows of assignment stack across different benchmark instances and the maximum number is less
than 110,000.
As in our SAT Submodule design, a large memory structure with such a large number of rows
must be partitioned into subarrays. Hence, we partition AIT into AIT Submodules. Based on our
SAT Submodule design experience, we know that the maximum number of memory rows for a
single subarray is less than 400 because of the driver strength of the maximum inverter size that
can fit within the width (pitch) of a single memory column.
Figure 4.4: Hardware modules for HW-CDCL; Assignment Information Table (AIT) and Clause
Information Table (CIT).
50
Table 4.1: Overall HW-SAT operations and CAM/SRAM modes for HW-BCP and AIT.
Operations Sub-operations
HW-BCP
CAM mode
HW-BCP
SRAM mode
AIT
CAM mode
AIT
SRAM mode
LOAD N/A write write hold hold
UNDO N/A search
CAM-activated
write
hold hold
BCP
Phase 1 search
CAM-activated
write
hold hold
Phase 2 read read hold hold
AIT-Push hold hold write write
CIT Read three LIDs read read hold hold
AIT-Search N/A hold hold search
CAM-activated
read
AIT-Pop N/A hold hold read read
AIT-Push N/A hold hold write write
Thus, we design each AIT Submodule to have 256 rows. Hence, the entire AIT uses 512 AIT
Submodules for a total of 2
17
rows (to meet the above requirement of having more than 110,000
rows) and uses an H-tree data bus similar that the SAT Array as shown in Fig. 4.6. Area of AIT is
26mm
2
.
Due to the structural similarities between the AIT and SAT Submodules, to estimate delay for
each sequence of AIT operations, instead of performing a detailed design of AIT, we used the
memory operation delays from the SAT Submodule design and the H-tree wire delay scaled from
SAT Array design. Details of AIT operation delay are presented in Section 4.2.4 and Table 4.2.
4.2.3 HW-CDCL operations
CIT operation: CIT only operates on the values of the clauses in the SAT problem being solved.
This information, namely the variable IDs (VIDs) and polarities of the three variables in each
clause, is loaded into the CAM and SRAM cells when we start working on a new SAT problem, as
described earlier for our BCP operation.
Hence, the only new CIT operation we need is: Given a clause ID (CID), j, output the three
associated literal IDs (LIDs), i.e., the VID and polarity of each of the three literal in the clause. As
shown in Fig. 4.3 and Table 4.1, during a CIT operation, the CAM is configured in the read mode,
51
the columns of the SRAM that hold polarities are configured in CAM-activated read mode, i.e., the
word-line of the CAM in each row serves as the word line of these SRAM cells. All other columns
of the SRAM are in the hold mode. To perform the CIT operation, local clause ID, j, is provided
to the CAM decoder which selects the three rows corresponding to the jth clause. Within the jth
clause, a local controller within the SAT Submodule selects the first row, i.e., the first variable in
the clause. The selected CAM row and SRAM row drive the sense amplifiers to output the VID and
polarity, which become the first literal ID (LID). The local controller repeats this for the second
and the third rows (for the jth set of rows) to output the second and the third LIDs.
For CDCL, our design allows CAM array to be in the read mode with minimal circuitry.
AIT-Search operation: For each AIT-Search operation, the CAM is configured in the sesarch
mode, and the columns of the SRAM that hold reason, decLevel, and seen are all configured in
CAM-activated read mode, i.e., the match-line of the CAM in each row drives the word line of
these SRAM cells in that row. Also, tri-state drivers are inserted between the CAM array and
the SRAM array controlled to ensure that the CAM match lines drive the SRAM word lines only
during the AIT-Search operation. The operation starts by inputing a variable ID and using it to
drive the CAM search lines. Consequently, the match line of the CAM row with that variable ID
goes high and selects the corresponding SRAM row to drive the sense amplifiers and hence the
AIT outputs the values of reason, decLevel, and seen for that row.
AIT-Pop operation: During AIT-Pop operation, the CAM and the SRAM are both in the read
mode. To support AIT-Pop, AIT uses a set of tri-state drivers between the word lines of the CAM
and the SRAM arrays. It then controls these tri-state drivers such that the CAM word lines drive
the SRAM word lines. Hence, in this mode, given an address associated with the stack pointer,
the CAM decoder drives the CAM word line which activates the corresponding CAM row as well
as the corresponding SRAM row. Then, the selected row drives the sense amplifiers to output the
VID-related information stored in the row, including VID, reason, decLevel, and seen. After that,
the stack pointer decreases by one.
52
Table 4.2: Summarized CDCL operations and delays: CIT, AIT-Search, AIT-Push, and AIT-Pop
Operations Sequence of operations Delay (ns) Method to measure
CIT
Broadcast CID via H-tree input bus 3.4 Calculation with buffer design
Memory operation (CAM Read) 0.2 Netlist simulation
Send out LID1 via H-tree output bus 3.2 Calculation with buffer design
Send out LID2 via H-tree output bus 3.2 Calculation with buffer design
Send out LID3 via H-tree output bus 3.2 Calculation with buffer design
Total CIT operation delay 13.2
AIT-Search
Broadcast LID vis H-tree input bus 0.8 Scaled down from HW-BCP
Memory operation
(CAM Search & SRAM Read)
0.5 Netlist simulation
Send out info via H-tree output bus 0.7 Scaled down from HW-BCP
Total AIT-Search operation delay 2.0
AIT-Push
Broadcast input vis H-tree 0.8 Scaled down from HW-BCP
CAM/SRAM Write 0.2 Netlist simulation
Total AIT-Push operation delay 1.0
AIT-Pop
Broadcast input vis H-tree 0.8 Scaled down from HW-BCP
CAM/SRAM Read 0.2 Netlist simulation
Send out info via H-tree output bus 0.7 Scaled down from HW-BCP
Total AIT-Pop operation delay 1.7
AIT-Push operation: During AIT-Push operation, the CAM and the SRAM are both in the write
mode. AIT-Push operation is performed by increasing stack pointer by one and writing the VID-
related information to the CAM and SRAM row selected by the CAM decoder using the stack
pointer as the address.
4.2.4 Delay of HW-CDCL operations
We analyze delay for each of CDCL operations and summarize it in Table. 4.2.
CIT operation first sends out CID via the H-tree input bus of the SAT Submodule (which takes
3.4ns). Then within the corresponding SAT Submodule, it performs a memory operation (which
takes 0.2ns, as determined via simulation of the post-layout extracted netlist). Next, it sends out
three LIDs sequentially via the H-tree output bus (where each of the LID takes 3.2ns). Hence, the
CIT operation takes a total of 13.2ns.
AIT-Search operation starts with sending LID via the H-tree input bus of the AIT. The delay
value is 0.8ns, which is obtained using a scaled down version of the HW-BCP H-tree input bus.
53
Then, it performs CAM search and SRAM read operations, which take 0.3ns and 0.2ns, respec-
tively. After that, it sends out the corresponding VID-related information stored in the AIT via the
H-tree output bus (0.7ns). Thus, AIT-Search operation takes 2.0ns.
AIT-Push operation starts with sending the input data, and the stack pointer value as the address
of the AIT row, via the H-tree input bus of the AIT (0.8ns). Then, it performs SRAM write
operation (0.2ns). Thus, AIT-Push operation takes 1.0ns.
AIT-Pop operation starts with sending the read signal, and the stack pointer value as the address
of the AIT row, via the H-tree input bus of the AIT (0.8ns). Then, it performs SRAM read operation
(0.2ns). After that, it sends out the selected row data via the H-tree output bus (0.7ns). Totally,
AIT-Pop operation takes 1.7ns.
4.3 HW-CDCL algorithm
We designed new hardware accelerators for CDCL (HW-CDCL) to accelerate 70-80% of software
CDCL by using the newly designed modules and operations described above.
4.3.1 Algorithm foranalyze
The primary function of HW-CDCL is to perform, in hardware, the analyze function of CDCL to
create an initial learned clause from a conflict. HW-CDCL-analyze performs the entire operation
and hence completely replaces the software analyze explained in Section 4.1. Fig. 4.5 shows the
pseudo-code of HW-CDCL-analyze. It starts when a conflict is detected by HW-BCP. This sends
out the conflict information, namely the CID for the clause which caused the conflict ( CID
con f
),
via the input H-tree of the SAT Submodule. Using CID
con f
, it accesses the CIT to obtain three
literal IDs (LIDs) in this clause by using the CIT operation. Then for each LID
i
in CID
con f
, it
accesses AIT by providing its VID as a search key (note that VID is obtained by omitting one
bit, polarity, from LID). Then via AIT-Search operation, it retrieves the corresponding values of
resaon, decLevel, and seen. If this literal is not seen (seen = 0) and decision level is not root
54
Figure 4.5: Algorithm of HW-CDCL-analyze; analyze function of HW-CDCL.
(decLevel > 0), then first seen is set to 0 by using AIT-Push operation to mark that this LID is
checked and it doesn’t need to be checked again. If this LID is assigned at the current decision
level, further investigation is required. We use the variable counter to track the number of tasks
that need to be performed. Thus we increase counter here. If this LID is not assigned at the
current decision level, this LID is added to the learned clause. After repeating this process for each
LID in CID
con f
(subloop in Fig. 4.5), it tries to find the next CID to examine. By accessing AIT
entries from the top using AIT-Pop operations, it finds a seen VID which is marked at the previous
iteration. The reason clause of this LID becomes the next CID
con f
. We then decrease the counter
variable. The entire process above is repeated until the value of the counter becomes zero.
4.3.2 Algorithm formemLitRedundant
The secondary function of HW-CDCL is to implement in hardware nearly half (about 42-46%) of
the litRedundant operation. Since the conflict-clause minimization is composed of several opera-
tions that are difficult to implement in custom hardware, we do not create a full-hardware solution.
55
Instead, we noticed that the implication graph information used by litRedundant operation is al-
ready stored in AIT and that litRedundant software requires a large number of memory operations
to read this information. Thus, when litRedundant operation is performed in software, the memory
operations required for reading AIT information can be replaced by reading from our hardware
AIT to decrease runtime.
Every time AIT-Search operation is performed while executing litRedundant in software, our
HW-CDCL performs AIT-Search operation in HW-CDCL instead of running this part in software.
The remaining software for litRedundant is executed by the CPU. To implement this, we need
a minor design change in our AIT, namely adding to the seen field in AIT one more bit which is
required to perform the conflict-clause minimization in software. We call the portion of AIT-Search
operations in litRedundant as memLitRedundant, which is our target for acceleration.
In Chapter 5, we will present additional details about this using the expanded profiling data for
analyze & memLitRedudant and evaluate the overall speedup provided by our integrated design.
4.4 Integrated architecture, HW-SAT
We have presented our novel architecture for SAT which accelerates the entire BCP operation as
well as key functions of CDCL operations.
The primary part of HW-SAT is HW-BCP, which performs BCP operations and provides dra-
matic acceleration. As shown in Fig. 4.6, HW-BCP is composed of SAT Array, BCP Controller,
and Group Status Table. In BCP mode, the key module for HW-SAT is the SAT Array which
stores problem information and includes logic circuitry required for maximum parallelization. It
also contains the the H-tree input data bus as well as the H-tree output data bus that also includes
combining logic to integrate the processed signals. The BCP Controller generates control signals
for HW-BCP to perform the BCP operation. Group Status Table is a lookup table for low-power
architecture.
56
Figure 4.6: Architecture of HW-SAT.
The secondary part of HW-SAT is HW-CDCL, which performs CDCL operations. For CDCL,
HW-CDCL is composed of SAT Array, Assignment Information Table (AIT), and CDCL Controller.
SAT Array is used again but in a different mode, namely, a CDCL mode. Since BCP operation and
CDCL operations are completely exclusive, at any given time, SAT Array is either in the BCP
mode or in the CDCL mode. In the CDCL mode, SAT Array works as Clause Information Table
(CIT) to provide variable IDs (VIDs) for any given clause ID. AIT stores the implication graph
information to produce the 1-UIP [1] cut for CDCL. CDCL Controller generates control signals
for HW-CDCL.
Other than BCP and CDCL operations, we also need a LOAD operation to load a problem into
SAT Array initially and a UNDO operation to cancel variable assignments when a conflict occur.
During this operation, SAT Array works simply as storage. Also, two kinds of SRAM that store
variable values and all values of the AIT need to be initialized.
57
Chapter 5: Evaluation of HW-SAT
5.1 Speedup
We evaluate the overall speedup at the MiniSAT2 [1] level, as well as at the level of individual
operations, such as BCP, analyze, or memLitRedundant. We assume that our HW-SAT is a co-
processor to MiniSAT2 [1] on running general purpose processors. That is, we assume that we
have MiniSAT2+HW-SAT, where BCP, analyze, and memLitRedundant run on our HW-SAT and
the rest of the MiniSAT2 [1] runs on general purpose processors. We assume that the integrated
design, MiniSAT2+HW-SAT, runs as follows: while MiniSAT2 [1] running, when it reaches any
function that can be performed by our HW-SAT, then MiniSAT2 [1] stops working and this function
is performed by our HW-SAT.
Besides the runtime information shown in Table 2.1, we profiled further the 63 instances of
SAT Competition 2017 benchmark suite [3] using MiniSAT2 [1] for speedup analysis. We provide
the entire list of benchmark instances and their detailed profile information to Github ( https:
//github.com/soowangp/HW-SAT).
5.1.1 Definition of speedup
As above, we view HW-SAT as implementing functions such as BCP. When a function is accel-
erated by our HW-SAT, the delay of this function in software is replaced by the corresponding
HW-SAT operation delay. Consequently, new runtime, d
new, f unc
, with our HW-SAT is estimated:
d
new, f unc
= d
SW
− N
f unc
× d
avg,SW, f unc
+ N
f unc
× d
HW, f unc
, (5.1)
58
where d
SW
is the total runtime in MiniSAT2 [1], N
f unc
is the total number of times the function op-
eration is invoked during the runtime, d
avg,SW, f unc
is average delay of the function in MiniSAT2 [1],
and d
HW, f unc
is the operation delay of the function in the HW-SAT.
For that single function, we have a speedup in HW-SAT over MiniSAT2 [1],θ
f unc
:
θ
f unc
=
d
avg,SW, f unc
d
HW, f unc
. (5.2)
Then, we estimate the SAT-level speedup on MiniSAT2 [1], S
f unc
, where the function is ac-
celerated by using our HW-SAT in the integrated design, MiniSAT2+HW-SAT. S
f unc
is calculated
as:
S
f unc
=
d
SW
d
new
=
1
1− P
f unc
+
P
f unc
θ
f unc
, (5.3)
where P
f unc
is a ratio of total function runtime over total runtime in MiniSAT2 [1]. We measure
P
f unc
by profiling MiniSAT2 [1], as described ahead.
5.1.2 BCP speedup
To estimate BCP speedup and SAT-level speedup for BCP, we use the above equations by replacing
function with BCP. New runtime in MiniSAT2+HW-SAT (d
new,BCP
) and BCP speedup (θ
BCP
) are
calculated as follows:
d
new,BCP
= d
SW
− N
BCP
× d
avg,SW,BCP
+ N
BCP
× d
HW,BCP
, (5.4)
θ
BCP
=
d
avg,SW,BCP
d
HW,BCP
, (5.5)
where N
BCP
is the total number of BCP operations during runtime, d
avg,SW,BCP
is average BCP
delay in MiniSAT2 [1], and d
HW,BCP
is the delay of BCP operation in our HW-SAT. Likewise,
BCP SAT-level speedup (S
BCP
) is calculated as follows:
59
Table 5.1: Profile of BCP; total number of BCP invocations during runtime and BCP runtime (%),
average BCP delay in SW (d
avg,SW,BCP
(ns)), BCP speedup (θ
BCP
), and SAT-level speedup (S
BCP
)
on MiniSAT2 [1].
Bench
ID
No. of BCP
invocations
BCP runtime,
P
BCP
(%)
Avg BCP
delay in SW,
d
avg,SW,BCP
(ns)
BCP speedup,
θ
BCP
SAT-level
speedup, S
BCP
1 3,399,514,205 82.4 1,744 173 5.5
2 3,084,528,975 82.0 1,914 190 5.4
3 3,297,646,472 82.9 1,809 179 5.7
4 3,766,401,134 80.7 1,542 153 5.0
5 2,934,902,026 88.2 2,162 214 8.2
6 3,455,195,721 78.1 1,626 161 4.5
7 3,192,591,220 75.6 1,705 169 4.0
8 2,373,511,355 91.7 2,781 275 11.6
9 3,570,891,076 91.0 1,834 182 10.5
10 3,257,464,151 87.7 1,939 192 7.8
11 3,029,964,903 88.6 2,106 209 8.5
12 3,448,601,225 85.8 1,790 177 6.8
13 4,607,062,314 88.3 1,379 137 8.1
14 2,910,116,649 86.6 2,142 212 7.2
15 4,436,823,281 88.2 1,431 142 8.0
S
BCP
=
d
SW
d
new,BCP
=
1
1− P
BCP
+
P
BCP
θ
BCP
, (5.6)
where P
BCP
is a ratio of total BCP runtime over total runtime in MiniSAT2 [1] which we show in
Table 2.1.
We expanded the profile for BCP and show the data for 15 instances in Table 5.1, For each
instance, the table shows the number of BCP operations (N
BCP
), the ratio of BCP runtime over
total runtime (P
BCP
), average BCP operation delay in MiniSAT2 [1] (d
avg,SW,BCP
), calculated BCP
speedup (θ
BCP
), and the overall SAT-level speedup (S
BCP
). For fair comparison, d
avg,SW,BCP
is
adjusted for the 65nm technology based on the CPU benchmark data shown in Table 5.2.
Since total BCP time is 75.6-91.7% of total SAT solving time, the maximum speedup we can
achieve at the entire SAT-level is 4.0-11.6× . Even though we achieved significant speedup (137-
275× ) on BCP operations, SAT-level speedup is bounded by Amdahl’s Law which motivated us to
expand our HW-BCP to further accelerate CDCL.
60
Table 5.2: CPU Benchmark Data – PassMark single thread score
Technology CPU name
PassMark
single thread score
Intel 65nm Intel E6600 @2.4GHz 951
TSMC 65nm AMD Phenom @2.0GHz 798
TSMC 65nm AMD Phenom @2.4GHz 957
Intel 45nm Intel E7300 @2.66GHz 1086
Intel 32nm Intel i7-970 @3.2GHz 1498
Intel 32nm Intel i7-2700K @3.5GHz 1792
Intel 22nm Intel i7-3770K @3.5GHz 2081
Intel 22nm Intel i7-4790K @4.0GHz 2470
Intel 14nm Intel i7-6700K @4.0GHz 2523
Intel 14nm Intel i7-8700K @3.7GHz 2775
Intel 14nm Intel i7-10700K @3.8GHz 3080
TSMC 7nm AMD Ryzen 7 5800X @4.0GHz 3495
5.1.3 analyze speedup
As explained in Section 4.3.1, we accelerate analyze by using our HW-CDCL. The majority of
delay comes from HW-CDCL operations; CIT, AIT-Search, AIT-Pop, and AIT-Push (shown in
Table 4.2). Since other delays are negligible such as state machine transition, increasing/decreasing
counter, HW-CDCL controller, etc., we ignore these delays in this study.
We investigated analyze by profiling MiniSAT2 [1] to determine the number of HW-CDCL
operations. Table 5.3 shows that for each benchmark instance, the number of analyze calls and the
total number of CIT/AIT-Search/AIT-Push/Pop operations during MiniSAT2 [1] runtime. Then,
we calculate the average number of operations per analyze for each operation and average an-
alyze delay in HW-CDCL (d
avg,HW,analyze
(ns)) using the HW-CDCL operation delays shown in
Table 4.2, average analyze delay in SW (d
avg,SW,analyze
(ns)) (total analyze runtime divided by the
number of analyze calls). Consequently, based on Eq. (5.2), we calculate analyze speedup,
θ
analyze
=
d
avg,SW,analyze
d
avg,HW,analyze
. (5.7)
Similar to d
avg,SW,BCP
, d
avg,SW,analyze
is adjusted to the 65nm technology based on the CPU bench-
mark data shown in Table 5.2.
61
We achieved considerable speedup, 14.1-31.1× , for analyze function. We will explain the
overall SAT-level speedup, i.e., where we combine analyze speedup with BCP speedup, in Sec-
tion 5.1.5.
5.1.4 memLitRedundant(mLR) speedup
As explained in Section 4.3.2, we accelerate only parts of liltRedundant, namely the AIT-Search
operations in litRedundant, namely memLitRedundant (mLR), by using our HW-CDCL. We pro-
filed litRedundant function in MiniSAT2 [1] by using Valgrind cache simulations [24] to determine
the runtime that is due to memLitRedundant (mLR) operations.
Table 5.4 shows that for each benchmark instance, litRedundant runtime (%), the total number
of CPU cycles for the litRedundant function, the total number of CPU cycles for mLR in litRedun-
dant, and the ratio of mLR over litRedundant in terms of CPU cycles. We assume that this portion
can be replaced with our HW-CDCL AIT-Search operations for acceleration while leaving the re-
maining portions running in software. We also counted the total number of AIT-Search operations
in litRedundant and calculated the average number per litRedundant invocation is shown in the ta-
ble. Then, we calculate average mLR delay in HW-CDCL (d
avg,HW,mLR
(ns)) using the HW-CDCL
operation delays shown in Table 4.2, average mLR delay in SW (d
avg,SW,mLR
(ns)) (litRedundant
runtime times the mLR/litRedundant ratio). Consequently, based on Eq. (5.2), we calculate mLR
speedup,
θ
mLR
=
d
avg,SW,mLR
d
avg,HW,mLR
, (5.8)
and this is shown in Table 5.4. Similar to d
avg,SW,BCP
, d
avg,SW,mLR
is also adjusted for the 65nm
technology based on the CPU benchmark data shown in Table 5.2.
We achieved considerable speedup, 5.5-25.7× , for mLR function. We explain the overall SAT-
level speedup where we combine this with BCP and analyze in the next section.
62
Table 5.3: Profile analyze in terms of CIT/AIT operations and analyze speedup.
Bench
ID
No. of
analyze calls
analyze
runtime
(%)
No. of CIT
operations
No. of
AIT-Search
operations
No. of
AIT-Push/Pop
operations
Average
CIT ops.
per analyze
Average
AIT-Search
ops. per
analyze
Average
AIT-Push/Pop ops.
per analyze
Avg. analyze
delay in HW,
d
avg,HW,analyze
(ns)
Avg. analyze
delay in SW,
d
avg,SW,analyze
(ns)
analyze
speedup,
θ
analyze
1 3,056,572 5.1 494,535,607 1,897,388,348 2,594,134,062 162 621 849 4,527 119,650 26.4
2 2,825,125 6.2 604,924,706 2,332,237,632 2,647,503,192 214 826 937 5,742 156,696 27.3
3 2,622,238 4.8 464,591,893 1,959,555,932 2,414,056,268 177 747 921 5,074 132,859 26.2
4 1,906,258 5.6 483,766,240 1,772,681,409 3,571,487,027 254 930 1,874 7,743 210,687 27.2
5 4,910,381 3.2 204,550,069 1,686,206,333 2,237,002,334 42 343 456 1,856 46,615 25.1
6 2,455,986 7.6 618,277,662 2,883,199,620 2,782,744,206 252 1,174 1,133 7,204 222,132 30.8
7 1,354,585 7.8 589,487,475 2,768,747,027 2,847,644,219 435 2,044 2,102 12,688 416,609 32.9
8 1,490,323 2.3 73,503,516 662,189,200 2,258,358,493 49 444 1,515 3,580 112,538 31.4
9 5,185,494 2.4 153,235,609 982,133,779 3,012,061,060 30 189 581 1,558 33,315 21.4
10 2,137,731 3.0 112,736,082 733,128,207 3,031,601,960 53 343 1,418 3,300 102,037 30.9
11 1,926,437 2.9 109,782,001 664,349,864 2,704,321,725 57 345 1,404 3,338 107,243 32.1
12 12,329,091 4.3 739,972,392 3,571,809,824 2,977,980,800 60 290 242 1,699 25,093 14.8
13 2,507,461 3.1 165,727,547 932,951,847 3,934,583,195 66 372 1,569 3,733 87,561 23.5
14 12,251,532 3.9 577,105,395 3,109,662,167 2,558,249,783 47 254 209 1,411 22,612 16.0
15 2,606,569 3.1 176,674,723 949,526,917 3,950,751,342 68 364 1,516 3,672 85,334 23.2
63
Table 5.4: Profile litRedundant in terms of AIT operations and memLitRedundant (mLR) speedup (θ
mLR
).
Bench
ID
litRedundant
runtime (%)
Total no. of
CPU cycles for
litRedundant
No. of CPU cycles
for mLR
mLR/litRed.
ratio (%)
Avg no. of
AIT-Search
ops. per litRed.
Avg. mLR
delay in HW,
d
avg,HW,mLR
(ns)
Avg. mLR
delay in SW,
d
avg,SW,mLR
(ns)
mLR
speedup,
θ
mLR
1 8.3 3,778,907,548 1,753,267,351 46 26 51.3 535 10.4
2 8.3 4,470,728,686 2,022,648,874 45 20 40.3 494 12.3
3 8.2 4,437,619,125 1,996,921,557 45 13 26.4 370 14.0
4 10.6 4,327,497,959 1,821,286,253 42 20 39.2 627 16.0
5 4.4 3,292,516,111 1,451,409,403 44 12 23.3 321 13.8
6 10.5 6,143,362,826 2,663,669,000 43 14 28.0 343 12.2
7 13.2 8,895,186,811 4,026,418,090 45 18 36.7 596 16.3
8 2.4 2,117,259,672 966,877,350 46 15 30.3 646 21.3
9 1.6 1,140,595,727 476,526,299 42 9 18.3 216 11.8
10 4.9 2,444,829,148 1,052,190,127 43 13 25.9 662 25.6
11 4.3 2,917,029,749 1,252,656,314 43 13 25.0 642 25.7
12 4.3 4,582,111,140 1,978,245,855 43 11 21.8 120 5.5
13 4.4 2,084,051,508 897,024,507 43 14 28.7 458 16.0
14 4.2 4,760,948,330 2,051,924,375 43 11 21.8 149 6.8
15 4.6 2,345,686,235 1,015,152,365 43 15 30.9 454 14.7
64
Table 5.5: Runtimes for BCP, analyze, and mLR (P
BCP
, P
analyze
, and P
mLR
, respectively) and in-
dividual function speedups (θ
BCP
, θ
analyze
, and θ
mLR
, respectively), and the overall SAT-level
speedup (S
BCP,analyze,mLR
) in MiniSAT2+HW-SAT on MiniSAT2 [1].
Bench
ID
BCP
runtime,
P
BCP
(%)
analyze
runtime,
P
analyze
(%)
mLR
runtime,
P
mLR
(%)
θ
BCP
θ
analyze
θ
mLR
SAT-level
speedup,
S
BCP,analyze,mLR
1 82.4 5.1 3.9 173 26 10 10.4
2 82.0 6.2 3.8 190 27 12 11.1
3 82.9 4.8 3.7 179 26 14 10.5
4 80.7 5.6 4.5 153 27 16 9.8
5 88.2 3.2 1.9 214 25 14 13.6
6 78.1 7.6 4.6 161 31 12 9.3
7 75.6 7.8 6.0 169 33 16 8.6
8 91.7 2.3 1.1 275 31 21 18.7
9 91.0 2.4 0.7 182 21 12 15.2
10 87.7 3.0 2.1 192 31 26 12.8
11 88.6 2.9 1.9 209 32 26 13.9
12 85.8 4.3 1.9 177 15 6 11.0
13 88.3 3.1 1.9 137 23 16 13.2
14 86.6 3.9 1.8 212 16 7 11.6
15 88.2 3.1 2.0 142 23 15 13.2
5.1.5 Overall SAT-level speedup
In our integrated design, MiniSAT2+HW-SAT, all three functions – BCP, analyze, and mLR – are
accelerated in a manner that when the software invokes each of the three functions, it stops running
its operation in software and our HW-SAT executes the corresponding operation for acceleration.
To estimate the overall SAT-level speedup, we expand Eq (5.1) for the three functions in
MiniSAT2+HW-SAT, and new runtime, d
new,{BCP,analyze,mLR}
, is calculated as follows:
d
new,{BCP,analyze,mLR}
= d
SW
− N
BCP
× d
avg,SW,BCP
− N
analyze
× d
avg,SW,analyze
− N
mLR
× d
avg,SW,mLR
+ N
BCP
× d
HW,BCP
+ N
analyze
× d
avg,HW,analyze
+ N
mLR
× d
avg,HW,mLR
.
(5.9)
Similarly, the SAT-level speedup for individual functions, Eq. (5.3), is expanded for MiniSAT2+HW-
SAT, and then the overall SAT-level speedup, S
{BCP,analyze,mLR}
, is calculated as follows:
65
S
{BCP,analyze,mLR}
=
d
SW
d
new,{BCP,analyze,mLR}
=
1
1− P
BCP
− P
analyze
− P
mLR
+
P
BCP
θ
BCP
+
P
analyze
θ
analyze
+
P
mLR
θ
mLR
.
(5.10)
Table 5.5 shows that the values of each parameter used in Eq. (5.10) and the calculated overall
SAT-level speedup, S
{BCP,analyze,mLR}
, for each benchmark instance. The SAT-level speedup in
MiniSAT2+HW-SAT on MiniSAT2 [1] is in range of 8.6-18.7× which is significant improvement
compared to the SAT-level speedup (4.0-11.6× ) where we only accelerated BCP using our HW-
BCP.
5.2 Extrapolation
Based on the 65nm HW-SAT design, we want to see how area, delay, power, and largest SAT
instance size change, especially as we move from 65nm to 7nm technology. For this 65nm to 7nm
transformation, simply applying a scaling factor would be imprecise, because 7nm technology uses
FinFET and the wire pitch/spacing rules are very different. Hence, we carry out a more careful
transformation.
By maintaining the same design methodology, we estimate area and delay based on the size
of library SRAM cells for 65nm and 7nm. Based on our memory cells as well as from the litera-
ture [31], we assume that CAM cell area is two times the SRAM cell area. Thus, for extrapolation,
based on the SRAM cell size, we calculate the maximum number of rows in the same or similar
size of the SAT Submodule by assuming that we have the same bit-line driver strength that we had
for our HW-SAT using the 65nm technology (used up to 16× inverter to drive memory search-
and bit-lines due to the limitation for the width of a single memory column). Consequently, we
recalculate the size of the SAT Submodule with the updated combining logic. Similarly we already
performed for floorplan and H-tree design in Section 3.4, based on the geometry information, we
calculate the wire length of H-tree and other performance metrics again. Since we do not have
66
Figure 5.1: Area and wire delay with different instance sizes on different technologies: (1) 65nm
with our memory cells and (2) 65nm with library memory cells.
delay models of 65nm library memory cells and 7nm library cells, we estimate an upper bound of
total delay for both.
5.2.1 Area and scaling
As mentioned above, in our HW-SAT, SRAM cell area is a key factor which decides entire chip
area, the H-tree data path length (shown in Fig. 2.4) and the maximum instance size loaded on the
chip.
To extrapolate area as well as delay (discussed in Section 5.2.2) realistically, we start by de-
signing our own CAM and SRAM cells (shown in Fig. 3.1). Since we use the standard PDK
and follow its design rules to design our memory cells, its size is quite a bit larger than indus-
try’s library memory cells. Thus, we extrapolate area and delay based on the library cells. The
library SRAM cell area suggested by TSMC is 0.520um
2
at 65nm [37] and 0.027um
2
at 7nm [38].
Considering theoretical area scaling from 65nm to 7nm, we expect SRAM cell area would de-
crease by 86× (=(65/7)
2
), but in reality it is only 19.2× smaller. This is because it is difficult to
67
Table 5.6: Extrapolation for 7nm technology; maximum numbers of clauses and corresponding
estimated chip area (mm
2
), individual function speedups (θ
BCP
,θ
analyze
, andθ
mLR
), and SAT-level
speedup (S
BCP,analyze,mLR
) on MiniSAT2+HW-SAT.
Max number
of clauses
Chip area
(mm
2
)
θ
BCP
θ
analyze
θ
mLR
S
BCP,analyze,mLR
1M 6.7 245× 35.4× 11.8× 12.0× 2M 13.7 186× 26.3× 9.9× 11.8× 4M 28.3 172× 23.7× 9.3× 11.7× 8M 58.1 119× 16.2× 7.1× 11.1× 16M 119.4 103× 14.2× 6.4× 10.9× 32M 245.0 68× 9.1× 4.5× 10.0× 64M 502.5 58× 7.7× 3.9× 9.7× 128M 1029.8 37× 4.8× 2.6× 8.4× make a very compact SRAM cell in recent technologies like 7nm or under, as FinFET front-end is
combined with much more stringent back-end metal routing rules. Thus, anticipated performance
enhancement from 65nm to 7nm may not be proportional to the gate length of the transistor.
Fig. 5.1 shows how area and wire delay change with different target instance sizes for three
cases; (1) 65nm with our memory cells and (2) 65nm with library memory cells [37]. The instance
size, x-axis, has five cases; the numbers of clauses are 2
16
, 2
17
, 2
18
, 2
19
, and 2
20
, respectively. Total
area increases with the instance size. In 65nm, target instance size for practical chip size would
fall in the range 2
16
-2
20
.
However, if we can re-design the proposed HW-SAT using 7nm technology, as shown in Ta-
ble 5.6, we can have HW-SAT which is able to load the largest SAT instances with 32M or 64M
clauses on a reasonable chip size (less than or about 5cm
2
). Due to the much smaller 7nm SRAM
cell area, overall chip area shrinks significantly. For the 1 M clauses which is the maximum num-
ber of clauses fit into the reasonable chip size, the 7 nm HW-SAT achieves 77.7× less chip area
compared to the 65nm design with our memory cells and 19.9× less chip area to the 65nm design
out library memory cells.
68
Table 5.7: Function speedups (θ
BCP
, θ
analyze
, and θ
mLR
) and SAT-level speedup (S
BCP,analyze,mLR
)
on MiniSAT2+HW-SAT (designed for one million clauses) compared to software MiniSAT2 [1]
for different technologies; 65nm with our memory cells, 65nm with library cells, and 7nm.
Technology θ
BCP
θ
analyze
θ
mLR
SAT speedup on
MiniSAT2 [1]
65nm w/ our cells 137-275× 15-33× 6-26× 8.6-18.7× 65nm w/ lib cells 226-456× 28-61× 9-42× 8.9-19.3× 7nm 181-366× 19-42× 4-20× 8.6-18.9× 5.2.2 Delay and scaling
By estimating total BCP delay for our 65nm cells and further wire delay for 65nm/7nm library
cells, we evaluate accelerated BCP performance and see potential for the advanced technology.
Total delay for BCP is composed of three parts: 2× the wire delays on the H-tree (shown in
Fig. 2.4, 1× for data-in and 1× for data-out), memory operation delay from the point that data
arrive at the SAT Submodule to the point that a new value is updated to the SRAM, and subsequent
logic circuit delay. The wire delay is the dominant part of the proposed HW-BCP. Using our
memory cells, we achieve the optimal total delay of 10.1ns, in which total wire delay, total memory
operation delay, and total logic delay are 7.6 (74.6%), 0.9 (8.9%), and 1.7ns (16.5%), respectively.
When the 65nm library memory cells are used, operation delays (BCP, AIT-Search, AIT-Pop, AIT-
Push, and CIT) are achieved as 6.1, 1.2, 0.9, 0.6, and 6.4ns, respectively. (The upper-bound total
delay is estimated by adopting the same memory operation delay from our 65nm cells.)
Speedups for each function and the overall SAT-level speedup are calculated for a version of
our HW-SAT design with the 65nm library memory cells (shown in Table 5.7). Due to almost 4× smaller memory cell area in the 65nm library compared to the 65nm our memory cells, it achieves
significantly faster BCP operations delay as well as other operations. However, SAT-level speedup
is improved slightly (3.2-3.5%) when the HW-SAT uses library using our 65nm cells.
Then, we carefully address the estimation of wire delay on the 7nm node. By using the nor-
malized inverter FO4 delay which decreases by 4× on 7nm, logic circuit delay is estimated. RC
(Ω· F/um
2
) increases by 8-18× when technology changes from 180nm to 35nm [39]. With this
69
Table 5.8: SRAM cell area (um
2
) and maximum number of clauses in a 5cm
2
chip for different
technologies; 65nm with our memory cells, 65nm with library cells, and 7nm.
Technology SRAM cell area (um
2
) Max no. of clauses
65nm w/ our cells 2.160 1M
65nm w/ lib cells 0.520 4M
7nm 0.027 77M
tendency, we extrapolate RC for 7nm. RC is expected to increase by 15× when the technology
changes from 65nm to 7nm. We estimate delay of the data bus (shown in Fig. 2.4) by assuming
that it is a semi-global wire and RC estimation is in the middle of conservative and aggressive
estimation. Also, the upper-bound total delay is pessimistically estimated by adopting the same
memory operation delay from our 65nm cells.
Considering the above factors, as shown in Table 5.6, speedups of each individual function
(θ
BCP
, θ
analyze
, and θ
mLR
) and the overall SAT-level speedup (S
BCP,analyze,mLR
) is calculated by
assuming that ratios of each function runtime and delays of each software function are averaged.
For speedup calculation, each operation delay in software is adjusted for 7nm based on the CPU
benchmark data (shown in Table 5.2).
Since the general purpose processors in 7nm also has much improved performance (4× com-
pared to the one in 65nm), even with considerably improved performance in 7nm HW-SAT for one
million clauses, the overall SAT-level speedup is similar (8.6-18.9× ) to 65nm.
5.2.3 Feasibility
Since the proposed HW-SAT is an ASIC design and scalable, the largest SAT instance that can be
solved is limited by chip size. We assume that 5cm
2
is a feasible chip size for each technology.
Based on the SRAM cell size for each technology from 65nm to 7nm, we calculate SAT Sub-
module and overall geometric metrics, and the maximum number of clauses that can be loaded
into a 5cm
2
chip as shown in Table 5.8.
70
Table 5.9: BCP performance comparison between MiniSAT2 [1], FPGA-BCP [2], and the pro-
posed HW-BCP.
MiniSAT2 [1] FPGA-BCP [2] HW-BCP
Technology 65nm 7nm 65nm 7nm 65nm 7nm
Largest instance
(No. of clauses)
N/A N/A 64K 2M 1M 32M*
Avg. clock
cycles per BCP
1000 1000 10 11+ 1 1
Clock period (ns) 1.8 0.45 5 2.5 10.1 6.8-
Avg. BCP delay (ns) 1800 450** 50 27.5+ 10.1 6.8-
* Largest instance size in SAT Competition 2017 benchmark suite
** Adjusted based on the CPU benchmark data shown in Table 5.2
5.3 Comparison
Table 5.9 shows that, on HW-BCP using our 65nm memory cells, the largest instance that fits in a
5cm
2
chip has 1M clauses and has a clock period of 10.1ns. Compared to FPGA-BCP [2], setting
aside the fact that our HW-BCP can load 16× larger instances, in 65nm, HW-BCP achieves at least
5.0× speedup for BCP operations. Also, on SAT instances with up to 1M clauses, in 65nm, our
HW-BCP provides 197.8× speedup over MiniSAT2 [1] for BCP operations.
In 7nm, we estimate that the HW-BCP can hold the largest SAT instances (32M clauses), in
a 2.5cm
2
chip. In our HW-BCP, the clock period can be considerably reduced due to lower wire
delays in 7nm. As an upper bound, even with a pessimistic assumption that the memory operation
delay is the same as that for 65nm technology, we can achieve at least 6.8ns clock period. FPGA-
BCP [2] can also be expanded and implemented on Xilinx’s largest 7nm FPGA [40] and load an
instance with 2M clauses due to 32× larger BRAM capacity on the 7nm chips. FPGA clock period
can decrease by 2× [40]. Compared to FPGA-BCP [2], our HW-BCP can load 16× larger SAT
instances and achieve at least 5.0× speedup on BCP. On the largest SAT instances, in 7nm, our
HW-BCP has at least 66.2× speedup over MiniSAT2 [1] for BCP running on 7nm CPU.
71
5.4 Summary
We have designed a custom hardware accelerator for SAT (HW-SAT) which accelerates most SAT
operations. The primary function is to fully parallelize BCP operations and eliminate all von Neu-
mann overheads, including data movement. Also by storing on-chip all major large data structures
(clauses, variable values, etc.), we completely eliminate cache misses and associated performance
overheads. The secondary function is to accelerate parts of conflict analysis and conflict-clause
minimization [36], namely analyze and memLitRedundant, using the newly designed modules such
as Assignment Information Table (AIT), Clause Information Table (CIT), and their operations. In
65nm technology, HW-BCP achieves 5.0× speedup over FPGA-BCP [2] while solving 16× larger
instances than FPGA-BCP. Our HW-SAT, which is full custom combinations of memory, logic cir-
cuitry, and interconnect design, also shows significant speedups; 137-275 × , 15-33× , and 6-26× for the individual functions, BCP (θ
BCP
), analyze (θ
analyze
), and mLR (θ
mLR
), respectively. We
achieved the overall SAT-level speedup, S
BCP,analyze,mLR
, in range of 8.6× -to-18.7× over general
purpose processors. We also show that on 7nm, besides performance enhancements, HW-SAT can
support the largest SAT benchmark instances in a practical chip size. We still have opportunities
to improve overall performance by expanding HW-SAT to process other operations in SAT.
72
Chapter 6: Contributions and Future Research
6.1 Contributions
Contributions from users’ perspective: Since every problem in a very large set of problems in the
NP-complete set are reducible to 3-SAT, our HW-SAT provides opportunities for acceleration for
a wide range of users interested in various NP-complete problems. Further, the CAM, SRAM, and
logic based architecture in our HW-SAT can be adapted to obtain extremely efficient hardware for
solving many other problems whose performance is limited by index-based search operations.
Circuit design contributions: We developed a novel design methodology in terms of custom
memory design, integration of the memory array with digital combining logic, and a comprehen-
sive method for wire design and optimization that jointly optimizes for long wires, high total fanout
(a fanout of 2 at each stage), and high load capacitance at the final levels. Specifically, we achieved
significant wire delay reduction via optimal VLSI design; the delay for broadcasting to O(N) nodes
becomes O(logN).
Our HW-SAT accelerates the BCP operations by a significantly higher factor compared to pre-
vious research [2,19]. We create hardware implementations of other data structures which map the
software implementation of analyze function of SAT to hardware. Specifically, We developed an
approach for re-using hardware resources that we already have (to accelerate BCP) for accelerat-
ing conflict analysis in SAT solvers. More importantly, our HW-SAT is the first one to accelerate
CDCL operations and thereby provides a breakthrough increase in SAT-level acceleration.
73
Conceptual contributions: We designed custom hardware accelerators to maximize lookup
index-based search parallelization using CAM, which works well for memory-bounded or pointer-
chasing intensive applications such as Boolean satisfiability. We eliminate all von Neumann over-
heads including data movement, by directly accessing memory arrays, computing locally near the
memory by using the dedicated computational blocks, and transfer data with maximum parallelism
using dedicated local interconnects. In this manner, we enable maximal parallelization at practical
area.
6.2 Future research
Architecture improvements – Pipelining: In our future work, we plan to optimize HW-SAT de-
lay by pipelining our HW-SAT architecture. Our HW-SAT performs each operation in a single
cycle. For example, BCP operation is primarily composed of broadcasting via the input H-tree,
subsequent memory and logic operations, and returning the results via the output H-tree, and these
are performed serially within the same clock cycle. Since a majority of the delay of this opera-
tion comes from H-tree input/output, we have an opportunity for improving overall performance
via pipelining. To implement a pipelined architecture, we need to insert registers into the H-tree
and design a new controller that will flush the pipeline when a conflict occurs. Once our archi-
tecture is pipelined, we can achieve much higher clock frequency, much higher throughput, and
consequently significantly higher speedups.
Internal Unit Propagation Forwarding: The assignment of a value to a variable often causes
multiple unit propagations at the same time. The current version of our HW-SAT reports only one
unit propagation to CPU and ignores all others. We can create a much more efficient architecture
by processing multiple unit propagations, via internal unit propagation forwarding. To implement
this, each SAT Submodule needs to be redesigned to have a local queue to store unit propagation
tasks and every branch/merge point in the H-tree, i.e., every combining logic block, needs to have
buffers and controllers to control the traffic caused by multiple unit propagations.
74
Custom VLSI design: There are opportunities for further performance improvement via optimal
VLSI design. Currently, processing logic and combining logic are designed using general digital
RTL-to-GDS design flow. The logic can be further optimized in terms of area and delay via full
custom design. Our memory cells are designed to satisfy all foundry design rule checks (DRCs).
Custom memory cells in libraries do not satisfy all standard DRCs, and hence have lower area,
delay, and power. We can redesign our modules by using/modifying the library cells to lower area,
delay, and power for each operation.
Chip tape-out: When we started this project, we had planned for chip tape-outs. However, due
to the need to explore very large space of design options at higher levels, we did not create a com-
plete layout for the entire chip. We propose to develop a design flow for full custom VLSI design
for the entire chip. Once fabricated, HW-SAT chips will open various opportunities for acceler-
ation of the remaining parts of SAT algorithm, namely the heuristics, decision tree management,
and backtracking. They will also enable us to demonstrate the acceleration our design can provide
at the level of applications that use SAT, such as digital design verification and path planning in
robotics.
Clause size and design trade-off: We chose to design our HW-SAT for fixed sized clauses,
where each clause has three literals. Since every SAT problem can be reduced to 3-SAT in poly-
nomial time, we assumed that SAT problem is transformed into 3-SAT format before it uses our
HW-SAT. We can redesign our HW-SAT to have 4 literals or more in each clause. This would
enable exploration of additional high-level design trade-offs. For example, this would reduce the
number of SAT Submodules and consequently reduce area and delay overheads of the H-tree; at the
same time, this may under-utilize the CAM array used for clause storage and potentially increase
area overhead. Many other such high-level design trade-offs need to be explored.
SW-HW co-design: We have noted several opportunities for improving the overall performance
of our HW-SAT via co-design of software and hardware in comprehensive design space. We have
already employed SW-HW co-design when we distribute the original clauses to avoid duplicated
variables appearing in a SAT Submodule to ensure that we process only one unit propagation per
75
submodule in each cycle. Since we experimentally showed that we have sufficient SRAM write
noise-margin to allow multiple SRAM write operations within each subarray at the same time,
with some minor design updates to support handling of multiple unit propagations in each SAT
Submodule and an improved clause distribution algorithm, we expect to achieve considerable area
and delay improvements.
76
Bibliography
[1] N. E´ en and N. S¨ orensson, “An Extensible SAT-solver,” in Int’l Conference on Theory and
Applications of Satisfiability Testing , May 2003, pp. 502–518.
[2] J. D. Davis, Z. Tan, F. Yu, and L. Zhang, “A Practical Reconfigurable Hardware Accelerator
for Boolean Satisfiability Solvers,” in ACM/IEEE Design Automation Conference (DAC), Jun.
2008, pp. 780–785.
[3] T. Balyo, M. J. Heule, and M. J¨ arvisalo, “Proceedings of sat competition 2017: Solver and
benchmark descriptions.” University of Helsinki, Department of Computer Science, 2017.
[4] M. W. Moskewicz, C. F. Madigan, Y . Zhao, L. Zhang, and S. Malik, “Chaff: Engineering
an Efficient SAT Solver,” in ACM/IEEE Design Automation Conference (DAC), 2001, pp.
530–535.
[5] N. S¨ orensson and N. E´ en, “MiniSat 2.1 and MiniSat++ 1.0 - SAT Race 2008 Editions,” The
SAT race 2008: Solver descriptions, 2008.
[6] X. Zhang, J. Wang, C. Zhu, Y . Lin, J. Xiong, W. mei Hwu, and D. Chen, “DNNBuilder: an
Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs,”
in Proceedings of the International Conference on Computer-Aided Design. ACM, Nov.
2018.
[7] K. Gulati, S. Paul, S. P. Khatri, S. Patil, and A. Jas, “FPGA-based hardware acceleration
for Boolean satisfiability,” ACM Trans. on Design Automation of Electronic Systems, vol. 14,
no. 2, pp. 1–33, Apr. 2009.
[8] K. Gulati, M. Waghmode, S. Khatri, and W. Shi, “Efficient, scalable hardware engine for
Boolean satisfiability and unsatisfiable core extraction,” IET Computers Digital Techniques,
vol. 2, no. 3, pp. 214–229, May 2008.
[9] T. Achterberg, “SCIP: solving constraint integer programs,” Mathematical Programming
Computation, vol. 1, no. 1, pp. 1–41, Jan. 2009.
[10] C. Cadar, V . Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler, “EXE,” ACM Transac-
tions on Information and System Security, vol. 12, no. 2, pp. 1–38, Dec. 2008.
[11] K. Sen, D. Marinov, and G. Agha, “CUTE,” ACM SIGSOFT Software Engineering Notes,
vol. 30, no. 5, pp. 263–272, Sep. 2005.
77
[12] C. Cadar, D. Dunbar, and D. Engler, “Klee: Unassisted and automatic generation of high-
coverage tests for complex systems programs,” in Proceedings of the 8th USENIX Conference
on Operating Systems Design and Implementation, ser. OSDI’08. USENIX Association,
2008, p. 209–224.
[13] P. Godefroid, M. Y . Levin, and D. Molnar, “Automated whitebox fuzz testing,” in ACM Net-
work and Distributed System Security Symposium, Nov. 2008.
[14] S. Artzi, A. Kiezun, J. Dolby, F. Tip, D. Dig, A. Paradkar, and M. D. Ernst, “Finding bugs in
dynamic web applications,” in Proceedings of the 2008 international symposium on Software
testing and analysis - ISSTA '08. ACM Press, 2008.
[15] M. Davis, G. Logemann, and D. Loveland, “A Machine Program for Theorem-Proving,”
Communications of the ACM, vol. 5, no. 7, pp. 394–397, Jul. 1962.
[16] B. Selman and H. Kautz, “Domain-Independent Extensions to GSAT: Solving Large Struc-
tured Satisability Problems,” in Proceedings of the 13th Int’l Joint Conference on Artifical
Intelligence, Aug. 1993, pp. 290–295.
[17] D. A. McAllester, “An Outlook on Truth Maintenance,” in AI Memo, MIT AI Laboratory,
Aug. 1980.
[18] J. P. Marques-Silva and K. Sakallah, “GRASP: A Search Algorithm for Propositional Satis-
fiability,” IEEE Trans. on Computers, vol. 48, pp. 506–521, May 1999.
[19] J. Thong and N. Nicolici, “Fpga acceleration of enhanced boolean constraint propagation for
sat solvers,” in IEEE/ACM Int’l Conference on Computer-Aided Design (ICCAD), Nov 2013,
pp. 234–241.
[20] M. Safar, M. W. El-Kharashi, M. Shalan, and A. Salem, “A reconfigurable, pipelined, conflict
directed jumping search SAT solver,” in IEEE/ACM Design, Automation Test in Europe, Mar.
2011, pp. 1–6.
[21] A. A. Sohanghpurwala and P. Athanas, “An Effective Probability Distribution SAT Solver on
Reconfigurable Hardware,” in Int’l Conference on ReConFigurable Computing and FPGAs
(ReConFig), Nov. 2016, pp. 1–6.
[22] X. Yin, B. Sedighi, M. Varga, M. Ercsey-Ravasz, Z. Toroczkai, and X. S. Hu, “Efficient
Analog Circuits for Boolean Satisfiability,” IEEE Trans. on Very Large Scale Integration
(VLSI) Systems, vol. 26, no. 1, pp. 155–167, Jan. 2018.
[23] Intel, “Intel Comet Lake Processors Datasheet,” in Intel Documentation and Resources,
2020, [Accessed Sep. 1, 2022]. [Online]. Available: https://www.intel.com/content/www/us/
en/products/platforms/details/comet-lake-s.html
[24] N. Nethercote and J. Seward, “Valgrind: a framework for heavyweight dynamic binary in-
strumentation,” ACM SIGPLAN Notices, vol. 42, no. 6, pp. 89–100, Jun. 2007.
78
[25] A. V . Nori, J. Gaur, S. Rai, S. Subramoney, and H. Wang, “Criticality aware tiered cache
hierarchy: A fundamental relook at multi-level cache hierarchies,” in 2018 ACM/IEEE 45th
Annual International Symposium on Computer Architecture (ISCA). IEEE, Jun. 2018.
[26] M. Motomura, J. Toyoura, K. Hirata, H. Ooka, H. Yamada, and T. Enomoto, “A 1.2-million
Transistor, 33-MHz, 20-b Dictionary Search Processor (DISP) ULSI with a 160-kb CAM,”
IEEE Journal of Solid-State Circuits, vol. 25, no. 5, pp. 1158–1165, 1990.
[27] T. Yamagata, M. Mihara, T. Hamamoto, Y . Murai, T. Kobayashi, M. Yamada, and H. Ozaki,
“A 288-kb Fully Parallel Content Addressable Memory Using a Stacked-Capacitor Cell
Structure,” IEEE Journal of Solid-State Circuits, vol. 27, no. 12, pp. 1927–1933, Dec. 1992.
[28] A. T. Do, S. Chen, Z.-H. Kong, and K. S. Yeo, “A low-power CAM with efficient power and
delay trade-off,” in 2011 IEEE International Symposium of Circuits and Systems (ISCAS),
May 2011.
[29] H. Yamada, M. Hirata, H. Nagai, and K. Takahashi, “A high-speed string-search engine,”
IEEE Journal of Solid-State Circuits, vol. 22, no. 5, pp. 829–834, Oct. 1987.
[30] H. Yamada, Y . Murata, T. Maeda, R. Ikeda, K. Motohashi, and K. Takahashi, “Real-time
string search engine LSI for 800-mbit/sec LANs,” in Proceedings of the IEEE 1988 Custom
Integrated Circuits Conference. IEEE, 1988.
[31] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and
Architectures: a Tutorial and Survey,” IEEE Journal of Solid-State Circuits, vol. 41, no. 3,
pp. 712–727, Mar. 2006.
[32] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and Systems Perspective.
4th ed. Pearson, 2010.
[33] S. Park, J.-W. Nam, and S. K. Gupta, “HW-BCP: A Custom Hardware Accelerator for SAT
Suitable for Single Chip Implementation for Large Benchmarks,” in ACM/IEEE 26th Asia
and South Pacific Design Automation Conference , Jan. 2021.
[34] C. P. Gomes, B. Selman, H. Kautz et al., “Boosting combinatorial search through randomiza-
tion,” AAAI/IAAI, vol. 98, pp. 431–437, 1998.
[35] J. P. Marques-Silva, “An overview of backtrack search satisfiability algorithms,” in Proc. 5th
Intn’l. Symp. on Artificial Intelligence and Mathematics , Jan. 1998.
[36] A. V . Gelder, “Improved conflict-clause minimization leads to improved propositional proof
traces,” in 2009 Theory and Applications of Satisfiability Testing - SAT . Springer, Jul. 2009,
pp. 141–146.
[37] ITRS, “ITRS Report 2008,” in http://www.itrs2.net/itrs-reports.html, 2008.
[38] J. Chang et al., “A 7nm 256Mb SRAM in high-k metal-gate FinFET technology with write-
assist circuitry for low-VMIN applications,” in IEEE Int’l Solid-State Circuits Conference
(ISSCC), Feb. 2017, pp. 206–207.
79
[39] R. Ho, K. Mai, and M. Horowitz, “The Future of Wires,” Proceedings of the IEEE, vol. 89,
no. 4, pp. 490–504, Apr. 2001.
[40] S. Ahmad et al., “A Versatile 7nm Adaptive Compute Acceleration Platform Processor,” in
IEEE Hot Chips Symp., Aug. 2019.
80
Abstract (if available)
Abstract
Boolean Satisfiability (SAT) has broad usage in Electronic Design Automation (EDA), artificial intelligence (AI), and theoretical studies. Further, as an NP-complete problem, acceleration of SAT will also enable acceleration of a wide range of combinatorial problems.
We propose a completely new custom hardware design to accelerate SAT. Starting with the well-known fact that Boolean Constraint Propagation (BCP) takes most of the SAT solving time (80-90%), we first focus on accelerating BCP. Beyond BCP, to achieve >10X SAT-level speedup, we also accelerate functions of conflict-driven clause learning (CDCL), namely, analyze and memLitRedundant. By profiling a widely-used software SAT solver, MiniSAT v2.2.0 (MiniSAT2), we identify opportunities to accelerate BCP via parallelization and elimination of all von Neumann overheads, including data movement. We also create hardware implementations of other data structures for CDCL and design accelerators for CDCL operations.
We propose hardware accelerators for Boolean Satisfiability (HW-SAT). The core part of HW-SAT is an accelerator for BCP (HW-BCP) which achieves these goals via a customized combination of content-addressable memory (CAM) cells, SRAM cells, logic circuitry, and optimized interconnects. Specifically, our CAM design is area-delay efficient since we modify the CAM by removing its high-area and high-delay part, namely the priority encoder. The secondary part of HW-SAT are accelerators for CDCL (HW-CDCL) which carry out key lookup operations via the reuse of HW-BCP memory array with minimal design update, and the implementation of assignment information table (AIT).
In 65nm technology, for the largest SAT instances in the SAT Competition 2017 benchmark suite, our HW-BCP dramatically accelerates BCP (10.1ns per BCP in simulations), and hence provides a 5.0X speedup over FPGA-BCP and a 137-275X speedup over MiniSAT2 running on general purpose processors, where MiniSAT2 is an optimized software implementation which includes successful heuristics like two-watched literal and blocking literal. More importantly, due to the area-efficiency of ASIC design, ours HW-BCP can solve 16X larger SAT instances than FPGA-BCP. Our HW-CDCL is completely new architecture which accelerates analyze by 15-33X and memLitRedundant by 6-26X.
Even though HW-BCP accelerates BCP by two orders of magnitude (137-275X), the overall acceleration at SAT-level is limited to 4.0-11.6X. However, the proposed HW-SAT, which integrates HW-BCP and HW-CDCL, achieves SAT-level speedup, 8.6-18.7X.
Then, we extrapolate our HW-SAT design to 7nm technology and estimate area and delay. The analysis shows that in 7nm, in a realistic chip size, our HW-SAT would be able to solve the largest SAT instances in the benchmark suite. Again, for 7nm, our HW-SAT will be an order of magnitude faster and will solve problems that are an order of magnitude larger than FPGA-BCP.
Finally, we have developed SW approaches (namely, appropriate place ways of clauses across subarrays in HW-SAT) and HW enhancements (use additional logic to enable only a fraction of HW-SAT subarrays), to reduce average power for HW-SAT to acceptable levels.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Static program analyses for WebAssembly
PDF
Attacks and defense on privacy of hardware intellectual property and machine learning
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Automatic test generation system for software
PDF
Hardware and software techniques for irregular parallelism
PDF
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
Formal analysis of data poisoning robustness of K-nearest neighbors
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Techniques for methodically exploring software development alternatives
PDF
Assume-guarantee contracts for assured cyber-physical system design under uncertainty
PDF
Towards data-intensive processing architectures for improving efficiency in graph processing
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
Asset Metadata
Creator
Park, Soowang
(author)
Core Title
Custom hardware accelerators for boolean satisfiability
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-12
Publication Date
09/19/2022
Defense Date
02/28/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
BCP,CAM,CDCL,custom hardware,near-memory computing,OAI-PMH Harvest,SAT,von Neumann machine
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gupta, Sandeep (
committee chair
), Nuzzo, Pierluigi (
committee member
), Wang, Chao (
committee member
)
Creator Email
soowangp@gmail.com,soowangp@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112002238
Unique identifier
UC112002238
Legacy Identifier
etd-ParkSoowan-11228
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Park, Soowang
Type
texts
Source
20220919-usctheses-batch-982
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
BCP
CAM
CDCL
custom hardware
near-memory computing
SAT
von Neumann machine