Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving reliability, power and performance in hardware transactional memory
(USC Thesis Other)
Improving reliability, power and performance in hardware transactional memory
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improving Reliability, Power and Performance in Hardware Transactional
Memory
by
Sang Wook Stephen Do
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2018
2
To Jin Do, Jong Soon Kim and Ho Joon Do,
my sisters,
also, Hwan Jin Rho and Park Jung
3
ACKNOWLEDGEMENTS
First and most, I would like to express my gratitude to my advisor, Michel Dubois, for everything.
I also want to thank sincerely Murali Annavaram for his help and advice as a professor and a friend.
I thank my qualifying and defense committee members: Barath Raghavan, Aiichiro Nakano, Sandeep
Gupta and Jeffery T. Draper.
I am grateful for all of my colleagues and friends, especially, Jin Ho Suh, Daniel Wong, Mehrtash
Manoochehri, Hyeran Jeon and Lakshmi Kumar Dabbiru a.k.a ‘Prince’.
I really thank Diane Dmetras, Tim Boston and all the staff members of the department.
Last but not least, I am grateful for my family and my friend Hakseung Lee.
4
TABLE OF CONTENTS
Dedication ....................................................................................................................................... 2
Acknowledgements ......................................................................................................................... 3
List of Figures ................................................................................................................................. 8
List of Tables ................................................................................................................................ 11
Abstract ........................................................................................................................................................ 12
1. Introduction ........................................................................................................................................... 13
1.1 Transactional Memory .................................................................................................................. 13
1.2 Contributions ................................................................................................................................. 16
2. TRANSIENT ERROR DETECTION AND RECOVERY USING HARDWARE
TRANSACTIONAL MEMORY .......................................................................................................... 19
2.1 Introduction ................................................................................................................................... 19
2.2 Single Core Reliability .................................................................................................................. 21
2.2.1 Transient Error Detection and Correction ............................................................................. 21
2.2.2 Minimizing Error Detection and Correction Overheads ....................................................... 26
2.2.3 Design Overview ................................................................................................................... 27
2.2.4 Evaluation .............................................................................................................................. 29
2.3 Multi-Core Reliability ................................................................................................................... 32
2.3.1 Combining Coherence and Reliability .................................................................................. 32
2.3.2 Design Overview ................................................................................................................... 34
2.3.3 Evaluations ............................................................................................................................ 38
2.3.4 Speculative Abort .................................................................................................................. 45
2.4 Future Work .................................................................................................................................. 47
3. POWER EFFICIENT HARDWARE TRANSACTIONAL MEMORY .............................................. 48
3.1 Introduction ................................................................................................................................... 48
3.2 Dynamic Transaction Issue (DTI) ................................................................................................. 51
3.2.1 Conflict History ..................................................................................................................... 51
5
3.2.2 Conflict Prediction ................................................................................................................ 53
3.2.3 Wakeup from Idle State ......................................................................................................... 54
3.2.4 Transaction Flow ................................................................................................................... 55
3.2.5 Examples ............................................................................................................................... 56
3.2.6 Prediction Accuracy .............................................................................................................. 57
3.2.6.1 False negative alarm ..................................................................................................... 57
3.2.6.2 False positive alarm ...................................................................................................... 57
3.3 Micro-Architecture ........................................................................................................................ 58
3.3.1 Overall Architecture .............................................................................................................. 58
3.3.2 Conflict Prediction ................................................................................................................ 59
3.3.2.1 Conflict history vector .................................................................................................. 59
3.3.2.2 Running TxID vector ................................................................................................... 59
3.3.2.3 Prediction logic ............................................................................................................ 59
3.3.2.4 TxID generation ........................................................................................................... 60
3.3.2.5 Receiving TxID ............................................................................................................ 60
3.4 Overheads ...................................................................................................................................... 61
3.4.1 Performance Overheads ........................................................................................................ 61
3.4.2 Message Overheads ............................................................................................................... 62
3.4.2.1 Propagating memory addresses for stores .................................................................... 62
3.4.2.2 Propagating TxID at TX Begin and TX End ............................................................... 63
3.4.3 Hardware Overheads ............................................................................................................. 64
3.4.4 Power Overheads ................................................................................................................... 65
3.5 Related Work ................................................................................................................................. 65
3.6 Evaluation ..................................................................................................................................... 70
3.6.1 Experimental Setup ............................................................................................................... 70
3.6.1.1 Simulation setup ........................................................................................................... 70
3.6.1.2 Machine setup .............................................................................................................. 72
6
3.6.2 Results and Analyses ............................................................................................................. 73
3.7 Future work ................................................................................................................................... 84
4. Fine-Grain Transaction Scheduling ...................................................................................................... 85
4.1 Introduction ................................................................................................................................... 85
4.2 Fine Grain Transaction Conflict Prediction and Scheduling ........................................................ 88
4.3 Design Overview ........................................................................................................................... 92
4.3.1 Applicability .......................................................................................................................... 92
4.3.2 Design Space ......................................................................................................................... 92
4.3.2.1 Storage format .............................................................................................................. 92
4.3.2.2 Scope of history ............................................................................................................ 93
4.3.2.3 Scheduling .................................................................................................................... 93
4.3.3 Design Choices ...................................................................................................................... 93
4.3.3.1 Storage format .............................................................................................................. 93
4.3.3.2 Scope of history and signature update ......................................................................... 94
4.3.3.3 Scheduling .................................................................................................................... 95
4.4 Implementation Details ................................................................................................................. 97
4.5 Evaluation ..................................................................................................................................... 99
4.5.1 Simulation Environment ........................................................................................................ 99
4.5.2 Results and Analyses ........................................................................................................... 102
4.6 Future Work ................................................................................................................................ 110
5. Speculative Conflict Detection and Resolution in Hardware Transactional Memory ........................ 111
5.1 Introduction ................................................................................................................................. 111
5.2 Lazy conflict detection systems .................................................................................................. 112
5.2.1 Non-speculative conflict detection ...................................................................................... 113
5.2.2 Invalidation Without Validation (IWV) .............................................................................. 116
5.2.3 Flexibly lazy TM algorithms ............................................................................................... 118
5.3 IWV protocol ............................................................................................................................... 119
7
5.3.1 Overview ............................................................................................................................. 119
5.3.2 Implementation details ........................................................................................................ 121
5.3.3 Operational examples .......................................................................................................... 126
5.3.4 Comparison with ScalableBulk ........................................................................................... 130
5.3.4.1 Commit of N independent transactions/chunks ......................................................... 130
5.3.4.2 Conflict detection ....................................................................................................... 132
5.4 Evaluation ................................................................................................................................... 133
5.4.1 Methodology ....................................................................................................................... 133
5.4.2 Results ................................................................................................................................. 135
5.5 Future Work ................................................................................................................................ 138
6. Conclusion .......................................................................................................................................... 139
References ................................................................................................................................................. 140
8
LIST OF FIGURES
Figure 2.1 SRTM load and store data paths .................................................................................. 23
Figure 2.2 Error detection and correction with finger prints ........................................................ 25
Figure 2.3 SRTM execution flow ................................................................................................. 25
Figure 2.4 SRTM design overview ............................................................................................... 27
Figure 2.5 Performance overheads without errors ........................................................................ 29
Figure 2.6 Execution time breakdowns in Normal mode ............................................................. 30
Figure 2.7 Transitions between TX modes of execution .............................................................. 32
Figure 2.8 TX timeline with Eager detection ................................................................................ 35
Figure 2.9 TX timeline with Lazy detection ................................................................................. 35
Figure 2.10 Restoration log ........................................................................................................... 37
Figure 2.11 Log-update code ........................................................................................................ 38
Figure 2.12 Performance overheads without errors ...................................................................... 40
Figure 2.13 Execution cycle breakdown ....................................................................................... 41
Figure 2.14 Delayed aborts ........................................................................................................... 43
Figure 2.15 Speculative abort ....................................................................................................... 46
Figure 2.16 Performance overheads with speculative abort ......................................................... 47
Figure 3.1 Energy loss due to repeated aborts (Lazy conflict detection) ...................................... 48
Figure 3.2 Energy loss due to repeated aborts (Eager conflict detection) .................................... 49
Figure 3.3 Transaction flow .......................................................................................................... 55
Figure 3.4 Energy savings with DTI in the example of Figure 3.1 .............................................. 56
Figure 3.5 Logical modules in a processor node with DTI ........................................................... 58
Figure 3.6 Store-Load skew .......................................................................................................... 61
Figure 3.7 Conflict prediction module .......................................................................................... 64
Figure 3.8 Scalability with the number of threads of the base machine ....................................... 72
Figure 3.9 Dynamic power consumption ...................................................................................... 73
Figure 3.10 Overall dynamic energy consumption (transactional and non-transactional codes) . 74
Figure 3.11 Dynamic energy wasted on aborted transactions ...................................................... 75
Figure 3.12 Dynamic energy spent on all transaction executions – aborted or committed .......... 76
Figure 3.13 Average consecutive aborts ....................................................................................... 77
9
Figure 3.14 Execution times ......................................................................................................... 77
Figure 3.15 Committed cycles ...................................................................................................... 79
Figure 3.16 Transaction commit rates .......................................................................................... 79
Figure 3.17 Message overhead percentage ................................................................................... 80
Figure 3.18 Message overhead in data transferred ....................................................................... 81
Figure 4.1 False negative in CGPS ............................................................................................... 86
Figure 4.2 False positive in CGPS ................................................................................................ 86
Figure 4.3 Conflict prediction at memory accesses in FGPS ....................................................... 90
Figure 4.4 Signature updating events with FGPS ......................................................................... 96
Figure 4.5 System architecture ..................................................................................................... 97
Figure 4.6 FGPS microarchitecture with inputs/outputs ............................................................... 97
Figure 4.7 Data structure for write-set .......................................................................................... 98
Figure 4.8 FGPS prediction logic pseudo code ............................................................................ 98
Figure 4.9 CGPS prediction logic pseudo code (base machine) ................................................. 101
Figure 4.10 Execution time overall – normalized ....................................................................... 102
Figure 4.11 Number of transaction cycles – normalized ............................................................ 103
Figure 4.12 Aborted cycles – normalized ................................................................................... 104
Figure 4.13 Number of transaction aborts – normalized ............................................................ 105
Figure 4.14 Repeated transaction aborts due to false negatives in CGPS .................................. 107
Figure 4.15 Commit rate after transaction execution – normalized ............................................ 108
Figure 5.1 Early abort by early invalidation ............................................................................... 116
Figure 5.2 False abort ................................................................................................................. 117
Figure 5.3 IWV protocol flow .................................................................................................... 119
Figure 5.4 Commit without validation ........................................................................................ 120
Figure 5.5 Directory structure ..................................................................................................... 124
Figure 5.6 Livelock between processor nodes 1 and 2 ............................................................... 124
Figure 5.7 Message flows ........................................................................................................... 125
Figure 5.8 Parallel commits of independent transactions ........................................................... 126
Figure 5.9 Conflict detection and resolution ............................................................................... 128
Figure 5.10 Execution times ....................................................................................................... 133
Figure 5.11 Cycle time distribution for IWV .............................................................................. 134
10
Figure 5.12 Cycle time distribution for ScalableBulk ................................................................ 134
Figure 5.13 Transaction abort rate .............................................................................................. 135
Figure 5.14 Aborts in the validation phase ................................................................................. 136
Figure 5.15 Commit rate after requesting invalidation in IWV .................................................. 136
Figure 5.16 False abort rates in IWV .......................................................................................... 137
11
LIST OF TABLES
Table 2.1 Additional storage space for SRTM support ................................................................ 28
Table 2.2 Machine setting for SRTM simulation ......................................................................... 28
Table 2.3 Percentage Reduction in Number ................................................................................. 31
Table 2.4 Machine setting for MRTM simulation ........................................................................ 39
Table 2.5 SPLASH2 inputs ........................................................................................................... 39
Table 2.6 Percentage reduction in number ................................................................................... 42
Table 2.7 Number of aborts and aborted cycles – MRTM ........................................................... 43
Table 2.8 Read-set, write-set and instructions in TX .................................................................... 44
Table 2.9 Number of aborts and aborted cycles – speculative abort normalized delayed abort ... 46
Table 3.1 Repeated aborts ............................................................................................................. 50
Table 3.2 Hardware components .................................................................................................. 65
Table 3.3 Design space in hardware ............................................................................................. 69
Table 3.4 Simulation configuration .............................................................................................. 71
Table 3.5 Benchmark applications ................................................................................................ 71
Table 3.6 Message overheads with the reduction technique ......................................................... 80
Table 3.7 Ratios between stores and write-set size per transaction .............................................. 82
Table 3.8 Power, Energy and Performance with Message Reduction .......................................... 82
Table 3.9 Simulation summary with varying number of threads .................................................. 83
Table 4.1 FGPS design choices .................................................................................................... 96
Table 4.2 Processor node configuration ...................................................................................... 100
Table 4.3 Benchmark applications .............................................................................................. 101
Table 4.4 Fraction of aborted cycles in overall transaction cycles ............................................. 104
Table 4.5 Aborted cycles and transaction aborts in CGPS and FGPS – normalized .................. 106
Table 4.6 Number of transactions with consecutive aborts ........................................................ 107
Table 4.7 Number of signature operations per memory access in FGPS ................................... 109
Table 4.8 Additional messages at transaction aborts .................................................................. 109
Table 5.1 Cache state transition .................................................................................................. 120
Table 5.2 Core message types ..................................................................................................... 125
Table 5.3 Simulation configuration ............................................................................................ 133
12
ABSTRACT
Transactional Memory (TM) enhances the programmability as well as the performance of parallel
programs running on a multi-core or multi-processor system. To achieve this goal, TM adopts a lock-
free approach, in which mutually exclusive events are executed optimistically and corrected later if
violations of mutual exclusion are detected. As a result TM disposes of the complexities of conventional
locking mechanisms, especially when multiple locks must be held simultaneously. Some proposals such
as Transactional Memory Coherence and Consistency (TCC) and Bulk extend the applicability to cache
coherence and consistency with cache coherence protocols relying on the optimistic approach of TM.
To realize TM’s full potential, we propose various architecture schemes in this dissertation, targeting
improvements for reliability, power and performance of HTM systems. First, we introduce transaction-
based reliability protecting processor cores from transient errors. Second, we propose a Dynamic
Transaction Issue (DTI) scheme that can be easily implemented on top of existing HTM systems, saving
the power dissipation and energy consumption associated with transaction aborts. Third, we refine our
DTI scheme with a new approach based on Fine Grain Prediction and Scheduling (FGPS), improving
the prediction accuracy of the prior proactive scheduling algorithms. Lastly, we target transaction
commit, which is the common case in most benchmark applications, to improve the execution time of
successful transactions.
13
Chapter 1
1. INTRODUCTION
1.1 Transactional Memory
Transactional Memory (TM) [21] has been proposed to enhance the programmability and the
performance of parallel programs running on a multi-core or multi-processor system. To achieve this
goal, TM adopts a lock-free approach, in which mutually exclusive events are executed optimistically
and corrected later if postmortem violations of the mutual exclusiveness condition are detected. As a
result TM disposes of the complexities of conventional locking mechanisms, especially when multiple
locks must be held simultaneously. Some TM proposals such as Transactional Memory Coherence and
Consistency (TCC) [19] and Bulk [10] extend the idea to cache coherence and consistency and propose
cache coherence protocols relying on the optimistic approach of TM.
In any TM architecture, mutually exclusive events or critical sections or memory writes subject to
cache coherence as in TCC and Bulk are wrapped in transactions or ‘chunks’ (as they are called in
Bulk), which are a finite sequence of machine instructions. For example, a programmer needs to enclose
a critical section with special instructions specifying transaction boundaries such as ‘transaction_begin’
and ‘transaction_end’ instead of using a conventional locking mechanism requiring lock acquires and
lock releases.
To guarantee the correct semantics of a program, transactions should go through three phases as in
database transactions under the optimistic concurrency control in a database system adopting the
“executing first and correcting later” approach [25].
In the first phase, a processor core executes the machine instructions in the current transaction
including memory reads and writes optimistically, hoping that no other transaction wrote the same
14
memory location concurrently. The aggregates of the memory reads and writes in a transaction are
referred to as the read-set (R-set) and the write-set (W-set) of the transaction.
During the second phase, it is first decided whether any other transactions have accessed the same
memory location or locations and at least one of them have written the same location during the first
phase, thus violating the atomicity of a critical section or cache coherence transaction. If there was no
overlap between a R-set and a W-set or between two W-sets associated with different transactions the
transaction moves to the third phase. If there was such overlap or conflict, one of the transactions
involved in the conflict is selected as the winner. The winner could be the transaction or any other
transaction depending on the circumstances. In short, all the transactions involved in the conflict except
the winning one are aborted from their current execution point and rolled back to re-execute their
instructions from the beginning.
In the third and last phase, a transaction posts its W-set to the entire system so subsequent transactions
can read the newly updated data in its W-set. Until this moment, the W-set must be kept invisible,
isolated from the rest of the system: Since the current transaction may not survive conflicts it is not safe
to expose its tentative W-set to other transactions.
These three phases do not have to be sequential and may be overlapped, depending on the actual
implementation. The three phases are called execution phase – a transaction executes the instructions in
the transaction boundaries optimistically; validation phase – transaction conflicts are detected and one
survivor is selected; and commit phase – the W-set of the survivor is proclaimed, as used similarly in
some TM and database proposals such as [25] – read (execution) phase, validation phase and write
(commit) phase.
Any TM system should provide at least the following fundamental supports besides other
requirements to guarantee the proper operation of TM transactions as described above: conflict
15
detection, conflict resolution or concurrency control, rollback of an aborted transaction, and data
management or version management [33].
A conflict detection mechanism in a TM system should be able to detect overlaps or conflicts between
an R-set and a W-set or between a W-set and a W-set of simultaneous transactions. TM systems rely on
a conventional cache coherence protocol or a customized protocol to detect such conflicts. Once a
conflict is detected, the TM system should resolve the conflict using a conflict resolution mechanism,
using a type of priority scheme, to select a winning transaction, which continues, and to force the other
transactions involved in the conflict to abort. For an aborted transaction, the processor should provide a
rollback mechanism that restores the system state as it was before the beginning of the transaction and
restarts the execution from there. Lastly, version management is related to the management of W-sets
and associated data until the commit phase, keeping them isolated from the entire system and invisible
to other transactions.
The implementation of the fundamental supports described above may have a big impact on the
design complexity and the resulting performance. TM proposals are broadly categorized as follows.
First, depending on the physical implementation, TM systems are classified as Software TM (STM)
including but not limited to [20, 22, 30, 43, 45] implementing the required mechanisms in software, and
Hardware TM (HTM) such as [1, 9, 10, 19, 33, 37, 39, 58] implementing those requirements in
hardware. Some TM systems adopt both software and hardware mechanisms [12, 24, 31] and are
sometimes called ‘Hybrid TM’.
TM systems are further classified, using the terminology introduced in [33], into eager or early
conflict-detection systems [1, 33, 39, 58], in which conflicts are detected as soon as possible by
exposing W-sets in the execution phase, vs. lazy or late conflict-detection systems [9, 10, 19, 37] where
16
conflict detection is postponed until the end of the validation phase, and thus detecting a conflict lazily
using W-sets proclaimed after validation.
Also, regarding version management, some TM systems adopt ‘eager’ version management schemes
[1 (UTM), 33, 58], in contrast with ‘lazy’ version management schemes used by other TM systems such
as [1 (LTM), 9, 19, 39]. With the eager management scheme, newly updated data in a W-set eagerly
replace old data in the original memory locations, and the old data are stored in different locations other
than the original ones, during the execution phase. In case of abort, the old data are restored into the
original location. In case of commit, the new data are validated and the old data are discarded. As a
consequence, the commit process takes less time than the abort process.
On the other hand, TM systems adopting lazy version management store new data separately in some
locations while keeping old data in their original locations. At a transaction commit, after the validation
phase, the new data should be moved to the original locations, superseding the old data, or at least the
locations of the new data should be known to the entire system if the data transfer is not done right
away. At a transaction abort, the new data are just invalidated. Consequently, as opposed to eager
management, the lazy management scheme favors transaction aborts in terms of processing speed.
Some TM proposals [28, 46, 49, 50] take flexible approaches between the eager and the lazy policies
to achieve better performance.
1.2 Contributions
Transactional Memory has long been a popular topic in academia due to its novel approach to thread
synchronization in the shared memory model. Industry has started supporting Hardware Transactional
Memory (HTM) in commercial products [23, 38, 54, 61].
In this dissertation, we introduce new ways to improve the reliability, power and performance of
HTM with the following contributions.
17
In Chapter 2, we propose a novel architecture to provide protection against transient faults for
processor core and make the following contributions:
• We propose a novel microarchitecture for transient error detection and recovery in processor
core based on time redundancy BER with HTM’s existing features at minimal hardware cost.
• We provide implementation details for single core and multi-core reliability – SRTM (Single-
core Reliability on HTM) and MRTM (Multi-core Reliability on HTM).
• We evaluate the performance overheads of both SRTM and MRTM by comparing them to the
base machine without the error detection and recovery features.
In Chapter 3, we propose a dynamic transaction issue scheme that reduces power dissipation and
energy consumption of a base HTM machine with the following contributions:
• We propose a simple hardware scheme saving power and energy on existing HTM systems. By
targeting a more specific problem as compared to the prior scheduling algorithms, we could achieve the
goal with low hardware overheads and low implementation complexity and performance penalty.
• We provide the implementation details of our proposed scheme.
• We evaluate our scheme on a cycle accurate simulator by comparing it with various alternative
hardware mechanisms from a power and energy perspective.
In Chapter 4, we further improve the prediction accuracy of transaction abort which is essential to any
proactive scheduling algorithms including the one proposed in Chapter 3 of this dissertation by
proposing a new approach with the following contributions to the state of the art:
• We introduce a novel hardware prediction and scheduling algorithm based on Fine Grain
transaction Prediction and Scheduling (FGPS) in Hardware Transactional Memory.
• We provide the implementation details of FGPS in hardware.
18
• We evaluate FGPS on a cycle-accurate simulator, comparing it with one of the most effective
CGPS (Coarse Grain) designs.
In Chapter 5, we propose a way to improve the performance of HTM relying on Lazy conflict
detection in execution time by improving the commit process of transaction, which is on the critical path
of transaction execution in the common case. We make the following contributions:
• We introduce a new cache invalidation algorithm based on a fully optimistic approach in which
cache invalidations are sent en masse without validation and transactional conflict detection is
embedded in the cache protocol.
• We describe in some details an invalidation-based cache protocol based on the new optimistic
approach, which we have implemented on top of the SESC/SuperTrans simulator with some
modifications.
• We compare the performance of the new protocol by simulation against ScalableBulk, arguably
one of the most aggressive IAV protocols to-date.
19
Chapter 2
2. TRANSIENT ERROR DETECTION AND RECOVERY USING
HARDWARE TRANSACTIONAL MEMORY
2.1 Introduction
Modern microprocessor designs are becoming more vulnerable to transient faults and soft errors [5]
than ever before due to design trends mandating low supply voltage and reduced noise margins,
shrinking feature sizes and increased transistor density for fast, low power circuits. Detecting and
correcting such errors has become an important design goal. Transient errors in random logic in a
processor core can be detected by executing the same instructions on the same hardware but at different
times – time redundancy [35, 40, 53], or on redundant hardware but at the same time – space
redundancy [18, 47, 57], and by comparing the outputs of the different instances of the instruction
execution [42]. Detected error(s) can be corrected by either rolling back to a checkpoint to re-execute
instructions from the checkpoint in the hope that the transient error(s) should vanish – Backward Error
Recovery (BER), or by comparing results from redundant hardware and deciding based on a majority
vote in the hope that no errors have occurred in the majority – Forward Error Recovery (FER) such as
Triple Modular Redundancy (TMR) [29].
Time-redundancy BER provides cost-effective solutions for systems that cannot afford high hardware
cost/complexity for the sake of reliability. However, it confronts the following major implementation
challenges especially in the era of multicore or multiprocessor machines. First, inputs to the executions
of two instances of an instruction stream must be guaranteed to be identical to avoid false positive error
detection bringing about a costly error recovery routine. For example, a value loaded from a memory
location must be the same in the first instance and in the second instance of the same load instruction –
input replication. Input replication is not trivial given the possibility that the memory location may be
20
updated or invalidated by another thread running in another processor core. Second, an error must be
confined within a core that caused the error – error confinement. Otherwise, if errors are propagated
through shared memory to other cores, the cost of recovery is huge. Third, re-execution overheads for
error detection and correction should be minimized efficiently: Naively running back-to-back instances
of the same instruction stream on the same core incurs a 2× penalty or even more given the switching
overheads.
Industry now supports HTM in commercial products [38, 54, 61]. We leverage features of HTM to
provide processor cores with transient error detection and recovery at minimal hardware cost. First, the
abort/rollback mechanism is used for transient error detection and correction. Transactions are executed
twice in a row using the abort/rollback mechanism for error detection. A transaction with a detected
error is executed a third time for error correction. Second, the isolation mechanism is used to confine an
error to a processor core, preventing it from propagating to the memory system and hence to remote
processor cores. Third, together with the isolation mechanism, the conflict detection and resolution
mechanisms are used for input replication. By delaying the commit of a successful transaction which has
successfully validated its read- and write-sets, input replication is guaranteed in the following executions
of the same transaction for error detection and correction because no other transactions can modify the
memory locations belonging to the committing transaction’s read and write-sets. Inputs to second and
third executions for error detection and correction can also be replicated in a write buffer.
In summary, we introduce transaction-based core reliability by leveraging existing features of HTM
systems to protect core from transient errors efficiently. We make the following contributions:
• We propose a novel microarchitecture for transient error detection and recovery in
processor core based on time redundancy BER leveraging existing HTMs’ features at minimal
hardware cost.
21
• We provide implementation details for single core and multi-core reliability – SRTM
(Single-core Reliability on HTM) and MRTM (Multi-core Reliability on HTM).
• We evaluate the performance overheads of both SRTM and MRTM by comparing them
to the base machine without the error detection and recovery features.
The rest of the chapter is organized as follows. Section 2.2 presents the implementation details of
SRTM with evaluation results and analyses. Section 2.3 presents extensions to support MRTM. In
addition to base results, the section also proposes a speculative version of MRTM to further reduce the
execution time for error detection and correction. Section 2.4 closes the section with comments on future
work.
2.2 Single Core Reliability
2.2.1 Transient Error Detection and Correction
SRTM detects transient errors in the core of a processor by executing the same transaction twice on
the core, and by comparing the outputs of the two consecutive executions, namely register writes and
memory stores. Inputs to the core are memory loads. For differentiation, we call the first execution
Normal mode execution (in short Normal mode), and the second execution Error-Detection mode
execution (in short ED mode). A mismatch between the results of the two executions indicates that an
error is detected and error correction ensues. Otherwise, if the outputs match, the transaction commits its
stores to memory and moves to the next transaction. Comparing end results save much comparison
overheads because registers and memory locations are overwritten and many transient error results can
be masked. All structures outside the core including caches, main memory and interconnection network
are protected by other means such as Error-Correcting Codes (ECCs).
No transaction conflict detection and resolution is necessary for applications running on a single core
as only one transaction is active at a time. In the case of cores with fine-grain multithreading such as
22
simultaneous multi-threading (SMT), transactions can be handled in the same way as in MRTM
discussed in the following section given that existing HTM support can detect conflicts among
transactions running simultaneously on the same core. We do not discuss error detection and correction
for fine-grain multithreading separately in this thesis.
To protect the entire execution, every instruction should be included in a transaction, as in continuous
Transactional Memory proposed in TCC [19], Bulk [10, 11] and Multicheckpointing processors [16, 51,
52]. The hardware forms a transaction transparently to the software stack dynamically on various events
such as hardware buffer overflows. We call this automatic transaction implicit transaction as in [16, 51,
52] by contrast to explicit transactions, for which the programmer specifies the transaction boundaries in
the code. In the following, transactions are assumed to be implicit transactions, unless stated otherwise.
We choose a store buffer overflow to be the main cause of transaction formation as in [19], although
the overflow of any hardware resource could also be a trigger. Starting in Normal mode, when the buffer
is full the current transaction is aborted and restarted in ED mode. At the end of ED mode, the register
values and the store values of the two execution modes are compared. This Transaction Write buffer
(TxWB) can be implemented as a conventional store buffer, capable of forwarding latest values to loads.
TxWB holds memory store values temporarily and isolates these values until the commit of the current
transaction as is done in TCC [19]. External events such as exceptions either by software or hardware
including thread switching also cause early termination of a transaction.
During Normal mode, stores write only to TxWB but never to the cache and memory to enforce input
replication as well as error confinement. Local memory loads access TxWB, which is relatively small in
size (such as thirty-two entries in the evaluation section of this paper), and the first-level cache in
parallel, hence causing little performance overheads. TxWB hits always take priority over cache hits
because they provide the latest values. At the beginning of ED mode, the valid bits of the TxWB entries
23
are cleared in order to guarantee input replication, providing the same inputs from cache and memory
for load instructions during both Normal and ED modes. In ED mode, TxWB is used in the same way as
in Normal mode. At the end of ED mode, if no error is detected, the blocks in TxWB become non-
speculative and error-free, and a new transaction is started in Normal mode after flushing the blocks in
TxWB to the cache. Figure 2.1 shows data paths for memory load and store.
It should be noted that the flushing of TxWB could be done one block at a time, in the background,
whenever a new store is actually executed to avoid waiting for the flush of the entire buffer at once. For
example, in Normal mode if a load_X hits in TxWB, then X is loaded into a register. If a store_X hits in
TxWB for the first time, the old value of X is flushed to the cache. Subsequent store_X hits just
overwrite new values in TxWB. If a store_X misses in TxWB, a value in a TxWB entry, which has not
yet been allocated for a store, is flushed to the cache and overwritten with the value of X. In addition to
a valid bit, a dirty bit is used to indicate whether the corresponding entry has been written during a new
execution. A transaction is formed when TxWB is full with store values whether all the entries are
flushed all together or one at a time.
To compare the results between the executions of Normal mode and ED mode, we adopt
fingerprinting as proposed in [48], where register and memory updates committed from the reorder
24
buffer (ROB) in program order are hashed into a Cyclic Redundancy Check (CRC) code. Fingerprints
are very compact, hence reducing the comparison overheads further. The probability of error detection is
very high, e.g. 0.999985 with a 16-bit CRC and 0.99999999976 with a 32-bit CRC [48, 55].
Alternatively, one could compare the final images of the register files and TxWBs of both modes of
execution at the end of ED mode. This would result in 100% detection probability. Some transient errors
are also masked in the final images as error-free values to the same destination are overwritten.
However, storing and comparing extra copies of the raw data incurs more area and power overheads
than fingerprinting because we need to store up to three versions of the register files and TxWB one for
each mode of execution. Also, these additional hardware structures need protection such as ECC, as
used in the cache memory.
SRTM corrects transient errors by executing a transaction a third time in a row in ‘Error-Correction’
mode (in short, EC mode) if the fingerprints in Normal mode and in ED mode do not match. At the end
of EC mode, if the fingerprint matches either one of the prior executions in Normal or ED mode, then
the detected error(s) are corrected and the core moves to the next transaction by creating a new
checkpoint and flushing the blocks in TxWB to cache. Otherwise, the core triggers an exception, whose
handler deploys other measures to correct the uncorrected error(s), such as executing the transaction a
fourth time on the same core and/or on another core if any core is available. Regarding the hardware
overheads for error correction, we simply need another CRC buffer the fingerprint in EC mode. Figure
2.2 illustrates how fingerprints are used for transient error detection and correction. The results in each
execution mode are hashed into a fingerprint register as instructions are retired from the Re-Order
Buffer (ROB) in program order. Addresses and values of store instructions are kept in the Load/Store
Queue (LSQ) in each mode until the retire stage and merged on the ROB bus. The two-bit mode register
indicates the current mode of execution: 00 – Normal mode; 01 – ED mode; 10 – EC mode; 11 –
25
Exception mode. During Normal execution, the results from the LSQ and ROB are combined into the
Fingerprint-Normal register. During execution in ED mode, the results are combined into the
Fingerprint-ED register. These two Fingerprints are compared to detect transient error(s) using the mode
register and the 3-2 multiplexer in front of the comparator in Figure 2.2. Figure 2.3 shows the execution
flow of an application run on SRTM.
26
2.2.2 Minimizing Error Detection and Correction Overheads
Because the additional hardware cost of SRTM is low and can hardly be lowered further, we focus on
reducing the performance overheads of error detection and correction, or more specifically, the
execution times in ED and EC modes.
The execution times in both ED and EC modes are accelerated as compared to Normal mode because
the number of cache misses in the instruction and data caches is reduced due to memory prefetching
effects obtained by executing the same set of instructions twice in a row. Some cache misses are still
unavoidable because the blocks read in Normal may be replaced by other memory blocks and evicted
from the cache due to the limited space in cache sets. To alleviate this problem, we add small, fully-
associative victim caches in the memory hierarchy, accessed in parallel with a cache, to hold evicted
blocks in Normal mode. The victim caches benefit the Normal, ED and EC modes of execution. No
special memory/cache prefetching technique other than the victim caches is used in this thesis.
We further reduce the execution time of the redundant executions by using branch outcomes recorded
during an execution, hence eliminating most –if not all-- branch misprediction penalties in a following
execution whether it is ED mode or EC mode. One of the simplest way to implement this branch
“recollector” is to use a shift register made of flip-flops working as a First-In-First-Out (FIFO) buffer
with a counter register keeping track of the number of valid entries in the buffer. The branch recollector
records the outcome of each branch as it retires from the ROB until the recollector buffer is full, or until
every branch outcome is recorded, whichever comes first. Branch recollection works like a very accurate
branch predictor. The branch recollector also works in conjunction with any branch predictor. A
conventional branch predictor can be used when there is no valid entry in the recollector buffer.
To further reduce the execution time in ED mode other architectural information could be gathered
during Normal mode to help the following execution in ED mode. For example, a fast FIFO queue could
27
be added to record not only branch directions but also to record branch target addresses, or even to
record the entire instruction stream in a transaction, eliminating all instruction cache misses. Also, the
simple victim caches can be refined to further reduce the number of cache misses for both instruction
and data. We did not explore these approaches.
2.2.3 Design Overview
Figure 2.4 shows a block diagram of a core and its peripherals to support SRTM. New component
blocks added to the base machine are shaded. The base HTM module is responsible for generating a
checkpoint and rolling back the core to the checkpoint as requested by the Error Detection and
Correction (EDC) module. The conflict detection and resolution mechanism supported by the base HTM
module is not active in SRTM mode.
TxWB generates an interrupt when it is full of valid entries to let the EDC module know that it is time
to generate a transaction and change the execution mode. Other events than the TxWB overflow may
trigger the TX generation interrupt. For example, in the MRTM implementation in the next section, we
28
limit the number of instructions in a transaction to avoid incurring too many conflicts between
concurrent transactions because, in general, the longer a transaction is the more conflicts occur.
The EDC module can be turned off easily for fast execution of an application for which reliability is
not critical. For example, ignoring the TxWB overflow and turning off the interrupt from TxWB in the
SRTM module can make the entire execution run in Normal mode with no implicit transactions. Even
during execution of an application, the EDC module can be turned on and off based on the
characteristics of code sections.
Table 2.1 lists additional storage space needed for the SRTM support with 4-byte addresses and 32-
byte cache blocks.
29
2.2.4 Evaluation
In this section, we evaluate the performance overheads of SRTM compared to a base machine. The
performance overhead is measured as the additional execution time in ED mode with no errors detected,
which is the common case. We model our base machine and the SRTM support in the framework of
Intel’s Pin tool-set [27] by adding a memory hierarchy and other necessary components shown in Figure
2.4. Table 2.2 shows the machine setting in the simulation.
Benchmarks are SPEC CPU 2006 [60] integer and floating-point applications. Of all the 29
benchmarks we could not get results for 400.perlbench, 435.gromacs and 482.sphinx3 due to errors on
the host machine.
Figure 2.5 shows the execution times of the SPEC applications including the OS execution, for both
the base machine and the SRTM machine without any detected errors. Execution times are normalized
to the base machine to show the performance overheads of SRTM. The numbers on the X-axis are the
application numbers in the benchmark suite, e.g. 401.bzip2. All the application names of the entire suite
can be found in Table 2.3 below. Integer applications are on the left side of 483 and the Floating-Point
(FP) applications are on the right side of 483. ‘AVG’ is the average of all. From the results, the re-
execution overhead ranges from 15.8% (of the base machine execution time) with 401.bzip2 to 82.5%
30
with 416.gamess. The overall average, integer average and FP average are 42.5%, 40.8% and 43.8%
respectively.
Figure 2.6 shows the breakdowns of the execution times during Normal mode to give insights into the
results of Figure 2.5. Table 2.3 lists the percentage reductions of event numbers from Figure 2.6.
Regarding cache misses, in many applications L2 miss penalties are dominant in Normal mode, and
reduced effectively during ED mode. As a result, the performance overheads are also reduced
effectively. For example, in 401.bzip2, where the L2 penalty is responsible for more than 80% of the
execution time, 94.1% of L2 misses are removed so that the performance overhead is less than 25% as
shown in Figure 2.5. Overall, L2 misses are removed in ED mode most by 99.7% in 473.astar, least by
44.5% in 462.libquantum, and on average by 87.2% for all the applications. 403.gcc, 429.mcf,
445.gobmk, 453.povray and 465.tonto all show non-negligible L1-I miss penalties and the memory
prefetching effect reduces these penalties by 81.1%, 82.1%, 82.1%, 80.8% and 73.7% respectively. L1-
D penalties are reduced most by 99.7% in 429.mcf, least by 22.2% in 447.dealII, and on average by
74.0%.
31
The branch recollector is effective in many applications as shown in Table 2.3. The reduction in
misprediction rates range from 100% in 429.mcf to 28.6% in 454.calculix, and on average by 72.9%.
Especially in 403.gcc, 445.gobmk, 458.sjeng and 453.povray, applications with relatively high
percentages of branch mispredictions, the branch recollector is effective, yielding 99.0%, 78.9%, 85.1%
and 84.5% misprediction reductions, respectively. We expect that the bigger the size of branch
recollector becomes, the more effective the branch recollector will be.
32
In applications such as 416.gamess and 447.dealII where the computation time is dominant, the
performance overheads are higher than the average in Figure 2.5 because the current implementation of
SRTM targets cache misses and branch mispredictions only.
2.3 Multi-Core Reliability
In this section, we extend the SRTM support in the previous section to protect multiple cores from
transient errors in a multi-core system running multi-threaded applications with a little additional
hardware. Different from the case of SRTM, transactions (TXs) are active concurrently on multiple
cores, conflicting with each other on shared memory space, necessitating mechanisms for transaction
conflict detection and resolution. For this, the base HTM module in Figure 2.4 is fully functional for
handling TX conflicts. Our goal is to combine the execution flow of SRTM with TX aborts due to
conflicts in a smooth way.
2.3.1 Combining Coherence and Reliability
MRTM is different from SRTM in that a transaction is now a unit of coherence as well as reliability.
The base HTM module enforces transaction coherence by detecting and resolving conflicts among
concurrent transactions. The EDC module adds resiliency to transient errors to each core. We do not add
a new architectural component to SRTM but we make the two modules collaborate effectively.
33
For both coherence and reliability, every core in the system executes transactions in isolation, reading
and writing values on the memory speculatively. There are two types of abort that force a rollback of a
transaction. First, the EDC module triggers an abort in every transaction automatically for error
detection and possibly for error correction if necessary. We call this type of abort automatic abort.
Second, when a transaction conflict is detected and the current transaction did not win the conflict, the
base HTM module aborts the transaction. We call this type of abort conflict abort. Automatic aborts are
triggered only at transaction boundaries while conflict aborts occur any time during execution – more
specifically any time before a transaction is guaranteed to win any possible conflict after passing the
validation phase successfully. Automatic aborts always restart a transaction in a different mode
execution from the previous execution while conflict aborts always restart a transaction in Normal
mode. Figure 2.7 shows transitions between execution states on transaction aborts.
Transactions commit their write-sets, letting them accessible by the entire system, on two conditions
satisfied successfully, error-free and conflict-free conditions. For the error-free condition, the execution
of a transaction must be free from transient errors, no error detected or error detected and corrected. The
fingerprints of two consecutive executions run in Normal mode and ED mode match, or the fingerprint
of the third execution in EC mode matches either of the fingerprints in Normal or ED mode. For the
conflict-free condition, a transaction must pass the validation phase successfully – no TX conflict is
detected or the highest priority to commit is acquired. In other words, a transaction must be guaranteed
to commit.
Input replication between the modes of execution (Normal, ED and EC) is guaranteed collaboratively
by the SRTM module, which isolates store values until the commit time, and the base HTM module,
which recognizes and handles a violation of input replication and changes in the input set of memory
loads via its conflict detection and resolution mechanism. From the base HTM module’s view,
34
transactions only become longer than the original ones due to the automatic aborts caused by the EDC
module. The base HTM module detects and resolves TX conflicts between these longer transactions for
both coherence and input replication.
2.3.2 Design Overview
There are several options to implement MRTM. One of the most important factors is the underlying
conflict detection and resolution mechanism because this mechanism directly affects the implementation
of the base HTM module, which in turn affects the MRTM implementation leveraging the base HTM
support. The conflict detection and resolution mechanism is broadly represented by Eager and Lazy
conflict detection and resolution mechanisms, Eager detection and Lazy detection in short.
First, with Eager detection, TX conflicts are detected and resolved during the execution of a
transaction. If the transaction finishes execution without abort, it is considered to have passed the
validation phase successfully and ready to commit the write-set. At the moment, we can simply abort the
transaction and rerun it in ED mode for error detection. In the following execution, if the transaction
reaches the end point safely and the two fingerprints of the consecutive executions match, the
transaction commits its write-set and a new transaction is started. In case of no match, the transaction is
executed a third time in the same way as previously, letting the base TM module handle TX conflicts.
Figure 2.8 illustrates the timeline of a transaction with Eager conflict detection and resolution. From the
base TM module’s view, the validation phase becomes longer than the validation without the MRTM
support, composed of Validation1 and 2 plus Validation3 as optional. A conflict abort may occur during
any validation phase making the execution revert all the way back to the beginning of Normal mode as
in Figure 2.7. Alternatively, we can make a transaction never abort after Validation 1 to reduce the
penalty of conflict abort, which increases as an abort occurs later in the timeline in Figure 2.8. For
35
example, a priority tag can be attached to memory accesses from a transaction that has passed
Validation1, conferring a higher priority in conflict resolution.
By contrast to Eager detection, the validation phase in Lazy detection is initiated right after an
execution and therefore is separate from the execution. The timeline is illustrated in Figure 2.9.
Validation can be overlapped with the execution in ED mode because the validation activities occur
outside a core generally requiring exchanging messages on the interconnection network. Transactions
can commit at the end of the ED mode if no error is detected, or at the end the EC mode if error is
detected and corrected.
The duration of a validation phase depends on the mechanism used to validate. Validations can be
short, by forcing transactions to commit serially with the help of a commit token circulated among cores.
Validation can also be done for each memory store in the write-set by reserving a single access point
such as a directory entry. This approach generally takes more cycles but allows non-conflicting, multiple
transactions to commit simultaneously. In Figure 2.9, the validation phase is shown to be shorter than
36
the execution phase in ED mode because that is the common case. However, a validation phase longer
than an execution phase is possible in the cases of a short transaction and/or a large write-set.
We have implemented MRTM on a cycle accurate simulator with Lazy detection because it is less
intrusive in the simulator implementation than Eager detection, for which conflict is tested and resolved
for each memory access at a time during transaction execution. We adopt the early validation approach
where the validation phase runs in parallel with ED mode denoted as in Figure 2.9. With no errors
detected, transactions start the commit phase by propagating the addresses of the committing
transaction’s write-set at the end of ED mode denoted as TX-commit 1 in Figure 9. Transaction are
generated automatically when TxWB overflows with valid entries or the number of instructions
committed from ROB reaches a preset threshold, whichever comes first. Limiting the number of
instructions in a transaction may be useful to limit the number of TX conflicts, which is generally
proportional to transaction size.
So far we have described MRTM in the context of implicit transactions (similar to BulkSC), in which
the size of transactions is controlled by hardware. Implicit transactions enforce sequential consistency.
Explicit (programmed) transactions can also be handled in MRTM. The primary challenge is to support
long transactions for both conflict detection and error detection/ correction. Commercial HTM
implementations already support long transactions that do not fit in hardware resources. For example, if
speculative data of a long transaction overflow private buffer space, a fallback path is taken to execute
memory instructions non-speculatively, removing the need for the buffer space as in [23].
Therefore, we only need to support error detection and correction for long transactions. Because
execution results are compressed in fingerprints for comparison, we just have to guarantee input
replication between the modes of execution. During Normal mode, we use spare memory space to buffer
the data of store addresses that overflows TxWB and direct subsequent loads to those addresses to the
37
overflow buffer space. At the end of Normal execution, the overflow buffer space is simply discarded
and the memory hierarchy remains intact during the following ED mode. Since we consider that no error
detection is the common case, we log old data in the overflow buffer space during ED mode contrary to
Normal mode. If no error is detected a new transaction can start after the log is discarded. If an error is
detected the old data is restored from the log, and the transaction is re-executed in EC mode with the
same input. In the EC mode, the restoration log is formed in the same way as in the ED mode,
considering that error recovery is the common case.
The log formation can be done in either software or hardware using two registers pointing to the start
and end addresses of the log. For example, if a mode bit indicates that the execution runs in the overflow
mode, then every store instruction should invoke an exception or a function call that logs the old value if
the memory address hasn’t been logged before. The logging is not on the critical path because the log is
used at the end of execution, and can be done in parallel with the main flow of execution. Restoration of
the log can be done fast because searching the data structure to find redundancy is unnecessary. Figures
2.10 and 2.11 show the pseudo-Assembly code for the log update and the log data structure for 4-byte
addresses and data.
In future work, we could further reduce the performance overheads of the re-execution delay with the
help of compiler and programmer. If a compiler or programmer can reveal code sections that run
sequentially or access only read-only or private data, the core can execute those sections faster because
no transaction conflicts occur and so no validation phases are necessary.
38
In terms of hardware overheads, no additional hardware storage is necessary for the MRTM extension
to SRTM. Only the control logic should be modified to handle the conflict abort signal generated by the
transaction conflict detection and resolution mechanism in the base TM module in Figure 2.4. As in the
case of SRTM, the EDC module also can be turned off for fast execution.
2.3.3 Evaluation
In this section, we evaluate the performance overheads of MRTM, comparing a base machine and the
MRTM machine in terms of execution time with no errors. First, we implement all the components and
control logics on a cycle accurate simulator [3] that provides multicore simulation. We set the hardware
parameters for each core as the SRTM setting as shown in Table 2.2 with additional parameters
regarding conflict detection resolution as summarized in Table 2.4. The execution times are measured
for the entire program execution as all the instructions are protected from transient errors. For the
39
protection, every instruction is executed in a transaction generated automatically by hardware when
TxWB is full or every 2500 transactions are committed, whichever comes first, to limit transaction size.
Lazy conflict detection is adopted with the validation phase overlapped as shown as Validation1 in
Figure 2.9. Simulation results are obtained by running SPLASH-2 [56] benchmark suite on the modified
simulator. Table 2.5 lists inputs for the benchmark applications.
40
Figure 2.12 shows the performance overheads of MRTM in execution times for the SPLASH-2
applications. Execution times are measured for the entire executions as the every instruction is protected
from transient errors in a transaction. For example, Barnes takes 66.3% more time to finish the
execution with MRTM than without MRTM. Most applications except Raytrace exhibit the performance
overheads less than 100%, ranging from 17.8% with Ocean to 130.8% with Raytrace, on average 55.8%.
To gain insights into the results in Figure 2.12, Figure 2.13 shows execution distributions in
percentage obtained by dividing the sum of total cycles of all the processor cores in the base machine
into the sum of each of the breakdown categories. Table 2.6 lists percentage reductions in the number of
the level 1 instruction (L1-I) and data (L1-D) caches and the level 2 shared cache (L2-shared) in the ED-
mode execution, the breakdown categories in Figure 2.13 where the minimizing techniques for error
detection overheads have been applied, as compared with the Normal-mode execution. For example, in
Barnes, 12.3% of L1-I misses in the Normal-mode execution are removed in the ED-mode execution. It
also should be noted that the reductions in Table 2.6 do not have to be directly proportional to the
performance overheads in Figure 2.12 because the measured values in Figure 2.12 are the execution
times of applications run on a multi-core system while the measurement in Figure 2.13 are the sums of
cycles on all the cores in the system, possibly overlapped between cores in execution times. The
41
percentage reductions of branch mispredictions are 100% in all the applications and are not presented
separately
In Figure 2.13, L2 (shared) misses and the memory access time dominate in many applications.
Among these applications, the reduction in cache misses in the L2 shared cache in Table 2.6 is reflected
in the execution overheads in Figure 2.12. For example, in Ocean, where the L2 misses and memory
accesses are dominant, 97.2% of L2 misses in Normal mode are removed in ED mode and this reduction
is reflected in the performance overheads in Figure 2.12. In contrast, in Radiosity, Raytrace and
Volrend, the victim cache for the L2 cache is not so effective that the performance overheads are
relatively high.
42
Regarding L1-I and L1-D misses, the simple victim cache scheme is more effective with data than
with instructions as shown in Table 2.6. This is not aberrational considering the average number of
instructions and the average number of cache blocks in read- and write-sets in transactions as presented
in Table 2.8 below. In some applications such as Radiosity, both L1-I and L1-D misses or combined are
not negligible in Figure 2.13 and the reduction percentage are low as well especially with L1-I cache.
This indicates that there is still a chance to reduce the performance overheads in future work. For
example, we could store retired instructions in a stack memory, and retrieve and execute instructions
from the stack in ED mode.
The number of aborted cycles (‘abort’ in Figure 2.13) due to transaction conflict can also affect the
performance overheads adversely, especially in an application where the abort rate is high such as
Raytrace as shown in Figure 2.13. Table 2.7 lists the number of aborts and the number of aborted cycles
of the MRTM machine normalized to those of the base machine. For example, in Barnes, the number of
aborts and the number of aborted cycles increase by 16.1% and 94.4% respectively. Overall, the
numbers increase in the applications except Water where the number of aborts decreases by 9.8%. 2. 14
illustrates how transaction aborts are handled in the current implementation. While the base machine
43
sends out abort signals right after the validation phase, the MRTM machine sends out the signals after
ED mode if the transaction is error free. Considering the average number of instructions in transactions,
the validation phase is finished earlier than the execution of ED mode as shown in Figure 2.14, and two
factors could have caused the increases in the number of aborts and aborted cycles. First, it is possible
that more transactions could read the write-set of the committing transaction during the delayed abort
period, increasing the number of aborts. Second, transactions doomed to be aborted are simply delayed
by the time in the delayed abort period.
44
Though transaction aborts are not the common case in general, it can be problematic in applications
with high transaction abort rates as shown in the case of Raytrace above where the performance
overheads increases over 100% of the base machine. High abort rate also affects the base machine
adversely and has been a concern among TM researchers. In effort to reduce transaction aborts,
transaction scheduling algorithms have been proposed such as those appeared in [2, 4, 6, 7, 59], just to
name a few. These algorithms can be used together with MRTM and may reduce the performance
overheads of MRTM by reducing the number of transaction aborts. We leave the implementation and
evaluation of MRTM with a transaction scheduling algorithm in future work. Instead, we propose a
mechanism that can conditionally reduce the performance overheads with moderate modification in the
current implementation in the following section.
Table 2.8 lists average and maximum, read and write-set sizes, and the average number of instructions
in a transaction. In many applications, the maximum write-set size is 32 and the average instructions are
close to 2500 because transactions are set to be generated when TxWB with 32 entries is full or 2500
instructions are retired from the ROB, whichever comes first. The average write-set sizes are less
45
because the 2500 instruction limit comes more often. The read-set sizes are smaller than the write-set
sizes because in general not all loaded addresses are written back. Read-size also can be limited by
tracking the addresses of loaded blocks. The victim cache scheme should be more efficient in cache miss
reduction for data in ED mode for small transactions with small read- and write-sets. However, small
transactions also increase transaction commit overheads due to increased number of transaction commits.
2.3.4 Speculative Abort
As observed earlier, high abort rate in transaction execution can affect the performance overheads of
MRTM adversely. One way to improve the situation without help from outside components such as a
transaction scheduler is to remove the cause conditionally based on the expected outcome. As shown in
Figure 2.14, the cause of the situation is delayed abort and the expected outcome at the end of ED mode
is a match or no match between the fingerprints of the two consecutive executions. If we send abort
signals right after the validation phase in Figure 2.14 and the fingerprints match, then we might reduce
the adverse effect on the performance overheads without other effects. If we do the same, and the
fingerprints do not match and the error is not fixed, then we end up with unnecessary aborts because of a
transaction that is doomed with uncorrectable error. Those transactions aborted unnecessarily will
resume execution and be guaranteed to move forward by the base HTM module, causing performance
degradation but not affecting the correctness. Consequently, if uncorrectable errors are not the common
case, we mitigate the effect of high abort rate by aborting transactions speculatively. Figure 2.15
illustrates the speculative abort scheme. Accesses to the write-set are blocked and Nack’ed until the
commit time to prevent other transactions from reading uncommitted values because during the time in
the figure between “Send abort signals.” and ‘TX commit’ there is still a possibility that the transaction
never commits due to unfixed error(s).
46
Table 2.9 compares the number of aborts and the number of aborted cycles between the speculative
abort (Figure 2.14) and the delayed abort (Figure 2.13) schemes. For example, in Barnes the number of
aborts and the number of aborted cycles are reduced by 13.5% and 39.6% with the speculative abort
scheme. Figure 2.16 shows the execution times with speculative abort normalized to those on the base
machine. In Raytrace, the performance overheads are reduced to below 100%. In other applications, the
performance overheads are not changed much because the abort rates are not high in those applications.
47
2.4 Future Work
Future work should focus on, but not limited to, reducing further the performance overheads in
transient error detection, which is the most common case on the critical path of execution. The numbers
from the experimental results show that there is still a margin to improve. Executing the same
instruction stream twice is surely a burden but is also advantageous because we know what happens in
the near future. If we could exploit the architectural information about the future, we can further reduce
the performance overheads.
48
Chapter 3
3. POWER EFFICIENT HARDWARE TRANSACTIONAL
MEMORY
The work in this chapter is based on [13].
3.1 Introduction
While TM greatly improves the programmability of shared-memory multiprocessors and CMPs (Chip
MultiProcessors), an obvious disadvantage of any TM system is the machine cycles wasted on
transaction aborts, resulting in power and performance losses. In some worst-case scenarios, the losses
can be large as identified in [7].
One of the worst scenarios, which might occurs in a TM system adopting the Lazy conflict detection
scheme, in which memory stores are sent out all together after the validation phase for conflict detection,
is illustrated in Figure 3.1. In this scenario, N transactions (Tx1 ~ TxN) run on N processor cores (Pr1 ~
PrN) and compete and conflict with each other for the same memory location(s). In the first stage, Tx1 is
49
chosen as a winner by the conflict resolution mechanism and is committed successfully while Tx2 ~
TxN are aborted and restarted. The aborts are marked by ‘x’ in the figure. This situation repeats itself
until TxN is committed successfully, resulting in a total wasted energy (grey areas in the figure) of
which is ~ O(N
2
), if we assume that the transactions are fairly well synchronized and the energy
consumed per abort is roughly the same among the aborted transactions. In the example the repeated
aborts after the first one in each transaction are predictable and transactions should not restart after their
first abort until all conflicts with other transactions are cleared. This is what DTI (Dynamic Transaction
Issue) aims to achieve.
Figure 3.2 illustrates another example, in which energy waste is due to repeated aborts in a TM
system adopting the Eager conflict detection scheme. In Eager conflict detection, a memory store is
propagated at the execution of a store instruction during the execution phase using an underlying cache
coherence protocol for conflict detection. In Figure 3.2, a store in Tx1 conflicts at first with Tx2 and Tx1
is aborted. Then after Tx1 is restarted the same store in Tx1 again conflicts with Tx2 and Tx1 is again
50
aborted. In this situation, Tx1 could be aborted multiple times by Tx2, wasting energy because of wasted
cycles in the execution of aborted transactions, as shown in the grey area. For as long as Tx2 is running
these conflicts resulting in the abort of Tx1 are predictable. Tx1 should not restart until Tx2 is finished.
DTI deals with this case too.
To observe how many consecutive aborts of the same transaction actually occur in benchmark
applications on our base machine, we measured the number of such repeated aborts. Our base HTM
machine (described in more detail in Section 6) has no support for suppressing restarts of aborted
transactions. The results are presented in Table 3.1. The second row “average repeated aborts” gives the
average number of aborts repeated consecutively before a transaction commits given that the transaction
aborts at least once. Thus these numbers do not include transactions that commit on their first execution.
For example, in Bayes, an abort is repeated about three times in a row on average. In Vacation, every
abort is committed successfully in the next try without repetition; thus only one abort is counted each
time. Overall, the benchmark programs experience 4.25 consecutive aborts before committing, among
all transactions that abort at least once. The third row shows the fraction of aborted transactions that
experience more than one consecutive abort. For example, 65% of the aborted transactions in Bayes
experience at least two aborts in a row. Overall 42% of the aborted transactions suffer from multiple
consecutive aborts for all the programs.
To remedy the problem of multiple consecutive aborts, we propose a power-efficient Hardware
Transactional Memory which reduces the energy wasted on transaction aborts. More specifically, we
51
introduce a hardware mechanism called Dynamic Transaction Issue (DTI) that dynamically suppresses
the re-issuing of an aborted transaction if there is a strong possibility that the transaction will be aborted
again after the initial abort.
The rest of the chapter is organized as follows. In Section 3.2, we explain the dynamic issue
mechanism in detail. Section 3.3 describes micro-architectural modifications to an existing TM system.
In Section 3.4, we discuss overheads incurred by DTI and mechanisms are proposed to reduce their
impact. Section 3.5 is on related work. Section 3.6 provides experimental results comparing DTI and
some other alternatives with a base machine that has no mechanism to suppress re-issuing of aborted
transactions. Finally, Section 3.7 proposes future work.
3.2 Dynamic Transaction Issue (DTI)
In DTI, a transaction is not restarted immediately once it is aborted if there is a reasonable suspicion
that the transaction will conflict with another transaction in the future. Instead, the transaction is
suppressed from restarting until the suspicion is gone. During the time of suppression, the processor core
waits for a wakeup signal in a power-saving mode, thus saving power/energy. To predict a future
conflict, two sets of information, conflict history and currently running transaction IDs (TxIDs), are
maintained in each core.
3.2.1 Conflict History
We maintain a conflict history record in each processor core using a unique transaction TxID assigned
to every new transaction. A “new” transaction means a transaction not restarted after an abort. Aborted
transactions are not assigned a new TxID but keep their old TxID until committing successfully. The
conflict history is recorded in each core by pairing a TxID with every processor core ID (PID). For
example, when an aborted transaction has a conflict with a transaction with TxID 123 from processor 3
52
(PID = 3), TxID 123 is recorded in a record entry indexed by PID 3. We only keep track of the most
recent conflict from a processor core and so the total number of entries in the local record equals the
number of processor cores in the system minus one (The local core does not have an entry for itself.) In
the previous example, if a new conflict is detected with a transaction TxID= 456 from the same
processor (PID = 3), “456” overwrites the old TxID (123) in the local conflict history. This is based on
the assumption that a conflict with a transaction having a new TxID implies that the old transaction on
the same remote core was successfully committed or de-scheduled from the core and is not active any
more. When the current transaction on a processor core is committed successfully, the current conflict
history record in the core is cleared as a new transaction is assumed to have no relation to past history.
To detect and record a conflict, a TxID is sent along with a memory store address or a signature as
was done in [Yen et al. 2007]. Store addresses can be sent over a conventional cache coherence protocol
in Eager conflict detection HTM systems in which such addresses are sent out to other transactions
during the execution phase; data is not sent at that time but will be exposed later, after the commit phase.
In the case of Lazy conflict detection, we need to send out store addresses during the execution phase
because in current protocols they are sent out only after the validation phase by successfully committing
transactions. Otherwise, no conflict information can be collected from aborted transactions since aborted
transactions would not propagate their stores. Moreover new conflicts caused by an already committed
transaction are not useful to other transactions. For example, in Figure 3.1, conflict information from
Tx1 received by Tx2,…,TxN during Tx1’s commit phase is stale and possibly detrimental since Tx1 is
already committed. Furthermore, transactions (Tx2,…,TxN) need to collect conflict information from
each other’s, otherwise the aborted transactions (Tx2,…,TxN) will be restarted blindly and aborted
repeatedly, which does not resolve the issue. If conflict information is communicated at the time of
commit only, the conflict information from aborted transactions would never be gathered because only
53
committed, winning transactions would have a chance to send store addresses for conflict histories in the
Lazy detection scheme. Losing a competition does not have to be useless. We want to use the conflict
information from aborted transactions as well. We will evaluate the overheads caused by these extra
messages in Section 4.2.
3.2.2 Conflict Prediction
In addition to a conflict history record in each core, a record of TxIDs currently running on all cores is
maintained locally within each processor core for conflict prediction. When a new transaction starts its
execution, its TxID, generated dynamically, is broadcast to all other processor cores. The current TxIDs
of all other cores are stored in a local record called the “currently running transactions” record; like the
history conflict record, record entries are indexed by a PID. When a new transaction is started in a core
its TxID is broadcast to all other cores to notify them. In response remote cores overwrite the TxID
indexed by the PID in their currently running transactions record. In Section 4.2.2, we will discuss the
overheads of TxID broadcasts and introduce a technique that can avoid the broadcasts.
There are no ordering requirements between a TxID broadcast and the execution of the current
transaction on a core because TxIDs are only used to predict whether an aborted transaction will have a
conflict again if it were restarted immediately. For example, a transaction, which has broadcast its TxID
earlier, can still issue loads and stores regardless of the pending TxID broadcast. Also, there is no
ordering required between TxID broadcasts, allowing a transaction to commit even if multiple of its
TxID broadcasts are outstanding.
When a transaction is about to restart after its latest abort, TxIDs from the conflict history record and
the currently running transaction record are indexed by the same PID and are compared. If the
comparison results in a match, a future conflict is predicted and the processor core enters an idle state or
any energy saving mode. (N – 1) such comparisons are done, if there are N cores in the system.
54
Deadlocks may occur. For example two cores could enter the idle state and their records show conflict
with each other. Thus they will stay in the idle state forever because neither will commit and send a new
TxID to the other. To avoid deadlocks and guarantee forward progress, a higher priority is given to older
TxID. In other words, to prevent transactions from waiting for each other in idle state, if the TxID
associated with a core is older than the TxIDs causing future conflicts, the core does not enter an idle
state but instead restarts its transaction.
3.2.3 Wakeup from Idle State
Instead of probing, which requires continual inquiries, we advocate a signaling scheme triggered by
an event to wake up a processor core from the idle state. More specifically, a change in the currently
running TxID record triggers a new prediction of future conflicts. When a new TxID arrives at a core in
idle state, it updates the local currently running transactions record and triggers a local comparison
between TxIDs in the conflict history record and the record of currently running TxIDs. If no future
conflict is predicted, the core wakes up and resumes execution by restarting the pending aborted
transaction. Otherwise, the core stays in the idle state.
When a transaction is followed by non-transactional code, it is necessary to invalidate the TxID
associated with the transaction to avoid indefinite waiting. Let’s assume that a processor core is in idle
state due to a future conflict with TxID ‘x’ running on another core with PID ‘y’. The waiting core will
keep waiting until ‘y’ commits or de-schedules ‘x’, issues a new transaction ‘z’ and sends this new TxID
‘z’ to the waiting core to invoke the wakeup mechanism. If there is no such ‘z’ hence no delivery of the
new TxID (Note that ‘x’ might be the last transaction scheduled on ‘y’) it looks to the waiting core as if
‘x’ is running forever and the waiting core will never wake up. So, to prevent this undesirable situation,
processor core ‘y’ needs to send a message invalidating ‘x’, which updates other processor cores’
records whenever a core exits a transaction and starts the execution of non-transactional code.
55
3.2.4 Transaction Flow
Figure 3.3 shows the flow of transaction executions with DTI. When a transaction is new, i.e. not
restarted after an abort, it follows the normal flow of execution, which results in a Tx abort or commit. If
a transaction was aborted during its previous execution, it next enters the execution phase or idle state
based on DTI conflict prediction. When a TxID update is received and no future conflicts are predicted,
the core is awaken from the idle state and restarts the execution of the current aborted transaction. TxIDs
are broadcast whenever a new transaction starts execution or non-transactional code follows a
transaction.
56
3.2.5 Examples
Figure 3.4 illustrates how DTI saves the energy wasted in the transaction aborts of Figure 3.1. When
Tx1 is committed, the other transactions, Tx2 ~ TxN, are aborted as in Figure 3.1. However, at this time,
the aborted transactions have the information about past conflicts as well as the currently running Tx’s
in other cores, which are used to predict future conflicts. Consequently, only Tx2, which has the highest
priority among all remaining transactions, is allowed to restart execution. Meanwhile the cores running
Tx3, … , TxN, enter an idle state. The times spent in idle states are shown in the darker areas in Figure
3.4. When Tx2 is committed a new TxID or an invalidation from Tx2 is broadcast from the core that
committed Tx2 and this event triggers the waiting cores (Pr3 ~ PrN) currently in idle state to re-evaluate
and re-predict the possibility of future conflicts. Tx3 has the highest priority at this moment, and so it
resumes execution. The same procedure is repeated until TxN, the transaction with the lowest priority, is
committed successfully. The total energy wasted on the aborts is now reduced to (Energy per abort)(N –
1), which is ~ O(N) and
!
!
of the energy consumption in Eq. (1).
57
Regarding the scenario of Figure 3.2, Tx1 knows the fact that Tx2 has aborted Tx1 and Tx2 is still
running at the moment right before its first restart, and Tx1 enters an idle state instead of restarting right
away, thus saving energy.
3.2.6 Prediction Accuracy
As in most hardware prediction schemes based on dynamic information, understanding the impact of
the accuracy of conflict predictions in DTI is important to identify possible drawbacks as well as future
improvements. The two major DTI’s mispredictions are false negative and false positive alarms.
3.2.6.1 False negative alarm
False negative alarms occur when DTI falsely predicts that there is no conflict with a concurrent
transaction. For example, in Figure 3.4, TxN needs to know that it had a conflict with Tx2 in order to
avoid restarting. However, somehow, if the store address causing the conflict was delayed or missing in
the transfer from the Tx2 core to the TxN core, or if Tx2 never sent the conflicting address because Tx2
was aborted by Tx1 before reaching the instruction generating the conflicting address, then there will be
a false negative alarm. In the latter case, we might let Tx2 run until the end of the transaction and send
store addresses even after its abort to collect the addresses of upcoming stores. However, the execution
time overheads offset the gains of higher prediction accuracy. False negatives do not affect correctness
but they affect the performance of the system by enabling some transaction aborts that could have been
avoided.
3.2.6.2 False positive alarm
A false positive alarm happens when a transaction is predicted to have a conflict with other
transactions but the conflict does not happen. The main result is a delayed update of the currently
running TxID records, which in turn delays the restart of aborted transactions, and perhaps increases the
58
overall execution time. Different from the false negative case, a false positive could affect the
correctness of a program if the update of the currently running TxIDs never occurs such as in the case of
a transaction followed by non-transactional code. The solution to this problem was explained in Section
3.2.3.
3.3 Micro-Architecture
3.3.1 Overall Architecture
Figure 3.5 shows the micro-architecture of a processor node equipped with DTI partitioned into
logical modules. The cores execute machine instructions from their private caches. The TM module is
part of existing HTMs and provides TM services for the processor core such as conflict detection and
conflict resolution. The Conflict Prediction Module (CPM), which can be easily integrated into an
existing HTM system, is responsible for maintaining data structures for conflict history, for currently
running TxIDs and logic for conflict prediction as well as other necessary functions such as TxID
generation. CPM is the only addition to a traditional HTM to support DTI. It works closely with the
processor core and the TM module by exchanging control signals such as the wakeup signal from CPM
59
to the core when a conflict is removed. The processor node is connected to other nodes via an
interconnection network.
3.3.2 Conflict Prediction Module
As shown in Figure 3.5 the Conflict Prediction Module has five basic functions: maintaining a
Conflict History vector, maintaining a Running History vector, predicting conflicts, generating TxIDs
and receiving TxIDs from other cores.
3.3.2.1 Conflict history vector
Conflict history is stored in a vector with entries indexed by a processor core ID (PID) uniquely
associated with a processor core. The number of entries is the same as the number of cores (minus 1) in
the multi-core system. Each entry keeps a TxID with which the current transaction has most recently
conflicted with until the TxID is replaced by a new conflicting TxID. Flushing logic is added to clear all
the entries at a transaction commit.
3.3.2.2 Running TxID vector
The hardware structure of this vector is the same as for the conflict history vector. However the
content is different. In this case, a TxID stored in a vector entry is the TxID of the currently running
transaction on the processor core with the corresponding PID.
3.3.2.3 Prediction Logic
Two comparisons are needed to predict future conflicts. First, the entries with the same index (the
same PID) from the conflict history vector and the running TxID vector are compared in parallel. If one
TxID stored in an entry in the conflict history vector indexed by PID i matches the TxID stored in the
running TxID vector indexed by the same PID i, a future conflict is predicted since the same transaction
60
with a past conflict is still running. (N – 1) such comparisons are conducted, where N is the number of
cores. Second, the core’s TxID is compared with the TxIDs causing future conflicts. The processor core
enters an idle state unless its TxID has the highest priority. The comparison logic is invoked whenever
an aborted transaction is restarted or an update to the running TxID vector is received.
3.3.2.4 TxID generation
A TxID is needed for two purposes: to identify conflicting transactions and to make priority
decisions. Ideally, a global timestamp can serve directly as a TxID generator. Alternatively, a core can
obtain a global TxID value by updating a global variable accessible by all cores with a special
instruction such as a ‘read and increment’ or ‘fetch and add” instruction which reads the shared variable
and increments it by one atomically. Also, a processor core may generate a TxID locally by combining
its core ID with a local timer value to assign a unique identification; in this case priority is determined
by the core ID.
3.3.2.5 Receiving TxID
As TxIDs are generated in ascending order (at least those generated in the same processor core) it is
possible to commit the current transaction while multiple TxIDs are still outstanding in the same core.
When TxIDs for a specific core with the same PID are received out of order, a smaller TxID is
overwritten by a larger TxID and a smaller TxID is simply ignored. Extra care is needed for
transactional code followed by non-transactional code as discussed in Section 2.3. In this case, we need
to buffer a TxID invalidation message if the message is received before the TxID is invalidated.
For sufficiently long running programs, it is imaginable to have a TxID roll over its maximum value
so that TxID counters or timestamps overflow. When this happens two transactions could have the same
ID at the same time thus confusing the conflict resolution logic as well as the priority logic. A counter or
61
timestamp with 32 bits takes 2
32
values. Thus in 16 cores each running one transaction, this should not
be a problem. There are simple workarounds (as for all for all techniques relying on hardware counters)
if this is a real problem.
3.4 Overheads
DTI has some overheads: performance, inter-core message traffic, area and power/energy.
3.4.1 Performance Overheads
Figure 3.6 illustrates a scenario causing performance overheads due to the inherent shortcomings of
DTI. In Figure 3.6, two transactions, Tx1 and Tx2, just had a conflict for memory address X, and Tx2,
which was aborted by Tx1, is now about to make a decision whether to restart or idle at time T1. Tx1
has higher priority than Tx2. With DTI, Tx2, which keeps Tx1’s ID in its conflict history vector as well
as in the currently running transactions vector, enters an idle state at time T1 and later wakes up and
restarts execution at time T2 after it is notified of Tx1’s commit. However, it was actually safe to restart
Tx2 at T1 because the load to X occurs after Tx1’s commit, hence there was no need for Tx2 to idle
between T1 and T2. So time (T2 – T1) is a performance overhead, an unnecessary delay.
Several observations are warranted from this example. The overhead is not caused by a false positive
alarm since there is an actual conflict between the two transactions. Instead it is caused by the timing of
the conflict and the lack of information about the actual timing. This timing information is hard to
62
collect. From the fact that Tx2 keeps the conflict information of ‘X’ with Tx1, the Load of X in Tx2
must have occurred before the end of Tx1 in the previous run otherwise the conflict on X could not have
been detected and recorded. In the repeat run, the Load X was somehow delayed enough in Tx2 to safely
load X without aborting Tx2. However, it is hard to predict the exact timing of accesses to X as any
event might happen between the Tx2 begin and the Load X in consecutive runs of the same transaction.
To avoid this overhead, we would have to track conflicts on each memory access. For example, we
could allow Tx2 to begin but make a decision of whether to continue at every memory load or store. We
leave this avenue of research to future work. It will require further modifications and simulations and
may be impractical.
Another comment regarding this performance overhead is that, at the moment when the above event
occurs, it is not clear how the delay will affect the overall performance in terms of execution time
because it may be possible that the delay improves overall performance by avoiding other Tx aborts
even though, in general, serializing transactions does not beat parallelizing first and then serializing only
when a conflict is detected in terms of execution time. The occurrences of such delays as well as their
effects are more obvious in hindsight than in foresight.
3.4.2 Message Overheads
To maintain the information needed for conflict prediction, the following message overheads are
incurred, depending on the conflict detection policy of the underlying TM system.
3.4.2.1 Propagating memory addresses for stores
If we implement DTI on a TM system relying on eager conflict detection, no additional messages are
necessary as the conflict detection mechanism requires to send store addresses using a conventional
63
cache coherence protocol during the execution phase regardless of DTI; we just need to piggyback
TxIDs to the coherence messages already needed, just increasing a packet size by a few bytes.
In a TM system with lazy conflict detection, additional messages not in the base TM system are
needed. These messages are due to write hits
1
in the local cache during the execution phase since these
write hits, which are purely local and circumscribed within the local node boundary in the base TM, now
have to propagate outside local boundaries. To reduce the number of additional messages, we introduce
the following technique with modest hardware modification. We propose that a processor node sends
store addresses attached with a TxID over the network in bulk at the end of the transaction, regardless of
whether the transaction commits or aborts, instead of sending out a store address for each individual
write hit during execution. Cores send out store addresses in a message at transaction abort in addition to
at transaction commit, with an additional bit to differentiate between these two types of messages: a
message originating from an aborted transaction does not trigger a transaction abort but only updates the
conflict history vectors in remote cores. This technique reduces the store traffic in DTI drastically, as
simulation results will show in the evaluation section.
3.4.2.2 Propagating TxID at TX Begin and TX End
To keep the information of transactions running locally in every processor node current, each node
broadcasts a TxID at the beginning of a new transaction and also at the commit of the same transaction
if the transaction is followed by non-transactional code. We denote these TxID_B and TxID_C. Our
goal is not to keep such information in all cases but to predict future aborts using that information. A
TxID_B does not need to be sent to a processor node running an independent transaction, i.e. a
transaction without conflicts with others. Also, a TxID_B does not need to be sent out until the first store
address is sent out for conflict detection. Thus a separate message for TxID_B other than a message to
1
Write misses have to go out of the local node anyway, so no additional message is needed for them.
64
propagate a store address to update the conflict history vectors is not needed. No additional message or
broadcast under both Lazy and Eager conflict detection is needed: TxIDs are sent only to the nodes
currently sharing memory blocks. By the same token, no additional message for TxID_C is needed with
Lazy conflict detection as a transaction has to send out store addresses for conflict detection at the
commit time anyway. However, additional messages are needed for TxID_C in Eager conflict detection
in which a transaction commits silently. TxIDs must be sent to processor nodes which have conflicted
with the committing transaction.
3.4.3 Hardware Overheads
Hardware overheads comprise additional design costs and the area for the additional components of
the conflict prediction module shown in Figure 3.7. The design cost is marginal because the operation of
the prediction module is very simple: It compares two vector elements by bitwise exclusive OR. As for
the area overhead, we show the necessary circuitry in Table 3.2, assuming 32-bit TxIDs and N processor
cores. For example, if there are 16 cores in the system, we need two vector storage components, each of
which has the size of 60 bytes (15 cores × 4 bytes) and 15 comparators or 1 comparator with a counter
register.
65
3.4.4 Power Overheads
Power overhead derives from sending and receiving the additional messages as described in the
message overheads section and the activity of the circuitry in Table 3.2. We will estimate these
overheads by the number of additional messages and the activity frequency factors in the evaluation
section.
3.5 Related Work
Several TM proposals aim to avoid redundant transaction aborts by throttling the issue of transactions,
from a simple back-off mechanism such as the random back-off of aborted transactions in [7], to an
adaptive scheduling technique based on transaction abort and commit rates [59], to more complex
proactive transaction scheduling by profiling the pattern of conflicts between transactions [6], to a
reordering scheme based on work stealing at the thread level [2]. These proposals mainly focus on
performance improvement by scheduling transactions to avoid transaction aborts. By contrast, our paper
proposes a simple hardware scheme to save energy in HTM systems, targeting consecutive aborts of the
same transaction. We believe that DTI is the first effort addressing energy savings in such systems using
a hardware dynamic transaction issuing algorithm.
66
Among all the TM scheduling proposals, the proactive scheduling paper [6] is more closely related to
this paper in that proactive scheduling collects and uses the information of past transaction conflicts and
currently running transactions. Proactive scheduling is different from DTI in the following aspects.
Proactive scheduling is heavily reliant on software mechanisms incurring software overheads such as
scanning a software graph structure at every transaction begin, while DTI relies singly on simple
hardware mechanisms without software overheads. Moreover proactive scheduling targets HTMs with
Eager conflict detection, while DTI is applicable to systems with either Eager or Lazy conflict detection.
Finally, proactive scheduling improves execution time by switching threads, which causes thread
switching overheads and relies on the supply of threads while DTI targets the reduction of energy/power
consumption by putting processor cores into an energy saving mode or idle state without any other
major activities. Consequently, our mechanism is easy to port to existing HTM systems, regardless of
the characteristics of the underlying system, with minimal modifications of hardware and no
modification of the software stack and also without incurring much execution overheads.
HARP [4] is also directly related to our work as it relies on hardware-only mechanisms to predict
future conflicts using past conflict information. DTI specifically focuses on issues related to consecutive
aborts of the same transaction, which can be dealt with minimal system information and modest
hardware cost, targeting only previously aborted transactions and tracking only the most recent conflicts.
By contrast HARP keeps track of large amounts of stored information in each core regarding transaction
executions in the whole system in order to schedule transactions. DTI can manage the necessary
information in a small, simple, distributed hardware structure composed of two vectors, each of them
requiring less than one hundred bytes
2
in each core. In the case of HARP, a bigger hardware structure,
estimated to be 2.06 KB per core according to the proposal, is needed to maintain more detailed
information such as transaction size, contention ratio, number of consecutively predicted conflicts and
2
Based on the assumption that TxID size is 32 bits (4 bytes) and there are 16 processor cores.
67
more for multiple transactions as opposed to the last conflicting transaction per core in DTI. By
targeting aborted transactions only, DTI does not interfere with the execution of committing
transactions, which incur no overheads in DTI but does in HARP. The main goal of DTI is to save
power and energy consumption in a multi-core system that runs one thread per core, though it might run
on a system with multiple threads per core with the thread-switching decision delegated to software or
the operating system. Consequently, our scheme fits well in the context of relatively small systems that
cannot afford a big power/energy budget such as portable machines. Different from DTI, HARP focuses
mainly on enhancing system performance, especially for systems running multiple applications, each of
which is multi-threaded.
Several papers address the topic of power and energy savings in TM.
Sanyal et al. propose “Clock Gate on Abort (CGA)” [44] to improve the energy efficiency of HTM by
clock gating or by turning off processor cores for aborted transactions. The mechanisms in CGA are
developed for a specific HTM system, Scalable TCC [9], which is known to be a suboptimal design due
to its limited parallelism and scalability. By contrast DTI is readily applicable to any HTM system and is
not limited to the underlying TM system. Moreover, DTI relies on up-to-date dynamic information to
decide whether to put a processor core in an idle state, instead of simply putting the cores into idle state
immediately after a transaction is aborted like is done in CGA. DTI is more responsive to changes in
current conditions for cores in an idle state by snooping any changes that might wake up the core. In
Sanyal et al.’s approach a core in idle state waits for a timer preset to a fixed amount of time
3
to expire.
This approach is very similar to the blind simple back-off schemes mentioned earlier. Finally, DTI is
based on a distributed control mechanism while Sanyal et al.’s approach relies on a centralized control
3
Though the amount of time itself is determined dynamically based on the current contention level, once a core enters idle state, it waits for
the amount of time determined at the time of entrance.
68
mechanism delegated to directory modules, which avoids broadcast messages
4
but incurs delays due to
congestion.
Ferri et al. [15] proposes TM architectures well suited to embedded multicore systems, with emphasis
on energy, performance and complexity. The authors discuss various techniques for energy efficiency
such as shutting down the Transaction Cache when not in use. The restart of an aborted transaction is
triggered after a simple random exponential back-off period during which the CPU stays in a low-power
mode. Their proposed embedded architectures emphasizing energy efficiency are good targets for DTI.
Gaona et al. [17] propose Selective Dynamic Serialization (SDS), an energy reduction scheme
targeting an Eager-Eager
5
HTM system, specifically LogTM-SE [58]. The scheme serializes
transactions (instead of retrying immediately) when a counter incremented at the detection of a conflict
(NACK_SDS) or of an abort (ABORT_SDS) saturates to a preset value. Energy is saved by putting
cores in a low-power mode during the time it waits for its turn, using a hardware record based on
transaction priority for each conflicting address. The proposal adopts a token-based scheme, in which
the highest priority transaction wakes up the next highest priority transaction waiting for permission to
resume in line, as compared to DTI where a processor core decides locally whether to restart a
transaction based on the information of past conflicts and current transactions. SDS stalls transactions
during execution when a conflict is detected (address-base) while DTI restricts the restart of aborted
transactions at their beginning (transaction-base), and, consequently, is applicable to both Lazy and
Eager conflict detection TMs.
4
Broadcast messages are also avoidable in our approach using the techniques in Section 4.2.2.
5
Eager conflict detection and Eager version management.
69
Hourglass [26] like DTI focuses on repeated aborts with a simple contention management policy
evaluated in a Software Transactional Memory (STM) system. When transactions are aborted
consecutively over a certain threshold number, those transactions are marked as ‘toxic transactions’. A
toxic transaction acquires a token that prevents any other transaction from starting execution except
those that have already started execution. Once the toxic transaction is committed successfully, the token
is released to let other transactions proceed. There are several differences between Hourglass and DTI.
First, Hourglass doesn’t use conflict history for conflict prediction. Second, all concurrent transactions
are serialized as compared to only transactions with conflict history in DTI. Third, a central arbiter is
necessary in the case of multiple pending toxic transactions.
Table 3.3 compares the choices made in DTI and in several other proposals.
70
3.6 Evaluation
3.6.1 Experimental Setup
3.6.1.1 Simulation setup
We implemented and simulated DTI on top of the SESC simulator [41], a cycle-accurate simulator for
out-of-order cores in multi-core configurations augmented with the following packages.
First, we enable the TM package [34] on top of the base simulator using ‘--enable-transactional’. The
package includes a TM framework that incorporates the common operations of TM into the core of the
SESC simulator such as taking a checkpoint at a transaction begin and restoring the system context back
to the checkpoint at a transaction abort. Within this framework, functions related to the execution of
individual transactions such as beginTransaction(), abortTransaction() and commitTransaction() are
already implemented. We added DTI on top of this platform.
Second, we use the Power package [8], which is enabled by ‘--enable-power’, to estimate dynamic
power and energy consumption during the execution of a program by keeping track of units accessed
during each clock cycle. The dynamic power consumption is estimated by
where P
d
is the dynamic power consumption; C is the load capacitance calculated by the capacitance
models as described in [8] for the various hardware structures defined by the hardware parameters set in
the SESC configuration file, sesc.conf; V
dd
is the supply voltage; a is an activity factor
6
indicating the
average switching activity in every clock tick, ranging from 0 to 1 and dependent on the execution
behavior of each benchmark; f is the clock frequency.
Tables 3.4 and 3.5 summarize the hardware configuration for the hardware platform and the
benchmark applications. To help understand the scalability of the benchmarks, Figure 3.8 shows their
6
For circuits that pre-charge and discharge on every cycle, ‘a’ is set to 1.
71
execution times with varying number of threads, from one to sixteen, normalized to those with one
thread
7
on the base system, a TM system with no back-off or suppression of transaction restarts.
7
For Bayes (‘B’), the execution times are normalized to the execution time with two threads due to a simulation error with one thread.
72
3.6.1.2 Machine setup
To evaluate our approach and compare it to other proposals, we implemented and compared the
following hardware-only schemes.
Base Machine with No Back-off. Each core of the base machine has a TM module implementing
Lazy conflict detection and Lazy version management using bus contention for transaction conflict
detection as in TCC [19]. A transaction is restarted immediately after it is aborted. There is no
scheduling algorithm or back-off scheme.
Linear Back-off. A transaction is backed-off by a number of cycles linearly dependent on the number
of aborts experienced consecutively. For example, if the linear constant is L and the number of
consecutive aborts is N, the processor core stays in the idle state for (LxN) cycles before it restarts the
aborted transaction. L is set to eight cycles in our simulations.
Exponential Back-off. This is the same as the linear back-off scheme except that the number of cycles
in idle state increases exponentially as L
N
, where L is the radix (which is set to two in our simulations)
and N is the number of consecutive aborts. In this scheme a large number of consecutive aborts are dealt
73
with more aggressively but a small number of consecutive aborts are dealt with less aggressively than in
the linear back-off scheme.
Random Back-off. Originally proposed in [7] to remedy one of the TM pathologies identified in the
paper, this scheme handles the situation in Figure 3.1 by idling processor cores for a random number of
cycles. In our simulations a random number is generated between one and ten using the C
++
rand()
function and then is multiplied by the number of consecutive aborts to calculate the number of idle
cycles.
Dynamic Transaction Issue (DTI). In this scheme, a processor core enters and wakes up from idle
state when certain conditions are met using the Conflict History vector and the Running TxID vector as
described earlier in this paper. We implement DTI on top of the base machine.
3.6.2 Results and Analysis
Figure 3.9 compares the dynamic power consumptions of the schemes described in Section 6.1.2,
abbreviated as ‘No’ for the No back-off scheme (Base); ‘Lin’ for the Linear back-off scheme; ‘Exp’ for
the Exponential back-off scheme; ‘Ran’ for the Random back-off scheme; and ‘Dyn’ for our Dynamic
Transaction Issue (DTI) scheme. These abbreviations are also used later in the paper. Power
74
consumption is normalized to the power consumption of the No back-off scheme, the base machine. As
the figure shows, DTI outperforms all the back-off schemes for all benchmark programs in terms of
dynamic power consumption (by up to 38.2% in Intruder) except for Fmm where the Linear back-off
scheme consumes about 0.05% less power than DTI.
In the back-off schemes including DTI, the idle state is implemented by stopping instructions in the
fetch stage. A processor core enters the idle state by stopping fetching instructions. At that time, all prior
in-flight instructions proceed to the retirement stage. Once the core exits the idle stage, the execution
resumes by restarting the fetching of instructions. With this method, we can accurately model an idle
state which does not incur dynamic power overhead to resume execution. An alternative to this policy is
to enter the core into a sleep mode, which might save more power but would affect performance
whenever the core is wakened up.
We now evaluate DTI in terms of dynamic energy consumption, which is calculated by multiplying
the dynamic power consumption by the execution time in cycles (power × delay model). Figure 3.10
shows the total dynamic energy consumptions of the various schemes during the entire execution of
each benchmark, normalized to the energy consumptions of the No back-off policy (Base). DTI achieves
energy savings for some programs, most notably for Intruder by about 37%, and also for Yada and
75
Labyrinth by 12% and 8% respectively. The Linear back-off, the Exponential back-off and the Random
back-off schemes show similar energy consumption except for Intruder and Yada. In Intruder, the Linear
scheme, the Exponential scheme and the Random scheme improve energy consumption by 4%, 20% and
5% respectively as compared to the No back-off scheme. In Yada, energy consumption is slightly higher
with the Exponential and Random Back-off schemes. This is possible if a back-off scheme increases
execution time, for example by increasing the number of aborts.
Energy savings are very similar to power savings in most cases, as demonstrated by comparing Figure
3.9 and Figure 3.10. Variations are less than 1%. One of the reasons for these small differences is that
DTI (and other schemes) doesn’t predict future conflict with perfect accuracy, and some mispredictions
may slightly increase execution times and thus energy consumption as compared to power consumption.
For example, in Intruder, while the power consumption of DTI decreases by 38% relative to the No
back-off scheme (Base), energy consumption decreases by 37%. This slight difference can be attributed
to the increased execution time of DTI relative to the execution time of the No back-off scheme. The
increase in execution time is about 1%. In summary, the relative energy consumptions are similar to the
relative power consumptions. This implies that DTI does not sacrifice performance in order to save
power.
76
Figure 3.11 shows the dynamic energy wasted in aborted transactions only normalized to the case of
No back-off. For all applications DTI reduces the wasted energy due to aborts more than the other
schemes because it adapts to future transaction execution using past information rather than relying on a
static back-off mechanism. Figure 3.11 displays the differences between the various schemes better than
Figure 3.10 because Figure 3.10 shows the energy consumed during the entire execution of each
benchmark program while Figure 3.11 concentrates on the energy spent on aborted transactions only. If
transactions are a small portion of an entire execution, then we expect little improvement from any of
the back-off schemes, including DTI. For example, only 0.5% and 1.3% of the entire execution times of
Bayes and Vacation are transactions in the No back-off scheme, so that the bars for Bayes and Vacation
in Figure 3.10 are flat. Moreover, the fraction of aborted cycles to total cycles in transaction executions
(aborted cycles plus committed cycles) is another factor that affects overall energy consumption. For
example, in Fmm, only 0.1% of total transaction execution cycles is spent in aborted transactions. In
Figure 3.12, which is closer to Figure 3.10 than Figure 3.11, Fmm shows no improvement with any
back-off scheme. This does not mean that DTI or the other back-off schemes are ineffective but that
77
there is no need for them, since the underlying basic TM system with no back-off already runs smoothly
without losing machine cycles on wasted work or aborted transactions.
To confirm that DTI actually solves the problem illustrated in Figure 3.1, Figure 3.13 shows the
average number of consecutive aborts of the same transaction with each scheme, normalized to that of
the No back-off scheme. The figure shows how many times an aborted transaction repeats aborts
consecutively on average before it is committed successfully. For example, in Bayes, with the No back-
off scheme, aborted transactions experience repeated aborts 2.91 times consecutively before committing.
78
In the case of Vacation, every aborted transaction commits successfully right after the initial abort in
every scheme, so that the bars of Vacation are equal in the figure. DTI reduces consecutive aborts more
effectively than any other Back-off scheme. This is also reflected in the energy consumptions in Figure
3.10. The Exponential scheme somewhat works but not as well as DTI.
Figure 3.14 shows the execution times of the benchmark programs, normalized to those of the No
back-off scheme. We could not find significant variations among the schemes; the variations range from
-1.4% (Intruder with Exponential Back-off) to +2.5% (Yada with Random Back-off). With DTI, the
range is from -0.2% to +1.5%, with an average of +0.4%. These results confirm once more that DTI
does not sacrifice performance to save power. Note that performance can be affected by the underlying
TM system, in particular by the version management policy. Because our base machine adopts lazy
version management in which transaction aborts are fast and local, performance gains with DTI is not
guaranteed in all cases. For example, when two transactions with conflict history are about to restart
execution after an abort, DTI serializes the two transactions, while the base machine with no back-off re-
executes the two transactions in parallel. Whether there is an actual conflict or not, the resulting
execution time of DTI is the sum of the execution times of the two transactions, whereas in the base
machine, the execution time is the sum of the execution times of the two transactions only when there is
an actual conflict. This can occur because the prediction of DTI is not perfect.
Figure 3.15 compares the number of committed cycles (cycles in committed transactions) spent by all
the transactions in the No-Backoff scheme and in DTI. The figure shows that the number of committed
cycles is not noticeably different in the base machine (No-Backoff) and in DTI. Thus DTI does not
interfere with the normal flow of execution of benchmark programs when there are no conflicts among
concurrent transactions. DTI only targets aborted transactions, and does not affect independent
transactions.
79
Figure 3.16 shows transaction commit rates calculated by dividing the number of transactions that
successfully commit by the number of transactions that reach the end of their execution (either aborted
or committed) in all back-off schemes, normalized to the same rates in the No back-off scheme. In
Labyrinth, DTI achieves a better commit rate than the No back-off scheme by about 30%. In the figure,
DTI is shown to improve the commit rates in all the benchmark programs because of the conflict
prediction mechanism. With DTI, transactions have a better chance of committing than with the other
schemes once they reach the end of their execution. The commit rates of the No back-off scheme are
0.90, 0.95, 0.73, 0.91, 0.68, 1.00, 0.98 and 1.00 for the benchmark programs in the order of the
horizontal axis of the figure.
80
Figure 3.17 shows the message overheads during the execution of transactions with DTI. The figure
shows the fraction of extra messages needed to update the data structures of the conflict prediction
modules as discussed in Section 4.2, and the fraction of L1 miss request messages plus the commit
request messages. L1 misses by store instructions do not contribute to extra messages since they go
through the network anyway in the No back-off policy. In some programs, the fraction of extra
messages due purely to DTI is not negligible. Overall, extra messages contribute more than 20% to
overall traffic. In particular, in Fmm, the fraction of extra messages is 45.2%, which is a huge increase
in message traffic in the network and could affect overall performance and power.
The technique proposed in Section 4.2.1, in which store address packets are sent in bulk not only at
transaction commits but also at transaction aborts, drastically reduces the number of packets. Addresses
originated from aborted transactions do not abort the transaction at the receiving nodes but are only used
to update the conflict history vector of the receiving nodes. The fractions of extra messages with bulk
transfer of stores addresses on aborts are given in Table 3.6. In the table, the fractions of extra messages
are now negligible, close to zero. This is because only one message, which possibly contains multiple
81
addresses, is sent out at a transaction abort, as compared with sending individual store messages
separately during the execution of the transaction.
To compare the message overheads in terms of number of bytes transferred, Figure 3.18 compares the
number of extra bytes transferred (in the payload only) in DTI with and without the message overhead
reduction technique, normalized to the case without the reduction technique. We assume 32-bit
addresses. With the reduction technique, the total payload is measured by counting the number of
distinct block addresses in messages sent at transaction aborts, and multiplying the number by 4 (32-bit
address). Without the reduction technique, the total payload is calculated by multiplying the number of
messages sent for every store during the execution phase by 4. In the figure, we observe that the payload
size in bytes is cut dramatically with the reduction technique.
A major factor contributing to the dramatic reduction of bytes transferred in Figure 3.18 is the
temporal locality of memory stores: The same memory address may be written repeatedly during the
execution of a transaction, and the more frequent such repetitions are, the better the reduction technique
works. To quantify the temporal locality of stores, we measured the ratios between the number of stores
and the distinct cache block addresses in a transaction, on average, in Table 3.7. For example, in Bayes,
only 0.5% of the total stores in a transaction targets distinct block addresses.
82
We also compare the power consumption, the energy consumption and the performance of DTI with
and without the message overhead reduction technique. This is necessary because the reduction
technique may change the relative timings between the executions of transactions to a point that it may
affect the results (power, energy and performance) noticeably. Table 3.8 shows the measurements for
DTI with the reduction technique normalized to DTI without the reduction technique. In the table, no
noticeable change is observed except for Intruder, in which the power and energy consumptions are
increased by 16% and 15% respectively. However, the power and energy consumptions of Intruder with
the reduction technique are still better than the No Back-off scheme by 25%, with 1% increase in
performance.
As discussed in Section 3.4.4, there are two major sources of power overheads: the power for
communicating extra messages and the power for activating the prediction logic. With the message
reduction technique, the power overhead of extra messages is marginal as shown in Table 3.6. The
83
power overheads of the prediction logic is also marginal due to the relatively small capacitance and the
activity factor of the logic
8
, which is limited by the number of transaction commits and aborts, ranging
from 0.000 to 0.051 with an average of 0.015.
To help better understand how DTI performs for applications with varying number of threads, Table
3.9 summarizes results from simulations with 2, 4, 8 and 16 threads and cores. Each number is
normalized to the base machine with no back-off. Some entries are not available because we could not
run the simulator for some benchmarks and some thread numbers. Overall, dynamic energy savings,
especially on aborted cycles, increase as the number of threads increases due to the increased level of
contention in the base machine. The execution times are mostly unaffected as DTI targets only aborted
8
‘a’ in Equation (2).
84
transactions with low performance overheads. The dynamic energy wasted on aborts in DTI increases at
times for Bayes, Genome and Fmm, presumably because the prediction accuracy of the scheme and the
relatively small number of aborts enhance the effects of false negatives.
3.7 Future Work
Future work includes, but is not limited to, obtaining additional simulation results for various types of
HTM platforms including commercial HTM implementations to evaluate and understand how DTI can
save power/energy on these platforms. Comparisons with related proposals such as those described in
Section 3.5 are needed too. Finally, incorporating DTI into HTM systems for energy-sensitive
architectures such as those in smartphones will be a direction for our future research work.
85
Chapter 4
4. FINE-GRAIN TRANSACTIONAL SCHEDULING
4.1 Introduction
An obvious drawback of Transactional Memory (TM) is the overhead of transaction aborts, which can
waste a large number of cycles, possibly of the order of N
2
in the worst case, where N is the number of
concurrent threads [13], especially on high-performance parallel machines where N is large. As a
remedy, research papers have proposed transaction scheduling algorithms to reduce the number of aborts
and hence the number of wasted cycles. Most prior scheduling algorithms [4, 6, 13, 15, 17, 26, 44, 59]
use the transaction conflicts that have already happened and have been recorded to predict a future
conflict for a new transaction which is about to start, and takes action based on the prediction. Some of
these algorithms [4, 6, 13] also use the information of concurrently running transactions in the hope of
improving conflict prediction accuracy.
Based on conflict prediction a processor core decides whether to start the transaction immediately, to
stall the transaction and schedule another transaction, or to stall the core in an energy-saving mode. For
example, the Proactive scheduler in [6] predicts a conflict with a ‘confidence’ value measuring the level
of confidence that there will be a transaction conflict between a new transaction and other transactions
already running in other processor cores. Every time a core detects an actual conflict between a pair of
transactions, Tx
i
and Tx
j
, the scheduler updates the confidence value at (i, j) in a hardware table for
future conflict prediction. We call these scheduling algorithms Coarse Grain transaction conflict
Prediction and Scheduling (CGPS) because they predict a transaction conflict at the coarse granularity
of an entire transaction.
The prediction accuracy of CGPS algorithms is limited by the granularity of the prediction. First,
CGPS accuracy suffers from false negatives, i.e., CGPS may predict no conflict but a conflict and a
86
transaction abort actually happen as illustrated in Figure 4.1. In the figure, right before a processor core
(P2) in a multiprocessor system starts its transaction Tx2, the CGPS scheduler predicts no conflict
because no conflict history is currently recorded for Tx1 which is running on another processor core
(P1). The scheduler schedules Tx2 immediately on P2, and later a conflict occurs on address X between
the write X in P1 and the read X in P2. This conflict aborts Tx2 when it is detected, hence wasting the
machine cycles spent in Tx2 up to the time of the conflict detection.
Second, CGPS also suffers from false positives when CGPS predicts a conflict but no conflict is
detected as illustrated in Figure 4.2. In the figure the CGPS scheduler could immediately schedule Tx2
in P2 without causing a conflict on X because in real time Tx1 commits before the read X in Tx2.
However, because of a conflict history with Tx1, the scheduler waits until the end of Tx1 to start Tx2.
87
Arguably, false positives are not as detrimental as false negatives in general because their major
overhead is unnecessary context switching or processor stalls. Such unnecessary activities waste system
resources, but not as much as transaction aborts do.
In this chapter, we introduce Fine Grain transaction conflict Prediction and Scheduling (FGPS) which
predicts transaction conflicts at each memory access to deal with false positives and false negatives in
CGPS.
In summary, we make several new contributions to the state of the art:
• It introduces a novel hardware prediction and scheduling algorithm based on Fine Grain
transaction Prediction in Hardware Transactional Memory and Bulk.
• It provides the implementation details of FGPS in hardware.
• It evaluates FGPS on a cycle-accurate simulator, comparing it with one of the most effective
CGPS designs [13].
The balance of the chapter is made of the following sections. Section 4.2 provides an overview of
Fine Grain transaction conflict Prediction and Scheduling (FGPS), highlighting differences with Coarse
Grain transaction conflict Prediction and Scheduling (CGPS). We show how FGPS reduces the number
of false positives and false negatives and describe the hardware support for FGPS. Section 4.3 explores
the design space of HTM prediction and scheduling schemes from which we select the design choices
for our FGPS design. Section 4.4 provides the implementation details of the FGPS design. Section 4.5
compares the FGPS implementation to a prior CGPS implementation with simulation results obtained
from a cycle accurate simulator. Lastly, Section 4.6 concludes the chapter with comments on future
work.
88
4.2 Fine Grain Transaction Conflict Prediction and Scheduling
Predicting a transaction conflict accurately is critical to transaction scheduling algorithms based on
conflict prediction. In this section, we contrast our proposal for FGPS and existing predictive proposals
based on CGPS. We show how FGPS addresses both false negatives and false positives, and we outline
the hardware support necessary to implement FGPS.
The detection of a transaction conflict is a dynamic event dependent on the actual timings of
transaction executions. CGPS uses dynamic information gathered on the current and past states of the
system to predict future conflicts. Specifically, to predict a transaction conflict, CGPS relies on a
hardware data structure maintaining two types of dynamic information in each core: identification of
currently running transactions and conflict histories between these transactions and a new transaction to
be scheduled.
When a processor core is about to start a new transaction or restart an aborted transaction, a CGPS
module in the local processor core consults the data structure to predict a transaction conflict. If the
transaction has a conflict history with at least one of the transactions still running concurrently at remote
processor cores, the CGPS module predicts a future conflict, stalls the transaction and schedules another
transaction instead or stalls the transaction in an energy-saving mode until the remote, conflicting
transaction is committed.
CGPS predicts a future conflict based on past conflict histories at the beginning of a transaction, at
which time the timing and identification of a memory access – a load or a store – that is related to the
conflict histories is unknown. Therefore the possibility of false negatives and false positives exists as
illustrated earlier in the examples of Figure 4.1 and Figure 4.2. Just the fact that there has been a conflict
in the past does not provide accurate information about future conflict detection. By contrast, FGPS
predicts conflicts at the time of every memory access, and hence handles false negative and positive
89
predictions better. To achieve this, the FGPS module tracks the write-sets of currently running
transactions instead of conflict histories between transactions.
To better understand false negatives and false positives in CGPS, we take a closer look at the relation
between a possible conflict and the actual detection of a conflict. A possible conflict exists when part of
the read-set or write-set of a transaction overlaps with the write-set of another transaction. However, the
possible existence of a conflict between two transactions does not always lead to the manifestation of the
conflict and a transaction abort. This is because the manifestation and detection of a conflict depends on
the actual timing of the memory access that causes the conflict. Besides, aborts are also dependent on
the underlying conflict-resolution mechanism activated at runtime.
CGPS requires that two conditions be met to avoid a false negative prediction between two
transactions that have a possible transaction conflict with each other. First, processor cores should have
executed the two transactions before. Second, the TM modules running the transactions should have
detected the conflict between the transactions in the past. If both conditions are not met, no conflict
history could have been recorded. The first condition is violated if either one of the two transactions is a
first timer, never executed before. For the second condition, even though two transactions have a
possible conflict and were executed before (first condition), if there was enough of a time gap between
the executions of the two transactions so that the processor cores committed their transactions without
detecting the conflict, no conflict history was recorded.
The same reasoning explains how false positives happen when memory accesses such as write X in
Tx1 and read X in Tx2 in Figure 4.2 in two transactions with a conflict history are executed with enough
of a time gap not to cause a transaction conflict this time around. Conclusion: The existence of a
possible conflict does not always lead to actual conflict detection and a transaction abort.
90
In summary, the sources of false negatives and false positives in CGPS come from 1) relying on past
information, not present information, to predict a future conflict, and 2) predicting the future at the
granularity of transactions, which does not reflect the actual timings of the memory accesses that cause
transaction conflicts.
Figure 4.3 illustrates how FGPS predicts a transaction conflict using present information at the
granularity of memory accesses using the write-sets of currently running transactions in concurrent
threads on remote processor cores. By contrast with CGPS, a processor core in FGPS always starts the
execution of a transaction immediately at its beginning. Conflict predictions occur at the times of
shared-memory accesses, such as the read or write to memory address X in the figure, by testing
whether the address belongs to any of the write-sets (W
i
) of the (N – 1) transactions currently running (?
X in {W
i
}), assuming N processor cores in the system. If the test result is positive, a conflict is
predicted, the core stalls the execution of transaction Tx and a scheduling action such as switching to
another transaction ensues.
To remove false negatives, FGPS only requires one condition, which is easier to satisfy than the two
conditions of CGPS. The condition is that processor cores have executed currently running transactions
before so that their write-sets are known in advance. In CGPS, both the currently running transactions
and the transaction to be scheduled (Tx in Figure 4.3) should have executed in the past and conflicts
should have been detected and recorded as well.
91
Static, software approaches involving high-level languages or compilers might reveal write-sets of
transactions statically before the execution of the transactions. If that were possible, there would be no
false negatives anymore. However, such static approaches are out of the scope of this paper and we
leave it for future work.
False positives rarely occur in FGPS because predictions are made at the granularity of shared-
memory accesses. If the address of a memory access belongs to the write-set of any of the currently
running transactions, a conflict will be detected between the memory address and the remote transaction
unless the conflicting remote transaction is aborted by a conflict with another transaction, which is a
false negative. For example, in Figure 4.3, let’s assume that X belongs to one of the remote write-sets
W
i
. This indicates that there is an actual conflict between Tx and the transaction that owns W
i
so that Tx
will be aborted as soon as W
i
is committed. In CGPS, the detection of a conflict depends on the actual
timing of execution of the conflicting memory access (read/write X), which is unknown at the beginning
of a transaction.
Following the preceding discussion, FGPS deals better with false negatives and false positives than
CGPS because 1) FGPS does not rely on past history but just on present information and 2) it does so at
a granularity that reflects the actual timings of the sources causing transaction conflicts.
To support FGPS the hardware should have the following mechanisms with each processor core: 1) a
storing mechanism in a buffer space to store the write-sets of currently running transactions in remote
threads – store; 2) a conflict checking mechanism that checks whether a memory access belongs to the
write-sets – comparison; and 3) an updating protocol to keep the write-sets of currently running
transactions up-to-date – update. In addition, the hardware should guarantee forward progress, avoiding
deadlocks or livelocks.
92
4.3 Design Overview
In this section, we first discuss the applicability of FGPS for basic HTM systems, then we explore the
design space of hardware support for transaction prediction and scheduling, and finally we show the
design choices we have made that fit well with an FGPS implementation.
4.3.1 Applicability
FPGS can be applied to any base HTM system regardless of the underlying conflict detection
mechanism. For both eager and lazy conflict detection, a memory access is allowed to access the
memory hierarchy only if no conflict is predicted. With the eager mechanism, the base HTM detects a
conflict for the memory access just performed during the transaction execution. With the lazy
mechanism, the base HTM detects a conflict(s) collectively for the memory accesses already performed
at the end of the transaction execution. In either case, if the prediction(s) were accurate, no conflict is
detected, hence no transaction aborts.
4.3.2 Design Space
4.3.2.1 Storage format
Two options are available to keep track of the current write-set: either keeping all individual store
addresses of a transaction in a buffer or accumulating addresses in a signature using a hashing function
based, for example, on a Bloom filter [10]. TM proposals such as [37, 58] employ a signature scheme
for conflict detection, mostly because a signature of addresses is much more compact than a lengthy list
of raw addresses.
93
4.3.2.2 Scope of history
Regarding the scope of history bookkeeping, some proposals [4, 6] include all the conflicts that
happened in the past to predict future conflicts. For this, the scheduler accesses and updates a history of
conflicts between two specific transactions using transaction IDs (TxID) assigned to each transaction.
By contrast, the scheduler in [13] registers and uses only the most recent conflicts that happened to the
current transaction during its last execution, simplifying the prediction process and requiring a smaller
buffer space.
4.3.2.3 Scheduling
When a scheduler predicts a conflict during a transaction execution, it may choose to stall the
transaction and switch to another thread, or to stall the transaction in the core until no more conflict is
predicted. For example, the proactive scheduler in [4] switches to another thread if the size of the remote,
conflicting transaction is larger than a threshold, or waits with retries and random back-off delays if the
size is smaller than the threshold. A scheduler such as in [13] stalls the core in all cases and puts the
processor core in an energy-saving mode until the remote transaction causing the conflict commits and
updates its current status.
4.3.3 Design Choices
4.3.3.1 Storage format
In terms of storage and comparison overheads, signatures are better than a list of raw addresses to
track the current write-set. Moreover, the use of signatures for conflict prediction leverages existing
hardware support such as signature encoding and membership test already present in some HTM
systems for conflict detection.
94
4.3.3.2 Scope of history and signature update
Regarding the scope of history, we choose to store the most recent write-set signatures of aborted
transactions rather than all past signatures, which simplifies the updating of signatures. We store in each
(local) core N-1 write-set signatures, each of which encodes the write-set of a remote transaction running
on one of (N-1) remote processor cores.
The signature updating procedure is different depending on the conflict detection mechanism of the
underlying base HTM.
If the base HTM uses eager conflict detection, no additional messages are necessary during a
transaction execution. When a memory store is performed, any processor core receiving the invalidation
for the memory store through a cache coherence protocol needs to update the corresponding write-
signature indexed by the Processor core ID (PID) attached to the invalidation message. When a
processor core starts a new transaction or commits a transaction after an aborted transaction, its
signature stored in other cores does not represent the current transaction anymore. Therefore the core
sends an invalidation message with its PID to invalidate its signature kept in remote nodes.
If the base HTM uses lazy conflict detection a processor core aborting a transaction sends out its
current write-signature (which has accumulated the aborted transaction’s store addresses up to the point
of abort) to all remote processor cores unless the core switches to a new transaction. A remote core that
receives the signature then updates its local copy of the sender processor core’s signature using the
sender PID attached. The signature encodes the write-set of a transaction that has been aborted and is
retried on the sender processor core. When a processor core commits a transaction, its base HTM sends
a write-signature to the other cores for conflict detection. A receiver core invalidates the locally stored
signature of the committing transaction using the attached PID. A bit is added to any write-signature
95
packet to indicate whether the signature is from an aborted or committed transaction, so that a receiver
may take the appropriate action (updating or invalidating a signature).
Figure 4.4 summarizes the signature updating events of sender and receiver with eager and lazy
conflict detection.
For the scope of history, we could alternatively store all (or several) past signatures in a data structure,
not just the most recent signatures of aborted transactions. This would make it possible to keep track of
the write-signatures of past committed transactions to predict a conflict for a committed transaction that
is re-executed (re-scheduled) after the transaction has been committed successfully. This policy might
improve the prediction accuracy but the updating process including the maintenance of the data structure
would be more complex with a larger buffer space. We leave this for future work.
Also, we could let an aborted transaction run until the end of the transaction execution. This way, a
complete write-signature, including the write addresses beyond the abort point, could be obtained, which
could improve prediction accuracy. However, this approach would also increase the execution time,
offsetting its benefit, as reported in [13].
4.3.3.3 Scheduling
Regarding an action that follows a positive conflict prediction, for example transaction/thread
switching or core stalling, we delegate the decision to a scheduler at an upper level, which is typically a
part of the operating system. Nevertheless, we still have to guarantee forward progress at the hardware
level, especially for the stalling policy in which a deadlock might occur between two transactions
stalling each other. We use a timestamp to prioritize older transactions over younger ones in a small-size
system relying on a bus interconnection. In a large system with a scalable network, a shared counter
variable used as a timestamp and accessed with an atomic read-and-increment instruction may be
96
preferable. Other stalling policies such as prioritizing shorter transactions or policies based on static
prioritization could also be imposed.
Table 4.1 summarizes the design choices for our FGPS design.
97
4.4 Implementation Details
In this section, we provide implementation details for the FGPS design outlined in the previous
section.
Figure 4.5 shows the system architecture of our FGPS machine. Processor nodes are connected to
each other via an interconnection network through a network interface. In each node, the base HTM
provides hardware support for the correct operation of TM such as detecting and resolving a transaction
conflict. On top of the base HTM, the FGPS module is responsible for conflict prediction and
scheduling.
Figure 4.6 shows the microarchitecture of the FGPS module. The prediction logic receives a memory
address as an input and returns the prediction result as an output. The buffer space stores received write-
98
signatures in the entries indexed by PIDs. Figure 4.7 shows the hardware data structure in the buffer
space.
Figure 4.8 shows the pseudo-code for the prediction logic. Upon request, the prediction logic iterates
through valid entries in the signature data structure to test the membership of the address in one of the
signatures [10] (MemoryAddress in WriteSignature[i]). Once a conflict is found and if the timestamp of
the write-signature causing the conflict (WriteSignature[i].TimeStamp) is older than that of the
transaction currently running in the processor core, the logic stops iterating and returns the result
(ConflictPrediction) as “true”. Alternatively the membership operation test could be done in parallel
across all (N-1) entries. Under the core stalling policy, whenever a transaction is stalled due to a
99
predicted conflict, an update to the signature data structure calls the prediction logic and triggers the
resumption of execution if no more conflict is predicted.
4.5 Evaluation
In this section, we provide a comparison between our FGPS design and a base CGPS machine. We
first describe the simulation environment and we follow with simulation results and analyses.
4.5.1 Simulation Environment
Our base machine is based on the CGPS design in [13], which was shown to be superior to previous
CGPS systems using simple back-off scheduling, and matches our FGPS design in complexity for
hardware support. We implemented this base machine and our FGPS machine on a cycle-accurate
microarchitecture simulator and ran a set of benchmark applications on both machines. The major goals
of the evaluations are to compare the effectiveness of FGPS and CGPS for predicting transaction aborts
and to assess the overheads of the FGPS machine.
The simulation platform is the SESC simulator [41], which is capable of simulating a multi-core
machine composed of Out-of-Order superscalar MIPS cores with detailed timing. On top of the basic
SESC simulator, we added a TM package [34], which provides a TM context with basic HTM primitives
such as a transaction commit function including a simple transaction validation process and an abort and
rollback function restoring the machine state to a checkpoint made at the beginning of a transaction. We
modified the existing primitives and added new components shown in Figure 4.5 and 6 to implement our
design and the base machine on the simulator.
Table 4.2 shows the simulation configuration of a processor node in Figure 4.5.
For benchmark applications, we used the STAMP benchmark suite [32], which was developed for
TM evaluation, and the SPLASH-2 benchmark suite [56], where lock acquisitions and releases are
100
directly converted to transaction begins and ends. Table 4.3 summarizes the benchmark applications
with input parameters. The first eight applications are from the STAMP suite and the rest is from the
SPLASH-2 suite. We could not run all applications due to simulator errors especially in some of the
SPLASH-2 applications because of the direct conversion of locks to transactions.
101
102
Also, in both the CGPS and FGPS machines, we adopted the core stalling policy with the timestamp
mechanism to guarantee forward progress, making a direct comparison between the two machines
possible. Lazy conflict detection, for which the CGPS design in [13] was shown to outperform some
other back-off schemes, was used in the underlying base HTM module.
4.5.2 Results and Analyses
Figure 4.10 compares the execution times of the base CGPS machine and of the FGPS machine for
the benchmark applications, normalized to those of the CGPS machine. We measured the execution
times from the time of the first transaction execution to exclude pre-transaction execution time, which is
the same in the CGPS and the FGPS machines and is sometimes long enough to obscure differences
between the two machines. We can observe from the figure that the execution times do not vary much
between the two machines except in Bayes (‘B’) due to two reasons. First, we stall processor cores on
conflict detection and put them in an energy-saving mode instead of switching threads to reduce the
execution time. Second, the execution times include non-transactional code, which mitigates the
differences in the transactional code execution times between the two machines.
103
To exclude the effect of non-transactional code Figure 4.11 shows the execution time of transactions,
measuring the number of cycles spent in the execution of transactions only normalized to the transaction
cycles in the CGPS machine. Cycles spent in both committed and aborted transactions are counted and
displayed. We observe that in all cases except for one (Labyrinth), FGPS spends fewer cycles in
transactions. Note that the prediction and scheduling schemes in both the CGPS machine and the FGPS
machine directly target aborted transactions to reduce the number of cycles wasted in them and do not
affect committed transactions directly.
The level of contention among transactions characterizing an application also affects the improvement
margin of the FGPS machine. For example, for all the benchmark applications in STAMP with low
contention in Table 4.3 (Genome (G), Kmeans (K), Ssca2 (S) and Vacation (V)), FGPS shows little
improvement over CGPS. These (relatively) small improvements do not mean that FGPS is ineffective.
The gains attributable to FGPS depend to some degree, but not entirely, on the contention characteristics
of applications: higher contention levels do not guarantee larger improvement margins but mean more
opportunities for improvement. Table 4.4 lists the fraction of aborted cycles among all transaction cycles
(aborted cycles plus committed cycles) of the benchmark applications run on a machine equipped with
no prediction/scheduling (no FGPS/CGPS module in Figure 4.5) in order to show the dynamic
transaction contention characteristics. Among the SPLASH-2 applications (BA ~ WS), Water-spatial,
104
which has the highest fraction of aborted cycles, benefits most from FGPS as shown in Figure 4.11.
Other factors affect the effectiveness of FGPS beside the fraction of aborted cycles, as demonstrated in
Table 4.4: FGPS fares worse in Labyrinth (‘L’) despite its highest fraction of aborted cycles.
To evaluate the direct effect of FGPS on aborted cycles, Figure 4.12 compares the number of aborted
cycles between the CGPS and the FGPS machines, normalized to that of CGPS. From the figure, fine
grain prediction reduces the number of aborted cycles in most benchmark applications. The number of
aborted cycles is reduced most in Intruder (‘I’) (by 70.5 %). In the case of Vacation, the number of
transaction aborts is already very small even without prediction and scheduling and so FGPS can do
little to improve. For Labyrinth (‘L’), FGPS is still not effective according to this metric. Its number of
aborted cycles is increased by 7% over CGPS. Figures 4.11 and 4.12 show correlations between the
number of aborted cycles and the number of transaction cycles, especially for Labyrinth.
105
The average number of aborted cycles over all benchmarks is reduced by 34.4 %. The cycles saved in
aborted transactions can be leveraged to improve performance by switching to other threads, or
power/energy by putting processor cores in a low power/energy saving mode during those cycles.
Figure 4.13 shows the number of transaction aborts in the CGPS and FGPS machines, normalized to
that of CGPS. For all applications except Labyrinth (‘L’) and Yada (‘Y’), the number of aborts in the
FGPS machine is less but the amount of the reductions does not always match those in Figure 4.12
because the decreased number of cycles in Figure 4.12, which is the most important metric, not only
depends on the number of transaction aborts but also on the timings of transaction aborts: Aborts may
occur towards the beginning, in the middle, or towards the end of a transaction, resulting in different
numbers of wasted cycle. We observe that there is a strong correlation between the increased number of
transaction aborts in Labyrinth (‘L’) and the increased number of aborted cycles in Figure 4.12. In this
case, the FGPS machine, which is not completely free from false negative and false positive predictions,
does not work effectively to reduce the overall number of transaction aborts.
106
Table 4.5 compares the number of aborted cycles and the number of transaction aborts between the
CGPS and FGPS machines, normalized to the machine without a conflict prediction/scheduling
mechanism. For example, in Bayes (‘B’), CGPS raises the number of aborted cycles by 0.2 % over the
machine without a prediction/scheduling mechanism while FGPS decreases the number of aborted
cycles by 47.5 %. In most cases, except for Labyrinth (‘L‘), FGPS is better than CGPS. On average, the
number of aborted cycles in FGPS is better by 39.1 % and the number of transaction aborts is improved
by 39 % over the machine without prediction/scheduling.
Notable from Table 4.5, in applications such as Fmm (‘F), Ocean-contiguous (‘O’), Water-nsquared
(‘W’) and Water-spatial (‘WS’), for which CGPS provides no improvement, FGPS shows significant
improvements. This, coupled with the reduced number of consecutive aborts in all four applications as
shown in Table 4.6, indicates that FGPS handles false negatives much better than CGPS for those
applications.
107
Figure 4.14 illustrates an example in which a transaction experiences repeated transaction aborts due
to a false negative in the CGPS machine. In the figure, after Tx1’s first abort, P2 immediately starts Tx2,
which has never been executed before, because no conflict is predicted at the moment. Later, when P2
commits Tx2, it sends Tx2’s write-set, which overlaps with Tx1’s read-set or write-set, causing Tx1’s
second abort. Instead of that, given the very same available information, P1 would continue and commit
Tx1 after the first abort in the FGPS machine because P2 stalls Tx2 once it detects a future conflict after
the beginning of Tx2.
Figure 4.15 shows commit rates after transaction execution, normalized to those of the CGPS
machine. We measure these commit rates by dividing the number of transaction commits by the number
of transactions that reach the end of their execution and then either commit or abort. All other factors
(such as the number of transactions) being equal, the closer a transaction abort is to the end of
108
transaction execution, the worse the penalty of the abort is because the transaction wastes more cycles
and machine resources.
Although the number of aborted cycles displayed in Figure 4.12 provides the ultimate measure of
savings, we also use the commit rates to observe how accurate conflict predictions are. At one extreme,
if every piece of information is available about past, present and future, there are no false negatives and
the commit rates should be 100%. With the current implementations, two sources limit the commit rate.
First, write-sets sent at aborts are not complete but instead are accumulated only up to the abort time.
Second, tracking only the latest write-sets does not allow full access to what happened in the past. We
leave for future work the exploration of further optimizations that decrease false negatives and increase
commit rates further but result in increased overheads in execution time and hardware support. In Figure
4.15, the FGPS machine shows better commit rates overall than the CGPS machine, by 9.5 % on
average. The average commit rates of the FGPS and CGPS machines are 0.91 and 0.86 respectively.
To estimate conflict prediction overheads, Table 4.7 shows the number of signature operations per
memory access, measured by dividing the number of signature membership tests (MemoryAddress in
WriteSignature[i] in Figure 4.8) by the number of memory reads and writes for the applications run on
109
the FGPS machine. In all the applications, the number of membership operations is less than one and the
average for all applications is 0.43.
There are two factors that contribute to such low numbers. First, in some applications, transaction
commits are much more common than transaction aborts, and the FGPS machine only tracks the write
sets of the most recent aborts, hence in these applications write signatures remain invalid in the
prediction data structure (Figure 4.7) because write signatures from commits are never stored in the data
structure. For example, in Fmm (‘F’), less than 0.1 % of transactions experience aborts and the rest do
not propagate their write-set signatures to the data structure, necessitating no membership operations.
Second, as illustrated in Figure 4.8, the prediction logic does not visit all the valid signatures but stops as
soon as it detects a conflict.
110
Table 4.8 displays the number of additional messages (over the machine with no
scheduling/prediction) sent at transaction aborts for conflict prediction in the CGPS and FGPS
machines. In some applications, FGPS is better and in others it is worse than CGPS in terms of the
number of additional messages. Considering the number of other network activities and resulting
message traffic such as cache protocol messages and cache miss traffic, the numbers in Table 4.8 are
practically negligible.
4.6 Future Work
Future work includes, but is not limited to, exploring other design choices for the FGPS algorithm to
further reduce the number of false negatives with increased storage and execution overheads and using
the result of conflict prediction for thread context switching. Also, implementing our FGPS algorithm on
top of current TM support in commercial processors is desirable to understand its overheads in a current
commercial environment.
111
Chapter 5
5. SPECULATIVE CONFLICT DETECTION AND RESOLUTION
IN HARDWARE TRANSACTIONAL MEMORY
5.1 Introduction
In this chapter we address the timely delivery of invalidations and detection of conflicts in Hardware
Transactional Memory and other similar approaches committing stores in bulks. We advocate a fully
optimistic approach in which processor nodes executing transactions send cache invalidations for their
write-set en masse right after the execution phase and without validating the transaction. One of the
specific targets is Hardware Transactional Memory with late or lazy conflict detection as described in
[33] (lazy HTM) such as TCC [19] and Scalable TCC [9] in which processor nodes/cores transmit
multiple – possibly many – invalidations in one block. In addition to lazy HTM, our new approach is
applicable to any system based on an execution model similar to lazy HTM, and requiring multiple
cache invalidations in groups lazily such as Bulk [10] and its derivatives [11, 37]. Bulk provides
efficient speculative execution and correct memory consistency in shared-memory multiprocessors
based on so-called ‘chunks’ (a chunk in Bulk is equivalent to a transaction in lazy HTM).
While existing lazy HTM and Bulk protocols are mainstays for correct operation, they are not fully
optimistic from the perspective of the timing of invalidations, and their performance and complexity can
be improved by using a more optimistic approach as this paper demonstrates. We propose to send cache
invalidations en masse without validation and in parallel with transaction validation in lazy HTM and in
shared-memory systems using a Bulk-like consistency protocol. In summary, the contributions of our
paper are as follows.
112
• We introduce a new cache invalidation algorithm based on a fully optimistic approach in which
cache invalidations are sent en masse without validation and transactional conflict detection is
embedded in the cache protocol.
• We describe in some details an invalidation-based cache protocol based on the new optimistic
approach, which we have implemented on top of the SESC/SuperTrans simulator with some
modifications.
• We compare the performance of the new protocol by simulation against ScalableBulk, arguably
one of the most aggressive IAV protocols to-date.
The rest of the sections are organized as follows. Section 5.2 describes and compares prior approaches
and our new approach to highlight major differences. Section 5.3 describes a new cache invalidation
protocol based on the new approach. Section 5.4 presents our simulation results comparing IWV with
ScalableBulk. Finally, section 5.5 discusses future work.
5.2 Lazy conflict detection systems
In designing a shared-memory system based on the execution-validation-commit model requiring
simultaneous cache invalidations such as lazy HTMs, one important design decision is how and when to
propagate the cache invalidations to stakeholders in the system, i.e., processor nodes which have read or
written shared memory locations in their cache speculatively. This is important not only because the
decision is related to conflict detection for correctness but also because it affects complexity and
performance. How to propagate invalidations is well explored and documented in the current literature
on lazy conflict detection systems [9, 10, 19, 36, 37]. By contrast, when to propagate invalidations is an
aspect that has not received much attention.
In the following subsections, at first we describe prior lazy systems with a focus on when updates in
the write-set are propagated, and then we present our new approach in comparison with such existing
113
systems. For completeness, we also include other approaches in our survey of prior art, even if they are
not directly related to the specific goal of this paper. First we describe the current state of the art, where
invalidations are propagated after validation.
5.2.1 Non-speculative conflict detection
The current approach to propagate cache invalidations in lazy (lazy conflict detection) HTM and other
lazy systems is to wait until the end of the validation phase before sending invalidations to commit
values in the write-set. We call this approach Invalidation After Validation (IAV). IAV protocols send
cache invalidations when the current set of speculative writes are guaranteed to commit to the entire
system without failure: processor cores must first validate their current write-set before sending out
invalidations for them. In general, this validation process takes time due to activities such as message
exchanges between processor nodes and/or between processor nodes and directory nodes. It is hard to
hide the latency of this time-consuming process because it is on the critical path of transaction
execution. IAV protocols are complex and also delay conflict detection, which in turn delays transaction
aborts resulting in useless execution.
TCC [19] and BULK [10] rely on a central arbiter to validate or secure the current write-set. When a
processor node finishes the execution phase for the current transaction or bulk, it requests permission to
commit from the central arbiter. Once the permission is granted, the write-set is secure and invalidations
(of the write-set) are broadcast to the entire system. A node receiving invalidation for its read-set or
write-set before permission to commit is granted must abort and restart its current transaction. While this
scheme simplifies the validation phase, the reliance on a central arbiter and a broadcast mechanism
limits its scalability on a point-to-point interconnection network. In addition, it does not allow parallel
commits of independent transactions or bulks, since transactions or bulks are restricted to commit one at
a time in the whole system.
114
Scalable TCC [9]’s main goal is to improve the scalability of TCC in the context of a distributed
shared-memory multiprocessor with a directory-based protocol. When a transaction finishes the
execution phase, it first obtains a Transaction ID (TID), which is a unique number and is incremented by
one at every TID request. The TID is used for conflict resolution or concurrency control: a lower
(earlier) TID is given higher priority when transactions conflict with each other. Then the transaction
tries to secure its write-set and read-set by reserving and probing directory modules. Each directory
module maintains a Now Serving TID (NSTID) that tracks which TIDs have been serviced by it and
consequently which TID has the highest priority at the moment for all memory addresses attached to the
directory. The transaction can start the commit phase and start to send invalidations for its write-set once
all the addresses in its write-set and read-set are ensured to have the highest priorities by probing
corresponding NSTIDs to acquire writing rights and to guarantee that no prior transactions with higher
priority have written (or are in the process of writing) one or more addresses in the write-set or read-set.
To support parallel commits for independent transactions, transactions must send their TID in ‘skip’
messages to directories for memory addresses outside their current write-set to notify that they don’t
intend to write. This validation procedure, involving probing and possibly re-probing repeatedly all the
directory modules, safely confirms that cache invalidations are sent only from committing transactions.
Thus the cache protocol is separate from the transaction protocol. Besides being overly complex, it
cannot hide delays on the critical path of the transaction and it delays conflict detection. Late conflict
detection in turn increases the wasted work done by transactions that will abort later. In addition, the
directory-based TID tracking mechanism limits parallel commits of independent transactions trying to
reserve the same directory module(s).
Scalable HTM [36] proposes various algorithms to reserve directory modules in order to reduce the
message traffic of Scalable TCC. As in Scalable TCC, the basic approach is to secure a transaction’s
115
read-set and write-set by reserving their directory modules. Once all the memory addresses in the read-
set and write-set are secured, the current transaction is successfully validated and able to start sending
out invalidation messages. Several approaches are proposed to secure the read and write sets. In basic
Sequential Commit (SEQ), directories are reserved sequentially in ascending order. In SEQ with parallel
reader optimization (SEQ-PRO), multiple readers without a writer are allowed to reserve a directory
simultaneously. Finally in SEQ with occupancy stealing with timestamps (SEQ-TS), directories are
reserved in parallel and priorities are given based on timestamps. These techniques avoid contact with
all directory modules, but the reservation process is still on the transaction’s critical path and limits the
parallelism among independent transactions as in Scalable TCC [9] above.
ScalableBulk [37] also reserves directory modules to proceed to the commit phase and to propagate
invalidations but, by contrast with Scalable HTM, it lets two chunks (transactions) reserve the same
directory simultaneously unless they actually have a conflict on the same memory block. The
reservation process is done by a ‘Group Formation’ protocol in which participating directory modules
are grouped together serially by passing a protocol message called a ‘g’ or ‘grab’ message. Once the g
message arrives back to the first reserved directory module, the leader directory, invalidations are sent to
the set of sharer nodes which was collected during the circulation of the g message using the committing
processor’s write address signature, an abridged version of its write-set. A collision is detected when
two or more conflicting chunks try to reserve the same directory using each chunk’s read and write
signatures, and the chunk which has reserved the directory first is chosen as a winner. While
ScalableBulk’s Group Formation protocol can achieve a high level of parallelism for independent
chunks by reserving directory modules simultaneously, the group formation process is still on the
critical path of the chunk execution. ScalableBulk is the most aggressive scalable IAV protocol
116
proposed so far and it will be the reference point in the evaluations of our scheme, which we now
describe.
5.2.2 Invalidation without validation
By contrast with previous approaches, we propose to propagate invalidations for the write-set early,
right after the execution phase, without waiting to secure the write-set or the read-set. This new
approach is ‘fully optimistic’ in the sense that invalidations are sent out even without write/read-set
validation or central arbitration.
This Invalidation-Without-Validation (IWV) approach has the following advantages over IAV: (1) it
minimizes wasted work or stalling on transactions which will be aborted due to conflicts, by sending
invalidations used in conflict detection as early as possible as shown in Figure 5.1; (2) it simplifies the
underlying protocol by embedding transactional conflict detection in the cache protocol and tracking the
delivery of invalidations to each stakeholder instead of tracking conflicts; (3) it enables a high level of
parallelism for independent transactions, at least the same level as in ScalableBulk but without signature
processing, by simply tracking the delivery of invalidations; and (4) it offers more opportunities to hide
communication delays by removing the validation phase from the critical path of execution, and
consequently it is less sensitive to large network delays. Note that sending invalidations without
117
validation does not mean that there is no validation but rather that invalidations are sent out before the
confirmation that the transaction is successful. (2), (3) and (4) will become clearer after an IWV protocol
is presented in Section 3.
A possible drawback of IWV is false aborts, which occur when transactions are aborted by
invalidations sent by another transaction, which ends up being aborted later. This is possible because, at
the time when invalidations are sent out without validation, it is not known whether the sending
transaction will commit. Figure 5.2 illustrates a false abort. The second transaction aborts the third
transaction and then later the first transaction aborts the second transaction. Therefore the third
transaction should have never been aborted. False aborts do not occur in IAV-based protocols because a
transaction sends out invalidations only if it is guaranteed to commit. While false aborts do not affect
correctness, frequent occurrences of false aborts can harm performance. Currently false aborts are not a
focus of this paper, although they are accounted for in our evaluations. More work to avoid false aborts
is warranted in order to further optimize the IWV protocol. Besides false aborts, false invalidations may
also occur when a cache block is invalidated without causing an abort by an invalidation message sent
by a transaction aborted later. However, false invalidations are rarer than false aborts and their penalty is
much less than that of false aborts.
118
5.2.3 Flexible lazy TM algorithms
Other research proposals have been published, to improve the time-consuming validation phase of
lazy TMs. While IWV is not directly related to these proposals, we expose a few here and highlight
differences.
Flexible TM [46] lets software determine when to resolve conflicts detected in hardware. Eazy HTM
[50] is an HTM protocol with eager conflict detection and lazy conflict resolution (EAger + laZY). Both
proposals separate conflict detection from conflict resolution to take advantage of both eager conflict
detection and lazy conflict resolution. Dynamic HTM [28] increases flexibility further by adapting the
hardware to various workloads. To achieve maximal performance, the hardware switches dynamically
between an eager mode (eager conflict detection – eager version management) and a lazy mode (eager
conflict detection – lazy conflict resolution – lazy version management) using information gathered on
past transaction executions. ZEBRA TM [49] can switch mode between eager-conflict/eager-version
and lazy-conflict/lazy-version down to the granularity of a cache-line. In the case of the lazy mode, a
processor node first has to acquire a global commit token to ask directory modules to send invalidations
to relevant nodes.
These flexible algorithms improve the validation process together with other benefits, but they there
are different from our approach. First, in our approach, processor nodes do not track conflicts with other
nodes during the execution phase as is done in protocols using an early conflict detection tracking
mechanism that makes it necessary to maintain records of conflicts. Second, processor nodes do not
need to be aware of commits and aborts of other nodes to maintain conflict tracking records. Third, no
central arbiter is necessary but simple tags attached to invalidations are enough to resolve conflicts
between concurrent transactions.
We could apply our approach partly to the lazy parts of these flexible protocols. We defer this partial
application of IWV and detailed comparisons with these flexible algorithms to future work.
119
5.3 IWV protocol
5.3.1 Overview
Figure 5.3(a) depicts the overall flow of an IWV protocol. When a processor node completes the
execution phase of a transaction and wants to publish the current write-set, it enters the validation phase
where the current transaction is validated or is aborted by another processor node. Once the transaction
passes the validation phase, it enters the commit phase, and the execution phase of the next transaction is
started. Invalidations of the blocks in the current write-set are sent out before the transaction passes the
validation phase, right after the execution phase.
Figure 5.3 (b) illustrates how processor nodes process incoming invalidation messages. When a
processor node receives an invalidation, it first checks that the message received actually causes a
120
conflict with the current transaction and, if this is the case, it decides to abort or not by comparing the
priorities of the invalidation message and the current transaction. An abort can occur during the
execution phase or the validation phase, but not during the commit phase.
121
5.3.2 Implementation details
First, for scalability and compatibility with an arbitrary interconnection network, we adopt a
directory-based cache protocol that employs memory directories to track the global state of memory
blocks and to remove races caused by memory updates.
To track the local state of cache lines in private caches, two bits are attached to each cache line: V bit
(Valid bit) and A bit (Access bit). The V bit, if set to1, indicates that the corresponding cache block is
valid. The A bit is set to 1 once the block has been read or written by the current transaction, and is reset
to 0 when the current transaction commits or aborts. When a memory access hits in the local cache and
the block is valid, it reads or writes the block and sets the A bit to1. If the access misses in the local
cache, the cache controller sends a memory request to the home directory module with the necessary
information to retrieve the latest copy of the block. To deal with races between outgoing memory
requests and incoming invalidation requests, the A-bit should be set before sending any memory request.
At the home directory, the requesting node is registered as a sharer for the block, and an
acknowledgement (ACK) including data in the case of a read request, or a negative acknowledgement
(NACK), which forces the requesting node to retry, is sent back to the requester. Invalidation messages
are forwarded only to actual sharers and conflicts are detected when the invalidation messages hit local
cache blocks with the A bit set to 1. As an alternative to the A bit for conflict detection, a signature
scheme such as the one in ScalableBulk [37] can be used. Table 5.1 shows cache state transitions when
the A bit is used for conflict detection.
Concurrency control and conflict resolution
For concurrency control, which assigns relative priority to concurrent transactions, we adopt a ticket
mechanism similar to the TID scheme of Scalable TCC [19]. When a processor core finishes its
execution phase, it acquires a unique commit ticket number that is attached to every invalidation
message sent by the transaction. A new ticket is generated by incrementing the last ticket number by 1.
122
The ticket distribution mechanism can be implemented by a central ticket distributor with a simple
counter and a buffer to store incoming requests, or more simply by a global variable accessed with a
special instruction such as ‘fetch and add 1’ instruction. To hide the ticket acquisition delay, a processor
core may request a ticket early, before the end of the execution phase. Detected conflicts are resolved by
giving higher priorities to tickets with smaller numbers.
Validation in IWV
The validation process in IWV is done by tracking invalidations instead of reserving shared objects
such as directory modules or a bus as in IAV protocols. Tracking invalidations is much simpler than
reserving objects. There is no need for multiple and possibly repeated probe requests to the directories in
order to reserve such modules while avoiding deadlocks. But most important of all, tracking
invalidations is indispensable in any case in all shared-memory systems with cache invalidation
protocols. Therefore the only main mechanism added to the cache coherence protocol in IWV is the
ticket acquire/release mechanism. There is no more polling of directories or forming groups of
directories, and the complexity of the protocol is close to a regular shared-memory cache protocol. As a
result, the number of exchanged messages can be reduced and the validation is faster as compared to
IAV protocols.
To pass the validation phase and proceed to the commit phase in IWV, a processor node must simply
have the highest commit priority. Priority checking is the backbone of the validation process in IWV and
is done by acquiring and releasing commit tickets (directories) like in all scalable HTM or Bulk
schemes. A processor node releases its current commit ticket by sending the ticket number to all the
other processor nodes once it has received all invalidation acknowledgments for the cache blocks being
invalidated in the current write-set. A processor node is aware that it has the highest priority to commit
when all the tickets with lower number than its ticket have been released and have arrived at the
123
processor node. Acquiring released tickets can be implemented using a method similar to the ‘skip
vector’ mechanism in Scalable TCC [19] by buffering released tickets and updating the last ticket
received in order. For example, if a processor node holds ticket number 5 and has received tickets 1, 2
and 4, it must wait for ticket 3 to commit its current transaction. The releases of commit tickets can be
done in parallel and out of order, and so do commits of transactions. In some cases, transactions can
commit even without the validation phase, which is done in the background, when no conflict is
detected. Figure 5.4 illustrates this ideal case. In this figure, the validation phase of all transactions is
skipped altogether because each transaction holds the ticket with highest priority before the end of the
execution phase.
Directory structure
In addition to the sharer list used to identify destinations for invalidation requests, we attach a field
holding a ticket number to each directory entry in order to further filter redundant invalidation messages,
accelerate the ticket release process and reduce the number of false aborts. When an invalidation request
carrying a ticket number arrives at a directory entry, the incoming ticket number is compared with the
ticket number held in the entry. If the incoming ticket has a higher priority (lower number) or if no ticket
is held in the entry, then the incoming ticket number is stored in the entry and invalidation messages are
sent to all sharers. Otherwise, if the entry holds a ticket with a lower number, the directory
acknowledges the invalidation request without forwarding the request to sharers because there must be a
write-write conflict and the requesting node will eventually abort its current transaction.
Also, like the busy bit in a conventional directory protocol, to avoid a race condition between a cache
invalidation and a cache miss request, when a directory entry with a valid ticket number receives a cache
miss request (read or write), it NACKs the requestor to make it retry the request later – alternatively the
directory module may buffer incoming requests. When a transaction commits and updates a block state
124
at the home node, it clears the ticket number in the corresponding directory entry to allow accesses to
that block. Also, when a processor aborts its current transaction after sending out invalidations, it also
needs to clear any tickets held in the corresponding directory entries as needed. Figure 5.5 shows the
directory structure for one entry.
Forward progress
The explicit concurrency control mechanism using the commit ticket scheme removes all of the
possible sources of deadlock without the need to deploy a deadlock avoidance or detection algorithm, by
explicitly enforcing relative priorities between speculative transactions. However, in rare cases, the IWV
protocol may enter a live lock situation in which processor nodes abort each other indefinitely as
illustrated in Figure 5.6. One simple solution is to back off and delay the restart of the execution phase
125
of a repeatedly aborted transaction. Alternatively a processor node can hold the current ticket after begin
aborted, and restart the execution with the same ticket until the current ticket becomes the highest in
priority. This solution can also be applied to enforce fairness for a long transaction repeatedly aborted by
short transactions. We have adopted the first solution in our simulator.
Message types
Table 5.2 lists core message types in the IWV protocol and Figure 5.7 shows message flows.
Messages are initiated once previous messages have been transmitted. For example, the TR message can
be initiated after the IR, I, IRA and IA messages have been transmitted. The update (U) message can be
sent any time when the commit ticket of the committing node becomes the highest priority ticket.
126
Sometimes the ticket invalidation (TI) message can be combined with the update (U) message when the
update message is sent after all the invalidation acknowledgments are collected (the IA messages) from
the sharer node(s) – no need for a separate TI message as in the case of (4 – a) and (4 – b) in Figure 5.8
below (operational example).
5.3.3 Operational examples
We now go step by step through two examples of protocol transactions. The first example illustrates
how parallel commits are supported. The second example shows what happens on an abort.
Parallel commit
Figure 5.8 shows how independent transactions commit in parallel. Events with the same number
occur concurrently while those with smaller numbers must precede those with larger numbers.
(1 – a) shows the initial states of the processor nodes where processor nodes (PR) 1, 2 and 3 have
completed their execution phase. The nodes have read and written memory blocks A, B and C, and have
acquired tickets 2, 3, and 4, respectively. The Released Ticket (RT) fields store the last released ticket in
order, which is 1 at the moment. The directory module (DIR) has entries for A, B and C, and currently
127
no processor node has read or written the blocks except PR1 (A), PR2 (B) and PR3 (C) as shown in (1-
b) column S. Currently the memory (‘M’) keeps valid data for all the three blocks as an owner. ‘0’s in
the T (ticket) column mean that currently no transaction is trying to commit for the blocks and the
blocks are freely accessible.
In (2 – a), PR1 knows that it has the highest priority to commit by comparing its current ticket (= 2)
and the ticket in the RT field (= 1), which has been released lastly, and so it sends an Update (U) &
Invalidation Request (IR) message for A (1. U&IR_A) to the directory, which updates the global cache
state of A (PR1 becomes a new owner), and requests DIR to forward invalidation(s) to sharers if any
(none in this case except PR1 itself). The transactions in PR2 and PR3, which also have finished their
execution, send the Invalidation Request (1. IR_B and IR_C) messages to DIR, which also should be
forwarded to sharers as needed but do not update the global cache block states for B and C because at
this moment PR2 and PR3 are not guaranteed to commit their current transactions (B and C remain to be
owned by the memory). As a response to the update and invalidation requests, the directory immediately
sends back Invalidation Request Acknowledgements (2. IRA) to the requesting processor nodes with 0
attached as no other processor nodes share the blocks. (2 – b) shows the directory state after (2 – a) with
changes in bold. For example, it should be noted that PR1 has become the owner of block A and block B
and C hold ticket 3 and 4 respectively to prevent further accesses for the blocks.
In (3 – a), after the IRAs have been received, each processor node, acknowledged with all the
invalidation requests, releases its current ticket by sending out Ticket Release (3.TR_ticket) messages to
all other processors. The directory states do not change, but the Released Ticket (RT) field in each
processor node is set to 4 after released tickets 2, 3 and 4 have been received by the processor nodes.
Now PR2 and PR3 are ready to commit their transactions as the nodes know that they currently have
the highest priority to commit as indicated by their RT values. Both nodes can simultaneously send
128
Update messages for memory block B and C (4.U_B and U_C) to the directory as shown in (4 – a).
These messages update the block states in the directory so that node 1 and 2 become the owners of B
and C as shown in (4 – b). The T (Ticket) column in the directory has now all 0’s for blocks A, B and C
and memory requests can now be forwarded to right owners.
Conflict detection and resolution
Figure 5.9 illustrates how a conflict is detected and resolved.
(1 – a) and (1 – b) show the processor states and the directory state respectively. The only difference
from the initial states of Figure 5.8 (no conflict case) is that there is a transactional conflict between PR1
and PR2 as both processor nodes have read and written memory block A; so the directory has recorded
nodes PR1 and PR2 as sharers for A as shown in the S (Sharer) column for A in (1 – b). Currently
Memory is the owner of A because the stores to A from PR1 and PR2 are tentative and local.
129
In (2 – a), PR1 sends a U&IR message for A (1.U&IR_A) to DIR, knowing that it has the highest
priority to commit, and starts the next transaction. DIR forwards PR1’s invalidation request to PR2
(2.I_A), which has been identified as a sharer for block A, and responds PR1 with an IRA message with
‘1’ attached, meaning that one invalidation has sent for PR1 for block A. PR2 also sends Invalidation
Request for A (1.IR_A) to DIR. However, when this request with ticket 3 arrives at DIR, a valid ticket
(= 2), which happens to be smaller than the one arrived, has already been held in the directory entry for
A, and this makes DIR send back an IRA message (2.IRA) with ‘0’ attached to PR2 as no invalidation
actually has been forwarded for PR2’s request for block A. If PR2’s request with ticket 3 arrived at DIR
before PR1’s request with ticket 2 did, PR2’s request should be forwarded to PR1 where this
invalidation request is dropped because PR1 holds a ticket with higher priority (= 2) at the moment.
Also, at DIR, ‘3’ in the T (ticket) field for A should have been replaced by ‘2’ on the arrival of PR1’s
request. PR3 also sends an IR message for block C (1.IR_C) and receives an IRA (2.IRA) with ‘0’
attached as response because there is no sharer for C except PR3. (2 – b) shows the directory states after
the events in (2 – a). For block A, the owner has been changed to 1 (PR 1) due to PR1’s successful
commit for the block. However, the T field still holds the ticket number (‘2’) because the invalidation
process for A is still in progress. For block C, the ticket number from PR3 (= 4) is also held in the entry
because the transaction has not yet been finished (either commit or abort) even though the invalidation
process for the write-set (block C) has been ended.
(3 – a) shows message exchanges after the events in (2 – a). Now PR2 and PR3 can release their
tickets (3 and 4) by sending the TR messages (3.TR_3 and TR_4) as they have been acknowledged all
the invalidation requests for their write-sets. PR2 also sends Invalidation Acknowledgement for A
(3.IA_A) to PR1’s invalidation request forwarded earlier in (2 – a) by DIR (IR_A). (3 – b) and (3 – c)
shows the directory state and the processor node states respectively. In (3 – c), we can notice that PR1’s
130
RT (Released Ticket) value has changed to 4 after tickets 3 and 4 were grabbed by PR1. However, the
RT values of PR2 and PR3 are still 1 because ticket 2 owned by PR1 has not yet been released. In the
RT buffers for both of PR2 and PR3, are ticket numbers 1, 3, and 4 stored.
In (4 – a), PR1 sends a ticket release message for ticket 2 (4.TR_2) which is grabbed by PR2 and PR3
and makes these recipients update their RT values to 4 as in (4 – c). As an alternative to (3 – a) in Figure
5.8, to reduce the message traffic, with potential increase in delay, the TR_2 message has been
transmitted to PR3 via PR2 where the ticket release message is snooped. PR1 also sends a Ticket
Invalidation message for A (4.TI_A) to DIR after being acknowledged for all the invalidations requested
(IA_A in (3 – a)) to clear the ticket held in the directory entry for A and consequently to allow accesses
to the block. After receiving the ticket release message from PR1, which makes ticket 4 as the highest
priority ticket, PR3 sends an Update message for C (5.U_C) which changes the owner of block C to 3 as
well as clear the ticket 4 held in the directory entry as in (4 – c).
5.3.4 Comparison with ScalableBulk
5.3.4.1 Commit of N independent transactions/chunks
This section highlights the differences between ScalableBulk and our new protocol based on the
Invalidation Without Validation (IWV) approach regarding the handling of independent transactions
(chunks) and transactions with conflicts, especially in terms of delay overhead.
ScalableBulk
To commit a chunk (transaction), ScalableBulk goes through the following activities, all on the
critical path of transaction completion: (a) sending the read and write sets of the committing transaction
to the group leader which is a directory module with the lowest directory number among the
participating directories; (b) propagating the group (g) message from the group leader to the next node
in the group and to the group tail, which is a directory module with the highest number in the
131
participating directories; (c) sending back the g message from the group tail to the group leader; and (d)
acknowledging commit success from the group leader to the committing processor node. So, the time
needed to commit N independent transactions is the maximum of {(a) + (b) + (c) + (d)} of the N
transactions because the transactions are processed in parallel by the protocol, where b is proportional to
the number of participating directory nodes. If D is the average delay between nodes (between processor
and directory, or directory and directory), and if the average number of participating directories in the
group is K, we can formulate the (a), (b), (c) and (d) above as follows:
𝒂 =𝐷; 𝒃 = 𝐾−1 𝐷; 𝒄 =𝐷 (0 𝑖𝑓 𝐾= 1); 𝒅 =𝐷
So the total delay for committing N independent transactions is equal to
𝑀𝑎𝑥 𝐷+ 𝐾
!
− 1 𝐷+𝐷+𝐷 , 𝑖= 1,⋯ ,𝑁
= 𝑀𝑎𝑥 𝐾
!
+ 2 𝐷 , 𝑖= 1,⋯𝑁
= 𝐾
!"#
+ 2 𝐷 (Eq. 1)
The total delay is proportional to the number of directories reserved (K) in the group.
IWV
The total delay for committing N independent transactions is the maximum delay from releasing a
commit ticket to acquiring the released ticket among the N transactions. This time is the sum of four
components: (a) sending an invalidation request from the committing node to the home node directory;
(b) forwarding the invalidation request from the home node to a sharer node; (c) acknowledging the
invalidation from the sharer node to the committing node; and (d) releasing the current ticket from the
committing node to the independent nodes. Consequently, with the same notation used for ScalableBulk,
the total delay for committing N independent transactions is
𝐷+𝐷+𝐷+𝐷= 4𝐷 (Eq. 2)
132
From (Eq. 1) and (Eq. 2), the delay to commit N independent transactions in ScalableBulk is less than
in IWV if
𝐾
!"#
+ 2 𝐷 < 4𝐷 ↔ 𝐾
!"#
< 2.
This formula establishes that in order for Scalable bulk to be more efficient in committing N
independent transactions than IWV, the number of directories holding transaction data cannot be more
than 1.
5.3.4.2 Conflict detection
ScalableBulk
In ScalableBulk, transaction conflicts are detected when the write signature of the committing
processor node is delivered to any sharer node after the participating directories are reserved
successfully after the validation phase. Accordingly, the conflict detection delay can be estimated as the
time for (a) + (b) + (c) from the case of the commit of N independent transactions above plus the delay
from the leader to a sharer node which is also D (= the average delay between nodes).
This is equal to
𝐾+2 𝐷 (Eq. 3)
IWV
In the IWV protocol, the conflict detection delay is simply the time to send an invalidation message
from the committing node to a sharer node through the corresponding directory node, and this is equal to
𝐷+𝐷= 2𝐷 (Eq. 4)
From (Eq. 3) and (Eq. 4), the time needed for detecting a conflict is always greater in ScalableBulk
than in IWV.
133
5.4 Evaluation
In our evaluations, we compare our Invalidation Without Validation (IWV) protocol with the
ScalableBulk protocol. ScalableBulk is one of the most aggressive protocols using the Invalidation After
Validation (IAV) approach.
5.4.1 Methodology
Our base simulator is SESC [41] augmented with SuperTrans [34]. We have modified the code to
implement our IWV protocol and ScalableBulk. Programs run on the simulator are selected from the
SPLASH-2 benchmark suite [56], in which locks are converted directly to transactions, and from the
134
STAMP benchmark suite [32] composed of Transactional Memory applications. Instructions are
executed out of order on the base SESC platform until a processor core tries to commit the current
transaction by executing a special instruction, ‘transaction commit’, which stalls the processor core and
initiates the protocol procedures for both IWV and ScalableBulk. Transactions are aborted and re-started
as needed when the abort conditions are met by restoring the processor state stored in each transaction
context. Nested transactions are flattened by subsuming the inner transaction’s write-set into that of the
outer transaction. Table 5.3 summarizes the simulation configuration.
135
5.4.2 Results
Figure 5.10 displays the execution times of the STAMP and SPLASH-2 benchmark applications
under IWV normalized to the execution times of ScalableBulk. The execution times are the number of
cycles from the first transaction to the end of the program execution. On average, the IWV protocol
improves the execution time by 15.6 % and up to 41.3 % in kmeans. In water-nsq, IWV performs worse
than ScalableBulk by 5.9%.
Figures 5.11 and 5.12 show cycle time distribution per transaction in IWV and ScalableBulk. Each
bar in the graphs is divided into three categories, ‘Commit cycle’ – the number of cycles used to execute
instructions in a transaction committed successfully (computation or useful cycles); ‘Abort cycle’ – the
number of cycles used to execute instructions in transactions aborted (wasted cycles); and ‘Nacked
cycle’ – the number of cycles during which a processor core was stalled, waiting in the validation phase
to be committed or aborted, or being Nacked for requests for memory blocks locked by another
transaction in the validation phase or the commit phase.
136
Figure 5.13 compares transaction abort rates calculated as the number of aborts divided by the total
number of transactions for IWV and ScalableBulk. Except for labyrinth and water-sp, IWV has lower
abort rates than ScalableBulk, on average 0.21 for IWV and 0.30 for ScalableBulk. Figure 5.14 displays
percentages of aborts in the validation phase out of total aborts. The numbers indicate how badly
transaction aborts affect the overall performance because if a transaction is aborted in the validation
phase, more cycles are wasted than in the execution phase. In water-sp, both IWV and ScalableBulk
experience aborts mostly in the validation phase.
137
Figures 5.15 and 5.16 are related to the false abort/conflict of IWV. The bars in Figure 5.15 represent
the fraction of transactions committed successfully after sending out invalidations. For example, in
kmeans, 94 % of such transactions committed successfully, and the rest 6 % was aborted after sending
invalidation requests. These rates are important for two reasons: first, the rates give us an idea of how
much bandwidth is wasted due to unnecessary messages injected into the interconnect network by
aborted transactions. Second, the rates also provide us with upper bounds on the number of false
aborts/invalidations that are harder to be obtained as both aborts and invalidations must be recorded. For
example, given that the ssca2 commits 88% of the time, 12% of the total number of aborts is an upper
bound for the number of false abort as well as bandwidth wasted on aborted transactions because not all
aborts actually cause another transaction to abort – when the target transaction, to which an invalidation
was sent, has already been aborted by another transaction, especially by a committing transaction.
Figure 5.16 shows actual false abort rates calculated by the ratio of the number of aborts caused by
aborted transactions to the total number of aborts. All the applications have 5 % or lower false abort rate
except raytrace (~ 14.7 %).
138
5.5 Future Work
One important future work is to compare the speculative conflict detection and invalidation
mechanism on the late conflict detection mechanism with the early or eager speculative conflict
detection mechanism, which is on the opposite side of the late conflict detection mechanism. As a result,
the whole spectrum of conflict detection space will be covered: the conventional late (lazy) conflict
detection; our new, unconventional speculative late conflict detection; and the conventional early (eager)
conflict detection. For this, the following work should be executed: implementing and simulating a
protocol based on the early conflict detection mechanism in the same frame work used in the speculative
late conflict detection for a fair comparison followed by side by side comparisons between all the three
different protocols based on different conflict detection mechanisms.
139
Chapter 6
6. CONCLUSION
Transactional Memory (TM) has been a popular subject in academia since the idea was first
introduced. During the first epoch, many research papers have focused on proposing TM platforms with
supporting mechanisms and verifying the operations on cycle accurate simulators. Since Hardware
Transactional Memory (HTM) support was equipped in commercial products, the current efforts make
every endeavor to resolve implementation issues, moving to the next step.
To realize TM’s full potential, we propose various architecture schemes in this dissertation, targeting
improvement for reliability, power and performance of HTM systems. First, we introduce transaction-
based reliability protecting processor cores from transient errors. Second, we propose a Dynamic
Transaction Issue (DTI) scheme that can be easily implemented on top of existing HTM
systems, saving the power dissipation and energy consumption associated with
transaction aborts. Third, we refine our DTI scheme with a new approach based on Fine
Grain Prediction and Scheduling (FGPS), improving the prediction accuracy of the prior
proactive scheduling algorithms. Lastly, we targets transaction commit, which is the
common case in most benchmark applications, to improve the performance in terms of
execution time.
To move to the next phase and render TM systems to be full-fledged in its full potential, it is time to
put more effort into TM software development including applications running on continuous
transactional memory, where every instruction belongs to a transaction. In the same vein, our future
work will include developing TM software in addition to those aforementioned at the end of Chapter 2,
3, 4 and 5.
140
REFERENCES
[1] C. S. Ananian, K. Asanovic, B. C Kuszmaul, C. E. Leiserson and S. Lie, “Unbounded Transactional
Memory,” In HPCA 11, pp. 316-327, Feb. 2005.
[2] M. Ansari, M. Lujan, C. Kotselidis, K. Javis, C. Kirkham and I. Watson, “Steal-on-abort: Dynamic
Transaction Reordering to Reduce Conflicts in Transactional Memory,” In HiPEAC 2009, pp. 4-18,
Jan. 2009.
[3] E. K. Ardestani and J. Renau, “ESESC: A Fast Multicore Simulator Using Time-Based Sampling,”
In HPCA 19, pp. 448-459, Feb. 2013.
[4] A. Armejach, A. Negi, A. Cristal, O. Unsal, P. Stenstrom and T. Harris, “HARP: Adaptive Abort
Recurrence Prediction for Hardware Transactional Memory,” In HiPC’13, pp. 196-205, Dec. 2013.
[5] R. Baumann, “Soft errors in Advanced Computer Systems,” IEEE Design and Test of Computers,
22(3): 258-266, 2005.
[6] G. Blake, R. G. Dreslinski, and T. Mudge, “ Proactive Transaction Scheduling For Contention
Management,” In MICRO 42, pp. 156-167, Dec. 2009.
[7] J. Bobba, K. E. Moore, H. Volos, L. Yen, M. D. Hill, M. M. Swift and D. A. Wood, “Performance
Pathologies in Hardware Transactional Memory,” In ISCA 34, pp. 81-91, June 2007.
[8] D. Brooks, V. Tiwari and M. Martonosi, “Wattch: A Framework for Architectural-Level Power
Analysis and Optimizations,” In ISCA 27, pp. 83-94, June 2000.
[9] H. Chafi, J. Casper, B. D. Carlstrom, A. McDonald, C. C. Minh, W. Baek, C. Kozyrakis and K.
Olukotun, “A Scalable, Non-blocking Approach to Transactional Memory,” In HPCA 13, pp. 97-
108, Feb. 2007.
[10] L. Ceze, J. Tuck, J. Torrellas and C. Cascaval, “Bulk Disambiguation of Speculative Threads in
Multiprocessors,” In ISCA 33, pp. 227-238, June 2006.
[11] L. Ceze, J. Tuck, P. Montesinos and J. Torrellas, “BulkSC: Bulk Enforcement of Sequential
Consistency,” In ISCA 34, pp. 278-289, June 2007.
[12] P. Damron, A. Fedorova and Y. Lev, “Hybrid Transactional Memory,” In ASPLOS 12, pp. 336-
346, Dec. 2006.
[13] S. W. S. Do and M. Dubois, “Power Efficient Hardware Transactional Memory: Dynamic Issue of
Transactions,” In ACM Transactions on Architecture and Code Optimization, Vol. 13 Issue 1, April
2016.
141
[14] S. W. S. Do and M. Dubois, “Core Reliability: Leveraging Hardware Transactional Memory,” In
IEEE Computer Architecture Letters, Vol. 17 Issue 2, pp. 105 - 108, 2018
[15] C. Ferri, S. Wood, T. Moreshet, R. Iris Bahar and M. Herlihy, “Embedded-TM: Energy and
Complexity-effective Hardware Transactional Memory for Embedded Multicore Systems,” In
JPDC, 2010.
[16] M. Galluzzi, E. Vallejo, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J. Smith and M. Valero,
“Implicit Transactional Memory in Kilo-Instruction Multiprocessor,” In ACSAC-2007, pp. 339-
353, Aug. 2007.
[17] E. Gaona, J. R. Titos-Gil, J. Fernandez and M. E. Acacio, “Selective Dynamic Serialization for
Reducing Energy Consumption in Hardware Transactional Memory Systems,” In The Journal of
Supercomputing, 2014.
[18] M. Gomaa, C. Scarbrough, T. N. Vijaykumar and I. Pomeranz, “Transient-Fault Recovery for Chip
Multiprocessors,” In ISCA 30, pp. 98-109, June 2003.
[19] L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. Davis, B. Hertzberg, M. K. Prabhu, H.
Wijaya, C. Kozyrakis and K. Olukotun, “Transactional Memory Coherence and Consistency,” In
ISCA 31, pp. 102-113, June 2004.
[20] T. Harris and K. Fraser, “Language Support for Lightweight Transactions,” In OOPSLA ’03, pp.
388-402, Oct. 2003.
[21] M. Herlihy and J. E. B. Moss, “Transactional Memory: Architectural Support for Lock-free Data
Structures,” In ISCA 20, pp. 289-300, May 1993.
[22] M. Herlihy, V. Luchangco, M. Moir and W. N. Scherer, “Software Transactional Memory for
Dynamic-sized Data Structures,” In PODC 24, pp. 92-101, July 2003.
[23] C. Jacobi, T. Slegel and D. Greiner, “Transactional Memory Architecture and Implementation for
IBM System Z,” In MICRO-45, pp. 25-36, Dec. 2012.
[24] S. Kumar, M. Chu, C. J. Hughes, P. Kundu and A. Nguyen, “Hybrid Transactional Memory,” In
PPoPP ‘06 pp. 209-220, Mar. 2006.
[25] H. T. Kung and J. T. Robinson, “On Optimistic Methods for Concurrency Control,” ACM TODS,
6(2): 213-226, June 1981.
[26] Y. Liu and M. Spear, “Toxic Transactions,” TRANSACT’11. San Jose, CA
[27] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi and K.
Hazelwood, “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,”
142
In Proc. ACM SIGPLAN Conf. Programming Language Des. Implementation, pp. 27-38, Jun.
2005.
[28] M. Lupon, G. Magklis and A. Gonzalez, “A Dynamically Adaptable Hardware Transactional
Memory,” In MICRO-43, pp. 27-38, Dec. 2010.
[29] R. E. Lyons and W. Vanderkulk, “The Use of Triple-Modular Redundancy to Improve Computer
Reliability,” In IBM Journal of Research and Development, Vol. 6, Issue 2, pp. 200-209, Apr. 1962.
[30] V. J. Marathe, W. N. Scherer and M. L. Scott, “Adaptive Software Transactional Memory,” In
DISC ’05, pp. 354-368, Sept. 2005.
[31] C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis and K.
Olukotun, “An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees,”
In ISCA 34, pp. 69-80, June 2007.
[32] C. Minh, J. Chung, C. Kozyrakis and K. Olukotun,“STAMP: Stanford Transactional Applications
for Multi-processing,” In IISWC’08, Sept. 2008
[33] K. Moore, J. Bobba, M. J. Moravan, M. D. Hill and D. A. Wood, “LogTM: Log-based
Transactional Memory,” In HPCA 12, pp. 254-265, Feb. 2006.
[34] J. Poe, C. Cho and T. Li, “Using Analytical Model to Efficiently Explore Hardware Transactional
Memory and Multi-core Co-design,” In SBAC-PAD, Oct. 2008.
[35] M. Prvulovic, Z. Zhang and J. Torrellas, “ReVive: Cost-Effective Architectural Support for
Rollback Recovery in Shared-Memory Multiprocessors,” In ISCA 29, pp. 111-122, May 2002.
[36] S. H. Pugsley, M. Awasthi, N. Madan, N. Muralimanohar and R. Balasubramonian, “Scalable and
Reliable Communication for Hardware Transactional Memory,” In PACT’08, Oct. 2008.
[37] X. Qian, W. Ahn and J. Torrellas, “ScalableBulk: Scalable Cache Coherence for Atomic Blocks in
a Lazy Environment,” In MICRO-43, pp. 447-458, Dec. 2010.
[38] R. Rajwar and M. G. Dixon. “Intel Transactional Synchronization Extensions,” Intel Developer
Forum San Francisco, 2012.
[39] R. Rajwar, M. Herlihy and K. Lai, “Virtualizing Transactional Memory,” In ISCA 32, pp. 494-505,
June 2005.
[40] S. K. Reinhardt and S. S. Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,”
In ISCA 27, pp. 25-36, June 2000.
[41] J. Renau, B. Fraguela and L. Wei, “SESC simulator,”http://sourceforge.net/projects/sesc/ , June
2005.
143
[42] E. Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessor,” In
Proceedings of Fault-Tolerant Computing Systems (FTCS), 1999.
[43] B. Saha, A. Adl-Tabatabai, R. L. Hudson, C. C. Minh and B. Hertzberg, “A High Performance
Software Transactional Memory System for a Multi-core Runtime,” InPPoPP ‘06, pp. 187-197,
Mar. 2006.
[44] S. Sanyal, S. Roy, A. Cristal, O. S. Unsal and M. Valero, “Clock Gate on Abort: Towards Energy-
Efficient Hardware Transactional Memory,” In IPDPS ’09, pp. 1-8, May 2009.
[45] N. Shavit and S. Touitou, “Software Transactional Memory,” In PODC 14, pp. 204-213, Aug.
1995.
[46] A. Shriraman, S. Dwarkadas and M. L. Scott, “Flexible Decoupled Transactional Memory
Support,” In ISCA 35, pp. 139-150, June 2008.
[47] T. J. Slegel, R. M. Averill, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li,
J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum and C. F.
Webb, “IBM’s S/390 G5 Microprocessor Design,” In Micro, pp. 12-23, March/April 1999.
[48] J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe and A. G. Nowatryk, “Fingerprinting:
Bounding Soft-error-detection Latency and Bandwidth,” IEEE Micro, Vol. 24, No. 6, pp. 22-29,
Nov. 2004.
[49] R. Titos-Gil, A. Negi, M. E. Acacio, J. M. Garcia and P. Stenstrom, “ZEBRA: A Data-centric,
Hybrid-policy Hardware Transactional Memory Design,” In ICS’11, pp. 53-62, June 2011.
[50] S. Tomic, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris and M. Valero,
“EazyHTM: Eager-lazy Hardware Transactional Memory,” In MICRO-42, pp. 145-155, Dec. 2009.
[51] E. Vallejo, M. Galluzzi, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J. E. Smith and M. Valero,
“KIMP: Multicheckpointing Multiprocessors,” In XVI jornadas de Paralelismo, Sep. 2005.
[52] E. Vallejo, M. Galluzzi, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J. E. Smith and M. Valero,
“Chip Multiprocessors with Implicit Transactions,” In ACACES 2006, pp. 167-170, July 2006.
[53] T. N. Vijaykumar, I. Pomeranz and K. Cheng, “Transient-Fault Recovery Using Simultaneous
Multithreading,” In ISCA 29, pp. 87-98, May 2002.
[54] A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera and M. Michael,
“Evaluation Of Blue Gene/Q Hardware Support For Transactional Memories,” In PACT 21, pp.
127-136, Sep. 2012.
144
[55] J. K. Wolf, A. M. Michelson and A. H. Levesque, “On the Probability of Undetected Error for
Linear Block Codes,” IEEE Trans. Commun., Vol. 30, no. 2, pp. 317-325, Feb. 1982.
[56] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta, “The SPLASH-2 Programs:
Characterization and Methodological Considerations,” In ISCA 22, pp. 24-36, June 1995.
[57] A. Wood, “Data Integrity Concepts, Features, and Technology,” White paper, Tandem Division,
Compaq Computer Corporation.
[58] L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift and D. A. Wood,
“LogTM-SE: Decoupling Hardware Transactional Memory from Caches,” In HPCA 13, pp. 261-
272, Feb. 2007.
[59] R. M. Yoo and H. S. Lee. 2008, “Adaptive Transaction Scheduling For Transactional Memory
Systems,” In SPAA ’08, pp. 169 – 178, 2008.
[60] SPEC CPU 2006, Standard Performance Evaluation Corporation, https://www.spec.org/cpu2006/
[61] Intel Corporation Transactional Synchronization in Haswell, Retrieved from
http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/, Sept.
2012.
Abstract (if available)
Abstract
Transactional Memory (TM) enhances the programmability as well as the performance of parallel programs running on a multi-core or multi-processor system. To achieve this goal, TM adopts a lock-free approach, in which mutually exclusive events are executed optimistically and corrected later if violations of mutual exclusion are detected. As a result TM disposes of the complexities of conventional locking mechanisms, especially when multiple locks must be held simultaneously. Some proposals such as Transactional Memory Coherence and Consistency (TCC) and Bulk extend the applicability to cache coherence and consistency with cache coherence protocols relying on the optimistic approach of TM. ❧ To realize TM’s full potential, we propose various architecture schemes in this dissertation, targeting improvements for reliability, power and performance of HTM systems. First, we introduce transaction-based reliability protecting processor cores from transient errors. Second, we propose a Dynamic Transaction Issue (DTI) scheme that can be easily implemented on top of existing HTM systems, saving the power dissipation and energy consumption associated with transaction aborts. Third, we refine our DTI scheme with a new approach based on Fine Grain Prediction and Scheduling (FGPS), improving the prediction accuracy of the prior proactive scheduling algorithms. Lastly, we target transaction commit, which is the common case in most benchmark applications, to improve the execution time of successful transactions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Low cost fault handling mechanisms for multicore and many-core systems
Asset Metadata
Creator
Do, Sang Wook Stephen
(author)
Core Title
Improving reliability, power and performance in hardware transactional memory
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/15/2018
Defense Date
09/06/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cache coherence,cache coherence protocol,cache protocol,coarse grain scheduling,computer,computer architecture,Computer Engineering,Computer Science,concurrency control,core protection,electrical engineering,Energy,error,error correction,error detection,error recovery,fine grain scheduling,hardware transactional memory,HTM,micro architecture,multi-core system,OAI-PMH Harvest,optimistic concurrency control,parallel architecture,parallel processing,parallel programming,performance,Power,reliability,scheduling,synchronization,TM,transaction based reliability,transaction scheduling,transactional memory,transactional memory scheduling,transient error,transient error correction,transient error detection,transient error recovery
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dubois, Michel (
committee chair
), Annavaram, Murali (
committee member
), Raghavan, Barath (
committee member
)
Creator Email
sangstephendo@gmail.com,sdo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-106642
Unique identifier
UC11676682
Identifier
etd-DoSangWook-6971.pdf (filename),usctheses-c89-106642 (legacy record id)
Legacy Identifier
etd-DoSangWook-6971.pdf
Dmrecord
106642
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Do, Sang Wook Stephen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cache coherence
cache coherence protocol
cache protocol
coarse grain scheduling
computer architecture
concurrency control
core protection
error
error correction
error detection
error recovery
fine grain scheduling
hardware transactional memory
HTM
micro architecture
multi-core system
optimistic concurrency control
parallel architecture
parallel processing
parallel programming
reliability
scheduling
synchronization
TM
transaction based reliability
transaction scheduling
transactional memory
transactional memory scheduling
transient error
transient error correction
transient error detection
transient error recovery