Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Design of modular multiplication
(USC Thesis Other)
Design of modular multiplication
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DESIGN OF MODULAR MULTIPLICATION by Bo Zhang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2022 Copyright 2022 Bo Zhang Acknowledgements This thesis was accomplished based on precious assistance from my Ph.D. advisor, Prof. Massoud Pedram. He brought me to a new research area, the design of modular multiplica- tion and fully homomorphism encryption, in which I had very little background knowledge and research experience. Nevertheless, he spent time teaching me how to explore this new area, locate the problems, and discuss solutions. His influence in my whole life can never be overstated as I have learned too much from him from different perspectives. Next, I would like to thank other committee members in my oral defense exam, Prof. Peter Beerel, and Prof. Aiichiro Nakano, for providing valuable feedback to my research and helping me push the research boundary. I was lucky to meet and work with so many excellent people. Especially, I cannot thank enough Zeming Cheng for his help in research and discussions. We can propose many novel ideas to overcome research challenges. The alumni members and current members of the SPORT lab led by Prof. Pedram enjoy my sincere appreciation. They include Prof. Shahin Nazarian, Mingye Li, Naveen Kumar Katam, Haolin Cong, Shuang Chen, Tiansong Cui, Luhao Wang, Arash Fayyazi, Souvik Kundu, Mahdi Nazemi, Mohammad Saeed Abrishami, Marzieh Vaeztourshizi, Hassan Afzalikusha, Ghasem Pasandi, Soheil Nazar Shahsavani, and ii AmirhosseinEsmaili. Moreover,FangzhouWangandHuimeiChengalsoprovidedsignificant help for me for the past years. Finally, I would like to express my sincerest appreciation to my parents (Yonghong Liu and Haiyun Zhang) and my wife (Yitong Chen) for their unconditional support and love. Without any of the aforementioned people, this thesis would not be possible. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures viii Abstract x 1 Introduction 1 1.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Overflow Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Optimization of Expression (A+B)C . . . . . . . . . . . . . . . . . 2 1.2 Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Full-word Modular Multiplication . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Iterative Modular Multiplication. . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Scalable Modular Multiplication . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Multiplication 7 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Modified Booth Encoding Technique . . . . . . . . . . . . . . . . . . 8 2.1.2 Partial Product Optimization . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 High Performance Logarithmic Adder . . . . . . . . . . . . . . . . . . 13 2.2 Overflow Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Overflow Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Overflow Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Overflow Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Overflow Correction for Positive Number . . . . . . . . . . . . . . . . 22 2.3 Optimization of Expression (A+B)C . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Encode2bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Encode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 Example: Polynomial Evaluation . . . . . . . . . . . . . . . . . . . . 28 2.3.4 Encode with Ceiling, Floor, and Round Operations . . . . . . . . . . 28 iv 3 Modular Multiplication 38 3.1 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Iterative Basic Modular Multiplication . . . . . . . . . . . . . . . . . . . . . 40 3.3 Full-word Basic Modular Multiplication . . . . . . . . . . . . . . . . . . . . . 43 4 Barrett Modular Multiplication 44 4.1 Iterative Barrett Modular Multiplication . . . . . . . . . . . . . . . . . . . . 44 4.1.1 Classic Iterative Barrett Modular Multiplication . . . . . . . . . . . . 44 4.1.2 New Iterative Barrett Modular Multiplication: Basic . . . . . . . . . 47 4.1.3 New Iterative Barrett Modular Multiplication: Advanced . . . . . . . 53 4.2 Full-word Barrett Modular Multiplication . . . . . . . . . . . . . . . . . . . 63 4.2.1 Classic Full-word Barrett Modular Multiplication . . . . . . . . . . . 63 4.2.2 New Full-word Barrett Modular Multiplication: Basic . . . . . . . . . 65 4.2.3 New Full-word Barrett Modular Multiplication: Advanced . . . . . . 66 4.3 Extension to All Modulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Classic Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.2 New Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5 Montgomery Modular Multiplication 81 5.1 Domain Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Iterative Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . 83 5.2.1 Classic Iterative Montgomery Modular Multiplication . . . . . . . . . 83 5.2.2 New Iterative Montgomery Modular Multiplication: Basic . . . . . . 86 5.2.3 New Iterative Montgomery Modular Multiplication: Advanced . . . . 90 5.3 Full-word Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . 105 5.3.1 Classic Full-word Montgomery Modular Multiplication . . . . . . . . 105 5.3.2 New Full-word Montgomery Modular Multiplication . . . . . . . . . . 107 5.4 Scalable Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . 112 5.4.1 Classic Scalable Montgomery Modular Multiplication . . . . . . . . . 113 5.4.2 New Scalable Montgomery Modular Multiplication: L1 . . . . . . . . 116 5.4.3 New Scalable Montgomery Modular Multiplication: L2 . . . . . . . . 137 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5.1 Iterative Montgomery Modular Multiplication . . . . . . . . . . . . . 144 5.5.2 Full-word Montgomery Modular Multiplication . . . . . . . . . . . . 146 5.5.3 Scalable Montgomery Modular Multiplication . . . . . . . . . . . . . 146 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6 Summary and Future Work 151 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.1.1 Overflow Optimization and Encode . . . . . . . . . . . . . . . . . . . 151 6.1.2 Barrett Modular Multiplication . . . . . . . . . . . . . . . . . . . . . 152 6.1.3 Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . . 153 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 v References 157 vi List of Tables 2.1 Generate Partial Product PP i =X(− 2Y[2i+1]+Y[2i]+Y[2i− 1]) . . . . . 11 2.2 Truth Table for C 1 and C 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Truth Table for Overflow Prevention Data Model . . . . . . . . . . . . . . . 17 2.4 Truth Table for Overflow Correction Data Model . . . . . . . . . . . . . . . 20 2.5 Truth Table for Overflow Correction Data Model . . . . . . . . . . . . . . . 22 2.6 Truth Table for a 1 , b 1 , c 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Truth Table for a and b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Different Iterative Barrett MM Designs on Virtex-7 FPGA . . . . . . . . . . 79 4.3 5-stage Full-word Barrett MM Designs on Virtex-7 FPGA . . . . . . . . . . 80 5.1 Different Iterative Montgomery MM Designs on Virtex-7 FPGA . . . . . . . 144 5.2 5-stage Full-word Montgomery MM Designs on Virtex-7 FPGA . . . . . . . 146 5.3 Scalable Montgomery Modular Multiplication with N = 1024 . . . . . . . . . 147 5.4 Different Scalable Montgomery MM Designs on Virtex-7 FPGA . . . . . . . 148 vii List of Figures 2.1 Apply Radix-4 Modified Booth Encoding to a 8-bit Signed Number . . . . . 10 2.2 Optimization before Compression of Signed Number . . . . . . . . . . . . . . 12 2.3 Compression of 9 Numbers Based on CSA . . . . . . . . . . . . . . . . . . . 13 2.4 8-bit Signed Multiplication with Radix-4 Modified Booth Encoding . . . . . 14 2.5 Overflow Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Data Model for Overflow Prevention . . . . . . . . . . . . . . . . . . . . . . 16 2.7 Apply Overflow Prevention during Multiplication . . . . . . . . . . . . . . . 18 2.8 Data Model for Overflow Correction . . . . . . . . . . . . . . . . . . . . . . . 20 2.9 Apply Overflow Correction during Multiplication . . . . . . . . . . . . . . . 21 2.10 Apply Overflow Correction when Result is k-bit . . . . . . . . . . . . . . . . 21 2.11 Data Model for Overflow Correction with Positive Number . . . . . . . . . . 22 2.12 Model for Grouping 2 bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.13 Property of Outputs e and c out . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.14 Example of Encode when N = 8 . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.15 Ceiling Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.16 Ceiling Operation with Encode . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.17 Floor Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.18 Floor Operation with Encode . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.19 Round Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.20 Round Operation with Encode . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Category of Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 The Optimization of Z (i+1) /2 N− 3 as ˜ Z (i+1) H . . . . . . . . . . . . . . . . . . 56 4.2 The Computation of q (i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Reduction of q (i+1) S and q (i+1) C . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4 The Optimization of q (i+1) S +q (i+1) C as ˜ q (i+1) . . . . . . . . . . . . . . . . . . . 61 4.5 The Computation of Z (i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6 Block-level Diagram for the Hardware Implementation of the Proposed Itera- tive Barrett Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . 62 4.7 Compress ˜ T H µ by Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.8 Compute ˜ q = h ˜ T H µ/ 2 N+6 i by Compression and EncoudRound . . . . . . . . 71 4.9 Block-levelDiagramforthe4-stagePipelinedImplementationoftheProposed Full-word Barrett Modular Multiplication . . . . . . . . . . . . . . . . . . . 72 5.1 Example of EncodeSN when m = 8 . . . . . . . . . . . . . . . . . . . . . . . 93 viii 5.2 Computing q (i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Computing Z (i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Block-level Diagram for the Hardware Implementation of the Proposed Itera- tive Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . . . . 105 5.5 EncodeSN for (N +2)-bit q S and q C . . . . . . . . . . . . . . . . . . . . . . 109 5.6 Compute T S +T C + ˜ qM by Truncated Compression . . . . . . . . . . . . . . 110 5.7 Block-levelDiagramforthe4-stagePipelinedImplementationoftheProposed Full-word Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . 112 5.8 Compute TS (i) j and TC (i) j in L1 Scalable Montgomery MM . . . . . . . . . . 128 5.9 Compute PS and PC in L1 Scalable Montgomery MM . . . . . . . . . . . . 131 5.10 Data Dependency Graph with L = 1 . . . . . . . . . . . . . . . . . . . . . . 132 5.11 Schedule for N = 16, w = 8, and m = 2 with (a) p = 3 and (b) p = 2 . . . . 133 5.12 Internal Design of the Processing Element (PE) with L = 1 . . . . . . . . . . 136 5.13 PEc Diagram with L = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.14 Proposed Scalable Architecture of Montgomery MM with L = 1 . . . . . . . 137 5.15 Classic Data Dependency Graph with L = 2 . . . . . . . . . . . . . . . . . . 139 5.16 New Data Dependency Graph with L = 2 . . . . . . . . . . . . . . . . . . . 141 5.17 Internal Design of the Processing Element (PE) with L = 2 . . . . . . . . . . 143 5.18 PEc Diagram with L = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 ix Abstract Constituting many advanced mathematical operations, multiplication has been thoroughly studied from many perspectives including latency, area, and power. The modified Booth encodingisaveryimportantoptimizationthatreducesthenumberofpartialproductsduring multiplication. Since the compression results of partial products could have a potential overflow, it requires the final addition on compression results to achieve the correct product. Therefore, multiplication is usually implemented as a black box to present other operations. In this thesis, we presents two ways to solve the overflow problem and simplify multi- plication as compression. Further, a novel encoding with a constant delay is proposed to skip the full-bitwidth addition in the expression of (A + B)C. These two optimizations allow us replace multiplication and addition/subtraction with compression and encoding for any linear mathematical operations consisting of multiplication and addition/subtraction. The critical path is thus dramatically reduced, we only need addition in the very end of operations to represent the final result. The proposed encoding is also extended to support ceiling, floor, and round operations like the expression (A+B)/2 k C, (A+B)/2 k C, and (A+B)/2 k C. Although they are non-linear operations, we can still skip the addition A+B and division over 2 k by our constant-delay encoding technique. x Stemmingfromabstractalgebra,linearalgebra,etc,moderncryptosystemsarecommonly used all over the world to secure sensitive communications. In the context of cryptosystems, modular addition/subtraction and modular multiplication are the two primitive operations significantlyaffecttheperformanceofcryptosystems. Moreprecisely,modularoperationsfor bitwidth ranging from hundreds to thousands of bits are desirable to ensure a high security level. Modular addition/subtraction is pretty simple, whereas modular multiplication (MM) is quite complex. Barrett MM and Montgomery MM are the most widely used algorithm since they replace the long division in classic MM with multiplication and shift. Dependingonwhetherwedigeststhebitsofmultiplierinone-shotormanycyclesmanner, a modular multiplication can be categorized as full-word MM and iterative MM. Full-word MM is usually implemented in a pipeline structure so as to achieve high throughput, and we thereby need to reduce silicon area as much as possible. Iterative MM iteratively scans some bits of multiplier in every iteration. It is necessary to optimize both computation latency and silicon area. Moreover, non-decomposability is one property of modular multiplication, where a large-size modular multiplication cannot be decomposed into a bunch of small-size modular multiplications. Scalable Montgomery MM, as a special variant of Montgomery MM, is able to perform arbitrary-size modular multiplication with fixed resource utilization, as long as we have a memory management unit to buffer intermediate results. As a result, there are five different modular multiplications for different applications and context: full- word Barrett MM, iterative Barrett MM, full-word Montgomery MM, iterative Montgomery MM, and scalable Montgomery MM. This thesis presents remarkable improvements to all five modular multiplications. With- out doubt, our proposed overflow solution and encoding technique are applied on all designs. xi But more importantly, we have many optimizations specifically for each modular multiplica- tion. Generally speaking for both Barrett and Montgomery, a full-word MM requires three normal multiplications in series. An iterative MM needs sequential computation of quotient andintermediateresultineachiteration. Althoughtheoptimizationsofallfivemodularmul- tiplications are proved in completely different ways, we propose a full-word Barrett MM and a full-word Montgomery MM where two of the three multiplications are truncated with only half-size partial products. Our proposed iterative Barrett MM and iterative Montgomery MM are able to to compute quotient and intermediate result in parallel. Finally, the new scalable Montgomery MM has a new data dependency graph that requires less computation latency and hardware area than state-of-art references. xii Chapter 1 Introduction 1.1 Multiplication Multiplication is a key basic operation of many advanced operations like consecutive mul- tiplications and polynomial evaluation. Due to its wide application, the performance of multiplication affects those advanced operations. With the purpose to save hardware area, people would use modified Booth encoding to reduce the number of partial products, though the resulting partial products are signed numbers. However, the compression of signed num- bers results in two signed numbers, the sum of which could have an overflow. We need to ignore the potential overflow to recover the correct result. Consequently, multiplication is treated as an atomic operation and implemented completely in hardware in order to proceed the workflow of advanced operations. 1.1.1 Overflow Optimization Instead,weproposetwomethodstosolvetheoverflowproblems. Thefirstmethodisoverflow prevention, where the addition of compression result would definitely have no overflow. This method requires sign extension of signed numbers to several extra bits before compression. The compression results of N-bit signed numbers are thus two (N +k)-bit signed number without overflow, where k is usually 1 or 2. 1 Compressionofsignednumberscanalsobeinterpretedasfastadditionsandsubtractions of many signed numbers. As a result, the actual result size after compression and addition could be less than the size of signed numbers. The overflow prevention method is redundant since it needs more bits than the size of signed numbers. Therefore, our second method is overflow correction, in which we can remove overflow just through a 2-bit addition rather than a full-bitwidth addition. If the result of N-bit signed numbers after compression and additionisak-bitsignednumberwithk <N. Theoverflowcorrectionmethodwouldachieve two (k+1)-bit signed numbers without flow. 1.1.2 Optimization of Expression (A+B)C Although multiplication can be replaced with compression and we can present result by two signed numbers without overflow, the workflow of advanced operations would force us to perform addition. For example, we need the addition before multiplication in the expression (A +B)C; otherwise the area of two multiplications AC and BC is too costly. We here propose a novel encoding technique to group and encode k bits of A and B. Rather than a full-bitwidth addition of A+B, the encoded result of each group can generate a partial product directly. In the case of k = 2, we can generate a partial product for each 2-bit group and thus remove half partial products. Independent of the bitwidth of A and B, the encoding procedure only requires a constant 2-bit addition. We can thus skip the addition A +B and completely replace multiplication with compression in the expression of (A + B)C. In many case, we even need ceiling/floor/round operations before multiplication. Our proposedencodingtechniqueallowsustoskipadditionandceiling/floor/roundoperationsin 2 the expression (A+B)/2 k C, (A+B)/2 k C, and (A+B)/2 k C, where k is a positive integer. 1.2 Modular Multiplication Design and optimization for cryptosystems like Rivest–Shamir–Adleman (RSA) [1], elliptic- curve cryptography (ECC) [2], somewhat homomorphic encryption (SHE) [3], and fully homomorphic encryption (FHE) [4,5] has been a subject of great interest. As an important primitive operation, modular multiplication plays a great role in building a cryptosystem. The encryption and decryption of RSA is built on modular exponentiation, which is imple- mented through multiple rounds of modular multiplication. Most FHE schemes make exten- sive of number theoretic transform (NTT) and ring addition/multiplication, which in turn are dependent on modular multiplication. As a result, an efficient hardware realization of an modular multiplication algorithm is a key challenge for a high-performance cryptosystem. Given a multiplicand X, a multiplier Y, and a modulus M, the modular multiplication (MM) computes Z =XY mod M. To maintain a high security level for a cryptosystem, the bitwidthN ofmodulusM typicallyrangesfromhundredstothousandsofbits. Barrettmod- ular multiplication [6] (BMM) and Montgomery modular multiplication (MMM) [7] are the two most widely used modular multiplication, because they can replace the computationally expensive division with multiplication and shift. 3 1.2.1 Full-word Modular Multiplication Modular multiplication can be classified into full-word MM and iterative MM, depending on whether we scan all bits of multiplier at once. When modulus M has hundreds of bits e.g., N = 128, the silicon area of a N-bit normal multiplication maybe affordable. Therefore, some references compute the product XY in one-shot and then perform modular reduction. A MM is referred as full-word BMM or MMM, based on whether we use Barrett modular reduction or Montgomery modular reduction. A full-word MM consists of three consecutive multiplications and is frequently implemented in a pipelined architecture to achieve a high throughput. The goal of full-word MM is thereby to save as much area as possible. Our research allows us to replace all three multiplications with compressions. Moreover, we propose a new full-word BMM and a new full-word MMM, in which all operands of two compressions are roughly truncated into half size. 1.2.2 Iterative Modular Multiplication When N is a large number e.g., 1,024, a complete (one-shot) N-bit MM is very expensive in terms of the required gate count and silicon area. Instead, an iterative version of the MM, which repeatedly scans m bits of the multiplier Y in each iteration and multiplies it with the multiplicand X, becomes preferable. There are basically two challenges towards the development of iterative MM. i. Computationally expensive operations. Ini-thiterationoftheiterativeMMalgorithm, threemultiplicationsandtwoadditions are required. Whenm = 1 orm = 2, multiplication can be explicitly simplified so that 4 critical path is short. However, multiplications become complex when m is large or we hope to design an iterative MM parametric on m. Although the number of iterations to fully scan all bits of multiplier would reduce, the critical path becomes very long and thus not desirable. ii. Computation dependency. In each iteration, we have to compute quotient first and update intermediate result next. This data dependency unavoidably results in a long critical path for an MM design. The situation becomes worse when the m increases. Instead,weproposeanewiterativeBMMandanewiterativeMMM.Althoughtheproofs ofBMMandMMMaredifferent,thecomputationsofquotientandintermediateresultinour algorithms become parallel. Moreover, multiplication and addition are replaced with com- pression and encoding, with the support of our overflow solution of signed compression and encoding technique. As a result, we can optimize both computation latency and hardware area. 1.2.3 Scalable Modular Multiplication A large-bitwidth normal multiplication can be decomposed into a bunch of small-bitwidth normal multiplications through a decomposition algorithm like Karatsuba and Toom-Cook. Unlike normal multiplication, modular multiplication is in general non-decomposable. For example, we cannot implement a 128-bit modular multiplier to compute 1024-bit modular multiplication. All modular multiplications we have discussed assume that the bitwidth N of modulus M are known in advanced. 5 There is a variant of Montgomery MM, called scalable Montgomery MM. A smart pro- cessing element (PE) are designed and instantiated many times to work cooperatively. With the help of a memory management unit to buffer immediate results, the kernel, consisting of all instantiated PEs, is able to compute arbitrary size modular multiplication with a fixed resource utilization. We here propose new scalable Montgomery MM algorithms based on very efficient processing elements. Experimental results show dramatically reduction of computation latency and hardware area. 1.3 Thesis Organization The thesis organization is as follows: chapter 2 presents the solution of overflow problem during signed compression and optimize and the optimization of (A + B)C. Chapter 2 introduces the basic modular multiplication based on long division. Chapter 4 explains our proposed iterative BMM, full-word BMM, and their extension to all modulus, whereas chapter5showsourproposediterativeMMM,full-wordMMM,andscalableMMM.Chapter 6 summarizes our work and discuss future work related to modular multiplication. 6 Chapter 2 Multiplication Multiplicationisanessentialoperationinmanyapplicationslikearithmeticlogicunit,micro- processor, digital signal processor, and graphic engineer etc. Given multiplicand X and mul- tiplier Y, multiplication targets to compute the product Z =XY. Based on its importance, multiplication is investigated in many references. An interested reader may refer to [8] for an excellent review of multiplication. Typically, a multiplication can be decomposed into three steps: (1) Generate partial products based on X and one or more bits of Y. (2) Compress all partial products into two numbers. (3) Add the results after compression and ignore any potential overflow. In this section, we would first discuss how to implement the multiplication based on radix-4 modified Booth encoding technique. Later, we provide two solutions to ensure there isnooverflowwhenaddingtheresultsaftercompression. Finally,anovelencodingtechnique is proposed to finish A+B in the computing expression of (A+B)C with a constant delay. 7 2.1 Preliminaries To better explain algorithms, some notation is provided. An N-bit integer X will be represented as X = P N− 1 i=0 X[i]2 i in radix-2 and X = P nm− 1 i=0 X i 2 im in radix-2 m where n m = ⌈N/m⌉ and X i = X[im + m− 1 : im] denotes bits of X belonging to word i. In case n m ∗ m>N, we sign extend X by n m ∗ m− N bits when X is a signed number or pad X with n m ∗ m− N zeros when X is an unsigned number. 2.1.1 Modified Booth Encoding Technique Without loss of generality, we consider both multiplicand X and multiplier Y are N-bit signed numbers, and N is an even number. Multiplier Y can be expressed in (2.1). Y =− Y[N− 1]2 N− 1 + N− 2 X i=0 Y[i]2 i (2.1) where Y[i] is the i-th bit. The intuitive implementation of multiplication is to first compute all partial products based on multiplicand X and every bit of multiplier Y in (2.2). PP i = XY[i] is the partial product generated by X and Y[i] when i̸= m− 1, and PP N− 1 =− XY[N− 1]. Next, we need to add all N partial products together. 8 Z =XY =X(− Y[N− 1]2 N− 1 + N− 2 X i=0 Y[i]2 i ) =− XY[N− 1]2 N− 1 + N− 2 X i=0 (XY[i])2 i = N− 1 X i=0 PP i 2 i (2.2) Modified Booth encoding [9] is a commonly used technique to reduce the number of partial products. Generally speaking, a radix-2 k modified Booth encoding groups k + 1 bits of multiplier Y and combine it with multiplicand X to generate one partial product. The total number of partial products is only N/k. In case k is not a divisor of N, we can sign extend Y to be a multiple of k. Considering the simplicity and efficiency, the radix-4 modified Booth encoding ( k = 2) is widely used. We can rewrite Y in (2.3) with where Y[− 1] = 0. Y =− Y[N− 1]2 N− 1 + N− 2 X i=0 Y[i]2 i = N/2− 1 X i=0 (− 2Y[2i+1]+Y[2i]+Y[2i− 1])4 i (2.3) The main trick is to split Y[1], Y[3],··· , Y[N− 3] into 2Y[1]− Y[1], 2Y[3]− Y[3],··· , 2Y[N− 3]− Y[N− 3]. Figure 2.1 shows an example of a 8-bit signed number. Notice that the weight at bit position 1, 3, and 5 becomes negative. If we group every 2 bits in step (iii), the value range of a group is [− 2,2]. We can thus rewrite multiplication by defining PP i =X(− 2Y[2i+1]+Y[2i]+Y[2i− 1]) in (2.4). Each partial product PP i is a (N +1)-bit signed number. 9 0 1 1 1 1 1 0 1 0 1 2 3 4 5 6 7 + + + + + + + - 0 1 1 1 1 1 0 1 1 1 0 + - + - + - + - 0 1 1 1 1 1 0 1 1 1 0 + - + - + - + - 0 2 -0 -1 1 125 (i) (ii) (iii) Figure 2.1: Apply Radix-4 Modified Booth Encoding to a 8-bit Signed Number Z =XY =X N/2− 1 X i=0 (− 2Y[2i+1]+Y[2i]+Y[2i− 1])4 i = N/2− 1 X i=0 X(− 2Y[2i+1]+Y[2i]+Y[2i− 1])4 i = N/2− 1 X i=0 PP i 4 i (2.4) Table 2.1 shows the truth table for different Y[2i+1], Y[2i], and Y[2i− 1] bit values. As a result, we can easily encode Y[2i+1], Y[2i], and Y[2i− 1], and generate PP i . The number of partial products reduces from N to N/2 so that we save a lot of area. 2.1.2 Partial Product Optimization In order to add all signed numbers, we need to first sign extend all partial products up to 2N bits, since the productZ =XY must be a 2N-bit signed number. The example in figure 2.2 shows the optimization of 8-bit multiplication. For sake of brevity, all signs are labelled 10 Table 2.1: Generate Partial Product PP i =X(− 2Y[2i+1]+Y[2i]+Y[2i− 1]) Y[2i+1] Y[2i] Y[2i− 1] PP i 0 0 0 0 0 0 1 X 0 1 0 X 0 1 1 2X 1 0 0 − 2X 1 0 1 − X 1 1 0 − X 1 1 1 − 0 as S although different numbers could have different sign values. The sign bits in a number can be converted into 1’s followed by a ¯ S in step (ii). ¯ S is the Boolean inversion of S. Any overflows can be ignored because Z only has 2N bits. Notice that all dashed polygons can be removed, since the sum of each polygon is equal to 2 16 . Step (iii) shows the result after optimization. 2.1.3 Compression Compression based on carry-save adder (CSA) is a technique to speed up the addition of many numbers [10–12]. A CSA can compress three t-bit numbers X, Y, and Z into two (t+1)-bit numbers S and C as shown below. S[i] =X[i]⊕ Y[i]⊕ Z[i] C[i+1] =X[i]Y[i]+X[i]Z[i]+Y[i]Z[i] S[t] = 0,C[0] = 0,0≤ i≤ t− 1 (2.5) 11 0 bit position 7 + S S S S S S S 15 S S S S S S S S S - (i) S S 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (ii) S - S - S - S - S 1 1 (iii) S - S - S - - S S Figure 2.2: Optimization before Compression of Signed Number Evidently, 2C i+1 + S i = X i + Y i + Z i for 0 ≤ i ≤ t− 1. The CSA (a 3-to-2 compressor generating C and S from X, Y, and Z) uses t parallel (1-bit) full adder cells, and thus, incurs the delay of only one full adder cell regardless of the bitwidth t of its operands. Generally speaking, compressing k ≥ 3 numbers using a linear arrangement of CSAs results in two numbers S and C. The critical path delay is equal to k− 2 full adder cell delays. On the other hand, we can compress k ≥ 3 integers using a balanced ternary 12 tree arrangement of CSAs with only O(logk) full adder cell delays. Figure 2.3 shows an example to compress 9 numbers L 1 , L 2 , ··· , L 9 into two numbers S and C. The critical path delay is reduced from 7 full adder cell delays in the linear arrangement to 4 full adder cell delay in the ternary tree arrangement. Obviously, we can use the balanced ternary tree arrangement of CSAs to efficiently compress N/2 partial products into two signed numbers during multiplication. CSA CSA CSA L 1 L 2 L 3 L 4 L 5 L 6 L 7 L 8 L 9 CSA CSA CSA CSA S C Figure 2.3: Compression of 9 Numbers Based on CSA 2.1.4 High Performance Logarithmic Adder Finally, to complete multiplication, we just need to add the two numbers after compression. TherearemanyhighperformanceadderslikeKogge-Stoneadder(KSA)[13]andBrent-Kung adder (BKA) [14]. The time and area complexity of those adders are logarithmic, although they differs from time-oriented or area-oriented optimization. 13 Figure 2.4 shows the whole flow of a 8-bit signed multiplication. Notice that the result of compression is not correct because Z S +Z C ̸= XY. However, we can get the correct result when ignoring the overflow during addition. 8 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 0 8 X=108 Y=125 X -X -0 2X Z S =-23876 Z C =-28160 Z=13500 Figure 2.4: 8-bit Signed Multiplication with Radix-4 Modified Booth Encoding 14 2.2 Overflow Optimization The final addition after compression seems necessary. We should ignore the potential over- flow, butwecannotlocatewheretheoverflowoccurs. Infigure2.4, theoverflowoccursatbit position 15, whereas it is also possible to have overflow starting from other bit positions. In this section, we presents two solutions such that the sum of results after compression would not have overflow. • Overflow prevention: after reserving enough bits, the compression result can be fit into a data model the has no overflow; • Overflow correction: the compression result can be fit into a data model where only a 2-bit addition is required to remove any potential overflow. Consequently, we can build the relation XY =Z S +Z C in figure 2.4. When combining with a novel encoding technique (discussed later in section 2.3), we can simplify the computation like multiple multiplication or polynomial evaluation a lot. 2.2.1 Overflow Criteria Theorem 1. The sum of two N-bit signed numbers has an overflow if and only if the carry when adding the total N bits is different from the carry when adding the least N− 1 bits. Figure 2.5 shows the procedure to add two N-bit signed numbers Z S and Z C . Define C 1 as the carry of Z S [N− 2 : 0]+Z C [N− 2 : 0] and C 2 as the carry of Z S +Z C . Table 2.2 shows the truth table of C 1 andC 2 . WhenC 1 =C 2 , the sum is valid. However, the sum cannot be a N-bit signed number when C 1 ̸=C 2 . 15 0 N-1 + + + - Z S Z C C 1 C 2 Figure 2.5: Overflow Criteria Table 2.2: Truth Table for C 1 and C 2 C 1 C 2 comment 0 0 valid 0 1 max(Z S +Z C )<− 2 N− 1 1 0 min(Z S +Z C )>= 2 N− 1 1 1 valid 2.2.2 Overflow Prevention The overflow comes from the fact that the sum is out of range. Figure 2.6 shows the data model for overflow prevention. The leading 2 bits of Z S are same, and the leading 2 bits of Z C are 0. Assume the sum Z = Z S +Z C is a N-bit signed number, Z[N] must be the sign extension. 0 N + + + - 0 0 Z S Z C C 1 S 1 S 1 S 2 S 2 Z Figure 2.6: Data Model for Overflow Prevention 16 We define the carry from the addition for the least significant N− 1 bits as C 1 . Table 2.3 shows all possible combinations of S 1 and C 1 . Notice that it is impossible to have S 1 = 0 and C 1 = 1 because the resulting Z becomes larger than 2 N− 1 , which would violate the fact that Z is a N-bit signed number. For the remaining three cases, Z S +Z C can never have any overflows, therefore, the sum Z can be correctly presented. Table 2.3: Truth Table for Overflow Prevention Data Model S 1 C 1 S 2 0 0 0 0 1 N.A. 1 0 1 1 1 0 We may apply this method when compress all partial products during N-bit signed multiplication in figure 2.7. In this case, we deliberately sign extend all partial products into 2N+2 bits in step (ii). In step (iii), we remove all unnecessary 1s and easily guarantee that the maximum value of dashed polygon is less than 2 2N+1 when N > 4. In the case of N ≤ 4, we just need to sign extend all partial products into 2N +4 bits. The remaining proofs are similar to what we did above and are thus omitted here for brevity. 2 N+4 + N/2− 2 X i=1 (2 N+3 )4 i +2 2N = 5 3 2 2N + 1 3 2 N+4 < 2 2N+1 (2.6) Consequently, the compression results in two (2N +1)-bit unsigned numbers Z S and Z C in step (iv) regardless of the compression strategy. Notice that Z S [2N] and Z C [2N] cannot be 1 at the same time; otherwise Z S +Z C will be larger than 2 2N+1 . As a result, the sum of Z S [2N], Z C [2N], and the leading 2 bits outside polygon will be a 2-bit number with the 17 samevalue,whichisZ S [2N] XNOR Z C [2N]. SinceZ S [2N+1]isalwaysthesameasZ C [2N], we can hide Z S [2N +1] and Z C [2N +1] in a hardware implementation. We can use two (2N +1)-bit signed number to accurately present the product XY. 0 bit position + … … S S 2N+1 S - (i) (ii) (iii) N … … … … … … S S … … S S … … S 1 1 … … … … … … 1 1 … … 1 1 S - S 1 S - S - … … S … … … … 1 1 1 S - S S - S - … … … … … … 1 1 (iv) Z S Z C S 0 Z S Z C S 0 (v) Figure 2.7: Apply Overflow Prevention during Multiplication 18 In conclusion, the overflow prevention consists of four steps: i. Sign extend all numbers to enough bits and convert sign bits S into 1’s followed by ¯ S; ii. Remove overflows when merging constant 1’s; iii. Calculate maximum value of variables; iv. Fit the compression result into the data model without overflow. Thenumberofbitsweneedtosignextenddependsonthemaximumvalueofvariablesin step (iii). We usually sign extend 1 or 2 bits when compressing N signed numbers, whereas we may sign extend⌈log 2 (N)⌉+2 bits in the worst case. 2.2.3 Overflow Correction Overflow prevention reserves enough bits by sign extension so that the compression result has no overflow. If we compress a bunch of N-bit signed number, we would have two signed numbers whose size are more than N bits. But what if we already know the result is k-bit signed number (k < N)? The overflow prevention method is correct but redundant. We need a method to detect and correct overflow after compression. The overflow correction method relies on the data model in figure 2.8. Assume Z S and Z C are two (N +1)-bit signed numbers, whereas the sum Z = Z S +Z C is a N-bit signed number. In step (i), we add the leading 2 bits. Ignoring the overflow, let us call the sum as 2 a+b in step (ii). Define c is the carry from the addition of the least N− 1 bits. Since the sum is a N-bit signed number, the leading 2 bits of Z must be same as S in step (iii). Table 2.4 19 Z S Z C Z + - + S S 0 bit position N b a Z S Z C 0 0 c (i) (ii) (iii) …… …… …… …… …… Figure 2.8: Data Model for Overflow Correction shows all combinations of a and b. Notice that the combination a = 0,b = 1 is not possible since the leading 2 bits of Z are different, whatever c is. When a = b = 0, c has to be 0 to ensure S = 0. Similarly, we can deduce c for the combinations a = 1,b = 0 and a = b = 1. All those three combinations can present Z without overflow. Therefore, we can accurately present the N-bit signed number Z from (N +1)-bit signed numbers Z S and Z C . Table 2.4: Truth Table for Overflow Correction Data Model a b c 0 0 0 0 1 N.A. 1 0 1 1 1 0/1 We may apply this method when compress all partial products during N-bit signed multiplication in figure 2.9. Since the product is a 2 N-bit signed number, we can sign extend all partial products to 2N +1 bits in step (i). In step (ii), we compress them into two numbers Z S and Z C . Finally, we just need to add the leading 2 bits in step (iii). The data model in figure 2.8 guarantees there is no overflow. 20 0 bit position + … … S S 2N S - (i) N … … … … … … S S … … S … … (ii) Z S Z C a 0 Z S Z C (iii) b 0 S Z S (iv) Figure 2.9: Apply Overflow Correction during Multiplication What if we already know the result is k-bit signed number in this case? We just need to truncate the leastk+1 bits in figure 2.10. A 2-bit addition on the leading 2 bits is sufficient. 0 bit position + 2N - N (i) Z S Z C a 0 Z S Z C (ii) b 0 k S Z S (iii) Figure 2.10: Apply Overflow Correction when Result is k-bit Once we apply either overflow prevention or overflow correction, the compression result is accurate. For example, multiplication can be replaced with compression, and we can 21 guarantee XY = Z S +Z C . The question comes in mind immediately is why we need to prevent or correct the overflow. We can have the answer after we discuss the optimization when computing the expression (A+B)C. 2.2.4 Overflow Correction for Positive Number When the Z must be a N-bit positive number, we can further simplify the data model and representZ by twoN-bit signed numberZ S andZ C [15], shown in figure 2.11. We onlyneed to add the leading bits Z S [N− 1] and Z C [N− 1] in step (i). Table 2.5 shows the possible carry c of the least N− 1 bits for different a values in step (ii). Because the result Z in step (iii) is positive, c has to be same as a. In this case, the sum of Z S and Z C would not have any overflow. Z S Z C Z - + 0 0 bit position N-1 a Z S Z C 0 c (i) (ii) (iii) …… …… …… …… …… Figure 2.11: Data Model for Overflow Correction with Positive Number Table 2.5: Truth Table for Overflow Correction Data Model a c 0 0 1 1 22 2.3 Optimization of Expression (A+B)C The expression (A + B)C is a common basic expressions. Normally, we would need to first perform the full-size addition D = A+B followed by the multiplication Z = D∗ C (during which we can utilize the modified Booth coding to cut the number of generated partial products in half). Notice that the hardware resource cost to compute the expression Z = A∗ C + B ∗ C directly is costly so that is not an option. Instead, we propose a novel encoding technique (called Encode), which removes the necessity of performing the full-bitwidth addition of A and B when computing (A + B)∗ C, it does so by grouping and encoding every k bits of A and B and using these encoded k-bit groups to generate partial products of D∗ C. In this section, we discuss and show the implementation of this encoding approach for the k = 2 case. Extension to higher values of k is straight-forward. This technique thereby facilitates many mathematical expressions. 2.3.1 Encode2bit To explain Encode, we must carefully analyze the various steps of a 2-bit encoding as shown in Figure 2.12. Considering 2-bit unsigned numbers a = 2a 1 +a 0 and b = 2b 1 +b 0 , and 1-bit unsigned number c in . The goal is to compute c out , d[1 : 0], and e such that a+b+c in = 4c out − 2d 1 +d 0 +4e (2.7) Note that d i =d[i] for i = 0,1. 23 a 1 b 1 a 0 b 0 + + c in d 1 d 0 + + d 2 + d 1 d 0 + - d 2 + d 1 d 1 d 0 + - e + (i) (ii) (iii) (iv) c out c out c out Figure 2.12: Model for Grouping 2 bits We define 2 ∗ c 1 +d 0 = a 0 +b 0 +c in for step (i) of figure 2.12. Table 2.6 presents the truth table for different a 1 , b 1 , and c 1 so that a 1 +b 1 +c 1 = 2d 2 +d 1 +2c out in step (ii). The overflow due to a 1 =b 1 = 1 is captured in c out =a 1 AND b 1 . Table 2.6: Truth Table for a 1 , b 1 , c 1 a 1 b 1 c 1 c out d 2 d 1 e 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 1 1 In step (iii), we split d 1 as 2d 1 − d 1 . Notice that d 2 and d 1 cannot be 1 at the same time, therefore, we only need e = d 2 OR d 1 to represent d 2 +d 1 in step (iv). In conclusion, Algorithm 1 shows the 2-bit encoding, which takes three inputs a, b, and c in and generates outputs e, d, and c out . The critical path delay is equal to that for a 2-bit adder. To avoid confusions, note that XOR and AND have the same precedence, which are higher than OR. 24 Algorithm 1 [c out ,d,e] = Encode2bit(a,b,c in ) Input: a[1 : 0], b[1 : 0], c in Output: c out , d[1 : 0], e 1: t = (a 0 AND b 0 ) OR (a 0 AND c in ) OR (b 0 AND c in ) 2: c out =a 1 AND b 1 3: d 0 =a 0 XOR b 0 XOR c in 4: d 1 =a 1 XOR b 1 XOR t 5: e =a 1 XOR b 1 OR t We can also sum a, b, and c in into a 3-bit unsigned f as shown in figure 2.13. An interested reader can easily verify that e+c out =f 2 +f 1 d 1 =f 1 d 0 =f 0 (2.8) a 1 c out a 0 b 1 b 0 c in e d 1 d 0 + + + - + 1 0 0 1 1 0 1 0 1 + + + - + f 2 f 1 f 0 + + + 1 0 1 + + + Figure 2.13: Property of Outputs e and c out 2.3.2 Encode Algorithm 2 calculates N-bit ˜ a and ˜ b for two N-bit signed numbers a and b and a 1-bit unsigned input c. Here, we assume the sum a+b+c has no overflow. For example, a+b+c can be a (N − 1)-bit signed number so that we can easily correct overflow. Without loss 25 of generality, we assume the bitwidth N for inputs a and b is even. Otherwise, we can sign extendaandbby1bit. Theoutputc out ina2-bitencodingisusedastheinputc in forthenext 2-bit encoding. For example, the c out from Encode2bit(a[N− 3 :N− 4],b[N− 3 :N− 4],c in ) serves as the c in for Encode2bit(a[N− 1 :N− 2],b[N− 1 :N− 2],c in ). Algorithm 2 ˜ Z = Encode(a,b,c) Input: N-bit signed a and b, 1-bit unsigned c Output: N-bit ˜ a and ˜ b 1: ˜ a = ˜ b = 0, c (− 1) out =c 2: for i = 0 to N/2− 1 do 3: [c (i) out ,e (i) ,d (i) ] = Encode2bit(a[2i+1:2i],b[2i+1:2i],c (i− 1) out ) 4: ˜ a[2i+1 : 2i] =d (i) 5: ˜ b[2i+2] =e (i) 6: end for Recall that the sum of inputs is equal to the sum of outputs for a 2-bit encoding as shown in (2.7). The goal of the Encode procedure is to represent the sum a+b +c in a different format. Although algorithm 2 is presented by a for loop, all 2-bit encodings can be preformed in parallel. Figure 2.14 shows an example when N = 8. In step (ii), we group 2 bits and compute c out for each group. The c out output of the first 2-bit encoding is ignored, since we assume a+b+c does not have overflow. In step (iii), we perform 2-bit encoding and generate outputs e and d for each group. c out is already computed and used in step (ii). Similarly, the e output of the first 2-bit encoding is also ignored. In step (iv), we combine results into two numbers. Output e when encoding the leading 2 bits is ignored. In step (v), we fill empty slots with 0. Notice that we get all information in step (iii). Steps (iv) and (v) are only shown to explain the computation to the reader. The critical path delay to perform Encode is roughly equal to that of a 2-bit adder. 26 1 0 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 2 3 4 5 6 7 + + + + + + + - bit position a=-20 b=53 0 0 0 1 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 1 + + + + + + + + 0 0 1 0 1 0 0 0 1 1 0 1 + - + + - + + - + + - + 0 0 0 1 0 0 0 1 1 0 1 - + - + - + - + a=14 b=20 0 0 0 1 0 0 0 1 1 0 1 - + - + - + - + 0 0 0 0 0 (i) (ii) (iii) (iv) (v) 0 2 1 -2 ~ 0 ~ ignored ignored 1 c=1 Figure 2.14: Example of Encode when N = 8 An important observation is that we can group 2-bit parts again. As shown in step (v) of figure 2.14, some bits are constant 0. Their value can be treated as 0 ∗ 4 3 +2∗ 4 2 +1∗ 4 1 + (− 2)∗ 4 0 = 34. The value range for a group is [− 2,2]. Once we apply encoding technique on A and B in the expression (A+B)C, we can use a group to generate a partial product directly and only have N/2 partial products, like radix-4 modified Booth coding. There is no need to do an N-bit addition A+B. 27 2.3.3 Example: Polynomial Evaluation Algorithm 3 a(x) =a 0 +a 1 x+··· +a n x n Input: a 0 ,a 1 ,··· ,a n Output: Z 1: Z S = 0, Z C =a n 2: for i =n− 1 to 0 do 3: ˜ Z = Encode(Z S ,Z C ) 4: (Z S ,Z C ) = Compress( ˜ Zx+a i ) 5: end for 6: Z =Z S +Z C Polynomial evaluation is an good example that we can apply our encoding technique and overflow prevention/correction. Let us rewrite polynomial a(x) in (2.9). a(x) =a 0 +a 1 x+··· +a n x n = (((0+a n )x+a n− 1 )x+··· )x+a 0 (2.9) We can first accurately compress partial products from a n x+a n− 1 as Z S +Z C . After encoding Z S and Z C through ˜ Z = Encode(Z S ,Z C ), we can generate partial products from ˜ Zx+a n− 2 . Our overflow prevention/correction allows us to compress partial products into Z S and Z C without overflow. The procedure can continue until we update Z S and Z C for ˜ Zx+a 0 . Finally, we add Z = Z S +Z C . Totally we needs n− 1 Encode operations, n− 1 compressions, and 1 addition. 2.3.4 Encode with Ceiling, Floor, and Round Operations In this thesis, we need to do ceiling⌈X⌉, floor ⌊X⌋, and round [X] operations for the signed number X. Note that an unsigned number can be treated as a signed positive number. We can define the difference when comparing X with⌈X⌉,⌊X⌋, and [X] respectively. 28 X−⌈ X⌉∈ (− 1,0],X−⌊ X⌋∈ [0,1),X− [X]∈ [− 0.5,0.5) (2.10) Based on overflow prevention and correction, a ( N − 1)-bit signed number X can be derived from two N-bit signed numbers X S and X C after ignoring potential overflow. It seems necessary to perform X S + X C in the following expressions, where we assume the denominator is a power of 2 for simplicity. X S +X C 2 k Y, X S +X C 2 k Y, X S +X C 2 k Y (2.11) In this section, we show how to avoid X S +X C but has acceptable difference. 2.3.4.1 Ceiling Operation Figure2.15showstheproceduretopartiallyaddseveralbitsofX S andX C . Assumethesum X S +X C is a (N− 1)-bit signed number. In step (i), an addition on leading 2 bits is used to correct potential overflow. The 3-bit addition on X S [k− 1 :k− 3] and X C [k− 1 :k− 3] is used to control difference in step (ii). We apply a 3-bit addition because its delay is close to our Encode2bit and may add more bits if we extend our encoding technique by grouping more bits. 29 X S X C 0 0 + - + 0 bit position N-1 …… …… …… …… k + + + (ii) (iii) (iv) X H X S X C 0 0 …… …… b 0 0 …… …… 0 a + - + + + 0 0 …… …… b 0 0 …… …… 0 a + - + + + 1 X S X C …… …… …… …… (i) Figure 2.15: Ceiling Operation The result of X S [k− 1 :k− 3]+X C [k− 1 :k− 3] is a 4-bit unsigned number. Define the leading 2 bits as a and b shown in step (iii). We can easily verify the following statements: X S [k− 1 :k− 3]+X C [k− 1 :k− 3]− 8a∈ [0,7] X S [k− 1 :k− 3]+X C [k− 1 :k− 3]− 8a− 4b∈ [0,3] X S [k− 1 :k− 3]+X C [k− 1 :k− 3]− 8a− 8b∈ [− 4,3] (2.12) In step (iv), we define X H consisting of X S [N− 1 : k], X C [N− 1 : k], a, and a constant 1. The difference of X H and (X S +X C )/2 k can be shown in the range [− 1,0.125). If we 30 replace X S +X C 2 k with X H , the range of difference is close to ceiling operation but X H does not require a full-size addition X S +X C . X S +X C − X H 2 k = (X S +X C ) − (X S [N− 1 :k]+X C [N− 1 :k]+a− 1)2 k = (X S [k− 4 : 0]+X C [k− 4 : 0]) +(X S [k− 1 :k− 3]+X C [k− 1 :k− 3]− 8a)2 k− 3 − 2 k ∈ [− 2 k ,0.125× 2 k ) X S +X C 2 k − X H ∈ [− 1,0.125) (2.13) When it comes to the expression X S +X C 2 k Y, we prefer to use our proposed encoding technique because we can skip addition and generate partial products directly. Figure 2.16 shows the procedure based on encoding, where we group every 2 bits from bit position N− 1 to bit position k and apply Encode2bit. In case N− k is not a multiple of 2, we just need to sign extend X S and X C by 1 bit. The c in input of the last Encode2bit is the constant 1. Meanwhile, we compute the carry from a 2-bit normal addition X S [k− 1 :k− 2]+X C [k− 1 : k− 2]+X S [k− 3] AND X C [k− 3]. Consequently, the resulting ˜ X H in step (iii) can be used to generate (N− k)/2 partial products directly. Algorithm4showstheceilingoperationwithEncode. Line7isonlytocomputethecarry, whichissimplerthana2-bitaddition. Theresult ˜ X H = (˜ a, ˜ b)fromEncodeCeiling(X S ,X C ,k) is the one we want in step (iii) of figure 2.16. 31 X S X C + - + 0 bit position N-1 …… …… …… …… k + + + (i) …… (ii) ignored 0 …… …… b 0 0 0 0 + - + - + - + - X SH X CH (iii) X H ~ ignored 1 a Normal addition Figure 2.16: Ceiling Operation with Encode Algorithm 4 ˜ Z = EncodeCeiling(a,b,k) Input: N-bit signed a and b, integer k <N Output: (N− k)-bit ˜ a and ˜ b 1: (˜ a, ˜ b) = Encode(a[N-1:k],b[N-1:k],1) 2: ˜ b[0] = Carry(a[k-1:k-2]+b[k-1:k-2]+a[k-3] AND b[k-3]) 2.3.4.2 Floor Operation Similarly, figure 2.17 shows the procedure to partially add several bits of X S and X C . The result of X S [k− 1 :k− 3]+X C [k− 1 :k− 3] is a 4-bit unsigned number. Define the leading 2 bits as a and b shown in step (iii). In step (iv), we define X H consisting of X S [N− 1 :k], X C [N− 1 : k], a, and a constant 0. The difference of X H and (X S +X C )/2 k can be shown 32 X S X C 0 0 + - + 0 bit position N-1 …… …… …… …… k + + + (ii) (iii) (iv) X H X S X C 0 0 …… …… b 0 0 …… …… 0 a + - + + + 0 0 …… …… b 0 0 …… …… 0 a + - + + + 0 X S X C …… …… …… …… (i) Figure 2.17: Floor Operation in the range [0,1.125). If we replace X S +X C 2 k with X H , the range of difference is close to floor operation but X H does not require a full-size addition X S +X C . X S +X C − X H 2 k = (X S +X C ) − (X S [N− 1 :k]+X C [N− 1 :k]+a)2 k = (X S [k− 4 : 0]+X C [k− 4 : 0]) +(X S [k− 1 :k− 3]+X C [k− 1 :k− 3]− 8a)2 k− 3 ∈ [0,1.125× 2 k ) X S +X C 2 k − X H ∈ [0,1.125) (2.14) Figure 2.18 shows the procedure based on encoding to compute X S +X C 2 k Y. The c in input of the last Encode2bit is the constant 0. Meanwhile, we compute the carry from a 33 2-bit normal addition X S [k− 1 : k− 2]+X C [k− 1 : k− 2]+X S [k− 3] AND X C [k− 3]. Consequently, the resulting ˜ X H in step (iii) can be used to generate (N − k)/2 partial products directly. X S X C + - + 0 bit position N-1 …… …… …… …… k + + + (i) …… (ii) ignored 0 …… …… b 0 0 0 0 + - + - + - + - x SH x CH (iii) X H ~ ignored 0 a Normal addition Figure 2.18: Floor Operation with Encode Algorithm5showstheceilingoperationwithEncode. Line7isonlytocomputethecarry, which is simpler than a 2-bit addition. The result ˜ X H = (˜ a, ˜ b) from EncodeFloor(X S ,X C ,k) is the one we want in step (iii) of figure 2.18. Algorithm 5 ˜ Z = EncodeFloor(a,b,k) Input: N-bit signed a and b, integer k <N Output: (N− k)-bit ˜ a and ˜ b 1: (˜ a, ˜ b) = Encode(a[N-1:k],b[N-1:k],0) 2: ˜ b[0] = Carry(a[k-1:k-2]+b[k-1:k-2]+a[k-3] AND b[k-3]) 34 2.3.4.3 Round Operation Figure 2.19 shows the procedure to partially add several bits of X S and X C . X S X C 0 0 + - + 0 bit position N-1 …… …… …… …… k + + + (ii) (iii) (iv) X H X S X C 0 0 …… …… b 0 0 …… …… 0 a + - + + + 0 0 …… …… b 0 0 …… …… 0 a + - - + + b X S X C …… …… …… …… (i) Figure 2.19: Round Operation The result of X S [k− 1 : k− 3]+X C [k− 1 : k− 3] is a 4-bit unsigned number. Define the leading 2 bits as a and b shown in step (iii). After splitting b as 2b− b, we define X H consisting of X S [N− 1 : k], X C [N− 1 : k], a, and b in step (iv). The difference of X H and (X S +X C )/2 k can be shown in the range [− 0.5,0.625). If we replace X S +X C 2 k with X H , the range of difference is close to round operation but X H does not require a full-size addition X S +X C . 35 X S +X C − X H 2 k = (X S +X C ) − (X S [N− 1 :k]+X C [N− 1 :k]+a+b)2 k = (X S [k− 4 : 0]+X C [k− 4 : 0]) +(X S [k− 1 :k− 3]+X C [k− 1 :k− 3]− 8a− 8b)2 k− 3 ∈ [− 2 k− 1 ,2 k− 1 +2 k− 3 ) X S +X C 2 k − X H ∈ [− 0.5,0.625) (2.15) X S X C + - + 0 bit position N-1 …… …… …… …… k + + + (i) …… (ii) ignored 0 …… …… 0 0 0 + - + - + - + - X SH X CH (iii) X H ~ ignored c out e Figure 2.20: Round Operation with Encode Figure 2.20 shows the procedure based on encoding, where we group every 2 bits from bit position N− 1 to bit position k− 2 and apply Encode2bit. In case N− k +2 is not a multiple of 2, we just need to sign extend X S and X C by 1 bit. The c in input of the last Encode2bit is the constant X S [k− 3] AND X C [k− 3]. In step (ii), output c out of the last Encode2bit serves as input c in of the second last Encode2bit. In step (iii), output e of the 36 last Encode2bit is put accordingly. Based on equation (2.8), we have c out +e is equal toa+b shown in figure 2.19. Consequently, the resulting ˜ X H in step (iii) can be used to generate (N− k)/2 partial products directly. Algorithm 6 ˜ Z = EncodeRound(a,b,k) Input: N-bit signed a and b, integer k <N Output: (N− k)-bit ˜ a and ˜ b 1: (˜ a, ˜ b) = Encode(a[N:k-2],b[N:k-2],X S [k− 3] AND X C [k− 3]) 2: ˜ a = ˜ a[N-k+2:2] 3: ˜ b = ˜ b[N-k+2:2] Algorithm 6 shows the ceiling operation with Encode. The result ˜ X H = (˜ a, ˜ b) from EncodeRound(X S ,X C ,k) is the one we want in step (iii) of figure 2.20. 37 Chapter 3 Modular Multiplication 3.1 Categories Given a multiplicand X, a multiplier Y, and a modulus M, the modular multiplication (MM) targets to compute Z =XY mod M. Basically, MM algorithm needs to compute the large-bitwidth division to compute the quotient of input result over modulus. Instead, there are two MM algorithms: Montgomery MM (MMM) [7] and Barrett MM (BMM) [6]. Both algorithms replace the large-bitwidth division with multiplications and shifts. Modular multiplication Fixed-precision MM Scalable MM Iterative MM Full-word MM N is variable N is fixed Scan multiplier iteratively Scan multiplier in one-shot Figure 3.1: Category of Modular Multiplication As a primitive operation, MM, MMM, and BMM have many variants targeting different design purposes shown in figure 3.1. Depending on whether the bitwidth N of modulusM is fixedorvariable,thedesignofMMfallsintotwocategories: fixed-precisionandscalableMM. A fixed-precision MM assumes the bitwidth N of the modulus is fixed. Since N is known, some optimization techniques, like Karatsuba multiplication and Toom-Cook multiplication 38 [16], are possible to simplify design. Moreover, fixed-precision MM can be further classified as iterative MM and full-word MM, depend on whether multiplier Y is scanned iteratively or in one-shot. Based on the mathematical hardness to maintain high level security, N is usually a large number e.g., 1,024. A complete N-bit multiplication is very expensive in terms of the requiredgatecountandsiliconarea. Instead,aniterativeMMwhichrepeatedlyscansmbits of the multiplier and multiplies it with the multiplicand becomes preferable. When m = 1, multiplication of the multiplicand with 1 bit of the multiplier only requires a collection of ANDlogicgatesoperatinginparallel. Whenm = 2,themodifiedBoothcodingtechnique[9] maybeapplied to simplify the partialproducts. For these reasons, manyprior art references have focused on improving MM when m is fixed to 1 or 2. Unfortunately, the number of required iterations when m is small becomes very high, which results in a large end-to- end latency for performing MM. Moreover, it is difficult and inefficient to extend prior art realizations to support a parameterizable m. WhenN isnotlargee.g.,32or128,itispossibletoimplementacompleteN-bitmultipli- cation. Full-word MM is the variant that scans the multiplier Y in one-shot and implements the whole algorithm. Consequently, it can be fit in pipeline architecture to achieve high throughput. A full-word MM usually consists of three sequential N-bit multiplications. Prior references and our work show that some multiplication can be truncated. The goal of full-word MM design is thus to reduce the required gate count and silicon area. Unlike regular multiplication which can be decomposed into a collection of small-size multiplications, conventional MM design is in general non-decomposable. For example, we cannot use a conventional 1,024-bit MM to perform a 2,048-bit MM. Fortunately, there is 39 a variant of the Montgomery MM, called scalable Montgomery MM [15,17–19], which is able to perform Montgomery MM in a scalable manner i.e., with arbitrary operand size using ”fixed” hardware resources. A carefully-designed processing element (PE) is used to implement each iteration in many cycles. Since we can start the computation of the i-th iteration as soon as we complete the computation of the least m bits of intermediate results Z (i− 1) in the (i− 1)-st iteration, we can instantiate multiple PEs and make them work in parallel. A PE can be even reused once it finishes the assigned iteration. Consequently, with the help of a memory management unit to buffer intermediate results, we can use a fixed number of PEs to perform an arbitrary-size MM. Notice that only Montgomery MM has the scalable design, whereas MM and Barrett MM cannot be scalable. 3.2 Iterative Basic Modular Multiplication Recall that an N-bit integer Y can represented as Y = P N− 1 i=0 Y[i]2 i in radix-2 and Y = P nm− 1 i=0 Y i 2 im in radix-2 m where n m =⌈N/m⌉ and Y i =Y[im+m− 1 :im] denotes bits of Y belonging to word i. Algorithm 7 presents the iterative N-bit modular multiplication, where we scan an m-bit word of multiplier Y from the most significant word to the least significant word. In particular, we left shift the intermediate result Z (i+1) from the (i+1)-st iteration by m bits and add it to the product of X and Y i . Next, the quotient q (i) is computed as the floor of T (i) divided by M, and Z (i) is calculated by subtracting q (i) M from T (i) . In this algorithm, we need to iterate d =⌈N/m⌉ times to scan all words of multiplier Y. It is easy to show that Z (i) lies in the range [0,M): 40 Algorithm 7 Radix-2 m Basic Modular Multiplication Input: X,Y ∈ [0,M), r = 2 m , d =⌈N/m⌉ Output: Z∈ [0,M) 1: Z (d) = 0 2: for i =d− 1 to 0 do 3: T (i) =Z (i+1) r+XY i 4: q (i) = T (i) /M 5: Z (i) =T (i) − q (i) M 6: end for 7: return Z =Z (0) T (i) M − q (i) = T (i) M − T (i) M ∈ [0,1) Z (i) =T (i) − q (i) M ∈ [0,M) (3.1) We can unfold the expression of Z (0) until Z (d) as shown below. Z (i) =T (i) − q (i) M =Z (i+1) r+XY i − q (i) M Z (0) =Z (1) r+XY 0 − q (0) M ··· =Z (d) r d +X d− 1 X i=0 Y i r i − M d− 1 X i=0 q (i) r i =XY − M d− 1 X i=0 q (i) r i ≡ M XY (3.2) where≡ M denotes the congruence relationship modulo M. To avoid the confusion between the congruence relationship and the modular operation, we use a ≡ M b to show the con- gruence relationship between a and b under modulus M, and a mod M = b to show the modular operation. For example, 15≡ 4 7 because 15 mod 4 = 3 and 7 mod 4 = 3. 41 Although this algorithm is simple and immediately correct (i.e., it does not need a cor- rection step as found in more advanced modular multiplication algorithms), it relies on doing large-bitwidth divisions to compute q (i) , which is quite expensive. This algorithm is also referred as interleaved modular multiplication (IMM), especially when m is 1 or 2 and multiplication can be easily optimized. Basedonalgorithm7,wemayconcludetwomainchallengesthathinderstheperformance of iterative modular multiplication. i. Computationally expensive operations. In i-th iteration of the Montgomery MM algorithm, we need a large-bitwidth division to compute the quotient q (i) . Two multiplications are needed to compute the products ofXY i andq (i) M. Finally,twoadditionsarerequiredtoupdatetheintermediateresult Z (i) = Z (i+1) r +XY i − q (i) M. Operations like division, multiplication, and addition are quite expensive in terms of area and time, especially when N is large. ii. Computation dependency. In each iteration, we have to compute quotient q first and update intermediate result Z next. This data dependency unavoidably results in a long critical path for an MM design. The situation becomes worse when the m increases. Although the cycle latency (the number of cycles) can be reduced to approximately 1/m, the critical path becomes longer at the same time. As a result, the reduction of computation latency (period*cycles) becomes gradually small. 42 3.3 Full-word Basic Modular Multiplication Algorithm 8 Full-word Basic Modular Multiplication Input: X,Y ∈ [0,M) Output: Z∈ [0,M) 1: T =XY 2: q =⌊T/M⌋ 3: Z =T − qM 4: return Z The full-word modular multiplication is quite straight forward, shown in algorithm 8. Proof is omitted for brevity. Evidently, the main challenge for full-word modular multipli- cation is the costly area since we need to completely implement two N-bits multiplications, a division, and a subtraction for a N-bit modulus M. 43 Chapter 4 Barrett Modular Multiplication Barrett modular multiplication (BMM) [6] is a widely used MM algorithm because it replaces time-consuming divisions with multiplications and shifts. 4.1 Iterative Barrett Modular Multiplication 4.1.1 Classic Iterative Barrett Modular Multiplication LetM be anN-bit integer in the range [2 N− 1 ,2 N ). Algorithm 9 is the classic iterative BMM tocomputeZ =XY mod M. Noticethatiterationsinthisalgorithmareindexedfrom d− 1 to 0. Parameter µ is precomputed and provided as an input parameter to this algorithm. α and β are two parameters used to control the error. Finally, r = 2 m . Algorithm 9 Radix-2 m Barrett modular multiplication Input: X,Y ∈ [0,M), 2 N− 1 ≤ M < 2 N , r = 2 m , d =⌈N/m⌉, µ = 2 N+α /M Output: Z∈ [0,M) 1: Z (d) = 0 2: for i =d− 1 to 0 do 3: Z (i) =Z (i+1) r+XY i 4: q (i) = jj Z (i) 2 N+β k µ/ 2 α − β k 5: Z (i) =Z (i) − q (i) M 6: end for 7: Z =Z (0) 8: if Z≥ M then 9: Z =Z− M 10: end if 11: return Z Floor operation ⌊x⌋ computes the maximum integer that is smaller than or equal to x. For example, ⌊3.5⌋ = 3. The computation of quotient q (i) in the i-th iteration involves 44 three floor operations. Obviously, there can be a difference between x and its ⌊x⌋. Let us introduce the differences for the three floor operations as e 1 , e 2 , and e 3 , respectively. Note that e 1 ,e 2 ,e 3 ∈ [0,1). e 1 = Z (i) 2 N+β − Z (i) 2 N+β e 2 = 2 N+α M − 2 N+α M e 3 = j Z (i) 2 N+β k µ 2 α − β − j Z (i) 2 N+β k µ 2 α − β (4.1) Consequently, we can elaborate the computation of q (i) as follows: q (i) = ( Z (i) 2 N+β − e 1 )( 2 N+α M − e 2 ) 2 α − β − e 3 = Z (i) M − 2 N+β M e 1 − Z (i) 2 N+α e 2 − e 3 + e 1 e 2 2 α − β e = Z (i) M − q (i) = 2 N+β M e 1 + Z (i) 2 N+α e 2 +e 3 − e 1 e 2 2 α − β (4.2) Notice that e denotes the difference between Z (i) /M and q (i) in the above equation. When α ≥ m+3 and β ≤− 2, assuming Z (i+1) ∈ [0,2M), the derivation shown below proves that Z (i) will also lie in the [0,2M) range. 2 N+β M ≤ 2 N− 2 2 N− 1 = 0.5, 2 N+β M > e 2 2 α − β Z (i) =Z (i+1) r+XY i < 3· 2 N+m Z (i) 2 N+α < 3· 2 N+m 2 N+m+3 = 0.375 e = 2 N+β M e 1 + Z (i) 2 N+α e 2 +e 3 − e 1 e 2 2 α − β ∈ [0,2) Z (i) =Z (i) − q (i) M =eM ∈ [0,2M) (4.3) 45 Weobservethreedatadependenciesinalgorithm9. Thecriticalpathofthecomputation in each iteration thus encompasses three multiplications and two additions. i. Computation of Z (i) in line 3 precedes computation of Z (i) /2 N+β in line 4; ii. Computation of q (i) in line 4 precedes computation of q (i) M in line 5; iii. ComputationofZ (i) inline5ofthei-thiterationprecedescomputationofZ (i) r+XY i− 1 in line 3 of the (i− 1)-st iteration. We can combine lines 3 and 5 of algorithm 9 to explicitly show the relationship between Z (i+1) and Z (i) in each iteration: Z (i) =Z (i+1) r+XY i − q (i) M (4.4) Based on this relation, we can recursively unroll the computation of Z (0) until Z (d) = 0. Since d =⌈N/m⌉, we have P d− 1 k=0 Y k r k =Y. Consequently, the result Z (0) after all iterations are done is a number which is congruent to XY modulo M. 46 Z (0) =Z (1) r+XY 0 − q (0) M = (Z (2) r+XY 1 − q (1) M)r+XY 0 − q (0) M =Z (2) r 2 +X(Y 1 r+Y 0 )− M(q (1) r+q (0) ) =··· =Z (d) r d +X d− 1 X k=0 Y k r k − M d− 1 X k=0 q (k) r k =XY − M d− 1 X k=0 q (k) r k ≡ M XY (4.5) Since equation (4.3) guarantees Z (0) ∈ [0,2M), we only need at most one correction operation to ensure that output Z ∈ [0,M). In particular, we return Z (0) − M if Z (0) ≥ M and return Z (0) otherwise. 4.1.2 New Iterative Barrett Modular Multiplication: Basic Algorithm 10 shows the basic version of proposed BMM. Since q (i) and Z (i) in each iteration rely only on inputs q (i+1) and Z (i+1) , they can be computed in parallel. Consequently, all three multiplications can be performed simultaneously, and the critical path is reduced to one multiplication and two additions. This algorithm needs d =⌈(N +1)/m⌉+2 iterations indexed in reverse order from d− 3 to− 2. Compared with algorithm 9, algorithm 10 needs only two additional iterations. Notice the number of iterations d is different for different algorithms. 47 Algorithm 10 Radix-2 m Barrett modular multiplication: basic version Input: X,Y ∈ [0,M), 2 N− 1 ≤ M < 2 N , r = 2 m , d =⌈(N +1)/m⌉+2, µ = 2 N+2m+3 /M Output: Z∈ [0,M) 1: Z (d− 2) = 0, q (d− 2) = 0 2: for i =d− 3 to− 2 do 3: g (i) = hh Z (i+1) 2 N− 3 i µ/ 2 2m+6 i 4: q (i) =g (i) − q (i+1) r 5: Z (i) = (Z (i+1) − q (i+1) Mr)r+X ˜ Y i 6: end for 7: Z =Z (− 2) /r 2 8: if Z < 0 then 9: Z =Z +M 10: end if 11: return Z 4.1.2.1 Optimization of XY i Before presenting the proof of correctness for the proposed BMM algorithm, let us briefly discuss the optimization to compute XY i . Without loss of generality, we assume that m is an even number. Instead of computing XY i , we scan Y i and Y i− 1 [m− 1] and encode them as ˜ Y i as follows: ˜ Y i =Y i − Y i [m− 1]r+Y i− 1 [m− 1] = m/2− 1 X k=0 (− 2Y[2k+1+im]+Y[2k+im]+Y[2k− 1+im])4 k (4.6) where Y i [m− 1] is the most significant bits of m-bit unsigned integer Y i , and Y[2k+im] is the (2k +im)-bit of multiplier Y. Following the radix-4 modified Booth encoding, we can repetitivelygroupthreebitsY[2k+1+im],Y[2k+im],andY[2k− 1+im]for0≤ k≤ m/2− 1. Since the range of − 2Y[2k +1+im]+Y[2k +im]+Y[2k− 1+im] is [− 2,2], we can use this radix-4 encoding to generate multipliers in the range [− 2,2], which are then multiplied 48 with the multiplicand to generate the required partial products. Consequently, the product of X ˜ Y i has m/2 partial products, whereas XY i has m/2+1 partial products. It is easy to prove that the range of ˜ Y i is [− r/2,r/2]. Moreover, we can further compute the range for P i k=0 ˜ Y k as follows: i X k=0 ˜ Y k r k = i X k=0 Y k r k − Y[(i+1)m− 1]r ∈ [− 2 (i+1)m− 1 ,2 (i+1)m− 1 ) = [− r i+1 2 , r i+1 2 ) (4.7) When (i+1)m− 1≥ N, we have Y[(i+1)m− 1] = 0 since Y is a N-bit unsigned integer less than M. For the first iteration i =d− 3, we should guarantee (d− 3+1)m− 1≥ N so that d≥⌈ (N +1)/m⌉+2. Y = ⌈N/m⌉− 1 X k=0 Y k r k = ⌈(N+1)/m⌉− 1 X k=0 ˜ Y k r k (4.8) 4.1.2.2 Range of Z (i) − q (i) Mr Let us prove the range ofZ (i) − q (i) Mr by mathematical induction. The base case isZ (d− 2) − q (d− 2) Mr = 0. Assume Z (i+1) − q (i+1) Mr ∈ (− 1.25Mr,1.25Mr), we would like to derive a q (i) such that Z (i) − q (i) Mr∈ (− 1.25Mr,1.25Mr). We can estimate the absolute bound of Z (i) as follows: Z (i) = (Z (i+1) − q (i+1) Mr)r+X ˜ Y i ≤ Z (i+1) − q (i+1) Mr r+ X ˜ Y i < 1.25Mr 2 +0.5Mr = (1.25r+0.5)Mr < 2 N+2m+1 (4.9) 49 In other words, Z (i) is a (N +2m+2)-bit signed number. Similarly, Z (i+1) < 2 N+2m+1 and Z (i+1) is a (N +2m+2)-bit signed number. Insteadofusingflooroperationinalgorithm9, Weuseroundoperationinthisalgorithm. [x] denotes the round function, which finds the integer nearest to x, e.g., [2.4] = 2 and [2.5] = 3. The difference between x and [x] is in the range [− 0.5,0.5). We can thus restate the differences e 1 , e 2 , and e 3 as below, and show that e 1 ,e 2 ,e 3 ∈ [− 0.5,0.5): e 1 = Z (i+1) 2 N− 3 − Z (i+1) 2 N− 3 e 2 = 2 N+2m+3 M − 2 N+2m+3 M e 3 = h Z (i+1) 2 N− 3 i µ 2 2m+6 − h Z (i+1) 2 N− 3 i µ 2 2m+6 (4.10) Similarly, we can elaborate the computation of g (i) as follows: g (i) = ( Z (i+1) 2 N− 3 − e 1 )( 2 N+2m+3 M − e 2 ) 2 2m+6 − e 3 = Z (i+1) M − 2 N− 3 M e 1 − Z (i+1) 2 N+2m+3 e 2 − e 3 + e 1 e 2 2 2m+6 e = Z (i+1) M − g (i) = 2 N− 3 M e 1 + Z (i+1) 2 N+2m+3 e 2 +e 3 − e 1 e 2 2 2m+6 (4.11) Here, e denotes the difference between Z (i+1) /M and g (i) in the above equation. We can prove that|e|∈ (− 1,1) as follows: 50 |e| = 2 N− 3 M e 1 + Z (i+1) 2 N+2m+3 e 2 +e 3 − e 1 e 2 2 2m+6 < 2 N− 3 2 N− 1 × 1 2 + (1.25r+0.5)Mr 2 N+2m+3 × 1 2 + 1 2 + 1 2 2m+8 < 1 2 3 +( 1.25 2 4 + 1 2 6 )+ 1 2 + 1 2 10 < 0.75< 1 (4.12) As a result, Z (i+1) /M− g (i) ∈ (− 1,1) implies Z (i+1) − g (i) M ∈ (− M,M). Next we rewrite q (i) in terms of g (i) and q (i+1) as follows: q (i) =g (i) − q (i+1) r g (i) =q (i+1) r+q (i) (4.13) Consequently, we have Z (i) − q (i) Mr∈ (− 1.5Mr,1.5Mr) as shown below: Z (i) − q (i) Mr = (Z (i+1) − q (i+1) Mr)r+X ˜ Y i − q (i) Mr = (Z (i+1) − (q (i+1) r+q (i) )M)r+X ˜ Y i = (Z (i+1) − g (i) M)r+X ˜ Y i Z (i) − q (i) Mr < 0.75Mr+0.5Mr = 1.25Mr (4.14) We can also prove that q (i) is an (m+2)-bit signed number as follows: q (i) M = (Z (i+1) − q (i+1) Mr)− (Z (i+1) − g (i) M) ∈ (− (1.25r+1)M,(1.25r+1)M) ⊆ (− 2Mr,2Mr) q (i) ∈ (− 2r,2r) (4.15) 51 4.1.2.3 Correctness of Final Result We may unroll the computation of Z (− 2) until Z (d− 2) = 0 as follows: Z (− 2) = (Z (− 1) r− q (− 1) Mr)r+X ˜ Y − 2 =Z (0) r 2 +X( ˜ Y − 1 r+ ˜ Y − 2 )− Mr 2 (q (0) r+q (− 1) ) =··· =Z (d− 2) r d +X d− 3 X k=− 2 ˜ Y k r k+2 − Mr 2 d− 2 X k=− 1 q (k) r k+1 =X d− 3 X k=− 2 Y k r k+2 − Mr 2 d− 2 X k=− 1 q (k) r k+1 (4.16) Because ˜ Y − 2 = ˜ Y − 1 = 0 and q (d− 2) = 0, the above equation can be simplified as follows: Z (− 2) =Xr 2 d− 3 X k=0 ˜ Y k r k − Mr 2 d− 3 X k=− 1 q (k) r k+1 Z (− 2) r 2 =X d− 3 X k=0 ˜ Y k r k − M d− 3 X k=− 1 q (k) r k+1 =XY − M d− 3 X k=− 1 q (k) r k+1 ≡ M XY (4.17) Therefore, Z (− 2) is a multiple of r 2 , and Z (− 2) /r 2 is congruent to XY modulo M. The last thing is to prove the range of Z (− 2) /r 2 . Since ˜ Y − 2 = ˜ Y − 1 = 0, we can unroll the computation of Z (− 2) until Z (0) as follows: 52 Z (− 2) = (Z (− 1) r− q (− 1) Mr)r+X ˜ Y − 2 =Z (0) r 2 +X( ˜ Y − 1 r+ ˜ Y − 2 )− Mr 2 (q (0) r+q (− 1) ) =Z (0) r 2 − Mr 2 (q (0) r+q (− 1) ) = (Z (0) − g (− 1) M)r 2 Z (− 2) r 2 =Z (0) − g (− 1) M ∈ (− M,M) (4.18) Therefore, Z (− 2) /r 2 must be in the range (− M,M). We returnZ (− 2) /r 2 +M ifZ (− 2) < 0 and return Z (− 2) /r 2 otherwise. The proof is compete. 4.1.3 New Iterative Barrett Modular Multiplication: Advanced Although algorithm 10 breaks the data dependency, multiplication and addition still hin- der the development of an efficient BMM. Algorithm 11 shows an advanced version of the proposed BMM algorithm, in which all multiplications and additions in each iteration are replaced with compression and encoding operations. The implementation of this advanced algorithm in hardware results in much lower hardware area and computation latency of any and all prior state-of-art BMM designs. 4.1.3.1 Error Analysis In this algorithm, we replace Z (i+1) /2 N− 3 with ˜ Z (i+1) H and further replace h ˜ Z (i+1) H µ/ 2 2m+6 i with ˜ q (i+1) (see later discussion). The error terms e 1 and e 3 lie in the range [− 0.5,0.625). Furthermore, since µ is precomputed, e 2 is always in the range [− 0.5,0.5) as shown below: 53 Algorithm 11 Radix-2 m Barrett modular multiplication: advanced version Input: X,Y ∈ [0,M), 2 N− 1 ≤ M < 2 N , r = 2 m , d =⌈(N +1)/m⌉+2, µ = 2 N+2m+3 /M Output: Z∈ [0,M) 1: Z (d− 2) S =Z (d− 2) C =q (d− 2) S =q (d− 2) C = 0 2: for i =d− 3 to− 2 do 3: ▷ Define Z (i+1) SH and Z (i+1) CH 4: Z (i+1) SH =Z (i+1) S [N+2m+2:N-6] 5: Z (i+1) CH =Z (i+1) C [N+2m+2:N-6] 6: ▷ Encode Z (i+1) H and q (i+1) 7: ˜ Z (i+1) H = EncodeRound(Z (i+1) SH ,Z (i+1) CH ,3) 8: ˜ q (i+1) = EncodeRound(q (i+1) S ,q (i+1) C ,3) 9: ˆ q (i+1) = EncodeQSN(q (i+1) S ,q (i+1) C ) 10: ▷ Compress for ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 11: (q (i) S ,q (i) C ) = Compress( ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 ) 12: ▷ Perform division over 2 2m+6 by truncation 13: q (i) S =q (i) S [3m+8:2m+3] 14: q (i) C =q (i) C [3m+8:2m+3] 15: ▷ Compress for (Z (i+1) − ˜ q (i+1) Mr)r+X ˜ Y i 16: ˜ Y i =Y i − Y i [m− 1]r+Y i− 1 [m− 1] 17: (Z (i) S ,Z (i) C ) = Compress((Z (i+1) − ˜ q (i+1) Mr)r+X ˜ Y i ) 18: ▷ Extract useful information by truncation 19: Z (i) S =Z (i) S [N+2m+2:0] 20: Z (i) C =Z (i) C [N+2m+2:0] 21: end for 22: Z =Z (− 2) S [N+2m:2m]+Z (− 2) C [N+2m:2m] 23: if Z < 0 then 24: Z =Z +M 25: end if 26: return Z e 1 = Z (i+1) 2 N− 3 − ˜ Z (i+1) H ∈ [− 0.5,0.625) e 2 = 2 N+2m+3 M − 2 N+2m+3 M ∈ [− 0.5,0.5) e 3 = ˜ Z (i+1) H µ 2 2m+6 − ˜ q (i+1) ∈ [− 0.5,0.625) (4.19) Similar to the proof of the basic version of the proposed BMM algorithm, we can prove e in (4.11) is ∈ (− 0.73,0.89) and Z (i) − q (i) Mr ∈ (− 1.23Mr,1.39Mr). We need to point 54 out that the (m + 2)-bit signed q (i) cannot be smaller than − 1.5r when m ≥ 2, since 1.23r+0.89≤ 1.5r→r≥ 3.3 and r = 2 m ≥ 2 2 = 4. q (i) M = (Z (i+1) − q (i+1) Mr)− (Z (i+1) − g (i) M) ∈ (− (1.23r+0.89)M,(1.39r+0.73)M) ⊆ (− 2Mr,2Mr) q (i) ∈ (− 2r,2r) (4.20) 4.1.3.2 Optimization of Z (i+1) /2 N− 3 Inthisalgorithm,the(N+2m+2)-bitsignednumberZ (i) isrepresentedbytwo(N+2m+3)- bitsignednumbersZ (i) S andZ (i) C . WecancomputeZ (i) byaddingZ (i) S +Z (i) C andignoringthe potentialoverflow,butwedonotperformtheaddition. Weagainusemathematicalinduction to prove the correctness. The base case is Z (d− 2) S = Z (d− 2) C = Z (d− 2) = 0. Assuming we can represent Z (i+1) by Z (i+1) S and Z (i+1) C , we would like to represent Z (i) by Z (i) S and Z (i) C as well. Followingalgorithm10, thecomputationofq (i) requires Z (i+1) /2 N− 3 µ . Toimplementthis, weneedtoperformadditionZ (i+1) S +Z (i+1) C first,extractthemostseveralsignificantbitsnext, and perform multiplication finally. There are two challenges that require Z (i+1) S +Z (i+1) C : i. we need the addition to eliminate the potential overflow. ii. weneedtocontroldifferencebetween Z (i+1) /2 N− 3 and Z (i+1) /2 N− 3 asshownin(4.10). This data dependency requires a long critical path. Instead, we can use the encoding technique with round operation, in section 2.3.4. Figure 4.1 shows the optimization of Z (i+1) /2 N− 3 µ , where we group every 2 bits of the leading 2m+8 bits of Z (i+1) S and Z (i+1) C . 55 The logical AND of Z S [N− 6] and Z C [N− 6] is used as c in of the last Encode2bit operation. In step (iii), we compute ˜ Z (i+1) H after ignoring the least 2 bits. The result may be used to generate m+3 partial products without full-bitwidth addition of Z (i+1) S and Z (i+1) C . Z S Z C + - + 0 bit position N+2m+2 …… …… …… …… N-4 + + + (i) …… (ii) ignored 0 …… …… 0 0 0 + - + - + - + - Z SH Z CH (iii) Z H ~ ignored Figure 4.1: The Optimization of Z (i+1) /2 N− 3 as ˜ Z (i+1) H Analysis in section 2.3.4 proves the range of e 1 . e 1 = Z (i+1) 2 N− 3 − ˜ Z (i+1) H ∈ [− 0.5,0.625) (4.21) 4.1.3.3 Optimization of q (i) After the replacement of Z (i+1) /2 N− 3 with ˜ Z (i+1) H , computation of q (i) is updated as in (4.22). Notice that we move q (i+1) r inside the floor function. q (i) = " ˜ Z (i+1) H µ 2 2m+6 # − q (i+1) r = " ˜ Z (i+1) H µ − q (i+1) 2 3m+6 2 2m+6 # (4.22) 56 Similar to writing Z (i+1) in terms of Z (i+1) S and Z (i+1) C , we want to represent q (i+1) with q (i+1) S and q (i+1) C . In this algorithm, we compute two (m + 6)-bit signed numbers q (i) S and q (i) C in the i-th iteration. Rather than computing the (m+2)-bit q (i+1) , we derive a variable ˆ q (i+1) from q (i+1) S and q (i+1) C while guaranteeing that e 3 ∈ [− 0.5,0.625). We thus rewrite the computation of q (i) as follows after replacing q (i+1) with ˆ q (i+1) . q (i) = " ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 2 2m+6 # (4.23) Thewholeprocessisprovenbymathematicalinduction. Thebasecaseisq (i+1) S =q (i+1) C = q (i+1) = 0. Assume we can compute a ˆ q (i+1) from q (i+1) S and q (i+1) C , we would like to derive q (i) S and q (i) C so that we can compute a ˆ q (i) as well. Figure 4.2 shows the computation of ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 in step (i) by expanding all partial products from ˜ Z (i+1) H µ and ˆ q (i+1) 2 3m+6 . Following the discussion of overflow correction, it is sufficient to reserve the least 1+(m+2)+(2m+6) = 3m+9 bits. Since q (i) is an (m+2)-bit signed number as shown in (4.15), we only need to add the leading 2 bits in step (ii) to correct any potential overflows. Let the dashed rectangle in step (iii) denote ˆ q (i) . Following the discussion of round operation section 2.3.4, we can control the error e 3 to lie in the range [− 0.5,0.625) through the addition q (i) S [2m+5:2m+3]+q (i) C [2m+5:2m+3] in step (ii) as shown below: e 3 = ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 2 2m+6 − ˆ q (i) ∈ [− 0.5,0.625) (4.24) 57 0 bit position 3m+8 …… …… …… Z H * μ ~ 3m+6 q*2 3m+6 ^ …… …… …… …… 2m+5 …… …… (i) (ii) …… …… …… …… 0 0 0 0 0 (iii) - + + - + + + + - + q S q C q S q C ^ q Figure 4.2: The Computation of q (i) In conclusion, we reserve q (i) S =q (i) S [3m+8:2m+3] and q (i) C =q (i) C [3m+8:2m+3] in step (ii) so as to compute ˆ q (i) in step (iii) in the next iteration just like ˆ q (i+1) is used in the current iteration. For brevity, we use the notation ˆ q (i+1) = EncodeQSN(q (i+1) S ,q (i+1) C ). a =q (i+1) S [3m+8:3m+7]+q (i+1) C [3m+8:3m+7] b =q (i+1) S [2m+5:2m+3]+q (i+1) C [2m+5:2m+3] ˆ q (i+1) S = (− 2a[1]+a[0])2 m+1 +q (i+1) S [3m+6:2m+6] ˆ q (i+1) C =q (i+1) C [3m+6:2m+6] ˆ q (i+1) = (ˆ q (i+1) S ,ˆ q (i+1) C ,b[3],b[2]) (4.25) 58 Notice that it is not necessary to completely perform the EncodeQSN operation, since we only need the least 3 bits of ˆ q (i+1) in step (i) of figure 4.2. 4.1.3.4 Optimization of Z (i) The last task is to compute Z (i) S and Z (i) C . We can also use Encode to speed up the compu- tation of (q (i+1) S +q (i+1) C )M. As we discussed in (4.25), we reserve m+6 bits of q (i) S and q (i) C to derive ˆ q (i) . Since the last 3 bits are used to control the error, the remaining m+3 bits of q (i+1) S and q (i+1) C generate m/2+2 partial products. In reality, we can reduce them into m+2 bits and only generate m/2+1 partial products when m≥ 2. Figure 4.3 shows the partial addition on the leading 4 bits and the last 3 bits in step (i). The results in step (iii) can be added together as illustrated in step (iv). 0 bit position m+5 …… …… (i) + + + - + q S q C 2 (ii) + …… …… c a f 0 0 0 0 0 q S q C b 0 + d 0 - + + - + + + (iii) …… …… c a 0 0 b 0 d 0 - + + + e m+1 e f f …… S S (iv) q S q C q Figure 4.3: Reduction of q (i+1) S and q (i+1) C 59 Table 4.1 shows all four combinations of a and b in step (iii). If a =b = 0 and a =b = 1, we can ignore q (i+1) S [m+2] and q (i+1) C [m+2] in step (iii). We only need to prove that it is impossible to encounter the remaining two combinations of a and b. Equation (4.20) proves that q (i+1) is an (m+2)-bit signed number in the range (− 1.5r,2r) so that the leading 2 bits of q (i+1) in step (iv) must be the same. The combination a = 0 and b = 1 is not possible because the leading 2 bits cannot be the same. When a = 1 and b = 0, the maximum positive value it can represent is− 2 m+2 +2 m +2 m− 1 +2∗ 2 m− 1 =− 3∗ 2 m− 1 =− 1.5r. Since q (i+1) >− 1.5r, this combination is also impossible. Table 4.1: Truth Table for a and b a b Action 0 0 Ignore q (i+1) S [m+2] and q (i+1) C [m+2] in step (iii) 0 1 N.A. 1 0 N.A 1 1 Ignore q (i+1) S [m+2] and q (i+1) C [m+2] in step (iii) In conclusion, at the beginning of step (i), it is safe to ignore the q (i+1) S [m+5] and q (i+1) C [m+5] if we perform q (i+1) S [m+4:m+2] + q (i+1) C [m+4:m+2], which is equiva- lent to performing Encode2bit on q (i+1) S [m+4:m+3] and q (i+1) C [m+4:m+3] with c in = q (i+1) S [m+2] AND Z C [m+2]. We can continue and apply Encode2bit to every 2 bits of q (i+1) S andq (i+1) C as shown in figure 4.4. In step (iii), we ignore the last 2 bits and use the remaining bits of q (i+1) S and q (i+1) C to directly generate m/2+1 partial products. The whole procedure can be shown as ˜ q (i+1) = EncodeRound(q (i+1) S [m+4 : 0],q (i+1) C [m+4 : 0],3). Figure 4.5 shows how to compute (Z (i+1) − ˜ q (i+1) Mr)r+X ˜ Y i . Since the final Z (i) is an N +2m+2 bit signed number as shown in (4.9), it is sufficient to compress only the least N + 2m + 3 bits of m + 4 numbers into two (N + 2m + 3)-bit signed numbers based on 60 0 bit position m+4 …… …… (i) + + + - + q S q C 2 …… (ii) ignored 0 …… …… 0 0 0 + - + - + - + - (iii) q ~ ignored + Figure 4.4: The Optimization of q (i+1) S +q (i+1) C as ˜ q (i+1) the overflow correction method. Let us define these two numbers as Z (i) S and Z (i) C (this is motivated by the fact that the sum of these two numbers is the correct Z (i) ). The aforesaid m+4 numbers consist of three parts: m/2+1 numbers from ˜ q (i+1) M, m/2 numbers from X ˜ Y i , two numbers from Z (i+1) , and one number constructed by collecting the tail bits when generating partial products of ˜ q (i+1) M and X ˜ Y i . 0 bit position N+2m+2 …… …… q*M ~ XY i …… …… …… 2m-1 m-1 Z S Z C …… ~ …… …… …… …… + - + Z S Z C + - + Figure 4.5: The Computation of Z (i) 61 4.1.3.5 Block-Level Diagram Figure 4.6 shows the block-level diagram for the hardware implementation of the proposed iterative BMM algorithm. With the help of EncodeRound, EncodeQSN, and compression, the critical paths to update (q S ,q C ) and (Z S ,Z C ) correspond to those for doing a 2-bit addition and compression of m+4 numbers. q S ,q C Z S ,Z C EncodeQSN EncodeRound m+4 Compress EncodeRound m+4 Compress μ q ^ q ~ Z H ~ X Y i ~ M MUX MUX 0 0 1 1 BKA Z MUX 0 1 AND CT Z S Z C >=0? Figure 4.6: Block-level Diagram for the Hardware Implementation of the Proposed Iterative Barrett Modular Multiplication The critical path of the proposed BMM thus lies in the addition module for the final summation and correction step. Considering the trade-off between time and area, we use 62 a high performance logarithmic Brent-Kung adder (BKA) [14] and specify a 2-cycle timing constraint. Notethatweneedtoperformonefinalsummationanddoatmostonecorrection. Therefore, we only need to use BKA twice in four consecutive cycles, which justifies the 2- cycle timing constraint. The control signal CT is used to control the data flow of BKA. When CT = 0, we compute Z S +Z C and store it in Z. When CT = 1, we check the sign of Z. If Z < 0, we compute Z +M; otherwise, we do not need to perform any correction and simply choose the original Z. In conclusion, the total clock cycle count for the presented BMM design is CycleCount =d+4 (4.26) where d =⌈(N +1)/m⌉+2 is the number of iterations in the proposed BMM algorithm. 4.2 Full-word Barrett Modular Multiplication 4.2.1 Classic Full-word Barrett Modular Multiplication Algorithm 12 Full-word Barrett modular multiplication Input: X,Y ∈ [0,M), µ = 2 N+α /M Output: Z∈ [0,M) 1: T =XY 2: q = T 2 N+β µ/ 2 α − β 3: Z =Z− qM 4: if Z≥ M then 5: Z =Z− M 6: end if 7: return Z Once we have good understanding about iterative BMM design. The proof of full-word BMM shown in algorithm 12 becomes quite easy. We here provide the proof for independent 63 reading and complete proof. Let us reuse e 1 , e 2 , and e 3 to denote the differences for the three floor operations. Note that e 1 ,e 2 ,e 3 ∈ [0,1). e 1 = T 2 N+β − T 2 N+β e 2 = 2 N+α M − 2 N+α M e 3 = T 2 N+β µ 2 α − β − $ T 2 N+β µ 2 α − β % (4.27) Consequently, we can elaborate the computation of q as follows: q = ( T 2 N+β − e 1 )( 2 N+α M − e 2 ) 2 α − β − e 3 = T M − 2 N+β M e 1 − T 2 N+α e 2 − e 3 + e 1 e 2 2 α − β e = T M − q = 2 N+β M e 1 + T 2 N+α e 2 +e 3 − e 1 e 2 2 α − β (4.28) Notice that e denotes the difference between T/M and q in the above equation. When α ≥ N+1 and β ≤− 2, the derivation shown below proves that Z will also lie in the [0,2M) range. T =XY <M 2 < 2 2N → T 2 N+α < 2 2N 2 2N+1 = 1 2 2 N+β M ⩽ 2 N+β 2 N− 1 ⩽ 2 N− 2 2 N− 1 = 1 2 e = 2 N+β M e 1 + T 2 N+α e 2 +e 3 − e 1 e 2 2 α − β ⩽ 2 N+β M e 1 + T 2 N+α e 2 +e 3 < 1 2 + 1 2 +1 = 2 (4.29) Evidently, the main challenge for full-word BMM is the costly area since we need to completely implement three large-size multiplications in series. Notice that computation 64 latency is not a big concern for full-word BMM, since we frequently implement it into the pipeline architecture to achieve a high throughput. A possible concern would be the deep pipeline depth when N is too large. 4.2.2 New Full-word Barrett Modular Multiplication: Basic Algorithm 13 Full-word Barrett modular multiplication: basic version Input: X,Y ∈ [0,M), µ = 2 2N+3 /M Output: Z∈ [0,M) 1: T =XY 2: q = T 2 N− 3 µ/ 2 N+6 3: Z =T − qM 4: if Z < 0 then 5: Z =Z +M 6: end if 7: return Z When choosing parameter α =N +3 and β =− 3, algorithm 13 shows the basic version that we are going to optimize. The proof is pretty easy when we follow the error analysis as below. e 1 = T 2 N− 3 − T 2 N− 3 ∈ [− 0.5,0.5) e 2 = 2 2N+3 M − 2 2N+3 M ∈ [− 0.5,0.5) e 3 = T 2 N− 3 µ 2 N+6 − " T 2 N− 3 µ 2 N+6 # ∈ [− 0.5,0.5) (4.30) Consequently, we can elaborate the computation of q as follows: 65 q = ( T 2 N− 3 − e 1 )( 2 2N+3 M − e 2 ) 2 N+6 − e 3 = T M − 2 N− 3 M e 1 − T 2 2N+3 e 2 − e 3 + e 1 e 2 2 N+6 e = T M − q = 2 N− 3 M e 1 + T 2 2N+3 e 2 +e 3 − e 1 e 2 2 N+6 (4.31) Thedifference ebetweenT/M andq mustbeintherange(− 1,1). TheresultZ =T− qM lies thereby in the range (− M,M). Because q is an integer, q =T/M− e>− 1 would imply q≥ 0 such that q is a N-bit unsigned number. T =XY < 2 2N T M ⩽ (M− 1) 2 M =M− 2+ 1 M < 2 N − 1 e = 2 N− 3 M e 1 + T 2 2N+3 e 2 +e 3 − e 1 e 2 2 N+6 |e|< 2 N− 3 2 N− 1 × 1 2 + 2 2N 2 2N+3 × 1 2 + 1 2 + 1 2 N+8 < 1 q = T M − e∈ (− 1,2 N )→q∈ [0,2 N ) (4.32) 4.2.3 New Full-word Barrett Modular Multiplication: Advanced At first glance, all three multiplications in algorithm 13 are required to compute XY, T 2 N− 3 µ , and qM. In reality, the multiplications for T 2 N− 3 µ and qM do not have to be complete. For example, it is sufficient to compute the least N + 1 bits of qM because Z − qM in line 3 must be in the range [0,2M). Reference [20] uses two truncated mul- tiplications for T 2 N− 3 µ and qM. Their error analysis ensure the output is in the range [0,4M). Although they may need three subtractions to compare results with M, 2M, and 3M, the truncated multiplications save more areas. On contrast, our advanced version of 66 full-word BMM algorithm replaces multiplications with compressions but ensures output range of (− M,M). Algorithm 14 Full-word Barrett modular multiplication: advanced version Input: X,Y ∈ [0,M), µ = 2 2N+3 /M , truncation position k Output: Z∈ [0,M) 1: ▷ Compute XY 2: (T S ,T C ) = Compress(XY) 3: ▷ Replace T/2 N− 3 with EncodeRound 4: T SH =T S [2N +2 :N− 6] 5: T CH =T C [2N +2 :N− 6] 6: ˜ T H = EncodeRound(T SH ,T CH ,3) 7: ▷ Compress ˜ T H µ after truncating least k bits 8: (q S ,q C ) = CompressUpper( ˜ T H µ,k ) 9: ▷ Replace h ˜ T H µ/ 2 N+6 i with EncodeRound 10: q S =q S [2N +6 :N +3] 11: q C =q C [2N +6 :N +3] 12: ˜ q = EncodeRound(q S ,q C ,3) 13: ▷ Compress least N +1 bits from T S +T C − ˜ qM 14: (Z S ,Z C ) = CompressLower(T S +T C − ˜ qM,N +1) 15: ▷ Correct result to the range [0,M) 16: Z =Z S +Z C 17: if Z <M then 18: Z =Z +M 19: end if 20: return Z Because T = XY < M 2 ∈ [0,2 2N ), we can present 2N-bit unsigned number T by two (2N +1)-bit signed numbers T S and T C based on overflow correction for positive number, discussed in section 2.2.4. Meanwhile, encoding with rounding operation in section 2.3.4 allows us to quickly compute ˜ T H = EncodeRound(T SH ,T CH ,3) and generate N/2+2 partial products ˜ T H µ . The analysis of EncodeRound ensures e 1 = T 2 N− 3 − ˜ T H ∈ [− 0.5.0.625) (4.33) 67 Since µ is computed in advance, we always have e 2 = 2 2N+3 M − µ = 2 2N+3 M − 2 2N+3 M ∈ [− 0.5.0.5) (4.34) 4.2.3.1 Optimization of q When it comes to the computation of qM = h ˜ T H µ/ 2 N+6 i M, we may generate all partial products from ˜ T H µ , compress them into two signed numbers, and apply EncodeRound for h ˜ T H µ/ 2 N+6 i . However,wedonotneedtocompresseverything. Therestbitsarecompressed into two numbers q S and q C . If we still guarantee the difference e between T/M and q is in the range (− 1,1), q would be a N-bit unsigned number proved in (4.32), and we only need two (N +1)-bit signed numbers to represent it. It is sufficient to compress the least 2 N +7 bits for ˜ T H µ . We deliberately represent q by two (N +2)-bit signed numbers and compress least 2N +8 bits, because we assume N is even. We can thereby generate N/2+1 partial products from ˜ qM. Figure4.7showsthewaythattruncatesleastk bitsofallpartialproducts(howtochoose k is discussed later). The tail bit of last partial product usually cannot be hidden during compression, and we simply ignore it. The rest bits are compressed into two numbers q S and q C . In step (ii), the tail bit and all truncated bits can be virtually summed together as a number U with U ≤ 2 N+2 + k 2 2 k . To avoid round operation, we add q S [N + 5 : N +3]+q C [N +5 : N +3] into a 4-bit number with leading bits a and b. The leading 2 bits are added to correct potential overflow. We can split b as 2b− b in step (iii) and define q consisting of all bits in the dashed rectangular of step (iv). Following equation (2.12), we 68 + + 0 bit position 2N+7 N+6 …… k b 0 0 0 - + a (i) (ii) (iii) N+3 + + + 0...0 q S q C U 0 0 q S q C U b 0 0 0 - - a (iv) + + + q 0 0 q S q C U b Figure 4.7: Compress ˜ T H µ by Truncation can compare q in step (iv) with q S +q C +U in step (ii). Notice that q S +q C +U is ˜ T H µ because the addition has no overflow. U ∈ [0,2 N+2 + k 2 2 k ] (q S +q C )− q2 N+6 ∈ [− 2 N+5 ,2 N+5 +2 N+3 − 2 k+1 ] e 3 = ˜ T H µ 2 N+6 − q = q S +q C +U 2 N+6 − q∈ [− 0.5,0.6875+( k 2 − 2)2 k− N− 6 ] (4.35) Consequently, we can elaborate the computation of q as follows: 69 q = ˜ T H µ 2 N+6 − e 3 = ( T 2 N− 3 − e 1 )( 2 2N+3 M − e 2 ) 2 N+6 − e 3 = T M − 2 N− 3 M e 1 − T 2 2N+3 e 2 − e 3 + e 1 e 2 2 N+6 e = T M − q = 2 N− 3 M e 1 + T 2 2N+3 e 2 +e 3 − e 1 e 2 2 N+6 (4.36) The upper bound of error e between T/M and q is e = 2 N− 3 M e 1 + T 2 2N+3 e 2 +e 3 − e 1 e 2 2 N+6 < 2 N− 3 2 N− 1 × 0.625+ 2 2N 2 2N+3 × 0.5+0.6875 +( k 2 − 2)2 k− N− 6 + 0.625× 0.5 2 N+6 = 0.90625+(( k 2 − 2)2 k +0.3125) 1 2 N+6 (4.37) The lower bound of error e between T/M and q is e = 2 N− 3 M e 1 + T 2 2N+3 e 2 +e 3 − e 1 e 2 2 N+6 ⩾− ( 2 N− 3 2 N− 1 × 0.5+ 2 2N 2 2N+3 × 0.5+0.5+ 0.625× 0.5 2 N+6 ) =− (0.6875+ 0.3125 2 N+6 )>− 0.69>− 1 (4.38) In conclusion, we can guarantee e ∈ (− 1,1) as long as we find a proper k for a given modulus size N so that e < 1. For example, for a 128-bit modular multiplication, we can choose k = 124 and truncate almost half of partial products. 70 + + 0 bit position 2N+7 N+6 …… k (i) (ii) N+3 …… (iii) ignored 0 …… …… 0 0 0 + - + - + - + - (iv) q ~ ignored …… …… q S q C Figure 4.8: Compute ˜ q = h ˜ T H µ/ 2 N+6 i by Compression and EncoudRound 4.2.3.2 Optimization of Z After the replacement of h ˜ T H µ/ 2 N+6 i with our defined q in figure 4.7, we need a quick way to compute qM. Figure 4.8 shows how to apply EncodeRound from section 2.3.4 to avoid full-bitwidth addition and generate N/2+1 partial products. Since e =T/M− q∈ (− 1,1), the result Z = T − qM must be in the range of (− M,M) and is thus a (N +1)-bit signed number. ItissafetoonlycompressleastN+1bitsofallpartialproductsfromT S +T C − ˜ qM. 4.2.3.3 Block-level diagram Figure 4.9 shows the block-level diagram for the 4-stage pipelined hardware implementation of the proposed full-word BMM algorithm. Instead of relying on multiplications, we reduce the critical path to an encoding and a compression. With the help of EDA tools like Design 71 Compiler, advanced logic optimization and retiming technique are possible. The number of stages may also vary to meet timing constraints. Compress Z MUX 1 0 X Y T μ μ M M EncodeRound T SH T CH Compress Upper T H ~ q S ,q C EncodeRound Compress Lower q ~ T S [N:0],T C [N:0] M Z S ,Z C M Addition Addition >=0? Stage 1 Stage 2 Stage 3 Stage 4 Figure 4.9: Block-level Diagram for the 4-stage Pipelined Implementation of the Proposed Full-word Barrett Modular Multiplication 72 4.3 Extension to All Modulus The last thing about BMM is the range of modulus. All BMM algorithms so far assume 2 N− 1 ≤ M < 2 N . The assumption allows us to replace long division of basic modular multiplication with a multiplication and a constant shift of a power of 2. It is quite easy to implement the constant shift in hardware by picking up the corresponding bits. However, this assumption also exclude half of possible modulus. Any modulus in the range [1,2 N− 1 ) is not applicable. In this section, we would discuss how to modify BMM algorithms to support almost all modulus in the range [1,2 N ). 4.3.1 Classic Extension Algorithm 15 shows the extension to support all N-bit modulus, where input t is the size of current modulus. For example, t = 4 when M = 13∈ [2 3 ,2 4 ). Algorithm 15 Extended full-word Barrett modular multiplication Input: X,Y ∈ [0,M), current modulus size t, 2 t− 1 ≤ M < 2 t , µ =⌊2 2t /M⌋ Output: Z∈ [0,M) 1: T =XY 2: q =⌊Tµ/ 2 2t ⌋ 3: Z =T − qM 4: if Z≥ M then 5: Z =Z− M 6: end if 7: return Z Theproofofcorrectnessisquitesimilartowhatwehavedone. Letusreusethenotations e 2 and e 3 as below. 73 e 2 = 2 2t M − 2 2t M ∈ [0,1) e 3 = Tµ 2 2t − q = Tµ 2 2t − Tµ 2 2t ∈ [0,1) (4.39) Consequently, we can elaborate the computation of q as follows: q = Tµ 2 2t − e 3 = T( 2 2t M − e 2 ) 2 2t − e 3 = T M − T 2 2t e 2 − e 3 e = T M − q = T 2 2t e 2 +e 3 ∈ [0,2) (4.40) In this algorithm, computation of q requires on a dynamic shifter and a complete 2N× (N + 1) unsigned multiplication. Since the shift depends on the current modulus size k rather than the largest modulus size N, we cannot truncate some bits of partial products from Tµ in advance. The multiplication Tµ has to be a complete 2N× (N +1) unsigned multiplication because T = XY and µ are 2N-bits and (N + 1)-bit unsigned number in the worst case. On contrast, the classic full-word BMM algorithm (algorithm 13) just needs a (N +3)× (N +4) unsigned multiplication, and our new full-world BMM algorithm can improve more. 4.3.2 New Extension In this section, we show how to easily extend our proposed BMM algorithms to support almost all modulus. What we have to do is to choose a new precomputed input µ and insert shifters properly. The resulting BMM designs are still efficient because all improvements like EncodeRound, truncation, etc. are reserved. 74 4.3.2.1 Full-word Barrett modular multiplication Algorithm 16 Extended full-word Barrett modular multiplication: basic version Input: X,Y ∈ [0,M), current modulus size t, 2 2 ≤ 2 t− 1 ≤ M < 2 t Input: µ = 2 N+t+3 /M Output: Z∈ [0,M) 1: T =XY 2: q = T 2 t− 3 µ/ 2 N+6 3: Z =T − qM 4: if Z < 0 then 5: Z =Z +M 6: end if 7: return Z Algorithm 16 extend the basic version of our proposed full-word BMM (algorithm 13), whereµ ischangedfrom 2 2N+3 /M to 2 N+t+3 /M , and T/2 N− 3 becomes[T/2 t− 3 ]. Inline 2, we need a right shift of t− 2 bits. To avoid inefficiency when handling the case t− 2< 0, we simply assume t≥ 2 so that modulus should be not less than 2 2 = 4. Rather than repetitively proving the correctness, we just provide the prof as below. e 1 = T 2 t− 3 − T 2 t− 3 ∈ [− 0.5,0.5) e 2 = 2 N+t+3 M − 2 N+t+3 M ∈ [− 0.5,0.5) e 3 = T 2 t− 3 µ 2 N+6 − q∈ [− 0.5,0.5) q = T 2 t− 3 µ 2 N+6 − e 3 = ( T 2 t− 3 − e 1 )( 2 N+t+3 M − e 2 ) 2 N+6 − e 3 = T M − 2 t− 3 M e 1 − T 2 N+t+3 e 2 − e 3 + e 1 e 2 2 N+6 e = T M − q = 2 t− 3 M e 1 + T 2 N+t+3 e 2 +e 3 − e 1 e 2 2 N+6 ∈ (− 1,1) (4.41) Consequently, [T/2 t− 3 ] is a (t+4)-bit unsigned number and µ is a (N +5)-bit unsigned number. Rather than implementing the complete 2N× (N +1) unsigned multiplication in 75 algorithm 15, our extend full-word BMM just needs a (N+4)× (N+5) unsigned multiplica- tion. Algorithm 17 shows our advanced version of extended full-word BMM. Compared with algorithm14,theonlytwochangesare: 1)µ becomes 2 N+t+3 /M ; 2)weneedshiftersatline 4and5. Similarly,weassumet≤ 6toensureaproperbehaviorofshifters,andM ≥ 2 5 = 32. The proof is omitted for brevity. An interested reader can prove the correctness following the algorithm 14. Algorithm 17 Extended full-word Barrett modular multiplication: advanced version Input: X,Y ∈ [0,M), current modulus size t, 2 5 ≤ 2 t− 1 ≤ M < 2 t Input: µ = 2 N+t+3 /M , truncation position k Output: Z∈ [0,M) 1: ▷ Compute XY 2: (T S ,T C ) = Compress(XY) 3: ▷ Replace [T/2 t− 3 ] with Shifter and EncodeRound 4: T SH =T S ≫ (t− 6) 5: T CH =T C ≫ (t− 6) 6: ˜ T H = EncodeRound(T SH ,T CH ,3) 7: ▷ Compress ˜ T H µ after truncating least k bits 8: (q S ,q C ) = CompressUpper( ˜ T H µ,k ) 9: ▷ Replace h ˜ T H µ/ 2 N+6 i with EncodeRound 10: q S =q S [2N +6 :N +3] 11: q C =q C [2N +6 :N +3] 12: ˜ q = EncodeRound(q S ,q C ,3) 13: ▷ Compress least N +1 bits from T S +T C − ˜ qM 14: (Z S ,Z C ) = CompressLower(T S +T C − ˜ qM,N +1) 15: ▷ Correct result to the range [0,M) 16: Z =Z S +Z C 17: if Z <M then 18: Z =Z +M 19: end if 20: return Z 4.3.2.2 Iterative Barrett modular multiplication It should not be surprising to easily extend our proposed iterative BMM to support almost allmodulus,asshowninalgorithm18. Theonlytwochangesare: 1)µ becomes[2 t+2m+3 /M]; 76 2) we need a shifter at line 3. Following the proof of algorithm 10, the proof of correctness becomes evident and thus omitted. Algorithm 18 Extended radix-2 m Barrett modular multiplication: basic version Input: X,Y ∈ [0,M), current modulus size t, 2 2 ≤ 2 t− 1 ≤ M < 2 t , r = 2 m Input: d =⌈(N +1)/m⌉+2, µ = [2 t+2m+3 /M] Output: Z∈ [0,M) 1: Z (d− 2) = 0, q (d− 2) = 0 2: for i =d− 3 to− 2 do 3: g (i) = hh Z (i+1) 2 t− 3 i µ/ 2 2m+6 i 4: q (i) =g (i) − q (i+1) r 5: Z (i) = (Z (i+1) − q (i+1) Mr)r+X ˜ Y i 6: end for 7: Z =Z (− 2) /r 2 8: if Z < 0 then 9: Z =Z +M 10: end if 11: return Z Finally, algorithm 19 presents the advanced version of extended iterative BMM that replaces multiplication and addition with compression and encoding. Compared with algo- rithm 11 that only support modulus range [2 N− 1 ,2 N ), algorithm 19 supports modulus range [2 5 ,2 N ). The cost is two extra shifters that works in parallel. 77 Algorithm 19 Extended radix-2 m Barrett modular multiplication: advanced version Input: X,Y ∈ [0,M), current modulus size t, 2 5 ≤ 2 t− 1 ≤ M < 2 t , r = 2 m Input: d =⌈(N +1)/m⌉+2, µ = [2 t+2m+3 /M] Output: Z∈ [0,M) 1: Z (d− 2) S =Z (d− 2) C =q (d− 2) S =q (d− 2) C = 0 2: for i =d− 3 to− 2 do 3: ▷ Define Z (i+1) SH and Z (i+1) CH 4: Z (i+1) SH =Z (i+1) S ≫ (t− 6), Z (i+1) CH =Z (i+1) C ≫ (t− 6) 5: ▷ Encode Z (i+1) H and q (i+1) 6: ˜ Z (i+1) H = EncodeRound(Z (i+1) SH ,Z (i+1) CH ,3) 7: ˜ q (i+1) = EncodeRound(q (i+1) S ,q (i+1) C ,3) 8: ˆ q (i+1) = EncodeQSN(q (i+1) S ,q (i+1) C ) 9: ▷ Compress for ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 10: (q (i) S ,q (i) C ) = Compress( ˜ Z (i+1) H µ − ˆ q (i+1) 2 3m+6 ) 11: ▷ Perform division over 2 2m+6 by truncation 12: q (i) S =q (i) S [3m+8:2m+3] 13: q (i) C =q (i) C [3m+8:2m+3] 14: ▷ Compress for (Z (i+1) − ˜ q (i+1) Mr)r+X ˜ Y i 15: ˜ Y i =Y i − Y i [m− 1]r+Y i− 1 [m− 1] 16: (Z (i) S ,Z (i) C ) = Compress((Z (i+1) − ˜ q (i+1) Mr)r+X ˜ Y i ) 17: ▷ Extract useful information by truncation 18: Z (i) S =Z (i) S [N+2m+2:0] 19: Z (i) C =Z (i) C [N+2m+2:0] 20: end for 21: Z =Z (− 2) S [N+2m:2m]+Z (− 2) C [N+2m:2m] 22: if Z < 0 then 23: Z =Z +M 24: end if 25: return Z 4.4 Experimental Results The proposed Montgomery modular multiplication algorithms were coded in the Verilog hardware description language and implemented using the default flow of the Xilinx Vivado Virtex-7 xc7v585tffg1157-3 device, including synthesis and implementation (place&route). Instead of optimizing the algorithm for a specific value of m, our iterative and scalable MM designs maintain the flexibility to implement different m values, reflecting trade-off between 78 latency and area. The improvements achieved by our implementation can be seen in terms of both the hardware area and computation latency. Table 4.2: Different Iterative Barrett MM Designs on Virtex-7 FPGA Design N m Time Area ATP Improvement (%) Cycles Period (ns) Latency (us) LUT FF Latency Area ATP Alg 11 1024 4 263 3.37 0.89 20644 3124 21.07 8 135 3.89 0.53 28276 3147 16.50 16 71 4.92 0.35 47970 3204 17.88 [21] 1024 4 263 5.90 1.54 21008 3081 37.09 42.44 1.33 43.21 8 135 7.30 0.97 31274 3087 33.36 45.91 8.55 50.54 (*) ATP stands for the product of the total area (LUT+FF) and end-to-end latency post-place&route (in ms). (**) We collect the best performance of scalable MMM for a given m. Due to the non-linear operation (ceiling, floor, and round), it is not easy to simplify the multiplication in Barrett MM when m̸= 1. Consequently, there are not too many references focusing on iterative Barrett MM design. Reference [21] is the latest iterative Barrett MM (our contribution), which replace multiplication with signed compression using the overflow preventiondatemodelwhenthesumispositive. However,thedatadependencytocomputeq firstand Z nextisreserved,andqisdonethroughamultiplicationratherthanacompression. Ourproposediterativeallowstheparallelcomputationandcompletelyreplacemultiplication and addition with compression and encoding. Consequently, we save 45.91% computation latency and 8.55% area, which amount to 50.54% saving of ATP. Finally, table 4.3 shows our proposed full-word Barrett MM (algorithm 14) in a 5-stage pipelined architecture. Since we replace multiplication and addition with compression and encoding, addition in the final step becomes the bottleneck of performance and is thus split intotwostages. Tobestofauthors’knowledge,thereisnoreferencespresentingthefull-word Barrett MM results in FPGA. 79 Table 4.3: 5-stage Full-word Barrett MM Designs on Virtex-7 FPGA N Time Area ATP (LUT+FF)*µ s Period (ns) LUT FF Total 32 3.8 3450 526 3976 15.11 64 4.9 10980 1043 12023 58.91 128 6.8 39764 2104 41868 284.70 256 9.0 154559 4401 158960 1430.64 (*) ATP stands for the product of the total area (LUT+FF) and period post-place&route (in µ s). 4.5 Conclusion In this section, we present many efficient Barrett MM algorithms. Our proposed iterative Barrett MM successfully replaces costly multiplication and addition with compression and encoding in each iteration. The proposed EncodeRound avoids the full-size addition when computingtherequiredquotientandintermediateresult. Moreover, bymanipulatingpartial products into overflow correction model, we solved the overflow problem associated with modified Booth coding when applied to signed numbers. Experimental results show the reduction of area and time for our design compared to the latest published results. Ourproposedfull-wordMontgomeryMMremovesthenecessitytothecompletemultipli- cation when computing quotient. The truncated compression remove almost half bits when generating partial products. Our careful error analysis guarantee that final result is in the range (− M,M) and we only need one correction. 80 Chapter 5 Montgomery Modular Multiplication 5.1 Domain Transformation Montgomery modular multiplication [7] is another widely used MM algorithm because it replaces time-consuming divisions with multiplications and shifts. Let M be an N-bit odd positive integer. Instead of directly computing modular multiplication ¯ Z = ¯ X ¯ Y mod M, the operation can be performed efficiently in Montgomery domain with Montgomery MM. Assume R is a power of 2 and R > M. We can transform ¯ X (in the ordinary domain) into X (in the Montgomery domain) as below: X = ¯ XR mod M, ¯ X =XR − 1 mod M (5.1) whereR − 1 isthemultiplicativemodularinverseofRwithR − 1 ∈ (0,M)andRR − 1 mod M = 1. TheMontgomeryformZ of ¯ Z = ¯ X ¯ Y mod M isrepresentedbyX,Y,andR − 1 asfollows: Z = ¯ ZR mod M = ¯ X ¯ YR mod M = ( ¯ XR)( ¯ YR)R − 1 mod M =XYR − 1 mod M (5.2) 81 Notice that the transformation between X and ¯ X can also be fitted into operation XYR − 1 mod M with different X and Y, as shown below: X = ¯ X· R 2 · R − 1 mod M = ¯ X· (R 2 mod M)· R − 1 mod M ¯ X =X· 1· R − 1 mod M (5.3) It is easy to prove the homomorphism of addition and multiplication operations in the original and Montgomery domains: • XYR − 1 mod M = ( ¯ X ¯ Y)R mod M • X +Y mod M = ( ¯ X + ¯ Y)R mod M Thatis,theresultsofadditionormultiplicationintheMontgomerydomainareequaltothose of addition or multiplication in the original domain followed by a forward transformation from the original domain to the Montgomery domain. The RSA cryptosystem [1] based on modular exponentiation is a typical application of MMM, because the cost of required two domain transformations is amortized over many modular multiplication operations, and thus becomes negligible. Therefore, the design of Z = XYR − 1 mod M determines the performance of Montgomery MM. 82 5.2 Iterative Montgomery Modular Multiplication 5.2.1 Classic Iterative Montgomery Modular Multiplication Algorithm 20 Radix-2 m Montgomery modular multiplication Input: X,Y ∈ [0,M), odd M < 2 N , r = 2 m , d =⌈N/m⌉ Input: R =r ⌈N/m⌉ , rr − 1 − MM ′ = 1 Output: Z∈ [0,M) 1: Z (− 1) = 0 2: for i = 0 to d− 1 do 3: Z (i) =Z (i− 1) +XY i 4: q (i) = (Z (i) mod r)M ′ mod r 5: Z (i) = (Z (i) +q (i) M)/r 6: end for 7: Z =Z (d− 1) 8: if Z >M then 9: Z =Z− M 10: end if 11: return Z Algorithm 20 shows the radix-2 m Montgomery modular multiplication, where we scan a group of m-bits of multiplier Y i in the i-th iteration. Let d denote the number of iterations. B´ ezout’s identity [22] implies that rr − 1 − MM ′ = 1 when r and M are relatively prime with r − 1 ∈ (0,M) and M ′ ∈ (0,r). Both r − 1 and M ′ can be computed through the extended Euclidean algorithm [23]. In line 3 of algorithm 20, the product XY i is added to the intermediate result Z (i− 1) obtained in the (i− 1)-st iteration to produce the current result Z (i) . We compute q (i) in line 4 so that Z (i) +q (i) M becomes a multiple of r, and therefore, the computation of (Z (i) +q (i) M)/r in line 5 is reduced to a simple right shift (Z (i) +q (i) M) by m bits. To achieve this goal, we can set q (i) =Z (i) M ′ so that Z (i) +q (i) M =Z (i) +Z (i) M ′ M =Z (i) (1+MM ′ ) =Z (i) rr − 1 ≡ r 0 (5.4) 83 where≡ r denotes the congruence relationship modulo r. Notice that Z (i) M ′ is one of many possible choices for q (i) . Adding/subtracting any multiples of r to/from Z (i) M ′ will still meet the requirement Z (i) +q (i) M ≡ r 0. In other words, q (i) is unique up to a modulo on r. Consequently, valid choices of q (i) form a set as shown below: q (i) =Z (i) M ′ +tr,t∈Z ≡ r Z (i) M ′ mod r (5.5) The m-bit Z (i) M ′ mod r is a widely used choice for q (i) because it is the minimum positive value for q (i) . Line 5 of algorithm 20 can be rewritten as in (5.6) below, which establishes a relation between Z (i) and Z (i− 1) . Using this relation, we can recursively unfold Z (i) until Z (− 1) = 0: Z (i) r =XY i +q (i) M +Z (i− 1) Z (i) r 2 =XY i r+q (i) Mr+Z (i− 1) r =XY i r+XY i− 1 +q (i) Mr+q (i− 1) M +Z (i− 2) ··· Z (i) r i+1 =X i X k=0 Y k r k +M i X k=0 q (k) r k (5.6) 84 Consider the last iteration i = d− 1 =⌈N/m⌉− 1. We have P i k=0 Y k r k = Y. Defining R =r ⌈N/m⌉ and Q = P i k=0 q (k) r k , we have: Z (i) R =Z (i) r ⌈N/m⌉ =Z (i) r i+1 =X i X k=0 Y k r k +M i X k=0 q (k) r k =XY +MQ (5.7) The B´ ezout’s identity implies that RR − 1 − MM − 1 = 1 since R is even and M is odd. Therefore, Z (i) =Z (d− 1) =Z (nm− 1) ≡ M XYR − 1 in (5.8). Z (i) (1+MM ′ ) =Z (i) RR − 1 Z (i) +Z (i) MM ′ =XYR − 1 +MQR − 1 Z (i) ≡ M XYR − 1 (5.8) The range of Z (d− 1) is [0,2M) for q (i) =Z (i) M ′ mod r (proof is omitted for brevity), so that only one correction in line 8-10 is needed to adjust Z back to the range [0,M). We observe three data dependencies in algorithm 20: i. Computation of Z (i) precedes computation of (Z (i) mod r)M ′ mod r; ii. Computation of q (i) precedes computation of Z (i) +q (i) M ; iii. Computation of Z (i) in the i-th iteration precedes computation of Z (i) +XY i+1 in the (i+1)-st iteration. The critical path in each iteration thus encompasses three multiplications and two addi- tions. Feedforward Montgomery MM [24,25] is a variant of Montgomery MM, which is able 85 to compute q (i) and Z (i) in parallel. However, feedforward Montgomery MM can only guar- antee that Z (d− 1) in the last iteration falls in the range [0,2 m+1 M). As a result we cannot adjust Z back to the range [0,M) with a constant number of corrections. To the best of the authors’ knowledge, there exist no Montgomery MM algorithms in the prior art which parallelize the computation of q (i) andZ (i) while ensuring that the result of the last iteration lies in a constant range (i.e., independent of m). Note also that the number of iterations d can change for different variants of the Montgomery MM. In contrast, we propose an efficient Montgomery MM in this section. For the sake of clarity, we first present a basic version of proposed Montgomery MM, in which we parallelize the computation of q (i) and Z (i) while guaranteeing that Z (d− 1) lies in the range [0,2M). The basic version shows the correctness of proposed Montgomery MM. Next, we present an advanced version of the proposed Montgomery MM, in which multiplication and addition operations are replaced with compression and encoding operations. This advanced version significantly reduces the critical path delay while guaranteeing that Z (d− 1) lies in the range (− M,2M). 5.2.2 New Iterative Montgomery Modular Multiplication: Basic Algorithm 21 shows the basic version of proposed Montgomery MM. In this algorithm, q (i) and Z (i) are independent of each other and can thus be computed in parallel. Based on B´ ezout’s identity, we can find r − 1 ∈ (0,M) and M ′′ ∈ (0,r 2 ) such that r 2 r − 1 − MM ′′ = 1. Let us prove correctness of algorithm 21 by mathematical induction. At the beginning, Z (− 1) +q (− 1) M = 0≡ r 0. Assume q (i− 1) ∈ [0,r) and Z (i− 1) +q (i− 1) M ≡ r 0. We would like to deriveaq (i) suchthatZ (i) +q (i) M ≡ r 0. Todothis,wefirstdefine: g (i) =Z (i− 1) M ′′ mod r 2 = 86 Algorithm 21 Radix-2 m Montgomery modular multiplication: basic version Input: X,Y ∈ [0,M), odd M < 2 N , r = 2 m Input: d =⌈N/m⌉+2, R =r ⌈N/m⌉ , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,M) 1: Z (− 1) = 0, q (− 1) = 0 2: for i = 0 to d− 1 do 3: g (i) = (Z (i− 1) mod r 2 )M ′′ mod r 2 4: q (i) = (g (i) − q (i− 1) )/r 5: Z (i) = (Z (i− 1) +XY i r 2 +q (i− 1) M)/r 6: end for 7: Z =Z (d− 1) 8: if Z >M then 9: Z =Z− M 10: end if 11: return Z (Z (i− 1) mod r 2 )M ′′ mod r 2 asinline3ofalgorithm21. NextweshowthatZ (i− 1) +g (i) M ≡ r 2 0 as follows Z (i− 1) +g (i) M =Z (i− 1) +(Z (i− 1) M ′′ mod r 2 )M ≡ r 2 Z (i− 1) +Z (i− 1) M ′′ M =Z (i− 1) (1+MM ′′ ) =Z (i− 1) r 2 r − 1 ≡ r 2 0 (5.9) Furthermore, equation (5.9) implies the following: Z (i− 1) +g (i) M ≡ r 0 Z (i− 1) +(g (i) mod r)M ≡ r 0 (5.10) 87 Because both q (i− 1) and (g (i) mod r) are valid choices in [0,r), due to uniqueness of q (i− 1) up to a modulo on r in (5.5), we conclude q (i− 1) = g (i) mod r. Therefore, g (i) − q (i− 1) is a multiple of r i.e., g (i) − q (i− 1) ≡ r 0 (5.11) In line 4 of algorithm 21, we have q (i) = (g (i) − q (i− 1) )/r. Because g (i) is 2m bits and q (i− 1) is m bits, q (i) is also m bits. With this choice for q (i) , and based on line 5 of algorithm 21, we can now prove that Z (i) +q (i) M ≡ r 0. From equation (5.9), we know Z (i− 1) +XY i r 2 +g (i) M ≡ r 2 0 ⇔ ( Z (i− 1) +XY i r 2 +q (i− 1) M r + g (i) − q (i− 1) r M)r≡ r 2 0 ⇔ Z (i) +q (i) M ≡ r 0 (5.12) And the proof is complete. Consequently, all three multiplications in algorithm 21 may be done in parallel. The critical path for computing Z (i) is thus reduced to one multiplication and two additions. We can also unfold the expression of Z (i) until Z (− 1) as shown below: Z (i) r i+1 =Xr 2 i X k=0 Y k r k +M i X k=0 q (k− 1) r k Z (i) r i+1 =Xr 2 i X k=0 Y k r k +Mr 2 i− 2 X k=0 q (k+1) r k Z (i) r i− 1 =X i X k=0 Y k r k +M i− 2 X k=0 q (k+1) r k (5.13) 88 Because Z (− 1) = 0, we have q (− 1) =q (0) = 0. Let r − (i− 1) denote the modular multiplicative inverse of r i− 1 for i = 0,1,··· ,d− 1, i.e., r i− 1 r − (i− 1) = 1 mod M. We can easily derive the relationship between Z (i) and X( P i k=0 Y k r k )r − (i− 1) as follows Z (i) r i− 1 =X i X k=0 Y k r k +M i− 2 X k=0 q (k+1) r k ≡ M X i X k=0 Y k r k Z (i) ≡ M X( i X k=0 Y k r k )r − (i− 1) (5.14) When i ≥ ⌈ N/m⌉− 1, we have P i k=0 Y k r k = Y. Because q (i) ∈ [0,r), we can estimate the upper bound of Z (i) as follows: Z (i) r i− 1 =X i X k=0 Y k r k +M i− 2 X k=0 q (k+1) r k ≤ XY +M(r− 1) i− 2 X k=0 r k =XY +M(r− 1) r i− 1 − 1 r− 1 < (M +r i− 1 )M Z (i) < (1+ M r i− 1 )M (5.15) When (i− 1)m≥ N, we have M < 2 N ≤ 2 (i− 1)m =r i− 1 such that Z (i) < 2M. Therefore, when we set d = ⌈N/m⌉ + 2, we have Z (d− 1) ∈ [0,2M). Compared with algorithm 20, algorithm 21 only needs two additional iterations. 89 5.2.3 New Iterative Montgomery Modular Multiplication: Advanced Although algorithm 21 breaks the data dependency between q (i) and Z (i) and thus reduces thecomputationlatency, multiplicationsandadditionsstillhinderthedevelopmentofavery high-performance Montgomery modular multiplication. Algorithm 22 shows an advanced version of the proposed Montgomery MM, which replaces all multiplications and additions in each iteration with compression and encoding operations. The result is an algorithm which can be implemented in hardware with much lower hardware area and computation latency that any and all prior art designs. In this algorithm, Z (i− 1) is represented by a triplet (three values): an (N +2m+1)-bit signed number Z (i) S , an (N+2m+1)-bit signed number Z (i− 1) C , and a 1-bit unsigned number Z (i− 1) carry : Z (i− 1) = (Z (i− 1) S ,Z (i− 1) C ,Z (i− 1) carry ) =Z (i− 1) S +Z (i− 1) C +Z (i− 1) carry (5.16) Note that we do not explicitly perform the above addition. The operation (Z (i− 1) mod r 2 ) is simply replaced with Z (i− 1) L as follows: Z (i− 1) L = (Z (i− 1) S [2m− 1 : 0],Z (i− 1) C [2m− 1 : 0],Z (i− 1) carry ) =Z (i− 1) S [2m− 1 : 0]+Z (i− 1) C [2m− 1 : 0]+Z (i− 1) carry ≡ r 2 Z (i− 1) mod r 2 (5.17) Notice that the difference between ( Z (i− 1) mod r 2 ) and Z (i− 1) L is only 0 or r 2 . Similarly, q (i− 1) isrepresentedbyatriplet: anm-bitunsignednumberq (i− 1) S ,anm-bitunsignednumber q (i− 1) C , and a 1-bit unsigned number q (i− 1) carry . 90 q (i− 1) = (q (i− 1) S ,q (i− 1) C ,q (i− 1) carry ) =q (i− 1) S +q (i− 1) C +q (i− 1) carry (5.18) Algorithm 22 Radix-2 m Montgomery modular multiplication: advanced version Input: X,Y ∈ [0,M), odd M < 2 N , r = 2 m Input: d =⌈N/m⌉+2, R =r ⌈N/m⌉ , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,M) 1: Z (− 1) S =Z (− 1) C =Z (− 1) carry =q (− 1) S =q (− 1) C =q (− 1) carry = 0 2: for i = 0 to d− 1 do 3: ▷ Form Z (i− 1) , Z (i− 1) L , and q (i− 1) as triplets 4: Z (i− 1) = (Z (i− 1) S ,Z (i− 1) C ,Z (i− 1) carry ) 5: Z (i− 1) L = (Z (i− 1) S [2m-1:0],Z (i− 1) C [2m-1:0],Z (i− 1) carry ) 6: q (i− 1) = (q (i− 1) S ,q (i− 1) C ,q (i− 1) carry ) 7: ▷ Encode Z (i− 1) L and q (i− 1) 8: ˜ Z (i− 1) L = Encode(Z (i− 1) L ) 9: ˜ q (i− 1) = Encode(q (i− 1) ) 10: ˆ q (i− 1) = EncodeSN(q (i− 1) ) 11: ▷ Compress for ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) 12: (q (i) S ,q (i) C ) = Compress( ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) ) 13: ▷ Perform division over r by right shifting 14: q (i) carry =q (i) S [m− 1] OR q (i) C [m− 1] 15: q (i) S =q (i) S >>m, q (i) C =q (i) C >>m 16: ▷ Compress for Z (i− 1) +X ˜ Y i r 2 + ˜ q (i− 1) M 17: ˜ Y i =Y i − Y i [m− 1]r+Y i− 1 [m− 1] 18: (Z (i) S ,Z (i) C ) = Compress(Z (i− 1) +X ˜ Y i r 2 + ˜ q (i− 1) M) 19: ▷ Prevent overflow from signed compression 20: Z (i) S [N +3m] =Z (i) S [N +3m] XNOR Z (i) C [N +3m] 21: Z (i) C [N +3m] = 0 22: ▷ Perform division over r by right shifting 23: Z (i) carry =Z (i) S [m− 1] OR Z (i) C [m− 1] 24: Z (i) S =Z (i) S >>m, Z (i) C =Z (i) C >>m 25: end for 26: Z =Z (d− 1) S +Z (d− 1) C +Z (d− 1) carry 27: if Z >M then 28: Z =Z− M 29: else if Z < 0 then 30: Z =Z +M 31: end if 32: return Z 91 5.2.3.1 EncodeSN Using a similar procedure as the one we did for algorithm 21, we determine q (i− 1) so that Z (i− 1) +q (i− 1) M ≡ r 0. Recall that q (i− 1) is only unique up to a modulo on r as proved in (5.5). This observation motivates the application of our encoding techniques. EncodeSN calculates (t+1)-bit outputs ˆ a and ˆ b from t-bit inputs a and b. The encoding process consists of the following steps: a) calculate a 3-bit intermediate result d as d[2 : 0] = a[t− 1 : t− 2]+b[t− 1 : t− 2]+ (a[t− 3] AND b[t− 3]); b) calculate ˆ a[t− 4 : 0] =a[t− 4 : 0] and ˆ b[t− 4 : 0] =b[t− 4 : 0]; c) calculate ˆ a[t− 3] =a[t− 3] XOR b[t− 3]; ˆ a[t− 2] =d[0]; ˆ a[t− 1] =d[1]; ˆ a[t] =d[2]; d) calculate ˆ b[t− 1 :t− 3] = 0; ˆ b[t] =d[1]. The values of outputs ˆ a and ˆ b are defined as v ˆ a = ˆ a[t]2 t − ˆ a[t− 1]2 t− 1 + t− 1 X i=0 ˆ a[i]2 i v ˆ b = ˆ b[t]2 t − ˆ b[t− 1]2 t− 1 + t− 1 X i=0 ˆ b[i]2 i (5.19) ensuring that v ˆ a +v ˆ b =a+b. We first perform EncodeSN( q (i− 1) S ,q (i− 1) C ). Although the outputs are m + 1 bits, we truncate outputs and assign the least m bits to ˆ q (i− 1) S and ˆ q (i− 1) C . ˆ q (i− 1) carry is simply the q (i− 1) carry 92 itself. Because ˆ q S [m]2 m and ˆ q C [m]2 m aremultiplesofr = 2 m , thedifferencebetween ˆq (i− 1) = (ˆ q (i− 1) S ,ˆ q (i− 1) C ,ˆ q (i− 1) carry ) and q (i− 1) is a multiple or r so that ˆ q (i− 1) ≡ r q (i− 1) Z (i− 1) + ˆ q (i− 1) M ≡ r 0 (5.20) We use the notation ˆ q (i− 1) = EncodeSN(q (i− 1) ) to represent the said encoding process. 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 2 3 4 5 6 7 + + + + + + + + bit position q S =172 q C =245 q carry =1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 1 1 + + + + + + + - q=418 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 1 1 + + + + + + + - q S =-116 q C =21 q carry =1 q=-94 ^ ^ ^ ^ 1 1 + 8 add ignored Figure 5.1: Example of EncodeSN when m = 8 Figure 5.1 shows the case of ˆ q (i− 1) when m = 8, where the superscript (i− 1) is omitted for brevity. The signs + and− stand for positive and negative weights respectively. As an example, q (i− 1) = 418 and ˆ q (i− 1) = − 94. The difference q (i− 1) − ˆ q (i− 1) = 512 = 2∗ 256 = 2∗ 2 8 = 2r. 93 5.2.3.2 Encode Taking a second look at ˆ q (i− 1) = EncodeSN(q (i− 1) ), we actually apply Encode2bit on the leading 2 bits of q (i− 1) S and q (i− 1) C with c in = q (i− 1) S [m− 3] AND q (i− 1) C [m− 3] and ignore the outputsc out ande. Notethatq (i− 1) S [m− 3]andq (i− 1) C [m− 3]areadjustedaccordingly. Instead ofonlyencodingtheleading2bitsofq (i− 1) S andq (i− 1) C , wecangroupevery2bitsandperform Encode(q (i− 1) S ,q (i− 1) C ,q (i− 1) carry ). Similar to the computation of ˆ q (i− 1) , we can truncate outputs and assign the least m bits to ˜ q (i− 1) S and ˜ q (i− 1) C . Since we only ignore the outputs e and c out of the first Encode2bit in both EncodeSN and Encode, we have: ˜ q (i− 1) = ˆ q (i− 1) Z (i− 1) + ˜ q (i− 1) M ≡ r 0 (5.21) Weuse ˜ q (i− 1) = Encode(q (i− 1) )torepresentthesaidencodingprocess. Whenwecompute ˜ q (i− 1) M, we only have m/2 partial products. There is no need to do an m-bit addition: q (i− 1) S + q (i− 1) C + q (i− 1) carry . Since the range of a group is [− 2,2], the range of m-bit ˜ q (i) is [− 2 3 (r− 1), 2 3 (r− 1)] as shown below: ˜ q (i) ≤ m/2− 1 X k=0 2∗ 4 k = 2∗ 4 m/2 − 1 4− 1 = 2 3 (2 m − 1) = 2 3 (r− 1) (5.22) 94 5.2.3.3 Computation of q (i) Recall the computation of g (i) = (Z (i− 1) mod r 2 )M ′′ mod r 2 in Algorithm 21 results in Z (i− 1) +g (i) M ≡ r 2 0, that is, Z (i− 1) +((Z (i− 1) mod r 2 )M ′′ mod r 2 )M ≡ r 2 0 (5.23) If we add/subtract any multiples of r 2 to/from (Z (i− 1) mod r 2 ), the above equation will still hold. In other words, it is safe to replace (Z (i− 1) mod r 2 ) with Z (i− 1) L or ˜ Z (i− 1) L = Encode(Z (i− 1) L ),becausethedifferencebetweenanytwoof( Z (i− 1) mod r 2 ),Z (i− 1) L ,and ˜ Z (i− 1) L is a multiple of r 2 . The benefit is that we will only have m partial products if we compute ˜ Z (i− 1) L M ′′ . We thus have: Z (i− 1) +( ˜ Z (i− 1) L M ′′ mod r 2 )M ≡ r 2 0 (5.24) Equation (5.24) implies: Z (i− 1) +( ˜ Z (i− 1) L M ′′ mod r)M ≡ r 0 (5.25) Considering equation (5.20) and the uniqueness of q (i− 1) up to a modulo on r, we have ˜ Z (i− 1) L M ′′ mod r− ˆ q (i− 1) ≡ r 0 (5.26) 95 Followingequation(5.12)whichwasrelatedtoAlgorithm21, theobtained q (i) forAlgorithm 22 should meet the following requirement: q (i) ≡ r ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) r (5.27) Figure 5.2 shows compression of ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) for the least significant 2 m bits. In step (i), we compress m+3 numbers into two unsigned numbers q (i) S and q (i) C . Notice that each partial product has a tail bit. The aforesaid m + 3 numbers consist of three parts: m numbers from ˜ Z (i− 1) L M ′′ , 2 numbers from ˆ q (i− 1) S and ˆ q (i− 1) C , and 1 number constructed by collecting the tail bits and ˆ q (i− 1) carry . We then use a balanced ternary tree arrangement of CSAs to perform the compression (q (i) S ,q (i) C ) = Compress( ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) ) with O(logm) number of full adder cell delays. Here, we do the fixed-bitwidth compression and only collect the least significant 2 m bits of the two numbers after compression because any overflow is a multiple of r 2 . Because of equation (5.26), the sum of the least significant m bits can only be 0 or r. When q (i) S [m− 1] = 1 or q (i) C [m− 1] = 1, the following equality holds: q (i) S [m− 1 : 0]+q (i) C [m− 1 : 0] = r. On the other hand, when q (i) S [m− 1] = q (i) C [m− 1] = 0, we have q (i) S [m− 1 : 0]+q (i) C [m− 1 : 0] = 0. Therefore, we may define q (i) carry as follows: q (i) carry =q (i) S [m− 1] OR q (i) C [m− 1] q (i) carry r =q (i) S [m− 1 : 0]+q (i) C [m− 1 : 0] (5.28) 96 0 1 bit position … … … … … … Z L *M ’’ ~ m-1 2m-1 ^ q 0 0 sign extension q S q C q carry q S q C (i) (ii) (iii) Figure 5.2: Computing q (i) In step (ii), the computation of ( ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) )/r is reduced to a right shift by m bits: q (i) carry =q (i) S [m− 1] OR q (i) C [m− 1] q (i) S =q (i) S >>m q (i) C =q (i) C >>m (5.29) Note that the resulting q (i) = (q (i) S ,q (i) C ,q (i) carry ) is valid since it meets equation (5.27). Conse- quently, the computation of q (i) relies on encoding and compression operations, rather then the more expensive multiplication and addition operations. 97 5.2.3.4 Error Analysis BeforedoingtheerroranalysisforZ (i) ,letusbrieflydiscusstheoptimizationtocompute XY i based on the modified Booth coding. Instead of computing XY i , we scan Y i and Y i− 1 [m− 1] and encode them as ˜ Y i , that is, ˜ Y i =Y i − Y i [m− 1]r+Y i− 1 [m− 1] = m/2− 1 X k=0 (− 2Y[2k+1+im]+Y[2k+im]+Y[2k− 1+im])4 k (5.30) Note that Y − 1 = 0. Following the radix-4 modified Booth coding, we can be further divide ˜ Y i intom/2 groups if we group every 2 bits. Because the range of− 2Y[2k+1+im]+Y[2k+ im]+Y[2k− 1+im] is [− 2,2], we can use it to generate a partial product. Consequently, the product of X ˜ Y i has m/2 partial products, whereas XY i has m/2+1 partial products. Moreover, the range of ˜ Y i falls in [− r 2 , r 2 ] as shown next: Y i ∈ [0,2 m − 1] Y i − Y i [m− 1]r∈ [− 2 m− 1 ,2 m− 1 − 1] ˜ Y i =Y i − Y i [m− 1]r+Y i− 1 [m− 1]∈ [− 2 m− 1 ,2 m− 1 ] = [− r 2 , r 2 ] (5.31) We can further compute the range for P i k=0 ˜ Y k as follows: i X k=0 ˜ Y k r k = i X k=0 Y k r k − Y[(i+1)m− 1]r ∈ [− 2 (i+1)m− 1 ,2 (i+1)m− 1 ) = [− r i+1 2 , r i+1 2 ) (5.32) 98 When (i+1)m− 1≥ N, we have Y[(i+1)m− 1] = 0 so that Y = ⌈N/m⌉− 1 X k=0 Y k r k = ⌈(N+1)/m⌉− 1 X k=0 ˜ Y k r k (5.33) Since Z (i) = (Z (i− 1) +X ˜ Y i r 2 + ˜ q (i− 1) M)/r, we can unfold the expression until Z (− 1) = 0 as shown below: Z (i) r i+1 =Xr 2 i X k=0 ˜ Y k r k +M i X k=0 ˜ q (k− 1) r k Z (i) r i+1 =Xr 2 i X k=0 ˜ Y k r k +Mr 2 i− 2 X k=0 ˜ q (k+1) r k Z (i) r i− 1 =X i X k=0 ˜ Y k r k +M i− 2 X k=0 ˜ q (k+1) r k (5.34) Note that because Z (− 1) = 0, we have ˜ q (0) = ˜ q (− 1) = 0. Let r − (i− 1) denote the multiplicative modular multiplicative inverse of r i− 1 for i = 0,1,··· ,d − 1, i.e., r i− 1 r − (i− 1) = 1 mod M. The relationship between Z (i) and X( P i k=0 ˜ Yr k )r − (i− 1) can be derived as follows Z (i) r i− 1 =X i X k=0 ˜ Yr k +M i− 2 X k=0 ˜ q (k+1) r k ≡ M X i X k=0 ˜ Yr k Z (i) ≡ M X( i X k=0 ˜ Yr k )r − (i− 1) (5.35) When i≥⌈ (N +1)/m⌉− 1, the following equality holds: Z (i) r i− 1 =X i X k=0 ˜ Y k r k +M i− 2 X k=0 ˜ q (k+1) r k =XY +M i− 2 X k=0 ˜ q (k+1) r k (5.36) 99 We can derive upper and lower bounds for Z (i) as follows: Z (i) r i− 1 <M 2 + 2 3 M(r− 1) i− 2 X k=0 r k <M 2 + 2 3 M(r− 1) r i− 1 − 1 r− 1 < (M + 2 3 r i− 1 )M Z (i) < ( 2 3 + M r i− 1 )M Z (i) r i− 1 ≥ 0− 2 3 M(r− 1) i− 2 X k=0 r k > (− 2 3 r i− 1 )M Z (i) >− 2 3 M >− M (5.37) When meeting the condition M/r i− 1 ≤ 4/3, we have− M <Z (i) < 2M. Equation (5.38) given below simplifies this condition. Because log 3M 2 < N +2, this condition is met when N ≤ (i− 1)m. Therefore, when we set d =⌈N/m⌉+2, we have Z (d− 1) ∈ (− M,2M). We onlyneedonecorrection, eitheradditionorsubtraction, toadjust Z backto[0,M). Another scenario is to ensure M/r i− 1 ≤ 1/3 such that− M < Z (i) < M. In this case, we should set d =⌈(N +2)/m⌉+2. M/r i− 1 ≤ 4/3⇔ 3M ≤ 2 (i− 1)m+2 ⇔ log 3M 2 ≤ (i− 1)m+2 (5.38) We can also compute the maximum size of any intermediate result Z (i) as follows: Z (i) r i− 1 =X i X k=0 ˜ Y k r k +M i− 2 X k=0 ˜ q (k+1) r k Z (i) r i− 1 <M r i+1 2 + 2 3 M(r− 1) i− 2 X k=0 r k <M( r i+1 2 + 2 3 r i− 1 ) Z (i) <M( r 2 2 + 2 3 )< 2 N+2m (5.39) 100 In other words, independent of the iteration count i, any Z (i) can be represented by an (N +2m+1)-bit signed number. 5.2.3.5 Computation of Z (i) The last thing to do is to efficiently compute Z (i) = (Z (i− 1) + X ˜ Y i r 2 + ˜ q (i− 1) M)/r. At this time, it is not safe to add or subtract a multiple of r. Our proposed encoding and modified Booth coding save half of the partial products and reduce the hardware area. After compression, the partial products are compressed into two signed numbers. In this case, we use the overflow prevention data model discussed, shown in the figure 2.6 of section 2.2.2. We again prove the result by using mathematical induction. The base case is Z (− 1) = Z (− 1) S + Z (− 1) C + Z (− 1) carry = 0. Assume Z (i− 1) = (Z (i− 1) S ,Z (i− 1) C ,Z (i− 1) carry ) fits in the overflow prevention data model, if we show that Z (i) = (Z (i) S ,Z (i) C ,Z (i) carry ) also fits in the data model, then we will have proven that the sum Z (i) can be correctly presented. Figure 5.3 shows the computation of Z (i) . Both ˜ q (i− 1) M and X ˜ Y i r 2 generate m/2 partial products. Z (i− 1) isrepresentedbyZ (i− 1) S ,Z (i− 1) C , andZ (i− 1) carry . BecauseZ (i) isan(N+2m+1)- bit signed number, Z (i− 1) +X ˜ Y i r 2 +˜ q (i− 1) M =Z (i) r is an (N +3m+1)-bit signed number, which is also a multiple of r. To sum all partial products together, the first thing to do is to sign extend the numbers so that all of them have same bitwidth. Here, we deliberately extend them into N +3m+2 bits in step (i) of Figure 5.3. For sake of brevity, all signs are labelled as S although different numbers could have different sign values. The sign bits in a number can be converted into 1’s followed by a ¯ S in step (ii). Any overflowscanbeignoredbecause Z (i) r onlyhas(N+3m+1)bits. ¯ S istheBooleaninversion ofS. We further merge all constant 1’s into a number called prefix in step (iii) after ignoring 101 m m m 0 1 bit position N … s s … s s … - + N+3m+1 N+2m … s … 0 … s s … s s … s s … … s s … … … … … … 0 s s s s s s s s + + + + q*M ~ Xyr 2 ~ Z S Z C Z carry … 1 s … 1 s … … 1 … … 1 s … 1 s … 1 s … … 1 s … … … … … … 1 1 1 1 1 1 1 1 S - S - S - S - S - S - S - (i) (ii) 1 … 1 1 … s … s … … … … s … s … s … … s S - S - S - S - S - S - S - (iii) 1 10… 10 1 1111… 1111 0000… 0000 1 0 0 1 10… 10 1 0 1 1 Prefix … … 1 1 Z S Z C m-1 + (iv) Z S Z C Z carry S … … S 0 0 (v) - + + Figure 5.3: Computing Z (i) any overflows. Because the prefix is a constant number which is only dependent on m, the real implementation can skip steps (i) and (ii), and start at step (iii). When m≥ 4, we can 102 prove that the maximum value for dashed polygon in step (iii) is less than 2 N+3m+1 as shown below: ( m/2− 1 X k=0 (2 N − 1)4 k )+( m/2− 1 X k=0 (2 N − 1)4 k )2 2m +2(2 N+2m+1 − 1)+1+3∗ 2 N+3m− 3 < 2 N+3m+1 (5.40) Since the dashed polygon is less than 2 N+3m+1 , the compression of the m+3 numbers will result in two (N +3m+1)-bit unsigned numbers Z (i) S and Z (i) C in step (iv) regardless of the compression strategy. Note that Z S [N +3m] and Z C [N +3m] cannot be 1 at the same time; otherwise Z (i) S +Z (i) C will be larger than 2 N+3m+1 . As a result, the sum of Z S [N +3m], Z C [N + 3m], and the leading 2 bits of the prefix will be a 2-bit number with the same value, which is Z S [N +3m] XNOR Z C [N +3m]. Since Z S [N +3m] is always the same as Z S [N+3m+1],wecanhideZ S [N+3m+1]andZ C [N+3m+1]inahardwareimplementation. BecauseZ (i− 1) +X ˜ Y i r 2 +˜ q (i− 1) M ≡ r 0,instep(v),theoperation(Z (i− 1) +X ˜ Y i r 2 +˜ q (i− 1) M)/r can be simplified as shown below: Z (i) carry =Z (i) S [m− 1] OR Z (i) C [m− 1] Z (i) S =Z (i) S >>m Z (i) C =Z (i) C >>m (5.41) Because Z (i) = (Z (i) S ,Z (i) C ,Z (i) carry ) fits the overflow prevention data model, we thus correctly present Z (i) without doing an addition. 103 Whenm< 4, we can sign extend all numbers into (N+3m+3) bits in step (i) and prove the dashed polygon is less than 2 N+3m+2 in step (iii). The remaining proofs are similar to what we did above and are thus omitted here for brevity. 5.2.3.6 Block-Level Diagram Figure 5.4 shows the block-level diagram for the hardware implementation of the proposed Montgomerymodularmultiplier. WiththehelpofEncode, EncodeSN,andcompression, the critical paths to update (q S ,q C ,q carry ) and (Z S ,Z C ,Z carry ) correspond to those for doing a 2-bit addition and compression of m+3 numbers. The critical path of the proposed Montgomery MM thus lies in the addition module for the final summation and correction step. Considering the trade-off between time and area, we use a high performance logarithmic Brent-Kung adder (BKA) [14] and specify a 3-cycle timing constraint. Note that we need to perform one final summation and do at most one correction. WeonlyneedtouseBKAtwicein6consecutivecycles,whichjustifiesthe3-cycle timing constraint. The control signal CT is used to control the data flow of BKA. When CT = 0, we compute Z S +Z C +Z carry and store it in Z. When CT = 1, we check the sign of Z. If Z < 0, we compute Z+M. If Z≥ 0, we compute Z− M. If the computation result is negative, we do not need any correction and simply choose the original Z. Otherwise, the result, which is positive, is stored into Z. In conclusion, the total clock cycle count to do the Montgomery MM is only CycleCount =d+6 (5.42) where d =⌈N/m⌉+2 is the number of iterations. 104 q S ,q C ,q carry Z S ,Z C ,Z carry EncodeSN Encode m+3 Compress Encode m+3 Compress M’’ q ^ q ~ Z L ~ X y ~ M MUX MUX 0 0 1 1 BKA Z MUX 0 1 AND <0? CT Z S Z C Figure 5.4: Block-level Diagram for the Hardware Implementation of the Proposed Iterative Montgomery Modular Multiplication 5.3 Full-word Montgomery Modular Multiplication 5.3.1 Classic Full-word Montgomery Modular Multiplication Algorithm 23 Full-word Montgomery modular multiplication Input: X,Y ∈ [0,M), odd M < 2 N , R = 2 N , RR − 1 − MM − 1 = 1 Output: Z =∈ [0,M) 1: T =XY 2: q = (T mod R)M − 1 mod R 3: Z = (T +qM)/R 4: if Z >M then 5: Z =Z− M 6: end if 7: return Z 105 Full-word Montgomery MM is the variant that can be implemented in pipelined archi- tecture. Algorithm 23 shows the classic algorithm, in which we have three multiplications in series. Based on B´ ezout’s identity, we can find R − 1 ∈ (0,M) and M − 1 ∈ (0,R) such that RR − 1 − MM − 1 = 1. It is easy to verify that T +qM is a multiple of R so that the division over R becomes a constant shift. T +qM =T +((T mod R)M − 1 mod R)M =T +(TM − 1 mod R)M ≡ R T +TM − 1 M =TRR − 1 ≡ R 0 (5.43) Consequently, the resulting Z = (T +qM)/R is in the range [0,2M). Z = T +qM R = XY +qM R < M 2 +MR R = (1+ M R )M < 2M (5.44) We can prove Z = XYR − 1 mod M by rewriting q = (T mod R)M − 1 mod R as q = TM − 1 +tR with t∈Z. Z = T +qM R = T +(TM − 1 +tR)M R = XY(1+MM − 1 ) R +tM =XYR − 1 +tM ≡ M XYR − 1 (5.45) In algorithm 23, the expression q = (T mod R)M − 1 mod R can be simplified to only compute least N bits of the N × N unsigned multiplication of T mod R and M − 1 . The expression T +qM requires a complete N × N unsigned multiplication such that we can 106 replace the division over R by a shift. However, we propose a full-word Montgomery MM that truncate half of bits so as to reduce area. 5.3.2 New Full-word Montgomery Modular Multiplication Algorithm 24 shows the basic version we are going to optimize. An interested reader can prove easily the correctness following the proof of algorithm 23. We deliberately change R from 2 N to 2 N+2 just to ensure Z∈ (− M,M) in the advanced version (algorithm 25) of our proposed full-world Montgomery MM. Consequently, the summation and correction can be implemented in parallel. Note that we can keep R to be 2 N and prove Z ∈ (− M,2M) that requires us to implement summation and correction in series. Algorithm 24 Full-word Montgomery modular multiplication: basic version Input: X,Y ∈ [0,M), odd M < 2 N , R = 2 N+2 , RR − 1 − MM − 1 = 1 Output: Z =∈ [0,M) 1: T =XY 2: q = (T mod R)M − 1 mod R 3: Z = (T +qM)/R 4: if Z >M then 5: Z =Z− M 6: end if 7: return Z Without a doubt, we can replace multiplication and addition with compression and encoding, which can reduce the critical path a lot. Moreover, we can also truncate half bits before compression of ˜ qM, rather than a complete multiplicationqM before the division of R discussed in algorithm 23. Consequently, our proposed full-word Montgomery modular multiplication is very efficient. 107 Algorithm 25 Full-word Montgomery modular multiplication: advanced version Input: X,Y ∈ [0,M), odd M < 2 N , R = 2 N+2 , RR − 1 − MM − 1 = 1 Input: truncation position k Output: Z =∈ [0,M) 1: (T S ,T C ) = Compress(XY) 2: ˜ T L = Encode(T S [N +1 : 0],T C [N +1 : 0]) 3: (q S ,q C ) = CompressLower( ˜ T L M − 1 ,N +2) 4: ˜ q = Encode(q S ,q C ) 5: (Z S ,Z C ) = CompressUpper(T S +T C + ˜ qM,k) 6: 2d[1]+d[0] =Z S [N +1]+Z C [N +1]+Z S [N] AND Z C [N] 7: Z =Z S [2N +2 :N +2]+Z C [2N +2 :N +2]+d[1]+d[0] 8: if Z <M then 9: Z =Z +M 10: end if 11: return Z 5.3.2.1 Encode for ˜ T L As shown in algorithm 25, the first thing is to compress partial products from XY into two signed numbers T S and T C . Following the overflow correction or overflow prevention data models, T S and T C just needs to be 2N +2 bits because T =XY <M 2 < 2 2N . Later, we truncate the least N + 2 bits of T S and T C and encode them into ˜ T L = Encode(T S [N +1 : 0],T C [N +1 : 0]). During the discussion of iterative Montgomery MM, we know that the difference between ˜ T L and T modR must be a multiple of R = 2 N+2 , but the resulting ˜ T L allows us to only generate N/2+1 partial products from ˜ T L M − 1 . Since the expression q = (T mod R)M − 1 mod R has a modular operation, we only need the least N+2 bits of partial products and compress them into two (N+2)-bit unsigned numbers q S and q C after ignoring any overflow. 108 5.3.2.2 Error Analysis At the beginning of Montgomery section, we observe that the choice of q is unique only up to a modulo on R. We thus choose to encode ˆ q = EncodeSN(q S ,q C ), under the observation that the difference between ˆq and q S + q C is a multiple of R. Figure 5.5 illustrates the procedure of EncodeSN. We can easily conclude the range of ˆ q is [− 0.5R,0.625R) and the corresponding sum is a (N +3)-bit signed value. The range of Z = (T S +T C + ˆ qM)/R is (− M,M). Z < M 2 +0.625MR R = (0.625+ M R )M < 0.875M <M Z > 0− 0.5MR R =− 0.5M >− M (5.46) … … … … 0 N+1 + + + + + + + + bit position q S q C + + + + + + + - + + + + + + + - + add ignored 0 0 0 … … … … q S q C 0 0 0 … … … … q S q C q ^ Figure 5.5: EncodeSN for (N +2)-bit q S and q C ˆ q is only a temporary variable used to compute the range of final result. Instead, we actually compute ˜ q = Encode(q S ,q C ) to only generate N/2+1 partial products. Following 109 the discussion in equation (5.21), we have ˆ q = ˜ q so that Z = (T S +T C + ˜ qM)/R is also a (N +1)-bit signed number in the range (− M,M). 5.3.2.3 Optimization for T S +T C + ˜ qM + + 0 bit position 2N+2 …… k (i) (ii) N+2 q S q C T S T C (iii) q S q C 0 0 d 1 d 0 U U Figure 5.6: Compute T S +T C + ˜ qM by Truncated Compression Thelastthingistoshowthecorrectnesswhenwetruncatek bitsofpartialproductsfrom T S +T C +˜ qM beforecompression. Figure5.6presentstheprocedureoftruncatedcompression. After truncating the least k bits in step (i), we compress all rest bits into two numbers q S and q C in step (ii). Here we compress the least 2N +3 bits because T S +T C + ˜ qM = ZR is a (2N +3)-bit signed number. We can virtually add those truncated bits into a number called U in step (ii). The range of U would be [0,(⌈k/2⌉+2)2 k ). The key observation in step (ii) is that q S +q C +U must be T S +T C + ˜ qM so that it has to be a multiple of R. 110 In step (iii), the addition Z S [N +1 :N]+Z C [N +1 :N] results in a 3-bit value, and let us define the leading two bits as d 1 and d 0 . Recall that q S [N +1 : 0]+q C [N +1 : 0]+U is a multiple of R. If we guarantee q S [N +1 : 0]+q C [N +1 : 0]+U < R when d 0 = 0, then we automatically have q S [N +1 : 0]+q C [N +1 : 0]+U = 0. Meanwhile, we would also have q S [N +1 : 0]+q C [N +1 : 0]+U < 2R when d 0 = 1 because the difference when d 0 is 1 or 0 is 2 N+1 = 0.5R. Since d 0 = 1, we must have q S [N +1 : 0]+q C [N +1 : 0]+U =R. q S [N +1 : 0]+q C [N +1 : 0]+U =d 0 R q S +q C +U R =q S [2N +2 :N +2]+q C [2N +2 :N +2]+d 1 +d 0 (5.47) As a result, the number of truncated bits k should meet the following constraint. For example, we have k = 122 when N = 128. We can save almost half area during the computation of Z = (T S +T C + ˜ qM)/R. 2 N+1 +2 N +⌈k/2⌉2 k ≤ R = 2 N+2 →⌈k/2⌉2 k ≤ 2 N (5.48) 5.3.2.4 Block-level diagram Figure 5.7 shows the block-level diagram for the 4-stage pipelined hardware implementation of the proposed full-word Montgomery MM algorithm. Instead of relying on multiplications, we reduce the critical path to an encoding and a compression. With the help of EDA tools like Design Compiler, advanced logic optimization and retiming technique are possible. The number of stages may also vary to meet timing constraints. 111 Compress Z MUX 1 0 X Y T M -1 M -1 M M Encode T S [N+1:0] T C [N+1:0] Compress Lower T L ~ q S ,q C Encode Compress Upper q ~ T S [2N+2:k],T C [2N+2:k] M Z S ,Z C M Addition Addition >=0? Stage 1 Stage 2 Stage 3 Stage 4 Figure 5.7: Block-level Diagram for the 4-stage Pipelined Implementation of the Proposed Full-word Montgomery Modular Multiplication 5.4 Scalable Montgomery Modular Multiplication All modular multiplication algorithms (either Barrett MM or Montgomery MM) assume the largest modulus size is a predetermined parameter N, like N = 128 or N = 1024. Unlike 112 regularmultiplicationwhichcanbedecomposedintoacollectionofsmall-sizemultiplications, conventional MM design is in general non-decomposable. For example, we cannot use a conventional 1024-bit MM to perform a 2048-bit MM. Fortunately, there is a branch of the Montgomery MM, called scalable Montgomery MM [15,17], which is able to perform Montgomery MM in a scalable manner i.e., with arbitrary operand size using ”fixed” hardware resources. Scalable Montgomery MM can thus support variable modulus size N. A carefully-designed processing element (PE) is used to implement each iteration of iterative Montgomery MM in many cycles. Since we can start the computation of the i-th iteration as soon as we complete the computation of the least m bitsofZ (i− 1) inthe(i− 1)-stiteration, wecaninstantiate multiplePEs and makethem work in parallel. A PE can be even reused once it finishes the assigned iteration. Consequently, with the help of a memory management unit to buffer intermediate results, we can use a fixed number of PEs to perform an arbitrary-size MM. 5.4.1 Classic Scalable Montgomery Modular Multiplication Based on the iterative Montgomery MM, reference [17] has presented a radix-2 scalable Montgomery MM. Originally, the authors described their design as a multiple word radix-2 Montgomerymultiplication(MWR2MM).Touseaconsistentnamingconventionthroughout our article, we refer to the scalable radix-2 Montgomery MM. 5.4.1.1 Iterative Radix-2 Montgomery Modular Multiplication Algorithm 26 shows the iterative Montgomery MM with m = 1. All inputs X andY and output Z all lie in the range [0,2M). When we must do multiple modular multiplications, 113 Algorithm 26 Radix-2 Montgomery modular multiplication Input: X,Y ∈ [0,2M), odd M < 2 N , d =N +2, R = 2 N+2 Output: Z∈ [0,2M) 1: Z (− 1) = 0 2: for i = 0 to d− 1 do 3: Z (i) =Z (i− 1) +XY[i] 4: q (i) =Z (i) mod 2 =Z (i) [0] 5: Z (i) = (Z (i) +q (i) M)/2 6: end for 7: return Z like when doing modular exponentiation, this setup allows us to start the next modular mul- tiplication without doing the correction step. Therefore, in the rest of scalable Montgomery MM, we present various algorithms for performing Z = XYR − 1 mod M where inputs X and Y and output Z are in [0,2M). 5.4.1.2 Scalable Radix-2 Montgomery Modular Multiplication Extended from algorithm 26, reference [17] presents the scalable radix-2 Montgomery MM (algorithm 27). Each outer loop, which goes through N +2 iterations, scans 1 bit of the multiplier Y. The i-th outer loop iteration is equivalent to computing the result of the i-th iteration of the iterative Montgomery MM. Since the bitwidth N is variable, scalable Montgomery MM uses an inner loop, which iterates e times, to scan a word of w ≥ 2 bits of the multiplicand X in each iteration. Notice that e is equal to⌈(N +2)/w⌉+1 in binary form (and⌈(N +1)/w⌉+1 in carry-save form [10,12]) when X,Y ∈ [0,2M). We compute q based the least significant bit (LSB) of Z 0 in the 0-th inner loop iteration and use it in the remaining inner loop iterations. The w-bit additions in lines 5 and 9 result in a (w+1)-bit integer. We assign the most significant bit to C a or C b and the remaining w least significant bits to Z j . Finally, because we had to right shift Z by 1 bit in line 5 114 Algorithm 27 Scalable radix-2 Montgomery modular multiplication Input: X,Y ∈ [0,2M), odd M < 2 N , R = 2 N+2 , w≥ 2, e =⌈(N +2)/w⌉+1 Output: Z∈ [0,2M) 1: Z = 0 2: for i = 0 to N do{//Outer loop} 3: C a =C b = 0 4: for j = 0 to e− 1 do{//Inner loop} 5: (C a ,Z j ) =Z j +X j Y[i]+C a 6: if j == 0 then 7: q =Z 0 [0] 8: end if 9: (C b ,Z j ) =Z j +qM j +C b 10: Z j− 1 = Concatenate(Z j [0],Z j− 1 [w− 1 : 1]) 11: end for 12: Z e− 1 =Z − 1 = 0 13: end for 14: return Z of algorithm 27, the new Z j− 1 in line 10 is calculated as the concatenation of Z j [0] and Z j− 1 [w− 1 : 1]. Note that realization 27 also work for the case where w = 1. In that case, the only change is to modify line 10 to read as Z j− 1 =Z j [0]. Once we finish the first inner loop iteration of the i-th outer loop iteration (i.e., i,j = 1), we have computed Z 0 , which is required for the 0-th inner loop iteration of the (i+1)-st outer loop iteration (i.e., i+1,j = 0). Therefore, if we use N +2 processing elements (PEs) to do the outer loop iterations (one PE per outer loop iteration), we can start the next PE two cycles after the current PE. Exploiting this observation, we can design multiple-PE hardware that will have an issue latency (L) of 2. The clock period of a PE is lower bounded by two w-bit additions in binary form (lines 5 and 9). Reference [17] presented a design of a radix-2 scalable Montgomery MM with the issue latency of L = 2. Later references such as [18,19,26] improved the radix-2 scalable Mont- gomery MM by reducing the issue latency to one (L = 1) by using elaborate methods. In 115 spite of this low issue latency, the problem remains that, with a radix-2 design, we can only issue one bit of the multiplier every L cycles. As a result, we need at least LN cycles to finish the N-bit modular multiplication. This deficiency cannot be overcome no matter how many PEs are instantiated. When we switch the scalable Montgomery MM from radix-2 to radix-2 m , meaning that we issue m bits of the multiplier in each iteration, the number of cycles needed to do the full modular multiplication is approximately cut down to 1/m. However, multiplication cannot be easily eliminated when m > 1. This can be a major concern since multiplication is a complex arithmetic operation, which requires a large chip area and circuit delay in any hardware realization. Consequently, although the number of cycles required to finish the N-bit modular multiplication is greatly reduced, the time latency (in say nanoseconds) can be large since the clock period of the circuit tends to be long. 5.4.2 New Scalable Montgomery Modular Multiplication: L1 5.4.2.1 Transfer from Iterative to L2 Scalable Algorithm 28 Radix-2 m Montgomery modular multiplication for [0,2M) Input: X,Y ∈ [0,2M), odd M < 2 N , r = 2 m Input: d =⌈(N +2)/m⌉+2, R =r ⌈(N+2)/m⌉ , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,2M) 1: Z (− 1) = 0, q (− 1) = 0 2: for i = 0 to d− 1 do 3: g (i) = (Z (i− 1) mod r 2 )M ′′ mod r 2 4: q (i) = (g (i) − q (i− 1) )/r 5: Z (i) = (Z (i− 1) +XY i r 2 +q (i− 1) M)/r 6: end for 7: Z =Z (d− 1) 8: return Z 116 In our prior contribution [15], we proposed many optimizations to replace multiplication with compression to support parameterized m. However, this design does not break the data dependency to compute q first and Z next. It also cannot reduce the number of partial products due the lack of overflow prevention or correction. Fortunately, we propose a new scalable Montgomery MM, transferred from our proposed radix-2 m iterative Montgomery MM. The new design owns all advantages like 1) parallel computation of q and Z, 2) replacement of multiplication and addition with compression and encoding, 3) support of variant N. Consequently, we reduce both computation latency and hardware area. Algorithm 28 shows the iterative Montgomery MM when both inputs X andY andoutputZ areintherange[0,2M),whereweneedd =⌈(N +2)/m⌉+2iterations. We can implement each iteration of algorithm 28 in many cycles rather than one cycle, as shown in algorithm 29. Define X ′ = Xr 2 . We scan w-bit word of X ′ in each iteration of inner loop and need e =⌈(N +2m+1)/w⌉ cycles to finish XY i r 2 , since X is a (N +1)-bit unsigned number in the range [0,2M). Any intermediate result Z (i) in the i-th outer loop iteration is represented by ZM (i) j and ZR (i) j (0≤ j ≤ e− 1) as below. Each word ZM (i) j is a w-bit word, but the least w− m bits are 0. Each word ZR (i) j is a (w+2)-bit word. ZM (i) = e− 1 X j=0 ZM (i) j 2 wj ZR (i) = e− 1 X j=0 ZR (i) j 2 wj Z (i) = e− 1 X j=0 (ZM (i) j +ZR (i) j )2 wj =ZM (i) +ZR (i) (5.49) 117 Algorithm 29 L2 scalable radix-2 m Montgomery modular multiplication: basic version Input: X,Y ∈ [0,2M), odd M < 2 N , r = 2 m , w≥ 3m, e =⌈(N +2m+1)/w⌉ Input: d =⌈(N +2)/m⌉+2, R =r ⌈(N+2)/m⌉ , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,2M) 1: ZM (− 1) = 0, ZR (− 1) = 0, q (− 1) = 0 2: for i = 0 to d− 1 do{//Outer loop} 3: for j = 0 to e− 1 do{//Inner loop} 4: if j == 0 then 5: g (i) =ZR (i− 1) 0 M ′′ mod r 2 6: q (i) = (g (i) − q (i− 1) )/r 7: end if 8: T (i) j =ZM (i− 1) j +ZR (i− 1) j +X ′ j Y i +q (i− 1) M j 9: ZR (i) j =T (i) j [w+m+1 :m] 10: ZM (i) j− 1 =T (i) j [m− 1 : 0]2 w− m 11: end for 12: ZM (i) e− 1 = 0 13: end for 14: Z = P e− 1 j=0 (ZM (d− 1) j +ZR (d− 1) j )2 wj 15: return Z The only thing we need is to prove that Z (i) (in the end of i-th iteration of algorithm 28) is equal to ( P e− 1 j=0 T (i) j 2 wj )/r and we can represent Z (i) as Z (i) = ZM (i) +ZR (i) under the assumption Z (i− 1) =ZM (i− 1) +ZR (i− 1) . The rest proof is same as the iterative Montgomery MM. We can prove correctness by mathematical induction. The base is Z (− 1) = ZM (− 1) + ZR (− 1) = 0. In the j-th iteration of inner loop, we compute T (i) j = ZM (i− 1) j +ZR (i− 1) j + X ′ j Y i +q (i− 1) M j . Due to the division over r, we can decompose T (i) j into two parts: the least mbitsandtherest. Whenw≥ m,Z (i) r mod r = P e− 1 j=0 T (i) j 2 wj mod r =T (i) 0 mod r mustbe 0. Wecaneasilypresent( P e− 1 j=0 T (i) j 2 wj )/r asbelow, whereZM (i) e− 1 = 0. Consequently, weget ZR (i) j and ZM (i) j− 1 immediately once we compute T (i) j and assign ZR (i) j =T (i) j [w+m+1 :m] and ZM (i) j− 1 =T (i) j [m− 1 : 0]2 w− m . 118 Z (i) = Z (i− 1) +XY i r 2 +q (i− 1) M r = ZM (i− 1) +ZR (i− 1) +XY i r 2 +q (i− 1) M r = P e− 1 j=0 (ZM (i− 1) j +ZR (i− 1) j +X ′ j Y i +q (i− 1) M j )2 wj r = P e− 1 j=0 T (i) j 2 wj r = P e− 1 j=0 ( j T (i) j /r k r+T (i) j mod r)2 wj r = e− 1 X j=0 j T (i) j /r k 2 wj + P e− 1 j=1 (T (i) j mod r)2 wj r = e− 1 X j=0 j T (i) j /r k 2 wj + e− 2 X j=0 ((T (i) j+1 mod r)2 w− m )2 wj = e− 1 X j=0 ZR (i) j 2 wj + e− 1 X j=0 ZM (i) j 2 wj (5.50) The next problem is the size of T (i) j . The size of ZM (i) j is definite w bits, where least w− m bits are 0. Assume the size of ZR (i− 1) j is 2 w+x . We need to find an integer x such that T (i) j =ZM (i− 1) j +ZR (i− 1) j +X ′ j Y i +q (i− 1) M j is a (w+m+x)-bit number. We can then split T (i) j and form a (w+x)-bit ZR (i) j and a w-bit ZM (i) j− 1 . The integer x can be proven at least 2 based on the below inequality, and T (i) j has w+m+2 bits. T (i) j =ZM (i− 1) j +ZR (i− 1) j +X ′ j Y i +q (i− 1) M j ≤ (2 m − 1)2 w− m +(2 w+x − 1)+(2 w − 1)(2 m− 1 )+(2 w − 1)(2 m− 1 ) < 2 w+m+x (5.51) 119 Finally, Z (i− 1) mod r 2 would be ZR (i− 1) 0 mod r 2 if we assume w≥ 3m, because the least w− m≥ 2m bits of ZM (i− 1) j would be 0. Z (i− 1) mod r 2 = (ZM (i− 1) 0 +ZR (i− 1) 0 ) mod r 2 =ZR (i− 1) 0 mod r 2 (5.52) This observation implies that we can start the computation T (i) 0 in the i-th iteration, right after we know ZR (i− 1) 0 in the 0-th inner loop iteration and ZM (i− 1) 0 in the 1-st inner loop iteration. In another word, we can start the i-th iteration after 2 inner loop iteration of the (i− 1)-st iteration. We refer this as L2 scalable Montgomery MM with issue latency of 2 cycles. 5.4.2.2 Transfer from L2 Scalable to L1 Scalable Algorithm 30 L1 scalable radix-2 m Montgomery modular multiplication: basic version Input: X,Y ∈ [0,2M), odd M < 2 N , r = 2 m , w≥ 4m, e =⌈(N +2m+1)/w⌉ Input: d =⌈(N +2)/m⌉+2, R =r ⌈(N+2)/m⌉ , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,2M) 1: ZM (− 2) = 0, ZM (− 1) = 0, ZR (− 1) = 0, q (− 1) = 0 2: for i = 0 to d− 1 do{//Outer loop} 3: for j = 0 to e− 1 do{//Inner loop} 4: if j == 0 then 5: g (i) =ZR (i− 1) 0 M ′′ mod r 2 6: q (i) = (g (i) − q (i− 1) )/r 7: end if 8: T (i) j =ZM (i− 2) j /r+ZR (i− 1) j +X ′ j Y i +q (i− 1) M j 9: ZR (i) j =T (i) j [w+m+1 :m] 10: ZM (i) j− 1 =T (i) j [m− 1 : 0]2 w− m 11: end for 12: ZM (i) e− 1 = 0 13: end for 14: Z = P e− 1 j=0 (ZM (d− 2) j /r+ZM (d− 1) j +ZR (d− 1) j )2 wj 15: return Z 120 Algorithm 29 has an issue latency of 2 cycles, because we need to wait the generation of ZM (i− 1) 0 in the 1-st inner loop iteration of the (i− 1)-st outer loop iteration. Following our prior contribution [15], this constraint can be removed. Any intermediate result Z (i) in the i-th outer loop iteration is represented by ZM (i− 1) j , ZM (i) j and ZR (i) j (0≤ j≤ e− 1) as below. Each word ZM (i) j is a w-bit word, but the least w− m bits are 0. Each word ZR (i) j is a (w+2)-bit word. ZM (i) = e− 1 X j=0 ZM (i) j 2 wj ZR (i) = e− 1 X j=0 ZR (i) j 2 wj Z (i) = e− 1 X j=0 (ZM (i− 1) j /r+ZM (i) j +ZR (i) j )2 wj =ZM (i− 1) /r+ZM (i) +ZR (i) (5.53) Similarly, the only thing we need to prove is how to represent Z (i) (in the end of i-th iteration of algorithm 28) through the summation P e− 1 j=0 T (i) j 2 wj so that Z (i) =ZM (i− 1) /r+ ZM (i) +ZR (i) . The rest proof is same as the iterative Montgomery MM. We can prove the correctnessbymathematicalinduction. ThebaseisZ (− 1) =ZM (− 2) /r+ZM (− 1) +ZR (− 1) = 0. Due to the division over r, we can compute Z (i) in terms of T (i) j and ZM (i− 1) j as below, where ZM (i) e− 1 = 0. When w ≥ 2m, the least w− m ≥ m bits of ZM (i− 1) 0 is 0. We have Z (i) r mod r = T (i) 0 mod r = 0. Consequently, we get ZR (i) j and ZM (i) j− 1 immediately once we compute T (i) j and assign ZR (i) j =T (i) j [w+m+1 :m] and ZM (i) j− 1 =T (i) j [m− 1 : 0]2 w− m . 121 Z (i) = Z (i− 1) +XY i r 2 +q (i− 1) M r = ZM (i− 2) /r+ZM (i− 1) +ZR (i− 1) +XY i r 2 +q (i− 1) M r = P e− 1 j=0 (ZM (i− 1) j +ZM (i− 2) j /r+ZR (i− 1) j +X ′ j Y i +q (i− 1) M j )2 wj r = P e− 1 j=0 ZM (i− 1) j 2 wj + P e− 1 j=0 T (i) j 2 wj r = P e− 1 j=0 ZM (i− 1) j 2 wj + P e− 1 j=0 (T (i) j mod r+ j T (i) j /r k r)2 wj r = P e− 1 j=1 ZM (i− 1) j 2 wj r + P e− 1 j=1 (T (i) j mod r)2 wj r + e− 1 X j=0 j T (i) j /r k 2 wj = e− 1 X j=1 ZM (i− 1) j r 2 wj + e− 2 X j=0 ((T (i) j+1 mod r)2 w− m )2 wj + e− 1 X j=0 j T (i) j /r k 2 wj = e− 1 X j=1 ZM (i− 1) j r 2 wj + e− 1 X j=0 ZM (i) j 2 wj + e− 1 X j=0 ZR (i) j 2 wj =ZM (i− 1) /r+ZM (i) +ZR (i) (5.54) We can also simplify the computation of g (i) with the assumption w ≥ 4m. The least w− m≥ 3m bits of ZM (i− 1) j and the least w− 2m≥ 2m bits of ZM (i− 1) j /r would be 0. g (i) = (Z (i− 1) mod r 2 )M ′′ mod r 2 = (ZM (i− 2) 0 /r+ZM (i− 1) 0 +ZR (i− 1) 0 )M ′′ mod r 2 =ZR (i− 1) 0 M ′′ mod r 2 (5.55) Assume the size of ZR (i− 1) j is 2 w+x . We need to find an integer x such that T (i) j = ZM (i− 2) j /r+ZR (i− 1) j +X ′ j Y i +q (i− 1) M j is a (w+m+x)-bit number. We can then split T (i) j 122 and form a (w +x)-bit ZR (i) j and a w-bit ZM (i) j− 1 . The integer x can be proven at least 2 based on the below inequality, and T (i) j has w+m+2 bits. T (i) j =ZM (i− 1) j /r+ZR (i− 1) j +X ′ j Y i +q (i− 1) M j ≤ (2 m − 1)2 w− 2m +(2 w+x − 1)+(2 w − 1)(2 m− 1 )+(2 w − 1)(2 m− 1 ) < 2 w+m+x (5.56) Finally, the computation of T (i) 0 in the i-th iteration only relies on ZR (i− 1) 0 and ZM (i− 2) 0 . ZR (i− 1) 0 is computed in the 0-th inner loop iteration of the (i− 1)-st outer loop iteration. ZM (i− 2) 0 is computed in the 1-st inner loop iteration of the (i− 2)-nd outer loop iteration. We can surprisingly observe that ZR (i− 1) 0 and ZM (i− 2) 0 are available at the same time. This observation allows us to start the computation of T (i) 0 right after we know ZR (i− 1) 0 in the 0-th inner loop iteration. In another word, we can start the i-th iteration after 1 inner loop iteration of the (i− 1)-st iteration. We refer this as L1 scalable Montgomery MM with issue latency of 1 cycles. 5.4.2.3 Encoding Technique Of course, the next thing is to replace multiplication and addition with compression and encoding. Any intermediate result Z (i) in the i-th iteration is represented by ZSM (i− 1) j , ZCM (i− 1) j , ZSM (i) j , ZCM (i) j , ZSR (i) j , ZCR (i) j , and Z (i) carry (0 ≤ j ≤ e− 1) as below. Each word ZSM (i) j or ZCM (i) j is a w-bit unsigned word, but the least w− m bits are 0. Each 123 Algorithm 31 L1 scalable radix-2 m Montgomery modular multiplication: advanced version Input: X,Y ∈ [0,2M), odd M < 2 N , r = 2 m , w≥ 4m, e =⌈(N +2m+1)/w⌉ Input: d =⌈(N +4)/m⌉+2, R =r d− 2 , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,2M) 1: ZSM (− 2) =ZCM (− 2) = 0 2: ZSM (− 1) =ZCM (− 1) = 0, ZSR (− 1) =ZCR (− 1) = 0, Z (− 1) carry = 0 3: qS (− 1) =qC (− 1) = 0, q (− 1) carry = 0 4: for i = 0 to d− 1 do{//Outer loop} 5: q (i− 1) = (q (i− 1) S ,q (i− 1) C ,q (i− 1) carry ) 6: ˜ q (i− 1) = Encode(q (i− 1) ), ˆ q (i− 1) = EncodeSN(q (i− 1) ) 7: for j = 0 to e− 1 do{//Inner loop} 8: (TS (i) j ,TC (i) j ) = Compress(X ′ j ˜ Y i +˜ q (i− 1) M j +ZSM (i− 2) j /r+ZCM (i− 2) j /r+ZSR (i− 1) j + ZCR (i− 1) j +(Z (i− 1) carry AND j == 0)) 9: TS (i) j [w+m+2] =TS (i) j [w+m+2] XNOR TC (i) j [w+m+2] 10: TC (i) j [w+m+2] = 0 11: ZSR (i) j =TS (i) j [w+m+2 :m], ZCR (i) j =TC (i) j [w+m+2 :m] 12: ZSM (i) j− 1 =TS (i) j [m− 1 : 0]2 w− m , ZCM (i) j− 1 =TC (i) j [m− 1 : 0]2 w− m 13: if j == 0 then 14: Z (i− 1) L = (ZSR (i− 1) j [2m− 1 : 0],ZCR (i− 1) j [2m− 1 : 0],Z (i− 1) carry ) 15: ˜ Z (i− 1) L = Encode(Z (i− 1) L ) 16: (q (i) S ,q (i) C ) = Compress( ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) ) 17: q (i) S =q (i) S >>m, q (i) C =q (i) C >>m, q (i) carry =q (i) S [m− 1] OR q (i) C [m− 1] 18: Z (i) carry =TS (i) j [m− 1] OR TC (i) j [m− 1] 19: end if 20: end for 21: ZSM (i) e− 1 =ZCM (i) e− 1 = 0 22: end for 23: ZR H = 0, Carry = 0, FBS = 0, FBC = 0 24: for j = 0 to e− 1 do 25: (PS j ,PC j ) = Compress(ZR H + FBS + FBC + ZSM (d− 2) j /r + ZCM (d− 2) j /r + ZSM (d− 1) j +ZCM (d− 1) j +ZSR (d− 1) j [w− 1 : 0]+ZCR (d− 1) j [w− 1 : 0]+(Z (d− 1) carry ANd j == 0)+M j ) 26: PS j =PS j [w+2] XNOR PC j [w+2], PC j = 0 27: (Carry,Z j ) =PS j [w− 1 : 0]+PS j [w− 1 : 0]+Carry 28: FBS j =PS j [w+3 :w], FBC j =PC j [w+3 :w] 29: ZR H =ZSR (d− 1) j [w+2 :w]+ZCR (d− 1) j [w+2 : 0] 30: end for 31: return Z 124 word ZSR (i) j or ZCR (i) j is a (w+3)-bit signed word, where ZCR (i) j [w+2] = 0 and the sum ZSR (i) j +ZCR (i) j is also a (w+3)-bit signed word. Z (i) carry is a 1-bit unsigned word. ZSM (i) = e− 1 X j=0 ZSM (i) j 2 wj ,ZCM (i) = e− 1 X j=0 ZCM (i) j 2 wj ZSR (i) = e− 1 X j=0 ZSR (i) j 2 wj ,ZCR (i) = e− 1 X j=0 ZCR (i) j 2 wj Z (i) = ZSM (i− 1) +ZCM (i− 1) r +ZSM (i) +ZCM (i) +ZSR (i) +ZCR (i) +Z (i) carry (5.57) Similarly, the only thing we need to prove is how to represent Z (i) = (Z (i− 1) +X ˜ Y i r 2 + ˜ q (i− 1) M)/r throughthesummation P e− 1 j=0 T (i) j 2 wj sothatZ (i) =ZSM (i− 1) /r+ZCM (i− 1) /r+ ZSM (i) +ZCM (i) +ZSR (i) +ZCR (i) +Z (i) carry . An interested read can prove the the rest following the advanced version of our proposed iterative Montgomery MM in algorithm 22. 125 Z (i) = Z (i− 1) +X ˜ Y i r 2 + ˜ q (i− 1) M r = P e− 1 j=0 (ZSM (i− 1) j +ZCM (i− 1) j +TS (i) j +TC (i) j )2 wj r = ZSM (i− 1) +ZCM (i− 1) r + P e− 1 j=0 (TS (i) j +TC (i) j )2 wj r = ZSM (i− 1) +ZCM (i− 1) r + P e− 1 j=0 (TS (i) j mod r+ j TS (i) j /r k r)2 wj r + P e− 1 j=0 (TC (i) j mod r+ j TC (i) j /r k r)2 wj r = ZSM (i− 1) +ZCM (i− 1) r + P e− 1 j=1 (TS (i) j mod r+TC (i) j mod r)2 wj r + e− 1 X j=0 ( j TS (i) j /r k + j TC (i) j /r k )2 wj + TS (i) 0 mod r+TC (i) 0 mod r r = ZSM (i− 1) +ZCM (i− 1) r +ZSM (i) +ZCM (i) +ZSR (i) +ZCR (i) +Z (i) carry (5.58) Since Z (i) r mod r =TS (i) 0 +TC (i) 0 mod r = 0, the sum TS (i) 0 [m− 1 : 0]+TC (i) 0 [m− 1 : 0] would be only 0 or r. We can therefore set Z (i) carry = TS (i) 0 [m− 1] OR TC (i) 0 [m− 1]. Only when TS (i) 0 [m− 1] = TC (i) 0 [m− 1] = 0, we have TS (i) 0 [m− 1 : 0]+TC (i) 0 [m− 1 : 0] = 0; otherwise, we have TS (i) 0 [m− 1 : 0]+TC (i) 0 [m− 1 : 0] =r. 126 Z (i) carry =TS (i) 0 [m− 1] OR TC (i) 0 [m− 1] Z (i) carry r =TS (i) 0 [m− 1 : 0]+TC (i) 0 [m− 1 : 0] =TS (i) 0 mod r+TC (i) 0 mod r (5.59) The thing left is to determine the size of ZSR (i) j or ZCR (i) j . We here prove that the X ′ j ˜ Y i +˜ q (i− 1) M j +ZSM (i− 2) j /r+ZCM (i− 2) j /r+ZSR (i− 1) j +ZCR (i− 1) j +Z (i− 1) carry canbecorrectly presented by two (w+m+3)-bit signed numbers TS (i) j and TC (i) j without overflow, under theassumptionthatZSR (i− 1) j andZCR (i− 1) j aretwo(w+3)-bitsignednumbers. Asaresult, we split TS (i) j and TC (i) j and form (w + 3)-bit signed numbers ZSM (i) j and ZCM (i) j , and w-bit unsigned numbers ZSR (i) j and ZCR (i) j . Figure5.8showstheproceduretocomputeX ′ j ˜ Y i +˜ q (i− 1) M j +ZSM (i− 2) j /r+ZCM (i− 2) j /r+ ZSR (i− 1) j +ZCR (i− 1) j +Z (i− 1) carry . In step (i), we sign extend all partial products into w+m+4 bits. For sake of brevity, all signs are labelled as S although different numbers could have different sign values. The sign bits in a number can be converted into 1’s followed by a ¯ S, where ¯ S is the Boolean inversion of S. We merge all constant 1 into a number called prefix in step (ii). Any overflows can be ignored as long as the final result is at most a ( w+m+4)- bit signed number. It is easy to verify that the value in dashed polygon must be less than 2 w+m+3 , such that we can compress them into two (w+m+3)-bit numbers TS (i) j and TC (i) j in step (iii) regardless of the compression strategy. TS (i) j [w+m+2] and TC (i) j [w+m+2] cannot be 1 at the same time; otherwise TS (i) j +TC (i) j will be larger than 2 N+m+3 . As a result, the sum of TS (i) j [w+m+2], TC (i) j [w+m+2], and the leading 2 bits of the prefix will be a 2-bit number with the same value, which is TS (i) j [w+m+2] XNOR TC (i) j [w+m+2]. 127 0 1 bit position w … s s … s s … - + w+m+3 … s … 0 … s s … … … s s s s + + + q (i-1) *M j ~ X j ’Y i ZSR j (i-1) ZCR j (i-1) Z carry (i-1) (i) m-1 + … s s … s s … … s s … … s s s w-m ZSM j (i-2) /r ZCM j (i-2) /r ~ … s s … s s … … s … 0 … s s q (i-1) *M j ~ X j ’Y i ZSR j (i-1) ZCR j (i-1) Z carry (i-1) AND j==0 (ii) … s s … s s … … s s ZSM j (i-2) /r ZCM j (i-2) /r ~ - - - - - - - … … 1 1 1010…1001 prefix m 0000…0000 1 1 (iii) TS j (i) TC j (i) (iv) TS j (i) TC j (i) s 0 s 0 Figure 5.8: Compute TS (i) j and TC (i) j in L1 Scalable Montgomery MM As we discussed in the section about overflow prevention, the sum in step (iv) would not have overflow any more. Since TS (i) j [w+m+3] is always the same as TS (i) j [w+m+2], we can hide TS (i) j [w+m+3] and TC (i) j [w+m+3] in a hardware implementation. 128 Since X and Y are in the range [0,2M), it is necessary to prove the required number of outer loop iterations such that the Z (d− 1) is in the range (− M,M). Following the proof of iterative Montgomery MM, in each outer loop iteration, we can ensure Z (i) r = Z (i− 1) + X ˜ Y i r 2 + ˜ q (i− 1) M. When i≥⌈ (N +2)/m⌉− 1, the following equality holds: Z (i) r i− 1 =X i X k=0 ˜ Y k r k +M i− 2 X k=0 ˜ q (k+1) r k =XY +M i− 2 X k=0 ˜ q (k+1) r k (5.60) We can derive upper and lower bounds for Z (i) as follows. If we have N +4 ≤ (i− 1)m, then 2/3+4M/r i− 1 < 1, and we need at least d = ⌈(N +4)/m⌉+2 iterations to ensure Z (d− 1) ∈ (− M,M). Z (i) r i− 1 < 4M 2 + 2 3 M(r− 1) i− 2 X k=0 r k < 4M 2 + 2 3 M(r− 1) r i− 1 − 1 r− 1 < (4M + 2 3 r i− 1 )M Z (i) < ( 2 3 + 4M r i− 1 )M Z (i) r i− 1 ≥ 0− 2 3 M(r− 1) i− 2 X k=0 r k > (− 2 3 r i− 1 )M Z (i) >− 2 3 M >− M (5.61) Because Z (d− 1) ∈ (− M,M), the target Z ∈ [0,2M) is just the addition of Z (d− 1) +M. For the ease of hardware implementation, we use a for loop the generate a w-bit word Z j iterativelyfrominputsM j ,ZSM (d− 2) j /r,ZCM (d− 2) j /r,ZSM (d− 1) j ,ZCM (d− 1) j ,ZSR (d− 1) j [w− 1 : 0],ZCR (d− 1) j [w− 1 : 0],Z (d− 1) carry ,ZR H ,FBS,andFBC. Initially,extrainputsZR H ,FBS, and FBC are 0. We can compress all numbers into two (w+4)-bit signed number PS and PC, whereas only the least w bits are necessary to compute Z 0 . The leading 4 bits are thus 129 stored into FBS and FBC for the computation of Z 1 . The leading 3 bits of ZSR (d− 1) j and ZCR (d− 1) j are also added as ZR H = ZSR (d− 1) j [w +2 : w]+ZCR (d− 1) j [w +2 : w] together to compute Z 1 . This procedure continues until the whole Z are computed. Meanwhile, the carry bit when adding the least w bits of PS and PC iteratively serves as carry input for the next w-bit addition. Thelastthingistoshowthatitsufficestoset PS andPC as(w+4)-bitsignednumbers. The leading bit PC[w +3] is 0, and the sum PS +PC has no overflow. In figure 5.9, we stitchallrequiredsegmentstogetherintosixnumbersforefficientcompression( Z carry canbe easily hidden during compression). Step (i) shows the stitched results after the optimization of sign extension. Evidently, the sum of dashed polygon is less than 13× 2 w < 2 w+4 . We can compress the dashed polygon into two numbers PS and PC in step (ii). PS[w+3] and PC[w +3] cannot be 1 at the same time; otherwise the sum is larger then 2 w+4 . Finally, the sum of remaining two bits of 1, PS[w +3] and PC[w +3] is a 2-bit result with same value PS[w+3] XNOR PC[w+3]. Since PS[w+4] is always equal to PS[w+3], we can hide PS[w +4] and PC[w +4] in hardware implementation. The whole discussion follows the proof of overflow prevention, and the sum PS +PC does not have overflow. 5.4.2.4 Scheduling and Performance Estimation Figure 5.10 shows the data dependency graph. Node A denotes tasks of 0-th inner loop iteration. Node B denotes tasks of inner loop iteration when j ̸= 0. Node C denotes the task in lines 23-30 of the algorithm 31. We can start the A task of PE i as soon as we finish the A task of PE i− 1. The issue latency is thus L = 1. 130 ZSM j (d-1) ZSM j (d-2) /r ZCM j (d-1) ZCM j (d-2) /r s - 0 FBS FBC ZR H ZSR j (d-1) [w-1:0] M j ZCR j (d-1) [w-1:0] s - s 1 1 1 1…1 1 1 1 1 1 1 1 0 4 bit position w w-2m w-m 1 1 PS PC (i) (ii) PS PC (iii) s s 0 0 1 Z carry AND j==0 Figure 5.9: Compute PS and PC in L1 Scalable Montgomery MM Each PE needs e cycles to execute one A task and (e− 1) B tasks. Ideally, we need at least p =⌈(N +2m+4)/m⌉ PEs for all outer loop iterations. In the context of a scalable design, the bitwidth N of modulus M is variable such that we cannot know how many PEs are required in advance. As we mentioned earlier, after e cycles, a PE becomes free and may thus be reused. Assume p PEs are employed. Let k denote the number of rounds we use all p PEs, resulting in kp outer loop iterations in algorithm 31. From (5.61), it is easy to see that k =⌈(N +2m+4)/(mp)⌉ guarantees that output Z lies in [0,2M) as required. Figure5.11showsanexamplewhenN = 16,w = 8,m = 2,ande =⌈(N +2m+1)/w⌉ = 3. Whenwehavep = 3PEs(p≥ e)infigure5.11(a),weneed k =⌈(N +2m+4)/(mp)⌉ = 4 rounds. PE #1 finishes the computation about ˜ Y 0 and becomes free when we need to issue ˜ Y 3 . Therefore, we need kp = 12 cycles to finish all A tasks. When we only have 2 PEs 131 A Y 0 1 PE 2 3 PE #1 PE #2 PE #3 B A B B A Y 1 Y 2 0 ZSR 1 &ZCR 1 ZSM 0 ' &ZCM 0 ' ZSR 0 &ZCR 0 ZSR 0 &ZCR 0 ZSR 2 &ZCR 2 ZSM 1 ' &ZCM 1 ' ZSR 1 &ZCR 1 ZSR 0 &ZCR 0 ZSM 0 ' &ZCM 0 ' … … … … … 0 0 0 0 0 0 0 X 1 ' &M 1 X 0 ' &M 0 X 0 ' &M 0 X 2 ' &M 2 X 1 ' &M 1 X 0 ' &M 0 Cycle ~ ~ ~ Figure 5.10: Data Dependency Graph with L = 1 (p < e) in figure 5.11(b), we need k = 6 rounds. When we want to issue ˜ Y 2 , PE #1 is still busy and we have to wait. Therefore, we need (k− 1)e+p = 17 cycles to finish all A tasks. When all A tasks are done, we can start the C tasks and collect all results, which needs e+1 cycles. Based on the magnitude of p and e, the number of cycles required to perform Z = XYR − 1 mod M (where inputs X and Y and output Z are in [0,2M)) by using algorithm 31 is as follows: cycle cnt = kp+e+1 if e≤ p ke+p+1 otherwise (5.62) 132 PE #1 PE #2 PE #3 PEc 1 2 3 kp e (a) (b) Cycle A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B 4 5 6 7 8 9 10 11 12 C C C C C C C C C C C C Z 0 Z 1 Z 2 13 14 15 16 PE #1 PE #2 PEc 1 2 3 e Cycle A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B A A B B B B 4 5 6 7 8 9 10 11 12 C C C C C C C C C C C C Z 0 Z 1 Z 2 13 14 15 16 (k-1)e+p A A B B B B A A B B B B A A B B B B A A B B B B 17 18 19 20 21 Queue Figure 5.11: Schedule for N = 16, w = 8, and m = 2 with (a) p = 3 and (b) p = 2 133 where r = 2 m ,p = PE count,R =r kp− 2 = 2 m(kp− 2) e = N +2m+1 w ,k = N +2m+4 mp (5.63) 5.4.2.5 Hardware Implementation Finally, algorithm 32 elaborates the scalable architecture considering the number of pro- cessing elements p and number of rounds k. Iteration superscripts are elegantly removed. Figure 5.12 shows the internal design of the processing element (PE) to finish an inner loop inecycles. Duringthefirstiteration( j = 0), weupdateq S andq C basedonthecomputation of (q S ,q C ) = Compress( ˜ Z L M ′′ − ˆ q). For other iterations (j̸= 0), we keep the q S and q C . A control signal CT is thus used to differentiate j = 0 or j̸= 0, and change data flow. We also compute (TS j ,TC j ) = Compress(X ′ j ˜ Y i + ˜ qM j +ZSM ′ j /r+ZCM ′ j /r+ZSR j + ZCR j +Z carry ) in every iteration. The outputs TS j and TC j are registered and further split to form ZSR j , ZCR j , ZSM j− 1 , ZCM j− 1 , and Z carry . When CT = 1 (or j = 0), we update Z carry as TS j [m− 1] OR TC j [m− 1]; otherwise we set Z carry = 0 when CT = 0. Lastly, we send 0 to registers ZSM j− 1 and ZCM j− 1 when CT = 1. SinceZ iscomputediteratively,wedesignaspecialprocessingelement(PEc)tofinishthis task in e cycles. Figure 5.13 shows the diagram. Because ZSM j and ZCM j are generated one cycle later than other inputs, we need to delay other inputs by one cycle. To ensure correct operation in the first iteration ( j = 0), we force ZR H , Carry, FBS, and FBC to be 0. For the rest iterations, ZR H , Carry, FBS, and FBC work as they are designed. We show the block-level architecture of the proposed radix-2 m scalable Montgomery modular multiplication in figure 5.14. The required ZSM ′ j and ZCM ′ j in the i-th outer 134 Algorithm 32 L1 scalable radix-2 m Montgomery modular multiplication: hardware version Input: X,Y ∈ [0,2M), odd M < 2 N , r = 2 m , w≥ 4m, e = N+2m+1 w Input: p =PE count, k = l N+2m+4 mp m , d =kp, R =r d− 2 , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,2M) 1: ZSM ′ =ZCM ′ =ZSM =ZCM =ZSR =ZCR =Z carry = 0 2: qS =qC = 0 =q carry = 0 3: for i = 0 to d− 1 do{//Outer loop} 4: q = (q S ,q C ,q carry ), ˜ q = Encode(q), ˆ q = EncodeSN(q) 5: for j = 0 to e− 1 do{//Inner loop} 6: (TS j ,TC j ) = Compress(X ′ j ˜ Y i + ˜ qM j + ZSM ′ j +ZCM ′ j r + ZSR j + ZCR j + (Z carry AND j == 0)) 7: TS j [w+m+2] =TS j [w+m+2] XNOR TC j [w+m+2] 8: TC j [w+m+2] = 0 9: ZSR j =TS j [w+m+2 :m], ZCR j =TC j [w+m+2 :m] 10: ZSM ′ j− 1 =ZSM j− 1 , ZCM ′ j− 1 =ZCM j− 1 11: ZSM j− 1 =TS j [m− 1 : 0]2 w− m , ZCM j− 1 =TC j [m− 1 : 0]2 w− m 12: if j == 0 then 13: Z L = (ZSR j [2m− 1 : 0],ZCR j [2m− 1 : 0],Z carry ) 14: ˜ Z L = Encode(Z L ) 15: (q S ,q C ) = Compress( ˜ Z L M ′′ − ˆ q) 16: q S =q S >>m, q C =q C >>m, q carry =q S [m− 1] OR q C [m− 1] 17: Z carry =TS j [m− 1] OR TC j [m− 1] 18: end if 19: end for 20: ZSM e− 1 =ZCM e− 1 =ZSM − 1 =ZCM − 1 = 0 21: end for 22: ZR H = 0, Carry = 0, FBS = 0, FBC = 0 23: for j = 0 to e− 1 do 24: (PS,PC) = Compress(ZR H +FBS +FBC +M j +ZSM ′ j /r+ZCM ′ j /r+ZSM j + ZCM j +ZSR j [w− 1 : 0]+ZCR j [w− 1 : 0]+(Z carry AND j == 0)) 25: PS =PS[w+2] XNOR PC[w+2], PC = 0 26: (Carry,Z j ) =PS[w− 1 : 0]+PS[w− 1 : 0]+Carry 27: FBS =PS[w+3 :w], FBC =PC[w+3 :w] 28: ZR H =ZSR j [w+2 :w]+ZCR j [w+2 : 0] 29: end for 30: return Z loop iteration in algorithm 32 are actually ZSM j and ZCM j which are generated in the (i− 2)-nd outer loop iteration. Similarly, ZSM ′ j and ZCM ′ j required in the PEc are ZSM j and ZCM j generated by the next to the last PE. To compute Z j , we need ZSM ′ j , ZCM ′ j , 135 M’’ Encode M j _IN X ’ j _IN w w Z carry _IN ZSR j _IN ZCR j _IN CT compress Encode q C _IN q carry _IN q S _IN m m 1 EncodeSN m m 1 1 w+3 w+3 m m ZSM ’ j _IN ZCM ’ j _IN q ^ q ~ Z L ~ compress Y i ~ w+m+3 w+m+3 reg m reg ZSM j-1 Z carry w+3 ZSR j reg ZCM j-1 ZCR j TS j TC j w+3 m 1 M j X ’ j reg q S reg q C 1 0 mux 0 1 mux m m reg reg Figure 5.12: Internal Design of the Processing Element (PE) with L = 1 reg reg Z carry _IN ZSR j _IN ZCR j _IN ZSM ’ j _IN ZCM ’ j _IN ZSM j _IN ZCM j _IN M j _IN w w 3 3 add 3 ZR H reg reg reg reg reg compress w+4 w+4 reg reg PS PC 4 4 w FBS FBC 1 0 mux 0 0 1 mux 0 reg add Z j CT Carry Figure 5.13: PEc Diagram with L = 1 ZSM j , ZCM j , ZSR j , ZCR j , andZ carry . Whenp<e as in figure 5.11(b), a first-in first-out queue with a depth of e− p slots is necessary to buffer the outputs. 136 PE PE … … PE PE Z carry ZSR j ZCR j PE ZSM j-1 Z carry M j X j ' ZSR j ZCM j-1 ZCR j ZSM j-1 ZCM j-1 PEc Queue Z j M j X j ' ZSM j ’&ZCM j ’ m-bit register file for multiplier of depth p Y i Kernel ZSM j-1 Z carry M j X j ' ZSR j ZCM j-1 ZCR j ZSM j-1 Z carry M j X j ' ZSR j ZCM j-1 ZCR j ZSM j-1 Z carry M j X j ' ZSR j ZCM j-1 ZCR j Figure 5.14: Proposed Scalable Architecture of Montgomery MM with L = 1 5.4.3 New Scalable Montgomery Modular Multiplication: L2 5.4.3.1 L2 Algorithm with Encoding Technique Following the algorithm 32, we can implement L1 scalable Montgomery MM. Similarly, we have algorithm 33 implement L2 scalable Montgomery MM with encoding technique, in which the update of TS j and TC j relies on ZSM j and ZCM j , rather than ZSM ′ j and ZCM ′ j . Following the discussion of L1 scalable Montgomery MM, an interested reader can easily verify its correctness. As a result, we need to wait ZSM j and ZCM j before the start of a new inner loop iteration. Figure 5.15 shows the data dependency graph, where the PE #2for ˜ Y 1 canonlystartatthe3-rdcycle, suchthatithasaissuelatencyof2cycles(L = 2). We can generalize the discussion with different issue latency L. Assume p PEs are employed. Based on the magnitude of p and e, the number of cycles required to perform Z =XYR − 1 mod M (where inputs X and Y and output Z are in [0,2M)) is cycle cnt = kLp+e+1 if e≤ Lp ke+Lp+1 otherwise (5.64) 137 Algorithm 33 L2 scalable radix-2 m Montgomery modular multiplication: hardware version Input: X,Y ∈ [0,2M), odd M < 2 N , r = 2 m , w≥ 4m, e = N+2m+1 w Input: p =PE count, k = l N+2m+4 mp m , d =kp, R =r d− 2 , r 2 r − 1 − MM ′′ = 1 Output: Z∈ [0,2M) 1: ZSM =ZCM =ZSR =ZCR =Z carry = 0 2: qS =qC = 0 =q carry = 0 3: for i = 0 to d− 1 do{//Outer loop} 4: q = (q S ,q C ,q carry ), ˜ q = Encode(q), ˆ q = EncodeSN(q) 5: for j = 0 to e− 1 do{//Inner loop} 6: (TS j ,TC j ) = Compress(X ′ j ˜ Y i + ˜ qM j + ZSM j + ZCM j + ZSR j + ZCR j + (Z carry AND j == 0)) 7: TS j [w+m+2] =TS j [w+m+2] XNOR TC j [w+m+2] 8: TC j [w+m+2] = 0 9: ZSR j =TS j [w+m+2 :m] 10: ZCR j =TC j [w+m+2 :m] 11: ZSM j− 1 =TS j [m− 1 : 0]2 w− m 12: ZCM j− 1 =TC j [m− 1 : 0]2 w− m 13: if j == 0 then 14: Z L = (ZSR j [2m− 1 : 0],ZCR j [2m− 1 : 0],Z carry ) 15: ˜ Z L = Encode(Z L ) 16: (q S ,q C ) = Compress( ˜ Z L M ′′ − ˆ q) 17: q S =q S >>m 18: q C =q C >>m 19: q carry =q S [m− 1] OR q C [m− 1] 20: Z carry =TS j [m− 1] OR TC j [m− 1] 21: end if 22: end for 23: ZSM e− 1 =ZCM e− 1 =ZSM − 1 =ZCM − 1 = 0 24: end for 25: ZR H = 0, Carry = 0, FBS = 0, FBC = 0 26: for j = 0 to e− 1 do 27: (PS,PC) = Compress(ZR H +FBS +FBC +M j +ZSM j +ZCM j +ZSR j [w− 1 : 0]+ZCR j [w− 1 : 0]+(Z carry AND j == 0)) 28: PS =PS[w+2] XNOR PC[w+2], PC = 0 29: (Carry,Z j ) =PS[w− 1 : 0]+PS[w− 1 : 0]+Carry 30: FBS =PS[w+3 :w], FBC =PC[w+3 :w] 31: ZR H =ZSR j [w+2 :w]+ZCR j [w+2 : 0] 32: end for 33: return Z where r = 2 m ,p = PE count,R =r kp− 2 = 2 m(kp− 2) e = N +2m+1 w ,k = N +2m+4 mp (5.65) 138 A Y 0 1 PE 2 3 PE #1 PE #2 PE #3 B A B B Y 1 ZSM 0 ,ZCM 0 ZSR 0 ,ZCR 0 … … … 0 0 0 X 1 ' &M 1 X 0 ' &M 0 X 2 ' &M 2 X 1 ' &M 1 X 0 ' &M 0 Cycle B 0 X 3 ' &M 3 ZSM 1 ,ZCM 1 ZSR 1 ,ZCR 1 ZSM 2 ,ZCM 2 ZSR 2 ,ZCR 2 ZSM 0 ,ZCM 0 ZSR 0 ,ZCR 0 A Y 2 X 0 ' &M 0 B X 2 ' &M 2 B 0 X 4 ' &M 4 … 4 5 ~ ~ ~ Figure 5.15: Classic Data Dependency Graph with L = 2 For a presumed modulus size N, we need to employ p = ⌈e/L⌉ PEs to avoid waiting. The number of required PEs reduces with the increment of L, whereas the number of cycles (kLp+e+1)wouldincrease. TheproductofcyclecountsandPEcountsisthusproportional to N 2 /(mw) and is not very sensitive to L value. (kLp+e+1)p∼ kLp e L = N +2m+4 mp p N +2m+1 w ∼ N 2 mw (5.66) 139 Since the required computation of one inner loop iteration are roughly same for both L1 scalable (algorithm 32) and L2 scalable (algorithm 33), the clock period and hardware area for a PE would be roughly same. The product of cycle counts and PE counts is thereby a first-orderestimationoftheproductoflatencyandarea. Itseemsnotbeneficialtoimplement L2 scalable Montgomery MM if we prefer to optimize both latency and area. 5.4.3.2 New Scheduling and Performance Estimation Figure5.15istheclassicdatadependencygraphfollowingthefirstscalableMontgomery[17]. We have to wait one cycle for ZSM 0 and ZCM 0 , although ZSR 0 and ZCR 0 are available aftertheAtask. Thatisthereasonwhyreference[17]isonlyaL2scalabledesign. However, this data dependency does not fully leverage the immediate availability of ZSR 0 and ZCR 0 after the A task. We propose a new data dependency graph after implementing a processing element as a 2-stage pipeline. Rather than waiting ZSM 0 and ZCM 0 , the compression (TS j ,TC j ) = Compress(X ′ j ˜ Y i + ˜ qM j +ZSM j +ZCM j +ZSR j +ZCR j +(Z carry AND j == 0)) can be partitioned into two stages. A 2-stage pipeline example is the following. (TS j ,TC j ) = Compress(X ′ j ˜ Y i +ZSR j +ZCR j +(Z carry AND j == 0)) (TS j ,TC j ) = Compress(TS j +TC j + ˜ qM j +ZSM j +ZCM j ) (5.67) Consequently,wecanprovideZSR 0 andZCR 0 tostartthefirststageofPE#2,aftertwo cyclestofinishAtaskofPE#1. Meanwhile,thegenerated ZSM 0 andZCM 0 canbedirectly sent to the second stage of PE #2. Therefore, PE #2 can complete the A task on time. 140 Y 0 1 PE 2 3 PE #1 PE #2 PE #3 Y 1 … … … 0 0 X 2 ' &M 2 X 1 ' &M 1 X 0 ' &M 0 Cycle X 3 ' &M 3 ZSR 1 ZCR 1 Y 2 X 4 ' &M 4 … 4 5 ~ ~ ~ A A A A B B B B B B B B B B B B B B X 2 ' &M 2 X 1 ' &M 1 X 0 ' &M 0 A A A A B B B B B B X 0 ' &M 0 A A ZSM 1 ZCM 1 ZSR 2 ZCR 2 ZSM 2 ZCM 2 ZSR 0 ZCR 0 0 ZSR 0 ZCR 0 0 0 0 0 0 0 0 0 Figure 5.16: New Data Dependency Graph with L = 2 The procedure can continue, and the resulting design belongs to L2 scalable Montgomery MM due to an issue latency of two cycles. Following the discussion in (5.64), the product of cycle counts and PE counts is proportional to N 2 /(mw) and not very sensitive to L value. However, our proposed 2-stage pipeline architecture shrinks the critical path on each stage 141 and thus reduce clock period. Compared with L1 scalable Montgomery MM, the product of latency and area becomes smaller. Ideally speaking, we can design a PE as a L-stage pipeline and implement a scalable Montgomery MM with issue latency of L cycles. Although a L-stage pipeline has smaller period,itisveryhardtoreduceperiodbyafactorof1/L. Extraregistersarealsorequiredto partition combinational logic into L stages, and count to non-negligible area overhead. Con- sequently, the product of latency and area would not reduce monotonically with increment of L, and we implement the 2-stage pipeline PE. 5.4.3.3 Hardware Implementation Figure 5.17 presents a 2-stage implementation of the processing element, in which the com- pression are partitioned into two parts to balance the critical paths on both stages. Note that inputs ZSM j IN and ZCM j IN are sent to the compressor in the second stage with- out registering. The main functionality is similar to the PE in L1 design, where we com- pute (TS j ,TC j ) = Compress(X ′ j ˜ Y i + ˜ qM j +ZSM j +ZCM j +ZSR j +ZCR j +Z carry ) in every iteration. The outputs TS j and TC j are registered and further split to form ZSR j , ZCR j , ZSM j− 1 , ZCM j− 1 , and Z carry . When CT = 1 (or j = 0), we update Z carry as TS j [m− 1] OR TC j [m− 1]; otherwise we set Z carry = 0 when CT = 0. We send 0 to registers ZSM j− 1 and ZCM j− 1 when CT = 1. As for PEc, we just need to remove the inputs ZSM ′ j and ZCM ′ j , shown in figure 5.18. Because ZSM j and ZCM j are generated one cycle later than other inputs, we need to delay other inputs by one cycle. To ensure correct operation in the first iteration ( j = 0), we force 142 M’’ Encode M j _IN X ’ j _IN w w Z carry _IN ZSR j _IN ZCR j _IN CT compress Encode q C _IN q carry _IN q S _IN m m 1 EncodeSN m m 1 1 w+3 w+3 m m ZSM j _IN ZCM j _IN q ^ q ~ Z L ~ compress Y i ~ w+m+3 w+m+3 reg m reg ZSM j-1 Z carry w+3 ZSR j reg ZCM j-1 ZCR j TS j TC j w+3 m 1 M j X ’ j reg q S reg q C 1 0 mux 0 1 mux m m reg reg reg reg compress reg reg compress reg reg Stage Registers Stage Registers reg Figure 5.17: Internal Design of the Processing Element (PE) with L = 2 ZR H , Carry, FBS, and FBC to be 0. For the rest iterations, ZR H , Carry, FBS, and FBC work as they are designed. Z carry _IN ZSR j _IN ZCR j _IN ZSM j _IN ZCM j _IN M j _IN w w 3 3 add 3 ZR H reg reg reg reg reg compress w+4 w+4 reg reg PS PC 4 4 w FBS FBC 1 0 mux 0 0 1 mux 0 reg add Z j CT Carry Figure 5.18: PEc Diagram with L = 2 143 5.5 Experimental Results The proposed Montgomery modular multiplication algorithms were coded in the Verilog hardware description language and implemented using the default flow of the Xilinx Vivado Virtex-7 xc7v585tffg1157-3 device, including synthesis and implementation (place&route). Instead of optimizing the algorithm for a specific value of m, our iterative and scalable MM designs maintain the flexibility to implement different m values, reflecting trade-off between latency and area. The improvements achieved by our implementation can be seen in terms of both the hardware area and computation latency. 5.5.1 Iterative Montgomery Modular Multiplication Table 5.1: Different Iterative Montgomery MM Designs on Virtex-7 FPGA Design N m Time Area ATP Improvement (%) Cycles Period (ns) Latency (µ s) LUT FF Latency Area ATP Alg 22 1024 4 264 3.00 0.79 17661 3120 16.46 8 136 3.60 0.49 23802 3152 13.20 16 70 4.90 0.34 45087 3187 16.56 2048 4 520 3.40 1.77 32170 6177 67.80 8 264 3.90 1.03 49810 6250 57.72 16 134 5.20 0.70 88886 6271 66.31 [27] 1024 4 262 3.96 1.04 16883 5165 22.88 23.66 5.75 28.05 8 138 7.03 0.97 30339 9279 38.44 49.53 31.97 65.66 2048 4 518 4.81 2.49 33734 10315 109.75 29.04 12.94 38.23 8 266 7.85 2.09 62023 18489 168.12 50.69 30.37 65.67 [28] 1024 4 264 3.83 1.01 22625 7533 30.49 21.67 31.09 46.03 8 132 6.88 0.91 34904 7427 38.44 46.09 36.33 65.67 2048 4 522 4.57 2.39 36238 15066 122.39 25.89 25.26 44.60 8 262 7.69 2.01 79937 14700 190.67 48.90 40.76 69.73 (*) ATP stands for the product of the total area (LUT+FF) and end-to-end latency post-place&route (in ms). (**) We collect the best performance of scalable MMM for a given m. Table 5.1 reports the performance of different iterative Montgomery MM designs. Refer- ences[27,28]fullyexpandedthepartialproductswhencomputingq (i) M andXY i . Instead,by 144 implementing the modified Booth coding and the proposed Encode/EncodeSN, we removed half of the partial products. The analysis in section 5.2.3.5 prevented overflow for signed compression. Although the area complexity remains O(Nm), considering the configuration N = 1,024 and m = 8, our design reduces the area cost by 31.97% and 36.33%, respectively. Meanwhile, the reduction of critical path can be summarized in four points. i. The proposed algorithm 21 breaks the dependency between q and Z and thus reduces the critical path. ii. The proposed Encode/EncodeSN decomposes a full-bitwidth addition into a parallel collection of 2-bit additions, each of which only needs a constant delay. iii. Theobservationthattherearemanyvalidchoicesforq (i) in(5.5)removesthenecessity of adding the results after performing Compress( ˜ Z (i− 1) L M ′′ − ˆ q (i− 1) ). iv. The data model in figure 2.6 prevents overflow when adding two signed numbers and thus removes the necessity of adding the results after performing Compress(Z (i− 1) + X ˜ Y i r 2 + ˜ q (i− 1) M). Consequently, considering the configuration N = 1,024 and m = 8, our design reduces the computation latency by 49.53% and 46.09%, respectively. As a result, we save more than 63% in terms of the ATP metric. The advantages of our design in other configurations are also evident. When increasing m, the clock period gradually increases so that the savings in computation latency decrease accordingly. Based on the experimental results for our proposed Montgomery MM, we achieve the minimum value of the ATP metric for m = 8. 145 5.5.2 Full-word Montgomery Modular Multiplication Finally, table 5.2 shows our proposed full-word Montgomery MM (algorithm 25) in a 5-stage pipelined architecture. Similar to the full-word Barrett MM implementations, we break the final addition into two stages and avoid a long critical path. To best of authors’ knowledge, there is no references presenting the full-word Montgomery MM results in FPGA. Table 5.2: 5-stage Full-word Montgomery MM Designs on Virtex-7 FPGA N Time Area ATP (LUT+FF)*us Period (ns) LUT FF Total 32 3.6 3240 541 3781 13.61 64 4.9 11120 1061 12181 59.69 128 7.1 39629 2132 41761 296.50 256 9.7 153030 4451 157481 1527.57 (*) ATP stands for the product of the total area (LUT+FF) and period post-place&route (in µ s). 5.5.3 Scalable Montgomery Modular Multiplication Given a chip area constraint and the word size m of multiplier, there are many possible kernel configurations with the number p of PEs and the word size w of multiplicand for a designer-specifiedbitwidth N ofthemodulusM. Thenumberofcyclesisafunctionofeand p as shown in equation (5.64). When e≤ Lp, the cycle count is kLp+e+1. A PE becomes free before we issue another word of the multiplier, meaning that we have instantiated too many PEs. The reduction of p can improve resource utilization until e =Lp. Note also that the change of p does not affect cycle count since kp is roughly a constant. When e>Lp, the cyclecountiske+Lp+1. APEisstillbusywhenwewanttoissueanotherword, sowehave to wait for PEs to become available. Increasing p causes k to be reduced and thus lowers 146 the cycle count until e becomes equal to Lp. Therefore, for a designer-specified bitwidth N of the modulus M, the optimum value of p is⌈e/L⌉. Table 5.3: Scalable Montgomery Modular Multiplication with N = 1024 Design m w p Time Area ATP (*) (LUT+FF)*ms Cycles Period (ns) Latency (µ s) LUT FF L1 Scalable MMM (algorithm 32) 4 32 33 300 2.8 0.84 13107 5307 15.47 64 17 292 3 0.88 12575 5272 15.63 128 9 273 3.1 0.85 13959 5807 16.73 256 5 268 3.4 0.91 15408 7181 20.58 8 32 33 168 3.8 0.64 27730 6039 21.56 64 17 156 3.8 0.59 24004 5658 17.58 128 9 147 3.8 0.56 27431 6015 18.68 256 5 143 4.1 0.59 29161 7296 21.37 16 64 17 88 4.8 0.42 51812 6435 24.60 128 9 84 5.0 0.42 50143 6434 23.76 256 5 78 5.2 0.42 58479 7538 26.78 L2 Scalable MMM (algorithm 33) 4 32 17 580 1.8 1.04 7364 6752 14.74 64 9 542 2.1 1.14 7267 6700 15.90 128 5 532 2.1 1.12 8336 7411 17.65 256 3 530 2.3 1.22 10848 9191 24.34 8 32 17 308 2.2 0.68 15263 8928 16.39 64 9 290 2.3 0.67 13674 8426 14.74 128 5 282 2.5 0.71 14966 8991 16.89 256 3 272 2.7 0.73 18625 10903 21.69 16 64 9 164 3.0 0.49 28877 11034 19.64 128 4 152 3.1 0.47 29657 11078 19.19 256 3 146 3.2 0.47 36076 12933 22.90 (*) ATP stands for the product of the total area (LUT+FF) and end-to-end latency post-place&route (in ms). (**) Sufficient PEs are instantiated for N = 1024. (***) PEc are inserted two extra stages to reduce critical path. Table5.3showsourimplementationsofL1andL2scalableMontgomeryMMalgorithms, considering different configurations of ( w,m). As we discussed, the optimum p is⌈e/L⌉ with e = N+2m+1 w , which is almost independent of m when w ≥ 4m, and thus, decreases with the increase of w. Meanwhile, increasing w means that a PE is more complex, and hence, it would require more LUTs to do the required compressions and encoding. Consequently, 147 different implementations of the Montgomery MM with a fixed m have nearly the same area (in terms of the number of LUTs they need). Since the internal design of a PE consists of encoding and compression, the clock period would only have complexity of O(logm) and is nearly independent of w. However, when w becomes too large, the latency of addition in the PEc dominates the clock period such that the Montgomery MM computation latency increases whereas the cycle latency is somewhat reduced. The placement and routing inside a PE also become more complex and require more area. Based on the product of area and time, the configurations (32 ,4), (64,8), and (128,16) for (w,m) achieve the best trade-off for both L1 and L2 scalable Montgomery MM algorithms. Table 5.4: Different Scalable Montgomery MM Designs on Virtex-7 FPGA Design N m Time Area ATP Improvement (%) Cycles Period (ns) Latency (µ s) LUT FF Latency Area ATP L2 Alg 33 1024 4 580 1.80 1.04 7364 6752 14.74 8 290 2.30 0.67 13674 8426 14.74 16 152 3.10 0.47 29657 11078 19.19 L1 Alg 32 1024 4 300 2.80 0.84 13107 5307 15.47 -24.29 23.34 4.72 8 156 3.80 0.59 24004 5658 17.58 -12.52 25.49 16.15 16 84 5.00 0.42 50143 6434 23.76 -12.19 28.00 19.23 [15] 1024 4 290 3.90 1.13 19124 4638 26.87 7.69 40.59 45.14 8 145 5.80 0.84 37613 4971 35.81 20.69 48.10 58.84 (*) ATP stands for the product of the total area (LUT+FF) and end-to-end latency post-place&route (in ms). (**) We collect the best performance of scalable MMM for a given m. Reference [15] is the state-of-art scalable Montgomery MM design (our contribution). It implemented the m-bit multiplication operation to compute quotient q but replaced the multiplication operation with compression when computing the the intermediate result Z. Since reference [15] cannot avoid overflow after signed compression, it did not use modified 148 Boothcodingandhadtoexpandallpartialproductsasunsignednumbers. Thecriticalpath thereby consists of a m-bit addition and two compressions of m+2 and 2m+4 numbers. On contrast, our new scalable Montgomery MM (either L1 or L2) allows the application of modified Booth coding and encoding technique to reduce the number of partial products. Meanwhile, the computations of q and Z in each iteration become parallel. Compared with reference [15], our L2 scalable Montgomery MM saves 20.69% computation latency and 48.10% area, which amounts to 58.84% saving of ATP. The cycle countskLp+e+1 would be roughly double, when we switch from L1 design to L2 design. However, the insertion of stage registers cannot reduce clock period by half, due to the setup/hold time constraint, placement and routing. For example, the clock period of L2 design when m = 16 and w = 128 is 3.1 ns, whereas the clock period of L1 design when m = 16 and w = 128 is only 5 ns. The computation latency of L2 design is thus larger than that of L1 design. Note that the difference of computation latency reduces with the increment of m, because the critical path becomes longer and we can achieve a more balanced insertion of stage registers. On the other hand, the required PEs of a presumed N would be ⌈e/L⌉. Compared with L1 design, we only need half of PEs and half LUTs. However, the 2-stage pipeline requires more registers so that the area saving when m = 16 and w = 128 is 28% rather than 50%. We therefore save 19.23% ATP. Although it is the most common way to estimate area in FPGA, the area metric (LUT+FF) may not be fair enough since we do not know the actually area of LUT and FF. In the future, we can implement all designs into ASIC circuits and have a more precise area saving. 149 5.6 Conclusion In this section, we present many efficient Montgomery MM algorithms. Our proposed itera- tive Montgomery MM successfully replaces costly multiplication and addition with compres- sion and encoding in each iteration. The proposed encoding technique avoids the full-size addition when computing the required quotient. Moreover, by manipulating partial prod- uctsintooverflowpreventionmodel,wesolvedtheoverflowproblemassociatedwithmodified Boothcodingwhenappliedtosignednumbers. Consequently, thecriticalpathdelayforper- forming iterative Montgomery MM is dramatically. Experimental results show the reduction of area and time for our design compared to the latest published results. Ourproposedfull-wordMontgomeryMMremovesthenecessitytocompleteqM multipli- cation. Thetruncatedcompressionremovealmosthalfbitswhengeneratingpartialproducts of qM. All multiplication are replaced with compression based on our overflow prevention and correction method, and proposed encoding technique avoids the full-size additions. Our careful error analysis guarantee that final result is in the range ( − M,M) and we only need one correction. Finally, we transfer our iterative Montgomery MM into scalable Montgomery MM by computing intermediate results through an inner loop. Our L1 scalable Montgomery MM allowsanissuelatencyofonecycletosavecomputationlatency. Ratherthanbestuckatthe data dependency graph of classic scalable designs, we implemented process elements (PEs) inpipelinearchitectureandproposednewdatadependencygraph. TheresultingL2scalable Montgomery MM removes half of PEs, whereas pipelined PEs also reduce clock period. The product of area and time thus achieved the minimum. 150 Chapter 6 Summary and Future Work Thischapterconcludesthethesis. First, wesummarizetheachievedcontributions. Next, we describe challenges that are worth studying in the future. 6.1 Summary 6.1.1 Overflow Optimization and Encode We propose two methods to solve overflow problems: overflow prevention and overflow cor- rection. Overflow prevention method allows us to compress signed numbers into two signed numbers so that the sum of them does not have overflow. Overflow correction method cor- rects any potential overflow with only a 2-bit addition rather than a full-bitwidth addition. Those two methods allow us to replace multiplication with compression. Moreover, our proposed encoding technique provide a quick path to skip the addition in the expression like (A+B)C, (A+B)/2 k C, (A+B)/2 k C, and (A+B)/2 k C. By grouping and encoding several bits of A and B, we can directly generate a partial product. Theencodingtechniqueonlyrequiresaconstantdelay, independentofthebitwidthof Aand B. Many advanced operations like consecutive multiplications and polynomial evaluation becomemuchmoreefficient. Ourresearchaboutoverflowoptimizationandencodingreplaces multiplication and addition with compression and encoding for all five main categories of modular multiplication: (1) iterative Barrett MM; (2) full-word Barrett MM; (3) iterative 151 Montgomery MM; (4) full-word Montgomery MM; (5) scalable Montgomery MM. Those optimizations dramatically reduce the computation latency and hardware area. 6.1.2 Barrett Modular Multiplication Asapopularmodularmultiplicationalgorithm,BarrettMMarewidelyimplementedinmany cryptosystems. Ratherthancomputingquotientq throughthedivisionofintermediateresult Z over modulus M, a precomputed parameter µ is used to approximate 1/M and replace division with multiplication and shift. 6.1.2.1 Iterative Barrett Modular Multiplication In order to ensure intermediate result Z in every iteration is in a fixed range like [0 ,2M), classic iterative Barrett MM requires us to sequentially compute quotient q first and inter- mediate result Z next. Instead, our proposed iterative Barrett MM successfully break the data dependency. Specifically in each iteration, we have two parallel tasks: (1) set interme- diate result Z (i) into the range (− 1.23Mr,13.9Mr) with r = 2 m by subtracting a multiple of modulus q (i+1) M and adding partial products X ˜ Y i ; (2) compute an associated quotient q (i) to maintain therelationshipZ (i) − q (i) M ∈ (− M,M). Detailed proof basedon mathematical induction and careful error analysis guarantee the final result is also in the range ( − M,M). Only one correction is required to put final result back to the range [0 ,M). Compared with the state-of-art iterative Barrett MM, we saved 50.54% ATP when m = 8. 152 6.1.2.2 Full-word Barrett Modular Multiplication The goal of full-word Barrett MM is always to reduce area as much as possible. Due to the non-linear ceiling/floor/round operations, the computation of quotient requires a complete multiplication. However, our proposed full-word Barrett MM proves that it is safe to remove almost half size of partial products before compression of quotient computation. 6.1.2.3 Extension to All Modulus The most commonly used Barrett MM assume the most significant bit is always 1. The assumption simplifies the division of a power of 2 as a constant shift, which is equivalent to pick the corresponding wires in hardware. The classic extension to support almost all modulus requires two sequential dynamic shifter on the critical path and thus increase com- putationlatency. Oncontrast, weproposedanewextensionmethodthatissuitableforboth iterative and full-word Barrett MM. There is only one dynamic shifter on the critical path. 6.1.3 Montgomery Modular Multiplication Montgomery MM is another widely implemented in many cryptosystems, which computes modular multiplication in the Montgomery domain. All multiplications, additions, and sub- tractions in the original domain can be homomorphically implemented in the Montgomery domain. Therefore, the cost of domain transformation between original domain and Mont- gomery domain are amortized and negligible, if we have a lot of operations in Montgomery domain. 153 6.1.3.1 Iterative Montgomery Modular Multiplication The main procedure in each iteration of iterative Montgomery MM is to find a quotient q for intermediate result Z so as to ensure Z +qM ≡ r 0. Similar iterative Barrett MM, this purposealsorequiresussequentiallycomputequotient q firstandintermediateresult Z next in iterative Montgomery MM. Without double, we success to break this data dependency. In each iteration, we implement two tasks: (1) set least m bits of intermediate result Z (i) to 0 by adding partial products X ˜ Y i r 2 and q (i− 1) M and then right shift m bits; (2) compute an associated quotient q (i) to maintain the relationship Z (i) +q (i) M ≡ r 0. Detailed proof based on mathematical induction guarantees the final result is in the range ( − M,M) and congruent to XYR − 1 . Only one correction is required to put final result back to the range [0,M). Compared with the state-of-art iterative Montgomery MM, we saved 65.66% and 65.67% ATP respectively. 6.1.3.2 Full-word Montgomery Modular Multiplication We also want to save the area when implementing full-word Montgomery MM. Although we canguaranteethefinalresultisamultipleofapowerof2beforerightshift,thispropertyalso requiresustocompletethemultiplication. However,ourproposediterativeMontgomeryMM points out that it is safe to remove almost half size of partial products before compression of qM. We can still find the correct result in the range ( − M,M) and only need one correction. 6.1.3.3 Scalable Montgomery Modular Multiplication ScalableMontgomeryMMisanimportantvariantofmodularmultiplication. Afterdesigning a complex but efficient processing element (PE), we can instantiate many PEs and connect 154 them property as a kernel. With the support of a memory management unit, we can use a fixed kernel to support arbitrary size modular multiplication, though we need many cycles. We first transfer our latest iterative Montgomery MM to L1 scalable Montgomery MM, bydecomposingandcomputingintermediateresultZ (i) throughaninnerloop. Theresulting L1 algorithm allows an issue latency of one cycle. We can start a new PE immediately after we finish the first cycle of current PE. Further, we propose a new data dependency graph associated with the pipelined PE. The implemented 2-stage pipelined PE belongs to L2 scalable Montgomery MM but reduces the critical path a lot and also need less numbers of PE. As a result, our L2 scalable Montgomery MM can save 58.84% ATP than the best state-of-art design. 6.2 Future Work Our overflow solution and encoding technique allows us to replace multiplication with com- pression. Modular multiplication is just one example of advanced operations where we can use these optimizations. There are many other applications. For example, we can easily extend digital signal processor (DSP) from computing the expression AB+C to the expres- sion (A+B)C +D etc. Basedonourhighlyefficientmodularmultiplications,wecanexpectremarkableimprove- ments of many research directions. i. As the main cryptosystems used all over world, RSA and ECC consists of many mod- ular multiplications. A good research direction could be a new hardware accelerator 155 based on our proposed modular multiplication, so as to speed up encryption, decryp- tion, etc. ii. RSA and ECC have been shown to be vulnerable to quantum attacks due to Shor’s algorithm[29]runningonquantumcomputers. Lattice-basedcryptosystems(SHEand FHE) have been to be quantum-attack resistant cryptosystems. The homomorphic multiplications of SHE and FHE rely on the ring multiplication of two polynomials, each polynomial containing n coefficients and representing a ciphertext in a ring of size n. Ring multiplication is typically performed doing a forward number-theoretic transform (NTT) [30] on the two polynomials followed by a Hadamard (pointwise) modular multiplication of the NTT-transformed results and concluded by an inverse- NTT transformation of the Hadamard computation. The usage of NTT reduces the number of modular multiplications from n 2 to 3 2 nlogn+n. A good research direction is to design an efficient NTT architecture. iii. As quantum-attack resistant cryptosystems, lattice-based cryptosystems (SHE and FHE) are potential candidates that prevent user privacy and enable us do multiplica- tion/additiononciphertext. Forexample,peoplearesafetouploadciphertexttocloud platform and perform some computationally expensive operations without conveying sensitiveplaintext. However,FHEisanewlyproposedat2009[5]andtheperformance of ciphertext is far behind that of plaintext to several magnitude orders. A hardware acceleratortospeedupFHEoperationsandthedevelopmentofcorrespondingsoftware is a very important but difficult research direction. 156 References [1] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems,” Communications of the ACM, vol. 21, no. 2, pp. 120–126, 1978. [2] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of computation, vol. 48, no. 177, pp. 203–209, 1987. [3] M. Naehrig, K. Lauter, and V. Vaikuntanathan, “Can homomorphic encryption be practical?” in Proceedings of the 3rd ACM workshop on Cloud computing security workshop, 2011, pp. 113–124. [4] R. L. Rivest, L. Adleman, M. L. Dertouzos et al., “On data banks and privacy homomor- phisms,” Foundations of secure computation, vol. 4, no. 11, pp. 169–180, 1978. [5] C.Gentry,“Fullyhomomorphicencryptionusingideallattices,”inProceedingsoftheforty-first annual ACM symposium on Theory of computing, 2009, pp. 169–178. [6] P. Barrett, “Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor,” in Conference on the Theory and Application of Crypto- graphic Techniques. Springer, 1986, pp. 311–323. [7] P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computa- tion, vol. 44, no. 170, pp. 519–521, 1985. [8] N. H. Weste, CMOS VLSI Design: A Circuits and Systems Perspective. Orthogonal Publish- ing L3C, 2011. [9] O.L.MacSorley,“High-speedarithmeticinbinarycomputers,” ProceedingsoftheIRE,vol.49, no. 1, pp. 67–91, 1961. [10] V. Bunimov and M. Schimmler, “Area and time efficient modular multiplication of large inte- gers,” in Proceedings IEEE International Conference on Application-Specific Systems, Archi- tectures, and Processors. ASAP 2003. IEEE, 2003, pp. 400–409. [11] Y.-Y. Zhang, Z. Li, L. Yang, and S.-W. Zhang, “An efficient csa architecture for montgomery modular multiplication,” Microprocessors and Microsystems, vol. 31, no. 7, pp. 456–459, 2007. [12] C. Mclvor, M. McLoone, and J. V. McCanny, “Fast montgomery modular multiplication and rsa cryptographic processor architectures,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 1. IEEE, 2003, pp. 379–384. [13] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient solution of a general class of recurrence equations,” IEEE transactions on computers, vol. 100, no. 8, pp. 786–793, 1973. 157 [14] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE Computer Archi- tecture Letters, vol. 31, no. 03, pp. 260–264, 1982. [15] B. Zhang, Z. Cheng, and M. Pedram, “High-radix design of a scalable montgomery modular multiplier with low latency,” IEEE Transactions on Computers, 2021. [16] Z. Gu and S. Li, “A division-free toom–cook multiplication-based montgomery modular mul- tiplication,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 66, no. 8, pp. 1401–1405, 2018. [17] A. F. Tenca and C ¸. K. Ko¸ c, “A scalable architecture for modular multiplication based on montgomery’s algorithm,” IEEE Transactions on computers, vol. 52, no. 9, pp. 1215–1221, 2003. [18] M.-D. Shieh and W.-C. Lin, “Word-based montgomery modular multiplication algorithm for low-latency scalable architectures,” IEEE transactions on computers, vol. 59, no. 8, pp. 1145– 1151, 2010. [19] M.Huang, K.Gaj, andT.El-Ghazawi, “Newhardwarearchitecturesformontgomerymodular multiplication algorithm,” IEEE Transactions on computers, vol. 60, no. 7, pp. 923–936, 2010. [20] J. Ding and S. Li, “A modular multiplier implemented with truncated multiplication,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 65, no. 11, pp. 1713–1717, 2017. [21] B. Zhang, Z. Cheng, and M. Pedram, “A high-performance low-power barrett modular multi- plier for cryptosystems,” in 2021 IEEE/ACM International Symposium on Low Power Elec- tronics and Design (ISLPED). IEEE, 2021, pp. 1–6. [22] E. B´ ezout, Th´ eorie g´ en´ erale des ´ equations alg´ ebrique . De l’imprimerie de Ph. D. Pierres, 1779. [23] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2009. [24] H. Orup, “Simplifying quotient determination in high-radix modular multiplication,” in Pro- ceedings of the 12th Symposium on Computer Arithmetic. IEEE, 1995, pp. 193–199. [25] T. Wu, S. Li, and L. Liu, “Fast rsa decryption through high-radix scalable montgomery mod- ular multipliers,” Science China Information Sciences, vol. 58, no. 6, pp. 1–16, 2015. [26] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu, “An improved unified scalable radix-2 montgomery multiplier,” in 17th IEEE Symposium on Computer Arithmetic (ARITH’05). IEEE, 2005, pp. 172–178. [27] J.-S. Pan, P. Song, and C.-S. Yang, “Efficient digit-serial modular multiplication algorithm on fpga,” IET Circuits, Devices & Systems, vol. 12, no. 5, pp. 662–668, 2018. [28] S. S. Erdem, T. Yanık, and A. C ¸elebi, “A general digit-serial architecture for montgomery modularmultiplication,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 5, pp. 1658–1668, 2017. [29] P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM review, vol. 41, no. 2, pp. 303–332, 1999. 158 [30] C. Du and G. Bai, “Towards efficient polynomial multiplication for lattice-based cryptogra- phy,” in2016 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE,2016, pp. 1178–1181. 159
Abstract (if available)
Abstract
Constituting many advanced mathematical operations, multiplication has been thoroughly studied from many perspectives including latency, area, and power. The modified Booth encoding is a very important optimization that reduces the number of partial products during multiplication. Since the compression results of partial products could have a potential overflow, it requires the final addition of compression results to achieve the correct product. Therefore, multiplication is usually implemented as a black box to present other operations.
In this thesis, we present two ways to solve the overflow problem and simplify multiplication as compression. Further, a novel encoding with a constant delay is proposed to skip the full-bitwidth addition in the expression of (A+B)C. These two optimizations allow us to replace multiplication and addition/subtraction with compression and encoding for any linear mathematical operations consisting of multiplication and addition/subtraction. The critical path is thus dramatically reduced, we only need addition at the very end of operations to represent the final result. The proposed encoding is also extended to support ceiling, floor, and round operations like the expression ceil((A+B)/2^k), floor((A+B)/2^k), and round((A+B)/2^k). Although they are non-linear operations, we can still skip the addition $A+B$ and division over 2^k by our constant-delay encoding technique.
Stemming from abstract algebra, linear algebra, etc, modern cryptosystems are commonly used all over the world to secure sensitive communications. In the context of cryptosystems, modular addition/subtraction and modular multiplication are the two primitive operations that significantly affect the performance of cryptosystems. More precisely, modular operations for bitwidth ranging from hundreds to thousands of bits are desirable to ensure a high-security level. Modular addition/subtraction is pretty simple, whereas modular multiplication (MM) is quite complex. Barrett MM and Montgomery MM are the most widely used algorithm since they replace the long division in classic MM with multiplication and shift.
Depending on whether we digest the bits of the multiplier in a one-shot or many cycles manner, a modular multiplication can be categorized as full-word MM and iterative MM. Full-word MM is usually implemented in a pipeline structure so as to achieve high throughput, and we thereby need to reduce silicon area as much as possible. Iterative MM iteratively scans some bits of the multiplier in every iteration. It is necessary to optimize both computation latency and silicon area. Moreover, non-decomposability is one property of modular multiplication, where a large-size modular multiplication cannot be decomposed into a bunch of small-size modular multiplications. Scalable Montgomery MM, as a special variant of Montgomery MM, is able to perform arbitrary-size modular multiplication with fixed resource utilization, as long as we have a memory management unit to buffer intermediate results. As a result, there are five different modular multiplications for different applications and contexts: full-word Barrett MM, iterative Barrett MM, full-word Montgomery MM, iterative Montgomery MM, and scalable Montgomery MM.
This thesis presents remarkable improvements to all five modular multiplications. Without a doubt, our proposed overflow solution and encoding technique are applied to all designs. But more importantly, we have many optimizations specifically for each modular multiplication. Generally speaking for both Barrett and Montgomery, a full-word MM requires three normal multiplications in series. An iterative MM needs sequential computation of quotient and intermediate result in each iteration. Although the optimizations of all five modular multiplications are proved in completely different ways, we propose a full-word Barrett MM and a full-word Montgomery MM where two of the three multiplications are truncated with only half-size partial products. Our proposed iterative Barrett MM and iterative Montgomery MM are able to compute quotient and intermediate results in parallel. Finally, the new scalable Montgomery MM has a new data dependency graph that requires less computation latency and hardware area than state-of-art references.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Ultra-low-latency deep neural network inference through custom combinational logic
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Scalable processing of spatial queries
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
High performance and ultra energy efficient computing using superconductor electronics
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Reinforcement learning in hybrid electric vehicles (HEVs) / electric vehicles (EVs)
Asset Metadata
Creator
Zhang, Bo
(author)
Core Title
Design of modular multiplication
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-05
Publication Date
04/15/2022
Defense Date
03/02/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Barrett modular multiplication,compression,modular multiplication,Montgomery modular multiplication,multiplication,OAI-PMH Harvest,scalable modular multiplication
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Beerel, Peter (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
362589807@qq.com,zhan254@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110963597
Unique identifier
UC110963597
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Bo
Type
texts
Source
20220415-usctheses-batch-924
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Barrett modular multiplication
compression
modular multiplication
Montgomery modular multiplication
multiplication
scalable modular multiplication