--- title: "RSA Optimization" date: 2022-12-06 draft: false --- ## INTRODUCTION RSA is a public key cryptosystem, which was named after the creators of the algorithm: Rivest, Shamir, and Adleman [@STALLINGS]. It is widely used for both confidentiality and authentication. The main advantage for using RSA is the keys are created in such a way that the public key is publish and can be used to encrypt all messages to the owner of the public key. Unlike symmetric key schemes, RSA does not require sender and receiver to agree on a common key to encrypt and decrypt messages. To send an encrypted message, the encryptor only has to look up the public key of the intended recipient. The RSA algorithm works as follows. First, a public key and private key pair are generated. The public key is published and can be used to encrypt messages and verify signatures of the owner of the public key. The corresponding private key is known only to the owner and can be used to decrypt messages encrypted with the owner's public key. The private key can also be used to sign messages. The public and private keys are generated by choosing two distinct large primes, `p` and `q`. Then `n` is computed by multiplying `p` and `q` together. Since `n` is the product of two primes, we can compute Euler's totient function by `φ(n)=(p-1)(q-1)`. Then, an integer `e` is chosen such that `1 result := result*3 = (32)2*3 = 35 = 243 Iteration for digit 4: result := result2 = ((32)2*3)2 = 310 = 59049 1010bin - Digit equals "0" return result The idea is to implement this exponentiation by squares algorithm in the processor using a pow command. This pow command takes the destination register, the register containing the base and the register containing the exponent. This command then runs the loop using one multiplier to accumulate the square and one multiplier to accumulate the value when a 1 is hit. The loop will grab one binary digit at a time from the exponent register. We will always run the result accumulator, then depending on the digit, may use the second accumulator. It will then run through one more multiplier to multiply the squares by the 1's multiplier. ## JUSTIFICATION AND ANALYSIS In this section, we will describe how our specialized instructions will improve the performance of the RSA encryption. ### ENCRYPTION AND DECRYPTION EXPONENT Using the modular instruction in the Euclidean Algorithm, we can reduce the number of stalls needed. Instead of needing to stall for the result of the divide and the result of the multiply within the loop, the modular instruction does not have to stall for any previous results. In the loop of the Euclidean Algorithm without the modular instruction, two of the five instructions need to stall to wait for a previous result. CPI = Ideal CPI + The number of stalls per instruction CPIold = 1 + 2/5 CPIold = 1.4 CPInew = 1 (since no instructions in the loop need to stall for a result) Thus the speedup for the Euclidean Algorithm in the loop is: Speedup = CPIold / CPI new Speedup = 1.4 / 1 Speedup = 1.4 Thus, with respect to stalls, we have a speedup of 1.4 by using the modular instruction in the Euclidean Algorithm. Of course, the latency of the modular instruction is likely to be longer than the latency of the divide, multiply, and subtract instructions, since it is performing three calculations in one instruction. However, since we are building a system only for RSA, we can use specialized hardware to reduce the latency of the modular instruction to receive the speedup of 1.4. As mentioned in the design section, another advantage to using the modular instruction is we reduce the number of temporary registers from three to one. We are assume that the store instruction finishes in one clock cycle. Thus, in the Extended Euclidean Algorithm, three of the thirteen instructions in the loop need to stall for previous results. However, by using the modular and multipy-subtract instructions, no instruction needs to stall to wait for a previous result to be available. CPI = Ideal CPI + The number of stalls per instruction CPIold = 1 + 3/13 CPIold = 1.23 CPInew = 1 (since no instructions in the loop need to stall for a result) Thus the speedup for the Euclidean Algorithm in the loop is: Speedup = CPIold / CPI new Speedup = 1.23 / 1 Speedup = 1.23 Thus with respect to stalls, we have a speedup of 1.23 by using the modular and multiply-subtract instructions in the Extended Euclidean Algorithm. As mentioned above, it is likely that the latency of the modular and multiply-subtract instructions are higher than the traditional instructions. However, we are only building a system for RSA encryption and decryption, so we can build hardware that will reduce the latency of the modular and multiply-subtract instructions to receive the speedup of 1.23. Also, an advantage to using the modular and multiply-subtract instructions is we reduce the number of temporary registers needed from five to three. ### PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION Using the pow command we can reduce stalls of large exponents in half. Since the algorithm breaks the exponent into a binary representation of its self and loops through the digits, the worse case scenario is looping half the number of times. One example of this is if the exponent is eight which is 1000 in base two. Since the number of loops is divided in half, the number of multiplies are also divided in half. Even with a large number of ones in the base two representation of the exponent, this can use a seperate multiplyer, since it doesn't rely on the result of the squares. Using the above pseudocode we can assume that approximately 1/4 of the instructions are multiplies, thus 1/4 will stall. Using that knowledge and the knowledge that the number of stalls is cut in half we can use the following equations to determin the overall speedup: CPI = Ideal CPI + The number of stalls per instruction CPI old = 1 + 1/4 CPI old = 1.25 CPI new = 1 + 1/4/2 CPI new = 1 + 1/8 CPI new = 1.125 Thus the speedup of this section is: Speedup = CPI old / CPI new Speedup = 1.25/1.125 Speedup = 1.11 ## CONCLUSIONS In analyzing the typical algorithms used as a part of RSA, we have identified two primary bottlenecks in both encryption and decryption: modulus and exponentiation operations. We propose that because these two operations are so prevalent, they warrant specialized instructions. We also noticed that it is beneficial to use a multiply-subtract instruction to increase speedup while finding a decryption exponent. These specialized instructions improve performance by reducing the number of stalls, and therefore improving ideal CPI. The process of finding encryption and decryption exponents benefits from the increased efficiency of modulus operations. Likewise, the processes of prime generation and encryption and decryption in general, both benefit from a faster exponentiation operation. We find that by implementing these instructions, the whole procedure finds an overall speedup, from specific speedups of 1.11 for all uses of exponentiation, 1.4 for finding an encryption exponent, and 1.23 for finding a decryption exponent. Thus, we met our goals for improving the performance of the RSA algorithm by creating a specialized instruction set architecture. A further investigation we could make is to look at specific hardware that supports our instruction set architecture. To gain the maximum speedup, we need to use hardware that reduces the latency for all the specialized instructions to a negligible amount. Since we are creating a system with the sole task of encrypting and decrypting messages using RSA, we can create hardware that allows the specialized instructions to have a latency comparable to the traditional instructions. ## Bibliography 3 Beauchemin, Pierre, Brassard, Crepeau, Claude, Goutier, Claude, and Pomerance, Carl The Generation of Random Numbers That Are Probably Prime Bellare, M., Garay, J., and Rabin T. Fast Batch Verification for Modular Exponentiation and Digital Signatures. *Advances in Cryptology- Eurocrypt 98 Proceedings, Lecture Notes in Computer Sciencey*, 1998. Boneh, Dan and Shacham, Horavl. Fast Variants of RSA. *Appears in Cryptobytes*, 2002. Cohen, H., Frey, G. (editors): Handbook of elliptic and hyperelliptic curve cryptography. Discrete Math. Appl., (Chapman and Hall/CRC 2006). Cormen, T., Leiserson, C., Rivest, R., and Stein, C. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. Stallings, W., Cryptography and Network Security 2nd ed. (Prentice-Hall, 1998)