Creating a repo for everything on website

2023-03-30 13:50:07 -04:00
parent 41b9dc1e12
commit d235c415b8
37 changed files with 1652 additions and 2 deletions
--- a/content/posts/rsa-optimization.md
+++ b/content/posts/rsa-optimization.md
@@ -0,0 +1,524 @@
+---
+title: "RSA Optimization"
+date: 2022-12-06
+draft: false
+---
+
+# INTRODUCTION
+
+RSA is a public key cryptosystem, which was named after the creators of
+the algorithm: Rivest, Shamir, and Adleman [@STALLINGS]. It is widely
+used for both confidentiality and authentication. The main advantage for
+using RSA is the keys are created in such a way that the public key is
+publish and can be used to encrypt all messages to the owner of the
+public key. Unlike symmetric key schemes, RSA does not require sender
+and receiver to agree on a common key to encrypt and decrypt messages.
+To send an encrypted message, the encryptor only has to look up the
+public key of the intended recipient.
+
+The RSA algorithm works as follows. First, a public key and private key
+pair are generated. The public key is published and can be used to
+encrypt messages and verify signatures of the owner of the public key.
+The corresponding private key is known only to the owner and can be used
+to decrypt messages encrypted with the owner's public key. The private
+key can also be used to sign messages. The public and private keys are
+generated by choosing two distinct large primes, `p` and `q`. Then `n`
+is computed by multiplying `p` and `q` together. Since `n` is the
+product of two primes, we can compute Euler's totient function by
+`φ(n)=(p-1)(q-1)`. Then, an integer `e` is chosen such that
+`1<e<φ(n)` and `gcd(e,φ(n))=1`, where gcd stands for
+greatest common denominator. Finally an integer `d` is computed such
+that `d=e^{-1}(mod φ(n))`. Thus, the public key is the modulus
+`n` and the encryption exponent `e`. The private key is the decryption
+exponent `d`.
+
+RSA is criticized for the amount of time it takes to encrypt and decrypt
+messages. To encrypt a message, the sender uses the public key, `(n,e)`.
+The message is converted to an integer `m`, such that `0<m<n`. Then, the
+ciphertext message `c` is computed by `c=m^{e}(mod n)`. The message
+can be recovered by using the private key, `d`, and the formula
+`m=c^{d}(mod n)`. Currently to be considered secure, the size of `n`
+needs to be 1024 bits or more. This makes RSA computationally slow
+because arithmetic modulo large numbers is notoriously slow. For
+example, on small handheld devices, RSA decryption with 1024 bit modulus
+can take up to 40 seconds [@BONEH]. Also, RSA decryption can
+significantly reduce the number of requests a web server can handle. In
+addition, as computers become more powerful, 1024-bit RSA will not be
+considered secure. This means that in the future, the size of the
+modulus will increase. Therefore, we need to create methods we can use
+RSA in a practical fashion.
+
+To improve the speed of RSA encryption and decryption, we propose to
+create an instruction set architecture that will be made solely for RSA.
+In this paper, we will talk strictly about using RSA for encrypting and
+decrypting messages. However the same instruction set architecture we
+propose in this paper can be used for signing and verifying messages
+with RSA.
+
+# CHARACTERISTICS OF RSA
+
+There are three areas that the RSA can be optimized: finding the
+encryption and decryption exponent, prime number generation, and
+encrypting and decrypting messages. In this section we will describe how
+finding the encryption and decrypting exponent, prime number generation,
+and encrypting and decrypting the message is usually done, without
+specialized instructions.
+
+## ENCRYPTION AND DECRYPTION EXPONENT
+
+The current approach to verifying that the encryption exponent is
+coprime to the `φ(n)` is by using the Euclidean Algorithm. To find
+the decryption exponent, the Extended Euclidean Algorithm is used.
+Therefore, we will include these algorithms in our architecture and
+create instructions to accommodate the Euclidean Algorithm and Extended
+Euclidean Algorithm.
+
+The Euclidean Algorithm can be executed using the following pseudocode
+[@STALLINGS]:
+
+    ;; load phi of n (computed using Euler's totient function) and encryption exponent
+    load r1, location_of_phi(n)
+    load r2, location_of_encryption_exponent
+
+    loop while r2 not equal to zero
+
+    div t0, r1, r2       ;; t0 = r1 / r2
+    mult t1, r2, t0         ;; t1 = r2 * r
+    sub t2, r1, t1       ;; t2 = r1 - t1
+
+    store r2, r1            ;; r1 = r2
+    store t2, r2            ;; r2 = t2
+
+    ;; end loop
+
+    return r2                 ;; this is equal to the gcd of phi of n and the encryption exponent 
+     
+
+If the `gcd(φ(n), e) = 1`, then `e` is a valid encryption exponent
+for RSA, and we can now find the decryption exponent, `d`. If the
+`gcd(φ(n), e) \neq 1`, then a new encryption exponent `e` needs to
+be chosen such that `1<e<φ(n)`. Then, the Euclidean Algorithm
+needs to be ran again to verify that `gcd(φ(n), e) = 1`.
+Typically, to execute the Euclidean Algorithm, a divide, multiply, and
+subtract instruction would be needed in each iteration of the loop. In
+the design section, we will describe an instruction that will combine
+the divide, multiply, and subtract instruction into one instruction.
+
+Once the encryption exponent, `e` is chosen, we need to find the
+decryption exponent `d`. The decryption exponent, `d` needs to satisfy
+the following formula: `d*e = 1 (mod φ(n))`. Another way of
+describing `d` is it is the multiplicative inverse of `e` modular
+`φ(n)`. The Extended Euclidean Algorithm can be executed using the
+following pseudocode [@CORMEN]:
+
+    ;; load phi of n (computed using Euler's totient function) and encryption exponent
+    load r1, location_of_phi(n)
+    load r2, location_of_encryption_exponent
+
+    ;; values that will be used and updated in the loop of the algorithm
+    store 0, r3              ;; r3 = 0
+    store 0, r4              ;; r4 = 0
+    store 1, r5              ;; r5 = 1
+    store 1, r6              ;; r6 = 1
+
+    loop while r2 not equal to zero
+
+    div t0, r1, r2        ;; t0 = r1/r2
+    store r2, t1            ;; t1 = r2
+    mult t2, r1, t0      ;; t2 = r1 * t0
+    sub r2, r2, t2        ;; r2 = r2 - t2
+    store t1, r1            ;; r1 = t1
+    store r3, t3            ;; t3 = r3
+    mult t4, t0, r3      ;; t4 = t0 * r3
+    sub r3, r5, t4        ;; r3 = r5 - t4
+    store r3, r5            ;; r5 = r3
+    store r4, t5            ;; t5 = r4
+    mult t6, t0, r4      ;; t6 = t0 * r4
+    sub r4, r6, t6        ;; r4 = r6 - t6
+    store t5, r6            ;; r6 = t5
+
+    ;; end loop
+
+    return r5, r6          ;; r5 and r6 are e and d
+
+Since we chose an `e` such that `gcd(φ(n), e) = 1`, there will
+always exist a `d` such that `ed = 1 (mod φ(n))`. The Extended
+Euclidean Algorithm returns the integer `d`. Typically to execute the
+Extended Euclidean Algorithm, one divide, six stores, three multiplies,
+and three subtracts are needed within the loop. In the design section,
+we will use two instructions that will combine some of the instructions
+used in the Extended Euclidean Algorithm to reduce the number of stalls
+within the loop.
+
+## PRIME NUMBER GENERATION
+
+The common approach to generating large primes in making encryption and
+decryption keys is to randomly select integers and test them for
+primality. Since for large numbers, being sure of primality can be an
+expensive operation, algorithms are used that find "probable primes."
+These algorithms test numbers for primality with a certainty over a
+desired threshold. The particular algorithm we will investigate is known
+as Rabin's test, whose high-level use in prime generators as
+follows[@BEAUCHEMIN]:
+
+    ;;Generates a prime number of binary length l and certainty factor k
+    ;;The number will have a exp(4,-k) chance of being composite (non-prime)
+    function genPrime (l, k):
+        repeat
+            n <- Randomly selected l-digit odd number
+        until repeatRabin(n, k) = "prime"
+        return n
+
+    function repeatRabin(n, k)
+        from i <- 0 to k or until done
+            done <- rabin(n) = "composite"
+        end
+        if done
+            return "composite"
+        else
+            return "prime"
+    
+    function rabin(n)
+        s <- the number of trailing zeros in binary form of n-1
+        t <- n-1 with trailing zeros removed
+        a <- integer randomly selected between 2 and n-2
+        x <- exp(a,d) mod n
+        if x = 1 or x = n - 1
+            return "prime"
+        for r <- 1 ... s - 1
+            x <- exp(x,2) mod n
+            if x = 1 return "composite"
+            if x = n - 1 return "prime"
+        end for
+        return "composite"
+
+This technique will take a long amount of time for large values of `t`
+and `s`. For large values of `t`, exponentiation done in a typical
+manner would take an increasing number of iterations to calculate (as
+shown below). The instructions to calculate `x^{2} (mod n)` would be
+executed `s` times. These two factors indicate a heavy reliance on the
+ability of a system to calculate exponentiation.
+
+## ENCRYPTION AND DECRYPTION
+
+One aspect of RSA to improve upon is performing large exponentiation.
+Currently the implementation of exponentiation is performed by the
+compiler, which creates inconsistencies in speed between compilers and
+is generally not optimal. The general practice for compilers is to loop
+for the value in the exponent while accumulating the total.
+
+Current pseudocode for how compilers handle exponentiation [@COHEN]:
+
+    function power (base, exponent)
+         set result = base
+         loop from 1 to exponent do
+              result = result * base
+         return result
+
+This technique is very slow, especially for large exponents. It will
+cause a lot of stalls as the processor waits for the results of the
+previous computation. It will also stall waiting for another multiplier
+to be available. We can lessen the number of stalls using a technique
+known as exponentiation by squaring. This technique is explained further
+in the design section.
+
+# DESIGN
+
+In this section, we will describe specialized instructions that will be
+used for prime number generation, computing the encryption and
+decryption exponent, and encrypting a decrypting a message.
+
+## ENCRYPTION AND DECRYPTION EXPONENT
+
+The issue with implementing the Euclidean Algorithm the traditional way
+is a divide, multiply, and subtract instructions are needed for each
+iteration of the loop. We noticed that the Euclidean Algorithm is
+finding a modulus in each iteration of the loop, thus we propose to use
+a modulus instruction for the Euclidean Algorithm, which is defined
+below.
+
+    mod x, y, z : x = z - ( (z/y) * y)
+
+By using the modulus instruction, we have reduced the number of stalls
+in each loop. Unlike the traditional implementation, which requires both
+the multiply and subtraction instruction to stall until the instruction
+before it writes it's value to memory, the modulus instruction has all
+the values immediately.
+
+Therefore, we can implement the Euclidean Algorithm with the following
+pseudocode:
+
+    ;; load phi of n (computed using Euler's totient function) and encryption exponent
+    load r1, location_of_phi(n)
+    load r2, location_of_encryption_exponent
+
+    loop while r2 not equal to zero
+
+    mod t0, r1, r2      ;; t0 = r2 - ( (r2/r1) * r1)
+
+    store r2, r1           ;;r1 = r2
+    store t0, r2           ;;r2 = t2
+
+    ;; end loop
+
+    return r2                ;; this is equal to the gcd of phi of n and the encryption exponent 
+     
+
+Using the mod instruction, we can eliminate the use of the divide,
+multiply, and subtract instructions. Also, we use only one temporary
+register, as opposed to three temporary registers. We will provide an
+analysis to the speedup that the modulus instruction gives the
+implementation of the Euclidean Algorithm in the justification and
+analysis section.
+
+For the implementation of the Extended Euclidean Algorithm, we propose
+to use two specialized instructions: a modular instruction and a
+multiply-subtract instruction. The modular instruction combines the
+divide, multiply, and subtract instruction, and is defined the same as
+it is above, in the Euclidean Algorithm. The multiply-subtract
+instruction is defined as follows:
+
+     multsub w, x, y, z : w = x - (y * z)
+
+Like in the Euclidean Algorithm, by using the modular instruction in the
+Extended Euclidean Algorithm, we are reducing the number of stalls in
+each loop iteration. We are further reducing the number of stalls by
+also using the multiply-subtract instruction. Thus, we can implement the
+Extended Euclidean Algorithm with the following pseudocode:
+
+     ;; load phi of n (computed using Euler's totient function) and encryption exponent
+    load r1, location_of_phi(n)
+    load r2, location_of_encryption_exponent
+
+    store 0, r3              ;; r3 = 0
+    store 0, r4              ;; r4 = 0
+    store 1, r5              ;; r5 = 1
+    store 1, r6              ;; r6 = 1
+
+    loop while r2 not equal to zero
+
+    div t0, r1, r2                ;; t0 = r1/r2
+    store r2, t1                    ;; t1 = r2
+    mod r2, r1, r2                ;; r2 = r2 - ( (r2/r1) * r1)
+    store t1, r1                    ;; r1 = t1
+    store r3, t2                    ;; t2 = r3
+    multsub r3, r5, t0, r3       ;; r3 = r5 - (t0 * r3)
+    store t2, r5                    ;; r5 = t2
+    store r4, t3                    ;; t3 = r4
+    multsub  r4, r6, t0, r4      ;; r4 = r6 - (t0 * r3) 
+    store t3, r6                    ;; r6 = t3
+
+    ;; end loop
+
+    return r5, r6                  ;; r5 and r6 are e and d
+
+Using the modular and multiply-subtract instructions, we have reduced
+the amount of instructions in the loop of the implementation of the
+Extended Euclidean Algorithm from 13 instructions to 10 instructions. We
+also reduce the temporary registers from five to three. We will provide
+an analysis to the speedup given to the Extended Euclidean Algorithm by
+using the modular instruction and the multiply-subtract instruction in
+the justification and analysis section.
+
+## PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION
+
+One issue already discussed in the previous section is that of stalls
+during large exponents. The way exponentiation is handled causes many
+stalls slowing down the program. One way to lessen the number of stalls
+is by using a technique called exponentiation by squaring. This
+technique uses the square of the result to accumulate the final solution
+instead of just the base. In this way it'll divide the number of
+multiplies in half, thus divide the number of stalls in half.
+
+The pseudocode for how exponentiation by squaring will work [@COHEN]:
+
+    function power (base, exponent)
+         set binaryExp = (n1, n2, n3, ... nm) 
+         ;; where binaryExp is the binary representation of exponent and n1 is 1
+         set result = base
+         set singleResult = 1
+         loop for x from 0 to (length of binaryExp - 1) do
+              result = result * result
+              if binaryExp[x] == 1
+                    singleResult = singleResult * base
+
+         result = result * singleResult
+         return result
+
+An example for calculating 3 to the 10th is as follows:
+
+    result := 3
+    binaryExp := "1010"
+
+    Iteration for digit 2:
+      result := result^{2} = 3^{2} = 9
+      1010bin - Digit equals "0"
+
+    Iteration for digit 3:
+      result := result^{2} = (32)2 = 34  = 81
+      1010bin - Digit equals "1" --> result := result*3 = (32)2*3 = 35  = 243
+
+    Iteration for digit 4:
+      result := result2 = ((32)2*3)2 = 310  = 59049
+      1010bin - Digit equals "0"
+
+    return result
+
+The idea is to implement this exponentiation by squares algorithm in the
+processor using a pow command. This pow command takes the destination
+register, the register containing the base and the register containing
+the exponent. This command then runs the loop using one multiplier to
+accumulate the square and one multiplier to accumulate the value when a
+1 is hit. The loop will grab one binary digit at a time from the
+exponent register. We will always run the result accumulator, then
+depending on the digit, may use the second accumulator. It will then run
+through one more multiplier to multiply the squares by the 1's
+multiplier.
+
+# JUSTIFICATION AND ANALYSIS
+
+In this section, we will describe how our specialized instructions will
+improve the performance of the RSA encryption.
+
+## ENCRYPTION AND DECRYPTION EXPONENT
+
+Using the modular instruction in the Euclidean Algorithm, we can reduce
+the number of stalls needed. Instead of needing to stall for the result
+of the divide and the result of the multiply within the loop, the
+modular instruction does not have to stall for any previous results. In
+the loop of the Euclidean Algorithm without the modular instruction, two
+of the five instructions need to stall to wait for a previous result.
+
+    CPI = Ideal CPI + The number of stalls per instruction
+
+    CPIold = 1 + 2/5
+    CPIold =  1.4
+
+    CPInew = 1 (since no instructions in the loop need to stall for a result)
+
+    Thus the speedup for the Euclidean Algorithm in the loop is:
+    Speedup = CPIold / CPI new
+    Speedup = 1.4 / 1
+    Speedup = 1.4
+
+Thus, with respect to stalls, we have a speedup of 1.4 by using the
+modular instruction in the Euclidean Algorithm. Of course, the latency
+of the modular instruction is likely to be longer than the latency of
+the divide, multiply, and subtract instructions, since it is performing
+three calculations in one instruction. However, since we are building a
+system only for RSA, we can use specialized hardware to reduce the
+latency of the modular instruction to receive the speedup of 1.4. As
+mentioned in the design section, another advantage to using the modular
+instruction is we reduce the number of temporary registers from three to
+one.
+
+We are assume that the store instruction finishes in one clock cycle.
+Thus, in the Extended Euclidean Algorithm, three of the thirteen
+instructions in the loop need to stall for previous results. However, by
+using the modular and multipy-subtract instructions, no instruction
+needs to stall to wait for a previous result to be available.
+
+    CPI = Ideal CPI + The number of stalls per instruction
+
+    CPIold = 1 + 3/13
+    CPIold =  1.23
+
+    CPInew = 1 (since no instructions in the loop need to stall for a result)
+
+    Thus the speedup for the Euclidean Algorithm in the loop is:
+    Speedup = CPIold / CPI new
+    Speedup = 1.23 / 1
+    Speedup = 1.23
+
+Thus with respect to stalls, we have a speedup of 1.23 by using the
+modular and multiply-subtract instructions in the Extended Euclidean
+Algorithm. As mentioned above, it is likely that the latency of the
+modular and multiply-subtract instructions are higher than the
+traditional instructions. However, we are only building a system for RSA
+encryption and decryption, so we can build hardware that will reduce the
+latency of the modular and multiply-subtract instructions to receive the
+speedup of 1.23. Also, an advantage to using the modular and
+multiply-subtract instructions is we reduce the number of temporary
+registers needed from five to three.
+
+## PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION
+
+Using the pow command we can reduce stalls of large exponents in half.
+Since the algorithm breaks the exponent into a binary representation of
+its self and loops through the digits, the worse case scenario is
+looping half the number of times. One example of this is if the exponent
+is eight which is 1000 in base two. Since the number of loops is divided
+in half, the number of multiplies are also divided in half. Even with a
+large number of ones in the base two representation of the exponent,
+this can use a seperate multiplyer, since it doesn't rely on the result
+of the squares.
+
+Using the above pseudocode we can assume that approximately 1/4 of the
+instructions are multiplies, thus 1/4 will stall. Using that knowledge
+and the knowledge that the number of stalls is cut in half we can use
+the following equations to determin the overall speedup:
+
+    CPI = Ideal CPI + The number of stalls per instruction
+
+    CPI old = 1 + 1/4
+    CPI old = 1.25
+
+    CPI new = 1 + 1/4/2
+    CPI new = 1 + 1/8
+    CPI new = 1.125
+
+    Thus the speedup of this section is:
+    Speedup = CPI old / CPI new
+    Speedup = 1.25/1.125
+    Speedup = 1.11
+
+# CONCLUSIONS
+
+In analyzing the typical algorithms used as a part of RSA, we have
+identified two primary bottlenecks in both encryption and decryption:
+modulus and exponentiation operations. We propose that because these two
+operations are so prevalent, they warrant specialized instructions. We
+also noticed that it is beneficial to use a multiply-subtract
+instruction to increase speedup while finding a decryption exponent.
+These specialized instructions improve performance by reducing the
+number of stalls, and therefore improving ideal CPI. The process of
+finding encryption and decryption exponents benefits from the increased
+efficiency of modulus operations. Likewise, the processes of prime
+generation and encryption and decryption in general, both benefit from a
+faster exponentiation operation.
+
+We find that by implementing these instructions, the whole procedure
+finds an overall speedup, from specific speedups of 1.11 for all uses of
+exponentiation, 1.4 for finding an encryption exponent, and 1.23 for
+finding a decryption exponent. Thus, we met our goals for improving the
+performance of the RSA algorithm by creating a specialized instruction
+set architecture.
+
+A further investigation we could make is to look at specific hardware
+that supports our instruction set architecture. To gain the maximum
+speedup, we need to use hardware that reduces the latency for all the
+specialized instructions to a negligible amount. Since we are creating a
+system with the sole task of encrypting and decrypting messages using
+RSA, we can create hardware that allows the specialized instructions to
+have a latency comparable to the traditional instructions.
+
+# Bibliography
+
+3 Beauchemin, Pierre, Brassard, Crepeau, Claude, Goutier, Claude, and
+Pomerance, Carl The Generation of Random Numbers That Are Probably Prime
+
+Bellare, M., Garay, J., and Rabin T. Fast Batch Verification for Modular
+Exponentiation and Digital Signatures. *Advances in Cryptology-
+Eurocrypt 98 Proceedings, Lecture Notes in Computer Sciencey*, 1998.
+
+Boneh, Dan and Shacham, Horavl. Fast Variants of RSA. *Appears in
+Cryptobytes*, 2002. Cohen, H., Frey, G. (editors): Handbook of elliptic
+and hyperelliptic curve cryptography. Discrete Math. Appl., (Chapman and
+Hall/CRC 2006).
+
+Cormen, T., Leiserson, C., Rivest, R., and Stein, C. Introduction to
+Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001.
+
+Stallings, W., Cryptography and Network Security 2nd ed. (Prentice-Hall,
+1998)
+