flow-hugo/content/posts/rsa-optimization.md

---
author: "Halvo (Human)"
title: "RSA Optimization"
date: 2022-12-06
tags:
  - rsa
  - optimization
  - instruction-set
  - cryptography
  - performance
draft: false
summary: |
  A dive into a custom ISA for RSA: modular‑instruction shortcuts, multiply‑subtract tricks, and exponentiation‑by‑squaring hacks that shave a few cycles off the dreaded big‑number math.
---

## INTRODUCTION

RSA is a public key cryptosystem, which was named after the creators of
the algorithm: Rivest, Shamir, and Adleman [@STALLINGS]. It is widely
used for both confidentiality and authentication. The main advantage for
using RSA is the keys are created in such a way that the public key is
publish and can be used to encrypt all messages to the owner of the
public key. Unlike symmetric key schemes, RSA does not require sender
and receiver to agree on a common key to encrypt and decrypt messages.
To send an encrypted message, the encryptor only has to look up the
public key of the intended recipient.

The RSA algorithm works as follows. First, a public key and private key
pair are generated. The public key is published and can be used to
encrypt messages and verify signatures of the owner of the public key.
The corresponding private key is known only to the owner and can be used
to decrypt messages encrypted with the owner's public key. The private
key can also be used to sign messages. The public and private keys are
generated by choosing two distinct large primes, `p` and `q`. Then `n`
is computed by multiplying `p` and `q` together. Since `n` is the
product of two primes, we can compute Euler's totient function by
`φ(n)=(p-1)(q-1)`. Then, an integer `e` is chosen such that
`1<e<φ(n)` and `gcd(e,φ(n))=1`, where gcd stands for
greatest common denominator. Finally an integer `d` is computed such
that `d=e^{-1}(mod φ(n))`. Thus, the public key is the modulus
`n` and the encryption exponent `e`. The private key is the decryption
exponent `d`.

RSA is criticized for the amount of time it takes to encrypt and decrypt
messages. To encrypt a message, the sender uses the public key, `(n,e)`.
The message is converted to an integer `m`, such that `0<m<n`. Then, the
ciphertext message `c` is computed by `c=m^{e}(mod n)`. The message
can be recovered by using the private key, `d`, and the formula
`m=c^{d}(mod n)`. Currently to be considered secure, the size of `n`
needs to be 1024 bits or more. This makes RSA computationally slow
because arithmetic modulo large numbers is notoriously slow. For
example, on small handheld devices, RSA decryption with 1024 bit modulus
can take up to 40 seconds [@BONEH]. Also, RSA decryption can
significantly reduce the number of requests a web server can handle. In
addition, as computers become more powerful, 1024-bit RSA will not be
considered secure. This means that in the future, the size of the
modulus will increase. Therefore, we need to create methods we can use
RSA in a practical fashion.

To improve the speed of RSA encryption and decryption, we propose to
create an instruction set architecture that will be made solely for RSA.
In this paper, we will talk strictly about using RSA for encrypting and
decrypting messages. However the same instruction set architecture we
propose in this paper can be used for signing and verifying messages
with RSA.

## CHARACTERISTICS OF RSA

There are three areas that the RSA can be optimized: finding the
encryption and decryption exponent, prime number generation, and
encrypting and decrypting messages. In this section we will describe how
finding the encryption and decrypting exponent, prime number generation,
and encrypting and decrypting the message is usually done, without
specialized instructions.

### ENCRYPTION AND DECRYPTION EXPONENT

The current approach to verifying that the encryption exponent is
coprime to the `φ(n)` is by using the Euclidean Algorithm. To find
the decryption exponent, the Extended Euclidean Algorithm is used.
Therefore, we will include these algorithms in our architecture and
create instructions to accommodate the Euclidean Algorithm and Extended
Euclidean Algorithm.

The Euclidean Algorithm can be executed using the following pseudocode
[@STALLINGS]:

    ;; load phi of n (computed using Euler's totient function) and encryption exponent
    load r1, location_of_phi(n)
    load r2, location_of_encryption_exponent

    loop while r2 not equal to zero

    div t0, r1, r2       ;; t0 = r1 / r2
    mult t1, r2, t0         ;; t1 = r2 * r
    sub t2, r1, t1       ;; t2 = r1 - t1

    store r2, r1            ;; r1 = r2
    store t2, r2            ;; r2 = t2

    ;; end loop

    return r2                 ;; this is equal to the gcd of phi of n and the encryption exponent


If the `gcd(φ(n), e) = 1`, then `e` is a valid encryption exponent
for RSA, and we can now find the decryption exponent, `d`. If the
`gcd(φ(n), e) \neq 1`, then a new encryption exponent `e` needs to
be chosen such that `1<e<φ(n)`. Then, the Euclidean Algorithm
needs to be ran again to verify that `gcd(φ(n), e) = 1`.
Typically, to execute the Euclidean Algorithm, a divide, multiply, and
subtract instruction would be needed in each iteration of the loop. In
the design section, we will describe an instruction that will combine
the divide, multiply, and subtract instruction into one instruction.

Once the encryption exponent, `e` is chosen, we need to find the
decryption exponent `d`. The decryption exponent, `d` needs to satisfy
the following formula: `d*e = 1 (mod φ(n))`. Another way of
describing `d` is it is the multiplicative inverse of `e` modular
`φ(n)`. The Extended Euclidean Algorithm can be executed using the
following pseudocode [@CORMEN]:

    ;; load phi of n (computed using Euler's totient function) and encryption exponent
    load r1, location_of_phi(n)
    load r2, location_of_encryption_exponent

    ;; values that will be used and updated in the loop of the algorithm
    store 0, r3              ;; r3 = 0
    store 0, r4              ;; r4 = 0
    store 1, r5              ;; r5 = 1
    store 1, r6              ;; r6 = 1

    loop while r2 not equal to zero

    div t0, r1, r2        ;; t0 = r1/r2
    store r2, t1            ;; t1 = r2
    mult t2, r1, t0      ;; t2 = r1 * t0
    sub r2, r2, t2        ;; r2 = r2 - t2
    store t1, r1            ;; r1 = t1
    store r3, t3            ;; t3 = r3
    mult t4, t0, r3      ;; t4 = t0 * r3
    sub r3, r5, t4        ;; r3 = r5 - t4
    store r3, r5            ;; r5 = r3
    store r4, t5            ;; t5 = r4
    mult t6, t0, r4      ;; t6 = t0 * r4
    sub r4, r6, t6        ;; r4 = r6 - t6
    store t5, r6            ;; r6 = t5

    ;; end loop

    return r5, r6          ;; r5 and r6 are e and d

Since we chose an `e` such that `gcd(φ(n), e) = 1`, there will
always exist a `d` such that `ed = 1 (mod φ(n))`. The Extended
Euclidean Algorithm returns the integer `d`. Typically to execute the
Extended Euclidean Algorithm, one divide, six stores, three multiplies,
and three subtracts are needed within the loop. In the design section,
we will use two instructions that will combine some of the instructions
used in the Extended Euclidean Algorithm to reduce the number of stalls
within the loop.

### PRIME NUMBER GENERATION

The common approach to generating large primes in making encryption and
decryption keys is to randomly select integers and test them for
primality. Since for large numbers, being sure of primality can be an
expensive operation, algorithms are used that find "probable primes."
These algorithms test numbers for primality with a certainty over a
desired threshold. The particular algorithm we will investigate is known
as Rabin's test, whose high-level use in prime generators as
follows[@BEAUCHEMIN]:

    ;;Generates a prime number of binary length l and certainty factor k
    ;;The number will have a exp(4,-k) chance of being composite (non-prime)
    function genPrime (l, k):
        repeat
            n <- Randomly selected l-digit odd number
        until repeatRabin(n, k) = "prime"
        return n

    function repeatRabin(n, k)
        from i <- 0 to k or until done
            done <- rabin(n) = "composite"
        end
        if done
            return "composite"
        else
            return "prime"

    function rabin(n)
        s <- the number of trailing zeros in binary form of n-1
        t <- n-1 with trailing zeros removed
        a <- integer randomly selected between 2 and n-2
        x <- exp(a,d) mod n
        if x = 1 or x = n - 1
            return "prime"
        for r <- 1 ... s - 1
            x <- exp(x,2) mod n
            if x = 1 return "composite"
            if x = n - 1 return "prime"
        end for
        return "composite"

This technique will take a long amount of time for large values of `t`
and `s`. For large values of `t`, exponentiation done in a typical
manner would take an increasing number of iterations to calculate (as
shown below). The instructions to calculate `x^{2} (mod n)` would be
executed `s` times. These two factors indicate a heavy reliance on the
ability of a system to calculate exponentiation.

### ENCRYPTION AND DECRYPTION

One aspect of RSA to improve upon is performing large exponentiation.
Currently the implementation of exponentiation is performed by the
compiler, which creates inconsistencies in speed between compilers and
is generally not optimal. The general practice for compilers is to loop
for the value in the exponent while accumulating the total.

Current pseudocode for how compilers handle exponentiation [@COHEN]:

    function power (base, exponent)
         set result = base
         loop from 1 to exponent do
              result = result * base
         return result

This technique is very slow, especially for large exponents. It will
cause a lot of stalls as the processor waits for the results of the
previous computation. It will also stall waiting for another multiplier
to be available. We can lessen the number of stalls using a technique
known as exponentiation by squaring. This technique is explained further
in the design section.

## DESIGN

In this section, we will describe specialized instructions that will be
used for prime number generation, computing the encryption and
decryption exponent, and encrypting a decrypting a message.

### ENCRYPTION AND DECRYPTION EXPONENT

The issue with implementing the Euclidean Algorithm the traditional way
is a divide, multiply, and subtract instructions are needed for each
iteration of the loop. We noticed that the Euclidean Algorithm is
finding a modulus in each iteration of the loop, thus we propose to use
a modulus instruction for the Euclidean Algorithm, which is defined
below.

    mod x, y, z : x = z - ( (z/y) * y)

By using the modulus instruction, we have reduced the number of stalls
in each loop. Unlike the traditional implementation, which requires both
the multiply and subtraction instruction to stall until the instruction
before it writes it's value to memory, the modulus instruction has all
the values immediately.

Therefore, we can implement the Euclidean Algorithm with the following
pseudocode:

    ;; load phi of n (computed using Euler's totient function) and encryption exponent
    load r1, location_of_phi(n)
    load r2, location_of_encryption_exponent

    loop while r2 not equal to zero

    mod t0, r1, r2      ;; t0 = r2 - ( (r2/r1) * r1)

    store r2, r1           ;;r1 = r2
    store t0, r2           ;;r2 = t2

    ;; end loop

    return r2                ;; this is equal to the gcd of phi of n and the encryption exponent


Using the mod instruction, we can eliminate the use of the divide,
multiply, and subtract instructions. Also, we use only one temporary
register, as opposed to three temporary registers. We will provide an
analysis to the speedup that the modulus instruction gives the
implementation of the Euclidean Algorithm in the justification and
analysis section.

For the implementation of the Extended Euclidean Algorithm, we propose
to use two specialized instructions: a modular instruction and a
multiply-subtract instruction. The modular instruction combines the
divide, multiply, and subtract instruction, and is defined the same as
it is above, in the Euclidean Algorithm. The multiply-subtract
instruction is defined as follows:

     multsub w, x, y, z : w = x - (y * z)

Like in the Euclidean Algorithm, by using the modular instruction in the
Extended Euclidean Algorithm, we are reducing the number of stalls in
each loop iteration. We are further reducing the number of stalls by
also using the multiply-subtract instruction. Thus, we can implement the
Extended Euclidean Algorithm with the following pseudocode:

     ;; load phi of n (computed using Euler's totient function) and encryption exponent
    load r1, location_of_phi(n)
    load r2, location_of_encryption_exponent

    store 0, r3              ;; r3 = 0
    store 0, r4              ;; r4 = 0
    store 1, r5              ;; r5 = 1
    store 1, r6              ;; r6 = 1

    loop while r2 not equal to zero

    div t0, r1, r2                ;; t0 = r1/r2
    store r2, t1                    ;; t1 = r2
    mod r2, r1, r2                ;; r2 = r2 - ( (r2/r1) * r1)
    store t1, r1                    ;; r1 = t1
    store r3, t2                    ;; t2 = r3
    multsub r3, r5, t0, r3       ;; r3 = r5 - (t0 * r3)
    store t2, r5                    ;; r5 = t2
    store r4, t3                    ;; t3 = r4
    multsub  r4, r6, t0, r4      ;; r4 = r6 - (t0 * r3)
    store t3, r6                    ;; r6 = t3

    ;; end loop

    return r5, r6                  ;; r5 and r6 are e and d

Using the modular and multiply-subtract instructions, we have reduced
the amount of instructions in the loop of the implementation of the
Extended Euclidean Algorithm from 13 instructions to 10 instructions. We
also reduce the temporary registers from five to three. We will provide
an analysis to the speedup given to the Extended Euclidean Algorithm by
using the modular instruction and the multiply-subtract instruction in
the justification and analysis section.

### PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION

One issue already discussed in the previous section is that of stalls
during large exponents. The way exponentiation is handled causes many
stalls slowing down the program. One way to lessen the number of stalls
is by using a technique called exponentiation by squaring. This
technique uses the square of the result to accumulate the final solution
instead of just the base. In this way it'll divide the number of
multiplies in half, thus divide the number of stalls in half.

The pseudocode for how exponentiation by squaring will work [@COHEN]:

    function power (base, exponent)
         set binaryExp = (n1, n2, n3, ... nm)
         ;; where binaryExp is the binary representation of exponent and n1 is 1
         set result = base
         set singleResult = 1
         loop for x from 0 to (length of binaryExp - 1) do
              result = result * result
              if binaryExp[x] == 1
                    singleResult = singleResult * base

         result = result * singleResult
         return result

An example for calculating 3 to the 10th is as follows:

    result := 3
    binaryExp := "1010"

    Iteration for digit 2:
      result := result^{2} = 3^{2} = 9
      1010bin - Digit equals "0"

    Iteration for digit 3:
      result := result^{2} = (32)2 = 34  = 81
      1010bin - Digit equals "1" --> result := result*3 = (32)2*3 = 35  = 243

    Iteration for digit 4:
      result := result2 = ((32)2*3)2 = 310  = 59049
      1010bin - Digit equals "0"

    return result

The idea is to implement this exponentiation by squares algorithm in the
processor using a pow command. This pow command takes the destination
register, the register containing the base and the register containing
the exponent. This command then runs the loop using one multiplier to
accumulate the square and one multiplier to accumulate the value when a
1 is hit. The loop will grab one binary digit at a time from the
exponent register. We will always run the result accumulator, then
depending on the digit, may use the second accumulator. It will then run
through one more multiplier to multiply the squares by the 1's
multiplier.

## JUSTIFICATION AND ANALYSIS

In this section, we will describe how our specialized instructions will
improve the performance of the RSA encryption.

### ENCRYPTION AND DECRYPTION EXPONENT

Using the modular instruction in the Euclidean Algorithm, we can reduce
the number of stalls needed. Instead of needing to stall for the result
of the divide and the result of the multiply within the loop, the
modular instruction does not have to stall for any previous results. In
the loop of the Euclidean Algorithm without the modular instruction, two
of the five instructions need to stall to wait for a previous result.

    CPI = Ideal CPI + The number of stalls per instruction

    CPIold = 1 + 2/5
    CPIold =  1.4

    CPInew = 1 (since no instructions in the loop need to stall for a result)

    Thus the speedup for the Euclidean Algorithm in the loop is:
    Speedup = CPIold / CPI new
    Speedup = 1.4 / 1
    Speedup = 1.4

Thus, with respect to stalls, we have a speedup of 1.4 by using the
modular instruction in the Euclidean Algorithm. Of course, the latency
of the modular instruction is likely to be longer than the latency of
the divide, multiply, and subtract instructions, since it is performing
three calculations in one instruction. However, since we are building a
system only for RSA, we can use specialized hardware to reduce the
latency of the modular instruction to receive the speedup of 1.4. As
mentioned in the design section, another advantage to using the modular
instruction is we reduce the number of temporary registers from three to
one.

We are assume that the store instruction finishes in one clock cycle.
Thus, in the Extended Euclidean Algorithm, three of the thirteen
instructions in the loop need to stall for previous results. However, by
using the modular and multipy-subtract instructions, no instruction
needs to stall to wait for a previous result to be available.

    CPI = Ideal CPI + The number of stalls per instruction

    CPIold = 1 + 3/13
    CPIold =  1.23

    CPInew = 1 (since no instructions in the loop need to stall for a result)

    Thus the speedup for the Euclidean Algorithm in the loop is:
    Speedup = CPIold / CPI new
    Speedup = 1.23 / 1
    Speedup = 1.23

Thus with respect to stalls, we have a speedup of 1.23 by using the
modular and multiply-subtract instructions in the Extended Euclidean
Algorithm. As mentioned above, it is likely that the latency of the
modular and multiply-subtract instructions are higher than the
traditional instructions. However, we are only building a system for RSA
encryption and decryption, so we can build hardware that will reduce the
latency of the modular and multiply-subtract instructions to receive the
speedup of 1.23. Also, an advantage to using the modular and
multiply-subtract instructions is we reduce the number of temporary
registers needed from five to three.

### PRIME NUMBER GENERATION, ENCRYPTION, AND DECRYPTION

Using the pow command we can reduce stalls of large exponents in half.
Since the algorithm breaks the exponent into a binary representation of
its self and loops through the digits, the worse case scenario is
looping half the number of times. One example of this is if the exponent
is eight which is 1000 in base two. Since the number of loops is divided
in half, the number of multiplies are also divided in half. Even with a
large number of ones in the base two representation of the exponent,
this can use a seperate multiplyer, since it doesn't rely on the result
of the squares.

Using the above pseudocode we can assume that approximately 1/4 of the
instructions are multiplies, thus 1/4 will stall. Using that knowledge
and the knowledge that the number of stalls is cut in half we can use
the following equations to determin the overall speedup:

    CPI = Ideal CPI + The number of stalls per instruction

    CPI old = 1 + 1/4
    CPI old = 1.25

    CPI new = 1 + 1/4/2
    CPI new = 1 + 1/8
    CPI new = 1.125

    Thus the speedup of this section is:
    Speedup = CPI old / CPI new
    Speedup = 1.25/1.125
    Speedup = 1.11

## CONCLUSIONS

In analyzing the typical algorithms used as a part of RSA, we have
identified two primary bottlenecks in both encryption and decryption:
modulus and exponentiation operations. We propose that because these two
operations are so prevalent, they warrant specialized instructions. We
also noticed that it is beneficial to use a multiply-subtract
instruction to increase speedup while finding a decryption exponent.
These specialized instructions improve performance by reducing the
number of stalls, and therefore improving ideal CPI. The process of
finding encryption and decryption exponents benefits from the increased
efficiency of modulus operations. Likewise, the processes of prime
generation and encryption and decryption in general, both benefit from a
faster exponentiation operation.

We find that by implementing these instructions, the whole procedure
finds an overall speedup, from specific speedups of 1.11 for all uses of
exponentiation, 1.4 for finding an encryption exponent, and 1.23 for
finding a decryption exponent. Thus, we met our goals for improving the
performance of the RSA algorithm by creating a specialized instruction
set architecture.

A further investigation we could make is to look at specific hardware
that supports our instruction set architecture. To gain the maximum
speedup, we need to use hardware that reduces the latency for all the
specialized instructions to a negligible amount. Since we are creating a
system with the sole task of encrypting and decrypting messages using
RSA, we can create hardware that allows the specialized instructions to
have a latency comparable to the traditional instructions.

## Bibliography

3 Beauchemin, Pierre, Brassard, Crepeau, Claude, Goutier, Claude, and
Pomerance, Carl The Generation of Random Numbers That Are Probably Prime

Bellare, M., Garay, J., and Rabin T. Fast Batch Verification for Modular
Exponentiation and Digital Signatures. *Advances in Cryptology-
Eurocrypt 98 Proceedings, Lecture Notes in Computer Sciencey*, 1998.

Boneh, Dan and Shacham, Horavl. Fast Variants of RSA. *Appears in
Cryptobytes*, 2002. Cohen, H., Frey, G. (editors): Handbook of elliptic
and hyperelliptic curve cryptography. Discrete Math. Appl., (Chapman and
Hall/CRC 2006).

Cormen, T., Leiserson, C., Rivest, R., and Stein, C. Introduction to
Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001.

Stallings, W., Cryptography and Network Security 2nd ed. (Prentice-Hall,
1998)