III
Most advanced security systems rely on publickey schemes based either on the
factorization or discrete logarithm problem. Since both problems are known to be
closely related, a major breakthrough in cryptanalysis tackling one of those problems could render a large set of cryptosystems completely useless. The McEliece
publickey scheme is based on the alternative security assumption that decoding
unknown linear, binary codes is NPcomplete. In this work, we investigate the efficient implementation of the McEliece scheme on reconfigurable hardware what was
up to date considered a challenge due to the required storage of its large keys. To
the best of our knowledge, this is the first time that the McEliece encryption scheme
is implemented on a Xilinx Spartan3 FPGA.
VI
Erklrung/Statement
Hiermit versichere ich, dass ich meine Diplomarbeit selbst verfat und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich
gemacht habe.
I hereby declare that the work presented in this thesis is my own work and that
to the best of my knowledge it is original except where indicated by references to
other authors.
Bochum, den October 14, 2009
Stefan Heyse
1.1.
1.2.
1.3.
1.4.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
5
5
5
Overview . . . . . . . . . . . . . .
Key Generation . . . . . . . . . .
Encryption . . . . . . . . . . . . .
Decryption . . . . . . . . . . . . .
Reducing Memory Requirements
Security . . . . . . . . . . . . . . .
2.6.1. Weaknesses . . . . . . . .
2.6.2. Attacks . . . . . . . . . . .
2.7. Side Channel Attacks . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
8
9
9
10
10
12
12
Motivation . . . . . . . . .
Existing Implementations
Goals . . . . . . . . . . . .
Outline . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
2.1.
2.2.
2.3.
2.4.
2.5.
2.6.
3.1.
3.2.
3.3.
3.4.
3.5.
3.6.
3.7.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
16
17
17
19
20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
24
24
25
26
29
31
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.1. Encryption . . . . . . . . . . . . . . . . . . . . . .
6.1.1. Reading the Public Key . . . . . . . . . .
6.1.2. Encrypting a message . . . . . . . . . . .
6.2. Decryption . . . . . . . . . . . . . . . . . . . . . .
6.2.1. Inverting the Permutation P . . . . . . . .
6.3. Decoding . . . . . . . . . . . . . . . . . . . . . . .
6.3.1. Computing the Square Root of T(z)+z . .
6.3.2. Solving the Key Equation . . . . . . . . .
6.3.3. Computing the Error Locator Polynomial
6.3.4. Searching Roots . . . . . . . . . . . . . . .
6.4. Reverting the Substitution S . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Sigma .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
32
33
34
35
37
38
39
42
42
43
45
46
46
47
48
8.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2. Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3. Outlook for Further Work . . . . . . . . . . . . . . . . . . . . . . . . . .
55
56
56
The advanced properties of publickey cryptosystems are required for many cryptographic issues, such as key establishment between parties and digital signatures. In
this context, RSA, ElGamal, and later ECC have evolved as most popular choices and
build the foundation for virtually all practical security protocols and implementations with requirements for publickey cryptography. However, these cryptosystems
rely on two primitive security assumptions, namely the factoring problem (FP) and
the discrete logarithm problem (DLP), which are also known to be closely related.
With a significant breakthrough in cryptanalysis or a major improvement of the
best known attacks on these problems (i.e., the Number Field Sieve or Index Calculus),
a large number of recently employed cryptosystems may turn out to be insecure
overnight. Already the existence of a quantum computer that can provide computations on a few thousand qubits would render FP and DLPbased cryptography
useless. Though quantum computers of that dimension have not been reported to
be built yet, we already want to encourage a larger diversification of cryptographic
primitives in future publickey systems. However, to be accepted as real alternatives
to conventional systems like RSA and ECC, such security primitives need to support
efficient implementations with a comparable level of security on recent computing
platforms. For example, one promising alternative are publickey schemes based
on Multivariate Quadratic (MQ) polynomials for which hardware implementations
were proposed on CHES 2008 [9]. In this work, we demonstrate the efficient implementation of another publickey cryptosystem proposed by Robert J. McEliece in
1978 that is based on coding theory [35]. The McEliece cryptosystem incorporates
a linear errorcorrecting code (namely a Goppa code) which is hidden as a general linear code. For Goppa codes, fast decoding algorithms exist when the code is
known, but decoding codewords without knowledge of the coding scheme is proven
NPcomplete [3]. Contrary to DLP and FPbased systems, this makes this scheme
also suitable for postquantum era since it will remain unbroken when appropriately
chosen security parameters are used [6].
The vast majority1 of todays computing platforms are embedded systems. Only
a few years ago, most of these devices could only provide a few hundred bytes
1 Already
in 2001, 98% of the microprocessors in worldwide production were assembled in embedded platforms.
Chapter 1  Introduction
of RAM and ROM which was a tight restriction for application (and security) designers. Thus, the McEliece scheme was regarded impracticable on such small and
embedded systems due to the large size of the private and public keys.For example
for a 80 bit security level, public key is 437.75 Kbyte and secret key is 377 Kbyte large.
But nowadays, recent families of microcontrollers provide several hundreds of bytes
of FlashROM. Moreover, recent offtheshelf hardware such as FPGAs also contain
dedicated memory blocks and Flash memories that support onchip storage of up to
a few megabits of data. In particular, these memories could be used, e.g., to store
the keys of the McEliece cryptosystem.
While a microcontroller implementation already exist [16] this work present the
first implementation of the McEliece cryptosystem on a Xilinx Spartan3AN 1400
FPGA which is a suitable for many embedded system applications. To the best
of our knowledge, no other implementations for the McEliece scheme have been
proposed targeting embedded platforms. Fundamental operations for McEliece are
based on encoding and decoding binary linear codes in binary extension fields that,
in particular, can be implemented very efficiently in dedicated hardware. Unlike FP
and DLPbased cryptosystems, operations on binary codes do not require computationally expensive multiprecision integer arithmetic what is beneficial for small
computing platforms.
There exist only a few McEliece software implementations [39, 40] for 32 bit architectures. Implementation [39] is written in pure i386 assembler. It encrypts with 6
KBits/s and decrypts with 1.7 Kbits/s on a 16 MHz i386CPU. A Cimplementation
for 32bit architectures exists in [40]. The code is neither formatted nor does it contain a single line of source code comment. Nevertheless, it was used for the open
source p2p software freenet and entropy [19, 18]. Due to the poor documentation
we are unable to give performance results for the specific code. An FPGA implementation of the original McEliece crypto system appears not to exist, only in [8] a
signature scheme derived from McEliece is implemented on a FPGA.
To the best of our knowledge, this work presents the first implementation of the
McEliece public key scheme for embedded systems.
Goal of this thesis is a proofofconcept implementation of McEliece for reconfigurable hardware. The primary target is to solve the memory problem for the large
public and private key data and fit all components into a low cost FPGA of the
Spartan3 class. Simultaneously we want to achieve a high performance to allow
the McEliece scheme to become a competitor in post quantum crypto systems on
embedded systems. For encryption the large public key has to be stored in a fast
accessible area and for decryption a way has to be found to reduce the private key
size, because the target FPGA has only limited memory available.
First, the growing interest in post quantum cryptography is motivated and why
especially embedded implementations are necessary. Next, an introduction to the
McEliece scheme is given. Subsequently the basic concepts of error correcting codes,
especially Goppa codes are presented. Afterwards, an introduction to FPGAs is
given and it is shown how to perform finite field arithmetics with them. Then the
encryption implementation is explained with respect to the properties of the target
platform. Next, the implementation of decryption is described on the second target
platform. Finally, the results are presented and options for further improvement are
outlined.
The McEliece crypto system consists of three algorithms which will be explained in
the following sections: A key generation algorithm which produces a public/private
key pair, an encryption algorithm and a decryption algorithm. The public key is
a hidden generator matrix G of a binary linear code of length n and dimension k
capable of correcting up to t errors. R. McEliece suggested to use classical binary
goppa codes. The generator matrix of this code is hidden using a random k k
binary nonsingular substitution matrix S and a random n n permutation matrix
P. The matrix G = S G P and the error correcting capability t forms the public
key. G itself together with matrices S and P form the secret key.
To encrypt a message, it is transformed into a codeword of the underlying code
and an error e with Hamming weight w H (e) t is added. Without knowledge
of the specific code used, errors can not be corrected and therefore the original
message cannot be recovered. The owner of the secret informations can reverse the
transformations of G and use the decoding algorithm of the code to correct the errors
and recover the original message.
c = m G = m S G P c = c + e = m S G P + e
c P 1 = m S G P P 1 + e P 1 = m S G + e P 1
(2.1)
The decoding algorithm can still be used to correct the t errors and get m =
m S because the equation w H (e) = w H (e P1 ) t still holds. To finally receive
the original message m a multiplication with S1 is performed which results in
m S1 = m S S1 = m.
Basically the parameters of the McEliece crypto system are the parameters of the
Goppa code used. After choosing the underlying Galois field GF(2m )and the error
correcting capability of the code t, all other parameters depend on these two values.
The original parameters suggested by McEliece in [35] are m = 10, t = 50, but in
citehac, the authors noted that t = 38 is a better choice with regard to the computational complexity of the algorithm while not reducing the security level. These
parameters nowadays only give 260 bit security [6] compared to a symmetric ciher.
security at least parameters n = 211 = 2048,
To achieve an appropriate level of 280 bit
t = 38 and k = n m t = 1751 must be chosen. Then the key generation is:
Algorithm 1 Key Generation Algorithm for McEliece Scheme
Input: Security Parameters m, t
Output: K pub , Ksec
1: n 2m , k n m t
2: C random binary (n, k)linear code C capable of correcting t errors
3: G k n generator matrix for the code C
4: S random k k binary nonsingular matrix
5: P random n n permutation matrix
k n matrix S G P
6: G
t); Private Key(S, G, P).
7: return Public Key( G,
Note that in Section 2.4 only matrices P1 and S1 are used. Here exist the possibility to precompute these inverse matrices and then the private (decryption) key
is redefined to (S1 , G, P1 ). The permutation matrix P is very sparse. In every row
and columns exactly a single one occurs. This fact is used in the implementation to
save space when storing P.
t ):
Suppose Bob wishes to send a message m to Alice whose public key is (G,
McEliece encrypts a k bit message into a n bit ciphertext. With the actual parameters, this results in 2048 bit ciphertext for a 1751 bit message. This is an overhead of
about nk = 1.17.
2.4 Decryption
2: c m G
3: Generate a random nbit error vector z containing at most t ones
4: c = c + z
5: return c
To decrypt the ciphertext and receive the message Alice has to perform the following
steps:
Algorithm 3 McEliece Message Decryption
Input: c, Ksec = ( P1 , G, S1 )
Output: Plaintext m
1: c c P1
= mS
2: Use a decoding algorithm for the code C to decode c to m
1
S
3: m m
4: return m
To make McEliecebased cryptosystems more practical (i.e., reduce the key sizes),
there is an ongoing research to replace the code with one that can be represented in
a more compact way. Examples for such alternative representations are quasicyclic
codes [20], or low density parity check codes [37], but have also been broken again []
Using a naive approach, in which the code is constructed from the set of all elements in F2m in lexicographical order and both matrices S, P are totally random, the
public key G = S G P becomes a random n k matrix. However, since P is a
sparse permutation matrix with only a single 1 in each row and column, it is more
efficient to store only the positions of the 1s, resulting in an array with n m bits.
Another trick to reduce the public key size is to convert G to systematic form
{ Ik Q}, where Ik is the k k identity matrix. Then, only the (k (n k)) matrix
Q is published [17]. But to achieve the same security level the message has to be
10
In the past, many researchers attempted to break the McEliece scheme [33, 32, 2] but
none of them was successful for the general case.
At Crypto97, Berson [7] shows that McEliece has two weaknesses. It fails when
encrypting the same message twice and when encrypting a message that has a
known relation to another message. In the first case assume that c1 = m G + e1
and c2 = m G + e2 with e1 = e2 are send which leads to c1 + c2 = e1 + e2 .
2.6 Security
11
pi = Pr ({l : e1 (l ) = 1} {l : e2 (l ) = 1} = i ) =
50
(1025
)(50974
i )
(1024
50 )
(2.2)
E( L1 ) =
(100 2 i) pi 95.1
(2.3)
i =0
For example, with the cardinality L1 = 94 it follows that L0 = 930 and only
3 entries of L0 are affected by an error. The probability to select 524 unmodified
positions from L0 is
(927
525)
(930
524)
0.0828
(2.4)
(2.5)
12
According to [17], there is no simple rule for choosing t with respect to n. One
should try to make an attack as difficult as possible using the best known attacks.
A recent paper [5] from Bernstein, Lange and Peters introduces an improved attack of McEliece with support of Bernsteins list decoding algorithm
[4] for binary
Size K pub
Size Ksec
in KBits
(G(z), P, S) in KBits
644
3, 502
33, 178
Security Level
For keys limited to 216 , 217 , 218 , 219 , 220 bytes, the authors propose Goppa codes
of lengths 1744, 2480, 3408, 4624, 6960 and degrees 35, 45, 67, 95, 119 respectively, with
36, 46, 68, 97, 121 errors added by the sender. These codes achieve security levels
84.88, 107.41, 147.94191.18, 266.94 against the attack described by the researchers.
The susceptibility of the McEliece cryptosystem to side channel attacks has not extensively been studied, yet. This is probably due to the low number of practical systems
employing the McEliece cryptosystem. However, embedded systems can always be
subject to passive attacks such as timing analysis [30] and power/EM analysis [34].
In [42], a successful timing attack on the Patterson algorithm was demonstrated. The
attack does not recover the key, but reveals the error vector z and hence allows for
efficient decryption of the message c. This implementations is not be susceptible to
this attack due to unconditional instruction execution, e.g., the implementation will
not terminate after a certain number of errors have been corrected.
13
For a key recovery attack on McEliece, the adversary needs to recover the secret
substitution and permutation matrices S and P, and the Goppa code itself, either
represented by G or by the Goppa polynomial g(z) and the support L. A feasible
For each bit set in c,
our implementations perform one
SPA is possible to recover c.
iteration of the euclidean algorithm (EEA). Assuming the execution of EEA to be
visible in the power trace, c can easily be recovered. The attacker is able to recover
the whole whole permutation matrix P with less than 100 chosen ciphertexts. This
powerful attack can easily be prevented by processing the bits of c in a random order.
Without recovering P, power attacks on inner operations of McEliece are aggravated
due to an unknown input. Classical timing analysis, as described in [30] seems
more realistic, as many parts of the algorithm such as EEA,Permutation and modulo
reduction exhibit a datadependent runtime. Yet, no effective timing attack has been
reported so far, and simple countermeasures are available [31].
An attacker knowing P could recover the generator polynomial G (z) and thereby
break the implementation [17]. As a countermeasure, we will randomize the execu and consequently
tion of step 1 of Algorithm 4 due to a possible SPA recovering c,
1
P . Differential EM/power attacks and timing attacks are impeded by the permutation and scrambling operations (P and S) obfuscating all internal states, and finally,
the large key size. Yet templatelike attacks [12] might be feasible if no further protection is applied.
For this section it is assumed that the reader is familiar with the theory of finite
fields and algebraic coding theory. A good introduction to finite fields for engineers
can be found in [36]. The following definitions are from [45] and define the crucial
building blocks for Goppa codes.
Because the McEliece scheme is based on the topic of some problems in coding
theorie, a short introduction into this field based on [17] is given.
Definition 3.1.1 An (n, k)code C over a finite Field F is a kdimensional subvectorspace of
the vector space F n . We call C an (n, k, d)code if the minimum distance is d = minx,yC dist( x, y),
where dist denotes a distance function, e.g. Hamming distance. The distance of x F n to
the nullvector wt( x ) := dist(0, x ) is called weight of x.
Definition 3.1.2 The matrix G F kn is a generator matrix for the (n, k)code C over
F, if the rows of G span C over F. The matrix H F (nk)n is called parity check matrix
for the code C if H T is the right kernel of C. The code generated by H is called dual code of
C and denoted by CT .
By multiplying a message with the generator matrix a codeword is formed, which
contains redundant information. This information can be used to recover the original message, even if some errors occurred, for example during transmission of the
codeword over an radio channel.
Introduced by V.D. Goppa in [21] Goppa codes are a generalization of BCH and
RScodes. There exist good decoding algorithms e.g. [38] and [44].
Definition 3.2.1 (Goppa polynomial, Syndrome, binary Goppa Codes). Let m and t be
positive integers and let
t
g(z) =
gi z i F 2 m [ z ]
i =0
(3.1)
16
= { 0 , , n 1 }
i F 2m
(3.2)
(3.3)
c (z) =
n 1
i =0
ci g (z) g (i )
g (i )
z i
mod g(z)
(3.4)
The binary Goppa code (, g(z)) over F2 is the set of all c = (c0 , , cn1 ) F2n
such that the indentity
c (z) = 0
(3.5)
holds in the polynomial ring F2m (z) or equivalent
c (z) =
n 1
i =0
ci
0
z i
mod g(z)
(3.6)
If g(z) is irreducible over F2m then (, g(z)) is called an irreducible binary Goppa code.
Recall equation (3.4). From there it follows that every bit of the received codeword
is multiplied with
g (z) g (i )
(3.7)
g (i ) (z i )
Describing the Goppa polynomial as gs zs + gs1 zs1 . . . + g0 one can construct the parity check matrix H as
gs
gs
gs
g ( 0 )
g ( 1 )
g ( n 1 )
g
+
g
g
+
g
g
+
g
s
s
s
0
0
0
s 1
s 1
s 1
g ( 0 )
g ( 1 )
g ( n 1 )
(3.8)
H=
..
..
..
.
.
s 1
g1 + g2 0 ++ gs 0s 1
g1 + g2 0 ++ gs 0s 1
g1 + g2 0 ++ gs 0
g( )
g( )
g(
)
0
n 1
3.4 Encoding
17
gs
0 0
g ( 0 )
0
g
s 1 gs 0
g ( 0 )
H=
..
.. ..
..
.
.
.
.
s 1
g
0
g2 g s
1
g ( 0 )
1
g ( 1 )
1
g ( 1 )
..
1s 1
g ( 1 )
g ( n 1 )
n 1
g ( n 1 )
..
.
1
sn
1
g ( n 1 )
(3.9)
where the first part has a determinate unequal Zero and following the second part
H is a equivalent parity check matrix, which has a simpler structure. By applying
the Gaussian algorithm to the second matrix H one can bring it to systematic form
( Ik H ), where Ik is the k k identity matrix. Note that whenever a column swap is
performed, actual also a swap of the elements in the support is performed.
From the systematic parity check matrix ( Ik H ), now the systematic generator
matrix G can be derived as ( Ink H T ).
However, decoding such a codeword r on the receivers side with a (possibly) additive error vector e is far more complex. For decoding, we use Pattersons algorithm [38] with improvements from [43].
Since r = c + e e mod G (z) holds, the syndrome Syn(z) of a received codeword
can be obtained from Equation (3.6) by
Syn(z) =
m
GF (2 )
e
z
GF (2m )
mod G (z)
(3.10)
18
To finally recover e, we need to solve the key equation (z) Syn(z) (z)
mod G (z), where (z) denotes a corresponding errorlocator polynomial and (z)
denotes an errorweight polynomial. Note that it can be shown that (z) = (z) is
the formal derivative of the error locator. By splitting (z) into even and odd polynomial parts (z) = a(z)2 + z b(z)2 , we finally determine the following equation
which needs to be solved to determine error positions:
Syn(z)(a(z)2 + z b(z)2 ) b(z)2
mod G (z)
(3.11)
To solve Equation (3.11) for a given codeword r, the following steps have to be
performed:
1. From the received codeword r compute the syndrome Syn(z) according to
Equation (3.10). This can also be done using simple tablelookups.
2. Compute an inverse polynomial T (z) with T (z) Syn(z) 1 mod G (z) (or
provide a corresponding table). It follows that (T (z) + z)b(z)2 a(z)2 mod G (z).
3. There is a simple case if T (z) = z a(z) = 0 s.t. b(z)2 z b(z)2 Syn(z)
mod G (z) 1 z Syn(z) mod G (z) what directly leads to (z) = z.
Contrary, if T (z) = z, compute a square root R(z) for the given polynomial
R(z)2 T (z) + z mod G (z). Based on a observation by Huber [24] we can
compute the square root R(z) by:
R(z) = T0 (z) + w(z) T1 (z)
(3.12)
where T0 (z), T1 (z) are odd and even part of T (z) + z satisfying T (z) + z =
T0 (z)2 + z T1 (z)2 and w(z)2 = z mod g(z) which can be precomputed for
every given code. We can then determine solutions a(z), b(z) satisfying
a(z) = b(z) R(z)
mod G (z).
(3.13)
with a modified euclidean algorithm(see section 3.6). Finally, we use the identified a(z), b(z) to construct the errorlocator polynomial (z) = a(z)2 + z b(z)2 .
4. The roots of (z) denote the positions of error bits(see section 3.7). If (i ) 0
mod G (z) with i being the corresponding bit of a generator in GF(211 ), there
was an error in the position i in the received codeword that can be corrected
by bitflipping.
This decoding process, as required in Step 2 of Algorithm 3 for message decryption, is finally summarized in Algorithm 4:
19
6:
R(z) T (z) + z
7:
Compute a(z) and b(z) with a(z) b(z) R(z) mod G (z)
8:
( z ) a ( z )2 + z b ( z )2
9: end if
10: Determine roots of (z) and correct errors in r which results in r
r iG {Map rcor to m}
11: m
12: return m
To solve this equation a(z) = b(z) R(z) mod G (z) with the extended euclidean
algorithm observe the following: From (z) = a(z)2 + z b(z)2 and deg((z))
deg( G (z))
deg( G (z))1
deg(G (z)) it follows that deg(a(z))
and deg(b(z))
. During
2
2
the iterations of the extended euclidean algorithm:
ri 2 ( z ) = 1 R( z ) + 0 G ( z )
ri 1 ( z ) = 0 R( z ) + 1 G ( z )
..
.
ri = ri 2 ai ri 1 where ai =
ri 2
ri 1
(3.14)
20
and
(3.15)
(3.16)
In addition:
It follows that deg(r (z)) is constantly decreasing after starting with deg(G (z)) and
deg(u(z)) increases after starting with zero. Using this, one can see that there is a
unique point in the computation of EEA where both polynomials are just below their
deg( G (z))
bounds. The EEA can be stopped when for the first time deg(r (z)) <
and
2
deg( G (z)1)
The roots of (z) indicate the error positions. If (i ) 0 mod G (z)1 there was an
error ei = 1 at ci in the received bit string, which can be corrected by just flipping
the bit. The roots can be found by evaluating (ai ) for every i. A more sophisticated
method is the Chien search [13]. To get all roots i of (z) the following holds for all
elements except the zero element:
(i ) = s (i )
= s,i
s1
(i +1 ) = s (i +1 )
s
= s (i ) s
= s,i s
= s,i +1
+s1 (i )
+...
+s1,i + . . .
s1
+...
+1 (i +1 ) + 0
s1 s1
+...
+1 (i ) + 0
+s1 (i +1 )
+s1 (i )
+1 (i ) + 0
+1,i + 0,i
+s1,i s1 + . . .
+s1,i +1 + . . .
+1,i + 0,i
+1,i +1 + 0,i +1
In other words, one may define (i ) as the sum of a set { j,i 0 j s}, from
which the next set of coefficients may be derived thus:
j,i +1 = j,i j
1 Note
(3.17)
21
j,i = 0
(3.18)
j =0
then (i ) = 0 and i is a root. This method is more efficient than the brute force
like method mentioned before.
If the generator matrix G is in standard form, just take the first k bits out of the
nbit codeword c and retrieve the original message m. To get a generator matrix
in this form, reorder the elements in the support of the code until G is in standard
form. Note that in the computation of the syndrome and the error correction, the
ith element in C and (z), respectively, corresponds with the ith element in the
support and not with ai . If a matrix not in standard form is used, a mapping matrix
which selects the right k bits from c, has to be found. This matrix iG must be of the
form G iG = IDk , where IDk is the k k identity matrix. To get iG do the following:
Algorithm 5 Getting a Mapping Matrix from Codewords to Messages
Input: G
1: Select randomly k columns of G
2: Test if the resulting k k matrix sub is invertible
3: If not goto Step 1
4: Else insert the rows of sub1 into a k n matrix at the positions where the k
random columns come from and fill the remaining rows with zeros.
5: return iG
For the Magma algorithm computing iG see Appendix B. A multiplication of
a valid codeword c with this matrix iG computes the corresponding message m
because:
c iG = (m G ) iG = m IDk = m
(3.19)
Now all prerequisites to start with implementing the McEliece public key scheme
are given.
This chapter introduces FPGAs and presents the necessary additional hardware.
Furthermore, field arithmetic in hardware is introduced.
FPGA stands for Field Programmable Gate Arrays. It consists of a large amount of
Look Up Tables that can generate any logic combination with four inputs and one
output and basic storage elements based on Flip Flops.
Multiplier
DCM
IOBs
IOBs
CLBs
DCM
DCM
Block RAM
IOBs
OBs
IOBs
24
to a net list. This process is called synthesis and for this implementation the tools
XST and Synplify are chosen from the hundred available. Based on the net list the
correct behavior of the design can be verified by using a simulation tool, which is
in our case Modelsim. This both steps are completely hardware independent. The
next step is mapping and translating the net list into logic resources and special
resources offered by the target platform. Due to this hardware dependency, those
and the following steps need to know the exact target hardware. The final step
placeandroute (PAR) then tries to find an optimum placement for the single logic
blocks and connects them over the switching matrix. The output of PAR can now be
converted into a bitstream file and loaded into a flash memory on the FPGA board.
The FPGA contains a logic block that can read this flash memory and can configure
the FPGA accordingly. On most FPGA boards this memory is located outside the
FPGA chip and can therefore be accessed by anyone. To protect the content of the
bitstream, which may include intellectual property (IP) cores or, like in our case,
secret key material, the bitstream can be stored encrypted. The FPGA bootup logic
25
then has to decrypt the bitstream before configuring the FPGA. Some special FPGAs,
for example the Spartan3AN series, contain large ondie flash memory, which can
only be accessed by opening the chip physically. For the decryption algorithm the
bitstream file has to be protected by one of the two methods mentioned above.
Aside from the algorithmic part of the design, a way has to be found to get data into
the FPGA, and after computation, to read the data back. For this implementation we
remain at a standard UARTinterface even though interfaces with higher bandwidth
(Ethernet,USB,PCIExpress) are possible. The used UART is derived from an existing
source for the PicoBlaze published in Xilinx application note 223 [11]. A wrapper
component was developed, that provide all necessary ports: Input en_16_x_baud
uart
uart_tx
clk
write_buffer
reset_buffer
en_16_x_baud
clk
clk
[7:0]
serial_out
buffer_full
buffer_half_full
tx_out
tx_out
data_in[7:0]
C_uart_tx
rx_in
rst
rx_in
uart_rx
serial_in
read_buffer
reset_rx_buffer
reset_buffer
en_16_x_baud
clk
buffer_data_present
buffer_full
buffer_half_full
data_out[7:0]
[7:0]
C_uart_rx
C_uart
26
For some data a structure is required, that hold some values for later processing.
For example ciphertext, plaintext, bytes send and received from UART and also
precomputed values have to be stored inside the FPGA. Instead of building these
storage from a huge amount of registers, both target FPGA for this implementation
provide dedicated RAM. This block RAM (BRAM) is organized in blocks of 18Kbit
and can be accessed with the maximum frequency of the FPGA. Each of the BRAMs
is true dual ported, allowing independent read and write access at the same time.
With a Xilinx vendor tool CoreGen each block can be configured as RAM or ROM
in single or dual port configuration. Also one can select the required width and
depth of the BRAM. CoreGen then generates a VHDL instantiation template that can
be used in the VHDL code. Every type of BRAM can be initialized with predefined
content via the CoreGen wizard. It uses a .coe file which can contain the memory
content in binary, decimal or hex. This is used transfer the precomputed tables and
constant polynomial w into the FPGA. Goppa polynomial G is not stored in a ROM,
because we need to access it in its whole width at once.This polynomial is stored as
a large register. Because it is constant and only needed in the extended euclidean
algorithm it will be resolved to fixed connections to VCC and GND by the synthesis
tool. The Spartan3200 incorporates 12 BRAMs and the Spartan32000 incorporates
40 BRAMs allowing up to 216 or 720 Kbits of data to be stored, respectively.
All data that is required to configure the FPGA and also constant values from VHDL
code, like ROM and RAM initialization values, has to be stored on the FPGA board.
In normal circumstances the target platform contains a flash memory chip which
holds the bitstream file. During bootup, the FPGA reads this file and configure
himself and initialize the BRAM cells accordingly. But this flash has a standard
interface and can therefore be read by anyone. To protect the private key data, there
exist two ways. The first way, which is only possible at newer FPGAs [47], is to store
the bitstream file encrypted. The FPGA contains a hardware decryption module
and a user defined secret key. During bootup, the bitstream file is decrypted inside
the FPGA and then the normal configuration takes place. Our target FPGAs are
not capable of decrypting bitstream files. But the Spartan3AN family of FPGAs
contain a large ondie flash memory. Assuming that physical opening the chip is
hard, this memory can be accepted as secure on chip storage. If the designer does
not connect the internal flash to the outside world, then the flash cannot be read
from the I/O pins. Additionally, there exist some security features which can be
27
Level 1
Level 2
Level 3
configured during bitstream generation. Table 4.1 summarizes the different security
levels provides by Spartan3 FPGAs [26].
Analyzing McEliece encryption and decryption algorithms (cf. Section 3.2), the following arithmetic components are required supporting computations in GF(2m ): a
multiplier, a squaring unit, calculation of square roots, and an inverter. Furthermore,
a binary matrix multiplier for encryption and a permutation element for step 2 in
Algorithm 2 are required. Many arithmetic operations in McEliece can be replaced
by table lookups to significantly accelerate computations at the cost of additional
memory. Our primary goal is area and memory efficiency to fit the large keys and
required lookuptables into the limited onchip memories of our embedded target
platform.
Arithmetic operations in the underlying field GF(211 ) can be performed efficiently
with a combination of polynomial and exponential representation. In registers, we
store the coefficients of a value a GF(211 ) using a polynomial basis with natural
order. Given an a = a10 10 + a9 9 + a8 8 + + a0 0 , the coefficient ai GF(2) is
determined by bit i of an 11 bit standard logic vector where bit 0 denotes the least sig
28
(4.1)
5:
6:
7:
8:
9:
10:
11:
A A q zk B
u u + q zk v
if degree( A) degree( B) then
AB
B A { swap A and B}
uv
v u { swap u and v}
end if
end while
u (z)
return ( A0 )
30
Decryption
Encryption
Count
UART receive
219
UART send
256
(8 8)Submatrix MUL
56064
Error Distribution
PRNG*
UART receive
256
UART send
219
Permutation
Polynomial MUL
Polynomial SQRT
Polynomial SQR
EEA+
1024 + 2
EEA.getDegree
EEA.shiftPoly
EEA.MulCoeff
PRNG+

EEA.DivCoeff
(8 8)Submatrix MUL
65, 536
65, 536
1. get_degree: a component that determine the degree and the leading coefficient
(lc) of a polynomial
2. gf_div: a component that divide two field elements
3. gf_mul: a component that multiply two field elements
31
Due to the large number of required multiplications we choose the fastest possible design for this operation. Instead of performing the multiplication with the table
lookup method mentioned in Section 4.2 or implementing schoolbook or Karazubalike methods, we completely unroll the multiplication to a hardwired tree of XORs
and ANDs. This tree is derived from the formal multiplication of two field elements
and includes modulo reduction mod p() = 11 + 2 + 1, which is a defining polynomial for GF(211 ). Using this a complete multiplication is finished in one clock
cycles. See Appendix C for the complete multiplication code.
After optimization by the synthesis tool the multiplier consists of an 11 bit register
for the result and a number of XORs as shown in Table A.2 in Appendix A. Overall
one multiplier consumes 89 fourinput LUTs, but computes the product in one clock
cycle.
To allow get_degree to complete as fast as possible, first every coefficient of the input
polynomial is checked if it is Zero as shown in Listing 5.1.2.
e n t i t y gf_compare i s PORT(
c o e f f _ i n : in
STD_LOGIC_VECTOR(MCE_M1 downto 0 ) ;
equa l : out s t d _ l o g i c
);
end gf_compare ;
32
a r c h i t e c t u r e s t r u c t u r a l of gf_compare i s
begin
equal <= 0 when ( c o e f f _ i n =GF_ZERO ) e l s e
end s t r u c t u r a l ;
1 ;
Also gf_div is designed for troughput. Like mentioned in section 4.2 it makes use of
two precomputed tables in BRAM. Gf_div transforms both input coefficients into exponential representation at once with a TLU in a dual port ROM poly2exp_dualport.
33
The exponents are then subtracted exp_diff. This difference may become negative.
To avoid an extra clock cycle for adding the modulo to bring the difference back
to the positive numbers, we extend the exp2poly table to the negative numbers as
exp2polymod. It turns out that negative numbers are represented in twos complement
form in hardware. So addresses 0 to 2047 contain the standard exponenttopolynom
mapping. Addresses 2048 and 2049 contain due to twos complement form the exponent 2048, 2047, which cant occur because the maximum exponent range is
0 2046. This addresses are filled with a dummy value. The rest of the address
space now contains the same values in the same order as at the beginning of the
table. By this, the modulo reduction is avoided resulting in about 55, 600 overall
saved clock cycles. Remark that the TLU method is chosen because these design
has finished in 3 clock cycles, whereas a dedicated arithmetic unit computing the
division need at least 11 clock cycles when assuming that one clock cycle is needed
per bit operation. Figure 5.1 shows the complete division component. The lookup
tables and the substractor are highlighted in red. The surrounding signals form the
controlling state machine.
poly2exp_dualport
clk
coeff_a[10:0]
coeff_b[10:0]
[10:0]
[10:0]
[10:0]
clka
clkb
douta[10:0]
addra[10:0]
doutb[10:0]
clka
addrb[10:0]
[10:0]
exp2polymod
[10:0]
=1
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
exp_out_b_1[10:0]
[2]
D[11:0]
R
E
Q[11:0]
[11:0]
[11:0]
douta[10:0]
addra[11:0]
[10:0]
coeff_c[10:0]
[10:0]
TLU2
invert\.exp_diff_2[11:0]
exp_diff[11:0]
TLU1
statemachine
rst
start
[11:0]
[11:0]
I[1:0]
C
R
Q[3:0]
[3]
[0:3]
0
[2]
state[0:3]
0
[1]
1
[0]
1
e
d
[1]
e
d
e
d
un1_state_1[0]
un1_state_2
un1_state_3
D[0] Q[0]
R
E
valid
valid
e
d
un1_state[0]
[0]
valid_1_sqmuxa
The poly_mul_shift_add component also is designed with priority to high performance. Therefore, the multiplication of the input polynomial with the coefficient
is completely unrolled. We use 27 multiplier in parallel of which each consists of a
tree of ANDs and XORs. Each of these multiplier blocks is connected to the fixed
coefficient q and one of the polynomial coefficients as shows in Figure 5.2.
34
297
11
Mul26
Mul25
Mul24
Mul23
Mul22
Mul2
Mul1
Mul0
297
(5.1)
Remember that q is the result of the division of the leading coefficients of two polynomials. By setting u(z) = 0, q = A10 , k = 0 and v(z) = u(z), this equation results in
u(z) = 0 +
1
u(z)
z0 u ( z ) =
A0
A0
(5.2)
35
The next required computation, according to Algorithm 4 line 6, is taking the square
root of T (z) + z).Like mentioned in Section 4.2, instead of using a matrix to compute
the square root of the coefficients, we choose the method proposed by K. Huber [24].
For this and a latter purpose, we need a component, which can split polynomials into
its odd and even part so that T (z) + z = T0 (z)2 + z T1 (z)2 . Splitting a polynomial
means to take the square root of all coefficients and assigning square roots from odd
positions to one result polynomials and even one to the other result polynomial.
All coefficients that origin from even positions form the even part and vise versa.
Taking square roots of coefficients only occurs in this component and nowhere else
in the whole algorithm. Therefore, we choose a method that does not consume many
slices, but use the already introduced TLU method.
And the last required field operation is the computation of a polynomial square
in Algorithm 4 at line eight. Like the square root computation, this operation is
required only once during one decryption. Due to this low utilization we decide to
implement it in a space saving way. The square is computed coefficient by coefficient,
thus requiring only a single squaring unit. This squaring unit is simply the unrolled
multiplier from Section 5.1.1, where both input are wired together.
Based on the decisions made in Chapter 4, this chapter discusses the implementation
of encryption and decryption.
In this section the implementation of the matrix multiplication and generation of the
random error plus its distribution among the ciphertext is presented. Remember
that encryption is only c = m G + e. But first the target platform is presented.
Because of the lower logic requirements for the encryption and since the encryption key is public and does not require confidential storage, the encryption routine is
implemented on a low cost Spartan3 FPGA, namely a XC3S200. This device is part of
a Spartan3 development board manufactured by Digilent[14]. Aside from the FPGA
this board additional provides:
1. Onboard 2Mbit Platform Flash (XCF02S)
2. 8 slide switches, 4 pushbuttons, 9 LEDs, and 4digit sevensegment display
3. Serial port, VGA port, and PS/2 mouse/keyboard port
4. Three 40pin expansion connectors
5. Three highcurrent voltage regulators (3.3V, 2.5V, and 1.2V)
6. 1Mbyte onboard 10ns SRAM (256Kb x 32)
Figure 6.2 shows an summary of the design. The single blocks will be explained
in the following.
The implementation consist of two parts which can be selected via two of the
sliding switches. The first part ( selected with sw1 = 1) reads the public key from
the UART and store it in external SRAM. The second part(selected with sw0 = 1)
reads the message from the UART and encrypt it with the public key.
The top level component , called toplevel_encrypt controls the UART interface, reads
the external switches and writes data to the SRAM. Also this component controls
Chapter 6  Implementation
38
(a) Board
BRAM
Buffer
PRNG
for Errors
Access switch
Matrix
Multiplier
FSM
FSM
DEBUG
Mode Select
8x8
Matrix
Multiplier
Row Counter
Column Counter
Buffer
BRAM
Access switch
Buffer
I/O
UART
mce_encrypt
toplevel_encrypt
SRAM
6.1 Encryption
39
procedure is repeated until all 448, 512 bytes of G are written to SRAM. Now, the
FSM returns to idle state which is indicated by driving an LED on the board. Then,
another public key could be read or, more useful, the mode can be switched to
encryption by setting sw1 to 0 and sw0 to 1.
To send data from the PC to the FPGA the open source terminal program Hterm [23]
is used. Hterm can send bytes to the FPGA either from keyboard or from a file. The
structure of the file containing the data is simply one byte in hexadecimal form (w/o
leading 0x header) on a single line.
During development, debug methods where integrated, that are also in the final
design to allow verification of the current state and the behavior of some important
signals. The development board contains four sevensegment displays which share
a common data bus with signals (a, b, c, d, e, f, g, DP) and are individual selected
by four separated anode control lines. The function of the signals can be seen from
Figure 6.3.
1
[3:0]
SevenSeg
dot
data[0:7]
char[3:0]
[0:7]
C_SevenSeg0
Chapter 6  Implementation
40
misused to display data larger than 16 bit. For example, the SRAM address is 18 bit
wide and therefore the two leading bits are displayed as dots in two of the segments.
Four of this components are in parallel connected to a signal called disp, which by
itself can be connected to various internal signals. Which signal is displayed can be
selected using the sliding switches sw2 to sw4. Table 6.1 shows the available debug
modes and how they are selected. Switch sw5 controls whether the status bit of the
Table 6.1.: Function of the Debug Switches
Signal on Seven Segment Display
SW2
SW3
SW4
UART or the control bits for the SRAM are displayed on the seven LEDs led0 to led6.
Signal led7 is reserved and hardwired to indicate completion of transferring G to the
SRAM.
When the encryption mode is entered by setting sw0 = 1, all buffers are cleared,
SRAM is disabled and the state machine is set to idle state. Now every received
byte from the UART is directly feed through to the mce_encrypt component which
is depicted in Figure 6.5. This component contains the matrix multiplier and the
mce_encrypt
clk
prng_clk
rst
start
byte_ready
rst
[7:0]
[31:0]
M_byte_in[0:7]
SRAMdata[31:0]
valid
need_byte
SRAMoe
status[15:0]
Cipher_byte_out[0:7]
SRAMaddress[17:0]
[15:0]
[7:0]
[17:0]
C_mce_encrypt
[31:0]
SRAMdata[31:0]
6.1 Encryption
41
First, every byte is read into an buffer in BRAM. Because this buffer is implemented as BRAM of 18Kbit of which only 1752 bits are needed, we decide to additional put the cipher buffer of 2048 bit into the same BRAM block. The interface of
this combined buffer is shown in Figure 6.6.
YandM_buff
clka
wea[0]
clkb
web[0]
clk
[0:7]
=0
[0:7]
=1
[0:7]
dina[7:0]
douta[7:0]
doutb[7:0]
addra[8:0]
[0:7]
[0:7]
[0:7]
dinb[7:0]
addrb[8:0]
C_YandM_buff
Chapter 6  Implementation
42
finished. ColCnt also addresses the current c_work which saves an extra address register. When a new row is started the next m_work is read in. When the multiplication
is complete, the FSM changes to error distribution state.
To distribute random errors among the ciphertext, the bit address of the error positions is generated onthefly by a fast and small PRNG based on the PRESENT block
chiffre [10]. Figure 6.7 shows how Present is embedded into the PRNG. For real
PRNG
PRNG
rst
prng_clk
rst
clk
start
seed
rst
prng_clk
Present
rst
clk
rst
clk
start
valid
[79:0]
data_out[63:0]
[63:0]
[63:0]
key[79:0]
valid
cipher[63:0]
[63:0]
plain[63:0]
C_PRNG
C_present
C_PRNG
6.2 Decryption
IV
43
80 bit
64 bit
Plain
80 bit
PRESENT
KEY
Cipher
<<1
Add
Round
Counter
80 bit
Random
64 bit
Due to the polynomial arithmetic involved in decoding the goppa code, decryption
requires far more logic and storage than encryption. For this reason, the decryption
is implemented on a larger Spartan3 FPGA, namely an Xilinx XC3S2000. Originally,
it was planned to save the large S matrix in the internal flash of an XC3S1400AN,
but it turns out that reading data from flash is possible only bitwise at 50MHz
using an SPIInterface and therefore will become a bottleneck. Hence we developed
a method to generate S on the fly in the FPGA so that flash is only required for
storing the bitstream. The McEliece implementation (decryption configuration) for
the Spartan3 2000 FPGA is depicted in Figure 6.9. The top_level component is nearly
the same as for encryption, except for the SRAM control and the adopted debugging
information to the new development board. This board used for decryption is a
HWAFXSP32000 from NuHorizons [46].
As for encryption, the ciphertext is first read using the UART and fed into mce_decrypt.
Figure 6.11 shows the important components and registers of mce_decrypt. The several parts will be explained in the following sections. The mce_decrypt component
stores the ciphertext in a BRAM buffer. This buffer is shared for storing plaintext
and ciphertext. The address buses of this buffer use the same address register, except the highest bit address[8] which selects the ciphertext buffer or plaintext buffer,
respectively. See Figure 6.12 for the complete interface.
Chapter 6  Implementation
44
(PRESENT)
FSM
goppa_
decode
Buffer
I/O
mce_decrypt
UART
Buffer
Polynomial EEA
over GF(211)
Syn(z) Table
Syn(z)1 Table
BRAM
BRAM
Polynomial MUL/
SQ over GF(211)
Log Table
Antilog Table
BRAM
PRNG
for S1
FSM
P1
Access switch
Matrix
Multiplier
BRAM
toplevel_decrypt
The first step in decryption is to revert the permutation P. Like mentioned in Section 2.5, the inverse permutation matrix P1 is not stored as matrix but as array
of 11 bit indexes into the ciphertext. P1 is 2048 11 = 22.528 bit large and stored
in a ROM. Because one BRAM can only hold 18 KBits, this ROM is built of two
BRAMs with a simple 11 bit address bus. The input and output buses are depicted
in Figure 6.13.
From this ROM now consecutive the indexes of the permutated ciphertext are
read as 11bit value perm_address. The value at every address i indicates from which
received cipher text bit j the permutated bit i is taken. Because the received ciphertext is stored byte wise, the upper 8 bits select the byte address of YandMbuff and
the lower three the corresponding bit of the byte. Listing C in the appendix should
make the process of permutation clear.
After eight iterations, one byte of the permutated ciphertext has become available,and is input to the goppa_decode component. This component can immediately
start computing a partial syndrome, while concurrently the next permutated byte is
generated. Both components communicate with a simple handshake protocol with
6.2 Decryption
mce_decrypt
D[0] Q[0]
R
E
valid
D[0] Q[0]
E
valid
start_goppa
goppa_decode
clk
rst
clk
rst
clk
rst
start
byte_ready
[0:7]
[1:7]
D[7:0]
E
Q[7:0]
[0:7]
[0:7]
D[7:0]
R
E
Q[7:0]
[0:7]
YandM_buff
clka
wea[0]
clkb
web[0]
need_byte
valid
plain_byte_out[0:7]
cipher_byte_in[0:7]
C_goppa_decode
[0:7]
[0:7]
[0:7]
0
1
[0:7]
[0:7]
P_decrypt\.Y_byte_w_8[0:7]
D[7:0]
E
Q[7:0]
=0
[0:7]
Y_byte_w[0:7]
doutb[7:0]
[0:7]
[0:7]
[0:7]
dinb[7:0]
D[7:0]
E
Q[7:0]
plain_byte_out[0:7]
[0:7]
ext_need_byte
[7:0]
status[15:0]
plain_byte_out[0:7]
[15:0]
addrb[8:0]
[0:7]
cipher_byte[0:7]
douta[7:0]
dina[7:0]
addra[8:0]
[0:7]
=1
perm_byte[0:7]
[7:0]
[0:7]
[0:7]
C_YandM_buff
Y_byte_in[0:7]
D[0] Q[0]
E
[0:7]
start_prng
D[7:0]
E
Q[7:0]
[0:7]
t_work[0:7]
PRNG
prng_clk
rst
clk
start
seed
prng_clk
valid
data_out[63:0]
[63:0]
statemachine_9
C_PRNG
D[0] Q[0]
E
[0:7]
seed_prng
I[33:0]
C
R
Q[18:0]
[0:18]
state[0:18]
ext_byte_ready
start
[2:0]
[8:2]
D[7:0]
R
Q[7:0]
[7:0]
[7:0]
[7:0]
RowCnt[7:0]
[63:0]
D[63:0] Q[63:0]
E
[63:0]
random[63:0]
[8:1]
D[7:0]
R
Q[7:0]
[7:0]
ColCnt[7:0]
[0:7]
D[7:0]
E
Q[7:0]
[0:7]
part_prod[0:7]
[10:0]
D[10:0] Q[10:0]
R
[10:0]
perm_addr[10:0]
[0:6]
D[7:0]
E
Q[7:0]
[0:7]
m_work[0:7]
invP
clka
[10:0]
douta[10:0]
addra[10:0]
[10:0]
C_invP
C_mce_decrypt
45
Chapter 6  Implementation
46
YandM_buff
clka
wea[0]
clkb
web[0]
clk
[0:7]
=0
[0:7]
douta[7:0]
dina[7:0]
doutb[7:0]
addra[8:0]
[0:7]
=1
[0:7]
[0:7]
dinb[7:0]
addrb[8:0]
[0:7]
C_YandM_buff
invP
clk
clka
[10:0]
douta[10:0]
addra[10:0]
[10:0]
C_invP
6.3 Decoding
47
goppa_decode
clk
rst
clk
rst
start
byte_ready
[0:7]
need_byte
valid
plain_byte_out[0:7]
cipher_byte_in[0:7]
[0:7]
C_goppa_decode
48
mce_decrypt
goppa_decode
poly_EEA
clk
rst
start
=0
stopdegree[4:0]
[2]
[0:7]
[10:0]
[0:7]
[10:0]
poly_s[296:0]
[0:7]
addra[7:0]
C_cipher_buff
[10:0]
=0
C_EEA
[0]
[0:7]
douta[7:0]
dina[7:0]
D[7:0]
R
E
Q[7:0]
[0:7]
[10:0]
poly_x[296:0]
[2]
start_EEA
cipher_BRAM_buff
clka
wea[0]
valid
poly_r[153:0]
plain_byte_out[0:7]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
SList
clka
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
poly_mulW_T1_clocked
statemachine_7
[10:0]
I[22:0]
C
R
[10:0]
0
[10:0]
[10:0]
[10:0]
D[7:0]
E
Q[7:0]
[0]
[0:7]
cipher_byte[0:7]
byte_ready
Q[18:0]
state[0:18]
[10:0]
[0:7]
[11]
[0:18]
D[0] Q[0]
R
E
clk
rst
start
[10:0]
[10:0]
[10:0]
start_split
[10:0]
clk
rst
start
clk
rst
D[0] Q[0]
R
E
[10:0]
[10:0]
[10:0]
start_comp_sigma
valid
poly_a_in[153:0]
poly_sigma[307:0]
poly_b_in[153:0]
[10:0]
[10:0]
[10:0]
clk
rst
start
C_comp_sig
[10:0]
[10:0]
poly_sigma_in[307:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
valid
poly_WxT1[296:0]
poly_T1[153:0]
C_mulT1W
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
C_chien_search
[10:0]
[10:0]
[10:0]
[10:0]
poly_even[153:0]
[10:0]
[10:0]
root[10:0]
C_split
[10:0]
[10:0]
[10:0]
valid
ready
poly2split[296:0]
[10:0]
[10:0]
[10:0]
chien_search
clk
rst
start
valid
poly_odd[153:0]
[10:0]
[10:0]
comp_sigma
[10:0]
C_TLU_InvSList
poly_split
[10:0]
[10:0]
douta[10:0]
addra[10:0]
start_mulT1w
[10:0]
start_chien
[10:0]
InvSList
clka
[10:0]
D[0] Q[0]
R
E
[10:0]
start
D[0] Q[0]
R
E
douta[10:0]
addra[10:0]
C_TLU_SLIST
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
[10:0]
C_goppa_decode
C_mce_decrypt
Chapter 6  Implementation
D[0] Q[0]
R
E
6.3 Decoding
49
tion the ready state is entered, in which the FSM waits until signal start drops to zero.
Then, the EEA is finished and returns to idle state.
When the complete syndrome is computed, Syn(z) will be fed again into poly_EEA
to compute T (z) and z is added. In the first version of the implementation all
polynomials are kept in std_logic_vector(297 downto 0) but it turns out that synthesis
tool is unable to handle such large vectors. For example, in a simple XOR, both input
coefficients are placed in one corner of the FPGA and the output far away in another
corner, leading to a large routing delay (nearly 90% of the overall delay). So the
description of polynomials was modified to type PolyArray_t is array(26 downto 0) of
std_logic_vector(10 downto 0). Now, the synthesis tool seems to be able to identify the
regular structure and place corresponding coefficents closer. This leads to a higher
achievable frequency compared to the original version.
Chapter 6  Implementation
50
:!valid_degreea&!:rst
:start&!:rst
:valid_degreea&:stop_degree!=0
checkab
:valid_degreea&:stop_degree="0000"
ready
:start&!:rst
!:valid_reduceu&!:rst
!:valid_div&:valid_reduceu&!
normalize
rst
!:valid_div&!:rst
idle
:rst
&!:
reduce1
:rst
:rst
A
uce
red
lid_
: va
u&
uce
red
lid_
: va
:degreeA&!stop_degree&valid_degreea
:rst
:rst
:rst
reduce2
!:valid_reduceA&!:rst
6.3 Decoding
51
goppa_decode
clk
rst
poly_split
clk
rst
clk
rst
start
[296:0]
valid
poly_odd[153:0]
poly2split[296:0]
poly_even[153:0]
[153:0]
[153:0]
C_split
C_goppa_decode
clk
rst
poly_mulW_T1_clocked
clk
rst
clk
rst
start
[153:0]
valid
poly_WxT1[296:0]
poly_T1[153:0]
[296:0]
C_mulT1W
C_goppa_decode
After computing R(z) completes, it is feed to the poly_EEA component. But now
poly_EEA runs in a special mode. Over the input value stop_EEA, one can decide
if poly_EEA runs in normal mode (stop_EEA = 00000 ) or in solvekeyequationmode (stop_EEA = 00000 ). In this mode, poly_EEA stops the algorithm, when the
Chapter 6  Implementation
52
goppa_decode
poly_mulW_T1_clocked
clk
rst
clk
rst
poly_mul
clk
rst
clk
rst
start
[10:0]
[307:0]
valid
coeff_mulPoly[10:0]
poly_Mulout_extern[307:0]
[307:0]
poly_Mulin_extern[307:0]
C_poly_mul
C_mulT1W
C_goppa_decode
goppa_decode
clk
rst
comp_sigma
clk
rst
clk
rst
start
[153:0]
[153:0]
valid
poly_a_in[153:0]
poly_sigma[307:0]
[307:0]
poly_b_in[153:0]
C_comp_sig
C_goppa_decode
For searching roots of the error locator polynomial, an extra component was designed. It reads the whole polynomial poly_sigma at once and outputs the found
6.3 Decoding
53
clk
rst
chien_search
clk
rst
clk
rst
start
[307:0]
valid
ready
root[10:0]
poly_sigma_in[307:0]
[10:0]
C_chien_search
C_goppa_decode
These array is initialized with the coefficients of poly_sigma and then iteratively updated. After each update, the sum over all array elements is computed and if the
sum becomes zero, the actual tested field element is passed off to the goppa_decode
component. To avoid a timing based side channel attack as mentioned in [30], the
iterations are not stopped when the maximum number of possible roots are found.
Instead the component continue until all field elements are checked.
Each time a new root is reported to goppa_decode, this root (in polynomial representation) is used as address for another ROM InvSList. The 11 bit value located
at this address is the ciphertext bit corresponding the the root. Similar to the permutation operation the upper eight bit address a byte in the ciphertext RAM cipher_BRAM_buff and the lower tree bits select a bit of this byte. Due to the properties
described in Section 3.3 each root correspond with the position of an error bit. This
bit is now corrected(by mean of toggling it) and the whole byte is written back to cipher_BRAM_buff. Because checking roots by chien_search in every case requires more
time than correcting a bit error, no need for synchronizing exist and goppa_decode
can correct the error and afterwards wait for the next root.
When chien_search indicates completion, goppa_decode is also finished. Then,the
transfer of the corrected ciphertext back to mce_decrypt can be started in a bytewise
manner.
Chapter 6  Implementation
54
r
r
0
8
56
r r r
9
57
1
S0,0 = . .
(6.1)
..
..
..
r r
r
7
63
15
Actually, the complete matrix S1 is build row wise from these submatrices. Goal
of this method is to minimize memory accesses for reading and writing message
This way every byte of m
has to be read only once and when one row
bytes of m.
of submatrices is completed, this byte is not required anymore in the later process.
Note that the virtually generated matrix is one bit larger in each dimension than
necessary. The additional bit in each row is removed manually and the additional
what does not affect the computed result.
column is handled by padding a 0 bit to m,
To allow maximum throughput the PRNG has its own clock domain clk_prng. Due
to its simple and regular structure, the PRNG can be clocked with up to the double
frequency compared to the rest of the design. Nevertheless, it turned out that the
multiplication procedure processes one random word in four clock cycles ( five clock
cycles if a column increment occurs) and then wait about 14 clock cycles until the
PRNG has generated the next random word.
Now, the matrix multiplication reads the intermediate message byte mi,j and the
i from YandMbuff and multiply m
i with Si,j . The code used here is
message byte m
the same as shown in listing 6.1.2. The result is added to mi,j and afterwards mi,j is
written back to YandMbuff and mi,j+1 is read. Figure shows a timing diagram of the
multiplication routine:
When the multiplication is finished also the decryption is completed. Now, the
content of YandMbuff is bytebybyte feed to the UART and send to the host PC.
55
MUL
We now present the results the McEliece implementation providing 80 bit security
(n = 2048, k = 1751, t = 27) for an Xilinx Spartan3 FPGA. We report performance
figures for a Xilinx Spartan3 XC3S2005 FPGA performing decryption and a Xilinx
Spartan3 XC3S20005 FPGA performing encryption based on results obtained from
using Xilinx ISE 10.1. The resource requirements for the FPGA implementation after
placeandroute (PAR) are shown in Table 7.1.
Table 7.1.: Implementation results of the McEliece scheme with n = 2048, k =
1751, t = 27 on a Spartan3 FPGA after PAR
Encryption
XC3S200
Resource
Available
Slices
796 (41%)
694 (36%)
1,920
LUTs
1,082 (28%)
870 (22%)
3,840
FFs
1,101 (28%)
915 (23%)
3,840
1 (8%)
1 (8%)
12
4,644 KBits
16,896 Kbits
Slices
12.977 (63%)
12,443 (60%)
20,480
LUTs
17,974 (43%)
18,637 (45%)
40,960
9,985 (24%)
9,894 (23%)
40,960
22 (55%)
22 (55%)
40
11,218 (100%)
4,644 KBits
16,896 Kbits
BRAMs
Decryption
XC3S2000
Flash Memory*
FFs
BRAMs
Flash Memory
Table 7.2 summarizes the clock cycles needed for every part of the de and encryption routines for the FPGA implementation.
In our FPGA design, the CPRNG to generate S1 based on the PRESENT block
cipher turns out to be a bottleneck of our implementation since the matrix generation does not meet the performance of the matrix multiplication. Since designing
CPRNGs for FPGAs is complex, this is not in the scope of this thesis. Hence, Table 7.2 gives estimates for the case that an ideal PRNG (which does not incur any
wait cycles due to throughput limitations) is available. This PRNG must be about
Chapter 7  Results
58
Encryption
Aspect
Spartan3200
Spartan22000
150 MHz
80 MHz
180sec+
147 cycles
Undo permutation c P1
Determine Syn(z)
360,184 cycles
Compute T = Syn(z)1
Compute T + z
625 cycles
487 cycles
312 cycles
Correct errors
312,328 cycles
1,035,684/188,306* cycles
Maximum frequency
Load G into SRAM
Encrypt c = m G
Decryption
Inject errors
Undo scrambling
+
*
S 1
m
to [15], RSA1248 actually corresponds to 80 bit symmetric security. However, no implementation results for embedded systems are available for this key size.
59
Table 7.3.: Comparison of our McEliece designs with singlecore ECC and RSA implementations for 80 bit security.
Method
Platform
McEliece encryption
Spartan3 2005
Time
Throughput
ms/op
bits/sec
2.24
779, 948
81,023/161,829+
McEliece decryption
Spartan3 20005
21.61/10.82+
ECCP160 [22]
Spartan3 10004
5.1
31,200
Spartan3E 15005
51
20,275
These are estimates assuming that an ideal PRNG for generating S1 would be used.
bits)) [1]
Chapter 7  Results
60
Encryption
Decryption
250 MHz
140 MHz
Slices
1,056 ( 5%)
13,162 (72%)
18,432
LUTs
1,204 ( 3%)
19,343 (52%)
36,864
FFs
1,158 ( 3%)
10,861 (29%)
36,864
1 ( 1%)
22 (22%)
96
1,497 KBits
1,497 KBits
Maximum frequency
350 MHz
150 MHz
Slices
369 ( 5%)
4,896 (68%)
7,200
LUTs
944 ( 3%)
13,227 (46%)
28,800
FFs
973 ( 3%)
10,051 (34%)
28,800
1 ( 2%)
12 (25%)
48
1,533 KBits
1,533 KBits
Virtex4lx40
Maximum frequency
BRAMs
Virtex5lx50
Flash Memory*
BRAMs
Flash
Memory*
Available
284 KBit/sec for decryption. The Virtex5 can achieve 1.82 MBits/sec for encryption
and 304 KBit/sec for decryption.
Finally, we like to summarize our findings of this thesis. Furthermore, open issues
and unconsidered aspects will be stated in Section 8.3 to point out the scope for
future research.
We have implemented the McEliece crypto system with special attention to memory
limitations found in all modern FPGAs. This implementation is 2,5 times faster in
encryption than ECC and 23 times faster than RSA. The decryption is about two
times slower than ECC, but 5 times faster than RSA. When taking the output rate
into account this implementation encrypts about 25 times faster than ECC160 with
respect to the plaintext size and decrypts at a five times higher rate. In comparison
to RSA1024 this implementation achieves 38 time the output rate in encryption.
Decryption is eight times faster then RSA1024. Additionally we developed a method
to significantly reduce the secret key size, what is very important since secret key
material has to be stored in a protected area.
Although the underlying algorithms consist matrix multiplication, polynomial
field arithmetic and the extended Euclidean algorithm, due to the large polynomials
and huge data to be stored and processed, we were unable to reach the performance
we expected. For encryption wider RAMs are required to read in as much data as
possible at once. Decryption would benefit from faster PRNGs and shorter polynomials. When the size of the underlying field is increased, less errors are required
for the same level of security. The polynomials can then have a lower degree and
thus an shorter total bit length. This should result in a more effective routing, which
makes about 80% of the delay in our implementation.
Nevertheless, we believe with growing memories in embedded systems, ongoing research and optimizations, McEliece can evolve to a suitable and quantum
computer resistant alternative to RSA and ECC that have been extensively studied for years.
Chapter 8  Discussion
62
Depending on the chosen parameters for PAR the maximum achievable frequency
is in the range of 70 to 80 MHz on the XC3S2000 FPGA. The critical path is always
in the EEA between the get_degree and poly_mul_shift_add component. Inserting a
pipeline stage flipflop into one of these components would double the number of
clock cycles required and do not lead to a double frequency. About 80% of the signal
delay are due to routing. We found out that synthesis always had problems placing
logical connected parts also close together on the chip. For example, the log/antilog
BRAM was placed in one edge of the chip, while the logic working with the output
was placed in the opponent edge. However, this issue could not be corrected by
manual placement due to other constraints, i.e. often signal paths would become
longer, resulting in even worser critical paths.
As mentioned in Section 2.5 some research was done in reducing the public key size.
These approaches reduce the public key size at the cost of additional computations
during de and encryption. But for our purposes, reducing the secret key size is
more important, which should be further investigated.
Also the impact of different field sizes for the same security level is still a field
for further research. Especially for implementations that do the field arithmetic with
table lookups, it can be an advantage, if the size of the tables fit exactly into the
memory and has a word size that is a multiple of the register size of the target.
Furthermore the ongoing research to replace Goppa codes with other, more compactly representable codes, should be intensified. Note that all proposed replacements for the Goppa Codes were broken.
In a future implementation, other ways to compute the parity check matrix should
be tested. And finally we plan to implement the Niederreiter scheme, which is
closely related to McEliece to compare the efficiency of both systems at the same
level of security.
Table A.1.: Display Characters and Resulting Control Values
Character
Control Value
(abcdefg)
(0000001)
(1001111)
(0010010)
(0000110)
(1001100)
(0100100)
(0100000)
(0001111)
(0000000)
(0000100)
(0001000)
(1100000)
(0110001)
(1000010)
(0110000)
(0111000)
Chapter A  Tables
64
Count
XOR2
XOR3
XOR4
XOR5
XOR6
XOR7
XOR8
XOR9
XOR10
XOR11
XOR12
66
[0 ,1 ,1 ,1]:[1 ,0 ,0 ,0] ,
[1 ,1 ,1 ,1]:[0 ,1 ,0 ,0] ,
default :[0 ,0 ,0 ,0] >;
r e t u r n out ;
end funct i on ;
funct i on perm ( da ta )
out : = Zero ( plainV ) ;
f o r i in [ 0 . . 6 3 ] do
out [ ( ( i mod 4 ) 16 + F l o o r ( i /4) ) + 1 ] : = da ta [ i + 1 ] ;
end f o r ;
r e t u r n out ;
end funct i on ;
funct i on p r e s e n t ( p l a i n , key )
// p r i n t " p l a i n : " , Reverse ( E l t s e q ( p l a i n ) ) ;
da ta : = p l a i n ;
f o r count in [ 1 . . 3 1 ] do
// p r i n t " \nround : " , count ;
keyseq : = E l t s e q ( key ) ;
//key : = keyV ! keyseq ;
roundkey : = plainV ! keyseq [ 1 7 . . 8 0 ] ;
p r i n t " Roundkey : " , Reverse ( E l t s e q ( roundkey ) ) ;
// p r i n t " Data a f t e r . . . " ;
//add roundkey
da ta : = da ta + roundkey ;
// p r i n t " key xor : " , Reverse ( E l t s e q ( da ta ) ) ;
da ta seq : = E l t s e q ( da ta ) ;
//sbox
f o r i in [ 0 . . 1 5 ] do
s u b s t _ d a t a : = sbox ( da ta seq [ 4 i +1 . . 4 i + 4 ] ) ;
f o r j in [ 1 . . 4 ] do
da ta [ 4 i + j ] : = s u b s t _ d a t a [ j ] ;
end f o r ;
end f o r ;
// p r i n t " Sbox : " , Reverse ( E l t s e q ( da ta ) ) ;
//perm
da ta : = perm ( da ta ) ;
// p r i n t " pbox : " , Reverse ( E l t s e q ( da ta ) ) ;
//compute roundkey
// p r i n t " Next key : " ;
key : = R o t a t e ( key , 6 1 ) ;
// p r i n t " a f t e r r o t : " , Reverse ( E l t s e q ( key ) ) ;
keyseq : = E l t s e q ( key ) ;
k e y _ n i b b l e : = keyseq [ 7 7 . . 8 0 ] ;
subst_ key : = sbox ( k e y _ n i b b l e ) ;
f o r j in [ 1 . . 4 ] do
key [ 7 6 + j ] : = subst_ key [ j ] ;
67
end f o r ;
// p r i n t " a f t e r sbox : " , Reverse ( E l t s e q ( key ) ) ;
tmp : = count ;
i f ( tmp ge 1 6 ) then
tmp : = tmp 16;
key [ 2 0 ] : = key [ 2 0 ] + 1 ;
end i f ;
i f ( tmp ge 8 ) then
tmp : = tmp 8;
key [ 1 9 ] : = key [ 1 9 ] + 1 ;
end i f ;
i f ( tmp ge 4 ) then
tmp : = tmp 4;
key [ 1 8 ] : = key [ 1 8 ] + 1 ;
end i f ;
i f ( tmp ge 2 ) then
tmp : = tmp 2;
key [ 1 7 ] : = key [ 1 7 ] + 1 ;
end i f ;
i f ( tmp ge 1 ) then
tmp : = tmp 1;
key [ 1 6 ] : = key [ 1 6 ] + 1 ;
end i f ;
// p r i n t " a f t e r s a l t : " , Reverse ( E l t s e q ( key ) ) ;
end f o r ;
keyseq : = E l t s e q ( key ) ;
roundkey : = plainV ! keyseq [ 1 7 . . 8 0 ] ;
p r i n t " f i n a l roundkey : " , Reverse ( E l t s e q ( roundkey ) ) ;
// p r i n t " Data a f t e r . . . " ;
//add roundkey
da ta : = da ta + roundkey ;
// p r i n t " f i n a l key xor : " , Reverse ( E l t s e q ( da ta ) ) ;
r e t u r n da ta ;
end funct i on ;
68
plain := s t a t e [ 1 ] ;
key : = s t a t e [ 2 ] ;
count : = s t a t e [ 3 ] ;
p l a i n : = p r e s e n t ( p l a i n , key ) ;
key : = R o t a t e ( key , 1 ) ;
tmp : = count ;
i f ( tmp ge 1 6 ) then
tmp : = tmp 16;
key [ 5 ] : = key [ 5 ] + 1 ;
end i f ;
i f ( tmp ge 8 ) then
tmp : = tmp 8;
key [ 4 ] : = key [ 4 ] + 1 ;
end i f ;
i f ( tmp ge 4 ) then
tmp : = tmp 4;
key [ 3 ] : = key [ 3 ] + 1 ;
end i f ;
i f ( tmp ge 2 ) then
tmp : = tmp 2;
key [ 2 ] : = key [ 2 ] + 1 ;
end i f ;
i f ( tmp ge 1 ) then
tmp : = tmp 1;
key [ 1 ] : = key [ 1 ] + 1 ;
end i f ;
count : = ( count +1) mod 2 ^ 5 ;
r e t u r n < p l a i n , key , count > ;
end funct i on ;
69
// f i l l S with 8 x8 blocks , each block colums per colums
f o r i in [ 1 . . BitBound ^2] do
count : = count + 1 ;
S [ 8 BlockRowCnt+RowCnt + 1 ] [ 8 BlockColCnt+ColCnt + 1 ] : =
s t a t e [ 1 ] [ count ] ;
// r e l o a d prng
i f ( count eq 6 4 ) then
s t a t e : = prng ( s t a t e ) ;
count : = 0 ;
end i f ;
i f ( RowCnt l t 7 ) then
RowCnt : =RowCnt + 1 ;
else
RowCnt : = 0 ;
i f ( ColCnt l t 7 ) then
ColCnt : = ColCnt + 1 ;
else
ColCnt : = 0 ;
i f ( BlockColCnt l t BlockBound 1)
then
BlockColCnt : = BlockColCnt + 1 ;
else
BlockColCnt : = 0 ;
BlockRowCnt : = BlockRowCnt + 1 ;
p r i n t "Now working i n
BlockRow " , BlockRowCnt ;
end i f ;
end i f ;
end i f ;
end f o r ;
u n t i l I s U n i t ( Submatrix ( S , 1 , 1 , dim , dim ) ) ;
r e t u r n Submatrix ( S , 1 , 1 , dim , dim ) , S , IV ;
end funct i on ;
funct i on b u i l d _ S ( IV , dim )
BlockBound : = C e i l i n g ( dim /8) ;
BitBound : = 8 BlockBound ;
count : = 0 ;
RowCnt : = 0 ;
ColCnt : = 0 ;
BlockRowCnt : = 0 ;
BlockColCnt : = 0 ;
s t a t e : = i n i t _ p r n g ( IV ) ;
s t a t e : = prng ( s t a t e ) ;
S : = Zero ( KMatrixSpace ( F , BitBound , BitBound ) ) ;
// f i l l S with 8 x8 blocks , each block colums per colums
f o r i in [ 1 . . BitBound ^2] do
count : = count + 1 ;
70
// p r i n t f
" S[%o ][%o ]\ n " , 8 BlockRowCnt +RowCnt , 8 BlockColCnt+ColCnt ;
S [ 8 BlockRowCnt +RowCnt + 1 ] [ 8 BlockColCnt+ColCnt + 1 ] : =
s t a t e [ 1 ] [ count ] ;
// r e l o a d prng
i f ( count eq 6 4 ) then
s t a t e : = prng ( s t a t e ) ;
count : = 0 ;
end i f ;
i f ( RowCnt l t 7 ) then
RowCnt : =RowCnt + 1 ;
else
RowCnt : = 0 ;
i f ( ColCnt l t 7 ) then
ColCnt : = ColCnt + 1 ;
else
ColCnt : = 0 ;
i f ( BlockColCnt l t BlockBound 1) then
BlockColCnt : = BlockColCnt + 1 ;
else
BlockColCnt : = 0 ;
BlockRowCnt : = BlockRowCnt + 1 ;
p r i n t "Now working i n BlockRow
" , BlockRowCnt ;
end i f ;
end i f ;
end i f ;
end f o r ;
p r i n t I s U n i t ( Submatrix ( S , 1 , 1 , dim , dim ) ) ;
r e t u r n Submatrix ( S , 1 , 1 , dim , dim ) ;
end funct i on ;
72
AND b ( 3 ) ) XOR ( a ( 1 0 ) AND b ( 5 ) ) ;
73
i f ( need_byte = 1 ) then g op p a r e a d y t o t a k e
wa it4 ta ke < = 1 ;
e l s e d o n t n e e d o r t a k e n ?
i f ( w a i t 4 t a k e = 1 ) then t a k e n , c o n t i n u e
bu i l d i n g the next byte
wa it4 ta ke < = 0 ;
byte_ready < = 0 ;
s t a t e <=perm1 ;
e l s e d o n t need , w a i t i n g
end i f ;
end i f ;
e l s e c o n t i n u e b u i l d i n g t h e c u r r e n t b y t e
byte_ready < = 0 ;
s t a t e <=perm1 ;
end i f ;
[1] X. A. N. 457. Powering and Configuring Spartan3 Generation FPGAs. Xilinx Inc.
D. Bibliography
76
(20072008.
[20] P. Gaborit. Shorter keys for code based cryptograhy. In Proceedings of Workshop
on Codes and Cryptography, pages 8191, 2005.
[21] V. Goppa. A new class of linear errorcorrecting codes. Problemy Peredachi
Informatsii, 6(3):2430, 1970. Original in russian language.
[22] T. Gneysu, C. Paar, and J. Pelzl. Specialpurpose hardware for solving the
elliptic curve discrete logarithm problem. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 1(2):121, 2008.
[23] T. Hammer.
Hterm  a terminal program for windows and linux.
.
[24] K. Huber. Note on decoding binary goppa codes. Electronics Letters, 32:102103,
1996.
[25] H.
T.
Inc.
ily
for
xilinx
Modular
fpga.
exponentiation
core
Data
Sheet,
October
[26] X. Inc.
Security Solutions Using Spartan3 Generation
fam2008.
.
FPGAs.
.
[27] X.
Inc.
Spartan3AN
FPGA
Family
Data
Sheet.
.
[28] X. Inc. Xilinx ISE 10.1 Design Suite Software Manuals and Help  PDF Collection.
.
D. Bibliography
77
D. Bibliography
78
Nu
horizons
electronics
corp.
[47] Xilinx.
IP
Security
in
FPGAs,
Whitepaper261.
.
4.1.
4.2.
4.3.
4.4.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
23
24
33
33
37
37
38
39
39
40
41
41
42
42
42
43
43
43
44
44
44
45
45
46
46
47
49
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.1. Security of McEliece Depending on Parameters . . . . . . . . . . . . .
12
26
30
39
51
52
53
54
57
58
5.1.
5.2.
5.3.
5.4.
.
.
.
.
.
.
.
.
31
32
32
33
40
B.1.
B.2.
B.3.
B.4.
.
.
.
.
59
59
61
62
65
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
Matrix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1.
2.
3.
8
9
9
4.
5.
19
21
6.
29