HEKS: A Family of Key Stretching Algorithms

Arnold Reinhold,

reinhold@world.std.com

July 15, 1999, minor revision July 5, 2001

DRAFT g

1. Introduction

Most people are not good at memorizing even modest amounts of meaningless information. As a result, when they are asked to create a security password or passphrase, they often pick one that is either too short or too easy to guess. [REIN] Such passphrases are then vulnerable to an attacker who searches the space of possible passphrases. Searchers can try random combinations of symbols, use lists common passphrases, or employ heuristics based, in part, on personal information known about the passphrase creator.

In 1995 I suggested that passphrases be hashed using a computation-intensive algorithm that would make key search attacks more expensive [REIN]. John Kelsey, Bruce Schneier, Chris Hall and David Wagner made a similar proposal [KEL]. They named this approach "key stretching." The Unix password encryption function crypt() can be considered an early key stretcher [MOR]. This paper presents a family of key stretching algorithms, which I call HEKS, for Hash Extended Key Stretcher, and describes two versions in detail.

It is my hope that after review and incorporation of appropriate suggestions, a HEKS algorithm can become a standard for key stretching, just as DES and 3DES are standards for ciphers and SHA1 and MD5 are standards for hash functions. I would argue that deployment of a good standard key stretcher would have a greater positive impact on the actual security of cryptographic systems than will the introduction of the new AES cipher. By publishing this family of algorithms, it is my intent to place the algorithms in the public domain.

The HEKS algorithm family is built upon an existing one-way cryptographic hash function, such as MD5 [MD5] or SHA-1 [SHA]. The HEKS algorithm is demonstrably no less secure than the selected hash. The one way properties of the hash are also used to show that the HEKS algorithm cannot be effectively shortened.

HEKS accepts a passphrase and optional salt as byte strings and produces an output byte string that is the same length as the output of the selected hash function. Calculating the output requires a substantial and predictable quantity of computation and computer resources, including large amounts of memory and frequent 32-bit multiplication. Salt is recommended to prevent an attacker from building a dictionary of common passphrases and their HEKS equivalents. HEKS is parameterized to allow the user to specify the amount of memory used and the quantity of computation.

1.1. How should a key stretcher be judged?

The purpose of a key stretcher is to make key search attacks more difficult. How effective a given key stretching algorithm is depends the resources available to the user and to the assumed attacker. Attackers with limited resources would probably use ordinary personal computers to search the key space. Attackers with greater resources can employ large parallel arrays of key search engines, built from custom or field programmable integrated circuit. Users are presumed to have a personal computer with millions of bytes of storage and a fast processor.

Any attack is made more difficult by lengthy key stretching computations. A key stretcher can also try to level the playing field against an attacker with the resources to build massively parallel search engines by utilizing as much as possible of the computing resources that the user already possesses. In the ideal limit, the best such an attacker could do would be to replicate the user’s computer many times. Realistically we can at least force the silicon footprint of each key search engine to be large. A basic low end PC in 1999 has, maybe, half a billion gates inside. If one can weave even 5% of those gates into the key stretch algorithm, massively parallel attacks become much harder.

The continuing progress in integrated circuit technology makes it hard to specify a figure of merit for key stretchers. Moore’s law says performance doubles every 18 months. Still it is possible to roughly quantify key stretcher performance at any point in time, or epoch, as:

M1 = the number of keys searched per logic gate per second at the epoch.

An alternative measure is:

M2 = the number of keys searched per capital investment dollar per second at the epoch

Calculating M2 is further complicated by the varying costs of designing and procuring custom circuits that different organizations face.

HEKS sets a lower bound on gate count by requiring a large amount of memory in its calculations. Frequent 32-bit multiplies also add to the logic complexity of a custom search engine.

1.2. The General HEKS Algorithm

Here is an outline of the HEKS algorithm:

The passphrase and salt are hashed.
The output of the hash is used to set the parameters for a linear congruential pseudo-random number generator G1.
G1 is used to form G2, a second pseudo-random number generator using an algorithm due to Bays and Durham [BAY], as described under the name Algorithm B in Knuth's the Art of Computer Programming -- Semi-Numerical Algorithms [KNU2 and KNU3]. In HEKS, the auxiliary table V required by Algorithm B is large, say on the order of a megabyte or more.
The output of G2 is used to stir an array B that is the natural input block size for the selected Hash algorithm.
Periodically, the array B is hashed. The output of the hash reinitializes G1. The hashes themselves are chained using the chaining method of the selected hash.
The frequency at which hashing is done is a design parameter which is set it so that there is enough mixing of the V array between each hash and so that less than, say, 10% of the computation time is spent in the hash algorithm on a typical platform. Since most hash algorithms are designed for efficient hardware implementation, the bulk of computation time should not be spent in the hash.
After the specified number iterations have taken, the results of final hash is output as the stretched key. A user community would normally select the number of iterations so that the time delay on the slowest machine they use would still be acceptable.

Note that the output of HEKS is just the hash of the passphrase, the salt, some padding and a series of snapshots of the B array. Therefore HEKS is no weaker than the hash of the passphrase and salt alone.

2. A version of HEKS based on SHA1

Here is a detailed description of a version of HEKS using SHA1. It is intended to be sufficient for coding [though it may not be in this version of the paper]. I call it HEKS1-D1 (for draft 1). A variant, called HEKS1-D2, gives better performance on the hierarchical memories found on typical PCs. See Section 3, below.

2.1 Conventions:

All words are 32-bits.

Arrays are zero-based.

Characters are unsigned.

All variables and arrays are unsigned.

All computations are mod 2**32 unless noted otherwise.

P is the user passphrase as a byte array.

S is a random salt vector as a byte array

K and N are parameters that determine running time and are discussed below.

V[] is an array of 32-bit words, L words long, where L is large, e.g. 2**18 words (1 megabyte)

B[] is an array of sixteen 32-bit words. This is the natural input buffer size for SHA1. B is initially set to zero.

H() is the hash-next-block operation of SHA1. H accepts 512 bits of data and a 160-bit state as input and forms a 160-bit output. The 160-bit output of H is broken into five 32-bit words w0, w1, w2, w3, and w4.

G1 is a 32-bit linear congruential PNRGs with state X. G1 is specified by three 32-bit values, the multiplier, a, the increment b and the initial value of X. Thus Xnew = (a*Xprev + b) mod 2**32

G2 is a second PNRG using Algorithm B from Knuth, {KNU2, KNU3], with G1 as input. That means that for each cycle, G2prev, the previous output of G2, specifies a random value j in [0, L-1]. If L is not a power of 2, j= G2prev mod L. If L is 2**m, then j is the high-order m bits of G2prev. The next output of G2 is V[j] and V[j] is then replaced by the next value of G1.

G2 is initialized by filling V with the first L values of G1 and by setting G2prev = w4. G2prev is reset to w4 after each hash, but V is not altered.

2.2 HEKS-D1 Algorithm:

1. Compute SHA1 (P | S), i.e. appended the salt to the passphrase and extend the resulting string to a multiple of 512 bit blocks as defined in the SHA1 specifications. The resulting 160-bit hash value is broken into five 32 bit words, w0, w1, w2, w3, w4, which also become the state input to the next H().

2. Initialize G1 as follows:

X = w0

a = ((w1 and FBFFFFF8) or 020000005)

b = (w2 or 1)

These bit operations make a = 5 mod 8, a < (63/64)m and a > (1/128)m and force b to be odd, causing G1 to have maximal length, as recommended in Knuth [KNU3] Vol. 2, Sec. 3.6 iii and iv.

3. Fill the array V with sequential values from G1:

for (i=0; i++; i<L;) V[i] = G1next() + P[i mod(len(P))]

4. In a loop with n running from 0 to N-1, perform steps 4a — 4d below.

4a. In a loop with i running from 0 to K-1, add G2next() to B[i mod 16]

4b. As long as n is less than the passphrase length, add the nth passphrase byte to B[1]. This step prevents a 160-bit bottleneck in the effective passphrase size.

4c. Compute H(B)

4d. Reinitialize G1 and G2 as follows: X = X + w0; a=w1; b=w2; g2prev=w4

5. The stretched key is the final 160-bit output of H(B): w0, w1, w2, w3, w4

6. Zero all internal variables.

2.3 HEKS-D1 design considerations

The following explains the reasons for some of the design choices.

2.3.1 Parameter selection

K is 1571 (decimal). K = 1000*pi/2 and is a prime. K was chosen so that the time spent computing H is less than 10% of the overall computation time on common platforms. Thus an attacker who implements H in hardware will only enjoy a small advantage.

We would also like a high likelihood that the B array is affected by at least one change in the V array that occurred in the previous cycle. This can be satisfied by making K > sqrt(L). For K = sqrt(L), the expected number of times the B array is affected by changes to V in the previous cycle is, of course, one. The probability that the B array is affected when K = sqrt(L) is 1 -(1-sqrt(L)/L)**sqrt(L), which approaches 1-1/e = 63.2% for large L.

Since 1571**2 = 2,468,041. This criterion is essentially met for V array sizes up to 10 megabytes (2.5 million words).

L and N are selected by the user:

L is as large as the user or user community can comfortably provide on the smallest machine on which they plan to use the algorithm. Using L = 2**m makes the algorithm about twice as fast in my tests. On the other hand, choosing L around 1.1*2**m might force an attacker building a massively parallel machine to use the next size memory chip.

N is specified by the user to achieve the desired time delay. For most users a time delay of one half to two seconds would be appropriate. Users who are very security conscious might opt for 30 seconds. There may be applications for longer delays. For example setting a 20 minute delay on a file system locked at close of business would increase a night intruder's risk of apprehension even if they knew the passphrase.

2.3.2 Justification for algorithm details

The passphrase is added, byte by byte. to the B array each major cycle until each byte is used once. This prevents what would otherwise be a 160-bit bottleneck in the algorithm. 160 bits would not be a problem for human passphrases, which almost always have less entropy, but there may be other applications for this algorithm.

The parameters of the first random number generator used to fill the large V array are adjusted to insure the generator has maximal cycle length. [KNU3 Section 3.2.1.2 Theorem A and Section 3.6].

Similar adjustments are not made to the subsequent LCGs. Doing so would only reduce the number of possibilities an attacker would have to consider. We only extract K numbers from each LCG. An occasional short cycle does no harm and the frequency of short cycles in randomly selected LCGs is extremely low based on an empirical test.

The following table summarizes a test of 40 million random LCGs of the form X=a*X+b, where a, b and X0 are random 32-bit quantities generated by SHA1. Only 771 LCGs out of 40 million had a cycle length of 16384 or less. Only 104 had a cycle length of 2048 or less.

Cycle Number Fraction Cumulative

32 2 5.00E-08 5.00E-08

64 0 0.00E+00 5.00E-08

128 7 1.75E-07 2.25E-07

256 7 1.75E-07 4.00E-07

512 16 4.00E-07 8.00E-07

1024 24 6.00E-07 1.40E-06

2048 48 1.20E-06 2.60E-06

4096 107 2.68E-06 5.28E-06

8192 183 4.58E-06 9.85E-06

16384 377 9.43E-06 1.93E-05

Total 771 1.93E-05

With a modulus of 2**32, only cycles that are a power of two are possible. Note that the frequency of cycles seems to drop by roughly a factor of 2 as the cycle length shortens by a factor of two. Also Coveyou and McPherson [COV] reported that random LCGs had good spectral properties.

2.3.2 Use of existing SHA1 packages

Some SHA1 packages may need to be modified for use with HEKS. The modifications do not change the SHA1 algorithm, but allow access to the block transformation routine.

Also some implementations of SHA1, e.g. [REID], on little-endian machines incorporate the little-endian to big-endian conversion in the basic block transformation. This is needed for processing the passphrase and salt, but must be disabled for the body of the algorithm, which assumes the array B is a block of 32-bit words.

2.4 Test vector

Here is a HEKS-D1 test vector using the following constants:

K=1571

L=2**18

N=20,000

passphrase = "qwertyuiop"

salt = "sodiumchloride":

FDDE6BEE 7FEC8380 CEEC5AB5 58561205 2EEB2ECA

3. HEKS-D2

Tests of HEKS-D1 revealed a marked decrease in performance for the same number of iterations as the size of the V array was increased. A factor of 10 degradation was observed going from 4 kilowords to 4 megawords on a Macintosh 6500 (603e processor, 250 MHz clock, 16 Kbyte internal data cache, 256K byte L2 cache, 96 Megabyte RAM). A graph of runtime vs. log of V size clearly reveals the multi level memory cache structure of the computer on which the test is run. The following table shows log2 of the V array size followed by the run time in seconds. The number of iterations is the same in each case.

10 2.516667

11 2.516667

12 2.616667

13 4.816667

14 6.350000

15 7.316667

16 8.250000

17 11.816667

18 15.200000

19 18.033333

20 19.900000

21 22.083333

22 23.616667

HEKS-D2 attempts to flatten this curve by changing the V access algorithm to keep most access within a window that fits inside L1 cache. The window base is then changed each major cycle.

Specifically, a variable called cut is initialized each major cycle as follows:

cut = (w[1]/127) or (2**28)

The G2 algorithm is modified so that the V index, j, is computer as follows:

If g2prev > cut then j = (w3 + g2prev >> 22) mod L

Otherwise j is computed as per HEKS D1

This change causes all but about 1/16^th of the V accesses to take place within a one killoword window. This reduces the runtime difference between a V size 4 kilowords and 4 megawords in HEKS-D2 to a factor of 1.6 on the same machine.

The following table shows the amount of reduction in runtime difference as a function of window size in kilowords:

1 KW 1.59

2 KW 1.92

4 KW 2.35

8 KW 2.96

A plot of these number is linear in log2 of the memory size until 8 KW where it jumps markedly. Remember the L1 cache size on the test machine is 4 KW.

In HEKS-D2 it is no longer likely that the B array is affected by a V change made in the previous cycle, however this does no allow the calculations to be segmented since one still needs the result of the previous hash to initialize G1, G2 and the window.

3.1 Test Vector

Here is a HEKS-D2 test vector using the following constants:

K=1571

L=2**18

N=20,000

passphrase = "qwertyuiop"

salt = "sodiumchloride":

E71DB0B5 91202EA0 555AB761 52FB468C 5186AAB2

4. Attacks on the proposed algorithms

Kelsey et al [KEL] list a series of possible attacks on key stretching algorithms. Here is a discussion of each possible attack applied to the proposed HEKS algorithm:

4.1 Narrow-pipe attack.

No intermediate state has a reduced number of bits. The full state of the chained hash transform is carried through the algorithm. The final output is the hash of the input passphrase, the salt and a series of pseudo random blocks. More importantly, the V array has an enormous amount of state that is built up over the course of the computation.

4.2 Reuse of computed values

Since all constants used in each cycle of the algorithm are derived from hash output and assuming the selected hash cannot be reversed, there is no possibility of computing cycle n until cycle n-1 is complete.

4.3 Special purpose massively parallel key search machines

Passphrase searching can always be done in parallel by having each engine search a different set of passphrases. The main design goal of this algorithm is to make massively parallel key search machines it as expensive as possible by requiring many 32-bit multiplies and large amounts of memory.

4.4 Cherry picking

Assuming the selected hash produces uniform, unpredictable output, there is no way to search only a portion of the key space cheaply.

4.5 Non random stretched keys

Since the final HEKS output is a hash output, it will enjoy the full randomization of the selected hash.

4.6 Full and partial pre-computation

Assuming the selected hash output is large enough, e.g. 160-bits in SHA1, building a full dictionary will not be feasible. Any key stretching algorithm is liable to a partial attack by building a list containing the stretched versions of the most likely keys. For this reason at least 32 bits of salt should be used with this or any key stretcher. HEKS does not limit salt size, though there is no point in having the salt be larger than the output of the selected hash.

4.7 Minimum use of original key

This criteria does not come from the cited paper, but results from our analysis of the key-stretching properties of CipherSaber-2. [CS2} To reduce the potential of the key being compromised by analysis of incidental electromagnetic radiation (Tempest), the key values should only be used during the initial stage of the key stretching algorithm. Since in HEKS each key byte is only used twice, in the initial hash and once per byte in the initialization of the B array, this exposure is minimized.

5. Design alternatives

I have tried to avoid the temptation to add gimmicks to the algorithm and keep it as simple as possible. Here are some of the issues I considered.

Linear congruential pseudo-random number generators (LCGs) have a bad name in cryptography, however they are not used here for their cryptographic strength but as a way to burn up lots of 32-bit multiplies. LCG have been extensively studied and, because we are using random constants, we can rely on their average properties. Cryptographic strength comes from frequent application of SHA1.

An alternative is to build G1 using two LCGs, one 32-bit and one based on hardware floating point with the two outputs added together.

Another alternative is to use 64-bit LCG and arrays. Since most PCs today only have hardware support for 32-bit multiplies, this would give an advantage to an attacker who could incorporate a 64-bit multiplier in his search engines.

If L is a power of 2, then the computation of j, the index into L, could be done by masking. However, it would be better to use the upper bits from the output of G1 to derive j, since the low-order bits of an LCG mod 2**n are not that random. An alternative is to use the prime modulus 2**32 -1 in the LCG instead of 2**32.

If L is not a power of 2, the suggested method of computing j, j= G2prev mod L, does not produce a completely uniform distribution. A better method is to take the high-order 32 bits of the 64 bit product L*G2prev. Unfortunately most compilers do not provide efficient access to this machine operation.

An alternative stirring mechanism for A would be to rotate the output of G2 before adding it to A. The amount of rotation might be the low-order 5 or 6 bits of the previous rotated value. The initial shift value would be derived from the Hash. Doing this introduces non-linearity to B and would address concerns about poor randomness in the low order bits of an LCG.

If more than 160 bits of output are required, the next-to-last hash values can be used as a second 160-bit output.

6. Further Work

Here are some tasks that I have not yet done.

6.1 Parameter tuning

Since the HEKS algorithm contains a number of parameters, it is desirable to tune it for maximal performance. One possibility is to write a program to search parameter HEKS space using some optimization algorithm. Of course the program would have to be run on a variety of common machines. One issue in doing this is how to weight number of operations against wide accesses to the V array.

Also in HEKS-D2, the parameter K perhaps should be smaller and the window size can be tuned to the minimal L2 cache that is likely to be encountered.

6.2 Profile selection

A series of profiles each representing a different level resource consumption should be developed. These might be specified by a simple letter code, starting at a level suitable from palm tops and increasing by factors of 2 in resource use. It may be necessary to use separate letters for iteration and memory parameters. So, for example, specifying HEKS-1EG would indicate both the algorithm (HEKS-1) and the parameter selection (E level iteration count, G level memory).

6.3 UNICODE Passphrases

Since this is a new standard it might make sense to specify that the input passphrase always be in 16-bit UNICODE (ISO 10646-1: UTF-16, bigendian). This would assure interoperability across all international versions. The conversion from ASCII to UNICODE is straightforward and code size and runtime are not an issue here. There also needs to be a convention for dealing with an initial Byte Order Group, perhaps always stripping it off.

On the other hand, it may be best to just use UTF-8 encoding, since that is transparent to ASCII in most cases and includes the full UNICODE generality.

6.4 Implimentation in other languages

Reference implementations should be wrtten in other languages, such as Java, which may not support unsigned integer types. It may be appropriate to adjust the algorithm to accommodate such languages.

7. References

[BAY] C. Bays and S. D. Durham, ACM Trans. Math. Software 2, 1976, pp.

59-64

[COV] Coveyou and McPherson, JACM 14 (1967), p.100-119

[KNU2] D. E. Knuth, The Art of Computer Programming, Vol. 2 Semi-

Numerical Algorithms, Second Edition, Sec. 3.2.2, pp. 32-

33, Addison-Wesley, 1973

[KNU3] D. E. Knuth, The Art of Computer Programming, Vol. 2 Semi-

Numerical Algorithms, Second Edition, Sec. 3.2.2, p. 34, Addison-Wesley, 1998

[KEL] J. Kelsey, B. Schneier, C. Hall, and D. Wagner, Secure Applications of Low-Entropy Keys, 1997 Information Security Workshop (ISW'97), Proceedings (September 1997), Springer-Verlag, 1998, pp. 121-134. Available at http://www.copunterpane.com

[MD5] R. Rivest, The MD5 Message-Digest Algorithm, RFC-1321, April 1992

[MOR] R.H. Morris and K. Thompson. Unix password security. Communications of the ACM, 22(11):594, November 1979.

[REID] S. Reid, SHA-1 in C, By Steve Reid, available at ftp://utopia.hacktic.nl/pub/replay/pub/crypto/HASH/sha/sha1.c

[REIN] A. Reinhold, "Results of a Survey on PGP Pass Phrase Usage," posted to the Usenet group sci.crypt.research on June 2 1995, Available at http://www.hayom.com/passphrase.survey.asc

[SHA] FIPS PUB 180-1 - Secure Hash Standard, U.S. Department Of Commerce,National Institute of Standards and Technology, April 17, 1995 available at

http://bilbo.isu.edu/security/isl/fip180-1.html

agr 2001-7-5