Randomness

Randomness, and generating random numbers, is one of the most important tools for building secure systems. Why should random, meaningless numbers be important? There are two particularly important things we get from randomness:

The primary tools we have for talking about randomness come from probability theory, so we review some of those notions before moving on.

A Quick Probability Theory Review

The coverage of probability theory here is just a brief review to jog your memory, since students in upper-level computer science classes like this one have seen basic probability theory in several classes before now (e.g., at UNCG you should have seen this in at least CSC 350 and STA 271/290, but you also probably did some basic probability theory in high school as well). While the concepts are well-established and standard, notation varies slightly depending on the author, and for consistency we will use the same notation that Hein used in the CSC 250/350 textbook. If you took those classes here at UNCG, then you can refer back to that book for additional information.

A set of possible outcomes of a random process or experiment is called a sample space. We focus on finite sample spaces in this class, which can be denoted as a finite set \(S=\{x_1,x_2,x_3,\cdots,x_n\}\). Each element \(x_i\) of \(S\) is referred to as a point in sample space \(S\). A probability distribution on \(S\) is a function \(P:S\rightarrow[0,1]\) that maps each point \(x_i\in S\) to a real number \(P(x_i)\) between 0 and 1, called the probability of \(x_i\), subject to the condition that the sum of all probabilities is equal to one:

\[ \sum_{i=1}^n P(x_i) = 1 . \]

One of the most common probability distributions we see is the uniform distribution, where every point in the sample space has probability \(1/n\). We’ll denote this probability distribution by \(U_n\), so \(U_n(x_i)=1/n\) for all \(x_i\in S\). Notice that the probability doesn’t depend on the actual sample point at all – it is the same for every point. The uniform distribution comes up in many, many problems, from flipping a coin (a uniform distribution on two outcomes) to rolling a fair 6-sided die (a uniform distribution on six outcomes) to selecting a card from a fairly shuffled deck of playing cards (a uniform distribution on 52 outcomes). When we say we “pick a value at random,” we always mean that we pick using a uniform probability distribution unless we explicitly state that we’re using a different distribution. For example, if we say that we pick a 128-bit symmetric encryption key at random, that means that each of the \(2^{128}\) possible keys is equally likely.

An event is some subset of points from the sample space. If \(A\subseteq S\) is an event, we define the probability of the event, written \(P( A)\), as the sum of the probabilities of the points that make up the event. In other words,

\[ P(A) = \sum_{x\in A} P(x) . \]

Since events are defined as sets of sample points, we can perform the usual set operations on them. If we have two events \(A\) and \(B\), we can combine events to represent new events — for example:

The intersection operator allows us to define one more fundamental property of events: independence. We say that events \(A\) and \(B\) are independent if \(P(A\cap B)=P(A)P(B)\).


Example: Consider three tosses of a fair coin, recording heads (H) or tails (T) for each toss. The sample space is the set of eight possible outcomes: \[ S=\{ HHH, HHT, HTH, HTT, THH, THT, TTH, TTT \} . \]

Since it’s a “fair coin,” the probability distribution is the uniform distribution on these eight points, so for every \(x\in S\), we have \(P(x)=U_8(x)=1/8\). There are many ways we can define events on this sample space, such as the event of “all heads” or “one head” or “more heads than tails.” Let’s define event \(A\) to be “more heads than tails,” so \(A=\{HHH,HHT,HTH,THH\}\). Since there are four points in that event, and each point has probability \(1/8\), adding up over all of them we get \(P(A)=4/8=1/2\). In other words, there’s a 50-50 chance of getting more heads than tails (and a 50-50 chance of getting more tails than heads). This will be true for any odd number of coin tosses, where there can’t be a tie.

Now consider events \(E_1=\{HHH,HHT,HTH,HTT\}\) (representing the first toss coming up heads) and \(E_2=\{HHH,HHT,THH,THT\}\) (representing the second toss coming up heads), and note that \(P(E_1)=P(E_2)=1/2\). Then consider \(E_1\cap E_2=\{HHH,HHT\}\), and with 2 possibilities we have \(P(E_1\cap E_2)=2/8=1/4\). Since \(P(E_1\cap E_2)=P(E_1)P(E_2)\), these events are independent

On the other hand, consider \(E_3=\{TTH,TTT\}\) (“neither of the first two tosses is heads”) and \(E_4=\{HTT,TTT\}\) (“neither of the last two tosses is heads”). Now we have \(P(E_3)=P(E_4)=1/4\), and so \(P(E_3)P(E_4)=1/16\). But \(P(E_3\cap E_4)=P(\{TTT\})=1/8\). Since \(P(E_3\cap E_4)\neq P(E_3)P(E_4)\), these events aren’t independent. Using an intuitive notion of “independence” this makes sense: If neither of the first two tosses are heads, you know that the second toss is tails, meaning that the probability of the second two tosses being tails being higher than if you knew nothing about the second toss (this “intuitive argument” actually argues about conditional probabilities, which gives an equivalent definition of independence that we won’t go into here).


A random variable \(V\) is a function \(V:S\rightarrow\mathbb{R}\) that maps each sample space point to a real number. Typically (but not always) a random variable represents some measurement taken on a random experiment, such as the number of “heads” results in \(n\) coin tosses, or the running time of an algorithm that makes random choices while it is running (in this case, the sample space is the set of all possible sequences of random choices made by the algorithm).

The expected value of a random variable \(V\), written \(E[V]\), is just the weighted sum of the random variable values, weighted by the probabilities. More specifically,

\[ E[V]=\sum_{x\in S} P(x)\cdot V(x) . \]

While this definition of expectation can certainly be used, in practice we often view and calculate this differently. In particular, if we use \(V(S)\) to represent the set of all possible random variable values, then for each \(v\in V(S)\) we can define an event for “\(V=v\)” that gives all sample space points for which the value of \(V\) is \(v\):

\[ (V=v) = \{ x\,|\, x\in S \mbox{ and } V(x)=v \} . \]

Now we can talk about \(P(V=v)\), the probability that \(V\) has value \(v\). The following formula, the weighted sum over distinct random variable values, is how we usually compute an expected value:

\[ E[V] = \sum_{v\in V(S)} P(V=v)\cdot v . \]


Example: Continuing our example from above, we can define random variable \(X\) to be the number of heads that came up in our three coin tosses, so the possible random variable values are \(X(S)=\{0,1,2,3\}\). More specifically, we have three events:

\[ \begin{array}{l} (X=0)=\{ TTT \}\\ (X=1)=\{ HTT, THT, TTH \}\\ (X=2)=\{ HHT, HTH, THH \}\\ (X=3)=\{ HHH \} . \end{array} \]

Computing the expected value of \(X\), we get

\[ \begin{array}{rl} E[X] & = P(X=0)\cdot 0 + P(X=1)\cdot 1 + P(X=2)\cdot 2 + P(X=3)\cdot 3 \\ & = \frac{1}{8}\cdot 0 + \frac{3}{8}\cdot 1 + \frac{3}{8}\cdot 2 + \frac{1}{8}\cdot 3 \\ & = \frac{3}{8}+\frac{6}{8}+\frac{3}{8} \\ & = \frac{12}{8} \\ & = \frac{3}{2} . \end{array} \]

This example illustrates that the every-day English language usage of “what we expect” is different from the mathematical notion of an expected value. The mathematical definition of expectation tells us that the “expected value” of the number of heads is \(3/2\), even though a fractional number of heads is not possible. So we “expect” something impossible! You get better intuition by thinking about this as an average value, and in fact the “law of large numbers” says that if we perform this experiment (three coin tosses) many, many times, counting each time how many heads come up, then the average over our experiments will be very close to the expected value. So while averages and expected values are different mathematically (one is a property of experiments and the other is a property of a probability distribution), they are related in a very strong way.


The Birthday Problem

The “Birthday Problem,” or “Birthday Paradox,” is a common example problem in probability theory, and is fun to use since at first the result defies student’s intuition. While the standard math problem is stated using small numbers and is a problem that you can visualize easily, the exact same problem — but with much larger numbers — comes up repeatedly in cryptography. This is the core problem related to the “high-probability uniqueness” use of randomness that was described at the beginning of this section, and we also see this when analyzing collisions in cryptographic hash functions (while not actually random, the same principles apply).

The basic “Birthday Problem” is this: Given a room with 23 people in it, what is the probability that two share the same birthday? It seems like with such a small number of people, and so many different possible birthdays, the probability should be low. However, the probability turns out to be a little more than 1/2, so it is more likely than not that two people share the same birthday!

To analyze this situation, we generalize the problem to \(n\) people in a room, each with one of \(d\) possible birthdays chosen uniformly and independently. For our specific problem we have \(n=23\) and \(d=365\) (ignoring leap years). Are birthdays really uniform? Not exactly, but it’s close enough!

Let \(D_1\) represent the event that the first person’s birthday is on January 1, and \(D_2\) be the event that the second person’s birthday is on January 2. Since birthdays are independent, with probability 1/365 of being on any particular day, we know that the probability of “Person 1” having a January 1 birthday and “Person 2” having a January 2 birthday is \(P(D_1\cap D_2)=P(D_1)P(D_2)=1/365^2\). This argument generalizes easily to any list of days and any number of people: for \(n\) people and any list of \(n\) days, the probability that the people have exactly those birthdays is \(1/365^n\). Since every different list of \(n\) days is a distinct point in our sample space, we can pick any set \(S\) of birthday assignments, and the probability that randomly assigned birthdays come from this set is exactly \(|S|/365^n\). What we’re going to do next is let \(S\) be the set of all birthday assignments in which no two people share a birthday, and calculate the size of \(S\) so that we can find the probability that no two people share a birthday.

Calculating \(|S|\) is now just a combinatorics problem: “Person 1” has 365 different choices for their birthday. “Person 2” can pick any one of the 364 remaining days. “Person 3” has 363 options, and so on. If there were 5 people, then the number of possible birthday assignments with distinct days would be \(365\cdot 364\cdot 363\cdot 362\cdot 361\). For \(n\) people, the number of birthday assignments is \(365\cdot 364\cdots (365-n+1)\). Therefore, for \(n=23\) people, the probability that all have distinct birthdays is

\[ \frac{365\cdot 364\cdot 363\cdots 344\cdot 343}{365^{23}} \approx 0.4927 . \]

Looked at from the opposite side, the probability that two people do share a birthday is approximately \(1-0.4927=0.5073\), or slightly better than \(1/2\).

Of course, we’re not really interested in birthdays, as amusing as that result is. For arbitrary \(n\) and \(d\), given a uniformly-distributed sample space of size \(d\), the probability that \(n\) independently chosen points contains a duplicated value is

\[ 1 - \frac{d!}{(d-n)!}\cdot\frac{1}{d^n} . \]

This formula is easy enough to calculate with a program when \(d\) and \(n\) are small, like the traditional birthday problem, but an example application in cryptography (which we’ll see below) can use very large values such as \(d=2^{128}\) and \(n=2^{64}\). Computing those factorials exactly is simply impossible with these numbers, so we need an easier to use formula.

For the values of \(d\) and \(n\) that we’re interested in, the probability of a duplicate can be closely approximated by \(1-e^{-n^2/(2d)}\). If we evaluate this formula with \(d=23\) and \(n=365\), we get approximately 0.5155, which is accurate to within 0.01.


Example: Returning to a problem from the beginning of this section: There are systems that require unique values, and if values are ever repeated then security fails. If the values are from a large enough space, say 128-bit binary strings, then we can pick randomly with very little danger of repetition. Let’s push some numbers through and see why this is.

What if we pick a million different values every second, and do this every second of every day for 16 years? What is the probability that we see a duplicate value? Our “powers of two approximation” techniques are very useful here: A million is about \(2^{20}\), and the number of seconds in 16 years is about \(2^{4}\cdot 2^{25}=2^{29}\). Therefore, we are picking \(n=2^{49}\) values out of \(2^{128}\) possibilities. Using our approximation formula, the probability that we see any repeated value is roughly

\[1-e^{-n^2/(2d)}=1-e^{-(2^{49})^2/(2\cdot 2^{128})} = 1-e^{-2^{98}/2^{129}} = 1-e^{-1/2^{31}} . \]

Since \(1/2^{31}\) is so close to zero, \(e^{-1/2^{31}}\) is very close to 1, making the probability very small. But how small?

A simple result using the Taylor series expansion of \(e^{-x}\) shows that when \(x\) is small, \(1-e^{-x}\) is approximately \(x\), with an error of less than 1% when \(x\leq 0.02\). Our “\(x\)” is much smaller than 0.02, so the probability of a duplicate is very close to \(1/2^{31}\), or roughly 1 in 2 billion. That’s actually several times lower than the odds of winning the Powerball jackpot with a single random ticket! So you can run this system, picking a million values every second for 16 years, with almost no danger of repetition.


Next, let’s turn the problem around and ask how many samples we have to take before the probability of seeing a duplicate value exceeds \(1/2\). In other words, for a given \(d\) we need to find \(n\) such that \(e^{-n^2/(2d)}=1/2\). Doing the algebra to solve for \(n\), we get

\[ e^{-n^2/(2d)} = \frac{1}{2} \iff e^{n^2/(2d)} = 2 \iff \frac{n^2}{2d} = \ln 2 \iff n^2 = (2 \ln 2) d \iff n = \sqrt{2\ln 2} \cdot \sqrt{d}, \]

so we need \(n\approx 1.177\sqrt{d}\). Returning to the original birthday problem and plugging in \(d=365\) we get \(n\approx 22.5\). Since we discovered earlier that the “break-even point” for two people having the same birthday is between 22 and 23 people, this is excellent!

For crypto-sized numbers, we generally ignore the constant multiple of 1.177, since it is so close to 1, and use \(n=\sqrt{d}\) for an approximate number of samples required before we are likely to see a duplicate.


Example: Again picking 128-bit values at random, how many would we need to select before we’re likely to see a duplicate? Taking the square root, this would require approximately \(\sqrt{2^{128}}=2^{64}\) samples — about 16 billion billion.

Let’s double-check that answer by going through our duplicate probability formula with \(d=2^{128}\) and \(n=2^{64}\). With these values, the probability of a duplicate is

\[1-e^{-n^2/(2d)} = 1-e^{-(2^{64})^2/(2\cdot 2^{128})} = 1-e^{-2^{128}/2^{129}} = 1-e^{-(1/2)} \approx 0.3935 . \]

So there’s roughly a \(4/10\) probability of seeing a duplicate, which is pretty close to our goal of \(1/2\) probability, verifying our earlier estimate.

This exact problem comes up a lot in cryptography, where it is the probability of duplicate initialization vectors for ciphers, the probability of an attack called a “meet-in-the-middle” attack working, the probability of seeing collisions in hash functions, and more. You should be very comfortable with the basic rule of thumb that for a sample space of size \(d\), to have a decent probability of finding a duplicate with random samples you need roughly \(\sqrt{d}\) samples.


Randomness in Practice – Sources of Randomness

In the sections above, we discussed randomness as an abstract mathematical concept, with applied examples using simple physical processes like flipping coins and rolling dice, but we are interested in implementing things in computer systems. Where does randomness come from there? The computer certainly doesn’t sit there with a coin, flipping it over and over. The generic name for what we want is a “random number generator,” or RNG, so how do we make a RNG?

The answer to this question is more difficult than it seems. One warning up front, which cannot be repeated too many times to people dealing with security: Anything you compute with a standard, deterministic process, from given start values, is not random. It doesn’t matter how complex the process is, or how random-looking the output is, it is not random. No matter how many statistical tests it can pass, it’s still not random. For example, there is a function in the standard C library named rand(), which is a pseudorandom number generator (always referred to as a PRNG, where the “P” is very important), and the output of rand() “looks random” – at least if you don’t look too carefully. This function is good enough for random-looking behavior in simple games, but such values are not random, and in high-stakes games the rand() function isn’t “random enough” for even basic unpredictability. Using such a PRNG in a cryptographic protocol is absolutely disastrous, and you can end up with no security at all. There are ways to do pseudorandom number generation that are secure in a cryptographic setting, which we call “cryptographically secure pseudorandom number generation,” but there are many pitfalls and we will devote a full section to this later. And even with a cryptographically secure PRNG, while perhaps secure enough in most settings, you are reminded again: these are not random numbers!

Since you can’t compute random values, they need to come from some other source, such as a physical process. There have been various approaches to this over the years, including a semi-serious random number generator made by engineers at Silicon Graphics using pictures of lava lamps (this was known as Lavarand). More practically, there are several higher-bandwidth sources of randomness that modern systems use, that we describe below. For all sources of randomness, a user needs to think about whether the source is uniform, because in most cases it is not. A source of random bits in which the 0/1 outcomes don’t have exactly the same probability is called a “biased RNG.” Fortunately, sources that are used in practice are generally post-processed to remove bias, and as a result what the user sees is a nice, uniformly-distributed random source.

Hardware random number generators: It is possible to build hardware devices that measure random electrical/thermal “noise” in a circuit or measure various actions that result from randomness in quantum physics. If a computer can read from such a device, then it has a source of true randomness. Over the past few decades, there have been hardware random number generators built into all sorts of devices, including devices designed specifically for cryptography, security chips such as Trusted Platform Modules (TPMs), general-purpose computer support chipsets, and even directly into computer CPUs. All modern Intel and AMD CPUs have a hardware random number built in, which is accessible through the RDRAND assembly language instruction.

Gathering system entropy: As an alternative to devices built specifically to produce random bits, a computer can measure fluctuating physical processes that it has access to. Computer systems respond to various events that have variable timing, such as keys being pressed on the keyboard, network packets arriving, and hard drives seeking to a particular track. Measuring these events with a high precision timer, the least significant bits of the times (e.g., the “microseconds” part of the time between keystrokes) provides significant randomness. When the PGP email-encryption program was initially released in the early 1990’s, key generation required the user to type for about a minute while the program collected randomness from the keystroke timings. These days, all modern operating systems support this idea directly, constantly measuring these events to feed a pool of random bits, which can be provided to the user through various interfaces. For example, in Linux the /dev/random device provides these random numbers, and can be opened and read by any program. This is a blocking device, however, and so programs that try to read more randomness than is available will stall until the system collects enough randomness to provide to the user. The alternative is the non-blocking /dev/urandom device, which can provide cryptographically secure pseudorandom values when true random values are not available. Unfortunately, it is difficult to get enough randomness this way in non-traditional computing systems such as embedded devices, which don’t have a keyboard or hard-drives. This has resulted in some security issues, as explored in the following 2013 research paper:

Keaton Mowery, Michael Wei, David Kohlbrenner, Hovav Shacham, Steven Swanson, “Welcome to the Entropics: Boot-Time Entropy in Embedded Devices,” IEEE Symposium on Security and Privacy, 2013, pp. 589-603.

Several times in this section, we have referred to “sufficient randomness” or similar notions. Is there a way to measure randomness and reason about it? Yes! That’s the topic of “entropy,” which we cover in the next section.

 

Up: Contents      Prev: Estimation with Powers of Two      Next: Entropy


© Copyright 2020, Stephen R. Tate
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.