Why rand() is bad?
Notation :
- [a, b] - implies the range of integers from a to b inclusive.
- LGC - Linear Congruential Generators
Using rand()
Let’s give an example of generating 16 numbers in the range [0, 99].
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main()
{
srand(time(NULL));
for (int i = 0; i < 16; ++i)
printf("%d", rand() % 100);
printf("\n");
}
A seemingly innocuous code with no visible syntactical or logical fallacies. Let’s dive into each one of the functions and variables.
-
time(NULL)
:- In C (and also C++),
NULL
is0
and it’s an integer. Hence, it has a high tendency to cause unintended function overloads and lead to ambiguous code. Usenullptr
instead(from C++11). -
time
runs once every second. Run the code more than once in a second, and now you can replicate the seed.
- In C (and also C++),
-
srand(time(NULL))
:time()
returns a 64-bittime_t
, whilesrand()
accepts 32-bitunsigned int
. Hence, a loss of precision here and potentially vulnerable. -
rand()
:- It has a range of [0, 32767] (215-1). That is awfully small for several use cases today. For instance, you cannot sample data from a dataset of size > 32767 and we deal with hundreds of thousands of data rows easily.
- it uses Linear Congruential Generators for pseudo-random number generation and if the parameters, namely modulus(
m
), multiplier(a
) and increment(c
) do not obey certain conditions, the generated numbers do not acheive a periodicity ofm
and hence, not a uniform distribution. - In higher dimensions, LGCs lie on well-defined hyperplanes(Marsaglia’s Theorem). Not something you want in an internet facing code.
-
rand()%100
:rand()
has a range of [0, 32767]. The length is not a multiple of 100 and if you split the range into buckets of 100, you get 32700 buckets of equal length 100 and one incomplete bucket of length 68 i.e.[0, 67]. Repeat the numbers long enough and the frequency of the first 68 numbers will be higher than the rest.
Introducing <random>
Starting from C++11, provides the random
header which hosts and bunch of random distributions and clean way to draw samples from them as well. The distributions I tend to encounter most are :
uniform_int_distribution
uniform_real_distribution
bernoulli_distribution
poisson_distribution
normal_distribution
Complete list available here.
Coming back to the pseudo-random number generation problem, our task is to draw integer samples from a uniform distribution in the range [0, 99]
. This time we can use uniform_int_distribution
. To draw samples from the above distribution, we need to provide a Random Number Engine (LGC being one). This time, instead of LGC, we use mersenne_twister_engine
. Further, unlike srand()
to set a seed, let’s use random_engine
which produces uniformly distributed non-deterministic random numbers.
The new code snippet is as follows:
#include <random>
#include <iostream>
int main()
{
std::random_device rd;
std::mt19937 mt(rd()); // Can be supplied with static seeds like any non-negative integral value
std::uniform_real_distribution<double> dist(0, 99);
for (int i = 0; i < 16; ++i)
std::cout << dist(mt) << "\n";
}
We’ll start with std::mt19937
first and come to random_device
later.
-
std::mt19937
:- Creates a random number generation engine based on Mersenne Twister Algorithm
- Its periodicity is 219937 i.e. its range is [0, 219937-1]. That’s absolutely massive.
- It is faster than true random methods (like
random_device
which rely on hardware devices) - Passes several statistical tests for randomness including Diehard Tests.
- Not cryptographically secure.
-
std::uniform_real_distribution<double> dist(a, b)
:- Passing the engine to dist, we want numbers in the range [a, b].
- Works on
signed/unsigned short/int/long/long long
, but not onchar
.
-
std::random_device
:- Can be used to set 32-bit and higher seeds.
- It is non-seedable and hence non-reproducible.
- It is slower than Mersenne-Twister.
Acknowledgement
The code-snippets and analysis have been derived from the talk by Stephan T. Lavavej and the intention of this article is to provide an abridged version in readable format for future reference.