Debunking all myths about MT19937 random number generator
In this article I will debunk some popular myths about the Mersenne twister in C++ that have accumulated over the years. The source for some of the myths is this article named C++ Seeding Surprises. Let’s start with few definitions.
- PRNG, pseudo-random number generator
- This is a deterministic algorithm that takes certain state as input and as output it gives a pseudo-random number and the modified state that should be used as the next input. PRNGs are almost always predicable. If you know enough consecutive output random numbers, you can infer the internal state. They should not be used for cryptography, only for things like simulations or games.
- CSPRNG, cryptographically secure pseudo-random number generator
- These algorithms are also deterministic, but they are not predicable. If you see some outputs you should not be able to guess its internal state, future or previous outputs.
- TRNG, True random number generator
- TRNG is based on physical phenomena. When we observe certain physical phenomena that is nondeterministic and measure it with number, that number is considered a truly random number. Such TRNGs are connected to a computer as IO devices.
- Internal state of PRNG
- The internal state usually consists of few numbers of certain size (32-bit, 64-bit, etc).
- Initial internal state
- The first numbers that go in the state, before getting the first pseudo-random number, are called initial state.
- Seed
- A number or sequence of numbers that are used to generate the initial internal state. Often the seed and the initial internal state are the same, but not always. E.g. in Mersenne twister (MT19937) the initial internal state is always different than the seed.
I will now continue with the myths.
Myth 1: MT19937 can never generate 7 or 13 as the first output when seeded with single integer
This is wrong. Here is sample code that debunks this myth.
#include <random>
#include <cassert>
int main()
{
auto rnd = std::mt19937(1080100664);
assert(rnd() == 7);
rnd.seed(736640520);
assert(rnd() == 7);
rnd.seed(1292535796);
assert(rnd() == 13);
}
How did I found these seeds? I’ll leave that as an exercise to the reader.
Myth 2: MT19937 has bad internal states with lot of zeroes
This was true until 2002. After the first publication in 1998, some flaws were discovered with initial internal states that contained lot of zeroes. This was fixed in 2002 by the original authors. They introduced two initialization functions that take the seed, do some fiddling on the bits and only after that they fill the state. There is one algorithm that takes one integer and appropriately fills the state and another algorithm take takes sequence of arbitrary length and then fills the state.
If you look at the constructors of the standard MT19937 you should notice similar algorithms for seeding. Even if the seed sequence generates lot of consecutive zeroes, the additional bit fiddling will fix that. With those algorithms in place, there is explicit distinction between the seed and the initial state. You should not try to directly fill the internal state for the purposes of seeding.
Myth 3: You should seed MT19937 with sequence of 624 integers
No, you shouldn’t. As I stated previously, the seed and the state are different. If the state is huge, the seed does not need to be. Consider the state as implementation detail. The seed matters mostly for the randomness of the few initial outputs. The more seeds you have, the more initial states you have, the more uniformity you have in the first few outputs.
If 32-bit seed is not enough for you, use 64-bit seed, i.e. std::seed_seq
constructed with two integers. Do you know how big is 2^64? Just one for-loop
that iterates all 64-bit numbers and does nothing else may take maybe 200 years
to execute. Is that not enough for you? Use 128-bit seed, like my example in my
previous article. With 128-bit seed you have
gazzilion different initial states. Generating all of them would take bazzilion years.
I have two more arguments related to std::random_device
on why using 624
integers is too much.
std::random_device
is very much platform specific facility, but most likely it
is CSPRNG. There are some things that are common. The OS collects randomness
from various IO like mouse movement, keyboard typing, temporary files and fills
up the randomness, the entropy gets higher. As you request random numbers, the
randomness pool is spent. If you overuse it, it will deplete. Modern operating
systems try very hard this not to happen. Still, you should not fight with the
OS. You should treat std::random_device
as finite resource that should be
spent wisely. Therefore, using just few values from std::random_device
is the
right way to seed. Using 624 values from std::random_device
is wrong.
Secondly, calls to std::random_device
are slow. Internally, it does a system
call which is more expensive than normal function call (in user space). The system
call may further do some IO, which is, again, a slow operation. When you request
more unnecessary data from std::random_device
, you unnecessarily slow your program.
Myth 4: std::seed_seq is not bijection
This is probably the only correct statement in that bad article, but also it is
completely useless. Is it a big problem if there are few collisions? They tested the case
where the input is two 32-bit integers and the output is also two 32-bit integers.
Why would one use seed_seq
in the first place in such case? If the internal
state of some PRNG is 64 bits, I would directly seed it with 64-bit number. Can
someone test for bijection where the input is 2 or 4 integers and the output is
624 integers?
Conclusion
The current design of the library for random numbers is actually good.
There can be some improvements, like ease of use. For starters, I would improve the
documentation in the standard by giving an example that shows the little dance between random_device
,
seed_seq
and mt19937
. I will use just 4 integers from random_device
in that example.
Future proposals about random numbers in C++ must not make random_device
to conform to
the named requirement SeedSequence because that would make it very easy to use
random_device
the wrong way.