What are the best ways to unit test code that outputs random sequences satisfying specific conditions, such as Markov chains?
Let's be specific. There are two natural things to test:
That the initial state returned in each chain follows the user-specified probabilities, such that
chain[0] == i
with probabilityp_initial[i]
.That the transitions between states are correct, that is, if
chain[k] == i
, then the probability ofchain[k+1] == j
is given byp_transition[i, j]
.
Now, I can think of two ways to do this.
- Test that the code does what I want it to do. Generate a large number of sequences (say, 10000) and test that the frequencies for the initial state approximate
p_initial
. Next test that the frequencies of transitions to statej
from statei
followp_transition
.
The difficulty with this approach is that any finite sample will exhibit fluctuations from the prescribed probabilities. I thus have to set an acceptable level for these fluctuations. But this means that a) the test might miss actual errors in the code if they only change the frequencies a little bit, and b) after a refactoring I might get false failures due to a particular chain falling outside the accepted fluctuation level by pure chance. (I'm assuming I'm using a fixed seed for the random number generator so that at least without refactoring the tests should be reproducible.) Worse, the more I relax my acceptance criteria to mitigate problem b), the more I risk falling for a).
Moreover, I'm only testing a small subset of the requirements for the model (I also want the transitions to be independent, for instance). And compared to how simple the implementation for this code is (given a random number generator; see below), it doesn't seem worth it to add more bloat in the testing code.
- The alternative is to assume that the random number generator was tested by its creators and works fine, and just test that my code calls it in the right way. So I can mock the RNG and check that my code calls it with the right arguments.
The problem with this approach is that it makes the test depend strongly on the implementation. For instance, I could choose the initial state by using something like numpy.random.choice
in Python, or I could simply generate a random number between 0 and 1 (numpy.random.random
) and implement my own logic for choosing the initial state based on that. Even within each of these choices, I could choose to order the states in any way. The tests must know what the code is doing, and that makes the tests fragile to refactoring.
So that's my problem: it seems wrong to couple the test to the implementation so strongly. But it also seems wrong to perform a lot of tests of my function that are in effect tests of the RNG (and not particularly stringent tests either).
Is there a different, better way to do this that allows me to test the code that I wrote while relying on the RNG writers to have checked their own code?
For definiteness, here is some (Python) code implementing a simple Markov chain generator
import numpy as np
class PlainMarkov(object):
def __init__(self, p_initial, p_transition, rng=None):
""" p_initial = sequence of initial-state probabilities
p_transition = sequence of sequences for transition probabilities
rng = random number generator
"""
self.p_initial = np.asarray(p_initial)
self.p_transition = np.asarray(p_transition)
self.rng = rng is rng is not None else np.random.random.__self__
def run(self):
""" Draw one chain from the Markov distribution. """
n_states = len(self.p_initial)
state = self.rng.choice(n_states, p=self.p_initial)
chain = []
while state_numeric != 0:
chain.append(state)
state = self.rng.choice(n_states, p=self.p_transition[state])
return chain
4 Answers 4
The alternative is to assume that the random number generator was tested by its creators and works fine, and just test that my code calls it in the right way. So I can mock the RNG and check that my code calls it with the right arguments.
This is, of course, the correct answer.
Unless you are actually in the business of designing your own RNG, you want to treat randomness as something to be provided to your system as an input, and thus your tests can choose what inputs to provide on the "random" channel.
Even within each of these choices, I could choose to order the states in any way. The tests must know what the code is doing, and that makes the tests fragile to refactoring.
There are two responses to this
First: if the observable behavior of the system is altered by a change to the implementation, that change is not a refactoring.
Second: you don't have to over-fit your tests. If you can describe the constraints that are not satisfied by incorrect outputs, the write your tests to check the output against the constraints, rather than worrying that the output matches some previously calculated value.
This is property based testing vs example based testing in a nutshell. I recommend Scott Wlaschin's An Introduction to Property Based Testing if you aren't already familiar with the subject.
This discussion of the Fischer fairy chess problem describes some of the concerns, albeit on a simpler sort of problem with a more limited range of outputs.
-
that change is not a refactoring.
- really? So changingrandom.nextInt(0,2) == 0
torandom.nextBoolean()
is not a refactoring?Stack Exchange Broke The Law– Stack Exchange Broke The Law04/15/2019 04:49:52Commented Apr 15, 2019 at 4:49 -
Refactoring, by definition [refactoring.com/] [en.wikipedia.org/wiki/Code_refactoring], does not alter the observable behaviour of the system.KarlM– KarlM05/15/2019 21:59:13Commented May 15, 2019 at 21:59
A unittest should be what the name implies, a test of a specific unit.
That generally means you mock away all dependencies. So if you made a fake random number generator that returned a fixed set of number (1, 7, 4 , 3, 9, 2 for example and always that sequence) then you could test that you generate a very specific Markov chain based on that. And your markov chain generate function should be one that gives a very specific chain based on a sequence of random numbers. Even if you refactor your function you would expect that the same random number sequence as input gives that same results.
You are not going to show with a unittest that your function will generate some specific probability distribution. You are just trying to show that if you know exactly which numbers are generated out of your random number generator that your algorithm gives that expected output. If need be you can create a second test based on a different set of random number generated and show that one works too. If both those tests pass you don't know your function will work always, you just know that it is likely enough to work so that it's worth the effort doing more complex tests/see what happens in production.
If course you can also do some kind of automated integration test/load test/statistical analysis on a large number of runs. but that's not a unittest. And with those tests you have do deal with statistical unlikely scenario's (like generate 20 times 0 in a row out of your random number generator).
Think of your function as a blackjack scoring function. The card that's being drawn is random, but if i have a 8 and a Queen my score should always be 18 (and i'm alive, etc). You don't want to tests probabilistic behavior you want to tests very deterministic results after your random/probability function has actually been executed
-
But there are several different but equally valid ways in which an implementation could take a sequence of numbers from an RNG and turn them into a sequence of states from a Markov chain. So even if I fix the RNG output, the output from the Markov chain generator is implementation-dependent. What you suggest, I think, is essentially to make the specific way in which this transformation is done a part of the interface. But that doesn't seem much better than making the test depend on a hidden implementation detail.Legendre17– Legendre1708/06/2018 18:39:05Commented Aug 6, 2018 at 18:39
-
1I would argue that making an explicit transformation between random numbers and sequences is much better exactly because it is testable and you could show other people the algorithm and talk about the correctness. As long as you used the same algorithm you could refactor and keep the tests. It's exactly why hidden implementation details are worse. You are testing a specific implementation of your interface. Of course if you had a different implementation you would have to write new tests. Anyone using your interface would be tested with a mock implementation with a (static) mock sequenceBatavia– Batavia08/06/2018 18:53:14Commented Aug 6, 2018 at 18:53
-
That's a fair point, but this approach makes the interface cater to the testing framework more than to the user. As a user, I wouldn't care how the transformation from random numbers to state chains is performed. If I had two implementations that differed only in this transformation, I would think of them as having the same interface because nothing in my code would need to change to account for this. Is the best way really to make an implementation detail public just so I can write the tests?Legendre17– Legendre1708/06/2018 19:39:52Commented Aug 6, 2018 at 19:39
-
2@Legendre17 deterministic unit tests do not lock you to a particular implementation. They make it very clear when your implementation has functionally changed. You want to play with functionally different implementations? Fine, do it. Just write each kind its own unit tests. Switch tests when you switch implementations. You want to write statistical tests that don't care about implementation? Fine do it. Just don't call these slow things unit tests. Run slow tests separately. Nothing Batavia has said dictates any interface details. Just that you use tests wisely.candied_orange– candied_orange08/06/2018 22:57:04Commented Aug 6, 2018 at 22:57
-
@candied_orange I think I'm not explaining this right. The only functional requirement for writing a Markov generator is that it generates chains drawn from the appropriate distribution. Given a PRNG, such a class is trivial to implement, but that can be done in several ways that use the PRNG differently. These aren't functionally different implementations, they differ only in their internals. So if I write a test for each of them, I'm writing tests that depend on details that are not part of the interface of the class.Legendre17– Legendre1708/06/2018 23:31:26Commented Aug 6, 2018 at 23:31
Perhaps I'm missing something, but you are aware that you can use random.seed()?
Use that to set some specific seed. By hand validate the the answer. And store / validate that you get those states after a run with that seeded input value.
Then maybe try another seed, and run, and validate (by hand - some other mechanism) those answers. And validate that your markov chain evaluator gets those answers too.
Sorry if I missed something, but I hope that helped...
-
Yes, I fix the random generator seed to keep the tests deterministic. The problem is that there are different ways in which the code can be implemented that are all correct but can lead to different results, even though the PRNG sequence is fixed. So if I check that I get a particular chain for a particular seed, this can fail when I refactor the code just because an implementation detail changed. I was hoping there's a better way.Legendre17– Legendre1708/06/2018 20:50:03Commented Aug 6, 2018 at 20:50
-
1I'm afraid I don't follow how that's possible. A markov chain becomes completely deterministic and you can compute the 'right answer' given a hard-wiring of the sequence of 'dice rolls'. A markov chain is a simple state machine, where the rules for transitioning between states are probabilistic. If you fix the rolls on the dice, you fix the sequence of states. If you actually change your set of states (is that what you mean by refactor?) - then of course you will get a different answer - you have a different markov chain! @Legendre17Lewis Pringle– Lewis Pringle08/06/2018 21:37:33Commented Aug 6, 2018 at 21:37
-
What I'm saying is that even if the dice rolls are fixed, we still have freedom in choosing what each roll means. If we roll 1, does that map to state 1 or state 5? More to the point, suppose I decide to write
(rng.random() > 1 - np.cumsum(p)).nonzero()[0][0]
instead ofrng.choice(len(p), p=p)
in therun
method. Then I'll get the exact same statistics, but the outcome for each choice of seed will be different.Legendre17– Legendre1708/07/2018 01:52:19Commented Aug 7, 2018 at 1:52 -
1@Legendre17. That choice is enshrined in the test results. You cannot renumber your state's or change your random number generator (which the above effectively does) . It's worth remembering the point of unit tests or regression tests. It's a tool! The world doesn't end when they fail. Your attention is caught and you are warned to look more closely at your code change. What you describe is not a real problem.Lewis Pringle– Lewis Pringle08/07/2018 02:57:57Commented Aug 7, 2018 at 2:57
I think in this case, the most you can test on an interface level is whether the resulting chain is a valid Markov chain - ie. whether the initial element is from the p_initial list, and whether each subsequent element is a valid transition from the element before it. And for that, the test doesn't need to know about the implementation at all.
I think having a further test of whether the results are statistically correct may be a good idea, but I wouldn't call that test a "unit test".
You also can have unit tests of specific implementations, which rely on knowledge of those implementations. Whether that's a useful thing to have depends on the situation. In this case, I wouldn't consider it to be, but YMMV.
rng.choice()
to return hardcoded values and check thatrun()
returns the expected chain.rng.choice()
at all. Besides, the output fromrng.choice()
depends on the arguments with which it's called, so hardcoding the output tests only part of the functionality.