Why do we test our code? The obvious reason is that we want to check that our software is correct (or at least correct enough), but the less obvious (though possibly more important) reason is that we want to make our software correct. That's the point of test-driven development: testable code is easier to reason about, understand, use, and trust.
There are a number of ways to make code more correct. One not-particularly-helpful response is to tell developers to try harder and insist that a disciplined programmer who remembers to run the right linters and sanitizers can write safe code. Planning for, rather than denying, human fallibility (as other engineering disciplines have always done), requires tools that can’t have the classes of bugs you’re worried about (memory-safe languages), methods for ensuring correctness (model checking and formal verification), and processes (two-party code review).
While many of these techniques trade off speed for safety, the practice we’ll discuss today, randomized testing, helps with both. In tandem with test-driven development, it can flush out complicated bugs with relatively little effort from developers. Taking Brian Kernighan’s words—“if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it”—to heart, randomized testing employs more powerful techniques than human ingenuity to test and debug applications.
In this post, we’ll see that:
My own worst enemy
I once took a 1-day course on software testing which changed the way I thought about these practices forever. The course paired me up with a partner and culminated with an exercise where we attempted to write some simple code (Conway's Game of Life) by alternating between 5-minute rounds of:
If my partner failed to make the test pass in time, we had to start that round over. The trick, though, was that my partner was instructed to be uncooperative: they'd do the worst possible thing to make your tests pass. So if I wrote:
They’d respond with:
This exercise totally changed the way I think about testing. Before, I always wrote tests by thinking about the code I was going to write (or, more often, had already written). This made it really easy to deceive myself into thinking that I got it right: of course the code does what you predicted it did! But notably, that’s not the same as checking that the code is correct. After this class, I started thinking about testing as a game: can I write test(s) such that it would be almost impossible to make the test pass unless the code was correct? But instead of an uncooperative partner, I’m playing this game against myself.
You'll quickly notice that it's almost impossible to write deterministic tests that incorrect code cannot pass. You might try parameterized testing (enumerate a many test cases with their expected outcomes):
But in the adversarial coding game, your partner could always write a lookup table. That's not a worry in practice, of course, but it does point to an actual issue: if you're just thinking of examples to test, you're limited by your imagination. Over time, most developers get better at thinking about edge cases, but good software engineering is about doing things that take away reliance on humans getting things right, because they often don't (myself very included).
So we want to introduce randomized testing. This is almost a direct generalization of parameterized testing, but there's a catch: in most parameterized tests, you have to hard-code the expected result! How can you do that if you're generating test cases on the fly? Well, you have to start thinking in terms of the abstract properties of the code you're writing: what does it mean for this code to be correct?
The most obvious property to check about some code is that it shouldn't crash or leak memory ever. This brings us to fuzzing, which is (mostly) about those sorts of checks.
In fuzzing, your test gets random binary data as an input. For example, using pythonfuzz:
The easiest way to implement a fuzzer would be to generate literally random data. But modern fuzzers have a number of features that lead to better testing:
These fuzzers can be extremely effective at catching bugs by applying a fair bit of computing power to try as many inputs as possible, especially ones a developer would never think of. For instance, the OSS-Fuzz effort fuzzes important open source software projects and has found thousands of issues, many of which were security vulnerabilities.
Fuzzing is really important in languages like C/C++ where issues like buffer overflow are common, but still have value in memory-safe languages like Go, Rust, or Python. For instance, anything that involves untrusted user input could benefit from fuzz coverage, especially parsers.
But what if you’re not writing a parser? It turns out that “not crashing” isn’t the only property you'd want to check! For example, you might know that:
These are great things to check! Note that neither on its own is sufficient to make sure you actually implemented add() or reverse() correctly. In general, you should use parameterized tests or other traditional testing methods in conjunction with randomized tests. But if somebody came up with an input that violated that property, it could point to an issue.
Now we're getting into the realm of property-based testing. This type of testing is inspired by QuickCheck (Haskell), which makes an important observation: if our functions are total, in the sense that they do something correct for all inputs, we can be much more confident that they don't get misused. Consider a simple example based on the excellent Hypothesis library for Python:
Instead, let’s leverage the type system:
See what happened? The cycle of test->code->test failure->fix code led here to a refactoring of our code that makes it more correct by construction.
Why aren’t these techniques more popular?
In my experience, discourse around fuzzing and randomized testing is insufficiently nuanced. Proponents say, “you should fuzz everything!” Like most universal advice, this contains a kernel of truth but misses the fact that every developer has a unique problem in front of them, and many developers don’t have the experience with randomized or fuzz testing to apply them properly to their codebase. This leads to frustration through tests that break all the time and cost a lot of money without finding bugs.
Further, most overviews of randomized testing (this post included) use unrealistically simplistic code for illustration. See the talk Property-Based Testing: The Ugly Parts for an illustration of how to extend this to more complicated codebases.
Together, this leads developers to adopt a dichotomous view of these testing techniques: developers either “like” or “hate” them, and aim to either use them everywhere, with full buy-in on a particular library. The post My Take on Property-Based Testing has a balanced discussion of the strengths and weaknesses of these approaches, and some heuristics for when to apply each.
Most code would benefit from some kind of randomized testing, especially property-based testing. I find that property-based-testing-friendly code is almost always easy to read and reason about. Further, the engines will uncover things you never thought about (e.g., overflow bugs) that might actually matter in practice. Fuzz testing can also avoid vulnerabilities when processing untrusted inputs.