What the fuzz? Better coding through randomized testing

Zachary Newman, Principal Research Scientist
  •  
March 13, 2023

Why do we test our code? The obvious reason is that we want to check that our software is correct (or at least correct enough), but the less obvious (though possibly more important) reason is that we want to make our software correct. That's the point of test-driven development: testable code is easier to reason about, understand, use, and trust.

There are a number of ways to make code more correct. One not-particularly-helpful response is to tell developers to try harder and insist that a disciplined programmer who remembers to run the right linters and sanitizers can write safe code. Planning for, rather than denying, human fallibility (as other engineering disciplines have always done), requires tools that can’t have the classes of bugs you’re worried about (memory-safe languages), methods for ensuring correctness (model checking and formal verification), and processes (two-party code review).

While many of these techniques trade off speed for safety, the practice we’ll discuss today, randomized testing, helps with both. In tandem with test-driven development, it can flush out complicated bugs with relatively little effort from developers. Taking Brian Kernighan’s words—“if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it”—to heart, randomized testing employs more powerful techniques than human ingenuity to test and debug applications.

In this post, we’ll see that:

  • Randomized testing is a powerful tool and should be used more often.
  • Fuzzing (a type of randomized testing) is really useful for specific applications.
  • Both fuzzing and randomized testing are less popular than they should be because popular discourse fails to recommend the right tool for the right job.

My own worst enemy

I once took a 1-day course on software testing which changed the way I thought about these practices forever. The course paired me up with a partner and culminated with an exercise where we attempted to write some simple code (Conway's Game of Life) by alternating between 5-minute rounds of:

  • 1. A testing phase, where I wrote a test.
  • 2. A coding phase, where my partner attempted to make the test pass.

If my partner failed to make the test pass in time, we had to start that round over. The trick, though, was that my partner was instructed to be uncooperative: they'd do the worst possible thing to make your tests pass. So if I wrote:

They’d respond with:

This exercise totally changed the way I think about testing. Before, I always wrote tests by thinking about the code I was going to write (or, more often, had already written). This made it really easy to deceive myself into thinking that I got it right: of course the code does what you predicted it did! But notably, that’s not the same as checking that the code is correct. After this class, I started thinking about testing as a game: can I write test(s) such that it would be almost impossible to make the test pass unless the code was correct? But instead of an uncooperative partner, I’m playing this game against myself.

Randomized testing

You'll quickly notice that it's almost impossible to write deterministic tests that incorrect code cannot pass. You might try parameterized testing (enumerate a many test cases with their expected outcomes):

But in the adversarial coding game, your partner could always write a lookup table. That's not a worry in practice, of course, but it does point to an actual issue: if you're just thinking of examples to test, you're limited by your imagination. Over time, most developers get better at thinking about edge cases, but good software engineering is about doing things that take away reliance on humans getting things right, because they often don't (myself very included).

So we want to introduce randomized testing. This is almost a direct generalization of parameterized testing, but there's a catch: in most parameterized tests, you have to hard-code the expected result! How can you do that if you're generating test cases on the fly? Well, you have to start thinking in terms of the abstract properties of the code you're writing: what does it mean for this code to be correct?

Fuzzing

The most obvious property to check about some code is that it shouldn't crash or leak memory ever. This brings us to fuzzing, which is (mostly) about those sorts of checks.

In fuzzing, your test gets random binary data as an input. For example, using pythonfuzz:

Here, we’re using a parse() function because functions that take binary input are the easiest to fuzz.

The easiest way to implement a fuzzer would be to generate literally random data. But modern fuzzers have a number of features that lead to better testing:

  • Mutation-based fuzzers generate random data, but also make small tweaks to prior input data.
  • “Smart” fuzzers understand the structure of the expected input, and can generate data according to a developer-provided grammar to better exercise interesting code paths.
  • Coverage-guided fuzzers profile the code under test, and tweak their inputs to ensure maximal coverage.


These fuzzers can be extremely effective at catching bugs by applying a fair bit of computing power to try as many inputs as possible, especially ones a developer would never think of. For instance, the OSS-Fuzz effort fuzzes important open source software projects and has found thousands of issues, many of which were security vulnerabilities.

Fuzzing is really important in languages like C/C++ where issues like buffer overflow are common, but still have value in memory-safe languages like Go, Rust, or Python. For instance, anything that involves untrusted user input could benefit from fuzz coverage, especially parsers.

Property-based testing

But what if you’re not writing a parser? It turns out that “not crashing” isn’t the only property you'd want to check! For example, you might know that:

  • add(x, y) should give you the same result as add(y, x)

  • reverse(l) applied twice should give you back the original list

These are great things to check! Note that neither on its own is sufficient to make sure you actually implemented add() or reverse() correctly. In general, you should use parameterized tests or other traditional testing methods in conjunction with randomized tests. But if somebody came up with an input that violated that property, it could point to an issue.

Now we're getting into the realm of property-based testing. This type of testing is inspired by QuickCheck (Haskell), which makes an important observation: if our functions are total, in the sense that they do something correct for all inputs, we can be much more confident that they don't get misused. Consider a simple example based on the excellent Hypothesis library for Python:

This looks simple enough that it would never do the wrong thing! But consider the input first=”Foo, Bar”, last=”Baz”. This will result in ”Baz, Foo, Bar” which violates the given property.

Is this a good test? My attitude here used to be "whatever—garbage in, garbage out." But in order to avoid passing “garbage” to this function, the caller needs to call a check_valid_name() parser. For a name this could be relatively cheap, but if the inputs were longer documents, this call could be quite expensive, or expose your system to denial-of-service vulnerabilities.

Instead, let’s leverage the type system:

Now, we can define a parse_name() function that returns a Name. Then, because of encapsulation, we can be confident that any Name has been parsed and is free of characters like commas. Our property-based test now takes in a Name as its input (you’ll need to define a strategy for making these), so you no longer have to worry about the comma case. Your function does the same thing, but now it’s closer to total: every input of the correct type results in a valid output.

See what happened? The cycle of test->code->test failure->fix code led here to a refactoring of our code that makes it more correct by construction. 

Even though we didn’t fix a bug in format_full_name(), we were forced to clarify what we consider a valid input to that function. Don’t accept “garbage in, garbage out”—instead, use your typechecker to disallow the garbage!

Why aren’t these techniques more popular?

In my experience, discourse around fuzzing and randomized testing is insufficiently nuanced. Proponents say, “you should fuzz everything!” Like most universal advice, this contains a kernel of truth but misses the fact that every developer has a unique problem in front of them, and many developers don’t have the experience with randomized or fuzz testing to apply them properly to their codebase. This leads to frustration through tests that break all the time and cost a lot of money without finding bugs.

Further, most overviews of randomized testing (this post included) use unrealistically simplistic code for illustration. See the talk Property-Based Testing: The Ugly Parts for an illustration of how to extend this to more complicated codebases.

 

Together, this leads developers to adopt a dichotomous view of these testing techniques: developers either “like” or “hate” them, and aim to either use them everywhere, with full buy-in on a particular library. The post My Take on Property-Based Testing has a balanced discussion of the strengths and weaknesses of these approaches, and some heuristics for when to apply each.

Conclusion

Most code would benefit from some kind of randomized testing, especially property-based testing. I find that property-based-testing-friendly code is almost always easy to read and reason about. Further, the engines will uncover things you never thought about (e.g., overflow bugs) that might actually matter in practice. Fuzz testing can also avoid vulnerabilities when processing untrusted inputs.

No items found.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Don’t break the chain – secure your supply chain today!

Engineering

What the fuzz? Better coding through randomized testing

Zachary Newman, Principal Research Scientist
March 13, 2023
copied

Why do we test our code? The obvious reason is that we want to check that our software is correct (or at least correct enough), but the less obvious (though possibly more important) reason is that we want to make our software correct. That's the point of test-driven development: testable code is easier to reason about, understand, use, and trust.

There are a number of ways to make code more correct. One not-particularly-helpful response is to tell developers to try harder and insist that a disciplined programmer who remembers to run the right linters and sanitizers can write safe code. Planning for, rather than denying, human fallibility (as other engineering disciplines have always done), requires tools that can’t have the classes of bugs you’re worried about (memory-safe languages), methods for ensuring correctness (model checking and formal verification), and processes (two-party code review).

While many of these techniques trade off speed for safety, the practice we’ll discuss today, randomized testing, helps with both. In tandem with test-driven development, it can flush out complicated bugs with relatively little effort from developers. Taking Brian Kernighan’s words—“if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it”—to heart, randomized testing employs more powerful techniques than human ingenuity to test and debug applications.

In this post, we’ll see that:

  • Randomized testing is a powerful tool and should be used more often.
  • Fuzzing (a type of randomized testing) is really useful for specific applications.
  • Both fuzzing and randomized testing are less popular than they should be because popular discourse fails to recommend the right tool for the right job.

My own worst enemy

I once took a 1-day course on software testing which changed the way I thought about these practices forever. The course paired me up with a partner and culminated with an exercise where we attempted to write some simple code (Conway's Game of Life) by alternating between 5-minute rounds of:

  • 1. A testing phase, where I wrote a test.
  • 2. A coding phase, where my partner attempted to make the test pass.

If my partner failed to make the test pass in time, we had to start that round over. The trick, though, was that my partner was instructed to be uncooperative: they'd do the worst possible thing to make your tests pass. So if I wrote:

They’d respond with:

This exercise totally changed the way I think about testing. Before, I always wrote tests by thinking about the code I was going to write (or, more often, had already written). This made it really easy to deceive myself into thinking that I got it right: of course the code does what you predicted it did! But notably, that’s not the same as checking that the code is correct. After this class, I started thinking about testing as a game: can I write test(s) such that it would be almost impossible to make the test pass unless the code was correct? But instead of an uncooperative partner, I’m playing this game against myself.

Randomized testing

You'll quickly notice that it's almost impossible to write deterministic tests that incorrect code cannot pass. You might try parameterized testing (enumerate a many test cases with their expected outcomes):

But in the adversarial coding game, your partner could always write a lookup table. That's not a worry in practice, of course, but it does point to an actual issue: if you're just thinking of examples to test, you're limited by your imagination. Over time, most developers get better at thinking about edge cases, but good software engineering is about doing things that take away reliance on humans getting things right, because they often don't (myself very included).

So we want to introduce randomized testing. This is almost a direct generalization of parameterized testing, but there's a catch: in most parameterized tests, you have to hard-code the expected result! How can you do that if you're generating test cases on the fly? Well, you have to start thinking in terms of the abstract properties of the code you're writing: what does it mean for this code to be correct?

Fuzzing

The most obvious property to check about some code is that it shouldn't crash or leak memory ever. This brings us to fuzzing, which is (mostly) about those sorts of checks.

In fuzzing, your test gets random binary data as an input. For example, using pythonfuzz:

Here, we’re using a parse() function because functions that take binary input are the easiest to fuzz.

The easiest way to implement a fuzzer would be to generate literally random data. But modern fuzzers have a number of features that lead to better testing:

  • Mutation-based fuzzers generate random data, but also make small tweaks to prior input data.
  • “Smart” fuzzers understand the structure of the expected input, and can generate data according to a developer-provided grammar to better exercise interesting code paths.
  • Coverage-guided fuzzers profile the code under test, and tweak their inputs to ensure maximal coverage.


These fuzzers can be extremely effective at catching bugs by applying a fair bit of computing power to try as many inputs as possible, especially ones a developer would never think of. For instance, the OSS-Fuzz effort fuzzes important open source software projects and has found thousands of issues, many of which were security vulnerabilities.

Fuzzing is really important in languages like C/C++ where issues like buffer overflow are common, but still have value in memory-safe languages like Go, Rust, or Python. For instance, anything that involves untrusted user input could benefit from fuzz coverage, especially parsers.

Property-based testing

But what if you’re not writing a parser? It turns out that “not crashing” isn’t the only property you'd want to check! For example, you might know that:

  • add(x, y) should give you the same result as add(y, x)

  • reverse(l) applied twice should give you back the original list

These are great things to check! Note that neither on its own is sufficient to make sure you actually implemented add() or reverse() correctly. In general, you should use parameterized tests or other traditional testing methods in conjunction with randomized tests. But if somebody came up with an input that violated that property, it could point to an issue.

Now we're getting into the realm of property-based testing. This type of testing is inspired by QuickCheck (Haskell), which makes an important observation: if our functions are total, in the sense that they do something correct for all inputs, we can be much more confident that they don't get misused. Consider a simple example based on the excellent Hypothesis library for Python:

This looks simple enough that it would never do the wrong thing! But consider the input first=”Foo, Bar”, last=”Baz”. This will result in ”Baz, Foo, Bar” which violates the given property.

Is this a good test? My attitude here used to be "whatever—garbage in, garbage out." But in order to avoid passing “garbage” to this function, the caller needs to call a check_valid_name() parser. For a name this could be relatively cheap, but if the inputs were longer documents, this call could be quite expensive, or expose your system to denial-of-service vulnerabilities.

Instead, let’s leverage the type system:

Now, we can define a parse_name() function that returns a Name. Then, because of encapsulation, we can be confident that any Name has been parsed and is free of characters like commas. Our property-based test now takes in a Name as its input (you’ll need to define a strategy for making these), so you no longer have to worry about the comma case. Your function does the same thing, but now it’s closer to total: every input of the correct type results in a valid output.

See what happened? The cycle of test->code->test failure->fix code led here to a refactoring of our code that makes it more correct by construction. 

Even though we didn’t fix a bug in format_full_name(), we were forced to clarify what we consider a valid input to that function. Don’t accept “garbage in, garbage out”—instead, use your typechecker to disallow the garbage!

Why aren’t these techniques more popular?

In my experience, discourse around fuzzing and randomized testing is insufficiently nuanced. Proponents say, “you should fuzz everything!” Like most universal advice, this contains a kernel of truth but misses the fact that every developer has a unique problem in front of them, and many developers don’t have the experience with randomized or fuzz testing to apply them properly to their codebase. This leads to frustration through tests that break all the time and cost a lot of money without finding bugs.

Further, most overviews of randomized testing (this post included) use unrealistically simplistic code for illustration. See the talk Property-Based Testing: The Ugly Parts for an illustration of how to extend this to more complicated codebases.

 

Together, this leads developers to adopt a dichotomous view of these testing techniques: developers either “like” or “hate” them, and aim to either use them everywhere, with full buy-in on a particular library. The post My Take on Property-Based Testing has a balanced discussion of the strengths and weaknesses of these approaches, and some heuristics for when to apply each.

Conclusion

Most code would benefit from some kind of randomized testing, especially property-based testing. I find that property-based-testing-friendly code is almost always easy to read and reason about. Further, the engines will uncover things you never thought about (e.g., overflow bugs) that might actually matter in practice. Fuzz testing can also avoid vulnerabilities when processing untrusted inputs.

Related articles

Ready to lock down your supply chain?

Talk to our customer obsessed, community-driven team.