May 1, 2026

How we automatically test the world's most secure Linux distribution

Dustin Kirkland, SVP of Engineering

I’ve been running Linux, and only Linux, on my personal machines since 1998. Every laptop, every desktop, every home server. In 27 years I’ve experienced nearly every way a package upgrade can go wrong: the shared library that moved between versions, the config format that changed without a migration path, the init script that worked on install but broke silently on the next reboot because a symlink pointed somewhere that no longer existed after the upgrade, the interpreter that gained a reserved keyword and broke a script I’d written years earlier and never thought to touch again. And because my wife runs Linux, my kids have grown up on it, and I put my parents on it to avoid viruses (which made me their upstream maintainer, permanently), I’ve also been the family IT department for a long time. I know exactly what it feels like when something breaks on an upgrade. And I’ve long wished that distributions would simply test more. Run the thing. Start the daemon. Import the module. Confirm it still works before shipping it.

That frustration is what we set out to fix at Chainguard. And now we’ve finished it.

Chainguard OS, the fully bootstrapped-from-source Linux distribution that underlies all Chainguard Containers, has achieved 100% test coverage across every package and every subpackage in our entire catalog. More than “it compiles.” More than “the build pipelines exit zero.” Every package now has at least one (and in many cases multiple) test stanzas that actually run the software, verify it behaves, and catch the exact classes of failures I’ve been fixing by hand since I was playing MUDs.

This is what it looks like today in numbers:

7,451 top-level packages (ie, open source projects)
20,248 package and subpackage definitions
32,806 binary packages across two architectures (x86_64 and aarch64)

All of these are automatically tested whenever a new package builds. The hard way is the reliable way. And this was, unambiguously, the hard way.

What "tested" actually means

Let me be specific, because "tested" is a word that gets stretched so often it becomes meaningless. “Tested” means:

For a daemon: start the service, wait for a log line confirming it's up, optionally run a health check against the live endpoint, then clean up. Scott Moser built the daemon-check-output pipeline in May 2024 to do exactly this. It's not a smoke test; it's a lifecycle test.
For a CLI tool: invoke the binary with arguments that produce meaningful output, pipe that output through grep or assertions, and confirm it says what it's supposed to say.
For a shared library: run ldd-check to verify every declared dependency resolves against actual files in the package. This is the test that would have saved my Saturday.
For a Python package: import the module.
For a Ruby gem: require it.
For a compiler: compile something.

Josh Wolf built the tw toolkit in February 2025: a composable family of test pipelines that made writing these tests dramatically faster. Dann Frazier contributed compiler, smoke, and clang pipelines in January 2026. These aren't one-off scripts; they're reusable infrastructure that scales across thousands of packages.

26 months of work

The first Wolfi (now, Chainguard OS) commit landed in September 2022. Zero tests. The distribution existed, it was functional, and it was already more secure than anything else in the market. But it was not verified.

Ville Aikas wrote the first test: stanza in December 2023. That's the moment this story really begins. One engineer decided the bar needed to be higher and wrote the code to prove it.

What followed was 26 months of sustained, distributed effort across more than 200 contributors. Some highlights:

Brian Murray ran systematic sweeps through CUDA packages and the Perl ecosystem, writing tests in batches where others might have written one at a time.
Arturo Borrero Gonzalez ran a Python blitz that knocked out a significant portion of that catalog, and also built the metapackage and virtualpackage test pipelines.
Sergio Durigan Junior established the discipline of testing new major versions on day one, so the bar doesn’t slip when packages upgrade.
Justin Vreeland contributed 184 test stanzas, among the highest individual totals in the catalog, systematically working through packages others had left for later.
Furkan Türkal drove test coverage across the network tooling ecosystem (pdns, dnsdist, xdp-tools), contributing 145 stanzas to packages where a broken package means broken networking.
James Rawlings merged 908 pull requests generated by qackage, the AI-assisted test-writing bot Josh Wolf shipped in December 2025. Nine hundred and eight. That’s not a rounding error; that’s a force multiplier.
James Page tackled zlib. Dimitri John Ledkov handled ca-certificates. Arthur Exaltacao tested bats — the Bash Automated Testing System, widely used for shell script testing across the Linux ecosystem — which is, yes, testing the testing framework itself. He restructured it into proper subpackages, added version verification to each, and cleaned up a file permissions issue in the process. Someone had to do it, and it required someone who understood what the tool actually does.

The last 55 packages were the hardest. Lua OpenResty plugins. X font protocol metadata. Packages that existed to provide data, not executables, and required us to think carefully about what "runs correctly" even means for something with no binary entry point. We resolved each one individually. That's not glamorous work. It's the work that proves you're serious.

Why this matters for a rolling distribution

Chainguard OS is a rolling distribution. There are no point releases. No stable branch you freeze and forget. Packages update continuously as upstreams release, which means your images track things like the latest OpenSSL, the latest glibc, or the latest Python runtime, automatically.

That is a significant security advantage. It is also, if done carelessly, a significant operational risk.

The thing that makes continuous delivery trustworthy is continuous verification. A dependency update that breaks shared library linkage needs to be caught in CI, not in your cluster. A daemon that fails to start after a config format change needs to fail our test pipeline at 2:00 a.m., not your on-call rotation at 2:00 a.m.

Test coverage at this scale transforms "rolling" from a risk word into a reliability word. Speed and security are not opposites; they're partners, but only if you've done the work to make speed safe.

Secure-by-default has always been Chainguard’s promise. Tested-by-default is what makes that promise auditable.

The infrastructure that made it possible

None of this happens without the tooling. Writing 7,000 tests by hand, one at a time, would have taken a decade. The team built a testing infrastructure that covered every meaningful category of package behavior.

Service lifecycle testing: Start the daemon. Wait for the specific log line that says it is ready. Optionally send a real request to a live endpoint and validate the response. Shut it down cleanly. More than 270 packages are tested this way, including PostgreSQL, Redis, nginx, HAProxy, and Prometheus.
Shared library dependency verification: Walk every binary in a package and confirm that each shared library it links against exists at the exact path the dynamic linker will look. This is the test that catches the version bump that silently moved a library. It is the most widely used test in the catalog, covering more than 1,000 packages.
CLI invocation testing: Run the binary with real arguments and validate the output. A tool that exits non-zero, reports the wrong version, or whose help text no longer contains the expected subcommands fails immediately.
Language ecosystem testing: Import the Python module. Require the Ruby gem. Load the Node.js package. Build a small Go program. Invoke the Java main class. Every major language ecosystem has its own test pattern.
Compiler and toolchain testing: Actually compile a small program through each compiler in the catalog: C, C++, Fortran, Clang, LLVM. Confirm the output binary runs. Dann Frazier built dedicated compiler/smoke and compiler/clang pipelines in January 2026 to cover this entire class.
Header file and package configuration validation: Parse C and C++ headers using the preprocessor to catch broken includes before any consumer sees them. Validate pkg-config files against real library paths, covering more than 530 packages.
Java archive inspection: Open JAR files, validate their manifests, and confirm declared entry points are present.
Shell script syntax checking: Run every installed shell script through the interpreter’s built-in syntax checker before any code executes.
Package structure verification: Confirm documentation, data, and metapackages contain exactly what they claim. Verify that metapackages’ declared dependencies can all be resolved and installed together.
AI-assisted test generation: Qackage, built by Josh Wolf in December 2025, analyzes installed packages and automatically proposes complete test stanzas. The pipelines listed above:
- daemon-check-output (service lifecycle)
- ldd-check, pkgconf (library and build configuration)
- gem-check, ver-check, help-check, shell-syntax-check, header-check (language and tooling)
- tw toolkit: a composable family of pipelines you can assemble like building blocks
- jar-check, contains-files, metapackage, virtualpackage, emptypackage, no-docs (package structure)
- compiler/smoke, compiler/clang (toolchain validation)
- qackage (AI-assisted generation)

Every one of these was built because a contributor looked at a category of packages and decided it was too important to leave untested and too large to handle manually. Doing the hard work once (building reusable infrastructure) beats writing one-off tests 7,000 times. That's the right instinct. Doing the hard work once (building infrastructure) beats doing the easy work 7,000 times.

An invitation

If you're running containers in production (and you are), the question isn't whether your base image has tests. The question is whether those tests are any good.

Every one of our Chainguard Containers is built on Wolfi. Every package in Wolfi now has a test that runs the actual software. When a package updates tonight, our CI will start the daemon, verify the library links, import the module, and either pass or catch the failure before it reaches you.

That's the world's most secure and most enterprise-ready rolling Linux distribution. We didn't just claim it. We tested it.

Explore our full directory of container images.