A common assumption made by developers and organizations wading into software supply chain security is -- doesn't my software composition analysis (SCA) tool take care of this problem by finding common vulnerabilities and exposures (CVEs)?
Scanners aren’t magic. And if we are really going to solve the software supply chain security problem, we need to stop treating them like they are.
This post explains how most vulnerability scanners for containers work, and highlights a few challenges this approach has that can lead to blind spots in your infrastructure.
Most scanners use package databases to see what packages are installed inside of containers. Software installed outside of these systems aren’t readily identifiable, making them invisible to scanners. Some scanners have extra heuristics to identify particular pieces of software, but it’s difficult to depend on these when you don’t know how they work.
For example, Snyk has special rules in place to notice Node.js itself inside images, because Node is typically installed manually rather than via a package manager. Scanners without this heuristic would miss it.
Running syft against the official node image shows that it can’t find the node package itself:
When scanners can’t find these packages, they also can’t find vulnerabilities inside of them. And worse, they can’t warn you about things they don’t know about.
How Did We Get Here?
The problem is complicated and no one is really at fault. Most Linux distributions are designed to distribute a single (or very small number) of stable versions of software for long periods of time, with only security fixes backported. Many modern applications choose to release faster than distributions would allow, encouraging their users to install them manually. Exclusively installing software via package managers would solve the inventory problem, but it would leave users without the ability to install many applications, or the specific versions they need.
This mismatch leaves end users in a bit of a bind, as the industry’s increased focus on supply chain security has made accurate asset inventory a critical element in an organization’s security posture. You might think SBOMs are designed to solve this problem, and you wouldn’t be wrong.
Unfortunately, most SBOM tools today work the same way as vulnerability scanners, and operate by scanning built software artifacts in order to generate the SBOM. There’s no guarantee these scanners will find everything, and therefore no guarantee that the SBOM results are complete.
Example: Phantom Wordpress
The best way to illustrate this problem is by example. Let’s start with scanning the official wordpress image on Dockerhub.
Most vulnerability scanners have a mode to output what packages they found in the image before querying for vulnerabilities, so we can use that to sanity test things. This isn’t meant to pick on any particular scanner - they all have similar behavior here, but we’ll use syft and grype here. Running syft against this image shows a massive package list (full output here):
We see a few things we’d expect in a wordpress image, like apache, apt, and openssl, but a few things are missing…
We don’t see wordpress itself here at all, or even PHP itself (the language wordpress is written in). Let’s sanity check to make sure we’re not missing anything, and this does in fact have wordpress and PHP installed. We can start up the container following the instructions and see that wordpress does indeed start up.
Viewing it in the browser works too! Loading localhost:8080 shows a wordpress setup screen:
What about PHP? We can find that too, and see v7.4.30 installed:
So Wordpress and PHP are installed here, but syft doesn’t find either of them. This isn’t just syft. We can double check with other scanners like snyk or trivy. The full output from these is here, but once again there’s no wordpress or PHP.
What’s happening here? How are these phantom packages installed in the images if the vulnerability scanners can’t find them, or report vulnerabilities against them? This isn’t a theoretical concern either! Grype reports over 500 potential CVEs in the wordpress image, many of which are false positives, without even finding PHP and Wordpress. This means the problem isn’t just noise, we’re also missing data and there can be false negatives.
To understand why this happens, we need to look at how these images are built and how wordpress and PHP are getting installed. To do that, we can look at the official Dockerfile. The section where wordpress gets installed is here:
Wordpress gets installed by downloading and extracting a gzipped tar file from the official wordpress site, and the checksum is verified as well. This is a great practice, and even matches the official wordpress installation instructions. But, it unfortunately means that the system package manager (apt on a debian image, apk on Alpine), doesn’t know about the installation.
If we want to see what this package manager does know about, we can open up that file and take a look at it. On Debian, this lives at /var/lib/dpkg/status, and entries look a bit like this:
On Alpine, these live at /lib/apk/db/installed, and entries look like:
This is the big revelation: operating system packages are installed and managed by a package manager, but Wordpress was just installed manually by unzipping a directory. Package managers aren’t terribly fancy, they just store some state on disk to help you remember what programs are installed and what versions they are, in case you want to uninstall or upgrade them later.
So What Next?
Today’s scanners were not built to fully address software supply chain security, and currently do not detect key components of the software they are scanning.
Reverse-engineering, SCA based scanners are fundamentally limited by the information left for them by build tools. It’s clear that we need to shift this problem further left into build tools themselves. At Chainguard, we believe that generating SBOMs at build time is the only way to be sure of their accuracy and completeness. In addition, we believe that instrumenting build tooling to provide accurate inventory is not just a compliance and security benefit - it can also be more developer friendly and productive!
We’re working on a full suite of tooling at Chainguard to do this. We designed melange to make it trivial to convert any piece of software into a complete package with the required metadata to produce SBOMs and accurate vulnerability findings, and apko to assemble these packages into runnable images. More specialized language-specific tooling like jib (for Java) and ko (for Go) have also shown the productivity advantages of dependency-aware build tools.