The universe and software might be more similar than you imagine. The universe is 27 percent dark matter, invisible matter that we can’t detect with telescopes and radios. Much like regular dark matter, software dark matter comprises packages that exist but which are effectively unseen, software that is untracked by typical tools like a package manager or a software bill of materials (SBOM). According to our own estimates examining several hundred popular open source software containers, software dark matter constitutes 32 percent of analyzed containers.
And just like cosmological dark matter poses problems for understanding the universe, software dark matter complicates the job of anyone seeking software transparency—that elusive goal currently associated with SBOM enthusiasts who seek a world in which complete and accurate knowledge about software is normal. Unfortunately, software dark matter has more tangible effects on software users than the physical equivalent: the more software dark matter present in a container, for instance, the more challenging it is for software analysis tools to find and correctly identify that software. And when software analysis tools can’t correctly identify software, there’s a greater chance that scanning tools will fail to find software vulnerabilities that are present, undermining one of the central goals of software transparency.
We performed an analysis to quantify the percentage of files within 350 popular open source software containers that are software dark matter. The analysis used a tool that we wrote and open-sourced, darkfiles, for measuring software dark matter. The findings include:
After further defining software dark matter and presenting and analyzing a software dark matter dataset, this piece calls for reducing software dark matter to enable software transparency.
What Is Software Dark Matter?
Software dark matter refers to files that are not tracked by operating system (OS) package managers (like `apt` or `apk`), which renders these files and the packages they represent invisible—or at least complicated to find—to software composition analysis and security scanning tools. Tools like darkfiles can therefore be used to perform a straightforward calculation: what percentage of files are tracked by the underlying OS package manager.
Why Does Software Dark Matter Matter?
Software dark matter makes the job of software analysis tools harder, both conceptually and technically. This matters because when software analysis tools fail to find and correctly identify software components, then, most importantly, it becomes more likely that security scanning tools fail to flag known software security vulnerabilities. In addition, when software analysis tools fail in their function, it also enables attackers to slip in malicious, unwanted software.
It’s like finding and identifying goods shipped on a container ship but not placed in a shipping manifest: these goods are likely to be overlooked and treated as second-class cargo likely to be forgotten. Of course, there are technical tricks that scanners and other tools can use to find this dark matter, but it’s a complicated endeavor in comparison to checking a package manifest list. The implication is that SBOMs and other means of representing dependency information will likely be incomplete and wrong in a world of pervasive software dark matter, which raises the question…
How Much Software Dark Matter Is in Popular Open Source Containers?
Before advocates of software transparency declare war on software dark matter, it’s worth understanding how common software dark matter is. A reasonable starting point, though not the final word, is an assessment of the most popular open source containers on Docker Hub. This set of software artifacts represents containers that are commonly used and very likely underpin a wide set of important software applications. This analysis therefore selected 350 containers from among the 1000 most popular container images (script for collecting popular images). These 350 images had either an Alpine-based or Debian-based operating system, a requirement imposed by the current implementation of the darkfiles tool (script for identifying OS for containers). All images were then analyzed with darkfiles.
Figure 1 represents the percentage of software dark matter (along the horizontal axis) in this sample of popular container images. The vertical axis represents the percent of this sample that has a particular amount of software dark matter.
The software dark matter graph reveals that approximately thirty percent of the images in this popular Dockerhub image sample have less than one percent software dark matter. Some containers are therefore already building images with little to no software dark matter, although the practice appears to be far from widespread.
While there is a concentration of containers with high (90 percent or more) software dark matter, the distribution is relatively even, with a wide range of software dark matter percentages. Treating each contained equally, the mean software dark matter percentage is 32 percent and the median is 10 percent. If, however, containers are weighted by the number of files in the container, the mean software dark matter percentage is 63 percent.
It bears mentioning that many aspects of software dark matter are still unexplored, including what explains the prevalence of this phenomenon and whether these findings are similar across programming language and package manager ecosystems. On a more technical note, our analysis didn’t consider non-system package managers like pip: we don’t know what fraction of this dark matter these tools detect.
Less Software Dark Matter → More Software Transparency
Software transparency is rightly en vogue. Companies and individuals alike have experienced the downsides of depending on inscrutable software and want reliable means to detect known vulnerable software components and avoid tampering. SBOM advocacy epitomizes this demand for software transparency. And while advocates of software transparency acknowledge a wide range of challenges, less appreciated is that software dark matter, whether in containers or elsewhere, will pose a challenge for software transparency.
Fortunately, software dark matter need not be a permanent obstacle, though we’ll leave approaches to reducing or coping with software dark matter to another day. In short, SBOM advocates, and anyone else committed to software transparency, already have a lot on their plate, but unfortunately it’s time to add yet another challenge: software dark matter.