TL;DR: SBOMs could be substantially more useful if the National Vulnerability Database implemented widespread usage of the purl naming scheme. To measure how much purl usage could reduce scanner false positives, this study analyzed real vulnerability scanner false positives and found that purl information could reduce the false positive rate by over fifty percent.
A Software Bill of Materials (or SBOM), an ingredient list of software components, is not meant to be a digital paper weight, nice-looking but of little use. SBOMs are meant to be utilitarian, consumed in the pursuit of goals such as identifying known vulnerabilities in open source software components. Underappreciated is that using an SBOM to accomplish a goal like identifying vulnerabilities means combining SBOMs with other data sources, especially software vulnerability data. Consequently, shortcomings in software vulnerability data can reduce the usefulness of SBOMs.
Arguably a central deficit of vulnerability data, especially the data in the National Vulnerability Database (NVD), the most important of all software vulnerability databases, stems from the so-called “naming problem.” Current naming schemes for software lack a universal scheme for reliably identifying open source software components or products. This naming problem hampers SBOM adoption because automating vulnerability identification via an SBOM is currently difficult and prone to errors, including false positives in which one component is mis-identified as another.
To remedy the naming problem’s effects on SBOM usage, a group that calls itself the SBOM Forum has proposed that vulnerabilities in the NVD be identified by package URL (purl). This naming scheme (explained more later), which makes open source software package ecosystem data a first-class conceptual citizen, offers the promise of more precise software component identification and fewer false positives when using software component analysis tools for vulnerability identification.
To measure the extent to which using the purl naming scheme in the NVD could reduce false positives associated with vulnerability scanning and to help the NVD and groups such as the NIST Software Identification Tagging project make an informed decision about this proposal, this piece conducts two preliminary analyses.
In short, adding purl information to vulnerability data would substantially reduce scanner false positives and help make SBOMs more useful.
The Naming Problem and the Promise of purl
This naming problem, the lack of a universal namespace for open source software components or products, has practical implications. Consider that for an SBOM to help in identifying the potential vulnerabilities in a piece of software, the SBOM must list components, especially the component’s name, and match those components with vulnerabilities. Imagine an SBOM includes the real npm package “delegate.” Then imagine trying to check if a vulnerability database, say the NVD, contains any vulnerabilities associated with “delegate”. The naming problem now begins. The vulnerability database, in fact, includes a piece of software called delegate, but it might or might not be an npm package, and this “delegate” is indeed associated with a vulnerability. This uncertainty is due to the naming problem and specifically the NVD’s use of the so-called CPE naming scheme, a system better for naming software products than open source software components. Should a vulnerability scanner include this finding and risk a false positive?
Or exclude it and risk a false negative? Welcome to the naming problem, a problem that arises because the CPE naming scheme poorly implements the modern idea of an “ecosystem” such as npm. For readers interested in a more thorough explanation of the naming problem and CPEs in particular, the SBOM Forum’s critique is the most thorough current resource.
purl, or a package URL, offers a solution to the problem described earlier. By creating a naming scheme that embeds the open source package ecosystem information, purl allows a vulnerability database to accurately describe open source components. This approach enables accurate automation that avoids the problems mentioned above.
For instance, the npm package “delegate” can be described, in a purl, as “pkg:npm/delegate.” If the vulnerability database also used the purl naming scheme, then that other “delegate” software would have a different purl (since it is not an npm package) and so there would have been no match and no tradeoff between avoiding false positives and false negatives.
But a skeptic might still ask: to what extent could the false positive problem be reduced if NVD adopted purl and added a purl to all vulnerabilities? The preliminary analysis below suggests two approaches and an initial rough estimate of an answer.
Approach #1: What Percentage of False Positives Associated with Chainguard’s Images Could be Avoided with purl Use?
Chainguard currently offers container images designed to have few or no known vulnerabilities. Software organizations can therefore avoid the security problems of having many unpatched vulnerabilities and avoid the headache of triaging scanner false positives. But some vulnerability scanners for some Chainguard Images do sometimes reveal potential vulnerabilities, including false positives, providing a well-bounded dataset of scanner false positives to examine.
To investigate the nature of these false positives, Chainguard Labs analyzed all confirmed false positive vulnerability scanner results (from Grype) associated with Chainguard’s Wolfi-based images produced in December 2022 and January 2023. (Wolfi is a community Linux OS designed to address software supply chain security from the ground up.) This resulted in forty-four total false positives. The dataset is here.
The analysis then proceeded to identify those false positives in which a package was mis-identified and confused with a package in another ecosystem and that particular mis-identification led to a false positive. Manual analysis revealed that this happened in twenty-eight of the forty-four cases. In other words, in slightly over sixty percent of the identified cases, adding purl information to the vulnerability, which correctly encodes ecosystem information, would have eliminated the false positive.
Of course, this is only one particular analysis of one set of container images by one vulnerability scanner. This limitation motivated a second analysis derived from a broader data source.
Approach #2: What Percentage of Grype’s Publicly Reported False Positives Could be Avoided with purl Use?
Grype, the open source vulnerability scanner, conveniently tags all GitHub issues related to false positives with a “false positive” tag. This means that there exists a list of, as of January 2023, over eighty Grype GitHub issues, open or closed, with a false positive tag. These issues can then be mined to understand the cause of each false positive and provide an estimate of how many false positives could be avoided if vulnerability data (and SBOM data) used purl information. The dataset is here.
The analysis first identified those false positives GitHub issues with sufficient information provided, which reduced the issue count to seventy-five. The main vulnerability associated with each GitHub issue was then analyzed to determine if the false positive was due to Grype mis-identifying the ecosystem of a given package. Thirty-nine of the seventy-five false positives, according to a manual analysis, were due to a mis-identification of the ecosystem. In other words, approximately fifty percent of these false positives would have been avoided if purl information was present in the vulnerability data.
Analytical Limits and purl Reform
First, these analyses are limited by the fact that they only pertain to one vulnerability scanner.
Second, and more importantly, some advocates of the current naming system might criticize the purl approach on the grounds that the current system (CPE) does in fact have a way to store the language ecosystem via a “target software” data field. But advocates would have to admit that it’s used, at best, inconsistently. (Further empirical analysis would be welcome here.) Part of the underlying problem is arguably that there is insufficient validation of data input during the CVE creation process. Some parties that submit CVEs do not add the ecosystem information, worsening the naming problem. While the use of a purl will not change this underlying dynamic, it will however, provide a stricter definition of valid values and this strictness could be used to improve NVD data quality in tandem with an improved data validation process.
The utility of an SBOM does not depend only on the data inside in a SBOM. For identifying vulnerabilities, SBOMs particularly depend on vulnerability data and, unfortunately, an important source of vulnerability data, the NVD, uses a naming scheme for open source components that leads to component mis-identification. The analysis earlier suggests that a package URL, or purl, could reduce this mis-identification and specifically reduce the rate of false positives for vulnerability scanners.
What should you do? If you work at the NVD, consider funding an effort to add purl information to at least all new vulnerabilities, if not all vulnerabilities. If you are a researcher, consider performing a larger, broader analysis to corroborate, qualify or refute these findings. In short, there is much to be done to improve the quality and usefulness of SBOMs and, fortunately, this analysis suggests that there are promising potential solutions.