This is about CVE-2007-4559, a CVE in the tar module in Python. You might have seen this popping back up in your security scanners recently, which is strange for a 15 year old bug. This post will explore the history of the vulnerability, why it’s in your scanner today, why it (probably) shouldn’t be, and what our decision-making process for dealing with these classes of vulnerabilities is at Chainguard.
Before we start, let's first understand what CVEs, or Common Vulnerabilities and Exposures, are. They serve as a list of entries—each containing an identification number, a description, and at least one public reference—for publicly known cybersecurity vulnerabilities. However, not all CVEs are created equal, and many are debated or contested due to the process by which they’re filed.
The National Vulnerability Database (NVD) is “world writable.” That means almost anyone can file a CVE against almost any piece of software (with a few exceptions). The authors of the software are not consulted in this process, leading to lots of disagreement later on. In open-source projects, maintainers often don’t have the time to properly contest CVE entries, leading to the database having lots of “invalid” entries that clog up scanners and vulnerability response teams.
This specific CVE
The specific vulnerability in question relates to Python's tarfile module, which can be used to read and write tar files. There’s a function called `tarfile.extractall`, which does exactly what it sounds like - extracts the contents of a tar file! This extraction process might be surprising - tar files can contain lots of things, including file paths that are “relative”, and might even include `../../` path traversals. These are all “features'' of the tarfile format, even though they can result in security vulnerabilities when used incorrectly.
The tarfile package in Python prominently warns users to not use this function on untrusted inputs, and explains the exact issue:
So, is this a vulnerability? The maintainers argued no. The logic was that while this function could be used unsafely, users were warned to not use it in such a manner. Furthermore, the behavior is technically correct, and could be the desired behavior.
The counter argument is that safe defaults are important, and if a function can be used unsafely it will. Maintainers bear a responsibility in making sure that their APIs and projects are easy to use safely, and it should be considered a security vulnerability if they’re not.
Vulnerabilities are hard!
Why are we talking about this?
The maintainers decided that even though they don’t consider this a security vulnerability, it’s still worth trying to improve the defaults and ergonomics to make them safer. This matches the behavior of Go, which also introduced some changes to the tar standard library functions that will eventually make them safer by default (we opted to enable these by default earlier than Go).
This change in status from “Won’t Fix” to “Let’s at least try to partially fix it and improve the defaults” triggered us to reevaluate our initial decision, and to make many security scanners to start reporting on the vulnerability again. That’s right - scanners can selectively hide or show vulnerabilities, and they do that often. Remember those noisy invalid CVEs we talked about earlier? There are so many of them that most vulnerability scanners implement logic to hide (or at least display differently) CVEs that the maintainers have decided not to fix. There’s no standard way to report this, so many scanners assume that something marked “Won’t Fix” is considered invalid. This issue got reopened, so it magically went from hidden in scanners to Scary and Vulnerable again!
The proposed fix introduces a new parameter to the function to allow users to automatically mark tar files as trusted or untrusted, and then selectively filter out contents rather than extracting everything. This is super handy, but the documentation still states that the function should not be used on untrusted inputs, and that it doesn’t fix everything. The default is also left where it was, so while the API can be called in a slightly more safe manner, it’s still unsafe by default (though this may change in future releases).
So what now?
As a distributor of Python (and many other open source projects), we have the responsibility to triage CVEs, fix them, or clearly mark them as invalid with an explanation of why. Most CVEs are useful and have straightforward fixes. This one is not.
In these cases, we have a basic process to help guide our decisions. In projects with responsible (and responsive) security teams, we tend to trust their judgment and match their actions. That’s why we previously marked this as "Won’t Fix" and unimportant, matching many other Linux distributions. Now that the issue has been reopened, we adjusted our scanner feed appropriately.
But our work doesn’t stop there, we have to decide what to do with it still. We have a few choices in this case:
In this case, we’re going to do a combination of 1 and 2. Given there is no clear “fix,” and the code is now possible to call safely, we’ll update our scanner feed to mark it as fixed with the version that contains the new APIs.
There’s also been some discussion around changing the defaults of the function to opt-in to the safer behavior, even before the upstream Python project does that. We’re going to continue to monitor that conversation and roll out the fix early if there’s consensus that it’s possible to do so safely. If the function were intended for use with unsafe inputs, we would feel much differently.
Our decision making process
In general, security bugs that arise from using code in ways that are outside of the code’s defined threat model pass a very high bar to be considered a CVE. Designing code to process untrusted input is very difficult to get right, and the larger the surface area the harder it is. Applications can still clearly have bugs that need to be fixed, but it’s unfeasible to track every piece of code that could be misused as a vulnerability. Fortunately, the NVD has another mechanism for issues like this - Common Weakness Enumerations (CWEs), which can be used to refer to and warn about commonly misused APIs or functions.
To be a bit extreme, this is sort of like filing a CVE against the C language for not protecting against buffer overflows, or for a terminal that allows someone to run “sudo rm -rf /” if you leave your laptop unlocked. While security can always be improved, CVEs are a mechanism for communicating important data between stakeholders, the threat models of the program in question must be considered when filing them to ensure that this communication channel remains high on signal and low on noise.
Precedence and other examples
When we make borderline cases like this, I like to ask what factors would have made the decision harder, or easier. Fortunately, the industry is full of other CVE entries that we can use for inspiration and examples.
For an example where untrusted input did lead to a vulnerability, we can look at the security fixes in Go 1.20.5. The Go toolchain is designed to run on untrusted inputs, and tries very hard to not allow for arbitrary code execution during installation and compilation. If they slip up on that, they report CVEs and fix the issues. Meanwhile, many other programming languages (like Python) *do* allow for arbitrary code execution during package installation.
This is definitely something to consider when designing systems that use these languages, and we’d consider it a huge improvement to disallow for arbitrary code execution during package installation, but we agree that this is not necessarily a security vulnerability because the systems are designed and clearly documented to operate this way. Put another way, just because curl *can* be piped to bash, doesn’t mean that the fact that it can be is a vulnerability in curl.
A counter example of this was CVE-2023-31975 in yasm, which received a CVSS score of 9.8 and triggered a lot of debate. Buffer overflows are always bugs, but only security vulnerabilities if they can be exploited. A program that is only designed (and clearly documented) to run on trusted input can’t be exploited, even though it can crash. Applications can surely be constructed that run this program in a manner it wasn’t intended for, but the vulnerability belongs there rather than in the called program.
Programs that are clearly documented to be safe for untrusted input (the Go toolchain) should try to meet those obligations. Programs that no reasonable person would attempt to run on untrusted input (a shell, yasm), might have bugs, but should not otherwise issue CVE entries for vulnerabilities that are only exploitable when misused.
You could definitely argue that a reasonable person could assume that the tarfile module is safe for untrusted input, tars seem like innocuous collections of files and users commonly upload or download them to transfer data. This would be an unsafe assumption though, and the tar module in Python clearly documents this limitation, making it hard to justify a CVE entry.
Tar files are inherently unsafe, while there are ways to improve their handling, users must still know and acknowledge that dealing with them is risky, especially from untrusted sources. The changes to Python’s tarfile module don’t change this inherent factor, so the tarfile module will remain something to be used with care.
We applaud the Python community for improving the security and ergonomics of their APIs, and we agree with their evaluation that this should be improved even though they don’t consider it a vulnerability. As always, we’ll review these case by case, transparently, and document our decisions. We publish these in our advisory database for scanners to consume directly. If you’d like to take advantage of the work we do here, you can! Simply use our Wolfi (un)distribution or our Chainguard Images product and benefit from our patching, triage, and advisory feed.