All Articles

This Shit is Hard: The complexities of fixing Python library security issues at scale

Wesley Wiedenmeier, Senior Software Engineer

We all know we should keep our dependencies up to date. We all know we should use automated tooling to ensure versions are bumped the moment a security fix is released for any of the hundreds of thousands of dependencies our applications use.

Unfortunately, we’ve all been in situations where updating a dependency just isn’t an option. With the rise of malware attacks, many companies are electing not to bump dependencies as soon as they’re available.

Python developers understand this pain exceptionally well. While Python has many strengths, backwards compatibility is not one of them. Sometimes Python developers are faced with some very unpleasant choices:

  1. Commit to a lengthy migration to a new version of a dependency

  2. Allow security vulnerabilities to go unpatched

  3. Try to backport patches themselves

Not updating right away means you’re vulnerable to known vulnerabilities. Updating as soon as new package versions are released increases your potential exposure to malware. Backporting patches yourself is time-consuming and complex, and few teams have enough extra engineering capacity to do so.

At Chainguard, our mission is to be the trusted source for open source. Although we encourage you to keep your dependencies up to date as suggested in choice one, we recognize that our Chainguard Libraries customers want to keep their applications secure even when upgrading isn’t an option. If you’re in the business of backporting patches yourself, this post may help you understand why it’s an infeasible long-term approach and why you should let my team and me handle it for you.

What patches should we use anyway?

Lots of people write patches for security fixes: upstream project maintainers, Linux distros, some random developer on a forum somewhere. If you Google <cve_id> patch, you will probably get quite a few results.

Unfortunately, not everyone on the internet is entirely trustworthy, so not every patch is of equal quality. People make mistakes, and some people deliberately want to mislead others. If you download and run code from an unknown source on the internet, you don’t know if you’re getting code from a highly competent and honest engineer, someone who doesn’t know how to write secure software, or a malicious actor motivated to gain access to your company and customer secrets.

We really, really don’t want to put our customers at risk by shipping a patch that either doesn’t fix a security issue or, even worse, introduces a new vulnerability. While a patch on a random forum may have been written perfectly, current tooling makes it impossible to verify that hope.

At Chainguard, we take a cautious approach to selecting which patches we backport. We only accept patches that the legitimate upstream project developers have accepted into the main branch of the project’s repository.

Nobody knows a project’s ins and outs better than the original authors and maintainers. We rely on their deep knowledge of their project to properly vet the patches they allow into their codebase.

Generating patches is easy. Trusting them is hard.

It is really, really, easy to prompt an AI to “take this patch file and backport it to this version of the codebase.” Anyone can type that into their preferred AI coding agent, go get coffee, and return to a message saying that the backport is complete. Unsurprisingly, the AI agent will be proud and confident in its work.

At Chainguard, we use AI. However, the real value of our work is establishing guardrails and testing that give us confidence that what we’re shipping is correct.

We have several important questions to answer when backporting a patch:

  1. Is backporting this patch even feasible?

  2. Will this cause a regression?

  3. Does this actually fix the vulnerability?

Regression testing is hard

Our regression testing process for CVE backports is conceptually simple. We check that the project has an extensive test suite before performing a backport, and we do not proceed unless we are confident the project has sufficient test coverage. We run tests before applying the backported patches to establish baseline success/failure/skip counts. We then apply our backported patches and re-run the tests. If any new tests fail, we don’t ship the patches. Although this is a simple concept, actually doing it is pretty hard.

Before we can start running tests, we need to find a compatible set of runtime dependencies for the target version of a package we’re trying to backport a patch to. As anyone in the Python community knows, this isn’t always the easiest thing to do.

Once you take transient dependencies into account, a single Python package may have anywhere from dozens to hundreds of runtime dependencies. Many packages have very open version ranges specified for their dependencies. Even the smartest of dependency resolvers can’t figure out a compatible set of dependencies if a package’s metadata doesn’t indicate exactly what versions of every dependency are compatible.

Understandably, many packages don’t set upper version limits on their requirements. As a package maintainer, it's hard, if not impossible, to predict when the maintainers of one of your dependencies will make a change that breaks compatibility. As a package maintainer, you could rely on other maintainers following semver conventions, but not everyone does.

Unfortunately, this means that if you create a virtual environment and install a version of a package from five years ago, the chances are slim that the automatically resolved runtime dependencies are all compatible with the package you installed and with each other. It takes some serious archeology to figure out which set of runtime dependencies work together. This is like taking a step back in time to when a package was originally released.

Once you finally have a compatible set of runtime dependencies, you have to figure out how to build the project and get the test suite started.

Python is a diverse software ecosystem with a wide variety of applications. Unsurprisingly, there are many different test tools and build processes out there.

Many of the most essential packages in the ecosystem are also the hardest to build. And often, the most requested packages for CVE remediation have complex C/C++/Rust dependencies that require very complex build processes.

It takes hours for both AI agents and humans to get a clean run of a test suite for a particular version of a package. And you have to repeat that work for every individual version of a package you want to ship patches for, because nothing stays consistent over time.

Knowing when a CVE is fixed is hard

For every Python CVE we remediate, we first confirm that we have a real vulnerability by exploiting it in a controlled, reproducible environment. Then we try to exploit the original vulnerability in the same environment against our patched wheel to verify that our fix was successful. The only variable we change between test runs is which wheel we use: the original, unpatched one or our remediated one.

But how do we get the exploits?

It would be nice if every CVE report included a simple, easily reproducible proof-of-concept for the vulnerability—a schematic that demonstrates how it works. We use the original reproduction steps if they exist. Unfortunately, they rarely do.

Far more often, we have to understand the patch and generate exploits ourselves. When we initially started this CVE remediation effort, we were skeptical about how reliably we could generate working exploits. Surprisingly and also worryingly, AI agents are adept at generating exploits against known vulnerabilities.

We have had success using AI agents to generate exploits, ranging from HTTP authentication bypasses to timing side-channel attacks in cryptographic libraries.

We won’t go too deep into the specifics of how we’re generating exploits, but suffice to say that as AI agents get smarter and cheaper, everyone needs to get much better at keeping their environments free of known security vulnerabilities.

Using AI safely is also hard

We’re using AI agents for this work, but AI agents can easily be misled. How can we protect our AI systems against prompt-injection attacks?

The real value in our AI usage isn’t the prompts or context we’re using; it’s the guardrails we’re putting around the models. Our CVE remediation effort heavily relies on validation logic that’s implemented in conventional code. We rely on AI agents to figure out how to backport patches, but we use good, old-fashioned Python and Go code to validate the AI agents’ output.

The CVE remediation process is broken into several stages, each producing a YAML metadata file that follows a fixed schema. After each stage runs, we use a combination of code-based validation and human review to verify the stage's output before passing that data to the next stages of the remediation process.

We mentioned earlier that we only trust patches that come from the real upstream repository for a package. But how do we enforce this?

Chainguard already has well-tested conventional code for resolving the upstream repository for a Python project. This code is part of our “wheel rebuilder,” which rebuilds hundreds of thousands of artifacts from verified sources for Chainguard Libraries for Python. We reuse this code to determine which repo to pull patches from for CVE remediation. When the rebuilder can’t find the source repository on its own, we rely on human-reviewed configuration files that specify the actual upstream repository for a project.

Once an AI agent claims it has found the real upstream patch for a CVE, automated validation code ensures that the repository where the patch was discovered matches the real upstream repository identified by the wheel rebuilder’s source repo discovery logic. The automated validation then checks if the commits discovered by the AI agent were merged to the main branch of this repository. This validation code does some additional sanity checks on the AI agent’s output. If any of these checks fail, the AI agent’s output is immediately discarded.

Human judgment matters

AI agents are incredible tools, but human judgment matters more. Sometimes, when backporting a patch, judgment calls are required to decide how to preserve the upstream patch's behavior while minimizing regression risk.

Every patch is reviewed by the engineer coordinating the backport process and by another engineer with deep knowledge of the Python ecosystem. We don’t just review the patch files themselves; we review everything generated during the backport process, including metadata on the upstream patch, metadata on how to run the upstream test suite, regression testing results, exploitability testing results, and notes on how the backport process itself was performed.

Patch files are hard to review

But how do we review patch files anyway?

Say you’ve finished backporting a patch. You now have a commit from a project’s repo and a patch file that applies to an older version of the package. How do you compare them?

This seems simple at first; you just need a diff showing the changes between the original CVE patch and the backported patch. Why not just run git format-patch -1 <commit_id> to turn the original commit into a patch, and then use standard diff to compare the original patch against your backported patch?

Have you ever taken two patch files that apply to different versions of a codebase and run a diff between them? The output is ugly and certainly isn’t suitable for human review.

Patch files work well because they contain a lot of information (line numbers, context lines, etc.) to help tools locate which exact lines to apply changes to. Unfortunately, all this information varies, sometimes significantly, across different versions of a codebase. As a result, running diff on an original patch and its backport produces output that is unreadable and useless.

How do we compare backported patches to the original patch? We built our own “diff of diffs” tool.

Going into this project, we were surprised that there didn’t really seem to be many options for open source “diff of diffs” tools. There are endless tools out there for manipulating diff and patch files. But we didn’t find a tool that seemed suitable for comparing the code changes introduced by a backported patch to those in the original upstream commit.

The "diff of diffs" tool is designed to compare a backported security patch with the original upstream CVE fix, accepting input as local file paths or GitHub commit URLs. To achieve an accurate comparison, it first normalizes the various patch formats and strips non-essential metadata, such as headers and email formatting.

Crucially, it replaces variable elements such as hunk headers (@@ -53,7 +54,12 @@ class FormData:) and git index lines (index 0303f547ec..2f702d6480 100644) with placeholders, which ensures the tool only compares the actual code modifications. For complex fixes involving multiple upstream commits, it automatically joins them. The resulting output highlights the difference between the code changes in the original fix and the backported patch. This makes it easy to spot any errors in the backporting process or review necessary adaptations for compatibility with older Python versions.

How are our remediated libraries used?

Security products don’t keep you safe if you don’t use them. In that light, usability matters. When we added CVE remediation to our Chainguard Libraries for Python product, we needed to find a way to make these patched libraries effortless for our customers to use.

Given our commitment—first and foremost—of preventing malware from entering our Python libraries, we needed to rebuild every patched library from source rather than offering a patch file.

This allows us to offer drop-in replacements for PyPI that automatically provide patched versions of libraries. That way, you can just point your repository manager or build system at our index, rebuild your application, and you’re protected.

We make this possible using local version numbers. Python package versions follow PEP-440, which allows local version numbers to be separated from the main version number with a + character. When resolving dependencies, package managers consider a version specification such as 1.2.3+local.1 to match the same version constraints as version 1.2.3. However, the local version number takes precedence over the base version number. Therefore, if you run pip install “package==1.2.3” and 1.2.3+local.1 is available in your index, Python build tools install the patched 1.2.3+local.1 version automatically.

Chainguard uses local version numbers in the format +cgr.N, where N increments as we patch more CVEs for a given package version. When you install a version of a package from our remediated index, it automatically receives the latest +cgr.N local version available, ensuring it's protected against as many CVEs as possible.

This wouldn’t be possible without our automated approach to securely building Python wheels. In addition to CVE-remediated Python libraries, we ship tens of thousands of Python libraries built from source in a secure, SLSA L2-compliant environment. We ship signed PEP-740 attestations and provide SBOMs for our built wheels, following PEP-770, to help you understand and audit your software supply chain.

Our automation ensures you can trust the wheels you download and run from both our main package index and our Python-remediated index.

But what about scanners?

Our remediated libraries need to work well with scanners. When a customer installs +cgr.N remediated libraries from our index and scans their application for vulnerabilities, their scanner needs to understand that while package==1.2.3 may be vulnerable to a CVE, package==1.2.3+cgr.1 is not.

We make this possible by publishing a VEX feed which specifies which exact CVEs are remediated in each +cgr.N library release. The VEX feed is automatically populated with data from our CVE remediation infrastructure, and new entries appear after newly remediated packages are published to our index.

We currently integrate with Trivy, Grype, Anchore Enterprise, and AWS Inspector, so if you use any of these scanners and our patched libraries, you’ll see vulnerabilities disappear from your vulnerability reports instantly. We plan on adding more scanner integrations this year.

Conclusion

The current state of finding a random patch on the internet, applying it to your library, and hoping for the best doesn’t work as Python ecosystem attacks continue to increase in quantity and scope. Does this mean that we can’t patch some CVEs? Absolutely. Someone may have authored a patch for a CVE that upstream maintainers haven’t yet addressed. A patch may be under review in the upstream project but not yet shipped. Or the original maintainers may have simply chosen not to fix a CVE.

We are constantly evolving; however, our general philosophy is to ship fewer high-quality, highly trusted patches than to ship more that we have lower confidence in.

What is hard, and virtually impossible to do well, is building a high-quality, trusted index of patched Python libraries for your production environment. This is a complex, high-stakes fight you don't need to take on.

If you’d like to learn more about Chainguard Libraries for Python, visit our landing page or reach out to our sales team.

Share this article

Related articles

Want to learn more about Chainguard?

Contact us