Mitigating Malware in the Python Ecosystem with Chainguard Libraries
Malware attacks in the Python ecosystem are growing in severity and frequency, with many tainted packages appearing on the Python Package Index (PyPI) each year. To mitigate malware in open source libraries, Chainguard recently announced Chainguard Libraries for Python, a malware-resistant index of packages in which each library and its full dependency tree is built from source inside our hardened infrastructure.
To verify our malware-mitigation thesis, we analyzed ~3,025 malicious Python packages sourced from the Backstabber’s Knife Collection, which has grown considerably since its initial release and is now considered one of the most well-vetted dataset of known malicious open source libraries.
Our early results find that ~98% of the malicious libraries would have been avoided by users relying on Chainguard Libraries as their sole source for Python dependencies.
The main reason is that a majority of these malicious libraries (~92%) do not have attributable source code URLs – which we consider the true source code. Instead, these wheel packages have fake code bundled into the source distribution (sdist), and understandably do not make their malicious code available anywhere. Since Chainguard goes straight to the raw Python source code from a public repository such as GitHub, the Chainguard Factory would not have built and distributed these libraries. No such public source code exists for the malware. If we strip out these malicious packages that lack source code, our analysis shows that users would have mitigated 75% of malware.
In the rest of this post, we’ll walk through our research approach and methodology, highlight the key findings, and close with practical takeaways for security leaders.
Methodology and Results
For this analysis, Chainguard leveraged the ~3,000 known malicious packages from the Backstabber’s Knife Collection, a dataset of known open source package malware. We then created a simple decision tree to determine whether or not Chainguard Libraries would have mitigated each individual malicious dependency. A critical assumption we make in this analysis is that Chainguard Libraries is treated as the primary source for language dependencies, with no fall back to PyPI, where many of these malicious packages originated.
Below is the full dependency tree, with three key stages and results:
Valid Source Code Available: Determine whether the malicious package in question was uploaded to PyPI with attributable source code URLs. Since Chainguard Libraries builds libraries, and their associated dependency tree, entirely from source, this is a fundamental requirement for the package to be available in our index. If a library does not have source code available in a public repository, we do not build and supply it. For our dataset, Chainguard would mitigate ~92% of these malicious packages because they lacked source code – we would not have built or distributed these packages.
Artifact Matches Source Code: Determine whether the distributed library was faithfully reproduced from the source code. If the distributed artifact does not match the source code, that means it has been tampered with, either at package build (i.e., compromised build pipeline or system) or distribution (i.e., compromised distribution point via leaked tokens). Since Chainguard builds libraries in our infrastructure and serves as the distribution chokepoint, our artifacts 100% match the source to mitigate these kinds of malware attacks. From our dataset, Chainguard would have mitigated an additional 6% of malicious packages by reproducing an artifact that matches the public source code without any tampering applied. So even if we exclude packages without source code, Chainguard would have mitigated 75% of the malicious libraries with source code (6% of the remaining 8% of the original dataset that did have attributable source code).
Is the Source Malicious?: Determine whether the source code itself is malicious. Today, because we build straight from source, Chainguard Libraries does not combat malware where the source code itself is malicious unless it is known and reported. The phenomenon of malicious source is significantly rarer given how difficult it is to compromise raw source code – there are audit trails, PRs have to be merged by the community, and there are many eyeballs watching the projects. XZ Utils, for example, is widely misunderstood as a source attack, when in reality the backdoor was introduced in the tarball source code and binary release only. Building xz-utils from the actual source code repository mitigates the malicious backdoor because the repository did not contain the attack. This means that the remaining 2% of malicious packages would not have been stopped by Chainguard today.


Key Takeaways for Security Teams
Malware in the Python ecosystem is running rampant. Relying on public repositories that are oriented for ease of publisher convenience as opposed to enterprise security presents a significant risk. The most effective way for security teams to combat malware is by turning to secure and trusted open source libraries that have been built entirely from source in hardened infrastructure. This is the approach that Chainguard Libraries is taking at its core, and one that shows promising results in mitigating malware at its roots based on our analysis.
If you’d like to learn more about how Chainguard Libraries can transform your software supply chain, reach out today. Existing Chainguard Containers customers can get started with Chainguard Libraries by reaching out to your account teams and exploring our documentation.
Ready to Lock Down Your Supply Chain?
Talk to our customer obsessed, community-driven team.