Taming bad Python packages: Assessing Python malware detectors with a benchmark dataset

John Speed Meyers and Zachary Newman, Principal Research Scientists

August 23, 2023

Ly D. Vu, visiting researcher at Chainguard and a lecturer and researcher at FPT University in Vietnam, contributed to this post.

The number of malicious open source software packages found in popular registries is growing. According to analysis by the Atlantic Council, malicious open source packages have grown from a nearly non-existent phenomenon a decade ago to a near-monthly occurrence. One way to counter these attacks is to detect them and fix them at the source. For example, the Python Package Index (PyPI), the Python programmer’s equivalent of the app store, has experimented with a malware detection capability that analyzes new packages and new package versions, monitoring for potentially malicious packages to help protect PyPI users.

Unfortunately, the task of detecting malicious open source software in the context of a community registry is hard. Registry maintainers and contributors have limited time (often on a volunteer basis) to review security verdicts, a problem made worse by high false positive rates and the huge number of packages in many registries. Also, academic research on malware detection mostly overlooks the peculiar requirements of scanning entire open source software registries (with some exceptions), providing relatively little help. Finally, software security companies have admittedly been making progress on this task, aiding repository administrators, though even their efforts could potentially be made more effective if there were shared datasets and methodologies for OSS malware detection.

To help parties interested in creating better approaches to scanning package registries for malicious OSS, we conceived a project to build a benchmark Python malware dataset to assess the current performance of the PyPI malware checks and assess alternative approaches. This dataset--which contains both malware and benign Python packages--allows interested parties to empirically measure the false positive rate and true positive rate of a particular approach.

Using this open source benchmark dataset to assess the current PyPI checks, our analysis found:

High False Positive Rates: More than one third of the 1,430 popular packages triggered an alert. Nearly 15 percent of 986 random packages triggered an alert. Using conservative assumptions about the prevalence of malware, this rate will generate thousands of false positives per week.
Moderately High True Positive Rates: Nearly 60 percent of malicious packages triggered an alert.

These results reinforce what Python malware experts have discovered intuitively: the current approach, despite netting a majority of the malware, has unacceptably high false positive rates, flooding a potential user with false leads. It’s our hope that this benchmark dataset will accelerate PyPI malware discovery, helping security researchers create more efficient Python malware detection techniques. Ultimately, we hope these better approaches reduce the number of malicious packages in PyPI and reduce the average time to detection for malware found in PyPI.

This blog post explains the creation and use of the benchmark dataset to measure the performance of the current PyPI malware checks.

A New Python Malware Benchmark Dataset

To create a dataset that enables interested parties to measure the performance of Python malware detection approaches, we built a dataset with both malicious and benign packages.

Assembling the malicious Python packages was straightforward. The Backstabber’s Knife Collection dataset (checked out at commit 22bd76) contains 107 malicious packages of previously identified Python malware, while the MalOSS dataset (checked out at commit 2349402e) contains 140 valid examples. The benchmark dataset includes only one copy of the 75 packages that were found in both datasets. Furthermore, three packages did not contain Python files, two similarly named packages were deemed the same, and the dataset only included the latest version of each package. Therefore, in total, the dataset contains 168 malicious Python packages. To prevent the further proliferation of OSS malware, both datasets are restricted; researchers must specifically request access, and so we are releasing only the metadata associated with the malicious files. Of course, there are also many examples of Python malware not in this dataset--packages that were removed before a sample was stored--and future researchers should seek, if practical, to incorporate these.

Creating a benign package list was more difficult. There is no pre-existing dataset of Python packages universally regarded as benign. We, therefore, created two benign datasets.

As a first step, we created a combined dataset of the 1,000 Python packages that are most downloaded and 1,000 most widely depended upon Python packages. It’s a safe assumption that none of these packages are malicious. We excluded three popular packages that lack any Python files. By using only one copy of any packages found in both datasets, we generated a benign dataset of 1,430 packages. It’s worth acknowledging that popular packages are likely different from a typical Python package: potentially better engineered and more conformant to standard Python programming practices. Consequently, using only popular packages as the benign dataset might lead to unrealistic benchmark results since these packages might be relatively easy for detection tools to classify as benign.

The second step was to select the most recent version of 1,000 randomly chosen Python packages. We excluded the packages that lacked any Python files and the packages that we had chosen, but no longer available at the time of analysis, which led to a benign dataset of 986 packages. While there is a chance that some of these packages are malicious, the chance that more than a handful of these packages is malicious is vanishingly small. Importantly, these packages are more likely to represent a package selected from PyPI at random.

The code for downloading packages from PyPI and then analyzing these packages is available on GitHub.

Table 1 displays the descriptive statistics of the benchmark dataset. The dataset column values contain links to the metadata for each package list.

Dataset	Number of Packages	Number of Python files	Number of Lines of Code
Malicious	168	1,339	228,192
Benign - Popular	1,430	164,223	45,254,876
Benign - Random	986	16,832	2,770,978

Table 1. Descriptive Statistics of the Benchmark Dataset

Note: The scc tool was used to calculate the file and line statistics.

Benchmarking the Current PyPI Malware Detection Approach

PyPI currently implements two security checks. The first is a “setup pattern check” for performing regular expression-based checks (specifically, YARA rules) of setup.py files. The second is a “package turnover check” for performing scans of suspicious behavior related to package ownership. The analysis below only assesses the setup pattern check. These checks generate alerts when an uploaded artifact contains suspicious behavior, allowing administrators to review the alerts and, if necessary, take action.

We ran PyPI’s setup pattern check on the malicious packages and both sets of benign packages. The checks were only run on the setup.py file, a file often abused by Python malware creators. The analysis in Table 2 reports statistics on the number of generated alerts. Most importantly, the analysis found that around 33 percent (474 packages) of the popular packages and nearly 15 percent (147 packages) of random packages triggered alerts. On a positive note, the analysis found that nearly 60 percent (99 packages) of malicious packages generated at least one alert.

Dataset	Total Number of Packages	Total Number of Packages With At Least 1 Alert	Percentage of Packages With At Least 1 Alert
Malicious	168	99	58.9%
Popular	1,430	474	33.1%
Random	986	147	14.9%

Table 2. Benchmark Results for Current PyPI Malware Checks

Note: This analysis only analyzed setup.py files for all packages.

These malware checks did catch a majority of malicious packages. However, these checks have a dismal false positive, which will waste the time of even the most efficient security analyst. For instance, assuming 20,000 package releases per week (a number we corroborated using a public PyPI dataset) and 20 malicious packages included among these, an analyst using these checks to search for malicious packages must deal with over 4,000 false positive alerts per week.

Helping the Python Security Community Reduce the Harm from PyPI Malware

Our creation of a Python malware dataset is intended to help anyone engaged in PyPI repository malware scanning assess the effectiveness of proposed approaches. Because this dataset contains both malware and benign packages, security researchers can calculate the true positive rate and, perhaps even more importantly, the false positive rate of a detection approach. This benchmark dataset can help security researchers and the PyPI community more efficiently detect Python malware, helping reduce the amount of PyPI malware and quickening the time to detection for any malware in PyPI.

Our findings are congruent with the current attitudes among the PyPI community; the status quo malware checks, even though they catch the majority of Python malware, produce too many false positives to be effectively used by a security analyst.

In a future blog post, we plan to extend this Python malware detection benchmarking to more sophisticated open source software malware analyzers such as OSSGadget Detect Backdoor, bandit4mal, and OSSF Package Analysis. Also, we plan to port the PyPI Malware Checks rules to SemGrep rules tool and benchmark this approach.

Ly D. Vu is a visiting researcher at Chainguard and a lecturer and researcher at FPT University in Vietnam. Zack Newman is a senior software engineer at Chainguard John Speed Meyers is a security data scientist at Chainguard.

Ready to Lock Down Your Supply Chain?

Talk to our customer obsessed, community-driven team.

Talk to an expert