Hunting malware on package repositories

Ly D. Vu, Zachary Newman, and John Speed Meyers
  •  
October 13, 2022

After several high-profile attacks delivered via open-source software repositories, a tool that can detect malicious open source software packages starts to sound like a magic wand that could make these problems disappear. Fortunately, there are decades of academic research on and commercial tools for malware detection that aim to build tools just like that.

Are tools the cure to these malicious packages? To find out, we spoke to administrators of and contributors to PyPI, the main repository for Python packages, along with an academic researcher who works on this problem. We completed further analysis on our recent research of malware detection tools to see how they measured up to the requirements of real package repositories. Here’s what we found:

  • These tools aren’t suitable to run on software repositories automatically, in large part because they’re too noisy.
  • External researchers can (and do) run their own tools in their own environments and send reports to get malware removed.
  • This often works out better for everybody involved.
  • There are promising directions for improving these scanners, and other, even more promising techniques for improving software repository security that administrators are working toward right now.

Interviews

We checked in with members of the PyPI community and supply-chain security researchers to see what it would take to deploy malware detection software. PyPI deployed an experimental “malware checks” system in 2020, so our interviewees (an administrator of PyPI, and one developer of the malware check system) have direct experience with running malware detection for a real repository. However, these checks aren’t used anymore. We sought to find out why not, and what it would take to deploy such a system again. It turns out that a lot of common assumptions about malware detection on package repositories are wrong.

False positive rates matter more than false negative rates.

Many researchers build systems designed to catch all or most malware: after all, we don’t want to let bad packages through. They accept a low false-positive rate as the price to pay to catch bad actors. However, given the number of legitimate packages published, even seemingly-low rates (like 5%) require administrators to manually inspect thousands of packages each week. An automated tool needs to have an “effectively zero” rate. Consistent with our own past results, the PyPI maintainers reported experiencing unsustainably high false positive rates.

Repository administrators must balance multiple security priorities.

PyPI maintainers are very invested in the security of the platform. But, like many contributions to open-source software and infrastructure, there’s more to be done than people to do it. PyPI and similar repositories must weigh automated malware detection against software signing and multi-factor authentication. Noting that most malware packages affect few or no actual users (PyPI data shows downloads for taken-down packages, but many come from automated crawlers), PyPI administrators have decided to use their finite resources to focus on higher impact projects.

Just because PyPI isn’t running these checks doesn’t mean that others aren’t.

In lieu of repository-side scanning, an interesting symbiosis has emerged: security researchers develop and operate PyPI malware detection systems using their own time and computing resources, providing reports to PyPI when they detect malicious packages. PyPI maintainers benefit with high-quality, low-noise reports on malware, and the security researchers benefit from positive coverage of their company, products, and services.

Benchmarking Different Malware Detection Approaches

To understand if existing systems were appropriate for this setting, we ran some experiments comparing different Python malware detection approaches. These systems include static analysis tools that analyze source code, dynamic analysis tools that observe running software, and metadata analysis tools that look at things like package names.

Many of these tools separate their engine (which can run many different rules in a common format) and the rules themselves (which encode specific traits to look for). We focused on tools with behavior-based (rather than metadata-based) detection, and publicly available source code and detection rules. We found three Python-ecosystem tools which met these conditions: Bandit4Mal, OSSGadget OSS Detect Backdoor, and PyPI Malware Checks.

We used a benchmark dataset including 168 malware packages (courtesy of the Backstabber's Knife Collection and maloss datasets), 1,430 popular packages, and 986 randomly-selected packages. We then scanned these packages with each chosen tool, recording all alerts produced by the setup.py files (which can run malicious code at package installation time) as well as the entire package (for malicious code that executes at runtime). We consider an alert for a malicious package a true positive and an alert for a benign package a false positive. The following table shows the percentage of packages with at least one alert, by tool and dataset:

This is what we learned:

Scanners catch the majority of malicious packages.

All three of these tools had true positive rates above 50%. When including all Python files, the tools detected over 85% of malicious packages. 

False positive rates are high (sometimes higher than true positive rates).

The measured tools have false positive rates between 15% and 97%. The false positive rate increases (sometimes higher than the true positive rate for malicious packages) when checking all files, rather than just setup.py files. This suggests that many rules used by these tools are designed to catch behavior that is suspicious in setup.py files, but normal in package code.

When it rains, it pours: packages with one alert often have many more.

The tools can fire multiple alerts per package, and they did. Scanning the setup.py files of benign packages, we find that all tools have a median of 3 or fewer alerts. When scanning all Python files, the number of alerts increases to between 10 and 85. The noisiest benign packages had 145,799 alerts.

Making the alerts more strict results in missing a lot of malware.

Rather than flagging a package as possibly malicious if it has any alerts, we tried requiring a threshold number of alerts. We found that with a higher threshold, the tools report very few (or even no) malicious packages even before the true positive rates became manageable.

Some rules are better than others.

One of the rules checks for networking code in unexpected places. These types of checks were a good indicator of a malicious package. Other rules, which looked for metaprogramming or running external processes, were less effective in distinguishing malicious and benign code.

The tools ran reasonably fast.

Tested on a laptop, checking a typical package took well under 10 seconds. This is too slow to run before a package upload finishes, but is quite reasonable to passively analyze a repository.

Verdict

Overall, these systems aren’t ready to run automatically on a repository like PyPI. The main reason is double-digit false positive rates, which make maintainers sift through thousands of alerts every week just to find a handful of malicious packages.

Our interviews pointed to potential directions for better scanning. First, prioritize higher-impact packages: typosquatters, shrinkwrapped clones, and popular packages. Second, consider dynamic scanning techniques, running code in a sandbox. Third, make sure tools are easy to interpret. “6 alerts” is hard to evaluate; “makes network calls to these domains,” less so. Most importantly, don’t expect volunteer repository administrators to maintain and run tools for you; instead, form a relationship and plan to work together in the long haul.

Conclusions

The primary lesson from our interviews and experiments is to listen to maintainers. Better than anyone else, they understand requirements and priorities, and have a personal stake in preventing attacks. Outsiders just don’t have the expertise to decide whether a proposed system has too much latency or addresses the wrong problem.

This is not to say that independent researchers are useless. On the contrary, they are essential to the ecosystem. Progress relies on their experimental, impractical-seeming ideas. However, it's unfortunately common to assume that a flashy demo justifies dumping a pile of barely-working code on repository administrators. Instead, researchers should engage with maintainers, who can outline requirements for practical systems, and who have endless ideas worth exploring (the previous PyPI malware checks are a great example of such collaboration; we know about their deficiencies only because they tried).

Despite these findings, we remain optimistic about open-source software security. Organizations like the OpenSSF do listen to maintainers while providing resources for academics, maintainers, and companies to collaborate. As long as we listen to what the community has to say, open-source security will steadily improve.

If you’d like to learn more about our results or proposed future directions, please check out our paper on arXiv.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Don’t break the chain – secure your supply chain today!

Research

Hunting malware on package repositories

Ly D. Vu, Zachary Newman, and John Speed Meyers
October 13, 2022
copied

After several high-profile attacks delivered via open-source software repositories, a tool that can detect malicious open source software packages starts to sound like a magic wand that could make these problems disappear. Fortunately, there are decades of academic research on and commercial tools for malware detection that aim to build tools just like that.

Are tools the cure to these malicious packages? To find out, we spoke to administrators of and contributors to PyPI, the main repository for Python packages, along with an academic researcher who works on this problem. We completed further analysis on our recent research of malware detection tools to see how they measured up to the requirements of real package repositories. Here’s what we found:

  • These tools aren’t suitable to run on software repositories automatically, in large part because they’re too noisy.
  • External researchers can (and do) run their own tools in their own environments and send reports to get malware removed.
  • This often works out better for everybody involved.
  • There are promising directions for improving these scanners, and other, even more promising techniques for improving software repository security that administrators are working toward right now.

Interviews

We checked in with members of the PyPI community and supply-chain security researchers to see what it would take to deploy malware detection software. PyPI deployed an experimental “malware checks” system in 2020, so our interviewees (an administrator of PyPI, and one developer of the malware check system) have direct experience with running malware detection for a real repository. However, these checks aren’t used anymore. We sought to find out why not, and what it would take to deploy such a system again. It turns out that a lot of common assumptions about malware detection on package repositories are wrong.

False positive rates matter more than false negative rates.

Many researchers build systems designed to catch all or most malware: after all, we don’t want to let bad packages through. They accept a low false-positive rate as the price to pay to catch bad actors. However, given the number of legitimate packages published, even seemingly-low rates (like 5%) require administrators to manually inspect thousands of packages each week. An automated tool needs to have an “effectively zero” rate. Consistent with our own past results, the PyPI maintainers reported experiencing unsustainably high false positive rates.

Repository administrators must balance multiple security priorities.

PyPI maintainers are very invested in the security of the platform. But, like many contributions to open-source software and infrastructure, there’s more to be done than people to do it. PyPI and similar repositories must weigh automated malware detection against software signing and multi-factor authentication. Noting that most malware packages affect few or no actual users (PyPI data shows downloads for taken-down packages, but many come from automated crawlers), PyPI administrators have decided to use their finite resources to focus on higher impact projects.

Just because PyPI isn’t running these checks doesn’t mean that others aren’t.

In lieu of repository-side scanning, an interesting symbiosis has emerged: security researchers develop and operate PyPI malware detection systems using their own time and computing resources, providing reports to PyPI when they detect malicious packages. PyPI maintainers benefit with high-quality, low-noise reports on malware, and the security researchers benefit from positive coverage of their company, products, and services.

Benchmarking Different Malware Detection Approaches

To understand if existing systems were appropriate for this setting, we ran some experiments comparing different Python malware detection approaches. These systems include static analysis tools that analyze source code, dynamic analysis tools that observe running software, and metadata analysis tools that look at things like package names.

Many of these tools separate their engine (which can run many different rules in a common format) and the rules themselves (which encode specific traits to look for). We focused on tools with behavior-based (rather than metadata-based) detection, and publicly available source code and detection rules. We found three Python-ecosystem tools which met these conditions: Bandit4Mal, OSSGadget OSS Detect Backdoor, and PyPI Malware Checks.

We used a benchmark dataset including 168 malware packages (courtesy of the Backstabber's Knife Collection and maloss datasets), 1,430 popular packages, and 986 randomly-selected packages. We then scanned these packages with each chosen tool, recording all alerts produced by the setup.py files (which can run malicious code at package installation time) as well as the entire package (for malicious code that executes at runtime). We consider an alert for a malicious package a true positive and an alert for a benign package a false positive. The following table shows the percentage of packages with at least one alert, by tool and dataset:

This is what we learned:

Scanners catch the majority of malicious packages.

All three of these tools had true positive rates above 50%. When including all Python files, the tools detected over 85% of malicious packages. 

False positive rates are high (sometimes higher than true positive rates).

The measured tools have false positive rates between 15% and 97%. The false positive rate increases (sometimes higher than the true positive rate for malicious packages) when checking all files, rather than just setup.py files. This suggests that many rules used by these tools are designed to catch behavior that is suspicious in setup.py files, but normal in package code.

When it rains, it pours: packages with one alert often have many more.

The tools can fire multiple alerts per package, and they did. Scanning the setup.py files of benign packages, we find that all tools have a median of 3 or fewer alerts. When scanning all Python files, the number of alerts increases to between 10 and 85. The noisiest benign packages had 145,799 alerts.

Making the alerts more strict results in missing a lot of malware.

Rather than flagging a package as possibly malicious if it has any alerts, we tried requiring a threshold number of alerts. We found that with a higher threshold, the tools report very few (or even no) malicious packages even before the true positive rates became manageable.

Some rules are better than others.

One of the rules checks for networking code in unexpected places. These types of checks were a good indicator of a malicious package. Other rules, which looked for metaprogramming or running external processes, were less effective in distinguishing malicious and benign code.

The tools ran reasonably fast.

Tested on a laptop, checking a typical package took well under 10 seconds. This is too slow to run before a package upload finishes, but is quite reasonable to passively analyze a repository.

Verdict

Overall, these systems aren’t ready to run automatically on a repository like PyPI. The main reason is double-digit false positive rates, which make maintainers sift through thousands of alerts every week just to find a handful of malicious packages.

Our interviews pointed to potential directions for better scanning. First, prioritize higher-impact packages: typosquatters, shrinkwrapped clones, and popular packages. Second, consider dynamic scanning techniques, running code in a sandbox. Third, make sure tools are easy to interpret. “6 alerts” is hard to evaluate; “makes network calls to these domains,” less so. Most importantly, don’t expect volunteer repository administrators to maintain and run tools for you; instead, form a relationship and plan to work together in the long haul.

Conclusions

The primary lesson from our interviews and experiments is to listen to maintainers. Better than anyone else, they understand requirements and priorities, and have a personal stake in preventing attacks. Outsiders just don’t have the expertise to decide whether a proposed system has too much latency or addresses the wrong problem.

This is not to say that independent researchers are useless. On the contrary, they are essential to the ecosystem. Progress relies on their experimental, impractical-seeming ideas. However, it's unfortunately common to assume that a flashy demo justifies dumping a pile of barely-working code on repository administrators. Instead, researchers should engage with maintainers, who can outline requirements for practical systems, and who have endless ideas worth exploring (the previous PyPI malware checks are a great example of such collaboration; we know about their deficiencies only because they tried).

Despite these findings, we remain optimistic about open-source software security. Organizations like the OpenSSF do listen to maintainers while providing resources for academics, maintainers, and companies to collaborate. As long as we listen to what the community has to say, open-source security will steadily improve.

If you’d like to learn more about our results or proposed future directions, please check out our paper on arXiv.

Related articles

Ready to lock down your supply chain?

Talk to our customer obsessed, community-driven team.