Securing the machine learning supply chain

Zackary Newman, Principal Research Scientist
November 30, 2022

You’ve done it: you’ve secured your software supply chain. You’ve implemented SLSA, started signing commits, adopted minimal, reproducible container base images, and ensured compliance. All your developers are following best practices—and they’re even happy with the tooling! Time to take a well-earned vacation.

Fast-forward six months: you’ve been hauled in front of Congress to testify about the breach that led to the theft of your customers’ financial data. How? A data scientist was running some statistics about customer data when they pulled some weights for a neural net off the internet and loaded it with `pickle.load()`. This compromised their Jupyter notebook which had a live connection to a production database. From there, the attackers had won.

Just as “MLOps” applies techniques and processes from the software lifecycle to the data lifecycle, the practice of “ML Supply Chain Security” applies techniques and processes for securing the software lifecycle. In this post, we’ll learn:

  • Why machine learning has the same security problems as the rest of the software supply chain: you have standard software dependencies, but you also have models (which run code) and data (which could be malicious!)!
  • What you can do about it (without slowing down your data scientists).

Machine Learning Supply Chain Security

According to Microsoft, a typical data science lifecycle for a team involves a step called “Data acquisition and understanding.” This sounds suspiciously like what software engineers do when they engage with open source software. But there are a few factors that make the data science setting scarier:

You still have all of the dangers of a standard software supply chain, because you take dependencies on open source code to even do the analysis:

  • Instead of checking in a list of requirements to Git, the dependencies just wind up in a `pip install` Jupyter notebook cell: there’s no version pinning, and no common format for specifying dependencies.
  • That notebook itself may not be checked into version control of any kind—it could live on a developer’s laptop!
  • Your environment contains a mix of software from the operating system, software built-in to the notebook, and software installed interactively.

Further, models are really just code:

  • Pulling from a machine learning model repository like Hugging Face is just like getting code from GitHub.
  • These models can be published by anyone, and typically aren’t signed.
  • It’s a common practice to serialize models using Python’s Pickle library. However, if you were to read the documentation, you’d see “The `pickle` module is not secure.” at the top of the page. When loading a Pickle data, it can run arbitrary code!

Finally, data from unknown sources poses all the same threats:

  • Standard practice is to copy data directly into the Git repository using Git LFS; the data isn’t typically stored in human-readable format, so you can’t tell whether it’s safe by looking.
  • Opaque blobs checked into Git are the best case—it’s common to run `wget` to fetch data.
  • Even if the data is well-formatted, it can contain (undetectable) attacks: poisoning, where even small amounts of poisoned training data can cause huge decreases in the performance of a model, or adversarial attacks, “backdoors” that misclassify specific data.

To sum it up: the data you depend on can introduce vulnerabilities and run attacker code just like software dependencies. However, it’s totally opaque and you can’t tell by looking. Further, the state of the art in operations for data science is about a decade behind that of software engineering. Once an attacker has a foothold, they can do a ton of damage:

  • Compromise developer machines (local) or hosted infrastructure.
  • Compromise build systems, which often run ML models in a CI/CD pipeline.
  • Exfiltrate data that the developer has access to.
  • Models trained using their data might be deployed to production (where the attacker now has a backdoor).

Securing the Machine Learning Supply Chain

Fortunately, we know how to secure the ML supply chain: it’s the same way we secure the software supply chain! Specifically:

  • Make sure inputs are coming from trusted sources using digital signatures (perhaps via Sigstore ).
  • Make sure the contents of data and models are known, using Software Bills of Material (SBOMs; the AI Profile group for SPDX is adding SBOM features to support datasets) and VEX, to determine whether you need to worry about specific vulnerabilities in the supply chain.
  • Ensure that code runs with as few permissions as possible (principle of least privilege) by running in sandboxed environments like Google Colaboratory.
  • Use dedicated data formats, like AVRO or Parquet—which just store data, and don’t run code when you load them!
  • Don’t let data scientists run against production databases; rather, they should use read-only, anonymized replicas.
  • Define policies defining whom you trust to provide data using The Update Framework (TUF) or in-toto. Enforce these using something like Chainguard Enforce.

Many of these are already best practices in the MLOps world, and follow directly from applying frameworks like SLSA to this problem space. They even have tangential benefits, like reproducibility—which is just good science!

However, the tooling for solving these problems for data science is currently immature. In some cases, software tools can be applied as-is, but ML-specific implementations might be required to support workflows that data scientists have come to expect (for instance, using interactive notebooks). This brief from Georgetown University’s Center for Security and Emerging Technology provides policy recommendations, and thisTransatlantic Cyber Forum report recommends a “security approach rooted in conventional information security” and outlines the many steps that will be required to implement it, like “[i]ncreas[ing] transparency, traceability, validation, and verification”—which projects like Sigstore are doing for software!

It will be a long journey to secure ML supply chains, but we can follow the tracks laid in the software world; the sooner we start, the better.

Related articles

Ready to lock down your supply chain?

Talk to our customer obsessed, community-driven team.