This Shit is Hard: Building hardened PyTorch wheels with upstream parity
Chainguard’s “This Shit is Hard” series showcases the difficult engineering work we’ve tackled to deliver best-in-class outcomes for customers using our products. We’ve covered several important topics, including the Chainguard Factory, Chainguard Libraries for Java, our integrations with several scanner partners, recent hardening of glibc, our implementation of SLSA Level 3, and the concept of “Zero Trust” in open source software. Today, we’re discussing how we tackled dependency resolution, toolchain challenges, and manylinux compatibility to deliver hardened PyTorch wheels.
When we first announced we were building Chainguard Libraries for Python, we set an ambitious goal: achieve 100% feature parity with upstream builds. Not "close enough." Not "we'll add missing features later." We wanted to run the full test suite with our builds - builds that make use of our hardened toolchain and provide SLSA level 2 provenance - and see the same results as the upstream builds.
As one of the most complex libraries to build, PyTorch served as an excellent stress test.
This wasn't our first rodeo with PyTorch. We'd been building it for Chainguard Containers for over a year, but building wheels for a library distribution introduced new challenges. Dependency management, binary compatibility, and test infrastructure pushed our team into uncharted territory.
What we learned along the way has improved how we approach building native Python wheels in general — including dependencies, OS compatibility, and the trade-offs between maintaining and distributing native dependencies.
The challenge: Understanding what "parity" really means
PyTorch is a massive codebase with numerous optional dependencies that enable performance optimizations on different hardware architectures. Our initial assessment suggested this would be straightforward, as we already had working Chainguard OS package (APK) builds that generate wheels as an intermediate artifact. Achieving true parity meant understanding every upstream configuration decision and replicating it in our build system.
Discovering the dependency landscape
The first step was identifying which optional libraries upstream includes. By analyzing PyTorch's build configuration, we discovered several dependencies we needed to integrate.
A practical starting point was to query the upstream PyTorch installation's build configuration with:
import torch
print(torch.__config__.show())
This provides a lot of information about library dependencies, compiler versions, and flags used in the build:
PyTorch built with:
- GCC 13.3
- Intel(R) oneAPI Math Kernel Library Version 2024.2
- Intel(R) MKL-DNN v3.7.1
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- CUDA_VERSION=12.8
...
The output helped us identify the initial set of features enabled in the build, and also sent us down the rabbit hole of understanding what dependencies are required to satisfy them.
Challenge #1: Decoding the dependency list
PyTorch's configuration references MKL, MKL_STATIC, MKLDNN, and MKLDNN_ACL. At first glance, they appeared to be a set of feature flags for the MKL project. We discovered they actually refer to a few different projects, some of which have been renamed or merged over time.
After an investigation, the mapping became clearer:
MKL— Intel’s Math Kernel Libraries, which are now part of oneAPI MKL (sometimes called oneMKL)MKLDNN— Renamed to oneDNN (PyTorch still uses the old name internally)MKLDNN_ACL— oneDNN support for the ARM Compute Library, another dependency to track
Code inspection and testing identified other dependencies:
numpy — Python Scientific Computing
triton — Language/compiler using LLVM for code generation
libumf (Unified Memory Framework) — Advanced memory management
MAGMA — GPU-accelerated linear algebra
Initial approach: Consume dependencies from Chainguard OS
For native C/C++ dependencies, our default approach is to package and consume them from our APK repositories. This allows us to use our existing automation to keep them up to date and monitor for CVEs. Unfortunately, this wasn’t always as easy as packaging the libraries and pointing PyTorch’s build system at them:
MKL
Intel distributes MKL as part of their oneAPI toolkit, but also offers a "standalone" download. We chose the standalone option, but soon learned that it contains multiple projects bundled together under various licenses. Ultimately, we determined that very little of this toolkit was required — just the “-classic” subset, in fact. Installing too many mkl-classic shared libraries negatively impacts our parity goals due to subtle optimizations in the PyTorch build system.
Triton
Triton is a Python package runtime dependency. The PyTorch build will generate a runtime dependency on the version available at build time. If Triton is unavailable at build time, no dependency is generated, resulting in many test failures. The most challenging part of building this dependency is that it requires an LLVM toolchain, which itself must be built from a specific commit hash. Even though it simply gets thrown away at the end, tuning the bootstrap LLVM was the heaviest lift for the Triton dependency.
MAGMA
Packaging the MAGMA library was straightforward, but once we integrated it into our PyTorch builds, we noticed that it had bloated our wheels by 4GB! We assessed our approach and determined that PyTorch had a particular way of integrating MAGMA, which involves source code patches and configuration. Tracking these customizations from release to release so we could consume a pre-built library from Chainguard OS was not a good use of our time. Instead, we opted to build MAGMA just as PyTorch does, and to do so as part of our PyTorch wheel build process.
Challenge #2: CUDA runtime dependencies and wheel compatibility
Upstream PyTorch takes an elegant approach to CUDA: it doesn't bundle the CUDA libraries with its wheels. Instead, they declare runtime dependencies on separate CUDA library wheels that users install from PyPI or elsewhere. These dependencies are pinned to the identical versions PyTorch was built and tested against. For parity, we opted to replicate this pattern.
The problem: Bridging APK and wheel ecosystems
Our build infrastructure uses CUDA libraries from internal APK packages, which are constantly moving forward and updating versions. However, at runtime, our PyTorch wheel users require a specific library version in wheel format. For our runtime pins to exactly match the versions we picked up at build-time, we needed to do some conversion.
Our solution: Automated APK-to-wheel translation
To solve this challenge, we built a translation layer that performs the following tasks:
Introspect APK package metadata at build time
Map APK names and versions to their PyPI wheel equivalents
Generate accurate wheel dependencies in PyTorch's metadata
Here's an example of the mapping logic in a shell script:
whl_name_to_apk_name() {
local whl="$1"
echo "$whl" | sed -r \
-e "s/-nccl-cu12/-nccl-cuda-${CUDA_VERSION}/" \
-e 's/-cuda-runtime-/-cuda-cudart-/' \
-e 's/-(cublas|cufft|cufile|curand|cusolver|cusparse|nvjitlink)-/-lib\1-/' \
-e 's/-nvtx-/-cuda-nvtx-/' \
-e 's/-cusparselt-cu([0-9][0-9])/-cusparselt-cuda-\1/' \
-e 's/-cudnn-cu([0-9][0-9])/-cudnn-9-cuda-\1/' \
-e "s/-cu[0-9][0-9]/-${CUDA_VERSION}/"
}
installed_apk_version() {
local apk="$1"
apk info -v | grep -E "^${apk}-([^-]+)-r[0-9]+$" | \
sed -r "s/${apk}-([^-]+)-r[0-9]+$/\1/"
}
This ensures that our customers use the identical versions of these dependencies that we used when building and testing our packages.
The NCCL edge case
The NCCL source code is available, so we rebuild it from scratch for use as an internal build dependency. However, there's often a lag between when the project tags a release and when they publish the corresponding wheel to PyPI.
Our automation publishes APK updates within hours of a new upstream tag. This creates a window where PyTorch wheels might depend on an NCCL version that doesn't yet exist as a wheel, which breaks installations. For the time being, we pin to the latest wheel available instead of tracking the latest.
Challenge #3: Clashing runtimes problem
With dependencies resolved, we began testing and immediately hit segmentation faults before any actual tests ran.
Debugging the crash
The minimal reproducer was surprisingly simple:
import torch._dynamo # Segfault
We used gdb to capture a backtrace of the crashing thread:
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7ca1ed3 in __pthread_once_slow (once_control=0x7fff24bfeaf8 <optree::GetCxxModule(std::optional<pybind11::module_> const&)::storage+8>, init_routine=0x7fff24ceb420 <__once_proxy>)
at ./nptl/pthread_once.c:116
#2 0x00007fff24b35274 in __gthread_once (__func=<optimized out>, __once=0x7fff24bfeaf8 <optree::GetCxxModule(std::optional<pybind11::module_> const&)::storage+8>)
at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:700
#3 std::call_once<pybind11::gil_safe_call_once_and_store<pybind11::module_>::call_once_and_store_result<optree::GetCxxModule(const std::optional<pybind11::module_>&)::<lambda()> >(optree::GetCxxModule(const std::optional<pybind11::module_>&)::<lambda()>&&)::<lambda()> >(std::once_flag &, struct {...} &&) (__once=..., __f=...) at /usr/include/c++/13/mutex:907
...
The more valuable clue came from what the other threads were doing:
#5 0x00007ffedacf1b8b in blas_thread_server () from /home/dann-frazier/noomp/lib/python3.13/site-packages/numpy/_core/../../numpy.libs/libscipy_openblas64_-ff651d7f.so
#5 0x00007fffa25514fb in ?? () from /home/dann-frazier/noomp/lib/python3.13/site-packages/torch/lib/../../torch.libs/libopenblasp-r0-2eadb500.3.30.so
We were vendoring an OpenBLAS library that was conflicting at runtime with NumPy’s version, possibly due to having conflicting OpenMP runtimes. But OpenBLAS wasn’t explicitly included in our PyTorch build and isn’t vendored with upstream builds. It turned out to be available in our build environment, and the PyTorch build picked it up opportunistically. Removing OpenBLAS from our build environment should then solve the problem, but sadly, our test case was still segfaulting. Only this one was different.
The second crash: Static libstdc++ strikes back
The new crash had a different signature — corruption in basic C++ initialization code, before any PyTorch-specific logic ran:
Thread 1 (Thread 0x7ffff7eb3740 (LWP 102148) "python"):
#0 0x00007fffe98a0c24 in std::codecvt<char16_t, char, __mbstate_t>::do_unshift(__mbstate_t&, char*, char*, char*&) const () from /lib/python3.13/site-packages/torch/lib/libtorch.so
#1 0x00007fffe9907d1d in std::basic_ostream<char, std::char_traits<char> >& std::basic_ostream<char, std::char_traits<char> >::_M_insert<long>(long) () from /lib/python3.13/site-packages/torch/lib/libtorch.so
...
We attempted to achieve compatibility with older Linux distributions by complying with the manylinux standard. To avoid maintaining complicated custom toolchains, as used in the reference manylinux build images, we decided to accept an increase in binary file sizes and statically link a copy of libstdc++ into our native C++ wheels.
We theorized that statically linking libstdc++ into each shared library creates isolated C++ runtime instances that conflict when loaded into the same process.
Building a reproducer
To validate this theory, we used Claude to create a small tool to load Python modules that link with libstdc++, either statically or dynamically, and initialize iostream:
Step 1: Loading module1 (dynamic libstdc++)...
Module1: Using iostream from system libstdc++
Result: Module1 initialized successfully
Step 2: Loading module2 (static libstdc++)...
This should crash due to iostream initialization conflict...
Segmentation fault
When any combination was loaded into the same Python process, it reproduced the crash.
The problem: libstdc++ uses global constructors for iostreams (std::cout, std::cerr). When the second static libstdc++ initializes, it reinitializes these globals, corrupting the first instance's state.
The solution: All libraries in a process must share a single libstdc++ runtime (the system-provided one).
This identified a fundamental architecture issue with our native C++ wheels, and forced us to reconsider our use of static libstdc++.
Challenge #4: Achieving manylinux compatibility without static linking
The manylinux standard exists to ensure Python wheels work across many Linux distributions. It specifies the maximum versions of system libraries that wheels can depend on.
For manylinux_2_28, that means glibc 2.28 (released in 2018) and libstdc++ from GCC 8.
The industry standard: Custom toolchain
The reference manylinux build images use heavily patched GCC versions based on Red Hat vendor branches. These patches allow applications to use modern libstdc++ functionality, while retaining compatibility with old libstdc++ libraries. These patches create a split libstdc++ model:
libstdc++.so (from GCC 8) — Contains symbols safe for old systems
libstdc++_nonshared.a (from newer GCC) — Contains newer symbols
Using this split, a custom linker script first resolves safe symbols against the old shared library. Next, it statically links any newer symbols that are used from the non-shared static library directly into the application. We decided to adopt the same approach. We ported these compatibility patches to the upstream toolchain and developed several regression tests to ensure we exposed the expected library symbols.
Challenge #5: Cost-effective test infrastructure
With builds working, we used the upstream test suite to validate parity. PyTorch's unit test suite is extensive, but also resource-intensive.
The scale problem
A complete test run on a single GPU instance clocked in at over 30 hours. Many tests require multiple GPUs (up to 8). Our initial delivery plan:
4 Python versions (3.10, 3.11, 3.12, 3.13)
3 CUDA variants (12.6, 12.8, 12.9)
Test both our wheels and upstream wheels for comparison.
That's 24 test runs at 30 hours each — totalling 720 hours or 30 days of compute. Some tests require multiple GPUs — up to 8. If we were to run all of the tests on 8xGPU instances, it would be very expensive ($31.34/hr x 720 = $22,564!). And we’d have to repeat that for each new PyTorch version.
Our solution: Strategic test matrix optimization
We designed a test plan that achieves complete coverage while minimizing cost. First, we optimized the GPUs. We switched from Ampere (two generations old) to T4 (three generations old) for most of the tests. Where Ampere instances must run on hardware where the GPUs are physically attached, T4 GPUs can be attached dynamically to generic instance types - we used n1-standard-8 instances, which currently average $0.74/hr with 1 T4 attached. Even if we did a full 720 hours here, it would be vastly less expensive - around $532.
We also split multi-GPU (distributed) tests out, so we don’t need to run everything on multi-GPU instances. We bumped our T4 instances up to the max T4 GPU count of 4, which bumps their cost to $1.81/hr. These runs took about 8 hours. If we were to run these tests on all combinations, that would run around $350.
Finally, we adjusted our approach so that we are still testing each variable in our wheel matrix, but we avoided exhaustive testing of each combination. That is, at least one of our Python 3.13 wheels and at least one of our CUDA 12.9 wheels received full testing, but we avoided full testing for every Python version/CUDA combination.
We did run into several other issues along the way - GPU instance scarcity, tests that hang indefinitely on T4s, etc. However, running the upstream test suite revealed some real problems. There were at least 3 parity issues that we may not have caught otherwise.
The path forward
These challenges fundamentally shaped our approach for Chainguard Libraries. The tooling we developed for building PyTorch, and the methodology for integrating dependencies and testing large projects at scale, now benefit every wheel with native dependencies we build.
More importantly, this work demonstrates that achieving true upstream parity requires understanding the why behind every upstream decision, then finding architecturally sound ways to achieve the same goals within our infrastructure.
The result: PyTorch wheels that match upstream feature-for-feature, pass the full test suite, and maintain the binary compatibility guarantees users expect — all built from our fully auditable supply chain.
Building this was hard. Building it with the confidence that our customers could seamlessly drop our builds into their existing deployments was harder. But that's precisely the kind of challenge that makes Chainguard's approach to software supply chain security valuable.
Want to learn more? Explore Chainguard Libraries or contact our team to discuss your Python library needs.
Share this article
Related articles
- engineering
Gastown, and where software is going
Dan Lorenc, Assistant Mayor of Gastown
- engineering
Running Renovate as a GitHub Action (and NO PAT!)
Adrian Mouat, Staff Developer Relations Engineer
- engineering
Making time: Space to think, build, and create (or, This Shit is Fun!)
Dustin Kirkland, SVP of Engineering
- engineering
This Shit is Hard: Keeping Chainguard OS lean, current, and secure — the power of garbage collection
James Page, Principal Software Engineer, and John Slack, Senior Product Manager
- engineering
Doing our best work: Chainguard’s engineering principles in practice
Dustin Kirkland, SVP of Engineering
- engineering
It’s time to rethink golden images. Chainguard can help.
Sam Katzen, Staff Product Marketing Manager