Home
Unchained
Engineering Blog

This Shit is Hard: Java Archeology at a Massive Scale

Manfred Moser, Senior Principal Developer Relations Engineer, Jason van Zyl, Senior Manager, Engineering, Sam Dacanay, Staff Software Engineer, and Arkadiusz Czajkowski, Senior Software Engineer

Our Chainguard Factory runs the Chainguard Container builds securely and reliably around the clock. Built and maintained by our experts, it is now taking on another truly crazy and courageous project: Chainguard Libraries, specifically Chainguard Libraries for Java. Normally, you have to trust thousands of maintainers and release managers, their workstations and CI pipelines, and the transportation of the binaries to public repositories, and eventually, to you. With Chainguard Libraries, that is no longer the case. We make it simple: The Chainguard Factory handles it all.


Chainguard Libraries for Java represents our commitment to securing the JVM library supply chain with a focus on all open source components on the Maven Central Repository. Lovingly called “Central,” it is one of the largest and longest running repositories of open source and other binary artifacts, and it is the de-facto standard repository for all build tools in the Java ecosystem and community. The scale of Central is astounding, and at nearly 20 years old, it contains some amazing historical stuff. Some of the components are unfortunately long past their due dates, but are still in use. No other repository for JVM libraries comes close in significance, and covering a majority of libraries from Central presents a good starting point for a comprehensive supply chain.


Securing the supply chain of all these JVM components from Central means that we have to follow the same approach taken with our Chainguard Containers product. We go straight to the source and replace the whole supply chain for our customers.


This might sound simple on the surface, but it is definitely not. In this latest post from our “This Shit is Hard” blog series, we dig into the details of how we’re doing it.


How to start with millions of libraries?


The first question we had to answer looking at the sheer size of the effort was, “where do we even start?”. Since there is limited information available about downloads and usage of libraries on Central, we had to rely on known data from the artifacts themselves. We looked at the dependency graph of the whole repo, and determined “popular” libraries by the amount of other libraries depending on them. We also started with newer libraries from the last 5 years, and left out older libraries that rely on creaky old Java 8 or other old releases. We settled on starting with 20,000 (!!) libraries and library versions that are the most heavily depended upon. This turns out to provide coverage for over 90% of the artifacts commonly needed for building Java applications. Finding that reasonable and effective starting point was tricky, and involved a whole lot of complex analytics. By now, we are way beyond that count with over 53,000 Maven projects and over 540,000 artifacts, and are moving to cover all libraries needed by our customers.


In addition to a first set of libraries to build, we had the advantage of the Chainguard Factory. It runs full throttle all day building Chainguard Containers and the needed C libraries and projects, consuming new versions, analyzing new security issues, and sorting out dependencies. Beyond building containers and their underlying packages, the Chainguard Factory now builds language libraries too. Now we “just” had to scale this to building Java-based libraries.


Metadata—are you kidding me?


With this initial set of libraries to build, we had a first task. Next, we started monitoring the feed of new projects and versions deployed to Central, so we can keep our inventory up to date and ready for new use by our customers and our own builds. Now we had a growing list of libraries to build, but what we knew about them was… not much really. We know the location in Central and the type of file, which is most often a Java Archive (JAR).


Thanks to the Maven repository format we can tell a few things pretty quickly. For example, the files here all belong to version `3.17.0` of the `commons-lang3` artifact in the `org.apache.commons` groupid. Typically, you use the JAR file in that folder, but there is also always the project object model file. It is renamed from the original pom.xml in the source code to a .pom file in the repository, and it contains a wealth of metadata. At least, it should, and there is some verification on this metadata when releases are deployed to Central. Unfortunately, the reality is that the metadata can be spotty, faulty, or just very difficult to locate. Our experts figure it out and extract the source code location.


Source code - where art thou? 


Theoretically we can look at the pom file such as this one and see the specific section about the version control system:


<scm>
<connection>scm:git:http://gitbox.apache.org/repos/asf/commons-lang.git</connection>
<developerConnection>scm:git:https://gitbox.apache.org/repos/asf/commons-lang.git</developerConnection>
<url>https://gitbox.apache.org/repos/asf?p=commons-lang.git</url>
<tag>rel/commons-lang-3.17.0</tag>
</scm>

In this example, we have enough information, although there are some wrinkles. The source repository URL is actually a redirect and the source now lives here, with the tag for the specific release here. This gives us enough information to find the source.


With other projects, we are not so lucky, and we found numerous issues, even though we concentrated on new libraries and therefore could narrow things down to git repositories for now:


  • Source code hosting system is no longer operational: Do you remember Google Code or java.net? Amazingly, SourceForge is still around! Without source code available on any public server, we are not building the library. In some cases the code was just renamed or moved, which required us to perform some detective work.

  • Missing tag in the repository: When a developer builds the release locally and forgets to push the tags, you get the info in the pom file, but not in the source code repository. At a minimum this requires manual intervention to potentially find the right commit.

  • Fake tags: Some build tools can’t correctly handle the SCM tagging, and as a workaround developers just add some static junk data in the pom. This passes deployment to Central since there is no strict validation in place. If the release is still valid, we need to manually find the commit.

  • Invalid tag: Even with good intentions, a tag is sometimes created and pushed, but then a minor issue is found and fixed. As a result, the release binary may be manually cut from a later commit than the tag, or the build process might have a bug that causes the release to be built from a different commit than the one listed in the URL. We even found instances where the name of the tag and the version seems to be entirely unrelated. Again, this requires manual detective work to find the right commit. 


Beyond all these challenges we are tackling, the next step is to look beyond git. Depending on how far back we end up going for libraries, we might end up using Subversion (svn), Mercurial (hg), and others. We used those and others in the past and will be able to manage.


For some libraries, we are not able to trace back to the original source code because it was either never published or no longer available. In these cases, we do not provide that library. This behavior is on purpose, as we provide a complete and secure supply chain to our customers from source. No source, no secure supply chain, no library.


Assuming we can locate the source code, our next easy step awaits: run the build.


All build tools are terrible


In the Java ecosystem, there are many build tools and languages. To support building everything from source, we had to build a meta tool that can proceed after getting the source. It needs to gather all the requirements for the build, such as the JDK version, figure out the build tool, and then the right invocation.


For build tools, we started with supporting Apache Maven, Apache Ant, Gradle, and sbt, and we are prepared to tackle others. We provision the right tool, set up the JVM, try to determine the right configuration for the JVM and the build tool, and then run the build. Depending on the project setup, this can be a simple call with the Maven wrapper and all configuration available in the setup, or it can be a long-winded process to figure out how this was ever built in the past. The further back we go, the weirder things get.


For example, in one project, we found a pom.xml file, so our tool automatically assumed that the project is a functioning Maven setup. Well, that was wrong. At that point in time, the pom file was a hollow shell of metadata, and the actual build needed to use the build.xml file from Ant. Unfortunately, the build went ahead and produced an invalid small JAR file with no useful content. Luckily our testing caught it, and we now have a nice and sizable JAR.


Other problems we found with Maven builds include undocumented setup, required system properties, requirements for specific, often old Maven versions, and other complexities that cannot be easily figured out by an automated system.


With projects that build with Gradle or Ant originally, we continue to have challenges to automatically reproduce how things are supposed to work. Gradle build scripts use Groovy or Kotlin to define how the build works, and Ant uses XML. Both are pretty much freeform with few standards, less metadata, and a lot more, often unnecessary, complexity.


Without our experts on the team with significant background with these sorts of builds, we would never be able to provide any level of useful coverage of the needed JARs from Central.


What are we even building? 


So, are we just building JAR files and deploying them and the related pom files? Not really. We also build JARs with embedded dependencies, executable JARs, WAR files, tarballs, and other binaries—whatever the original build creates. The problem is that all these different artifacts need their own build requirements fulfilled, and their own tests to run successfully—typically not an easy task.


Beyond that we are also creating files with detailed information about the build, the build infrastructure, the source code location, and the software bill of materials (SBOM) details for the library. We sign them all and provide the necessary signature and certificate resources. Every library can be traced back to the original source code.


Dependencies - oh my!


In the end, it was nice to build a library successfully, but really that’s not the end at all. From our first look at the pom files, we can also learn what the dependencies of each project are, and queue up all the builds for these upstream dependencies. And, of course, these dependencies typically have their own dependencies. You can imagine how that quickly escalates. It means we need to build a whole lot of stuff. And that is only to build up our inventory.


Time does not stand still, and Central gets new releases deployed all day every day. We monitor the feed of these new dependencies, and add them to the queue. As a customer, it’s very likely you’ll find the binary already in our Chainguard Libraries for Java repository when you update your dependencies.


However, if you are so fast to adopt that an artifact is not yet there, or you are so far behind that we have not built that very old library yet—your build requests it automatically! We track every request in our repository. Any request we can’t fulfill is tracked, and the resulting artifact identifiers are queued for building.


You have been warned


All of this setup ends up being very powerful, very expensive, super interesting to work on, and also reliant on development and operation by true experts across a whole number of domains, programming languages, build tools, version control systems, and historical information. Chainguard has all the necessary ingredients from the Linux kernel up to the Java libraries and beyond.


Now, if you want to secure your software supply chain from source, you know what’s ahead for you. So go on and build it yourself—or contact us and we can give you a demo of Chainguard Libraries for Java, and you won’t have to bother with all the craziness of Java archeology. We are doing it for you.


Java is not all we are doing of course, so in a future post, we will tell you more about the fun we have with Python libraries. It’s going to be juicy.

Share

Ready to Lock Down Your Supply Chain?

Talk to our customer obsessed, community-driven team.

Talk to an expert