Writing software has been compared to gardening, suggesting the tendency of a codebase to evolve and morph and the necessity of a software developer to heed the natural tendencies of a codebase’s inhabitants. We wondered about the extent to which the analogy applied to open source software packages and applications, the digital equivalent of plants and gardens. We wondered if some open source packages, akin to the Virginia Creeper vine, grow so quickly that they actually pose risks to the garden in which they grow. And could there be other packages more like the ornamental English Boxwood shrub, growing slowly and posing relatively little risk?
The analogy suggested itself to us because of the growing recognition that adding dependencies, whether imported directly or indirectly, comes with costs, including security risks. For instance, one 2021 analysis investigated nearly 200,000 open source Python packages and found security issues in nearly 46% of these packages. Another article from 2020 scanned several popular open source software registries for malware and found nearly three hundred confirmed malicious open source software packages that accounted for 300,000+ downloads. Because each new dependency adds risk, we thought that software developers and software security teams would be interested in an estimate of how fast dependency trees grow.
To determine if popular open source software packages are more like Virginia creepers, potentially growing out of control, or staid English Boxwoods, we analyzed the dependency data of widely-used open source software packages across five programming language ecosystems. Using data from deps.dev, a new Google-released open source software dependency dataset, we found:
Npm and PyPI Are Like a Grove of English Boxwoods
Analyzing only the latest version, npm and Python Package Index (PyPI) packages grow relatively slowly. The widely depended-upon packages in these ecosystems tend to have fewer dependencies, often less than 5. While there does exist a fast-growing minority of packages in these ecosystems, these dependency trees, in general, appear to hardly grow.
Go, Maven, and Cargo Are More Like Virginia Creepers
These ecosystems’ widely depended-upon packages often have relatively more dependencies, often in the low tens, and there is generally a slow increase in the size of the dependency trees in these ecosystems. It’s not weed-like, but it’s not nothing.
Fortunately, this analysis finds that widely depended-upon packages in these ecosystems have relatively modest growth that doesn’t rise to Virginia Creeper levels. Nevertheless, the fact that there is dependency tree growth even among widely depended-upon packages suggests that developers would benefit from tools that surface to the user the appearance of new dependencies.
Measuring Dependency Tree Growth
Measuring the size of an open source package’s dependency tree requires building a dataset that associates a count of dependencies with each package version. Fortunately, Google’s recent release of deps.dev allows us to use queries to generate exactly this data. The query we used for the below analysis can be found here.
Figure 1. Number of Packages by Total Dependency Count by Ecosystem Using Latest Version of Each Package
PyPI and npm appear to have widely depended-upon packages with relatively small dependency trees, usually possessing less than ten dependencies. The other ecosystems, especially Cargo, have more packages with larger dependency trees.
We then assessed the growth of these dependency trees. The analysis included all dependencies for a given package version, both direct and indirect, and focused on those packages that are most widely depended on within an ecosystem. Importantly, the analysis only uses non-development package versions (or “ordinal versions”) so that early, non-production releases and experimental releases are excluded. Because there is so much variability, the below analysis also presents the findings by percentile, with different shades indicating different percentiles. Finally, this analysis only used packages with either 25 or 50 versions available, to avoid analytical problems introduced by systematic differences between packages with only a few versions and those with many versions.
Figure 2 presents an analysis of dependency growth for only those widely depended-upon packages with at least 25 versions. The sub-graph title indicates for each ecosystem how many of the top 100 most depended-upon packages contained at least 25 versions.
Figure 2. Dependency Tree Growth Across First 25 Versions for Widely Depended-Upon Packages by Ecosystem
The Cargo ecosystem had noticeable growth in the top percentile. Cargo, Go, and Maven also have some growth in the middle percentiles, sometimes by 100% over the course of 25 versions. For npm and PyPI, there is only growth in the 90th percentile.
Figure 3 presents a similar analysis of dependency growth, this time including only those packages with at least 50 versions. This analysis is included to guard against the possibility that 25 versions are too few to observe any noticeable trends.
Figure 3. Dependency Tree Growth Across First 50 Versions for Widely Depended-Upon Packages by Ecosystem
With the caveat that the number of widely depended-upon packages with 50 versions in each ecosystem is smaller in comparison to the previous analysis, an interesting finding emerges. The dependency tree growth curves mostly level out, with a few notable exceptions such as the early versions in Cargo and the late versions in Maven, suggesting that the Virginia Creeper analogy is too strong for these ecosystems. There is also some growth in the top percentiles for npm and PyPi, also cautioning against labeling these ecosystems as solely Boxwood territory.
It’s also worth pointing out the possibility that these results exhibit a selection bias. The most popular packages are necessarily towards the bottom of most dependency graphs: any of their dependencies would likely be even more popular. Future analyses might examine a representative sample of applications and investigate the most common direct dependencies, those packages most important to a typical developer. Nevertheless, this work helps characterize the dependency growth of these bottom-most, popular packages.
In sum, npm and PyPI’s widely depended-upon packages have relatively little dependency tree growth, while Go, Maven, and Cargo have notably more. Fortunately, according to our analysis of the deps.dev data for widely depended-upon packages, this growth is mostly at a measured, English Boxwood-like pace. Nevertheless, the steady growth of these dependency trees suggests that software gardeners can’t skip pruning!
Cultivate Your Garden
These findings suggest that pragmatic programmers truly ought to view their dependency trees as akin to garden plants. Their organic growth means that software teams should, absent exceptional diligence, expect their software application dependency trees to grow since the underlying dependencies are themselves growing. This growth creates new security risks, increasing the probability of unintentional vulnerabilities and malicious compromises in final software applications. Software teams could therefore benefit from tools that surface new indirect dependencies.