Home
Unchained
Engineering Blog

This Shit Is Hard: Hardening glibc

Sergio Durigan Junior, Senior Software Engineer

Chainguard’s “This Shit is Hard” series showcases the difficult engineering work we’ve tackled to deliver best-in-class outcomes for customers using our products. We’ve covered several important topics, including the Chainguard Factory, Chainguard Libraries for Java, our integrations with several scanner partners, and our implementation of SLSA Level 3. Today, we’re discussing hardened compiler flags and the GNU C Library.


It should come as no surprise that we take security very seriously here at Chainguard. This can manifest in various ways, such as our commitment to provide zero-CVE container images to our customers or the vast automation we put in place to make sure our packages are constantly being upgraded with the latest upstream fixes. But security work does not involve only mitigation; it pays to be proactive and take preventative measures that can block bad actors from doing harm. One of these measures is the adoption of hardened compiler flags when building our packages.


In this blog post, I will tell a short story about our attempts to compile one of our most important libraries, the GNU C Library (glibc), with hardened flags. It was an interesting task, full of challenges and hidden knowledge, which allowed us to work with the upstream GNU Compiler Collection (GCC) and glibc projects to resolve complex issues and ultimately ship a hardened version of the library.


Our Toolchain and Build Flags


According to Wikipedia, a toolchain is “a set of software development tools used to build and otherwise develop software.” In the context of Chainguard packages, you can think of a toolchain as the compiler (GCC, for example) and its accompanying tools like the linker, debugger, etc. At Chainguard, we maintain toolchains for many languages (Golang, Rust, Python, C/C++, Java, and many more), but for this specific blog post, we are interested in the C toolchain. And because the standard C library we use, glibc, is such an integral part of any GNU/Linux system, we consider it part of our toolchain as well.


One of our most important proactive measures is to compile our packages using hardened compiler flags. These flags usually come from OpenSSF’s Compiler Options Hardening Guide for C and C++, and we follow their recommendations as closely as possible. Other major GNU/Linux distributions only adopt a small subset of OpenSSF’s suggested flags. This is yet another way in which Chainguard stays at the forefront of security practices.


The build flags we use are a mix of general good practices and CPU-specific hardened flags when compiling C/C++ programs. For example, using -z noexecstack when linking a program will mark its stack memory as non-executable, preventing data execution attacks. On x86 and ARM64 systems, we use -fcf-protection=full and -mbranch-protection=standard, respectively, which enable code instrumentation to increase program security by checking to prevent flow control diversion, a technique used in return-oriented programming (ROP) and jump-oriented programming (JOP).


The glibc Problem


Almost every package written in C or C++ that we ship to our users is compiled using our recommended hardened flags. There are very few exceptions to this rule, and they usually happen because we need to disable one specific flag to fix an incompatibility with upstream.  But glibc has always been an exception to this rule, and this is no surprise.


The GNU C Library is a foundational package for us. It is present in basically every image we offer, and is a dependency of the vast majority of our packages. We treat it uniquely when releasing new versions or making changes to it, because a bug in glibc can suddenly become a problem in all of our images.


In our first attempt to compile glibc using hardened flags, we discovered that some applications would start crashing when they used the hardened glibc. Our engineers were able to obtain a stack trace of the crash, but unfortunately, it did not provide much information about what was going on. We decided to revert the introduction of the hardened flags when compiling glibc until we could fully understand the issue. Even with the setback, we recognized how beneficial it would be for all of our images to benefit from a hardened glibc, and we weren't ready to give up.


The New Attempt and Investigation


After talking to the team, I decided to investigate the problem and understand what could be done to fully enable the hardened flags for glibc. I love debugging hard problems, and have some toolchain background, so I was excited to see how far I could take this task. Although the ultimate goal was to enable all hardened flags, it would already be a win if we could enable a subset of them.


I started by trying to reproduce the problem. Fortunately, I already had one simple test to trigger the failure. All that needed to be done was install the py3-matplotlib package and then, inside a system running a hardened glibc, invoke:


$ python3 -c 'import matplotlib'

This would result in an abortion with a coredump. I followed the steps, and saw that the crash happened as expected. With that, I could use GDB to obtain the stack trace:


#0  0x00007c43afe9972c in __pthread_kill_implementation () from /lib/libc.so.6
#1  0x00007c43afe3d8be in raise () from /lib/libc.so.6
#2  0x00007c43afe2531f in abort () from /lib/libc.so.6
#3  0x00007c43af84f79d in uw_init_context_1[cold] () from /usr/lib/libgcc_s.so.1
#4  0x00007c43af86d4d8 in _Unwind_RaiseException () from /usr/lib/libgcc_s.so.1
#5  0x00007c43acac9014 in __cxxabiv1::__cxa_throw (obj=0x5b7d7f52fab0, tinfo=0x7c429b6fd218 <typeinfo for pybind11::attribute_error>, dest=0x7c429b5f7f70 <pybind11::reference_cast_error::~reference_cast_error() [clone .lto_priv.0]>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:93
#6  0x00007c429b5ec3a7 in ft2font__getattr__(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [clone .lto_priv.0] [clone .cold] () from /usr/lib/python3.13/site-packages/matplotlib/ft2font.cpython-313-x86_64-linux-gnu.so
#7  0x00007c429b62f086 in pybind11::cpp_function::initialize<pybind11::object (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), pybind11::object, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11::object (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), pybind11::object (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#1}::_FUN(pybind11::detail::function_call&) [clone .lto_priv.0] ()
   from /usr/lib/python3.13/site-packages/matplotlib/ft2font.cpython-313-x86_64-linux-gnu.so
#8  0x00007c429b603886 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/lib/python3.13/site-packages/matplotlib/ft2font.cpython-313-x86_64-linux-gnu.so
...

This was the same stack trace that had been obtained by the other engineers working on this issue, which is good. It meant that I was seeing the same problem, and not something different. I remember thinking that it was strange to see the abort function being called right after _Unwind_RaiseException, but I didn’t pay much attention to it. What I wanted was to find another package that was crashing with the hardened glibc so that I could compare the stack traces, and I readily found that the Emacs text editor was also having the same problem. With Emacs, this is the stack trace I got:


#0  0x00007eede329972c in __pthread_kill_implementation () from /lib/libc.so.6
#1  0x00007eede323d8be in raise () from /lib/libc.so.6
#2  0x00007eede322531f in abort () from /lib/libc.so.6
#3  0x00007eede262879d in uw_init_context_1[cold] () from /usr/lib/libgcc_s.so.1
#4  0x00007eede2646e7c in _Unwind_Backtrace () from /usr/lib/libgcc_s.so.1
#5  0x00007eede3327b11 in backtrace () from /lib/libc.so.6
#6  0x000059535963a8a1 in emacs_backtrace ()
#7  0x000059535956499a in main ()

Now the crash is happening inside _Unwind_Backtrace, which means that a pattern emerged! This must have something to do with stack unwinding (or so I thought… keep reading to discover the whole truth). You see, the backtrace function (yes, it’s a function) and C++’s exception handling mechanism use similar techniques to do their jobs, and it pretty much boils down to unwinding frames from the stack.


A Minimal Reproducer


Easily reproducing the bug is great and helps with debugging, but having a minimal reproducer for the problem is better.


py3-matplotlib is a huge package and pulls in a bunch of extra dependencies, so it’s not easy to ask other people to “just install this big package plus these other dependencies, and then run this command…”, especially if we have to file an upstream bug and talk to people who may not even run the distribution we’re using. So I set out to try and come up with a smaller recipe to reproduce the issue, ideally something that’s not tied to a specific package from the distribution.


Having all the information gathered from the initial debug session, especially the Emacs backtrace, I thought that I could write a very simple program that just invoked the backtrace function from glibc in order to trigger the code path that leads to _Unwind_Backtrace. Here’s what I wrote:


#include <execinfo.h>

int
main(int argc, char *argv[])
{
  void *a[4096];
  backtrace (a, 100);
  return 0;
}

The good news is that I was still able to reproduce the problem with this small program, which made life a lot easier.


GCC Comes Into Play


Up until this point, I was considering that this bug was caused by glibc.  At the same time, while working on bringing GCC 15 to Wolfi, I accidentally found that the crash did not happen when using a hardened glibc that had been compiled using GCC 15.  I switched my focus to GCC and, more specifically, libgcc, which provides the machinery necessary to perform frame unwinding (among other things).


With invaluable help from a friend, I spent some hours diving deep into the internals of the unwinding logic on libgcc. I learned a lot, and eventually found that the following excerpt, from a function called uw_frame_state_for, was problematic:


// ...
  fde = _Unwind_Find_FDE (context->ra + _Unwind_IsSignalFrame (context) - 1,
                          &context->bases);
  if (fde == NULL)
    {
#ifdef MD_FALLBACK_FRAME_STATE_FOR
      /* Couldn't find frame unwind info for this function.  Try a
         target-specific fallback mechanism.  This will necessarily
         not provide a personality routine or LSDA.  */
      return MD_FALLBACK_FRAME_STATE_FOR (context, fs);
#else
      return _URC_END_OF_STACK;
#endif
    }
// ...

The fde variable should not be NULL here, but it is. This never happened when glibc was compiled with GCC 15, but it did happen if glibc was compiled with GCC 14. This NULL value was coming from another function called find_fde_tail, and it seemed like there was some problem with the way the function was calculating the Frame Description Entry (which is what FDE stands for) of /lib/ld-linux-x86-64.so.2. This means that glibc and the dynamic loader are indeed also involved in the bug.


I decided to bisect GCC and find exactly what changed between GCC 14 and 15 that made things suddenly work, but at this point, I was not sure anymore that we were dealing with a compiler bug. I had reached the limit of my knowledge about GCC/glibc and frame unwinding, so I decided to file an upstream bug and ask for help.


Upstream Takes Over


I filed this bug against GCC, and within hours, the community performed a thorough investigation and identified the root cause. It was indeed a glibc bug after all. A new bug was filed, this time against glibc, and a fix was readily proposed.


In the end, the problem was indeed in how the linker defines __ehdr_start, which, according to the code (from elf/dl-support.c):


if (_dl_phdr == NULL)
  {
    /* Starting from binutils-2.23, the linker will define the
       magic symbol __ehdr_start to point to our own ELF header
       if it is visible in a segment that also includes the phdrs.
       So we can set up _dl_phdr and _dl_phnum even without any
       information from auxv.  */

    extern const ElfW(Ehdr) __ehdr_start attribute_hidden;
    assert (__ehdr_start.e_phentsize == sizeof *GL(dl_phdr));
    _dl_phdr = (const void *) &__ehdr_start + __ehdr_start.e_phoff;
    _dl_phnum = __ehdr_start.e_phnum;
  }

But the following definition is the problematic one (from elf/rtld.c):


extern const ElfW(Ehdr) __ehdr_start attribute_hidden;

This symbol (along with its counterpart, __ehdr_end) was being run-time relocated when it shouldn’t be. The fix that was pushed added optimization barriers to prevent the compiler from doing the relocations.


Epilogue: Fun with the ARM64 Linker


Did you think it was over? We thought so too. But as it turned out, there was yet another problem lurking in the shadows.


Chainguard supports ARM64 packages and images, and one of the ARM processors in the market is Google’s Axion. For some obscure reason, the ldconfig tool was crashing when running on these processors. After a much shorter investigation, I filed another upstream bug against glibc and, again, got a reply and a proper fix within days. The problem, this time, had to do with the fact that, on ARM64, memcpy and memcmp (among other functions) are implemented using indirect functions (or ifunc). This is a neat glibc feature that allows developers to offer different implementations of the same function and then select amongst them at runtime. For this to work, however, some initialization code needs to be executed. In this specific bug, the functions were being invoked before the initialization procedure, which was causing the crash.


Conclusion


These were very interesting bugs to investigate, and it was great being able to help upstream address them. It is also a side effect of Chainguard’s approach to keeping our packages secure: because we are constantly building and testing the latest upstream updates, we end up catching complex bugs that not even the original projects are aware of. If you are interested in learning more about our work with compiler hardening, check out how we mitigated an rsync vulnerability using these compiler flags. Our DevRel team has also provided a great tutorial to get you started on using compiler flags to secure your code.

Share

Ready to Lock Down Your Supply Chain?

Talk to our customer obsessed, community-driven team.

Talk to an expert