Home

Unchained

Engineering Blog

So you want to check image signatures in Kubernetes…?

Dan Lorenc, CEO

July 6, 2023

Signing containers, and checking their signatures, is an important security practice everyone should be using in today’s threat landscape. It’s also required by many regulatory frameworks enterprises are required to meet like FedRAMP, PCI-DSS, and HIPAA to name a few. When most organizations approach tackling signing and verifying signatures, it seems easier than it is and it can be tempting to try to build all of the infrastructure to do so yourself. At Chainguard, our team has spent a long time exploring better ways to sign and verify containers, and we found a lot of pitfalls in the process.

‍

Admission webhooks are the obvious place to start for verifying container signatures.

‍

You look up the Kubernetes docs and they say you can verify things with webhooks, so off you go! You can even write a simple one in an afternoon. I did, with Cosigned, and it later turned into the policy-controller project, which grew a lot more features over time and is now the core backend for our Enforce platform.

‍

Nothing here is specific to that project though, all admission-based signature verifiers have roughly the same architecture. In an effort to save you time, this post is a "speed run" through those challenges and how to overcome them.

‍

Here are five problems you'll likely encounter when trying to verify signatures at deployment time in Kubernetes:

‍

1. Resource Types

Kubernetes admission controllers receive Kubernetes objects, not container images, making it necessary to locate the containers within these objects. While pods are relatively straightforward, other resource types like ReplicaSets, Deployments, Jobs, CronJobs, etc., have containers embedded in different places.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - name: my-container
    image: my-image:latest


```
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: my-cronjob
spec:
  schedule: "0 1 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: my-job
            image: my-image:latest
```

To verify containers in all these types, you'll need to implement dynamic logic to identify container locations within the object specifications. You can focus on verifying containers only in pods (since all workloads will eventually get “compiled” down into them), but this approach may result in unfriendly error messages when issues arise because the higher-level object creation will succeed, but pods will fail to schedule later on.

‍

Handling ephemeral containers, custom resource definitions (CRDs), and Helm charts further complicates the problem. To provide a decent developer experience, you’re going to need to hardcode logic for all of the types and CRDs in your environment, and keep this up to date as the Kubernetes APIs change, and new types of CRDs get installed.

‍

2. Scale

Once you've identified the containers within the object specifications, the next step is to validate their signatures. The naive approach here is to validate the signature for the image, then return the result in the webhook response.

‍

Unfortunately, a single deployment can result in numerous pods and each pod triggers a separate request to fetch the container's signature from the OCI registry. Although nodes have caching mechanisms, ensuring fresh signatures can be challenging. Caching in webhook servers is possible, but caching is one of the two (or three) most difficult problems in computer science.

‍

If you don’t cache/batch requests, even a modest deployment size might trigger rate limits on container registries as you race to fetch and check signatures for the same image dozens or hundreds of times at once. This can cause back-pressure on the API server, leading to slower deployments and cluster instability.

‍

To do this reliably at scale, you’re going to need to face the caching problem head on and do some concurrency management so you avoid the thundering herd problem while filling up the cache the first time. You now have a distributed systems problem, a caching problem, and a concurrency one to deal with all at once.

‍

3. Tags and digests

After you’ve solved caching, distributed systems, and concurrency, you’re left with the final, hardest problem in computer science - naming! Container images can be addressed two ways in a registry: by tag or by digest. Tags are convenient names, but they can move around and point to different images at different times. That makes them unsuitable for cryptographic provenance, meaning you have to sign and verify digests instead of tags.

‍

Tags can be resolved to digests, but it’s not quite that simple. If you simply add a resolution step in your webhook before checking signatures, you’ve now created a “time of check” vs. “time of use” race condition. The webhook might validate a signature for the digest that corresponds to the tag at the time it checks, but there's no guarantee that the node will obtain the same digest for the tag during the image pull. This is not a theoretical issue - this has resulted in real CVEs in real admission controllers that have only been found through careful audits. Learn from them, instead of on your own.

‍

It also means the admission controller needs to be granted pull access to all images, and might even need secrets to do so. Misconfigurations here can be dangerous.

‍

An angry goose could watch and rapidly change the image tag to point at a different digest, the registry could be compromised, or you might accidentally sign a multi-architecture manifest that contains many images that have nothing to do with each other.

‍

The most reliable way to solve this problem is to have developers resolve all tags to digest in configuration before deployment. However even this is tricky because mutating webhooks can inject more containers and the ordering is undefined, so now you have to have reinvocation policies (which leads to scaling adventures), and ideally perform verification in a dedicated webhook after all mutation has occurred. The simple signature checker is now growing quite a bit of functionality, and even needs mutation permissions on all of your deployments.

‍

4. Attestations and additional data

Signatures are a great security practice, but they don't actually prove anything about the image itself – just that a key signed some data. For stronger security guarantees, you need to actually attest to properties of the image itself, like how it was built (SLSA), whether it was properly scanned, or who approved of the deployment. This is where attestations come in.

‍

Once you get basic signature checking done, you'll likely want to move up the ladder to attestations. This means increased memory pressure on the admission controller from even more data, and adds to latency. If you got by before by staying just under registry rate limits, these are likely to push you over.

‍

Attestations can also be time-stamped and enable richer policy that's hard to reason about at deployment time. If a container was free of vulnerabilities when it was first built, should it be allowed to continue running six months later? There are almost definitely going to be vulnerabilities present in the image by then, but a deploy-time webhook isn't going to notice until the pod gets rescheduled.

‍

5. Key rotation

Similar to the example with timestamping, signing keys need to be rotated over time. Containers signed with one key might be okay when they get deployed, but when you rotate (or revoke) the signing key, you'll need a way to handle that.

‍

If you do nothing, the container will eventually crash and then not come back up unless it's re-signed. This can be problematic because you have no idea when that's going to happen and it might take down your systems when you aren’t expecting it.

‍

To do this safely, you'll need a way to continuously scan and validate running workloads to see which workloads no longer meet the required policies so you can take action on your own schedule. You'll probably also change your policy itself from time to time, and then you risk changing it in a way that puts existing workloads out of compliance. You're going to want a way to "test" or "dry run" policies against existing workloads before you enforce them, to balance reliability with security. Deploy-time webhooks don't provide a way to do this.

‍

Let us help

Checking signatures before deployment time is one of those problems that sounds much easier than it is in practice. Deploy-time checks are still valuable, but you need to couple them with a system that continuously validates policies against running workloads and lets you safely change those policies and keys over time.

‍

My hope with this post is that it saves you time as you architect a signing and verification system for Kubernetes. If you find yourself wishing that there was something already built to help make signing and verifying container images easier, you’re in luck. We built our Enforce Signing product to address the problems mentioned in this post. Enforce Signing gives users the flexibility of keyless signatures alongside a privately managed signing infrastructure which doesn’t store any sensitive data. Get in touch today and let us help make signing software simple, secure and at scale for your organization.

Ready to Lock Down Your Supply Chain?

Talk to our customer obsessed, community-driven team.

Talk to an expert