TL;DR: Everything that can be ephemeral, should be ephemeral.
Ephemerality is the property of lasting for a very short time. As it applies to production software, this is a far-reaching principle with critical applications including: credentials, keys, signatures, attestations, and even the infrastructure itself!
As with all things context matters, and what is considered “a very short time” varies, so it is best to think of ephemerality as a spectrum. From a security standpoint, having things live only as long as they have to in order to accomplish the task at hand is the ideal. However, sometimes even knowing how long the task will take is difficult, or overhead necessitates some degree of reuse (for efficiency / performance).
The key to this principle is imposing an upper bound on the lifespan of things, and that generally, the shorter that lifespan is the less risk it presents.
Styles of ephemerality
Let’s label the two most common styles of ephemerality:
Generally speaking, “time bombs” are preferable to “rotation” because they do not rely on active revocation, but practically it all comes down to lifespans.
Lifespan of ephemerality
Once you have embraced one of the styles of ephemerality, the progression becomes all about lifespans. Generally the best security posture is only letting things live for exactly as long as they absolutely must, but this can be hard to know in advance, may present efficiency trade-offs, and beyond a certain point may have diminishing returns.
For “time bombed” credentials and keys this means typically O(min) to live. For “rotated” credentials this is often O(days) to O(months) depending on the cost of rotation, and whether it can be automated.
In the context of ephemeral infrastructure this is a place where serverless abstractions really shine because the perceived lifetime of the underlying compute resource is the duration of the request! Most providers do not scope the instances’ lifetimes at quite this fine a grain for efficiency, but the expectation of instance churn is readily built into this development model.
Short-lived credentials are a well-established production best practice (1, 2, 3), so this should not be terribly contentious. One of the main risks posed by long-lived credentials are “exfiltration attacks” where a credential is leaked and used by the attacker until it is detected and revoked! Despite this recognition as a best practice, short-lived credentials still remain out of reach for too many users. In some cases this stems from a lack of support in upstream services; these might be workload services that do not surface a short-lived “workload identity” concept (e.g. many CI vendors), or services that do not offer a short-lived authentication option (e.g. DockerHub). In many cases, a problem arises from asymmetry between the token provider and consumer. As an example of this, for a very long time it was impractical/impossible to use a short-lived “workload identity” token from Google to talk to AWS APIs, and vice versa. However, the industry is rapidly adopting OpenID Connect (OIDC) as a form of portable authentication, and supporting “federation” that allows users to exchange a token from vendor A for a token native to vendor B (for calling vendor B’s APIs) according to predefined federation rules.
The ubiquitous Kubernetes container platform has stable support for Service Account Token Volume Projection since version 1.20, which allows workloads to get an OIDC token for its service account. GKE and EKS support public issuers for these tokens, and AKS has preview support, so they can already be used with OIDC federation mechanisms to authenticate with services without any long-lived credentials!
Homework / Call-to-Action: if you are still using long-lived credentials, then you should triage whether they can be replaced with a short-lived token. If you consume services that do not support short-lived tokens, then press them to support a short-lived option! Assess the damage it would incur to your business if that credential leaked, and weigh that against the cost of migrating to an alternative.
The idea of short-lived signing keys (and supporting infrastructure) is much newer than short-lived credentials, but it builds on very mature ideas that keep billions of people safe on the internet every day in the form of WebPKI. This idea is rapidly gaining traction through projects like Sigstore’s Fulcio project, which is rapidly approaching “1.0” and aims to provide a “public good” for signing that parallels what Let’s Encrypt has done for TLS.
As mentioned above, one of the dangers with long-lived credentials and keys is “exfiltration attacks” where a credential or key lands in the hands of a malicious attacker, who can now use it until it is revoked. However, credential revocation is a simpler problem because authorization is often an online process, whereas signature validation is often offline (e.g. air-gapped) using a well-known public key. This makes signing key revocation extremely difficult, and the danger of exfiltration attacks higher.
The Fulcio project is another context where having OIDC tokens as a standard short-lived credential has been advantageous. The “public good” Fulcio instance maintained by the Sigstore community is rapidly gaining support for workload systems with publicly accessible OIDC “issuer” endpoints (e.g. Google, AWS, Github, SPIFFE), and self-hosted Fulcio instances can be configured to support issuers accessible to them (e.g. on-prem).
Homework / Call-to-Action: if you are using long-lived signing keys, then you should triage what systems are handling them, and whether those systems are supported by Fulcio. Generally adding Fulcio support for new workload OIDC issuers is quick, and many CI vendors are rushing to make sure their platform is well supported. If you have questions about a particular CI platform, the Sigstore community can help provide the current state of any in-flight integrations or help identify any integration gaps to press on your CI vendor to support.
The idea of ephemeral attestations is still somewhat nascent. An attestation is a claim about a subject that is then signed by the attestor. To make a claim without associating an expiration (or timestamp) with it, is to claim it is a universal truth. There are many things that may be truthfully claimed at a point in time, which may stop being true at a later point in time (or their usefulness simply decays over time).
A real-world example of such a claim is a Covid-19 PCR test where a licensed authority attests that at the time the test was taken that the patient was negative, but a negative test result generally has a shelf-life of ~72 hours before the test must be refreshed.
Within the world of software, vulnerability scans work much the same way. At time X my application container may not have any known vulnerabilities, but the exact same container may be found to have vulnerabilities at time Y without change because a new vulnerability was disclosed in my application or its dependencies. Given the possibility of new disclosures, I may want to “time bomb” my vulnerability attestations to ensure they are regularly refreshed.
A more contentious example is binary provenance. Suppose I want to claim that a particular binary was produced from a certain set of inputs using a particular process. Should an attestation of this provenance expire? For folks with reproducible builds, this can actually be a desirable property! If provenance attestations expire (and must be re-attested), then we are forced to periodically produce the same result from the same inputs/process. This is part of how SolarWinds’ Trebuchet build system was designed to mitigate the now- famous attack on their build system. Reproducibility and performing reverification builds are also part of how the distroless images are able to produce Tekton Chains based provenance for SLSA level 2 (the primary build uses GCB, which must match the Tekton-based build). So even in cases where something might be considered a universal truth, it can be useful to get independent confirmation. Put differently: when you have reproducible builds, you can periodically confirm a binary’s provenance, so why wouldn’t you take advantage? When you cannot simply rebuild things to confirm a binary’s provenance, do you really want to trust that binary’s provenance to be true forever?
Homework / Call-to-Action: If your organization doesn’t have reproducible builds, start to make that investment now as it can take a while to identify all the sources of variation. While this may seem somewhat orthogonal, it becomes a kind of “cheat code” for supply chain security. You should also start to represent all kinds of knowledge about artifacts in terms of attestations (signed claims), and at a minimum, these should carry a creation time, which allows lifespan-based policy judgements, but ideally, they should be time bombed with an expiry.
Most folks have experienced the cautionary tale of the “pet” VM that nobody knows how to rebuild and everyone is afraid to touch; or worse the critical infrastructure running on Bob’s workstation. These are a security nightmare. Kill these with fire.
The notion of ephemerality in infrastructure is not new, but has not always been framed with the lens of “ephemerality”. In the context of builds, SLSA captures this idea as the “ephemeral environment” requirement for levels 3+. Another classic example of this is the well-known “pets vs. cattle” idea; however, there are a number of trends that have made infrastructure ephemerality increasingly attainable in recent years (Cloud, Infrastructure-as-Code, Declarative Infrastructure, GitOps, …).
For layered infrastructures, this principle should be considered at every layer of abstraction. For example: Bare Metal, VMs, Pods. Considering the failure modes of all of these layers is standard reliability practice, but just like testing your backups there is no substitute for routinely rotating your infrastructure to identify gaps. Chaos Engineering has become a popular mechanism for trying to flush out these kinds of gaps potentially using a jittered cadence.
Serverless/Stateless application patterns also lend themselves extremely well to ephemerality because they present the abstraction of infrastructure that exists solely for the context of a given request. This lets the provider of the serverless abstraction rotate the worker units invisibly. For a services example, Knative offers a Kubernetes-based serverless service abstraction, which can be used in conjunction with an early-stage project called the “descheduler” to bound the maximum lifetime of pods. For a build example, Tekton offers a Kubernetes-based serverless build abstraction, which runs each Task in a separate Pod.
Stateful workfloads do not have as clear an abstraction to codify ephemerality, but good failover hygiene can be established (e.g. hitless rollouts, chaos testing, testing backups), in order to prepare workloads for implementing techniques like Time-to-Live (TTLs, e.g. via descheduler).
Homework / Call-to-Action: for each layer of your infrastructure, you should understand the maximum lifespan of the various resources, and impose upper bounds as needed. Wherever possible, invest in Infrastructure-as-Code (IaC) to automate bootstrapping new environments and reconciling changes to existing environments (e.g. GitOps). Embrace Stateless/Serverless patterns for as much of your architecture as possible, and practice chaos engineering to help identify gaps. Press your vendors to support TTLs on resources, and integrate with IaC providers (e.g. Terraform, Pulumi).
The attacker perspective
The goal of ephemerality is to make any foothold an attacker gains like quicksand, or any credentials / keys the attacker exfiltrates like the self-destructing missions in Mission Impossible (or Inspector Gadget, if you prefer).
If an attacker gains access to a production workload, that foothold is only as persistent as the unit of workload they have compromised. If the workload is ephemeral then an attacker is forced to regularly move laterally (e.g. breach something else) or repeatedly breach the workload to maintain their foothold.
For a Tekton example, if an attacker is able to submit a malicious build payload, they essentially have until the build completes (or times out) to complete their attack or move laterally before their foothold is gone. Contrast this with persistent “runners” where an attacker can simply start a daemon process in the background as they plot their next move.
However, these principles must be applied to as many things as they can be. If the above Tekton build is using long-lived credentials, then breaching it once for a short time may be enough to exfiltrate credentials that can be used to continue the attack from their own machines. Conversely, if long-lived infrastructure using short-lived credentials is breached, the attacker can simply continuously exfiltrate (or use) new short-lived credentials.
Now consider that you have done the work to use short-lived infrastructure, with short-lived credentials, and short-lived keys. You have invested in reproducible builds. However, the attestations made by that infrastructure are treated as forever valid. If an attacker is able to breach your build system just once, and inject a malicious payload into your software, then that bad binary provenance is valid for as long as you use that package. If those attestations must be refreshed (e.g. via reproducible builds), then the attacker has to breach your build system on every single build, and the shorter the lifespan of the attestation, the more persistent the attacker must be.
The shorter all the lifespans, the harder an attacker has to work to get what they are after, and the less likely they are to be successful.