An OPA Gatekeeper gotcha when enforcing policy on all resource kinds
EDIT 2023-07-02: After some thinking time, I’ve found a better solution to this problem. Read about it in my newer article on OPA user-created resources
At work we’re beginning to roll out OPA Gatekeeper in our Kubernetes clusters to enforce policy on the resources that are deployed there. As part of some early forays into enforcing policy, I ran in to this frustrating ‘gotcha’ that I figured was worth documenting.
A common use case for OPA Gatekeeper is to enforce that resources have particular metadata.
This helps to ensure that, for example, it’s easy to tell who owns a particular Kubernetes resource, by enforcing
that all resources have a label called
Using this policy is fairly straightforward.
ConstraintTemplate resource is installed in to your Kubernetes cluster running Gatkeeper, you can create
a resource like this:
apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sRequiredLabels metadata: name: must-have-owner-label spec: match: kinds: - apiGroups: [''] kinds: ['*'] parameters: message: | All resources must have an owner label. labels: - key: 'owner'
This above constraint enforces that all resources have a label called
What could possibly go wrong…
Unfortunately, the above constraint will cause you a number of issues, which I’ll outline below.
The first issue you will run in to
The first issue you will encounter will happen when you try and deploy some sort of workload, say, a
If you’re anything like me, you’ll want to validate that this works by deploying a simple
apiVersion: apps/v1 kind: Deployment metadata: name: httpbin namespace: my-namespace labels: owner: nik spec: replicas: 1 selector: matchLabels: app: httpbin owner: nik template: metadata: labels: app: httpbin owner: nik spec: containers: - image: docker.io/kennethreitz/httpbin imagePullPolicy: IfNotPresent name: httpbin ports: - containerPort: 80
But you’ll need a
Namespace to deploy this to, so let’s create that as well:
apiVersion: v1 kind: Namespace metadata: name: my-namespace labels: owner: nik
With the constraint as written above, OPA Gatekeeper will enforce our
owner label constraint on non-namespaced
resources as well.
But that’s ok, because we included the
owner label on the namespace.
Seems like it should work, right?
If you create your namespace and then try to create your deployment, it will succeed, but your
Deployment will soon
take on the following status:
... status: conditions: - lastTransitionTime: '2022-09-26T23:21:51Z' message: >- pods "httpbin-78f658c47f-" is forbidden: error looking up service account my-namespace/default: serviceaccount "default" not found reason: FailedCreate status: 'True' type: ReplicaFailure ...
Huh? That’s a weird one. I’ve never seen that error before.
If you’re new to Kubernetes, you might not know that if you don’t specify a
serviceAccount for a pod, your pod
will be assigned to the default service account for a namespace, which is called
In this case, we didn’t specify a service account as part of our pod template, so we get the
default service account.
But our error message is saying that there is no
default service account in the namespace we created.
How does the
default service account get there in the first place?
Well, the Kubernetes docs on service account administration have this to say:
A ServiceAccount controller manages the ServiceAccounts inside namespaces, and ensures a ServiceAccount named “default” exists in every active namespace.
Ok, so our
default service account should definitely exist.
When you’re told a resource doesn’t exist in a permissioned environment like Kubernetes, it’s always a good idea to
just double check that the resource actually doesn’t exist using some sort of privileged user.
This helps you rule out a situation where something just has insufficient permissions to know that there really is
a service account in that namespace called
As a privileged user:
$ kubectl get serviceaccounts -n my-namespace No resources found in tenant-automat-reference-app namespace.
Ok, so there is definitely no service account in our namespace, despite the fact that the docs tell us there should be. What’s going on?
As it turns out, we overlooked something important in our mandatory labels constraint.
The key to debugging this issue was to have a look at the control plane logs.
The docs above tell us that it’s the job of the ‘ServiceAccount controller’ to ensure that the
account exists, so perhaps something went wrong while it was trying to create the service account?
That’s exactly what we find:
E0926 22:53:47.849664 14 serviceaccounts_controller.go:180] my-namespace failed with : admission webhook "validation.gatekeeper.sh" denied the request: [must-have-owner-label] All resources must have an owner label.
That explains things.
The service account controller tried to create the
default service account.
But since it has no way of knowing that we now enforce that every resource must have an
owner label, the
Gatekeeper validating admission webhook denies its request to create the
default service account.
The second issue you will run in to
Even if you manage to get around this first issue, for example, by excluding
ServiceAccount resources from the
resources that Gatekeeper cares about, you’re not out of the woods quite yet.
You will run in to exactly the same issue when you create a
Service resource that selects your pods.
This is because backing a
Service resource is an
Endpoint resource that is also created for you by the endpoint
And this controller too does not know about our requirement that every resource must have an
So you will get a cryptic message saying that your pod ‘could not bind to an endpoint’.
If you again look at the control plane logs, you will find:
I0927 00:03:23.031559 14 event.go:291] "Event occurred" object="httpbin" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToCreateEndpoint" message="Failed to create endpoint for service my-namespace/httpbin: admission webhook \"validation.gatekeeper.sh\" denied the request: [must-have-owner-label] All resources must have an owner label.
This is a bit of an annoying issue. What options are available to you?
Set up your constraint resource to not include resources like
However this comes with a serious drawback: you can no-longer use a wildcard to say ‘apply to every resource’ as we did in our constraint above. This is because Gatekeeper lacks an option to exclude resources from a constraint, it can only include. So the effect is that you will have to list every resource that your constraint applies to.
This solution is viable, but not particularly nice. Whenever you install a new kind of resource in to your cluster, you’ll need to update this constraint to include that resource if you want to enforce an ownership label. And if you have other similar universally-enforced constraints, you’ll have to update those too. There may be ways to DRY this list of ‘resources that you can safely enforce policies on’, but it’s a bit of a pain to have to update it when you install new resources. You could also automate the generation of this list, but again, it seems like a lot of hassle.
It’s also not entirely clear to me which resources should be on this list, leading me to believe that you’d have to
construct the list by trial and error.
For example, in the above exercise, the
ReplicaSet resource is created just fine, because it seems to inherit the
labels of the parent
ConstraintTemplate to allow excluding resources.
Unfortunately, the only other viable option I’ve found to solve this problem is to deviate from the open source required labels constraint template, and add your own custom rules around how to enforce this constraint.
This would mean adding some fields to the
spec of the
exemptKinds field that accepts a list of
kinds to ignore.
Or if necessary, an
exemptResources field that allows for defining more complex combinations of exempt kinds,
exemptions by name, namespace or API group.
Although this would require some custom Rego, it shouldn’t be too hard to create. The cost here is that you’re deviating from the open source required labels constraint template. However given that the constraint template is only around 30 lines of Rego, you’re not losing out on my by defining your own constraint template.
Hopefully this post serves as some documentation of this frustrating beginner gotcha, as I couldn’t find anyone else having the same issue. If you’re interested in more content on OPA, Rego and Gatekeeper, please reach out and let me know!