Failing Less at Kubernetes with Policy as Code & OPA

5 min read

Kubernetes has become the de facto way to run modern computing platforms, both in the cloud and on-premise. This is a huge change from just a few years ago, and it didn’t happen overnight. On the road to production readiness with Kubernetes, many have run afoul.

Thanks to Kubernetes Failure Stories though, engineering teams from around the world have shared in detail what they did wrong and how they can avoid making the same mistake in the future. Today we take a look back in the archives to see which issues are still relevant and where a modern Policy as Code stack based on OPA might have helped.

While Policy as Code wouldn’t have saved the day for every failure story, it’s interesting to see how many examples are still relevant today and how the outage could have been avoided with the help of some policy checking.

Templating Blast Radius

In their blog post, the Skyscanner Engineering team share a story in which a templating bug caused many resources to be destroyed. A variable name was not templated which meant that many resources with different names were updated to use the name of the variable instead. This led other automation to reconcile the state which involved removing a lot of resources.

The post doesn’t share an example of the full configuration file, but based on the details provided a simplified version might look something like this:

# env1.yaml
---
namespaces:
- name: Namespace1
  cell: $cluster.name
# env2.yaml
---
namespaces:
- name: Namespace2
  cell: $cluster.name

Here we can imagine the issue. Two files that were executed with different data for cluster, now have the same (incorrect) cell reference: the literal value $cluster.name.

Here are some things we could check using OPA in the Rego policy language in CI to make sure this didn’t happen again.

Check for invalid characters in the resource name

One of the most simple checks to perform after templating these resources would be to validate the cell resource names do not contain a $ symbol. A simple Rego policy to do this might look like this:

package play

import future.keywords.contains
import future.keywords.if

invalid_pattern := `[^\w\-]+`

deny contains message if {
	namespace := input.namespaces[_]

	regex.match(invalid_pattern, namespace.cell)

	message := sprintf("Namespace %q cell name %q contains an invalid character, must not match: %s", [
		namespace.name,
		namespace.cell,
		invalid_pattern,
	])
}

This policy could create a helpful error message like this:

Namespace "Namespace1" cell name "$cluster.name" contains an invalid character, must not match: [^\\w\\-]+

Have a play around with this in the Rego Playground here.

Validate that there are some expected number of cells referenced

Hindsight is 20/20, but with Rego, it’s easy to enforce lots of different policies at the same time and there’s more we can do here!

Oftentimes, it’s useful to also sanity check that there are some expected number of unique references in the input. A policy like this might have helped in this incident:

package play

import future.keywords.contains
import future.keywords.if

expected_cell_count := 5

deny contains message if {
	referenced_cells := {c |
		some n
		c := input.namespaces[n].cell
	}

	count(referenced_cells) <= expected_cell_count

	message := sprintf("The list of referenced cells [%s] is shorter than expected (%d)", [
		concat(",", referenced_cells),
		expected_cell_count,
	])
}

An example error message generated by this policy looks like this:

The list of referenced cells [$cluster.name] is shorter than expected (5)

Have a play with this in the Rego Playground here.

Unexpectedly Exposed, Unexpectedly Root

In a post shared by the JW Player DevOps team, they talk about how an attacker was able to run a cryptocurrency miner on their pre-production clusters.

One of the reasons this was possible was that a developer tool was exposed on a public load balancer because of a missed label. Let’s see how we can write a policy to address this at admission time.

Ensuring Load Balancers are internal only

The JQ Player team’s Load Balancer was created by the WeaveScope application. Cloud providers run additional controllers to configure cloud resources in response to Kubernetes services. For example, in Google Cloud, it’s possible to instruct the controller managing Load Balancing to create an internal one like this:

apiVersion: v1
kind: Service
metadata:
  name: ilb-svc
  annotations:
    networking.gke.io/load-balancer-type: "Internal"
  labels:
    app: hello
spec:
  type: LoadBalancer
...

In certain environments, it’d be advisable to ensure that this label was set. If we were using OPA as an admission controller, we could use a policy a bit like this:

package play

import future.keywords.contains
import future.keywords.if

gke_load_balancer_type_key := "networking.gke.io/load-balancer-type"

deny contains msg if {
	lb_type := object.get(
		input.request.object.metadata,
		["annotations", gke_load_balancer_type_key],
		"Unknown",
	)

	lb_type != "Internal"

	msg := "Load balancer type must be set to Internal"
}

You can experiment with this policy yourself here in the Rego playground.

The JW Player team would have also benefited from blocking root access on their clusters. This is made easy with the gatekeeper-library of pod security policies. Similarly, you can easily apply these policies with Styra PSP Compliance Packs.

Wait — I’m using that!

In a Kubecon talk from Airbnb, their engineers talk about an issue where a component responsible for cleaning up images was deleting images that were still needed. Rather than being related to a misconfiguration, this is a distributed systems problem — and one where policy is part of the solution.

In the talk, the presenters explain that the ECR policy for cleaning up images is lacking the data needed to delete images in the way they need. They don’t want it to delete images that are still in use.

OPA and a simple cron job could do better than ECR’s own deletion policy here. The process could look something like this:

  • Cron job starts and lists all the images on ECR
  • It filters them down to a list of the N% oldest images on the repo
  • The cron job submits this list to OPA
  • OPA responds with the images which are not in use
  • Cron job deletes the old unused images

Using the OPA project kube-mgmt it’s possible to replicate data into OPA. We could use this to replicate all of the pod data — this would give us a list of all of the images in use. If this data were to be too large, we could load OPA with a list generated by some other process to save on memory. (Aside, if you’re looking to load gigabytes of control plane data into OPA, you might be interested in checking out Enterprise OPA).

Let’s see what a policy that would do this check for us might look like.

images_to_delete := {image |
	some i
	image := input.images[i]
} - images_in_use

Here in this policy we make use of some data loaded into OPA via some other process (images_in_use) and use it to filter the list of images supplied. OPA responds with a list of images to delete.

You can experiment with this policy yourself here in the Rego playground. You might also want to check out the documentation here on the different options available when loading external data like this into OPA.

Next Steps

Having a policy-as-code system based on OPA can be an important tool in your toolbox when building guardrails in your platform. Misconfigurations can come from anywhere and OPA’s generic model makes it easy to enforce policy across your estate — CI, cron jobs, Kubernetes admission, etc.

Ask yourself the question, if you had an outage caused by misconfiguration today, how would you block such a change from being made in the future? 

Want to learn more about Kubernetes admission control and failing less at Kubernetes with OPA? Check out our newest Styra Academy course, OPA for K8s Admission Control!

Cloud native
Authorization

Entitlement Explosion Repair

Join Styra and PACLabs on April 11 for a webinar exploring how organizations are using Policy as Code for smarter Access Control.

Speak with an Engineer

Request time with our team to talk about how you can modernize your access management.