Skip to content

Blog


How DoorDash Ensures Velocity and Reliability through Policy Automation

September 20, 2022

|
Lin Du

Lin Du

Infrastructure-as-code is growing in popularity because it automates and streamlines the complexities of deploying company infrastructure into multi-cloud environments. DoorDash uses Terraform with Atlantis GitOps workflow to provision its infrastructure. These technologies worked great initially, but as the business continued to grow, our cloud infrastructure expanded significantly. The core-infra team engineers soon became full-time code reviewers for all the changes that were needed to keep the platform from breaking. Naturally, at this volume human error began to start impacting the platform’s health negatively. Networking changes, database updates, or failure to review infrastructure pull requests all could impact the entire organization in potentially very expensive and time-consuming ways. 

A critical part of infrastructure deployment revolves around ensuring automated provisioning, updates, and management of cloud infrastructure quickly without breaking requirements related to security, reliability, or costs. We will discuss here how DoorDash leverages an open policy agent, or OPA, to build policy-based guardrails with codified rules that ensure velocity and reliability for cloud infrastructure automated deployments. 

Defining infrastructure-as-code

Infrastructure-as-code, or IaC, allows managing and provisioning infrastructure through code instead of using manual processes. Terraform is an infrastructure provisioning tool that allows use of HCL HashiCorp Configuration Language — to describe and create the desired infrastructure and automatically delete and modify existing infrastructure. At DoorDash, we use GitHub for version control and to manage the lifecycle of our IaC, integrating it with a set of CI/CD tools known as GitOps.

Defining policy as code with an open policy agent 

An OPA is an open-source, general-purpose policy engine that unifies enforcement across the stack. Using a high-level declarative language called Rego, users can write policy and simple APIs as code to offload policy decision-making from the business logic. Among the policies that can be enforced are cloud infrastructure automation, microservices, Kubernetes, CI/CD pipelines, API gateways, and more.

OPA decouples policy decision-making from policy enforcement. When the software needs to make policy decisions, it queries OPA and supplies structured data such as JSON as input. The OPA generates policy decisions by evaluating the query input against policies and data.

At DoorDash we use an open-source utility tool called Conftest which uses the Rego language from OPA to check assertions. Using Conftest is pivotal for our goals because it allows us to validate the Terraform plan against the OPA rules inside the PR review pipeline process.

Using policy as code

Policy is a set of rules, conditions, or instructions meant to be enforced across the organization, including such things as cloud-native infrastructure, application authorization, or Kubernetes admission control. One example would be establishing policy rules to define the conditions required for infrastructure code to pass a security control and be deployed.

At DoorDash, we built policy-based guardrails by codifying rules to secure infrastructure deployments and changes, including but not limited to:

  • The critical resource changes that require extra code review from different teams (a good example of this might be a load balancer where a change may require an additional review from a traffic engineer)
  • The supported Terraform modules allowed for infrastructure changes, where so long as engineers are taking the prescriptive approach with respect to deploying a cloud resource the approval is automated
  • The specific actions allowed for particular resources
  • The changes that require security team review
  • Ensuring that cloud resources tags are being used 
  • The cost parameters around allowable changes to infrastructure 
  • Validating resource type to ensure engineers are taking advantage of existing reserved instances and savings plans

Recipe 1. Require core-infra admin group review when critical resources are deleted

package resources_protection

# Critical Resources List Examples
critical_resources = {
        "aws_elasticache_cluster",
        "aws_elasticache_replication_group",
        "aws_elasticsearch_domain",
        "aws_elasticache_subnet_group",
        "aws_db_instance",
        "aws_db_option_group",
        "aws_db_parameter_group",
        "aws_s3_bucket",
        "aws_route",
}

# check if any protected resources has unsupported delete ops
protected_resource_deletion_detect = true {
        type := input.resource_changes[r].type
        critical_resources[type]
        input.resource_changes[r].change.actions[_] == "delete"
}

# deny any protected resources delete ops
deny[msg] {
        protected_resource_deletion_detect
        msg := "Some of the resources actions require core-infra admin review. 
}

Attempts to delete critical cloud resources from infrastructure code will generate the following “OPA check failed” message: 

Evaluating cloud
FAIL - /root/atlantis/.atlantis/repos/doordash/default.tfplan.json - Some of the resources actions require core-infra admin review.
Recipe 2. Require security review for security groups with port 22 (SSH) without source restrictions.
package sg_module_check

ssh_open_to_all_cidrs = true {
	input.ingress[r].from_port == 22
	input.ingress[r].cidr_blocks[_] == "0.0.0.0/0"
}

deny[msg] {
	ssh_open_to_all_cidrs
	msg := "Security Group with SSH port open to all networks. It requires a security review."
}

Attempts to create/update a security group with port 22 and the CIDR 0.0.0.0/0 generates the following “OPA check failed” message:

Evaluating sec
FAIL - /root/.atlantis/repos/doordash/default.tfplan.json - 
Security Group with SSH port open to all networks. It requires a security review.

Integrating OPA with Atlantis

At DoorDash, the infrastructure code review process is supercharged with policy-as-code through a custom workflow that uses Conftest in Atlantis (DoorDash’s setup predates official OPA support in Atlantis). We self-host Atlantis, which is an open-source golang application for Terraform pull request automation. When a GitHub pull request is created, Atlantis runs a Terraform plan and passes the plan file to Conftest, which then pulls custom policies written in Rego from an AWS S3 bucket, evaluates the OPA policy based on the Terraform plan, then comments the output to the PR — all in a single action. As shown in Figure 1, the PR owners then know whether their PR meets all policy requirements or requires fixes before it is submitted for further review.

Figure 1:DoorDash policy automation flowchart
Figure 1:DoorDash policy automation flowchart

Conclusion

With infrastructure-as-code becoming ever more prevalent, it’s becoming increasingly important to quickly identify cloud infrastructure changes, deployed resources, and usage. Using policy-as-code within infrastructure-as-code helps DoorDash automate infrastructure pull-request reviews and adds extra guardrails for continuously deploying infrastructure without fear of breaking anything. This approach also ensures our infrastructure stays compliant with company policies. 

Ideas for the Future

One area also being actively explored is the introduction of cloud cost policies. We could set future policies on cost estimates before resources are launched. Such guardrails would enable our infrastructure team and the wider engineering organization to self-serve while staying within an acceptable cloud infrastructure budget. Also we can ensure no engineer or team can introduce significant spend changes without explicit approval from a member of our FinOps team.

The DoorDash core-infra cloud team is working on an API-driven self-serve approach for automating some of the infrastructure tasks currently done via the Atlantis GitOps workflow discussed above. The goal is to provide a unified experience for all infrastructure operations, which we believe is the future of engineering workflow changes. We will likely cover this in a future blog post.

Reference

Related Jobs

Location
San Francisco, CA; Seattle, WA; Sunnyvale, CA
Department
Engineering
Location
San Francisco, CA; Seattle, WA
Department
Engineering
Location
San Francisco, CA; Sunnyvale, CA; Los Angeles, CA; Seattle, WA; New York, NY
Department
Engineering
Location
San Francisco, CA; Sunnyvale, CA; Seattle, WA
Department
Engineering
Location
San Francisco, CA; Sunnyvale, CA
Department
Engineering