Real-time event processing is a critical component of a distributed system's scalability. At DoorDash, we rely on message queue systems based on Kafka to handle billions of real-time events. One of the challenges we face, however, is how to properly validate the system before going live.
Traditionally, an isolated environment such as staging is used to validate new features. But setting up a different data traffic pipeline in a staging environment to mimic billions of real-time events is difficult and inefficient, while requiring ongoing maintenance to keep data up-to-date. To address this challenge, the team at DoorDash embraced testing in production via a multi-tenant architecture, which leverages a single microservice stack for all kinds of traffic including test traffic.
In such a multi-tenant architecture, the isolation is implemented at the infrastructure layer. We will delve here into how we set up multi-tenancy with a messaging queue system based on Kafka.
The world of multi-tenancy
DoorDash has pioneered the testing in production which utilizes the production environment for end-to-end testing. This provides a number of advantages including reduced operational overhead. But this also brings forth interesting challenges around isolating production and test traffic flowing through the same stack. We solve this using a fully multi-tenant architecture where data and traffic is isolated at the infrastructural layer with minimal interference with the application logic.
Multi-tenancy involves designing a software application and its supporting infrastructure to serve multiple customer segments or tenants. At DoorDash, we have introduced a new tenant called doortest in the production environment. Under this tenant, the same application or service instances are shared with different user management, data, and configurations, ensuring efficient and effective testing in a production-like environment.
Data isolation in multi-tenants
In a multi-tenant environment, data isolation is crucial to ensure that tenants don’t impact each other. While we have achieved this in databases, it also needs to be extended to other infrastructure components. In Kafka, a test tenant processing production event can cause data inconsistencies, including outages and other incidents.
There are limitations to the traditional approach of using separate Kafka topics for each tenant, including scalability issues for multiple tenant environments, and inaccurate load testing.
To overcome these challenges, DoorDash has made Kafka a tenant-aware application, which allows different tenants to share the same topic. Figure 1 below provides an overview of the Kafka workflow architecture.
In this workflow, messages originating from various tenant environments are tagged with distinct tenant information by an agent of OpenTelemetry, or OTEL — an open-source framework that provides tools and software to collect and process telemetry data from cloud-native applications. OTEL uses native Kafka headers to propagate context. Upon receipt by the consumer, the context filter relays messages containing the appropriate tenant information to the processor. This ensures that sandbox consumers mirror the configurations of production consumers and subscribe to the same topic.
To achieve this, we made several changes to Kafka producer and consumer clients as described below.
Kafka producer with context propagation
As explained in a previous post, OTEL provides custom context propagation, which simplifies implementation of multi-tenancy on the Kafka producer side.
Each event sent out by the Kafka producer includes propagated tenant and route information.
Additionally, we have scenarios in which a single service requires multiple sandbox environments. To distinguish which sandbox environment an event is directed toward, we incorporate route information to map a production service application name to a sandbox host. A unique host label is generated upon sandbox deployment. The host label varies between deployments but remains consistent among all pods within the same deployment. The pod machine’s environment variable sets the host label, which provides route information in the context propagation. Both of these contexts can easily be configured through an internal UI tool.
Kafka consumer as a service
In DoorDash, the Asgard framework offers a range of standard libraries that encapsulate commonly used server and client functionalities. Asgard dependencies are presented as a single opaque list, providing all the boilerplate necessary for integrating widely used libraries and hiding their versions behind one Asgard version. Asgard also offers yet-another-markup-language, or YAML, configuration files for various environments such as prod, and sandbox.
Asgard lets product team engineers concentrate solely on implementing the business logic in their services. For Kafka consumers, Asgard runs as a service, only exposing configurations through YAML files while processing the event method for developers.
Figure 2 below shows an overview of Asgard. Thanks to this framework, product team engineers only need to focus on the YAML configuration and Service implementation sections.
The Asgard framework allows us to inject multi-tenancy awareness for Kafka consumers in one place, which is then automatically applied to all the product team's services.
Consumer group isolation
Consumer groups allow Kafka consumers to work together and process events from a topic in parallel. Events sent to the same topic will be load-balanced to all consumers in the same group, meaning the first requirement is to set different consumer groups for various tenants. We offer two ways to do consumer group isolation in a sandbox environment.
The first option is manual configuration, where the user can update the YAML config file and set a different group ID for the sandbox environment.
The second option is auto-generation, which is enabled by default for Asgard Kafka consumers. When running in a sandbox environment, the Asgard Kafka consumer service automatically appends the host label’s suffix to the group ID. This ensures that different sandbox deployments have different consumer groups and that within the same deployment, all consumer pods are part of the same consumer group. This approach ensures proper load balancing of events to all consumers within the same group while maintaining isolation between different tenant groups.
This is an example of configuration:
kafka: groupId: xxx_group_id randomTenantGroupId: true … |
Another important consideration is setting the auto.offset.reset property for the Kafka consumer. In the sandbox environment, we set it to latest by default. This is to prevent the inefficient polling of all existing events in the Kafka cluster whenever a new deployment occurs. Instead, the consumer starts from the latest available event.
Tenant and route context isolation
The test tenant Kafka consumer can now subscribe to the same topic as the production tenant to receive real-time events. The next step is to filter out events not targeted to the current tenant consumers.
To achieve this, we introduced an additional Kafka consumer config field that accepts a list of allowed tenant events. By setting this config field, the Kafka consumer verifies the tenant context information and skips non-matching events. This step ensures that sandbox consumers do not accidentally process events intended for production consumers.
After that, there is another filter based on the route information. We compare the host label retrieved from the environment variable with the one inside the route context header to determine whether the current consumer is the event's target destination. This step ensures that production and sandbox consumers do not process events that belong to a different tenant. In the absence of the route information, the production tenant processes the doortest events ensuring that test traffic gets processed if there are no sandbox deployed for the service.
For example, our Advertisements Team sought to segregate production and testing events to prevent adverse impacts on our ad serving algorithms caused by production services processing test events. Consequently, they opted for the config pattern, explicitly defining allowedConsumerTenancies for both production and sandbox environments.
In production environment:
kafka: allowedConsumerTenancies: - prod ... |
In sandbox environment:
kafka: allowedConsumerTenancies: - doortest … |
Meanwhile, our Logistics Team preferred not to handle the responsibility of deploying sandboxes solely for processing all test events. They found it safe for their production services to handle both production and test events. However, they aimed to restrict sandboxes to processing specific test events following the deployment of a new release. To achieve this, they simply set enableTenantRouting to true.
kafka: enableTenantRouting: true … |
Separately, our Dasher Team wanted to shadow all the production events to test a new alternative architecture. This was safe since the processing of the events did not mutate production data. To achieve this, they simply set enableTenantRouting to false.
kafka: enableTenantRouting: false … |
The table in Figure 3 is created by combining tenant and routing context to monitor which Kafka consumer from each environment will handle a specific message.
Consumer Env | Tenant ID (*) | Route Info (*) | Allowed Consumer Tenancies (**) | Process Event? |
prod | prod | N/A | prod | Yes |
prod | doortest | N/A | prod | No |
sandbox | prod | N/A | doortest | No |
sandbox | doortest | N/A | doortest | Yes |
prod | prod | N/A | both | Yes |
prod | doortest | absent | both | Yes |
prod | doortest | present | both | No |
sandbox | prod | N/A | both | No |
sandbox | doortest | sandbox host is not a match | both | No |
sandbox | doortest | sandbox host is a match | both | Yes |
(*) from Kafka event context
(**) from yaml config
Putting it all together
With this new multi-tenant aware Kafka, testing Kafka applications in isolation has become easier for the developers. No code changes are required; developers only need to add a single line to the configuration file. This update addresses several use cases, including the consumption of messages with designated tenant IDs and routing contexts. Additionally, it ensures that all Kafka messages are consumed without any being left unprocessed.
This solution ensures that the multi-tenancy paradigm is fully realized in Kafka, providing data isolation between different tenants and avoiding potential issues with data inconsistencies. Overall, this is a crucial step toward achieving a more robust and reliable production environment at DoorDash.
Conclusão
In summary, DoorDash has implemented a multi-tenancy awareness system for both Kafka producers and consumers that makes the production environment’s tech stack more efficient and developer-friendly for testing new features and patches. DoorDash has streamlined the test-and-release process for product team engineers through simple YAML file configurations while ensuring the security and isolation of each tenant’s data. The result is a more robust and simpler testing-in-production environment.