Ir al contenido

Blog


Rebuilding and Migrating a Session Management System with Zero Downtime

May 28, 2021

|
Sin Ko

Sin Ko

DJ Thirumalai

DJ Thirumalai

Li Pei

Li Pei

As we fundamentally rearchitected DoorDash’s platform, migrating the complex session management system from our monolithic codebase to our new microservices architecture proved one of the toughest engineering challenges we faced. This critical system recognizes every unique DoorDash consumer web experience user and has multiple dependencies. Making things more difficult, we needed to complete the migration with zero downtime.

We began the process of migrating DoorDash from a monolithic codebase to a microservice architecture in 2019. This fundamental change gave us a more scalable platform able to meet DoorDash’s growing business needs. Much of the work involved extracting logic from the previous codebase and rewriting it as a service in the new architecture.

While migrating the session management system, we not only needed to untangle existing dependencies, but also discovered hidden dependencies. Maintaining functionality required understanding dependency logic and identifying their owners. 

Our migration strategy involved building a duplicate session management system in our microservice architecture. The new system had to coexist with the prior system on our platform and required components that could handle our production load. Completing the migration with zero downtime meant users needed to be silently granted sessions on the new system to reduce any friction as we logged them out of the old one.

This project demonstrated our team's commitment to solving challenging engineering problems at scale that improve our customers’ experience.

Our monolith session management is an implicit dependency

While our new microservice architecture provides many benefits in terms of developer productivity, its main purpose was to improve reliability as DoorDash’s business rapidly grew. Unfortunately, as long as our web session logic remained in the monolith, our user identification functionality also stayed in this legacy architecture. Because this functionality is the first action for processing any incoming web request, our overall reliability would be tied to the monolith’s performance, which we already deemed as not scalable enough to support our future growth. Despite our growing suite of microservices, we would need to scale the monolith to keep up with our growth until we could migrate the session service.

What exactly is a session?

When we talk about sessions, we are referring to a series of HTTP calls made by the same user to a server. On the server side, a session management system typically identifies and authenticates a user whey they sign in, and keeps track of when they sign out. For a much more in-depth look into session management, how it interacts with authentication, and access control, refer to this excellent session management cheat sheet by the Open Web Application Security Project.

High stakes extraction

DoorDash session management sits in the middle of the authentication/authorization process, and after soliciting feedback from different internal teams, we realized its connection to security, internal tools, and customer acquisition gives it an outsized business impact.

Session management directly drives customer experience

The most important part of the migration was maintaining existing sessions before, after, and during the authentication process, and our ability to serve them on the website. Many different sales channels lead new customers to the website, allowing them to start their customer journey. If we cannot deliver a new system that is at least as performant as our existing module, we risk upending the customer experience.

Session invalidation is an important security tool

Invalidating a session, when we believe a user’s session should be logged out, is an important security measure. Companies often need to end a user’s session, either in response to a security incident, as with this GitHub security incident, or to remove stale logins. As a critical part of the migration’s success, we needed to ensure the session invalidation tool continued to operate, reducing the risk exposure for our overall business.

Internal tools are dependent on the monolith session logic

Our monolith powers many of our internal tools, from configuring our markets and stores to housing our experimentation infrastructure, all of which rely on our legacy session module. Problems with the session management system migration may lock the team out of critical systems. This would prevent them from mitigating changing market conditions, and could potentially create a cascading effect that negatively impacts our business.

Scoping the session migration

One of the core challenges we faced was understanding the interaction between session management systems and other critical modules. We know that any session management system works in tandem with the authentication and identity modules. However, we were made aware that other teams had previously modified the session management module in order to quickly ship products. 

Figuring out all the module dependencies and API usages required a deep dive into the existing functionalities, exploring the system internally, as well as doing user testing. We browsed the site, performing actions as a typical customer would, while probing the system, inspecting logs, and reading the codes to understand the backend’s behaviors. Through this exploration, the team inspected the library source codes, traced down code paths authored five plus years ago, and collaborated with different teams to identify various challenges and potential issues or pitfalls early on. 

Legacy monolith session module lacks ownership 

At the beginning of this process, we needed to determine which team owned which functionality built in the system management system. At DoorDash, we have been leveraging Django's built-in session module, which lets developers store arbitrary data for each user, and makes use of cookies to store a session ID key to map to said data. Inspecting the session store, we found a lot of extraneous data coupled with data that is critical to a consumer’s ordering experience. While exploring the codebase, we found a number of issues, including spaghetti code, lack of documentation, and throwaway solutions surrounding the data in question. 

These findings meant that part of the migration would have to involve unwinding other features from the monolith’s logic. To unwind these features we would need to work with other teams to review all of the module's functionality, assign ownership, and determine if it would be part of the migration or would need a new home in the monolith.

System can mistakenly issue two sessions for two different systems

During our exploration, we discovered that some web pages were still being served by our monolith, which could result in two different session IDs created for a user. If a user browsed one of these pages from the monolith, Django’s session manager would attach its own session ID cookie, and the user would be keeping two different session ID cookies in their browser, one for the monolith session system and one for the new session system. 

The result of having multiple session ID cookies is that subsequent requests will contain two different session IDs, making it hard to tell which one should be trusted. What can often happen is different components of the system are unable to handle both session IDs provided in the same request, causing runtime errors. One of these problem pages is the receipt page, which our users are redirected to immediately after they check out. In this case, our users will check out, and will immediately be asked to provide their credentials to log back in, creating a negative user experience.

Web experience was managed by monolith session module

In a typical authentication flow, a user’s web session exists as a guest session prior to the authentication step. When the user logs in, this session gets upgraded to an authenticated session. Figure 1, below, illustrates how a session system interacts with guest sessions, authentication, and access control. Guest sessions are unique to our web experience, as iOS and Android app users must be logged in to browse merchants and delivery items. 

At DoorDash, we have various platform teams that own and operate our authentication module, but tracking guest sessions is not part of their domains. In our regular web shopping flow, guests are allowed to browse stores and menus, and add items to their cart. But they must be logged in before they can check out. Our reimplementation of the new session management system must grant the same access control as our current system, allowing for guest sessions, but also integrate with our platform’s authentication process to allow business transactions.

Diagram of session management flow
Figure 1: While the platform team takes care of the authentication and access control modules, this sessions management system migration handles the rest, including the pre-auth sessions and session finalization for our guests.

Planning the rollout, migration, and observability

Our replacement session management system had to deliver both feature parity and the same performance as the monolith session module. But these were just table stakes. Our exploration phase revealed a challenging situation where our legacy logic was critical to ensuring a smooth customer experience, had multiple dependencies, and was a central component to other parts of the platform. At the same time, our exploration highlighted a lack of ownership, and to a certain extent, a lack of observability and insight into the existing system. 

Based on these findings, building a functionally equivalent session management system would not be enough. During the rollout, some proportion of our users would be granted sessions by the new system, while others had sessions from the monolith system, and this period of overlapping session systems was the trickiest part of the project. We needed additional features and logic specifically for this transitional period. Ultimately, we settled on three guiding principles, which we referred to as “meta features”, for our rollout: 

  • The rollout would need to be safe for our customers under either the existing monolith or replacement session system; if one malfunctioned, we could safely move the affected customers to the other system.
  • The rollout would need to be observable and measurable. At any given time we needed to know how many people were using the new session system, and how their experiences compared to users of the monolith system.
  • The migration would need to be executed with zero downtime and minimal disruption because the business and so many teams relied on its effectiveness. 

The meta features for a safe rollout, observing performance, and achieving minimal disruption were deceptively simple and straightforward. However, the complexity of the problem, and how it impacts all the incoming requests, led us to carefully deliberate how we could achieve this migration. 

Ensuring a safe rollout

Knowing that rolling forward is easy, but rolling backward is hard, we had to figure out how to tell browsers to keep a different session ID if we roll forward, and erase their session ID if we roll back. At DoorDash, rolling out a new code path via our experimentation framework is not new; however, these code paths are usually either server side or client side. In our case, we wanted to synchronize the rollout both on the server side and client side. Fortunately, we already had a handshake between the client and server, which happens before our React single-page application begins to render. We decided to expand on this single entry point for synchronizing the client and server rollout status.

In this single entry point, we were able to centralize the logic for checking the status of an experiment. Because we didn’t want our users to have two different session IDs, if we detected the user was in the treatment group of the experiment, we asked the browser to delete its existing session-related cookies and accept a new set of session cookies. 

At the same time, we implemented a kill switch feature. If we activated the kill switch, it instructed the browser to delete any session-related cookies, and the server would reissue the monolith-managed session cookies back to the browser, hence rolling back to our default, pre-migrated state. After testing this among our team members and a subset of staff, we confirmed that the new kill switch was behaving as desired, and we were able to safely fall back to the monolith managed session if the new system misbehaved. The simplified decision diagram in Figure 2, below, illustrates the conditions we checked to determine which session system applied to the user.

We also created a special cookie, which we refer to as a magic cookie, that let anyone enter the new session flow. At its most basic level, this cookie acted as a flag for an incoming HTTP request. If the server detected the flag during the handshake process, it would invoke custom logic to bypass the experimentation or kill switch check, and directly allow its associated browser to use our new session management system.

This cookie let us leverage our existing end-to-end testing suite, or host manual testing, so we could continuously validate that our new session system was compatible with the rest of the rapidly changing DoorDash architecture.

Diagram of incoming request flow
Figure 2: We built logic into our platform that determines which session management system applies based on certain conditions. If a user’s browser has what we termed a magic cookie, for example, the system creates a session in our microservice architecture.

Ensuring an observable and measurable rollout

At the beginning of the migration project, we had partnered with our Data Analytics team to come up with an experiment plan, to ensure that the platform’s conversion rate would not be impacted when customers migrated to the new session module. The outcome from this collaboration was an analytic report that tracked all of the steps in our customer journey, comparing the engagement of customers at each step. The team gained a high-level visibility into the performance of the new session system compared to the existing flow, which allowed us to quickly identify underperforming stages and diagnose potential problems. 

On the engineering side, we also took advantage of our decision to create a single session entry point by adding logging statements for failure and sending out metric data, giving us the ability to peer into the health and performance of all our components. With these data-driven guardrails in place, we were able to take the guesswork out of the rollout plan, helping us to decide when our next ramp up should be, and how much of an increment we could take confidently.

Achieving minimal disruption

To ensure a seamless experience for customers navigating between web pages hosted by our monolith and microservice architecture, we added new logic to our monolith that decodes incoming requests containing a new session management system ID cookie, and prevents monolith session ID cookies from being issued if a new session ID exists. While it seems counterintuitive to add new logic into a monolith that we are actively deprecating, the team believed this was a sensible tradeoff. 

Our monolith, written in Django, attaches its own session during an HTTP request as a library behavior, and we did not want to disable this behavior globally as it may lead to other unexpected bugs. We could have moved all the template rendering functionality out of our monolith to completely sidestep this issue, but we needed to focus on our session management migration in order to decouple from our monolith as quickly as possible. 

After our monolith became compatible with our new session system, users could navigate between all pages, regardless of which session system granted their session, letting us defer the page rendering migration and treating it as a separate project. As it turned out, our monolith would still be involved with the new session management system.

Part of the migration involved simulating heavy loads to validate our new components were ready to handle our production traffic volume. We used loadtest, an open source library, to create synthetic traffic and observed the behavior of our system under stressed conditions. We caught some memory leak issues and bottlenecks that would have manifested when we started receiving a non-trivial amount of traffic, and avoided slowing down our customers in a live production environment.

As part of this migration, we communicated to other teams about the upcoming changes. We made sure that each team was aware of this change and its possible impact on our internal tools, which could directly hinder everyone’s ability to do their job. This communication was important as it gave us an opportunity to engage with teams who were concerned with how the changes in session would impact our ability to track user behaviors during experiments. To allay general concerns we also shared our charts and dashboards with our product teams, who in turn helped keep an eye on the charts, because those metrics directly impacted all the web-based features.

Making our code work harder to remove friction for our customers  

After months of planning and execution, we rolled out the new session flow to one percent of our customers. Using the analysis provided by our Analytics team, we saw a small but detectable conversion drag. Given the small rollout percentage, it could have just been noise in the data. After digging into the logs and charts, and using our magic cookie to conduct manual testing, we were confident that the new system was working as expected. 

However, when the same pattern repeated after ramping up to two and then five percent of all customers, we had to investigate further. Once again, we could not find anything wrong in the error logs or charts.

At this point, we hypothesized that bucketing more and more users into the new session flow created additional friction for our users because they start as a logged-out guest in the new session management system. This is a side effect of a policy we introduced to ensure that no user would be issued two different sets of session cookies.

In our quest for a truly disruption-free migration, we went back to the drawing board to solve this problem for our customers. Because our monolith at this point understood both the new and legacy sessions, we implemented a new session exchange mechanism there. When we detected a customer who should be using the new session according to our experimentation framework, but had a legacy session, we assumed that this customer was visiting the site for the first time after being newly bucketed into the treatment group. The logic for this check is seen in Figure 3, below: 

Diagram of incoming request flow
Figure 3: We built additional logic into our monolith that could determine whether a customer’s session information should be copied to the new session management system, ensuring a disruption-free experience.

Within our single entry point, we call the monolith session exchange endpoint, which returns all the data stored in the monolith session system. We then pick and choose the relevant information to store in the new session system. In other words, we prime our new session management system with the customer’s existing data, so their subsequent requests can continue to be authenticated in the new session system. Figure 4, below, shows the revised network flow in a simplified diagram.

Diagram of session information exchange
Figure 4: Using our single entry point, we initiate an exchange of certain session management information from the monolith to the microservice architecture, delivering an unimpeded customer experience.

After testing this new exchange mechanism and confirming it worked, we rolled out the additional functionality prior to our next ramp up. With this new feature in place, we observed our conversion rate had recovered, coming back to the same level as customers with the legacy session. 

Conclusión

Migrating a critical piece of functionality to a new system, especially one built into the existing framework, is tricky. Startups in particular may find feature-complete frameworks such as Django provide ease-of-use and rapid product prototyping. On the flip side, going with any particular framework may cause teams to defer their architecture decisions, leaving companies with unowned pieces of functionality, tribal knowledge, and other surprises under the hood.

While this migration was a challenging task, it was also an opportunity to test our  engineering teams’ soft skills. Success depends on being able to work with other teams to identify ownership, delegate responsibilities if necessary, and communicate upcoming changes. During such a critical process, it’s important to exercise best judgment and resolve conflicts among peers or escalate upwards.

Rolling out any risky, foundational functionality could have a big impact on business metrics. Not only is a rollout plan required, validating it will be crucial for a successful result. Our focus on observability, from both the engineering and business side, gave us the safety and confidence for the entire rollout, and ultimately an incident-free migration.

Ultimately, the team’s customer obsession is what drove us to success. From the very beginning, we know we have to deliver a smooth transition experience to our customers, both internal and external. We dug into the details, trying to discover hidden dependencies on the monolith session management system; monitoring closely for any business impact on our end users; asking ourselves if we can deliver a better experience; and ended up with our session exchange mechanism to repair our conversion drag. At the end of the day, there was no better compliment than the lack of notice from our customers.

Agradecimientos

The authors would like to thank Jie Qin, Li Pei, Dananjayan Thirumalai, Rohit Raghunathan, Ivar Lazzaro, Dmitriy Dunin, Bri Lister, Ram Prasanna, Vlad Protsenko, Michael Sitter, Esha Mallya, Corry Haines, Hang Yin, Kevin Chen, Robert Lee, Arun Kumar Balasubramani, and Kyle Mogilev for their contribution and help throughout this effort.

About the Authors

  • Sin Ko
  • DJ Thirumalai

    DJ Thirumalai is a software engineer at DoorDash.

  • Li Pei

    Li Pei is a software engineer on DoorDash's New Verticals team.

Trabajos relacionados

Ubicación
Toronto, ON
Departamento
Ingeniería
Ubicación
New York, NY; San Francisco, CA; Sunnyvale, CA; Los Angeles, CA; Seattle, WA
Departamento
Ingeniería
Ubicación
San Francisco, CA; Sunnyvale, CA
Departamento
Ingeniería
Ubicación
Seattle, WA; San Francisco, CA; Sunnyvale, CA
Departamento
Ingeniería
Ubicación
Seattle, WA; San Francisco, CA; Sunnyvale, CA
Departamento
Ingeniería