Making Applications Compatible with Postgres Tables BigInt Update

Previously, DoorDash relied on Postgres as its main data storage and used Python Django database models to define the data. By default, an ID field is automatically added as the primary key which is an auto-incrementing Integer value, whose upper bound is about 2.1 billion(2³¹-1). However, to support the rapid growth of business, we needed to evolve the data models from Integer to BigInt before we ran out of the Integer ID range to avoid integer overflow. While we were able to complete the primary key data type upgrade in tables without downtime. We needed to guarantee our systems were fully compatible with the table’s BigInt upgrade.

The compatibility challenges of upgrading to BigInt

To launch IDs beyond Integer, we needed to upgrade the table’s primary key and other tables referencing the primary key to BigInt. Also, applications that use the upgraded tables must be able to correctly insert and accept data beyond the Integer limit. Below were some challenges we needed to address before we started to roll out the BigInt for our systems:

Foreign key constraint was not always present in cross-referencing between tables

When upgrading the primary key data type in a table, we needed to upgrade other tables that have fields referencing the upgraded primary key, for example, the tables with foreign key on the updated primary key. However, some tables just defined an ordinary field instead of using foreign key constraints to reference the upgraded primary key. In our case, when the ID of an order in the order table was passed down to another service, the order ID was saved as an ordinary field with an unrelated naming of target_id in their table. Such references without the foreign key constraint were harder to identify.

Hard to detect if IDs were handled properly in backend applications with a microservice architecture

There was not an easy way to confirm if applications’ code could handle the new ID data type properly. Particularly for data that were used across domains under the microservice architecture, different flows are running in isolated services. It’s hard to tell exactly how the upgraded table ID is flowing and how it is handled inside the services using the ID. For example, if the data type of the ID is defined as Integer in the API request and response between services, or if there is data type downcasting from Long to Integer in the code, we could end up with code exceptions or incorrect data when we begin to use the IDs exceeding the Integer range in any table.

Guaranteeing compatibility on all client platforms and versions

Similar issues of data type incompatibility could also happen on user-facing clients, such as our Android, iOS and Web apps. For example, if a client application defines a data object with an Integer ID to receive a response from the backend, the client won’t work when the backend returns a response with the ID that exceeds the Integer range. We also needed to make sure that clients that are compatible with the BigInt upgrade worked well not only in the future versions of clients, but also in the older app versions.

Identifying compatibility issues by using separate postgreSQL sequences to control data creation

We developed a solution that could experimentally use the BigInt range on small-scale users to identify compatibility issues before bigInt reached all users. Here was the high-level process:

Created a new additional postgreSQL sequence with starter ID greater than the upper bound of Integer, besides the original default sequence, to associate with the primary key datatype upgraded table
Developed a new code path to specially use the new sequence to insert data beyond Integer range.
Rolled out the new code path on small-scale users to find out and fix potential issues using ID beyond the Integer range, while the original sequence of IDs within the Integer range continue to serve the remaining user traffic
Switched the new sequence as default sequence for a table when experiments confirmed it works reliably.

Step 1: Create additional sequences for the bigInt upgraded tables

After the datatype of primary key in a table and columns in other referencing tables were upgraded from Integer to BigInt (how to upgrade the column datatype in tables is another big topic but not the focus of this blog), the tables had capacity to deal with data beyond 2³¹-1. We created a separate sequence with a starter id greater than the Integer upper bound. We chose the new starter of 2³². Values between 2³¹ and 2³²-1 would wrap around and become negative when truncated to be represented in 32 bits signed Integer. Negative integers could represent a precious escape hatch if we didn’t get all systems ready on time before we overflowed. A sample below showed how we created the sequence:

CREATE SEQUENCE __bigint_tablename_id_seq START WITH 4294967296;

Step 2: Use the two different sequences for data insertion to the same table

We continued to use the original default sequence associated with the table to generate the primary key within the Integer range for public traffic. Meanwhile, we created a new parallel code path that specifically used the new sequence from step 1 to generate the ID beyond Integer limit when inserting records into the table. We created an experiment to control which code path to use before inserting the data. The flow was shown in Figure 1 below.

Figure 1: Using two different sequences to insert data for ID values in different data type range depending on whether we are using a new sequence or not

The code sample below showed how we implemented the flow of Figure 1 in a Kotlin based microservice using JDBI to interact with the database.

 @GetGeneratedKeys  
 @SqlUpdate("insert into $sampleTable ( $insertColumns ) values ( $insertValues )")
  fun create(@BindKotlin obj: DBObj): Long

We originally used the query above to insert new data, where $insertColumns, $insertValues didn’t include the ID column and its value, so it would be automatically generated by the default sequence associated with the table. The default sequence generated IDs within the Integer range.

@GetGeneratedKeys("id")  
@SqlUpdate("insert into $sampleTable ( $insertColumnsWithId ) values ( $insertValuesWithIdInNewSequence )")
   fun createWithIdInNewSequence(@BindKotlin obj: DBObj): Long

Meanwhile, we defined the new interface above to specify the ID value when inserting new data. This specified ID value was generated from the new sequence we created. In the query, insertColumnsWithId = ":id, $insertColumns" and insertValuesWithIdInNewSequence = "nextval('__new_sampletable_id_seq::regclass), $insertValues".

The nextval('__new_sampletable_id_seq::regclass) was to get the next value from the new sequence. In this case, the data we would insert from this interface was ID beyond Integer limit, so we could test the systems to handle BigInt.

We ran an experiment to select which sequence to use when inserting data into our system. When the new sequence was selected, we were able to use the ID beyond the Integer range for data insertion.

if(User in experiment){
     createWithIdInNewSequence(obj)
} else{
    create(obj)
}

Step 3: Utilize the ability to insert data beyond the Integer range to detect issues

As we stated in Step 2, we kept the default current ID sequence to serve the public traffic before confirming all flows and apps were able to handle the BigInt upgrade. Meanwhile, we rolled out a bit of traffic for internal users to use the new sequences code path and monitored for any abnormality.

We were able to detect a number of issues during our gradual rollout:

Read failure on PK upgraded table because API defined the ID in the request/response payload as Integer

Though we successfully inserted data whose IDs were greater than the Integer range to the table, some APIs failed to retrieve such data because the API request only accepted integer IDs. To fix such errors, we just needed to fix the problematic endpoints by upgrading the param data type from Integer to Long in the request or response.

Write failure on referencing tables because referencing tables were not upgraded

From placing an order to an order being delivered, complicated data is flowing through different services, and the services persist the needed data into their respective databases. During the test rollout, we got reports from our downstream service that they had failures of data persisting to their table due to an ID reference from our table.

It turned out that we didn’t upgrade their table with a reference column(not using foregin key) to our BigInt upgraded table. The column in their table referencing our upgraded primary key was still in the Integer format, which led to failures to write data to their table when the referring ID was generated by the new sequence. We fixed these overlooked tables by upgrading the reference column data type to align with the upgraded BigInt table.

Wrong IDs were persisted into tables due to ID downcast without error exposure

Writing and reading successfully from the BigInt new sequence was not enough; we also needed to confirm the correctness of the data. For example, one of the teams had an ETL job to join the data from the Transaction table with the Order table on the order ID to generate a data report. The ETL job was running a query with the following structure:

select * from order o 
join transaction t on t.order_id = o.id 
where …..

Though the data type of the order ID in both Order table and Transaction table had been upgraded to BigInt, there was a line of code downcasting the original order ID from Long to Integer and the result was persisted to the Transaction table.

  targetId = orderId.toInt()

This Long.toInt() data type casting operation could perform without any error. But the resulting value is represented by the least significant 32 bits of this Long value. So the data generated from the new sequence was able to be written and read successfully but not correctly. Due to the wrong casting of data type, the ETL job couldn’t get the expected result before we fix it!

Step 4: Switch the new sequence as default sequence for table

After resolving all the issues in step 3, we were confident that our systems were compatible when the IDs began to exceed the Integer range. The last thing to be done was to replace the default original sequence with the new sequence as our default ID generator. This was necessary because even though the original sequence could generate IDs greater than the Integer range, it wasn’t aware that part of the IDs beyond the Integer range had been occupied from the new sequence. Before the two sequences overlapped, we needed to use the new sequence only. so we associated our new sequence as default with the table and abandoned the original sequence.

When we reached this point, the migration of Bigint upgrade was smoothly done and our systems could support ID's beyond the integer range.

Conclusion

This article presented the technique of adding additional postgres sequences to detect compatibility issues during postgres table bigInt upgrade to make sure there were no blind spots in our systems with the upgrade. BigInt migration is a company-wide task that could be challenging from the lower storage layer for data migration to the higher layer of applications compatibility. A recent blog suggests using BigInt by default, examining production-scale database with properly sized hardware can handle the slight overhead with no problem. If you need to go through the bigInt upgrade, we hope our article could shed light on how to detect the compatibility issues to fix them!

Acknowledgements

Many thanks for the great support from Cesare Celozzi, Zhengli Sun, the awesome work from Alessandro Salvatori for table schema change and migration, and the timely help from Patrick Rogers, Saba Khalilnaji, Justin Lee, Malay Harendra and Andy Liu to fix the issues to make the upgrade go smoothly.