Optimizing OpenTelemetry’s Span Processor for High Throughput and Low CPU Costs

When companies move to microservices, they need to address a new challenge of setting up distributed tracing to identify availability or performance issues throughout the platform. While various tools offered on the market or through open-source perform this task, there is often a lack of standardization, making leveraging these tools costly or complicated.

As DoorDash migrated off its monolith to microservices, we considered these issues and chose OpenTelemetry, an open source distributed tracing project, for its emphasis on telemetry data standardization. To make this implementation successful, we needed to ensure that we could utilize OpenTelemetry without incurring any performance costs that would negatively affect our platform and user experience. To do this, we created a baseline of how OpenTelemetry performed on our platform. We then researched six ways of optimizing its performance and compared them side by side to pick a winner.

What is distributed tracing?

Before we jump into our benchmarking experiments, let’s review some important terminology. Tracing is the process of specialized logging that records information about a program’s execution in real-time. A trace, for example, contains information about the time a program spends in a function. Traces are generally helpful in debugging software performance issues.

Tracing in a monolith architecture is relatively easy. But in a microservice architecture, where a request travels through many services, it is often difficult to get an end-to-end picture of the request’s execution, making it challenging to identify availability or performance issues through traditional tracing.

Distributed tracing, a method to profile and monitor applications built on microservice architecture, solves traditional tracing challenges. As DoorDash moved away from a monolith architecture to a microservice distributed architecture, we needed distributed tracing to identify and resolve availability and performance issues quickly.

Many open source projects or commercial vendors provide distributed tracing. However, for many of the options on the market, there is a lack of standardization, resulting in a lack of data portability and a costly burden on users to maintain the distributed tracing instrumentation libraries. The OpenTelemetry project solves the standardization problems by providing a single, vendor-agnostic solution.

What is OpenTelemetry?

OpenTelemetry is a collection of tools, APIs, and SDKs to instrument, generate, collect, and export metrics, logs, and traces for analysis to understand software's performance and behavior. OpenTelemetry formed as a merger of OpenTracing and OpenCensus after the developers of those projects decided that it’s better to develop a single, vendor-agnostic approach for enabling observability.

OpenTelemetry supports three data sources: traces, metrics, and logs. At Doordash, we decided to adopt OpenTelemetry for traces and began testing its distributed tracing feature for our platform.

How does distributed tracing work using OpenTelemetry?

Once we decided to use OpenTelemetry, we wanted to look into its distributed tracing functionality. Distributed tracing tracks a single request’s progression across the multiple services that form an application. An example of a request might be a call from a mobile endpoint or a UI action by an end-user. A single trace is essentially a tree of spans, where spans are objects that represent a unit of work done by individual services or components involved in the request. A single trace contains a single root span that encapsulates the entire request. The root span would contain child spans which in turn may have child spans.

Using OpenTelemetry, services generate a trace with a globally unique identifier. OpenTelemetry propagates the trace context to all underlying services where each service generates the spans. OpenTelemetry sends the spans to a local collector, which enhances the trace data with custom metadata. The local collectors send the trace data to a collector gateway, where they are processed and shipped to distributed tracing tools such as Jaeger, NewRelic, Splunk or LightStep for storage and visualization.

The trace identifier generated at the root span remains the same for all the child spans within the request. Distributed trace visualizers use this trace identifier to stitch together the spans generated for the same request, thus providing an end-to-end picture of the request execution across the application. Figure 1, below, outlines the high-level architecture of how we set up OpenTelemetry’s distributed tracing.

Infographic of how open telemetry works — Figure 1: OpenTelemetry handles the spans generated by microservices and exports them to a local collector. The local collector enhances the trace data with custom metadata and sends them to a collector gateway. The collector gateway processes the trace data and exports it to tracing tools for storage and visualization.

Load testing OpenTelemetry to identify performance impact

Any significant increase in request latency or CPU overhead would mean a negative experience for DoorDash users or higher resource usage. Once we decided to adopt OpenTelemetry, we wanted to understand the performance impact of using OpenTelemetry’s distributed tracing functionality in our services.

Using our internal load testing tool, we tested the OpenTelemetry agent attached to a Netty gRPC service written in Kotlin. The gRPC service has two rpc calls, createMessage and getMessages, which use AWS’ simple queue service to store and retrieve the messages. OpenTelemetry’s Java agent is attached to the service with the following span exporter configuration options:

otel.traces.exporter=otlp (OpenTelemetry Protocol)
otel.traces.sampler.arg=1 (100% sampled)
otel.bsp.max.queue.size=2048
otel.bsp.max.export.batch.size=512
otel.bsp.schedule.delay=5000 (5 seconds)

We hit the two rpc calls using a synthetic load generating ~500 peak rps. The service deployment has a 1000 millicores CPU and 1000 mebibytes memory allotted in a Kubernetes cluster. We observed 72% pod CPU utilization at peak but only observed 56% pod CPU utilization at peak without the exporter. These experiments show that we have to spend more computing resources to adopt OpenTelemetry in our platform.

Graph charting our CPU utilization percentage — Figure 2: We observed 56% pod CPU utilization when OpenTelemetry’s exporter is disabled

Figure 3: We observed 72% pod CPU utilization when OpenTelemetry’s exporter is enabled. The CPU utilization is higher compared to the experiment where the exporter is disabled

At this point, we noticed an increase in CPU usage with OpenTelemetry when the span exporter is enabled. CPU profiling is generally helpful in identifying these kinds of problems. Our internal load testing tool has an option to enable the YourKit Java profiler and automatically collect CPU profiling data from the service under test. We leveraged this profiling option to collect the CPU profiling data to identify the performance issue. The CPU profiling data showed a background thread named BatchSpanProcessor_WorkerThread that is continuously polling spans from a blocking queue.

Figure 4: Profiling stack traces show the CPU overhead is coming from the batch span processor’s worker thread

Note that even though the profiler shows that the batch span processor’s thread is in a waiting state, there is a direct and indirect CPU cost due to the kernel’s context switches. We played with the batch span processor’s (BSP) options as described below to see if they would help reduce the CPU overhead, but they didn’t help.

Increased batch size otel.bsp.max.export.batch.size to 5120
Reduced trace sampling otel.traces.sampler.arg to 0.01
Increased schedule delay otel.bsp.schedule.delay to 300000 (5 minutes)

Once the workarounds discussed above didn’t help resolve the performance problem, we wanted to do a deep dive into the implementation of OpenTelemetry’s BSP.

How does OpenTelemetry’s BSP work?

Once we identified that the BSP is causing the CPU overhead, we inspected OpenTelemetry’s codebase and its configuration options to understand BSP. Figure 5, below, shows the implementation of BSP, the component in OpenTelemetry that processes the spans generated by microservices and exports them to the collector.

Figure 5: Service threads add spans to a bounded blocking queue. Exporter thread polls the queue continuously and sends the spans to the collector in batches

The OpenTelemetry specifications outline BSP behavior as follows:

otel.bsp.max.queue.size controls the maximum number of spans in the waiting queue. Any new spans are simply dropped once the queue is full.
The exporter thread waits until the current number of spans it collected from the queue reaches otel.bsp.max.export.batch.size before sending them to the collector in one batch. Batching, as explained above, makes perfect sense to reduce the overhead of any network or I/O calls.
Once every otel.bsp.schedule.delay interval, the exporter thread will send any spans it collected from the queue.

Creating benchmarks to establish a performance baseline for the BSP

After going through the BSP’s implementation, we noticed two apparent issues. The first one is the lock contention with the ArrayBlockingQueue. ArrayBlockingQueue uses an exclusive lock to provide threads safe access to the underlying queue. Only one thread holds access to the queue at a time while other threads wait for their turn. This access pattern results in threads doing a lot of waiting, leading to low throughput.

The second issue is the exporter thread’s continuous polling. The exporter thread receives a signal for each new queue entry due to the continuous polling, which causes context switch overhead resulting in high CPU usage. We wrote two benchmarks using Java Microbenchmark Harness (JMH) that surfaced the above two performance issues, then we established a performance baseline and compared the new BSP’s implementations.

Use Java Microbenchmark Harness to write benchmarks

Writing benchmarks using a naive framework is difficult and gives misleading results as many JVM and hardware-level optimizations are applied to the component under test, making the code appear to perform better than in reality. These optimizations are not applicable when the component under test runs as part of a larger application. JMH provides a convenient way to warm up JVM, run multiple iterations, and create multiple worker threads.

The example shown below runs a benchmark measuring throughput (operations per second) with five threads, warms up the JVM for a second, and runs five iterations where each iteration runs for five seconds. Each thread continuously processes a new span for the entire duration of the benchmark.

@Benchmark
@Fork(1)
@Threads(5)
@Warmup(iterations = 1, time = 1) 
@Measurement(iterations = 5, time = 5)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public void export_05Thread(
   BenchmarkState benchmarkState, @SuppressWarnings("unused") ThreadState threadState) {
 benchmarkState.numThreads = 5;
 benchmarkState.processor.onEnd(
     (ReadableSpan) benchmarkState.tracer.spanBuilder("span").startSpan());
}

Establishing the right metrics for performance comparison

Once we decided to use JMH to write benchmarks, we wanted to establish the right performance metrics to compare the BSP’s different implementations.

The number of span exports per second

Although the primary JMH metric, operations per second, is a good metric in general, the number of spans getting exported is much better for understanding the BSP’s performance and effectiveness. For example, an implementation of the BSP that aggressively drops the spans would have very high throughput, but it does little valuable work because traces have missing spans. JMH provides a convenient way to track secondary metrics such as exports per second. We will be using the metric exports per second going forward to compare different versions of the BSP. Since we are interested in testing BSP in an isolated environment, we faked the network calls that send spans to the collector to do nothing. This kind of faking was necessary to prevent any external noise in the benchmark experiments.

Measuring CPU time of the exporter thread

Measuring CPU usage of the exporter thread is also very important since we are interested in making the exporter thread very lightweight. Measuring exporter thread CPU is very tricky with JMH since it consumes as much CPU as possible to hit peak throughput. We employed a simple hack shown in the example below to run the benchmark in a steady state. Adding a significant sleep in the JMH worker threads, we generated consistent requests per second.

private static void doWork(BenchmarkState benchmarkState) {
 benchmarkState.processor.onEnd(
     (ReadableSpan) benchmarkState.tracer.spanBuilder("span").startSpan());
 // This sleep is essential to maintain a steady state of the benchmark run
 // by generating 10k spans per second per thread. Without this, JMH outer
 // loop consumes as much CPU as possible making comparing different
 // processor versions difficult.
 // Note that time spent outside of the sleep is negligible allowing this sleep
 // to control span generation rate. Here we get close to 1 / 100_000 = 10K spans
 // generated per second.
 LockSupport.parkNanos(100_000);
}

While running the benchmark, we used the YourKit Java profiler to measure the CPU time of the exporter thread.

Configuration used for the benchmark

All the benchmark experiments are using the configuration given below.

otel.bsp.max.queue.size = 2048
otel.bsp.max.export.batch.size = 512
otel.bsp.schedule.delay = 5000ms

Benchmark results with different implementations of the BSP

Once we established the metrics, it was easy to compare and contrast the BSP’s different implementations. The goal is to increase the exports per second while reducing the CPU usage of the exporter thread. We will be using throughput to signify the exports per second in the benchmark results. Note that all the benchmark results discussed below are from runs using a 2.4 GHz 8-core Intel processor. When trying to figure out the best batch span processor, we looked at the following options:

ArrayBlockingQueue
ConcurrentLinkedQueue
Disruptor with TimeoutBlockingWait strategy
Disruptor with SleepingWait strategy
ArrayBlockingQueue with signal batching
ConcurrentLinkedQueue with signal batching
MpscQueue with signal batching

For each of these options, we ran the benchmarks and compared the results. This approach gave us a side-by-side comparison of exports per second and the CPU time, which enabled us to pick a winner and use it in OpenTelemetry’s BSP.

BSP using ArrayBlockingQueue

The implementation using Java’s inbuilt ArrayBlockingQueue is the existing implementation and was used as a baseline. It uses a preallocated array buffer to implement the queue and a lock for concurrent access. Figure 6 and Figure 7, below, show the benchmark results while using this approach.

Figure 6: We can see a massive drop in throughput with an increase in the number of threads due to lock contention with the array blocking queue.

Figure 7: There is a significant increase in CPU time with an increase in the number of threads due to lock contention and constant polling of the queue.

Our benchmark results show that the ArrayBlockingQueue approach does not scale since the throughput and CPU time get steadily worse as the number of threads increases mainly due to the lock contention.

There is additional overhead hurting the throughput besides the lock contention with the increasing number of threads. A rising number of threads usually puts pressure on the memory subsystem, resulting in decreased performance. Modern processors rely heavily on cache memory to keep a thread’s state. Cache memory is usually one or two orders of magnitude faster than main memory. When we have too many threads, the cache becomes full, and the processor has to evict the cache entries frequently. The evictions result in frequent cache misses leading to main memory access, resulting in decreased performance.

BSP using ConcurrentLinkedQueue

The next obvious choice is to use Java’s inbuilt ConcurrentLinkedQueue. ConcurrentLinkedQueue implements a non-blocking algorithm using atomic compare-and-swap (CAS) supported by modern processors. Given the nature of the algorithm, it is only applicable to an unbounded queue. An unbounded queue is not a great option since we need to allocate memory for every push operation with the queue and deallocate once the entries pop from the queue head. Figure 8 and Figure 9, below, show the benchmark results we got while using this approach.

Figure 8: There is a significant increase in throughput due to less lock contention, but the throughput flattened out early with an increase in the number of threads.

Figure 9: There is a visible decrease in CPU time but a quite significant overhead remains

Using ConcurrentLinkedQueue, we see an increase in throughput and a decent decrease in the CPU time of the exporter thread. However, the throughput flattens out much earlier than expected with the increase in the number of threads. We identified that there are subtle issues with concurrency on a queue data structure.

We noticed that there is contention on head or tail even with CAS. The head and tail usually occupy the same cache lines, making cache coherence very expensive in modern processors due to frequent cache misses. This performance degradation phenomenon is commonly known as false sharing.
There is unnecessary garbage collector activity due to memory allocations for new nodes.
size() implementation of ConcurrentLinkedQueue is actually linear in complexity. Since OpenTelemetry specifications require a bounded size queue, we had to use an atomic counter to efficiently track the queue size and find whether the queue is full.

BSP using LMAX Disruptor

LMAX Disruptor, our next choice, is designed to address the concurrency concerns with queues. The key design ideas used by the Disruptor are:

It uses a specialized ring buffer with preallocated memory and memory padding to avoid false sharing.
It uses sequencing, a concurrency technique, to reduce lock contention with multiple producers. Producers claim the next slot in the ring buffer using an atomic counter. The claiming producer updates the ring buffer’s entry and commits the change by updating the ring buffer’s cursor. The cursor represents the latest entry available for consumers to read.
Consumers read from the ring buffer and have different waiting strategies when waiting for new entries.

Once we picked the Disruptor, we proceeded to the benchmark experiments with the two most common waiting strategies offered by Disruptor, TimeoutBlockingWait and SleepingWait.

Disruptor using TimeoutBlockingWait strategy

We first tried out the Disruptor with the timeout blocking wait strategy. When using this waiting strategy, the Disruptor lets consumers wait for new entries in the ring buffer using a conditional variable. This waiting strategy is similar to how the exporter thread is waiting for new entries using ArrayBlockingQueue or ConcurrentLinkedQueue. Figure 10 and Figure 11, below, show the benchmark results we got while using this approach.

Figure 10: The benchmark results show a very similar throughput to ConcurrentLinkedQueue, except we see less contention and an improved throughput with 20 threads

Figure 11: The benchmark results show a very similar CPU time to ConcurrentLinkedQueue, which is what we expected given that this waiting strategy is the same as the one we implemented using ConcurrentLinkedQueue.

The benchmark results show that the Disruptor with TimeOutBlockingWait strategy is a good choice because of the increase in throughput and decrease in CPU time compared to the baseline. These improvements occur because of the Disruptor’s non-blocking algorithm. The benchmark results also show that this approach did not perform better than the ConcurrentLinkedQueue approach, revealing that the writer threads’ signaling was the bottleneck.

Disruptor using SleepingWait strategy

After using the timeout blocking wait strategy, we wanted to try out the Disruptor with the sleeping wait strategy. When using this wait technique, the Disruptor’s consumer threads that are waiting for new entries initially go through a busy-wait, then use a thread yield, and eventually sleep. According to the Disruptor documentation, this waiting strategy will reduce the producing thread’s impact as it will not need to signal any conditional variables to wake up the exporter thread. Figure 12 and Figure 13, below, show the benchmark results we got while using this approach.

Figure 12: The benchmark results show a vast improvement in throughput. A near zero contention on the write path with no signaling overhead and a busy-wait on the exporter thread contributed to this improvement. But the busy-wait wastes CPU cycles, we could see its impact in the throughput drop with 20 threads.

Figure 13: The benchmark results show a huge increase in CPU time due to busy wait of the disruptor’s sleeping strategy. This high CPU time makes this approach not viable.

The benchmark results show that the Disruptor with SleepingWait strategy is not a good choice because of the significant increase in CPU time compared to the previous approaches. We noticed that this increase in CPU time was due to the busy-wait of the exporter thread. It is important to note that the benchmark results show that this approach was better for high throughput. This high throughput occurred because this approach did not require signaling of the exporter thread. However, the throughput flattened out with 10 threads or more due to the pressure on the memory subsystem and the CPU cycles wasted by the busy-wait. After analyzing these benchmark results, we moved to explore new ways of reducing the signaling overhead while lowering the exporter thread’s CPU time.

Batching the signals to reduce the exporter thread’s CPU time

Thus far, the benchmarks with different implementations showed us the most significant bottleneck to throughput is the writer threads’ locking and signaling, and the significant CPU cost to the exporter thread is context switching or a busy-wait.

The exporter thread is notified about any new entry right away, even though it will export the spans only when its buffer reaches maximum export size. This kind of exporting behavior allows us to reduce the exporter's thread CPU time by notifying only when it will end up doing the export operation.

We use an atomic integer to let the writer threads know the number of spans the exporter thread needs for the export operation. The hypothesis is that this kind of signal batching will reduce the writer threads’ contention and the exporter thread’s frequent context switches. Below is a pseudocode implementation of the signal batching.

// pseudocode of the writer thread
if queue.offer(span):
 if queue.size() >= spansNeeded.get():
   // notify the exporter thread 
   signalExporterThread()



// pseudocode of the exporter thread 
while continue:
  while queue.isNotEmpty() and buffer.size() < EXPORT_SIZE:
    // Fill the local buffer 
    buffer.add(queue.poll())

  if buffer.size() >= EXPORT_SIZE:
    // Export the buffer 
    export(buffer)
 
  if queue.isEmpty():
    // Wait till there are enough spans for next export
    spansNeeded.set(EXPORT_SIZE - buffer.size())
    waitForSignal(WAIT_TIME)
    spansNeeded.set(Integer.MAX_VALUE)

BSP using signal batching and ArrayBlockingQueue

This implementation uses the ArrayBlockingQueue but batches the signals. Figure 14 and Figure 15, below, show the benchmark results using this approach.

Figure 14: The benchmark results show a very good improvement in throughput with signal batching compared to batch span processor’s implementation without signal batching. But the throughput suffered due to lock contention with the blocking queue.

Figure 15: The benchmark results show a huge decrease in CPU time of the exporter thread. This confirms the hypothesis that signal batching would reduce the exporter thread’s context switches.

The benchmark results show that the signal batching with ArrayBlockingQueue is a good choice compared to the baseline because of the improvement in throughput and a significant decrease in the CPU time. But, the throughput suffered with the increase in number of threads due to lock contention and pressure on the memory subsystem. We therefore proceeded to test signal batching with other queue implementations.

BSP using signal batching and ConcurrentLinkedQueue

This implementation uses Java's inbuilt concurrent linked queue to reduce lock contention and signal batching to reduce the exporter thread’s CPU time. Figure 16 and Figure 17, below, show the benchmark results using this approach.

Figure 16: The benchmark results show a huge improvement in throughput due to reduced contention but throughput flattened out early similar to the ConcurrentLinkedQueue implementation without signal batching.

Figure 17: The benchmark results show a similar CPU time of the exporter thread with BlockingQueue and signal batching.

The benchmark results show that the signal batching with ConcurrentLinkedQueue is a good choice compared to the previous approaches because of the improvement in throughput and significant decrease in the CPU time. We then proceeded to test signal batching with other queue implementations to see if we could find an even better solution.

BSP using signal batching and disruptor

The LMAX Disruptor doesn’t offer any batch waiting strategy. Although it is possible to implement a new waiting strategy inside the Disruptor, we found an alternate solution, described in the next section, that is much easier to implement.

BSP using signal batching and MpscQueue

Even though LMAX Disruptor is not applicable with regard to batching signals, the Disruptor’s fundamental design principles, as shown in the above benchmarks, worked very well to mitigate the concurrent linked queue’s contention issues. We learned about different concurrent queue implementations offered by the JCTools library, and the multi-producer single consumer queue (MpscQueue) worked perfectly for the BSP’s use case.

It is tailor-made for the BSP’s use case where we have multi-writer threads and a single consumer, the exporter thread.
It uses a ring buffer and sequencing to resolve head or tail contention. The preallocated ring buffer implies there is no unnecessary garbage collector activity.
It uses memory padding to avoid false sharing.
MpscQueue doesn’t signal the consumer about the new entries, so it is easy to implement custom signaling with batching.

We noticed that MpscQueue is very similar to the Disruptor, but we can easily apply the signal batching. Figure 18 and Figure 19, below, show the benchmark results using this approach.

Figure 18: The benchmark results show this approach to be the best performer. MPSCQueue reduces the write contention seen with the ConcurrentLinkedQueue implementation and signal batching reduces the overhead of notifying exporter thread and context switches. The throughput flattened out with ten worker threads due to the pressure on the memory subsystem similar to what we have seen in other benchmark results.

Figure 19: The benchmark results show this approach to be the best performer. MPSCQueue with signal batching has reduced the CPU time of the exporter thread even more than using signal batching with other queue structures.

The benchmarks show that MpscQueue with the signal batching approach is best for high throughput and a less CPU intensive batch span processor. MpscQueue does resolve the write contention issue seen in ConcurrentLinkedQueue and, with signal batching, reduces the CPU overhead of the exporter thread. The throughput flattened out with ten threads or more due to the pressure on the memory subsystem as seen in the previous benchmarks. We learned that the existing waiting strategies with the Disruptor tend towards lowering the latency and are CPU intensive, whereas MpscQueue with signal batching proved to be a good alternative.

And finally, the load test that initially revealed the problem showed us the optimization reduced CPU utilization.

Figure 20: We observed 56% pod CPU utilization when Open Telemetry’s exporter is disabled.

Figure 21: We observed 56% pod CPU utilization when Open Telemetry’s exporter with the new batch span processor is enabled. The experiment showed that the optimizations helped reduce resource usage.

Trade-Offs with signal batching

Although MpscQueue with signal batching worked best with the default configuration options, it hurts throughput if the export batch size (otel.bsp.max.export.size) is close to the maximum queue size (otel.bsp.max.queue.size) or higher. The trade-off for higher throughput is using a bigger queue with a size greater than the export batch size.

Conclusion

For any companies running on a microservice architecture, distributed tracing is an essential tool. OpenTelemetry is an excellent choice for addressing distributed tracing needs, but the performance issues we have seen with OpenTelemetry’s BSP component could negatively impact throughput and increase resource usage costs.

To address the negative impact, we contributed benchmarks and fixes to the OpenTelemetry project. Specifically, we contributed JMH benchmarks and optimizations in the BSP to improve throughput and reduce CPU cost. These optimizations will be available in the next release of the OpenTelemetry java library. Generally speaking, we believe the benchmarks and the different queue implementations we discussed could be applied to build highly performant multi-threaded producer-consumer components.

Acknowledgments

Thanks to Amit Gud and Rabun Kosar for their OpenTelemetry adoption efforts. Many thanks to the OpenTelemetry project maintainers who responded quickly to the batch span processor’s performance issues and worked with us to implement the optimizations.