Optimizing throughput of Kafka consumers

This article is about ways to optimize the througput of Kafka consumers, specifically using Spring Kafka. With some simple techniques we can improve the throughput by a factor.

Kafka is often marketed as 'low latency' and 'real time', but actually Kafka is designed for througput. Many frameworks don't properly take advantage of this and could perform much better.

Take for example Spring Kafka. Spring Kafka allows you to processe messages from Kafka by adding annotations to methods. This is the canonical example:

@KafkaListener(topics = "demo", groupId = "demo")
    public void oneByOneListener(ConsumerRecord<String, String> record) {
            repository.save(record.key(), record.value());
    }

Assuming the defualt settings, this boils down to:

while(true) {
	var records = consumer.poll()
	for(record in records) {
		try {
			oneByOneListener(record);
		} catch (Exception e) {
			handleError(e);
		}
	}
	consumer.commit();
}

Processing one message at a time, and commits the offset of this message to Kafka when it is processed, effectively acknowledging that the message was processed:

For correctness, this is a good default. When an error occurs during processing of the message, the offset is not committed, Spring Kafka will revert the offset to ensure the message is processed again.

For performance however this is not a good approach.

Commiting the offset takes time and puts some load on the Kafka broker. Another problem is that call to databases or external services have some overhead.

By processing batches of records we can greatly reduce this overhead.

This is also in line with Kafka's design, .poll() returns a batch of records for a reason.

https://docs.spring.io/spring-kafka/reference/kafka/receiving-messages/message-listener-container.html#committing-offsets

Published: 2024-03-14

Tagged: clojure

Optimizing the memory footprints of caches with JOL

I've created a small demo project based on an actual project I did for a client some time ago. It features a basic java project that shows how to measure the memory footprint of caches using Java Object Layout (JOL) library and some techniques to reduce this footprint.

This article explains how measuring the memory footprint of caches can be done and explains the optimization techniques. It should be read along with reading the demo project source code.

Motivation

Lots of applications use some type of in-memory storage. A common use case is caching results gathered from an external source, like a database or another service. Caches are usually added to improve performance and to reduce the load on the external sources.

Caching the results takes up heap memory. To prevent your application from running out of memory it is a good idea to know what the memory footprint of this storage is, and size the cache or heap accordingly.

Usually the cache has some settings to limit maximum number of entries. How can you determine how much memory is used at most? Alternatively, what can you set this limit to maximize the usage of the available memory?

To answer these questions you need to determine the average memory footprint of an entry in the cache. This is simple in theory, but somewhat tricky in practice: you could just walk the object graph and recursively sum the footprint of all the fields, and divide it by the number of entries. However the exact size of objects is hard to determine, each object requires some overhead which may vary per JVM implementation and can even depend on the used JVM options.

Rather then depending on rough estimates, it is best to measure how much memory is used. There is a tool called JOL which traverses an object graph and sums the sizes, which gives you an accurate measurement of the memory footprint.

Measuring the memory footprint of a cache

In the example project I setup a mock cache, which is populated with generated data. Just pretend the entries come from a database or some external service.

To measure the total memory usage of an object and all it references we use jol's GraphLayout.totalSize():

System.out.println(GraphLayout.parseInstance(cache).totalSize() / 1024d / 1024d + " MiB");

This will report the total size in MiB (mebibyte), which is also the unit used to configure heap sizes, for example -Xmx500m will result in a max heap size of 500 MiB = 500 * 1024 * 1024 bytes.

Note that parsing the object graph and computing the total size is a slow operation. This is not something you want to run often in production. My advice is to measure it locally during development using realistic data, or use a feature flag to only enable it in test environments. However it is important that the cache is filled with data that has similar sizes as in production.

Once you know the total size, the average size per entry is easily calculated and you can use this to estimate how much memory will be needed to support different cache sizes. For larger caches you may run into limits: the machine of vm might not offer enough memory, forcing you to upgrade to a more expensive machine. By optimizing the representation of objects stored in the cache, you can lower the memory footprint by a good amount. This might enable you to get good cache utilization without throwing resources at the problem.

Optimizing memory usage

Often it is possible to reduce the memory footprint of an object by a factor without too much work. It will require writing customized classed especially for use in the caches. There is some impact on readability and the result might be somewhat less idiomatic, but in some cases the savings could be worth it. If it means you can lower your memory requirements by 3x from 6000 MiBto 2000 MiB you can run your application on cheaper machines. Memory can be expensive, especially in cloud environments. Besides lowering costs, optimizing the memory footprint of caches can also enable you to store more entries while using the same amout of memory. Larger caches may improve cache utilization and lower the overall load on the system by avoiding calls to external services or databases.

Optimizing the memory footprint of caches is a bit of a niche technique, for most applications it is not worth the effort, but for applications that process lots of requests the savings could be significant.

My recommendation is to introduce an optimized class just for use in the cache, which conforms to the same interface as the domain object. You can then apply these techniques to reduce the memory footprint.

In the demo project I added several implementations of the Timeline interface, using some of these optimizations:

Use the smallest datatype that works.

In the example project, the simple implementation uses java.time.Instant to represent the start and end of an interval. The memory footprint of an Instant is composed out of a long (8 bytes) and an int (4 bytes), plus the overhead of an object (12 bytes). Assuming that a precision of milliseconds is sufficient, we can replace an Instant with just a long.

Only store essential fields

Often only a subset of the field are required to be stored in the cache. By only including the required fields we can reduce the memory footprint.

In the demo project, the TimelineSimple class had a createdAt field which is not used, so it is omitted in the TimelinePrimitive and TimelinePrimitiveArrays classes.

Flatten object trees:

Every object uses some bytes of overhead, how many is implementation specific. Usually it is 12 bytes on modern 64-bit JVM implementations. The TimelinePrimitiveArrays class stores the start, end and value classes into arrays, so no intermediate Window object are needed.

Trim collections

By default the backing array of an ArrayList doubles in size when you add an element and the backing array is full, so the array may take up more space than needed. You can use ArrayList.trimToSize() to trim unused parts the backing array.

Alternatively, you can ensure the array has the right size when constructing it. Most stream methods to build an array, like LongStream.toArray(), already do this.

Measuring the impact of optimizations

When doing any kind of optimizing you should always measure the results, so you know if the effort is worth it, or when you have reached your goals.

By comparing the memory footprint of the different implementations, we can check how well the optimizations work.

In the example project, I use three different implementations of a class: one 'idiomatic' java, one a little optimized, and one using primitive arrays. These classes offer the same functionality, but with different memory footprints.

To compare the memory footprint of the variants some mock data is generated, and converted to the different representation. JOL's GraphLayout.totalSize() is then used measure the total and average memory footprint for each representation.

Running the main function of the demo prints:

Idiomatic:
- elements: 100000
- total memory footprint: 159.15380859375 MiB
- avg memory / element: 1668.84864 bytes / element
Primitive fields:
- elements: 100000
- total memory footprint: 82.69098663330078 MiB
- avg memory / element: 867.07784 bytes / element
Primitive arrays:
- elements: 100000
- total memory footprint: 54.46696472167969 MiB
- avg memory / element: 571.12752 bytes / element

We see that implementation based on primitive arrays uses 3x less memory than the idiomatic implementation. The code is a little less idiomatic, but still readable in my opinion, and you can store 3x more entries, so that is a good improvement. For applications that require large caches this may enable you to increase cache utilization or get by with cheaper machines with less RAM.

Background reading:

https://shipilev.net/jvm/objects-inside-out/ An extensive guide on Java memory footprints, also uses JOL.

https://www.baeldung.com/java-memory-layout A simple introduction to memory layout of objects:

Consulting

If you are looking for someone who can reduce costs or speed up your system with these types of optimizations contact, or you want to make sure your applications don't run out of memory, contact optimize@chrisblom.net

Published: 2024-02-23

Tagged: jol java memory optimization performance

About

I am a software developer with a background in artificial intelligence backed by 12 years of experience in diverse programming languages and domains.

My expertise is designing and building reliable, fast backend systems that handle lots of data. I can support the entire development cycle, from gathering requirements to design, architecture, planning, implementation, testing, operations, monitoring and optimization.

I usually work as a lead developer for clients in long term engagements, but I also take short term projects. For projects with clear specification I can work for a fixed price. For larger projects I can utilize a team of experts in my network. If you are seeking a seasoned professional for consultation, advice, or potential collaboration, feel free to contact me for a no-obligation consultation.

My approach is rooted in a genuine passion for problem-solving. I value attention to detail, experimentation, consistent quality, and collaborative teamwork in achieving results.

Key Areas of Proficiency:

Backend Systems: Proficient in designing and building stable, high-performance backend systems.
Performance: measuring and improving performance and efficient, lowering costs
Streaming & Messaging: Experienced with Kafka, Kafka Streams, and other messaging platforms
Business Logic Modelling: Adept at translating complex business rules to code
Optimization: Utilizing Optaplanner/TimeFold, MiniZinc and Google OR-tools for optimizing routing and scheduling
Full-Stack Development with Clojure and ClojureScript.

Contact

LinkedIn
Github
Email: contact@chrisblom.net

Resume

English
Dutch

Business info

Kvk / Chamber of Commerce: 84135875

Published: 2024-01-01

Tagged: about contact

Chris Blom