Apache Kafka can be an inscrutable beast for the uninitiated
What is Kafka?
Apache Kafka is an open source project providing powerful distributed processing of continuous data streams – and is currently trusted in production by thousands of enterprises globally including the likes of Netflix, Twitter, Spotify, Uber and more.
The technology architecture and implementation makes it highly reliable and highly available, enabling stream processing applications to utilise geographically distributed data streams. While Kafka is not difficult to use, it’s tricky to optimise.
Here are our first 5 rules that will help you perfect your Kafka system to get ahead!
Kafka is built for parallel processing. Partition count is set at the topic level. The more partitions, the more throughput you can get through greater parallelization. The downside is that it will lead to more replication on latency, more painful rebalances, and more open files on your servers. Keep these tradeoffs in mind. The most accurate way to determine optimal partition settings is to actually calculate desired throughput against your hardware. Assume a single partitionon a single topic can handle ~10 MB/s (producers can actually produce faster than this but it’s a safe baseline) and then figure out what your desired total throughput is for your system.
If you want to dive in and start testing faster, a good rule of thumb is to start with 1 partition per broker per topic. If that works smoothly and you want more throughput, double that number, but try to keep the total number of partitions for a single topic on a single broker below 10. So for example, if you have 24 partitions and three brokers, each broker will be responsible for 8 partitions, which is generally fine. If you have dozens of topics an individual broker could easily end up handling hundreds of partitions. If your cluster’s total number of partitions is north of 10,000 then be sure you have really good monitoring because rebalances and outages could get really thorny.
There are two fronts in the war to secure a Kafka deployment:
- Kafka’s internal configuration, and
- The infrastructure on which Kafka is running.
Starting with the latter, the first goal is isolating Kafka and ZooKeeeper. ZooKeeper should never be exposed to the public internet (except for unusual use cases).
If you are only using ZooKeeper for Kafka, then only Kafka should be able to talk to it. Restrict your firewalls / security groups accordingly. Kafka should be isolated similarly. Ideally there is some middleware or load balancing layer between any clients connecting from the public internet and Kafka itself. Your brokers should reside within a single private network and by default reject all connections from outside.
As for Kafka’s configuration, the .9 release added a number of useful features. Kafka now supports authentication on between itself and clients as well as between itself and ZooKeeper. Kafka also now supports TLS, which we recommend using if you have clients connecting directly from the public internet. Be advised that using TLS will impact throughput performance. If you can’t spare the CPU cycles then you will need to find some other way to isolate and secure traffic hitting your Kafka brokers.
- Open File Config
Ulimit configuration is one of those things that can sneak up on you with a lot of different programs. Devops engineers have all been there before. A pagerduty alert fires late at night. Seems at first like a load issue but then you notice one or more of your brokers is just totally down. You dig through some logs and get one of these: “java.io.IOException: Too many open files.”
It’s an easy fix. Edit /etc/sysctl.conf with a larger value for max open files and restart. Save yourself an outage and ensure that your deployment system (Chef, CloudForma on, etc.) is setting a hard Ulimit of at least 128,000.
This one is pretty simple. Certainly low latency is going to be your goal with Kafka. Ideally you have your brokers geographically located near their clients. If your producers and consumers are located in the United States, best not to have your Kafka brokers in Europe. Also be aware of network performance when choosing instance types with cloud providers. It may be worthwhile to go for the bigger servers with AWS that have greater bandwidth if that becomes your bottleneck.
All of the above issues can be anticipated at the me of cluster creation. However conditions change, and without a proper monitoring and alerting strategy, you can get bit by one of these problems down the road. With Kafka you want to prioritize two basic types of monitoring: system metrics and JVM stats. For the former you need to ensure you track open file handles, network throughput, load, memory, and disk usage at a minimum. For the latter, be mindful of things like GC pauses and heap usage. Ideally you will keep a good amount of history and set up dashboards for quickly debugging issues.
For alerting, you will want to configure your system (Nagios, PagerDuty, etc.) to warn you about system issues like low disk space or latency spikes. Better to get an annoying alert about reaching 90% of your open file limit than getting an alert that your whole system has crashed.
Kafka is a powerful piece of so ware that can solve a lot of problems. Like most libraries and frameworks, you get out of it what you put into it. If you have a solid infrastructure and dev team that can devote sufficient time, you can do amazing things with Kafka. Lacking that, Kafka can be risky. Fortunately there is a growing ecosystem of managed offerings. With the right provider you can get all of the performance and scaling benefits of Kafka without having to go it alone. To that end, check out Instaclustr’s newest offerings in the Kafka space.
Instaclustr is the Open Source-as-a-Service company, delivering reliability at scale. We operate an automated, proven, and trusted managed environment, providing database, analytics, search, and messaging. We enable companies to focus internal development and operational resources on building cutting-edge customer-facing applications.
For more information,