Kafka work queue with a dynamic number of parallel consumers Kafka work queue with a dynamic number of parallel consumers kubernetes kubernetes

Kafka work queue with a dynamic number of parallel consumers


The pattern you found is accurate. Note that topics can also be created using the Kafka Admin API and partitions can also be added once a topic has been created (with some gotchas).

In Kafka, the way to divide work and allow scaling is to use partitions. This is because in a consumer group, each partition is consumed by a single consumer at any time.

For example, you can have a topic with 50 partitions and a consumer group subscribed to this topic:

  • When the throughput is low, you can have only a few consumers in the group and they should be able to handle the traffic.

  • When the throughput increases, you can add consumers, up to the number of partitions (50 in this example), to pick up some of the work.

In this scenario, 50 consumers is the limit in terms of scaling. Consumers expose a number of metrics (like lag) allowing you to decide if you have enough of them at any time


Thank you Mickael for pointing me in the correct direction.

https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html

Kafka consumers are typically part of a consumer group. When multipleconsumers are subscribed to a topic and belong to the same consumer group,each consumer in the group will receive messages from a different subset ofthe partitions in the topic.

https://dzone.com/articles/dont-use-apache-kafka-consumer-groups-the-wrong-wa,

Having consumers as part of the same consumer group means providing the“competing consumers” pattern with whom the messages from topic partitionsare spread across the members of the group. Each consumer receives messages from one or more partitions (“automatically” assigned to it) and the samemessages won’t be received by the other consumers (assigned to different partitions). In this way, we can scale the number of the consumers up to thenumber of the partitions (having one consumer reading only one partition); inthis case, a new consumer joining the group will be in an idle state without being assigned to any partition.

Example code for dividing the work among 3 consumers, up to a maximum of 100:

bin/kafka-topics.sh --partitions 100 --topic divide-topic --create --replication-factor 1 --zookeeper localhost:2181

...

for n in range(0,3):    consumer = KafkaConsumer(group_id='some-constant-group',                     bootstrap_servers=['localhost:9092'])    ...


I think, you are on right path -

Here are some steps involved -

  1. Create Kafka Topic and create the required partitions. The number of partitions is the unit of parallelism. In other words you run these many number of consumers to process the work.
  2. You can increase the partitions if the scaling requirements increased. BUT it comes with caveats like repartitioning. Please read the kafka documentation about the new partition addition.
  3. Define a Kafka Consumer group for the consumer. Kafka will assign partitions to available consumers in the consumer group and automatically rebalance. If the consumer is added/removed, kafka does the rebalancing automatically.
  4. If the consumers are packaged as docker container, then using kubernetes helps in managing the containers especially for multi-node environment. Other tools include docker-swarm, openshift, Mesos etc.
  5. Kafka offers the ordering for partitions.
  6. Check out the delivery guarantees - At-least once, Exactly once based on your use cases.

Alternatively, you can use Kafka Streams APIS. Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management and real-time querying of application state.