Kafka

This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.

What is Kafka

Apache Kafka is a distributed high throughput message system. Apache Kafka is a distributed streaming platfom.

Advantages

Low delay: has to power to deal with data with time complexity of O(1).

High throughput: Can support over 100000 data per second in each low quality machine.

Horizontal expansion: Support message partition among Kafka brokers, distributed consumer and support horizontal expansion.

Sequence: In each partition, messages are sequential. Multiple Scenes: Support offline data processing and real-time data processing.

Kafka framework

Kafka

Topic&Parititon&Segment

Each data record contains a key-value pair and a timestamp, the timestamp is mainly used in streaming process.

Each topic is a messsge group.

Each topic contains multiple paritions, each parition is like a folder in different machine which contributes the high throughput of Kafka.

All data are stored in disk sequentially, the older file in each parition would be delted.

Kafka2

Producer

When producer send message: Producer is Asynchronous, if want it to be Synchronous, use flush()

How producer guarantees sequence: Use queue and retry mechanism set max.in.flight.requests.per.connection = 1 message two will wait message one be sent successfully.

Routing strategy: partitioner.class

public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
	List<PartitionInfo> partition = cluster.partitionsForTopic(topic);
	int numPartitions = partitions.size();
	if (keyButes == null) {
		int nextValue = counter.getAndIncrement();
		List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
		if (availablePartitions.size() > 0) {
			int part = Utils.toPositive(nextValue) % availablePartitions.size();
			return availablePartitions.get(part).partition();
		} else {
			// no partitions are available, give a non-available partition
			return Utils.toPositive(nextValue) % numPartitions;
			}
	} else {
	    // has the keyBytes to choose a partition
		return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
	}
	

Consumer Group

Each parition can only be consumed by one consumer.

consumer

When a new consumer enter or quit, paritions will be allocated to consumer again.

rebalance

Kafka uses centralized reblance.

A coordinator read all topics from Zookeeper and listen the changes of all topics and entry of consumers.
It choose a leader for each group and leader send rebalance plan to coordinator through SyncGroup.
Other member get reblance plan from Coordinator through SyncGroup.

High availability mechanism

Data replication

replication

Followers pull data from leader

Failover

failover

If all replicas fails

Wait for one of replica in ISR recover and choose it to be the leader
- Long wait time, reduce availability.
- If all replicas in ISR would not recover, the partition will die.
Choose the replic who first recover to be the leader no matter it is in ISR or not.
- It doesn’t contain all commits, may loss data.
- High availability

Kafka mechanism