This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.
What is Kafka
Apache Kafka is a distributed high throughput message system. Apache Kafka is a distributed streaming platfom.
Advantages
Low delay: has to power to deal with data with time complexity of O(1).
High throughput: Can support over 100000 data per second in each low quality machine.
Horizontal expansion: Support message partition among Kafka brokers, distributed consumer and support horizontal expansion.
Sequence: In each partition, messages are sequential. Multiple Scenes: Support offline data processing and real-time data processing.
Kafka framework
Topic&Parititon&Segment
Each data record contains a key-value pair and a timestamp, the timestamp is mainly used in streaming process.
Each topic is a messsge group.
Each topic contains multiple paritions, each parition is like a folder in different machine which contributes the high throughput of Kafka.
All data are stored in disk sequentially, the older file in each parition would be delted.
Producer
When producer send message: Producer is Asynchronous, if want it to be Synchronous, use flush()
How producer guarantees sequence: Use queue and retry mechanism set max.in.flight.requests.per.connection = 1 message two will wait message one be sent successfully.
Routing strategy: partitioner.class
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
List<PartitionInfo> partition = cluster.partitionsForTopic(topic);
int numPartitions = partitions.size();
if (keyButes == null) {
int nextValue = counter.getAndIncrement();
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
return Utils.toPositive(nextValue) % numPartitions;
}
} else {
// has the keyBytes to choose a partition
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
Consumer Group
Each parition can only be consumed by one consumer.
When a new consumer enter or quit, paritions will be allocated to consumer again.
Kafka uses centralized reblance.
-
A coordinator read all topics from Zookeeper and listen the changes of all topics and entry of consumers.
-
It choose a leader for each group and leader send rebalance plan to coordinator through SyncGroup.
-
Other member get reblance plan from Coordinator through SyncGroup.
High availability mechanism
Data replication
Followers pull data from leader
Failover
If all replicas fails
-
Wait for one of replica in ISR recover and choose it to be the leader
- Long wait time, reduce availability.
- If all replicas in ISR would not recover, the partition will die.
-
Choose the replic who first recover to be the leader no matter it is in ISR or not.
- It doesn’t contain all commits, may loss data.
- High availability