During mid-February 2021, numerous product teams reported untimely failures on producers and lag on consumers. This nightmare went on for two weeks. đ When we dug deep and compared all the times this had happened over the past few weeks, we found a common event: Migration of Kafka broker(s) done by Google.
This wasnât a normal migration, but a live migration. Here, the virtual machine instance is kept running, and it is migrated live to another host on the same zone instead of requiring the VM to be rebooted.
Partition Leaders and Replicas Every partition, in ideal conditions, is assigned a broker that acts as a leader and has zero or more brokers which act as replicas, governed by the replication factor. The leader handles all read and write requests for the partition while the followers passively replicate the leader and remain in sync.
Controller A controller is a broker. A cluster always has one controller present. In the events of the controller going down, the zookeeper elects a new controller for the cluster. Zookeeper always expects heartbeats to be sent from all the brokers in the cluster and if a heartbeat isnât received with a certain interval, then the zookeeper assumes the broker to be non-functional. (This interval is governed by ZOO_TICK_TIME which by default is 2000 ms). So, if the controller doesn't send a heartbeat within the configured time, controller re-election takes place and another broker is made the controller instead.