Replica Assignment assigns SPUs to a replica set and Replica Election coordinates their roles. The election algorithm manages replica sets in an attempt to designate one active leader at all times. SPUs have a powerful multi-threaded engine that can process a large number of leaders and followers at the same time.
If an SPU becomes incapacitated, the election algorithm identifies all impacted replica sets and triggers a re-election. The following section describes the algorithm utilized by each replica set as it elects a new leader.
The Leader
and Followers
of a Replica Sets have different responsibilities.
Leader
responsibilities:
Followers
responsibilities:
All followers are in hot-standby and ready to take-over as leader.
Each data stream has a Live Replica Set (LRS) that describes the SPUs actively replicating data records in their local data store. LRS status can be viewed in show partitions
CLI command.
Replica election covers two core cases:
goes offline
comes back online
When an SPU goes offline, the SC identifies all impacted Replica Sets
and triggers an election:
set Replica Set
status to Election
choose leader candidate from the followers in LRS based on smallest lag behind the previous leader:
leader candidate found:
Replica Set
status to CandidateFoundno eligible leader candidate available:
Replica Set
status to OfflineLeader Candidate
NotificationAll SPUs in the Replica Set receive proposed leader candidate
and perform the following operations:
SPU that matches leader candidate
tries to promote follower replica to leader replica:
follower
to leader
promotion successful
follower
to leader
promotion failed
Other SPUs ignore the message
Promotion Successful
from Leader Candidate
SPUThe SC perform the follower operations:
Replica Set
status to OnlineThe SC chooses the next Leader Candidate
from the LRS list and the process repeats.
If no eligible leader candidate left:
Replica Set
status to OfflineLRS
All SPU followers update their LRS:
When an known SPU comes back Online, the SC identifies all impacted Replica Sets
and triggers a refresh.
For all Replica Sets with status Offline, the SC performs the following operations:
set Replica Set
status to Election
choose leader candidate from follower membership list based on smallest lag behind the previous leader:
leader candidate found:
Replica Set
status to CandidateFoundno eligible leader candidate left:
Replica Set
status to OfflineThe algorithm repeats the same steps as in the “SPU goes Offline” section.
Each SPU has a Leader Controller that manages leader replicas, and a Follower Controller that manages follower replicas. SPU utilizes Rust async framework to run a virtually unlimited number of leader and follower operations simultaneously.
Each Replica Set has a communication channel where for the leader and followers exchange replica information. It is the responsibility of the followers to establish a connection to the leader. Once a connection between two SPUs is created, it is shared by all replica sets.
For example, three replica sets a, b, and c that are distributed across SPU-1
, SPU-2
, and SPU-3
:
The first follower (b, or c) from SPU-1
that tries to communicate with its leader in SPU-2
generates a TCP connection. Then, all subsequent communication from SPU-1
to SPU-2
, irrespective of the replica set, will reuse the same connection.
Hence, each SPU pair will have at most 2 connections. For example:
SPU-1
<=> SPU-2
SPU-1
followers => SPU-2
leadersSPU-2
followers => SPU-1
leadersReplicas use offsets to indicate the position of a record in a data stream. Offsets starts at zero
and are incremented by one anytime a new record is appended.
Log End Offset (LEO) represents the offset of last record in the local store of a replica. A records is considered committed only when replicated by all live replicas. Live Replica Sets (LRS) is the set of active replicas in the membership list. High Watermark (HW) is the last offset of the record committed by the LRS.
Synchronization algorithm collects the LEOs, computes the HW, and manages the (LRS).
In this example:
If Follower-2 goes offline: LRS = 2 and HW = 3.
All replica followers send their replica status, LEO
and HW
, to their leader. The leader:
LEO
to compute the missing records and send to followerHW
(from min replica LEOs
).HW
, LEO
, and LRS
to all followersReplica followers receive the data records, LEO
, and HW
from the leader and perform the following operations:
LEO
and HW
And the cycle repeats.
If the leader detects any of follower’s HW
is less than the LRS HW
by a maximum number of records, the leader removes the follower from the LRS.
Followers removed from the LRS are ineligible for election but continue to receive records. If follower catches up with the leader it is added back to LRS and once again becomes eligible for election.
If a leader goes offline, an election is triggered and one of the followers takes over as leader. The rest of the followers connect to the new leader and synchronize their data store.
When the failed leader rejoins the replica set, it detects the new leader and turns itself into a follower. The replica set continue under the new leadership until a new election is triggered.
Replica leaders receive data records from producers and sends them to consumers.
Consumers can choose to receive either COMMITTED or UNCOMMITTED records. The second method is discouraged as it cannot deterministically survive various failure scenarios.
By default, UNCOMMITTED messages are sent to consumers.