Consensus is Impossible
by b5Our latest video covers a proof paper by Nancy Lynch that shows that 100% certainty of consensus is impossible in a distributed system with even one faulty node. The paper is very comprensive, and we thought picking one to dramatize would be a great way to introduce impossibility proofs!
We're using this post to add details that weren't covered in the video. We'll update this post with more info as comments & questions roll in!
Can I haz paper link?
Do you use this in iroh?
Indirectly, yes! All iroh protocols run up against these laws of distributed systems "physics". Some examples:
- strictly speaking, iroh docs isn't a consensus protocol, it's a "sync" protocol. Here even the word sync is a bit of a misnomer, because there is no such thing as asynchronus synchronization, but we digress. Instead the garuntee iroh docs focuses on is that all nodes will eventually have the same data, once we all have the same entries in our namespace, not that they will all agree on the same data at the same time.
- iroh blobs is content-addressed (refer by hash) specifically to get around the potential of Byzantine faults, assuming you trust the hash you're asking for in the first place.
The extra t
is for buggy nodes
In the video, one fix to address the fault is to transition from requiring 2t + 1
to 3t + 1
. The detail we don't cover in the video: the extra t
describes that you're accounting for buggy or malicious nodes (called "Byzantine" nodes) as well as down/delayed nodes. The down/delayed nodes would be covered by the 2t
part of the equation.
Flipping that ratio around, in essence this extra t
says that if we want to account for buggy or malicious nodes, we can only tollerate less than 1/3 of the nodes being down, delayed, or faulty instead less than half. This makes sense because we still need to handle the same cases as 2t+1
, but now we're also accounting for the fact that a node that is still sending messages could be giving you bad messages — so you need more redundancy. In a 3-node configuration, a single node failure means that you've already crossed that 1/3 threshold, so you need at least one more healthy node in the system to acocunt for all of these possible errors.
This is not the most interesting proof in the paper!
That honor goes to the FLP impossibility proof, for which the authors won a Dijkstra Prize. Nancy Lynch was a co-author on that paper as well, and it's a great read if you're interested in the topic!
Join our discord & nerd out
Come tell us what we missed! We'll update here!
Join DiscordTo get started, take a look at our docs, dive directly into the code, or chat with us in our discord channel.