MSC4297: State Resolution 2.1 (#4297)

* Placeholder * Add MSC4297: SR2.1 manually `git apply` hates me --------- Co-authored-by: Travis Ralston <travisr@matrix.org>
4 months ago · d9ec3387c0
parent 2bb34224dd
commit d9ec3387c0
12 changed files with 268 additions and 0 deletions
--- a/proposals/4297-state-resolution-v2_1.md
+++ b/proposals/4297-state-resolution-v2_1.md
@ -0,0 +1,268 @@
+# MSC4297: State Resolution v2.1
+
+This MSC proposes two modifications to the existing state resolution algorithm which will improve
+security by reducing the frequency of "state resets". This proposal bases its changes on
+room version 11.
+
+## Background
+
+Matrix is decentralised. This means there is no central entity which orders events. Ordering
+is critical to enforce access control. For example, in order for Bob to change the room name
+he needs to be a moderator/admin _first_. To model this, rooms are represented as a
+directed acyclic graph (DAG) of events. State events are operations like "Bob changing the room name" or
+"Bob gaining admin privileges". These events "point" to the most recent events that the server that created the event has
+seen in that room. In order for Bob to change the room name, his room name event MUST point either
+directly or indirectly to the event which gave him the right to change the room name. As Matrix is
+decentralised, Alice could independently demote Bob without him being aware of it at the same time
+he tries to change the room name. This is concurrent behaviour, and how this is managed is up to the
+state resolution algorithm. The algorithm:
+
+ - selects which events are in conflict,
+ - determines how to order these events,
+ - filters out unauthorised events based on this ordering
+
+For example, Alice's demotion may be applied first so Bob cannot change the room name.
+
+<img width="1115" src="/proposals/images/4297-msc-1.png" />
+
+
+However, the algorithm can cause surprising behaviour. If Alice cannot communicate Bob's demotion to Bob's
+server quickly, then Bob may perform many more privileged operations over minutes or hours. Eventually,
+Alice's demotion will arrive on Bob's server, which would cause all the operations Bob did to be
+"rolled back" or reverted. This can be unexpected to users on Bob's server, but is an _expected_
+consequence of Matrix being decentralised. A "state reset" is very similar to this in that it causes
+unexpected behaviour. However, the defining characteristic of a state reset is that this happens
+_even when there is no revocation event_ such as a demotion.
+
+Concurrent events can occur at any point in the room DAG. This means when the DAG is put into a
+total order from "oldest" to "newest", previously unseen events may appear at the "older" end of the
+ordering. These unseen events could affect whether later events are authorised or not, so we need to
+"replay" events from the unseen events to the latest events. To avoid replaying too many events,
+the algorithm intelligently calculates the difference between the sets of concurrent events to only replay what is necessary. It is desirable to prevent
+servers from adding concurrent events from an obviously long time ago, but since servers never
+coordinate (e.g via a consensus algorithm) they can never be sure that they have seen all concurrent
+events. The idea of making it impossible to add concurrent events to some sections of the graph is
+referred to as "causal stability" or "finality".
+
+Matrix allows servers to partially synchronise the DAG. This allows servers that have been offline for
+a long time to quickly resynchronise without being forced to pull in the entire room history. To ensure
+authorisation events are applied correctly, events have another DAG formed of `auth_events`, called
+the "auth chain". These chains only consist of authorisation events and _are_ fully synchronised.
+The `auth_events` cite all the historical events which authorise the event in question. Authorisation
+events are the following state events: `m.room.create`, `m.room.member`, `m.room.power_levels` and
+`m.room.join_rules`. A subset of authorisation events are [power events](https://spec.matrix.org/v1.14/rooms/v11/#definitions).
+
+>[!NOTE]
+> As a reminder, State Resolution v2.0 works by merging together sets of state across branches of
+> the room DAG.  State events common to both branches are called 'unconflicted'; state events which
+> only exist on one branch or another or have different values are called 'conflicted'.  The
+> resolved state is calculated:
+>  * Start with the unconflicted state as our "base layer"
+>  * Consider the conflicted power events (PLs, kicks, bans, join_rules) and any conflicted authorisation
+>    events that are required to authorise said power events, use reverse topological ordering to
+>    provide a consistent ordering and then layer those by replaying them on top, incrementally
+>    authorising as you go.
+>  * Work out the sequence of the conflicted normal state events using mainline ordering (the
+>    'backbone' of power level events through the conflicted set) and then similarly replay those
+>    on top too.
+>  * Reapply any unconflicted state keys which may have been overwritten in the previous steps.
+
+## Problems with the existing algorithm
+
+The algorithm relies on two pieces of ordering information, from `prev_events` and `auth_events`.
+The `prev_events` ordering controls the input state sets to the algorithm. These state sets aren't
+guaranteed to map correctly onto the auth chain ordering induced by `auth_events` due to partial
+synchronisation. If these orderings disagree, the algorithm can select older state.
+This can happen due to federation outages or due to faulty implementations. The scenarios below
+require this to have happened, and are represented by "incorrect" state. This typically manifests
+as _older state_ existing in a state set. State resets happen because the algorithm fails to determine
+all the events that need to be replayed when _older state_ exists in a state set.
+
+Problem A focuses on the "initial state" of the room, before the conflicted events are replayed.
+This is currently set to the "unconflicted" state of the room. The core idea is that if both forks
+agree on the exact event IDs for 99 members, but disagree on the exact event ID for the 100th member,
+we do not need to replay all 99 member events. The problem is that there is more than one way to
+"agree on the exact event IDs", which can cause state to reset under certain circumstances.
+
+Problem B focuses on selecting _which events_ are in conflict. The core idea is that we do
+not want to replay the entire history of the room every time there is a conflict, but only the
+differences between each fork (the _auth difference_). These differences don't include enough events
+under certain circumstances, causing state to reset.
+
+>[!NOTE]
+> The following scenarios present rooms in two ways: via the familiar `prev_events` ordering which
+> represents concurrent behaviour and the more unfamiliar `auth_events` ordering which represents
+> "is authorised by" relationships. The `auth_events` graph contains many redundant edges (e.g every event
+> references the create event), so we will only present the _transitive reduction_ of this graph, which
+> removes these edges. This helps illustrate the problems better.
+>
+> <img width="201" src="/proposals/images/4297-msc-2.png" />
+>
+> The scenarios also use coloured diamond symbols to indicate the state of the room at a particular
+> edge on the `prev_events` graph. Servers may incorrectly calculate this e.g due to partial synchronisation,
+> which is indicated by a dashed arrow pointing from the correct state to the incorrect state. The colour
+> of the diamond only serves to distinguish between other state sets, it has no other meaning, despite similar
+> colours being used to indicate the sender of an event.
+>
+> <img width="632" src="/proposals/images/4297-msc-3.png" />
+
+
+### Problem A: Conflicting events can be unauthorised by the unconflicted state
+
+Events can be conflicted in _two_ ways: via room state's `prev_events` and via the auth chain's `auth_events`.
+The room state determines what the inputs to the state resolution algorithm are and hence what is in
+the unconflicted state. The auth chains determine which extra events are pulled in and the ordering
+between events. Room state can unexpectedly reset if these two orderings disagree, as outlined in
+the following example[^1]:
+
+<img width="426" src="/proposals/images/4297-msc-problem-a-1.png" />
+<img width="618" src="/proposals/images/4297-msc-problem-a-2.png" />
+
+
+In this scenario, the state of the room at the blue diamond is obtained via partial synchronisation,
+e.g `/state{_ids}`. This response contains an outdated join rules event, meaning both join rules events
+are now in conflict, even though all events on both forks agree
+on what the latest join rule event is via their auth chains. This causes the join rules to be
+replayed. However, both forks also agreed that Alice had left the room. As such, the unconflicted
+state starts with Alice not being a member in the room. When the join rules events get replayed,
+both fail since Alice (who set them) is not in the room. This causes the room to have no join
+rules event.
+
+<img width="573" src="/proposals/images/4297-msc-problem-a-3.png" />
+
+### Problem B: Conflicting events need extra unconflicted events in order to be authorised
+
+Similarly to Problem A, this occurs when the ordering between `prev_events` and `auth_events` differs.
+In this scenario, one fork can reference newer events via the auth chain whilst claiming the room
+state is an older event. When this happens, the auth difference does not include all relevant events
+between the old and new events as both sides have seen the events in-between via their auth chains. Most
+of the time this will not cause a state reset, but when there are chains of events which are dependent
+on one another, this can cause a state reset.
+
+<img width="432" src="/proposals/images/4297-msc-problem-b-1.png" />
+<img width="554" src="/proposals/images/4297-msc-problem-b-2.png" />
+
+Note that the state of the room at the red diamond is obtained via partial synchronisation, e.g
+`/state{_ids}`. This response contains an outdated power levels event, meaning the power levels events
+are now in conflict.
+
+In this example[^1], Alice is an Admin who promotes Bob. Bob then promotes Charlie. It is critical that
+Bob's promotion is applied before Charlie's promotion, or else it will be unauthorised. However, the
+auth difference calculation fails to include this event, leading to a state reset.
+
+<img width="446" src="/proposals/images/4297-msc-problem-b-3.png" />
+
+## Proposal
+
+Two modifications are made to the algorithm which are described below. Both changes relate to
+which events are selected for replaying. They do not modify how conflicted events are sorted
+nor do they modify the iterative auth checks.
+
+### Modification 1: Begin the first phase of iterative auth checks with an empty state map
+
+This aims to fix problem A by disregarding the `prev_events` ordering entirely for determining the
+initial state. It does this by starting with an empty state map. This causes the iterative auth
+checks algorithm to load the auth chains, per the [iterative auth checks definition](https://spec.matrix.org/v1.14/rooms/v11/#definitions):
+
+> If a (event_type, state_key) key that is required for checking the authorization rules is not
+present in the state, then the appropriate state event from the event’s `auth_events` is used if the
+auth event is not rejected.
+
+This ensures that the algorithm replays events on top of the `auth_events` histories, rather than some unrelated
+history.
+
+Step 2 of the state resolution algorithm is amended to state:
+
+> Apply the _iterative auth checks algorithm_, starting from ~~the unconflicted state map~~
+__an empty state map__, to the list of events from the previous step to get a partially resolved state.
+
+<img width="593" src="/proposals/images/4297-msc-sol-1.png" />
+
+Note that even though we no longer insert the unconflicted state into the partially resolved state,
+Step 5 of the algorithm ensures that the unconflicted state is still in the final merged
+output, even though it may not have been during the resolution process:
+
+> Update the result by replacing any event with the event with the same key from the unconflicted
+> state map, if such an event exists, to get the final resolved state.
+
+### Modification 2: Add events _between_ the conflicted state set to the full conflicted set
+
+This aims to fix problem B by including relevant intermediate events when performing state resolution.
+This modifies the definition of the "full conflicted set" to include _all events_ which are a _descendant_
+of one conflicted event and an _ancestor_ of another conflicted event. This forms a "conflicted subgraph"
+which is then replayed by the algorithm.
+
+This means the full conflicted set contains:
+ - the conflicted state events themselves AND
+ - the auth difference AND
+ - the events between the conflicted state events
+
+The purpose of the auth difference is to replay the relevant auth history from each input state set.
+Most of the time it does this, but when the input state sets are derived from a partial sync response
+there's no guarantee that this will include the relevant history because the response may include
+erroneous older events. By including the conflicted state subgraph we ensure that input state sets
+with _old events_ have the auth history from those old events replayed.
+
+<img width="549" src="/proposals/images/4297-msc-sol-2.png" />
+
+A new term is added in the state resolution algorithm:
+
+> **Conflicted state subgraph.** Starting from an event in the _conflicted
+> state set_ and following `auth_events` edges may lead to another event in the
+> conflicted state set. The union of all such paths between any pair of events
+> in the conflicted state set (including endpoints) forms a subgraph of the
+> original `auth_event` graph, called the _conflicted state subgraph_.
+
+And the following modification is made to the definition of "Full conflicted set":
+
+> **Full conflicted set.** The full conflicted set is the union of the conflicted state set,
+> <ins>the conflicted state subgraph</ins> and the auth difference.
+
+
+## Potential issues
+
+### Performance
+
+These modifications may impact performance in two ways:
+ - more work is done on every state resolution to calculate the conflicted state subgraph.
+ - potentially more events are replayed during resolution.
+
+The data required to calculate the auth difference is also the same information required to
+calculate the conflicted state subgraph, so no extra database requests are needed with these
+changes. However, more CPU work needs to be performed to walk the auth chain to look for
+conflicted events on every resolution.
+
+In the common case, the conflicted state subgraph overlaps entirely with the auth difference,
+meaning no extra events need to be replayed. This has been confirmed via
+[partition/fault tolerance testing](https://github.com/element-hq/chaos) which tests extreme
+cases with large numbers of membership changes and federation outages, producing resolutions such as:
+```
+additional events replayed=0 num_conflicts=17 conflicted_subgraph=271 auth_difference=263
+```
+
+## Further work
+
+More changes are required in order to fix all cases of state resets. The changes proposed here
+are based on real world scenarios where state resolution has produced undesirable results.
+The underlying causes of these state resets is mismatched orderings. If the protocol had a single
+ordering then this would remove this entire class of issues. This will be explored in a future MSC.
+
+## Security considerations
+
+The state resolution algorithm is a critical component in the overall security of a room. This proposal
+is modifying the algorithm so there are inevitable risks associated with it. These risks are mitigated
+because the proposal is _not_ changing how events are ordered nor how events are authorised. It is
+purely _adding_ events to be replayed and relying on the auth chains as the authoritative source
+to rebase changes onto.
+
+## Unstable prefix
+
+This algorithm is in use for room version `org.matrix.hydra.11`.
+
+## Dependencies
+
+This MSC has no dependencies.
+
+
+[^1]: When used in conjunction with [MSC4289](https://github.com/matrix-org/matrix-spec-proposals/pull/4289)
+Alice should not be present in the `m.room.power_levels` event. The examples function in the same way.
--- a/proposals/images/4297-msc-1.png
+++ b/proposals/images/4297-msc-1.png
--- a/proposals/images/4297-msc-2.png
+++ b/proposals/images/4297-msc-2.png
--- a/proposals/images/4297-msc-3.png
+++ b/proposals/images/4297-msc-3.png
--- a/proposals/images/4297-msc-problem-a-1.png
+++ b/proposals/images/4297-msc-problem-a-1.png
--- a/proposals/images/4297-msc-problem-a-2.png
+++ b/proposals/images/4297-msc-problem-a-2.png
--- a/proposals/images/4297-msc-problem-a-3.png
+++ b/proposals/images/4297-msc-problem-a-3.png
--- a/proposals/images/4297-msc-problem-b-1.png
+++ b/proposals/images/4297-msc-problem-b-1.png
--- a/proposals/images/4297-msc-problem-b-2.png
+++ b/proposals/images/4297-msc-problem-b-2.png
--- a/proposals/images/4297-msc-problem-b-3.png
+++ b/proposals/images/4297-msc-problem-b-3.png
--- a/proposals/images/4297-msc-sol-1.png
+++ b/proposals/images/4297-msc-sol-1.png
--- a/proposals/images/4297-msc-sol-2.png
+++ b/proposals/images/4297-msc-sol-2.png