diff --git a/proposals/4297-state-resolution-v2_1.md b/proposals/4297-state-resolution-v2_1.md new file mode 100644 index 000000000..2c1429a6a --- /dev/null +++ b/proposals/4297-state-resolution-v2_1.md @@ -0,0 +1,268 @@ +# MSC4297: State Resolution v2.1 + +This MSC proposes two modifications to the existing state resolution algorithm which will improve +security by reducing the frequency of "state resets". This proposal bases its changes on +room version 11. + +## Background + +Matrix is decentralised. This means there is no central entity which orders events. Ordering +is critical to enforce access control. For example, in order for Bob to change the room name +he needs to be a moderator/admin _first_. To model this, rooms are represented as a +directed acyclic graph (DAG) of events. State events are operations like "Bob changing the room name" or +"Bob gaining admin privileges". These events "point" to the most recent events that the server that created the event has +seen in that room. In order for Bob to change the room name, his room name event MUST point either +directly or indirectly to the event which gave him the right to change the room name. As Matrix is +decentralised, Alice could independently demote Bob without him being aware of it at the same time +he tries to change the room name. This is concurrent behaviour, and how this is managed is up to the +state resolution algorithm. The algorithm: + + - selects which events are in conflict, + - determines how to order these events, + - filters out unauthorised events based on this ordering + +For example, Alice's demotion may be applied first so Bob cannot change the room name. + + + + +However, the algorithm can cause surprising behaviour. If Alice cannot communicate Bob's demotion to Bob's +server quickly, then Bob may perform many more privileged operations over minutes or hours. Eventually, +Alice's demotion will arrive on Bob's server, which would cause all the operations Bob did to be +"rolled back" or reverted. This can be unexpected to users on Bob's server, but is an _expected_ +consequence of Matrix being decentralised. A "state reset" is very similar to this in that it causes +unexpected behaviour. However, the defining characteristic of a state reset is that this happens +_even when there is no revocation event_ such as a demotion. + +Concurrent events can occur at any point in the room DAG. This means when the DAG is put into a +total order from "oldest" to "newest", previously unseen events may appear at the "older" end of the +ordering. These unseen events could affect whether later events are authorised or not, so we need to +"replay" events from the unseen events to the latest events. To avoid replaying too many events, +the algorithm intelligently calculates the difference between the sets of concurrent events to only replay what is necessary. It is desirable to prevent +servers from adding concurrent events from an obviously long time ago, but since servers never +coordinate (e.g via a consensus algorithm) they can never be sure that they have seen all concurrent +events. The idea of making it impossible to add concurrent events to some sections of the graph is +referred to as "causal stability" or "finality". + +Matrix allows servers to partially synchronise the DAG. This allows servers that have been offline for +a long time to quickly resynchronise without being forced to pull in the entire room history. To ensure +authorisation events are applied correctly, events have another DAG formed of `auth_events`, called +the "auth chain". These chains only consist of authorisation events and _are_ fully synchronised. +The `auth_events` cite all the historical events which authorise the event in question. Authorisation +events are the following state events: `m.room.create`, `m.room.member`, `m.room.power_levels` and +`m.room.join_rules`. A subset of authorisation events are [power events](https://spec.matrix.org/v1.14/rooms/v11/#definitions). + +>[!NOTE] +> As a reminder, State Resolution v2.0 works by merging together sets of state across branches of +> the room DAG. State events common to both branches are called 'unconflicted'; state events which +> only exist on one branch or another or have different values are called 'conflicted'. The +> resolved state is calculated: +> * Start with the unconflicted state as our "base layer" +> * Consider the conflicted power events (PLs, kicks, bans, join_rules) and any conflicted authorisation +> events that are required to authorise said power events, use reverse topological ordering to +> provide a consistent ordering and then layer those by replaying them on top, incrementally +> authorising as you go. +> * Work out the sequence of the conflicted normal state events using mainline ordering (the +> 'backbone' of power level events through the conflicted set) and then similarly replay those +> on top too. +> * Reapply any unconflicted state keys which may have been overwritten in the previous steps. + +## Problems with the existing algorithm + +The algorithm relies on two pieces of ordering information, from `prev_events` and `auth_events`. +The `prev_events` ordering controls the input state sets to the algorithm. These state sets aren't +guaranteed to map correctly onto the auth chain ordering induced by `auth_events` due to partial +synchronisation. If these orderings disagree, the algorithm can select older state. +This can happen due to federation outages or due to faulty implementations. The scenarios below +require this to have happened, and are represented by "incorrect" state. This typically manifests +as _older state_ existing in a state set. State resets happen because the algorithm fails to determine +all the events that need to be replayed when _older state_ exists in a state set. + +Problem A focuses on the "initial state" of the room, before the conflicted events are replayed. +This is currently set to the "unconflicted" state of the room. The core idea is that if both forks +agree on the exact event IDs for 99 members, but disagree on the exact event ID for the 100th member, +we do not need to replay all 99 member events. The problem is that there is more than one way to +"agree on the exact event IDs", which can cause state to reset under certain circumstances. + +Problem B focuses on selecting _which events_ are in conflict. The core idea is that we do +not want to replay the entire history of the room every time there is a conflict, but only the +differences between each fork (the _auth difference_). These differences don't include enough events +under certain circumstances, causing state to reset. + +>[!NOTE] +> The following scenarios present rooms in two ways: via the familiar `prev_events` ordering which +> represents concurrent behaviour and the more unfamiliar `auth_events` ordering which represents +> "is authorised by" relationships. The `auth_events` graph contains many redundant edges (e.g every event +> references the create event), so we will only present the _transitive reduction_ of this graph, which +> removes these edges. This helps illustrate the problems better. +> +> +> +> The scenarios also use coloured diamond symbols to indicate the state of the room at a particular +> edge on the `prev_events` graph. Servers may incorrectly calculate this e.g due to partial synchronisation, +> which is indicated by a dashed arrow pointing from the correct state to the incorrect state. The colour +> of the diamond only serves to distinguish between other state sets, it has no other meaning, despite similar +> colours being used to indicate the sender of an event. +> +> + + +### Problem A: Conflicting events can be unauthorised by the unconflicted state + +Events can be conflicted in _two_ ways: via room state's `prev_events` and via the auth chain's `auth_events`. +The room state determines what the inputs to the state resolution algorithm are and hence what is in +the unconflicted state. The auth chains determine which extra events are pulled in and the ordering +between events. Room state can unexpectedly reset if these two orderings disagree, as outlined in +the following example[^1]: + + + + + +In this scenario, the state of the room at the blue diamond is obtained via partial synchronisation, +e.g `/state{_ids}`. This response contains an outdated join rules event, meaning both join rules events +are now in conflict, even though all events on both forks agree +on what the latest join rule event is via their auth chains. This causes the join rules to be +replayed. However, both forks also agreed that Alice had left the room. As such, the unconflicted +state starts with Alice not being a member in the room. When the join rules events get replayed, +both fail since Alice (who set them) is not in the room. This causes the room to have no join +rules event. + + + +### Problem B: Conflicting events need extra unconflicted events in order to be authorised + +Similarly to Problem A, this occurs when the ordering between `prev_events` and `auth_events` differs. +In this scenario, one fork can reference newer events via the auth chain whilst claiming the room +state is an older event. When this happens, the auth difference does not include all relevant events +between the old and new events as both sides have seen the events in-between via their auth chains. Most +of the time this will not cause a state reset, but when there are chains of events which are dependent +on one another, this can cause a state reset. + + + + +Note that the state of the room at the red diamond is obtained via partial synchronisation, e.g +`/state{_ids}`. This response contains an outdated power levels event, meaning the power levels events +are now in conflict. + +In this example[^1], Alice is an Admin who promotes Bob. Bob then promotes Charlie. It is critical that +Bob's promotion is applied before Charlie's promotion, or else it will be unauthorised. However, the +auth difference calculation fails to include this event, leading to a state reset. + + + +## Proposal + +Two modifications are made to the algorithm which are described below. Both changes relate to +which events are selected for replaying. They do not modify how conflicted events are sorted +nor do they modify the iterative auth checks. + +### Modification 1: Begin the first phase of iterative auth checks with an empty state map + +This aims to fix problem A by disregarding the `prev_events` ordering entirely for determining the +initial state. It does this by starting with an empty state map. This causes the iterative auth +checks algorithm to load the auth chains, per the [iterative auth checks definition](https://spec.matrix.org/v1.14/rooms/v11/#definitions): + +> If a (event_type, state_key) key that is required for checking the authorization rules is not +present in the state, then the appropriate state event from the event’s `auth_events` is used if the +auth event is not rejected. + +This ensures that the algorithm replays events on top of the `auth_events` histories, rather than some unrelated +history. + +Step 2 of the state resolution algorithm is amended to state: + +> Apply the _iterative auth checks algorithm_, starting from ~~the unconflicted state map~~ +__an empty state map__, to the list of events from the previous step to get a partially resolved state. + + + +Note that even though we no longer insert the unconflicted state into the partially resolved state, +Step 5 of the algorithm ensures that the unconflicted state is still in the final merged +output, even though it may not have been during the resolution process: + +> Update the result by replacing any event with the event with the same key from the unconflicted +> state map, if such an event exists, to get the final resolved state. + +### Modification 2: Add events _between_ the conflicted state set to the full conflicted set + +This aims to fix problem B by including relevant intermediate events when performing state resolution. +This modifies the definition of the "full conflicted set" to include _all events_ which are a _descendant_ +of one conflicted event and an _ancestor_ of another conflicted event. This forms a "conflicted subgraph" +which is then replayed by the algorithm. + +This means the full conflicted set contains: + - the conflicted state events themselves AND + - the auth difference AND + - the events between the conflicted state events + +The purpose of the auth difference is to replay the relevant auth history from each input state set. +Most of the time it does this, but when the input state sets are derived from a partial sync response +there's no guarantee that this will include the relevant history because the response may include +erroneous older events. By including the conflicted state subgraph we ensure that input state sets +with _old events_ have the auth history from those old events replayed. + + + +A new term is added in the state resolution algorithm: + +> **Conflicted state subgraph.** Starting from an event in the _conflicted +> state set_ and following `auth_events` edges may lead to another event in the +> conflicted state set. The union of all such paths between any pair of events +> in the conflicted state set (including endpoints) forms a subgraph of the +> original `auth_event` graph, called the _conflicted state subgraph_. + +And the following modification is made to the definition of "Full conflicted set": + +> **Full conflicted set.** The full conflicted set is the union of the conflicted state set, +> the conflicted state subgraph and the auth difference. + + +## Potential issues + +### Performance + +These modifications may impact performance in two ways: + - more work is done on every state resolution to calculate the conflicted state subgraph. + - potentially more events are replayed during resolution. + +The data required to calculate the auth difference is also the same information required to +calculate the conflicted state subgraph, so no extra database requests are needed with these +changes. However, more CPU work needs to be performed to walk the auth chain to look for +conflicted events on every resolution. + +In the common case, the conflicted state subgraph overlaps entirely with the auth difference, +meaning no extra events need to be replayed. This has been confirmed via +[partition/fault tolerance testing](https://github.com/element-hq/chaos) which tests extreme +cases with large numbers of membership changes and federation outages, producing resolutions such as: +``` +additional events replayed=0 num_conflicts=17 conflicted_subgraph=271 auth_difference=263 +``` + +## Further work + +More changes are required in order to fix all cases of state resets. The changes proposed here +are based on real world scenarios where state resolution has produced undesirable results. +The underlying causes of these state resets is mismatched orderings. If the protocol had a single +ordering then this would remove this entire class of issues. This will be explored in a future MSC. + +## Security considerations + +The state resolution algorithm is a critical component in the overall security of a room. This proposal +is modifying the algorithm so there are inevitable risks associated with it. These risks are mitigated +because the proposal is _not_ changing how events are ordered nor how events are authorised. It is +purely _adding_ events to be replayed and relying on the auth chains as the authoritative source +to rebase changes onto. + +## Unstable prefix + +This algorithm is in use for room version `org.matrix.hydra.11`. + +## Dependencies + +This MSC has no dependencies. + + +[^1]: When used in conjunction with [MSC4289](https://github.com/matrix-org/matrix-spec-proposals/pull/4289) +Alice should not be present in the `m.room.power_levels` event. The examples function in the same way. \ No newline at end of file diff --git a/proposals/images/4297-msc-1.png b/proposals/images/4297-msc-1.png new file mode 100644 index 000000000..92a775954 Binary files /dev/null and b/proposals/images/4297-msc-1.png differ diff --git a/proposals/images/4297-msc-2.png b/proposals/images/4297-msc-2.png new file mode 100644 index 000000000..d429d9a9f Binary files /dev/null and b/proposals/images/4297-msc-2.png differ diff --git a/proposals/images/4297-msc-3.png b/proposals/images/4297-msc-3.png new file mode 100644 index 000000000..b52d86ce0 Binary files /dev/null and b/proposals/images/4297-msc-3.png differ diff --git a/proposals/images/4297-msc-problem-a-1.png b/proposals/images/4297-msc-problem-a-1.png new file mode 100644 index 000000000..16015c41d Binary files /dev/null and b/proposals/images/4297-msc-problem-a-1.png differ diff --git a/proposals/images/4297-msc-problem-a-2.png b/proposals/images/4297-msc-problem-a-2.png new file mode 100644 index 000000000..620b6b63a Binary files /dev/null and b/proposals/images/4297-msc-problem-a-2.png differ diff --git a/proposals/images/4297-msc-problem-a-3.png b/proposals/images/4297-msc-problem-a-3.png new file mode 100644 index 000000000..f52453578 Binary files /dev/null and b/proposals/images/4297-msc-problem-a-3.png differ diff --git a/proposals/images/4297-msc-problem-b-1.png b/proposals/images/4297-msc-problem-b-1.png new file mode 100644 index 000000000..3487f7531 Binary files /dev/null and b/proposals/images/4297-msc-problem-b-1.png differ diff --git a/proposals/images/4297-msc-problem-b-2.png b/proposals/images/4297-msc-problem-b-2.png new file mode 100644 index 000000000..631a2b948 Binary files /dev/null and b/proposals/images/4297-msc-problem-b-2.png differ diff --git a/proposals/images/4297-msc-problem-b-3.png b/proposals/images/4297-msc-problem-b-3.png new file mode 100644 index 000000000..5dcbba70f Binary files /dev/null and b/proposals/images/4297-msc-problem-b-3.png differ diff --git a/proposals/images/4297-msc-sol-1.png b/proposals/images/4297-msc-sol-1.png new file mode 100644 index 000000000..d2eb668a5 Binary files /dev/null and b/proposals/images/4297-msc-sol-1.png differ diff --git a/proposals/images/4297-msc-sol-2.png b/proposals/images/4297-msc-sol-2.png new file mode 100644 index 000000000..b549e2cfe Binary files /dev/null and b/proposals/images/4297-msc-sol-2.png differ