You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
matrix-spec-proposals/proposals/3902-faster-remote-joins.md

8.5 KiB

MSC3902: Faster remote room joins over federation (overview)

Background

It is well known that joining large rooms over federation can be very slow (see, for example, synapse#1211).

Much of the reason for this is the large number of events which are returned by the /send_join API. (As of August 2022, a /send_join request for Matrix HQ returns 206479 events.) These events are necessary to correctly validate the state of the room at the point of the join, but the list is expensive for the "resident" server to generate, and even more so for the joining server to validate and store.

This proposal therefore sets out the changes needed so that most of the room state can be popuated lazily, in the background, after the user has joined the room.

This proposal supersedes MSC2775.

Proposal

Firstly, we change /send_join to return, on request, a much reduced list of room state. The details of the changes to the API are set out in MSC3706, but in summary: m.room.member events are omitted from the response.

This gives the joining server enough information to start handling some interactions with the room. Conceptually, processing then splits into two threads: one, a modified mechanism for handling incoming events and requests in the "partial-state" room; and second, a background process which concurrently "resynchronises" the room state.

Handling requests and events in the partial-state room

A number of changes must be made to handle the "partial-state" scenario. (As of this writing, these changes are limited to homeserver implementations, but the list may be extended to include changes to client implementations before this MSC is concluded.)

  • Processing incoming events received over federation:

    • Currently, the spec requires that an incoming event "Passes authorization rules based on the state before the event, otherwise it is rejected". Since we do not know the (full) state before the event, we can no longer apply this check. Instead, we perform a state-resolution between the limited state that we do have, and the event's auth events; we then check that the incoming event passes the authorization rules based on that resolved state.

      This process means that we are largely trusting remote servers not to send invalid events (hence the need for a revalidation during the resynchronisation process); however it does mean that if we have a ban for a particular user, then their events will be rejected.

    • Additionally, no attempt is made to perform a "soft fail" check on incoming events.

  • Handling other federation requests: most federation requests require knowledge of the room state for authorisation (we should reject requests from servers which do not have users in the room). However, we can no longer correctly determine that state. MSC3895 specifies a new error code to indicate that we were unable to authorise a request.

  • Handling client-server requests: depending on the request in question, the server may or may not be able to accurately answer it. For example, a request for the topic of the room via /rooms/{roomId}/state/m.room.topic can reliably be answered (since we assume we have all non-membership state in the room), whereas a request for the list of joined members cannot be answered.

    In the current implementation, requests that require knowledge of m.room.member events for remote users will block until the resynchronisation completes.

    (Note that we can reliably answer requests that require knowledge only of the membership state for local users.)

  • /sync requires specific changes:

    • If lazy-loading of memberships is enabled, then any "partial state" room is included in the response. Even when lazy-loading is enabled, the server is expected to "include membership events for the sender of events being returned in the response". Since we do not have the full state of the room, we may be missing membership events for some senders. We resolve this by checking the auth_events for affected events, which must include a reference to a membership event.

    • If lazy-loading is not enabled, partial-state rooms are omitted from the response (until the state synchronisation completes).

      (This is pending implementation in Synapse.)

  • Outgoing events: This is pending implementation, but is likely to require some changes to ensure we do not get into a situation of being unable to safely answer a /get_missing_events or /state_ids request for an event we have generated.

  • Device management: homeserver implementations are expected to maintain a cache of the device list for all remote users that share a room with a local user, via m.device_list_update EDUs. To handle incomplete membership lists, we need to make the following changes:

    • Fixes to outgoing device list updates: we keep a record of any local device list changes that take place during the resynchronisation, and, once resync completes, we send them out to any homeservers that were in the room at any point since we started joining. (Synapse implementation)

    • Fixes to incoming device list updates: normally we ignore device-list updates from users who we don't think we share a room with. To ensure we do not discard incoming device list updates, we keep a record of any remote device list updates we receive, and replay them once resync completes. (Synapse implementation)

Resynchronisation

Once a server receives a "partial state" response to /send_join, it must then call /state/{room_id}, setting event_id to the ID of the join event returned by /send_join, to obtain a full snapshot of the state at that event. It can then update its database accordingly.

However, this process may take some time, and it is likely that other events have arrived in the meantime. These new events will also have been stored with "partial state", and will not have been subject to the full event authorisation process. The server must therefore work forward through the event DAG, recalculating the state at each event, and rechecking event authorisation, until it has caught up with "real time" and new events are being created with "full state".

Potential issues

TBD

Alternatives

TBD

Security considerations

It's important to note that, during the resynchronisation process, events are accepted without running the full checks process; this is an inevitable consequence of having partial state, but does mean that we might accept abusive events that would otherwise be rejected.

This is mitigated by (a) the process of re-running the event authorisation process once we have full state, and (b) the fact that "partial state" is a transient state: in other words, the window for sending abusive content is limited, and only users who happen to be in the room during the "resynchronisation" process will observe the abusive content.

Unstable prefix

n/a

Dependencies

This MSC builds on MSC3706 and MSC3895 (which at the time of writing have not yet been accepted into the spec).