Richard van der Hoff 3 weeks ago committed by GitHub
commit dee80e26b4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,171 @@
# MSC3902: Faster remote room joins over federation (overview)
## Background
It is well known that joining large rooms over federation can be very slow (see,
for example, [synapse#1211](https://github.com/matrix-org/synapse/issues/1211)).
Much of the reason for this is the large number of events which are returned by
the [`/send_join`](https://spec.matrix.org/v1.4/server-server-api/#put_matrixfederationv2send_joinroomideventid)
API. (As of August 2022, a `/send_join` request for Matrix HQ returns 206479
events.) These events are necessary to correctly validate the state of the room
at the point of the join, but the list is expensive for the "resident" server to
generate, and even more so for the joining server to validate and store.
This proposal therefore sets out the changes needed so that most of the room
state can be popuated lazily, in the background, *after* the user has joined
the room.
This proposal supersedes [MSC2775](https://github.com/matrix-org/matrix-spec-proposals/pull/2775).
## Proposal
Firstly, we change `/send_join` to return, on request, a much reduced list of
room state. The details of the changes to the API are set out in
[MSC3706](https://github.com/matrix-org/matrix-spec-proposals/pull/3706), but
in summary: `m.room.member` events are omitted from the response.
This gives the joining server enough information to start handling some
interactions with the room. Conceptually, processing then splits into two
threads: one, a modified mechanism for handling incoming events and requests in
the "partial-state" room; and second, a background process which concurrently
"resynchronises" the room state.
### Handling requests and events in the partial-state room
A number of changes must be made to handle the "partial-state" scenario. (As of
this writing, these changes are limited to homeserver implementations, but the
list may be extended to include changes to client implementations before this
MSC is concluded.)
* Processing incoming events received over federation:
* Currently, the
[spec](https://spec.matrix.org/v1.4/server-server-api/#checks-performed-on-receipt-of-a-pdu)
requires that an incoming event "Passes authorization rules based on the
state before the event, otherwise it is rejected". Since we do not know
the (full) state before the event, we can no longer apply this
check. Instead, we perform a state-resolution between the limited state
that we do have, and the event's auth events; we then check that the
incoming event passes the authorization rules based on that resolved
state.
This process means that we are largely trusting remote servers not to send
invalid events (hence the need for a revalidation during the
resynchronisation process); however it does mean that if we have a ban for
a particular user, then their events will be rejected.
* Additionally, no attempt is made to perform a "soft fail" check on incoming events.
* Handling other federation requests: most federation requests require
knowledge of the room state for authorisation (we should reject requests
from servers which do not have users in the room). However, we can no longer
correctly determine that
state. [MSC3895](https://github.com/matrix-org/matrix-spec-proposals/pull/3895)
specifies a new error code to indicate that we were unable to authorise a
request.
* Handling client-server requests: depending on the request in question, the
server may or may not be able to accurately answer it. For example, a
request for the topic of the room via
[`/rooms/{roomId}/state/m.room.topic`](https://spec.matrix.org/v1.4/client-server-api/#get_matrixclientv3roomsroomidstateeventtypestatekey)
can reliably be answered (since we assume we have all non-membership state
in the room), whereas a request for the [list of joined
members](https://spec.matrix.org/v1.4/client-server-api/#get_matrixclientv3roomsroomidjoined_members) cannot be answered.
In the current implementation, requests that require knowledge of
`m.room.member` events for remote users will *block* until the
resynchronisation completes.
(Note that we can reliably answer requests that require knowledge only of
the membership state for local users.)
* [`/sync`](https://spec.matrix.org/v1.4/client-server-api/#get_matrixclientv3sync)
requires specific changes:
* If [lazy-loading](https://spec.matrix.org/v1.4/client-server-api/#lazy-loading-room-members)
of memberships is enabled, then any "partial state" room is included in
the response. Even when lazy-loading is enabled, the server is expected to
"include membership events for the `sender` of events being returned in
the response". Since we do not have the full state of the room, we may be
missing membership events for some senders. We resolve this by checking
the `auth_events` for affected events, which must include a reference to a
membership event.
* If lazy-loading is *not* enabled, partial-state rooms are omitted from the
response (until the state synchronisation completes).
(This is [pending implementation](https://github.com/matrix-org/synapse/issues/12989) in Synapse.)
* Outgoing events: This is [pending
implementation](https://github.com/matrix-org/synapse/issues/12997), but is
likely to require some changes to ensure we do not get into a situation of
being unable to safely answer a
[`/get_missing_events`](https://spec.matrix.org/v1.4/server-server-api/#post_matrixfederationv1get_missing_eventsroomid)
or
[`/state_ids`](https://spec.matrix.org/v1.4/server-server-api/#get_matrixfederationv1state_idsroomid)
request for an event we have generated.
* [Device management](https://spec.matrix.org/v1.4/server-server-api/#device-management):
homeserver implementations are expected to maintain a cache of the device
list for all remote users that share a room with a local user, via
`m.device_list_update` EDUs. To handle incomplete membership lists, we need to make the following changes:
* Fixes to outgoing device list updates: we keep a record of any *local*
device list changes that take place during the resynchronisation, and,
once resync completes, we send them out to any homeservers that were in
the room at any point since we started joining. ([Synapse
implementation](https://github.com/matrix-org/synapse/pull/13934))
* Fixes to incoming device list updates: normally we ignore device-list
updates from users who we don't think we share a room with. To ensure we
do not discard incoming device list updates, we keep a record of any
*remote* device list updates we receive, and replay them once resync
completes. ([Synapse
implementation](https://github.com/matrix-org/synapse/pull/13913))
### Resynchronisation
Once a server receives a "partial state" response to `/send_join`, it must then
call [`/state/{room_id}`](https://spec.matrix.org/v1.4/server-server-api/#get_matrixfederationv1stateroomid),
setting `event_id` to the ID of the join event returned by `/send_join`, to
obtain a full snapshot of the state at that event. It can then update its database
accordingly.
However, this process may take some time, and it is likely that other events
have arrived in the meantime. These new events will also have been stored with
"partial state", and will not have been subject to the full event authorisation
process. The server must therefore work forward through the event DAG,
recalculating the state at each event, and rechecking event authorisation,
until it has caught up with "real time" and new events are being created with
"full state".
## Potential issues
TBD
## Alternatives
TBD
## Security considerations
It's important to note that, during the resynchronisation process, events are
accepted without running the full checks process; this is an inevitable
consequence of having partial state, but does mean that we might accept abusive
events that would otherwise be rejected.
This is mitigated by (a) the process of re-running the event authorisation
process once we have full state, and (b) the fact that "partial state" is a
transient state: in other words, the window for sending abusive content is
limited, and only users who happen to be in the room during the
"resynchronisation" process will observe the abusive content.
## Unstable prefix
n/a
## Dependencies
This MSC builds on [MSC3706](https://github.com/matrix-org/matrix-spec-proposals/pull/3706) and [MSC3895](https://github.com/matrix-org/matrix-spec-proposals/pull/3895)
(which at the time of writing have not yet been accepted into the spec).
Loading…
Cancel
Save