diff --git a/proposals/3618-simplify-federation-send.md b/proposals/3618-simplify-federation-send.md new file mode 100644 index 000000000..8bed8063f --- /dev/null +++ b/proposals/3618-simplify-federation-send.md @@ -0,0 +1,63 @@ +# MSC3618: Simplify federation `/send` response + +## Overview + +Currently we specify that the federation `/send` endpoint returns a body of +`pdus: { string: PDU Processing Result}`. In theory a homeserver can return +information here on an event-by-event basis as to whether there was a problem +processing events in the transaction or not. + +However, this does not really make much difference in practice — soft-fails +are silent and rejected events may be too – and server implementations do not +"cherry-pick" which events in a transaction to retry later. Since the presence +of a `txnId` in the request implies that we should consider a transaction to be +idempotent for a given `txnId`, we should therefore either accept that the +entire transaction was accepted successfully by the remote side or we should +retry the entire transaction. + +The worst case is that the homeserver is not able to process the transaction at +all for some reason, i.e. due to the database being down or similar, in which +case the server really should just return a HTTP 500 status code and this +signals to the sender to retry later. + +## Proposal + +This MSC proposes that we remove the `pdus` section from the response body, so +that we return only one of two conditions: + +* A HTTP 200 with a `{}` body to signal that the transaction was accepted; +* A HTTP 500 to signal that there was a problem with the transaction and to retry + sending later. + +## Benefits + +A significant benefit is that the receiving homeserver no longer needs to block the +the `/send` request in order to wait for the events to be processed for their `PDU Processing +Result`s. + +Given that it is possible for a transaction to contain events from multiple rooms, or +EDUs for unrelated purposes, it is bad that a single busy room can lengthen the amount of +time to return the `/send` response to the caller. This means that new events for other +rooms may be held back unnecessarily by processing events for a single busy room, as +per the spec: + +> The sending server must wait and retry for a 200 OK response before sending a +> transaction with a different txnId to the receiving server. + +With this proposal, the receiving server no longer needs to wait for `PDU Processing Result`s +as this MSC does away with them. Receiving servers that do not want to durably persist transactions +before processing them can continue to perform all work in-memory by continuing to block the `/send` +request until all processing is completed, as may be done today. Additionally, a receiving +server that is receiving too many transactions from a remote homeserver may wish to block for +an arbitrary period of time for rate-limiting purposes, but this is an implementation specific +detail and not strictly required. + +Another benefit is that sending homeservers no longer need to parse the response body at +all and can instead just determine whether the transaction was accepted successfully by +observing the HTTP status code. + +## Potential issues + +Synapse appears to use the `"pdus"` key for logging (see [here](https://github.com/matrix-org/synapse/blob/b38bdae3a2e5b7cfe862580368b996b8d7dfa50f/synapse/federation/sender/transaction_manager.py#L160)). +Conduit does the same and treats the response as an empty list if it is not present. Dendrite +ignores the response body altogether.