MSC4136: Shared retry hints

1 month ago · 63b249472d
parent 72e694ba0b
commit 63b249472d
1 changed files with 100 additions and 0 deletions
--- a/proposals/4136-shared-retry-hints.md
+++ b/proposals/4136-shared-retry-hints.md
@ -0,0 +1,100 @@
+# MSC4136: Shared retry hints between servers
+
+## Problem
+
+New Matrix servers currently have no idea which other servers are online or not, and so start a retry schedule from
+first principles.
+
+This is far from ideal, as joining a room with many participating servers (e.g. 10K in #matrix:matrix.org right now) is
+incredibly heavy on the connecting server as the newly joined server will have to make 10K connection attempts as
+rapidly as possible (both SRV and .well-known lookups, and then /send attempts) to identify alive servers.  Meanwhile,
+dead servers (e.g. domains which no longer run Matrix servers) will be [hammered]
+(https://mastodon.matrix.org/@mnot@techpolicy.social/112319234007365786) by connection attempts.
+
+## Proposal
+
+When joining a room, the server which facilitates the join ('resident server' in spec parlance) should propose a retry
+intervals to the joining server in the `/send_join` response.
+
+Specifically, we add an optional `retry_hints` field to the `/send_join` response which provides optional
+recommendations on how long the joining server SHOULD wait in milliseconds before retrying if it cannot connect to the
+given domain.  As the field is optional, it does not require a new room version.
+
+If no hint is provided for a given domain, the retry hint MUST be considered to be zero milliseconds.  Retry hints
+should be provided whether or not the request had `omit_members` specified (i.e. whether or not faster remote room
+joins are in use).
+
+If the joining server is already communicating with a given domain (and so maintaining its own retry schedule), it MUST
+ignore the `retry_hints` provided by the resident server for that domain.
+
+As an example `/send_join` response:
+
+```json
+  ...
+  "servers_in_room": [
+    "matrix.org",
+    "example.com",
+    "element.io"
+  ],
+  "retry_hints": [
+    {
+      "example.com": {
+        "retry_after": 3600000,
+      }
+    }
+  ]
+```
+
+This means that the joined room has three participating servers, but `example.com` is not reliably responding over
+federation, and the resident server recommends that the joining server should wait 1 hour before retrying to connect to
+it.
+
+The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose
+explicitly adding that:
+
+ * Servers should follow exponential or geometric backoff schedules (and MUST NOT retry linearly)
+ * Servers SHOULD reset their retry schedule for a given domain if they receive traffic from that domain, and immediately retry.
+ * On joining a room, servers SHOULD attempt to federate with all newfound servers, connecting in reverse order of `retry_after`.
+     * In other words: first attempting servers without a `retry_hint`, and then attempting servers with lower
+       `retry_after` values, and then finally the servers with the largest `retry_after` values.
+     * However, if federation fails, then the joining server should seed its retry algorithm with the `retry_after`
+       value for that server (rather than starting anew).
+
+## Alternatives
+
+We could be more detailed in the recommended retry schedule (e.g. also specify the last time that a given server was
+seen to be working; when it was first seen to have failed; when the last retry was attempted; etc). However, this would
+bloat the size of the /send_join response (which we want to be as small as possible, to keep joins fast), and it's
+unclear whether the joining server really needs to know this data to seed its retry algorithm: instead, it can assume
+that retry_hints are present because the server is currently down and has just failed.  It also avoids risks of
+accidentally creating a thundering herd of retry schedules if all servers seed their retry algorithm with precisely the
+same schedule.  It also avoids us being too prescriptive on retry schedule algorithms.
+
+Alternatively, we could go the other way, and avoid timing information entirely in `retry_hints` and simply return a
+flag to say whether the resident server can currently connect to the destination or not. The joining server would then
+use this to prioritise connection attempts. However, it feels useful to recommend how aggressively the joining serevr
+should retry, as whether the server is currently up or down (e.g. to decide whether to attempt an immediate retry or
+not).
+
+## Security considerations
+
+A malicious resident server could tell the joining server that certain destinations are down when they are not.  This is
+mitigated by:
+
+ * The joining server will attempt all servers anyway, just deprioritising ones with higher `retry_after` values.
+ * The joining server will reset its retry schedule if it sees traffic from a given destination, letting the destination
+   assert its own existence and health whenever its users communicate
+ * The joining server will prioritise its own retry schedule over the received hint (if it has one)
+ * The joining server has to trust the resident server to say what servers exist in the first place anyway
+ * Events will indirectly make their way to destinations by being transitively pulled in via DAG references
+   (although E2EE keys may be missing)
+ * The resident server may have different connectivity to the destination than the joining server anyway, so this may
+   not be malicious behaviour anyway.
+
+## Unstable prefix
+
+While this MSC is in development, the `retry_hints` key should be returned as `org.matrix.msc4136.retry_hints`.
+
+## Dependencies
+
+None