Proposal for .well-known for server discovery

pull/977/head
Richard van der Hoff 6 years ago
parent f288facec8
commit 87330b9b9b

@ -0,0 +1,126 @@
# .well-known support for server name resolution
Currently, mapping from a server name to a hostname for federation is done via
`SRV` records. This presents two principal difficulties:
* SRV records are not widely used, and administrators may be unfamiliar with
them, and there may be other practical difficulties in their deployment such
as poor support from hosting providers. [^1]
* It is likely that we will soon require valid X.509 certificates on the
federation endpoint. It will then be necessary for the homeserver to present
a certificate which is valid for the server name. This presents difficulties
for hosted server offerings: BigCorp may be reluctant to hand over the
keys for `bigcorp.com` to the administrators of the `bigcorp.com` matrix
homeserver.
Here we propose to solve these problems by augmenting the current `SRV` record
with a `.well-known` lookup.
## Proposal
For reference, the current [specification for resolving server
names](https://matrix.org/docs/spec/server_server/unstable.html#resolving-server-names)
is as follows:
* If the hostname is an IP literal, then that IP address should be used,
together with the given port number, or 8448 if no port is given.
* Otherwise, if the port is present, then an IP address is discovered by
looking up an AAAA or A record for the hostname, and the specified port is
used.
* If the hostname is not an IP literal and no port is given, the server is
discovered by first looking up a `_matrix._tcp` SRV record for the
hostname, which may give a hostname (to be looked up using AAAA or A queries)
and port. If the SRV record does not exist, then the server is discovered by
looking up an AAAA or A record on the hostname and taking the default
fallback port number of 8448.
Homeservers may use SRV records to load balance requests between multiple TLS
endpoints or to failover to another endpoint if an endpoint fails.
The first two points remain unchanged: if the server name is an IP literal, or
contains a port, then requests will be made directly as before.
If the hostname is neither an IP literal, nor does it have an explicit port,
then the requesting server should continue to make an SRV lookup as before, and
use the result if one is found.
If *no* result is found, the requesting server should make a `GET` request to
`https://\<server_name>/.well-known/matrix/server`, with normal X.509
certificate validation. If the request fails in any way, then we fall back as
before to using using port 8448 on the hostname.
Rationale: Falling back to port 8448 (rather than aborting the request) is
necessary to maintain compatibility with existing deployments, which may not
present valid certificates on port 443, or may return 4xx or 5xx errors.
If the GET request succeeds, it should result in a JSON response, with contents
structured as shown:
```json
{
"server": "<server>[:<port>]"
}
```
The `server` property has the same format as a [server
name](https://matrix.org/docs/spec/appendices.html#server-name): a hostname
followed by an optional port.
If the response cannot be parsed as JSON, or lacks a valid `server` property,
the request is considered to have failed, and no fallback to port 8448 takes
place.
Otherwise, the requesting server performs an `AAAA/A` lookup on the hostname,
and connects to the resultant address and the specifed port. The port defaults
to 8448, if unspecified.
### Caching
Servers should not look up the `.well-known` file for every request, as this
would impose an unacceptable overhead on both sides. Instead, the results of
the `.well-known` request should be cached according to the HTTP response
headers, as per [RFC7234](https://tools.ietf.org/html/rfc7234). If the response
does not include an explicit expiry time, the requesting server should use a
sensible default: 24 hours is suggested.
Because there is no way to request a revalidation, it is also recommended that
requesting servers cap the expiry time. 48 hours is suggested.
Similarly, a failure to retrieve the `.well-known` file should be cached for
a reasonable period. 24 hours is suggested again.
### The future of SRV records
It's worth noting that this proposal is very clear in that we will maintain
support for SRV records for the immediate future; there are no current plans to
deprecate them.
However, clearly a `.well-known` file can provide much of the functionality of
an SRV record, and having to support both may be undesirable. Accordingly, we
may consider sunsetting SRV record support at some point in the future.
### Outstanding questions
Should we follow 30x redirects for the .well-known file? On the one hand, there
is no obvious usecase and they add complexity (for example: how do they
interact with caches?). On the other hand, we'll presumably be using an HTTP
client library to handle some of the caching stuff, and they might be useful
for something?
## Security considerations
The `.well-known` file potentially broadens the attack surface for an attacker
wishing to intercept federation traffic to a particular server.
## Conclusion
This proposal adds a new mechanism, alongside the existing `SRV` record lookup
for finding the server responsible for a particular matrix server_name, which
will allow greater flexibility in deploying homeservers.
[^1] For example, Cloudflare automatically "flattens" SRV record responses.
Loading…
Cancel
Save