Do a SRV lookup before .well-known lookup

also other clarifications and corrections.
pull/1708/head
Richard van der Hoff 5 years ago
parent e789eb186a
commit f33a540e6d

@ -1,21 +1,17 @@
# MSC1708: .well-known support for server name resolution
Currently, mapping from a server name to a hostname for federation is done via
`SRV` records. This presents two principal difficulties:
* SRV records are not widely used, and administrators may be unfamiliar with
them, and there may be other practical difficulties in their deployment such
as poor support from hosting providers. [^1]
* [MSC1711](https://github.com/matrix-org/matrix-doc/pull/1711) proposes
requiring valid X.509 certificates on the
federation endpoint. It will then be necessary for the homeserver to present
a certificate which is valid for the server name. This presents difficulties
for hosted server offerings: BigCorp may be reluctant to hand over the
keys for `bigcorp.com` to the administrators of the `bigcorp.com` matrix
homeserver.
Here we propose to solve these problems by augmenting the current `SRV` record
`SRV` records. However,
[MSC1711](https://github.com/matrix-org/matrix-doc/pull/1711) proposes
requiring valid X.509 certificates on the federation endpoint. It will then be
necessary for the homeserver to present a certificate which is valid for the
server name. This presents difficulties for hosted server offerings: BigCorp
may want to delegate responsibility for running its Matrix homeserver to an
outside supplier, but it may be difficult for that supplier to obtain a TLS
certificate for `bigcorp.com` (and BigCorp may be reluctant to let them have
one).
This MSC proposes to solve this problem by augmenting the current `SRV` record
with a `.well-known` lookup.
## Proposal
@ -24,59 +20,80 @@ For reference, the current [specification for resolving server
names](https://matrix.org/docs/spec/server_server/unstable.html#resolving-server-names)
is as follows:
* If the hostname is an IP literal, then that IP address should be used,
together with the given port number, or 8448 if no port is given.
1. If the hostname is an IP literal, then that IP address should be used,
together with the given port number, or 8448 if no port is given.
2. Otherwise, if the port is present, then an IP address is discovered by
looking up an AAAA or A record for the hostname, and the specified port is
used.
3. If the hostname is not an IP literal and no port is given, the server is
discovered by first looking up a `_matrix._tcp` SRV record for the
hostname, which may give a hostname (to be looked up using AAAA or A queries)
and port.
4. Finally, the server is discovered by looking up an AAAA or A record on the
hostname, and taking the default fallback port number of 8448.
We insert the following between Steps 3 and 4:
If the SRV record does not exist, the requesting server should make a `GET`
request to `https://<server_name>/.well-known/matrix/server`, with normal
X.509 certificate validation. If the request does not return a 200, continue
to step 4, otherwise:
XXX: should we follow redirects?
* Otherwise, if the port is present, then an IP address is discovered by
looking up an AAAA or A record for the hostname, and the specified port is
used.
The response must have a `Content-Type` of `application/json`, and must be
valid JSON which follows the structure documented below. Otherwise, the
request is aborted.
* If the hostname is not an IP literal and no port is given, the server is
discovered by first looking up a `_matrix._tcp` SRV record for the
hostname, which may give a hostname (to be looked up using AAAA or A queries)
and port. If the SRV record does not exist, then the server is discovered by
looking up an AAAA or A record on the hostname and taking the default
fallback port number of 8448.
If the response is valid, the `m.server` property is parsed as
`<delegated_server_name>[:<delegated_port>]`, and processed as follows:
Homeservers may use SRV records to load balance requests between multiple TLS
endpoints or to failover to another endpoint if an endpoint fails.
a. If `<delegated_server_name>` is an IP literal, then that IP address should
be used, together with `<delegated_port>`, or 8448 if no port is
given. The server should present a valid TLS certificate for
`<delegated_server_name>`.
The first two points remain unchanged: if the server name is an IP literal, or
contains a port, then requests will be made directly as before.
b. Otherwise, if the port is present, then an IP address is discovered by
looking up an AAAA or A record for `<delegated_server_name>`, and the
specified port is used. The server should present a valid TLS certificate
for `<delegated_server_name>`.
If the hostname is neither an IP literal, nor does it have an explicit port,
then the requesting server should continue to make an SRV lookup as before, and
use the result if one is found.
(In other words, the federation connection is made to
`https://<delegated_server_name>:<delegated_port>`).
If *no* SRV result is found, the requesting server should make a `GET` request
to `https://\<server_name>/.well-known/matrix/server`, with normal X.509
certificate validation. If the request fails in any way, then we fall back as
before to using using port 8448 on the hostname.
c. If the hostname is not an IP literal and no port is given, a second SRV
record is looked up; this time for `_matrix._tcp.<delegated_server_name>`,
which may give yet another hostname (to be looked up using A/AAAA queries)
and port. The server must present a TLS cert for the
`<delegated_server_name>` from the .well-known.
Rationale: Falling back to port 8448 (rather than aborting the request) is
necessary to maintain compatibility with existing deployments, which may not
present valid certificates on port 443, or may return 4xx or 5xx errors.
d. If no SRV record is found, the server is discovered by looking up an AAAA
or A record on `<delegated_server_name>`, and taking the default fallback
port number of 8448.
If the GET request succeeds, it should result in a JSON response, with contents
structured as shown:
(In other words, the federation connection is made to
`https://<delegated_server_name>:8448`).
### Structure of the `.well-known` response
The contents of the `.well-known` response should be structured as shown:
```json
{
"server": "<server>[:<port>]"
"m.server": "<server>[:<port>]"
}
```
The `server` property should be a hostname or IP address, followed by an
The `m.server` property should be a hostname or IP address, followed by an
optional port.
If the response cannot be parsed as JSON, or lacks a valid `server` property,
the request is considered to have failed, and no fallback to port 8448 takes
place.
Otherwise, the requesting server performs an `AAAA/A` lookup on the hostname
(if necessary), and connects to the resultant address and the specifed
port. The port defaults to 8448, if unspecified.
(The formal grammar for the `server` property is identical to that of a [server
name](https://matrix.org/docs/spec/appendices.html#server-name).)
@ -92,18 +109,10 @@ sensible default: 24 hours is suggested.
Because there is no way to request a revalidation, it is also recommended that
requesting servers cap the expiry time. 48 hours is suggested.
Similarly, a failure to retrieve the `.well-known` file should be cached for
a reasonable period. 24 hours is suggested again.
### The future of SRV records
It's worth noting that this proposal is very clear in that we will maintain
support for SRV records for the immediate future; there are no current plans to
deprecate them.
However, clearly a `.well-known` file can provide much of the functionality of
an SRV record, and having to support both may be undesirable. Accordingly, we
may consider sunsetting SRV record support at some point in the future.
A failure to retrieve the `.well-known` file should also be cached, though care
must be taken that a single 500 error or connection failure should not break
federation for an extended period. A short cache time of about an hour might be
appropriate; alternatively, servers might use an exponential backoff.
### Outstanding questions
@ -127,7 +136,6 @@ as soon as possible, to maximise uptake in the ecosystem. It is likely that, as
we approach Matrix 1.0, there will be sufficient other new features (such as
new Room versions) that upgrading will be necessary anyway.
## Security considerations
The `.well-known` file potentially broadens the attack surface for an attacker
@ -138,6 +146,3 @@ wishing to intercept federation traffic to a particular server.
This proposal adds a new mechanism, alongside the existing `SRV` record lookup
for finding the server responsible for a particular matrix server_name, which
will allow greater flexibility in deploying homeservers.
[^1] For example, Cloudflare automatically "flattens" SRV record responses.

Loading…
Cancel
Save