You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
matrix-spec-proposals/proposals/3468-mxc-hash.md

197 lines
6.9 KiB
Markdown

# MSC3468: MXCs to Hashes
Currently, matrix media/content repositories work with a MXC to blob mapping, fetching the media
from the domain embedded in the MXC to present it to the user.
However, this becomes a problem when media retention, redaction, and resiliency come into play,
the singular MXC URI becoming a point of failure once the backing server retracts the URI, either
deliberately (aforementioned redaction), or accidentally (via server reset, or losing the backing media).
This is in opposition to how MXCs are used in matrix today, much like Discord media URLs;
immutable and always online, links are copied and reused across rooms.
## Proposal
I propose for MXCs to be reworked into being a pointer to hashes.
This gives the extra benefit of decoupling aliasing pointers (such as the MXC is) with the underlying media.
Alongside this change, I also propose for an additional client-side endpoint which can quickly "clone"
a MXC. This being done by having the server look up the MXC's hash,
and then creating a new MXC also referencing that hash.
The client-server content API would expose a method for the client to retrieve the hash of a
particular MXC, alongside aforementioned method to clone it.
The server-server content API would add a dedicated fetch method for fetching the hash to a MXC, and
fetching the media to a hash.
### Specification
#### Client-Server
This proposal would like to add the following two methods to CS;
```
POST _matrix/media/v1/clone/{serverName}/{mediaId}
Rate-limited: Yes
Authentication: Yes
Responses:
200: JSON (see below)
429: Ratelimited
503: Could not fetch remote MXC-to-hash mapping
```
200 response:
```json
{
"m.clone.mxc": "mxc://local.server/media_id"
}
```
```
GET _matrix/media/v1/hash/{serverName}/{mediaId}
Rate-limited: Yes
Authentication: Yes
Responses:
200: JSON (see below)
429: Ratelimited
503: Could not fetch remote MXC-to-hash mapping
```
200 response:
```json5
{
"m.mxc.hash": "1234567890abcdef" // hex-encoded hash
}
```
#### Server-Server
This proposal would like to add the following two endpoints to S2S;
```
GET _matrix/federation/v1/media/hash
Rate-limited: No
Authentication: Yes
Query parameters:
media_id: string, the local part of an MXC for which the hash is queried
Responses:
200: Pure-binary encoding of corresponding hash
404: Media ID does not exist
```
```
GET _matrix/media/v1/media/fetch/{hash}
Rate-limited: Yes
Authentication: Yes
Responses:
200: Blob of data corresponding to hash
404: Hash-media not found
429: Ratelimited
```
### "Which hash?"
*Note: this is an area of feedback, this'll be removed in the final draft*
So far, the definition of "hash" has been vague. I think converging on a specific hash function
could be a lock-in for future expansion.
So, i'd like to propose using [`multihash`](https://github.com/multiformats/multihash) for these
purposes, this would allow a common format self-describing the hashes used.
For now, only a set series of hashes would be included (see
[here](https://github.com/multiformats/multicodec/blob/master/table.csv) for a full table), which
can be expanded/deprecated with subsequent matrix spec releases, without changing up the format of
the hash, or documenting checks to differentiate the types of hash used, or to reinvent multihash.
However, this is up for debate.
## Motivation
This MSC wishes to unblock efforts for media retention and redaction;
- https://github.com/matrix-org/synapse/issues/6832
- https://github.com/matrix-org/matrix-doc/issues/701
By addition of the `/clone` endpoint, any client wishing to preserve media, can do so by simply
fetching/storing media locally, reducing the linkrot effect that remote servers redacting media
could have.
This MSC would also wish to make matrix more flexible for diverse media delivery systems.
Mapping MXCs to hashes could allow the hashes themselves to become self-verifying keys in any
(centralized or distributed) KV store.
This, in turn, could prepare matrix better for P2P efforts.
This MSC also wishes to make matrix content delivery more resilient, with the exception of mapping a
MXC alias to a hash, a hash could be retrieved from anywhere, and still be self-verifying,
considerably lessening the bus factor, and allowing for better distributed load (see the first
"future extension" in below section)
## Potential issues
This could have a slight performance hit, as an extra RTT between servers is needed to fetch the
media actual, after fetching the hash corresponding to that bit of media.
I think this is a more acceptable tradeoff, an alternative would be to side-channel the hash in a
header, in an endpoint fetching directly from a MXC.
## Future extensions
*Note: this is free-form speculation, and serves to illustrate how future MSCs can extend the
behavior this MSC is enabling.*
A possible extension would be a server-server endpoint which requests what recommended content
endpoints would be to fetch hashes from.
(I.e. a server would ask `/media/endpoints`, and the server can respond with
`["https://common.caching.server", "https://matrix.org"]`, in decreasing order of priority)
This can be helpful when servers share a common "media server", as is the case today with
[matrix-media-repo](https://github.com/turt2live/matrix-media-repo), which "tricks" federation by
redirecting any request for media to itself. This future extension would formalize this process.
This would also be helpful with dealing with "thundering herds", as servers can be redirected to
multiple servers to fetch media from a hash from.
(However, as-is, this could have security problems with DoS-ing, issues with cache invalidation
after redacting media, and possibly more. This is only to illustrate flexibility.)
Another possible extension could be to allow to tap in natively to decentralized media stores, which
often key their data to hashes. This could make media P2P easier to implement and work with.
One last possible extension is to add `410` to every endpoint pertaining fetching media, this could
help with communicating that media has been deleted to servers and clients.
## Security considerations
A big part of this MSC's motivation is to unblock media redaction/retention efforts. However, that
does not mean this MSC should be blind to the struggle of containing unsavory media across
federation.
This MSC adds a `/clone` endpoint, by which a client, on any server, could easily "copy" media,
seemingly making containment efforts useless.
However, at a room-level, and possibly a server-level, hashes themselves could be banned. This can
be implementation-specific, or be built-into bots like mjolnir.
## Unstable prefix
This MSC uses the unstable prefix `nl.automatia.msc3468`;
- `_matrix/media/nl.automatia.msc3468/clone/{serverName}/{mediaId}`
- `_matrix/media/nl.automatia.msc3468/hash/{serverName}/{mediaId}`
- `_matrix/federation/nl.automatia.msc3468/media/hash`
- `_matrix/media/nl.automatia.msc3468/media/fetch/{hash}`
- `nl.automatia.msc3468.clone.mxc`
- `nl.automatia.msc3468.mxc.hash`