pull/2846/merge
Jan Christian Grünhage 2 months ago committed by GitHub
commit 7887668f9b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,139 @@
# MSC2846: Decentralising media through CIDs
<sup>Authored by Jan Christian Grünhage and Andrew Morgan</sup>
Currently, the Media API is less decentralised than most other aspects of
Matrix. When the homeserver that uploaded a piece of content goes down, the
content is not visible from homeservers that didn't fetch it beforehand. This is
because the MXC URLs are made up of the server name of the sending server and an
opaque media ID. We can't fetch media from servers other than the origin server
because we there is no signature or hash included to verify the integrity of the
file. This proposal modifies the media ID to be a
[CID](https:/github.com/multiformats/cid) (content ID), which (among other
things) includes a hash of the file. This allows us to verify the integrity of
the file both on the server side and the client side.
## Current behaviour
### Sending
![Current flow for sending](images/2846-sending-current.jpeg)
### Receiving
![Current flow for receiving](images/2846-receiving-current.jpeg)
## Proposal
### Data structures
We propose MXC URLs change from `mxc://<server>/<opaque ID>` to
`mxc://<server>/<CID>`. We include the server here for backwards compatibility
reasons, so that old servers and clients would still work as before, and also as
a primary source for downloading the media. If that fails, the server needs a
hint on where to get the media from instead, which the client may send to the
server as a query parameter.
### Sending
![Proposed flow for sending](images/2846-sending-proposed.jpeg)
As you can see, this is very similar to what happens on existing clients and
servers. The client behaviour *can* change (in terms of verification of file
integrity), but a client that does not change its behaviour will still work as
expected. Clients can additionally verify that the MXC URL they received from
the server actually represents the file that was originally sent.
The server should not use v0 CIDs, it should always use v1 CIDs (until we change
that in the future). This is because v1 CIDs have lots of benefits over their v0
counterpart. The server may implement any number of supported hashes from the
multihash spec for decoding (bikeshedding opportunity: Do we want to recommend
hashes here that we don't recommend sending, for making the switch to other
hashes easier in the future?), but it should stick to reasonably widespread
hashes for files it is sending (bikeshedding opportunity: What do these include?
https://github.com/multiformats/multicodec/blob/master/table.csv has a list of
all hashes supported by the multihash spec. SHA2 and SHA3 should be safe bets).
### Receiving
![Proposed flow for receiving](images/2846-receiving-proposed.jpeg)
Again very similar to what happens in the current state. You can drop old
implementations into this just fine, everything will continue to work. New
implementations will be able to verify additional hashes and try more fallbacks
for fetching content.
The server trying more fallbacks requires that the client hints to their server
where the content might be located. This should be done by query parameters on
the download request. There's two options here:
- A pair of room and event ID, given via the `room` and `event` query
parameters.
- A list of explicit fallback servers, via the `via` query parameter.
Giving a room and event ID is preferrable, but for some contexts it might be
better to explicitly give fallback servers. (Bikeshedding opportunity: What
usecases would this be? Avatars? Or do we want to remove this option completely?)
The client usually trusts its own server at least somewhat, so it doesn't need
to verify the CID of the file served there, but the server needs to verify the
CID of the file returned by the remote to prevent malicious remotes from serving
invalid content for rooms that they participate in.
##### Potential remotes
1. Origin encoded in the MXC URL.
2. If the client has supplied explicit fallback servers, try those in the order
the client supplied them in.
3. If the client has supplied a room+event ID combo as a hint:
1. Try the servers that used to be in the room back then and still are.
2. Try the servers that are in the room now but weren't back then.
3. Try the servers that used to be in the room but aren't anymore. These are
tried last to make sure servers leaving a room aren't put under any
unnecessary load from that room anymore.
## Potential issues
- Multihash and CID are not wide spread outside of IPFS and Protocol Labs.
There's implementations for a few languages, but this might be an at least
somewhat limiting factor. Less difficult than E2EE, but still not trivial.
## Alternatives/Related MSCs
1. [**MSC2706**](https://github.com/matrix-org/matrix-doc/pull/2706) proposes to
use IPFS directly, but in a similarily backwards compatible way to how we're
changing MXC URLs here. MSC2706 does make authenticating media worse, because
it publishes the file to IPFS and that is easy to scrape, but that also means
that fallback nodes are automatically found. Public files in this MSC *could*
be put into IPFS in the future, maybe as an updated version of MSC2706,
without changing the MXC URL format again, as we'd already have CIDs here.
1. [**MSC2703**](https://github.com/matrix-org/matrix-doc/pull/2703) specifies a
grammar for media IDs, which could be problematic for us here. It specifies
that media IDs must be opaque, as well as a maximum of 255 characters in
length. This is in conflict to this MSC (and also MSC2706), because we do
encode information in the media ID, which servers and clients do want to
decode. It contains a hash, which should be used to verify the integrity of
the file that was fetched. The other possible conflict with that MSC is the
character limit of 255 characters, which should not affect this MSC, because
a CID is normally 60 characters, but that depends on what the CID actually
looks like. In the future, a CID based on a much longer multihash could mean
we run into issues here, but this is fairly unlikely, as that would mean hash
lengths of over a thousand bits.
1. [**MSC2834**](https://github.com/matrix-org/matrix-doc/pull/2834) proposes to
replace MXC URLs with custom hash identifier + hash strings. This is very
similar to what we're doing here, with the difference of not reusing
pre-existing methods like multihash and CIDs. Also, by removing the server
name from the MXC URL, it breaks backwards compatibility on the server side,
and for clients which attempt to parse the MXC URL.
1. **MSCNaN** proposes authentication of media endpoints using events attached
to the media files. As this MSC also does, it proposes sending the room and
event ID as query parameters when downloading. Its authentication would also
help with the potential issue of leaking file contents, as discussed in the
security considerations section.
## Security considerations
- Without authentication, this enables fetching of files you know the hash of
(assuming the hash you know is one the media repo of your server supports).
This is potentially problematic, as hashes of things are leaked in places
where access to files are not always leaked as well. For example, git commit
IDs are SHA1 hashes of the objects, so a commit ID could lead to the whole
repo (up to that commit) being leaked when the objects end up in matrix's
media repo. This is a fairly far fetched usecase, but it's still an indicator
that this might be problematic. MSCNaN would help here.
## Backwards compatibility concerns
Clients/Servers not implementing this MSC should continue to work normally. New
events sent with non-CID media IDs should not pose a problem either, because
they wouldn't be parsed as CIDs successfully. If they actually are parsed as
CIDs successfully but aren't valid, that's either a huge coincidence, or, a lot
more likely, a malicious MXC URL. In that case, it would just fail, which is not
worse than what malicious MXC URLs can already do right now.

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

@ -0,0 +1,14 @@
sequenceDiagram
participant C as Client
participant S as Server
participant O as Origin
C->>S: req file
activate S
opt fetch file
S->>O: req file
activate O
O-->>S: serve file
deactivate O
end
S-->>C: serve file
deactivate S

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

@ -0,0 +1,21 @@
sequenceDiagram
participant C as Client
participant S as Server
participant R as Remote
C->>S: req file with CID and hint
Note over C,S: The (optional) hint is a room+event ID
activate S
opt fetch file from remote
loop try potential remotes
S->>R: req file
activate R
R-->>S: serve file
deactivate R
end
end
S->>S: verify the CID
S-->>C: serve file
deactivate S
opt verify CID
C->>C: verify the CID
end

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

@ -0,0 +1,7 @@
sequenceDiagram
participant C as Client
participant S as Server
C->>S: upload file
S-->>C: return MXC URL
C->>S: send event with MXC URL
S-->>C: return event ID

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

@ -0,0 +1,17 @@
sequenceDiagram
participant C as Client
participant S as Server
C->>S: upload file
activate S
S->>S: calculate CID
S-->>C: return MXC URL with CID
deactivate S
opt verify CID
C->>C: verify the CID
end
C->>S: send event with MXC URL
activate S
S-->>C: event ID
deactivate S
Loading…
Cancel
Save