You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
matrix-spec-proposals/proposals/4016-streaming-e2ee-file-tr...

19 KiB

MSC4016: Streaming and resumable E2EE file transfer with random access

Problem

  • File transfers currently take twice as long as they could, as they must first be uploaded in their entirety to the senders server before being downloaded via the receivers server.
  • As a result, relative to a dedicated file-copying system (e.g. scp) they feel sluggish. For instance, you cant incrementally view a progressive JPEG or voice or video file as its being uploaded for “zero latency” file transfers.
  • You cant skip within them without downloading the whole thing (if theyre streamable content, such as an .opus file)
  • For instance, you cant do realtime broadcast of voice messages via Matrix, or skip within them (other than splitting them into a series of separate file transfers).
  • You also can't resume uploads if they're interrupted.
  • Another example is sharing document snapshots for real-time collaboration. If a user uploads 100MB of glTF in Third Room to edit a scene, you want all participants to be able to receive the data and stream-decode it with minimal latency.

Closes https://github.com/matrix-org/matrix-spec/issues/432

N.B. this MSC is not needed to do a streaming decryption or encryption of E2EE files (as opposed to streaming transfer). The current APIs let you stream a download of AES-CTR data and incrementally decrypt it without loading the whole thing into RAM, calculating the hash as you go, and then either surfacing or deleting the decrypted result at the end if the hash matches.

Relatedly, v2 MXC attachments can't be stream-transferred, even if combined with [MSC2246] (https://github.com/matrix-org/matrix-spec-proposals/pull/2246), given you won't be able to send the hash in the event contents until you've uploaded the media.

Solution sketch

  • Upload content in a single file made up of contiguous blocks of AES-GCM content.
    • Typically constant block size (e.g. 32KB)
    • Or variable block size (to allow time-based blocksize for low-latency seeking in streamable content) - e.g. one block per opus frame. Otherwise a 32KB block ends up being 8s of typical opus latency.
      • This would then require a registration sequence to identify the starts of blocks boundaries when seeking randomly (potentially escaping the bitstream to avoid registration code collisions).
  • Unlike todays AES-CTR attachments, AES-GCM makes the content self-authenticating, in that it includes an authentication tag (AEAD) to hash the contents and protect against substitution attacks (i.e. where an attacker flips some bits in the encrypted payload to strategically corrupt the plaintext, and nobody notices as the content isnt hashed).
    • (The only reason Matrix currently uses AES-CTR is that native AES-GCM primitives werent widespread enough on Android back in 2016)
  • To prevent against reordering attacks, each AES-GCM block has to include an encrypted block header which includes a sequence number, so we can be sure that when we request block N, were actually getting block N back - or equivalent.
    • XXX: is there still a vulnerability here? Other approaches use Merkle trees to hash the AEADs rather than simple sequence numbers, but why?
  • We use streaming HTTP upload (https://developer.chrome.com/articles/fetch-streaming-requests/) and/or tus resumable upload headers to incrementally send the file. This also gives us resumable uploads.
  • We then use normal HTTP Range headers to seek while downloading.

Advantages

  • Backwards compatible with current implementations at the HTTP layer
  • Fully backwards compatible for unencrypted transfers
  • Relatively minor changes needed from AES-CTR to sequence-of-AES-GCM-blocks for implementations like https://github.com/matrix-org/matrix-encrypt-attachment
  • We automatically maintain a serverside E2EE store of the file as normal, while also getting 1:many streaming semantics
  • Provides streaming transfer for any file type - not just media formats
  • Minimises memory usage in Matrix clients for large file transfers. Currently all(?) client implementations store the whole file in RAM in order to check hashes and then decrypt, whereas this would naturally lend itself to processing files incrementally in blocks.
  • Leverages AES-GCMs existing primitives and hashing rather than inventing our own hashing strategy
  • We've already implemented this once before (pre-Matrix) in our 'glow' codebase, and it worked excellently. pre-E2EE and pre-Matrix in our glow codebase.
  • Random access could enable torrent-like semantics in future (i.e. servers doing parallel downloads of different chunks from different servers, with appropriate coordination)
  • tus looks to be under consideration by the IETF HTTP working group, so we're hopefully picking the right protocol for resumable uploads.

Limitations

  • Enterprisey features like content scanning and CDGs require visibility on the whole file, so would eliminate the advantages of streaming by having to buffering it up in order to scan it. (Clientside scanners would benefit from file transfer latency halving but wouldn't be able to show mid-transfer files)
  • When applied to unencrypted files, server-side content scanning (for trust & safety etc) would be unable to scan until its too late.
  • For images & video, senders will still have to read (and decompress) enough of the file into RAM in order to thumbnail it or calculate a blurhash, so the benefits of streaming in terms of RAM use on the sender are reduced. One could restrict thumbnailing to the first 500MB of the transfer (or however much available RAM the client has) though, and still stream the file itself, which would be hopefully be enough to thumbnail the first frame of a video, or most images, while still being able to transfer arbitrary length files.
  • Cancelled file uploads will still leak a partial file transfer to receivers who start to stream, which could be awkward if the sender sent something sensitive, and then cant tell who downloaded what before they hit the cancel button
  • Small bandwidth overhead for the additional AEADs and block headers - ~32 bytes per block.
  • Out of the box it wouldn't be able to adapt streaming to network conditions (no HLS or DASH style support for multiple bitstreams)
  • Might not play nice with CDNs? (I haven't checked if they pass through Range headers properly)
  • Recorded E2EE SFU streams (from a MSC3898 SFU or LiveKit SFU) could be made available as live-streamed file transfers through this MSC. However, these streams would also have their own S-Frame headers, whose keys would need to be added to the EncryptedFile block in addition to the AES-GCM layer.

Detailed proposal

The file is uploaded asynchronously using MSC2246.

The proposed v3 EncryptedFile block looks like:

"file": {
    "v": "org.matrix.msc4016.v3",
    "key": {
        "alg": "A256GCM",
        "ext": true,
        "k": "cngOuL8OH0W7lxseExjxUyBOavJlomA7N0n1a3RxSUA",
        "key_ops": [
            "encrypt",
            "decrypt"
        ],
        "kty": "oct"
    },
    "iv": "HVTXIOuVEax4E+TB", // 96-bit base-64 encoded initialisation vector
    "url": "mxc://example.com/raAZzpGSeMjpAYfVdTrQILBI",
},

N.B. there is no longer a hashes key, as AES-GCM includes its own hashing to enforce the integrity of the file transfer. Therefore we can authenticate the transfer by the fact we can decrypt it using its key & IV (unless an attacker who controls the same key & IV has substituted it for another file - but the benefit to them of doing so is questionable).

We split the file stream into blocks of AES-256-GCM, with the following simple framing:

  • File header with a magic number of: 0x4D, 0x58, 0x43, 0x03 ("MXC" 0x03) - just so file can recognise it.
  • 1..N blocks, each with a header of:
    • a 32-bit field: 0xFFFFFFFF (a registration code to let a parser handle random access within the file
    • a 32-bit field: block sequence number (starting at zero, used to calculate the IV of the block, and to aid random access)
    • a 32-bit field: the length in bytes of the encrypted data in this block.
    • a 32-bit field: a CRC32 checksum of the block, including headers. This is used when randomly seeking as a consistency check to confirm that the registration code really did indicate the beginning of a valid frame of data. It is not used for cryptographic integrity.
    • the actual AES-GCM bitstream for that block.
      • the plaintext block size can be variable; 32KB is a good default for most purposes.
      • Audio streams may want to use a smaller block size (e.g. 1KB blocks for a CBR 32kbps Opus stream will give 250ms of streaming latency). Audio streams should be CBR to avoid leaking audio waveform metadata via block size.
      • The block is encrypted using an IV formed by concatenating the block sequence number of the file block with the IV from the file block (forming a 128-bit IV, which will be hashed down to 96-bit again within AES-GCM). This avoids IV reuse (at least until it wraps after 2^32-1 blocks, which at 32KB per block is 137TB (18 hours of 8k raw video), or at 1KB per block is 4TB (34 years of 32kbps audio)).
        • Implementations MUST terminate a stream if the seqnum is exhausted, to prevent IV reuse.
        • Receivers MUST terminate a stream if the seqnum does not sequentially increase (to prevent the server from shuffling the blocks)
        • XXX: Alternatively, we could use a 64-bit seqnum, spending 8 bytes of header on seqnums feels like a waste of bandwidth just to support massive transfers. And we'd have to manually hash it with the 96-bit IV rather than use the GCM implementation.
      • The block is encrypted including the 32-bit block sequence number as Additional Authenticated Data, thus stopping encrypted blocks from impersonating each other.

Or graphically, each frame is:

protocol "Registration Code (0xFFFFFFF):32,Block sequence number:32,Encrypted block length:32,CRC32:32,AES-GCM encrypted Data:64"

 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                 Registration Code (0xFFFFFFF)                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Block sequence number                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Encrypted block length                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             CRC32                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                     AES-GCM encrypted Data                    +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The actual file upload can then be streamed in the request body in the PUT (requires HTTP/2 in browsers). Similarly, the download can be streamed in the response body. The download should stream as rapidly as possible from the media server, letting the receiver view it incrementally as the upload happens, providing "zero-latency" - while also storing the stream to disk.

For resumable uploads (or to upload in blocks for HTTP clients which don't support streaming request bodies), we use tus 1.0.0.

For resumable downloads, we then use normal HTTP Range headers to seek and resume while downloading.

TODO: We need a way to mark a transfer as complete or cancelled (via a relation?). If cancelled, the sender should delete the partial upload (but the partial contents will have already leaked to the other side, of course).

TODO: While we're at it, let's actually let users DELETE their file transfers, at last.

N.B. Clients which implement displaying blurhashes should progressively load the thumbnail over the top of the blurhash, to make sure the detailed thumbnail streams in and is viewed as rapidly as possible.

Alternatives

  • We could use an existing streaming encrypted framing format of some kind rather (SRTP perhaps, which would give us timestamps for easier random access for audio/video streams) - but this feels a bit strange for plain old file streams.
  • Alternatively, we could descope random access entirely, given it only makes sense for AV streams, and requires timestamps to work nicely - and simply being able to stream encryption/decryption is a win in its own right. For instance, glow doesn't let you seek randomly within files which are mid transfer; only tail.
  • Split files into a series of separate m.file uploads which the client then has to glue back together (as the voice broadcast feature does in Element today).
    • Pros:
      • Works automatically with antivirus & CDGs
      • Could be made to map onto HLS or DASH? (by generating an .m3u8 which contains a bunch of MXC urls? This could also potentially solve the glitching problems weve had, by reusing existing HLS players augmented with our E2EE support)
    • Cons:
      • Is always going to be high latency (e.g. Element currently splits into ~30s chunks) given rate limits on sending file events
      • Can be a pain to glue media uploads back together without glitching
  • Transfer files via streaming P2P file transfer via WebRTC data channels (https://github.com/matrix-org/matrix-spec/issues/189)
    • Pros:
      • Easy to implement with Matrixs existing WebRTC signalling
      • Could use MSC3898-inspired media control to seek in the stream
    • Cons:
      • You dont get a serverside copy of the data
      • Hard for clients to implement relative to a simple HTTP download
      • You expose client IPs to each other if going P2P rather than via TURN
  • Do streaming voice/video messages/broadcast via WebRTC media channels instead
    • Pros:
      • Lowest latency
      • Could use media control to seek
      • Supports multiple senders
      • Works with CDGs and other enterprisey scanners which know how to scan VOIP payloads
      • Could automatically support variable streams via SFU to adapt to network conditions
      • If the SFU does E2EE and archiving, you get that for free.
    • Cons:
      • Complex; you cant just download the file via HTTP
      • Requires client to have a WebRTC stack
      • A suitable SFU still doesnt exist yet
  • Transfer files out of band using a protocol which already provides streaming transfers (e.g. IPFS?)
  • Could use ChaCha20-Poly1305 rather than AES-GCM, but no native webcrypto impl yet: https://github.com/w3c/webcrypto/issues/223
  • We could use YouTube's resumable upload API via Content-Range headers from https://developers.google.com/youtube/v3/guides/using_resumable_upload_protocol, but having implemented both it and tus, tus feels inordinately simpler and less fiddly. YouTube is likely to be well supported by proxies etc, but if tus is ordained by the HTTP IETF WG, then it should be well supported too.

Security considerations

  • Variable size blocks could leak metadata for VBR audio. Mitigation is to use CBR if you care about leaking voice traffic patterns (constant size blocks isnt necessarily enough, as youd still leak the traffic patterns)
  • Is encrypting a sequence number in block header (with authenticated encryption) sufficient to mitigate reordering attacks?
    • When doing random access, the reader has to trust the server to serve the right blocks after a discontinuity
  • The resulting lack of atomicity on file transfer means that accidentally uploaded files may leak partial contents to other users, even if they're cancelled.
  • Clients may well wish to scan untrusted inbound file transfers for malware etc, which means buffering the inbound transfer and scanning it before presenting it to the user.
  • Removing the hashes entry on the EncryptedFile description means that an attacker who controls the key & IV of the original file transfer could strategically substitute the file contents. This could be desirable for CDGs wishing to switch a file for a sanitised version without breaking the Matrix event hashes. For other scenarios it could be undesirable. An alternative might be for the sender to keep sending new hashes in related matrix events as the stream uploads, but it's unclear if this is worth it.

Conclusion

For the voice broadcast use case, it's a bit unclear whether this is actually an improvement over splitting files into multiple file uploads (or MSC3888). It's also unfortunate that the benefits of the MSC are reduced with content scanners and CDGs. Its also a bit unclear whether voice/video broadcast would be better served via MSC3888 style behaviour.

However, for halving the transfer time for large videos and files (and the magic "zero latency" of being able to see file transfers instantly start to download as they upload) it still feels like a worthwhile MSC. Switching to GCM is desirable too in terms of providing authenticated encryption and avoiding having to calculate out-of-band hashes for file transfer. Finally, implementing this MSC will force implementations to stream their file encryption/decryption and avoid the temptation to load the whole file into RAM (which doesn't scale, especially in constrained environments such as iOS Share Extensions).

Dependencies

This MSC depends on MSC2246, which has now landed in the spec. Extends MSC3469.

Unstable prefixes

Unstable prefix Stable prefix
org.matrix.msc4016.v3 v3