Merge 1896fc7cda into d6edcbd946

2 weeks ago · ab93ef5fcc
parent d6edcbd946 1896fc7cda
commit ab93ef5fcc
1 changed files with 443 additions and 0 deletions
--- a/proposals/3898-sfu.md
+++ b/proposals/3898-sfu.md
@ -0,0 +1,443 @@
+# MSC3898: Native Matrix VoIP signalling for cascaded SFUs
+
+[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
+specifies how full-mesh group calls work in Matrix. While that MSC works well
+for small group calls, it does not work so well for large conferences due to
+bandwidth (and other) issues.
+
+Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
+between peers (which could be clients or SFUs or both). To make use of them
+effectively, peers need to be able to tell the SFU which streams they want to
+receive at what resolutions.
+
+To solve the issue of centralization, the SFUs are also allowed to connect to
+each other ("cascade") and therefore the peers also need a way to tell an SFU to
+which other SFUs to connect.
+
+## Proposal
+
+- **TODO: spell out how this works with active speaker detection & associated
+signalling**
+
+### Diagrams
+
+The diagrams of how this all looks can be found in
+[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).
+
+### Additions to the `m.call.member` state event
+
+This MSC proposes adding two _optional_ fields to the `m.call.member` state event:
+`m.foci.preferred` and `m.foci.active`.
+
+Informational: This attempts to avoid the situation where a conference is ongoing
+with several users in, for example, New York. These users are all connected to the
+focus in New York. Alice joins from London: rather than connecting to the focus
+in London, she connects directly to the one in New York since that's where all the
+other participants are connected. If more users then join from London, however, they
+will all make the same decision and connect to the New York focus rather than the
+optimal configuration of the London users connected to the London focus. With active
+and preferred foci, the second user that joins from London will know that although
+Alice's active focus is New York, her preferred is London, and can therefore choose
+the London focus instead.
+
+For instance:
+
+```json
+{
+    "type": "m.call.member",
+    "state_key": "@matthew:matrix.org",
+    "content": {
+        "m.calls": [
+            {
+                "m.call_id": "cvsiu2893",
+                "m.devices": [{
+                    "device_id": "U738KDF9WJ",
+                    "m.foci.active": [
+                        { "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }
+                    ],
+                    "m.foci.preferred": [
+                        { "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" },
+                        { "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" },
+                    ]
+                }]
+            }
+        ],
+        "m.expires_ts":  1654616071686
+    }
+}
+```
+
+#### `m.foci.active`
+
+This field is a list of foci the user's device is publishing to. Usually, this
+list will have a length of 1, yet a client might publish to multiple foci if
+they are on different networks, for instance, or to simultaneously fan-out in
+different directions from the client if there is no nearby focus. If the client
+is participating full-mesh, it should either omit this field from the state
+event or leave the list empty.
+
+#### `m.foci.preferred`
+
+This field is a list of foci the client would prefer to switch to from the
+current active focus, if any other client also starts using the given focus. If
+the client is already using one of its preferred foci, it should either omit
+this field from the state event or leave the list empty.
+
+### Choosing a focus
+
+#### Discovering foci
+
+- **TODO: How does a client discover foci? We could use well-known or a custom endpoint**
+
+Foci are identified by a tuple of `user_id` and `device_id`.
+
+#### Determining the best focus
+
+There are many ways to determine the best focus; this MSC recommends the
+following:
+
+- Is the quickest to respond to `m.call.invite` with `m.call.answer`.
+- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered
+  port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to
+  the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to
+  measure media-path latency rather than signalling path latency.
+- Has the best latency of data-channel traffic flows.
+- Has the best latency and bandwidth determined by sending a small splurge of
+  media down the pipe to probe.
+
+#### Joining a call
+
+The following diagram explains how a client chooses a focus when joining a call.
+
+```mermaid
+flowchart TD;
+wantsToJoin[Wants to join a call];
+hasPreferred(Has preferred focus?);
+callPreferred[Calls preferred foci without media to grab a slot];
+publishPreferred[Publishes `m.foci.preferred`];
+checkMembers(Call has more than 2 members including the client itself?);
+callFullMesh[Calls other member full-mesh];
+callMembersFoci[Tries calling foci from `m.call.member` events];
+orderFoci[Orders foci from best to worst];
+findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member);
+callBestPreferred[Calls the focus];
+callBestActive[Calls the best active focus in room];
+publishActive[Publishes `m.foci.active`];
+
+wantsToJoin-->hasPreferred;
+hasPreferred--->|Yes|callPreferred;
+hasPreferred--->|No|checkMembers;
+callPreferred--->publishPreferred;
+publishPreferred--->checkMembers;
+checkMembers--->|Yes|callMembersFoci;
+checkMembers--->|No|callFullMesh;
+callMembersFoci--->orderFoci;
+orderFoci--->findFocusPreferredByOtherMember;
+findFocusPreferredByOtherMember--->|Found|callBestPreferred;
+callBestPreferred--->publishActive;
+findFocusPreferredByOtherMember--->|Not found|callBestActive;
+callBestActive--->publishActive;
+```
+
+#### Mid-call changes
+
+Once in a call, the client listens for changes to `m.call.member` state events
+and if another member starts using one of the client's preferred foci, the client
+switches to that focus.
+
+**TODO: other cases?**
+
+### Initial offer/answer dance
+
+During the initial offer/answer dance, the client establishes a data-channel
+between itself and the SFU to use later for rapid signalling.
+
+### Simulcast
+
+#### RTP munging
+
+#### vp8 munging
+
+### RTCP re-transmission
+
+### Data-channel messaging
+
+The client uses the established data channel connection to the SFU to perform
+low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
+ping messages, metadata, cascade and perform re-negotiation.
+
+See the section about the [rationale](#the-use-of-the-data-channels-for-signaling)
+behind the use of the data channels for signaling.
+
+- **TODO: Spell out how the DC traffic interacts with application-layer
+traffic**
+
+#### SDP Stream Metadata extension
+
+The client will be receiving multiple streams from the SFU and it will need to
+be able to distinguish them, this therefore builds on
+[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
+[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
+provide the client with the necessary metadata. Some of the data-channel events
+include an `sdp_stream_metadata` field including a description of the stream
+being sent either from the SFU to the client or from the client to the SFU.
+
+Other than mute information and stream purpose, the metadata includes video
+track resolution. The SFU may not be able to determine the resolution of the
+track itself but it does need to know for simulcast; therefore, we include this
+in the metadata.
+
+```json
+{
+    "streamId1": {
+        "purpose": "m.usermedia",
+        "audio_muted": false,
+        "video_muted": true,
+        "tracks": {
+            "trackId1": {
+                "width": 1920,
+                "height": 1080
+            },
+            "trackId2": {}
+        }
+    }
+}
+```
+
+#### Event types
+
+This MSC adds a few new `m.call.*` events and extends a few of the existing ones.
+
+##### `m.call.track_subscription`
+
+This event is sent to the focus to let it know about the tracks the client would
+like to start/stop subscribing to.
+
+Upon receiving this event, a focus should make the subscribe changes based on
+the `start` and `stop` arrays and respond with an `m.call.negotiate` event.
+
+In the case of video tracks, in the `start` array the client may also request a
+specific resolution for a given track; this resolution is a resolution the
+client wishes to receive but the SFU may send a lower one due to bandwidth etc.
+
+If the user for example switches from "spotlight" (one large tile) to "grid"
+(multiple small tiles) view, it should also send this event with the updated
+resolution in the `start` array to let the focus know of the resolution change.
+
+Clients may request each track only once: foci should ignore multiple requests
+of the same track.
+
+- **TODO: how do we prove to the focus that we have the right to subscribe to
+track?**
+
+```json
+{
+    "type": "m.call.track_subscription",
+    "content": {
+        "subscribe": [
+            {
+                "stream_id": "streamId1",
+                "track_id": "trackId1",
+                "width": 1920,
+                "height": 1080
+            },
+            {
+                "stream_id": "streamId2",
+                "track_id": "trackId2",
+                "width": 256,
+                "height": 144
+            }
+        ],
+        "unsubscribe": [
+            {
+                "stream_id": "streamId3",
+                "track_id": "trackId4"
+            },
+            {
+                "stream_id": "streamId4",
+                "track_id": "trackId4"
+            }
+        ]
+    }
+}
+```
+
+##### `m.call.negotiate`
+
+This event works exactly like the `m.call.negotiate` event in 1:1 calls.
+
+```json
+{
+    "type": "m.call.negotiate",
+    "content": {
+        "description": {
+            "type": "offer",
+            "sdp": "..."
+        },
+        "sdp_stream_metadata": {...} // As specified in the Metadata section
+    }
+}
+```
+
+##### `m.call.sdp_stream_metadata_changed`
+
+This event works very similarly to the 1:1 call `m.call.sdp_stream_metadata_changed`.
+
+- **TODO: Spec how foci actually use this to advertise tracks**
+
+```json
+{
+    "type": "m.call.sdp_stream_metadata_changed",
+    "content": {
+        "sdp_stream_metadata": {...} // As specified in the Metadata section
+    }
+}
+```
+
+##### `m.call.ping`, `m.call.pong`
+
+A ping message must be sent by the focus to the client at an interval
+no greater than 30 seconds. On receiving a ping message, a client must respond
+immediately with a pong message. A client may therefore detect that the
+connection has failed after an amount of time of its choosing (greater than
+30 seconds) has elapsed since it last saw a ping message. A server may deem a
+client unresponsive after not receiving a pong some amount of time after it
+has sent a ping, again the amount of time the server waits is up to the
+implementation. Either send should hang up once deeming the other side
+unresponsive.
+
+focus -> client:
+
+```json
+{
+    "type": "m.call.ping",
+    "content": {}
+}
+```
+
+client -> focus:
+
+```json
+{
+    "type": "m.call.pong",
+    "content": {}
+}
+```
+
+##### `m.call.connect_to_focus`
+
+If a user is using their focus in a call, it will need to know how to connect to
+other foci present in order to participate in the full-mesh of SFU traffic (if
+any). The client is responsible for doing this using the
+`m.call.connect_to_focus` event.
+
+```json
+{
+    "type": "m.call.connect_to_focus",
+    "content": {
+        // TODO: How should this look?
+    }
+}
+```
+
+### Notes
+
+#### Hiding behind foci
+
+We do not recommend that users utilise a focus to hide behind for privacy, but
+instead use a TURN server, only providing relay candidates, rather than
+consuming focus resources and unnecessarily mandating the presence of a focus.
+
+## Potential issues
+
+The SFUs participating in a conference end up in a full mesh. Rather than
+inventing our own spanning-tree system for SFUs however, we should fix it for
+Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
+similar to decide what better-than-full-mesh topology to use. In practice, full
+mesh cascade between SFUs is probably not that bad (especially if SFUs only
+request the streams over the trunk their clients care about) - and on aggregate
+will be less obnoxious than all the clients hitting a single SFU.
+
+Too many foci will chew bandwidth due to full-mesh between them. In the worst
+case, if every use is on their own HS and picks a different foci, it degenerates
+to a full-mesh call (just server-side rather than client-side).  Hopefully this
+shouldn't happen as you will converge on using a single SFU with the most
+clients, but need to check how this works in practice.
+
+SFrame mandates its own ratchet currently which is almost the same as megolm but
+not quite.  Switching it out for megolm seems reasonable right now (at least
+until MLS comes along)
+
+## Alternatives
+
+An option would be to treat 1:1 (and full mesh) entirely differently to SFU
+based calling rather than trying to unify them. Also, it's debatable whether
+supporting full mesh is useful at all. In the end, it feels like unifying 1:1
+and SFU calling is for the best though, as it then gives you the ability to
+trivially upgrade 1:1 calls to group calls and vice versa, and avoids
+maintaining two separate hunks of spec.  It also forces 1:1 calls to take
+multi-stream calls seriously, which is useful for more exotic capture devices
+(stereo cameras; 3D cameras; surround sound; audio fields etc).
+
+### The use of the data channels for signaling
+
+The current specification assumes that signaling works over Matrix, but
+side-chains to the data channel once the peer connection is established
+in order to perform low-latency signaling.
+
+In an ideal scenario the use of the data channels would not be required and
+the usage of native Matrix signaling would be sufficient, however due to
+the fact that regular Matrix signaling may need to traverse different
+servers, e.g. `client <-> home server <-> home server <-> sfu`, our
+signaling would not be quite as fast as we need it to be. The effect will
+be even greater when coupled with the fact that certain protocols like
+HTTP would not be as efficient for a real-time communication as e.g. WebRTC
+data channels or WebSockets.
+
+The problem would be solved if the clients could connect to the SFU
+**directly** and communicate via Matrix for all signaling messages. This
+would allow us to use a faster transport (WebSockets, QUIC etc) to transmit
+signaling messages. However, this is **currently** not possible due to the fact
+that it would require the support of the P2P Matrix that is still being under
+development at the time of writing this MSC.
+
+To read more about the problem and get more context, please refer to the
+[discussion](https://github.com/matrix-org/matrix-spec-proposals/pull/3898#discussion_r1019098025).
+
+### Cascading
+
+One option here is for SFUs to act as an AS and sniff the `m.call.member`
+traffic of their associated server, and automatically call any other `m.foci`
+which appear.  (They don't need to make outbound calls to clients, as clients
+always dial in).
+
+## Security considerations
+
+Malicious users could try to DoS SFUs by specifying them as their foci.
+
+SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
+enough to all the participants when a participant leave (and meanwhile if we
+keep using the old session, we're technically leaking call media to the parted
+participant until we manage to rotate).
+
+Need to ensure there's no scope for media forwarding loops through SFUs.
+
+In order to authenticate that only legitimate users are allowed to subscribe to
+a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
+sniff the `m.call` events on their associated server, and only act on to-device
+`m.call.*` events which come from a user who is confirmed to be in the room for
+that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
+connect to the SFU without having the keys to decrypt the traffic, but this
+feature is desirable for non-E2EE confs and to stop bandwidth DoS)
+
+## Unstable prefixes
+
+We probably don't care for this for the data-channel?
+
+While this MSC is not considered stable, implementations should use
+`org.matrix.msc3898` as a namespace.
+
+|Stable (post-FCP) |Unstable                           |
+|------------------|-----------------------------------|
+|`m.foci.active`   |`org.matrix.msc3898.foci.active`   |
+|`m.foci.preferred`|`org.matrix.msc3898.foci.preferred`|