Merge 1896fc7cda
into d6edcbd946
commit
ab93ef5fcc
@ -0,0 +1,443 @@
|
||||
# MSC3898: Native Matrix VoIP signalling for cascaded SFUs
|
||||
|
||||
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
|
||||
specifies how full-mesh group calls work in Matrix. While that MSC works well
|
||||
for small group calls, it does not work so well for large conferences due to
|
||||
bandwidth (and other) issues.
|
||||
|
||||
Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
|
||||
between peers (which could be clients or SFUs or both). To make use of them
|
||||
effectively, peers need to be able to tell the SFU which streams they want to
|
||||
receive at what resolutions.
|
||||
|
||||
To solve the issue of centralization, the SFUs are also allowed to connect to
|
||||
each other ("cascade") and therefore the peers also need a way to tell an SFU to
|
||||
which other SFUs to connect.
|
||||
|
||||
## Proposal
|
||||
|
||||
- **TODO: spell out how this works with active speaker detection & associated
|
||||
signalling**
|
||||
|
||||
### Diagrams
|
||||
|
||||
The diagrams of how this all looks can be found in
|
||||
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).
|
||||
|
||||
### Additions to the `m.call.member` state event
|
||||
|
||||
This MSC proposes adding two _optional_ fields to the `m.call.member` state event:
|
||||
`m.foci.preferred` and `m.foci.active`.
|
||||
|
||||
Informational: This attempts to avoid the situation where a conference is ongoing
|
||||
with several users in, for example, New York. These users are all connected to the
|
||||
focus in New York. Alice joins from London: rather than connecting to the focus
|
||||
in London, she connects directly to the one in New York since that's where all the
|
||||
other participants are connected. If more users then join from London, however, they
|
||||
will all make the same decision and connect to the New York focus rather than the
|
||||
optimal configuration of the London users connected to the London focus. With active
|
||||
and preferred foci, the second user that joins from London will know that although
|
||||
Alice's active focus is New York, her preferred is London, and can therefore choose
|
||||
the London focus instead.
|
||||
|
||||
For instance:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.member",
|
||||
"state_key": "@matthew:matrix.org",
|
||||
"content": {
|
||||
"m.calls": [
|
||||
{
|
||||
"m.call_id": "cvsiu2893",
|
||||
"m.devices": [{
|
||||
"device_id": "U738KDF9WJ",
|
||||
"m.foci.active": [
|
||||
{ "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }
|
||||
],
|
||||
"m.foci.preferred": [
|
||||
{ "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" },
|
||||
{ "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" },
|
||||
]
|
||||
}]
|
||||
}
|
||||
],
|
||||
"m.expires_ts": 1654616071686
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### `m.foci.active`
|
||||
|
||||
This field is a list of foci the user's device is publishing to. Usually, this
|
||||
list will have a length of 1, yet a client might publish to multiple foci if
|
||||
they are on different networks, for instance, or to simultaneously fan-out in
|
||||
different directions from the client if there is no nearby focus. If the client
|
||||
is participating full-mesh, it should either omit this field from the state
|
||||
event or leave the list empty.
|
||||
|
||||
#### `m.foci.preferred`
|
||||
|
||||
This field is a list of foci the client would prefer to switch to from the
|
||||
current active focus, if any other client also starts using the given focus. If
|
||||
the client is already using one of its preferred foci, it should either omit
|
||||
this field from the state event or leave the list empty.
|
||||
|
||||
### Choosing a focus
|
||||
|
||||
#### Discovering foci
|
||||
|
||||
- **TODO: How does a client discover foci? We could use well-known or a custom endpoint**
|
||||
|
||||
Foci are identified by a tuple of `user_id` and `device_id`.
|
||||
|
||||
#### Determining the best focus
|
||||
|
||||
There are many ways to determine the best focus; this MSC recommends the
|
||||
following:
|
||||
|
||||
- Is the quickest to respond to `m.call.invite` with `m.call.answer`.
|
||||
- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered
|
||||
port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to
|
||||
the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to
|
||||
measure media-path latency rather than signalling path latency.
|
||||
- Has the best latency of data-channel traffic flows.
|
||||
- Has the best latency and bandwidth determined by sending a small splurge of
|
||||
media down the pipe to probe.
|
||||
|
||||
#### Joining a call
|
||||
|
||||
The following diagram explains how a client chooses a focus when joining a call.
|
||||
|
||||
```mermaid
|
||||
flowchart TD;
|
||||
wantsToJoin[Wants to join a call];
|
||||
hasPreferred(Has preferred focus?);
|
||||
callPreferred[Calls preferred foci without media to grab a slot];
|
||||
publishPreferred[Publishes `m.foci.preferred`];
|
||||
checkMembers(Call has more than 2 members including the client itself?);
|
||||
callFullMesh[Calls other member full-mesh];
|
||||
callMembersFoci[Tries calling foci from `m.call.member` events];
|
||||
orderFoci[Orders foci from best to worst];
|
||||
findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member);
|
||||
callBestPreferred[Calls the focus];
|
||||
callBestActive[Calls the best active focus in room];
|
||||
publishActive[Publishes `m.foci.active`];
|
||||
|
||||
wantsToJoin-->hasPreferred;
|
||||
hasPreferred--->|Yes|callPreferred;
|
||||
hasPreferred--->|No|checkMembers;
|
||||
callPreferred--->publishPreferred;
|
||||
publishPreferred--->checkMembers;
|
||||
checkMembers--->|Yes|callMembersFoci;
|
||||
checkMembers--->|No|callFullMesh;
|
||||
callMembersFoci--->orderFoci;
|
||||
orderFoci--->findFocusPreferredByOtherMember;
|
||||
findFocusPreferredByOtherMember--->|Found|callBestPreferred;
|
||||
callBestPreferred--->publishActive;
|
||||
findFocusPreferredByOtherMember--->|Not found|callBestActive;
|
||||
callBestActive--->publishActive;
|
||||
```
|
||||
|
||||
#### Mid-call changes
|
||||
|
||||
Once in a call, the client listens for changes to `m.call.member` state events
|
||||
and if another member starts using one of the client's preferred foci, the client
|
||||
switches to that focus.
|
||||
|
||||
**TODO: other cases?**
|
||||
|
||||
### Initial offer/answer dance
|
||||
|
||||
During the initial offer/answer dance, the client establishes a data-channel
|
||||
between itself and the SFU to use later for rapid signalling.
|
||||
|
||||
### Simulcast
|
||||
|
||||
#### RTP munging
|
||||
|
||||
#### vp8 munging
|
||||
|
||||
### RTCP re-transmission
|
||||
|
||||
### Data-channel messaging
|
||||
|
||||
The client uses the established data channel connection to the SFU to perform
|
||||
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
|
||||
ping messages, metadata, cascade and perform re-negotiation.
|
||||
|
||||
See the section about the [rationale](#the-use-of-the-data-channels-for-signaling)
|
||||
behind the use of the data channels for signaling.
|
||||
|
||||
- **TODO: Spell out how the DC traffic interacts with application-layer
|
||||
traffic**
|
||||
|
||||
#### SDP Stream Metadata extension
|
||||
|
||||
The client will be receiving multiple streams from the SFU and it will need to
|
||||
be able to distinguish them, this therefore builds on
|
||||
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
|
||||
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
|
||||
provide the client with the necessary metadata. Some of the data-channel events
|
||||
include an `sdp_stream_metadata` field including a description of the stream
|
||||
being sent either from the SFU to the client or from the client to the SFU.
|
||||
|
||||
Other than mute information and stream purpose, the metadata includes video
|
||||
track resolution. The SFU may not be able to determine the resolution of the
|
||||
track itself but it does need to know for simulcast; therefore, we include this
|
||||
in the metadata.
|
||||
|
||||
```json
|
||||
{
|
||||
"streamId1": {
|
||||
"purpose": "m.usermedia",
|
||||
"audio_muted": false,
|
||||
"video_muted": true,
|
||||
"tracks": {
|
||||
"trackId1": {
|
||||
"width": 1920,
|
||||
"height": 1080
|
||||
},
|
||||
"trackId2": {}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Event types
|
||||
|
||||
This MSC adds a few new `m.call.*` events and extends a few of the existing ones.
|
||||
|
||||
##### `m.call.track_subscription`
|
||||
|
||||
This event is sent to the focus to let it know about the tracks the client would
|
||||
like to start/stop subscribing to.
|
||||
|
||||
Upon receiving this event, a focus should make the subscribe changes based on
|
||||
the `start` and `stop` arrays and respond with an `m.call.negotiate` event.
|
||||
|
||||
In the case of video tracks, in the `start` array the client may also request a
|
||||
specific resolution for a given track; this resolution is a resolution the
|
||||
client wishes to receive but the SFU may send a lower one due to bandwidth etc.
|
||||
|
||||
If the user for example switches from "spotlight" (one large tile) to "grid"
|
||||
(multiple small tiles) view, it should also send this event with the updated
|
||||
resolution in the `start` array to let the focus know of the resolution change.
|
||||
|
||||
Clients may request each track only once: foci should ignore multiple requests
|
||||
of the same track.
|
||||
|
||||
- **TODO: how do we prove to the focus that we have the right to subscribe to
|
||||
track?**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.track_subscription",
|
||||
"content": {
|
||||
"subscribe": [
|
||||
{
|
||||
"stream_id": "streamId1",
|
||||
"track_id": "trackId1",
|
||||
"width": 1920,
|
||||
"height": 1080
|
||||
},
|
||||
{
|
||||
"stream_id": "streamId2",
|
||||
"track_id": "trackId2",
|
||||
"width": 256,
|
||||
"height": 144
|
||||
}
|
||||
],
|
||||
"unsubscribe": [
|
||||
{
|
||||
"stream_id": "streamId3",
|
||||
"track_id": "trackId4"
|
||||
},
|
||||
{
|
||||
"stream_id": "streamId4",
|
||||
"track_id": "trackId4"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
##### `m.call.negotiate`
|
||||
|
||||
This event works exactly like the `m.call.negotiate` event in 1:1 calls.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.negotiate",
|
||||
"content": {
|
||||
"description": {
|
||||
"type": "offer",
|
||||
"sdp": "..."
|
||||
},
|
||||
"sdp_stream_metadata": {...} // As specified in the Metadata section
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
##### `m.call.sdp_stream_metadata_changed`
|
||||
|
||||
This event works very similarly to the 1:1 call `m.call.sdp_stream_metadata_changed`.
|
||||
|
||||
- **TODO: Spec how foci actually use this to advertise tracks**
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.sdp_stream_metadata_changed",
|
||||
"content": {
|
||||
"sdp_stream_metadata": {...} // As specified in the Metadata section
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
##### `m.call.ping`, `m.call.pong`
|
||||
|
||||
A ping message must be sent by the focus to the client at an interval
|
||||
no greater than 30 seconds. On receiving a ping message, a client must respond
|
||||
immediately with a pong message. A client may therefore detect that the
|
||||
connection has failed after an amount of time of its choosing (greater than
|
||||
30 seconds) has elapsed since it last saw a ping message. A server may deem a
|
||||
client unresponsive after not receiving a pong some amount of time after it
|
||||
has sent a ping, again the amount of time the server waits is up to the
|
||||
implementation. Either send should hang up once deeming the other side
|
||||
unresponsive.
|
||||
|
||||
focus -> client:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.ping",
|
||||
"content": {}
|
||||
}
|
||||
```
|
||||
|
||||
client -> focus:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.pong",
|
||||
"content": {}
|
||||
}
|
||||
```
|
||||
|
||||
##### `m.call.connect_to_focus`
|
||||
|
||||
If a user is using their focus in a call, it will need to know how to connect to
|
||||
other foci present in order to participate in the full-mesh of SFU traffic (if
|
||||
any). The client is responsible for doing this using the
|
||||
`m.call.connect_to_focus` event.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "m.call.connect_to_focus",
|
||||
"content": {
|
||||
// TODO: How should this look?
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
#### Hiding behind foci
|
||||
|
||||
We do not recommend that users utilise a focus to hide behind for privacy, but
|
||||
instead use a TURN server, only providing relay candidates, rather than
|
||||
consuming focus resources and unnecessarily mandating the presence of a focus.
|
||||
|
||||
## Potential issues
|
||||
|
||||
The SFUs participating in a conference end up in a full mesh. Rather than
|
||||
inventing our own spanning-tree system for SFUs however, we should fix it for
|
||||
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
|
||||
similar to decide what better-than-full-mesh topology to use. In practice, full
|
||||
mesh cascade between SFUs is probably not that bad (especially if SFUs only
|
||||
request the streams over the trunk their clients care about) - and on aggregate
|
||||
will be less obnoxious than all the clients hitting a single SFU.
|
||||
|
||||
Too many foci will chew bandwidth due to full-mesh between them. In the worst
|
||||
case, if every use is on their own HS and picks a different foci, it degenerates
|
||||
to a full-mesh call (just server-side rather than client-side). Hopefully this
|
||||
shouldn't happen as you will converge on using a single SFU with the most
|
||||
clients, but need to check how this works in practice.
|
||||
|
||||
SFrame mandates its own ratchet currently which is almost the same as megolm but
|
||||
not quite. Switching it out for megolm seems reasonable right now (at least
|
||||
until MLS comes along)
|
||||
|
||||
## Alternatives
|
||||
|
||||
An option would be to treat 1:1 (and full mesh) entirely differently to SFU
|
||||
based calling rather than trying to unify them. Also, it's debatable whether
|
||||
supporting full mesh is useful at all. In the end, it feels like unifying 1:1
|
||||
and SFU calling is for the best though, as it then gives you the ability to
|
||||
trivially upgrade 1:1 calls to group calls and vice versa, and avoids
|
||||
maintaining two separate hunks of spec. It also forces 1:1 calls to take
|
||||
multi-stream calls seriously, which is useful for more exotic capture devices
|
||||
(stereo cameras; 3D cameras; surround sound; audio fields etc).
|
||||
|
||||
### The use of the data channels for signaling
|
||||
|
||||
The current specification assumes that signaling works over Matrix, but
|
||||
side-chains to the data channel once the peer connection is established
|
||||
in order to perform low-latency signaling.
|
||||
|
||||
In an ideal scenario the use of the data channels would not be required and
|
||||
the usage of native Matrix signaling would be sufficient, however due to
|
||||
the fact that regular Matrix signaling may need to traverse different
|
||||
servers, e.g. `client <-> home server <-> home server <-> sfu`, our
|
||||
signaling would not be quite as fast as we need it to be. The effect will
|
||||
be even greater when coupled with the fact that certain protocols like
|
||||
HTTP would not be as efficient for a real-time communication as e.g. WebRTC
|
||||
data channels or WebSockets.
|
||||
|
||||
The problem would be solved if the clients could connect to the SFU
|
||||
**directly** and communicate via Matrix for all signaling messages. This
|
||||
would allow us to use a faster transport (WebSockets, QUIC etc) to transmit
|
||||
signaling messages. However, this is **currently** not possible due to the fact
|
||||
that it would require the support of the P2P Matrix that is still being under
|
||||
development at the time of writing this MSC.
|
||||
|
||||
To read more about the problem and get more context, please refer to the
|
||||
[discussion](https://github.com/matrix-org/matrix-spec-proposals/pull/3898#discussion_r1019098025).
|
||||
|
||||
### Cascading
|
||||
|
||||
One option here is for SFUs to act as an AS and sniff the `m.call.member`
|
||||
traffic of their associated server, and automatically call any other `m.foci`
|
||||
which appear. (They don't need to make outbound calls to clients, as clients
|
||||
always dial in).
|
||||
|
||||
## Security considerations
|
||||
|
||||
Malicious users could try to DoS SFUs by specifying them as their foci.
|
||||
|
||||
SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
|
||||
enough to all the participants when a participant leave (and meanwhile if we
|
||||
keep using the old session, we're technically leaking call media to the parted
|
||||
participant until we manage to rotate).
|
||||
|
||||
Need to ensure there's no scope for media forwarding loops through SFUs.
|
||||
|
||||
In order to authenticate that only legitimate users are allowed to subscribe to
|
||||
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
|
||||
sniff the `m.call` events on their associated server, and only act on to-device
|
||||
`m.call.*` events which come from a user who is confirmed to be in the room for
|
||||
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
|
||||
connect to the SFU without having the keys to decrypt the traffic, but this
|
||||
feature is desirable for non-E2EE confs and to stop bandwidth DoS)
|
||||
|
||||
## Unstable prefixes
|
||||
|
||||
We probably don't care for this for the data-channel?
|
||||
|
||||
While this MSC is not considered stable, implementations should use
|
||||
`org.matrix.msc3898` as a namespace.
|
||||
|
||||
|Stable (post-FCP) |Unstable |
|
||||
|------------------|-----------------------------------|
|
||||
|`m.foci.active` |`org.matrix.msc3898.foci.active` |
|
||||
|`m.foci.preferred`|`org.matrix.msc3898.foci.preferred`|
|
Loading…
Reference in New Issue