You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
matrix-spec-proposals/proposals/3401-group-voip.md

15 KiB

MSC3401: Native Group VoIP signalling

Note: previously this MSC included SFU signalling which has now been moved to MSC3898 to avoid making this MSC too large.

Problem

VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room. This has some limitations, especially if you try to broaden the approach to multiparty VoIP calls:

  • VoIP signalling can generate a lot of events as candidates are incrementally discovered, and for rapid call setup these need to be relayed as rapidly as possible.
    • Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
    • Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
  • VoIP signalling leaks IP addresses. There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
  • Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.

Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.

Proposal

This proposal provides a signalling framework using to-device messages which can be applied to native Matrix 1:1 calls, full-mesh calls and in the future SFU calls, cascaded SFU calls MCU calls, and hybrid SFU/MCU approaches. It replaces the early flawed sketch at MSC2359.

This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.

Diagrammatically, this looks like:

1:1:

          A -------- B

Full mesh between clients

          A -------- B
           \       /
            \     /
             \   /
              \ /
               C

SFU (aka Focus):

          A __    __ B
              \  /   
               F 
               | 
               |
               C
Where F is an SFU focus

Cascaded decentralised SFU:

     A1 --.           .-- B1
     A2 ---Fa ----- Fb--- B2
           \       /
            \     /
             \   /
              \ /
               Fc
              |  |
             C1  C2

Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.

m.call state event

The user who wants to initiate a call sends a m.call state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using m.terminated). Its body has the following fields:

  • m.intent to describe the intended UX for handling the call. One of:
    • m.ring if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)
    • m.prompt is the call should be presented as a conference call which users in the room are prompted to connect to
    • m.room if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
  • m.type to say whether the initial type of call is voice only (m.voice) or video (m.video). This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.
  • m.terminated if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).
  • m.name as an optional human-visible label for the call (e.g. "Conference call").
  • The State key is a unique ID for that call. (We can't use the event ID, given m.type and m.terminated is mutable). If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.

For instance:

{
    "type": "m.call",
    "state_key": "cvsiu2893",
    "content": {
        "m.intent": "m.room",
        "m.type": "m.voice",
        "m.name": "Voice room"
    }
}

We mandate at most one call per room at any given point to avoid UX nightmares - if you want the user to participate in multiple parallel calls, you should simply create multiple rooms, each with one call.

Call participation

Users who want to participate in the call declare this by publishing a m.call.member state event using their matrix ID as the state key (thus ensuring other users cannot edit it). The event contains an array m.calls of objects describing which calls the user is participating in within that room. This array must contain one item (for now).

The fields within the item in the m.calls contents are:

  • m.call_id - the ID of the conference the user is claiming to participate in. If this doesn't match an unterminated m.call event, it should be ignored.
  • m.devices - The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:
    • device_id - The device id to use for to-device messages when establishing a call
    • session_id - A unique identifier used for resolving duplicate sessions from a given device. When the session_id field changes from an incoming m.call.member event, any existing calls from this device in this call should be terminated. session_id should be generated once per client session on application load.
    • expires_ts - A POSIX timestamp in milliseconds describing when this device data should be considered stale. When updating their own device state, clients should choose a reasonable value for expires_ts in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp. A device must be ignored if the expires_ts field indicates it has expired, or if the user's m.room.member event's membership field is not join.
    • feeds - Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.
      • purpose - Either m.usermedia or m.screenshare otherwise the feed should be ignored.

For instance:

{
    "type": "m.call.member",
    "state_key": "@matthew:matrix.org",
    "content": {
        "m.calls": [
            {
                "m.call_id": "cvsiu2893",
                "m.devices": [
                    {
                        "device_id": "ASDUHDGFYUW", // Used to target to-device messages
                        "session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
                        "expires_ts": 1654616071686,
                        "feeds": [
                            {
                                "purpose": "m.usermedia",
                                "id": "qegwy64121wqw", // WebRTC MediaStream id
                                "tracks": [
                                    {
                                        "kind": "audio",
                                        "id": "zvhjiwqsx", // WebRTC MediaStreamTrack id
                                        "label": "Sennheiser Mic",
                                        "settings": { // WebRTC MediaTrackSettings object
                                            "channelCount": 2,
                                            "sampleRate": 48000,
                                            "m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
                                        }
                                    },
                                    {
                                        "kind": "video",
                                        "id": "zbhsbdhzs",
                                        "label": "Logitech Webcam",
                                        "settings": {
                                            "width": 1280,
                                            "height": 720,
                                            "facingMode": "user",
                                            "frameRate": 30.0,
                                            "m.maxbr": 512000,
                                        }
                                    },
                                ],
                            },
                            {
                                "purpose": "m.screenshare",
                                "id": "suigv372y8378",
                                "tracks": [
                                    {
                                        "kind": "video",
                                        "id": "xbhsbdhzs",
                                        "label": "My Screenshare",
                                        "settings": {
                                            "width": 3072,
                                            "height": 1920,
                                            "cursor": "moving",
                                            "displaySurface": "monitor",
                                            "frameRate": 30.0,
                                            "m.maxbr": 768000,
                                        }
                                    },
                                ]
                            }
                        ]
                    }
                ]
            }
        ],
    }
}

This builds on MSC3077, which describes streams in m.call.* events via a sdp_stream_metadata field.

TODO: Do we need all of this data? Why would we need it? TODO: This doesn't follow the MSC3077 format very well - can we do something about that? TODO: Add tracks field TODO: Add bitrate/format fields

Clients should do their best to ensure that calls in m.call.member state are removed when the member leaves the call. However, there will be cases where the device loses network connectivity, power, the application is forced closed, or it crashes. If the m.call.member state has stale device data the call setup will fail. Clients should re-attempt invites up to 3 times before giving up on calling a member.

Call setup

In a full mesh call, for any two participants, the one with the lexicographically lower user ID is responsible for calling the other. If two participants share the same user ID (that is, if a user has joined the call from multiple devices), then the one with the lexicographically lower device ID is responsible for calling the other.

Call setup then uses the normal m.call.* events, except they are sent over to-device messages to the relevant devices (encrypted via Olm). This means:

  • When initiating a 1:1 call, the m.call.invite is sent to the devices listed in m.call.member event's m.devices array using the device_id field.
  • m.call.* events sent via to-device messages should also include the following properties in their content:
    • conf_id - The group call id listed in m.call
    • dest_session_id - The recipient's session id. Incoming messages with a dest_session_id that doesn't match your current session id should be discarded.
    • seq - The sequence number of the to-device message. This is done since the order of to-device messages is not guaranteed. With each new to-device message this number gets incremented by 1 and it starts at 0
  • In addition to the fields above m.call.invite events sent via to-device messages should include the following properties :
    • device_id - The message sender's device id. Used by the opponent member to send response to-device signalling messages even if the m.call.member event has not been received yet.
    • sender_session_id - Like the device_id the sender_session_id is used by the opponent member to filter out messages unrelated to the sender's session even if the m.call.member event has not been received yet.
  • For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the m.call event propagates, to minimise latency. Therefore we'll need to include an m.intent on the m.call.invite too.

Example Diagrams

Legend

Arrow Style Description
Solid State Event
Dashed Event (sent as to-device message)

Basic Call

sequenceDiagram
    autonumber
    participant Alice
    participant Room
    participant Bob
    Alice->>Room: m.call
    Alice->>Room: m.call.member
    Bob->>Room: m.call.member
    Alice-->>Bob: m.call.invite
    Alice-->>Bob: m.call.candidates
    Alice-->>Bob: m.call.candidates
    Bob-->>Alice: m.call.answer
    Bob-->>Alice: m.call.candidates
    Alice-->>Bob: m.call.select_answer

Potential issues

To-device messages are point-to-point between servers, whereas today's m.call.* messages can transitively traverse servers via the room DAG, thus working around federation problems. In practice if you are relying on that behaviour, you're already in a bad place.

Alternatives

There are many many different ways to do this. The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants. This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC. So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.

An alternative to to-device messages is to use DMs. You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls. It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.

Security considerations

State events are not encrypted currently, and so this leaks that a call is happening, and who is participating in it, and from which devices.

Malicious users in a room could try to sabotage a conference by overwriting the m.call state event of the current ongoing call.

Unstable prefix

stable event type unstable event type
m.call org.matrix.msc3401.call
m.call.member org.matrix.msc3401.call.member