This MSC defines the modules with which the MatrixRTC (Matrix Real Time Communication) signalling system is built.
MatrixRTC is short for Matrix real time communication.
This MSC defines the modules with which the Matrix real time system is built.
The MatrixRTC specification is separated into different modules.
MatrixRTC specifies how a real time session is described in a room and how matrix users can connect to
a session.
- The MatrixRTC room state that defines the state of the real time application.\
The MatrixRTC specification is separated into different modules:
- The MatrixRTC room state that defines the state of the real time session.\
It is the source of truth for:
- Who is part of a session
- Who is connected via what technology/backend
- Metadata per device used by other participants to decide whether the streams
from this source are of interest / need to be subscribed.
- The RTC backend.
- The MatrixRTC backend.
- Allows for multiple backend implementations to be used.
- It defines how to discover the available backend(s).
- It defines how to connect the participating peers.
- Livekit is the standard for this as of writing.
- Defines how to connect to a server/other peers, how to update the connection,
how to subscribe to different streams...
- Another planned backend is a full mesh implementation based on MSC3401.
- The RTCSession types (application) have their own per application spec.
- Calls can be done with an application of type `m.call` see (TODO: link call msc)
- A proposal utilising LiveKit is the standard for this as of writing.
- Another planned backend is a full mesh implementation based on [MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).
- The MatrixRTC application.
- Each application type can have it's own spec.
- Voice and video conferencing can be done with an application of type `m.call`
- The application defines all the details of the RTC experience:
- How to interpret the metadata of the member events.
- What streams to connect to.
- What data in which format to sent over the RTC channels.
- What MatrixRTC backends are supported.
- End-to-end encryption of media streams
This MSC will focus on the Matrix room state, which can be seen as the most high
level signalling of a call:
This MSC will focus on the Matrix room state which is responsible for the high
level signalling of a RTC session:
## Proposal
Each RTC session is made out of a collection of `m.rtc.member` state events.
Each `m.rtc.member` event defines the application type: `application`
and a `call_id`.
The first element of the state key is the `userId` and the second the `deviceId`.
(see [this proposal for state keys](https://github.com/matrix-org/matrix-spec-proposals/pull/3757#issuecomment-2099010555)
for context about second/first state key.)
Each RTC session is made out of a collection of `m.rtc.member` room state events.
Each `m.rtc.member` event defines who (the `member`) is a participant of which session (the `session`).
### The MatrixRTC room state
Everything required for working MatrixRTC
All data related to a MatrixRTC session
(current session, sessions history, join/leave events, ...) only
require one event type.
requires one event type.
A complete `m.rtc.member` state event looks like this:
(current session, sessions history, join/leave events, ...) only
require one event type:.
```json5
We use a set of `m.rtc.member` (one for each participant) state events to represent a session.
based on the content a `m.rtc.member` state event can either represent a connected or a disconnected member.
#### Joining a session
Sending a well-formed `m.rtc.member` event that describes a connected state for a state key that is not yet used or contains a disconnected `m.rtc.member` event represents a join action.
The fields are as follows:
- `member` required object - describes the participant of the RTC session:
- `id` required string - a unique identifier for this session membership as defined above. Recommended to be a UUID. It can be reused if the user leaves and rejoins the session.
It should be unique across all devices of the user. TODO: define grammar
- `device_id` required string - the Matrix device ID of the device that is joining the session. This is used when sending
- `full_mesh` - a backend using a full-mesh approach based on [MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).
#### Choosing the value of `foci_preferred` for the `m.rtc.member` state event
At some point session participants have to decide/propose which Focus they will use.
Based on the Focus type and application choosing the method by which the contents of the `foci_preferred` field on the `m.rtc.member`
can be different.
There are three guidelines which should be obeyed by a client when building the `foci_preferred` list:
1. It is always desired to have as few Focus switches as possible.
If there are other participants on the session (i.e. other `m.rtc.member` events) the client should calculate what the Focus it should connect to
based on the `m.rtc.member` events for the existing participants.
This should happen reactively on each `m.rtc.member` state event change.
Each MatrixRTC frontend is responsible that it can deal with focus switches based on changing state gracefully. It is part of the design of MatrixRTC and a requirement for a eventually consistent distributed system.
The calculated Focus should then be present at the start of the `foci_preferred` list.
2. The client should lookup the suggested foci from the homeserver `.well-known/matrix/client` as defined below.
MatrixRTC is designed around the same culture that makes Matrix possible: A large amount of infrastructure in the form of homeservers is provided by the users.
To achieve a stable and healthy ecosystem backend RTC infrastructure should be thought of as a part of a homeserver.
It is very similar to a TURN server: mostly traffic and little CPU load.
To not end up in a world where each user is only using one central SFU but where the traffic
is split over multiple SFU's it is important that we leverage the SFU distribution on the
homeserver federation.
These proposals from **your own** homeserver should come next in the `foci_preferred` list of the member event.
3. Clients should not use a hard-coded Focus.
Looking up the preferred Foci from a client is toxic to a federated system. If the majority of users
decide to use the same client all of the users will use one Focus. This destroys the passive security mechanism, that
each instance is not an interesting attack vector since it is only a fraction of the network.
Additionally it will result in poor performance if every user on Matrix would use the same Focus.
However, there are cases where this is acceptable:
- Transitioning to MatrixRTC. Here it might be beneficial to have a client that has a fallback Focus
so calls also work with homeservers not supporting it.
- For testing purposes where a different Focus should be tested but one does not want to touch the .well-known
- For custom deployments that benefit from having the Focus configuration on a per client basis instead of per homeserver.
Therefore, if a client does use a hard-coded Focus it should come last in the `foci_preferred` list.
#### Discovery of Foci using `.well-known/matrix/client`
> [!NOTE]
> Backend **infrastructure** in this context can be anything that can serve as the backend for a
> MatrixRTC session. In most cases this is a SFU. But also a full mesh implementation could
> be an infrastructure. Not all kind of infrastructure require a way of sourcing a backend resource
> (e.g. full-mesh). In this MSC we only refer to infrastructure where it is necessary to have access to additional
> data to participate in the MatrixRTC session.
We use a `m.rtc_foci` key in the homeserver `.well-known/matrix/client` that can be used to expose
a sorted (by priority) list of Focus description objects.
Each application type might have its own specification in how the different streams
are interpreted and even what Focus type to use. This makes this proposal extremely
flexible. A Jitsi conference could be added by introducing a new `application`
and a new Focus type and would be MatrixRTC compatible. It would not be compatible
with applications that do not use the Jitsi Focus but clients would know that there
is an ongoing session of unknown type and unknown Focus and could display/represent
this in the user interface.
To make it easy for clients to support different RTC session types, the recommended
approach is to provide a Matrix widget for each session type, so that client developers
can use the widget as the first implementation if they want to support this RTC
session type.
To make it easy for clients to support different application types, the recommended
approach is to provide a Matrix widget for each application type. This way the
client developers can use the widget as the first implementation if they want to
support this RTC application type.
Each application should get its own MSC in which the all the additional
fields are explained and how the communication with the possible foci is
defined:
- [`m.call`](www.example.com) TODO: create `m.call` MSC and add link here.
- `m.call` - voice and video conferencing described by [MSC4196](https://github.com/matrix-org/matrix-spec-proposals/pull/4196).
#### Interoperability between applications
There is a use-case in which a `m.call` app might want to participate in a session of type (application) `custom-call-with-more-features`. A native mobile matrix client might support `m.call` and is at hand to join the feature rich application/session.
There could be fallback mechanisms but the most flexible approach is to treat it per application type. If it makes sense for an application type to fully conform to `m.call` a client that can connect to an `m.call` RTC session (application) could claim that it is also compatible with `custom-call-with-more-features` . It is than the job of the `custom-call-with-more-features` session type (application) to define some kind of feature list so that it can tell if users are joining with an m.call client or a dedicated `custom-call-with-more-features` client.
### End-to-end encryption of media streams
We define how the key material is shared between the participants of the call to facilitate end-to-end encryption of the media streams.
The backend (e.g. LiveKit) MSC defines how the key material is actually used.
#### Shared password
A shared password may be used to encrypt the media streams sent via the RTC backend that has been distributed ahead of time to the participants.
For example, it could be in the query parameter of a private URL attached to a calendar invitation.
#### Per-participant sender key
A participant can share it's chosen key with other participants by sending Matrix [to-device messaging](https://spec.matrix.org/v1.11/client-server-api/#send-to-device-messaging) to the other participants.
The key is sent as an event of type `m.rtc.encryption_keys` as an encrypted to-device message.
The device ID that is being sent to is the `member`.`device_id` from the `m.rtc.member` events.
The event contains the following fields:
- `session` required object: The contents of the `session` from the `m.rtc.member` event.
- `member` required object: The contents of the `member` from the corresponding `m.rtc.member` event.
- `keys` required array of objects: The sender keys to be distributed to the participant:
- `key` required string: The base64 encoded key material.
- `index` required int: The index of the key to distinguish it from other keys. This must be a between 0 and 255 inclusive.
In some implementations of MatrixRTC this may correspond to the `keyID` field of the WebRTC [SFrame](https://www.w3.org/TR/webrtc-encoded-transform/#sframe) header.
- `invalidates_key_index` optional int: The index of the key that is invalidated by this key. If this is set, the application should invalidate the key identified
by `invalidates_key_index` once it receives a frame with the new `index`. This is to protect against an exfiltrated key being used to forge frames.
- `invalidates_after_ms` optional int: The number of milliseconds after the key identified by `invalidates_key_index` is invalidated by this key even if no frames
are received. Again, this is to protect against an exfiltrated key being used to forge frames.
Depending on the RTC application, additional fields may be added to this event.
An example to-device event:
```json5
// event type: "m.rtc.encryption_keys"
{
"session": {
"application": "m.call",
"call_id": "",
"scope": "m.room"
},
"member": {
"id": "xyzABCDEF10123",
"device_id": "DEVICEID",
"user_id": "@user:matrix.domain"
},
"room_id": "!roomid:matrix.domain",
"keys": [
{
"index": 10,
"key": "base64encodedkey",
"invalidates_key_index": 9,
"invalidates_after_ms": 5000
},
],
}
```
On receipt of the `m.rtc.encryption_keys` event the application can associate the received key with the RTC session by matching the `session` and `member` contents with the corresponding `m.rtc.member` event.
When the application joins the session it should send the key to all the existing participants.
To ensure forward secrecy and post compromise security, the key material should be rotated (i.e. a new key generated) when a participant joins or leaves the session.
Key rotation is done as follows:
- the sending application generates the new key material for the participant.
- the sending application sends the new key material to all the participants with a new `index` value and `invalidates_key_index` set to the current `index`.
- the receiving application stores the new key material for the specified `index`.
- the sending application continues to use the old/current key to encrypt media.
- the sending application waits for a period of time. The default should be 3 seconds.
It is possible to overwrite this on a per application basis in case an application has specific requirements on security or wants to minimize missed stream data.
Also negotiation approaches can be defined where the RTC application uses data channels to communicate if everyone has received the next key.
- the sending application starts to use the new key to encrypt media.
- the receiving application invalidates the existing key with the `invalidates_key_index` value.
### Discovery/negotiation of application types
Problem: If a user wants to make a call to a user or room, then which call/application options should the client present to the user?
This should also take account of non-MatrixRTC calling: legacy 1:1 VoIP, room state widget for Jitsi.
TODO: write up notes.
## Potential issues
## Alternatives
### One state event per user
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401) proposed to have one state event per user with that state event containing an array of memberships.
This introduces two problems:
- potential inconsistency where one user device overwrites the state of another device during a concurrent update.
- when handling client disconnects the MSC3757 proposal could not be used as you would not know what the correct
state is at the time of the disconnect.
### One state event per device
This would mean not using `member`.`id` in the state key anymore. Race conditions can be solved by the client which would need to manage multiple sessions at once.
### A separate system not associated with Matrix accounts
This MSC proposes to combine the MatrixRTC backend infrastructure with the homeserver.
Other sources where the backend could be sourced from are:
- A separate system not associated with Matrix accounts.
(you would need a Matrix account + a "LiveKit provider" account for example)
- The client could bring its own backend link.
- A centralized solution.
The centralized solution would not fit to Matrix. A separate system would match the distributed
nature of Matrix but would not match the user experience goals for MatrixRTC calls.
The client defining the SFU that is used, is the current solution. This causes the issue, that clients
in general are less distributed than homeservers. There is only a limited set of clients that a large
percentage of users use.
Using this as the source for the infrastructure would result in just a handful of very large infrastructure
hosts.
This is harder to scale and it is harder to justify who is covering the costs. (For Matrix homeservers, this
is an already solved problem where there are individuals, communities and institutions that have their own individual
solutions and answers for how and why they provide the infrastructure.)
### `m.rtc.encryption_keys` room event
Earlier iterations of this MSC used an encrypted `m.rtc.encryption_keys` room event to distribute the per-participant sender keys.
Whilst reducing traffic by only needing to send one event per participant, this approach does not allow for perfect forward secrecy
as the keys are stored in the room history.
The encrypted content of the `m.rtc.encryption_keys` event was as follows:
```json
{
"session": {
"application": "m.call",
"call_id": ""
},
"member": {
"id": "xyzABCDEF10123",
"device_id": "DEVICEID",
"user_id": "@user:matrix.domain"
}.
"keys": [
{
"index": 0,
"key": "base64encodedkey"
},
],
}
```
## Security considerations
### Discoverability of infrastructure
The `.well-known/matrix/client` is publicly readable, hence everyone can read and know
about the infrastructure which could lead to resource "stealing".
Each infrastructure however has their own authentication mechanism defined in the infrastructure specification.
Those mechanisms for instance can use a service to interact with the homeserver and based on that decide to allow users
to use the infrastructure.
This is defined in the respective infrastructure MSC.
### Forward secrecy for end-to-end encryption of media streams
The considerations to ensure forward secrecy are described in the [End-to-end encryption of media streams](#end-to-end-encryption-of-media-streams)
section above.
### End-to-end media encryption key rotation lag
The proposed key rotation semantics does mean that a participant could continue to decrypt media that was sent in the three seconds after
leaving the session.
## Unstable prefix
The state events and the well_known key introduced in this MSC use the unstable prefix
`org.matrix.msc4143.` instead of `m.` as used in the text.
Use `org.matrix.msc3401.call.member` as the state event type in place of `m.rtc.member`.
For discovery via `.well-known/matrix/client` the prefix `org.matrix.msc4158.rtc_foci` is used in place of `m.rtc_foci`.
Use `io.element.call.encryption_keys` in place of the `m.rtc.encryption_keys` room event and to-device event types.
## Dependencies
This proposal depends on
[MSC3757: Restricting who can overwrite a state event](https://github.com/matrix-org/matrix-spec-proposals/pull/3757)
to provide access control for the decentralised management of call membership state. However, an alternative such
as [MSC3779: "Owned" State Events](https://github.com/matrix-org/matrix-spec-proposals/pull/3779) could be used instead with
some adaptations.
Possible values inside the `m.rtc.member` event (like `m.call`) will use a prefix defined in the
related PR (TODO create and link `m.call` application type PR)
This proposal also depends on [MSC4140: Cancellable delayed events](https://github.com/matrix-org/matrix-spec-proposals/pull/4140)
to provide a mechanism for clients to ensure that they can update the room state even if they lose connection.