You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
346 lines
21 KiB
Markdown
346 lines
21 KiB
Markdown
2 years ago
|
# MSC2746: Improved Signalling for 1:1 VoIP
|
||
|
|
||
|
Historically, Matrix has basic support for signalling 1:1 WebRTC calls which suffer a number of shortcomings:
|
||
|
|
||
|
* If several devices try to answer the same call, there is no way for them to determine clearly
|
||
|
that the caller has set up the call with a different device, and no way for the caller to
|
||
|
determine which candidate events map to which answer.
|
||
|
* Hangup reasons are often incorrect.
|
||
|
* There is confusion and no clear guidance on how clients should determine whether an incoming
|
||
|
invite is stale or not.
|
||
|
* There is no support for renegotiation of SDP, for changing ICE candidates / hold/resume
|
||
|
functionality, etc.
|
||
|
* There is no distinction between rejecting a call and ending it, which means that in trying
|
||
|
to reject a call, a client can inadvertantly cause a call that has been sucessfully set up
|
||
|
on a different device to be hung up.
|
||
|
|
||
|
## Proposal
|
||
|
### Change the `version` field in all VoIP events to `"1"`
|
||
|
The version property is changed to `"1"` in all existing VoIP events
|
||
|
([`m.call.answer`](https://spec.matrix.org/v1.5/client-server-api/#mcallanswer),
|
||
|
[`m.call.candidates`](https://spec.matrix.org/v1.5/client-server-api/#mcallcandidates),
|
||
|
[`m.call.hangup`](https://spec.matrix.org/v1.5/client-server-api/#mcallhangup)
|
||
|
[`m.call.invite`](https://spec.matrix.org/v1.5/client-server-api/#mcallinvite)). Note
|
||
|
that this changes the type of the `version` field from an integer to a string, as
|
||
|
described in the [Unstable Prefix](#unstable-prefix) section.
|
||
|
|
||
|
This will be used to determine whether devices support this new version of the protocol. For example,
|
||
|
clients can use this field to know whether to expect an `m.call.select_answer` event from their
|
||
|
opponent. If clients see events with `version` other than `0` or `"1"` (including, for example, the numeric
|
||
|
value `1`), they should treat these the same as if they had `version` == `"1"`.
|
||
|
|
||
|
Note that this implies any and all future versions of VoIP events should be backwards-compatible.
|
||
|
If it does become necessary to introduce a non backwards-compatible VoIP spec, the intention would
|
||
|
be for it to simply use a separate set of event types.
|
||
|
|
||
|
### Define the configurations of WebRTC streams and tracks
|
||
|
|
||
|
The [spec](https://spec.matrix.org/v1.5/client-server-api/#voice-over-ip) does not currently define
|
||
|
the WebRTC streams and tracks that should be sent. Under this proposal,
|
||
|
clients are expected to send one stream with one track of kind `audio` (creating a
|
||
|
voice call). They can optionally send a second track in the same stream of kind
|
||
|
`video` (creating a video call).
|
||
|
|
||
|
Clients implementing this specification use the first stream and will ignore any streamless tracks. Note that
|
||
|
in the Javascript WebRTC API, this means `addTrack()` must be passed two parameters: a track and a stream,
|
||
|
not just a track, and in a video call the stream must be the same for both audio and video track.
|
||
|
|
||
|
A client may send other streams and tracks but the behaviour of the other party with respect to presenting
|
||
|
such streams and tracks is undefined.
|
||
|
|
||
|
This follows the existing known implementations of v0 VoIP.
|
||
|
|
||
|
### Add `invitee` field to [`m.call.invite`](https://spec.matrix.org/v1.5/client-server-api/#mcallinvite)
|
||
|
This allows for the following use cases:
|
||
|
* Placing a call to a specific user in a room where other users are also present.
|
||
|
* Placing a call to oneself.
|
||
|
|
||
|
The field should be added for all invites where the target is a specific user, and should be set
|
||
|
to the Matrix user ID of that user. Invites without an `invitee`
|
||
|
field are defined to be intended for any member of the room other than the sender of the event.
|
||
|
|
||
|
Clients should consider an incoming call if they see a non-expired invite event where the `invitee` field is either
|
||
|
absent or equal to their user's Matrix ID, however they should evaluate whether or not to ring based on their
|
||
|
user's trust relationship with the callers and/or where the call was placed. As a starting point, it is
|
||
|
suggested that clients ignore call invites from users in public rooms. It is strongly recommended that
|
||
|
when clients do not ring for an incoming call invite, they still display the call invite in the room and
|
||
|
annotate that it was ignored.
|
||
|
|
||
|
### Add `party_id` to all VoIP events
|
||
|
Whenever a client first participates in a new call, it generates a `party_id` for itself to use for the
|
||
|
duration of the call. This needs to be long enough that the chance of a collision between multiple devices
|
||
|
both generating an answer at the same time generating the same party ID is vanishingly small: 8 uppercase +
|
||
|
lowercase alphanumeric characters is recommended. Parties in the call are identified by the tuple of
|
||
|
`(user_id, party_id)`.
|
||
|
|
||
|
The client adds a `party_id` field containing this ID to the top-level of the content of all VoIP events
|
||
|
it sends on the call, including `m.call.invite`. Clients use this to identify remote echo of their own
|
||
|
events: since a user may now call themselves, they can no longer ignore events from their own user. This
|
||
|
field also identifies different answers sent by different clients to an invite, and matches `m.call.candidates`
|
||
|
events to their respective answer/invite.
|
||
|
|
||
|
A client implementation may choose to use the device ID used in end-to-end cryptography for this purpose,
|
||
|
or it may choose, for example, to use a different one for each call to avoid leaking information on which
|
||
|
devices were used in a call (in an unencrypted room) or if a single device (ie. access token) were used to
|
||
|
send signalling for more than one call party.
|
||
|
|
||
|
A grammar for `party_id` is defined [below](#specify-exact-grammar-for-voip-ids).
|
||
|
|
||
|
### Introduce `m.call.select_answer`
|
||
|
This event is sent by the caller's client once it has decided which other
|
||
|
client to talk to, by selecting one of multiple possible incoming `m.call.answer`
|
||
|
events. Its `selected_party_id` field indicates the answer it's chosen. The `call_id`
|
||
|
and `party_id` of the caller is also included. If the callee's client sees a `select_answer` for an answer
|
||
|
with party ID other than the one it sent, it ends the call and informs the user the call
|
||
|
was answered elsewhere. It does not send any events. Media can start flowing
|
||
|
before this event is seen or even sent. Clients that implement previous
|
||
|
versions of this specification will ignore this event and behave as they did
|
||
|
before.
|
||
|
|
||
|
Example:
|
||
|
```
|
||
|
{
|
||
|
"type": "m.call.select_answer",
|
||
|
"content": {
|
||
|
"version": "1",
|
||
|
"call_id": "12345",
|
||
|
"party_id": "67890",
|
||
|
"selected_party_id": "111213",
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### Introduce `m.call.reject`
|
||
|
|
||
|
* If the `m.call.invite` event has `version` `"1"`, a client wishing to reject the call
|
||
|
sends an `m.call.reject` event. This rejects the call on all devices, but if the calling
|
||
|
device sees an `answer` before the `reject`, it disregards the reject event and carries on. The reject has a
|
||
|
`party_id` just like an answer, and the caller sends a `select_answer` for it just like an
|
||
|
answer. If another client had already sent an answer and sees the caller select the
|
||
|
reject response instead of its answer, it ends the call.
|
||
|
* If the `m.call.invite` event has `version` `0`, the callee sends an `m.call.hangup` event.
|
||
|
|
||
|
Example:
|
||
|
```
|
||
|
{
|
||
|
"type": "m.call.reject",
|
||
|
"content" : {
|
||
|
"version": "1",
|
||
|
"call_id": "12345",
|
||
|
"party_id": "67890",
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
If the calling user chooses to end the call before setup is complete, the client sends `m.call.hangup`
|
||
|
as previously.
|
||
|
|
||
|
### Clarify what actions a client may take in response to an invite
|
||
|
The client may:
|
||
|
* Attempt to accept the call by sending an `m.call.answer`.
|
||
|
* Actively reject the call everywhere: send an `m.call.reject` as per above, which will stop the call from
|
||
|
ringing on all the user's devices and the caller's client will inform them that the user has
|
||
|
rejected their call.
|
||
|
* Ignore the call: send no events, but stop alerting the user about the call. The user's other
|
||
|
devices will continue to ring, and the caller's device will continue to indicate that the call
|
||
|
is ringing, and will time the call out in the normal way if no other device responds.
|
||
|
|
||
|
### Introduce more reason codes to [`m.call.hangup`](https://spec.matrix.org/v1.5/client-server-api/#mcallhangup)
|
||
|
* `ice_timeout`: The connection failed after some media was exchanged (as opposed to current
|
||
|
`ice_failed` which means no media connection could be established). Note that, in the case of
|
||
|
an ICE renegotiation, a client should be sure to send `ice_timeout` rather than `ice_failed` if
|
||
|
media had previously been received successfully, even if the ICE renegotiation itself failed.
|
||
|
* `user_hangup`: Clients must now send this code when the user chooses to end the call, although
|
||
|
for backwards compatability with version 0, a clients should treat an absence of the `reason`
|
||
|
field as `user_hangup`.
|
||
|
* `user_media_failed`: The client was unable to start capturing media in such a way that it is unable
|
||
|
to continue the call.
|
||
|
* `user_busy`: The user is busy. Note that this exists primarily for bridging to other networks such
|
||
|
as the PSTN. A Matrix client that receives a call whilst already in a call would not generally reject
|
||
|
the new call unless the user had specifically chosen to do so.
|
||
|
* `unknown_error`: Some other failure occurred that meant the client was unable to continue the call
|
||
|
rather than the user choosing to end it.
|
||
|
|
||
|
### Introduce `m.call.negotiate`
|
||
|
This introduces SDP negotiation semantics for media pause, hold/resume, ICE restarts and voice/video
|
||
|
call up/downgrading. Clients should implement & honour hold functionality as per WebRTC's
|
||
|
recommendation: https://www.w3.org/TR/webrtc/#hold-functionality
|
||
|
|
||
|
If both the invite event and the accepted answer event have `version` equal to `"1"`, either party may
|
||
|
send `m.call.negotiate` with a `description` field to offer new SDP to the other party. This event has
|
||
|
`call_id` with the ID of the call and `party_id` equal to the client's party ID for that call.
|
||
|
The caller ignores any negotiate events with `party_id` + `user_id` tuple not equal to that of the
|
||
|
answer it accepted and the callee ignores any negotiate events with `party_id` + `user_id` tuple not equal to that of the caller. Clients should use the `party_id` field to ignore the remote echo of their
|
||
|
own negotiate events.
|
||
|
|
||
|
This has a `lifetime` field as in `m.call.invite`, after which the sender of the negotiate event
|
||
|
should consider the negotiation failed (timed out) and the recipient should ignore it.
|
||
|
|
||
|
The `description` field is the same as the `offer` field in `m.call.invite` and `answer`
|
||
|
field in `m.call.answer` and is an `RTCSessionDescriptionInit` object as per
|
||
|
https://www.w3.org/TR/webrtc/#dom-rtcsessiondescriptioninit.
|
||
|
|
||
|
Example:
|
||
|
```
|
||
|
{
|
||
|
"type": "m.call.negotiate",
|
||
|
"content": {
|
||
|
"version": "1",
|
||
|
"call_id": "12345",
|
||
|
"party_id": "67890",
|
||
|
"lifetime": 10000,
|
||
|
"description": {
|
||
|
"sdp": "[some sdp]",
|
||
|
"type": "offer",
|
||
|
},
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Once an `m.call.negotiate` event is received, the client must respond with another `m.call.negotiate`
|
||
|
event, with the SDP answer (with `"type": "answer"`) in the `description` property.
|
||
|
|
||
|
This MSC also proposes clarifying the `m.call.invite` and `m.call.answer` events to state that
|
||
|
the `offer` and `answer` fields respectively are objects of type `RTCSessionDescriptionInit`.
|
||
|
Hence the `type` field, whilst redundant in these events, is included for ease of working
|
||
|
with the WebRTC API and is mandatory. Receiving clients should not attempt to validate the `type` field,
|
||
|
but simply pass the object into the WebRTC API.
|
||
|
|
||
|
### Designate one party as 'polite'
|
||
|
In line with WebRTC perfect negotiation (https://w3c.github.io/webrtc-pc/#perfect-negotiation-example)
|
||
|
we introduce rules to establish which party is polite in the process of renegotiation. The callee is
|
||
|
always the polite party. In a glare situation, the politenes of a party is therefore determined by
|
||
|
whether the inbound or outbound call is used: if a client discards its outbound call in favour of
|
||
|
an inbound call, it becomes the polite party.
|
||
|
|
||
|
### Add explicit recommendations for call event liveness.
|
||
|
`m.call.invite` contains a `lifetime` field that indicates how long the offer is valid for. When
|
||
|
a client receives an invite, it should use the event's `age` field in the sync response plus the
|
||
|
time since it received the event from the homeserver to determine whether the invite is still valid.
|
||
|
The use of the `age` field ensures that incorrect clocks on client devices don't break calls.
|
||
|
If the invite is still valid *and will remain valid for long enough for the user to accept the call*,
|
||
|
it should signal an incoming call. The amount of time allowed for the user to accept the call may
|
||
|
vary between clients. For example, it may be longer on a locked mobile device than on an unlocked
|
||
|
desktop device.
|
||
|
|
||
|
The client should only signal an incoming call in a given room once it has completed processing the
|
||
|
entire sync response and, for encrypted rooms, attempted to decrypt all encrypted events in the
|
||
|
sync response for that room. This ensures that if the sync response contains subsequent events that
|
||
|
indicate the call has been hung up, rejected, or answered elsewhere, the client does not signal it.
|
||
|
|
||
|
If on startup, after processing locally stored events, the client determines that there is an invite
|
||
|
that is still valid, it should still signal it but only after it has completed a sync from the homeserver.
|
||
|
|
||
|
The minimal recommended lifetime is 90 seconds - this should give the user
|
||
|
enough time to actually pick up the call.
|
||
|
|
||
|
### Introduce recommendations for batching of ICE candidates
|
||
|
Clients should aim to send a small number of candidate events, with guidelines:
|
||
|
* ICE candidates which can be discovered immediately or almost immediately in the invite/answer
|
||
|
event itself (eg. host candidates). If server reflexive or relay candiates can be gathered in
|
||
|
a sufficiently short period of time, these should be sent here too. A delay of around 200ms is
|
||
|
suggested as a starting point.
|
||
|
* The client should then allow some time for further candidates to be gathered in order to batch them,
|
||
|
rather than sending each candidate as it arrives. A starting point of 2 seconds after sending the
|
||
|
invite or 500ms after sending the answer is suggested as a starting point (since a delay is natural
|
||
|
anyway after the invite whilst the client waits for the user to accept it).
|
||
|
|
||
|
### Mandate the end-of-candidates candidate
|
||
|
Define that an ICE candidate whose value is the empty string means that no more ICE candidates will
|
||
|
be sent, and mandate that clients must send such a candidate in an `m.call.candidates` message.
|
||
|
The WebRTC spec requires browsers to generate such a candidate, however note that at time of writing,
|
||
|
not all browsers do (Chrome does not, but does generate an `icegatheringstatechange` event). The
|
||
|
client should send any remaining candidates once candidate generation finishes, ignoring timeouts above.
|
||
|
This allows bridges to batch the candidates together when bridging to protocols that don't support
|
||
|
trickle ICE.
|
||
|
|
||
|
### Add DTMF
|
||
|
Add that Matrix clients can send DTMF as specified by WebRTC. The WebRTC standard as of August
|
||
|
2020 does not support receiving DTMF but a Matrix client can receive and interpret the DTMF sent
|
||
|
in the RTP payload.
|
||
|
|
||
|
### Specify exact grammar for VoIP IDs
|
||
|
`call_id`s and the newly introduced `party_id` are explicitly defined to be between 1
|
||
|
and 255 characters long, consisting of the characters `[0-9a-zA-Z._~-]`.
|
||
|
|
||
|
(Note that this matches the grammar of 'opaque IDs' from
|
||
|
[MSC1597](https://github.com/matrix-org/matrix-spec-proposals/blob/rav/proposals/id_grammar/proposals/1597-id-grammar.md#opaque-ids),
|
||
|
and that of the `id` property of the
|
||
|
[`m.login.sso` flow schema](https://spec.matrix.org/v1.5/client-server-api/#definition-mloginsso-flow-schema).)
|
||
|
|
||
|
### Specify behaviour on room leave
|
||
|
If the client sees the user it is in a call with leave the room, the client should treat this
|
||
|
as a hangup event for any calls that are in progress. No specific requirement is given for the
|
||
|
situation where a client has sent an invite and the invitee leaves the room, but the client may
|
||
|
wish to treat it as a rejection if there are no more users in the room who could answer the call
|
||
|
(eg. the user is now alone or the `invitee` field was set on the invite).
|
||
|
|
||
|
The same behaviour applies when a client is looking at historic calls.
|
||
|
|
||
|
### Clarify that supported codecs should follow the WebRTC spec
|
||
|
The Matrix spec does not mandate particular audio or video codecs, but instead defers to the
|
||
|
WebRTC spec. A compliant matrix VoIP client will behave in the same way as a supported 'browser'
|
||
|
in terms of what codecs it supports and what variants thereof. The latest WebRTC specification
|
||
|
applies, so clients should keep up to date with new versions of the WebRTC specification whether
|
||
|
or not there have been any changes to the Matrix spec.
|
||
|
|
||
|
## Potential issues
|
||
|
* The ability to call yourself makes the protocol a little more complex for clients to implement,
|
||
|
and is somewhat of a special case. However, some of the necessary additions are also required for
|
||
|
other features so this MSC elects to make it possible.
|
||
|
* Clients must make a decision on whether to ring for any given call: defining this in the spec
|
||
|
would be cumbersome and would limit clients' ability to use reputation-based systems for this
|
||
|
decision in the future. However, having a call ring on one client and not the other because one
|
||
|
had categorised it as a junk call and not the other would be confusing for the user.
|
||
|
|
||
|
## Alternatives
|
||
|
* This MSC does not allow for ICE negotiation before the user chooses to answer the call. This can
|
||
|
make call setup faster by allowing connectivity to be established whilst the call is ringing. This
|
||
|
is problematic with Matrix since any device or user could answer the call, so it is not known which
|
||
|
device is going to answer before the user chooses to answer. It would also leak information on which
|
||
|
of a user's devices were online.
|
||
|
* We could define that the ID of a call is implicitly the event ID of the invite event rather than
|
||
|
having a specific `call_id` field. This would mean that a client would be unable to know the ID of
|
||
|
a call before the it received the response from sending the invite event, which could complicate
|
||
|
implementations. There is probably no compelling reason to change this.
|
||
|
* `m.call.select_answer` was chosen such that its name reflect the intention of the event. `m.call.ack`
|
||
|
is more succinct and mirrors SIP, but this MSC opts for the more descriptive name.
|
||
|
* This MSC elects to allow invites without an `invitee` field to mean a call for anyone in the room.
|
||
|
This could be useful for hunt group style semantics where an incoming call causes many different
|
||
|
users' phones to ring and any one of them may pick up the call. This does mean clients will need
|
||
|
to not blindly ring for any call invites in any room, since this would make unsolicited calls
|
||
|
easy in public rooms. We could opt to leave this out, or make it more explicit with a specific value
|
||
|
for the `invitee` field.
|
||
|
* `party_id` is one of many potential solutions: callees could add `answer_id`s to their events and
|
||
|
callers could be identified by the lack of an `answer_id`. An explicit field on every event may be
|
||
|
easier to comprehend, less error-prone and clearer in the backwards-compatibility scenario.
|
||
|
* We could make `party_id`s more prescriptive, eg. the caller could always have a `party_id` of the
|
||
|
empty string, the word `caller` or equal to the `call_id`, which may make debugging simpler.
|
||
|
* To allow for bridging into protocols that don't support trickle ICE, this proposal requires that
|
||
|
clients send an empty candidate to signal the end of candidates. This means it will be up to bridges
|
||
|
to buffer the invite and edit the SDP to add the candidates once they arrive, adding complexity to
|
||
|
bridges. The alternative would be a discovery mechanism so clients could know whether a callee supports
|
||
|
trickle ICE before calling, and disable it if so. This would add complexity to every Matrix client as
|
||
|
well as having to assume that all current clients did not, disabling trickle ICE everywhere until clients
|
||
|
support the discovery mechanism. The mechanism would also have to be per-user which would make sense for
|
||
|
bridged users, but not where some of a users devices support trickle ICE and some do not.
|
||
|
|
||
|
## Security considerations
|
||
|
* IP addresses remain in the room in candidates, as they did in the previous version of the spec.
|
||
|
This is not ideal, but alternatives were either sending candidates over to-device messages
|
||
|
(would slow down call setup because a target device would have to be established before sending
|
||
|
candidates) or redacting them afterwards (the volume of events sent during calls can already
|
||
|
cause rate limiting issues and this would exacerbate this).
|
||
|
* Clients must take care to not ring for any call, as per the 'alternatives' section.
|
||
|
|
||
|
## Unstable prefix
|
||
|
Since VoIP events already have a 'version' field, we would ideally use a string, namespaced version during
|
||
|
development, but this field is defined to be an int in version 0. This MSC proposes changing the version
|
||
|
field to a string so that this namespacing can be used for future changes. Since there is no other easy way
|
||
|
to namespace events whilst in development and ensure interoperability, we have chosen not to use an unstable
|
||
|
prefix for this change, on the understanding that in future we will be able to use the string `version` field
|
||
|
for the unstable prefix.
|
||
|
|
||
|
For backwards compatibility, strongly typed implementations should allow for
|
||
|
`version` to either be a string or the integer `0`.
|