From ea715b7a3f5e7aaae3699f5e651f0e938521eec4 Mon Sep 17 00:00:00 2001 From: David Baker Date: Fri, 28 Apr 2023 10:06:49 +0100 Subject: [PATCH] MSC2746: Improved VoIP Signalling (#2746) --- proposals/2746-reliable-voip.md | 345 ++++++++++++++++++++++++++++++++ 1 file changed, 345 insertions(+) create mode 100644 proposals/2746-reliable-voip.md diff --git a/proposals/2746-reliable-voip.md b/proposals/2746-reliable-voip.md new file mode 100644 index 00000000..579604c9 --- /dev/null +++ b/proposals/2746-reliable-voip.md @@ -0,0 +1,345 @@ +# MSC2746: Improved Signalling for 1:1 VoIP + +Historically, Matrix has basic support for signalling 1:1 WebRTC calls which suffer a number of shortcomings: + + * If several devices try to answer the same call, there is no way for them to determine clearly + that the caller has set up the call with a different device, and no way for the caller to + determine which candidate events map to which answer. + * Hangup reasons are often incorrect. + * There is confusion and no clear guidance on how clients should determine whether an incoming + invite is stale or not. + * There is no support for renegotiation of SDP, for changing ICE candidates / hold/resume + functionality, etc. + * There is no distinction between rejecting a call and ending it, which means that in trying + to reject a call, a client can inadvertantly cause a call that has been sucessfully set up + on a different device to be hung up. + +## Proposal +### Change the `version` field in all VoIP events to `"1"` +The version property is changed to `"1"` in all existing VoIP events +([`m.call.answer`](https://spec.matrix.org/v1.5/client-server-api/#mcallanswer), +[`m.call.candidates`](https://spec.matrix.org/v1.5/client-server-api/#mcallcandidates), +[`m.call.hangup`](https://spec.matrix.org/v1.5/client-server-api/#mcallhangup) +[`m.call.invite`](https://spec.matrix.org/v1.5/client-server-api/#mcallinvite)). Note +that this changes the type of the `version` field from an integer to a string, as +described in the [Unstable Prefix](#unstable-prefix) section. + +This will be used to determine whether devices support this new version of the protocol. For example, +clients can use this field to know whether to expect an `m.call.select_answer` event from their +opponent. If clients see events with `version` other than `0` or `"1"` (including, for example, the numeric +value `1`), they should treat these the same as if they had `version` == `"1"`. + +Note that this implies any and all future versions of VoIP events should be backwards-compatible. +If it does become necessary to introduce a non backwards-compatible VoIP spec, the intention would +be for it to simply use a separate set of event types. + +### Define the configurations of WebRTC streams and tracks + +The [spec](https://spec.matrix.org/v1.5/client-server-api/#voice-over-ip) does not currently define +the WebRTC streams and tracks that should be sent. Under this proposal, +clients are expected to send one stream with one track of kind `audio` (creating a +voice call). They can optionally send a second track in the same stream of kind +`video` (creating a video call). + +Clients implementing this specification use the first stream and will ignore any streamless tracks. Note that +in the Javascript WebRTC API, this means `addTrack()` must be passed two parameters: a track and a stream, +not just a track, and in a video call the stream must be the same for both audio and video track. + +A client may send other streams and tracks but the behaviour of the other party with respect to presenting +such streams and tracks is undefined. + +This follows the existing known implementations of v0 VoIP. + +### Add `invitee` field to [`m.call.invite`](https://spec.matrix.org/v1.5/client-server-api/#mcallinvite) +This allows for the following use cases: + * Placing a call to a specific user in a room where other users are also present. + * Placing a call to oneself. + +The field should be added for all invites where the target is a specific user, and should be set +to the Matrix user ID of that user. Invites without an `invitee` +field are defined to be intended for any member of the room other than the sender of the event. + +Clients should consider an incoming call if they see a non-expired invite event where the `invitee` field is either +absent or equal to their user's Matrix ID, however they should evaluate whether or not to ring based on their +user's trust relationship with the callers and/or where the call was placed. As a starting point, it is +suggested that clients ignore call invites from users in public rooms. It is strongly recommended that +when clients do not ring for an incoming call invite, they still display the call invite in the room and +annotate that it was ignored. + +### Add `party_id` to all VoIP events +Whenever a client first participates in a new call, it generates a `party_id` for itself to use for the +duration of the call. This needs to be long enough that the chance of a collision between multiple devices +both generating an answer at the same time generating the same party ID is vanishingly small: 8 uppercase + +lowercase alphanumeric characters is recommended. Parties in the call are identified by the tuple of +`(user_id, party_id)`. + +The client adds a `party_id` field containing this ID to the top-level of the content of all VoIP events +it sends on the call, including `m.call.invite`. Clients use this to identify remote echo of their own +events: since a user may now call themselves, they can no longer ignore events from their own user. This +field also identifies different answers sent by different clients to an invite, and matches `m.call.candidates` +events to their respective answer/invite. + +A client implementation may choose to use the device ID used in end-to-end cryptography for this purpose, +or it may choose, for example, to use a different one for each call to avoid leaking information on which +devices were used in a call (in an unencrypted room) or if a single device (ie. access token) were used to +send signalling for more than one call party. + +A grammar for `party_id` is defined [below](#specify-exact-grammar-for-voip-ids). + +### Introduce `m.call.select_answer` +This event is sent by the caller's client once it has decided which other +client to talk to, by selecting one of multiple possible incoming `m.call.answer` +events. Its `selected_party_id` field indicates the answer it's chosen. The `call_id` +and `party_id` of the caller is also included. If the callee's client sees a `select_answer` for an answer +with party ID other than the one it sent, it ends the call and informs the user the call +was answered elsewhere. It does not send any events. Media can start flowing +before this event is seen or even sent. Clients that implement previous +versions of this specification will ignore this event and behave as they did +before. + +Example: +``` +{ + "type": "m.call.select_answer", + "content": { + "version": "1", + "call_id": "12345", + "party_id": "67890", + "selected_party_id": "111213", + } +} +``` + +### Introduce `m.call.reject` + + * If the `m.call.invite` event has `version` `"1"`, a client wishing to reject the call + sends an `m.call.reject` event. This rejects the call on all devices, but if the calling + device sees an `answer` before the `reject`, it disregards the reject event and carries on. The reject has a + `party_id` just like an answer, and the caller sends a `select_answer` for it just like an + answer. If another client had already sent an answer and sees the caller select the + reject response instead of its answer, it ends the call. + * If the `m.call.invite` event has `version` `0`, the callee sends an `m.call.hangup` event. + +Example: +``` +{ + "type": "m.call.reject", + "content" : { + "version": "1", + "call_id": "12345", + "party_id": "67890", + } +} +``` + +If the calling user chooses to end the call before setup is complete, the client sends `m.call.hangup` +as previously. + +### Clarify what actions a client may take in response to an invite +The client may: + * Attempt to accept the call by sending an `m.call.answer`. + * Actively reject the call everywhere: send an `m.call.reject` as per above, which will stop the call from + ringing on all the user's devices and the caller's client will inform them that the user has + rejected their call. + * Ignore the call: send no events, but stop alerting the user about the call. The user's other + devices will continue to ring, and the caller's device will continue to indicate that the call + is ringing, and will time the call out in the normal way if no other device responds. + +### Introduce more reason codes to [`m.call.hangup`](https://spec.matrix.org/v1.5/client-server-api/#mcallhangup) + * `ice_timeout`: The connection failed after some media was exchanged (as opposed to current + `ice_failed` which means no media connection could be established). Note that, in the case of + an ICE renegotiation, a client should be sure to send `ice_timeout` rather than `ice_failed` if + media had previously been received successfully, even if the ICE renegotiation itself failed. + * `user_hangup`: Clients must now send this code when the user chooses to end the call, although + for backwards compatability with version 0, a clients should treat an absence of the `reason` + field as `user_hangup`. + * `user_media_failed`: The client was unable to start capturing media in such a way that it is unable + to continue the call. + * `user_busy`: The user is busy. Note that this exists primarily for bridging to other networks such + as the PSTN. A Matrix client that receives a call whilst already in a call would not generally reject + the new call unless the user had specifically chosen to do so. + * `unknown_error`: Some other failure occurred that meant the client was unable to continue the call + rather than the user choosing to end it. + +### Introduce `m.call.negotiate` +This introduces SDP negotiation semantics for media pause, hold/resume, ICE restarts and voice/video +call up/downgrading. Clients should implement & honour hold functionality as per WebRTC's +recommendation: https://www.w3.org/TR/webrtc/#hold-functionality + +If both the invite event and the accepted answer event have `version` equal to `"1"`, either party may +send `m.call.negotiate` with a `description` field to offer new SDP to the other party. This event has +`call_id` with the ID of the call and `party_id` equal to the client's party ID for that call. +The caller ignores any negotiate events with `party_id` + `user_id` tuple not equal to that of the +answer it accepted and the callee ignores any negotiate events with `party_id` + `user_id` tuple not equal to that of the caller. Clients should use the `party_id` field to ignore the remote echo of their +own negotiate events. + +This has a `lifetime` field as in `m.call.invite`, after which the sender of the negotiate event +should consider the negotiation failed (timed out) and the recipient should ignore it. + +The `description` field is the same as the `offer` field in `m.call.invite` and `answer` +field in `m.call.answer` and is an `RTCSessionDescriptionInit` object as per +https://www.w3.org/TR/webrtc/#dom-rtcsessiondescriptioninit. + +Example: +``` +{ + "type": "m.call.negotiate", + "content": { + "version": "1", + "call_id": "12345", + "party_id": "67890", + "lifetime": 10000, + "description": { + "sdp": "[some sdp]", + "type": "offer", + }, + } +} +``` + +Once an `m.call.negotiate` event is received, the client must respond with another `m.call.negotiate` +event, with the SDP answer (with `"type": "answer"`) in the `description` property. + +This MSC also proposes clarifying the `m.call.invite` and `m.call.answer` events to state that +the `offer` and `answer` fields respectively are objects of type `RTCSessionDescriptionInit`. +Hence the `type` field, whilst redundant in these events, is included for ease of working +with the WebRTC API and is mandatory. Receiving clients should not attempt to validate the `type` field, +but simply pass the object into the WebRTC API. + +### Designate one party as 'polite' +In line with WebRTC perfect negotiation (https://w3c.github.io/webrtc-pc/#perfect-negotiation-example) +we introduce rules to establish which party is polite in the process of renegotiation. The callee is +always the polite party. In a glare situation, the politenes of a party is therefore determined by +whether the inbound or outbound call is used: if a client discards its outbound call in favour of +an inbound call, it becomes the polite party. + +### Add explicit recommendations for call event liveness. +`m.call.invite` contains a `lifetime` field that indicates how long the offer is valid for. When +a client receives an invite, it should use the event's `age` field in the sync response plus the +time since it received the event from the homeserver to determine whether the invite is still valid. +The use of the `age` field ensures that incorrect clocks on client devices don't break calls. +If the invite is still valid *and will remain valid for long enough for the user to accept the call*, +it should signal an incoming call. The amount of time allowed for the user to accept the call may +vary between clients. For example, it may be longer on a locked mobile device than on an unlocked +desktop device. + +The client should only signal an incoming call in a given room once it has completed processing the +entire sync response and, for encrypted rooms, attempted to decrypt all encrypted events in the +sync response for that room. This ensures that if the sync response contains subsequent events that +indicate the call has been hung up, rejected, or answered elsewhere, the client does not signal it. + +If on startup, after processing locally stored events, the client determines that there is an invite +that is still valid, it should still signal it but only after it has completed a sync from the homeserver. + +The minimal recommended lifetime is 90 seconds - this should give the user +enough time to actually pick up the call. + +### Introduce recommendations for batching of ICE candidates +Clients should aim to send a small number of candidate events, with guidelines: + * ICE candidates which can be discovered immediately or almost immediately in the invite/answer + event itself (eg. host candidates). If server reflexive or relay candiates can be gathered in + a sufficiently short period of time, these should be sent here too. A delay of around 200ms is + suggested as a starting point. + * The client should then allow some time for further candidates to be gathered in order to batch them, + rather than sending each candidate as it arrives. A starting point of 2 seconds after sending the + invite or 500ms after sending the answer is suggested as a starting point (since a delay is natural + anyway after the invite whilst the client waits for the user to accept it). + +### Mandate the end-of-candidates candidate +Define that an ICE candidate whose value is the empty string means that no more ICE candidates will +be sent, and mandate that clients must send such a candidate in an `m.call.candidates` message. +The WebRTC spec requires browsers to generate such a candidate, however note that at time of writing, +not all browsers do (Chrome does not, but does generate an `icegatheringstatechange` event). The +client should send any remaining candidates once candidate generation finishes, ignoring timeouts above. +This allows bridges to batch the candidates together when bridging to protocols that don't support +trickle ICE. + +### Add DTMF +Add that Matrix clients can send DTMF as specified by WebRTC. The WebRTC standard as of August +2020 does not support receiving DTMF but a Matrix client can receive and interpret the DTMF sent +in the RTP payload. + +### Specify exact grammar for VoIP IDs +`call_id`s and the newly introduced `party_id` are explicitly defined to be between 1 +and 255 characters long, consisting of the characters `[0-9a-zA-Z._~-]`. + +(Note that this matches the grammar of 'opaque IDs' from +[MSC1597](https://github.com/matrix-org/matrix-spec-proposals/blob/rav/proposals/id_grammar/proposals/1597-id-grammar.md#opaque-ids), +and that of the `id` property of the + [`m.login.sso` flow schema](https://spec.matrix.org/v1.5/client-server-api/#definition-mloginsso-flow-schema).) + +### Specify behaviour on room leave +If the client sees the user it is in a call with leave the room, the client should treat this +as a hangup event for any calls that are in progress. No specific requirement is given for the +situation where a client has sent an invite and the invitee leaves the room, but the client may +wish to treat it as a rejection if there are no more users in the room who could answer the call +(eg. the user is now alone or the `invitee` field was set on the invite). + +The same behaviour applies when a client is looking at historic calls. + +### Clarify that supported codecs should follow the WebRTC spec +The Matrix spec does not mandate particular audio or video codecs, but instead defers to the +WebRTC spec. A compliant matrix VoIP client will behave in the same way as a supported 'browser' +in terms of what codecs it supports and what variants thereof. The latest WebRTC specification +applies, so clients should keep up to date with new versions of the WebRTC specification whether +or not there have been any changes to the Matrix spec. + +## Potential issues + * The ability to call yourself makes the protocol a little more complex for clients to implement, + and is somewhat of a special case. However, some of the necessary additions are also required for + other features so this MSC elects to make it possible. + * Clients must make a decision on whether to ring for any given call: defining this in the spec + would be cumbersome and would limit clients' ability to use reputation-based systems for this + decision in the future. However, having a call ring on one client and not the other because one + had categorised it as a junk call and not the other would be confusing for the user. + +## Alternatives + * This MSC does not allow for ICE negotiation before the user chooses to answer the call. This can + make call setup faster by allowing connectivity to be established whilst the call is ringing. This + is problematic with Matrix since any device or user could answer the call, so it is not known which + device is going to answer before the user chooses to answer. It would also leak information on which + of a user's devices were online. + * We could define that the ID of a call is implicitly the event ID of the invite event rather than + having a specific `call_id` field. This would mean that a client would be unable to know the ID of + a call before the it received the response from sending the invite event, which could complicate + implementations. There is probably no compelling reason to change this. + * `m.call.select_answer` was chosen such that its name reflect the intention of the event. `m.call.ack` + is more succinct and mirrors SIP, but this MSC opts for the more descriptive name. + * This MSC elects to allow invites without an `invitee` field to mean a call for anyone in the room. + This could be useful for hunt group style semantics where an incoming call causes many different + users' phones to ring and any one of them may pick up the call. This does mean clients will need + to not blindly ring for any call invites in any room, since this would make unsolicited calls + easy in public rooms. We could opt to leave this out, or make it more explicit with a specific value + for the `invitee` field. + * `party_id` is one of many potential solutions: callees could add `answer_id`s to their events and + callers could be identified by the lack of an `answer_id`. An explicit field on every event may be + easier to comprehend, less error-prone and clearer in the backwards-compatibility scenario. + * We could make `party_id`s more prescriptive, eg. the caller could always have a `party_id` of the + empty string, the word `caller` or equal to the `call_id`, which may make debugging simpler. + * To allow for bridging into protocols that don't support trickle ICE, this proposal requires that + clients send an empty candidate to signal the end of candidates. This means it will be up to bridges + to buffer the invite and edit the SDP to add the candidates once they arrive, adding complexity to + bridges. The alternative would be a discovery mechanism so clients could know whether a callee supports + trickle ICE before calling, and disable it if so. This would add complexity to every Matrix client as + well as having to assume that all current clients did not, disabling trickle ICE everywhere until clients + support the discovery mechanism. The mechanism would also have to be per-user which would make sense for + bridged users, but not where some of a users devices support trickle ICE and some do not. + +## Security considerations + * IP addresses remain in the room in candidates, as they did in the previous version of the spec. + This is not ideal, but alternatives were either sending candidates over to-device messages + (would slow down call setup because a target device would have to be established before sending + candidates) or redacting them afterwards (the volume of events sent during calls can already + cause rate limiting issues and this would exacerbate this). + * Clients must take care to not ring for any call, as per the 'alternatives' section. + +## Unstable prefix +Since VoIP events already have a 'version' field, we would ideally use a string, namespaced version during +development, but this field is defined to be an int in version 0. This MSC proposes changing the version +field to a string so that this namespacing can be used for future changes. Since there is no other easy way +to namespace events whilst in development and ensure interoperability, we have chosen not to use an unstable +prefix for this change, on the understanding that in future we will be able to use the string `version` field +for the unstable prefix. + +For backwards compatibility, strongly typed implementations should allow for +`version` to either be a string or the integer `0`.