diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index 4fec4e0f..f0749c01 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -1,4 +1,4 @@ -# MSC4081: Claim fallback key on network failures +# MSC4081: Eagerly sharing fallback keys with federated servers *Abstract: This MSC aims to increase the robustness of the Olm session setup protocol over federation. With this MSC, transient network failures over federation will not cause undecryptable messages due to @@ -10,69 +10,141 @@ only be used once. However, this presents several problems: - what happens when the device does not upload more keys and the uploaded keys are all used up? (key exhaustion) - what happens if the OTK cannot be claimed due to transient network failures. -[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" -which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, -specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords. -The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier -sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been -"used", with some lag time to account for slow networks. +[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" +which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, +specifically impacting forward secrecy, which protects past sessions against future compromises of keys or +passwords. The risk is that if the private part of the fallback key is exposed, an attacker may use the key to +decrypt earlier sessions. This can be mitigated by creating a new fallback key as soon as the old one has been used +(and hence later deleting the private key, with some lag time to account for slow networks). + +For reference, https://crypto.stackexchange.com/a/52825 is a good explanation of why OTKs are preferable +to fallback keys, where they are available. (The question is about Signal rather than Olm, however the principles +are much the same. Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to +OTK.) ## Proposal -Currently, fallback keys are _only_ claimed on key exhaustion, not due to transient network failures. This MSC +Currently, fallback keys are _only_ used on key exhaustion, not due to transient network failures. This MSC proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server the target device is on is unreachable. In order for servers to return fallback keys during the network failure, -the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new -key `fallback_keys` to the [`m.device_list_update` EDU](https://spec.matrix.org/v1.9/server-server-api/#definition-mdevice_list_update). This MSC proposes changing the spec wording (bold is new): +the fallback keys must be cached _in advance_ on the claiming user's homeserver. + +### Extend `/_matrix/client/v3/keys/upload` request + +Clients have to opt in to this process when uploading fallback keys. To allow this, we extend the [`POST +/_matrix/client/v3/keys/upload`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysupload) +endpoint with a new request body parameter, `eager_share_fallback_keys`, as follows (bold is new): + +| Name | Type | Description | +|-----------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `device_keys` | `DeviceKeys` | Identity keys for the device. May be absent if no new identity keys are required. +| `fallback_keys` | `OneTimeKeys` | The public key which should be used if the device’s one-time keys are exhausted, **or if the user's homeserver is unreachable**. [etc] +| `one_time_keys` | `OneTimeKeys` | One-time public keys for “pre-key” messages. The names of the properties should be in the format :. The format of the key is determined by the key algorithm. May be absent if no new one-time keys are required. +| **`eager_share_fallback_keys`** | **`boolean`** | **Whether the `fallback_keys` should immediately be sent to other homeservers which have a user which share a room with this user. Omitting this property is the same as setting it to `false`. + +### Extend `m.device_list_update` EDU + +This MSC proposes adding a new key `fallback_keys` to the [`m.device_list_update` +EDU](https://spec.matrix.org/v1.9/server-server-api/#definition-mdevice_list_update). We change the spec wording as +follows: > Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and > must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a > room which contains servers which are not already receiving updates for that user’s device list, or changes in -> device information such as the device’s human-readable name **or fallback key**). +> device information such as the device’s human-readable name **or, if the client has opted into eager sharing of +> fallback keys, the fallback keys**). -The following key/values are added to the `DeviceKeys` object definition (bold is new): +A new property `fallback_keys` is added to the body of the `m.device_list_update` EDU, as shown below (bold is new): -| Name | Type | Description | -|------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| algorithms | [string] | Required: The encryption algorithms supported by this device. | -| device_id | string | Required: The ID of the device these keys belong to. Must match the device ID used when logging in. | -| keys | {string: string} | Required: Public identity keys. The names of the properties should be in the format :. The keys themselves should be encoded as specified by the key algorithm. | -| signatures | Signatures | Required: Signatures for the device key object. A map from user ID, to a map from : to the signature. The signature is calculated using the process described at Signing JSON. | -| user_id | string | Required: The ID of the user the device belongs to. Must match the user ID used when logging in. | -| **fallback_key** | **{string: KeyObject}** | **The fallback key for this device, if set. The format of this object is identical to the /keys/claim response for a single device. This replaces any previously sent fallback key.** | +| Name | Type | Description | +|-----------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `deleted` | `boolean` | True if the server is announcing that this device has been deleted. +| `device_display_name` | `string` | The public human-readable name of this device. Will be absent if the device has no name. +| `device_id` | `string` | Required: The ID of the device whose details are changing. +| `keys` | `DeviceKeys` | The updated identity keys (if any) for this device. May be absent if the device has no E2E keys defined. +| `prev_id` | `[integer]` | The `stream_ids` of any prior `m.device_list_update` EDUs sent for this user which have not been referred to already in an EDU’s `prev_id` field. If the receiving server does not recognise any of the `prev_ids`, it means an EDU has been lost and the server should query a snapshot of the device list via `/user/keys/query` in order to correctly interpret future `m.device_list_update` EDUs. May be missing or empty for the first EDU in a sequence. +| `stream_id` | `integer` | Required: An ID sent by the server for this update, unique for a given `user_id`. Used to identify any gaps in the sequence of m.device_list_update EDUs broadcast by a server. +| `user_id` | `string` | Required: The user ID who owns this device. +| **`fallback_keys`** | **`{string: KeyObject}`** | **The fallback keys for this device, if set, and if the client has opted in to eager sharing. This is the same as the most recent `fallback_keys` uploaded by this device via [`POST /_matrix/client/v3/keys/upload`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysupload).** -An example of the new field: + +An example of an EDU with the new property: ```js { - // ... - "fallback_key": { + "content": { + "device_display_name": "Mobile", + "device_id": "QBUAZIFURK", + "keys": { + "algorithms": [ + "m.olm.v1.curve25519-aes-sha2", + "m.megolm.v1.aes-sha2" + ], + "device_id": "JLAFKJWSCS", + "keys": { + "curve25519:JLAFKJWSCS": "3C5BFWi2Y8MaVvjM8M22DBmh24PmgR0nPvJOIArzgyI", + "ed25519:JLAFKJWSCS": "lEuiRJBit0IG6nUf5pUzWTUEsRVVe/HJkoKuEww9ULI" + }, + "signatures": { + "@alice:example.com": { + "ed25519:JLAFKJWSCS": "dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA" + } + }, + "user_id": "@alice:example.com" + }, + "prev_id": [ + 5 + ], + "stream_id": 6, + "user_id": "@john:example.com", + "fallback_keys": { "signed_curve25519:AAAAHg": { + "fallback": true, "key": "zKbLg+NrIjpnagy+pIY6uPL4ZwEG2v+8F9lmgsnlZzs", "signatures": { - "@alice:example.com": { + "@johh:example.com": { "ed25519:JLAFKJWSCS": "FLWxXqGbwrb8SM3Y795eB6OA8bwBcoMZFXBqnTn58AYWZSqiD45tlBVcDa2L7RwdKXebW/VzDlnfVJ+9jok1Bw" } } } } + }, + "edu_type": "m.device_list_update" } ``` -As a reminder, clients SHOULD rotate their fallback key when they realise it has been "used", with some lag time -to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new -`m.device_list_update` EDU MUST be sent. +### Changed semantics for `/keys/claim` -The definition of when a fallback key is "used" also needs to change. Previously, a key is "used" -_if it is claimed by another device_. When this happens, the client is told this via `/sync`, either by reducing -the one-time key count by 1, or by removing the algorithm from the `device_unused_fallback_key_types` array. This proposal -makes it impossible to know if the fallback key has been claimed by another device, as it is sent eagerly over -federation. Therefore, this changes the definition of "used" to be "when the device receives and successfully -decrypts an initial pre-key to-device event which uses that key". As per the specification, this is identified as -`type: 0` messages. This will require client-side changes to change when new fallback keys get uploaded. +[`POST /_matrix/client/v3/keys/claim`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysclaim) can +now respond with a cached fallback key if the remote server is unreachable. -Due to this change, it is recommended that the fallback key is also **cycled periodically** -_even if the key isn't "used"_, e.g once per week. This reduces the risk of >1 session being established with the same -key, but for some reason the client isn't able to detect it. +### Changed semantics for rotating fallback keys + +As a reminder, clients SHOULD upload a new fallback key when they realise it has been "used". + +The definition of when a fallback key is "used" is changed by this MSC. Previously, a fallback key is "used" +_if it is claimed by another device_. When this happens, the client is told this via `/sync`, by removing the +algorithm from the `device_unused_fallback_key_types` array. This is no longer a useful mechanism, as the key is +sent eagerly over federation. + +Therefore, we change the definition of "used" to be "when the device receives and successfully decrypts an initial +pre-key to-device event which uses that key". As soon as such an event is received, a new fallback key should be +created and uploaded via `/keys/upload`. (As above, this will then trigger `m.device_list_update` EDUs.) + +We also add a recommendation that the fallback key is also **rotated periodically** _even if the key isn't "used"_, +e.g once per week. This reduces the risk of the key being used without the client knowing about it (such as a +networking problem). + +Once a new key has been uploaded, the private part of the old key should be scheduled for deletion. This cannot +happen immediately, since there may be other messages in flight which rely on the old key. This was also true of +the original fallback keys implementation +([MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732)), however there could now be a much more +significant delay between the old key being used to encrypt a message and that message being received at the +recipient, and MSC2732's recommendation (the lesser of "as soon as the new key is used" and 1 hour) is inadequate +We therefore recommend significantly increasing the period for which an old fallback key is kept on the client, to +30 days after the key was replaced, but making sure that at least one old fallback key is kept at all +times. (Since we recommend rotating keys every week, normally there will be several old keys on the +client. However, if a user does not use their client for a month, there could be a backlog of messages for the most +recent old key; this is why we always keep at least one.) ## Comparisons with X3DH (Signal) @@ -98,42 +170,46 @@ claim OTKs as Signal is not federated). ## Security Considerations -Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they -dislike the slightly weaker security properties fallback keys provide. This could be resolved by adding a flag to -the `/keys/claim` endpoint to state whether returning a fallback key is acceptable to the client or not. If this -flag is not set/missing, fallback keys would not be returned in place of OTKs, meaning this MSC would be entirely -opt-in, and hence require client-side changes. However, a malicious server can trivially ignore this flag and -return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like -security theater to add this flag. - -A malicious actor who can control network conditions (but not the servers themselves) can force a client to use a fallback key by temporarily -preventing two homeservers from communicating. Previously, the only way such an actor could force a client to -use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this -MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It -is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds, -forcing use of a fallback key does nothing to compromise those sessions. This means this attack is only useful for -_active attacks_, where an attacker wants to compromise _sessions that have yet to be established_, and wants to -force those sessions to be set up with the fallback key. - -By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time than -before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user requests it. -At that point, the client is notified via `/sync` that the fallback key has been used and hence should be rotated. -With this MSC, the client would not be notified when the fallback key is used on the remote server, because this MSC -is robust to network partitions. Instead, the user will be notified when they receive a to-device event encrypted with -the fallback key. If having access to the public part of the fallback key -_for an extended period of time_ is useful for an attacker, then this MSC decreases security. The author is not aware -of any scenario where having access to the public key for a longer period of time is a security risk. If there is a -risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on long-lived public keys as -addresses would also be vulnerable. Furthermore, the user's own homeserver has access to the fallback key today. If -access to the key for an extended time is a security risk, and the user does not trust their own homeserver (not -unreasonable given this is for E2EE) then any concerns _are already present today_, just not over federation. +1. Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they + dislike the slightly weaker security properties fallback keys provide. Since fallback keys are marked as such + with `fallback: true`, such clients can detect this situation and act accordingly (eg by refusing to send a + message, or by retrying later). + +2. A malicious actor who can control network conditions (but not the servers themselves) can force a client to use + a fallback key by temporarily preventing two homeservers from communicating. Previously, the only way such an + actor could force a client to use a fallback key would be to claim all the OTKs before the client had a chance + to upload more. Therefore, this MSC increases the ways attackers can force clients to use fallback + keys. Fallback keys weaken forward secrecy. It is assumed that "most" sessions will be set up using OTKs and not + the fallback key. If this assumption holds, forcing use of a fallback key does nothing to compromise those + sessions. This means this attack is only useful for _active attacks_, where an attacker wants to compromise + _sessions that have yet to be established_, and wants to force those sessions to be set up with the fallback + key. + +3. By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time + than before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user + requests it. At that point, the client is notified via `/sync` that the fallback key has been used and hence + should be rotated. With this MSC, the client would not be notified when the fallback key is used on the remote + server, because this MSC is robust to network partitions. Instead, the user will be notified when they receive a + to-device event encrypted with the fallback key. If having access to the public part of the fallback key _for an + extended period of time_ is useful for an attacker, then this MSC decreases security. + + We are not aware of any scenario where having access to the public key for a longer period of time is a security + risk. If there is a risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on + long-lived public keys as addresses would also be vulnerable. Furthermore, the user's own homeserver has access + to the fallback key today. If access to the key for an extended time is a security risk, and the user does not + trust their own homeserver (not unreasonable given this is for E2EE) then any concerns _are already present + today_, just not over federation. ## Alternatives -Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the message -will not be encrypted for that device, and the end user will be unable to decrypt the message. What's worse, this -will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or much longer. -As a data point, Matrix Rust SDK currently uses [15 seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) -and this is seen as very low. - - +1. Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the + message will not be encrypted for that device, and the end user will be unable to decrypt the message. What's + worse, this will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or + much longer. As a data point, Matrix Rust SDK currently uses [15 + seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) and this is seen as very low. + +2. Clients could remember that they were unable to claim keys for a given device, and retry periodically. The main + problem with this approach (other than increased complexity in the client) is that it requires the sending + client to still be online when the remote server comes online, and to notice that has happened. There may be + other benefits to such an approach, but we feel that this MSC nevertheless represents an achievable, incremental + improvement in reliability.