You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
matrix-spec-proposals/proposals/3970-device-scope-txnid.md

9.8 KiB

MSC3970: Scope transaction IDs to devices

Transaction identifiers in the Client-Server API are currently scoped to the concept of a "client session" which, when refresh tokens are used, can span a sequence of access tokens.

The spec reads:

The scope of a transaction ID is a “client session”, where that session is identified by a particular access token. When refreshing an access token, the transaction IDs scope is retained. This means that if a client with token A uses TXN1 as their transaction ID, refreshes the token to B, and uses TXN1 again itll be assumed to be a duplicate request and ignored. If the client logs out and back in between the A and B tokens, TXN1 could be used once for each.

The "client session" scope concept described can be complicated to implement.

This MSC proposes that the scope of a transaction identifier is changed to something that is easier to implement whilst maintaining required transaction semantics.

The transaction IDs appear in two parts of the Client-Server API spec:

  1. As a identifier to allow the homeserver to make some PUT endpoints idempotent
  2. An unsigned field in the event data model for a client to tell if it sent an event or not. a.k.a. solving the "local echo" problem

For reference, the PUT endpoints that have the a {txnId} param are:

Proposal

It is proposed that the scope of transaction identifiers be changed from a "client session" to a "device".

A "device" is typically represented by a device_id elsewhere in the spec.

For idempotency, this means the homeserver changing the method of identifying a request from:

  • (client session, HTTP path of request which includes the transaction ID)

to:

  • (device_id, HTTP path of request which includes the transaction ID)

For local echo, the homeserver would now include the transaction_id in the event data when it is serving a sync request from the same device_id as determined from the access token.

Potential issues

This is technically a breaking change to the spec

The main "issue" I see with this proposal is that this is technically a breaking change to the spec.

A device ID could have multiple sequences of access tokens associated with it (since device ID can be specified as a parameter to /login).

For example, a "bot" implementation might masquerade as the same "device" despite calling /login on every run by passing the same device ID. Such a device might also use the same transaction ID on each run.

Therefore it could potentially lead to a request being treated as a duplicate where previously it would have been treated as a distinct request.

However, some evidence has been collated which suggests that nothing would be impacted in reality:

1. Data from the matrix.org homeserver suggests the change would have no impact

The matrix.org homeserver is a reasonable size deployment and could be considered reasonably representative of the diversity of Matrix clients.

Synapse maintains a event_txn_id table that contains a rolling 24 hour window of (user_id, token_id, room_id, txn_id) tuples.

Having analysed the contents of the table, it appears that there are no repeated transaction IDs for a given user, token and room.

n.b. not all PUT endpoints contribute to the table but I think the high-volume ones do

This suggests that the widening of the scope from token to device would not have caused any issues during the periods sampled.

For reference the following is the schema of the event_txn_id table:

              Table "matrix.event_txn_id"
   Column    |  Type  | Collation | Nullable | Default
-------------+--------+-----------+----------+---------
 event_id    | text   |           | not null |
 room_id     | text   |           | not null |
 user_id     | text   |           | not null |
 token_id    | bigint |           | not null |
 txn_id      | text   |           | not null |
 inserted_ts | bigint |           | not null |
Indexes:
    "event_txn_id_token_id" btree (token_id)
    "event_txn_id_event_id" UNIQUE, btree (event_id)
    "event_txn_id_ts" btree (inserted_ts)
    "event_txn_id_txn_id" UNIQUE, btree (room_id, user_id, token_id, txn_id)
Foreign-key constraints:
    "event_txn_id_event_id_fkey" FOREIGN KEY (event_id) REFERENCES matrix.events(event_id) ON DELETE CASCADE
    "event_txn_id_token_id_fkey" FOREIGN KEY (token_id) REFERENCES matrix.access_tokens(id) ON DELETE CASCADE

And the query to look for repeated transaction IDs:

SELECT e1.txn_id, LEFT(e1.user_id, 5) AS user_id, e1.token_id, e2.token_id, e1.inserted_ts, e2.inserted_ts FROM matrix.event_txn_id e1, matrix.event_txn_id e2 WHERE e1.txn_id = e2.txn_id AND e1.event_id <> e2.event_id AND e1.event_id < e2.event_id AND e1.user_id = e2.user_id AND e1.room_id = e2.room_id ORDER BY e1.token_id;
 txn_id | user_id | token_id | token_id | inserted_ts | inserted_ts
--------+---------+----------+----------+-------------+-------------
(0 rows)

2. Conduit homeserver already scopes transaction IDs to devices

As highlighted by the new Complement tests the Conduit homeserver is already scoping transaction IDs to devices.

I can't find a related issue listed, so presumably this non-compliant behaviour isn't causing a known issue for admins and users of the Conduit homeserver?

Is the "device" concept the right level of abstraction to use?

One way to look at it is that device is already widely used in the end-to-end encryption parts of the spec and so why isn't it suitable for this use case too?

What about two clients masquerading as a single device ID?

I don't know if this actually works in practice. If this was a concern then it could be mitigated by clarifying in the spec that if a client wishes to submit requests using the same device_id as another client session that it should choose transaction identifiers that are unique to that client session.

Alternatives

Do nothing

We could leave the transaction ID scope as is.

However, it makes it difficult to implement a features like MSC3861: Matrix architecture change to delegate authentication via OIDC as the concept of a "client session" doesn't really exist in OIDC.

As noted above, at least one homeserver implementation is also not implementing the spec as it is today.

It also turns out that the current implementation of refresh tokens in Synapse breaks the transaction ID semantics already and needs to be fixed.

Make a backwards compatible change

A backwards compatible alternative could be something like:

  1. For idempotency have clients opt-in to a new scope of transaction ID, but support the current semantics too for compatibility
  2. Have clients opt-in (e.g. request param on the sync endpoint) to receiving transaction ID for all events in the sync response and make the client responsible for identifying which messages they sent

The disadvantage of this is that we create a load of extra maintenance work to support both semantics for a period of time for (empirically) no gain in return.

Security considerations

A malicious client can adopt an existing device ID of a user. This could possibly allow some kind of denial of service attack.

However, if such an attack where possible it would be possible to do so without this MSC as device IDs are crucial to the implementation of end-to-end encryption.

Other recommendations

I'm not suggesting that these recommendations are addressed in this proposal, but more notes for future proposals or spec clarifications.

Clarification on idempotency semantics

I have separately prepared a spec PR to clarify some of the idempotency semantics that doesn't modify the spec but is useful to understand the context of this proposal.

Clarification on transaction ID time scope

I also suggest that the spec be clarified over what time periods the transaction ID is scoped for such that clients can be aware. This cloud simply be to say that the time period is not defined and so may vary by implementation.

Recommend a less naive transaction ID format

Currently the format of a transaction ID is not specified, but a recommendation is given:

After the [previous] request has finished, the {txnId} value should be changed (how is not specified; a monotonically increasing integer is recommended).

I think this is an unnecessarily naive recommendation.

In most clients environments a pseudo-random number generator will be available and so could be used to generate a UUID/ULID or similar.

As an aside, in my research I have found some clients use a "seconds since epoch" as a transaction ID which introduces a limit on the maximum possible event transmission rate per room to once per second. Perhaps a better recommendation could help prevent the such behaviour being introduced in future.

Unstable prefix

None needed.

Dependencies

None.