From 628e723483f206b43c349e501dc6a226f8483dfb Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Mon, 23 Oct 2017 00:43:01 +0100 Subject: [PATCH] Move the MXID spec to the appendices Also link to them from the /register API doc. --- api/client-server/registration.yaml | 18 +- .../appendices/identifier_grammar.rst | 225 ++++++++++++++++++ specification/intro.rst | 213 ----------------- specification/targets.yaml | 1 + 4 files changed, 242 insertions(+), 215 deletions(-) create mode 100644 specification/appendices/identifier_grammar.rst diff --git a/api/client-server/registration.yaml b/api/client-server/registration.yaml index 9bd6aa75..a01d2559 100644 --- a/api/client-server/registration.yaml +++ b/api/client-server/registration.yaml @@ -45,6 +45,16 @@ paths: If the client does not supply a ``device_id``, the server must auto-generate one. + The server SHOULD register an account with a User ID based on the + ``username`` provided, if any. Note that the grammar of Matrix User ID + localparts is restricted, so the server MUST either map the provided + ``username`` onto a ``user_id`` in a logical manner, or reject + ``username``\s which do not comply to the grammar, with + ``M_INVALID_USERNAME``. + + Matrix clients MUST NOT assume that localpart of the registered + ``user_id`` matches the provided ``username``. + The returned access token must be associated with the ``device_id`` supplied by the client or generated by the server. The server may invalidate any access token previously associated with that device. See @@ -86,7 +96,7 @@ paths: username: type: string description: |- - The local part of the desired Matrix ID. If omitted, + The basis for the localpart of the desired Matrix ID. If omitted, the homeserver MUST generate a Matrix ID local part. example: cheeky_monkey password: @@ -121,7 +131,11 @@ paths: properties: user_id: type: string - description: The fully-qualified Matrix ID that has been registered. + description: |- + The fully-qualified Matrix user ID (MXID) that has been registered. + + Any user ID returned by this API must conform to the grammar given in the + `Matrix specification `_. access_token: type: string description: |- diff --git a/specification/appendices/identifier_grammar.rst b/specification/appendices/identifier_grammar.rst new file mode 100644 index 00000000..d53f61ac --- /dev/null +++ b/specification/appendices/identifier_grammar.rst @@ -0,0 +1,225 @@ +.. Copyright 2016 Openmarket Ltd. +.. Copyright 2017 New Vector Ltd. +.. +.. Licensed under the Apache License, Version 2.0 (the "License"); +.. you may not use this file except in compliance with the License. +.. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +Identifier Grammar +------------------ + +Server Name +~~~~~~~~~~~ + +A homeserver is uniquely identified by its server name. This value is used in a +number of identifiers, as described below. + +The server name represents the address at which the homeserver in question can +be reached by other homeservers. The complete grammar is:: + + server_name = dns_name [ ":" port] + dns_name = host + port = *DIGIT + +where ``host`` is as defined by `RFC3986, section 3.2.2 +`_. + +Examples of valid server names are: + +* ``matrix.org`` +* ``matrix.org:8888`` +* ``1.2.3.4`` (IPv4 literal) +* ``1.2.3.4:1234`` (IPv4 literal with explicit port) +* ``[1234:5678::abcd]`` (IPv6 literal) +* ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port) + + +Common Identifier Format +~~~~~~~~~~~~~~~~~~~~~~~~ + +The Matrix protocol uses a common format to assign unique identifiers to a +number of entities, including users, events and rooms. Each identifier takes +the form:: + + &localpart:domain + +where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of +the homeserver which allocated the identifier, and ``localpart`` is an +identifier allocated by that homeserver. + +The sigil characters are as follows: + +* ``@``: User ID +* ``!``: Room ID +* ``$``: Event ID +* ``#``: Room alias + +The precise grammar defining the allowable format of an identifier depends on +the type of identifier. + +User Identifiers +++++++++++++++++ + +Users within Matrix are uniquely identified by their Matrix user ID. The user +ID is namespaced to the homeserver which allocated the account and has the +form:: + + @localpart:domain + +The ``localpart`` of a user ID is an opaque identifier for that user. It MUST +NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``, +``_``, ``=``, ``-``, and ``/``. + +The ``domain`` of a user ID is the `server name`_ of the homeserver which +allocated the account. + +The length of a user ID, including the ``@`` sigil and the domain, MUST NOT +exceed 255 characters. + +The complete grammar for a legal user ID is:: + + user_id = "@" user_id_localpart ":" server_name + user_id_localpart = 1*user_id_char + user_id_char = DIGIT + / %x61-7A ; a-z + / "-" / "." / "=" / "_" / "/" + +.. admonition:: Rationale + + A number of factors were considered when defining the allowable characters + for a user ID. + + Firstly, we chose to exclude characters outside the basic US-ASCII character + set. User IDs are primarily intended for use as an identifier at the protocol + level, and their use as a human-readable handle is of secondary + benefit. Furthermore, they are useful as a last-resort differentiator between + users with similar display names. Allowing the full unicode character set + would make very difficult for a human to distinguish two similar user IDs. The + limited character set used has the advantage that even a user unfamiliar with + the Latin alphabet should be able to distinguish similar user IDs manually, if + somewhat laboriously. + + We chose to disallow upper-case characters because we do not consider it + valid to have two user IDs which differ only in case: indeed it should be + possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However, + user IDs are necessarily used in a number of situations which are inherently + case-sensitive (notably in the ``state_key`` of ``m.room.member`` + events). Forbidding upper-case characters (and requiring homeservers to + downcase usernames when creating user IDs for new users) is a relatively simple + way to ensure that ``@USER:matrix.org`` cannot refer to a different user to + ``@user:matrix.org``. + + Finally, we decided to restrict the allowable punctuation to a very basic set + to reduce the possibility of conflicts with special characters in various + situations. For example, "*" is used as a wildcard in some APIs (notably the + filter API), so it cannot be a legal user ID character. + + The length restriction is derived from the limit on the length of the + ``sender`` key on events; since the user ID appears in every event sent by the + user, it is limited to ensure that the user ID does not dominate over the actual + content of the events. + +Matrix user IDs are sometimes informally referred to as MXIDs. + +Historical User IDs +<<<<<<<<<<<<<<<<<<< + +Older versions of this specification were more tolerant of the characters +permitted in user ID localparts. There are currently active users whose user +IDs do not conform to the permitted character set, and a number of rooms whose +history includes events with a ``sender`` which does not conform. In order to +handle these rooms successfully, clients and servers MUST accept user IDs with +localparts from the expanded character set:: + + extended_user_id_char = %x21-7E + +Mapping from other character sets +<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< + +In certain circumstances it will be desirable to map from a wider character set +onto the limited character set allowed in a user ID localpart. Examples include +a homeserver creating a user ID for a new user based on the username passed to +``/register``, or a bridge mapping user ids from another protocol. + +.. TODO-spec + + We need to better define the mechanism by which homeservers can allow users + to have non-Latin login credentials. The general idea is for clients to pass + the non-Latin in the ``username`` field to ``/register`` and ``/login``, and + the HS then maps it onto the MXID space when turning it into the + fully-qualified ``user_id`` which is returned to the client and used in + events. + +Implementations are free to do this mapping however they choose. Since the user +ID is opaque except to the implementation which created it, the only +requirement is that the implemention can perform the mapping +consistently. However, we suggest the following algorithm: + +1. Encode character strings as UTF-8. + +2. Convert the bytes ``A-Z`` to lower-case. + + * In the case where a bridge must be able to distinguish two different users + with ids which differ only by case, escape upper-case characters by + prefixing with ``_`` before downcasing. For example, ``A`` becomes + ``_a``. Escape a real ``_`` with a second ``_``. + +3. Encode any remaining bytes outside the allowed character set, as well as + ``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#`` + becomes ``=23``; ``á`` becomes ``=c3=a1``. + +.. admonition:: Rationale + + The suggested mapping is an attempt to preserve human-readability of simple + ASCII identifiers (unlike, for example, base-32), whilst still allowing + representation of *any* character (unlike punycode, which provides no way to + encode ASCII punctuation). + + +Room IDs and Event IDs +++++++++++++++++++++++ + +A room has exactly one room ID. A room ID has the format:: + + !opaque_id:domain + +An event has exactly one event ID. An event ID has the format:: + + $opaque_id:domain + +The ``domain`` of a room/event ID is the `server name`_ of the homeserver which +created the room/event. The domain is used only for namespacing to avoid the +risk of clashes of identifiers between different homeservers. There is no +implication that the room or event in question is still available at the +corresponding homeserver. + +Event IDs and Room IDs are case-sensitive. They are not meant to be human +readable. + +.. TODO-spec + What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389 + +Room Aliases +++++++++++++ + +A room may have zero or more aliases. A room alias has the format:: + + #room_alias:domain + +The ``domain`` of a room alias is the `server name`_ of the homeserver which +created the alias. Other servers may contact this homeserver to look up the +alias. + +Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the +domain). + +.. TODO-spec + - Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391 diff --git a/specification/intro.rst b/specification/intro.rst index 9e12f89c..be2d9201 100644 --- a/specification/intro.rst +++ b/specification/intro.rst @@ -387,219 +387,6 @@ dedicated API. The API is symmetrical to managing Profile data. Would it really be overengineered to use the same API for both profile & private user data, but with different ACLs? - -Identifier Grammar ------------------- - -Server Name -~~~~~~~~~~~ - -A homeserver is uniquely identified by its server name. This value is used in a -number of identifiers, as described below. - -The server name represents the address at which the homeserver in question can -be reached by other homeservers. The complete grammar is:: - - server_name = dns_name [ ":" port] - dns_name = host - port = *DIGIT - -where ``host`` is as defined by `RFC3986, section 3.2.2 -`_. - -Examples of valid server names are: - -* ``matrix.org`` -* ``matrix.org:8888`` -* ``1.2.3.4`` (IPv4 literal) -* ``1.2.3.4:1234`` (IPv4 literal with explicit port) -* ``[1234:5678::abcd]`` (IPv6 literal) -* ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port) - - -Common Identifier Format -~~~~~~~~~~~~~~~~~~~~~~~~ - -The Matrix protocol uses a common format to assign unique identifiers to a -number of entities, including users, events and rooms. Each identifier takes -the form:: - - &localpart:domain - -where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of -the homeserver which allocated the identifier, and ``localpart`` is an -identifier allocated by that homeserver. - -The sigil characters are as follows: - -* ``@``: User ID -* ``!``: Room ID -* ``$``: Event ID -* ``#``: Room alias - -The precise grammar defining the allowable format of an identifier depends on -the type of identifier. - -User Identifiers -++++++++++++++++ - -Users within Matrix are uniquely identified by their Matrix user ID. The user -ID is namespaced to the homeserver which allocated the account and has the -form:: - - @localpart:domain - -The ``localpart`` of a user ID is an opaque identifier for that user. It MUST -NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``, -``_``, ``=``, ``-``, and ``/``. - -The ``domain`` of a user ID is the `server name`_ of the homeserver which -allocated the account. - -The length of a user ID, including the ``@`` sigil and the domain, MUST NOT -exceed 255 characters. - -The complete grammar for a legal user ID is:: - - user_id = "@" user_id_localpart ":" server_name - user_id_localpart = 1*user_id_char - user_id_char = DIGIT - / %x61-7A ; a-z - / "-" / "." / "=" / "_" / "/" - -.. admonition:: Rationale - - A number of factors were considered when defining the allowable characters - for a user ID. - - Firstly, we chose to exclude characters outside the basic US-ASCII character - set. User IDs are primarily intended for use as an identifier at the protocol - level, and their use as a human-readable handle is of secondary - benefit. Furthermore, they are useful as a last-resort differentiator between - users with similar display names. Allowing the full unicode character set - would make very difficult for a human to distinguish two similar user IDs. The - limited character set used has the advantage that even a user unfamiliar with - the Latin alphabet should be able to distinguish similar user IDs manually, if - somewhat laboriously. - - We chose to disallow upper-case characters because we do not consider it - valid to have two user IDs which differ only in case: indeed it should be - possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However, - user IDs are necessarily used in a number of situations which are inherently - case-sensitive (notably in the ``state_key`` of ``m.room.member`` - events). Forbidding upper-case characters (and requiring homeservers to - downcase usernames when creating user IDs for new users) is a relatively simple - way to ensure that ``@USER:matrix.org`` cannot refer to a different user to - ``@user:matrix.org``. - - Finally, we decided to restrict the allowable punctuation to a very basic set - to reduce the possibility of conflicts with special characters in various - situations. For example, "*" is used as a wildcard in some APIs (notably the - filter API), so it cannot be a legal user ID character. - - The length restriction is derived from the limit on the length of the - ``sender`` key on events; since the user ID appears in every event sent by the - user, it is limited to ensure that the user ID does not dominate over the actual - content of the events. - -Matrix user IDs are sometimes informally referred to as MXIDs. - -Historical User IDs -<<<<<<<<<<<<<<<<<<< - -Older versions of this specification were more tolerant of the characters -permitted in user ID localparts. There are currently active users whose user -IDs do not conform to the permitted character set, and a number of rooms whose -history includes events with a ``sender`` which does not conform. In order to -handle these rooms successfully, clients and servers MUST accept user IDs with -localparts from the expanded character set:: - - extended_user_id_char = %x21-7E - -Mapping from other character sets -<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< - -In certain circumstances it will be desirable to map from a wider character set -onto the limited character set allowed in a user ID localpart. Examples include -a homeserver creating a user ID for a new user based on the username passed to -``/register``, or a bridge mapping user ids from another protocol. - -.. TODO-spec - - We need to better define the mechanism by which homeservers can allow users - to have non-Latin login credentials. The general idea is for clients to pass - the non-Latin in the ``username`` field to ``/register`` and ``/login``, and - the HS then maps it onto the MXID space when turning it into the - fully-qualified ``user_id`` which is returned to the client and used in - events. - -Implementations are free to do this mapping however they choose. Since the user -ID is opaque except to the implementation which created it, the only -requirement is that the implemention can perform the mapping -consistently. However, we suggest the following algorithm: - -1. Encode character strings as UTF-8. - -2. Convert the bytes ``A-Z`` to lower-case. - - * In the case where a bridge must be able to distinguish two different users - with ids which differ only by case, escape upper-case characters by - prefixing with ``_`` before downcasing. For example, ``A`` becomes - ``_a``. Escape a real ``_`` with a second ``_``. - -3. Encode any remaining bytes outside the allowed character set, as well as - ``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#`` - becomes ``=23``; ``á`` becomes ``=c3=a1``. - -.. admonition:: Rationale - - The suggested mapping is an attempt to preserve human-readability of simple - ASCII identifiers (unlike, for example, base-32), whilst still allowing - representation of *any* character (unlike punycode, which provides no way to - encode ASCII punctuation). - - -Room IDs and Event IDs -++++++++++++++++++++++ - -A room has exactly one room ID. A room ID has the format:: - - !opaque_id:domain - -An event has exactly one event ID. An event ID has the format:: - - $opaque_id:domain - -The ``domain`` of a room/event ID is the `server name`_ of the homeserver which -created the room/event. The domain is used only for namespacing to avoid the -risk of clashes of identifiers between different homeservers. There is no -implication that the room or event in question is still available at the -corresponding homeserver. - -Event IDs and Room IDs are case-sensitive. They are not meant to be human -readable. - -.. TODO-spec - What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389 - -Room Aliases -++++++++++++ - -A room may have zero or more aliases. A room alias has the format:: - - #room_alias:domain - -The ``domain`` of a room alias is the `server name`_ of the homeserver which -created the alias. Other servers may contact this homeserver to look up the -alias. - -Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the -domain). - -.. TODO-spec - - Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391 - - License ------- diff --git a/specification/targets.yaml b/specification/targets.yaml index 0d64815b..fb68e13d 100644 --- a/specification/targets.yaml +++ b/specification/targets.yaml @@ -34,6 +34,7 @@ targets: - appendices.rst - appendices/base64.rst - appendices/signing_json.rst + - appendices/identifier_grammar.rst - appendices/threat_model.rst - appendices/test_vectors.rst groups: # reusable blobs of files when prefixed with 'group:'