You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
matrix-spec-proposals/specification/appendices/identifier_grammar.rst

401 lines
16 KiB
ReStructuredText

.. Copyright 2016 Openmarket Ltd.
.. Copyright 2017, 2018 New Vector Ltd.
..
.. Licensed under the Apache License, Version 2.0 (the "License");
.. you may not use this file except in compliance with the License.
.. You may obtain a copy of the License at
..
.. http://www.apache.org/licenses/LICENSE-2.0
..
.. Unless required by applicable law or agreed to in writing, software
.. distributed under the License is distributed on an "AS IS" BASIS,
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
.. See the License for the specific language governing permissions and
.. limitations under the License.
Identifier Grammar
------------------
Some identifiers are specific to given room versions, please refer to the
`room versions specification`_ for more information.
.. _`room versions specification`: index.html#room-versions
Server Name
~~~~~~~~~~~
A homeserver is uniquely identified by its server name. This value is used in a
number of identifiers, as described below.
The server name represents the address at which the homeserver in question can
be reached by other homeservers. All valid server names are included by the
following grammar::
server_name = hostname [ ":" port ]
port = 1*5DIGIT
hostname = IPv4address / "[" IPv6address "]" / dns-name
IPv4address = 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT "." 1*3DIGIT
IPv6address = 2*45IPv6char
IPv6char = DIGIT / %x41-46 / %x61-66 / ":" / "."
; 0-9, A-F, a-f, :, .
dns-name = *255dns-char
dns-char = DIGIT / ALPHA / "-" / "."
— in other words, the server name is the hostname, followed by an optional
numeric port specifier. The hostname may be a dotted-quad IPv4 address literal,
an IPv6 address literal surrounded with square brackets, or a DNS name.
IPv4 literals must be a sequence of four decimal numbers in the
range 0 to 255, separated by ``.``. IPv6 literals must be as specified by
`RFC3513, section 2.2 <https://tools.ietf.org/html/rfc3513#section-2.2>`_.
DNS names for use with Matrix should follow the conventional restrictions for
internet hostnames: they should consist of a series of labels separated by
``.``, where each label consists of the alphanumeric characters or hyphens.
Examples of valid server names are:
* ``matrix.org``
* ``matrix.org:8888``
* ``1.2.3.4`` (IPv4 literal)
* ``1.2.3.4:1234`` (IPv4 literal with explicit port)
* ``[1234:5678::abcd]`` (IPv6 literal)
* ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port)
.. Note::
This grammar is based on the standard for internet host names, as specified
by `RFC1123, section 2.1 <https://tools.ietf.org/html/rfc1123#page-13>`_,
with an extension for IPv6 literals.
Server names must be treated case-sensitively: in other words,
``@user:matrix.org`` is a different person from ``@user:MATRIX.ORG``.
Some recommendations for a choice of server name follow:
* The length of the complete server name should not exceed 230 characters.
* Server names should not use upper-case characters.
Common Identifier Format
~~~~~~~~~~~~~~~~~~~~~~~~
The Matrix protocol uses a common format to assign unique identifiers to a
number of entities, including users, events and rooms. Each identifier takes
the form::
&localpart:domain
where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of
the homeserver which allocated the identifier, and ``localpart`` is an
identifier allocated by that homeserver.
The sigil characters are as follows:
* ``@``: User ID
* ``!``: Room ID
* ``$``: Event ID
* ``+``: Group ID
* ``#``: Room alias
The precise grammar defining the allowable format of an identifier depends on
the type of identifier.
User Identifiers
++++++++++++++++
Users within Matrix are uniquely identified by their Matrix user ID. The user
ID is namespaced to the homeserver which allocated the account and has the
form::
@localpart:domain
The ``localpart`` of a user ID is an opaque identifier for that user. It MUST
NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``,
``_``, ``=``, ``-``, and ``/``.
The ``domain`` of a user ID is the `server name`_ of the homeserver which
allocated the account.
The length of a user ID, including the ``@`` sigil and the domain, MUST NOT
exceed 255 characters.
The complete grammar for a legal user ID is::
user_id = "@" user_id_localpart ":" server_name
user_id_localpart = 1*user_id_char
user_id_char = DIGIT
/ %x61-7A ; a-z
/ "-" / "." / "=" / "_" / "/"
.. admonition:: Rationale
A number of factors were considered when defining the allowable characters
for a user ID.
Firstly, we chose to exclude characters outside the basic US-ASCII character
set. User IDs are primarily intended for use as an identifier at the protocol
level, and their use as a human-readable handle is of secondary
benefit. Furthermore, they are useful as a last-resort differentiator between
users with similar display names. Allowing the full unicode character set
would make very difficult for a human to distinguish two similar user IDs. The
limited character set used has the advantage that even a user unfamiliar with
the Latin alphabet should be able to distinguish similar user IDs manually, if
somewhat laboriously.
We chose to disallow upper-case characters because we do not consider it
valid to have two user IDs which differ only in case: indeed it should be
possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However,
user IDs are necessarily used in a number of situations which are inherently
case-sensitive (notably in the ``state_key`` of ``m.room.member``
events). Forbidding upper-case characters (and requiring homeservers to
downcase usernames when creating user IDs for new users) is a relatively simple
way to ensure that ``@USER:matrix.org`` cannot refer to a different user to
``@user:matrix.org``.
Finally, we decided to restrict the allowable punctuation to a very basic set
to reduce the possibility of conflicts with special characters in various
situations. For example, "*" is used as a wildcard in some APIs (notably the
filter API), so it cannot be a legal user ID character.
The length restriction is derived from the limit on the length of the
``sender`` key on events; since the user ID appears in every event sent by the
user, it is limited to ensure that the user ID does not dominate over the actual
content of the events.
Matrix user IDs are sometimes informally referred to as MXIDs.
Historical User IDs
<<<<<<<<<<<<<<<<<<<
Older versions of this specification were more tolerant of the characters
permitted in user ID localparts. There are currently active users whose user
IDs do not conform to the permitted character set, and a number of rooms whose
history includes events with a ``sender`` which does not conform. In order to
handle these rooms successfully, clients and servers MUST accept user IDs with
localparts from the expanded character set::
extended_user_id_char = %x21-39 / %x3B-7F ; all ascii printing chars except :
Mapping from other character sets
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
In certain circumstances it will be desirable to map from a wider character set
onto the limited character set allowed in a user ID localpart. Examples include
a homeserver creating a user ID for a new user based on the username passed to
``/register``, or a bridge mapping user ids from another protocol.
.. TODO-spec
We need to better define the mechanism by which homeservers can allow users
to have non-Latin login credentials. The general idea is for clients to pass
the non-Latin in the ``username`` field to ``/register`` and ``/login``, and
the HS then maps it onto the MXID space when turning it into the
fully-qualified ``user_id`` which is returned to the client and used in
events.
Implementations are free to do this mapping however they choose. Since the user
ID is opaque except to the implementation which created it, the only
requirement is that the implementation can perform the mapping
consistently. However, we suggest the following algorithm:
1. Encode character strings as UTF-8.
2. Convert the bytes ``A-Z`` to lower-case.
* In the case where a bridge must be able to distinguish two different users
with ids which differ only by case, escape upper-case characters by
prefixing with ``_`` before downcasing. For example, ``A`` becomes
``_a``. Escape a real ``_`` with a second ``_``.
3. Encode any remaining bytes outside the allowed character set, as well as
``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#``
becomes ``=23``; ``á`` becomes ``=c3=a1``.
.. admonition:: Rationale
The suggested mapping is an attempt to preserve human-readability of simple
ASCII identifiers (unlike, for example, base-32), whilst still allowing
representation of *any* character (unlike punycode, which provides no way to
encode ASCII punctuation).
Room IDs and Event IDs
++++++++++++++++++++++
A room has exactly one room ID. A room ID has the format::
!opaque_id:domain
An event has exactly one event ID. The format of an event ID depends upon the
`room version specification <index.html#room-versions>`_.
The ``domain`` of a room ID is the `server name`_ of the homeserver which
created the room/event. The domain is used only for namespacing to avoid the
risk of clashes of identifiers between different homeservers. There is no
implication that the room or event in question is still available at the
corresponding homeserver.
Event IDs and Room IDs are case-sensitive. They are not meant to be human
readable. They are intended to be treated as fully opaque strings by clients.
.. TODO-spec
What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389
Group Identifiers
+++++++++++++++++
Groups within Matrix are uniquely identified by their group ID. The group
ID is namespaced to the group server which hosts this group and has the
form::
+localpart:domain
The ``localpart`` of a group ID is an opaque identifier for that group. It MUST
NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``,
``_``, ``=``, ``-``, and ``/``.
The ``domain`` of a group ID is the `server name`_ of the group server which
hosts this group.
The length of a group ID, including the ``+`` sigil and the domain, MUST NOT
exceed 255 characters.
The complete grammar for a legal group ID is::
group_id = "+" group_id_localpart ":" server_name
group_id_localpart = 1*group_id_char
group_id_char = DIGIT
/ %x61-7A ; a-z
/ "-" / "." / "=" / "_" / "/"
Room Aliases
++++++++++++
A room may have zero or more aliases. A room alias has the format::
#room_alias:domain
The ``domain`` of a room alias is the `server name`_ of the homeserver which
created the alias. Other servers may contact this homeserver to look up the
alias.
Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the
domain).
.. TODO-spec
- Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391
matrix.to navigation
++++++++++++++++++++
.. NOTE::
This namespacing is in place pending a ``matrix://`` (or similar) URI scheme.
This is **not** meant to be interpreted as an available web service - see
below for more details.
Rooms, users, aliases, and groups may be represented as a "matrix.to" URI.
This URI can be used to reference particular objects in a given context, such
as mentioning a user in a message or linking someone to a particular point
in the room's history (a permalink).
A matrix.to URI has the following format, based upon the specification defined
in RFC 3986:
https://matrix.to/#/<identifier>/<extra parameter>?<additional arguments>
The identifier may be a room ID, room alias, user ID, or group ID. The extra
parameter is only used in the case of permalinks where an event ID is referenced.
The matrix.to URI, when referenced, must always start with ``https://matrix.to/#/``
followed by the identifier.
The ``<additional arguments>`` and the preceeding question mark are optional and
only apply in certain circumstances, documented below.
Clients should not rely on matrix.to URIs falling back to a web server if accessed
and instead should perform some sort of action within the client. For example, if
the user were to click on a matrix.to URI for a room alias, the client may open
a view for the user to participate in the room.
The components of the matrix.to URI (``<identifier>`` and ``<extra parameter>``)
are to be percent-encoded as per RFC 3986.
Examples of matrix.to URIs are:
* Room alias: ``https://matrix.to/#/%23somewhere%3Aexample.org``
* Room: ``https://matrix.to/#/!somewhere%3Aexample.org``
* Permalink by room: ``https://matrix.to/#/!somewhere%3Aexample.org/%24event%3Aexample.org``
* Permalink by room alias: ``https://matrix.to/#/%23somewhere:example.org/%24event%3Aexample.org``
* User: ``https://matrix.to/#/%40alice%3Aexample.org``
* Group: ``https://matrix.to/#/%2Bexample%3Aexample.org``
.. Note::
Historically, clients have not produced URIs which are fully encoded. Clients should
try to interpret these cases to the best of their ability. For example, an unencoded
room alias should still work within the client if possible.
.. Note::
Clients should be aware that decoding a matrix.to URI may result in extra slashes
appearing due to some `room versions <index.html#room-versions>`_. These slashes
should normally be encoded when producing matrix.to URIs, however.
Routing
<<<<<<<
Room IDs are not routable on their own as there is no reliable domain to send requests
to. This is partially mitigated with the addition of a ``via`` argument on a matrix.to
URI, however the problem of routability is still present. Clients should do their best
to route Room IDs to where they need to go, however they should also be aware of
`issue #1579 <https://github.com/matrix-org/matrix-doc/issues/1579>`_.
A room (or room permalink) which isn't using a room alias should supply at least one
server using ``via`` in the ``<additional arguments>``, like so:
``https://matrix.to/!somewhere%3Aexample.org?via=example.org&via=alt.example.org``. The
parameter can be supplied multiple times to specify multiple servers to try.
The values of ``via`` are intended to be passed along as the ``server_name`` parameters
on the Client Server ``/join`` API.
When generating room links and permalinks, the application should pick servers which
have a high probability of being in the room in the distant future. How these servers
are picked is left as an implementation detail, however the current recommendation is
to pick 3 unique servers based on the following criteria:
* The first server should be the server of the highest power level user in the room,
provided they are at least power level 50. If no user meets this criteria, pick the
most popular server in the room (most joined users). The rationale for not picking
users with power levels under 50 is that they are unlikely to be around into the
distant future while higher ranking users (and therefore servers) are less likely
to give up their power and move somewhere else. Most rooms in the public federation
have a power level 100 user and have not deviated from the default structure where
power level 50 users have moderator-style privileges.
* The second server should be the next highest server by population, or the first
highest by population if the first server was based on a user's power level. The
rationale for picking popular servers is that the server is unlikely to be removed
as the room naturally grows in membership due to that server joining users. The
server could be refused participation in the future due to server ACLs or similar,
however the chance of that happening to a server which is organically joining the
room is unlikely.
* The third server should be the next highest server by population.
* Servers which are blocked due to server ACLs should never be chosen.
* Servers which are IP addresses should never be chosen. Servers which use a domain
name are less likely to be unroutable in the future whereas IP addresses cannot be
pointed to a different location and therefore higher risk options.
* All 3 servers should be unique from each other. If the room does not have enough users
to supply 3 servers, the application should only specify the servers it can. For example,
a room with only 2 users in it would result in maximum 2 ``via`` parameters.