Proposal for human ID rules.

Includes handling of namespaces for bots, handing of capitalisation, spoof
checks and escape sequences.
pull/977/head
Kegan Dougal 10 years ago
parent 8cd5fa822f
commit 4f3ee12409

@ -1,5 +1,21 @@
This document outlines the format for human-readable IDs within matrix. This document outlines the format for human-readable IDs within matrix.
Summary
-------
- Human-readable IDs are Room Aliases and User IDs.
- They MUST be Unicode as UTF-8.
- If spoof checks fail, the user ID in question MUST be rewritten to be punycode
with an additional ``@`` prefix.
Room aliases cannot be rewritten.
- Spoof Checks:
- MUST NOT contain one of the 107 blacklisted characters on this list:
http://kb.mozillazine.org/Network.IDN.blacklist_chars
- MUST NOT contain characters from >1 language, defined by
http://cldr.unicode.org/
- User IDs MUST NOT contain a ``:`` or start with a ``@`` or ``.``
- Room aliases MUST NOT contain a ``:``
- User IDs SHOULD be case-insensitive.
Overview Overview
-------- --------
UTF-8 is quickly becoming the standard character encoding set on the web. As UTF-8 is quickly becoming the standard character encoding set on the web. As
@ -10,16 +26,16 @@ identify different users. In addition, there are non-printable characters which
cannot be rendered by the end-user. This opens up a security vulnerability with cannot be rendered by the end-user. This opens up a security vulnerability with
phishing/spoofing of IDs, commonly known as a homograph attack. phishing/spoofing of IDs, commonly known as a homograph attack.
Web browers encountered this problem when International Domain Names were Web browsers encountered this problem when International Domain Names were
introduced. A variety of checks were put in place in order to protect users. If introduced. A variety of checks were put in place in order to protect users. If
an address failed the check, the raw punycode would be displayed to an address failed the check, the raw punycode would be displayed to
disambiguate the address. Similar checks are performed by home servers in disambiguate the address. Similar checks are performed by home servers in
Matrix. However, Matrix does not use punycode representations, and so does not Matrix in order to protect users. In the event of a failed check, the raw
show raw punycode on a failed check. Instead, home servers must outright reject punycode is displayed as the user ID along with a special escape sequence to
these misleading IDs. indicate the change.
Types of human-readable IDs Types of human-readable IDs
--------------------------- ~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are two main human-readable IDs in question: There are two main human-readable IDs in question:
- Room aliases - Room aliases
@ -28,54 +44,95 @@ There are two main human-readable IDs in question:
Room aliases look like ``#localpart:domain``. These aliases point to opaque Room aliases look like ``#localpart:domain``. These aliases point to opaque
non human-readable room IDs. These pointers can change, so there is already an non human-readable room IDs. These pointers can change, so there is already an
issue present with the same ID pointing to a different destination at a later issue present with the same ID pointing to a different destination at a later
date. date. Checks SHOULD be applied to room aliases, but they cannot be renamed in
punycode as that would break the alias. As a result, the checks in this document
apply to user IDs, although HSes may wish to enforce them on room alias
creation.
User IDs look like ``@localpart:domain``. These represent actual end-users, and User IDs look like ``@localpart:domain``. These represent actual end-users, and
unlike room aliases, there is no layer of indirection. This presents a much unlike room aliases, there is no layer of indirection. This presents a much
greater concern with homograph attacks. greater concern with homograph attacks. Checks MUST be applied to user IDs.
Checks Spoof Checks
------ ------------
- Similar to web browsers. First, each ID is split into segments (localpart/domain) around the ``:``. For
- blacklisted chars (e.g. non-printable characters) this reason, ``:`` is a reserved character and cannot be a localpart or domain
- mix of language sets from 'preferred' language not allowed. character.
- Language sets from CLDR dataset.
- Treated in segments (localpart, domain) User IDs which start with an ``@`` are used as an escape sequence for failed
- Additional restrictions for ease of processing IDs. user IDs. As a result, the localpart MUST NOT start with an ``@`` in order to
avoid namespace clashes.
- Room alias localparts MUST NOT have ``#`` or ``:``.
- User ID localparts MUST NOT have ``@`` or ``:``. The checks are similar to web browsers for IDNs. The first check is that the
segment MUST NOT contain a blacklisted character on this list:
Rejecting http://kb.mozillazine.org/Network.IDN.blacklist_chars - NB: Even though
--------- this is Mozilla, Chrome follows the same list as per
- Home servers MUST reject room aliases which do not pass the check, both on http://www.chromium.org/developers/design-documents/idn-in-google-chrome
GETs and PUTs.
- Home servers MUST reject user ID localparts which do not pass the check, both The second check is that it MUST NOT contain characters from more than 1
on creation and on events. language. This is defined by this dataset http://cldr.unicode.org/ and is
- Any home server whose domain does not pass this check, MUST use their punycode applied after stripping " 0-9, +, -, [, ], _, and the space character"
domain name instead of the IDN, to prevent other home servers rejecting you. ( http://www.chromium.org/developers/design-documents/idn-in-google-chrome )
- Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing
due to homograph attacks, and failing due to including ``:`` s, etc)
- Error message MAY go into further information about which characters were Consequences of a failed check
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a user ID fails the check, the user ID on the event is renamed. This is
possible because user IDs contain routing information. This doesn't require
extra work for clients, and users will see an odd user ID rather than a spoofed
name. Renaming is done in order to protect users of a given HS, so if a
malicious HS doesn't rename their IDs, it doesn't affect any other HS.
- The HS MAY reject the creation of the room alias or user ID. This is the
preferred choice but it is entirely benevolent: other HSes may not apply this
rule so checks on incoming events MUST still be applied. The error code returned
for the rejection is ``M_FAILED_HUMAN_ID_CHECK``, which is generic enough for
both failing due to homograph attacks, and failing due to including ``:`` s.
Error message MAY go into further information about which characters were
rejected and why. rejected and why.
- Error message SHOULD contain a ``failed_keys`` key which contains an array
of strings which represent the keys which failed the check e.g::
failed_keys: [ user_id, room_alias ] - The HS MUST rename the localpart which failed the check. It SHOULD be
represented as punycode. The HS MUST prefix the punycode with the escape
sequence ``@`` on user ID localparts, e.g. ``@@somepunycode:domain``. Room
aliases do not need to be escaped, and indeed they cannot be, as the originating
HS will not understand the rewritten alias. If a HS renames a user ID, it MUST
be able to apply the reverse mapping in case the user wishes to communicate with
the ID which failed the check.
Other considerations Other rejected solutions for failed checks
-------------------- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Basic security: Informational key on the event attached by HS to say "unsafe - Additional key: Informational key on the event attached by HS to say "unsafe
ID". Problem: clients can just ignore it, and since it will appear only very ID". Problem: clients can just ignore it, and since it will appear only very
rarely, easy to forget when implementing clients. rarely, easy to forget when implementing clients.
- Moderate security: Requires client handshake. Forces clients to implement - Require client handshake: Forces clients to implement
a check, else they cannot communicate with the misleading ID. However, this a check, else they cannot communicate with the misleading ID. However, this
is extra overhead in both client implementations and round-trips. is extra overhead in both client implementations and round-trips.
- High security: Outright rejection of the ID at the point of creation / - Reject event: Outright rejection of the ID at the point of creation /
receiving event. Point of creation rejection is preferable to avoid the ID receiving event. Point of creation rejection is preferable to avoid the ID
entering the system in the first place. However, malicious HSes can just entering the system in the first place. However, malicious HSes can just
allow the ID. Hence, other home servers must reject them if they see them in allow the ID. Hence, other home servers must reject them if they see them in
events. Client never sees the problem ID, provided the HS is correctly events. Client never sees the problem ID, provided the HS is correctly
implemented. implemented. However, it is difficult to ensure that ALL HSes will come to the
- High security decided; client doesn't need to worry about it, no additional same conclusion (given the CLDR dataset does come out with new versions).
protocol complexity aside from rejection of an event.
Namespacing
-----------
Bots
~~~~
User IDs representing real users SHOULD NOT start with a ``.``. User IDs which
act on behalf of a real user (e.g. an IRC/XMPP bot) SHOULD start with a ``.``.
This namespaces real/generated user IDs. Further namespacing SHOULD be applied
based on the service being used, getting progressively more specific, similar to
event types: e.g. ``@.irc.freenode.matrix.<username>:domain``. Ultimately, the
HS in question has control over their user ID namespace, so this is just a
recommendation.
Additional recommendations
--------------------------
Capitalisation
~~~~~~~~~~~~~~
User IDs SHOULD be case-insensitive. This SHOULD be applied based on the
capitalisation rules in the CLDR dataset: http://cldr.unicode.org/

Loading…
Cancel
Save