Merge pull request #3 from matrix-org/human-id-rules

Proposal for human ID rules.
7 years ago · a7c28fdf43
parent d80a0192cd aebfcda320
commit a7c28fdf43
1 changed files with 114 additions and 63 deletions
--- a/drafts/human-id-rules.rst
+++ b/drafts/human-id-rules.rst
@ -1,81 +1,132 @@
-This document outlines the format for human-readable IDs within matrix.
+Abstract
 ========
-Overview
+This document outlines the format for human-readable IDs within Matrix.
--------
+
-UTF-8 is quickly becoming the standard character encoding set on the web. As
+Background
-such, Matrix requires that all strings MUST be encoded as UTF-8. However,
+----------
 UTF-8 is the dominant character encoding for Unicode on the web. However,
 using Unicode as the character set for human-readable IDs is troublesome. There
 are many different characters which appear identical to each other, but would
-identify different users. In addition, there are non-printable characters which
+produce different IDs. In addition, there are non-printable characters which
-cannot be rendered by the end-user. This opens up a security vulnerability with
+cannot be rendered by the end-user. This creates an opportunity for
 phishing/spoofing of IDs, commonly known as a homograph attack.
-Web browers encountered this problem when International Domain Names were
+Web browsers encountered this problem when International Domain Names were
 introduced. A variety of checks were put in place in order to protect users. If
 an address failed the check, the raw punycode would be displayed to
-disambiguate the address. Similar checks are performed by homeservers in
+disambiguate the address.
 Matrix. However, Matrix does not use punycode representations, and so does not
 show raw punycode on a failed check. Instead, homeservers must outright reject
 these misleading IDs.
-Types of human-readable IDs
+The human-readable IDs in Matrix are Room Aliases and User IDs.
---------------------------
+Room aliases look like ``#localpart:domain``. These aliases point to opaque
-There are two main human-readable IDs in question:
+non human-readable room IDs. These pointers can change to point at a different
 room ID at any time. User IDs look like ``@localpart:domain``. These represent
 actual end-users (there is no indirection).
- Room aliases
+Proposal
- User IDs
+========
-Room aliases look like ``#localpart:domain``. These aliases point to opaque
+User IDs and Room Aliases MUST be Unicode as UTF-8. Checks are performed on
-non human-readable room IDs. These pointers can change, so there is already an
+these IDs by homeservers to protect users from phishing/spoofing attacks.
-issue present with the same ID pointing to a different destination at a later
+These checks are:
-date.
+
-
+User ID Localparts:
-User IDs look like ``@localpart:domain``. These represent actual end-users, and
+ - MUST NOT contain a ``:`` or start with a ``@`` or ``.``
-unlike room aliases, there is no layer of indirection. This presents a much
+ - MUST NOT contain one of the 107 blacklisted characters on this list:
-greater concern with homograph attacks.
+     http://kb.mozillazine.org/Network.IDN.blacklist_chars
-
+ - After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
-Checks
+   contain characters from >1 language, defined by the `exemplar characters`_
------
+   on http://cldr.unicode.org/
- Similar to web browsers.
+
- blacklisted chars (e.g. non-printable characters)
+.. _exemplar characters: http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters
- mix of language sets from 'preferred' language not allowed.
+
- Language sets from CLDR dataset.
+Room Alias Localparts:
- Treated in segments (localpart, domain)
+ - MUST NOT contain a ``:``
- Additional restrictions for ease of processing IDs.
+ - MUST NOT contain one of the 107 blacklisted characters on this list:
-
+   http://kb.mozillazine.org/Network.IDN.blacklist_chars
-  - Room alias localparts MUST NOT have ``#`` or ``:``.
+ - After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
-  - User ID localparts MUST NOT have ``@`` or ``:``.
+   contain characters from >1 language, defined by the `exemplar characters`_
-
+   on http://cldr.unicode.org/
-Rejecting
+
---------
+.. _exemplar characters: http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters
- Homeservers MUST reject room aliases which do not pass the check, both on 
+
-  GETs and PUTs.
+In the event of a failed user ID check, well behaved homeservers MUST:
- Homeservers MUST reject user ID localparts which do not pass the check, both
+ - Rewrite user IDs in the offending events to be punycode with an additional ``@``
-  on creation and on events.
+   prefix **before** delivering them to clients. There are no guarantees for
- Any homeserver whose domain does not pass this check, MUST use their punycode
+   consistency between homeserver ID checking implementations. As a result, user
-  domain name instead of the IDN, to prevent other homeservers rejecting you.
+   IDs MUST be sent in their *original* form over federation. This can be done in
- Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing 
+   a stateless manner as the punycode form has no information loss.
-  due to homograph attacks, and failing due to including ``:`` s, etc)
+
- Error message MAY go into further information about which characters were
+In the event of a failed room alias check, well behaved homeservers MUST:
-  rejected and why.
+ - Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK``
- Error message SHOULD contain a ``failed_keys`` key which contains an array
+   to the client if the client is attempting to *create* this alias.
-  of strings which represent the keys which failed the check e.g::
+ - Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK``
-
+   to the client if the client is attempting to *join* a room via this alias.
-    failed_keys: [ user_id, room_alias ]
+
-
+Examples::
-Other considerations
+
--------------------
+  @ebаy:domain.com (Cyrillic 'a', everything else English)
- Basic security: Informational key on the event attached by HS to say "unsafe
+  @@xn--eby-7cd:domain.com (Punycode with additional '@')
 Homeservers SHOULD NOT allow two user IDs that differ only by case. This
 SHOULD be applied based on the capitalisation rules in the CLDR dataset:
 http://cldr.unicode.org/
 This check SHOULD be applied when the user ID is created, in order to prevent
 registration with the same name and different capitalisations, e.g.
 ``@foo:bar`` vs ``@Foo:bar`` vs ``@FOO:bar``. Homeservers MAY canonicalise
 the user ID to be completely lower-case if desired.
 Rationale
 =========
 Each ID is split into segments (localpart/domain) around the ``:``. For
 this reason, ``:`` is a reserved character and cannot be a localpart character.
 The 107 blacklisted characters are used to prevent non-printable characters and
 spaces from being used. The decision to ban characters from more than 1 language
 matches the behaviour of `Google Chrome for IDN handling`_. This is to protect
 against common homograph attacks such as ebаy.com (Cyrillic "a", rest is
 English). This would always result in a failed check. Even with this though
 there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is
 entirely Latin.
 .. _Google Chrome for IDN handling: https://www.chromium.org/developers/design-documents/idn-in-google-chrome
 User ID localparts cannot start with ``@`` so that a namespace of localparts
 beginning with ``@`` can be created. This namespace is used for user IDs which
 fail the ID checks. A failed ID could look like ``@@xn--c1yn36f:domain.com``.
 If a user ID fails the check, the user ID on the event is renamed. This doesn't
 require extra work for clients, and users will see an odd user ID rather than a
 spoofed name. Renaming is done in order to protect users of a given HS, so if a
 malicious HS doesn't rename their IDs, it doesn't affect any other HS.
 Room aliases cannot be rewritten as punycode and sent to the HS the alias is
 referring to as the HS will not necessarily understand the rewritten alias.
 Other rejected solutions for failed checks
 ------------------------------------------
 - Additional key: Informational key on the event attached by HS to say "unsafe
  ID". Problem: clients can just ignore it, and since it will appear only very
  rarely, easy to forget when implementing clients.
- Moderate security: Requires client handshake. Forces clients to implement
+- Require client handshake: Forces clients to implement
  a check, else they cannot communicate with the misleading ID. However, this
  is extra overhead in both client implementations and round-trips.
- High security: Outright rejection of the ID at the point of creation /
+- Reject event: Outright rejection of the ID at the point of creation /
  receiving event. Point of creation rejection is preferable to avoid the ID
  entering the system in the first place. However, malicious HSes can just
  allow the ID. Hence, other homeservers must reject them if they see them in
  events. Client never sees the problem ID, provided the HS is correctly
-  implemented.
+  implemented. However, it is difficult to ensure that ALL HSes will come to the
- High security decided; client doesn't need to worry about it, no additional
+  same conclusion (given the CLDR dataset does come out with new versions).
-  protocol complexity aside from rejection of an event.
+
 Outstanding Problems
 ====================
 Capitalisation
 --------------
 The capitalisation rules outlined above are nice but do not fully resolve issues
 where ``@alice:example.com`` tries to speak with ``@bob:domain.com`` using
 ``@Bob:domain.com``. It is up to ``domain.com`` to map ``Bob`` to ``bob`` in
 a sensible way.