Updated to reflect more recent progress

10 years ago · 3d5ec5eb15
parent 0131543f00
commit 3d5ec5eb15
1 changed files with 93 additions and 110 deletions
--- a/drafts/human-id-rules.rst
+++ b/drafts/human-id-rules.rst
@ -1,103 +1,101 @@
-This document outlines the format for human-readable IDs within matrix.
+Abstract
-
+========
-Summary
+
-------
+This document outlines the format for human-readable IDs within Matrix.
- Human-readable IDs are Room Aliases and User IDs.
+
- They MUST be Unicode as UTF-8.
+Background
- If spoof checks fail, the user ID in question MUST be rewritten to be punycode
+----------
-  with an additional ``@`` prefix.
+UTF-8 is the dominant character encoding for Unicode on the web. However,
  Room aliases cannot be rewritten.
 - Spoof Checks:
   - MUST NOT contain one of the 107 blacklisted characters on this list: 
     http://kb.mozillazine.org/Network.IDN.blacklist_chars
   - MUST NOT contain characters from >1 language, defined by
     http://cldr.unicode.org/
 - User IDs MUST NOT contain a ``:`` or start with a ``@`` or ``.``
 - Room aliases MUST NOT contain a ``:``
 - User IDs SHOULD be case-insensitive.
 Overview
 --------
 UTF-8 is quickly becoming the standard character encoding set on the web. As
 such, Matrix requires that all strings MUST be encoded as UTF-8. However,
 using Unicode as the character set for human-readable IDs is troublesome. There
 are many different characters which appear identical to each other, but would
-identify different users. In addition, there are non-printable characters which
+produce different IDs. In addition, there are non-printable characters which
-cannot be rendered by the end-user. This opens up a security vulnerability with
+cannot be rendered by the end-user. This creates an opportunity for
 phishing/spoofing of IDs, commonly known as a homograph attack.
 Web browsers encountered this problem when International Domain Names were
 introduced. A variety of checks were put in place in order to protect users. If
 an address failed the check, the raw punycode would be displayed to
-disambiguate the address. Similar checks are performed by home servers in
+disambiguate the address.
 Matrix in order to protect users. In the event of a failed check, the raw
 punycode is displayed as the user ID along with a special escape sequence to
 indicate the change.
-Types of human-readable IDs
+The human-readable IDs in Matrix are Room Aliases and User IDs.
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Room aliases look like ``#localpart:domain``. These aliases point to opaque
-There are two main human-readable IDs in question:
+non human-readable room IDs. These pointers can change to point at a different
 room ID at any time. User IDs look like ``@localpart:domain``. These represent
 actual end-users (there is no indirection).
- Room aliases
+Proposal
- User IDs
+========
-Room aliases look like ``#localpart:domain``. These aliases point to opaque
+User IDs and Room Aliases MUST be Unicode as UTF-8. Checks are performed on
-non human-readable room IDs. These pointers can change, so there is already an
+these IDs by homeservers to protect users from phishing/spoofing attacks.
-issue present with the same ID pointing to a different destination at a later
+These checks are:
-date. Checks SHOULD be applied to room aliases, but they cannot be renamed in
+
-punycode as that would break the alias. As a result, the checks in this document
+User ID Localparts:
-apply to user IDs, although HSes may wish to enforce them on room alias 
+ - MUST NOT contain a ``:`` or start with a ``@`` or ``.``
-creation.
+ - MUST NOT contain one of the 107 blacklisted characters on this list: 
-
+     http://kb.mozillazine.org/Network.IDN.blacklist_chars
-User IDs look like ``@localpart:domain``. These represent actual end-users, and
+ - After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
-unlike room aliases, there is no layer of indirection. This presents a much
+   contain characters from >1 language, defined by http://cldr.unicode.org/
-greater concern with homograph attacks. Checks MUST be applied to user IDs.
+
-
+Room Alias Localparts:
-Spoof Checks
+ - MUST NOT contain a ``:``
------------
+ - MUST NOT contain one of the 107 blacklisted characters on this list: 
-First, each ID is split into segments (localpart/domain) around the ``:``. For 
+   http://kb.mozillazine.org/Network.IDN.blacklist_chars
-this reason, ``:`` is a reserved character and cannot be a localpart or domain 
+ - After stripping " 0-9, +, -, [, ], _, and the space character it MUST NOT
-character. 
+   contain characters from >1 language, defined by http://cldr.unicode.org/
-
+
-User IDs which start with an ``@`` are used as an escape sequence for failed 
+
-user IDs. As a result, the localpart MUST NOT start with an ``@`` in order to 
+In the event of a failed user ID check, well behaved homeservers MUST:
-avoid namespace clashes.
+- Rewrite user IDs in the offending events to be punycode with an additional ``@``
-
+  prefix **before** delivering them to clients. There are no guarantees for
-The checks are similar to web browsers for IDNs. The first check is that the 
+  consistency between homeserver ID checking implementations. As a result, user
-segment MUST NOT contain a blacklisted character on this list: 
+  IDs MUST be sent in their *original* form over federation. This can be done in
-http://kb.mozillazine.org/Network.IDN.blacklist_chars - NB: Even though 
+  a stateless manner as the punycode form has no information loss.
-this is Mozilla, Chrome follows the same list as per 
+
-http://www.chromium.org/developers/design-documents/idn-in-google-chrome
+In the event of a failed room alias check, well behaved homeservers MUST:
-
+- Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK``
-The second check is that it MUST NOT contain characters from more than 1 
+  to the client if the client is attempting to *create* this alias.
-language. This is defined by this dataset http://cldr.unicode.org/ and is 
+- Send an HTTP status code 400 with an ``errcode`` of ``M_FAILED_HUMAN_ID_CHECK``
-applied after stripping " 0-9, +, -, [, ], _, and the space character" 
+  to the client if the client is attempting to *join* a room via this alias.
-( http://www.chromium.org/developers/design-documents/idn-in-google-chrome )
+
-
+Examples::
-
+
-Consequences of a failed check
+  @ebаy:domain.com (Cyrillic 'a', everything else English)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  @@xn--eby-7cd:domain.com (Punycode with additional '@')
-If a user ID fails the check, the user ID on the event is renamed. This is 
+
-possible because user IDs contain routing information. This doesn't require 
+Homeservers SHOULD NOT allow two user IDs that differ only by case. This
-extra work for clients, and users will see an odd user ID rather than a spoofed 
+SHOULD be applied based on the capitalisation rules in the CLDR dataset:
-name. Renaming is done in order to protect users of a given HS, so if a 
+http://cldr.unicode.org/
 This check SHOULD be applied when the user ID is created, in order to prevent
 registration with the same name and different capitalisations, e.g.
 ``@foo:bar`` vs ``@Foo:bar`` vs ``@FOO:bar``. Home servers MAY canonicalise
 the user ID to be completely lower-case if desired.
 Rationale
 =========
 Each ID is split into segments (localpart/domain) around the ``:``. For 
 this reason, ``:`` is a reserved character and cannot be a localpart character. 
 The 107 blacklisted characters are used to prevent non-printable characters and
 spaces from being used. The decision to ban characters from more than 1 language
 matches the behaviour of Google Chrome for IDN handling. This is to protect
 against common homograph attacks such as ebаy.com (Cyrillic "a", rest is
 English). This would always result in a failed check. Even with this though
 there are limitations. For example, сахар is entirely Cyrillic, whereas caxap is
 entirely Latin. 
 User ID localparts cannot start with ``@`` so that a namespace of localparts
 beginning with ``@`` can be created. This namespace is used for user IDs which
 fail the ID checks. A failed ID could look like ``@@xn--c1yn36f:domain.com``.
 If a user ID fails the check, the user ID on the event is renamed. This doesn't
 require extra work for clients, and users will see an odd user ID rather than a
 spoofed name. Renaming is done in order to protect users of a given HS, so if a 
 malicious HS doesn't rename their IDs, it doesn't affect any other HS.
- The HS MAY reject the creation of the room alias or user ID. This is the 
+Room aliases cannot be rewritten as punycode and sent to the HS the alias is
-  preferred choice but it is entirely benevolent: other HSes may not apply this
+referring to as the HS will not necessarily understand the rewritten alias.
  rule so checks on incoming events MUST still be applied. The error code returned
  for the rejection is ``M_FAILED_HUMAN_ID_CHECK``, which is generic enough for 
  both failing due to homograph attacks, and failing due to including ``:`` s. 
  Error message MAY go into further information about which characters were 
  rejected and why.
 - The HS MUST rename the localpart which failed the check. It SHOULD be 
  represented as punycode. The HS MUST prefix the punycode with the escape 
  sequence ``@`` on user ID localparts, e.g. ``@@somepunycode:domain``. Room 
  aliases do not need to be escaped, and indeed they cannot be, as the originating
  HS will not understand the rewritten alias. If a HS renames a user ID, it MUST 
  be able to apply the reverse mapping in case the user wishes to communicate with
  the ID which failed the check.
 Other rejected solutions for failed checks
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -115,29 +113,14 @@ Other rejected solutions for failed checks
  implemented. However, it is difficult to ensure that ALL HSes will come to the
  same conclusion (given the CLDR dataset does come out with new versions).
-Namespacing
+Outstanding Problems
-----------
+====================
 Bots
 ~~~~
 User IDs representing real users SHOULD NOT start with a ``.``. User IDs which
 act on behalf of a real user (e.g. an IRC/XMPP bot) SHOULD start with a ``.``.
 This namespaces real/generated user IDs. Further namespacing SHOULD be applied
 based on the service being used, getting progressively more specific, similar to
 event types: e.g. ``@.irc.freenode.matrix.<username>:domain``. Ultimately, the 
 HS in question has control over their user ID namespace, so this is just a 
 recommendation.
 Additional recommendations
 --------------------------
 Capitalisation
-~~~~~~~~~~~~~~
+--------------
 The home server SHOULD NOT allow two user IDs that differ only by case. This SHOULD be applied based on the 
 capitalisation rules in the CLDR dataset: http://cldr.unicode.org/
-This check SHOULD be applied when the user ID is created, in order to prevent
+The capitalisation rules outlined above are nice but do not fully resolve issues
-registration with the same name and different capitalisations, e.g.
+where ``@alice:example.com`` tries to speak with ``@bob:domain.com`` using
-``@foo:bar`` vs ``@Foo:bar`` vs ``@FOO:bar``. Home servers MAY canonicalise
+``@Bob:domain.com``. It is up to ``domain.com`` to map ``Bob`` to ``bob`` in
-the user ID to be completely lower-case if desired.
+a sensible way.