matrix-spec-proposals

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

36 KiB

Raw Blame History

title	weight	type
Appendices	70	docs

Unpadded Base64

Unpadded Base64 refers to 'standard' Base64 encoding as defined in RFC 4648, without "=" padding. Specifically, where RFC 4648 requires that encoded data be padded to a multiple of four characters using = characters, unpadded Base64 omits this padding.

For reference, RFC 4648 uses the following alphabet for Base 64:

Value Encoding  Value Encoding  Value Encoding  Value Encoding
    0 A            17 R            34 i            51 z
    1 B            18 S            35 j            52 0
    2 C            19 T            36 k            53 1
    3 D            20 U            37 l            54 2
    4 E            21 V            38 m            55 3
    5 F            22 W            39 n            56 4
    6 G            23 X            40 o            57 5
    7 H            24 Y            41 p            58 6
    8 I            25 Z            42 q            59 7
    9 J            26 a            43 r            60 8
   10 K            27 b            44 s            61 9
   11 L            28 c            45 t            62 +
   12 M            29 d            46 u            63 /
   13 N            30 e            47 v
   14 O            31 f            48 w
   15 P            32 g            49 x
   16 Q            33 h            50 y

Examples of strings encoded using unpadded Base64:

UNPADDED_BASE64("") = ""
UNPADDED_BASE64("f") = "Zg"
UNPADDED_BASE64("fo") = "Zm8"
UNPADDED_BASE64("foo") = "Zm9v"
UNPADDED_BASE64("foob") = "Zm9vYg"
UNPADDED_BASE64("fooba") = "Zm9vYmE"
UNPADDED_BASE64("foobar") = "Zm9vYmFy"

When decoding Base64, implementations SHOULD accept input with or without padding characters wherever possible, to ensure maximum interoperability.

Binary data

In some cases it is necessary to encapsulate binary data, for example, public keys or signatures. Given that JSON cannot safely represent raw binary data, all binary values should be encoded and represented in JSON as unpadded Base64 strings as described above.

In cases where the Matrix specification refers to either opaque byte or opaque Base64 values, the value is considered to be opaque AFTER Base64 decoding, rather than the encoded representation itself.

It is safe for a client or homeserver implementation to check for correctness of a Base64-encoded value at any point, and to altogether reject a value which is not encoded properly. However, this is optional and is considered to be an implementation detail.

Special consideration is given for future protocol transformations, such as those which do not use JSON, where Base64 encoding may not be necessary in order to represent a binary value safely. In these cases, Base64 encoding of binary values may be skipped altogether.

Signing JSON

Various points in the Matrix specification require JSON objects to be cryptographically signed. This requires us to encode the JSON as a binary string. Unfortunately the same JSON can be encoded in different ways by changing how much white space is used or by changing the order of keys within objects.

Signing an object therefore requires it to be encoded as a sequence of bytes using Canonical JSON, computing the signature for that sequence and then adding the signature to the original JSON object.

Canonical JSON

We define the canonical JSON encoding for a value to be the shortest UTF-8 JSON encoding with dictionary keys lexicographically sorted by Unicode codepoint. Numbers in the JSON must be integers in the range [-(2**53)+1, (2**53)-1].

We pick UTF-8 as the encoding as it should be available to all platforms and JSON received from the network is likely to be already encoded using UTF-8. We sort the keys to give a consistent ordering. We force integers to be in the range where they can be accurately represented using IEEE double precision floating point numbers since a number of JSON libraries represent all numbers using this representation.

{{% boxes/warning %}} Events in room versions 1, 2, 3, 4, and 5 might not be fully compliant with these restrictions. Servers SHOULD be capable of handling JSON which is considered invalid by these restrictions where possible.

The most notable consideration is that integers might not be in the range specified above. {{% /boxes/warning %}}

{{% boxes/note %}} Float values are not permitted by this encoding. {{% /boxes/note %}}

import json

def canonical_json(value):
    return json.dumps(
        value,
        # Encode code-points outside of ASCII as UTF-8 rather than \u escapes
        ensure_ascii=False,
        # Remove unnecessary white space.
        separators=(',',':'),
        # Sort the keys of dictionaries.
        sort_keys=True,
        # Encode the resulting Unicode as UTF-8 bytes.
    ).encode("UTF-8")

Grammar

Adapted from the grammar in http://tools.ietf.org/html/rfc7159 removing insignificant whitespace, fractions, exponents and redundant character escapes.

value     = false / null / true / object / array / number / string
false     = %x66.61.6c.73.65
null      = %x6e.75.6c.6c
true      = %x74.72.75.65
object    = %x7B [ member *( %x2C member ) ] %7D
member    = string %x3A value
array     = %x5B [ value *( %x2C value ) ] %5B
number    = [ %x2D ] int
int       = %x30 / ( %x31-39 *digit )
digit     = %x30-39
string    = %x22 *char %x22
char      = unescaped / %x5C escaped
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
escaped   = %x22 ; "    quotation mark  U+0022
          / %x5C ; \    reverse solidus U+005C
          / %x62 ; b    backspace       U+0008
          / %x66 ; f    form feed       U+000C
          / %x6E ; n    line feed       U+000A
          / %x72 ; r    carriage return U+000D
          / %x74 ; t    tab             U+0009
          / %x75.30.30.30 (%x30-37 / %x62 / %x65-66) ; u000X
          / %x75.30.30.31 (%x30-39 / %x61-66)        ; u001X

Examples

To assist in the development of compatible implementations, the following test values may be useful for verifying the canonical transformation code.

Given the following JSON object:

{}