Add details about why this proposal should exist

hs/hash-identity
Andrew Morgan 5 years ago
parent b26a9ed1fd
commit 9fd6bd3184

@ -6,22 +6,41 @@ To summarise the issue, lookups (of Matrix user IDs) are performed using
plain-text 3PIDs (third-party IDs) which means that the identity server can
identify and record every 3PID that the user has in their contacts, whether
that email address or phone number is already known by the identity server or
not.
If the 3PID is hashed, the identity server could not determine the address
unless it has already seen that address in plain-text during a previous call
of the [/bind
mechanism](https://matrix.org/docs/spec/identity_service/r0.2.1#post-matrix-identity-api-v1-3pid-bind)
(without significant resources to reverse the hashes). This helps prevent
bulk collection of user's contact lists by the identity server and reduces
its ability to build social graphs.
This proposal thus calls for the Identity Service API's
[/lookup](https://matrix.org/docs/spec/identity_service/r0.2.1#get-matrix-identity-api-v1-lookup)
endpoint to use hashed 3PIDs instead of their plain-text counterparts (and to
deprecate both it and
[/bulk_lookup](https://matrix.org/docs/spec/identity_service/r0.2.1#post-matrix-identity-api-v1-bulk-lookup)),
which will leak less data to identity servers.
not. In the latter case, an identity server is able to collect email
addresses and phone numbers that have a high probability of being connected
to a real person. It could then use this data for marketing or other
purposes.
However, if the email addresses and phone numbers are hashed before they are
sent to the identity server, the server would have a more difficult time of
being able to recover the original addresses. This prevents contact
information of non-Matrix users being exposed by the lookup service.
However, hashing is not perfect. While reversing a hash is not possible, it
is possible to build a [rainbow
table](https://en.wikipedia.org/wiki/Rainbow_table), which could map many
known email addresses and phone numbers to their hash equivalents. When the
identity server receives a hash, it would then be able to look it up in this
table, and find the email address or phone number associated with it. In an
ideal world, one would use a hashing algorithm such as
[bcrypt](https://en.wikipedia.org/wiki/Bcrypt), with many rounds, which would
make building such a rainbow table an extraordinarily expensive process.
Unfortunately, this is impractical for our use case, as it would require
clients to perform many, many rounds of hashing, linearly dependent on their
address book size, which would likely result in lower-end mobile phones
becoming overwhelmed. Thus, we must use a fast hashing algorithm, at the cost
of making rainbow tables easy to build.
The rainbow table attack is not perfect. While there are only so many
possible phone numbers, and thus it is simple to generate the hash value for
each one, the address space of email addresses is much, much wider. Therefore
if your email address is decently long and is not publicly known to
attackers, it is unlikely that it would be included in a rainbow table.
Thus the approach of hashing, while adding complexity to implementation and
minor resource consumption of the client and identity server, does provide
added difficultly for the identity server to carry out contact detail
harvesting, which should be considered worthwhile.
## Proposal

Loading…
Cancel
Save