Merge bf9b7c87e1 into d6edcbd946

1 month ago · b61a7009ef
parent d6edcbd946 bf9b7c87e1
commit b61a7009ef
1 changed files with 230 additions and 0 deletions
--- a/proposals/3752-markup-locations-for-text.md
+++ b/proposals/3752-markup-locations-for-text.md
@ -0,0 +1,230 @@
+# Markup locations for Text
+
+[MSC3574](https://github.com/matrix-org/matrix-spec-proposals/pull/3574)
+proposes a mechanism for marking up resources (webpages, documents, videos, and
+other files) using Matrix. The proposed mechanism requires an
+`m.markup.location` schema for representing the location of annotations within
+different kinds of resources. MSC3574 punts on what standard location types
+might be available, deferring that large family of questions to other MSCs.
+This MSC aims to provide basic location types for marking up textual resources.
+ 
+## Proposal
+
+Markup locations for text should approximately follow the format for textual
+annotations provided by the w3c's [web annotation data
+model](https://www.w3.org/TR/annotation-model/). This will simplify
+interoperability with WADM-based annotation systems like
+[hypothes.is](https://hypothes.is).
+
+Markup locations for text should applicable to `text/*` Media Types, including
+markdown and html. It should also be at least partly applicable to formats that
+provide an associated text stream, such as `application/pdf`,
+`application/epub+zip`, and video or audio files with embedded lyrics or
+captions.
+
+The WADM model provides two basic notions of locations in text: "Text Position"
+(roughly, an offset) and "Text Quote" (roughly, a search query). In practice,
+both should be provided for a given text location whenever possible, for robust
+anchoring in contexts where the underlying text may change (for example, on the
+web). In these cases, clients can use the Text Position offset to find an
+approximate position, and look for the nearest approximately matching Text
+Quote.
+
+### Text Positions
+
+Text Positions will be represented within an `m.markup.location` as follows:
+
+```
+"m.markup.location": {
+    "m.markup.text.position": {
+        "start": ..
+        "end": ..
+    }
+    ..
+}
+```
+
+The `start` and `end` values should be non-negative integers, with 0 indicating
+a position before the first character of the document's text, 1 indicating the
+position after the first character and before the second, and so on.
+
+The following requirements from the web annotation data model must be
+respected:
+
+> The selection of the text must be in terms of unicode code points (the
+"character number"), not in terms of code units (that number expressed using a
+selected data type). Selections should not start or end in the middle of a
+grapheme cluster. The selection must be based on the logical order of the text,
+rather than the visual order, especially for bidirectional text.
+
+> The text must be normalized before recording in the Annotation. Thus HTML/XML
+tags should be removed, and character entities should be replaced with the
+character that they encode. 
+
+In view of the ambiguity of the markdown format (and similar text formats), and
+the resulting complexity of normalization, special markdown characters should
+*not* be removed before generating a text position.
+
+### Text Quotes
+
+Text Quotes will be represented within an `m.markup.location` as follows:
+
+```
+"m.markup.location": {
+    "m.markup.text.quote": {
+        "exact": ...
+        "prefix": ...
+        "suffix": ...
+    }
+    ..
+}
+```
+
+The `exact` value should be the text occupying the designated location. The
+`prefix` should be a snippet of text occurring before the designated location,
+and the `suffix` should be a snippet occurring after the designated location.
+`prefix` and `suffix` may be omitted in cases where they're clearly unnecessary
+to disambiguate the location. 
+
+Text should be normalized as above. In the case of multiple matches, all
+matches should be treated as part of the location.
+
+### Text Ranges
+
+There may be cases in which we want to use the selectors above to indicate the
+endpoints of a text range, because we want, for example, to select from the
+beginning of a document to a certain phrase, or because we want to select a
+long quote without including the contents of the quote in the `exact` value.
+
+In these cases, we can use a Text Range location, `m.markup.text.range`. Each
+endpoint of the range should be given either as a non-negative integer, or as a
+`prefix`/`suffix` pair. So for example,
+
+```
+"m.markup.location": {
+    "m.markup.text.range": {
+        "start": 0,
+        "end": {
+            "prefix": "the",
+            "suffix": " end"
+        }
+    }
+}
+```
+
+would indicate all of "this is the end" except " end".
+
+### Web Annotation Data Model Serialization
+
+[MSC3574](https://github.com/matrix-org/matrix-spec-proposals/pull/3574)
+includes a scheme for serializing matrix markup events as web annotations in
+the web annotation data model. The scheme requires each markup location type to
+have a canonical serialization as [a web annotation
+selector](https://www.w3.org/TR/annotation-model/#selectors]). In this section,
+we describe how to serialize `m.markup.text.range`, `m.markup.text.quote` and
+`m.markup.text.position` as WADM selectors.
+
+The correspondence between `m.markup.text.quote` and `m.markup.text.position`
+and WADM
+[TextQuoteSelector](https://www.w3.org/TR/annotation-model/#text-quote-selector)
+and
+[TextPositionSelector](https://www.w3.org/TR/annotation-model/#text-position-selector)
+selectors is very direct. In each case, we only need to add a field indicating
+the selector type. So a location like:
+
+```
+"m.markup.text.quote": {
+    "exact": ...
+    "prefix": ...
+    "suffix": ...
+}
+```
+
+becomes
+
+``` 
+{
+    "type": "TextQuoteSelector"
+    "exact": ... 
+    "prefix": ... 
+    "suffix": ... 
+}
+```
+
+and 
+
+```
+"m.markup.text.position": {
+    "start": ...
+    "end": ...
+}
+```
+
+becomes
+
+``` 
+{
+    "type": "TextPositionSelector"
+    "start": ...
+    "end": ...
+}
+```
+
+The more complicated `m.markup.text.range` should be serialized via the WADM
+[RangeSelector](https://www.w3.org/TR/annotation-model/#range-selector) selector, which
+combines two WADM selectors to designate an area reaching from the beginning of
+the area designated by the first selector to the beginning of the area
+designated by the second selector.
+
+If either endpoint of the `m.markup.text.range` location is an offset, that
+endpoint should be represented by a WADM `TextPositionSelector` with both the
+start and end values equal to the offset. If either endpoint of an
+`m.markup.text.range` location is a `prefix`/`suffix` pair, it should be
+represented by a `TextQuoteSelector` with the corresponding `prefix`, but with
+the `exact` value equal to the suffix, and with no suffix provided. 
+
+So, for example,
+
+```
+"m.markup.text.range": { 
+    "start": 0,
+    "end": { 
+        "prefix": "the", 
+        "suffix": " end" 
+    } 
+}
+```
+
+becomes 
+
+```
+{ 
+    "type: "RangeSelector",
+    "startSelector": {
+        "type": "TextPositionSelector",
+        "start": 0,
+        "end": 0
+    }
+    "endSelector": { 
+        "type": "TextQuoteSelector"
+        "prefix": "the", 
+        "exact": " end" 
+    } 
+}
+```
+
+## Security considerations
+
+Because room state is unencrypted, `m.space.child` events conveying locations
+via `m.markup.location.quote` could leak information about an encrypted
+resource text. This is part of a more general problem with state events
+potentially leaking information, and deserves a general resolution, a la
+[MSC3414](https://github.com/matrix-org/matrix-spec-proposals/pull/3414)
+
+## Unstable prefix
+
+| Proposed Final Identifier | Purpose                                                    | Development Identifier                        |
+| ------------------------- | ---------------------------------------------------------- | --------------------------------------------- |
+| `m.markup.text.quote`     | key in `m.markup.location`                                 | `com.open-tower.msc3752.markup.text.quote`    |
+| `m.markup.text.position`  | key in `m.markup.location`                                 | `com.open-tower.msc3752.markup.text.position` |
+| `m.markup.text.range`     | key in `m.markup.location`                                 | `com.open-tower.msc3752.markup.text.range`    |