From 871c5f36a7f35d4713164d2366c496f64eda1b29 Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Wed, 9 Mar 2022 10:54:30 -0600 Subject: [PATCH 1/8] markup locations for text, initial commit --- proposals/XXXX-markup-locations-for-text.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 proposals/XXXX-markup-locations-for-text.md diff --git a/proposals/XXXX-markup-locations-for-text.md b/proposals/XXXX-markup-locations-for-text.md new file mode 100644 index 000000000..c5348c5d5 --- /dev/null +++ b/proposals/XXXX-markup-locations-for-text.md @@ -0,0 +1,21 @@ +# Markup locations for Text + +[MSC3574](https://github.com/opentower/matrix-doc/blob/main/proposals/3574-resource-markup.md) +proposes a mechanism for marking up resources (webpages, documents, videos, and +other files) using Matrix. The proposed mechanism requires an +`m.markup.location` schema for representing the location of annotations within +different kinds of resources. MSC3574 punts on what standard location types +might be available, deferring that large family of questions to other MSCs. +This MSC aims to provide basic location types for marking up textual resources. + +## Proposal + +### Text Positions + +### Text Quotes + +### Text Ranges + +### Web Annotation Data Model Serialization + +## Security considerations From 4291834111a216ad32d7be0a24a2f75970c8bddc Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Wed, 9 Mar 2022 11:08:45 -0600 Subject: [PATCH 2/8] rename msc file --- ...up-locations-for-text.md => 3752-markup-locations-for-text.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename proposals/{XXXX-markup-locations-for-text.md => 3752-markup-locations-for-text.md} (100%) diff --git a/proposals/XXXX-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md similarity index 100% rename from proposals/XXXX-markup-locations-for-text.md rename to proposals/3752-markup-locations-for-text.md From 7269502e18574cb58642268f940fa34ce668abae Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Fri, 11 Mar 2022 10:09:04 -0600 Subject: [PATCH 3/8] Add Text Position, Text Quote locations --- proposals/3752-markup-locations-for-text.md | 74 ++++++++++++++++++++- 1 file changed, 73 insertions(+), 1 deletion(-) diff --git a/proposals/3752-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md index c5348c5d5..cf8e4f1a1 100644 --- a/proposals/3752-markup-locations-for-text.md +++ b/proposals/3752-markup-locations-for-text.md @@ -10,10 +10,82 @@ This MSC aims to provide basic location types for marking up textual resources. ## Proposal +Markup locations for text should approximately follow the format for textual +annotations provided by the w3c's [web annotation data +model](https://www.w3.org/TR/annotation-model/). This will simplify +interoperability with WADM-based annotation systems like +[hypothes.is](https://hypothes.is). + +Markup locations for text should applicable to `text/*` Media Types, including +markdown and html. It should also be at least partly applicable to formats that +provide an associated text stream, such as `application/pdf`, +`application/epub+zip`, and video or audio files with embedded lyrics or +captions. + +The WADM model provides two basic notions of locations in text: "Text Position" +(roughly, an offset) and "Text Quote" (roughly, a search query). In practice, +both should be provided for a given text location whenever possible, for robust +anchoring in contexts where the underlying text may change (for example, on the +web). In these cases, clients can use the Text Position offset to find an +approximate position, and look for the nearest approximately matching Text +Quote. + ### Text Positions - + +Text Positions will be represented within an `m.markup.location` as follows: + +``` +m.markup.location: { + m.markup.text.position: { + start: .. + end: .. + } + .. +} +``` + +The `start` and `end` values should be non-negative integers, with 0 indicating +a position before the first character of the document's text, 1 indicating the +position after the first character and before the second, and so on. + +The following requirements from the web annotation data model must be +respected: + +> The selection of the text must be in terms of unicode code points (the +"character number"), not in terms of code units (that number expressed using a +selected data type). Selections should not start or end in the middle of a +grapheme cluster. The selection must be based on the logical order of the text, +rather than the visual order, especially for bidirectional text. + +> The text must be normalized before recording in the Annotation. Thus HTML/XML +tags should be removed, and character entities should be replaced with the +character that they encode. + +In view of the ambiguity of the markdown format (and similar text formats), and +the resulting complexity of normalization, special markdown characters should +*not* be removed before generating a text position. + ### Text Quotes +Text Quotes will be represented within an `m.markup.location` as follows: + +``` +m.markup.location: { + m.markup.text.quote: { + exact: ... + prefix: ... + suffix: ... + } + .. +} +``` + +The `exact` value should be the text occupying the designated location. The +`prefix` should be a snippet of text occurring before the designated location, +and the `suffix` should be a snippet occurring after the designated location. +`prefix` and `suffix` may be omitted in cases where they're clearly unnecessary +to disambiguate the location. Text should be normalized as above. + ### Text Ranges ### Web Annotation Data Model Serialization From d35d167d3e6e7789ea0793688273d189d33f1978 Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Fri, 11 Mar 2022 17:00:38 -0600 Subject: [PATCH 4/8] Add text ranges --- proposals/3752-markup-locations-for-text.md | 27 ++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/proposals/3752-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md index cf8e4f1a1..9d60284eb 100644 --- a/proposals/3752-markup-locations-for-text.md +++ b/proposals/3752-markup-locations-for-text.md @@ -84,10 +84,35 @@ The `exact` value should be the text occupying the designated location. The `prefix` should be a snippet of text occurring before the designated location, and the `suffix` should be a snippet occurring after the designated location. `prefix` and `suffix` may be omitted in cases where they're clearly unnecessary -to disambiguate the location. Text should be normalized as above. +to disambiguate the location. + +Text should be normalized as above. In the case of multiple matches, all +matches should be treated as part of the location. ### Text Ranges +There may be cases in which we want to use the selectors above to indicate the +endpoints of a text range, because we want, for example, to select from the +beginning of a document to a certain phrase, or because we want to select a +long quote without including the contents of the quote in the `exact` value. + +In these cases, we can use a Text Range location, `m.markup.text.range`. Each +endpoint of the range should be given either as a non-negative integer, or as a +`prefix`/`suffix` pair. So for example, + +``` +m.markup.location: { + m.markup.text.range: { + start: 0 + end: { + prefix: "the", + suffix: " end" + } +} +``` + +would indicate all of "this is the end" except " end". + ### Web Annotation Data Model Serialization ## Security considerations From 3411691d49d44dcad0649d0bf7265be7ae206e7f Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Fri, 11 Mar 2022 17:07:40 -0600 Subject: [PATCH 5/8] Balance brackets --- proposals/3752-markup-locations-for-text.md | 1 + 1 file changed, 1 insertion(+) diff --git a/proposals/3752-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md index 9d60284eb..a21d3a9f9 100644 --- a/proposals/3752-markup-locations-for-text.md +++ b/proposals/3752-markup-locations-for-text.md @@ -107,6 +107,7 @@ m.markup.location: { end: { prefix: "the", suffix: " end" + } } } ``` From 3e4a47e6a8217c95d2c8d0173b6b818c74528ff7 Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Sat, 12 Mar 2022 10:06:16 -0600 Subject: [PATCH 6/8] Add WADM serialization and unstable prefix --- proposals/3752-markup-locations-for-text.md | 113 +++++++++++++++++++- 1 file changed, 112 insertions(+), 1 deletion(-) diff --git a/proposals/3752-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md index a21d3a9f9..02c709e58 100644 --- a/proposals/3752-markup-locations-for-text.md +++ b/proposals/3752-markup-locations-for-text.md @@ -103,7 +103,7 @@ endpoint of the range should be given either as a non-negative integer, or as a ``` m.markup.location: { m.markup.text.range: { - start: 0 + start: 0, end: { prefix: "the", suffix: " end" @@ -116,4 +116,115 @@ would indicate all of "this is the end" except " end". ### Web Annotation Data Model Serialization +[MSC3574](https://github.com/opentower/matrix-doc/blob/main/proposals/3574-resource-markup.md) +includes a scheme for serializing matrix markup events as web annotations in +the web annotation data model. The scheme requires each markup location type to +have a canonical serialization as [a web annotation +selector](https://www.w3.org/TR/annotation-model/#selectors]). In this section, +we describe how to serialize `m.markup.text.range`, `m.markup.text.quote` and +`m.markup.text.position` as WADM selectors. + +The correspondence between `m.markup.text.quote` and `m.markup.text.position` +and WADM +[TextQuoteSelector](https://www.w3.org/TR/annotation-model/#text-quote-selector) +and +[TextPositionSelector](https://www.w3.org/TR/annotation-model/#text-position-selector) +selectors is very direct. In each case, we only need to add a field indicating +the selector type. So a location like: + +``` +m.markup.text.quote: { + exact: ... + prefix: ... + suffix: ... +} +``` + +becomes + +``` +{ + type: "TextQuoteSelector" + exact: ... + prefix: ... + suffix: ... +} +``` + +and + +``` +m.markup.text.position: { + start: ... + end: ... +} +``` + +becomes + +``` +{ + type: "TextPositionSelector" + start: ... + end: ... +} +``` + +The more complicated `m.markup.text.range` should be serialized via the WADM +[RangeSelector](https://www.w3.org/TR/annotation-model/#range-selector) selector, which +combines two WADM selectors to designate an area reaching from the beginning of +the area designated by the first selector to the beginning of the area +designated by the second selector. + +If either endpoint of the `m.markup.text.range` location is an offset, that +endpoint should be represented by a WADM `TextPositionSelector` with both the +start and end values equal to the offset. If either endpoint of an +`m.markup.text.range` location is a `prefix`/`suffix` pair, it should be +represented by a `TextQuoteSelector` with the corresponding `prefix`, but with +the `exact` value equal to the suffix, and with no suffix provided. + +So, for example, + +``` +m.markup.text.range: { + start: 0, + end: { + prefix: "the", + suffix: " end" + } +} +``` + +becomes + +``` +{ + type: RangeSelector, + startSelector: { + type: TextPositionSelector, + start: 0, + end: 0 + } + endSelector: { + type: TextQuoteSelector + prefix: "the", + exact: " end" + } +} +``` + ## Security considerations + +Because room state is unencrypted, `m.space.child` events conveying locations +via `m.markup.location.quote` could leak information about an encrypted +resource text. This is part of a more general problem with state events +potentially leaking information, and deserves a general resolution, a la +[MSC3414](https://github.com/matrix-org/matrix-spec-proposals/blob/travis/msc/encrypted-state/proposals/3414-encrypted-state.md) + +## Unstable prefix + +| Proposed Final Identifier | Purpose | Development Identifier | +| ------------------------- | ---------------------------------------------------------- | --------------------------------------------- | +| `m.markup.text.quote` | key in `m.markup.location` | `com.open-tower.msc3752.markup.text.quote` | +| `m.markup.text.position` | key in `m.markup.location` | `com.open-tower.msc3752.markup.text.position` | +| `m.markup.text.range` | key in `m.markup.location` | `com.open-tower.msc3752.markup.text.range` | From 562ba45af5aef7f231c858210cfd700b48b8fdc2 Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Mon, 14 Mar 2022 13:07:53 -0500 Subject: [PATCH 7/8] Use PR links --- proposals/3752-markup-locations-for-text.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/proposals/3752-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md index 02c709e58..e810c8025 100644 --- a/proposals/3752-markup-locations-for-text.md +++ b/proposals/3752-markup-locations-for-text.md @@ -1,6 +1,6 @@ # Markup locations for Text -[MSC3574](https://github.com/opentower/matrix-doc/blob/main/proposals/3574-resource-markup.md) +[MSC3574](https://github.com/matrix-org/matrix-spec-proposals/pull/3574) proposes a mechanism for marking up resources (webpages, documents, videos, and other files) using Matrix. The proposed mechanism requires an `m.markup.location` schema for representing the location of annotations within @@ -116,7 +116,7 @@ would indicate all of "this is the end" except " end". ### Web Annotation Data Model Serialization -[MSC3574](https://github.com/opentower/matrix-doc/blob/main/proposals/3574-resource-markup.md) +[MSC3574](https://github.com/matrix-org/matrix-spec-proposals/pull/3574) includes a scheme for serializing matrix markup events as web annotations in the web annotation data model. The scheme requires each markup location type to have a canonical serialization as [a web annotation @@ -219,7 +219,7 @@ Because room state is unencrypted, `m.space.child` events conveying locations via `m.markup.location.quote` could leak information about an encrypted resource text. This is part of a more general problem with state events potentially leaking information, and deserves a general resolution, a la -[MSC3414](https://github.com/matrix-org/matrix-spec-proposals/blob/travis/msc/encrypted-state/proposals/3414-encrypted-state.md) +[MSC3414](https://github.com/matrix-org/matrix-spec-proposals/pull/3414) ## Unstable prefix From bf9b7c87e130408f65d6f055ec8ebc24a5daec4c Mon Sep 17 00:00:00 2001 From: Graham Leach-Krouse Date: Wed, 18 May 2022 15:14:28 -0500 Subject: [PATCH 8/8] Enquote JSON strings. --- proposals/3752-markup-locations-for-text.md | 86 ++++++++++----------- 1 file changed, 43 insertions(+), 43 deletions(-) diff --git a/proposals/3752-markup-locations-for-text.md b/proposals/3752-markup-locations-for-text.md index e810c8025..564eb4337 100644 --- a/proposals/3752-markup-locations-for-text.md +++ b/proposals/3752-markup-locations-for-text.md @@ -35,10 +35,10 @@ Quote. Text Positions will be represented within an `m.markup.location` as follows: ``` -m.markup.location: { - m.markup.text.position: { - start: .. - end: .. +"m.markup.location": { + "m.markup.text.position": { + "start": .. + "end": .. } .. } @@ -70,11 +70,11 @@ the resulting complexity of normalization, special markdown characters should Text Quotes will be represented within an `m.markup.location` as follows: ``` -m.markup.location: { - m.markup.text.quote: { - exact: ... - prefix: ... - suffix: ... +"m.markup.location": { + "m.markup.text.quote": { + "exact": ... + "prefix": ... + "suffix": ... } .. } @@ -101,12 +101,12 @@ endpoint of the range should be given either as a non-negative integer, or as a `prefix`/`suffix` pair. So for example, ``` -m.markup.location: { - m.markup.text.range: { - start: 0, - end: { - prefix: "the", - suffix: " end" +"m.markup.location": { + "m.markup.text.range": { + "start": 0, + "end": { + "prefix": "the", + "suffix": " end" } } } @@ -133,10 +133,10 @@ selectors is very direct. In each case, we only need to add a field indicating the selector type. So a location like: ``` -m.markup.text.quote: { - exact: ... - prefix: ... - suffix: ... +"m.markup.text.quote": { + "exact": ... + "prefix": ... + "suffix": ... } ``` @@ -144,19 +144,19 @@ becomes ``` { - type: "TextQuoteSelector" - exact: ... - prefix: ... - suffix: ... + "type": "TextQuoteSelector" + "exact": ... + "prefix": ... + "suffix": ... } ``` and ``` -m.markup.text.position: { - start: ... - end: ... +"m.markup.text.position": { + "start": ... + "end": ... } ``` @@ -164,9 +164,9 @@ becomes ``` { - type: "TextPositionSelector" - start: ... - end: ... + "type": "TextPositionSelector" + "start": ... + "end": ... } ``` @@ -186,11 +186,11 @@ the `exact` value equal to the suffix, and with no suffix provided. So, for example, ``` -m.markup.text.range: { - start: 0, - end: { - prefix: "the", - suffix: " end" +"m.markup.text.range": { + "start": 0, + "end": { + "prefix": "the", + "suffix": " end" } } ``` @@ -199,16 +199,16 @@ becomes ``` { - type: RangeSelector, - startSelector: { - type: TextPositionSelector, - start: 0, - end: 0 + "type: "RangeSelector", + "startSelector": { + "type": "TextPositionSelector", + "start": 0, + "end": 0 } - endSelector: { - type: TextQuoteSelector - prefix: "the", - exact: " end" + "endSelector": { + "type": "TextQuoteSelector" + "prefix": "the", + "exact": " end" } } ```