Canonical XTM

This specification describes a serialization format for topic maps which has the property that all logically equivalent topic maps have the exact same byte-by-byte representation in this format. This can be used to test the conformance of XTM processors.

This document is not an official document in any sense; it is just a proposal for the consideration of the topic map community. The contributions of Geir Ove Grønmo are gratefully acknowledged.

The specification describes the serialization of a topic map into an output document, but does not concern itself with where that topic map came from. It is not a goal to ensure that the canonical topic map can be successfully read into an XTM processor, but merely to confirm that all processing defined by the XTM 1.0 specification has been performed correctly.

The topic map must before serialization be processed into consistent topic map, as defined by XTM 1.0. When applying canonicalization to XTM documents no string normalization such as Unicode canonical decomposition must be performed. (This should be based on a topic map data model, which would define this for us.)

The output document must be a canonical XML document. In addition, a line feed (U+00A0) must be inserted after every end tag and likewise after every start tag of elements that have element content or are empty. (This means <baseNameString>, <resourceData>, <topicRef>, <instanceOf>, <resourceRef>, <subjectIndicatorRef>.)

[Remark: Must handle: sorting of topics that have no characteristics and class-instance topic relationships with scope.]

2. Serialization

The document element must be a <topicMap> element with the xmlns attribute value set to http://www.topicmaps.org/cxtm/1.0/.

The topic map is serialized by first writing out all topics, and then writing out all associations. Since only one topic map is output, there is no mergemap information to serialize.

2.1. <topic>

Topics are ordered according to the rules in the 'Ordering principles' section. All <topic> elements must have an id attribute, set to the value 'idN', where N is the number of the topic in sort order, starting with 1.

Topics are serialized by first writing out all class-instance relationships as <instanceOf> elements, then the <subjectIdentity> element, then all <baseName>s, then all <occurrence>s. The <instanceOf>, <baseName> and <occurrence> elements are ordered according to the rules in the 'Ordering principles' section.

2.2. <instanceOf>

A class-instance relationship is serialized as an <instanceOf> element, with the 'href' attribute set to the ID of the <topic> element representing the class topic, with the character '#' prepended.

Note that the <instanceOf> element is an empty element, and so, according to the Canonical XML specification must be serialized with both a start and an end tag, with nothing between the tags.

2.3. <subjectIdentity>

If the topic has no addressable subject, nor any known subject indicators, this element is not output at all.

If the topic has an addressable subject, that is output first using a <resourceRef> element.

For each subject indicator the topic has, a <subjectIndicatorRef> element is output. The elements must be ordered according to the ordering principles.

2.4. <resourceRef>

The <resourceRef> element is an empty element, holding the reference to the resource in its 'href' attribute.

2.5. <subjectIndicatorRef>

The <subjectIndicatorRef> element is an empty element, holding the reference to the subject indicator in its 'href' attribute.

2.6. <baseName>

Each topic name is serialized using a <baseName> element. First the scope is written out using the <scope> element, then the base name value in the <baseNameString> element and finally the variant names using <variant> elements. The variant names must be ordered according to the ordering principles.

2.7. <scope>

If the scoped topic map construct has an empty scope, this element is not output at all. If it has a non-empty scope, references to the topics making up that scope are written out using <topicRef> elements in the order defined by the ordering principles.

Note that in all cases the scope that is output must consist of the scope resulting from inheriting the scope of any parent elements that have scope. The scope of variant names therefore consists of the union of their own scope and those scope of all their ancestors.

2.8. <baseNameString>

Contains the base name value.

2.9. <variant>

Each variant name is serialized using a <variant> element. First its parameters are written out using the <scope> element, then the variant name value in the <variantName> element and finally any child variant names using <variant> elements. The variant names must be ordered according to the ordering principles.

2.10. <variantName>

Contains the variant name value.

2.11. <occurrence>

Each occurrence is written out using an <occurrence> element. If the occurrence is an instance of a class an <instanceOf> element is output, followed by a <scope> element representing the scope of the occurrence (provided it is non-empty) and last followed by a <resourceRef> element if the occurrence is an external resource or a <resourceData> element if the occurrence is an internal resource. [Remark: This is probably too vague]

2.12. <resourceData>

Contains the resource inline.

2.13. <association>

Associations are serialized using <association> elements, which first contain an <instanceOf> element (if the association is an instance of a class), a <scope> element (unless the association is in the unconstrained scope), and finally a <member> element for each participating topic in the association. The <member> elements must be ordered according to the ordering principles.

3. Ordering principles

This section establishes how to determine the ordering of each topic map element that is written out. This is used to ensure that all elements are serialized in a specific order. The order is obtained by sorting the elements according to the rules specified below.

3.1. General rules

The following rules are common to all object ordering rules specified below:

String values are ordered in lexicographical order, based on UCS code point values. [Remark: Or using a locale-specific collator?]
Object properties with no value are considered to be ordered before properties with a value.
If no means for differentiating the ordering of objects was found, the system generated ids of the objects should be used for ordering. The object ids should be converted to strings and ordered by their string values. This should guarantee consistent ordering of objects within a given topic map instance, but no guarantee is given across systems or topic map instances.

3.2. Collections

The collections must first be sorted using the ordering rules of their components, before they can be compared.
Starting from the beginning of the collections individual items in the collections are compared. The collections are considered to have different ordering at the point when items are considered to have different ordering. If one of the collections is exhausted before the other the collection with the highest number of items are considered to be ordered after the one with a lower number of items.

3.3. Locators

The ordering is defined by first comparing the locator addresses.
Failing that, the locators are compared by comparing their notations.

3.4. Topics

If the topics have addressable subjects, the order is defined by comparing the locators.
Failing that, if the topics have subject indicators, the sets of locators are compared.
Failing that, if the topics have base names, the sets of base names are compared.
Failing that, if the topics have occurrences, the sets of occurrences are compared.
Failing that, if the topics have types, the sets of types are compared.
Failing that, if the topics have source locators, the sets of locators are compared.

3.5. Topic references

Topic references always have the same relative ordering as the topic that they refer to.

3.6. Base names

The ordering is defined by first comparing the basename values.
Failing that, if the basenames have scope, the sets of themes are compared.
Failing that, if the basenames have variants, the sets of variants are compared.

3.7. Variant names

The ordering is defined by first comparing the inline variant name values.
Failing that, if the variant names have locators, the locators are compared.
Failing that, if the variant names have scope, the sets of themes are compared.

3.8. Occurrences

The ordering is defined by first comparing the inline occurrence values.
Failing that, if the occurrences have locators, the locators are compared.
Failing that, if the occurrences have type, the type topics are compared.
Failing that, if the occurrences have scope, the sets of themes are compared.

3.9. Associations

The ordering is defined by first comparing the association types.
Failing that, if the associations have scope, the sets of themes are compared.
Failing that, if the associations have roles, the sets of roles are compared.

3.10. Association roles

The ordering is defined by first comparing the association role types.
Failing that, if the associations roles have players, the player topics are compared.

4. Canonical XTM DTD

<!ELEMENT topicMap (topic*, association*)>
<!ATTLIST topicMap 
          xmlns       CDATA "http://www.topicmaps.org/cxtm/1.0/" #FIXED>

<!ELEMENT topic (instanceOf*, subjectIdentity?, baseName*, occurrence*)>
<!ATTLIST topic
          id          ID    #REQUIRED>
          
<!ELEMENT instanceOf EMPTY>
<!ATTLIST instanceOf 
          href        CDATA #REQUIRED>

<!ELEMENT subjectIdentity (resourceRef?, subjectIndicatorRef*)>

<!ELEMENT resourceRef EMPTY>
<!ATTLIST resourceRef
          href        CDATA #REQUIRED>

<!ELEMENT subjectIndicatorRef EMPTY>
<!ATTLIST subjectIndicatorRef
          href        CDATA #REQUIRED>


<!ELEMENT baseName (scope?, baseNameString, variant*)>

<!ELEMENT scope (topicRef+)>
<!ELEMENT topicRef EMPTY>
<!ATTLIST topicRef
          href        CDATA #REQUIRED>

<!ELEMENT baseNameString (#PCDATA)>


<!ELEMENT variant (scope, variantName, variant*)>
<!ELEMENT variantName (resourceData | resourceRef)>
<!ELEMENT resourceData (#PCDATA)>


<!ELEMENT occurrence (instanceOf?, scope?, (resourceRef | resourceData)>


<!ELEMENT association (instanceOf?, scope?, member+)>
<!ELEMENT member (instanceOf?, topicRef)>

Canonical XTM

A canonical serialization format for topic maps

Table of contents