Extending tolog

This paper describes a number of extensions that might be made to the tolog query language for topic maps in order to make it fulfill all the requirements for the ISO-standardized TMQL (Topic Map Query Language). First, the lessons to be learned from the considerable body of research into the Datalog query language are considered. Finally, a number of different extensions to the existing tolog query language are considered and evaluated.

This paper extends and improves on earlier work on tolog, first described in [Garshol01].

Biography

Lars Marius Garshol is currently Development Manager at Ontopia, a leading topic map software vendor. He has been active in the XML and topic map communities as a speaker, consultant and open source developer for a number of years.

Lars Marius has also been responsible for adding Unicode support to the Opera web browser. His book on Definitive XML Application Development was published by Prentice-Hall in its Charles Goldfarb series last year.

Lars Marius is one of the editors of the ISO Topic Map Query Language standard, and also co-author of the Topic Map standard itself.

1. Introduction

Work on query languages for topic maps has been progressing for the past 3-4 years, and ISO has started a formal work item to create a standardized query language for topic maps, to be called TMQL (Topic Maps Query Language), which will eventually become ISO 18048. At the moment, however, there is no standard TMQL; only query languages developed by companies and individuals.

The ISO committee responsible for topic maps evaluated the existing query languages for topic maps at its meeting in London in May 2003, and found that simply selecting one of the languages would not be the right approach. Instead, what was needed was further development of the individual proposals, as well as a better understanding of the relationships between them, and the possible approaches to creating a topic map query language.

This paper discusses a particular query language for topic maps, first described in [Garshol01], and since implemented in the OKS (Ontopia Knowledge Suite) and TM4J. A draft specification of the first tolog version, now dubbed 0.1, also exists [Garshol03b]. At the time of writing the language has been in active use for well over a year, and has been used in several commercial projects.

This paper further develops the language, adding several new features, and improving the syntax somewhat, based on this practical experience. The purpose of the paper is to explore how far it is possible to take the language, to contribute to the work on a standardized topic map query language, and to improve the language for future versions of the OKS (and presumably also TM4J).

2. tolog

The tolog query language is inspired by, and syntactically similar to, Prolog, but leaves out most of the Prolog programming language, and also has differences in syntax and evaluation model. Parts of SQL have been grafted onto this base. The Prolog syntax has also been extended to support treating topic map association types as predicates and to express OR conditions directly in queries.

The 0.1 version of tolog supported querying on associations of specified types and class-instance relationships, but all other parts of the topic map model were beyond its reach. Section section 4.5. describes how tolog 0.1 can be extended to reach all of topic maps.

2.1. Goals for tolog

tolog was created to serve three main purposes:

Simplifying the development of topic map-based applications. Extracting information from topic maps using a topic map engine API requires a substantial amount of code; a good query language dramatically reduces the complexity of this task.
Improving the performance of topic map applications. When using the API to traverse a topic map the application only informs the engine of what it will do in small pieces at a time (essentially method call by method call) whereas a query reflects a much larger piece of intent ("give me all topics of type 'composer' that were born in 'Italy'"). This provides the topic map engine with much greater opportunities for optimization, and so leads to better performance.
Use as an embedded language in other languages, such as schema languages, data binding languages, publishing languages, conversion control languages, and so on; that is, tolog was intended to play something of the same role that XPath plays for XML.

The language as it currently exists meets all these goals, but needs to be extended in order to serve them more fully, since at present the language clearly is not sufficiently powerful to meet all the needs of its users.

Its key shortcomings in version 0.1 are:

Names, occurrences, scope, and reification cannot be queried at all. For publishing especially, but also for many other applications, this is an unacceptable shortcoming.
Queries can only be made against specific ontologies; there is no way to write introspective queries without knowledge of the ontology, such as a query to find all association types in a topic map. Again, this is unacceptable.

The extensions proposed in section 4.5. rectify this. There are, however, other shortcomings beyond the key ones, and the discussion of these make up the rest of the paper.

2.2. A quick tutorial

tolog queries basically have two phases: the query phase, which extracts information from the topic map, and the post-processing phase, where the results from the query phase are processed for presentation. The query phase is specified using predicates, which are of the form shown below:

predicate($VAR1, $VAR2)?

This minimal query will produce all combinations of values that make the predicate true. For the built-in 'instance-of' predicate the following query:

instance-of($A, $B)?

would produce a set of (A, B) pairs containing every instance (A) of every class (B), because $A and $B are variable references that can be filled with any value. A typical example might look as shown below.

A	B
Norway	country
Sweden	country
Denmark	country
Oslo	city
Stockholm	city
Köbenhavn	city

By replacing one of them with a literal one side of the predicate can be tied down. Thus

instance-of($A, country)?

will produce all values of $A that make the predicate true; that is, it will find all instances of the type country. (In this case: Norway, Sweden, and Denmark.) Pinning the query down on the opposite side gives a similar, but different, result:

instance-of(Norway, $B)?

namely all the classes Norway belongs to. (In this case only country.)

tolog 0.1 has two built-in predicates: 'instance-of' and 'direct-instance-of' (the former takes superclass-subclass relationships into account; the latter does not). In addition, every association type can be treated as a predicate. That is, if there is an association type 'born-in' with roles 'person' and 'place' the following query will produce all combinations of places where a person is born.

born-in($A : person, $B : place)?

As can be seen, the syntax is slightly different here, since we need to make it clear which role is played by $A and which by $B. This is what the : roletype suffix does^[1]. Without it would be difficult to evaluate the queries correctly, and in some cases even impossible (when both roles are played by topics of the same type).

In addition to queries consisting of single predicates it is possible to do queries by chaining together predicates in the same way as in Prolog. For example:

born-in($A : person, $B : place),
located-in($B : containee, italy : container)?

This query would first find all people and the places they are born, then it would check if the birthplace is in Italy, and, if it is not the match would be rejected^[2]. As in Prolog there is an implicit AND between the predicates.

In tolog 0.1 there was one last way to create a predicate, and that was to define an inference rule. These used Prolog syntax (or a subset thereof), and if we wanted to turn the query above into a rule (for example for ease of use) it would look as follows.

italian($A) :- born-in($A : person, $B : place),
               located-in($B : containee, italy : container).

This rule defines the predicate italian, which is true for any person born in a place that is in Italy.

Rules can use any predicate, including other rules, which enables the user to build quite extensive pieces of business logic using declarative rules, which can be easily reused, thus simplifying application development.

Quite often one wants to run a query where alternatives are necessary. One example is from the Italian Opera topic map where, if one wants to find all opera premieres for a given city, one must check against both the city and the specific theatre, since in several cases the theatre is not known. To find all opera premieres in Milano^[3] we must therefore do the following.

{ premiere($OPERA : opera, milano : place) |
  premiere($OPERA : opera, $THEATRE : place),
  located-in($THEATRE : containee, milano : container)
}?

Here the query is split into two branches, separated by a | character and wrapped in curly braces. This means that the two branches are alternatives, joined with OR. Consequently, we will find all operas that premiered in Milano OR which premiered in a theatre located in Milano.

2.3. The post-processing features

Typically, when retrieving, say, all person topics in the topic map, one would like to get them in alphabetical order. tolog supports this through its order by clause, which can be used as follows:

instance-of($A, person)
order by $A?

This makes tolog sort the query results by the sort name of the person topics before returning the query result. It's also possible to specify whether one wants ascending or descending sort, and to sort on multiple variables.

Another common requirement is to be able to ignore variables that are only used in the query to get the correct result, but which are not actually wanted in the query result. One simple example is the search for Italians in section where the B variable contains the birthplace, which we are not interested in. To get rid of it we can use a select clause to project the query down to the single variable we are interested in:

select $A from
born-in($A : person, $B : place),
located-in($B : containee, italy : container)?

The select clause also supports counting, which means that we can do things like produce a report of the number of people born in each town in Italy.

select $B, count($A) from
born-in($A : person, $B : place),
located-in($B : containee, italy : container)?

There is, however, no support for a group by clause like that of SQL, which means that this feature is less useful than it might be. This is discussed further in section 5.5..

2.4. Topic references

A key problem in designing a topic map query language is the issue of how to address topics, because topics are at one and the same time the means by which the structure of a query is specified and also the main literals used in the query. To reference a table or a column in SQL or an element type and an attribute type in XQuery is trivial; one simply writes its name. In topic maps, however, topics need not have unique names, though they may have URIs, which unfortunately can be of three different kinds.

This considerably complicates the most central part of any topic map query language: how to refer to the topics that make up the query. To illustrate: I want to find every person born in Milano. To do this, I must refer to four topics: the born-in topic, the place topic, the person topic, and the Milano topic. For each I must know: does it have a subject indicator? A subject address? A source locator? And if the answer to any of these is 'yes', what is the locator?

In tolog 0.1 there were five different ways to refer to a topic:

by topic ID, using ordinary XML name syntax (without the ':' character). This ID was resolved by taking the URI in the [base locator] property of the topic map item, adding the '#' character, and then the ID, then finding the topic with the resulting URI in its [source locators] property.
by source locator, by just giving the URI as follows: s"http://www.example.com".
by subject indicator, again by giving the URI in a similar fashion: i"http://www.example.com".
by subject address, again by giving the URI in a similar fashion: a"http://www.example.com".
by object ID, using the @id syntax. There is no such thing as an object ID in the topic map standard, nor in [ISO 13250-2], nor anything like a standard for their syntax, so this part of the language is OKS-specific. If tolog ever becomes the basis for TMQL this feature will have to go.

While this is sufficient in the sense that it allows everything that can be addressed directly to be addressed, it is not good enough, because having to identify topics using full-length URIs inside queries is simply too verbose, making queries hard to both write and read. This is addressed in section 4.1..

3. Datalog for topic maps?

tolog was originally based on inspiration primarily from Prolog, but also to some extent from SQL. However, it was later realized that in a number of respects it is remarkably similar to Datalog, the query language used by deductive databases. Deductive databases were a particular kind of database based on logic programming that was popular in the late 80s and early 90s. These databases were often layered on top of traditional relational databases and implemented by translating Datalog queries to SQL. Good overview papers are [Liu96] and [Ullman90]; [Elmasri94] also has a chapter on Datalog.

3.1. Why reinvent the wheel?

One question that may be lingering in the reader's mind at this point is why we bother to effectively reinvent the deductive database using a new data model when there is an existing body of implementations and research into deductive databases. The answer, of course, is that this is not what is actually happening. Topic maps are a data model first and foremost, and the query language is only secondary, whereas with deductive databases the query language was primary, and that language even dictated the data model.

The topic map data model has a number of features that the deductive databases did not have, such as merging, scope, global object identity, and standardized interchange syntaxes, and it also has some features that were only added to a few deductive databases after extensive research, such as object identity through proxies [Liu90].

In short, the answer is that the wheel is not being reinvented; it is just being adapted to a new kind of vehicle, making use of the experiences from its use on the original type of vehicle. In other words, we are trying to retrofit the good features of deductive databases onto topic maps.

3.2. Comparison with Datalog

Datalog and tolog share a number of important differences with Prolog, the most important of which is that both are substantially simpler than Prolog. Datalog does not have lists, it has a heavily reduced standard library of predicates, and does not support predicates in rule heads. There is also a difference in intent: Datalog is intended as a database query language; Prolog as a full programming language. In all of these cases, tolog matches up with Datalog rather than with Prolog.

3.2.1. Evaluation algorithm

Another important shared difference with Prolog is that neither tolog nor Datalog has a specified evaluation algorithm. Prolog specifies that evaluation must be backward-chaining (or top-down depth-first, as it is also known), whereas in both Datalog and tolog the only thing that is required is to produce the correct result. How to do that is left to the language processor, which can then choose the right strategy for each particular query.

Some Datalog processors also exploit this to avoid infinite loops with recursive inference rules by using a technique known as fixpoint; that is, they stop once the query result no longer changes with new iterations. This allows them to handle infinite loops without requiring developers to write the query to explicitly test for such loops, which is necessary in Prolog.

The freedom of evaluation algorithm can also be exploited for optimization purposes. The existing tolog implementations all use a top-down breadth-first algorithm, which gives good performance in-memory, but is susceptible to infinite loops, and also gives less good performance when implemented against a persistent data store. However, it is expected that as implementations mature processors will make better use of the opportunities for optimization provided by this freedom.

It should be noted that not specifying the evaluation algorithm carries a hidden cost: predicates cannot have side-effects. If predicates can have side-effects the order in which they are evaluated matters, since the side-effects modify the environment, and such modifications are rarely independent. For example, if predicates are used to do output from the query the order of evaluation will affect the order of the resulting output. What this means in practice is that great care must be taken to ensure that the design of the query language does not violate this principle.

3.2.2. tolog queries topic maps

This is obvious, of course, but it is a major difference with Datalog, and the cause of several of the differences between the two. Datalog queries an underlying database of facts and rules, where the rules are like tolog rules, but the facts are statements of the form:

born-in(puccini, lucca).

This gives the fact that Puccini was born in Lucca, and does exactly the same as the born-in association would do in a topic map. Since topic map constructs are structurally different from Datalog constructs in several important ways, this also requires tolog to be substantially different from Datalog, as we will see in the following sections.

3.2.3. Syntax for variables

One obvious syntactical difference between tolog and Datalog is that in tolog variables use the $VAR syntax, whereas in Datalog they use the Var syntax. That is, in tolog variable references begin with a dollar sign, whereas in Datalog they begin with an upper-case letter.

tolog originally used the same syntax as Datalog, but it was quickly realized that since topic IDs could have any combination of upper- and lower-case letters this was ambiguous. The dollar sign was introduced to solve this, and has done its job well. While this means that there is a slight incompatibility between the syntaxes the ease of reference provided by the IDs is considered more important than syntactical compatibility with Datalog.

3.2.4. Direct support for OR

Another difference is the { ... | ... } syntax for OR, which does not exist in Datalog. Instead, to do the opera premiere query in Datalog one would have to define an ad-hoc rule and then query using that, like so:

city-premiere($A, $B) :- premiere($A, $B), instance-of($B, city).
city-premiere($A, $B) :- premiere($A, $C), located-in($C, $B).

city-premiere($A, milano)?

The tolog approach is clearly more intuitive for users, since they don't have to introduce a new rule, think of a name for it, give it two definitions, and finally use it. For this reason it was decided to use the OR syntax rather than require rule definitions (which also complicates the query interface).

So far user feedback seems to indicate that the decision was right.

3.2.5. Unique predicate names

Tied to the issue of inline OR is the issue of the uniqueness of predicates. In tolog there can only be one definition of each predicate, whereas in Datalog it is permitted to have any number of definitions of the same predicate, which, as we saw, is how OR is implemented. Since tolog has direct support for OR there is no need to support multiple definitions of the same predicate, but it would still be entirely possible to do so.

One benefit of doing so is that one would then be able to have inference rules with the same name as an association type that would then generate any instances of the association that could be deduced from other information in the topic map. This would be useful for transitive relationships like 'is contained in', so that contained-in(italy, $PLACE)? would find both places directly contained in Italy and those indirectly contained there. However, since association predicates and rule predicates have different syntaxes it is not clear how this would work in practice, and so it seems better to abandon the idea and insist that predicate names be unique.

This would prevent different definitions of the same rule from different sources from complementing one another rather than be in conflict. This might well be a bug more than a feature, however, as such rules might well have different intentions, in which case one would not want them to merge.

Implementation-wise, having multiple definitions of the same predicate is not a problem. This is just treated as if there were an implicit OR branch between them in the query. It does seem however, that the benefits are difficult to exploit, and that the negatives are quite clear, and so predicate names should be required to be unique.

3.2.6. The interpretation of NOT

In tolog 0.1 the not operator was interpreted quite simply as a filter. Given a temporary result from the preceding parts of the query it would remove all rows in the result for which the contents of the not operator would evaluate as true. The author has not been able to find a name for this approach in existing logic programming literature; possibly because it does not have one.

However, this is not the only possible definition of the not operator. For example, there are two ways to interpret the rule given below.

bachelor($X) :- sex($X, "male"),
                not(married($X : husband, $Y : wife))?

In this rule there is an implicit quantification that becomes obvious if the rule is interpreted logically: is X a bachelor if X is male and not married to anyone or if X is male and not married to everyone? The naïve user will expect the first, but logicians argue that the second is right, since it is equivalent to quantifying every variable as "for all" .

As language designer, however, it seems strange to choose logical purity over what the user expects, and what is actually the most useful. Experience has shown that the desired interpretation in every case (or nearly so) is the first. In such cases it seems clear that it is the duty of the language designer to choose practicality over purity.

Another logical problem with not is how to interpret queries where a negated clause recursively depends on itself. Such queries have no proper logical interpretation, and in Datalog implementations one has traditionally required that this not occur. Rule bases where such recursion does not occur have been called 'stratified', and there are efficient algorithms for determining whether a rule base is stratified and to use stratification information in query evaluation.

At this point the author is forced to admit that this understanding of the interaction of the different logical interpretations of not, optimizations of recursive rules like the magic sets transformations, and stratification is insufficient. Whether and how stratification affects tolog is therefore not yet known, and more research is required to find out.

3.2.7. Support for sets

Pure Datalog does not have support for a set type, but quite a few implementations added this as an extension. [Liu98] discusses the approaches taken by four such implementations: LDL, COL, Hilog, and his own Relationlog. All of these allow the construction of sets through the use of set literals, as well as building sets during querying through special operators in rule heads. Some also allow the accessing of nested sets through partial set terms, which are matched against the queried sets.

Using rule heads to build sets is not very appealing intuitively, and also because it requires users to define rules in order to be able to build sets at all. Further, the evaluation of the four existing implementations provided by [Liu98] is not very promising. All four implementations have serious shortcomings in what operations they support on sets and what can be queried, despite being quite thoroughly researched and designed.

There are several use cases for tolog that would seem to require sets, such as working with scope, which is a set of topics, but as it has not been conclusively demonstrated that there is a real need for sets in tolog the author has decided to leave support for them out until it is clear that they are necessary and that they can be made to work as desired.

3.2.8. The safeness of predicates

Several predicates define relations that are infinite, and for such predicates the query result will be infinite if too few of the arguments are specified directly. For example, the string-length(STR, LEN) predicate has this problem. The query below shows the problem.

string-length(A, B)?

This predicate produces every possible string and for each string its length, which (at least theoretically) is an infinite set. The query below is unproblematic, since the string is given, and the query result then has only a single result row.

string-length("foo", B)?

If we turn this around we get a version of the query which is safe in theory and equally lethal in practice.

string-length(A, 10)?

This query should generate all strings of length 10, and for any finite alphabet (or character set) this is a finite set. However, at the time of writing there are roughly 100,000 characters in Unicode, which would give 100000¹⁰ = 10^5*10 = 10⁵⁰ different strings in the query result. In other words, the query would still be running when the sun swells out to a red giant and swallows the earth, and the query processor with it.

The only solution, in both Datalog and tolog, is to detect and disallow unsafe queries. In tolog 0.1 it is not possible to use predicates in an unsafe way, but this changes in tolog 1.0, with the introduction of predicates such as those in section 4.6..

4. tolog modules

A key problem in tolog is how to address individual topics used as predicates, association roles, and literals in queries. Topics can be addressed using URIs in three different ways, using source locators, subject addresses, or subject identifiers. tolog 0.1 had an elegant solution to this for the most common class of source locators, but for all other kinds of URIs the URIs must be given in full, which bloats the query quite substantially.

At the same time, there is the issue of how to make tolog extensible through the addition of new predicates. This may be association predicates, built-in topic map predicates, predicates for handling values of the basic types, rule predicates, predicates for RDF, and so on.

The module system is an attempt to handle both of these problems with a single solution. The next section presents this system, and the following sections show how it can be applied.

4.1. The module system

The most basic component of the module system is the declaration of name prefixes that are bound to a particular namespace, rather like XML namespaces. In the case of tolog a namespace is a set of values, each of which has a URI. The declaration binds the prefix to a URI prefix and a qualification of it. The example below shows the prefix xtm bound as a subject identifier to the XTM 1.0 core URI prefix.

using xtm for "http://www.topicmaps.org/xtm/1.0/core.xtm#" as identifier
select $SUPER from
  xtm:superclass-subclass($SUPER : xtm:superclass, $SUB : xtm:subclass),
  not(xtm:superclass-subclass($TMP : xtm:superclass, $SUPER : xtm:subclass))?

This query finds all classes which have no superclasses. Identifiers which have the xtm prefix are appended to the URI prefix to produce a URI and the topic with this URI as its subject identifier is then looked up. It is then applied as an association predicate or association role in the same way as in tolog 0.1.

There are more alternatives than just the identifier keyword, however. The source and subject keywords can be used to look up topics by source locator or subject address. In addition to this there is the keyword module, which is used for modules consisting of predicates. It is these modules that are presented in the following sections.

4.2. Topics interpreted as predicates

One thing we have glossed over so far is how topics can actually be interpreted as predicates. In tolog 0.1 a topic could be interpreted as a predicate provided it was used as an association type of at least on association in the topic map. It then had to have arguments that were pairs, of the form topic : role-type, where the latter had to be a topic reference specifying the role type.

In the new tolog version there are three possible cases, depending on whether the topic is

an association type, in which case it denotes an association predicate,
an occurrence type, in which case it denotes an occurrence predicate. Occurrence predicates are always binary, and take simple arguments rather than pairs. The first argument is the topic, the second the occurrence string or locator. Or, the topic may be
a name type^[4] in which case it denotes a name predicate (which is of the same form as occurrence predicates).

This means that tolog 1.0 will allow queries such as

instance-of($COMPOSER, composer),
birthdate($COMPOSER, $DATE)
order by $DATE?

which will return composers and their birth dates, sorted by the dates. (Note that the query engine will not know that the dates are dates, and will sort them as strings. If they are in ISO 8601 format that will give the correct order, but other date formats will not.)

Note that it is perfectly possible (if unusual) for a topic to be an association type, an occurrence type, and a name type all at the same time. Topics are evaluated as predicates in the order given in the list above, which means that a topic will be an association predicate if used as an association type, and only be considered an occurrence predicate if not used as an association type (and so on).

This discourages the abuse of the same topic as multiple kinds of types, but does not make it impossible to query occurrences of a type that is also an association type, since the predicates in section 4.4. provide a (slightly more cumbersome) way to query such occurrences. (An example is provided there.)

4.3. Extensibility

While tolog 1.0 will define some built-in modules it will also allow new modules to be added through various means, one of which will be the addition of new built-in modules in various implementations, in much the same way as in XPath and XSLT.

To allow this, a type system is necessary so that interoperability between different predicates can be ensured. Two issues must be clarified for the values produced by different predicates: under what circumstances are they equal, and what is their ordering? The type system provides a set of basic types for which these two questions are clarified, and allows more types to be added by modules as necessary.

In tolog 0.1 the only possible value that could occur during the query phase was topics, and there was no extensibility, so this was not a problem. With the changes described so far there will be two types:

topics, which have the equality rule assigned to them in [ISO 13250-2], and
strings, which have the equality and ordering rules assigned to them in [ISO 13250-2].

This leaves only the ordering of topics, which is done in the same way as in tolog 0.1.

[[[Yeah, fine, but how do we define an extensible type system? Equality is fine, but what about ordering across types?]]]

4.4. The topic maps module

The current version of tolog can only query associations and the class/instance relationships of topics. Clearly, in order to qualify as a candidate for TMQL tolog will have to be able to query all parts of the topic map model. The topic maps module contains predicates that allow tolog access to all aspects of topic maps.

The predicate for accessing topics is called topic and takes the following keyword arguments:

topic: contains the topic itself,
locators: contains source locators of the topic,
identifiers: contains subject identifiers of the topic,
addresses: contains subject addresses of the topic, and
reified: contains what the topic reifies, if anything.

There is a similar predicate for base names, called basename, which takes the following keyword arguments:

topic: the topic of which this is a base name,
basename: the base name itself,
scope: the scope of the base name,
value: the string value of the base name,
reifier: the topic reifying this base name, and
locators: the source locators of the base name.

The occurrence predicate has the following keyword arguments:

topic: the topic of which this is a occurrence,
occurrence: the occurrence itself,
type: the type of the occurrence,
scope: the scope of the occurrence,
value: the string value of the occurrence,
locator: the locator of the occurrence resource,
reifier: the topic reifying this occurrence, and
locators: the source locators of the occurrence.

There would need to be more predicates (one for each information item in [ISO 13250-2]), but this is enough to give the flavour of the module. Given this, the query below would find all topics used as occurrence types:

select $OCCTYPE from
  occurrence($OCCTYPE : type)?

It would also be possible to find the birthdate of each composer, even though birthdate might also be an association type.

instance-of($COMPOSER, composer),
occurrence($COMPOSER : topic, birthdate : type, $DATE : value)?

As can be seen, this approach works, but the number of arguments to each predicate becomes large, which means that each predicate is doing a large amount of work, and is tricky to understand. It also requires the use of pairs rather than single arguments, and implementing these predicates efficiently is very hard because of the number of possible argument combinations.

4.5. Alternative topic map module

An alternative approach to designing a topic map module is to use the properties in the data model to generate predicates rather than the information items. This leads to a much simpler design, shown below as a list of predicates with definitions. The definitions take the form of a generative procedure for producing all matches to the open form of the predicate (the form where no arguments are bound).

type($ITEM, $TYPE): the complete set of matches can be generated by traversing all information items in the topic map, producing a match whenever the item has a non-null [type] property, (binding the item to ITEM and the value of [type] to TYPE),
scope($ITEM, $THEME): generate by traversing all information items in the topic map, producing a match for each topic item in the [scope] property (if the item has one), (binding the item to ITEM and the elements of the [scope] to THEME),
reifies($REIFIED, $REIFIER): traverse all topic items in the topic map, producing a match whenever the topic item has a non-null value in its [reified] property, (binding the topic item to REIFIER and the value of [reified] to REIFIED),
source-locator($ITEM, $LOCATOR): traverse all information items in the topic map, producing a match for each element in its [source locators] property, (binding the information item to ITEM and its source locator to LOCATOR),
topic($TOPIC): traverse all topic items in the topic map, producing a match for each,
association($ASSOC): traverse all association items in the topic map, producing a match for each,
topicmap($TM): the only match is the topic map item,
topic-name($TOPIC, $NAME): traverse all topic items in the topic map, producing a match for each of the elements in its [topic names] property, (binding the topic to TOPIC and the topic name to NAME),
variant($NAME, $VARIANT): traverse all topic name items in the topic map, producing a match for each of the elements in its [variants] property, (binding the topic name to NAME and the variant to VARIANT),
occurrence($TOPIC, $OCC): traverse all topic items in the topic map, producing a match for each of the elements in its [occurrences] property, (binding the topic to TOPIC and the occurrence to OCC),
association-role($ASSOC, $ROLE): traverse all association items in the topic map, producing a match for each of the elements in its [roles] property, (binding the association to ASSOC and the role to ROLE),
base-locator($TM, $LOCATOR): if the topic map item has a non-null value for its [base locator] property there is a single match, where the topic map item is bound to TM and the locator to LOCATOR,
identifier($TOPIC, $LOCATOR): traverse all topic items in the topic map, producing a match for each of the elements in its [subject identifiers] property, (binding the topic to TOPIC and the locator to LOCATOR),
address($TOPIC, $LOCATOR): traverse all topic items in the topic map, producing a match for each of the elements in its [subject addresses] property, (binding the topic to TOPIC and the locator to LOCATOR),
value($ITEM, $STRING): traverse all topic name, variant, and occurrence items in the topic map, producing a match for each when the [value] property has a non-null value, (binding the information item to ITEM and the value of [value] to STRING),
resource($ITEM, $LOCATOR): traverse all variant and occurrence items in the topic map, producing a match for each when the [resource] property has a non-null value, (binding the information item to ITEM and the value of [resource] to LOCATOR), and
player($ROLE, $TOPIC): traverse all association role items in the topic map, producing a match for each with the role bound to ROLE and the value of the [player] property to TOPIC.

With these predicates in hand it is possible to query all aspects of the topic map model, even without knowledge of the ontology used in the topic map. This makes it possible to do previously impossible queries, such as the one below, which finds all association types in the topic map.

select $ATYPE from
  association($ASSOC), type($ASSOC, $ATYPE)?

We can also find all topics used as scopes, all topics used to scope occurrences, all topics which have no associations, all topics without a name, and so on.

4.6. The string module

A large class of queries can only be supported by allowing access to the internals of strings, to check for the presence of substrings, and when outputting it may be necessary to allow strings to be modified for presentation purposes. Examples of such queries abound:

Find the name of every type and upper-case the first character of each (for display purposes).
Find every topic which has a name or an occurrence containing the phrase 'topic maps'-
[[[fill in more here]]]

To handle this we can define a module of string predicates, with URI http://psi.ontopia.net/tolog/modules/string, which might contain predicates derived from the string functions in [XPath], which have shown themselves to be a well-chosen subset. In doing the derivation the only change that needs to be made is to add the return value as the last argument to each predicate in addition to the arguments the original function already has (except when the return value is boolean).

This would give a set of predicates as follows, omitting the definitions from XPath for brevity:

string($OBJECT, $STRING),
concat($STRING1, $STRING2, $STRING),
starts-with($STRING1, $STRING2),
contains($STRING1, $STRING2),
substring-before($STRING, $BEFORE, $SUB),
substring-after($STRING, $AFTER, $SUB),
substring($STRING, $FROM, $TO, $SUB),
length($STRING, $LEN),
normalize-space($STRING1, $STRING2), and
translate($STRING, $TR1, $TR2, $RESULT).

This would allow the first query in the list above to be written as show below (leaving out the module declaration).

select $UPNAME from
  instance-of($PERSON, person), name($PERSON, $NAME),
  substring($NAME, 0, 1, $FIRST), translate($FIRST, "abcdef...", "ABCDEF...", $U1),
  substring($NAME, 1, 1000, $REST), concat($U1, $REST, $UPNAME)
order by $UPNAME?

This query is quite long (because it actually does quite a lot), but clearly demonstrates the power of the module. The second query can be written as:

select $ITEM from
  value($ITEM, $VALUE), contains($VALUE, "topic maps")?

4.7. Further modules for primitive values

As the previous section showed, there is clearly a need for numbers in tolog. (Several of the arguments to the string functions were numbers.) Many queries are also very likely to use numeric criteria, as in:

Find every person above 35 years of age, with at least NOK 300,000 in income,
Find every person in the topic map and order them by height,
Find every person in the topic map and convert their height from cm to inches,
[[[and so on ad nauseam]]]

There are two approaches one could take to this, both already taken by various Datalog versions. The first is to define a module consisting of predicates for the various arithmetic functions, such as less for <, plus for +, and so on. Alternatively, one could define more intuitive inline predicates as built-ins to the language.

The first query would look as shown below with the first approach (again leaving out prefix bindings):

select $PERSON from
  instance-of($PERSON, person),
  age($PERSON, $AGE), less($AGE, 35),
  income($PERSON, $INCOME), lesseq($INCOME, 300000)?

With the second approach it would look as follows:

select $PERSON from
  instance-of($PERSON, person),
  age($PERSON, $AGE), $AGE > 35,
  income($PERSON, $INCOME), $INCOME => 300000?

Clearly, the second approach is the more intuitive, while the former allows the predicates to be isolated into a module in a way that is not possible with the latter. The module isolation also makes the language more regular by removing any syntactic distinction between built-in arithmetic predicates and ones added as extensions.

[[[resolve!]]]

Another problem here is how the tolog query engine is to know that in the examples above the string value of the 'age' occurrence type is to be interpreted as a number. There are several approaches to this:

Let the 'less' predicate automatically convert string arguments to numbers, emitting an error if the conversion fails for any of the strings.
Supply explicit conversion predicates that can be used to do the conversions.
Use schema information to detect that the 'age' occurrence type must contain an integer. This requires that a schema language with data typing capabilities exist, be used, and be supported by tolog implementations.

[[[a likely choice is automatic conversions, plus explicit conversions since needed for sorting]]]

[[[dates also, unfortunately]]]

5. Handling user requirements

tolog has been available in a commercial implementation (that of Ontopia) since early 2002, and since then a large number of users have been exposed to it, which has brought up a number of requirements. In this section we discuss extensions to the language to support these.

5.1. References to non-existent topics

One problem with the current language definition is that users sometimes want to write queries referring to topics which may not exist. An example of this is found in the Omnigator, which queries topic maps for superclass/subclass associations in order to show the class hierarchy. The Omnigator queries any topic map, and so cannot make assumptions about which topics are present. The current tolog assumes that any reference to a non-existent topic is an error, and so causes such queries to fail with an error, something that is clearly not acceptable.

[[[Discuss solutions.]]]

5.2. Non-binding joins

[[[Depends on ability to reference variables outside query for the examples to really make sense.]]]

In many cases one wants to select a number of different pieces of information relevant to a topic, some of which may be optional. In these cases it is convenient to be able to have clauses in the query that do not make the query as a whole fail even if these clauses fail. This kind of query is very common when using queries to drive topic map presentation, for example in web applications.

The query below, which selects information relevant to an opera, is a typical example of the genre.

composed-by(tosca : opera, $COMPOSER : composer),
written-by(tosca : work, $LIBRETTIST : writer),
occurrence(premiere-date : type, $PREMIERE : date)?

In this case there will always be a composer and a librettist, but there will not always be a premiere date. Even so we always want the query to succeed. One way to achieve this at present is to write the query as follows:

composed-by(tosca : opera, $COMPOSER : composer),
written-by(tosca : work, $LIBRETTIST : writer),
{ occurrence(premiere-date : type, $PREMIERE : date) |
  tosca /= puccini }?

In the cases where the opera has no premiere date the second or branch will succeed, and the result will be a a row with a null value for the premiere date. This is what we want, but it's an awkward way to achieve it. The obvious way around this is to allow or clauses to contain only a single branch and for this to imply that there is a second branch that always succeeds. This allows us to write the query in a rather natural fashion, as follows.

composed-by(tosca : opera, $COMPOSER : composer),
written-by(tosca : work, $LIBRETTIST : writer),
{ occurrence(premiere-date : type, $PREMIERE : date) }?

5.3. Pre-parsed queries

For performance it is very useful to be able to parse and compile a query once and then run it many times reusing the result of the compilation. However, in practice, most queries have parameters that depend on program input, and so pre-parsing is only possible if pre-parsed queries are allowed to take parameters.

In tolog 1.0 a syntax is introduced that allows this:

born-in(%place% : place, $PERSON : person)?

Given a place this query finds everyone who was born there.

5.4. Uniqueness control

In tolog 0.1 there is no way to control whether or not a query should return duplicate result rows, and some implementations remove duplicates if SELECT is used, but leave them in if it is not. At the very least tolog 1.0 needs to define what the appropriate behaviour is, and one might want to include a SELECT DISTINCT option like that of SQL.

[[[Is there any need for duplicates at all? Why not always remove them?]]]

5.5. Controlling aggregate functions

[[[A GROUP BY clause]]]

5.6. Filtering on aggregate results

[[[A HAVING clause]]]

5.7. Avoiding infinite loops

[[[Queries in hierarchical structures where the structure, for whatever reason, may contain a loop. This will result in an infinite loop unless we come up with some clever way to avoid it.]]]

5.8. Cycle and path searches

[[[How to support queries of the types "does this structure contain cycles?" and "what's the shortest path from X to Y?"?]]]

5.9. Hierarchy queries

[[[How do we support querying hierarchies and retaining the structure of the hierarchy in the output?]]]

5.10. Outputting XML/HTML

One of the most common uses for topic maps today is for driving web sites, yet there is no standardized way to produce HTML/XML output from a topic map. A standardized topic map query language is one step towards that goal, but even so the last step of producing output will not be standardized. It seems clear that adding support for output production to the language will meet an existing and reasonable commercial demand.

The question is how to add support for HTML/XML output production. Should one choose an XSLT-like template approach? Or is it better to use something more like XML Query's element constructors? The sections below examine both alternatives.

5.10.1. An XSLT-like templating language

It seems reasonable that a template language for XML/HTML output should use an XML syntax, and allow the use of the tolog query language to produce dynamic output in much the same way that XSLT uses XPath. However, XSLT stylesheets consist of templates matched against document nodes in document order, but this does not seem appropriate for topic maps where topics and associations have no defined order.

Instead, the template should describe the flow of control and select the information used to produce output for each step. It appears that a language consisting of a loop construct, a conditional test, possibly value binding (that is, functional language-like lexical variable binding), and perhaps also support for defining and calling functions, is what is needed.

This idea about the XML-based language to replace the tag libraries
has been churning in my head, and I sat down to rewrite the composer
page in OperaMap with such a language in order to see how it might
work. It turns out that it can work quite well, but there are some
challenges to getting it right.

I kept the templating system we already use, since it's clear that
something like it is a must-have. We may need to extend it, but so far
I don't know that we do, and so I haven't tried, either.

The language is heavily based on tolog, and I'm using the extended
tolog that has predicates for accessing all of topic maps, and also
the following extensions:

 - a 'name(topic, string, scope?)' predicate for selecting the most
   appropriate name; the scope argument is optional,

 - name and occurrence types can be used in the same way as
   association types to generate predicates for querying base names
   and occurrences; in both cases the form is type(topic, value), and

 - direct support for sets, using the syntax [elem1, elem2, ...].


The TMTL language itself has the following constructs:

 - <tmtl:bind select="...">
   - runs the tolog query in the select attribute
   - the variables bound by the first result row are visible inside
     the tag

 - {...}
   - outputs the result from the tolog query in the curlies
   - if the expression is a single variable reference the value of
     that variable is output
   - if the expression is a query it may only bind a single variable;
     the value of the only variable bound in the first result row is
     output; if there are no results nothing is output
   - if the variable is bound to a string or locator, that is output;
     if it is bound to a topic (or other , its ID is output

 - to output curly brackets, write \{ and \}
 
 - <tmtl:if select="...">
   - runs the tolog query in the select attribute
   - if the query produces any results, the test succeeds
   - inside the tag the variables bound in the first row of the query
     results are available

 - whitespace rules are as in XSLT

 - <tmtl:foreach select="...>
   - runs the tolog query in the select attribute
   - executes the contents once for each row in the result
   - variables bound by the current row are visible inside the tag


Points particularly worthy of note: 

  - This works for RDF, too. Or very nearly so: all that is missing is
    a way to deal with the topic IDs being bandied about here.

  - This page is 81 lines; original was 142 + 72. We're down to 40% of
    the original.

  - I believe this can be quite effective when translated to SQL.

  - I believe this is relatively easy to implement.

  - I'm *not* happy with the <tmtl:page> attributes. Most of that
    should move to the JSP tag which invokes the TMTL page, I think.



<tmtl:page xmlns:tmtl="http://psi.ontopia.net/tmtl/" 
           template="template.tmtl"
           objparam="id"
           set="topic">

<tmtl:bind select="name(%topic%, $NAME)?">
<tmtl:put name='title'>{$NAME}</tmtl:put>

<tmtl:put name='content'>
  <tmtl:if select="illustration(%topic%, $PICTURE)?">
    <img align=right src="{$PICTURE}" height="234">&#160;&#160;
  </tmtl:if>

  <h1>{$NAME}</h1>

  <p>Italian composer
     <tmtl:if select="instance-of(%topic%, librettist)?">
       and <a href="librettist.jsp?id={$topic}">librettist</a>
     </tmtl:if>.

     <tmtl:if select="nom-de-plume(%topic%, $NAME)?">
       Also known as {$NAME}.
     </tmtl:if>

     <tmtl:if select="born(%topic%, $DATE)?">
       Born {$DATE}

       <tmtl:if select="born-in(%topic% : person, $CITY : place)?">
         in <a href="city-region.jsp?id={$CITY}">{name(%CITY%, $NAME)?}</a>
       </tmtl:if>.
     </tmtl:if>

     <tmtl:if select="died(%topic%, $DATE)?">
       Died {$DATE}

       <tmtl:if select="died-in(%topic% : person, $CITY : place)?">
         in <a href="city-region.jsp?id={$CITY}">{name(%CITY%, $NAME)?}</a>
       </tmtl:if>.
     </tmtl:if>
  </p>

  <tmtl:if select="description(%topic%, $DESC)?">
    <p>{$DESC}</p>
  </tmtl:if>

  <p><b>Operas:</b></p>

  <ul>
  <tmtl:foreach select="composed-by(%topic% : composer, $OPERA : opera)?">
    <li><a href="opera.jsp?id={$OPERA}">{name(%OPERA%, $NAME, [%TOPIC%])?}</a>
        {premiere-date(%TOPIC%, $DATE)?}
  <tmtl:foreach>
  </ul>

  <tmtl:if select="homepage(%TOPIC%, $URL)?">
    <p><b>Other sites:</b></p>
    <ul>
    <tmtl:foreach select="homepage(%TOPIC%, $URL)?">
      <li><a href="{$URL}">{$URL}</a></li>
    </tmtl:foreach>
    </ul>
  </tmtl:if>

<br/>
<tmtl:bind select="name(%topic%, $NAME, [normal])?">
  <p>Search OperaBase for performances of operas by {$NAME}:</p>
  
  <form method="get" action="operabase.jsp">
    <input type="hidden" name="composer" value="{$NAME}">
    <input type="radio" name="sort" value="V">City
    <input type="radio" name="sort" value="G">Country
    <input type="radio" name="sort" value="T">Title
    <input type="radio" name="sort" value="D" checked>Date
    <input type="submit" value="search">
  </form>
</tmtl:bind>

</tmtl:put>
</tmtl:bind>
</tmtl:page>

5.10.2. An XQuery-like syntax

       select $A, $B, $C from
         ...
       output
       for group($A, $B) {
         <foo><bar>$A</bar>
              <baz>$B</baz>
         for group($C) {
            <gazonk>$C</gazonk>
         }
         </foo>
       }

5.11. Dealing with scope

[[[We left the toughest part for last: how to deal with scope. It can be done using sets, or we can add native support for it in the language, going beyond the Datalog model. A native scope operator must probably be applicable on a clause-by-clause basis.]]]

6. Conclusions

[[[Summarize what we've discovered and what it means.]]]

Acknowledgements

Thanks go to Geir Ove Grønmo and Steve Pepper for developing the topic map module together with me.

Many thanks to Robert Barta, for insights and opinions that clarified many issues for me, and also helped broaden my outlook.

Bibliography

Elmasri94: Fundamentals of Database Systems, Ramez Elmasri and Shamkant B. Navathe, Benjamin/Cummings Publishing, Redwood City, USA, 1994. ISBN 0-8053-1753-8.
Garshol01: tolog — A topic map query language, Lars Marius Garshol, Ontopia. Presented at XML Europe 2001, Berlin, Germany. Available from http://www.ontopia.net/topicmaps/materials/tolog.html.
Garshol03: Living with topic maps and RDF, Lars Marius Garshol, Ontopia. Presented at XML Europe 2003, London, UK. Available from http://www.ontopia.net/topicmaps/materials/tmrdf.html.
Garshol03b: tolog 0.1, Ontopia Technical Report, Lars Marius Garshol, 2003-03-10. Available from http://www.ontopia.net/topicmaps/materials/tolog-spec.html.
Liu96: Deductive Databases — Where to Now?, Mengchi Liu and John Cleary, Journal of Intelligent Information Systems, 1996, 90/377/01. Available from http://www.scs.carleton.ca/~mengchi/papers/ddb-JIIS.ps at the time of writing.
Liu98: Overview of Datalog Extensions with Tuples and Sets, Mengchi Liu, in proceedings of the 6th International Workshop on Deductive Databases and Logic Programming (DDLP '98), Manchester, UK, June 20, 1998, pp. 99-112. Available from http://www.scs.carleton.ca/~mengchi/papers/ddb-DDLP98.ps.
Liu90: Deductive Database — Where to Now?, Mengchi Liu and John Cleary, Computer Science Technical Report 1990-377-1, University of Calgary, Calgary, Alberta, Canada, January 1, 1990. Available from http://pharos.cpsc.ucalgary.ca/Dienst/Repository/2.0/Body/ncstrl.ucalgary_cs/1990-377-1/pdf.
ISO 13250-2: ISO 13250-2: Topic Maps — Data Model, ed. Lars Marius Garshol and Graham Moore, ISO. Available from http://www.isotopicmaps.org/sam/
Ullman90: Deductive Databases: Achievements and Future Directions, Jeffrey D. Ullman and Carlo Zaniolo, SIGMOD Record 19(4), 75-82(1990). Available from http://www.cs.ucla.edu/~zaniolo/papers/sigrecord90.pdf at the time of writing.
XPath

Footnotes

1	The tolog version descibed in [Garshol01] required a rule declaration for each association type that gave the order of the association roles as predicate arguments. This solved the problem, and made queries shorter, but was awkward for users and so was changed in favour of the present syntax.
2	In fact, the optimizer will turn this query around and will start with all cities located in Italy, then find all people born there. The result is the same, but since fewer false matches are produced it is faster.
3	The traditional English misspelling of this name is 'Milan'.
4	The next version of ISO 13250 will allow typed names. See [ISO 13250-2].

Extending tolog

Proposal for tolog 1.0

Table of contents

Abstract

Biography

Acknowledgements

Bibliography

Footnotes