Automagic Topic Maps

1-2-3; Creating topic maps from existing data sources

By:	Geir Ove Grønmo
Affiliation:	Ontopia AS
Email:	grove@ontopia.net
Web:	http://www.ontopia.net

Abstract

This paper explains a new approach for creating topic maps from existing data sources. The approach consists of a three-step procedure, and include going via RDF to produce the final topic map.

Very few topic maps have to be developed from scratch, since there in most cases are existing data sources that can be used to populate them, and often more than a single data source exist.

The paper starts out by examining the components of a topic map and their use, with a focus on how to identify subjects and how to assign them characteristics. Steps for ensuring quality and optimizing their usefulness are discussed.

This paper shows that Topic Maps and RDF are technologies particularly suited to be used as the glue between incompatible data sources.

1. Introduction

automagically adverb. Describes a process that occurs automatically and with a certain level of mystery so that it seems somewhat magical.

You may wonder how topic maps are created. There are many ways to do this and they span from manual hand-coding to being generated automatically by computers. This paper focuses on the latter. It might not seem obvious at first how to create a topic map from existing data, but it can actually be very easy. Read on and you'll see how easy it can be.

This paper introduces a new approach that allows topic maps to be created automagically. The procedure that we'll go through is a simple three-step procedure that include going via RDF (Resource Description Framework) to produce the topic map.

2. Findability

The Topic Maps paradigm brings a lot of interesting features to the table. The Topic Maps standard focuses heavily on the fact that it should be easy to find what you are looking for. This include finding thing through searching, quering, and traveral of the knowledge base.

Topic Maps is a very explicit form of knowledge representation and it forces you to consider some of the issues that are critical when it comes to optimizing findability. It focuses a lot on how to name and how to identify subjects. This is very important. No other knowledge representation puts this much focus on the two.

2.1. Names are for humans

A name is a label that is used to identify a subject. The name is not neccessarily globally unique, but if it is qualified by the context(s) in which it is applicable, it can be. Names in topic maps are specified this way; namely qualified by a set of perspectives [which themselves are subjects]. This makes sure that names of subjects are appropriately qualified.

Words that are spelled the same way, but have different meaning are called polysems. An example of such a word is Paris. Some people might connect the label 'Paris' to the French capital, while others might think of a poisonous plant. They are both valid subjects, but at most one of them is right. Those that recognize both subjects are left wondering which of the two is correct.

What if it was Paris, the second child of Queen Hecabe 1 of Troy, or Paris, Texas or even Paris, Kentucky? As you can see there a lot of meanings to the word 'Paris'. The word has a lot more meanings than those already mentioned.

It can be very problematic to recognize the right subject unless you happen to have some knowledge of the context in which the name is used. If the name occurs in an article about the French revolution you can most likely ignore the greek hero, the plant and the American cities, but if you do not recognize a distinctive context it is hard, if not impossible.

Contexts are hard to get 100% right, so names are for human consumption mainly. Computers may get most of the subjects right, using techniques like fuzzy logic, but there can be no guarantee that it is correct. The context in which a name occurs is in most situations implicit. Humans can handle implicit knowledge much better than computers. Without any, or with limited, context information it is often impossible for a computer to connect a name to the right subject.

Computers are lousy at figuring out such things by themselves. They need to be given exact and unambiguous information. Enter identities.

2.2. Identities are for computers

A subject identity is an address that can be used to uniquely identify a subject. Subject identities should be globally unique, ie. it should be used to identify one, and only one, subject.

An identifier is self-contained and contains enough information to identify a single subject, so no information about the context is needed. This kind of identity is perfect for a computer, since the work of identifying subjects then is limited to comparing identity addresses.

Here are some identity addresses that can be used to identify Paris, the French capital:

http://psi.ontopia.net/geography/capital.py?country=FR
http://www.lonelyplanet.com/destinations/europe/paris/
urn:cities.france.paris

In XTM (XML Topic Maps) subject identities are represented as URIs (Uniform Resource Identifiers) . The ISO 13250 standard is more flexible and allows any addressing mechanisms. You'd only in exceptional cases want to consider anything else than URI addressing mechanisms. On the web only URIs are used and we'll focus on those here.

3. What's in a topic map?

Topic Maps contain topics, which can have identities and characteristics. A topic represents a subject. It is an electronic representation of the subject - a stand-in for the subject that is. Topic characteristics can be names, related information and related concepts.

Topics first become really useful when you assign them identities and characteristics. This makes sense, because the reasons for having topics in the first place is that you'd want to say things about them (the subjects that is). And the way to say things about them is to assign them characteristics. Following is a short introduction to the kinds of information you can attach to topics.

3.1. Identities

You use identities to identify the subject that a topic represents, not neccessarily the topic itself. The point is that is should be possible to figure out which subject a topic represents by looking at its identities.

As mentioned earlier identities are used to [uniquely] identify subjects. In topic maps there are three kinds of identities: subject addresses, subject indicators, and source locators.

A subject address is an address that references a resource that is the subject.

A subject indicator is an address that references a resource that unambigously indicates the subject. It can be seen as an indirect subject address. This is useful when the subject is not directly addressable. Most subjects are not directly addressable, but instead they have representations that can be addressed. The difference between a subject address and a subject indicator can be considered to be that the subject address points directly at the subject, while the subject indicator points to something that "describes" the subject.

A source locator is an address that points to the resource that caused a topic map object representation to come into existence. It could also be an address that points to its current representation. A source locator is the address of a representation of a topic map object, e.g. an XML element in a serialized topic map or an object in a Java virtual machine. A source locator can in other words be used to reference the origin(s) of a topic map object. Source locators are really a special kind of subject indicator, but their existence is valueable enough to be handled separately.

3.2. Perspectives (scope)

When we assign a topic characteristics we are essentially making assertions about the topic. However not all assertions are universally valid:

a name (e.g., "St. Petersburg") may be applicable in some contexts, but not in others (pre-1914 and post-1991);
an occurrence might be pertinent in some situations, but not in others (e.g., when the reader is under a certain age);
an association might state an opinion (e.g., that the Sun is orbiting around the Earth) that is not shared by others.

The purpose of assigning perspectives is to allow the topic map author to express the limits within which such characteristic assignments have validity. The default perspective, ie. when no perspectives have been assigned, is the unconstrained one, which means that the assignment is universally applicable.

All topic characteristics can be assigned perspectives. (Note that subject identities can not.)

3.3. Names (base names)

The names of topics are called base names. A base name is a label that can be used to name a subject. Base names are assigned to topics in a perspective. A subject usually has more than a single name, and there are no limits on how many of them can be assigned.

A base name can have variants, which are alternative forms of the base name. Variant names are normally defined to be applicable in specific processing contexts, and usually represent context-dependent renditions of the base name. Examples are the plural name, sort keys, and phonetic form.

Example: The beer topic could be assigned two base names each having a single variant.

beer (perspective: English)
- 'bir (perspective: pronounciation)
øl (perspective: Norwegian)
- 'ul (perspective: pronounciation)

Note that the variant names "inherit" the perspectives of their base names.

3.4. Related information (occurrences)

You'd most likely to attach information [resources] to topics. This would normally be information that is related to the topic for some reason. Information resources are attached to topics through occurrences. Occurrences are typed, so that you can figure out why the information is related to the topic. The relatable information resources can have identity or be anonymous resources. Anonymous resources are also called inline occurrences, and the ones with identify, external occurrences.

Note that if you want to consider the information resource as a topic by itself you'd have to attach it using an association instead. Occurrences are in other words characteristics that reference information resources that are relevant to the topic, but which are not considered important enough to be represented as topics in the topic map.

Occurrences allow traversal from subjects to related information and from information resources to related subjects. This is part of what makes the GPS for the Web possible.

Example: The beer topic could be assigned three occurrences, the first two descriptions of the subject of beer in two different languages (inline occurrences), and the third an information portal about beer (an external occurrence):

"An alcoholic beverage usually made from malted cereal grain, flavored with hops, and brewed by slow fermentation." (type: description, perspective: English)
"Alkoholholdig drikk laget av vann, malt, humle og gjær." (type: description, perspective: Norwegian)
http://www.realbeer.com/ (type: portal, perspective: unrestricted)

3.5. Related concepts (associations)

Relationships between topics are defined using associations. Associations have a type and a set of one or more roles that topics can participate in.

Example: capital-of(france : country, paris : capital).

Here Paris participates in an association together with France where Paris plays the role capital and France the role country. The relationship type in this example is capital of. (Note that the role types and association types are references to topics.)

Associations are multidirectional. This is extremely useful because you in the example above don't have to say that "Paris is the capital of France" and that "The capital of France is Paris". The two statements are the same, so there is no need to say it twice. Interestingly, the two sentences can be generated from the names of the association components.

By designating appropriate roles you know why a topic participates in the association. The important thing here is to specify which of the two topics is the capital and which is the country, and why they participate in the same association.

Associations break relationships into the components that are required to explicitly describe relationships between topics. When associations have more than a single participating topic they are connected and you can traverse between them.

4. Data sources

Now that you know what topic maps are, and you have some ideas about what you can do with them, it is time to look at how to create them.

First you need to know what kind of knowledge you'd want to describe. And, secondly, you need to get that knowledge from somewhere. Where would you find that kind of information? You'd not want to collect it and type it in yourself. It is very likely that the information exists in some electronic form already. People store information all over the place. There is an enourmous number of information systems around, though most of them are poorly integrated.

Following is a list of electronic information systems that are likely to contain the information you are looking for:

relational databases
web sites
enterprise information systems
directory systems
content management systems
documents in file systems

This means that is very likely that you have the needed information in electronic form somewhere already. The existing data might not be well structured or easy accessible, but that is what we're trying to change. We'd want to represent the data in topic map form for exactly this reason; to improve findability.

5. Extracting knowledge

All data sources contain knowledge is some form, but it is almost always implicit and hard to recognize, especially for computers. In this section we'll introduce a simple procedure for extracting such implicit knowledge and making it explicit, first in the form of an RDF model and finally as a topic map.

5.1. Overview

Extracting knowledge from data sources can be very simple indeed. Below is a simple three-step procedure that shows how it can be done:

Recognize a subject
Extract RDF statements about the subject.
Map the statements into topic characteristics.

...and you have a topic map.

When we go through each of the individual steps in detail we'll be using a relational database as the source of data. Data will be read from relational database tables, broken into RDF statements (triples) and mapped into topic characteristics. This procedure is similar to the ones being used by the Ontopia Autogen Toolkit and MDF - The Metadata Processing Framework tools.

Let's say that we'd like to create a topic map about countries, and our relational database happens to contain the following tables:

Table: COUNTRIES

 -------------------------------------
| id | datacode | name      | capital |
|----|----------|-----------|---------| 
|  1 | FR       | France    |       3 |
|  2 | NO       | Norway    |       4 |
|  3 | DK       | Denmark   |       2 |
|  4 | ES       | Spain     |       1 |
|  ...                                |
 -------------------------------------

Table: CITIES

 -----------------------------------------------------
| id | name       | population | latitude | longitude |
|----|------------|------------|----------|-----------
|  1 | Madrid     | 2.976.064  |    40.41 |     -3.71 |
|  2 | Copenhagen |   625.810  |    55.72 |     12.57 |
|  3 | Paris      | 2.152.329  |    48.87 |      2.33 |
|  4 | Oslo       |   473.454  |    59.93 |     10.75 |
|  5 | Antwerpen  |   470.349  |    51.22 |      4.42 |
|  ...                                                | 
 -----------------------------------------------------

5.2. Recognizing subjects

In order to say something about a subject we need to find occurrences of it in the source data. An occurrence of a subject usually means that you can find some information about the subject in that location. This observation can be used to extract statements about the subject that can be useful in our knowledge base.

A relational database table consists of tables, which contain information about subjects. Each table row usually represents a subject and the columns contain information about that specific subject.

To find the occurrences we need to be able to identify the mentioned subjects. In the case of a relational database table we need to figure out which table columns contain data that can be used to identify the subjects. These columns are called key columns, e.g. primary keys.

Once we've identified the key columns we can generate a URI that can be used to identify the subjects occurring in each row. The URI can for example be created via concatenation of an address prefix and the the primary key value. The actual structure of the URI is up to you to decide, but the important thing to remember is that it has to uniquely identify the subject.

The next step is to extract what is being said about the subject. We'll be extracting those statements as RDF statements.

5.3. Extracting RDF statements

RDF models are about resources, which correspond to subjects in topic maps. This helps us to realize that there can be an easy transition from RDF to Topic Maps.

An RDF model consists of statements about resources, which are often called triples. A statement has three parts:

subject, the resource the statement is about.
property, the property being assigned to the subject, ie. the reason why the value is being assigned to the subject.
value, the value assigned to the subject.
Example: ( subject, property, value )

The subject and the property are always URIs, while the value can either be a URI or a string literal.

So in order to extract statements from a relational database we need to break the data stored in tables into triples.

Let's extract the statement that the subject Madrid has a population of 2.976.064. If there are no two cities with the same name it might be okay to identify the cities by their names. But to be on the safe side let's use the row id to identify the city. The row id can be found in the id column. The id is an integer and is surely not a globally unique identifier. We therefore need to qualify the row id with a more authoritive address fragment. This can for example mean prefixing it with "http://www.mysite.org/cities/", which would make sure that the identifier is much more likely to be unique. If we happen to own the mysite.org domain name we can make sure that it is.

Note that we don't neccessarily have to make the URIs globally unique this early in the extraction process. Only when we attempt at creating topic instances with subject identity do we need to make sure that it is globally unique, but it doesn't hurt to do this from the beginning. (The RDF models are not to be used here except as an intermediate step towards a topic map.)

The identity of the subject can be used to fill in the subject part of the triple.

Example: ( http://www.mysite.org/cities/1, ?, ? )

The property part of a triple is ontological information and seldom come from the data source itself, so we can invent a resource that represents the "population" property. Let's identify it by a relative URI, #population, since we're ourselves in control of the ontology.

The ontology information can be used to fill in the property part of the triple.

Example: ( http://www.mysite.org/cities/1, #population, ? )

Columns either contain data or reference other rows. Statement values can be found by looking at the other columns in the row. Since we're looking for the population we find it in the population column. The value in this column completes the triple.

The column values can be used to fill in the value part of the triple.

Example: ( http://www.mysite.org/cities/1, #population, "2.976.064" )

We have now a completed our first RDF statement:

[1] ( http://www.mysite.org/cities/1, #population, "2.976.064" )

Here are some more examples of statements that can be extracted from the two tables above:

[2] ( http://www.mysite.org/cities/4, #latitude, "59.93" )
[3] ( http://www.mysite.org/countries/1, #name, "France" )
[4] ( http://www.mysite.org/countries/3, #hascapital,  http://www.mysite.org/cities/2 )
[5] ( http://www.mysite.org/countries/2, #datacode, "NO" )

[2] says that Oslo has the latitude 59.93. [3] says that the subject France has the name "France". [4] says that the capital of Denmark is Copenhagen. [5] says that Norway has the ISO 3166 data code NO.

The next step is to map the RDF statements into topics and topic characteristics.

5.4. Mapping statements into topic characteristics

In this step we have to decide whether RDF statements are to be mapped into subject identities, names, occurrences or associations. Let's go through the statements we have created so far.

[1] ( http://www.mysite.org/cities/1, #population, "2.976.064" )

This is a typical inline occurrence since it references an anonymous resource, so we'll map it into an inline occurence of type population and value 2.976.064. It's an occurrence of the topic Madrid.

[2] ( http://www.mysite.org/cities/4, #latitude, "59.93" )

This one is similar to the previous one, so we'll also map this one into an inline occurrence, but this time with the type latitude. It's an occurrence of the topic Oslo.

[3] ( http://www.mysite.org/countries/1, #name, "France" )

This a name, so we'll map it into a base name of the France topic. The perspective is English.

[4] ( http://www.mysite.org/countries/3, #hascapital,  http://www.mysite.org/cities/2 )

This is a relationship between two different subjects, so we'll map it into an association. The association type is capital of. The topic Denmark plays the role country and the topic Copenhagen plays the role capital.

[5] ( http://www.mysite.org/countries/2, #datacode, "NO" )

A country data code can be used as a base name with the ISO 3166 two letter datacode perspective, but it could also be used to identity the subject Norway. The XTM specification include subject indicators of countries. This means that we can create the subject indicator "http://www.topicmaps.org/xtm/1.0/country.xtm#NO" and be able to merge the topic Norway with topics from other topics maps that reference the same subject indicator.

The fact that RDF statements are so primitive helps us to easily decide what kind of topic characteristics the individual triples should be mapped into.

5.5. Merging data sources

If your knowledge is to be extracted from multiple data sources there are a couple of alternatives available for how to merge the results.

The first alternative is to merge the RDF models that are produced by the extraction processes before turning them into topic maps.

The second alternative is to create topic maps for the individual data sources and merge the topic maps.

The alternative to chose here depends on whether you'd want to do further processing on the merged knowledge base or not. If you need to do further processing it might be advantageous to first merge the RDF models before turning them into topic maps.

5.6. What are the advantages?

The procedure works, but what are the advantages of doing it in this way?

Mapping directly into topic map objects means that you have to do a lot of things at the same time. This can be complicated and it is easy to loose focus because of the extreme verbosity.

Going via RDF statements means that we don't have to express all of the topic map bells and whistles at the same time as we're recognizing the knowledge expressed in the data. This let us focus on the important things, namely identifying the statements that we want to extract from the data.

Topic characteristics can be rather complex objects, e.g. occurrences have type, perspective and reference information resources, associations have types, perspective and roles, and roles have types and players.

What goes where shouldn't be the focus at the point when you are doing statement recognition and generation. This job is better deferred to a later stage. Remember that you may have to do a lot triple and string manipulation work before you're actually ready to create topic characteristics.

The conclusion is that the procedure makes creating topic maps easier by doing the conversion in a layered approach.

6. Conclusion

The paper has given a brief introduction to Topic Maps, including an explaination of what they consist of and what they can be used for. Topic Maps were mainly invented to solve the findability problem. The paper has explained the approach Topic Maps take to solve this problem.

As has been shown the three-step procedure described in this paper provides an elegant way of creating topic maps from existing data sources. The approach is very simple and its flexibility can be enourmous.

The paper has also shown that RDF and Topic Maps can be very complementary technologies even though they both have been invented to solve much the same problems.

A challenge remains in that there seem to be a lot of potential in building user-interfaces on top of a systems implementing this approach. Such user interfaces could be of tremendous help to the those configuring the process.