The automatic classification module can process textual content in many different formats to extract the keywords which identify what the content is about. This can be used to automatically classify content into a topic map, or to provide suggestions for classification to users so that they can fine-tune the classification.

The input to the module is a file in some format. The module will extract the textual content from the file and process it to find the appropriate keywords for the content, which can then be used to classify the content. Each extracted keyword has an associated score (a number between 0 and 1), which indicates how relevant the keyword is to the content.

For example, processing the Metadata? Thesauri? Taxonomies? Topic Maps! paper using the module produces the following output:

  • topic maps, 1.0
  • Dublin Core metadata, 0.98
  • subject-based classification, 0.41
  • faceted classification, 0.34
  • metadata, 0.25
  • controlled vocabulary, 0.21

This could be used to automatically create topics for each keyword that receives a score higher than, say, 0.15, and automatically associate these topics with the classified content, thus automatically building a topic map.

The module supports formats like XML, HTML, PDF, Microsoft Word (binary and OOXML), Microsoft Powerpoint (binary and OOXML), and support for more formats can be added easily. It also supports English and Norwegian, but, again, support for more languages can be plugged in easily.