Statistical Classifications: modelling with SKOS
What do you think of when you hear the term controlled vocabularies? Is it Taxonomies? Subject headings? Synonym rings? Within the information management domain, a number of familiar faces arise when the term 'vocabularies' is used. Vocabularies may vary in how they are constructed and what shape they take, but they more or less converge on a common purpose: supporting and enhancing information organisation and retrieval.
Advocates argue that vocabularies are a wise investment - vocabularies improve both recall and precision, whether for single user search tasks or for assisting data mining at scale. Compiling and maintaining vocabularies can be a significant task and so, therefore, the arguments do need to be made in such a way that a return on investment is manifest.
There are, however, some vocabularies that are ubiquitous, often mandated and highly entrenched in everyday operations. They don't get as much attention when talk turns to libraries, cataloguing and subject headings (or search, metadata and thesauri, etc). Yet they are embedded in so many of the functions, activities and transactions that make up day-to-day life.
Statistical classifications, often called standard classifications, are these vocabularies. Statistical classifications are the indicators developed to quantify statistical research and reporting. They are developed and maintained by agencies such as the Department of Statistics Singapore (DSS), Australian Bureau of Statistics (ABS) and Statistics New Zealand (SNZ), to take a regional sample. Governments mandate the use of these statistical classifications in all sorts of applications; you cannot complete high school, vote in an election, fill out a tax return, register a business name, get married, register a birth, or even get buried without (often unwittingly) selecting a code from a schedule of statistical classes.
And this is why statistical classifications should be of interest to information professionals: the argument for these classifications has already been made. Classification data is everywhere - and to the extent that government data becomes 'open data', so too does the embedded classification data.
But are these classifications like those things that we commonly refer to as 'controlled vocabularies'? The examples I will share in this article, drawn from industry classifications published by the statistical agencies mentioned above, exhibit some typical features such as:
- Hierarchy - classifications arranged into a taxonomies, sometimes with named taxonomic levels (e.g. major group, minor group, narrow field etc).
- Synonym and homonym control - terms are unique in form and meaning (at least within the same taxonomic levels)
- Identifiers - all terms are 'coded', and sometimes the codes imply hierarchy or some other semantics
This last point is both critical and pertinent to linked-data vocabulary environments and tasks. Statisticians have been treating unique identifiers, or 'codes', as key data elements within classifications long before metadata professionals were talking about linked-data and URIs. Indeed - the constituent elements of these classifications are commonly referred to as 'the codes', as if the linguistic terms are of secondary interest.
Therefore, and given a. the ubiquitous nature of statistical classifications, and b. their similarity to other controlled vocabulary types, it is worth considering whether these classifications can be employed in both traditional and emerging information management systems. Can these classifications, for example, be:
- considered as 'authorities' and quoted within MARC library standards?
- incorporated within web and semantic-web metadata standards?
- leveraged in text mining, entity extraction, assisted and automatic classification or other semantic technologies?
Statistical classifications could be put to work in many scenarios that normally utilise thesaurus-type vocabularies. To make classifications more application-ready, they can be modelled (or re-modelled) using thesaurus conventions. Below I will examine how statistical classifications can be modelled using SKOS in order to make them fit better with both traditional and emerging semantic technologies.
Labels that represent SKOS concepts are commonly referred to as terms. Terms may be single or multi-word in form. Guidelines exist for construction of terms, for example use of nouns, pluralisation, and whether or not to factor compound concepts into multi-word terms or as separate terms. In thesauri that adhere to some kind of guideline (such as in Z39.19, section 6.3) , the resulting terms are short, compatible with applications such as navigation systems (e.g. breadcrumbs, tag clouds, website taxonomies). They are easy to identify and select from in alphabetical lists, drop-down menus or auto-complete search fields.
Statistical classifications, in contrast, tend to favour precision over usability, and can be verbose. Classification terms are often phrases, even clauses. They often do not resemble natural spoken language, or even terminology that would be used in publications. The following epitomic examples are from industry classifications used in the Asia-Pacific region:
- Engineering design and consultancy services supporting mining, oil and gas extraction and offshore exploration activities
- Post-production Services and Other Motion Picture and Video Activities
- Other reservation service activities (eg timeshare, on-line hotel reservation services, reservation of tickets for performances etc)
- Other Publishing (except Software, Music and Internet)
- Sandblasting/Shotblasting works (except ships)
- Wholesale of motor vehicles except motorcycles and scooters
Use of examples are responsible for some of the bloating of these terms (I discuss this further below). Even when examples are not explicitly indicated (such as with "eg"), they are often given following "and" or "and other". Attempts to describe the constituent members of some class are sometimes made within the label itself ('mining, oil and gas...'); whether the class members are supposed to be exhaustive, indicative or representative is not always clear.
Use of slash (/) is used to delimit terms that are somehow close in relation, such as in "Sandblasting/Shotblasting". Are these concepts synonymous, or associative? Can they be factored into different concepts, or should one of the labels be treated as an alternative label <skos:altLabel>? If so, which label should be preferred? The meaning, and intention of such a term may be drawn out from reference to definition or example lists, or from consultation with subject matter experts. If the term is spit into separate concepts, then the work of vocabulary modelling crosses over into vocabulary extension (it's important to make sure a vocabulary source agency is on-board before going into this territory).
Exceptions and examples can be removed from the labels but retained in the overall semantics of the vocabulary; let's look at exceptions first.
For end-users, or those with some kind of information need, vocabularies need to be familiar, natural. For those creating records, metadata, cataloguing, they need to be instructive. "When should I / shouldn't I apply this term"? For those creating classification data, this is a problem of scope. What is included in the class? What is outside that I may also need to know about?
Thesauruses often utilise scope notes, which are expressed in SKOS with a documentation property <skos:scopeNote>. Instructions about when to use or not to use a term are conventionally expressed in these notes that are separate to the main textual label.
Sometimes statistical classifications include some of this scope-work within the term itself. For example "Wholesale of motor vehicles except motorcycles and scooters". This label is very clear from a cataloguing perspective and hints strongly that codes for 'motorcycles' and 'scooters' may exist elsewhere in the classification.
However there are significant implications for storing exceptions in labels for end-users. Consider the search box environment that uses string-matching and auto-complete to suggest search terms to users. If a user types "MOTORCY...", the class will be returned as a suggested search term. If the user does not read the label all the way through, they may be misled into thinking that they have retrieved results for 'motorcycles' when in fact they have retrieved results that quite deliberately do NOT include motorcycle references.
Exceptions have implications for text mining also. What if the classifications Other Publishing (except Software, Music and Internet) was used to assist and validate the extraction of the terms 'publishing' and 'music'? False positives would be returned (because music is given in the label as an exception). Natural language processing algorithms could be taught various routines when encountering strings such as 'except' - this would seem to greatly diminish the value and increase the complexity of leveraging vocabularies in text mining operations.
If the 'exception' examples above could be re-labeled, <skos:scopeNote> could be used to retain the intent of the vocabulary:
Other Publishing SN: Does not include Software, Music and Internet publishing Wholesale of motor vehicles SN: Does not include motorcycles and scooters
This approach retains the instruction which guides the cataloguer, while 'quarantining' misleading terms from search and extraction routines that rely on labels.
Use of examples in labels may not result in the same level of false-positive results that exceptions will lead to. Examples may, however, be misleading. We don't always know how representative examples concepts are within a label. In the example "Wholesale of audio and video equipment (eg radio and television sets, sound reproducing and recording equipment)", can we assume that the class has been used to describe an even distribution of sound reproducing equipment, recording equipment, radios and televisions? Perhaps one of these concepts concepts has never been needed? And what emerging concepts may be missing?
It is possible to retain these example terms in SKOS documentation properties AND draw on them in text mining operations by assigning different weightings to each fields. So in the following example:
Wholesale of audio and video equipment EN: radio sets; television sets; sound reproducing equipment; recording equipment
The example terms such as "television sets", stored in <skos:example> could be assigned lesser weightings in text mining operations than terms such as "video equipment", with the latter retained in the <skos:prefLabel>.
Differential weightings is only one approach - another is that example terms could be retained for search box scenarios, but excluded from text mining operations. By extending the semantics with SKOS, more options for distinguishing or combining attributes is opened up.
As hinted above under Exceptions, there may be associations between concepts in a classification. Exceptions, whether given in labels or note fields, hint at the presence of some other concept that is somehow associated. In SKOS, these can be precisely defined with the property <skos:related>.
Let's extend a class that we looked at earlier:
Wholesale of motor vehicles SN: Does not include motorcycles and scooters RT: Wholesale of motorcycles and scooters
By adding the associative relationship, there is the possibility for applications to 'complete the set' in search and mining operations. Also in navigation it is easy for users to find more relevant resources via 'see also' links, based on the <skos:related> property.
Statistical classifications are coded - indeed it is common to refer to the codes rather than the labels themselves in statistical applications. And in linked-data application, the codes or identifiers are key data elements. It is the mandated and ubiquitous use and classification codes, and their exchange between systems, that make statistical classifications so valuable as the basis for taxonomies within semantic technologies.
Where a classification is transformed into an RDF vocabulary, then we will be expecting the concepts to be represented using URIs. Therefore one option is to retain codes in the URI pattern. So using our motor vehicles example, where the code is "46561", a URI could be constructed as follows:
This is the most straight-forward and transparent way to retain codes in RDF concepts. By storing them in the URI, the URI does not even need to be resolved in order to find the code value.
However, this approach presents a risk. What if the concept becomes merged with another concept? For example the concept may be merged with the associated concept mentioned above (so that scooters and motorcycles are included within motor vehicles wholesale). The associate concept had the code "46562". Will systems that rely on the semantics of the URI only be able to know about the merging of the concepts?
Another approach that assumes that URIs are being resolved is to store the codes in the <skos:notation> property. With this approach it is possible to store the legacy code, and additional codes retained from mergers. In the above merged example the following would be encoded:
Wholesale of motor vehicles Notation: 46561; 46562
(The URI suffix for this concept can be minted using any approach, including a 'no semantics; approach such as with UUID scheme).
Statistical classifications are embedded in many data operations and full-fill a controlled vocabulary like function. They are well suited to organising and clustering resources around well defined concept. Using widely adopted vocabulary standards like SKOS, classifications can be deployed in even more information management contexts such as user search and text mining. Some enhancements that vocabulary and thesaurus standards can bring to statistical classifications include:
- Labels - can be shorter in length and more natural in tone
- Exceptions - notes about exceptions and exclusions cn be managed in documentation (note) properties, and not in the lables
- Examples - elaborating example or constituent concepts within a label can have unintended consequences for search and retrieval, and may also be manage in documentation properties
- Associations - adding non-hierarchical relationship properties between associated concepts improves navigation and recommender systems
- Codes - can be retained in RDF manifestations either in concept URI suffixes or in SKOS notation field