Vocabulary projects that cover multiple knowledge domains, such as multi-disciplinary science, cross-curriculum education or whole-of-government functions, present special challenges. Here are a few things to consider when starting out.
Whether compiling, editing or simply using a controlled vocabulary, one is often confronted with a high level question: is this one vocabulary or two? Or is it many? This is a daunting question when setting out to build a vocabulary - how happy will your client be to hear that the estimated effort has now doubled or tripled? (It’s worth preparing a good story first if you ever need to do this).
This question comes about when the context, content and/or community of practice behind the vocabulary is broad in scope. Consider the organisational context behind the vocabulary project – are there key institutions behind the effort, and do they share common functions, activities or fields of research? What content will vocabulary terms be drawn from – will it be from multiple collections, representing a wide array of topic and genres? And who will use the vocabulary(s) – are they a community with similar needs, cultures and aspirations?
In short: will the vocabulary be sourced from one or many domains?
If the answer is no then a simpler task is ahead. The burden of disambiguating concepts is lifted as there will be broad understanding about the meaning of common terms. For example, in a construction thesaurus, ‘crane’ doesn’t need a qualifier to distinguish it from the bird-sense. In a geophysics thesaurus, a scope note that limits the use of ‘Australia’ to works about continental entities and features, rather than political boundaries, may not be needed.
If the answer is yes, then a more complex challenge is ahead. As well as homograph control via qualifiers and scope notes, poly-hierarchies may be difficult to apply without inadvertently creating odd, less-than-intuitive hierarchies. Reference structures may be difficult to untangle later during revisions or restructure.
Despite the challenges, there may be sound reasons for managing multiple domains in the one vocabulary project. In some cases, it may even be inevitable. An education thesaurus is a clear example of this – to capture concepts from across a curriculum, many domains stemming from disparate communities of practice need to be included. And by doing so, opportunities arise: trans-disciplinary, or cross-curriculum associations are more readily made between concepts in the same vocabulary. A broad science thesaurus presents similar challenges as well as opportunities for creating associations between scholarly traditions. And government subject/functions thesauri naturally draw warrant from multiple ministerial portfolios, each responsible for implementing very different policies, but opportunities arise for whole-of-government associations across functions and activities.
There are three approaches that may be considered when building multi-domain vocabularies and these approaches may be combined: Top-down building; Concept schemes; and URI structure.
In top-down vocabulary construction, high-level principles of division are established early in the project. In simple terms, 'top terms' are those terms that themselves do not have a broader term relationship. Top-down construction method starts with analysis of the high-level functions, disciplines or organisational structures that will form the warrant for the top level vocabulary terms.
Top-down approach draws on framework-like sources such as curriculum, academic or policy instruments to ensure that all domains are covered and evenly represented. In contrast, bottom-up vocabulary construction focuses on the information resources that the vocabulary would describe and retrieve, gradually building up word-stock to be later arranged into taxonomies. Across multiple domains, the bottom-up method is a formidable task, which may or may not result in a logical, intuitive top-level structure.
Top-down vocabularies have neat, triangular top levels that can be used to drive navigation features in retrieval systems - the 'top' is an important entry point for users when navigating, or drilling-down through concepts.
But how does an information retrieval system know that terms are Top terms? In simple terms, Top terms are "those terms that themselves do not have a broader term relationship", is not wholly adequate. When retrieval systems rely on this as a business rule, constraints for vocabulary construction options are introduced. Terms that have no broader terms, but themselves are not intended, or required to be Top terms, become rogue entities. Such rouge terms are called 'orphan' terms - an incomplete analogy - for orphans may go on to have children (narrow terms) themselves and will thus be treated by systems as Top terms.
Its worth considering how Top terms are identified, managed and reported in vocabulary management software. Some applications allow terms to be explicitly marked as 'top' so that orphans are not reported, exported or published within contexts where only Top terms should launch. Without this feature, you may be left with inflexible rules for management of the top level structure. When I started work on Schools Online Thesaurus (ScOT), ScOT had just over 180 top terms. Retrieval systems in the school education sector required a tidy top-term structure with which to build user-navigation features. We managed to consolidate that top term level down to ten terms. This exercise was good for the overall health of the thesaurus - many overlapping concepts were merged, reference structures made more intuitive, making it easier to identify conceptual gaps and potential growth areas. But there was also a bit of shoe-horning - terms were packed away into less-than-perfect reference structures so as to pave the way for a navigable top term structure. A method of explicitly marking the required Top terms may have alleviated the need to sweep orphans under the taxonomic carpet.
It is possible to mark terms explicitly as Top terms using widely adopted standards: concept schemes may provide further clarity and control, not only for your top terms but throughout the structure of your multi-domain vocabulary.
Concept scheme is so named to refer to the SKOS (Simple Knowledge Organization System) property skos:conceptScheme, although similar approaches pre-date SKOS. For example, the Thesaurus of Australian Government Subjects (TAGS) in 2002 used 'Subject Categories' to "group broadly related terms". TAGS also uses Broader Term (BT) relationships, although sparingly, resulting in a broad, shallow thesaurus shape.
In SKOS, the definition of concept scheme is "suggestive rather than restrictive". The relevance of concept schemes to multi-domain vocabularies is that concept schemes can be identified and therefore used to name and separate those domains. A vocabulary can have one or many concept schemes, and like the Subject categories in TAGS, terms can have membership in multiple concept schemes. Membership is denoted by the In Scheme property (skos:inScheme). Further, Top Concept Of (skos:topConceptOf) denotes that a concept is the most ascendent term in that concept scheme. And the concept scheme itself can declare that it has Top concepts, or skos:hasTopConcept relationships with terms.
The concept scheme approach introduces some flexibility and control with respect to how domains are structured:
Terms may belong to multiple concept schemes, and have broader-narrower relationships with terms from any of those concept schemes.
A term may be a Top concept in one concept scheme and belong to another concept scheme where it is not a Top concept.
Terms in a distinct domain could be given membership to a given concept scheme yet, at least in theory, have no relationships with each other. This scenario may be better suited to a Collection (skos:collection) approach.
The concept scheme approach also introduces new possibilities with respect how a thesaurus is extracted and presented:
Top-term displays can be limited to terms belonging to a concept scheme without first resolving 'orphan' terms issues.
Rules for presenting one concept scheme, and not another, may be defined for scenarios such as Top term or navigation display.
All terms belonging to a Concept scheme may be extracted and reported. This will be useful if domain-specific terminology is to be reviewed by domain-specific communities of practice.
A really important benefit of Concept schemes is what they don't do: concept scheme do not interfere with the basic hierarchical, equivalence and associative relationships that make a thesaurus work. And they can be ignored by reviewers, editors or those responsible for system implementations if they are not needed.
Questions remains about 'partial implementation':
Should concept scheme and concept scheme properties (e.g. skos:inScheme; skos:topConceptOf) always be used when other parts of the SKOS standard are used?
Must all terms have at least one concept scheme membership (when at least one concept does?)
Should a term that has an antecedent Top concept always be in the same concept scheme?
Of course all sorts of absurd scenarios may be technically valid - such as a thesaurus with as many concept schemes as there are concepts, or all concepts in one concept scheme are also in another, and so on. If concept schemes are to be used, it is worth exploring the rules set out by the thesaurus management system to be used. Software that incorporates a stable set of quality measures, such as qSKOS, may ensure some consistency in what is allowed.
If concept schemes support a flexible way of explicitly defining membership in domains, using URIs is the hard-wired approach. URI structure and patterns can be used to distinguish domains within a vocabulary, with or without the concept scheme approach.
Let's start with the minimum URI structure needed to identify a concept in a vocabulary. You need a server location; a vocabulary name; a concept identifier.
We can add a further directory based on any number of domains:
The domains may be based on concept schemes. In a vocabulary system, rules can be set so that all concepts in a concept scheme inherit the extra directory URI prefix.
But some inflexibilities are introduced by this approach, namely:
Concepts are 'locked-in' to the domain structure
Concepts can only have membership in one domain, as defined by the URI
Locked-in, exclusive membership may well suit some contexts where concepts are substantially different in purpose or intent. For example, the Australian Standard Geographical Classification (ASGC) includes distinct 'classification structures': Main Structure; Local Government Area Structure; Statistical District Structure; Statistical Region Structure. Geographical concepts could be be managed in separate vocabularies based on these classification structures (and therefore with different /vocabularyName identifiers). If they were to be expressed in the same vocabulary, the domain-directory approach may be suitable. Because City of Brisbane, a Local Government Area, can never be an instance of a Statistical region. Nor can Gold Coast-Tweed (Queensland/New South Wales) ever belong to the Local Government Area domain. Therefore it would be safe to assign the following URIs:
http://myserver/asgc/lga/cityofbrisbane http://myserver/asgc/sr/GoldCoast-Tweed(Queensland/New South Wales)
The URI structure approach to defining domains does not rely on a concept scheme approach. Without concept schemes, the URI structure can't be used to explicitly indicate that a concept is 'top' of either vocabulary or domain. Used in combination, more absurd theoretical scenarios are possible, for example:
Concept has URI domain structure and is member of concept scheme based on that domain, and also of other concept schemes.
Concept has URI domain structure and is not a member of the concept scheme based on that domain.
Some URI domain structures are based on concept schemes, while others are not.
Again, commonsense should prevail. But more importantly, business rules for utilising a combination of URI structure and concept schemes should be established and, ideally, integrated with vocabulary management systems.
So you've identified that you have a multi-domain vocabulary to build - it's a substantial task. You will need to do the following:
- Decide if the domains are distinct: could any concepts logically have membership in more than one domain?
- Identify reporting requirements: will different communities need access to domain-specific terms for consultation, review or approval?
- Determine system requirements for displaying top terms and/or navigation - do these terms need to be explicitly marked in the vocabulary?
- Evaluate vocabulary management applications for their ability to support, flag or enforce business rules needed in multi-domain vocabulary construction.