What is warrant?
When identifying terms for a vocabulary, a combination of source-types should be used. In thesaurus construction, the terminology of the user community and of the information resources that they will access are called the user warrant and literary warrant respectively. A lesser understood source should be leveraged: organisational warrant is the third pillar of term sourcing. Thesauri that draw on all three sources are better equipped to support automated extraction services in addition to traditional information retrieval tasks.
Three characteristics make a thesaurus stand out from other controlled vocabularies: hierarchy ('broader/narrower'); association ('see also'); and equivalence (see, or used for). The latter is what comes to mind for most people when the word thesaurus is used. The popularity of general thesauri, many thanks to Roget, is evident in the many hard-copy volumes found in school classrooms, home-studies and (decreasingly) in the workplace. Equivalence, or synonym services are of course now commonly integrated with various information and content services.
Synonym control is the key means by which communities are bridged. Different communities of practice with similar information needs may search with different terms, and even within these communities different terminology will be used between users. In fact, even individuals may not use the same terms each time they have the same information need.
Equivalence, then, is the means by which we can be inclusive - accommodating many terms that belong in the same meaning-cluster. By authorising synonym relationships, we create more equal opportunities among those that are trying to satisfy information needs.
But each community can be broken down into conceptual types - Z39.19 does this by establishing three types of warrant: user; literary; and organisational.
Literary warrant refers to the information resources, or the literature (meant in a broad sense), in the the domain that will be the source of terms for the thesaurus. Typically, they are the information resources that will, in turn, also be described with the thesaurus.
Think of a collection as if it were a community: a literary collection 'speaks' a certain language, which may be in part the product of publishing styles, editorial guidelines or domain-specific jargon.
Thesaurus terms are drawn from literature using different methods. A common practice is to identify candidate terms while collection items are being catalogued. Collection items may also be pre-catalogued (such as 'cataloguing in publication') with index terms also suggested.
When a literary collection is itself the only source of warrant for a thesaurus, the word-stock in the resulting thesaurus will resemble the language spoken by the collection - a sort of cumulation of the many back-of-the-book indexes in the literary works themselves. A thesaurus built using literary warrant only will be less than ideal for retrieval - assuming that the collection will be accessed by another community. We are typically concerned about the 'user' community, although I will discuss another as well. For now let's turn to the warm-bodied user as a source of warrant.
Users are those information-seeking human communities that have information needs. The user can be thought of as the natural language user. If we develop thesauri without reference to the language used by users, we can hardly call the outcome a retrieval thesaurus. A fundamental feature of thesauri is that they match the language of the literature with the search language of the user - and the equivalence relationship is key to performing this match.
Sourcing user language is also done with a number of approaches. Search terms logged in retrieval systems can be extracted, analysed and added to thesauri (either as new concepts or alternative terms for exiting concepts). Outside of closed systems, user communities are often active in social networks - social media, self-publishing, professional events are contexts where the language of domain members can be gleaned and harvested for thesaurus word-stock.
Of course user language and literary language are not vastly different - and are somewhat cross-reinforcing in nature. Authors should consider the language used by their target audience - and readers are in turn influenced by terminology used in what they consume. But by bridging disparities in literary and user terms, search and retrieval is improved. The rationale for thesaurus work relies somewhat on this assumption.
But are warm-bodied, natural-language users the only agents that have need for information resources? Earlier I broadened out definition of language communities to include both users and information resources. Now let's see if we can broaden the notion of information need beyond warm-bodied users to include organizational frameworks.
Z39.19 introduced the concept of organisational warrant in 2005. At the time I didn't grasp the significance or usefulness of this theoretical extension. But soon after I was working in the education sector, and it was there that its meaning and relevance became quickly apparent. In the new context, the school curriculum was the organisational warrant that was relevant to the school education domain. In fact we started using the phrase "curriculum warrant" when evaluating or justifying choice of terminology for preferred terms (counterbalancing with user and/or literary warrant for alternative terms).
I realised later that while working in the government sector, I had indeed been drawing on organisational warrant: legislation and policy documents. The legalese style of language that is typical of these documents is often different again from the language used by information seekers, and also of the language used in the information resources mandated by those legal documents.
And mandating is what legislation does - legislation mandates, or requires the existence of various resources and services in the community (including information resources and services). The same can be said about curriculum frameworks: curriculum mandates, or requires the existence of information resources and services that will somehow meet the objectives set out in the curriculum content.
Organising frameworks will be relevant given different contexts - they may be company policies, international standards, statutory instruments and so on.
As for literary and user language, organisational is not vastly different - it may be more formal and standardised, and somewhat usefully, organisational documents are often accompanied with term glossaries. There will be some uptake of this language within publishing and user language. So the benefits of including organisational warrant in thesaurus construction is about improving linguistic access to concepts - but only in collaboration with the other warrant sources.
From retrieval to extraction
So far we have established that it is the warm-bodied human user that has the information need. But we have also established that organising frameworks have needs too - certain types of special, formalised documents mandate, require or 'need' resources and services. Can we think, then, of these organising frameworks as being agents too with information needs?
Human agents carry out the work of sourcing resources and services to meet organisational objectives. But this work can be automated - resources can be mined, extracted, using the language in organising frameworks as structured queries. An example of this is the Australian Curriculum - a machine-readable curriculum that has been integrated with educational content retrieval systems. See for example the curriculum-based search features in Scootle. When a user selects a curriculum objective from a learning area and year level, the Scootle system looks for resources that contain the same language as the content description. This works well in Scootle because the resources have been meta-tagged with a thesaurus that has been built using warrant from the Australian Curriculum (presented in more detail in 2015).
Resources can be extracted and matched with organisational objectives in other contexts. This extraction approach works better if thesauri, used as part of extraction rules, are built using organisational as well as literary and user warrant.
Thesauri should draw on and reflect the language used by information seekers and of the information sought - where terminology between users and literary collections varies, synonyms can be used to match meanings across terms. A third source of warrant are organising frameworks that govern, enable or mandate the domain itself. Organising documents should be used for additional terms in a thesaurus. A thesaurus constructed using all three sources of warrant - literary, user and oganisational - is well placed to facilitate not only enhanced search and retrieval, but also data mining and extraction using domain-organising principles as the basis.