You say potato, I say Solanum

By Judith Doyle (CC BY-ND 2.0)

By Judith Doyle (CC BY-ND 2.0)

Equivalence relationships are used in controlled vocabularies to augment search indexes and improve recall in search operations. In addition to creating 'regular' synonyms, taxonomists can refine equivalence relationships to further improve recall and search tasks. Two approaches are discussed here: Generic posting and Hidden labels. These approaches are also discussed in ANSI/NISO Z39.19, sections 8.2.4 and 9.3.1.1 respectively, with the latter term "Hidden label" introduced later as part of the Simple Knowledge Organization System (skos) ontology.

Generic posting and Hidden labels are useful, but should be used with caution. Let's look at the rationale and implications for taking some license with our synonyms.

Generic posting

by Sandos (CC BY-SA 3.0) via Wikimedia Commons

by Sandos (CC BY-SA 3.0) via Wikimedia Commons

A colleague told me he had been in a pet shop and asked the staff "do you have any rodents"? No, was the reply. "So, no guinea pigs, then?" he persisted. Oh yes, we have those, was the reply.

Given the context (a pet shop), my colleague was probably right to expect the staff to pass the all-some test - a test applied to pairs of concepts to determine whether they are in a hierarchical relationship, also known as a genus-species relationship. As in "all guinea pig are rodents, and some rodents are guinea pigs - therefore the relationship is hierarchical". In fact in this example rodents (order: rodentia) are several levels above guinea pigs (species: cavia porcellus), with genus (cavia) and family (caviidae) separating them.

We can use this anecdote to explore cases where it may be useful to define equivalence between terms that are not, in most contexts, actually equivalent. What if it were unlikely that the pet shop would ever supply rodents other than guinea pigs? In a searchable database it may be worth mapping the terms together, so that a search for rodent and a search for guinea pig yield the same results. By augmenting the search index with this mapping there will be greater recall across user groups (those who search for rodent and those who search with guinea pig will return the same results). 

Now that you have found your guinea pig, you realise that you need to source some appropriate pet food for them. On the web, it's easy to find pet food that is "recommended for all rodents", yet mentions only rats and mice in product descripiton. But a search for "guinea pig food" may not return the same results without the aid of equivalence mapping.

Notice that we have jumped contexts from a small, local bricks-and-mortar shop to online shopping. Generic, or 'upward' posting can be used to improve search results within certain domains and for search audiences that are known or understood. What works in this pet shop may work less well for an online retailer supplying a broader range of pets and supplies. It may not work at all in other contexts - imagine if the same search rules were applied in a zoo database or website! 

Before creating generic-posted equivalences in a thesaurus, consider the likely user satisfaction that will result. Another example: if I search a hardware store website for lawnmower, I will have a number of items to view, evaluate and compare. Do I want results for line trimmers (aka whipper-snippers) also included in this results set? Both items could be returned if lawnmowers is included in the hierarchy to "garden power tools", which is in turn linked by narrower relationship to line trimmers. A system could therefore return these 'sibling' items together. But automatically returning siblings is risky - we all know how different brothers and sisters can be, and this approach will work better in some taxonomies than others.

A key question is: will by users be satisfied when I return results that are not exactly what they searched for? Where is my warrant for doing so? The answer is an empirical one and subject-able to user testing. I suggest, however that the answer may also lie in how results are presented: is it clear to the user that equivalence mapping has been used to return the results set? This last point is especially critical when another class of equivalence is used in search: the hidden label.

Hidden labels

By Loveteamin (Own work) [CC BY-SA 3.0 via Wikimedia Commons

By Loveteamin (Own work) [CC BY-SA 3.0 via Wikimedia Commons

I worked with a developer to build a bespoke web service that would fetch data from a thesaurus in 2011. We were using skos elements for the first time, and in the development I was charged with specifying how skos elements should be fetched and returned in the web service interface. During testing, I noticed that the skos:hiddenLabel property was not being returned in my requests. I queried this with the developer - the response was sobering: "it's because those properties are hidden". 

I love how literal developers can be. That aside, I realised that the use case for skos:hiddenLabel needed a broader narrative. As thesaurus taxonomist, I needed a consistent policy regarding when equivalent term should be marked-up as 'for display' or not. We used skos:hiddenLabel to store common misspellings (e.g. millenium), irregular plurals (e.g. plateaux) and embedded plurals (prisoners/ prisoner of war). The rationale for storing these term was to augment search indexes in ways that would be difficult to do with an algorithmic approach. The rationale for hiding the terms was that in a user search context, these labels may contribute to clutter. For example, in a search for "BODMAS" the equivalent term "Order of operations" could be employed to expand the search. It may also be worth returning a dialogue along the lines of "results for BODMAS OR Order of operations" so that the user understands why some results not containing "BODMAS" have been returned. However, consider how typical search page would need to present the following dialogue: "results for BODMAS OR Order of operations OR BEDMAS OR BIDMAS OR BIMDAS BIODMAS OR  BOMDAS OR BPODMAS OR PEMDAS". Does the user need to know all about the search index? By defining rules based on hidden vs regular equivalence relationships it is easier to design tidier interfaces for this kind of dialogue.

Another context was A-Z navigation. Both preferred and non-preferred terms (skos:prefLabel and skos:altLabel) were displayed alphabetically, with the latter italicised. But to avoid clutter, skos:hiddenLabel terms were not included in the A-Z navigation.

We were also concerned that displaying misspelled words could validate spelling mistakes - a devastating outcome in what was at the time a school education context. On the other hand, by not displaying common misspellings we may have missed an opportunity alert students to correct spellings. Because we used skos:hiddenLabel for other purposes (e.g. irregular plurals), we couldn't use this property to perform specific pedagogical tasks such as incorporating spelling tutorials. Naturally, considering the needs of this special user group (school kids) was a consideration when determining how to structure and present search result dialogues.

Summary

Synonym control is a core task for any taxonomist. When retrieval systems take advantage of equivalence-based search indexes, they help do democratise search - my search term differs from your search term, but we mean the same thing, we retrieve the same resources.

Taxonomists are somewhat justified in using available standards to further refine the equivalence relationship - I have discussed Generic posting and Hidden labels. These are powerful refinements - but bring implications. The following should be considered:

  • search dialogues: should equivalence relationships be communicated during search task? How should they be displayed, and what information should they include?
  • user literacy: will my typical users benefit from equivalence-based search-expansion, or be confused by it? Do some groups (e.g. younger users) have different needs?
  • collection context: a flexible approach to defining synonyms may work well in some contexts, and less well in others. Policies for defining equivalence should fit well with collection scope and specificity.
  • hierarchical context: are hierarchies already used to expand search results, for example for 'sibling' concepts? If so, Generic posting may be redundant, even counter-productive in search rules.
  • metadata context: what cataloguing rules, if any, are being used in the collection context? Does tagging look more like indexing (using specific concepts) or classifying (using broader concepts)? Equivalence relationships should fit well with how metadata is going to be created.