The Other NLP

By Takamorry from Fukui, Japan (Five Stars) [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

By Takamorry from Fukui, Japan (Five Stars) [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

 
 

If you thought the title of this piece referred to natural language processing, you’re right (mostly right. More about the other NLP later).

In What is warrant I touch briefly on the use of thesauri to assist data extraction, often called text mining. Using thesauri in conjunction with natural language processing algorithms is the bleeding edge of text mining. It’s five-star NLP. It’s also jumping ahead of a discussion about what other approaches are used to improve text mining – I intend to jump-back and mention those here.

I’m going to outline three approaches that assist natural language processing, and give each approach a star rating – three, four and five out of five stars are awarded to these approaches (no-one who plays in this field deserves less than three stars). The approaches are not mutually exclusive, although the five-star approach more or less assumes the three and four-star approaches, with some tweaking.

Three stars: write a dictionary

 
"Just this morning I added 'the', 'is' and 'are'..."

"Just this morning I added 'the', 'is' and 'are'..."

Time is precious. Resources are limited. But don’t worry, because you can always convince your boss that writing a dictionary is a reasonable thing to add to your work load, right?

Because you know that by ‘aardvark’ you mean the African mammal, and not the search engine, right? You know it, so add it to your dictionary and mark it up as special. And you know your content repository is peppered with stop words – irrelevant, distracting text that we want to teach our AI engine to ignore. So make a list of them.

As you run your extractions and build tables, highlighting frequent and improbable combinations of terms, make a dictionary entry each time you detect low or high relevancy terms, or cases of semantic ambiguity. It will get complicated as you try to ty to maintain reciprocal relationships between synonyms and associations. And you will find cases where terms are broader in meaning than other relevant terms – should you just treat these as equivalent?

Writing a dictionary is hard work - it'll be late nights and bad dreams with little return. Fundamentally, it's worth doing this kind of work to reduce the inherent ambiguity of your extracted terms. But it's better done with the right data standards, and data tools, at your disposal (more about that under Five-stars).

Four stars: get an encyclopedia

Madam could I sell you a... Wikipedia Premium account?

Madam could I sell you a... Wikipedia Premium account?

Think your content repository is big? It’s never as big as the knowledge domain(s) it belongs to. If you want a more comprehensive list of synonyms, antonyms, homographs and associations, draw on wider knowledge – get encyclopedic about your domain.

Encyclopedias are a rich source of term definitions, descriptions and often include indicative media to further clarify and illustrate the meaning of terms. They are also a rich source of terms that describe your concepts, including terms that may not be used by contributing authors. Encyclopedias are then a useful source of word-stock when designing text mining systems.

Encyclopedias are very broad in nature and typically lack hierarchical structure. The mighty Wikipedia is a game changer – it’s categories approach organizes the vast number of articles in to (more or less) logical hierarchy structure, which help to further clarify the meaning of concepts within its articles. It doesn't always obey standard rules for hierarchy construction - if you want that in your toolkit, go for five-star NLP.

Five stars: develop a thesaurus

By Takamorry from Fukui, Japan (Five Stars) [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

By Takamorry from Fukui, Japan (Five Stars) [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)], via Wikimedia Commons

Five shiny, wonderful stars goes to those NLP approaches that combine the first two approaches in the following way:

  • Instead of an ad-hoc dictionary, a domain-relevant taxonomy is selected as a starting point
  • The taxonomy is iteratively adapted using both your content database and encyclopedic corpus

Drawing on a combination of structured, semi-structured and unstructured content in your taxonomy, corpus and content respectively, powerful entity extraction and auto-classifier operations are enabled.

Five-star NLP starts with a taxonomy and finishes with a SKOS thesaurus:

  • add synonyms drawn from both your content and selected corpus to improve both text extraction and search tools
  • validate and update hierarchy structure using organizing frameworks found in encyclopedias (e.g. Wikipedia categories) or other corpora (policies, strategic plans, service directories etc).
  • associate concepts that have logical relationships within your content

qSKOS, a quality standard for SKOS vocabularies, ensures that thesaurus hierarchies are made up of logical relationships and without Cyclic Hierarchical Relations which can confound extraction and search algorithms. Unlike Wikipedia categories, a quality-assured SKOS thesaurus can be relied upon as a stable semantic component within natural language processing.

Five-star solution

If you want to Five-star NLP, you need five star tools that support:

  • Taxonomy management, based on SKOS and SKOS-based quality control. Tools that help you import, adapt and transform taxonomies into powerful thesauri
  • Corpus management – connecting to and extracting terms from web-based encyclopedias and other authorities relevant to your domain
  • Corpus-thesaurus iteration and validation – tools to assist you in updating the word-stock and structure of your thesaurus based on a growing corpus
  • Text mining and entity extraction – tools that combine standard AI methods with your developing corpus and thesaurus to extract concepts with increasing precision
  • Auto-classification – tools that leverage your NLP work and assign clear, representative classes to your content

Looking for a platform that does all of this? Ask us at makeitatriple about PoolParty.

The Other NLP

By Peacock_(PSF).png: Pearson Scott Foresman derivative work: Ken g6 (Peacock_(PSF).png) [Public domain], via Wikimedia Commons

By Peacock_(PSF).png: Pearson Scott Foresman derivative work: Ken g6 (Peacock_(PSF).png) [Public domain], via Wikimedia Commons

Enter “nlp” into your browser search. See anything you recognize or find interesting? Human discourse is awash with three letter acronyms. Wikipedia contains nine articles about phenomena where NLP works as an acronym (in the English version alone). Your search results may vary – from mine and from others, depending on which filter bubble you’re in.

In my bubble, I get results about Natural Language Processing and, to a lesser extent, Neuro-linguistic Programming. I think this Other NLP is slightly allegorical – neuro-linguistic programming is a less-than five-star approach to psychology and therapy.

In the 1970s neuro-linguistic programming enjoyed widespread popularity; in therapeutic practice, within university courses and as part of human resource development programs in large corporations. The central tenant of Neuro-linguistic programming is: “The Map is not the Territory” – a dictum dating back to a much early semantics movement in the 1930s under Alfred Korzybski. The territory refers to the real world and the real conditions for our existence that it provides. The map is our perception of that real world. The map is imperfect as a reflection – it is a distorted version of the real world.

 We can address this distorted perception, according to the Other NLP, by consciously adopting a new language. Using the words and phrases that are used by well people (or those with a less distorted, more positive real-world appreciation) has an internal (neuro) effect – our attitudes change, our outlook improves, we become more positive. Of course, this also includes identifying and managing damaging or negative words and phrases – these become the ‘stop words’ of the Other NLP.

Neuro-linguistic programming is no longer mainstream because its results failed to replicate. I’ll offer my own (non-expert) theory on why neuro-linguistic programming failed: you can’t bullshit a bullshitter (spellcheck guesses I mean ‘bulls hitter’ btw). Put less crudely, you can’t win an argument with yourself by following a script has no basis in reality, or is not somehow anchored in a knowledge base or structured set of values as a starting point. Whether this knowledge base is a moral compass, ideology or world view, something is needed to graft a new language onto and to give it context and structure. Borrowing and regurgitating an idealized script is like the dictionary approach – it’s less powerful ultimately unmanageable and will probably give you bad dreams.