The Cruciality of Due Diligence and How to Train the Machine

Eleanor Barlow

So, last week (take a look here) we learnt that Facebook has over twenty-thousand employees, all at degree level, who sit and annotate the comments that Facebook collects. This information then becomes the training data for all the Artificial Intelligence (AI), Natural Language Processing (NLP) and Machine Learning (ML) Facebook then employs. There are, however, a multitude of issues with this process.

black and white words

You see, two people can read the same comment and come up with very different interpretations of said comment at the same time. This is because comments are very subjective. Many have tried to solve this issue, and a lot of work has been put towards coming up with a solution. But very few companies can get higher than 85% accuracy because nobody can agree entirely when analysing comments. So, not only is it difficult for people to decipher and agree upon what a comment signifies, but it is exceedingly difficult for the machine to learn from the human how to decipher the same conundrum and to remain consistent.

But, before the machine can even attempt to use the training data provided, we have to ask the question of who decides what to look for in a specific data set? This is what we, at Pansensic, refer to as the taxonomy. Typically, when presented with data, people look for themes in said data-sets. But someone has to decided what themes to look for in the first place. So, say for instance you are analysing comments on employee experience, this data is bound to have a very different set of themes compared to when analysing data on patient or customer experience. And, again, if you look at themes in customer experience for an airline, this will differ vastly to customer themes within a mobile phone company, food and drink company or sports shop. And, let’s say that you get to this level, you then have all the sub themes to decipher. 


question mark stranded in the water

Whoever creates these taxonomies ought to have a holistic understanding of the subject or domain they are analysing. Pansensic believe that the most important thing is that the actual data dictates the themes. We find most preconceived models, or even academic models, more often than not, lack in themes. As a result, the data itself should dictate the themes you are looking for.

To create a taxonomy, the chief taxonomist needs to read large volumes of comments. Without doing so they will not be able to identify the themes. They also have to be humble enough to accept that they have to read all the comments and themes whilst leaving to one side their own subjectivity.

The most difficult aspect to taxonomy is knowing how many themes you want in the first place, and how big said themes are. If you have too few themes you get a high-level metric that you can do very little with. Yet, if you have too many themes you can be swamped and, therefore, lose sight of where the priority is.

In the end, what the majority of taxonomists are trying to do is to identify actionable insights. It is, however, exceedingly important that the taxonomist identifies all actionable insights, and then prioritises them. To identify all the actionable insights and not just the ones you want is a tricky task. The danger is, and what a lot of people get caught out on, is that it is easier to identify an actionable insight, but that this is not necessarily the insight that should be worked on.

Peter Drucker quote

It’s all about doing the right thing, not just about doing something. People often get distracted by what they find interesting, but in order to use the data correctly you can’t pick out what you find most interesting and focus on that. You have to do what is most important. Sometimes people will focus on the most urgent element, but this again may not necessarily be the most important element either.

Not only does the quantity of themes matter, but so does its structure. Say, for instance, you end up with 70 themes for a particular domain, this flat structure does not work very well as you do not get a clustering effect or even a priority effect. We then have to question if the structure could be tiered or not. If so, the taxonomist then has to contend with knowledge management principles and with it the ability to identify which of the themes are parent themes, which are the child themes and to identify the dependencies between one tier and another. So, all of a sudden you have gone from someone who is reading a Facebook quote, to someone who has to be able to understand the principles of knowledge management…and understand them well. This is no easy task.


So, when you are looking for a provider to help you understand your unstructured text, you need to go through the due diligence of questioning how many comments a statement is based on, how many comments are the taxonomists reading themselves and what are the capabilities of the people providing the taxonomy. If the answers to this are vague you can be sure that the comments and data analysis may

A) Not be very accurate or sensitive.

B) The full potential of your data may be lost.

C) You may action the wrong thing.

Due diligence with an emerging technology, like text analytics, is crucial to a quality purchase, one you can be confident in giving you the best ROI- Return on Information.

Contact Pansensic for a chat, or for a demo, and compare us like for like, with other providers.

Leave a Reply

Close Menu