Last week we tackled the first step of addressing the challenges of unstructured text.
In order to develop this thought further, there are three areas that need to be explored. First, an understanding of artificial intelligence (AI). Second, an understanding of machine learning (ML) and, third, an understanding of taxonomy in relation to data analysis. In today’s article we analyse these areas in order to identify how to distil wisdom from your customer and employee text data.
As discussed previously, the very first step in attaining wisdom from your data is in the understanding that you have to analyse words (qualitative data) rather than numbers (quantitative data). In order to do this, the first thing that you need is to have the AI and NLP (Natural Language Processing) capability. In other words, you need to have a text analytics tool.
Once you have the right technical capability, we come to our first problem in that the machine does not understand the meaning of words and phrases in various contexts. For instance, the machine is unable to determine the context of the word ‘cross’. It does not know the difference between ‘I am cross’ and ‘I will cross the road’. The word ‘cross’ can be used in more than six contexts at any given time, and it is the human that has to teach/train the machine these contexts. An age-old problem since AI & NLP was created.
Thus, while AI already has a formidable reputation as an amazing capability when used on quantitative data (the numbers), what few people realise is that its capability with unstructured text is known to be poor. The real issue is that if you are using purely AI and NLP, you then have to have a lot of gold-standard training data in order to train the machine properly and without bias. But obtaining such a high volume of gold-standard training data is next to impossible.
The Real Challenge
Kai- FU- Lee, Chinas top expert in AI, states that ‘AI cannot deal with unknown and unstructured spaces, especially ones that it hasn’t observed.’ Googles AI blog, posted by Jacob Devlin and Ming-Wei Chang, further argues that one of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand, or even a few hundred thousand human-labelled training examples.
Many organisations suffer with this issue. Take Facebook, for instance. This vast organisation includes 7500 people who manually read and categorise comments in order to train the machine. Last July (2018), they decided to employee an additional 5000 employees to assist with training the machine and to classify comments on Facebook.
Fortune magazine have stated that Alphabet’s YouTube division is also ramping up its humanoid workforce to, with plans to hire more than 10,000 people this year.
Since July, Google have come up with a training programme named BERT (Bidirectional Encoder Representations from Transformers). BERT is an algorithm, which helps understand context by pre-training. Pre-training, however, does not solve the problem, it merely tries to reduce the amount of training needed in order to understand how words work in different contexts.
But no matter the speed you are able to pre-train, you still have to teach the machine different contexts in the first place. There is no way for the machine to know that you might be cross while you cross the street to eat a hot-cross bun as you pin your new cross onto your crosshatch cross fit jacket…I highly doubt this will ever be a sentence used in the real world, but it is easy to understand why a machine may, and will, get confused easily and provide inaccurate data. The correct information has to be taught, and that teaching is extremely complex.
This brings us to the very critical question of who trains the training data in the first place? If you are a provider you need to do your due diligence on how the machine is trained, and how the people you employee to collect and analyse your data use said technology.
So, join us next week as we discuss the very real problems in Training data, ML and Taxonomy, why due diligence is crucial, and how your data analysis may be failing you!