Last week we tackled the first step of addressing the challenges of unstructured text.
There are three areas that need to be explored further.
1. An understanding of Artificial Intelligence (AI).
2. An understanding of Machine Learning (ML)
3. An understanding of taxonomy in relation to data analysis.
The very first step in attaining wisdom from your data is in the understanding that you have to analyse words (qualitative data) rather than numbers (quantitative data). In order to do this, you need to have the AI and NLP (Natural Language Processing) capability. In other words, you need to have a text analytics tool.
Once you have the right technical capability, we come to our first problem. The machine does not understand the meaning of words and phrases in various contexts. For instance, it is unable to determine the context of the word ‘cross’. It does not know the difference between ‘I am cross’ and ‘I will cross the road’. The word ‘cross’ can be used in more than six contexts at any given time. It is the human that has to teach/train the machine these contexts. An age-old problem since AI & NLP was created.
While AI already has a formidable reputation as an amazing capability when used on quantitative data (the numbers), few people realise that its capability with unstructured text is known to be poor. The real issue is that if you are using purely AI and NLP, you then have to have a lot of gold-standard training data in order to train the machine without bias. But obtaining such a high volume of gold-standard training data is next to impossible.
The Real Challenge
Kai- FU- Lee, Chinas top expert in AI, states that ‘AI cannot deal with unknown and unstructured spaces, especially ones that it hasn’t observed.’ Jacob Devlin and Ming-Wei Chang, further argues that one of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand, or even a few hundred thousand human-labelled training examples.
Many organisations suffer with this issue. Take Facebook, for instance. This vast organisation includes 7500 people who manually read and categorise comments in order to train the machine. Last July (2018), they employed an additional 5000 employees to assist with training the machine, and to classify comments on Facebook.
Fortune magazine have stated that Alphabet’s YouTube division is also ramping up its humanoid workforce, with plans to hire more than 10,000 people this year.
Since July 2018, Google have come up with a training programme named BERT (Bidirectional Encoder Representations from Transformers). BERT is an algorithm, which helps understand context by pre-training. Pre-training, however, does not solve the problem. It merely tries to reduce the amount of training needed in order to understand how words work in different contexts.
No matter what speed you are able to pre-train, you still have to teach the machine different contexts in the first place. There is no way for the machine to know that you might be cross while you cross the street to eat a hot-cross bun. It is easy to understand why a machine may get confused easily and provide inaccurate data. The correct information has to be taught, and that teaching is extremely complex.
This brings us to the very critical question of who trains the training data in the first place? If you are a provider you need to do your due diligence on how the machine is trained. As well as how the people you employee to collect and analyse your data use this technology.
Join us next week as we discuss the very real problems in Training data, ML and Taxonomy and how your data analysis may be failing you!