A selection of smart devices. Their natural language understanding capabilities are improving all the time. But how? — The natural language understanding capabilities of smart devices are improving all the time. But how? Image: Maria Alberto, Pixabay

How Your Digital Personal Assistant Understands What You Want (And Gets it Done)

Published in

The Startup

10 min readNov 7, 2020

One of the most remarkable things I learned during my studies in Computational Linguistics was that we still don’t know how language is processed in the brain. We know that an average human has around 80,000 words in their vocabulary, and that somehow, when we speak, our brains are able to form ideas, crawl through that vocabulary to find the right words, string them together, and express them. When we listen, we must decode an incoming audio signal into words, and search our own mental lexicons in order to extract meaning from them. And all of this at breathtaking speed.

It’s easy to rationalise this wonder by saying ‘well, our brains are simply the most advanced computational structures in our known universe’. But what about ‘machines’? Have you ever wondered how the invisible ‘persona’ in your mobile phone or smart speaker manages to understand and accomplish tasks for you? Sure, they’re not as good as we are, but they have a fraction of the compute power and complexity we do.

I can’t explain how I wrote this post, or how you’re reading it. But I know how your phone gets things done for you, and I’m going to do my best to explain it. The understanding, dear reader, will be up to you.

Why is language so damn difficult?

If you’ve ever tried to use an automated chatbot, or the voice interface in your mobile phone or your car, chances are you’ve experienced moments of confusion, or even communication failure. Why? Well, these interactions are facilitated via natural language, and natural languages are just plain hard. That’s because they are:

infinitely creative — you can say the same thing in many many different ways [1]
structured — if you perform a Google search for ‘brown gloves’, native speakers know (thanks to the grammatical structure of the utterance) that ‘orange gloves’ would be more relevant than ‘brown purse’, but that’s not so easy for a machine to understand
inferential — there’s also meaning in what isn’t said. If I search for ‘formal dress’, we know that I won’t accept results for ‘casual dress’ but I might for ‘formal gowns’. But how should a machine know that?
lexically and syntactically ambiguous — words and sentence structures can sometimes be interpreted in multiple ways, which means one search query could match a huge variety of results, only some of which match the user’s intent
context based — sometimes the only way to disambiguate a lexically ambiguous word is through the surrounding words; furthermore, the same word can mean different things to different people depending on the ‘context’ that is their life
‘negatable’ — this is a small but common pain point for natural language processing. For example, adding a ‘not’ to a sentence, or speaking sarcastically, changes the entire meaning
multimedia based—we don’t just communicate through text but also spoken messages, emojis, hashtags and so on.

A public sign with a grammatical error, shows the difficulty of understanding natural language, even for humans. — This example of syntactic ambiguity could momentarily confuse a human; it could completely bamboozle a machine. Image: ViralNova

If natural languages are so hard, why build them into our technologies? Why not, for example, have fixed sets of phrases per device, which users need to learn in order to be able to request certain tasks?

Well, because people want to speak naturally. Remember in the past, when you would reformulate search queries in a way you thought the machine would understand? People don’t want that anymore. And by the way, they don’t just expect better language capabilities from the digital personal assistant in their phone or smart speaker — they expect it in search engines, chatbots, and so on.

What is a digital personal assistant?

Despite having a voice, a ‘personality’ , and a semblance of ‘self-awareness’, under the hood a digital personal assistant is simply a software application. Throughout this article I will anthropomorphise a little with phrases like ‘the assistant knows…’ or ‘Siri tries to…’, but this is a stylistic choice and should not imply that the thing you’re interacting with is capable of thinking or knowing anything at all! When I say ‘assistant’, I thus always mean ‘application’.

The digital personal assistant waits for a wake word like “hey Siri” to activate it. Then it performs natural language processing to turn the spoken input into a textual representation, and applies natural language understanding to try “understand” the text. It then attempts to complete a task for you, by feeding the understood information to various APIs. An API — Application Programming Interface — is an interface between different software programs, such as between a digital personal assistant and a weather website. The API defines how the two programs can ‘communicate’, e.g. which requests can be made, what information each request requires and how it should be written. Once the API returns a result, natural language generation is usually to convert the information into more user-friendly text or speech.

What is Natural Language Processing? And NLU? And NLG?

Alright, let’s clarify some of the terms introduced above.

Natural Language Processing is the process of preparing human language data for different purposes. Those purposes might be research-based, like allowing linguists to analyse the data, or they might involve machine learning, in which case the processing steps are usually designed to make the text more appropriate for feeding to a machine learning algorithm.

NLP can involve steps like:

Automatic Speech Recognition: aka speech-to-text, which translates the incoming sound-waves into text. ASR models are built using machine learning, by showing a model thousands of pairs of input sound-waves and output sentences and letting it learn the correspondences.
Tokenizing: means splitting the input text into individual words, aka ‘tokens’. (Purpose for ML: most algorithms take their input as series of individual tokens).
Stemming: involves stripping the endings from words to leave only the word stem. (Purpose for ML: to reduce computational load by reducing the count of vocabulary that need to be processed; to improve performance by ensuring all words are represented in a consistent way, thus also boosting the number of training examples which feature each stem).
Note that stemming may not always result in a grammatical word. For example, converting plural nouns to singular can be done by removing the suffix -s, but this won’t work for irregular English nouns. Thus we get: dogs → dog, but countries → countrie, and women → women. Similar problems arise in other languages, too. For example, in German many plural nouns can be converted to singular be removing -en or -er, but irregular nouns pose problems, too. Thus we get Frauen → Frau (Women → Woman), which is correct, but Bücher → Büch (Books → Book, where the latter should actually be spelled Buch).
Lemmatizing: means converting each word to its standard form. Again an example could be reducing plural nouns to singular, but with lemmatizing, the result should also be a grammatical word. (Purpose for ML: as above).
Part-of-speech tagging: means assigning the grammatical roles, such as ‘noun’, ‘verb’, or ‘adjective’, to each word in the sentence. (Purpose for ML: parts-of-speech can be useful input features for various language tasks).
Named Entity Recognition: assigning labels like ‘person’, ‘place’, ‘organisation’, ‘date/time’ to relevant words in the sentence. (Purpose for ML: as above).

Interestingly, although part-of-speech tagging and named entity recognition are usually used in order to prepare features as input for machine learning models, they themselves are also usually achieved via machine learning: we feed an algorithm thousands of examples of already-labelled sequences, and it learns to recognise the patterns in our language which can be used to determine a word’s grammatical role, or whether it represents an entity.

Also note that NLP doesn’t have to include all of the above steps. In fact, modern neural language models rarely utilise features like part-of-speech tags. That’s because they’re powerful enough to take unannotated language input and learn for themselves which features in the input are useful for accomplishing their set task.

Once the NLP pipeline is complete and the user utterance has been processed, the following steps usually take place:

Natural Language Understanding: is about trying to extract valuable information from (processed) human utterances, in order to accomplish their requests. The field combines artificial intelligence, machine learning and linguistics to enable computers to “understand” and use human language. We’ll examine it in the next section.
Natural Language Generation: involves generating human-like text. This can be done using automated rules (for simple, restricted, repetitive contexts like generating weather reports from weather data), or else using neural network which were specifically trained to generate text.
Speech Synthesis: aka text-to-speech is, of course, the process of generating synthetic voice audio from text. The models are trained just like ASR models, though of course, the input and output sequences are the opposite.

How Does Natural Language Understanding Work?

Before we discuss the ‘how’, let’s demonstrate NLU in action. What’s the first word which comes into your head when you read the following?

“Hey Siri, book me a — ”

Most people will answer something like “flight”, “holiday”, or “hotel room”. And so does Google:

Google auto-suggestions for an incomplete search show how Google’s language models have learned typical language patterns.

What’s going on here? Well, we all have a language model in our head, which was learned automatically by our child brains taking statistical measurements of which words occur in different life situations and syntactic and lexical contexts [2]. Search engines and other language technologies have language models as well, which they have acquired through machine learning.

When you make a spoken request to your phone, the following process takes place:

Automatic Speech Recognition converts the incoming sound-waves into text. Depending on how the models in the next steps expect their input to look like, various other stages of the Natural Language Processing pipeline may be applied.

During Domain Detection, the digital personal assistant tries to classify the domain of assistance the user requires, such as flights, weather, or personal services (like getting a haircut).

Intent Detection similarly is about identifying what the user wants to do. For example if we’re in the flight domain, do they want to book a flight or just get flight information?

Slot Filling. Once the assistant knows the domain and intent, it knows which APIs it will need to access in order to satisfy user’s request, and what sort of information those APIs will require. Slot filling is about automatically extracting that information from the user’s utterance. The assistant does that by loading a set of expected slots, such as ‘departure city’ and ‘arrival city’. Then it tries to assign those slot labels to the words in your utterance (if it can’t identify all required slots, it might ask you some further questions).

Domain detection, intent detection, and slot filling are all accomplished via machine learning models. For each task, one must feed the learning algorithm with textual representations of thousands or millions of user utterances with their appropriate labels target label(s): the intent, and/or domain, and/or slots. I say and/or because, while it is possible to build one model per task, such that only one kind of label would be required per input sequence, it is more common to train a model to do multiple tasks. This is known as multi-task learning.

By completing these three tasks, the assistant can build a semantic frame — a more structured representation of the input — which it can use to complete your request.

Let’s take an example.

“Hey Siri…,” you say to your phone. The assistant wakes and begins trying to classify the request domain, even as you are still speaking. “… book me a flight…,” you continue. The word ‘flight’ makes the domain clear, and since you already said ‘book’, the intent becomes clear too. Note how this shows that the steps in natural language understanding often don’t align with the order in which the relevant input words are uttered: often, later language can disambiguate or completely redirect earlier interpretations.

At this point, the assistant knows it will need to fill some slots like “departure city”, which it does as you continue with “… from San Francisco to London.” It can now query the semantic frame and pass the relevant information on to an API such as Kayak’s flight search.

The API returns its results, and, finally, natural language generation is used to convert this back into human-understandable text, or synthesised speech.

Conclusion

So that’s it — how your digital personal assistant knows what you want, and gets it done for you. Simple? No. Remarkable? I think so.

[1] The variability of language is probably why Google gets five hundred million unique queries every day.
[2] The process for learning a second language is different than for a first language, particularly as one grows older, but the general point about having a language model in our heads remains so.

Thanks for reading! If this article helped you, please give it a little clap, so I know to produce more content like it. You can also follow me here on Medium, or on Twitter (where I post loads of interesting content on AI, tech, ethics, and more), or on LinkedIn (where I summarise the best of my Medium and twitter feed). If you’d like me to speak at your event, please contact me via my socials or the contact form here.

Want to know more about Natural Language Processing? I wrote a whole chapter about it in The Handbook of Data Science and AI, available here.