Do Voice Assistants Really Understand a Word You Say?

Man using mobile phone's virtual assistant (1) — Voice assistants are getting smarter, but do they really understand you? The answer can be seen through looking into how they work.

One of many questions raised by artificial intelligence is: Does your voice assistant really understand you when you speak to it?

How does it follow your commands about traffic and weather, and how does it play your favorite song when you tell it to play from your playlist?

What about those few situations where your voice assistant—whether it’s Alexa, Siri or Google—may not do the thing you want it to do?

Ask it about the German writer Goethe and see what it comes up with… you may not get what you’re looking for. Depending on your pronunciation of the name, your voice assistant might just return with the standard “I’m sorry, I don’t understand” response.

And this is just the tip of the iceberg.

This is why we will share here how your voice assistant works, the problems it faces and whether it really understands you.

How Does It Work?

Let’s begin with understanding how typical voice assistants work.

Usually, the process is simple; it goes through the following steps.

Listening for voice activation command, e.g. “Hey Google!”
Once activated, the system listens to the words you say.
Once the voice assistant has identified the words, it tries to find a predefined command that best suits the words you said.
Once it finds the command, it executes that.

Perhaps this is why when you talk to it, it sounds more like you are giving voice commands rather than having a conversation. This observation is partially true; your voice assistant is an advanced voice-activated command prompt.

The illusion of understanding, which is astonishingly believable, is made by letting you think whatever you say will have a pertinent response.

Problems

The fantasy starts to break when you first understand the need for activating the assistant by using predefined voice commands.

Activation

For an intelligent being to know that it is being addressed, you don’t always have to call it by a predefined command.

Would Tom not respond if you called him by saying “Tom!” instead of “Hey Tom”? He would not need a specific phrase to respond. His name or any vocative would be enough.

Word Recognition

couople voice message — The problem becomes more obvious when you give it a command made up of multiple languages

The next crack in the glass is the word recognition.

Any intelligent being is trained to identify and understand the lexicon they will be using in life for communication.

This is why children know the language of their parents, teachers and broader society. It’s the same case with voice assistants. It only understands the words it is trained to recognize and in the accent that word is said. If you say that word in any other accent, it simply won’t register.

Voice assistants were trained to understand neutral English accents and are just now learning different accents, native or otherwise. This is why any person with a non-native accent might be having difficulty communicating with the voice assistant.

Remember the viral video of the Italian grandmother attempting to have a conversation with Google Home?

The problem becomes more obvious when you give it a command made up of multiple languages. Now it has to recognize the words in English but the non-English portion as well as the accents.

Context

If you are one of the 43 percent of people on earth who speak more than one language, then you know a few words which sound the same but have a different meaning.

Recognizing them is a bit of a problem, and it comes down to the context of the discussion to understand which words were said in which language.

This is where the voice assistants fail to make an impression.

They don’t really understand the context of the communication. This is why it is hard for them to understand which language the word belongs to. This is also the reason why it is hard for the voice assistant to pay a song when you say “Play it” after you have referenced it just before a few commands.

It fails to make a contextual connection.

Is the Problem That Simple?

So what exactly is wrong here with the AI?

It was supposed to be quite smart, savvy even, if you use the term loosely. This is the same AI whose representative, DeepMind’s AlphaGo, beat a human professional Go player a few years back. The game is based on intelligence and intuition.

How did it actually do that?

Let’s go back in time here a bit and analyze… it made a move, move 37, which baffled its human opponent and had seemingly come out of nowhere.

Even the AI’s developers got confused and thought maybe the system was malfunctioning. However, later studies showed that the move was pivotal in winning the game.

If you think about how it came to make the move and the way we communicate, then you will understand the connection.

The DeepMind AI learned AlphaGo by playing against itself. It did so millions of times and came up with its strategies.

The illusion of intuition was overcome by the grit of experience. DeepMind was more experienced and had more scenarios played out than its human player opponents.

So, it has more experience and you would still argue that it should do better having talked to a whole lot of people gaining experience.

But here is the other thing—it is the target of the game.

It doesn’t matter how many paths you take, they all lead to one destination. The target is to win the game.

Communication, on the other hand, is something else. We use broken words and sentences. Sometimes we aren’t making any particular sense and, even then, humans understand it.

We use a point of references, and make up an argument. You ask someone about that piece of trash you just saw driving by and they will tell you that no, it was just a pink Pontiac Aztek.

A voice assistant will simply fail to understand.

We do it with the knowledge about the world we live in and the context of the discussion. The AI doesn’t have access to that kind of knowledge and thus fails to understand the complete concept of context.

The AI is following a set of rules and human works on instinctive intelligence that doesn’t follow a set of hard and fast rules. It is build up by the experience we receive living in this physical world by touching, feeling and understanding the relationship between objects.

The AI doesn’t have all that, and that is its huge drawback, for now.

Is It Really That Limited?

Voice assistants have come a long way.

They could only understand the neutral English accent at first.

But now, Google Voice Assistant by training, and Alexa through deep learning of the large data set it has, are improving.

Now they understand some variation in accents and can make a little bit of contextual sense. Ask Google’s Voice assistant about a song, and after it responds, ask it to play it by simply saying “Okay Google! Play it,” and it will be played.

Google is already performing better than the others when it comes to understanding isolated words in different accents (excluding more pronounced accents).

Amazon, having more than 60 percent market share in all smart speakers in use—with Amazon Echo accounting for 23 percent of all smart speakers—understands its potential from the start. It learns through deep learning techniques and it keeps on evolving. It is now flexible enough to understanding different forms of the same directive.

The way we see it, this is not limiting at all.

Will It Be the Same in the Future?

Major factors that distinguish humans from other species are the ability to communicate and the understanding of the concept of self and of time.

The machines, now advanced in communication, are expected by us to master it.

The problem today is not that the voice assistants can’t understand the words we say. It is more than just that.

It has two aspects:

1. The Emotional Side

The target market for voice assistants may be the whole world one day; but it is not right now. This is why people with thick accents feel left behind, creating an atmosphere of

shouting voice command — The problem today is not that the voice assistants can’t understand the words we say. It is more than just that.

unplanned bias. Especially in a culture as diverse as the United States, this becomes a problem where one person can perfectly communicate with the device and the other cannot.

If we consider the speed with which Google and Amazon are working on to resolve this problem, it may be overcome sooner than expected. They do however need to widen their data set to make it more widely acceptable.

2. The Intellectual Side

The other problem, which is more academic in nature, may require a different approach.

What is required now for the voice assistants is to understand multiple languages in the same sentence and be able to hold a conversation. In that context, the AI powering today’s voice assistants seem, to some, like a fancy version of the command prompt, activated through voice commands.

Promising Future

For humans, understanding voice and words alone are signs of intelligence. Intelligence is, in a basic sense, the ability to solve problems. Recognizing words is a part of the problem.

It is a start.

Holding a conversation, however, takes some imagination and making contextual connections. Even Einstein believed imagination to be the true sign of intelligence.

In that sense, lacking imagination and context, your voice assistant does understand the words you say, but it doesn’t really understand you.

Given the speed with which they are evolving, we hope that begins to change in the near future.