Alexa is always listening but not continually recording. It doesn’t send anything to cloud servers until it hears you say the wake word (Alexa, Echo, or Computer). But listening for wake words is harder than you might think.
Echo hardware isn’t all that intelligent. Without the internet, any request or question you ask will fail. This is because your commands are sent off to the cloud for interpretation and decisions. Amazon doesn’t want every conversation you have in front of a smart speaker to be recorded, but rather, just the commands you give the smart speaker. For this reason, the company employs a wake word to get the smart speaker’s attention. To accomplish this, Amazon uses a combination of fine-tuned microphones, a short memory buffer, and neural net training.
Fine-Tuned Microphones Pinpoint Your Voice
Voice assistant speakers, like Echo and Echo Dot, typically have multiple built-in microphones. The Echo Dot, for instance, has seven. That array gives the devices several abilities, from hearing commands spoken far away, to separating background noise from voices.
The latter is especially helpful for wake word detection. Using its multiple microphones, the Echo can pinpoint your location relative to where it’s sitting and listen in that direction while ignoring the rest of the room.
You see this in action whenever you use the wake word. Stand to the side of an Echo or Echo Dot and say the wake word. Notice the ring lights up in dark blue, and then a lighter blue as it circles and “points” toward you. Now, move several steps to the side and say the wake word once again. Notice the light-blue lights follow you.
Knowing where you are, helps the device focus on you better and tune out noises coming from elsewhere.
Short Memory Keeps the Speaker from Holding Too Much
Echo devices have plenty of storage, but they don’t use much of it. According to Rohit Prasad, the Vice President at Amazon and Head Scientist of Alexa Artificial Intelligence, an Echo can only physically store a few seconds of audio.
By reducing its capability, Amazon not only gives you more privacy (it’s one less place your voice is stored) but also prevents Echo from listening to entire conversations, limiting its focus to finding the wake word.
Imagine you had a three-second cassette and a tape recorder. Suppose after it reached the end, the tape looped back around to the beginning over and over. If you started recording a conversation, everything you said four seconds ago would be wiped and immediately recorded over. That’s what an Amazon Echo does.
It records continuously but wipes everything it just recorded at the same time. That short attention span means all it can hear is the word, “Alexa,” and not much more. Three seconds, though, is long enough for that word to be recorded, examined, and acted upon appropriately.
Neural Net Training Helps with Pattern Matching
Finally, Amazon depends on neural network training to teach the Echo how to pattern match. Much like other forms of machine learning, Amazon trains its algorithms by feeding it instance after instance of the word Alexa (or Computer, or Echo, depending on which wake word the company is training).
The idea is to cover every inflection and accent, but also the context. Amazon wants your Echo to recognize the difference when you’re talking to it, when you’re talking about it, or, perhaps, when you’re talking to a person named Alexa. The directional mics also assist with that goal.
With every word the Echo hears, it runs audio through layers of algorithms. Each layer is designed to rule out false positives, looking for sound-alikes or context clues. If one layer check passes, the word goes to the next. Finally, when the local device decides it did hear the wake word, it begins to record and pass on the audio to Amazon’s cloud servers. Amazon employs four algorithms: one for each wake word (Alexa, Computer, Echo), and one for Alexa Guard, which treats specific sounds, such as glass shattering, like a wake word.
But even when a match occurs, Amazon still runs more complicated checks. Have you noticed that when someone speaks the word Alexa on a TV show or commercial, it usually doesn’t elicit a response from your Echo? That’s because Amazon also does a cloud check.
Cloud Checks Rule Out Some False Positives
When companies make commercials that feature Alexa, they can submit the audio to Amazon. The company runs the audio through similar pattern matching algorithms used to identify the wake word. Once that exact instance is fully cataloged, it’s added to a database.
As part of the process when reaching out to the cloud, your Echo includes information about the wake word it heard and checks that database. Whenever it finds a match, Amazon instructs your Echo to ignore the wake word, shut down, and discard any recorded audio.
Additionally, Amazon checks for instances of the wake word spoken simultaneously. Not every company submits audio to Amazon, so the company came up with a novel backup solution. After checking for a database match, the company compares the wake word imprint against any other instances coming in at the same time. It’s unlikely that two people who say Alexa simultaneously would sound exactly alike, so if there’s a match, Amazon knows it’s likely a commercial or TV show and ignores the request.
Despite all the checks, false positives do still occur. You can listen to what your Echo has recorded at Amazon’s privacy hub, and you’ll likely find at least one false positive in the bunch. But the technology is continually being improved and, eventually, Amazon would like it to function without a wake word at all.