Grainger engineers voice localization techniques for smart speakers
Research team uses nearby wall reflections to improve devices like Amazon’s Alexa
Credit: Romit Roy Chowdhury
Smart speakers – think, an Amazon Alexa or a Google Home – offer a wide variety of capabilities to help free up both our time and our hands. We can hear the morning news while brushing our teeth, ask for a weather report while picking out a coat, and set a timer for the oven while handling two hot pans at once. According to Voicebot.ai, Alexa is supporting more than 100,000 skills worldwide, but one task it hasn’t mastered is determining user location in the home.
This localization task was the focus of a University of Illinois at Urbana-Champaign research team’s recently published paper, “Voice Localization Using Nearby Wall Reflections.” The work was accepted to the 26th Annual International Conference on Mobile Computing and Networking. In the paper, the team – led by Coordinated Science Lab graduate student Sheng Shen — explores the development of VoLoc, a system that uses the microphone array on Alexa, as well as room echoes of the human voice, to infer the user location inside the home.
Knowing a user’s location within a home could help a smart device better support currently available skills. For instance, after receiving commands like “turn on the light” or “increase the temperature,” Alexa currently has to guess which light and room is at the heart of the command. Using a technique known as reverse triangulation, Shen and advisor Romit Roy Choudhury are getting closer to voice localization.
“Applying this technique to smart speakers entails quite a few challenges,” shared Shen, an electrical and computer engineering (ECE) student. “First, we must separate the direct human voice and each of the room echoes from the microphone recording. Then, we must accurately compute the direction for each of these echoes. Both challenges are difficult because the microphones simply record a mixture of all the sounds altogether.”
VoLoc addresses these obstacles through an “align-and-cancel algorithm” that iteratively isolates the directions of each of the arriving voice signals, and from them, reverse triangulates the user’s location. Some aspects of the room’s geometry is spontaneously learned, which then helps with the triangulation. While this is an important breakthrough, Shen and Roy Choudhury plan to expand the research to more applications soon.
“Our immediate next step is to build to the smart speaker’s frame of reference,” Shen explained. “This could mean superimposing the locations, as provided by VoLoc, on a floorplan to determine that the user is in the laundry room. Alternatively, if the smart speaker picks up the sounds made by the washer and dryer in the same location as the voice command, it can come to the same conclusion.”
The possibilities of this function are seemingly endless and could improve Alexa’s current abilities.
“The implications are important,” said Roy Choudhury, a CSL professor and the W.J. “Jerry” Sanders III – Advanced Micro Devices, Inc. Scholar in Electrical and Computer Engineering. “Location can help Alexa in improving speech recognition, since different speech vocabularies and models can be loaded. For example, a command like ‘add urgent to the shopping list’ may not make sense, but if Alexa knows that the user is in the laundry room, Alexa may be able to infer that the user actually said `add detergent to the shopping list’.”
Shen and Roy Choudhury acknowledge that the technology could further erode privacy, by allowing companies like Amazon and Google to peer more closely into our homes and daily lives. However, they also believe the benefits are vital, as context-aware smart devices could become crucial supporting technologies to senior independent living and more.
For example, the technology could be used to remind a grandparent who lives independently to take their medication when he or she passes the medicine cabinet, or to remind a child to turn off the faucet when they run out of the bathroom with it still running.
“It’s more than interpreting voice commands,” said Shen. “It provides an extra set of eyes when it comes to caring for loved ones as well.”