© 2021 Innovation Trail

Teaching computers to listen like the human ear

via Flickr
Technology becomes obsolete if it doesn't work well.

Talking to computers on the phone has become such a pervasive and hated part of how we do business that there’s an entire website devoted to getting around automated phone menus. It’s called Get Human.

But researchers at Binghamton University haven’t given up on getting computerized speech recognition to work as well as the human ear.

Stephen Zahorian, a professor of electrical engineering at Binghamton University, identifies a number of situations that make the human voice more difficult for a computer to understand.

"I've heard my wife on the phone arguing with the speech recognizer basically because she got frustrated," said Zahorian. "Of course, it doesn’t make sense because if you lose your way of speaking normally and clearly the speech recognition just gets worse."

When a computer only has to identify words from the limited vocabulary you use while, let's say, banking — words like "transfer" or spoken numbers — it does pretty well.

But try doing that in a noisy car. Or think of a pilot trying to deal with a malfunction in the cockpit.

In these situations, voice recognition could do a lot better.

Zahorian hopes to develop tools that will help computers do more intricate processing, like the human brain does for involved audio.

That's where you need grad students.

To build the tools that will help computers process complicated sound, Zahorian and his team of grad students are building a database of conversational, noisy audio.

Most of it is taken from videos pulled off YouTube. 

It means that students have to transcribe each video to create an accurate record of what was said.

When I visited the lab, Brian Wong was deep into an instructional video on bartending. Each five-minute video takes an hour-and-a-half to transcribe. In all they hope to have 150 hours of audio.

"There are people who aren’t good at public speaking," Wong has learned. "They really shouldn’t be doing videos in my opinion."

But the transcription Wong and other students are doing is critical to building a lexicon of complex speech.

"You have theories," Zahorian explains, "based on speech science and probability theory and all kinds of things like that as to what kinds of things might improve the recognizer. But with the database we'll know what the right answer is.”

Translation: To see if the computer’s getting it right, researchers need a record of what was actually said.

The Binghamton database of "noisy" audio in English, Russian, and Mandarin will let them make lots and lots of comparisons between their models and real transcriptions.

Zahorian has worked on visual programs to teach deaf people how to speak and is interested in computer translators (people have been dreaming about those for a long time). But all these systems, he says, will only be as useful as the basic science.

"All of these tools tend to fall into disuse if people say, 'I don’t quite trust this, it’s not quite good enough,'" says Zahorian.

Hour after hour of painful transcription, he thinks he’s getting a little closer to helping these technologies stick.

To hear the air version of this story, listen below.


Related Content