Google's GAUDI Indexes Audio
Internet search giant Google has launched a new audio search indexing experiment that allows users to locate spoken words inside videos. Google Audio Indexing (GAUDI) was developed by Google Labs and runs the same underlying speech recognition technology used in Google’s Elections Video Search Gadget.
With the new application—which transforms spoken words into text that is then indexed—users searching for words or phrases inside video clips will be able to jump to portions of a video where those words are spoken.
"[GAUDI] provides a novel way to retrieve and navigate video material," wrote Michiel Bacchiani, a research scientist at Google, in an email. "In contrast to what is currently available on YouTube search, it provides a richer search signal by providing the transcript of spoken content in the video. In addition, the time alignment of the transcripts allows content-based within-video navigation."
GAUDI users type in a query and then refine the search results using channel filters that correspond to YouTube channels. Search results provide information about each video, including the number of times query terms are spoken.
For example, a keyword search of recent presidential debate videos revealed that during the first debate, Sen. John McCain said the word "maverick" twice, at 37:22 and at 37:26. Sen. Barack Obama never said the word.
"Throughout this election season, YouTube has become a dynamic tool for political discussion and a valuable platform for voters to engage with political candidates," Bacchiani wrote.
These videos are "rich with content, but it is often difficult to find both relevant videos as well as relevant information within a specific video. The Google Audio Indexing technology solves both these problems since the transcript allows search across videos as well as within videos as the text is time-aligned with the video content," he continued.
Currently, GAUDI is available in limited beta testing and is only being used to index YouTube videos related to the upcoming elections, but Google plans to expand its use to other videos and to improve the application’s technology.
"Automatic speech recognition remains in the early stages of development, and we’re constantly working on improving the quality of the transcripts," Bacchiani wrote. "One reason is that vocabulary keeps changing. In addition, sound quality in videos varies as a result of acoustics [and the] quality of the microphone. Such mismatches can cause severe performance degradation."
Finally, the lack of robustness shows in language use—finding transcripts to train the language model is hard.
"We hope to grow the content covered by the product, and further research will focus on increasing the robustness of the technology to allow coverage of a wider domain," Bacchiani continued. "Large data sets are likely [going to be] the key focus of that research."
One person who sees significant creative potential in this technology is James Larson, a consultant, VoiceXML trainer, and co-chair of the World Wide Web Consortium’s Voice Browser Working Group.
"Advertisers will use this technology to review TV and radio recordings to verify that the advertisements that they paid for were indeed broadcast—that’s the first use that I see of it," Larson says. "Advertisers may use this technology as sort of a data mining strategy, and that is to search out other advertisements that are related to what they are interested in."
Bacchiani echoes Larson’s sentiments about advertising. "Theoretically, the additional detail of the transcript allows the potential for more accurate advertisement targeting," he writes. "However, we have no immediate plans for using this technology for advertising."