The 2015 Speech Industry Star Performers: Microsoft
Microsoft Is Breaking Barriers with Cortana
Microsoft first launched Cortana, its speech-enabled digital assistant, in the United States in April 2014. Since then, it has brought the product to China, India, Australia, and various European countries.
Beyond its geographic expansion, Microsoft is also taking its Halo-inspired helper to other platforms and devices. The company brought Cortana to mobile devices running on Google's Android platform in June and expects to have a version for Apple's iOS later this year. It is also working to endow Cortana with even greater proactive and predictive capabilities, starting with the ability to determine user intent.
Cortana is also being expanded to bring voice control to several products around the home. An Insteon app for Windows Phone 8.1, for example, lets users employ Cortana's voice recognition through their mobile phones to lock and unlock doors, turn the lights on and off, and adjust the thermostat. With Cortana, users of the Insteon home automation hub can manipulate over 200 connected devices.
And, through a deal with Telefonica, subscribers to the telecommunications company's Movistar TV Go service will be able to use their Windows phones and Cortana to search for and access TV content with simple voice commands.
Steven Guggenheimer, a corporate vice president and chief evangelist at Microsoft, says Cortana's potential goes far beyond just performing voice-activated commands. Cortana, he says, continually learns about each user and becomes increasingly personalized, with the ultimate goal of proactively performing the right tasks at the right time.
Microsoft's innovations around the technology don't end there, though. As part of its Project Oxford initiative, Microsoft is working to dramatically improve its speech recognition and text-to-speech capabilities through advances in machine learning, neural networking, language understanding, and artificial intelligence. Project Oxford provides a portfolio of representational state transfer application programming interfaces (APIs) and software development kits to enable developers to add speech, language understanding, facial recognition, and computer vision to their own applications.
On the speech side, the APIs cover speech recognition, speech intent recognition, and text-to-speech conversion.
Another element of Project Oxford is the Language Understanding Intelligent Service (LUIS), which allows developers to add language understanding to applications. Users can create their own language understanding models or use prebuilt models from Bing and Cortana; deploy those models to HTTP end points or on phones, tablets, or other Internet-connected devices; and review commands spoken to applications to spot and correct errors.
Those same technologies are also being brought to bear in Microsoft's Skype Translator, a speech-driven communications program that can translate conversations in real time. Since releasing Skype Translator in preview form in December, Microsoft has done a lot with the technology. Users can now add speech recognition warnings to alert them when the translator is having a hard time understanding the speaker. They can also add text-to-speech recognition, allowing them to switch between text-to-speech and speech-to-speech translation. Microsoft has added Mandarin Chinese and Italian to the spoken languages available.
Some of the other changes in the latest version of Skype Translator include the following:
- text-to-speech translation, with the option to hear the instant messages people send in any supported language;
- continuous recognition as the person is speaking;
- automatic volume control; and
- a mute option for translated voice, enabling users to turn the translated audio off if they would prefer to simply read the transcript.
Skype Translator also employs advanced machine learning through deep neural networking to improve translations over time.
"While speech recognition has been an important research topic for decades, widespread adoption of the technology had been stymied by high error rates and sensitivity to speaker variation, noise conditions, etc.," Microsoft group program managers Mo Ladha and Chris Wendt wrote in a blog post. "The advent of deep neural networks for speech recognition...dramatically reduced error rates and improved robustness, finally enabling the use of this technology in broad contexts, such as Skype Translator."