The Third Wave: Speech in Consumer Electronics

The speech recognition market has been projected to be on the verge of explosion for over a quarter of a century, but in reality it is only during the past few years that substantive growth and success has occurred. The market can be roughly divided into three subcategories: PC-based products, telephony/networked products and consumer electronic/embedded products. The first noticeable growth occurred in PC-based segments, where companies like Dragon, IBM, L&H and Philips introduced PC-based speech dictation packages that have quickly grown to hundreds of millions of dollars in annual sales. More recently, the telephony segment has started taking off and companies like Nuance, Philips and SpeechWorks have announced major voice transaction design wins on a regular basis. Now the consumer/embedded segment has begun attracting the attention of speech recognition industry players and consumer electronic giants as well. The following is a list of key motivations for putting speech recognition into consumer applications: • Easier to use. With general purpose microcontrollers in many of our consumer electronics, it has now become relatively easy for a manufacturer to pack a product full of features. The difficulty actually appears in enabling the user to use these features. Manuals may be informative, but we don't want to read them. The classic example is the VCR, which many people use for playback only. Many of us don't know how to record or even know how to set the time. It's probably not surprising that one of the earlier consumer electronic speech recognition devices was a remote control by Voice Powered Technologies (which many people thought would help them record, but still didn't fulfill the need, and therefore had a slow death at market). • Easier to program. Many of the "easier to use" applications really attempt to improve ease of programming. This is especially important as the average age of the consumer increases, giving rise to more people suffering from age-related ailments like arthritis and vision problems. • Lack of size for keyboards/buttons. Closely related to ease of use is the fact that keyboards become increasingly dysfunctional as their size gets smaller. As personal electronics get more powerful, lighter, cheaper, smaller and more full-featured, the need for a voice user interface would appear to increase. • Hands/eyes too busy. In cars and other situations where the hands and/or eyes are busy, speech can create a safe and convenient means to control devices. • Feature differentiation and novelty. Probably the worst reason to add speech recognition, but let's face it, a lot of Pet Rocks got sold for novelty purposes. • Amusement. I intentionally separate amusement from novelty, since amusement value is true value. Interactive speech (i.e. talking and hearing) makes a product more engaging. Engaging products get used more, and are more enjoyable. Cost Considerations
The driving consideration in incorporating speech recognition into a consumer electronic product is cost. For most consumer-oriented applications, throwing an embedded PC into the system is just too expensive, and in many applications, even DSPs or CODECs add too much cost. Typical consumer electronics sell at retail for 3-6 times the cost of goods, so a product costing $15 might sell at retail for $45-$90. The "added value" for speech recognition typically is not high, and consumers will probably be unwilling to pay an extra $50 for a model that adds recognition. It would follow that a speech recognition subsystem must add little or no incremental cost to be viable at retail. There are consumer products with speech recognition on the market today that sell for as little as $14.95. The only way to get to low price points is to have a fully integrated solution that doesn't require additional processors, memory, royalties, amplifiers and other expenses. Better yet, a speech recognition IC that can replace existing chips can make speech recognition feasible with no incremental cost. The key to all of this is system cost. Several consumer electronic manufacturers began to incorporate speech only to kill the project because of hidden cost factors in implementing the speech technology that weren't readily apparent. For instance, a chip that has integrated many of the required speech and system functions onboard, such as the microphone preamplifier, microcontroller and the speaker driver, will offer a substantially lower system cost (and no surprises) relative to a standalone microcontroller or DSP running speech recognition software, which require a lot of externally added hardware functions and features. Another important consideration is time to market. With product lifecycles in consumer electronics lasting as little as a few months to as long as a few years, time to market is key to success. Every OEM has their key tradeshows to hit and seasons to stock. If their product isn't on the shelf it doesn't get bought. The key to a fast time to market is finding good technologies with all the necessary tools to support development or to finding a technology or IC vendor that has third-party and/or internal resources to do and support development efforts. A Brief History
The consumer speech recognition technology market is relatively new. Products have been hitting the shelves since the early '80s but only since the mid '90s, when both high accuracy and low cost simultaneously became achievable, have there been any large volume successes. In the early 1980s, Tomy Corporation from Japan introduced a robot and a car ("DanVan") that were voice controlled. Tomy even went so far as to open a US. subsidiary that rolled out what was probably the first buttonless voice recognition phone (a no digits keypad). In the late '80s, Innovative Products developed a line of phones using its internally developed technology. By the early '90s, Origin Systems (later bought by EXAR and disbanded) took over Innovative Products' vision and was working on voice recognition phones using its proprietary speaker dependent technology. The phones functioned reasonably well, but never made it to any major distribution. Around this same time, a large U.S. toy company rolled out a line of cars and robots controlled by voice. It had such a high return rate that the company was forced to include tapes and toll-free numbers that taught the users "how to talk." The label on all the boxes read: "VOICE COMMAND WORKS EVERY TIME IF YOU SAY THE WORDS PROPERLY. If after reading the instruction sheet and listening to the audio tape, you still cannot get it to work; call IN THE US 800-442-7440 IN CANADA 800-463-3353 and we'll teach you how to say the commands properly. Do not return the vehicle to the retailer, we will help you." This company continued to have return issues, developed credibility problems with retailers, and eventually went bankrupt. (The name of the company is not being printed because another company is now doing business under that same name). Another toy introduced in the early '90s was a doll, Julie, released with a TI DSP that actually worked. The buyer called Julie's name and her head would turn and talk. She was built full of sensors and activities, but the high implementation cost kept Julie from reaching any commercial success. Certainly one of the earlier pioneers in the consumer electronics speech recognition space was Voice Control Systems (now a part of Philips). It had some of the earliest automotive design wins. Voice Powered Technologies released a line of consumer products utilizing a licensed speaker dependent technology from the early to late 1990s. It released remote controls, and had some moderate success with voice recording PDA products. They went public in the mid '90s, the stock never took off and today they are in the midst of bankruptcy reorganization. Lower cost speech recognition ICs have been released since the 1980s from a string of manufacturers including Toshiba, Panasonic, Sensory, OKI, NEC, Sanyo, HMC, UMC and others. They came in three waves. The first wave was the Japanese players who entered during the mid to late '80s, followed by several Taiwanese companies (HMC, UMC, etc.) in the early '90s. The first speech recognition IC to realize measurable commercial success was Sensory's RSC-164 chip, released in 1995, which integrated speaker independent, speaker dependent and speech synthesis technology all on a low-cost fully integrated chip. OKI Semiconductor released a similarly featured product to Sensory's RSC-164 in early 1996, but at a higher price point. Today, Sensory's Interactive Speech lines of ICs and software solutions are found in the majority of consumer products that utilize speech recognition. Today's Market
With the growing demise of the expensive consumer PC, and the rapid growth of embedded PCs, DSPs and high- powered RISC processors, the PC-centric Comdex shows and the Consumer Electronic focused CES are starting to overlap more and more. Home appliances and networked products appear to be the buzz across these techno-shows. Along with this, the leading PC speech recognition companies are now making moves towards "low resource" engines for consumer implementations. Lernout & Hauspie has scored design wins in embedded PC-type consumer applications like the Auto-PC. IBM has reorganized and now has a consumer speech recognition division, and Philips has acquired VCS to achieve penetration into consumer applications. Two of the biggest emerging consumer markets for speech recognition are toys and cell phones. Interactive toys are the rage, and in cell phones voice dialing is becoming a standard feature, with every major manufacturer now releasing phones with voice dialing capabilities. Even cell phone chip providers like Qualcomm are now releasing CDMA chips with speech recognition options, using Qualcomm's internally developed technology. Toshiba Corp., who owns over 60% of the telephone answering device (TAD) chip market in Japan, has recently introduced a new series of TAD DSPs with built-in speech recognition capabilities, using speech technology licensed from Sensory. In the home, the speech recognition battle is starting to wage between the location of the speech technology: Will it be embedded in the products or centralized at a switch? In practice what will probably happen is there will be a thin client with a low resource command and control speech engine that can seamlessly take the user into the world of voice commerce natural language engines. There may even be three stages, with a low resource "appliance," a home-based hub, and a centralized switch. What Lies Ahead
As we get increasing MIPS and improved speech algorithms the capabilities of what can be done in low-cost consumer electronics will advance rapidly. PDAs and convergence telephones are already incorporating simple command and control and voice dialing. Continuous digit dialing and contact lookup/access without training are on the very near-term horizon. Text-to-speech synthesis and multiword spotting will be commercially arriving in low-cost consumer electronics within a few years, and constrained natural language interfaces will be in consumer electronics within the next five years. A/V electronics and white goods will have an abundance of speech recognition offerings in the next few years. Cordless phones will become invisible and will be embedded into the walls of rooms for voice communications without picking up a phone. Toys will become more real and more interactive, and a whole new segment of educational language training will start to emerge utilizing voice I/O capabilities. Hands-free car kits for voice recognition on mobile phones will rapidly be adopted as countries outlaw holding phones while driving; in-car voice recognition will be a standard feature found in most cars introduced beyond 2005. The keys to success of voice recognition in consumer products is to make it so an untrained user can immediately start using complex features without training vocabularies or reading manuals. For this to really work there needs to be intelligent synthetic speech responses, natural language understanding and of course it will have to work in high noise environments without increasing the cost of goods. We're not there yet, but we are clearly making great strides in that direction.

Todd Mozer is the President and CEO of Sensory, Inc., and can be reached by phone at 408-744-9000, by fax at 408-744-1299 or at www.voiceactivation.com and www.sensoryinc.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Companies and Suppliers Mentioned

The Third Wave: Speech in Consumer Electronics

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

XL8 Delivers Real-Time Spanish Translation Captions to U.S. Public Broadcasters

Northeastern Researchers Develop AI App to Help Speech-Impaired