Speech Analytics: Turn Conversations into Dollars

You might call speech analytics the little engine that could. The field that was once just a blip on the speech technology radar screen has gained steam, and organizations are sitting up and taking notice.

In the past year, there's been a rash of merger and acquisition activity in the speech analytics market. In February 2011, Hewlett-Packard bought next-generation analytics platform provider Vertica; in July, Verint purchased Vovici, a provider of enterprise feedback management solutions; and in October, Avaya acquired Aurix, a United Kingdom–based speech analytics and audio data mining company.

According to Donna Fluss, founder and president of DMG Consulting, as recently as 2004, there were only 25 traceable speech analytics implementations. Fast forward seven years and there are more than 3,170 implementations.

"The long-term benefits of speech analytics are unbelievable, and it's just going to get better," Fluss says. "It delivers tremendous value to organizations that use and apply its findings."

Paul Stockford, chief analyst at Saddletree Research, conducts annual surveys that measure buyers' attitudes and intentions toward specific technology in conjunction with the National Association of Call Centers at the University of Southern Mississippi. As recently as 2009, Saddletree found that 70 percent of survey respondents said they had no interest in evaluating speech analytics products. In the company's 2011 survey, however, speech analytics was among the top five technologies respondents were looking at, with 24 percent saying that they intended to evaluate it for purchase in 2012. Additionally, 9 percent said they had already secured funds for purchase this year.

"Nine percent doesn't sound like a really impressive number, but if you apply that to our estimate of 66,000 contact centers in the United States, in terms of real numbers, it works out to about 6,000 contact centers that have already funded speech analytics for 2012," Stockford says.

"I think if you look at the evolution of speech analytics in the contact center, it was at first this kind of whiz-bang, exciting thing," says Jeff Schlueter, vice president of marketing and business development at Nexidia. "Now we're seeing larger implementations, and it's becoming more mainstream. Companies are starting to understand how to get the best benefits from speech technology."

Michael Miller, vice president of customer strategy at UTOPY, is also excited about the continuing adoption of speech analytics. "This is a sea change, and there are new tool sets that have radical capabilities to drive performance, and [that] can change and improve how you do business," he says.

What Is Speech Analytics?

Speech analytics, also referred to as audio mining, takes unstructured verbal communication, automatically analyzes it via various methods, and then delivers insights from conversations. It is primarily used in the call center to better understand customer interactions and measure agent performance.

"A five-minute conversation typically includes over 1,000 words," says Daniel Ziv, vice president of customer interaction analytics at Verint. "[The] average contact center agent can generate over 10,000 hours of audio each week. This represents a gold mine of rich and unbiased customer insights."

Using speech analytics, call centers can follow entire conversations and identify topics, problematic agent or customer interactions, customer insights, emerging trends, and opportunities for upselling and cross-selling. Contact centers can instantly determine the reasons customers are calling, how effectively agents delivered service, and where more training might be needed to boost skills and reduce average handle times.

Types of Speech Analytics

There are three major types of speech analytics engines: phonetics, large vocabulary continuous speech recognition (LVCSR), and direct phrase recognition. Additionally, many speech analytics companies add proprietary technology on top of these categories, and/or use a combination of these engines.

Phonetic speech recognition engines (also known as speech-to-phoneme) transcribe audio into collections of phonemes, which are the smallest unit of human speech—about 400 distinct sounds. A text mining or search engine then attempts to spot combinations of phonemes that form keywords in the converted phonemes.

A benefit of phonetics recognition is that it can process a lot of audio quickly, and it can analyze different languages relatively easily. But some speech analytics companies say that used alone, phonetic recognition is the least precise of all engines. "Phonetics provides extremely fast indexing, because that's all it does," says Ziv. "It's not dictionary constrained, but an index is meaningless unless the engine is told what to look for."

LVCSR (also known as speech-to-text) transcribes recorded conversations into text by matching the spoken words with dictionaries of terms. These transcriptions are then indexed so that they can be quickly searched for keywords. Some believe that the accuracy rate of LVCSR engines is greater than that of phonetics because the recognition unit is entire words. However, it provides slower transcription when compared to phonetic indexing, is dictionary constrained, and its accuracy is largely a function of language models and processing time.

"These approaches enable rapid, ad-hoc searches and exploration of unstructured audio data," says Sean Murphy, director of marketing at UTOPY. "[On the downside, data] is permanently lost during the conversion process due to the inherent limitations of current speech recognition and conversion engines, and the context in which the words were used (and therefore the business impact or outcome) is very difficult to capture using such a keyword-spotting approach."

The third and last speech analytics method, direct phrase recognition, directly recognizes specific phrases that have been predefined by businesses within calls themselves, instead of first converting speech into phonemes or text. Since data is not lost in conversion using this approach and because phrases are much larger and easier to correctly identify, the direct phrase recognition method is regarded as offering high data reliability. In addition, phrases put words into context, delivering a richer accuracy of speaker intent and business meaning.

What Speech Analytics Can Do for You

In a nutshell, speech analytics can provide business intelligence that can be used by companies to improve customer service and sales, reduce the cost of service delivery, and identify areas for process improvements. These areas include:

Monitoring customer satisfaction: Track and analyze contacts with churn and dissatisfaction language.
Process improvement: Understand call drivers to reduce call volumes; find periods of silence to identify obvious issues that are training priorities.
Increasing sales effectiveness:Understand how a company's most successful agents handle themselves, so best practices can be identified and shared.
Risk management: Find conversations that represent risk to the organization, reducing costs related to litigation, damages, and regulatory fines.
Cross-selling and upselling opportunities: Improve sales conversion rates.

The ability to achieve an ROI is a real game changer, according to some industry experts, and can be realized in as little as six to 12 months.

Bob Sullebarger, senior vice president of sales and marketing at CallMiner, agrees that the bottom line can be significantly boosted using speech analytics. "Contact centers that aren't using analytics are leaving money on the table and risking their overall competitiveness," he says. "Speech analytics is an investment in competiveness and profitability that drives incredible return on investment. Contact center quality assurance costs are typically cut in half."

A common ROI example is improvement in sales conversion, Murphy says. "One of our clients [who introduced a speech analytics implementation] improved conversion of phone calls into home loan applications by six percentage points in six months, he says. "Each percentage point improvement was worth $1.2 million per year in revenue."The gain was achieved in six months, so the revenue gain ($100,000) was modest the first month after deployment. By the third month, however, accumulated revenue gain was $600,000 on a system cost of $330,000, so the cost of the system was paid before the end of the third month. The price of the system was on a three-year contract, so the ROI is calculated based on total revenue gain over three years, in this case $20.1 million divided by a system cost of $330,000. So, ROI is sixty-one times [the cost of the initial investment]."

Speech analytics costs vary by provider. Generally speaking, cost depends on the size of the contact center, the number of seats, how large the implementation is, and the savings that result from buying bundled packages.

But implementing speech analytics is just the beginning. Customers need to identify concepts and themes and fine-tune results. "You need to identify trends on a monthly basis," Fluss says. "If you don't go back and use the information and recommendations to fix problems, you're not going to remedy them."

Fluss also emphasizes that speech analytics findings need to be shared across the entire enterprise. "Once you find an issue, you need to figure out how to communicate it across the organization for change management," she says. "Organizations do this wrong sometimes. There is highly valuable information for different departments to use and understand revenue opportunities."

The Players

Most large companies use a combination of speech analytics engines and often add proprietary technology.

UTOPY's speech-to-text transcription performs speech analytics in a way that is similar to most of the other speech analytics products on the market today. UTOPY also adds a unique phrase-driven approach to speech categorization and analysis (which the company invented and patented), combining speech and business concept recognition into a single step.

"By directly recognizing and analyzing entire phrases, UTOPY's method is automatically tailored to specific business needs and industry-specific terminology, making it more reliable at recognizing speech that is relevant in a business context," Murphy says. "The result is a holistic approach to precisely and completely understanding and analyzing the entirety of the interactions taking place between companies and their customers, and an exponential increase in data reliability and business value."

Verint's Witness Actionable Solutions uses a combination of phonetics, LVCSR, and a third layer, Complete Semantic Index. "Complete Semantic Index is an index of every word and phrase we identify in all the calls we process," Ziv says. "The index retains 100 percent of the content in a meaningful structure versus a phonetic index that carries no linguistic meaning or a direct phrase or keyword recognition approach, which typically retains a very small percentage of the actual content. This technology differentiator translates to being able to identify unknown emerging issues much faster, which drives rapid and greater return on investment for our customers."

Aurix, acquired last year by Avaya, uses phonetics. The company says its speech analytics goes a step further than traditional speech-to-text engines because searches are based on the way a word sounds, rather than specific spellings of a word. The search process uses Hidden Markov Models and dynamic programming algorithms to perform keyword searches on the audio stream. The speed and accuracy of searches is increased, the company states, without the need for the huge computing power required by speech-to-text solutions.

A crucial benefit of the Aurix phonetic audio search engine is that all the phonetic intelligence in the audio signal is retained until search, unlike LVCSR mining, where much of the phonetic intelligence is discarded when the text-based transcription is generated, says Chris McGugan, vice president of emerging products and technology at Aurix. "As a result, audio data is instantly searchable, supporting real-time monitoring of many thousands of calls or other audio material."

McGugan says that the Aurix phonetic indexing method also allows a much higher volume of recordings to be processed, more quickly, and with less hardware power than LVCSR systems. "Audio is 'ingested' or indexed…eighty times faster than it is spoken, and the index files are compressed as they are generated," McGugan says. "This indexing is performed only once and can be searched for multiple terms as many times as required."

Nexidia uses patented phonetic indexing and search technology. In the first phase, recorded audio is input into the system and a time-aligned phonetic index is created. The second phase begins when a search is requested. Searches can be done directly on words or phrases, or using special operators such as Boolean strings or time-based proximity to other content. Nexidia's proprietary search engine identifies and matches the phonetic equivalent of the search string and returns relevancy-ranked results.

Nexidia's patented dialogue search, alignment, and analysis technology finds words or phrases which are spoken in either live or recorded media streams," says Schlueter. "One hundred thousand hours of media can be searched in under a second. Transcripts and captions can be timed within one-hundredth of a second. Languages and speakers can be automatically identified, and multiple languages are supported."

CallMiner uses an LVCSR engine, which it believes is the most sophisticated form of speech-to-text recognition.

"LVCSR performs not only a phonetic indexing of the data, but also applies a statistically weighted language model to the sequences of phonemes to derive the most likely combination of phonemes into words," says Sullebarger. "We use thousands of hours of conversational corpus to derive these models, letting the computational power of many servers take the place of painstakingly slow manual proofing that is required by several linguists using a phonetics-based solution."

The Future of Speech Analytics

What does the future hold for speech analytics? DMG's Fluss says that with a compounded annual growth rate of 124 percent between 2004 and 2010, and growth of 42 percent in 2011, DMG expects the speech analytics market to continue to expand over the next several years, growing by 32 percent in 2012, 25 percent in 2013, and 20 percent in 2014.

More education and wider customer adoption is, of course, behind the growth, as are emerging technologies such as emotion detection and real-time speech analytics.

Contact centers use speech analytics mainly on a reactive basis, with recordings being analyzed overnight and manual changes taking place the following day, which is time consuming and cumbersome, Fluss says. However, with real-time speech analytics, organizations "will be able to judge the emotional state of the caller and determine if supervisor support is needed," she says. "The long-term potential of speech analytics is unbelievable. This is not a field of dreams. The use of speech analytics is only limited by our imaginations."

A Case for Needs Assessment

Some questions to ask a prospective speech analytics provider:

How many customers do you have?
What business verticals do you have experience in?
Can you provide customer case studies?
What languages do you support?
What is your call transcription accuracy rate?
Does your solution record calls as well as analyze them?
What is the size of your largest speech analytics customer deployment?

Staff Writer Michele Masterson can be reached at mmasterson@infotoday.com.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Nexidia Releases New OnDemand Program

Offers process improvement and performance management accessible to contact centers of all sizes

13 Jun 2012

Nexidia Unveils Interaction Analytics

Software provides multichannel searching

12 Jun 2012

Speech Analytics: Turn Conversations into Dollars

Nexidia Releases New OnDemand Program

Nexidia Unveils Interaction Analytics

Gladia Launches Solaria, a Multilingual Speech-to-Text Model

aiOla Launches Jargonic Speech Recognition Model

Northeastern Researchers Develop AI App to Help Speech-Impaired

Amazon Launches Nova Sonic, a Gen AI Model for Building Voice Applications and Agents