Archive for April, 2007

Sea, speech and sun at Nuance Conversations in Cannes, France

Monday, April 30th, 2007

Nuance Conversations Europe 2007 was held last week in the legendary Carlton Inter-Continental hotel in Cannes, France, home to many a movie star. After last year’s successful first edition in Mallorca, expectations among the 250+ speech technology crowd were high. Apart from the traditional contact centre business and technology sessions, the new mobile and automotive tracks were testimony to the ubiquity of speech technology in our daily lives. The fil rouge of this year’s conference was “Elevating the User Experience”.

Carlton Inter-Continental hotel in Cannes, France

Nuance executive Peter Hauser painted a rosy picture of the state of the industry in general, and of his company in particular. As basic speech technologies have matured to the point of becoming commoditised, the real battle ground for the industry has shifted to the customer experience. In the same year that Dragon Naturally Speaking is celebrating its 10th birthday, Nuance has combined the best-of-breed features of its various legacy ASR technologies into Recognizer v9, which is now available worldwide. The new recognizer plays a pivotal yet unobtrusive role in Nuance’s vision for multi-modal mobile solutions, which is detailed below.

To bring this vision to the market, Nuance relies heavily on its partner network, so naked motivator David Taylor was called in to infuse the audience with their own dreams and with some simple – cynics would say simplistic - principles on how to realise them. Whether you like Mr. Taylor’s style or not, I guess there’s some truth in his truisms.

But let’s not get distracted. What’s the mobile vision about? Here’s how Nuance sees things: users push a single Voice button and talk. The device then does one of two things: perform the recognition directly by itself, or hand it off to a recognition server in the network. The recognition result is fed into a diverse set of applications including directory assistance, voice mail, music catalogue search, music playing, or navigation, to name just a few. Depending on the application context and the device capabilities, the requested information is then presented in the form of text, graphics, audio and/or video. The basic technology to enable this kind of “elevated user experiences” is there, and was convincingly demonstrated in Cannes. Therefore, the one million – or rather, billion - dollar question is not about technology anymore. But if that’s the case, what are then the main drivers or barriers that will make or break this whole new range of mobile services?

Nuance Conversations Europe 2007 in Cannes, France

Elements of an answer were provided on Wednesday by a panel consisting of Peggy Ann Salz (Informa), Pascal Coutier (Logan Orviss International), Trond Lund (Fast) and Marcel Pirlich (Arvato Mobile). Drivers for experience-rich multi-modal service uptake include personalisation (knowledge of past usage patterns), context-awareness (current time, place & activity), and flat rates. Barriers include lack of usability (see my previous post on free 411 systems in the US), difficulty of rich client installation procedures, absence of data bundling in Europe, availability of capable handsets and above all, lack of decent business plans. There is a consensus in the mobile industry that the mobile search market will eventually be driven by advertisements (a market estimated at $6.5 billion by 2011). Speech is seen as a major enabling technology in this space, as it facilitates data input in application contexts where secrecy and discretion are not an issue. In hands-free contexts like cars or warehouses, human speech is even the only legally allowed or practical means for inputting data.

Frederik Durant at Nuance Conversations 2007 in Cannes, France

To prove their point about the speed, naturalness and safety of the speech interface, Nuance put former Formula One driver Perry ‘The Stig’ McCarthy on stage behind the wheel of a virtual sports car, and asked him to read an SMS and select a given iPod tune while driving on the Monaco F1 circuit at “normal” speed. After 10 minutes and 20 odd crashes, the audience finally got to hear some music. Perry McCarthy’s opponent from Nuance got the same job done with TTS and speech input in a minute or so, without a single scratch to his car – although he did drive a bit slower, I must admit.

Just say 'The Supremes - Ain't No Mountain High Enough' and enjoy the multi-modal experience!

In the Contact Centre Business track I attented two customer testimonials.

Melanie Rowland, Head of Self Service and Automation-IVR at Vodafone UK, presented her company’s long journey from organically-grown IVR silos – a real customer-nightmare – to speech-enabled services hosted by Vicky, Vodafone’s virtual advisor. At the start in 2005, no less than 19% of the 6 million weekly inbound telephony contacts were simply abandoned, and only one caller out of four was able to select the right option for their query. Vodafone first fixed the existing IVRs, which allowed the automation rate for e.g. post-pay customers to jump from 20% to 45%. This investment paid for itself in less than 5 months. As customers were still dissatisfied with the existing voice, Vodafone then went a step further to create Vicky, a virtual persona who embodies Vodafone UK’s brand essence and personality. Since then, customer satisfaction rates have risen dramatically. Vodafone UK now plans to automate 60% of all inbound traffic by next year. As success factors, Mel Rowland cited the primacy of the customer experience over technology concerns, and also stressed the importance of well-defined and consistently tracked key performance indicators. Getting early buy-in from the business, marketing, technology and customer service departments through open communication is also key to succeed.

Melanie Rowland presents her colleague Vicky, the Voice of Vodafone UK 

On Wednesday afternoon Ross Moody, Head of Shared Services at Standard Bank in South-Africa, explained how his company’s initial exploratory initiative to integrate two different platforms turned into a strategic thinking exercise as the full potential of VoiceXML-based speech technology became apparent. In a country with 11 official languages, customers more often than not use another language than their mother tongue when interacting with service departments or systems. With help from Intelleca, Standard Bank therefore had to spend quite some effort on collecting data, tuning acoustic models, redesigning grammars and adding phonetic transcriptions to the pronunciation dictionaries in order to achieve desired perfomance levels. The use of a separate male persona for the wizard/guide as opposed to the female virtual agent seemed awkward to me, as it made the whole user experience slower and overly complex. Despite this negative point, customer feedback to the new, integrated speech solution was said to be “overwhelmingly positive”.

The Carlton Inter-Continental on La Croisette in Cannes, France

Like last year, Chief Scientist Vlad Sejnoha offered some glimpse into Nuance’s R&D future. The new vision, called Care 2.0, is driven by a common technology base which integrates elements from the traditional core technologies Dictation, Network and Embedded, which are converging. The common platform is backed by the availability of huge amounts of performance data and computing grids to process them, the existence of similar requirements and approaches across device categories and applications, and seamless APIs. In this world of converged and always-available basic technologies, the new challenge is to find effective ways of combining speech and visual components in order to offer compelling multi-modal, experience-rich services. The new Quest is to discover what kind of applications users want to talk to, and what type of service users need in a given context.

Unless I was blinded by the April sun, the perfect temperature and the excellent food & drinks, Nuance Conversations Europe 2007 led me to one conclusion: the present and future of speech technology look very bright indeed.

Google gives acte de présence with 1-800-GOOG-411

Saturday, April 14th, 2007

Barely waiting for Microsoft and Tellme to return from their honeymoon, Google Labs recently launched Google Voice Local Search, an experimental 411 (directory assistance) service. For the moment, 1-800-GOOG-411 just offers US local business listings, directly accessible from any US phone. But with a Grandstream SIP phone, an Asterisk PBX and a gateway like FreeWorldDialup, this minor nuisance is quickly bypassed.

So instead of speculating if, when and how Google will integrate the new service in its pay-per-click or pay-per-call advertising model, I just called 1-800-GOOG-411 for a quick try-out. Jingle Networks‘ 1-800-FREE-411 service was chosen as Google’s sparring partner.

To make the test a bit more fun and real for myself, I decided to only search for US businesses that I have actually visited at some point in time. This way I not-so-randomly picked David’s World Famous catering service in Burlington, MA; the MIT COOP bookstore in Cambridge, MA; and the Starbucks on El Camino Real in Palo Alto, CA.

First some food from David’s World Famous. My call to 1-800-GOOG-411 was answered by a neutral-sounding male voice saying “calls recorded for quality”. Notice the absence of any verb? After two seconds I got a pre-recorded prompt “GOOG-411 experimental. What city and state?” My answer “Burlington, Massachusetts” was well recognized and explicitly confirmed by the system. To the next question “what business name or category?” I said “David’s World Famous”. There was a short database lookup and after 21 seconds into the call, I got presented with the top-2 results. I chose the first one and could have been connected directly to the catering service after 41 seconds, if I had wanted to. Instead I asked for more address details, which another male TTS voice read aloud twice, presumably to give me a chance to jot it down. The phone number was read correctly, in a conversational, natural way. After this self-chosen digression, I was connected to the David’s World Famous answering machine – not a surprise, really, as the local time in Massachusetts at that moment was well after midnight.

I then tried the same procedure through 1-800-FREE-411, at least that’s what I had in mind. “Welcome to 1-800-FREE-411! Press 9 now to get the last number you requested”, said a female pre-recorded voice. I wasn’t interested in that, so I kept silent. After 12 seconds, a first commercial offered me to take part in Stonebridge Life’s $25,000 give-away. Er, maybe some other time. Thirty-one seconds into the call, I got a “What city and state, please?” prompt, and said “Burlington, Massachusetts”. There was no explicit confirmation; instead the system immediately continued with “Are you looking for a business, government or residential listing?” “A business listing”, I said. Again no confirmation, but another prompt “Would you like to search by name or by category?” “By name”, I answered. “OK, what listing?” “David’s World Famous”, I said. Now things became funny. The call was sponsored by “Girls Gone Wild”, who offered me two videos for free, meaning I just had to pay shipping and handling costs. Yeah, right. Not that I dislike oriental food, but hot ‘n’ spicy DVDs were not exactly what I had asked for. Anyway, back to the call. A flat female voice brought me down to earth with the message “the number you requested is seven eight one – two two nine - eight seven eight six”. You would think any decent VUI designer knows by now that US phone numbers don’t get read this way, but apparently not so at 1-800-FREE-411. What’s worse, after I’d heard the requested phone number, I was presented with two options: hear it again, or get connected to … Girls Gone Wild. While I was waiting for the obvious third option that would connect me to David’s World Famous, the system again threw the flat-spoken number at me, and prompted me for yet another repeat. Just when I thought I was finally going to be connected, the system thanked me for calling, made some more publicity about their own website “to learn about other special offers” and then hung up. Two minutes and five seconds had gone by, and I was still left with an empty stomach.

After the stomach, time for the brain. I called 1-800-GOOG-411 again, now searching for the MIT Coop bookstore. The speech recognition of “Cambridge, Massachusetts” went smoothly, as expected. Alas, the business name turned out to be more problematic, with its two abbreviations. “MIT” stands for Masschusetts Institute of Technology, and is customarily pronounced one letter at a time: M-I-T. The word “Coop”, although an abbreviation for “cooperative“, is pronounced as an acronym over there, rhyming with “loop” or “soup”. Being a foreigner, I pretended not to know this and said “M-I-T Co-op” at first. Successive attempts to recognize this same pronunciation generated a “no match” leading to a “try again” prompt, and a low-confidence false match with an attached explicit confirmation prompt. The system then presented me with some indirect matches from its database, all of which were irrelevant. After the fourth list item, the Google voice suggested to start all over again, so that’s what I did and said. I now pronounced MIT as an acronym, sounding like the German preposition “mit”, and stuck to “Co-op” for the second part. Apparently I guessed right, because the system literally confirmed my incorrect pronunciations and offered me a short list of three MIT Coop locations. I chose the second one, and after one minute and fifty-five seconds, I was connected to the answering machine of the MIT Coop on Kendall Square in Cambridge, Massachusetts.

My first search for the MIT Coop at 1-800-FREE-411 failed immediately with the message “We’re sorry but no live operators are available at this time. Please try again later”. For an automated system, that’s an illogical answer, especially since 1-800-FREE-411 explains in its own FAQ that they are ”no longer supporting live operator services from certain localities”. Subsequent calls [1,2,3,4] did go through, but they all suffered from no matches and false matches, irrespective of my pronunciation of “MIT Coop”. I couldn’t verify if “MIT Coop” was in-grammar or out-of-grammar, but the corresponding web search did return one entry. On the positive side, 1-800-FREE-411 transfers callers to an operator after two failed recognition attempts.

My last search for Starbucks Coffee on El Camino Real in Palo Alto, California went without a glitch at both 1-800-GOOG-411 and 1-800-FREE-411. With Google, I was transferred after 45 seconds; with the other system I got to hear the complete number after one minute and fifty seconds. This time the irrelevant ads were from InCharge Debt Solutions and American Express, respectively.

Before we draw some conclusions, first a warning: no speech recognition system should ever be evaluated on the basis of a few calls and utterances made by a single speaker over a single channel. To do so would not only be unfair, but also unscientific and possibly completely wrong. This being said, my first impression is thar Google’s potential entry in the automated DA space should be a major concern for all other players on the US 411 market. As could be expected, the 1-800-GOOG-411 voice user interface is clean and snappy, with various error recovery mechanisms already in place; speech recognition looks good; and the direct transfer to the requested number is an obvious functionality that’s blatantly missing with 1-800-FREE-411. So looking from the technology side, Google seems to know what they’re doing – hardly a surprise.

A bigger challenge for Google or any competitor will be to balance the economic aspects of sponsored local audio ads (remember the DMarc acquisition) with the human interaction limitations of a spoken phone interface. A caller’s tolerance for inserted ads is inversely proportional to the degree of certainty with which the business or category name is entered. If I ask for Starbucks, I want Starbucks’ phone number; but if I just want coffee, multiple relevant results are expected, including sponsored transfers and special offers. With its army of natural language processing specialists, the richness and vastness of its data, and its very deep pockets, Google is well placed to shake the US Directory Assistance industry, if it wants to. Unless it has other priorities, with even bigger returns.