Last Thursday and Friday I was at Nuance Conversations Europe in the picturesque town of Porto Petro on the isle of Mallorca, Spain. With 235 people from 25 countries, the first European edition of this conference was well attended. This post presents some highlights from the plenary sessions, from Thursday’s technical track and from Friday’s business track.
In the introductory session, Steve Chambers outlined Nuance’s (network) speech strategy. The first focal point in the short term is – or rather, remains – customer care: what callers want is speedy service, and a sense of control. “Human Touch” is the unifying theme under which Nuance is addressing the customer’s desire for more natural, unconstrained input. OpenSpeech Dialog (OSD) is the flagship product that should help make this dream come true. The second short-term focus is about “Googlizing Voice”, with “Mobi” as a concept of mobile dictation in a multi-modal setting. In the longer term, Nuance’s marketing strategy is centered around the idea of the Visible Customer: organizations should take a holistic approach towards customer care, by integrating currently siloed customer personas of repeat callers into a single view. Mass-personalization means new revenue opportunities. Mr. Chambers organized an automated vote, in which the audience identified lower prices and advances in speech accuracy as major factors for accelerating the adoption of speech technology in the telephony network.
Keynote speaker at the conference was Eckhard Geulen, Senior Exec. VP Marketing & Sales of Value Added Solutions at T-Com, Deutsche Telekom’s fixed branch. Dr. Geulen strongly pleaded in favor of customer-motivated speech initiatives, as a necessary complement to the more traditional organizationally motivated cost cutting exercises. Speech technology should indeed be positioned in the larger context of value-added services, and not just as a replacement for expensive customer service representatives. Referring to the Visible Customer concept, Eckhard Geulen openly admitted that Deutsche Telekom still has a long way to go: today, the respective identity records of a customer switching from T-Com to T-Mobile (or vice versa) are not linked (yet). Another frank statement: “the [German or European] market isn’t there yet”. To offer better risk-return ratios for its customers, T-Com has heavily invested in a managed services model. This way small and mid-size companies who are unable or unwilling to run their own voice platform can profitably develop speech initiatives without incurring heavy up-front investments. To further support market acceleration, T-Com has partnered with Genesys, Nuance and Voice Objects to create a Voice Community for the German-speaking market.
Thursday’s technical sessions featured the products OpenSpeech Dialog (OSD) and PromptSculptor, and also presented some best practices in information-driven VUI design and multi-lingual speech systems.
OpenSpeech Dialog supports a holistic approach to voice application development, based on the higher-level xHMI (eXtensible Human Machine Interface) language. xHMI was developed by Nuance and 20 partners to enable a “simpler and quicker” implementation of adaptive calls, i.e automated calls that adapt sensibly to callers speaking in their own way. At runtime, the OSD application flow is controlled by a conversation manager that keeps track of which slots from the conversation memory still need to be filled out. To implement the same functionality, xHMI code should be more compact and powerful than VoiceXML, which is more susceptible to “state explosion”. The compactness should be no surprise as xHMI does not live next to VoiceXML, but on top of it: the xHMI runtime processor indeed generates VoiceXML. To further ease voice application development, the next version of V-Builder (4.0) will support creation of xHMI code through a graphical interface. This is one example of a “blue” Nuance tool (V-Builder) integrating with a tool from the Scansoft/SpeechWorks legacy (OSD).
Comment: with respect to the xHMI vs. VoiceXML discussion, Nuance acknowledged that “innovation leads standardization by 3-5 years or more”. It remains to be seen whether xHMI will ever make it into an industry-backed standard as widely supported as VoiceXML is nowadays. Given the announced integration of xHMI, OSD and V-Builder, the first question developers should ask themselves is not whether they need xHMI or VoiceXML, but rather whether they need (and can afford!) a VoiceXML-generating tool at all to start with. If the answer is positive, the next question then is whether OSD/xHMI/V-Builder is suited for the job, as compared to e.g. VoiceObjects X5 or Audium (whose respective companies were, by the way, present at the conference as sponsors ).
On the text-to-speech side, PromptSculptor allows VUI designers/implementors to tune statically or dynamically generated prompts. PromptSculptor’s GUI allows users to manually edit any input text at the word or even phoneme level, by adapting a.o. duration, pitch or stress. The adapted prompt elements are stored back in the acoustic database, where they can be used for offline or online TTS generation. PromptSculptor is Nuance’s counterpart of Loquendo’s TTS Director and, to a minor extent, Acapela’s VirtualSpeaker. Next to PromptSculptor, Nuance’s senior TTS director Jan De Moortel also presented CustomVoices, a program/process allowing brand-aware companies to (have Nuance) develop their own custom TTS voice. Interested readers should count about one month for voice talent selection, script selection and recording, and another 3 to 5 months for building the actual new TTS voice. The first audio samples are available about 6 weeks after the end of the recording sessions.
To conclude the TTS session, Michel Arsac-England from VoltDelta International presented some lessons from a DA implementation at Telix AG. He pointed out the importance of TTS quality to user acceptance: most people dislike mispronounced names. Of course, even with an acoustic tuning tool like PromptSculptor, input data quality remains a prerequisite for TTS quality, as the GIGO (garbage in, garbage out) principle fully applies.
From the session on multi-lingual speech applications, I recall the following lessons: modularize “just enough” (?) to preserve flexibility, find experts on local culture, be aware that requiremenents, expectations and success metrics depend on culture, and don’t assume that language equals culture. As always, best practices are quite easy to enumerate, but more difficult to realize. Experience really makes the difference here.
On Friday I attended two sessions from the business track.
Christian Pereira, CEO of dtms Solutions, made the economic case for the hosted and managed service models for speech applications. Mr. Pereira completed Eckhard Geulen’s observations by pointing out the cost advantage of a mutualized, outsourced voice (application) platform. His company boasts a 30% growth in platform capacity a month, which does not require, however, a commensurate growth in support personnel. In a non-mutualized environment, a similar growth would be prohibitively costly due to an increase in operational staff, and therefore hamper market development. By centralizing and mutualizing the voice (application) platform, the cost per port per year drops from approx. 3000 euros (in a 30-port setup) to only 1000 euros (in a 1000-port setup). The cost advantage is shared between the customer and the platform provider (i.e. dtms Solutions). Mr. Pereira identified professional services and network limitations as the bottlenecks in dtms Solutions’ growth path . They are respectively addressed by stable partnerships with external application development companies, and by contracts with alternative network carriers.
Comment: Mr. Pereira singled out T-Com as his largest competitor, but that may only be true as far as the German (speaking) market is concerned. For the moment, dtms Solutions clearly focuses on the German (speaking) market, as their website does not even exist in English (yet). Similar companies like Monaco-based MAP Telecom openly target the pan-European market, but it is unclear today whether that broader focus (a contradictio in terminis?) is an asset or a liability. As the market matures and the demand for speech-driven phone applications rises, the value of a stable partner network with local ramifications – remember the importance of culture – may indeed prove to be the differentiating factor.
In the next session, Peter Mahoney, Nuance’s Vice President of Worldwide Marketing explained how to launch a speech application. Surprisingly often, speech projects focus on planning and building the application, but then neglect to bring the application to the market in a structured, controlled way. As user acceptance leads to loyalty and drives ROI, no-one really can afford to neglect the last part of the project. Mr. Mahoney presented best practices like building a launch team, and involving the marcom function as well as potentially concerned customer agents up front. In other words, he pleaded for not neglecting the internal audience; their acceptance of the speech initiative is just as important for overall success as the external end customers’ experience.
Nuance Conversations Europe was concluded with a presentation by Vlad Sejnoha, Nuance’s Chief Scientist. The main message I recall is that the traditional categories that prevail today in the speech world (network speech, embedded speech, dictation) are bound to become blurred in the coming decade.
All in all, it was a great event in a great setting; if speech technology and/or its European market penetration indeed evolve as quickly as announced over the next 12 to 24 months, I’d love to be present next time as well.