Dialogue
• Call flow: The structure of how a system branches based on callers’ responses.
• CCXML: Call Control eXtensible Markup Language. Provides telephony call control support for speech applications (whether based on VoiceXML or not).
• MRCP: Media Resource Control Protocol. Standard interaction protocol between server components of a speech platform.
• SALT: Speech Application Language Tags. W3C-approved set of extensions to existing mark-up languages that enable multimodal and telephony access to information, applications and Web services from PCs, telephones, and wireless personal digital assistants (PDAs)
• VUI: Voice User Interface. Set of interaction elements of the speech system, which drives the caller experience. Includes the “sound and feel” of the application. Crucial success factor of any speech application.
• VXML = VoiceXML: Voice eXtensible Markup Language. An XML-based document format for describing an automated dialogue between a caller and a system. VoiceXML is to a speech-enabled phone application what HTML is to a web application. W3C-backed industry standard, widely supported.
• VoiceXML Browser = VoiceXML Interpreter: server software that interprets VoiceXML code, and manages the automated dialogue between caller and system. Central component of a VoiceXML platform.
• VoiceXML Platform: set of server-based software components, typically consisting of a VoiceXML browser, an ASR server, a TTS server, and various file caching servers.
Input (caller to system)
• ASR: Automatic Speech Recognition. In a telephony context, modern ASR engines are speaker-independent, which means that they do not need to be trained by individual callers before usage.
• Barge-in: The ability for a caller to interrupt a system prompt before it has finished.
• DTMF: Dual-Tone Multi-Frequency. Also called touchtone. In a speech application context, DTMF or touchtone-based input contrasts with input based on the human voice.
• Grammar: Set of rules defining what the recognition engine is able to recognize in a
specific dialogue state. Popular grammar formats include GSL and SRGS.
• GSL: Grammar Specification Language created by Nuance Communications. Proprietary but popular format for specifying ASR grammars, supported by the Nuance ASR engine.
• NLU: Natural Language Understanding. The ability to understand complex caller input spoken in a natural, free-style manner.
• SRGS: Speech Recognition Grammar Specification: XML-based standard for specifying ASR grammars, supported by various ASR engines. Vendor-neutral, backed by W3C.
Output (system to caller)
• Audio file: Digital sound file that the computer plays to a caller. Contains (part of) a system prompt, or an earcon, or a mixture of both.
• Call script: A list of prompts to be recorded by a voice talent.
• Earcon = audio icon. A short, sometimes musical, sound.
• TTS: Text-To-Speech. Technology whereby a computer pronounces previously unseen words and sentences. Also known as Speech Synthesis.