Recognition of Recorded Utterances (RRU)

Overview

The VoiceGenie platform now allows VoiceXML applications to be run using recorded audio as speech input.

This document tells you how to use this feature in your application, and includes the following topics:

Background

Two examples of uses for this feature are: tuning applications, and running maintenance checks on a deployed application.

Tuning applications

In order to tune an application to improve recognition accuracy, you need to assess performance before and after various tuning adjustments. However, to be assured that any changes in performance come from the adjustments, it's important that all other factors are held constant - including the input utterances. Using this feature, you can reuse a set of utterances (either recorded by testers from a call script, or recordings of actual callers), as input to the application, while you make changes to the grammars and properties.

Running maintenance checks

Once an application is deployed, you don't want to wait for negative feedback from your users, to find out something's not working properly. Using this feature, you can create test scripts which will periodically "call" the application. You can check the results of these calls, to make sure that the ASR and your grammars are correctly accepting speech input.

Getting Started

This feature is very simple to use. For each <field> in your application, if you want the recognizer to use recorded audio as speech input, specify the audio source with the audioinexpr attribute. Otherwise, the recognizer will wait for speech input to come from the caller, as usual.

This attribute should specify one of the following sources:

 

Note: If you own a VoiceGenie platform, you can put audio files on the platform and reference them with audioinexpr="'file:///file path'".  Or, if you put the audio files in VoiceGenie's builtin "audio" directory (or any subdirectory of "audio"), then you can reference them with audioinexpr="'builtin:file path, relative to audio directory'".

Usage

Here are three examples that show <field>s receiving their input from different sources of recorded audio.

 

Note: A caller will not be able to hear the recorded audio input.

 

Example 1. Speech input from an audio file. The recognition results are sent to a script that processes them, so the application's recognition performance can be assessed. Note that this example is only to illustrate the use of this feature; no implementation is given for stats.jsp.


<?xml version="1.0"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <meta name="maintainer" content="yourname@yourserver.com"/> <meta name="application" content="Recorded Audio Input 1"/> <property name="bargein" value="false"/> <form> <!-- The caller will not hear the input --> <field name="field1" audioinexpr="'http://developer.voicegenie.com/libraries/audio/ common/goodbye.vox'"> <prompt> Please say goodbye. </prompt> <grammar xml:lang="en-US" version="1.0" root="ROOT" xmlns="http://www.w3.org/2001/06/grammar" type="application/srgs+xml"> <rule id="ROOT" scope="public"> <item> goodbye </item> </rule> </grammar> <nomatch> <var name="field1" expr="'nomatch'"/> <submit next="stats.jsp" namelist="field1"/> </nomatch> <noinput> <var name="field1" expr="'noinput'"/> <submit next="stats.jsp" namelist="field1"/> </noinput> <filled> <var name="confidence" expr="field1$.confidence"/> <submit next="stats.jsp" namelist="field1 confidence"/> </filled> </field> </form> </vxml>

Example 2. Speech input from the result of an earlier <record>.


<?xml version="1.0"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <meta name="maintainer" content="yourname@yourserver.com"/> <meta name="application" content="Recorded Audio Input 2"/> <property name="bargein" value="false"/> <form> <record name="recordaudio" beep="true" dtmfterm="true"> <prompt> At the tone, please say goodbye, then press the pound key. </prompt> </record> <!-- Use recording from above as input here --> <!-- The caller will not hear the input --> <field name="field1" audioinexpr="recordaudio"> <prompt> Please say goodbye. </prompt> <grammar xml:lang="en-US" version="1.0" root="ROOT" xmlns="http://www.w3.org/2001/06/grammar" type="application/srgs+xml"> <rule id="ROOT" scope="public"> <item> goodbye </item> </rule> </grammar> <nomatch> I didn't understand your recording. <exit/> </nomatch> <noinput> I didn't hear your recording. <exit/> </noinput> <filled> I heard your recording say <value expr="field1"/>. </filled> </field> </form> </vxml>

Example 3. Speech input from the recording of the caller's earlier input. Check out the tutorial on saving caller utterances for more information.

Note: The following example uses the tag format that is supported by OSR (<tag>command='goodbye';</tag>). If you want to run this example with a different ASR engine, confirm the format supported by that engine, and modify the below tag content if necessary.


<?xml version="1.0"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <meta name="maintainer" content="yourname@yourserver.com"/> <meta name="application" content="Recorded Audio Input 3"/> <property name="ASRENGINE" value="SPEECHWORKS"/> <property name="bargein" value="false"/> <form> <grammar xml:lang="en-US" version="1.0" root="ROOT" xmlns="http://www.w3.org/2001/06/grammar" type="application/srgs+xml"> <rule id="ROOT" scope="public"> <item> goodbye <tag>command='goodbye';</tag> </item> </rule> </grammar> <field name="field1" saveutterance="true" slot="command"> <prompt> Please say goodbye. </prompt> <catch event="nomatch noinput"> Try again. <reprompt/> </catch> <filled> I recognized goodbye with a confidence of <value expr="field1$.confidence"/>. Let's make sure I'm a consistent recognizer. </filled> </field> <!-- Use recording of caller's last input as input here --> <!-- The caller will not hear the input --> <field name="field2" audioinexpr="field1$.utteranceaudio" slot="command"> <prompt> Please say goodbye. </prompt> <filled> I recognized goodbye with a confidence of <value expr="field2$.confidence"/>. <if cond="field1$.confidence == field2$.confidence"> See, I am a consistent recognizer! <else/> <!-- This should not happen --> I guess I'm not a consistent recognizer. </if> </filled> </field> </form> </vxml>

Usage Notes

  1. A caller will not be able to hear the recorded audio input.
  2. Any speech/background noise coming into the phone will be ignored, when using this feature.
  3. If the attribute is empty or contains an empty expression, it will be ignored and the recognizer will listen for input from the caller.
  4. If the recorded audio input cannot be recognized, the appropriate event will be thrown (nomatch or noinput) as expected. Please note that the application is responsible for providing logic in the <nomatch> and <noinput> handlers to leave the field in this case, by hanging up or exiting, by transitioning to another field/form/document, by setting the field variable or cond attribute, etc. If no such logic is provided (ie. if the handlers simply prompt to retry recognition), an infinite loop will occur, since the same input will be used each time.

    See examples 1 and 2, for an illustration of leaving the field if a nomatch/noinput occurs, to prevent infinite looping. Note that in example 3, a nomatch/noinput will never occur in the field that uses RRU, since the input is the recognized utterance from the first field, so there is no need for nomatch/noinput handlers in that field.

    If you would like to fall back to collecting input from the live caller if the RRU is unsuccessful, you could do something like the following:
            
    <?xml version="1.0"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <property name="bargein" value="false"/> <form> <var name="autoinput" expr="'http://developer.voicegenie.com/libraries/audio /horoscopes/zodiacs.vox'"/> <field name="input" audioinexpr="autoinput"> <prompt> Please say test. </prompt> <grammar xml:lang="en-US" version="1.0" root="ROOT" xmlns="http://www.w3.org/2001/06/grammar" type="application/srgs+xml"> <rule id="ROOT" scope="public"> test </rule> </grammar> <nomatch> <assign name="autoinput" expr=""/> Sorry, I didn't understand the automated input. Now I'll listen to the live caller. <reprompt/> </nomatch> <filled> You said <value expr="input"/>. </filled> </field> </form> </vxml>
  5. This feature works both with bargein enabled and disabled, but it is recommended that bargein be disabled to ensure that all queued prompts are played properly.

Summary

Here is the <field> attribute that is used to indicate that speech input is coming from recorded audio, and what the source of that audio is:

Attribute Possible Values
audioinexpr - full http URI, ex. audioinexpr="'http://blah.com/audio/utterance.vox'"
- full file URI (audio on the platform), ex. audioinexpr="'file:///usr/local/phoneweb/blah/utterance.vox'"
- relative URI, ex. audioinexpr="'audio/utterance.vox'"
- reference to builtin audio on the platform, ex. audioinexpr="'builtin:test1/utterance.vox'"
- <record> field variable, ex. audioinexpr="recording1"
- $.utteranceaudio shadow variable, ex. audioinexpr="field1$.utteranceaudio"

Here are the supported audio formats:

File Format Extension Sample Rate Encoding
Dialogic Vox .vox 8 kHz u-law, a-law
Microsoft WAVE .wav 8 kHz u-law, a-law
AU Audio .au 8 kHz u-law, a-law
NIST Sphere .wav 8 kHz u-law, a-law