DARPA
Communicator
Testbed
Log Standard Proposal (v3)
Introduction
This document is intended to establish standards
for logfile contents and format. We will try to determine what is the smallest
set of data necessary in order to re-run a system, yet also includes meaningful
metrics. This may vary depending on how much of the system is to be re-run
as well as what we would like to measure. In the process we will attempt
to establish a standard format which all logfiles can be converted to (or
generated in, although we foresee that at least a minimal amount of inferencing
might be required to render the logs in this form). A goal of this document
is to provide a standard that is flexible and general enough such that
it could be used in different domains.
In order to accomplish this goal, we will
propose an XML DTD which records the basic events in a Communicator-compliant
system which can be annotated with type information indicating that a data
element is "significant" from the point of view of annotators (and annotation
tools).
To clarify we will consider the following
(term definitions are by no means final and are open to suggestion):
-
Session - The interaction of a user
with the system. In our current demonstration the equivalent of a phone
call. A session is composed of a set of turns.
-
Turn - The set of operations performed
by the system in the course of processing and presenting a single dialogue
participant's utterance.
-
Operation - Every command executed
by the system within a turn. Every operation can send and receive data.
-
Data - A set of key/value pairs.
The definition of "turn" requires special
attention. In some accounts, a turn is an exchange between user and system.
In a robust dialogue context, this definition fails to be adequate when
the user or system barges in with follow-up information, etc., or when
the dialogue involves more than two parties (a situation which we shouldn't
rule out). We propose that the term "turn" in the context of these log
files be reserved for the processing of a single participant's utterance
(either user or system). This definition is not without its problems. For
instance, it's not clear whether a call to the backend belongs at the end
of the processing of a user's utterance (because it's the presentation
of the utterance to the backend) or the beginning of the processing of
the system's utterance (because it's the source of the system's response).
We can currently think of nothing that this decision hinges on in the data
analysis, and recommend that either interpretation be recognized at the
moment.
Content
Here we will try to discuss the granularity
of data to be logged in an end-to-end system. The contents of these bullets
were derived mainly from the information needed by MITRE to do its own
internal evaluation and will probably change as the perspectives of other
sites are incorporated. Every log should contain enough information to
determine the following (here input refers to the user sending information
to the system and output refers to the system sending information to the
user). Ideally, all this information should be extractable from the log
file without any site-specific analysis. In this table, we describe the
data to be logged, whether it's optional or obligatory, and how we propose
to standardize access to the data:
Data
Obligatory
Standard access
Duration of session
yes
readable directly off the XML representation
proposed below
Duration of turn (input or output)
yes
readable directly off the XML representation
proposed below
Duration of generation of output (in a
phone demo, the time the synthesizer takes to generate the audio file)
yes
see
1
Duration of display of output (in a phone
demo, how long it takes to play the audio file)
yes
see
2
Duration of recognition of input (in a
phone demo, how long it takes the recognizer to produce its hypotheses)
yes
see
3
Duration of arbitrary operations
no
readable directly off the XML representation
proposed below
Number of turns within a session
yes
readable directly off the XML representation
proposed below
Number of sessions (in our current model
each session is its own logfile)
yes
readable directly off the XML representation
proposed below
The audio files corresponding to the user
input and system output and their formats. The audio files should be stored
and distributed with the logs, and the pathnames of these files should
be relative to the log.
yes
accessed given an arbitrary search of
the logged data (see the "audio_input" and "audio_output" values for the
type attribute of the
GC_DATA tag, as well as the "mime_type"
attribute)
The text of the user input chosen by the
system
yes
accessed given an arbitrary search of
the logged data (see the "text_input" values for the type attribute of
the
GC_DATA tag)
The text of the system output
yes
accessed given an arbitrary search of
the logged data (see the "text_output" value for the type attribute of
the
GC_DATA tag)
All possible input sentences (from the
recognizer) up to a certain limit (TBD) (N/A to systems that use a word
lattice)
no
accessed given an arbitrary search of
the logged data (see the "text_input_hypothesis" value for the type attribute
of the
GC_DATA tag)
Indication of whether the parse succeeded
no
see
4
The full input interpretation
no
accessed given an arbitrary search of
the logged data
The elements which may pose minor complications
have been left blank. Here we make tentative proposals for each of these:
-
Duration of output generation.
In a system where there is a single, obvious call to the synthesizer, this
is simply the duration of that operation, but this is only one possible
configuration. We propose that the "type" attribute be added to the GC_OPERATION
element and that a "virtual" operation be generated by a postprocess phase
with a distinguished type (say, "synthesis_duration"); alternatively, we
could introduce a new XML element (say, GC_EVENT) reserved for these "virtual"
events.
-
Duration of output presentation.
In the MIT system, this is an inference from notifications posted by the
audio server (playing_has_begun, playing_has_ended; see the Communicator
documentation for the MIT audio server). This could be handled similarly
to output generation, or we could add optional start and end time attributes
to the GC_DATA element which contains the audio file.
-
Duration of recognition.
Again, we propose to handle this similarly to output generation.
-
Indication of whether the
parse succeeded. Again, this is frequently an inference. We can insert
a distinguished GC_DATA element (say, with a type of "input_parse_successful").
We believe that this sort of proposal will
allow sites to gather data in the form they prefer, and augment it with
sharable semantics in such a way that individual sites' data will retain
its site-specific integrity.
Format
We believe that XML would be a good candidate
language for this format for many reasons, among them that there is a growing
supply of viewers, editors, as well as a variety of parsers available in
many programming languages.
We propose that operations should be logged
as single XML elements. For example:
<operation
server="nl"
turnid="-01"
location="localhost:11000"
name="paraphrase_reply"
stime="930254422.720000"
etime="930254422.790000"
>
<data
type="input"
key="tidx"
>
3813
</data>
<data
type="output"
key=":reply_string"
>
Hi! Welcome to MITRE's Travel demonstration.
This call is being recorded for system development. You may hang up or
ask for help at any time. How can I help you?
</data>
</operation>
Since in our distributed architecture
messages are sent asynchronously, and many events may occur before the
completion of an operation, some caching (or post processing) will be necessary
to log operations as single elements.
Next we will try to define the main entities
in the logfile and their formats. A DTD is also available which defines
these terms and their relations. We will assume all time types will use
a standard base time known as "the epoch", the number of milliseconds since
January 1, 1970, 00:00:00 GMT.
GC_SESSION
A session represents an interaction
of a user with the system. In our current demo the equivalent to a phone
call. The elements in this table refer to the
XML
DTD.
Name
Description
Type
Required
id
We should attempt to determine a unique
identifier for sessions. MIT's solution for this is of the following format
(IP:process id:session counter). Process id's might not be trivial to achieve
in different programing languages and OS' however there usually are "equivalent"
data available
string
yes
stime
time when session started
milliseconds
yes
etime
time when session finished
milliseconds
yes
Example:
<GC_SESSION
id="129.10.2.200:1010:3"
stime="930254422.720000"
etime="930254434.790000"
>
...
</GC_SESSION>
GC_TURN
Consists of each interaction of the
user with the system, as discussed in the
introduction.
The elements in this table refer to the
XML
DTD.
Name
Description
Type
Required
id
A unique identifier within each session
number
yes
stime
time when turn started
milliseconds
yes
etime
time when turn ended
milliseconds
yes
Example:
<GC_TURN
id="-01"
stime="930254422.720000"
etime="930254424.790000"
>
...
</GC_TURN>
GC_OPERATION
Every command executed by the system within
a turn. All operations can send and receive data, frames or audio files.
The elements in this table refer to the
XML
DTD.
Name
Description
Type
Required
type
the type of operation being executed (specific
values TBD)
string
no
turnid
the turn id that this operation was executed
under
number
yes
stime
time when operation started
milliseconds
yes
etime
time when operation ended
milliseconds
yes
server
the name (according to the program file)
of the server that executed the operation
string
yes
location
the server (real server name or IP address)
and its port (server_name:port_number)
string
yes
name
the name of the operation
string
yes
Example:
<GC_OPERATION
server="nl"
turnid="-01"
location="localhost:11000"
name="paraphrase_reply"
stime="930254422.720000"
etime="930254422.790000"
>
<GC_DATA
type="input"
key="tidx"
>
3813
</GC_DATA>
<GC_DATA
type="output"
key=":reply_string"
>
Hi! Welcome to Mitre's Travel demonstration.
This call is being recorded for system development. You may hang up or
ask for help at any time. How can I help you?
</GC_DATA>
</GC_OPERATION>
GC_DATA
A key/value pair. This datatype can be used
to display the information involved in an operation, as well as to display
the contents of a GC_FRAME. The elements in this table refer to the
XML
DTD.
Name
Description
Type
Required
key
the name of this data point
string
yes
turnid
the turn id that this operation was executed
under
number
no
time
time stamp for this data point
milliseconds
no
type
valid values of type include audio_input,
audio_output, text_input, text_output, text_input_hypothesis, and concept.
See the
Content section.
string
no
mime_type
the mime type of the data
string
no
Examples:
<GC_DATA
key=":synth_log_filename"
turnid="-01"
type="audio_output"
mime_type="audio/wav"
>
/home/communicator/Travel-demo/../logs/travel_cfone/19990624/006/travel_
cfone-19990624-006-synth--01.wav
</GC_DATA>
<GC_DATA
key=":listening_has_begun"
turnid="000"
time="930254422.790000"
>
</GC_DATA>
GC_FRAME
This stucture would allow for recording of
frames. The elements in this table refer to the
XML
DTD.
Name
Description
Type
Required
frame_type
Galaxy frame type
string
no
name
the name of the frame
string
no
turnid
the turn id that this operation was executed
under
number
no
Example:
<GC_FRAME
turnid="000"
name="scores"
type="c"
>
<GC_DATA
key=":total_score"
>
-1408.9955
</GC_DATA>
<GC_DATA
key=":acoustic_score"
>
-1367.4408
</GC_DATA>
<GC_DATA
key=":ngram_score"
>
-15.5547
</GC_DATA>
<GC_DATA
key=":nphones"
>
58
</GC_DATA>
<GC_DATA
key=":nwords"
>13
</GC_DATA>
</GC_FRAME>
Code support
MITRE volunteers to work with sites to produce
the appropriate conversion tools from MIT logfiles to the proposed logfile
standard. If more appropriate, we will produce a new logging module for
the Hub which will simplify this process; however, we don't envision this
to be necessary.
Document
Type Definition (DTD)
Below we provide an XML DTD to define the
above types.
<?xml version="1.0"?>
<!ELEMENT GC_LOG GC_SESSION*>
<!ELEMENT GC_SESSION GC_TURN*>
<!ATTLIST GC_SESSION id NMTOKEN #REQUIRED>
<!-- time could be defined as CDATA if we chose to use a non
millisecond format -->
<!ATTLIST GC_SESSION stime NMTOKEN #REQUIRED>
<!ATTLIST GC_SESSION etime NMTOKEN #REQUIRED>
<!ELEMENT GC_TURN ( GC_OPERATION | GC_DATA | GC_FRAME )*>
<!ATTLIST GC_TURN id NMTOKEN #REQUIRED>
<!ATTLIST GC_TURN stime NMTOKEN #REQUIRED>
<!ATTLIST GC_TURN etime NMTOKEN #REQUIRED>
<!ELEMENT GC_OPERATION ( GC_DATA | GC_FRAME )*>
<!ATTLIST GC_OPERATION type NMTOKENS #IMPLIED>
<!ATTLIST GC_OPERATION turnid NMTOKEN #REQUIRED>
<!ATTLIST GC_OPERATION server CDATA #REQUIRED>
<!ATTLIST GC_OPERATION location NMTOKEN #REQUIRED>
<!ATTLIST GC_OPERATION name CDATA #REQUIRED>
<!ATTLIST GC_OPERATION stime NMTOKEN #REQUIRED>
<!ATTLIST GC_OPERATION etime NMTOKEN #REQUIRED>
<!ELEMENT GC_DATA ANY>
<!ATTLIST GC_DATA key NMTOKEN #REQUIRED>
<!ATTLIST GC_DATA type NMTOKENS #IMPLIED>
<!ATTLIST GC_DATA mime_type NMTOKEN #IMPLIED>
<!ATTLIST GC_DATA time NMTOKEN #IMPLIED>
<!ATTLIST GC_DATA turnid NMTOKEN #IMPLIED>
<!ELEMENT GC_FRAME ( GC_DATA | GC_FRAME )*>
<!ATTLIST GC_FRAME frame_type NMTOKEN #IMPLIED>
<!ATTLIST GC_FRAME name CDATA #IMPLIED>
<!ATTLIST GC_FRAME turnid NMTOKEN #IMPLIED>