Recognition of affective intent in speech
Human speech provides a natural and intuitive interface for both
communicating with humanoid robots as well as for teaching them. To
this end, Kismet recognizes and affectively responds to praise,
prohibition, attention, and comfort in robot-directed speech.
These affective intents are well matched to
human-style instruction scenarios since praise, prohibition, and
directing the robot's attention to relevant aspects of a task, could
be intuitively used to train a robot.
The system runs in real-time and
exhibits robust performance (i.e., for a teaching task, confusing
strongly valenced intent for neutrally valenced intent is better than
confusing oppositely valenced intents. For instance, confusing
approval for an attentional bid, or prohibition for neutral speech, is
better than interpreting a prohibition for praise.). Communicative
efficacy has been tested and demonstrated in multi-lingual studies
with the robot's caregivers as well as with naive subjects (only
female subjects have been tested so far). Importantly, we have
discovered some intriguing social dynamics that arise between robot
and human when expressive feedback is introduced. This expressive
feedback plays an important role in facilitating natural and intuitive
human-robot communication.
Infant recognition of affective intent
Developmental psycholinguists have extensively studied how affective
intent is communicated to preverbal infants. Infant-directed speech is
typically quite exaggerated in pitch and intensity. From the results
of a series of cross-cultural studies, Anne Fernald suggests that much
of this information is communicated through the ``melody" of
infant-directed speech. In particular, there is evidence for at least
four distinctive prosodic contours, each of which communicates a
different affective meaning to the infant (approval, prohibition,
comfort, and attention) -- see figure.
Maternal exaggerations in
infant-directed speech seem to be particularly well matched to the
innate affective responses of human infants.
Recognition of affective intent
Inspired by this work, we have implemented a recognizer to distinguish
the four affective intents for praise, prohibition, comfort,
attentional bids. Of course, not everything a human says to Kismet
will have an affective meaning, so we also distinguish neutral
robot-directed speech.
We have intentionally designed Kismet to resemble a very young
creature so that people are naturally inclined speak to Kismet
with appropriately exaggerated prosody. This aesthetic choice
has payed off nicely for us.
As shown below, the preprocessed pitch
contour of labeled utterances resembles Fernald's prototypical prosodic
contours for approval, attention, prohibition, and comfort/soothing.
prosodic contours of kismet-directed speech
As shown below, the affective speech recognizer receives
robot-directed speech as input. The speech signal is analyzed by the
low level speech processing system, producing time-stamped pitch (Hz),
percent periodicity (a measure of how likely a frame is a voiced
segment), energy (dB), and phoneme values. This low-level auditory
processing code is provided by the Spoken Language Systems Group at
MIT. The next module performs filtering and pre-processing to reduce
the amount of noise in the data. The resulting pitch and energy data
are then passed through the feature extractor, which calculates a set
of selected features.
Finally, based on the trained model, the classifier determines whether
the computed features are derived from an approval, an attentional
bid, a prohibition, soothing speech, or a neutral utterance. As shown above,
we
adopted a multi-stage approach where several mini-classifiers are used
to classify the data in stages. In all training phases we modeled
each class of data using a Gaussian mixture model, updated with the EM
algorithm and a Kurtosis-based approach for dynamically deciding the
appropriate number of kernels. In the first stage, the classifier
uses global pitch and energy features to separate some classes (high
arousal versus low arousal). Below, you can see that pitch mean
and energy variance separates the utterances accourding to
arousal nicely.
The remaining clustered classes are then
passed to subsequent classification stages. Utilizing prior
information, we included a new set of features that encoded the shape
of the pitch contour according to Fernald's results.
We found these features to be useful in
separating the difficult classes in the subsequent classification
stages.
For Kismet, output of the vocal affective intent classifier is
interfaced with the emotion subsystem
where the information is appraised at an affective level and then used
to directly modulate the robot's own affective state. In this way, the
affective meaning of the utterance is communicated to the robot
through a mechanism similar to the one Fernald suggests. The robot's
current "emotive" state is reflected by its facial expression and body
posture. This affective response provides critical feedback to the
human as to whether or not the robot properly understood their
intent. As with human infants, socially manipulating the robot's
affective system is a powerful way to modulate the robot's behavior
and to elicit an appropriate response. The video segment on this page
illustrates
these points.
Other topics