Prosody Based Audio-visual Co-analysis for Coverbal Gesture Recognition
Author | : Sanshzar Kettebekov |
Publisher | : |
Total Pages | : 30 |
Release | : 2002 |
Genre | : Computer vision |
ISBN | : |
Abstract: "Although recognition of natural speech and gestures have been studied extensively, previous attempts of combining them for multimodal Human-Computer Interaction (HCI) were mostly semantically motivated, e.g., keyword-gesture co-analysis. Such top-down co-analysis for improving gesture recognition is associated with the inherent complexity of natural language processing and is not always suitable for real-time HCI. This paper explores prosodic phenomena of spontaneous gesture and speech production and presents a computational framework for improving the recognition of continuous gestures. Prosody based co-analysis of audio and visual signal is investigated at two different levels, namely, physiological and articulation. Physiological constraints are defined in a feature-based integration framework using Hidden Markov Models (HMMs). Co-articulation is analyzed using a Bayesian belief network of naïve classifiers to explore alignment of intonationally [sic] prominent speech segments and hand velocity. A weighted fusion scheme is applied for combining the decisions of the two co-analysis models. It was found that both levels of co-analyses uniquely contribute in detection and disambiguation of kinematically defined gesture primitives, which subsequently improves the performance of continuous gesture recognition. The efficacy of the proposed approach was demonstrated on a large database collected from the Weather Channel broadcast. This formulation opens new avenues for bottom-up frameworks for inclusion of natural gesticulation for HCI."