In the development of multimodal dialogue systems our aim is to build a system capable of dealing with different types of inputs and different services. This diversity gives us the possibility of building a number of applications.
The architecture of our current system is schematically shown in the figure below.
Access to the system may be done via microphone (as illustrated), telephone, GSM, PDA and web. In order to satisfy these generic goals, both interface blocks create a domain-independent level to the Dialogue Manager (DM), who does not know which type of device made the request (i.e., spoken request or click by the pen in a PDA application). The Input and Output Manager (IOM) creates this level by sending a unique XML format for each request, independent of the source, and the same principle is applied at the Service Manager (SM).
The four main blocks of the IOM are AUDIMUS (L2F's general purpose recognizer), DIXI+ (L2F's synthesizer), TM (Text Manager) and FACE (a Java 3D implementation of an animated face with a set of visemes for Portuguese phonemes and a set of emotions and head movements).
The DM determines the action requested by the user and asks the SM to execute it. It is based on a communication hub of the Galaxy framework that interconnects the IOM and SM with a set of blocks: Language Interpretation, Interpretation Manager, Discourse Context, Behavioral Agent, and Generation Manager.
The development of our multimodal dialogue system has been mainly done in the scope of a national project financed by FCT: DIGA - Dialog Interface for Global Access (2004 - 2005), and has been supported by several undergraduate projects.
Our recent progress in this area has been achieved through our participation in the Interactive Home of the Future, at Museu das Comunicações. The home was built as a mean of demonstrating new telecommunications / multimedia technologies to the generic public. Our spoken dialogue system gives access to a virtual butler (Ambrósio) that controls several devices in the master bedroom. The system combines automatic speech recognition, natural language understanding, speech synthesis and a visual interface based on a realistic animated face.
The device controller interface is based on a web server, giving access to any device in the home. The system is quite generic, being able to control different domains: home environment, remote database access (weather information, bus information, stock market information), email access, etc.