Personal tools

AAAI 2011 Workshop: Language-Action Tools for Cognitive Artificial Agents

Invited Talks - Abstracts & Presentations

 

Tamara Berg

Stony Brook University, USA

 

"Learning from Descriptive Text"   - Presentation

People communicate using language, whether spoken, written, or typed.  A significant amount of this language describes the world around us, especially the visual world in an environment, or depicted in images or video.  Such visually descriptive language is potentially a rich source of 1) information about the world, especially the visual world, and 2) training data for how people construct natural language to describe imagery. In addition there exist billions of photographs with associated text available on the web; examples include web pages, captioned or tagged photographs, and video with speech or
closed captioning.  In this talk I will describe several projects related to images, descriptive text, and depiction, including: automatically labeling faces in news photographs, discovering visual attribute terms from noisy web collections, and generating simple natural language descriptions for images. All papers, created datasets, and demos are available on my webpage at: http://tamaraberg.com/


Tom Dean

Google Research, USA


"The Tyranny of Retinotopic Maps: What the Primate’s Body Tells the Primate’s Brain" - Presentation

Images projected on the retina are ephemeral and yield far less information than introspection would suggest. Integrating these fleeting glimpses requires registering, stitching, remembering,inventing, etc. Saccades, ego and other motion require us to actively engage our environment. We are amazingly adept in switching between coordinate frames - head, body, car, bicycle, other people, objects, locations, etc. In this talk we briefly introduce the principles that have dominated the contemporary study of biological and computer vision, and suggest reasons for this hegemony. We then outline a set of principles that derive from the study of areas of the brain other than our vaunted neocortex and motivate how these principles could be used to make progress on some of the hard problems that have been given short shrift in mainstream computer vision over the last decade or so.

 

Jerome Feldman and Srini Narayanan

ICSI and UC Berkeley, USA


"Simulation Semantics, Embodied Construction Grammar, and the Language of Events" - Presentation

It remains challenging to communicate with Artificial Agents about actions, events, and processes  where the agents are embedded in dynamic, partially observable environments. This talk will present an overview of current efforts in the ICSI/UC Berkeley Neural Theory of Language (NTL) project. Now well into its third decade, the NTL project combines advanced computational methods with theories and representations based on all relevant biological and behavioral research. One foundational NTL idea is Simulation Semantics and its formalization as Coordinated Probabilistic Relational Models, which have been applied in a wide range of studies. A related, but previously separate, core concept is Embodied Construction Grammar and the related notion of best-fit analysis.

A major current undertaking is the integration both techniques in a system for language understanding that is also compatible with OWL and the semantic web. The two pilot task domains are interaction with artificial agents in (simulated) robotics and card games. For the card game task, the initial project goal is to build a system that will be able to understand any of the hundreds of Solitaire descriptions well enough to play the game. The robotics task involves less complex language, but a much richer real-time simulation environment.

 

Cornelia Fermüller and Yiannis Aloimonos

University of Maryland, USA

 

"The Cognitive Dialogue: An architecture for integrating vision, action and language"

A multimodal cognitive system can function (or be controlled) through a dialogue between the Language Executive, the Vision Executive and the Motor Executive. The Language Executive is (the set of programs) in charge of linguistic processing and the intentional system (goals). It has access to prior knowledge and to reasoning mechanisms. The Visual Executive is in charge of the visual operators that are applied to visual data, and the Motor Executive is in charge of the motor system. The system operates in a dialogue, reminiscent of the twenty question game. The Language Executive poses questions to the visual and motor executives and receives back answers. On the basis of the answers and the goals, the language executive comes up with the next question to ask, and so on. (Is there “noun” in the scene and where? What exists to the left of object X? Is object Y in the hands of agent Z? and so on). We show, using information theory, that it is possible to select the next question in an optimal sense that guarantees termination of the dialogue and goal achievement. Given that a number of "words" have been recognized in the scene, the "next word" to be used in the question should be the one that optimizes the entropy of the system. We describe examples from the application of the principle to the problem of video interpretation (from videos to sentences).

 

Max Garagnani & Friedemann Pulvermüller

MRC - Cognition & Brain Sciences Unit, Cambridge, U.K.

 

"Sensorimotor circuits for language, memory and action in the human brain: a neuroanatomically grounded - computational model" - Presentation

I will present a neurocomputational model that we developed to simulate and explain, at cortical level, word learning and language processes as they are believed to occur in motor and sensory primary, secondary and higher association areas of the (inferior) frontal and (superior) temporal lobes of the human brain. Mechanisms and connectivity of the model aim to reflect, as much as possible, functional and structural features of the corresponding cortices, including well-documented (Hebbian) associative learning mechanisms of synaptic plasticity. The model was able to explain and reconcile seemingly incongruous results on neurophysiological patterns of brain responses to well-learned, familiar sensory input (words) and new, unfamiliar linguistic material (pseudowords), and made novel predictions about the complex interactions between language and attention processes in the human brain. To test the validity of these predictions we carried out a new MEG study in which we presented subjects with familiar words and matched unfamiliar pseudowords during attention demanding tasks and under distraction. The experimental results indicated strong modulatory effects of attention on the brain responses to pseudowords, but not on those to words, fully confirming the model's predictions.

In the second part, I will illustrate how the same six-area network architecture, implementing the same functional features, can be applied to model and explain also cortical mechanisms underlying working memory processes, in the visual – as well as in the language – domain. In particular, I will present new simulation results that provide a mechanistic answer to the question of why “memory cells” (neurons exhibiting persistent activity in working memory tasks that require stimulus information to be kept in mind in view of future action) are found more frequently in prefrontal cortex and higher sensory areas than in primary cortices, i.e. far away from the  sensorimotor activations that bring about their formation (a phenomenon that we refer to as “disembodiment” of memory). The results point to the intrinsic connectivity of the sensorimotor cortical structures within which the correlation learning mechanisms operate as to the main factor determining the observed topography of memory cells.

 

Barbara Landau

Johns Hopkins University, USA

 

"Putting things together: Insights from human spatial language"

Human languages are well-designed to express spatial events, including objects, actions, and  spatial relationships.  Although many have argued for the existence of universal semantic primitives underlying the basic spatial meanings,  researchers have recently highlighted significant differences between languages in the kinds of meanings that are naturally expressed.  In one celebrated case, researchers have argued that English does not express the distinction between "tight fit" and "loose fit" in spatial actions, although other languages-- such as Korean-- do.   In this talk, I will argue that this conclusion is incorrect, because it takes account only of the range of meanings expressed by English prepositions.  When people describe "tight/loose" fit events, they vary not only in the prepositions they use, but also in the verbs they use, and the syntactic contexts in which these appear.  Once we evaluate how people describe events by examining complete information across the clause, we find that English speakers can and do make the distinction between tight and loose fit.  These findings are crucial in any attempt to understand the nature of human spatial descriptions.  Moreover, they will be crucial in understanding how we can use human spatial language to instruct machines.

 

Giorgio Metta

Italian Institute of Technology, Italy

(with Carlo Ciliberto, Vadim Tikhanoff, Lorenzo Natale, Francesco Rea, Katerina Pastra, Yiannis Aloimonos, Ajay Mishra, Doug Summerstay, Eirini Balta, Panagiotis Dimitrakis, Giorgos Karakatsiotis)


"The POETICON Project: connecting language and action in a humanoid robot"

POETICON explores the “poetics of everyday life”, i.e. the synthesis of sensorimotor representations and natural language in everyday human interaction. This is related to an old problem in Artificial Intelligence on how meaning emerges, which is approached in a new way. In this talk, we show how the “praxicon” - an extensible computational resource which associates symbolic representations (words/concepts) with corresponding sensorimotor representations and  is enriched with information on patterns among these representations for forming conceptual structures – can be used to guide complex action generation in a humanoid robot. We combine speech, vision, language and action in a unifying and promising way.


Raymond J. Mooney

University of Texas at Austin, USA


"Learning Language from its Perceptual Context" - Presentation

Current systems that learn to process natural language require laboriously constructed human-annotated training data.  Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we will present systems that learn to sportscast simulated robot soccer games and to follow navigation instructions in virtual environments by simply observing sample human linguistic behavior. This work builds on our earlier work on supervised learning of semantic parsers that map natural language to a formal meaning representation.  In order to apply such methods to learning from observation, we have developed methods that estimate the meaning of sentences from ambiguous perceptual context.

 

Katerina Pastra

Cognitive Systems Research Institute, Greece


“The Minimalist Grammar of Action and the role of Language” - Presentation

Language and action have been found to share a common neural basis and in particular a common “syntax”, an analogous hierarchical and compositional organization. While language structure analysis has led to the formulation of different grammatical formalisms and associated discriminative or generative computational models, the structure of action is still elusive and so are the related computational models. However, structuring action has important implications on action learning and generalisation, in both human cognition research and computation. In this talk, we present a biologically inspired generative grammar of action, which employs the structure-building operations and principles of Chomsky’s Minimalist Programme as a reference model. In this grammar, action terminals combine hierarchically into temporal sequences of actions of increasing complexity; the actions are bound with the involved tools and affected objects and are governed by certain goals. We show, how the tool-role and the affected-object role of an entity within an action drives the derivation of the action syntax in this grammar and controls recursion, merge and move, the latter being mechanisms that manifest themselves not only in human language, but in human action too. We will present how the minimalist grammar of action can be applied through both (a) a visual action/scene parser to build action syntax trees bottom up and (b)  a language parser to build action syntax trees top down. The corresponding tools open new directions to visual scene analysis and embodied language processing respectively.

 

Stanley Peters

Stanford University, USA


TBA

 

Jeffrey Mark Siskind

Purdue University, USA


"Mediating Cross-Modal Perception, Motor Control, Language, and Reasoning with Common and Deep Semantic Representations"

Human intelligence is tightly intertwined. Multiple perceptual modalities, like vision and audition, and multiple motor modalities, like manipulation and locomotion, inform, influence, and mediate each other though multiple thought modalities, like language and reasoning. Yet most computer intelligence research is compartmentalized into disjoint fields like computer vision, robotics, natural language, and AI. Understanding human intelligence and emulating it computationally will require common semantic representations across all these modalities. I will present our concerted effort to develop just that: common representations that allow rich and deep semantic interaction between computer vision, robotics, and natural language.  Our efforts focus on two concrete testbed tasks. In the first, one robot builds an assembly out of Lincoln Logs while a second robot observes that activity and communicates those observations, in natural language, to a third robot who must replicate that assembly. 

In the second, two robots play a board game, while a third robot---that does not know the game rules---observes the play and must infer the game rules. These tasks are specifically designed to support investigation into integrating vision, robotics, natural language, learning, and planning with common semantic representations and stochastic inference mechanisms. This allows filling in missing information from multiple modalities. When the vision system cannot fully determine the Lincoln Log assembly structure due to occlusion, it can ask questions in natural language, move its head to integrate information from multiple views, or disassemble the structure to view the assembly's internals. Likewise, when there are insufficient training examples to learn game rules, the learner can ask questions that can be answered either linguistically or by robotic demonstration. I will discuss the common stochastic inference mechanism built on top of a novel probabilistic programming language augmented with automatic differentiation to support maximum-likelihood estimation of rich complex models and how this architecture supports rich and deep semantic interaction between computer vision, robotics, and natural language.

 

Jun Tani

RIKEN Brain Institute, Japan


"Generating Cognitive Behavior through Top-Down and Bottom-Up Interaction in Hierarchically Organized Cortical Networks: Neuro-Robotics Experiments"  - Presentation

In this talk, I address two essential aspects for understanding the brain mechanisms for generating cognitive behavior. The first aspect concerns a generative model in which sensory-motor sequences can be predicted/generated with top-down intentions, and where such intentions can be modified by means of bottom-up regression by considering the prediction error with the sensory reality. The second aspect concerns a generative model that is self-organized with a functional hierarchy through dense interactions between the prefrontal cortex, characterized by its slower neural activity dynamics, and the posterior cortices, characterized by their faster dynamics. These two aspects are examined by reviewing some of our neuro-robotics experiments involving goal-directed action generation, mental simulation and planning, free-decisions, and language-action associative learning. The experimental results suggest that interactions between different levels and different modalities involving various local brain regions can lead to the generation of compositional, and yet contextual, cognitive acts.

 

Evelyne Tzoukermann

MITRE Corporation


"Language Models for Semantic Extraction in Video Action Recognition" - Presentation

We present the language models of an end-to-end system capable of automatically annotating real-word broadcast videos containing actions and objects.  The semantic extraction takes into account word relatedness as well as word disambiguation.  We address the following issues: (a) what are the optimal ways to extract the salient textual information relevant to vision?  (b) what is the best way to represent semantic information so that a vision model can utilize it?  We automatically process text transcripts and perform syntactic analysis to extract dependency relations. We then perform semantic extraction on the output to filter semantic entities related to actions. The resulting data are used to populate a matrix of co-occurrences utilized by the vision processing modules. Results show that explicitly modeling the co-occurrence of actions and tools significantly improved performance.

 

Gabriella Vigliocco

University College London, U.K.


"An interdisciplinary theory of semantic representation"

Recent years have seen exciting developments in neural and psychological understanding of how humans learn and represent meaning. In addition, we have also seen remarkable development of distributional models of meaning representation from a computational science perspective. It is the case, however, that these approaches have been mainly kept separate. In our work we have argued that, instead, understanding of how meaning is learnt and represented by humans can be greatly improved if we integrate insights from these different fields. In the talk, I will introduce our theoretical framework on the representation of word meaning, according to which information derived from our embodied (perceptual, motoric and affective) experience is integrated with information extracted from language; I will present experimental evidence suggesting how words from different domains (especially concrete and abstract) may depend to varying degrees upon different types of embodied and language-based information; and I will discuss likely mechanisms for the acquisition of meaning representations in childhood.


Britta Wrede

University of Bielefeld, Germany


"How interaction facilitates action and language learning – a case for Natural Pedagogy in robot tutoring" - Presentation

While infants apparently learn language and the meaning of actions and objects effortlessly through interaction with their peers or parents, the emergence of semantics is still a severe challenge for learning in robots. In our research, we follow the idea of Natural Pedagogy formulated by Csibra & Gergeley (2009) who state that on the one hand, parents, when teaching their infants, make use of specific strategies to help their infant understand the meaning of an action they should learn; on the other hand, children feedback their caregivers that they are receptive to the learning content. According to this stance, it is the human ability to teach even complex and subtle meanings to one another which distinguishes human from animal intelligence. This ability allows humans to teach and fast learn the meaning of actions that cannot be inferred by observation alone. Motivated by these observations we focus on enabling a robot to perceive relevant information provided by a tutor, such as (a) ostensive cues which signal when to learn, or (b) synchronized speech and action events, so called acoustic packages, which contain information on how to segment a presented action and speech stream into meaningful units. Implementing the principles of Natural Pedagogy, we evaluate, analyze and model how the contingent feedback of the robot influences the tutor’s demonstration. More specifically, we found that the learner’s gaze during the action demonstration as well as the learner’s reproduction of what s/he has learned from the demonstration are important feedback signals that influence the tutor’s movements. We argue that in order for a robot to be able to learn the meaning of action and speech, it needs to be able to perceive and understand the ostensive signals given by the tutor to engage in a tutoring interaction. On the other hand, the robot needs to be able to give meaningful feedback through gaze and imitation behavior to keep the interaction ongoing.

 

Papers - Abstracts & Presentations

 

D’Ausilio1 A. and L. Fadiga1,2

Italian Institute of Technology, Genova, Italy (1); University of Ferrara, Italy (2)

 

"The common origins of Language and Action" - Presentation

The motor system organization shows some interesting parallels with the language organization. Here we draw the possible communalities between Action and Language, basing our claims on neurophysiological, neuroanatomical and neuroimaging data. Furthermore, we speculate that the motor system may have furnished the basic computational capabilities for the emergence of both semantics and syntax.


Klavans1 J., R. Guerra1, R. LaPlante1, R. Stein2, E. Bachta2

University of Maryland College Park (1); Indianapolis Museum of Art (2)


"Beyond Flickr: Not All Image Tagging Is Created Equal"

This paper reports on the linguistic analysis of a tag set ofnearly 50,000 tags collected as part of the steve.museum project. The tags describe images of objects in museum collections. We present our results on morphological, part of speech and semantic analysis. We demonstrate that deeper tag processing provides valuable information for organizing and categorizing social tags. This promises to improve access to museum objects by leveraging the characteristics of tags and the relationships between them rather than treating them as individual items. The paper shows the value of using deep computational linguistic techniques in interdisciplinary projects on tagging over images of objects in museums and libraries. We compare our data and analysis to Flickr and other image tagging projects.

 

Pastra K., E. Balta, P. Dimitrakis, G. Karakatsiotis

Cognitive Systems Research Institute, Athens, Greece

 

"Embodied Language Processing: a new generation of language technology" - Presentation

At a computational level, language processing tasks are traditionally processed in a language-only space/context, isolated from perception and action. However, at a cognitive level, language processing has been shown experimentally to be embodied, i.e. to inform and be informed by perception and action. In this paper, we argue that embodied cognition dictates the development of a new generation of language processing tools that bridge the gap between the symbolic and the sensorimotor representation spaces. We describe the tasks and challenges such tools need to address and provide an overview of the first such suite of processing tools developed in the framework of the POETICON project.

 

Swadzba  A. and S. Wachsmuth

Applied Informatics, Faculty of Technology, Bielefeld University

 

"Aligned Scene Modeling of a Robot’s Vista Space – An Evaluation" - Presentation

One kind of meaningful structures in indoor rooms are supporting structures like tables and cupboards. A robot will need to know these structures for a natural interaction with the human and the environment. As bottom-up detection of such structures is a challenging problem, we propose to estimate potential supporting structures from a spatial description like “a bowl on the table”. As language and cognition schematize the space in the same way it is possible to estimate the representation of the space underlying a scene description. To do so, we introduce the aligned modeling approach which consists of rules transforming a sequence of object relations into a set of trees and a methodology to ground the abstract representation of the scene layout in the current perception using detectors for small movable objects and an extraction of planar surfaces. An analysis of 30 descriptions shows the robustness of our approach to a variety of description strategies and object detection errors.

 

Teo C.L., Y. Yang, H. Daume III, C. Fermüller, Y. Aloimonos

University of Maryland Institute for Advanced Computer Studies, College Park

 

"A Corpus-Guided Framework for Robotic Visual Perception"

We present a framework that produces sentence-level summarizations of videos containing complex human activities that can be implemented as part of the Robot Perception Control Unit (RPCU). This is done via: 1) detection of pertinent objects in the scene: tools and direct-objects, 2) predicting actions guided by a large lexical corpus and 3) generating the most likely sentence description of the video given the detections. We pursue an active object detection approach by focusing on regions of high optical flow. Next, an iterative EM strategy, guided by language, is used to predict the possible actions. Finally, we model the sentence generation process as a HMM optimization problem, combining visual detections and a trained language model to produce a readable description of the video. Experimental results validate our approach and we discuss the implications of our approach to the RPCU in future applications.


Tzoukermann1 E., J. Neumann2, J. Kosecka3, C. Fermuller4, I. Perera5, F. Ferraro6, B. Sapp5, R. Chaudhry7 and G. Singh3

The MITRE Corporation (1); Comcast (2); George Mason University (3); University of Maryland (4); University of Pennsylvania (5); University of Rochester (6); Johns Hopkins University (7)


"Language Models for Semantic Extraction and Filtering in Video Action Recognition"

The paper addresses the following issues:  (a) how to represent semantic information from natural language so that a vision model can utilize it?  (b) how to extract the salient textual information relevant to vision?  For a given domain, we present a new model of semantic extraction that takes into account word relatedness as well as word disambiguation in order to apply to a vision model. We automatically process the text transcripts and perform syntactic analysis to extract dependency relations. We then perform semantic extraction on the output to filter semantic entities related to actions. The resulting data are used to populate a matrix of co-occurrences utilized by the vision processing modules.  Results show that explicitly modeling the co-occurrence of actions and tools significantly improved performance.

 

X. Yu, C. Fermüller, Y. Aloimonos

University of Maryland Institute for Advanced Computer Studies, College Park


"Visual Scene Interpretation as a Dialogue between Vision and Language"

We present a framework for semantic visual scene interpretation in a system with vision and language. In this framework the system consists of two modules, a language module and a vision module that communicate with each other in a form of a dialogue to actively interpret the scene. The language module is responsible for obtaining domain knowledge from linguistic resources and reasoning on the basis of this knowledge and the visual input. It iteratively creates questions that amount to an attention mechanism for the vision module which in turn shifts its focus to selected parts of the scene and applies selective segmentation and feature extraction. As a
formalism for optimizing this dialogue we use information theory.We demonstrate the framework on the problem of recognizing a static scene from its objects and show preliminary results for the problem of human activity recognition from video. Experiments demonstrate the effectiveness of the active paradigm in introducing attention and additional constraints into the sensing process.

Document Actions
last modified Sep 09, 2011 11:46 PM


European Commission • Seventh Framework Programme