Cognitive Vision Systems (CogViSys)

Project Summary

The central goal is to build a vision system that can be used in a wider variety of fields and that is re-usable by introducing self-adaptation at the level of perception, by providing categorisation capabilities, and by making explicit the knowledge base at the level of reasoning, and thereby enabling the knowledge base to be changed. In order to make these ideas concrete CogViSys aims at developing a virtual commentator which is able to translate visual information into a textual description.
This is the unifying theme of the project. In order to build this virtual commentator, several conceptual subgoals have to be achieved: It is crucial that the more cognitive processes can start from a firm basis. Hence, some effort will go into state-of-the-art cue integration. Rather than recognising particular textures, objects, motions, CogViSys aims at recognising instantiations of classes thereof, hence a key goal is to make important progress in the area of categorisation. Approaches will be developed to express and use knowledge about the interpretation of scenes explicitly.

Humans can describe what they see: only few people would reject such a statement as indefensible - although quite a number might insist on adding some `clarifications'. Any attempt, though, to formulate in more precise terms what could be meant by `to describe' or `to see' - and which conditions have to be satisfied for a human to be able to do it - is likely to end up in an infinite regress. The words `human', `can', `see', and `describe' refer to common human experiences and are as such part of colloquial English.
It is common practice to introduce a new concept by analogy to a known one, for example a `Cognitive Vision System' as a computer which describes what it sees - or, in an even more anthropomorphic formulation - as a `virtual reporter'. The casual reader is then likely to acknowledge `OK, got it!' and to proceed to look for the next piece of information which succeeds to capture his interest. The mere observation, however, that colloquial English provides means to express a complicated mental activity does not imply that anyone can explain in detail what actually happens during the course of such an activity, even if humans perform it routinely. The surreptitious introduction of a Cognitive Vision System as a `virtual reporter' thus serves to illustrate unknown aspects of a familiar activity and, at the same time, to reject unwanted associations, for example with endeavors which attempt to make computers mimick, simulate, emulate, or top humans.
It may appear as a paraphrase to say that the mental activity of `describing what one sees' can be understood as a system of mutually interacting components which process information. This paraphrase, though, turns into a challenge, namely to specify the structure of an exemplary information processing system at a level of precision which enables automata to execute the required information processing steps.

Precision of a specification has to be paid for by a delimitation of its scope or by complexity - usually by a mixture of both. In a first attempt, the CogViSys project delimits its scope to three discourse domains, namely (i) road traffic, (ii) gestures for sign languages, and (iii) `ritualized' interactions within a small group of humans, for example as they can be observed in situation comedies (sitcoms). Each of these three discourse domains admits a large enough diversity of activities to prevent a simple `jumping-to-conclusions' approach. On the other hand, activities in each domain usually comply with a set of generally accepted, if not formally coded, rules which should facilitate a system-internal representation of expected `behavior' of `agents'.

Against the background outlined up to this point, a more detailed discussion of CogViSys objectives appears appropriate. CogViSys studies the transformation of image sequences - i. e. signals - into textual descriptions of temporal developments within a scene recorded by one or several video cameras. This transformation is envisaged to take place via several intermediate representations of the recorded information. A coarse outline of these envisaged representations illustrates additional characteristics of the CogViSys approach. In a first transformation step, data-driven processes will extract `cues' from the video input signal. Various combinations of different cues will then be used to instantiate schematic representations of visible (articulated) bodies and prevailing illumination conditions. Additional schematic knowledge concerning anticipated variations of illumination conditions and potential movements of bodies or body components will be activated to estimate state `trajectories' which can be combined with instantiated representations of bodies to generate a numerical spatio-temporal representation of the recorded scene. This numerical representation has then to be converted into a conceptual representation constituting the input for a process which generates the desired output text.

The CogViSys partners proceed on the hypothesis that cue extraction, cue combination, and text generation processes can be shared among system versions adapted to one of the three discourse domains. It is expected that the structure of schematic representations for bodies and their movements at the geometrical as well as at the conceptual level should be the same or at least very similar across the three discourse domains. Specific to each discourse domain will be the selection and characteristics (relative pose, shape, surface properties) of bodies and their admitted spatio-temporal changes. The knowledge specific to each discourse domain will be explicated and separated from `general' knowledge related to the treatment of video signals and text generation. The extent with which this intention can be realized should provide hints towards the viability and generalizability of the CogViSys approach.
A more specific test is planned throughout the last third of the project period: within each discourse domain, video input will be recorded from scenes with substantially different characteristics, for example UK sign language instead of American sign language or UK (left) versus continental (right) road traffic. If `general' knowledge has been properly separated from the discourse-domain-specific knowledge, only the latter - explicated - one has to be exchanged in order to enable the system to perform at a level comparable to the one attained during the preceding phase.
Such experiments depend on the ability to quickly and adequately assess the system performance across a wide range of input signal variations. In analogy to a student who paraphrases what she has learned in order to prove her understanding of the material taught, the generation of a textual description is taken as a cue for the degree with which an algorithmic approach is able to extract complex information from image signals. Adult humans, and teachers in particular, are well versed to not only assess the adequacy of natural language answers, but in addition to detect hints pointing to possible weaknesses of understanding. The algorithmic generation of descriptive texts thus is expected to facilitate the scrutiny of tentative computational processes which derive a system-internal representation of complex temporal developments from video recordings of a scene.

Of course, there are other means to assess the algorithmic `extraction of meaning' from digitized video. A system executing the envisaged computational processes can be integrated into a robot, in particular into a mobile one, in order to exploit the performance of tasks depending on the proper evaluation of video input as a cue to the adequacy of machine vision processes involved. Alternatively, the computational processes used to perform such an extraction of meaning from video could provide the basis for specifying neurophysiological or psychophysical experiments in order to compare their outcome with corresponding results of an algorithmic system. Although interactions with endeavors devoted to these or other alternatives will not be excluded, such endeavors do not constitute core objectives of the CogViSys project.
For more details, readers are invited to visit the homepages of individual partners involved in CogViSys.

Author: H.-H. Nagel, project coordinator.

Adapted to HTML: M.Arens 2001-08-31 .