The central goal is to build a vision system that can be used in a wider
variety of fields and that is re-usable by introducing self-adaptation
at the level of perception, by providing categorisation capabilities, and
by making explicit the knowledge base at the level of reasoning, and thereby
enabling the knowledge base to be changed. In order to make these ideas
concrete CogViSys aims at developing a virtual commentator which
is able to translate visual information into a textual description.
This is the unifying theme of the project. In order to build this virtual commentator,
several conceptual subgoals have to be achieved: It is crucial that the
more cognitive processes can start from a firm basis. Hence, some effort
will go into state-of-the-art cue integration. Rather than recognising
particular textures, objects, motions, CogViSys aims at recognising
instantiations of classes thereof, hence a key goal is to make important
progress in the area of categorisation. Approaches will be developed to
express and use knowledge about the interpretation of scenes explicitly.
Humans can describe what they see: only few people would reject
such a statement as indefensible - although quite a number might insist
on adding some `clarifications'. Any attempt, though, to formulate in more
precise terms what could be meant by `to describe' or `to see' - and which
conditions have to be satisfied for a human to be able to do it - is likely
to end up in an infinite regress. The words `human', `can', `see', and
`describe' refer to common human experiences and are as such part of colloquial
English.
It is common practice to introduce a new concept by analogy to
a known one, for example a `Cognitive Vision System' as a computer which
describes what it sees - or, in an even more anthropomorphic formulation
- as a `virtual reporter'. The casual reader is then likely to acknowledge
`OK, got it!' and to proceed to look for the next piece of information
which succeeds to capture his interest. The mere observation, however,
that colloquial English provides means to express a complicated
mental activity does not imply that anyone can explain in detail
what actually happens during the course of such an activity, even if humans
perform it routinely. The surreptitious introduction of a Cognitive Vision
System as a `virtual reporter' thus serves to illustrate unknown aspects
of a familiar activity and, at the same time, to reject unwanted associations,
for example with endeavors which attempt to make computers mimick, simulate,
emulate, or top humans.
It may appear as a paraphrase to say that the mental activity of `describing
what one sees' can be understood as a system of mutually interacting components
which process information. This paraphrase, though, turns into a challenge,
namely to specify the structure of an exemplary information
processing system at a level of precision which enables automata
to execute the required information processing steps.
Precision of a specification has to be paid for by a delimitation of
its scope or by complexity - usually by a mixture of both. In a first attempt,
the CogViSys project delimits its scope to three discourse
domains, namely (i) road traffic, (ii) gestures for sign languages, and
(iii) `ritualized' interactions within a small group of humans, for example
as they can be observed in situation comedies (sitcoms). Each of these
three discourse domains admits a large enough diversity of activities to
prevent a simple `jumping-to-conclusions' approach. On the other hand,
activities in each domain usually comply with a set of generally accepted,
if not formally coded, rules which should facilitate a system-internal
representation of
expected `behavior' of `agents'.
Against the background outlined up to this point, a more detailed discussion
of CogViSys objectives appears appropriate. CogViSys studies
the transformation of image sequences - i. e. signals - into textual descriptions
of temporal developments within a scene recorded by one or several video
cameras. This transformation is envisaged to take place via several intermediate
representations of the recorded information. A coarse outline of these
envisaged representations illustrates additional characteristics of the
CogViSys
approach. In a first transformation step, data-driven processes will extract
`cues' from the video input signal. Various combinations of different cues
will then be used to instantiate schematic representations of visible (articulated)
bodies and prevailing illumination conditions. Additional schematic knowledge
concerning anticipated variations of illumination conditions and potential
movements of bodies or body components will be activated to estimate state
`trajectories' which can be combined with instantiated representations
of bodies to generate a numerical spatio-temporal representation
of the recorded scene. This numerical representation has then to be converted
into a conceptual representation constituting the input for a process
which generates the desired output text.
The CogViSys partners proceed on the hypothesis that cue
extraction, cue combination, and text generation processes can be shared
among system versions adapted to one of the three discourse domains. It
is expected that the structure of schematic representations for
bodies and their movements at the geometrical as well as at the conceptual
level should be the same or at least very similar across the three discourse
domains.
Specific to each discourse domain will be the selection
and characteristics (relative pose, shape, surface properties) of bodies
and their admitted spatio-temporal changes. The knowledge specific to each
discourse domain will be explicated and separated from `general' knowledge
related to the treatment of video signals and text generation. The extent
with which this intention can be realized should provide hints towards
the viability and generalizability of the CogViSys approach.
A more specific test is planned throughout the last third of the project
period: within each discourse domain, video input will be recorded from
scenes with substantially different characteristics, for example UK sign
language instead of American sign language or UK (left) versus continental
(right) road traffic. If `general' knowledge has been properly separated
from the discourse-domain-specific knowledge, only the latter - explicated
- one has to be exchanged in order to enable the system to perform at a
level comparable to the one attained during the preceding phase.
Such experiments depend on the ability to quickly and adequately assess
the system performance across a wide range of input signal variations.
In analogy to a student who paraphrases what she has learned in order to
prove her understanding of the material taught, the generation of a textual
description is taken as a cue for the degree with which an algorithmic
approach is able to extract complex information from image signals. Adult
humans, and teachers in particular, are well versed to not only assess
the adequacy of natural language answers, but in addition to detect hints
pointing to possible weaknesses of understanding. The algorithmic generation
of descriptive texts thus is expected to facilitate the scrutiny of
tentative computational processes which derive a system-internal representation
of complex temporal developments from video recordings of a scene.
Of course, there are other means to assess the algorithmic `extraction
of meaning' from digitized video. A system executing the envisaged computational
processes can be integrated into a robot, in particular into a mobile one,
in order to exploit the performance of tasks depending on the proper evaluation
of video input as a cue to the adequacy of machine vision processes involved.
Alternatively, the computational processes used to perform such an extraction
of meaning from video could provide the basis for specifying neurophysiological
or psychophysical experiments in order to compare their outcome with corresponding
results of an algorithmic system. Although interactions with endeavors
devoted to these or other alternatives will not be excluded, such endeavors
do
not constitute core objectives of the CogViSys project.
For more details, readers are invited to visit the homepages of individual
partners involved in CogViSys.