Can we see speech?

Of course we can! But let us start somewhere at the beginning!
Remember the scene in 2001: A Space Odyssey in which the two characters Dave Bowman and Frank Poole conspire against HAL 2000, the omnipotent AI of the Discovery One. They isolate themselves in a shuttle, assuming that since HAL cannot hear them anymore and they can talk freely. Well, they obviously forgot HAL’s ability to read lips.


Naively, we know that we can read lips and map their movements somehow onto something like a “meaning”. It is said that deaf people use this technique. This means that our brains not only contains an articulatory representation of speech, meaning how we need to move our lips and tongue and jaw in order to speak. Also, we have a representation of how our mouth looks like when we speak. You can easily check this on your own: Next time you watch a movie in a foreign language, count how often you deliberately watch at the character’s lips in order to enhance perception!

In the 1970, Harry McGurk and his research assistant John MacDonald were able to show  what has now become known as the McGurk effect: When you mismatch visual and auditory information for a syllable, e.g. [ba], makes you hear something completely different. Check out the following video:

What happens in the movie above is the following: You see the mouth movement of the [ga] syllables and hear the auditory signal for [ba]. Your brain tries to resolve this conflicting information and what you  hear (or are supposed to do so) is {da}. If this is the case for you, then this effect clearly shows that the brain stores both, the visual and the auditory information, and uses it to process speech. If this was not the case, you would not have a conflict and perceive something completely different.

Massaro and Stork (1998) “Speech Recognition and Sensory Integration” explain this effect on the basis of cue similarity, with cues being pieces of information in the signal that the brain processes. [ba] and [da] share auditory cues, [ga] and [da] share visual cues. When you mix these, i.e. auditory [ba] + visual [ga], the brain choses that output, or perception, which matches a category with the highest probability. In case of the McGurk Effect: {da}. Massaro and Stork used this probability and similarity based algorithm to create a computer program which can read lips – the precursor to HAL. Simultaneously, the McGurk effect shows us that the brain stores and uses all provided physical information which is contained in the input.

Therefore: Yes, we can see speech. It is nothing special, quite the contrary. The information is there and the brain uses it!

Together with my colleague Daniel Duran from the Institut für Maschinelle Verarbeitung, Stuttgart, we investigate in a modeling project, how cue overlap and frequency differences between [ba], [da] and [ga] might be a source for the McGurk Effect. We want to investigate this by comparing  the “Naive Discriminative Learner“, an algorithm capturing human and animal learning behavior, and “Exemplar Model”, a model that captures the creation of phonemic categories on the basis of similarities.

We presented our preliminary results during the second Workshop at Schloss Dagstuhl in our talk “Modelling multimodal integration – The case of the McGurk Effect”. This time, the workshop was attended by Bernd Möbius, Laurence White, Frank Zimmerer, Uwe Reichel, Ingmar Steiner, James Kirby, Antje Schweitzer, Katrin Schweitzer, Mike Walsh and our invited guest, Janet Pierrehumbert. The results are promising, and also to a certain extent surprising.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s