The well-known "McGurk Effect" is an illusion in which visual cues to the syllable "ga" are combined with auditory cues to syllable "ba" resulting in the perception of "da" or "tha".
Below is an example from Pat Kuhl’s lab web site. Try listening to the following video with your eyes closed, then, after several repetitions, open your eyes to see how your perception changes in the presence of the visual stimulus.
Having experienced the conditions for the McGurk effect yourselves, now consider this recent paper by Schwartz …. It begins with the following observation "When a public demonstration of the McGurk effect (McGurk & MacDonald, 1976) is presented to visitors or students, there appears a large variability in the subjects’ audiovisual (AV) responses, some seeming focused on the auditory (A) input, others more sensitive to the visual (V) component and to the McGurk illusion".
This simple observation begs the question of whether people differ in their ability to combine auditory and visual speech or whether the variability in perceiving the McGurk effect is simply due to differences in mono-modal A and V perception per se.
The two points of view are represented by Grant and Seitz (1998) and Schwartz and others (e.g., Schwartz & Cathiard, 2004, Schwartz, 2006) on the one hand (who take the position that individuals can differ in their ability to fuse AV information) and Massaro and colleagues (see Massaro 1987, 1998) on the other (who have adopted the view that all participants are “optimal integrators” who combine AV evidence for available categories in the same multiplicative way).
Schwartz points out in his paper that answering this question has been obscured by methodological issues in how to model AV integration. What I like about the paper is that it provides both a methodological framework for analysis of audiovisual speech perception data and shows (by using this framework) that there are inter-individual differences in the process of AV fusion. Nice work Jean-Luc!
So what is it that Schwartz does?
First, he establishes the ground over which this issue will be decided – it is all about models of AV integration and in how well they can account for the pattern of observed data. Schwartz then goes on the compare two models, Masarro’s FLMP (which does not have a participant specific weighting factor) and the WFLMP, a variant of FLMP that explicitly incorporates participant-dependent weights for AV integration. To give these models something to work with, Schwartz uses a corpus and AV data that crosses a synthetic five-level audio /ba/-/da/ continuum with a synthetic video similar continuum. The 10 unimodal (5A, 5V) and 25 bimodal (AV) stimuli were presented for /ba/ vs. /da/ identification to 82 participants, with 24 observations per participants. These responses have been made publically available by Massaro and colleagues on their web site.
So, far so good; but how “good” a model is, is not only about how closely it can fit data. Schwartz refers to his work (Schwartz, 2006) in which he identifies the so-called 0/0 problem in Masarro’s FLMP model (Massaro, 1987, 1998).
Basically, this 0/0 problem allows a FLMP an indirect way to decrease the importance of a modality in fusion on a per participant basis by slightly but consistently mis-fitting the unimodal data – but it does this without actually having to use participant-specific weights. This “problem” in modeling needs to be taken into account in assessing the model and Schwartz argues that simply considering error (RMSE) does not do this. In order to “properly” evaluate the models, Schwartz uses a variant of a Bayesian Model Selection (BMS) criterion (he uses a more easily computed Laplace approximation of BMS).
So, what is the outcome? Why don’t you read Jean-Luc’s paper?– and I would recommend also having a look at a tutorial prepared by Schwartz on a practical implementation of BMSL.
So, what do you think?