Thirty years of object recognition Glyn W. Humphreys
DOI:10.1093/acprof:oso/9780199228768.003.0012
Abstract and Keywords
This chapter looks at changes and developments in the study of object recognition during the past thirty years. It discusses the Marrian revolution attributed to David Marr, who took ideas and concepts from psychophysics with the aim of translating them into working computer models. Another major development during this period was the recognition and serious research on the neural basis of visual perception and object recognition.
Keywords: object recognition, David Marr, psychophysics, computer models, neural system, visual perception
My perception, in the 1970s
I completed my undergraduate degree in 1976.I had started university when student protest was still common, buildings would be taken over, flags bravely unfurled and then abandoned, flapping lifelessly from windows. By 1976 this had begun to seem faintly silly. Rather than pulling things down for some vague goal, one wanted to see how things could be made to work.
I suppose I studied psychology because I wanted to understand how humans worked. I was fortunate to sit in social psychology lectures given by Henri Tajfel on group dynamics, fuelled by his own experiences as a wartime exile. I spent hours learning about various reinforcement regimes, which at least seemed factual and enabled a sort of understanding of which type of learning might apply in which situation. I even undertook a rare gem of a course on mathematical psychology covering topics such as Luce's choice theorem and Bayesian analyses of decision-making. But, though of course I received the statutory courses, I just didn't ‘get’ Perception. Although the classes were filled with enough demonstrations to satisfy even a Royal Society Christmas Lecture audience, I couldn't figure out what it all amounted to, what the mechanisms were. Things began to click together only when I attended a lecture on cognitive psychology, where I remember the idea of using converging operations was discussed. Suddenly some larger picture began to fall into place. The lecture introduced the Atkinson-Shiffrin model of short-term memory—my first encounter with a theory formulated in a box-and-arrows framework, where the representations inside the boxes were specified along the connections between the boxes. Here was something that could direct experiments and was open to empirical evaluation. Most excitingly, this approach could be tested (indeed a converging operations advocate would argue that it should be tested) using different lines of evidence—not just from studies of free and serial recall by normal participants, but also, for example, from patients whose brain lesion might mean that one part of the model may not function properly. If correct, the model should predict the pattern of impairment (p.152) found in neuropsychological populations. The idea of using converging evidence to test models was a revelation, suggesting that one should be able to link together work from different fields to construct an overarching account of human cognition. Moreover it encouraged the idea that different lines of converging evidence could then be designed to assess different component processes in the cognitive system. All this was somehow lacking in my understanding of perception. Our lectures explained that adaptation was the psychophysicist's microscope, but, as it were, all I could see were single cells. I really had no idea of what a perceptual system might comprise.
It would be unfair if these comments were read as a specific criticism of perception as I was taught it, because the fragmented picture reflected something of the state of affairs at that time. There were many clever experiments and interesting, non-intuitive ideas on aspects of perception (the notion that visual coding might operate through spatial frequency analysis had begun to infiltrate the undergraduate curriculum), but it was rare to find the different strands of work being linked together. Failing to see what could be done in perception, I went on to conduct a PhD in an area that would subsequently become known as visual cognition—inspired by Michael Turvey's (1973) (to this day beautiful) work on how different forms of masking could be used to probe sequential stages of visual processing. I hoped to advance our understanding of letter recognition by analysing the time course over which different types of information were made available. I had friends who were perception guys. They studied after-effects and visual gratings. I spent hours in dark labs listening to BBC Radio 4 and trying to detect low contrast patterns. But the world my friends inhabited was a different one to mine. Their vocabulary was foreign.
The Marrian revolution
Then, during my PhD, my psychophysicist friends started to talk about someone called David Marr, who was taking ideas from their field with the aim of translating them into working computer models. What was interesting was that, to do this, you had not only to specify how inputs were coded but also how different codes might be integrated, to think about the order of events, and to specify how the evolving representation could access stored knowledge that might allow the model to do something useful—like recognizing an object. In other words, to have a working computer model, you had to think of perception as a system. Suddenly I could see an analogy with models with which I had grown familiar, particularly accounts such as the dual-route model of word recognition being proposed by Max Coltheart and colleagues (e.g. Coltheart 1978). Marr's ideas offered a new kind of scaffolding to link (p.153) different aspects of perception together using converging operations from computer science as well as from visual psychophysics. It gave scientists coming with very different approaches a common language.
The initial paper that kindled my interest was Marr and Nishihara (1978), built on Marr's earlier proposals (Marr 1976; Marr and Hildreth 1980). Note the incremental approach: Marr had earlier dealt with feature coding, and now Marr and Nishihara went on to consider higher-level representations where features were integrated and then associated with past knowledge. This in itself felt a novel way of thinking—a reflection of a computational approach in which a complex system could be built by linking together modules that each performed their own particular job. Moreover, the proposals put forward by Marr and Nishihara, for how you might go from feature representation to code a surface-based description of an object, and then from that to a modal three-dimensional (3D) representation, specified mechanisms for how object recognition might actually take place. These mechanisms could be tested.
As I came to read more, it became clear that Marr was not the first person to think of constructing explicit theories of pattern and object recognition using ideas from psychology and physiology—proposals such as the Pandemonium model of Selfridge (1959) long predated the work—but Marr's arguments still felt revolutionary. In part I think this was because they came with a well worked-through philosophy for how different approaches to perception could be linked. Marr argued for the utility of having different levels of description. He proposed a computational level of theory which set out the constraints that would impact on any system that used vision for object recognition—as relevant to computers as to humans. For example, his work on developing a model for stereopsis (Marr and Poggio 1976) used constraints such as: no point in the world should be represented at more than one point in an internal representation of depth, to limit possible mappings between points in the two eyes. Similar constraints still influence computational models today, for example the Heinke and Humphreys (2003) account of visual selection.1 Beneath the computational level of theory, Marr suggested that one could have an algorithmic theory, based on abstracted processing mechanisms, which could be implemented in different kinds of hardware. Further, underneath this, he suggested that there could be a theory of the hardware, which dealt with how particular algorithms were realized in different physical systems. As Coltheart notes in Chapter 5, much of subsequent cognitive science has been built on (p.154) the idea that theories can be abstracted from the hardware on which processes operate. Marr's framework for different levels of theorizing makes this explicit, and has had a profound influence on the field—though, as I shall describe, the boundaries between, for example, the algorithmic and hardware levels have become increasingly blurred over time.2
After my PhD, I was lucky to gain a lectureship at Birkbeck College where Max Coltheart had recently taken up a chair in psychology, and the department was a hotbed of research into aspects of reading. This work had a particular flavour. It employed functional accounts of performance, such as the dual-route model, to guide experiments, and it used data from neuropsychological patients with disorders of reading alongside data from normal participants. The neuropsychological work seemed especially exciting. Here theorists went outside the laboratory and addressed real-life problems that people experienced after brain lesions. The models could also be used to guide therapy (e.g. Coltheart et al. 1992), and so could be useful practically. As a young lecturer it was impossible not to become infected.3 The functional account of cognition offered by the dual-route model could be thought of as an algorithmic level theory, in much the same way as Marr and Nishihara's proposed mechanisms underlying visual object recognition. It was thus not difficult to think of testing the Marr and Nishihara account using similar procedures to those used to test dual-route theory—with cognitive neuropsychological studies providing an important part of the empirical armoury. This was how my own work in this area started. It was not a profoundly original approach, and Graeme Ratcliff and Freda Newcombe were already embarked on a similar analysis (Ratcliff and Newcombe 1982). However, up to that date I think it is true to say that neuropsychological data had had little impact on theories of normal object recognition, and indeed there were still controversies within the neurological literature over whether ‘true’ disorders of visual object recognition could occur without contamination from peripheral visual disturbances or more profound cognitive impairments (Bender and Feldman 1972).
My own work was helped enormously on two counts. One was meeting Jane Riddoch, who was beginning a PhD under Max Coltheart's supervision as I came Birkbeck and who wanted to carry out neuropsychological studies of cognition. It was Jane's insights into patient disorders that helped frame the (p.155) questions posed in our joint work, and her access to patients made the research possible. The second was meeting HJA, a profoundly agnosic patient with a wonderfully persevering nature. HJA had many low-level visual processes as well as high-level cognitive capacities preserved, refuting the argument that frank disorders of ‘intermediate’ visual processes could not exist (contra Bender and Feldman 1972). HJA subsequently loyally helped our research for over 25 years (e.g. Riddoch and Humphreys 1987a; Riddoch et al. 1999). Patients such as HJA, with selective disturbances of particular aspects of cognition, have made great contributions to the field, and single-case studies should not be overlooked despite current-day emphases on group-based lesion analyses (see The Biological revolution, below).
The first neuropsychological papers on disorders of object recognition that I read were those of Warrington and Taylor (1973, 1978). These distinguished between groups of patients who had deficits either in matching objects depicted in different views or in matching between physically different exemplars of objects used to perform the same basic function (e.g. a wheelchair and a deckchair, both of which serve the function of being used to sit on). Such data provided early suggestions that aspects of object recognition could be fractionated; for example, the ability to achieve viewpoint-independent matching was distinct from access to semantic/functional knowledge about objects. Moreover, the data indicated that some of the processes proposed by Marr and Nishihara had psychological reality (e.g. that there might be some process that derived common object structures across viewpoints). The basic fractionation made by Warrington and Taylor has also continued to influence much of the work in the field; indeed the question of how objects can be recognized across different points of view has generated enormous heat and perhaps rather less light than one would hope (see Biederman and Gehardstein 1993; Tarr and Bülthoff 1998). Interestingly, findings that patients with problems in matching objects across different viewpoints can retain an ability to recognize objects in prototypical views (e.g. Davidoff and Warrington 1999) remain perhaps one of the strongest pieces of evidence suggesting that Marr and Nishihara's account was not correct in its details. For example, according to Marr and Nishihara, some form of view-independent object representation needs to be constructed to enable recognition to occur. If patients cannot construct a view-independent representation, then their recognition of objects in all views should be impaired. However, perhaps the more important point is that, through formulating their account of the perceptual system underlying object recognition, Marr and Nishihara paved the way for questions about view-independent representation to be addressed in a theoretically coherent way.
(p.156) After the first revolution
Following from Marr's work, subsequent theories of object recognition have differed in many critical ways. One distinction concerns whether surface-based and 3D representations of objects need to be coded for recognition to take place. For example, Biederman's (1987) influential ‘Recognition by Components’ theory supposed that object representations could be assembled directly from the edges of visual objects, without the need to generate any intermediate surface-based representations. Other theorists have proposed a more direct image-based approach to recognition, where multiple, view-specific, memory representations may be held and used to match objects appearing in different viewpoints (Edelman and Bülthoff 1992). Hybrid accounts, in which view-independent and view-specific procedures operate in parallel, have also been proposed (Hummel and Stankiewicz 1998). These hybrid models hold that view-independent coding requires attentional processes that ensure that the parts of objects are coded in appropriate relative spatial locations, bringing into the play the issue of how attention may play a modulatory role in object recognition. Studies in which attention is manipulated in normal participants, or which use patients who are limited in attending across all the parts of objects, have provided some support for hybrid accounts (e.g. Stankiewicz et al. 1998; Vernier and Humphreys 2006).
A further question highlighted by post-Marrian theories concerns the role of colour and surface texture on object recognition, as edge-based approaches to object recognition maintain that colour and surface texture should play little causal role. Here there is again converging evidence from studies with normal participants and with patients pointing to there being an influence of colour and surface texture at least for some object classes and for objects for which surface information is a reliable cue (e.g. Humphrey et al. 1994; Price and Humphreys 1989; Riddoch and Humphreys 2004; Tanaka and Presnell 1999; Wurm et al. 1993).
We can think of this empirical work as refining our ideas about what we might term the intermediate representations involved in object recognition, such as the surface- and 3D-model representations suggested by Marr and Nishihara (1978). In addition to this, converging experimental work with normal participants and patients has helped to ‘flesh out’ our understanding of how the input into these intermediate representations is coded (how perceptual features are integrating and organized) and also what later processes are required for object recognition (the involvement of different forms of stored knowledge). For example, we have argued that work with patient HJA distinguishes between processes that group oriented elements into edges, (p.157) and subsequent processes that code the relations between edges within and across objects (Humphreys 2001; Humphreys and Riddoch 2006). HJA can perform normally on tasks requiring that local oriented elements are grouped (Figure 11.1), but he is profoundly impaired at encoding the correct relations between edges within and across shapes—indeed his recognition errors often involve inappropriate segmentation of shapes based on misinterpreting an internal edge as a segmentation cue (Giersch et al. 2000; Riddoch and Humphreys 1987a). It is thus possible to elaborate on different stages of visual grouping and perceptual organization. Work by Mary Peterson and colleagues also provides evidence that perceptual organization operates in a top-down as well as a purely bottom-up manner, so that processes such as edge assignment (in figures with ambiguous figure-ground relations) are influenced by whether the edge forms part of a known object representation (see Peterson and Skow-Grant 2003). This notion—that earlier visual processes can be ‘penetrated’ by top-down knowledge—is a critical point that contrasts with the ideas put forward by Marr and colleagues. In keeping with the idea of stand-alone computational modules, Marr proposed a bottom-up approach to object recognition whereby early processes were not affected by feedback from processes at higher levels of representation. The questions of whether, when, and how top-down processes might influence earlier stages of object recognition are ones that will drive research in this field for some time to come.
(p.158) Distinctions between different forms of higher-level representations in object recognition have also been suggested. Evidence for structural representations of objects separate from semantic/functional representations in normal participants comes from reports by Schacter and Cooper (1993) that normal participants showed long-term priming for novel but plausible 3D shapes (with minimal semantic representations), but no priming for implausible shapes. They argue that plausible but not implausible 3D shapes must have persistent structural representations. In neuropsychological studies, several investigators (Fery and Morais 2003; Hillis and Caramazza 1995; Riddoch and Humphreys 1987b; Sheridan and Humphreys 1993; Stewart et al. 1992) have documented patients who can distinguish reliably between real objects and structurally similar non-objects, but who remain impaired at accessing semantic knowledge about the objects, for example in matching together semantically related objects. Such dissociations indicate a separation between stored structural representations of objects and stored semantic knowledge. The framework put forward by Marr and Nishihara (1978) needs to be expanded to take account of these additional distinctions.
One other major change in experimental and neuropsychological work on perception after the Marrian revolution has been to emphasize the importance of visual information for action. If you followed courses on Perception from the 1970s through to the mid-1990s you would have hardly thought that vision was used for anything other than describing the visual world and recognizing objects. Of course, in everyday life vision is used for much more than this—particularly for guiding our actions on the world. In the 1990s, David Milner, Mel Goodale, and colleagues (Milner and Goodale 1995; Milner et al. 1991) described the agnosic patient DF who showed an impairment apparently even earlier in the visual stream than that suffered by HJA, as she showed profound limitations when making perceptual judgements about groups of visual elements or the orientations of single lines. Strikingly, though, DF was able to reach and post a letter through a letterbox positioned at different orientations! Milner and Goodale argued that there is a distinction between the visual information that is used for conscious perceptual judgements and for object recognition (processes that are damaged in DF), and the visual information used for action (spared in DF). Subsequently, Goodale and associates have attempted to derive converging evidence from studies of visual illusions in normal participants. Here, it has been argued that our actions are much less susceptible to some illusions than our conscious perceptual judgements (e.g. Agolioti et al. 1995; Bridgeman 2002; Haffenden and Goodale 1998; for alternative views see Franz et al. 2000; Pavani et al. 1999). Other work has suggested that the actions we intend to make can modulate how we attend to (p.159) objects, and, through this, alter how objects are coded (Linnell et al.2005). The step towards thinking of what behavioural outcomes result from visual processing has to have been a healthy one in terms of thinking about real-world applications, and also one that now enables converging work to be developed between vision scientists and computer scientists and engineers working on robotic systems.
The biological revolution
There is one other change I believe worth highlighting, that has taken place after the Marrian revolution. This is that the neural basis of visual perception and object recognition (indeed, of all of cognition) is now taken much more seriously. One of the main drivers for this has been the development of functional brain imaging, which now allows us to assess which brain regions are active when we, for example, recognize particular types of object. My view is that brain imaging can contribute to our understanding of the functional basis of object recognition, not least because it brings another type of converging evidence to bear. The new evidence is concerned with where in the brain a given process operates. Now, because we have prior knowledge of what a given brain region is typically involved in, new information indicating that this area is recruited when a given stimulus is processed, can constrain our account of what kind of processing is involved. As a concrete example, Moore and Price (1999) contrasted the neural regions activated when participants named black and white line drawings relative to when they named colour images. They found differential activation in a number of posterior areas in the right hemisphere. One functional account of why coloured images can be easier to identify than black and white images of objects is that coloured images specifically facilitate name retrieval (Ostergaard and Davidoff 1985). A contrasting account is that colour images facilitate the object recognition process itself. Given that changes are observed in the right hemisphere, and that the right hemisphere is not usually thought to modulate name retrieval in normal right-handed participants, these imaging data suggest that the effects of colour are on object recognition itself. Arguments such as this, of course, start to blur Marr's distinction between the algorithmic level of description and descriptions of the hardware. Accounts of what particular regions of the ‘hardware’ are doing can be used to inform accounts of what algorithms might be involved. I find nothing ideologically objectionable in this. It seems simply to be a case of using extra (dare I say converging) evidence to help refine out arguments about complex processes such as object recognition.
(p.160) This ‘biological revolution’ is still evolving, but some new emphases are apparent. Imaging data suggest that distinct brain regions may be recruited when different objects are recognized. This is perhaps most obvious when contrasting faces and other objects, given the highly reliable finding that small regions of the occipital cortex and fusiform gyrus show enhanced activity to faces compared with other stimuli (Grill-Spector et al. 2004; Kanwisher and Yovel 2006). However, neural specialization can be observed for other classes of object too. Haxby et al. (2001) raised the possibility that there are not generic ‘object recognition procedures’, but rather that contrasting processes may be called into play, depending on the object involved. This idea of recruitment may be important here. For example, there is evidence that there is activation of medial temporal cortex, left parietal and ventral frontal cortex when tools are recognized (Grabowski et al. 1998; Grafton et al.1997). The interesting point that medial temporal cortex is associated with motion processing (Beauchamp et al. 2002), and parietal and ventral frontal regions are associated with tool use (Decety et al. 1994, 1997), suggests that associations with object motion and functional actions may come into play as we process tools, and these associations may even help us recognize the object involved. These suggestions from imaging sit alongside neuropsychological studies showing that patients can have selective deficits (or sparing) in processing faces versus other objects (Buxbaum et al. 1996; Riddoch et al. 2008; Rossion et al. 2003; Rumiati et al. 1994) or relatively impaired (or preserved) recognition of tools compared with living things (Riddoch and Humphreys 2004; Sirigu et al. 1991). Whereas the earlier emphasis from neuropsychological studies was primarily on the functional deficit involved, arguments about the lesion site now also become relevant. Of course, it can be difficult to argue about lesion site from single cases, given the (relative) idiosyncrasy of different brains, and so this also leads to a change in the way that research is done, moving work towards case series of patients rather than single cases (e.g. Humphreys and Riddoch 2003). Though, as I have argued, the continuing importance of single cases, and of functional dissociations, should not be lost when we add in further information about common lesion site over groups of patients.
One can caricature the box-and-arrow models that emerged during the Marrian revolution as being static, based on established representations in set boxes, and set connections between the boxes. However, an emergent emphasis from studying the biological basis of visual processing is that perceptual systems are not static but change dynamically over time. In studies of functional imaging, the importance of dynamic change has been highlighted by techniques such as adaptation (a return of the psychophysicist's electrode?), which have been developed to provide a finer-grained analysis of the neural (p.161) substrates of processing (e.g. Kourtzi and Kanwisher 2001). Imaging studies show that neural areas responding to a stimulus have reduced activity if the same stimulus is adapted repeatedly. This would be consistent with the cells responding to that stimulus in that region entering a refractory state. The extent to which there is recovery of activity when the same stimulus is shown under different conditions (e.g. when the viewpoint changes) or when a new stimulus is presented, indicates both whether the same neurones in that region code the different stimuli, and whether the region contains different populations of neurones that can now be prized apart by the selective adaptation of the neurones to responding to one stimulus. Given the limited resolution of much of present-day functional imaging (e.g. using voxel sizes of 2×2×2 mm, say), adaptation has proved to be an important way of probing the selectivity of neural responding. But, perhaps even more than this, it indicates that dynamic changes operate continuously in perception, with both short- and longer-term changes being evident (see Kourtzi and DiCarlo 2006; Kourtzi and Kanwisher 2001). Understanding these dynamic changes is a critical issue for future research. The emphasis on dynamic change and learning also enables links to be formed with neural network models that incorporate dynamic fluctuations in activity as part of their normal operation, and with studies of how perceptual systems evolve as they develop. The importance of converging operations will not go away.
No comments:
Post a Comment