Mental Imagery: In search of a theory*
Zenon W. Pylyshyn
Rutgers Center for Cognitive Science
Rutgers University, Busch Campus
Piscataway, NJ 08854
ABSTRACT
It is widely accepted that there is something special about reasoning by using mental images. The question of how it is special, however, has never been satisfactorily spelled out, despite over thirty years of research in the post-behaviorist tradition. This article considers some of the general motivation for the assumption that entertaining mental images involves inspecting a picture-like object. It sets out a distinction between phenomena attributable to the nature of mind, to what is called the cognitive architecture, and ones that are attributable to tacit knowledge used to simulate what would happen in a visual situation. With this distinction in mind the paper then considers in detail the widely held assumption that in some important sense images are spatially displayed or are depictive, and that examining images uses the same mechanisms that are deployed in visual perception. I argue that the assumption of the spatial or depictive nature of images is only explanatory if taken literally, as a claim about how images are physically instantiated in the brain, and that the literal view fails for a number of empirical reasons – e.g., because of the cognitive penetrability of the phenomena cited in its favor. Similarly, while it is arguably the case that imagery and vision involve some of the same mechanisms, this tells us very little about the nature of mental imagery and does not support claims about the pictorial nature of mental images. Finally I consider whether recent neuroscience evidence clarifies the debate over the nature of mental images. I claim that when such questions as whether images are depictive or spatial are formulated more clearly, the evidence does not provide support for the picture-theory over a symbol structure theory of mental imagery. Even if all the empirical claims were true, they do not warrant the conclusion that many people have drawn from them; that mental images are depictive or are displayed in some (possibly cortical) space. Such a conclusion is incompatible with what is known about how images function in thought. We are then left with the provisional counterintuitive conclusion that the available evidence does not support rejection of what I call the “null hypothesis”; namely, that reasoning with mental images involves the same form of representation and the same processes as that of reasoning in general, except that the content or subject matter of thoughts experienced as images includes information about how things would look.
Table of Contents
1 Why is there a problem about mental imagery?
1.1 The pull of subjective experience
1.2 The imagery debate: What was it about?
2 What is special about image-based reasoning?
3 Why images exhibit certain properties: Cognitive architecture or tacit knowledge?
3.1 What knowledge is relevant to the tacit knowledge explanation?
3.2 Methodological note: cognitive penetrability as a litmus
4 Problem-solving by “mental simulation”: Some examples. 10
4.2 The “size” of mental images
5.1 Depiction and mandatory spatial properties of representations
5.2 Real versus functional space
5.3 Projected mental images: Inheriting spatial properties from real space
5.4 Visuomotor interaction with images
6 Are images “seen” by the visual system?
6.1 The experience of seeing and of imagining
6.2 Interference between imaging and visual perception
6.3 Visual illusions induced by superimposing mental images. 26
6.4 Imagined versus perceived motion
6.5 Extracting novel information from images: Visual (re)perception or inference?
7 Can evidence from neuroscience settle the question?
7.1 Searching for the “mind’s eye” and the “image” in the brain
7.2 What would it mean if all the neuroscience claims turned out to be true?
7.3 Is the ‘mind’s eye’ just like a real eye?
7.4 What has recent neuroscience evidence done for the “imagery debate”?
7.5 Is the “picture theorist” a straw man?
8 Conclusion: What is special about mental imagery?. 42
Cognitive science is rife with ideas that offend our intuitions. It is arguable that nowhere is the pull of the subjective stronger than in the study of perception and mental imagery. It is not easy for us to take seriously the proposal that the visual system creates something like symbol structures in our brain since it seems intuitively obvious that what we have in our mind when we look out onto the world, as well as when we close our eyes and imagine a scene, is something that looks like the scene, and hence whatever it is that we have in our heads must be much more like a picture than a description. Though we may know that this cannot be literally the case, that it would do no good to have an inner copy of the world, this reasoning appears to be powerless to dissuade us from our intuitions. Indeed, the way we describe how it feels to imagine something shows the extent of the illusion; we say that we seem to be looking at something with our “mind’s eye”. This familiar way of speaking reifies an observer, an act of visual perception, and a thing being perceived. All three parts of this equation have now taken their place in one of the most developed theories of mental imagery (Kosslyn 1994), which refers to a “mind’s eye” and a “visual system” that examines a “mental image” located in a “visual buffer”. Dan Dennett has referred to this view picturesquely as the “Cartesian Theater” view of the mind (Dennett 1991) and I will refer to it as the “picture theory” of mental imagery.
There has been a tradition of analyzing this illusion in the case of visual perception, going back to Descartes and Berkeley (it also appears in the 17th century debate between Arnaud and Malebranche – see Slezak submitted), and revived in modern times by (Gibson 1966), as well as computationalists like (Marr 1982). More recently (O'Regan 1992; O'Regan & Noë 2002) have argued against the intuitive picture-theory of vision on both empirical and theoretical grounds. Despite the widespread questioning of the intuitive picture view in visual perception, this view remains very nearly universal in the study of mental imagery (with such notable exceptions as Dennett 1991; Rey 1981; Slezak 1995); (see also the critical remarks by Fodor 1975; Hinton 1979; Pylyshyn forthcoming; Thomas 1999).
Why should this be so? Why do we find it so difficult to accept that when we “examine our mental image” we are not in fact examining an inner state, but rather are contemplating what the inner state is about – that is, some possible state of the visible world – and therefore that this experience tells us nothing about the nature and form of the representation? Philosophers have referred to this displacement of the object of thought from the (possible) world to a mental state as the “intentional fallacy” and it has much of cognitive science in its grip still.
What I try to do in this paper is show that we are not only deeply deceived by our subjective experience of mental imagery, but that the evidence we have accumulated to support what I call the “picture theory” of mental imagery is equally compatible with a much more parsimonious view, namely that most of the phenomena in question (but not all – see below) are due to the fact that the task of “imaging” invites people to simulate what they believe would happen if they were looking at the actual situation being visualized. I will argue that the alternative picture theory, or depiction-theory, trades so heavily on a systematic ambiguity between the assumption of a literal picture and the much weaker assumption that visual properties are somehow encoded. I will also argue that recent evidence from neuroscience (particularly the evidence of neural imaging) brings us no closer to a plausible picture theory than we were before this evidence was available.
There has been a great deal of discussion in the past 30 years about “the imagery debate.” Many people even believe that the debate has, at least in general outline, been put to rest because we now have hard evidence from neuroscience showing what (and where) images are (see, e.g., Kosslyn 1994; and the brief review in Pylyshyn 1994a). But if one looks closer at the “debate” one finds that what people think the debate is about is very far from univocal. For example, some people think that the argument that has been settled is whether images, whatever their nature, are fundamentally different from the form of representation involved in other kinds of reasoning, whether there are two different systems of mental codes. For others it is the question of whether images have certain particular properties – e.g. whether they are spatial, or depictive, or analogue. Others feel that the question that has been settled is whether imagery “involves” the visual system. I will argue that none of these claims has been sufficiently well posed to admit of a solution. In this paper I will concentrate primarily on a particular class of theory of mental imagery, which I refer to as “picture theories” and will consider other aspects of the “debate” only insofar as they bear on the alleged pictorial nature of images.
In this article I defend the provisional view, which I refer to as the “null hypothesis,” that at the relevant level of analysis – the level appropriate for explaining the results of many experiments on mental imagery – the process of imagistic reasoning involves the same mechanisms and the same forms of representation as are involved in general reasoning, though with different content or subject matter. This hypothesis claims that what is special about image-based thinking is that it is typically concerned with a certain sort of content or subject matter, such as optical, geometrical, or what we might call the appearance-properties of the things we are thinking about. If so, nothing is gained by attributing a special format or special mechanisms to mental imagery. While the validity of this null hypothesis remains an open empirical question, what is not open, I claim, is whether certain currently popular views can be sustained.
In the interest of full disclosure I should add that I don’t really, in my heart of hearts, believe that representations and processes underlying imagery are no different from other forms of reasoning. Nonetheless, I do think that nobody has yet articulated the specific way that images are different and that all candidates proposed to date are seriously flawed in a variety of ways that are interesting and revealing. Thus using the null hypothesis as a point of departure may allow us to focus more properly on the real differences between imagistic and other forms of reasoning.
Section 2 reviews some observations that have led many people to hold a picture theory of mental images (although a detailed discussion of what such a theory assumes is postponed until section 5). Section 3 introduces a distinction that is central to our analysis. It distinguishes two reasons why imagery might manifest the properties that are observed in experiments. One reason is that these properties are intrinsic to the architecture of the mental imagery system – they arise because of the particular brain mechanisms deployed in imagery. The other reason is that the properties are extrinsic to the mechanisms employed – they arise because of what people tacitly believe about the situation being imagined, which they then use to simulate certain behaviors that would occur if they were to witness the corresponding situation in reality. This distinction is then applied to some typical experiments on mental imagery where I argue that such experiments tell us little about special dedicated imagery mechanisms. Since section 4 discusses some material that has been published elsewhere, readers who have followed the “imagery debate” may wish to skim this section.
Section 5 discusses two widely held views about the nature of mental images (Kosslyn 1994); that images are “depictive” and that they are laid out in a “functional space”. I claim that the preponderance of evidence argues against the inherent spatial nature of mental images. An exception is evidence from experiments in which subjects project their images onto a visual scene. In this case I claim (section 5.3) that the use of visual indexes and focal attention provides a satisfactory explanation for how spatial properties are inherited from the observed scene, without any need to posit spatial properties of images. In section 0 I argue that the notion of a functional space is devoid of any explanatory power, since such a “space” is unconstrained and can have whatever properties one wishes to attribute to it (unless it is taken to be a simulation of a real spatial display as in the model described in Kosslyn, Pinker, Smith, & Shwartz 1979, in which case the underlying theory really is the literal picture theory). Section 6 discusses a claim that is assumed to be entailed by the depictive nature of images; namely, that information in an image is accessed through vision. Although there is evidence for some overlap between the mechanisms of imagery and those of vision, a close examination of this evidence shows that it does not support the assumption of a spatial display in either vision or imagery. Section 7 considers evidence from neuroscience, that many writers believe provides the strongest case for a picture theory. Here I argue that, notwithstanding the intrinsic interest of these findings, they do not support the existence of any sort of depictive display in mental imagery. Finally, section 8 closes with a brief discussion of where the “imagery debate” now stands and on the role of imagery in creative thinking.
Imagery seems to follow principles that are different from those of intellectual reasoning and certainly beyond any principles to which we have conscious intellectual access. Imagine a baseball being hit into the air and notice the trajectory it follows. Although few of us could calculate the shape of this trajectory none of us has any difficulty imagining the roughly-parabolic shape traced out by the ball in this thought experiment. Indeed, we can often predict with considerable accuracy where the ball will land (certainly a properly situated professional fielder can). It is very often the case that by visualizing a certain situation, we can predict the dynamics of physical processes that are beyond our ability to solve analytically. Is this because our imagery architecture inherently and automatically obeys the relevant laws of nature?
Opposing the intuition that one’s image unfolds according to some internal principle of natural harmony with the real world, is the obvious fact that it is you alone who controls your image. Perhaps, as (Humphrey 1951) once put it, viewing the image as being responsible for what happens in your imagining puts the cart before the horse. In the baseball example above, isn’t it equally plausible that the reason the imagined ball takes a particular path is that, under the right circumstances, you can recall having seen a ball inscribe such a path? Surely your image unfolds as it does because you, the image creator, made it do so. You can imagine things being pretty much any size, color or shape that you choose and you can imagine them moving any way you like. You can, if you wish, imagine a baseball sailing off into the sky or following some bizarre path, including getting from one place to another without going through intervening points, as easily as you can imagine it following a more typical trajectory. You can imagine all sorts of physically impossible things happening — and cartoon animators frequently do, to our amusement.
Some imagery theorists might be willing to concede that in imagining physical processes we must use our tacit knowledge of how things work, yet insist that the optical and geometrical properties of images are true intrinsic properties, despite that fact that the dynamic properties of images that are often cited in studies of mental images – properties such as mental rotation, mental scanning, or “representational momentum” discussed in sections 3.1 and 4. Nonetheless, the suggestion that the intrinsic properties of images are geometrical rather than dynamic makes sense both because spatial intuitions are among the most entrenched, and because there is evidence (Pylyshyn 1999) that geometrical and optical-geometrical constraints are built into the early-vision system, as so-called “natural constraints”. While we can easily imagine the laws of physics being violated, it seems nearly impossible to imagine the axioms of geometry and geometrical optics being violated. Try imagining a four-dimensional block or how a cube looks when seen from all sides at once or what it would look like to travel through a non-Euclidian space. However, before concluding that these examples illustrate the intrinsic geometry of images, consider whether your inability to imagine these things might not be due to your not knowing, in a purely factual way, how these things might look (that is, where edges, shadows and other contours would fall)? The answer is by no means obvious. It has even been suggested (Goldenberg & Artner 1991) that certain deficits in imagery ability resulting from brain damage, are a consequence of a deficiency in the patient’s knowledge about the appearance of objects. At the minimum we are not entitled to conclude from such examples that images have the sort of inherent geometrical properties that we associate with pictures.
We also need to keep in mind that notwithstanding one’s intuitions, there is reason to be skeptical about what one’s subjective experience reveals about the form of a mental image. After all, when we look at an actual scene we have the unmistakable subjective impression that our perceptual representation is of a detailed three-dimensional panoramic view, yet it has now been convincingly demonstrated that the information available to cognition from a single glance is extremely impoverished, sketchy and unstable and that very little is carried over across saccades (see, for example, Blackmore, Brelstaff, Nelson, & Troscianko 1995; Carlson-Radvansky 1999; Carlson-Radvansky & Irwin 1995; Intraub 1981; Irwin 1993; O'Regan 1992; O'Regan & Noë 2002; Rensink 2000a, 2000b; Rensink, O'Regan & Clark 1997; Rensink, O'Regan & Clark 2000; Simons 1996). Indeed, there is now considerable evidence that we visually encode very little in a visual scene unless we explicitly attend to the items in question and that we do that only if our attention or our gaze is attracted to it (Henderson & Hollingworth 1999), (although see O'Regan, Deubel, Clark, & Rensink 2000). There are remarkable demonstrations that when presented with alternating images, people find it extremely difficult to detect a difference between the two – even a salient difference in a central part of the image.[1] This so-called change blindness phenomenon (Simons & Levin 1997) suggests that, notwithstanding our phenomenology, we are nowhere near having a detailed internal display since the vast majority of information in a visual scene goes unnoticed and unrecorded. It would thus be reasonable to expect that our subjective experience of mental imagery would be an equally poor guide to the form and content of the information in our mental images.
Nobody denies that the content and behavior of our mental images can be the result of what we intend our images to show, what we know about how things in the world look and work, and the way our mind or our imagery system constrains us. The important question about mental imagery is; which properties and mechanisms are intrinsic to, or constitutive of having and using mental images, and which arise because of what we believe, intend, or attribute to the situation we are imagining.
The distinction between effects attributable to the intrinsic nature of mental mechanisms and those attributable to more transitory states, such as people’s beliefs, utilities, habits, or interpretation of the task at hand, is central not only for understanding the nature of mental imagery, but for understanding mental processes in general. Explaining the former kind of phenomena requires that we appeal to what has been called the cognitive architecture (Fodor & Pylyshyn 1988; Newell 1990; Pylyshyn 1980, 1984; Pylyshyn 1991a, 1996) – one of the most important ideas in cognitive science. It refers to the set of properties of mind that are fixed with respect to certain kinds of influences. In particular, the cognitive architecture is, by definition, not directly altered by changes in knowledge, goals, utilities or any other representations (e.g., fears, hopes, fantasies, etc). In other words when you find out new things or when you draw inferences from what you know or when you decide something, your cognitive architecture does not change. Of course, if as a result of your state of beliefs and desires you decide to take drugs or to change your diet or even to repeat some act over and over, this can result in changes to your cognitive architecture, but such changes are not a direct result of the changes in your cognitive state. A detailed technical exposition of the distinction between effects attributable to knowledge or other cognitive states and those attributable to the nature of cognitive architecture are beyond the scope of this article (although this distinction is the subject of extensive discussion in Pylyshyn 1984, Chapter 7). This informal characterization and the following example will have to do for present purposes.
Consider the following illustrative example. We have a box of unknown construction, and we discover that it exhibits particular systematic behaviors (discussed in, Pylyshyn 1984). The box emits long and short pulses according to the following pattern: pairs of short pulses most often precede single short pulses, except when a pair of long-short pulses occurs first. What is special about this example is that it illustrates a case where the observed behavior, though completely regular when the box is in its “ecological niche,” is not due to the nature of the box (to how it is constructed) but to an entirely extrinsic reason. The reason this particular pattern of behavior occurs can only be understood if we know that the pulses are codes, and the pattern is due to a regularity in what they represent, in particular, that the pulses represent English words spelled out in International Morse Code. The observed pattern does not reflect how the box is wired or its functional architecture; it is due entirely to a regularity in the way English words are spelled (the principle being that generally i comes before e except after c). Similarly, I have argued that in most of the core experiments on mental imagery – such as the mental scanning case described in section 4.1 – the pattern does not reveal the nature of the mental architecture involved in imagery, but reflects a principle that observers know governs the world being imagined. The reason that in under certain conditions the behavior of both the code box and the cognitive system does not reveal properties of its intrinsic nature (of its architecture) is that both are capable of quite different regularities if the world they were representing behaved differently. They would not have to change their architecture in order to change their behavior. The latter observation, concerning the plasticity of non-architectural properties of thought, is the key to a methodology I have called “cognitive penetrability” for deciding whether tacit knowledge or cognitive architecture is responsible for some particular observed regularity (see section 3.2).
In interpreting the results of imagery experiments, it is clearly important to distinguish between cognitive architecture and tacit knowledge as possible causes. Take the following example. You are asked what color you see if you look through a yellow filter superimposed on a blue filter. The way that many of us would go about solving this problem, if we did not know the answer as a memorized fact, is to imagine a yellow filter and a blue filter being superimposed; we generally use the “imagine” strategy when we want to solve a problem about how certain things look. What color do you see in your image when the two filters are overlapped? Now ask yourself why you see that color in your mind’s eye rather than some other color? Some people (e.g., Kosslyn 1981) have argued that the color you see follows from a property of imagery, presumably some property of how colors are encoded and displayed in images. But since there can be no doubt that you can make the overlapping part of the filters be any color you wish, it can’t be that the image format or the architecture involved in representing colors is responsible. What else can it be? It seems clear in this case that the color you “see” depends on your tacit knowledge of the principles of color mixing or a recollection of how these particular colors combine (having seen something like them in the past). In fact people who do not know about subtractive color mixing generally give the wrong answer: mixing yellow light with blue light produces white light, but overlapping yellow and blue filters allows green light through.
When asked to do this exercise (as reported in Kosslyn 1981), some people claim that they “see” a color that is different from the one they report when they are simply asked to say (without using imagery) what would happen. Results such as this have made people leery of accepting the tacit knowledge explanation. There are indeed many cases where people report a different result when using mental imagery than when asked to merely answer the question without using their image. It is not clear what moral ought to be drawn from this, however, since it is a general property of reasoning that the way the question is put and the reasoning strategy that is used to get to the answer can affect the outcome. Knowledge can be organized and accessed in many different ways (see section 4.3 for more on the relevance of this to mental imagery studies). Indeed, it need not be accessed at all if it seems like more work that it is worth. For example, consider the following analog of the color-mixing task. Close your eyes and imagine someone writing the following on a blackboard: “759 + 356 = ”. Now, imagine that the person continues writing on the board. What number can you “see” being written next? People may give different answers depending on whether they believe that they are supposed to work it out or whether in the interest of speed they are supposed to guess or merely say whatever comes to mind. Each of these is a different task. Even without a theory of what is special about visual imagery, we know that the task of saying what something would look like can be a different task from the task of solving a certain intellectual puzzle about colors or numbers.
In most of the cases studied in imagery research, it would be odd if the results did not come out the way picture theorists predict. For if the results were inconsistent with the picture-theory, the obvious explanation would be that subjects either did not know how things would work in reality or else they misunderstood the instructions to “imagine x”. For example if you were asked to imagine in vivid detail, a performance of the Minute Waltz, the failure of the imagined event to take approximately one minute would simply indicate that you had not carried out the task you were supposed to. Since taking roughly one minute is constitutive of a real performance, it is natural to assume it to be indicative of a realistic imaginary re-creation of such a performance.
The concept of tacit knowledge plays an important role in cognitive science (see, for example, Fodor 1968), though it has frequently been maligned because it has to be inferred indirectly. Such knowledge is called “tacit” because it is not always explicitly available for, say, answering questions. There may nonetheless be independent evidence that such knowledge exists. This is a point that has been made forcibly in connection with tacit knowledge of grammar or of social conventions, which typically also cannot be articulated by members of a linguistic or social group, even though violations are easily detected. In our case the role of tacit knowledge can sometimes be detected using the criterion of cognitive penetrability, discussed below.
Not only is the notion of tacit knowledge often misunderstood, but in the case of explaining mental imagery results, the kind of tacit knowledge that is relevant has also been widely misunderstood. The only knowledge that is relevant to the tacit knowledge explanation is knowledge of what things would look like to subjects in situations like the ones in which they are to imagine themselves. Many writers have mistakenly assumed that the tacit knowledge explanation refers to one of several other kinds of knowledge. For example, although tacit knowledge of what results the experimenter expects (sometimes referred to as “experimenter demand effects”) is always an important consideration in psychological experiments (and may be of special concern in mental imagery experiments; see Banks 1981; Intons-Peterson 1983; Intons-Peterson & White 1981; Mitchell & Richman 1980; Reed, Hock & Lockhead 1983; Richman, Mitchell & Reznick 1979) it is not the knowledge that is relevant to the tacit knowledge explanation, as some have assumed (Finke & Kurtzman 1981b). Nor is it knowledge of such things as how the visual system works. It is not relevant to the tacit knowledge explanation that people are unlikely to know how their visual system or the visual brain works (as Farah 1988, has assumed). It is also not the knowledge people might have of what results to expect from experiments on mental imagery (as assumed by Denis & Carfantan 1985). Denis & Carfantan studied “people’s knowledge about images” and found that people often failed to correctly predict what would happen in experiments such as mental scanning. But these sorts of questions invite respondents to consider their folk psychological theories to make predictions about psychological experiments. They do not reflect tacit knowledge of what it would look like if the observers were to see a certain event happening in real life. The tacit knowledge claim is simply the claim that when subjects are asked to “imagine x” they use their knowledge of what “seeing x” would be like (as well as their other psychophysical skills, such as estimating time-to-collision) and they simulate as many of these effects as they can. Whether a subject has this sort of tacit knowledge cannot always be determined by asking them, and certainly not by testing them for their knowledge of psychology!
Notwithstanding the importance of tacit knowledge explanations of imagery phenomena, it remains true that not all imagery results are subject to this criticism. Even when tacit knowledge is involved, there is often than one reason for the observed phenomena. An example in which tacit knowledge may not be the only explanation of an imagery finding can be found in (Finke & Pinker 1982). The example concerns a particular instance of mental scanning (one in which it takes more time to judge that an arrow points to a dot when the dot is further away). Finke and Pinker argued that these results could not have been due to tacit knowledge because, even though subjects correctly predicted that judgments would take more time when the dots were further away, they failed to predict that the time would actually be longer for the shortest distance used in the study. But this was a result that even the authors failed to anticipate, because the aberrant short-distance time was most likely due to some mechanism (perhaps attentional crowding) different from the one that caused the monotonic increase of time with distance.
Another example in which tacit knowledge does not account for some aspect of an imagery phenomenon is in what has been called “representational momentum”. It was shown that when subjects observe a moving object and are asked to recall its final position from memory, they tend to misremember it as being displaced forward. (Freyd & Finke 1984) attributed this effect to a property of the imagery architecture. On the other hand, (Ranney 1989) suggested that the phenomenon may actually be due to tacit knowledge. It seems that at least some aspects of the phenomenon may not be attributable to tacit knowledge (Finke & Freyd 1989). But here again there are other explanations besides tacit knowledge or image architecture. In this particular case there is good reason to think that part of the phenomenon is actually visual. There is evidence that the perception of the location of moving objects is ahead of the actual location of such objects (Nijhawan 1994). Eye movement studies have also shown that gaze precedes the current location of moving objects in an anticipatory fashion (Kowler 1989, 1990). Thus even though the general phenomenon, involving imagined motion, may be attributable to tacit knowledge, the fact that the moving stimuli are presented visually may result in the phenomena also being modulated by the visual system. The general point in both these examples is that even in cases where tacit knowledge is not the sole determiner of a result in an imagery experiment, the phenomena in question need not reveal properties of the architecture of the imagery system. They may be due to properties of the visual system, the memory system, or a variety of other systems that might be involved.
How it is possible to tell whether certain imagery effects reflect the nature of the imagery architecture or the person’s tacit knowledge? In general, methodologies for answering questions about theoretical constructs are limited only by the imagination of the experimenter. Typically they involve convergent sources of evidence and independent theoretical motivation. One theoretically motivated diagnostic, discussed at length in (Pylyshyn 1984), is to test for the cognitive penetrability of the observations. This criterion is based on the assumption that if a particular pattern of observations arises because people are simulating a situation based on their tacit beliefs, then if we alter their beliefs or their assumptions about the task, say by varying the instructions, the pattern of observations may change accordingly, in ways that are rationally connected with the new beliefs. So, for example, if we instruct a person on the principles of color mixing we would expect the answer to the imaginary color-mixing question discussed above to change appropriately. We will see other examples of the use of this criterion throughout this article (especially the examples in section 4).
Not every imagery-related phenomenon that is genuinely cognitively impenetrable provides evidence for the nature of mental images or their mechanisms. Clearly many beliefs are resistant to change by merely being told that they are false. Nonetheless this criterion has proven useful in identifying parts of the visual system that constitute what is called early vision (Pylyshyn 1999). Cognitive penetrability remains a necessary but not sufficient condition for a pattern being due to the architecture of the imagery system.
The idea that what happens in certain kinds of problem solving can be viewed as off-line simulation has had a recent history in connection not only with mental imagery (Currie 1995), but also with other sorts of problems in cognitive science (Klein & Crandall 1995). But even if we grant that the “simulation mode” of reasoning is used in various sorts of problem solving, the question still remains; what does the real work in solving the problem by simulation – a special property of images (that is, the architecture of the image system) or tacit knowledge?
In what follows I will sketch a number of influential experimental results often cited in support of the picture theory, and compare explanations given in terms of inherent properties of the image with those given in terms of simulation based on tacit knowledge.
Probably the most cited result in the entire repertoire of research motivated by the picture-theory is the image-scanning phenomenon. Not only has this experimental paradigm been used dozens of times, but various arguments about the “metrical” or spatial nature of mental images, as well as arguments about such properties of the mind’s eye as its “visual angle,” rest on this phenomenon. Indeed, it has been referred to as a “window on the mind” (Denis & Kosslyn 1999).
The finding is that it takes longer to “see” a feature in a mental image that is further away from the place in the image an observer was initially focusing upon. So, for example, if you are asked to imagine a dog and inspect its nose and then to “see” what its tail looks like it will take you longer than if you were asked to first inspect its hind legs. Here is an actual experiment reported in (Kosslyn, Ball & Reiser 1978). Subjects were asked to memorize a map such as the one in Figure 1. They were then asked to imagine the map and to focus their attention on one place — say the “church”. In a typical experiment (there are many variants of this basic study) the experimenter says the name of a second place (say, “beach” or “tree”) and subjects are asked to examine their image and to press a button as soon as they can “see” the second named place on their image of the map. What many researchers have found consistently is that the further away the second place is from where the subject is initially focused, the longer it takes to “see” the second place in the image.
From this scanning result most researchers have concluded that larger map distances are represented by greater distances in image space. In other words, the conclusion that is drawn from this kind of experiment is that mental images have spatial properties – that is, they have spatial magnitudes or distances, as opposed to just encoding such properties in some unspecified manner. This is a strong conclusion about cognitive architecture. It says, in effect, that the symbolic code idea that forms the foundation of computational theories does not apply to mental images. In a symbolic encoding two places can be represented as being further away just the way we do it in language; by saying the places are, say, n meters from one another. But the representation of larger distances is not itself in any sense larger.

Figure 1: Map to be learned and imaged in one’s “mind’s eye” to study mental scanning
Is this strong conclusion about the metrical property of mental images warranted? Does the difference in scanning time reveal a property of the architecture or a property of what is represented? Notice how this distinction exactly parallels the situation in the color-mixing example discussed earlier. There we asked whether a particular observation revealed a property of the architecture or a property of what people know or believe – a property of the represented situation of which they have tacit knowledge. To answer this question for the scanning experiment we need to determine whether the pattern of increasing reaction time arises from a fixed capacity of the image-encoding or image-examining system or whether it can be altered by changing subjects’ understanding of the task or the beliefs that they hold about what it would be like to examine a real map; whether it is cognitively penetrable.
This is a question to be settled in the usual way – by careful analyses and experiments. But even before we do the experiment there is reason to suspect that the time-course of scanning is not a property of the cognitive architecture. Do the following test on yourself. Imagine that there are lights at each of the places on your mental image of the map. Imagine that a light goes on at, say, the beach. Now imagine that this light goes off and one comes on at the lighthouse. Did you need to scan your attention across the image to see the light come on at the lighthouse? Liam Bannon and I repeated the scanning experiment (see the description in Pylyshyn 1981) by showing subjects a real map with lights at the target locations, much as I just described. We allowed the subjects to turn lights on and off. Whenever a light was turned on at one location it was simultaneously extinguished at another location. Then we asked subjects to imagine that very map and to indicate (by pressing a button) when a light was on and they could “see” the illuminated place in their image. The time between button presses was recorded and its correlation to the distances between illuminated places on the map was computed. We found that there was no relation between distance on the imagined map and time. You might think: Of course there was no increase in time with increasing distance, because subjects were not asked to imagine scanning that distance. But that’s just the point: You can imagine scanning over the imagined map if you want to, or you can imagine just hopping from place to place on the imaginary map. If you imagine scanning, you can imagine scanning fast or slow, at a constant speed or at some variable speed, or scanning part way and then turning back or circling around! You can, in fact, do whatever you wish since it is your image.[2] At least you can do these things to the extent that you can create the phenomenology or the experience of them and providing you are able to generate the relevant measurements, such as the time you estimate it would take to get from point to point.
Whether or not you choose to simulate a certain temporal pattern of events in the course of answering a question may also depend in part on whether simulating that particular pattern seems to be relevant to the task. It is not difficult to set up an experimental situation in which simulating the actual scanning from place to place does not appear so obviously relevant to solving a particular problem. For example, we ran the following experiment that involved extracting information from an image (Pylyshyn 1981). Subjects were asked to memorize a map and to refer to their image of the map in solving the problem. As in the original (Kosslyn et al. 1978) studies, subjects had to first focus on one place on their imagined map and then to “look” at a second named place. The experiment differed from the original study, however, in that the task was to indicate the compass direction from the second named place to the previously focused place. This direction-judgment task requires that the subject make a judgment from the perspective of the second place, so it requires focusing at the second place. Yet in this experiment, the question of how you get from the first place to the second place on the map was far less prominent than it was in the “tell me when you can see X” task. In this study we found that the distance between places had no effect on the time taken to make the response. Thus it seems that the effect of distance on reaction time is cognitively penetrable.[3]
Not only do observers sometimes move their attention from one imagined object to another without scanning through the space between them, but we have reason to believe that they cannot move their attention continuously through empty imagined space (see section 0 for a brief description of the relevant study).
Another study closely related to the mental scanning paradigm, is one where it is found that it takes more time to report some visual detail of an imagined object if the object is imagined to be small, than if it is imagined to be large (e.g., it takes longer to report that a mouse has whiskers if the mouse is imagined as tiny, than if it is imagined as huge). This seems like a good candidate for a tacit knowledge explanation, since when you actually see a small object you know that you can make out fewer of its details due to the limited resolution of your eye. So if you are asked to imagine something small, you are likely to imagine it as having fewer visible details than if you are asked to imagine it looming large directly in front of you, whatever form of representation that may involve.
The original picture-theory view of this result is problematic in any case. What does it mean for your image to be “larger”? Such a notion is meaningful only if the image has a real size or scale. If, as in our null hypothesis, the information in the image is in the form of a symbolic description, then size has no literal meaning. You can think of something as large or small, but that does not make some thing in your head large or small. On the other hand, which details are represented in your imagination does have a literal meaning: You can put more or less detail into your active representation. Inasmuch as the task of imagining the mouse as “small” entails that you imagine it having fewer visible details, the result is predictable without any notion of real scale applying to the image.
The obvious test of this proposal is to apply the criterion of cognitive penetrability. Are there instructions that can ameliorate the effect of the “image size” manipulation, making details easier to report in small images than in large ones and vice versa? Could you imagine a small but extremely high resolution and detailed view of an object, in contrast to a large but low-resolution or fuzzy view that lacks details? I know of no one who has bothered to carry out an experiment that asks subjects to, say, report details from a large blurry image and then from a small clear one. What if such an experiment were done and showed that it is quicker to report details from a large blurry object than a small clear one? The strangeness of such a possibility should alert us to the fact that what is going wrong lies in what it means to have a blurred versus a clear image. Such results would be incompatible with what happens in seeing. If it took longer to see fine details in a real large object there would have to be a reason for it, such as that you were seeing it through a fog or out of focus. Thus so long as examining a visual image means simulating what it is like to see something, the results must be as reported; how could studies involving different sized mental images, or blurred versus clear images, fail to show that they parallel the case of seeing, unless subjects not understanding the instructions (e.g., did not understand the meaning of “blurry”)? The same goes for the imagery analogue of any property of seeing of which observers have some tacit knowledge or recollection. Thus it applies to the findings concerning the acuity profile of imagery, which approximates that of vision (Finke & Kosslyn 1980). Observers do not need to have articulated scientific knowledge of visual acuity; all they need is to remember roughly how far into the periphery of their visual field things can go before they become hard to see, and it is not surprising that this is easier to do while turning your head (with eyes closed) and pretending to be looking at objects in your periphery (which is how these studies were done).
There are many reasons why one might use a “simulation mode” strategy in answering a question; reasons that have nothing to do with the spatial nature of imagery, and sometimes not even because of what tacit knowledge is available. For example, to answer the question: What is the fourth (or n’th) letter in the alphabet after “M,” people normally have to go through the alphabetical sequence (and it takes them longer the larger the value of n). Similarly, the findings reported by (Shepard & Feng 1972) are easily understood if one considers how the relevant knowledge is organized. In their experiment, subjects are asked to mentally fold pieces of paper, such as shown in Figure 2, and to report whether the arrows marked on the paper would touch one another. They found that the more folds it would require to actually fold the paper and see whether the arrows coincide, the longer it takes to imagine doing so. From this they concluded that working with images parallels working with real objects.

Figure 2: Two of the figures used in the (Shepard & Feng 1972) experiment. The task is to imagine folding the paper (using the dark shaded square as the base) and say whether the arrows in these two figures coincide. The time it takes increases with the number of folds required.
The question that needs to be asked about this task is the same as the question we asked in connection with the color mixing task: What is responsible for the relation between time taken to answer the question and the number of folds it would have taken in folding real paper? This time the answer is not simply that it depends on tacit knowledge, because in this case it is not just the content of the tacit knowledge that makes the difference. It is because of the knowledge that subjects have about paper folding that they can do the task at all. But in this case it appears that imagining making individual folds is required in order to get the answer, and one would presumably get the same result even if one were not trying to imagine folding the paper. It is hard to see how to answer to this question without imagining going through the sequence of folds. A plausible explanation for this, which does not appeal to special properties of a mental image system, is that the reason one has to imagine going through a sequence of individual folds is the same as the reason one had to go through a series of letters in the earlier alphabet example. The reason may have to do with how one’s knowledge of the effects of folding is organized. What we know about the effects of paper folding is just this: we know what happens when we make one fold. Consequently to determine what would happen in a task that requires 4 folds, we have to apply our one-fold-at-a-time knowledge four times. Recall the parallel case with letters: In order to determine the fourth letter after M is we have to apply the “next letter” rote knowledge four times. In both cases a person could, in principle, commit to memory such facts as what results from double folds of different types; or which letter of the alphabet occurs exactly n letters after a given letter. If that were how paper-folding knowledge was organized, the Shepard and Feng results might not hold. The important point is that once again the result tells us nothing about how the states of the problem are represented — or about any special properties of image representations. They tell us only what knowledge the person has and how it is organized.
The role played by the structure of knowledge is ubiquitous and may account for another common observation about the use of mental imagery in recall. We know that some things are easier to recall than others and that it is easier to recall some things when the recall is preceded by the recall of other things. Memory is linked in various intricate ways. In order to recall what you did on a certain day it helps to first recall what season that was, what day of the week it was, where you were at the time, and so on. (Sheingold & Tenney 1982; Squire & Slater 1975) and others have shown that one’s recall of distant events is far better than one generally believes because once the process of retrieval begins it provides clues for subsequent recollections. The reason for bringing up this fact about recall is that such sequential dependencies are often cited as evidence for the special nature of imagery (Bower 1976; Paivio 1971). Thus, for example, in order to determine how many windows are there in your home, you probably need to imagine each room in turn and look around to see where the windows are, counting them as you go. In order to recall whether someone you know has a beard (or glasses or red hair) you may have to first recall other aspects of what he or she looks like (that is, recall an image of them). Apart from the phenomenology of recalling an appearance, what is going on is absolutely general to every form of memory retrieval. Memory access is an ill understood process, but at least it is known that it has sequential dependencies and other sorts of access paths and that these paths are often dependant on spatial arrangements (which is why the “method of loci” works well as a mnemonic device).
One of the earliest and most cited results in the research on manipulating mental images is the “mental rotation” finding. (Shepard & Metzler 1971) showed subjects pairs of drawings of three-dimensional figures, such as those illustrated in Figure 3, and asked them to judge whether the two objects depicted in the drawings were identical, except for orientation. Half the cases were mirror reflections of one another (or the 3D equivalent, called enantiomoprhs), and therefore could not be brought into correspondence by a rotation. Shepard and Metzler found that the time it took to make the judgment was a linear function of the angular displacement between the pair of objects depicted.

Figure 3. Examples similar to those used by (Shepard & Metzler 1971) to show “mental rotation.” The time it takes to decide whether two figures are identical except for rotation (a, b) or are mirror images (a, c) increases linearly as the angle between them increases.
This result has been universally interpreted as showing that mental images of the objects are “rotated” continuously and at constant speed in the mind and that this is, in fact, the means by which the comparison is made: We rotate one of the pair of figures until the two are sufficiently in alignment that it is possible to see whether they are the same or different. The phenomenology of the Shepard and Metzler task is clearly that we rotate the figure in making the comparison. I do not question either the phenomenology nor the description that what goes on in this task is “mental rotation.” But there is some question about what these results tell us about the nature of mental images. The important question is not whether we can or do imagine rotating a figure, but whether we solve the problem by means of the mental rotation. For mental rotation to be a mechanism by which the solution is arrived at, its utility would have to depend on some intrinsic property of images. As an example if it were the case that during mental rotation the figure moves as a rigid form through a continuum of angles, then mental rotation would be capitalizing in an intrinsic property of the image format.
Contrary to the general assumption, however, figural “rotation” is not a holistic process that operates on an entire figure, while the figure retains its rigid shape. Subjects in the original 3D rotation study (Shepard & Metzler 1971) examined both the target and the comparison figures together. In a subsequent study that monitored eye movements, (Just & Carpenter 1976) showed that observers look back and forth between the two figures, checking for distinct features. This point was also made using simpler 2D figures where it was found that observers concentrate on significant milestone features when carrying out the task (Hochberg & Gellman 1977), and that when such milestone features are available, no rotation effect is found. In studies reported in (Pylyshyn 1979) I showed that the apparent “rate of rotation” depends both on the complexity of the figure and on the complexity of the post-rotation comparison task (I used a task in which observers had to indicate whether a test figure, presented at various orientations, was embedded within the original figure). The dependence of the rotation speed on such organizational and task factors shows that whatever is going on in this case does not appear to consist in merely “rotating” one figure in a rigid manner into correspondence with the reference figure.
Even if the process of making the comparison in some sense involves the “rotation” of a represented shape, this tells us nothing about the form of the representation and does not support the view that the representation is pictorial. The proposal that a representation maintains its shape because of the inherent rigidity of the image while it is rotated cannot be literally true, notwithstanding the phenomenology. The representation is not literally being rotated; no codes or patterns of codes are being moved in a circular motion. At most what could be happening is that a representation of a figure is processed in such a way as to produce a representation of a figure at a slightly different orientation, and then this process is iterated (perhaps even continuously). There are probably good reasons, based on computational resource considerations, why the process might proceed by iterating parts of a form over successive small angles (thus causing the comparison time to increase with the angular disparity between the figures) rather than attempt the “rotation” in one step. For example, (Marr & Nishihara 1976) hypothesized what they called a primitive SPASAR mechanism, whose function was to compute the rotation of a simple dihedral vertex and determine its orthographic projections in a reference frame (a slightly different version that left out the details of the SPASAR mechanism, was later published in Marr & Nishihara 1978). This was an interesting idea that assumed a limited analogue operation that could be applied to one small feature of a representation at a time. Yet the Marr and Nishihara proposal did not postulate a pictorial representation, nor did it assume that a rigid configuration was maintained by an image in the course of its “rotation.” It hypothesized a simple primitive operation on parts of a structured representation in a response to a computational complexity issue.
Like the paper folding task discussed earlier, the mental rotation phenomenon is robust and not cognitively penetrable, and is not a candidate for a straightforward tacit knowledge explanation (as I tried to make clear in Pylyshyn 1979). Rather, the most likely explanation is one that appeals to the computational requirements of the task and general architectural (that is, working memory) constraints, and therefore applies regardless of the form of the representation. No conclusions concerning the format of image representations, or the form of their transformation, follow from the rotation results. Indeed these findings illustrate how treating the phenomenology as explanatory does not help us to understand why or how the behavior occurs.
It has frequently been suggested that images differ from structured descriptions in that the former stand in a special relationship to what they represent, a relationship referred to as depicting. One way of putting this is to say that in order to depict some state of affairs the representation needs to correspond to the spatial arrangement it represents the way that a picture does. One of the few people who have tried to be explicit about what this means is Stephen Kosslyn,[4] so I quote him at some length [Kosslyn, 1994 #880, p5):
“A depictive representation is a type of picture, which specifies the locations and values of configurations of points in a space. For example, a drawing of a ball on a box would be a depictive representation. The space in which the points appear need not be physical, such as on this page, but can be like an array in a computer, which specifies spatial relations purely functionally. That is, the physical locations in the computer of each point in an array are not themselves arranged in an array; it is only by virtue of how this information is “read” and processed that it comes to function as if it were arranged into an array (with some points being close, some far, some falling along a diagonal, and so on). In a depictive representation, each part of an object is represented by a pattern of points, and the spatial relation among these patterns in the functional space correspond to the spatial relations among the parts themselves. Depictive representations convey meaning via their resemblance to an object, with parts of the representation corresponding to parts of the object… . When a depictive representation is used, not only is the shape of the represented parts immediately available to appropriate processes, but so is the shape of the empty space… . Moreover, one cannot represent a shape in a depictive representation without also specifying a size and orientation… .”
This quotation introduces a number of issues that need to be examined closely. One idea we can put aside is the claim that depictive representations convey meaning through their resemblance to the objects they depict. This relies on the extremely problematic notion of resemblance, which has been known to be inadequate as a basis for meaning [certainly since \Wittgenstein, 1953 #1439]. Resemblance is neither necessary nor sufficient for something to have a particular meaning or reference: Images may resemble what they do not refer to (e.g. an image of John’s twin brother does not refer to John) and they may refer to what they do not resemble (an image of John taken through a distorting lens is an image of john even though it does not resemble him).
Despite its obvious problems, the notion of resemblance keeps surfacing in discussions of mental images, in a way that reveals how deeply the conscious experience of mental imagery contaminates conceivable theories of mental imagery. For example, (Finke 1989) begins with the observation, “People often wonder why mental images resemble the things they depict.” But the statement that images resemble things they depict is just another way of saying that the conscious experience of mental imagery is similar to the conscious experience one would have if one were to see the thing one was imagining. Consider what it would be like if images did not “resemble the things they depict”? It would be absurd if in imagining a table one had an experience that was like that of seeing a dog? Presumably this is because (a) what it means to have a mental image of a chair is that you are having an experience like that of seeing a chair, and (b) what conscious content your image has is something on which you are the final authority. You may be deceived about lots of things concerning your mental image. You may, and typically are, deceived about what sort of thing your image is (that is, what form and substance underlies it), but surely you cannot be deceived about what your mental image looks like, or what it resembles. These are not empirical facts about imagery, they are just claims about what the phrase “mental image” means.
In contrast to the vacuity of the criterion of resemblance, the proposal that images can be decomposed into “parts” with the spatial relations among parts of the image in some way mapping onto the parts and the spatial relationships among the corresponding parts of the world, deserves closer scrutiny although it has not received systematic treatment in the literature. Some time ago (Sloman 1971) suggested this as a defining characteristic of analogue representations and it is clearly an important criterion. Although it needs to be spelled out in more detail, this is a reasonable proposal, but it will not yield the conclusion that images are spatial in any sense that bears on the “depiction” story. In fact, it is true of any representational system that is compositional (see section 7.1).
Another proposal mentioned in the quotation is that in depictive representations certain aspects are mandatory so that, for example if you choose to represent a particular object you cannot fail to represent its shape, orientation and size. This claim too has some truth, although the question of which aspects are mandatory, why they are mandatory, and what this tells us about the form of the representation is not so clear. It is a general property of representations that some aspects tend to be encoded (or assigned as default value) if other aspects are. Sometimes that is true by virtue of what it is that you are trying to imagine. For example, you can’t imagine a melody without also imagining each note, and therefore making a commitment as to how many notes it has. This follows from what it means to “imagine a melody,” not from the inherent nature of some particular form of representation. The same is true for other examples of imaginings. When you ask someone to imagine an familiar shape by giving its name, say the letter “B”, the person will make a commitment to such things as whether it is in upper or lower case. It seems as though you can’t imagine a B without imagining either an upper case “B” or a lower case “b”. But is this not another case of a requirement of the task to “imagine a ‘B’”? In this example, are you not being asked to describe what you would see if you were actually looking at a token of a particular printed letter? If you actually saw a token of a B you would see either a lower or an upper case letter, but not both and not neither. If someone claimed to have an image of a B that was noncommittal with respect to its case you would surely be entitled to say that the person did not have a visual image at all.
When you get to other contents of an image, the situation gets murkier because it becomes less clear what exactly the task of “imagining the letter ‘B’” entails. Does your image of the letter have to have a color or texture or shading? Must you represent the background against which you are viewing it, the direction of lighting and the shadows it casts? Must you represent it as viewed from a particular point of view? What about its stereoscopic properties; do you represent the changing parallax of its parts as you imagine moving in relation to it? Could you choose to represent any or none of these things? Most of our visual representations, at least in memory, are noncommittal in various respects (for examples; see Pylyshyn 1978). In particular, they can be noncommittal in ways that no picture can be noncommittal. Shall we then say that they are not images? How you feel about such questions is more terminological (that is, what you are disposed to count as an image representation) than empirical. It shows the futility of assuming that mental images are just like pictures. As the graphic artist M.C. Escher once put it,
…a mental image is something completely different from a visual image, and however much one exerts oneself, one can never manage to capture the fullness of that perfection which hovers in the mind and which one thinks of, quite falsely, as something that is ‘seen’ (Escher 1960, p7).
Despite the temptation to do so, imagery theorists have been reluctant to claim that images are literally laid out in real space – that is, on a physical surface in the brain. However, because theories of imagery have had to appeal to such notions as distance, shape, size and so on, some notion of space is always presupposed. Consequently many writers who see the need for spatial properties speak of a “functional” space, with locations and other spatial properties being defined functionally (e.g., Denis & Kosslyn 1999). The example frequently cited (see the Kosslyn quotation above) is that of a matrix data structure in a computer, which can be viewed as having many of the properties of space without itself being laid out spatially in the physical machine. This is in some ways an attractive idea since it appears to allow us to claim that images have certain spatial properties without being committed to how they are implemented in the brain – so long as the implementation and its accessing operations function the way a real spatial system would function. The hard problem is to give substance to the notion of a functional space that does not reduce it to being either a summary of the data, with no explanatory mechanisms, or a model of real literal space. This problem that has been so widely misunderstood that it merits some extended discussion.
Consider first why a matrix data structure might appear to constitute a “functional space”. As typically used it seems to have two (or more) dimensions (since referencing individual cells is typically done by providing two numerical references or “coordinates”), to have distances (if we identify distance with the number of cells lying between two places), and to have empty spaces (so that it explicitly represents both where there are features and where there are no features). Graphical elements, such as points, contours, and regions can be represented by entering features into the cells at quantized coordinates. There is then a natural sense of the properties of “adjacency,” as well as of places being “between” two specified locations (as well as other simple geometrical properties of sets of features, such as being collinear, forming a triangle, and so on). Because of this, operations such as “scanning” from one feature to another, as well as of “shifting” and “rotating” patterns, can be given natural definitions (see, e.g., Funt 1980). Thus, the format of such a data structure appears to lend itself to being interpreted as “depictive” rather than “language-like” as noted in the earlier Kosslyn quote.
Notice, however, that that all the spatial notions mentioned in the previous paragraph are properties of a way of thinking about or of interpreting the data structure, they are not intrinsic properties of the matrix data structure itself. What makes cells in a matrix appear to be locations that have such properties as adjacency, betweeness, alignment, distance, and so on, is not any property of the matrix, nor even of the way that this data structure must be used. There is no sense in which any pairs of cells is special and so there is no natural sense in which some pairs of cells are “adjacent”, including a sense that derives from how they must be accessed. There are literally no constraints on the order in which cells must be accessed. We can, of course, require that the matrix be accessed in certain ways, and when we model imagery we typically do stipulate that certain pairs of cells be considered “adjacent” and that in accessing any pair of cells in a serial fashion, certain other cells (the ones we designate as being “between” the pair) must be visited first and in a certain order (which we call “scanning”). But it is critical to the interpretation of a computational process as a model of mental imagery that we be clear as to why such constraints hold. If our model of imagery assumed a literal physical surface, then the reason would be clear: physical laws require that movement over a surface follow a certain pattern, such as that the time it takes to get from one place to another is the ratio of the distance traversed to the speed of movement. But in a matrix no such intrinsic constraint exists. Such a constraint must be stipulated as an extrinsic constraint (along with many other constraints, such as those that govern the invariance of adjacency, betweeness, or collinearity, with transformations of scale, orientation, and translation). The spatiality of a matrix, or of any other form of “functional space”, must be stipulated or assumed over and above any intrinsic property of the format of the representation. The crucial fact about extrinsic constraints is that such constraints are independently stipulated, and so could be applied equally to any form of representation, including a model of imagery that used symbolic expressions or structured descriptions. So far as the notion of functional space is concerned, there is nothing to prevent us from modeling the content of an image as a set of sentence-like expressions in a language of thought. We could then stipulate that in order to go from examining one place (referred to, say, by a unique name) to examining another place (also referred to by a name) you must pass through (or apply an operation to) the places whose names are located between the two names on some list. You might object that this sort of model is ad hoc. It is. But no more ad hoc than when the constraints are applied to a matrix formalism. Notice, moreover, that both become completely principled if they are taken to be simulations of a real spatial display.
You might wonder why the matrix feels more natural than other ways of simulating space. The answer may be that a matrix offers a natural model of space because we are used to thinking of and displaying matrices as two-dimensional tables (complete with empty cells) and of viewing the cells as being referenced by names that we think of as pairs of coordinates[5]. We thus find it easy to switch back and forth between the data-structure view and the (physical) table view. Because of this, it is natural to interpret a matrix as a model of real space and therefore it is easy to make the slip between thinking of it on one hand as merely a “functional space” and thinking of it, on the other hand, as a stand-in for (or a simulation of) real space – a slip we encounter over and over in theorizing about the nature of mental imagery. As a simulation of real space it is unproblematic so far as the sorts of problems discussed here are concerned. But we must recognize that in this case we are assuming that images are written on a literal spatial medium, which we happen to be simulating by a matrix (for reasons of convenience). In fact in (Kosslyn et al. 1979) this view was made explicit when the authors invoke what they call the “cathode-ray tube model”. In that case it is the literal space that has the explanatory force, notwithstanding the fact that, as a practical matter, it is being simulated on a digital computer.
The point is that there is no such thing as a “functional space” apart from the set of extrinsic stipulations or constraints we choose to impose on such things as how symbolic names (e.g., matrix coordinates) map onto places in a physical display and how distances and geometrical predicates are to be interpreted over the data structure. What we have, rather, is one of two things: either a real physical space, with its (approximately) Euclidean properties, or a symbolic model of such a space.[6] Anything else is merely metaphoric and not explanatory. It allows one to think of an image as spatial without the attending disadvantages of having made an untenable assumption about the architecture of mental imagery.
The real scientific question is not how we can model space in a theory of mental imagery. Rather, it is whether there is any sense in which the architecture of mental imagery incorporates the geometry of real space. Only after we have answered this empirical question can we know whether one should model properties of space in modeling imagery. My purpose in belaboring the distinction between intrinsic and extrinsic constraints, and what is being presupposed when we talk of “functional space,” is simply to set the stage for the real issues, which are empirical. I have already described some of the relevant empirical findings in connection with the mental scanning and have suggested that the same is likely to be true for other findings that imply that images have metrical properties. The cognitive penetrability of such phenomena suggests that the mind does not work as though the imagery architecture imposes constraints like those you would expect of a real spatial display. It appears that we are not required to scan through adjacent places in getting from one place to another in an image – we can get there is quickly or as slowly as we wish, with or without visiting intermediate filled or empty places (assuming that visiting empty places is even possible – see section 0).
In most imagery studies subjects are asked to imagine something while looking at a scene; thus, at least in some phenomenological sense, superimposing or projecting an image onto the perceived world. Yet it has been amply demonstrated (O'Regan & Lévy-Schoen 1983) that true superposition of visual percepts does not occur when visual displays are presented in sequence, or across saccades. So what happens when a mental image (whether constructed or derived from memory) is superimposed over a scene? In many of these cases (e.g., Farah 1989; Hayes 1973; Podgorny & Shepard 1978) a plausible answer is that one allocates attention to the scene according to a pattern that corresponds roughly to the projected image. Alternatively, and more plausibly, one simply thinks of imagined objects as being located at places actually occupied by certain perceived ones. Thinking that something is at a certain location need not entail projecting an imagined shape onto some background. It might require nothing more that allocating attention to a particular object in a scene and thinking of that object as having a certain property. It is no more than thinking “this (e.g., referring to a bit of texture) is where I imagine feature F to be located”. The capacity for this sort of “demonstrative reference” has been investigated extensively and discussed by (Pylyshyn 2000, 2001).
Consider, for example, the study reported by (Podgorny & Shepar