Two friends go for a walk one fine summer afternoon and soon find themselves debating the pros and cons of experiments with AI systems. Fred says, ``The kind of designs you describe are fine for psychology experiments, but AI programs are really complicated, and simple experiment designs won't tell us much.'' Fred's companion, Abigail, replies, ``Humans are pretty complicated, too, but that doesn't stop psychologists studying them in experiments.'' Abigail thinks Fred is missing an important point; Fred isn't satisfied with her response. Abigail believes the complexity of AI programs generally does not preclude experimental control, so complexity doesn't bother her. She privately suspects Fred doesn't understand the concept of control because he often refers to the ``huge number of factors that must be controlled.'' Abigail knows most of these factors can be treated as noise and needn't be controlled directly; she knows random sampling will do the job. Fred, on the other hand, believes that if you control all but one of the factors that affect a system's behavior (whether the control is direct or by random sampling), then only small results can be demonstrated. (This argument is essentially Allen Newell's twenty questions challenge from the preface.) Fred also asserts that each small result applies only to the system within which it was demonstrated, and does not automatically apply to other systems.
Let us agree with Abigail: complexity does not preclude control. Scientists have figured out how to exercise control in very complex physical, biological, social and psychological experiments. Let us also agree with Fred and borrow a name for his concerns: Psychologists speak of the ecological validity of experiments, meaning the ability of experiments to tell us how real people operate in the real world. The concern has been voiced that psychology experiments, pursuing ever-higher standards of control, have introduced environments and stimuli that people never encounter, and tasks that people would never perform, in the real world. A scorching appeal for ecological validity is made by Neisser in his book Cognition and Reality (1976). Here is his description of tachistoscopic displays, which illuminate an image for a tiny fraction of a second:
Such displays come very close to not existing at all. They last for only a fragment of a second, and lack all temporal coherence with what preceded or what will follow them. They also lack any spatial link with their surroundings, being physically as well as temporally disconnected from the rest of the world.... The subject is isolated, cut off from ordinary environmental support, able to do nothing but initiate and terminate trials that run their magical course whatever he may do.... Experimental arrangements that eliminate the continuities of the ordinary environment may provide insights into certain processing mechanisms, but the relevance of these insights to normal perceptual activity is far from clear. (Neisser, 1976, p.36)
Similar things are said about the environments in which AI systems do their work, and the work they do. Here is an analysis by Steve Hanks of two experiments with agents in the Tileworld environment (Hanks et al., 1993, p.30; see also Pollack and Ringuette, 1990):
I think it's clear that these agents presented in these papers do not in and of themselves constitute significant progress. Both operate in extremely simple domains, and the actual planning algorithm ...is feasible only because the testbed is so simple: the agent has at most four possible primitive actions, it doesn't have to reason about the indirect effects of its actions, it has complete, perfect, and cost-free information about the world, its goals are all of the same form and do not interact strongly, and so on.The argument must therefore be advanced that these experimental results will somehow inform or constrain the design of a more interesting agent.... The crucial part of this extensibility argument will be that certain aspects of the world--those that the testbed was designed to simulate more or less realistically--can be considered in isolation, that is, that studying certain aspects of the world in isolation can lead to constraints and principles that still apply when the architecture is deployed in a world in which the testbed's simplifying assumptions are relaxed.
Neisser and Hanks are saying the same thing: The price of experimental control should not be irrelevant results, nor should control preclude generalization. It would be naive to suggest experimental control is nothing but technology, and the responsibility for minor, irrelevant, hard-to-generalize results rests with the researcher. Technologies, and the cultures that adopt them, encourage behaviors; sports cars encourage fast driving, statistics packages encourage data-dredging, and books like this one encourage well-designed experiments, which, if one isn't careful, can be utterly vacuous. This is a danger, not an inevitability. Knowing the danger, we can avoid it. Perhaps the best protection is afforded by the research questions that underlie experimental questions: If you have a reason for running an experiment, a question you're trying to answer, then your results are apt to interest other researchers who ask similar questions.