Phoenix is a simulated environment populated by autonomous agents. It is a simulation of forest fires in Yellowstone National Park and the agents that fight the fires. Agents include watchtowers, fuel trucks, helicopters, bulldozers and, coordinating (but not controlling) the efforts of all, a fireboss. Fires burn in unpredictable ways due to wind speed and direction, terrain and elevation, fuel type and moisture content, and natural boundaries such as rivers, roads and lakes. Agents behave unpredictably, too, because they instantiate plans as they proceed, and they react to immediate, local situations such as encroaching fires.
Lacking a perfect world model, neither a Phoenix planner nor its designers can be absolutely sure of the long term effects of actions: Does an action interact detrimentally with a later action in the plan? Will an action provide short term gain with long term loss? Are failures caused by a mismatch between the planning system and its environment? These questions are extremely difficult to answer for large, complex systems, and yet, these are precisely the systems in which detrimental interactions and failures are most likely [Corbato 91]. To identify the sources of failure and expedite debugging, we have developed a technique called Failure Recovery Analysis (FRA) [Howe 92]. FRA detects dependencies between failure recovery actions - those taken to recover from plan failures - and later failures. FRA also explains how some failure recovery actions might have caused later failures.
FRA involves four steps. First, execution traces are analyzed for statistically significant dependencies between failure recovery actions and subsequent failures; we call this step dependency detection. The remaining three steps explain failures by using the dependencies to focus the search for flaws in the planner that may have caused the observed failures.