Recall that Mycin and human experts accrued roughly 65% of the available ``acceptable or equivalent" scores from the panel of judges (Figure 3.4). We concluded that Mycin's performance was approximately equal to human experts. Now imagine that Mycin and the human experts each accrued approximately 100% of the available ``acceptable or better'' scores. Can we conclude that Mycin and human experts perform equally well? At first glance the answer is obvious: the program got the same score as the humans, so they perform equally. But this situation is qualitatively different from the one in which humans and Mycin each got roughly 65%. In the latter case, 35% is available to demonstrate higher performance. If Mycin was better than the humans, it could have a higher score. In the 100% case, if Mycin is better, it cannot have a higher score, because both are ``at ceiling.''
When one's hypothesis is Performance(A) Performance(B), if A and B achieve the maximum level of performance (or close to it), the hypothesis should not be confirmed, due to a ceiling effect. Ceiling effects arise when test problems are insufficiently challenging. Floor effects floor are just like ceiling effects but they are found at the opposite end of the performance scale. Imagine therapy recommendations problems that are so challenging that neither human experts nor Mycin can solve them correctly.
Technically, a ceiling effect occurs when the dependent variable, y, is equal in the control and treatment conditions, and both are equal to the best possible value of y. In practice, we use the term when performance is nearly as good as possible in the treatment and control conditions. Note that ``good" sometimes means large (i.e., higher accuracy is better) and sometimes it means small (e.g., low run times are better), so the ceiling can be approached from above or below. A ceiling thus bounds the abstract ``goodness" of performance. Floor effects occur when performance is nearly as bad as possible in the treatment and control conditions. Again, poor performance might involve small or large scores, so the ``floor" can be approached from above or below.
Consider an example from the Phoenix project (section 2.1). Assume the performance variable y is the time required to contain a fire, so good scores are small, and the ceiling is the smallest possible score. The mean time to contain fires within a 50 km radius of the firebase is roughly 20 hours of simulated time. Suppose you have designed a new scheduling algorithm for the Phoenix planner, but unfortunately, it shaves only 30 minutes from the mean finish time. Distraught, you consult a Phoenix wizard, who tells you a bit about how long things take in the Phoenix environment:
Activity | Average time for the activity |
---|---|
Noticing a fire in the environment | 2 hours |
Deciding which plan to use | 1 hour |
Average bulldozer transit time from the firebase to any point in a 50 km raduis | 4 hours |
Average time to cut one segment of fireline | 6 hours |
None of these activities involve scheduling. Each bulldozer cuts an average of two segments of fireline, so the average time to contain a fire is 19 hours. The new scheduling algorithm therefore has very little room to show its superiority, because the old version of Phoenix required 20 hours, and any version requires at least 19 hours. This is a ceiling effect, approached from above.
The most important thing to remember about ceiling and floor effects is how they arise. They arise not because a program in a control condition is very good (or bad) but because the program performs very well (or poorly) on a particular set of test problems. The fact that Phoenix's old scheduling algorithm takes only an hour longer than the minimum does not mean it is a good algorithm: a dozen uncontrolled factors might account for this performance, and its performance in a slightly different scenario might be considerably worse. Ceiling effects and floor effects are due to poorly chosen test problems.