Monday, September 11, 2006

Exhaustive Testing

The most exhaustively-tested program that I know of still had a serious bug when it was released. But this story has a happy ending.

Background: One night circa 1960 a group of Pacific Union College undergraduate students was playing the UCLA Executive Game in the Data Processing Laboratory, when a member of one of the teams spotted an anomaly on his team's balance sheet. The numbers didn't add up. In particular, the bottom line was not exactly the sum of the numbers it was supposed to be the total of. It was off by one. [1]

Initial investigation: Off-by-one errors are not uncommon in programming, but they generally do not show up as errors in arithmetic, particularly in integer arithmetic. Suspicion first fell on the hardware, but running the hardware diagnostics did not reveal any faults, and the error proved to be quite repeatable for those inputs. Next, the Executive Game program was suspected, but the relevant section of code was a single block add of the numbers, and it was hard to see how that could be wrong. So, finally, attention fell on the binary-to-decimal conversions.

It wasn't necessary to convert all the numbers by hand (remember, this was before pocket calculators), since inspection of the program's memory revealed that one of the binary numbers was odd and its corresponding decimal printout was even. This should not happen when converting integers! There must be something wrong with the Bendix-supplied conversion routine, and, sure enough, it failed a stand-alone test with that number as input.

At this point, we felt we had "scored" against Bendix, since the routine was used in hundreds of G-15 installations. However, it would be even more satisfying to tell Bendix what the error was, and better still to provide the fix. And of course it was natural to wonder whether other integers were incorrectly converted, and if so, how many.

Diagnosis: One of the students took up the challenge. He had a binary number that didn't convert correctly, and the source code for the conversion routine, which was only a few dozen instructions of straight-line code, so he could step through it and check the results after each instruction against a hand calculation. He found that the final multiplication--to obtain the least-significant decimal digit--left a fractional part that was very close to, but not quite, one. Had it been just a little larger, there would have been a carry into the integer part, and the final digit of the converted number would have been correct.

Working backwards, he found that the initial division was not programmed for the 58 word times required to produce a full 28-bit quotient. Presumably, the original programmer wanted to speed up the routine and had reasoned that all 28 bits weren't needed, since 10^7 is less than 2^24. However, there were a few unfortunate bit patterns in the truncated quotient that could lead to that loss of a final carry [2]. The smallest integer resulting in such a bit pattern had five digits, which is probably why no one had noticed the problem before.

Constraints: The "obvious solution" was simply to extend the division to 58 word times. This was tried immediately, and generated a correct result for the problem case. Although this was strong evidence that the problem had been correctly diagnosed, it was not an acceptable fix for the routine, because it would have cost an extra drum revolution per call. (The longer division would not be complete before the next instruction came off the drum.) So something more clever would be needed, satisfying some difficult constraints:
  • It could not take an extra drum revolution.
  • The locations of the integer and the fraction entry points could not be changed, since they were hardcoded into calling programs.
  • The exit must happen at the original time, to avoid costing an extra drum revolution in the calling program.
  • The new routine must use only (a subset of) the drum locations used by the original, since calling programs were free to use the remaining locations for their own purposes.

Being another programmer who believed he could improve any program [3], he set himself the challenge of meeting all these constraints, but always generating a correct conversion. And he produced a program that, by his analysis, surely did so.

Testing: Presumably the original programmer had also convinced himself that the original routine was correct. What if there were some other "corner case" that caused the new routine to fail? Before submitting the new routine to Bendix, it seemed prudent to test it stringently.

He devised the following test methodology: Write a loop that counted from 0 to 9,999,999, both in binary and in BCD [4], converted the binary to BCD, and compared the results. This would take weeks to run, but should surely provide incontrovertible evidence of the correctness of the routine. He set it up so that it could be run in chunks, whenever the machine was not otherwise occupied [5]. Whenever someone needed the machine, they'd log how far the test had gotten, and when the machine was free again, it would be restarted at that point.

After weeks of testing, the loop finally reached 9,999,999 without having encountered a single conversion error, and we were all convinced that the new routine was ready for the big time.

Submission: The routine was carefully prepared on DPL (rather than Bendix) coding sheets, and everything was carefully documented, in the hopes that Bendix would distribute it unmodified to all G-15 installations, enhancing the fame of the DPL. And sure enough, shortly after submission, we received a copy of the distribution that went to all sites, and it was on our coding sheets. We were all elated.

OOPS: Soon after Bendix's distribution, the programmer received an urgent telephone call: "Were you aware that your new routine drops the sign of negative numbers?" (He wasn't, but it did.)

Moral: The most exhaustive test I've ever witnessed failed to uncover a significant flaw. Furthermore, had the 10,000,000 test cases been supplemented by any one of the other 9,999,999 values in the domain of the routine, the error would have been caught instantly.

In the spirit of one of Alan Perlis's aphorisms [6]:

If you have ten million test cases, you probably missed one.
Happy ending: Applying even more cleverness, the student was able to produce a program that met all the constraints, and correctly converted both positive and negative integers. But he did not attempt another exhaustive test before submitting it.

Notes:
[1] By focusing on the least-significant digit, he was probably calling his executive qualifications into question. :-)
[2] "Few" is of course relative here. As I recall, it was only a few thousand values out of ten million.
[3] This was not me, although I was surely just as cocky.
[4] Since the BCD counting was just test harness, it could be optimized for simplicity and clarity, rather than speed or space, making it unlikely that it would fail in a fashion that correlated in any way with the conversion routine.
[5] Sort of like SETI@home.
[6] "If you have a procedure with ten parameters, you probably missed some."

6 Comments:

Comment by Anonymous Nitpicker:

In 1957 or so, my first program on any computer was on the G15 at Michigan Tech. I remember well my first bug. I had assumed that memory would be zeroed when my program began. It was not. Oops.

At least it did not take 10**6 + 1 test cases to find the bug.

What I recall is that sines and cosines were calculated by interpolation from tables on paper tape loops. Only later were Taylor expansions used to get better values.

Still floating point arithmetic was a continual hazard until IEEE arithmetic became almost universally used. In fact I was the distributer for Velvel Kahan's Paranoia program which tested such things as whether the number of check digits used in the registers was sufficient to avoid nasty cases. Otherwise, X * 1.0 might not equal X. The program was named for the attitude required of numerical analysts whenever they encountered a new computer arithmetic.

Much later I got to use the Orange Book when my team created a tiny language that we called Nosepicker. We got high marks from McKeeman that summer.

3:10 PM  
Comment by Blogger expraxite:

I see testing as a generally poor tool for verification (by which I mean showing that a program meets its specification, which is what you were trying to do).

If you were going to use testing to show correctness, it would be good practice to write out a specification and then to check that the test set exercised all the obvious partitions of the input data (in your case, at least: positive and negative integers, powers of two, even and odd integers, primes, zero and maxint - and all valid combinations).

9:19 AM  
Comment by Anonymous Anonymous:

" If you have ten million test cases, you probably missed one. "

Surely in this particular instance, you mean "If you have ten million test cases, you probably missed ten million"?

9:57 AM  
Comment by Blogger Jim Horning:

Re expraxite's comment: This routine, of course, had an unusually simple specification. I believe that the original Bendix routine would (with high probability) have passed with the suggested test set.

But I agree that testing is not a complete method for validation.

I think it was Rick Holt who said, "It is the duty of every programmer to show by logical reasoning that his program is correct, and then by testing that it is not."

10:43 AM  
Comment by Blogger Jim Horning:

This reminded Tom Van Vleck of his post Gauss

10:51 AM  
Comment by Blogger drunkenbatman:

I came across this randomly researching a testing issue -- an amusing read. Thanks for taking the time to write it out and share it.

9:51 AM  

Post a Comment

<< The Way It Was home