Post by John LevinePost by Lawrence D'OliveiroPost by John LevinePost by Lawrence D'OliveiroYou’d think the OS would have hardware protection against
nonprivileged user processes being able to do such things. But no.
The save process was a system call and the insufficient space was in the system.
It shouldn’t be possible for ordinary users to bring down the system in
this way.
No kidding. It was a bug in the OS software.
I had a case where it was a bug in the hardware. And it was hard to fix.
Once the University of Kent had moved to using EMAS (a third party OS), it
was enjoying a rolling MTBF averaging about 2000 hours over a 13 week
period. This was much better than the 20 hours we had been getting from
VME/K. People were very happy.
And then one day it all began to fall apart. The machine just stopped. No
crash, nothing. The engineer's panel indicated that the microcode had
halted. We re-IPLed the system, and an hour or two later it stopped again.
Eventually we called the engineers, and they ran tests. Lots of them. They
pronounced that there was nothing wrong.
Then the 'crashes' stopped, for a couple of weeks. Then they started
again. We couldn't get a handle on what was wrong at all. It was
eventually decided that, the next time it happened, I should use the
engineer's panel, for as long as it took, to investigate the state of the
machine. In the event, I simply dumped out all the target machine
registers, and the microcode PC.
Our engineers obligingly left a microcode training manual lying around,
together with a microfiche listing of the microcode. Oh, and some circuit
diagrams. I retired to a darkened room for much of that day; and the next.
Eventually I emerged with the reason for the crashes. Without going into
too much technical detail, it seemed that the microcode and the hardware
handed off tasks to each other; in particular, a part of the hardware
called the 'scheduler' was responsible for validating the type field in
the descriptor register during the execution of any instruction that used
a descriptor to access an operand. Any invalid type was trapped, and sent
back to the microcode to force an exception (known as a 'contingency').
All other type values were considered valid, and passed back to the
microcode to be used in accessing a jump table, thence invoking the right
bit of microcode for that descriptor type.
So, what was going wrong? It turned out that there was what can only be
described as a hardware design error. The scheduler didn't detect one
particular invalid type code, so it handed it back to the microcode, which
accessed the jump table with it. This of course accessed an entry marked
'can never happen', and the microcode halted. We later discovered that a
physicist's errant FORTRAN program was overwriting a descriptor, and
generating the bad type value. If the machine stopped, he just submitted
the job again until he got fed up and went off for a week or two. Then he
tried again, never noticing the causal connection.
We contacted ICL, but we never seemed to reach anyone who either
understood what the problem was, or had the power or inclination to get it
fixed (which would not have been a quick job, in any case).
So I decided I had better fix this another way. Back to the microcode
listing. I found an empty patch area, and hand assembled a new bit of
microcode which I linked to the right jump table entry. All this did was
generate a 'descriptor error' contingency with a hitherto unused subtype
code. I then wrote a tool to extract the microcode from the system disk,
patch it, and put it back again. We IPLed the system, and tested it (by
this time I had a test program). Success - it correctly triggered the new
contingency and the microcode didn't halt!
The only thing left to do was to modify the various components of the
operating system to do the right thing, culminating in a change to the
FORTRAN run-time system to generate a suitable message. That only took me
a few minutes.
We had no more microcode halts and the users were happy.
--
Using UNIX since v6 (1975)...
Use the BIG mirror service in the UK:
http://www.mirrorservice.org