I (and others) keynote at NASA/CMU Dependable Computing workshop
https://web.archive.org/web/20011004023230/http://www.hdcc.cs.cmu.edu/may01/index.html
When I first transfer out to SJR in 2nd half of 70s, I get to wander
around IBM (and other) datacenters, including disk bldg14/enginneering
and bldg15/product-test across the street. They were running 7x24,
prescheduled, stand alone testing and mentioned that they had recently
tried MVS ... but it had 15min MTBF (in that environment), requiring
re-ipl. I offer to rewrite I/O supervisor to make it bullet proof and
never fail so they could do any amount of on-demand, concurrent testing,
greatly improving productivity (downside was that they wanted me to
increasingly spend time playing disk engineer). I do an internal
research report about "I/O integrity" and happen to mention the MVS
15min MTBF. I then get a call from the MVS group, I thot that they
wanted help in improving MVS integrity ... but it seems they wanted to
get me fired for (internally) disclosing their problems.
1980, IBM STL was bursting at the seams and they were moving 300
(people&3270s from IMS DBMS group) to offsite bldg with dataprocessing
back to STL datacenter ... they had tried "remote" 3270 support and
found the human factors unacceptable. I get con'ed into doing "channel
extender" support so they can place channel attached 3270 controllers at
the off-site bldg with no perceptable difference in the human factors
offsite and in STL. The vendor then tries to get IBM to release my
support but there is group in POK that get it vetoed (they were playing
with some serial stuff and afraid that if it was in the market, it would
make it difficult to releasing their stuff). The vendor then replicates
my implementation.
Role forward to 1986 and 3090 product administrator tracks me done.
https://web.archive.org/web/20230719145910/https://www.ibm.com/ibm/history/exhibits/mainframe/mainframe_PP3090.html
There was an industry service that collected customer mainframe EREP
(detailed error reporting) data and generated periodic summaries. The
3090 engineers had designed the I/O channels predicting there would be a
maximum aggregate of 4-5 "channel errors" across all customer 3090
installations per year ... but the industry summary reported total
aggregate of 20 channel errors for 3090s first year.
It turned out for certain types of channel-extender transmission errors,
I had selected simulating "channel error" in order to invoke channel
program retry (in error recovery) ... and the extra 15 had come from
customers running the channel-extender support. I did a little research
(various different kernel software) and found simulating IFCC (interface
control check) would effectively perform the same kinds of channel
program retry (and got the vendor to change their implementation from
"CC" to "IFCC").
--
virtualization experience starting Jan1968, online at home since Mar1970