|Problems? Is your data what you think it is?|
Sometimes its not heat but "cold".
So i was working at IBM/Kingston on the AIX/ESA project. They were putting in a new box with 6 ESA engines. Unknown to us ,for some reason they(NSD) decided to pipe the cold "chiller-water" to it using a feed that wasn't right (7 gal/min rather than 14?), it was all they could find without running new pipes. In the beginning it was physically partitioned into two 3-core "boxes" and while one of them was used a bunch the other was mostly idle for about a month.
So i start to bring up the second partition, and i stress test it. And overheat alarms start going off all over the place. I mean BELLS, audible alarms as well as "hot boxes" on the console. I shut down the stress test, they go away, start it again and they come back.
I dont understand, but i go and chat about it with my contact to IBM's National Service Division (NSD). He has no clue and gets in touch with them and finds out about the pipe missmatch. He tells them they f'ed up and to fix it. In a week or so they say they have, and i run the stress tests on the second partition and have no problems.
Fast forward a few months, BIG demo, IBM VPs, people from computer world, infoweek, etc in to watch rotating globes and planes flying around, and maybe 6 other compute intensive tasks on each of about 2 dozen pcs, all coupled to the new box, now configured as as sigle 6-core box. Everything but the X-clients was on the ESA box. it was running about 99%. It was running for about 3 hours before the honchos came in to watch it and get the PR spiel. About an hour in the operator comes in and points out an over-heat hot-box on the system console in the far corner. NOT GOOD. I shut down a bunch of non-needed tasks, but the hot-box didnt go away, Its warning that it is going to shut down unused sections to lower the heat load.
So me and the op are quietly chatting, Im sure not going to shut down the demo, we'll just let it do what it wants and hope it doesnt crash. A VP walks over complaining about the noise. I point out the hot-box and explain what we were going to do, he agrees and walks away, but not happy. We quit chatting so as not to draw any further attention. The demo finishes with the hot box still up, but no magic smoke anywhere.
When i ran the stress tests on the second partition the first wasnt running at full load. and while we tested the demo, we didnt run the 6-way for hours and hours and hours so it didnt show the overheat warning. My NSD contact is pissed, its supposed to be fixed. He calls NSD and they promise to figure it out. And they get back soon. It seems the plumbing is still undersized, only 11, more then the 7, but not the 14 we needed. The VP as pissed, the AIX/ESA leader was pissed. my NSD contact was pissed, I was pissed, the operator was pissed, NSD fixed it pretty quick, i ran the demo for about 12 hours with no problems.
And it turns out the VP was not pissed at me, or the operator. He gave us a nice writeup about how we saved the demo.
While im at it, that 6-way figures into another interesting story. We would tightly-couple it to 3 other ESA 6-ways, 1 local-channel, 2 offsite channels over PVM. and run benchmarks to compare to MVS and VM. One night they want to do it again, but its Thanksgiving morn 12-6am. (and i needed about 90 min to patch it all together first) Can i come in to set it up? They offered me double-time plus time off to do it even, cuz it was thanksgiving. I said sure.
The IBM Fellow for super-computing was in Kingston, i met him a few times, he was the one that made the benchmarks we ran, he was an MVS fan. He comes to talk to me later, and tells me that thanksging i had booted the "largest box" in the world (at that time). All the other super computers were down for maint, or physically partitioned into smaller sections that day. He was right, it was real cute to know i was running the fastest computer in the world, if just for a little time.
In reply to Re^2: Shared DBI handle supporting threads and processes