Archived posting to the Leica Users Group, 2014/07/01
[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]In case you care.
Server computers that are engineered for reliability have two power
supplies and two power cords. Power supplies are the most frequent
component to fail in server computers, so having two of them makes it
survive the outage of one.
The server computer that had supported the LUG had two power supplies.
They were stacked vertically, one on top of the other. Both power
supplies had been running 24x7 for about 9 years, and their fans had
sucked in a certain amount of lint. Lint is flammable. The bottom power
supply failed, and the lint caught fire. The flame rose to the upper
power supply and ignited its stored-up lint also. Like firestarters in a
Franklin stove, the 20-second burst of flame was enough to ignite the
various flammable items (including lint) in the main enclosure. The
flash fire probably only lasted 40 or 50 seconds, but it was hot enough
to destroy most of the solder traces that were near the power supplies
on the circuit boards. There were various plastic tags on some of the
cables, which added flammable material.
You can go to the store and buy a laptop or a desktop computer, but you
really can't go buy a server computer. Yes, this being silicon valley,
there are stores around that sell server computers (Central Computer is
the best of the lot) but buying a server computer at a retail store is
like buying a bicycle at a department store. It's just not the same
thing. Server computers are special-order, because there are so many
variations on how they are built that no one can afford to keep good
ones in inventory.
The fire was on a Saturday morning, and I knew that the soonest I could
even place an order for a replacement server was Monday, and even at
rush-rush prices I wouldn't get it until Thursday. At the time a
Saturday-to-Thursday outage seemed unconscionable. So I decided to move
the LUG and its supporting software to the newest and emptiest of my
half-dozen servers. It wasn't exactly a spare--it was running a few
little things--but mostly it was idle.
The LUG server had been running software from the era of its
installation, about 2005. The new server was built with chips and
components that the old software didn't understand, so I couldn't just
restore the LUG server backups onto the new server. They wouldn't run. I
had to get the new software working on the replacement server and then
manuall move over each piece.
I made the mistake of believing the operating system documentation,
which detailed a function called "system upgrade". It was supposed to
work they way Mac or Windows updates work--you let it do its thing for a
while, and then you reboot and all is well. After running the system
upgrade, nothing worked any more, including the few services that had
been on that machine. After asking the experts, I realized that I was
going to have to wipe the machine, do a clean install, get all of the
necessary apps installed, and then restore both sets of backups (LUG
server and previous contents of that server) to the clean system.
So far this is not a crazy plan. I've done things like it many times
before, though the 9-year software update gap made for a few challenges.
Once I got all of the apps installed and the backups restored, I
immediately typed the command to turn it all on
/local/mailman/bin/mailmanctl start
and nothing happened. The error log showed a preposterous, deeply hard
to believe error message.
The wise person's first step in debugging strange failures on computers
is to type the error message into a search engine (I use Bing) to see if
other people had asked about it. To my great astonishment, no one had.
This never happens. Somebody else *always* has the same problem and has
asked about it.
I then started reading the source code of Mailman, trying to see what
circumstances would cause it to generate that message. Mailman is
written in a language called Python. When you are having trouble like
this, a good step is to explore "version skew". Mailman Version XXX
works only with Python Version YYY. The versions of Python that are
extant just now are 2.5, 2.6, 2.7, 3.2, 3.3, and 3.4. This is an
abnormally large spread of "current" versions, which usually means that
the language developers have made incompatible changes and have to keep
old versions around for apps that have come to depend on them.
I tried all 6 of those Python versions. I got the same odd error in the
2.* versions, and absolute chaos in the 3.* versions. Since the version
of Mailman that I wanted to use (2.1.18) failed the same way with all of
the 2.* Python versions, I wiped the slate clean one last time and
installed Python 2.7.
Gonna have to find this problem the old-fashioned way.
Many days pass as I read documentation, run tests, explore the software,
use debuggers, create and read log files, all to no avail.
Then I decided to instrument and log what was happening when
Mailman/Python started up. Figuring out how much information to put in a
log file is a black art. If you log too much, you will never find what
you are looking for in the swamp of details. If you log too little, you
probably won't log what you're looking for.
After far too much time staring at the logs, I saw that Python was
initializing from a library that was not listed in the Mailman
docdumentation.
An aside: language systems like Python tend to be aggressive in how they
find libraries. They look around and if they find something that looks
like a library, they use it. I'm sure the Python designers (none of whom
is named Monty) thought they were doing the world a favor by making it
go out and find its own libraries. "Autoconfiguration" run amok. Bad idea.
This library was obsolete. In the 9 years of not upgrading, the Mailman
software had changed the place where it kept certain library functions,
and both of them were present in the version I was trying to run. The
"wipe clean and reinstall" function only wiped the directories that it
knew about, and this obsolete directory was not on its list -- it had
been retired years ago -- so it didn't get removed by the "wipe clean"
function.
If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18,
one of them would surely have deleted that newly-obsolete directory. But
I didn't, so it was still there.
When a complex computer system is using two different versions of the
same library, with creation dates 7 years apart, it doesn't stand a
chance of working.
I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email"
which got rid of the ancient and incompatible library
and everything started working. Perfectly.
There were hundreds of loose ends, and I spent the next week hunting
them down, but it wasn't taking 18 hours a day and LUG mail was flowing
while I did it.
Thanks for listening.
Brian Reid
LUG Saloonkeeper and server wrangler