I got badly bitten yesterday because of my lack of knowledge of Linux
RAID (and some assumptions that it would work like Solaris' Disksuite).
I'd a machine that I'd said rootfs mirroring up on, and it had been
working fine for months. I'd to take it over to the UK, and install it in
a rack.
Just before I went, I set it up to use the serial port as a console, and
messed up LILO somehow (I think I used an old lilo.conf). I booted off a
tomsrtbt disk, mounted /dev/hda5 (one half of the rootfs mirror), ran lilo
with the right config file, and the machine booted fine. I powered down,
and brought it over with me to the UK.
I brought the machine back up, changed IPs, did some other configuration.
I rebooted, scanning the boot output to make sure it was OK. One line that
made me go cold was 'Error reading /etc/mtab, I/O error'. To me, that
usually means FS corruption.
I brought the box down to single user mode, remounted / as ro, and fscked
it. I got *hundreds* of errors. Two hours left till the airplane left. I
can fix it, I thought. Did a second fsck, and everything was fine. I
rebooted, same problem...only the corruption was getting worse. I did this
once more, and suddenly the machine didn't reboot. I got 'LIL-' for a boot
prompt.
I rebooted with my handy tomsrtbt disk, and ran LILO. Because I couldn't
mount md0, I mounted hda5, and did and fsck of that. Loads of errors, all
fixable. Cool.
I rebooted without the floppy, and got massive corruption. This time
worse than before. fsck fixed it, but again I got 'LIL-'. Re-ran LILO from
tomsrtbt, rebooted, and this time the machine had a corrupted inittab. My
heart sank.
I rebooted from tomsrtbt, and noticed that /dev/hda5 was fine. /dev/hdd1
(the other half of the mirror) was screwed. So, I changed /etc/raidtab to
set hdd1 as a "failed-disk", did a "raidstop" on md0, and changed / to be
/dev/hda5 and rebooted. More filesystem corruption.
It took about two reboots before I copped on that raidstop wasn't
persistant across reboots, like it is on solaris. Because the partition
types were set to "RAID Autodetect", every boot it was making an md0, and
even when I wasn't mounting it, it was syncing the two halves of the
mirror. Worse yet, it didn't sync from "last mounted" to "other disk", it
was always picking the corrupted disk, and syncing that to hda5, which I
had mounted, read-write, as root.
Once I changed both disks partition types back to 83 (linux fs), and did
an fsck, there was no more corruption. Alas, it had deleted files like
/etc/sysconfig/network and many others. I am not impressed.
Could someone with a bit more RAID knowledge than I have tell me what the
"right way to do things" was ? I've a feeling it probably incorporates
"Wait till both halves of the RAID mirror sync before you reboot" or some
such...as you don't have to do this in Solaris, I didn't bother...and that
could be what caused the massive corruption.
John
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!