a little background first.
Tue 16th ~22:00
Blackout in server room
Email server not UPSed (don't ask)
~22:30 Power restored server comes back up.
Wed 17th ~10:30
Problem identified for 1st time
1st technictian to server room finds no A/C running
Server room is approx 40 degrees celcius.
~11:30 A/C restored room temp normal.
All systems checked/monitored, no apparent issues.
Thur 18th 13:15
All disk access stops on email server
Console errors on email server e.g.
ext-fs errot (device in transaction 24734: journal has aborted)
System unresponsive, rebooted with a alt+sysrq+R after all else fails.
Partitions fscked on reboot, orphaned inodes cleared, no more issues.
Fri 19th 10:30
Same as thursday, all disk access stops, console errors etc.
Orphaned inodes only on /var partition and on all occasions appear
around the same locations (see http://www.it.gcd.ie/inodes.txt). Given
that we don't have massive experience with corrupt filesystems or RAID,
other than when it's working, were looking for a bit of advice. We do
have rsynced backups of the mail from 04:00am every night. Were thinking
of the following approach.
1. Take the box down.
2. In the scsi host util run verify media on all disks to identify &
mark any bad sectors and make them unavailable.
3. Reboot & remount /var ro
4. Rsync a new backup.
5. Run smartctl see if it identifies any issues.
6. Format /var?
7. Recreate /var from backup.
Any suggestions/additions, other approaches?
Some questions:
1. Do you think we could continue to trust these disks or should we just
forget it and replace them?
2. Does anyone have any hints from the admittedly little information as to
whether this might be just filesystem corruption or dead disks?
3. There were a lot of servers in the server room which all experienced
this slow cooking but none have shown any obvious problems so far.
Should we be doing something as a precaution for them?
4. Is it safe to assume that this failure is probably a direct result of
the heat?
Other info.
Dell Poweredge 2650
Kernel 2.6.3-29mdksmp
RAID 5
Red Hat/Adaptec aacraid driver (1.1.2-lk1 Nov 28 2005)
AAC0: kernel 2.8.4 build 6089
AAC0: monitor 2.8.4 build 6089
AAC0: bios 2.8.0 build 6089
AAC0: serial 171830d3fafaf001
scsi0 : percraid
Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
SCSI device sda: 1146866176 512-byte hdwr sectors (587195 MB)
[it at dubmail it]$ df
Filesystem Size Used Avail Use% Mounted on
/dev/scsi/host0/bus0/target0/lun0/part5
15G 5.7G 8.0G 42% /
/dev/scsi/host0/bus0/target0/lun0/part6
522G 32G 464G 7% /var
Var hosts cyrus-imap spool
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!