LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Systems crashing on disk activity

[ILUG] Systems crashing on disk activity

Niall O Broin niall at linux.ie
Sun Apr 18 14:34:50 IST 2004


I have a nasty problem with which some of you are already familiar but
I'm throwing it open to the wider community to see if anyone has any
ideas.

I admin a couple of servers which are hosted by Rackspace. Recently, we
have upgraded those boxes. The new hardware has an AMD Athlon XP 2600+
with a VIA chipset, 1GB of RAM and 2x36 GB SCSI disks RAID-1 on a
Megaraid controller. 

We migrated one important client to one of these servers and the bloody
box crashed with a scsi timeout error on the console. It was rebooted
and worked away until it crashed again with the same symptoms - we could
ping it, but it wasn't serving pages, and we couldn't ssh to it.

At that point, I asked that Rackspace replace it with new hardware which
they did. My mother having reared no idiots, I proceeded to beat on this
box's disks and it seemed fine. A few hours later, I decided to run an
overnight test which lasted about 5 minutes before it died again. You
can guess how thrilled I was.

On Friday I had telephone conference with our Rackspace account manager
and a senior Rackspace technician. They were very concerned because they
really found one hardware failure unlikely, and couldn't conceive of
there being two in a row. However, the problem was clearly hardware
related, and not a product of anything I was doing. So, they agreed to
deploy a THIRD server and I would test it and the second box over the
weekend.

As it happened, all three boxes (serv31, serv32, serv33) were still
online (Rackspace has a LOT of hardware - years ago we migrated a server
and the old one was still online months later, lost and forgotten in a
rack somewhere) so I decided to run a little test on all three. The test
was this script 

#!/bin/sh
while true
do
    rsync -a web /
    sleep 60
    rm -fr /web
    date >> /root/hammer.count
done

web being a directory with about 2GB of data in it.

This test is somewhat more disk I/O than the box would normally have but
nonetheless a solid combination of hardware, kernel and drivers should
keep running that script on RAID-1 disk until both disk drives died of
old age.

However, serv31 and serv32 died after a short time (don't know how long
as I deliberately did NOT have them rebooted by the ops staff until the
senior tech. people at Rackspace could take a look).  As I went to bed
last night, serv33 appeared OK, having carried out 20 iterations of the
test (which takes about 10 minutes to complete).

First thing this morning when I got up, I tried to ssh to serv33. It was
dead :-( I opened a ticket with Rackspace to have it rebooted and found
that it died sometime after completing 39 iterations of the test.

So, tomorrow I'll be having a rather fraught (I imagine) telecon with
the people from Rackspace and I'm wondering what to say to them.

It would seem that the chances of getting 3 servers deployed, all of
which has a similar hardware fault, is very small (of course, I could be
after stumbling on a bad mother board batch - I'm assuming that these
boxes have mobo integrated RAID controllers). That leaves a kernel
problem. The kernel is 2.4.21-9.0.1.EL which hopefully means that we'll
be able to utilise Red Hat's support to help investigate the kernel if
that becomes necessary.

Do any of you have any ideas about this, or have encountered anything
remotely similar?




Niall



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell