[ILUG] stress-testing scsi discs

From: Andrew Kemmy (kemmya at domain free.net.nz)
Date: Tue 10 Sep 2002 - 10:13:38 IST


Hi,
I have a system with the following disk subsystem :
#lspci
00:10.0 SCSI storage controller: Advanced System Products, Inc ABP940-U
/ ABP960-U (rev 03)
#dmesg
scsi0 : AdvanSys SCSI 3.3G: PCI Ultra: IO 0xE800-0xE80F, IRQ 0xA
  Vendor: SEAGATE Model: ST39140N Rev: 1498
  Type: Direct-Access ANSI SCSI revision: 02
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sda: 17783240 512-byte hdwr sectors (9105 MB)

I have installed debian woody with an ext2 filesystem on the disc.
In the first few days I got "input/output" errors on the disc, like
there were bad sectors. I didn't do a "check for bad blocks" when
formatting the disc, though in hindsight I should have.

The original kernel was 2.2.20, with advansys support built as a module.
I installed 2.4.18 debian source and compiled advansys support
statically into the kernel. Since then there have been no more
"input/output" errors, and I have run some fairly heavy duty tests on
the box such as simultaneously doing the following :
1./ running a VMware virtual machine
2./ on the real machine having wget continuously request a web page
from the virtual machine in a "while true....." loop
3./ back-to-back kernel compiles
4./ tarring the entire filesystem continuously
5./ running hdparm -t /dev/sda continuously
So the load average was > 5 for about 2 hours.
The box completed all tests without error and I was happy.

However today I noticed the the same symptoms : all disc I/O would
occasionally cease; the system couldn't even execute simple commands if
it had to read them from disc rather than memory. I used scsi-config to
switch off read and write caching on the disc, and ran a bonnie++
benchmark, which resulted in

Sep 9 14:56:11 cel333 kernel: advansys: advansys_reset: board 0: SCSI
bus reset started...
Sep 9 14:56:11 cel333 kernel: advansys: advansys_reset: board 0: SCSI
bus reset successful.

I then ran the bonnie++ benchmarks continuously for about 3 hours
without any errors, whereupon the box again mysteriously got wedged,
responding to pings but not ssh.

Any suggestions for trying to track down the error ?
Options are :
Reformat (possibly low-level) the disc.
Check termination.

The thing is the same cable, disc, adapter, and termination settings
were used on a previous box with no errors at all. The disc may have
been damaged in transit. Reading the scsitools docs tells me that SCSI
discs are "self-healing" ie they re-map bad sectors seamlessly, and
these options are set on the disc.

While writing this I have tarred up the entire disc 3 times, again
without errors, which would seem to rule out "bad sectors" as a cause ?

Any help appreciated.
Regs,
Andrew.

-- 


This archive was generated by hypermail 2.1.6 : Thu 06 Feb 2003 - 13:18:45 GMT