LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Production email server, corrupt ext3 fs - need advice

[ILUG] Production email server, corrupt ext3 fs - need advice

John Molohan john.molohan at gcd.ie
Tue May 30 16:11:36 IST 2006


Declan Moriarty wrote:
> On Fri, 2006-05-19 at 16:25 +0100, John Molohan wrote:
>   
>> a little background first.
>>     
>
>   
>>  Were thinking of the following approach.
>>
>> 1. Take the box down.
>> 2. In the scsi host util run verify media on all disks to identify & 
>>    mark any bad sectors and make them unavailable.
>> 3. Reboot & remount /var ro
>> 4. Rsync a new backup.
>> 5. Run smartctl see if it identifies any issues.
>> 6. Format /var?
>> 7. Recreate /var from backup.
>>
>> Any suggestions/additions, other approaches?
>>     
>
> Don't let us see you posted this to so many mailing lists ;-)
>
>   
> Is S.M.A.R.T. enabled? It should work on all disks and prevent the
> gradual corruption you seem afraid of.
>
>   
>> Some questions:
>> 1. Do you think we could continue to trust these disks or should we just
>>    forget it and replace them?
>>     
>
> Unless they are pretty new, replace them if you're a serious outfit
> (e.g. a business). I don't know what the site is. If you are students,
> or software heads gifted a server by some company then sure, look again
> at them.
>
> If the disk is damaged, often there are damaged sectors near the ones
> that actually don't read. Near on the same track, or on nearby tracks.
> So you fix, and more go down tomorrow.
>
>   
>> 2. Does anyone have any hints from the admittedly little information as to
>>    whether this might be just filesystem corruption or dead disks?
>>     
>
> The obvious problems you haven't thought of  are variations in
> temperature in the server room, and dirt under the heat sinks. People
> put fans on top of CPUs and think they will remain cold. They build up
> dust often between the fins of the heat sink, and then heat problems
> start.
>
>   
>> 3. There were a lot of servers in the server room which all experienced
>>    this slow cooking but none have shown any obvious problems so far.
>>    Should we be doing something as a precaution for them?
>>     
>
> See the answer to 2. Lift the fans & check.
>
>   
>> 4. Is it safe to assume that this failure is probably a direct result of 
>>    the heat? 
>>
>>     
> It is likely.  But it doesn't matter - it's a failure.
>
>
>   
Just a quick update. It seems that the root of our problems may actually 
be a buggy aacraid driver. We switched over onto a backup server last 
week only to experience the exact same error, which was nice. It was 
also a Dell 2650 with the same controller so it made sense. Anyway it 
seems that if you have an Adaptec Perc 3Di and are using the 1.1.2 
driver you could trigger this bug with heavy disk I/O. We've upgraded to 
1.1.5 and have been testing since the weekend without a repeat. I'll 
give an update when we know for certain.



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell