On Mon, Sep 6, 2010 at 7:33 AM, Conor Wynne <mariconor at gmail.com> wrote:
> On 06/09/2010 04:16, paul at clubi.ie wrote:
>> On Sun, 5 Sep 2010, Michael Conry wrote:
>>>>> crashed after only a little bit of use (just small stuff like Web,
>>> email, etc.,). Trying to reboot, it failed. Then tried safe mode,
>>> and it worked, though I've since found that that was a red-herring.
>>> The shut-down is a bit random, and may happen during first stage boot
>>> (after grub, before splash), may come during fsck, or may come after
>>> using the system for a few minutes to half an hour.
>>>>>> If the failure happens early in the boot, I get a stack-trace with
>>> contents that I could only photograph. I'll transcribe some bits here
>>> in case it means anything to anyone...
>>> lots of irq stuff...
>>> ret_from_intr+0x0/0x11
>>> <EOI> <#MC> [<lots of address stuff>] ?panic+0x111/0x137
>>> ?panic+0xa1/0x137
>>> mce_panic+0x1e32/0x210
>>> do_machine_check+0x7d3/0x820
>>> machine_check+0x1c/0x30
>>>>> I am now thinking the problem is hardware related.
>>>> Probably. The machine paniced cause of a Machine Check Exception
>> (MCE). These are raised by the hardware for uncorrectable hardware
>> errors, such as ECC failures, temperature limits, etc. There may be
>> further information logged regarding the MCE.
>>>> The way it crashes after different periods of time reminds me of
>> problems I've had with temperature - if you let the machine alone a
>> while to cool and start it does it get further than if you restart it
>> immediately? As others have said, did you check the fans?
>>>> regards,
>> Get your hoover out and suck all the dust off the fans, heatsinks and
> electronics.
> The machine I'm writing on had "died" a few months ago, turns out it was
> down to dust - although it looked clean.
>> Otherwise do as others have sais, memtest86, diags etc.
> Test with a live CD as well and see if its related to your install /
> kernel.
I've done a share of dusting by now, and it's quite possible that that
was the original cause. Anyone know if that means permanent damage
(say if processor heat-sink and side-of-case vent were clogged)?
I'm guessing the answer is "maybe" or "it depends", it's funny how
messy and random and chaotic things get once the problems/risks are
out of software and in hardware. :-)
Paul's question regarding temperature: I was looking for a pattern
like that too... it's not very clear if there is one. Very first
experience was coming to the machine shut down and powered off, hit
power on and it crashed, reset it and it booted (should if anything
have been warmer)... if Keith's hypothesis of dry/cracked solder
joints were the case, however, that might make sense (heats up and
closes, heats up a bit more and opens, who knows?!).
I have taken CPU cooler off (didn't look badly bonded, actually pulled
the processor out of the ZIF socket, and no evident heat damage) and
I'll try replacing that (EUR 20 or so), also reflash bios,
after that wondering whether to cut losses and start putting together
the mini-itx always-on server I had been mulling over.
Anyway, huge thanks to all of you for suggestions and ideas. I'm glad
to know that there's not an obvious and immediate fix staring me in
the face that I'm over-looking.
Cheers,
Michael
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!