# The dreaded random shutdown problem



## GlaielGamer (Apr 15, 2011)

I have a story that goes with this problem but first here's my specs

Case: COOLER MASTER RC-690-KKN1-GP Black SECC/ ABS ATX Mid Tower Motherboard: ASUS P6X58D-E
GPU: Nvidia GTX480 (asus)
CPU: Intel Core i7-980X
RAM: 6 sticks of this (12 GB)
PSU (original): Corsair 750TX
PSU (new): Corsair 850HX
Running Windows 7 Ultimate
Bought + Assembled September 2010

=========
Ok, so this is the first computer I ever decided to build. Use was for work primarily (programmer, lots of visual studio-ing and accidentally crashing the graphics drivers / leaking memory / everything bad you could do with software). I had the computer initially overclocked to 3.7GHz (highest you could safely go without increasing voltage, as I read) and the ram overclocked to 1600mhz (its 1600mhz ram but the MB identified it as 1333). Computer was usually slept at the end of the day instead of shut down.

March comes, and I was away for a little over a week. The computer was unplugged (power strip flipped off) for this time. This was not the first time it was off for a week (christmas and thanksgiving it was also unplugged). I come back, and the computer doesn't boot. I unplugged everything and reseated the CMOS battery and whatnot and eventually got it to boot up again (overclock setting reverted back to default, haven't changed them since). Also the wifi card was removed since i'd only been using ethernet anyway.

Since then, I've been experiencing random shutdowns. Not restarts, not software crashes or BSODs, just suddenly it shuts down as if it had been unplugged. These happen anywhere from minutes to hours to days after startup, in no identifiable pattern. Sometimes under heavy load, sometimes while idling, sometimes during boot (usually only when immediately starting up after a previous shutdown). Happened in safe mode too. Every now and then one of these results in the computer being unable to start up unless I remove and reseat the CMOS battery (unsure if this is a result of an improper shutdown or a symptom of the cause).

- Ran windows 7's built in memtest, no errors
- Intel burn test had no errors
- chkdsc reported no problems
- Temperatures are normal (highest I've seen the CPU is 65-70C when running intel burn test, 25-30C when idling and 50C under my normal "heavy load", GPU hits 90C when playing starcraft but supposedly that's normal for a GTX480, remains at 50-70 during idle / medium load)
- Set "Do Nothing On Power Button Press" in windows settings

I replaced the PSU last week (from a corsair 750TX to a corsair 850HX) thinking that was the most likely cause. Went one week with no problems, then today BAM shutdown. It's not the PSU (I needed a modular one anyway so I don't mind the purchase).

Other symptoms:
For a brief period of time (not anymore) windows would hibernate instead of sleep.
I can't remember if this was coincidental with the above, but sometimes at boot/unsleep/unhibernate the keyboard would not work, and a simple reset would fix this issue. I think this only happened when the computer was un-hibernated after the above problem triggered.
Other than that there are never any other software symptoms. When its running its running fine, other than the shutdowns.



So.... what do I do? Do I blindly buy a new motherboard and hope that works? Would replacing the CMOS battery do anything? Is it possibly a RAM/HD issue even though I checked both of those? Is it a GPU issue (no visual artifacts ever so I don't think it is...)? Are there more tests I should run? Could a case fan be shorting out? Should I set my overclock settings back to what they used to be instead of assuming default would be safe?

Any help is appreciated.
(if this is the wrong subforum, just let me know and i'll lock and post elsewhere)


----------



## Stu_computer (Jul 7, 2005)

all the bios updates for this mobo include ' Improve system stability' soooo....
what bios version do you have installed?

see what happens is when an acpi os runs for the first time it tells bios to take a hike cause it will run the hardware from now on and so thereafter the hardware is configured to initialize to what acpi (and the drivers you installed) tell it to be. that is till you did your cmos reset (battery trick) and gave it amnesia. so either bios isn't playing nice with acpi or you have some badass drivers installed harry. and did u remember to install the chipset drivers?


----------



## GlaielGamer (Apr 15, 2011)

I'm using whatever bios version was installed when I bought the thing. Haven't updated it. Should I? What if it randomly powers off while updating?


----------



## GlaielGamer (Apr 15, 2011)

Ok I'm on the problem computer right now and can look up info.

- BIOS ver 0303
- checked event viewer, no critical errors other than standard shut down unexpectedly
- updated chipset drivers (download from asus's website, took only a few seconds to run so i'm not sure if it did anything)

(i also went and upped the ram back to 1600mhz and the CPU up to 3.7ghz, what it was before the problems started)


----------



## GlaielGamer (Apr 15, 2011)

Neither of those made any difference. Please help!


----------



## Stu_computer (Jul 7, 2005)

okay, finally got the manual and qvl from the asus site.

first, got the memory wrong.
the F3-12800CL9D-4GBRL modules are okay for 1333 and in slots a1-b1-c1 (a2-b2-c2 empty) for 6Gig total.

for 12Gig and 1600Mhz you should have the F3-12800CL9T2-12GBNQ
Newegg.com - G.SKILL NQ 12GB (6 x 2GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Desktop Memory Model F3-12800CL9T2-12GBNQ


a kit of multiple modules is tested and a proved match (for this kit only).
buying several kits of the same type DOES NOT mean/guarantee they will all be matched and work together. simply, if you want 8Gig total then buy a 8Gig kit for a guaranteed match.

expanding on that further, with several kits the user has to keep track which are the matched modules so they don't get mismatched, defeating the purpose of buying the kit in the first place.

more importantly, the user must know which slots the matched modules need to populate else the modules are servicing the wrong channels and not a match (while in actual use).

---------------------
next, the 303 bios version is better than the original release version (definetly better memory support). when your satisfied the system is stable i recommend updating to the newest version. do the update with only one ram module (in slot a1).

---------------------
next, no overclocking while system is unstable.
-are speed step and turbo boost enabled/disabled?
-what antivirus program? (bet its norton).

you will need a stable base point in order to pinpoint when a modification/change becomes unstable.
suggest you configure only 3 ram modules (as above) and then in bios set optimum performance and save. see if that stabalizes the system.


----------



## GlaielGamer (Apr 15, 2011)

Been running on 3 sticks today, no shutdowns but again I need to test it for a full week before I know for sure.


"expanding on that further, with several kits the user has to keep track which are the matched modules so they don't get mismatched, defeating the purpose of buying the kit in the first place."

well damn I do remember pulling out all the memory and reinserting it when I had the initial trouble, maybe that could have been the issue? (that would explain why it worked fine up till that point) I have been running them at 1333 since then though. Is there a way of seeing which ones are matched by looking at their serial numbers? Do they need to be matched in pairs of 2 (A1 + A2 matched, etc) or in pairs of 3 (A1+B1+C1 matched, etc)? I never knew that they needed to be matched.


"next, no overclocking while system is unstable."
It wasn't for the past month (just yesterday)

"-are speed step and turbo boost enabled/disabled?"
yes, that was the default

"-what antivirus program? (bet its norton)."
windows security essentials, scan came up clean

I ran 3 sticks today, with the case sideways and the front panel removed. I won't know if its stabilized until I run it for a week with no shutdowns (as is the nature of the problem)


Meh if I need to buy a $150 ram kit that isn't so bad I guess.
So its most likely a ram issue then? Thanks for the help.


----------



## GlaielGamer (Apr 15, 2011)

running with 3 sticks didn't get any shutdowns the first day I tried while running the computer, but I did get a kernel-power error while it was sleeping overnight.

I'm using just 2 sticks now (matching ones) to see if it still happens.


----------



## Stu_computer (Jul 7, 2005)

that may have been a coincidence, but go with the 2 sticks for now to get a stable system.

the ram you have isn't up to the task of fully populating all slots. check the qvl and you will see its okay for DIMM Socket Support (A and B only--which is as dual channel or a single triple channel configuration).

unlike previous processors the i7-980X directly accesses the ram now and there are independant memory controllers for each channel so that puts a brutal demand on the rams performance.

the other issue is there are a lot of errata for both the i7-980X and the X58 chip.



> HP Z800 Workstation*-* Bulletin: HP Z400, Z600, and Z800 Workstations, HP Pavilion Elite HPE-180t, HPE-190t, and HPE-190jp Desktop - SYSTEM BIOS UPGRADE REQUIRED; Certain Intel Processors May Cause Unpredictable System Behavior - c02078984 - HP Busin
> The system may stop responding to keyboard or mouse input.
> A system operating in a Microsoft Windows environment may generate a blue screen.
> A system operating in a Linux environment may generate a kernel panic.


some errata is corrected with bios updates (and some never get fixed), hopefully the newest asus update clears your random issues.

when ready to do the update refer to the manuals sec 3. for usb update method, and see Hikers (post #8) for an execllent step-by-step update BIOS via EZ Flash 2
ASUSTeK Computer Inc.-Forum- BIOS 0502 released Jan 17, 2011

------------------------------------------------


> "-are speed step and turbo boost enabled/disabled?"
> yes, that was the default


fyi, these need to be disabled when O/Cing else can destabalize the system.

(its 1600mhz ram but the MB identified it as 1333)

the 1333 was correct because its the fail-safe default JEDEC (SPD) setting for the 'first boot' of DDR3 1600 ram.

you have to change the bios setting *Ai Overclock Tuner* [Auto] to [XMP] Profile 1 to get 1600. (Profile 2 for 1800? guessing...g.skill doesn't post their profiles.)

XMP with speed step and turbo boost enabled should let cores 1-2 ramp up to 3.6GHz and cores 3-6 up to 3.46GHz when appropriate.

(kind of makes manually o/cing to 3.7GHz pointless.)


----------



## GlaielGamer (Apr 15, 2011)

Got a random shutdown with 2 sticks of ram today. It was VERY reluctant to start afterwords (immediately afterwards) and tried to reboot itself automatically to failure every time (fans would spin for half a second, headphones would have a popping noise, then the lights would go out and fans would stab before it tried it again).

Shuffling ram didnt help
Resetting CMOS didn't help
Shuffling ram, unplugging the power supply from the motherboard, unplugging all the case fans, resetting CMOS, then attempting to start up did it (long enough for me to commit todays work to my svn server which is the only reason I tried so hard to get it restarted again).

I'm very skeptical of it being a ram issue now, especially since every single incident has been a "kernel-power" error and it hasn't manifest itself in software at all (no blue screens, no video artifacts, no odd glitches)


----------



## Stu_computer (Jul 7, 2005)

> I'm very skeptical of it being a ram issue now, especially since every single incident has been a "kernel-power" error and it hasn't manifest itself in software at all (no blue screens, no video artifacts, no odd glitches)


Kernel-power error
The last sleep transition was unsuccessful. This error could be caused if the system stopped responding failed or lost power during the sleep transition
The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding *crashed or lost power unexpectedly*.

thats computer lingo for "I was knocked out cold and don't know what happened."

what might be helpful is knowing what event(s) was logged just prior to these TKOs?

that first sentence is interesting too though "system stopped responding failed or lost power during the sleep transition"...in other words a C-state error.
you mentioned some sleep/wake issues and i also noticed a lot of the i7-980X errata involves C-state issues and Hyperthreading (logical C-state) issues.

since the system was originally stable and there were no hardware changes and no apparent hardware problems when you tested but the shutdowns are becomming more frequent that suggests narrowing suspects to a thermal degrade, or subtle software changes (as in windows core updates and driver updates).

can try running some of the Mats_Runs to see if a problem is identified, like...
Mats_Run.power.exe
Mats_Run.performance.exe
Mats_Run.devices.exe
Download details: Microsoft Fix it: Automated solutions for your issues

and so on, also instead of sleeping the system when not in use try rebooting to a different OS and see if it will shutdown with a linux live CD or Hiren's Boot CD (has mini linux and miniXP) thus eliminating software error.

i'm hesitant to suggest it because it's a lot of bother, but with all the case shifting/moving etc it's possible the CPU isn't properly seated and simply reseating the CPU and a fresh application of thermal paste may be a solution.

if a slight finger pressure to the mobo near the CPU socket causes a shutdown then pull the board and check the standoffs and check the board for heat warp. (slight as in touching your eyelid.)

moving on----------------------

with a 6 core processor the Hyperthreading doesn't benefit most apps, and it appears to be a liability (errata) for this CPU so best to simply disable it and eliminate it as a possible source of an occasional random shutdown.

if you notice a change in performance for some special app you may have, then when system is stable again can always give it a try then and decide if the actual gain is worth it.


----------



## GlaielGamer (Apr 15, 2011)

events: no consistency in events before shutdown (most common is DNS error which doesn't really tell me anything)

pressing on the motherboard: didn't do anything (used the back of a pen instead of my finger, tried a bunch of different spots).

However, I've been running it for 3 days with both side panels of the case removed, all case fans unplugged, the case lying on its side, and the front panel of the case unplugged. I'm thinking.... maybe it could be the case or one of the fans? Is it possible for a short circuited / badly connected case fan to be able to cause the PSU to cut out?

If I can get it to run more than a week like this I'll consider buying a new case then...


(and yes, hyperthreading does matter to me since I run visual studio, its incredibly helpful to compile 12 source files at a time)


----------



## GlaielGamer (Apr 15, 2011)

Ok I made it just over a week (and counting) with no shutdowns with that configuration (all fans unhooked, case power/reset button/led's unhooked, front and back panel of the case taken off)

Tomorrow or monday I'll test to see if its the reset button on the case malfunctioning (though I'm pretty sure i've tested that before).

Is it possible for a faulty case fan / connector to short out the power supply? Should I get a new case or see if there's something I can do to "fix" the one I have? I suppose I could plug in all the fans again and see if the shutoffs return


----------



## Stu_computer (Jul 7, 2005)

yes a fan or it's header can short the system, and both do happen occasionally.

if the fault is in the fan power lead then it should either spin erratic or not at all, if its in the sensor or control lead then the chipset may interpret it as a critical trip fault and force shutdown.

since there are two types on that board: 3pin (gnd, pwr, sensor) and 4pin (gnd, pwr, sensor, control) the most convenient method is to reconnect all fans again and if system shuts down then that would verify a fan fault, can then do one at a time to isolate the fault (start with 4pin as a likely cause).


----------



## GlaielGamer (Apr 15, 2011)

Update:

The fans and the case weren't the issue. I think I found the problem now though (again its annoying when it lasts 2 weeks stably before the problem comes back, makes it nearly impossible to test properly).

Its not a software problem (I saw it shutdown while in BIOS once now)

So last week the problem came up in that it would shutdown consistently within a minute of starting up (often during boot), booted into BIOS and noticed my PSU 12V rail was really low (11.1v - 11.3v) [doesnt report this low when reading through windows so I never noticed] (I watched it drop then shut off here). Though, testing the power supply with the tester indicates its fine (and like... I already replaced the power supply, both are high end corsair ones). 

I unplugged the 24-pin connector and looked at the 2 12v connections on the motherboard, and both were less shiny and more "brown" than the others. I dont have any cleaner for them so I just blew compressed air on them and tried to clean them as best as I could that way, plugged the connector back in and booted into bios, voltage was reported as 12.05v (normal), and it lasted a day and a half before the issue came back.

Right now it really consistently shuts off when waking from sleep mode. I unplugged it and am gonna stop using it till I get a new motherboard and a UPS (its possible its an electrical issue in my office since I have a lot of stuff on the same circuit, though unplugging them doesn't seem to make a difference and the circuit breaker has never tripped).

I will report back again when I get my new parts. Any tips for replacing a motherboard? How should I replace the CPU? Do I need to remove and replace the thermal grease? Any recommendations for a brand?


----------



## Abstand (Jun 20, 2011)

90 degrees celcius is way too hot for any GPU. Look into that a bit, I recently had a problem with a GT 240 overheating to tose temps and had almost the exact same symptoms.


----------



## GlaielGamer (Apr 15, 2011)

Abstand said:


> 90 degrees celcius is way too hot for any GPU. Look into that a bit, I recently had a problem with a GT 240 overheating to tose temps and had almost the exact same symptoms.


it idles at 50-ish, and the issue has happened while idling (even while in bios)- this isn't the issue


----------



## GlaielGamer (Apr 15, 2011)

Well ****, replacing the motherboard didn't make the issue go away.


----------



## HPC toaster (Sep 8, 2011)

I had a similar problem with an i7-980X in a 17.4" Laptop (Clevo D900F, mySN QXG7 -- beware of the company; very bad service). After running numerous tests for RAM, HDD, SSD and Chipset (X58) I narrowed the problem down to the CPU or BIOS settings associated with the CPU. The only CPU feature in the Clevo BIOS (v. 1.00.09) which has not been used extensively before in predecessing machine generations is the C-state. This feature was designed to reduce the heat dissipation of the CPU whenever possible. Unfortunately it leads to instabilities in CPU synchronisation with its immediate components (RAM, N bridge), and Intel even admits this in a white paper:
http://download.intel.com/design/intarch/papers/323671.pdf

It eludes my mind why mySN is unable to enable the right (stable) BIOS settings before delivery; these guys should sell hot dogs or matches.

So I de-activated the C-states flag in the BIOS -- the machine now runs stable for weeks on a high-demand simulation job with all six cores, while the remaining six threads give me enough performance do edit LaTeX docs, etc. 


Hope this helps


----------

