Intermittent Part Failure

It’s been a reasonably bad week, but I’m really not wanting to vent about the work-related things that went wrong, so instead I’ll ramble about my computer a bit.

Parts failing intermittently are the worst parts to diagnose.

My own computer has been having issues for quite a while, starting late last year when I went to install a BD-Player and one channel of RAM decided it didn’t want to clear post ever again. I’d been having off and on problems with my secondary HD leading up to that, and when the RAM issue occurred, I decided it had to be RAM related.

Of course, That drive’s issues ended up not being RAM related, and on a glance at my Windows Event Log a few weeks later I found several thousand entries for HDD I/O errors.  A quick swap out later, I lost maybe one or two files which I have a backup of I haven’t gotten around to restoring.

Over the course of owning this system, one thing became increasingly obvious: Something is WRONG with it.   I think I finally pegged it down to something that the Asus Sabertooth line is bad about – dynamic overclocking.  Overclocking is a bit of a goofy thing in general, and I think the term has lost much of its meaning in recent years as manufacturers started building in “overclock tolerances” into their designs to account for user shenanigans.

I am also not a fan of overclocking, as I never really bought into the whole idea of spending on cooling what I could have spent on a faster CPU in the first place.  There’s a huge line of thought (And arguments) about what hardware is actually capable of, but your warranty is for the numbers on the box.   A general rule of thumb is that while your (hardware) may have started it’s production cycle as a more powerful chip, mass production of electronics is not the consistent process we like to think it is.

Years of computer repair have taught me that no two computer parts are exactly identical.  They may look the same, they may act the same, and run the same software…  But the longer you use them, the more each part develops it’s own distinct personality based on it’s differences and the ab/use you give it.  You can treat them like identical machines, but when those chips are built, they’re built with a rather dramatic fault tolerance and each chip tends to have things that work and things that don’t.  Chips that work but fail certain tests get repurposed down to lower powered configurations for use.  So your nVidia GTX Blah-80 may have started life as a nVidia Quadro that failed performance testing.

This fundamentally is the basis behind overclocking – You’re taking a gamble that enough of the higher end features of your given piece of hardware work.  If they don’t, you have a brick.  If they do, you’ve gotten a free upgrade that may or may not be buggier than IE 6.0 in quirks mode.

I generally don’t like taking that bet.  I push my computer to the endurance limit, but not the performance limit.  I don’t play Crysis or games that push graphics cards to their limits.  The two games I’ve put the most time into in the last five years are Minecraft and WoW, and those games use CPU rendering rather than GPU rendering.  The most intense thing I do to my computer- which I stopped recently- is run Google Chrome at all.  (Firefox tends to get a bad rap for how many gigs of RAM it uses to display a 320kb webpage, but Google Chrome is even worse.)

Speaking of Minecraft, that plays into the “Game test” diagnosis:  I stopped playing MC around 1.3.5 all the way until the 1.5 snapshots started coming out.  Up to that point, the only game that had given me issues was WoW with occasional visual artifacts.  (I haven’t even played WoW since January, anyway.)  When I started playing, MC started having issues and I wrote that off as being the snapshot.  When it went live and they were getting worse, I figured it was just me.  A friend talked me into playing Starcraft 2 a month or so ago, and to really prove that your system has something wrong with it, all you really need to do is run a blizzard product.

What do I mean by that?  Basically, Blizzard has their own way of doing things.  They made their first major foray into 3D before anyone else did, and they decided to stick with tradition rather than adopt standardized techniques.  This results in an unusual hybrid of “This is how we’ve always done it” with “Ooh, that’s awesome, let’s try using that too” that gives Blizzard games very intense hardware requirements for what they actually are doing.  In my case,  SC2 would act like my system didn’t have enough resources to run it for the first two games played, then suddenly it would be fine on max settings.

So I had a system that I knew was bad, Behaving erratically when I ran SC2, artifacting in WoW, crashing when I played Minecraft, had burned some parts in the past, and tended to need a reboot around the 1 week of uptime mark.   It also took an excessively lengthy amount of time to boot up by Windows 8 standards.

Something was wrong, but couldn’t peg down what exactly.  Then my RL friends invited me to play FTB with them and I finally was able to provoke a crash on demand.  After a day or two without a reboot, MC would crash on launch.  Thinking it was a mod, I tried vMC with the same result.  Reboot, and it would work for about 8 hours.  Since most of how I play MC is to automate machinery (A habit I picked up from playing better than wolves/better then buildcraft), I shrugged and said “okay, I’ll just leave you open and see if you crash, I don’t have a chunkloader anyway”

Sure enough, MC was fine until I closed it, then I needed to reboot.  So for several days, I just idled in my base, waiting on things like my Basalt cobblestone generator to fill barrels, trying to figure out this crazy mess of mods called feed the beast.

For the first time since I ran Windows 98, I was back to daily reboots.  That lasted about a week and a half before MC started abruptly crashing whenever Thaumcraft would do an aura node update.  At least I think it’s thaumcraft doing it, the only clue I had was a day or so ago when I crafted my first thaumcraft wand and discovered that the only corner of my base that I can regenerate Vis in is the same corner that causes the random mystery lag preceding an MC crash.

Then on a reboot, I got an nVidia Control panel error letting me know it had driver issues.

Some hardware tests that had gone without errors before now turned up that, sure enough, the other two sticks of RAM went bad, as well.  They still work, but are very unstable for anything other than basic web browsing.

Now for one last piece of info that perhaps I could have opened with, this is the second set of RAM this mobo’s gone through.  Having arrived back at RAM, I started reviewing why I might have been having issues…  and it seems this one is completely my fault for not reading fine print:  While this mobo says it wants 1600 mhz DDR3, It actually only runs at 1333 with an AI overclock to 1600.   Running actual 1600 mhz DDR3 ram in it resulted in some rather strange automatic timing adjustments, and it overclocked past that, even though I had manually corrected the timings when I first installed the RAM.

I’m quite amazed I got the life out of it I did, considering how badly I mismatched the memory.  Rather than try to doctor the system any further, I decided to just do a full upgrade.  I’m moving to a third gen i7, and a different board without as many “overclocker” features while still giving direct access to adjust timings and clock speeds if I need to correct a bad automatic detection.  I’m hopeful I won’t have to, given that I’ve matched the memory out to non-OC speeds as opposed to the theoretical top OC.

I’ve also decided to try the SSD plunge.  I’ll be running an SSD for my OS w/ a striped data drive array, possibly upgrading to Raid 5 later on.  I actually purchased 3 drives in anticipation of using raid 5 right out of the door, but I need to check to see if the motherboard’s on-board RAID is Soft-raid or actual hardware raid.  I suspect it’s soft-raid, and soft-raid 5 arrays apparently suffer terrible write performance.

If it is soft-raid, I’ll need to buy a hardware raid controller as an expansion later on.  Sata 6gb/s Hardware raid controllers are quite a bit more expensive than I’d like, so I’m hoping they come down in price a little bit.

With any luck and no shipping delays the hardware will arrive tuesday.

In other news, I got down to the local pet store to find that the 55 gallon tank re-stock last week arrived cracked, so I have to wait until it gets restocked.  Ah well.  Resupply is a mere setback.

5 Comments »

  1. daft27 Said,

    April 15, 2013 @ 11:30 am

    Speaking of tolerances… my laptop is perfectly happy running on 120V but has issues with 240V power. I get random BSODs (which indicates a different problem each time) if I don’t run my GPU on non-power saving mode.

    Hope you like working with the SSD. I put one in one of my laptops, and the difference was amazing.

  2. Norren Said,

    April 15, 2013 @ 3:39 pm

    That’s an odd bug. If you had a spare power brick I’d try swapping out to see if maybe there’s something wrong with say a voltage regulator. Could also be the laptop battery/battery charger.

    I’ve heard nothing but praise for SSD performance, so I’m looking forward to it. ^_^

  3. JinK Said,

    April 17, 2013 @ 5:37 pm

    If the RAM isn’t dead and you’re still waiting for the upgrade, and you suspect it to be a motherboard clocking issue… Why not try for underclocking. Sure, no one likes a hit to performance, but if it works, you can be sure that your motherboard is killing your RAM.

    I have SSDs on a striped raid and it’s AWESOME. You never realized how much the HDD bottleneck is until you switch. And on Laptops, I don’t even want to see a harddrive in them anymore. Laptops usually die from overheating and SSDs solve that, increase its speed incredibly, battery life gets extended, and you can throw the whole laptop while running and you can be sure it’s definitely not the SSD that’s broken. I’d rather get a $400 laptop and get an $300 SSD than an $800 laptop.

  4. Norren Said,

    April 17, 2013 @ 8:26 pm

    A few reasons. The first part of the answer is the most valid in my book: “I wanted to replace it anyway”. 🙂 Between concern that it was going to up and completely die on me when I started substituting hardware, and some upgrades I’d been window shopping for a while? It just seemed like a better time than any.

    A second reason is my living room needs an HTPC. My family bought a network enabled 3D TV last summer and it’s gone all this time without a single anime or 3D anything on it. (Hmm… I wish they’d done Fate/Zero in 3D…) Now that I can diagnose the old system without fear that it will die on me, I’ll scavenge it into that HTPC. My old video card was 3D capable by the specs, so I’ll see if I can’t get some

    On the SSDs, Do you ever run into TRIM issues in that configuration? I was thinking about trying that, but the only useful info I found on the configuration in the past was an article talking about how the Raid Array prevents the OS from running TRIM on the particular SSDs the author owned.

    I haven’t had a good chance to abuse/test this thing out yet, but I’m in the process of getting all my apps and games reinstalled, so I expect the SSD to spoil me in the next few hours. 🙂

  5. JinK Said,

    April 17, 2013 @ 11:44 pm

    TRIM is an issue in RAIDS but so far for newer computers, RAID 0 and RAID 0 only is completely fine. RAID 5 is very appealing, but data loss isn’t much of an issue for SSDs. They aren’t susceptible to damage as much from heat or shock, and they have an expected active time in the Millions of hours.

    I can see why TRIM is an issue as the each cell deteorates with each write but a more appealing solution would be to have one $70 1TB HDD to hold a backup for two $300 500GB SSDs on a RAID 0. Instead of having 3 SSDs on a RAID 5, you can have like 4 backups on HDDs and 2 SSDs on a RAID 0 on the same price as 3 SSDs on a RAID 5. TRIM doesn’t work but… I’m not sure but I’ve heard that there are some MBs that don’t use TRIM at all on RAID. If you defragment, SSDs get defragmented differently than HDDs, too, to preserve the cells of the SSD.

RSS feed for comments on this post

Leave a Comment