Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Hardware and Oct 3, 2001

By hurstdog in Site News
Wed Oct 03, 2001 at 01:55:30 PM EST
Tags: Hardware (all tags)
Hardware

Well if you didn't notice, we were down this morning again. Well, this morning for us Americans. Approximately 4-5 hours downtime, it crashed/froze/stopped responding about 5am eastern time. The server is a Compaq Proliant 8000 with 2GB of ram. Since we can't see any software problems that would be making it crash, we think it might be hardware. From my talking to an old proliant admin, he says that sometimes not running SmartStart will cause intermittant problems. Does anyone here have any experience with Proliants and SmartStart? I give some more info below.


Bubba has been crashing under low load, and it doesn't seem like any software thats running is exhausting the ram to cause mod_perl and apache to start thrashing... So it looks like hardware. We don't want to have a hardware problem, so we're hoping its just a problem with not running SmartStart

SmartStart from what I understand it sets up some raid information, the SmartArray stuff, and sets up the boot partition. When Slackware was installed on bubba, from what I understand, it was just installed as the root, w/out smartstart running at all. Anyone here know if that will cause problems? Care to share your experience?

Also, Jason at VHosting says that when he is there at the console the disk controller seems to be acting flakey. Like sometimes it doesn't seem to be communicating with the disk array. We've had that problem since we got the box. Lastly, it doesn't come back up every boot. It seems to come back every 3 or so boots, when it doesn't, it just gets a blank screen...

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
When kuro5hin is down I ...
o Keep ping'ing the box until its back up 6%
o Hit reload endlessly 28%
o Go outside 5%
o Watch tv 0%
o read slashdot 27%
o Actually get work done 32%

Votes: 74
Results | Other Polls

Related Links
o Compaq Proliant 8000
o SmartStart
o VHosting
o Also by hurstdog


Display: Sort:
Hardware and Oct 3, 2001 | 15 comments (15 topical, editorial, 0 hidden)
Sounds like. (none / 0) (#1)
by wiredog on Wed Oct 03, 2001 at 02:22:59 PM EST

That box was worth every penny Rusty paid for it. Flaky hardware is a bitch. Does it have Genuine Compaq Memory in it, or something from nocturnalaviation.com? Bad memory can cause those sort of probs. I once had a mobo that was flaky, every time a truck drove by the PC rebooted.

If there's a choice between performance and ease of use, Linux will go for performance every time. -- Jerry Pournelle
Memory (none / 0) (#2)
by rusty on Wed Oct 03, 2001 at 02:44:06 PM EST

It had some flaky memory, but we found it and yanked it. I think memory was responsible for our earlier problems, but it's probably not that anymore.

____
Not the real rusty
[ Parent ]
Sounds like hardware to me (5.00 / 1) (#4)
by sigwinch on Wed Oct 03, 2001 at 04:29:33 PM EST

The boot failures sound like a hardware problem. The fact that it couldn't cooperate with certain memory tends to support that conclusion too, although memory issues could be coincidental.

I'd recommend running Memtest86 on it. Memtest86 runs as a bootable disk that takes over the computer, so there will be some downtime but well worth it if it finds a problem. The full set of tests takes hours, but my experience has been that a truly flaky data path will be detected before it finishes the first pass of the first test. My current practice is to throw Memtest86 on a new machine before I bother doing anything with it, and I routinely come across flaky mobos/RAM. At work recently it found slightly flaky memory in a pair of 1 GB machines before they went into production as an electromagnetics simulator; finding that one in production would have been hellish.

If you want to keep the machine live, you can try building large projects. E.g., Linux kernel. The "-j" option of make lets it run multiple processes at once which can drastically increase system load, especially on multi-processor machines. OTOH, finding a problem on a live system stands a good chance of blowing away the filesystem. Memtest86 or scratch drives might end up taking less time than blowing away the production partitions.

If you do find flakiness, the first thing to do is turn off the machine (using a power strip to avoid the @#@#$!$#$# always-on power supply), and systematically unplug and reconnect every electrical connector. CPU, RAM, fans, backplane/daughtercard, SCSI card, power connectors, you name it. Reseat every single connector. I've seen this fix a lot of "funnies". In fact, you might try doing this preemptively as the downtime would be fairly small.

--
I don't want the world, I just want your half.
[ Parent ]

We did (none / 0) (#5)
by rusty on Wed Oct 03, 2001 at 05:20:15 PM EST

memtest86 found the flaky memory, and that's out. And Jason did the "unplug/reconnect everything" trick too, because when the machine arrived at VHosting it wouldn't boot at all.

It looks like now the most likely culprits are lack of SmartStart stuff, and/or flaky RAID controller.

____
Not the real rusty
[ Parent ]

Rats (none / 0) (#6)
by sigwinch on Wed Oct 03, 2001 at 08:11:18 PM EST

I was hoping it would be one of the usual suspects. Just had a thought: the booting problem could be the power supply. You could try loading it down with some spare hard drives and see if it boots less often.

--
I don't want the world, I just want your half.
[ Parent ]

the booting problem (none / 0) (#8)
by el_guapo on Thu Oct 04, 2001 at 11:36:47 AM EST

i can't remember, does it have redundant power supplies? i had problems booting when both power supplies were in it - try removing one next time. hell, try removing one and see if the lockup problem goea away. this is all very weird - here bubba and his siblings ran (and his siblings STILL run) rock solid. i'm really sorry he's causing so many problems :-(
mas cerveza, por favor mirrors, manifestos, etc.
[ Parent ]
There was me (none / 0) (#3)
by hulver on Wed Oct 03, 2001 at 04:06:50 PM EST

Thinking I'd crashed it by posting a bad diary entry. Oh well, have to type that one in again.

--
HuSi!
Prob. flaky RAID Controller (none / 0) (#7)
by RedhatV on Wed Oct 03, 2001 at 08:29:54 PM EST

I'd take a guess that its a flaky RAID controller. Smartstart is handy to use to setup Compaq Servers (it installs a Diagnostics partition and allows you to setup your RAID) but I wouldn't say that it affects the future operation of your server (SmartStart is the build process, it doesn't leave items running after setup, except maybe for Compaq Agents, these won't be installed if you chose the manual OS install option anyway). Is there anything in the server log? (hardware log that is). I know some of the compaqs have the integrated logs as part of the MOBO.

ok, so i may have exhausted (none / 0) (#9)
by el_guapo on Thu Oct 04, 2001 at 04:59:06 PM EST

my free-hardware welcome BUT - I now have in my grubby little hands a dual p3 550 and a quad 833 xeon 512 cache.
mas cerveza, por favor mirrors, manifestos, etc.
What bootloader do you use? (none / 0) (#10)
by JML on Fri Oct 05, 2001 at 01:48:51 AM EST

I had a problem similar to this once. A machine (an 8 way Dell with 8GB) was very flakey, but its twin was rock solid. Turns out that on the flakey machine I had setup grub as the bootloader, because grub is nicer than lilo.

Anyways, grub was passing a mem= argument to the kernel on boot, which was causing the machine to panic or lockup when it used all available ram, because it was overwriting apic structures or something technical like that.

The solution was to pass grub the option --no-mem-option on grub's kernel line, and let Linux figure out what to do with the memory.

This doesn't seem to relate to your booting trouble, but it was the cause of my problem, so maybe it is related to yours...

Your all going to hate me but. (none / 0) (#11)
by Net_Fish on Mon Oct 08, 2001 at 12:20:43 AM EST

And no this is not a FreeBSD troll.

I was talking to a guy that works for compaq in .au, he reacently helped supply www.ausgamers.com with 3 compaq dual 1GHz boxes and all the dodads like remote insight cards and stuff. He was saying that one of the big problems with the system might be that its missing all the compaq kernel modules for controlling the RAID controler and getting info from all the onboard monitoring (server would have told you the ram bas bad :) )

as much as ino is going to rip his eyes out install a supported os (RedHat 6.1 - 7.1) and then install all the kernel modules required, or hedge your bets and install the modules onto the slackware install that is there now.

Power supply (none / 0) (#12)
by SnowBlind on Mon Oct 08, 2001 at 11:37:59 AM EST

"Lastly, it doesn't come back up every boot. It seems to come back every 3 or so boots, when it doesn't, it just gets a blank screen..."
is a classic power supply problem. The failed boots can be power supply failing internal test. Of course, it really does'nt tell you that it is failing.

There is but One Kernel, and root is His Prophet.
dual redundant power supplies (none / 0) (#13)
by hurstdog on Mon Oct 08, 2001 at 11:40:59 AM EST

So supposedly that isn't a problem...

[ Parent ]
spec was unclear. (none / 0) (#15)
by SnowBlind on Wed Oct 17, 2001 at 06:19:18 PM EST

The Compaq website spec said nothing about dual power supplies, so OK, maybe it is the power supply, maybe it isn't. Might even be that the sensor is reading bad power when the power is just fine.
We have a Sun with sequencial (sp?) serial numbers on the redundent power supplies. Makes me shiver. =)
Then again, it can be the electrical supply circut not providing stable power.

There is but One Kernel, and root is His Prophet.
[ Parent ]
Linux 2.4.x and memory/VMM issues (5.00 / 1) (#14)
by bored on Mon Oct 08, 2001 at 02:21:17 PM EST

You didn't happen to upgrade the OS recently or when you upgraded the box did you? What your describing sounds very similar to some of the VMM balancing/bug problems in various 2.4.x kernels. The machine memory subsystems seem to degrade until the swap system is consuming 100% of the CPU, without any corresponding disk activity, eventually appearing to have locked up. Before you start blaming the HW you should verify that your software configuration hasn't changed. If you can't verify the old configuration try upgrading to the absolute latest 2.4 kernels which have had numerous VMM related patches. Most of these bugs became more noticeable with large memory configurations, although I wouldn't really consider 2gb to be a lot of memory, it probably is enough to trigger some of the bugs.

Hardware and Oct 3, 2001 | 15 comments (15 topical, 0 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!