From: Bela Lubkin <belal@sco.com> Subject: Re: 5.0.6 grinds to a complete halt Date: Thu, 20 May 2004 08:03:40 GMT Message-ID: <20040520080340.GS10272@sco.com> References: <O1ymc.18442$urx.4433@news04.bloor.is.net.cable.rogers.com>
<20040507002147.GG10272@sco.com>
<mKgoc.43839$n7P1.28035@twister01.bloor.is.net.cable.rogers.com>
<d3e78b1d.0405190813.3c671b7f@posting.google.com> Barry Swane wrote:
> It appears I declared victory a little too early. > Killing the amirdmon process did indeed have salutory effects on the > performance. Customer stopped reporting noticeable slowness in system > performance. > > > One disconcerting fact: Before killing the amirdmon, I ran the same job on > > the new server, and on the 5 year old Acer Altos 9100 (also with RAID 5) > > server that it replaced. It took 6 times as long on the new server! > > After killing the amirdmon, I ran the job again-- now it only takes 4 times > > as long as the old server. Clearly something else is still not correct. > > As noted above, file copy type jobs were still 3-4 times slower than > the 5 year old server. However, the server did run for a full week, > before back-sliding yesterday. Again, nothing going on that I can pin > it on. > > > To answer your questions: > > System is remote, so I'm not able to observe disk light. > > I could in fact ping the system, while it was hung > > Flipping screens on the console did work-- sort of. > > i.e., user sees the login prompt, he can type his login, > > and it echoes the characters that are typed. > > But, then wait forever for password prompt-- never happens. > > I'm now inclined to theorize that Bela's suggestion is correct-- that > the disk (RAID 5) has stopped responding completely. Would that be > consistent with the reported behavior? i.e., if you are in a shell, > you can type characters, and they echo, and you can do a carriage > return-- but nothing is ever executed? Perfectly consistent. OpenServer is very conservative about swapping; it never pushes process pages out to swap unless it's out of memory. On modern systems this generally means that swap is never touched. Thus, any active process resides entirely in memory. Also, the kernel itself is all hard-loaded in RAM -- none of it is pagable. If the disk subsystem hangs, the kernel continues to function. Each individual process continues to function until the first time it tries to access the disk. For instance, the program that provides the login prompt (`getty`, for console ttys) will continue to accept and echo characters. If you hit return on a name, it goes to exec `login`, which involves disk access, so you never get to the password prompt. If you're sitting at a shell prompt, you can type; you can run internal commands like "echo foo"; but any attempt to run a binary will hang. (Even if the binary is fully cached, its access time needs to be updated on disk.) > Also, this seems major-league weird-- that the system can perform > absolutely normally, all the time-- except once in a while it loses > contact with the disk? It isn't particularly weird. What you're describing is a fairly standard set of symptoms for a variety of conditions including SCSI bus timing, parity or signal integrity problems; internal errors in a disk drive; and so on. You might rightly expect a RAID controller to be a bit more thorough about error recovery, but apparently this particular one -- in this particular failure case, whatever it is -- isn't.
You also mischaracterizze the situation here. It _isn't_ performing absolutely normally. It's running 6 times slower than older and presumably much slower machines. But I bet the two symptoms are actually unrelated, and you have two separate problems to solve. (1) complex application jobs run much more slowly than expected; (2) the disk subsystem occasionally hangs. > > Thanks for the tip on the debugger. I am optimistic that getting rid of the > > amirdmon will avoid the hangup again-- if I'm wrong, I will post the results > > you suggested. > > Some questions re the debugger- which I have now configured. > If the disk has stopped-- am I likely to get anything back from the > debugger? > I assume this can only be run from the system console-- I can't do it > remotely? > I imagine that, in order to get info from the debugger, root must > already be logged in, and sitting at # prompt? > I am trying to experiment with the debugger in advance of the > freeze-up, to try to get a little bit familiar with it: > i) if I hold CTRL-ALT-D - it just logs me out, as if I had pressed > CTRL-D > ii) I can load scodb, from shell prompt > If I enter "stack" command, I get > When operating on /dev/mem, you cannot examine the stack of the > current process. The "stack" command must be used with the "-p" > argument. > If I enter "stack -p", I get the same message > > Can someone point me to documentation on scodb? man scodb makes > reference to the SCODB User's Guide. I thought I had a complete set > of manuals- but I don't have that one. These are good questions... I'll post a second reply as a separate subthread, because I'm going to include some research results that are worth archiving permanently under a sensible subject line. >Bela<
Have you tried Searching this site?
Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates
This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more. We appreciate comments and article submissions.
Many of the products and books I review are things I purchased for my own use. Some were given to me specifically for the purpose of reviewing them. I resell or can earn commissions from the sale of some of these items. Links within these pages may be affiliate links that pay me for referring you to them. That's mostly insignificant amounts of money; whenever it is not I have made my relationship plain. I also may own stock in companies mentioned here. If you have any question, please do feel free to contact me.
Specific links that take you to pages that allow you to purchase the item I reviewed are very likely to pay me a commission. Many of the books I review were given to me by the publishers specifically for the purpose of writing a review. These gifts and referral fees do not affect my opinions; I often give bad reviews anyway.
We use Google third-party advertising companies to serve ads when you visit our website. These companies may use information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide advertisements about goods and services of interest to you. If you would like more information about this practice and to know your choices about not having this information used by these companies, click here.
Click here to add your comments
Don't miss responses! Subscribe to Comments by RSS or by Email
Click here to add your comments
If you want a picture to show with your comment, go get a Gravatar