Determining what is causing a highloadreportedby check_load plugin

Rick Mangus rick.mangus+nagios at gmail.com
Tue Dec 7 16:48:49 CET 2010


Kjournald is needed for journalling on ext3 filesystems.  Be glad you didn't
manage to kill them.

To find something that is running many many instances, try this: "ps -ax -o
cmd | sort | uniq -c | sort -n"

The output will be like so:
      3 [kjournald]
      3 [sh] <defunct>
      5 -bash
      7 crond

The column on the left is the number of processes with that command line.  I
occasionally have 10,000 instances of nsca that simply need to be killed.
Do let us know what you find!

--Rick

On Tue, Dec 7, 2010 at 9:25 AM, Kaplan, Andrew H. <AHKAPLAN at partners.org>wrote:

>  Hi there --
>
> The output shown below shows the top processes on the server:
>
> 439 processes: 438 sleeping, 1 running, 0 zombie, 0 stopped
> CPU0 states: 19.0% user,  9.4% system,  0.0% nice, 71.0% idle
> CPU1 states: 20.1% user, 13.0% system,  0.0% nice, 66.3% idle
> CPU2 states: 27.1% user, 17.3% system,  0.0% nice, 55.0% idle
> Mem:  2064324K av, 2013820K used,   50504K free,       0K shrd,  487764K
> buff
> Swap: 2096472K av,   12436K used, 2084036K free                  976244K
> cached
>
>   PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
>  2398 root      15   0  1280 1280   824 R     1.9  0.0   0:00 top
>  5648 root      22   0  1196 1196  1104 S     1.3  0.0   0:00 ASMProServer
>     1 root      15   0   488  484   448 S     0.0  0.0   2:28 init
>     2 root      0K   0     0    0     0 SW    0.0  0.0   0:00
> migration_CPU0
>     3 root      0K   0     0    0     0 SW    0.0  0.0   0:00
> migration_CPU1
>     4 root      0K   0     0    0     0 SW    0.0  0.0   0:00
> migration_CPU2
>     5 root      15   0     0    0     0 SW    0.0  0.0   0:03 keventd
>     6 root      34  19     0    0     0 SWN   0.0  0.0  17:52
> ksoftirqd_CPU0
>     7 root      34  19     0    0     0 SWN   0.0  0.0  16:39
> ksoftirqd_CPU1
>     8 root      34  19     0    0     0 SWN   0.0  0.0  17:33
> ksoftirqd_CPU2
>     9 root      15   0     0    0     0 SW    0.0  0.0  28:22 kswapd
>    10 root      15   0     0    0     0 SW    0.0  0.0  42:39 bdflush
>    11 root      15   0     0    0     0 SW    0.0  0.0   3:08 kupdated
>    12 root      25   0     0    0     0 SW    0.0  0.0   0:00 mdrecoveryd
>    18 root      16   0     0    0     0 SW    0.0  0.0   0:00 scsi_eh_0
>    21 root      15   0     0    0     0 SW    0.0  0.0   4:38 kjournald
>   101 root      15   0     0    0     0 SW    0.0  0.0   0:00 khubd
>   265 root      15   0     0    0     0 SW    0.0  0.0   0:03 kjournald
>   266 root      15   0     0    0     0 SW    0.0  0.0   3:43 kjournald
>   267 root      15   0     0    0     0 SW    0.0  0.0   0:04 kjournald
>   268 root      15   0     0    0     0 SW    0.0  0.0   0:01 kjournald
>   269 root      15   0     0    0     0 SW    0.0  0.0   0:11 kjournald
>   270 root      15   0     0    0     0 SW    0.0  0.0   4:34 kjournald
>   271 root      15   0     0    0     0 SW    0.0  0.0   4:28 kjournald
>   272 root      15   0     0    0     0 SW    0.0  0.0   0:08 kjournald
>   273 root      15   0     0    0     0 SW    0.0  0.0   0:14 kjournald
>   274 root      15   0     0    0     0 SW    0.0  0.0   0:07 kjournald
>   275 root      15   0     0    0     0 SW    0.0  0.0   1:14 kjournald
>   805 root      15   0   588  576   532 S     0.0  0.0   1:39 syslogd
>   810 root      15   0   448  432   432 S     0.0  0.0   0:00 klogd
>   830 rpc       15   0   596  572   508 S     0.0  0.0   0:04 portmap
>   858 rpcuser   19   0   708  608   608 S     0.0  0.0   0:00 rpc.statd
>   970 root      15   0     0    0     0 SW    0.0  0.0   0:21 rpciod
>   971 root      15   0     0    0     0 SW    0.0  0.0   0:00 lockd
>   999 ntp       15   0  1812 1812  1732 S     0.0  0.0   5:04 ntpd
>  1022 root      15   0   772  720   632 S     0.0  0.0   0:00 ypbind
>  1024 root      15   0   772  720   632 S     0.0  0.0   1:16 ypbind
>
> What caught my eye was the number of processes along with the number of
> sleeping processes.
> I tried running the kill command on the kjournald instances, but that did
> not appear to stop them.
>
> Aside from rebooting the server, which can be done if necessary, what other
> approach can I try?
>
>
>
>
>  ------------------------------
> *From:* Daniel Wittenberg [mailto:daniel.wittenberg.r0ko at statefarm.com]
> *Sent:* Tuesday, December 07, 2010 9:11 AM
>
> *To:* Nagios Users List
> *Subject:* Re: [Nagios-users] Determining what is causing a
> highloadreportedby check_load plugin
>
>  So what are the first few processes listed in top?  That should be what
> is causing your load then.
>
>
>
> Dan
>
>
>
>
>
>
>
> *From:* Kaplan, Andrew H. [mailto:AHKAPLAN at PARTNERS.ORG]
> *Sent:* Tuesday, December 07, 2010 7:49 AM
> *To:* Nagios Users List
> *Subject:* Re: [Nagios-users] Determining what is causing a high
> loadreportedby check_load plugin
>
>
>
> Hi there --
>
>
>
> The load values that are displayed in top match those for the check_load
> plugin. This is the case whether the plugin
>
> is run either automatically or interactively. The output for the uptime
> command is shown below:
>
>
>
> 8:48am  up 153 days, 23:21,  1 user,  load average: 73.36, 73.29, 73.21
>
>
>
>
>
>
>
>
>  ------------------------------
>
> *From:* Daniel Wittenberg [mailto:daniel.wittenberg.r0ko at statefarm.com]
> *Sent:* Monday, December 06, 2010 4:40 PM
> *To:* Nagios Users List
> *Subject:* Re: [Nagios-users] Determining what is causing a high load
> reportedby check_load plugin
>
> In top, does it show the same load values?  The status of your memory
> shouldn’t cause the nagios plugin to report high cpu.  What does the uptime
> command say?  Try running the check_load script by hand on that host and
> verify it returns the same results.
>
>
> Dan
>
>
>
>
>
> *From:* Marc Powell [mailto:lists at xodus.org]
> *Sent:* Monday, December 06, 2010 3:26 PM
> *To:* Nagios Users List
> *Subject:* Re: [Nagios-users] Determining what is causing a high load
> reported by check_load plugin
>
>
>
>
>
> On Mon, Dec 6, 2010 at 1:50 PM, Kaplan, Andrew H. <AHKAPLAN at partners.org>
> wrote:
>
> Hi there --
>
> We are running Nagios 3.1.2 server, and the client that is the subject of
> this e-mail is running version 2.6 of the nrpe client.
>
> The check_load plugin, version 1.4, is indicating the past three readings
> are the following:
>
> load average: 71.00, 71.00, 70.95 CRITICAL
>
> The critical threshold of the plugin has been set to the 30, 25, 20
> settings.
>
> When I checked the client in question, the first thing I did was to run the
> top command. The results are shown below:
>
> CPU0 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
> CPU1 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
> CPU2 states:  1.0% user,  4.0% system,  0.0% nice, 93.0% idle
> Mem:  2064324K av, 2032308K used,   32016K free,       0K shrd,  509924K
> buff
> Swap: 2096472K av,   21432K used, 2075040K free                 1035592K
> cached
>
> The one thing that I noticed was the amount of free memory was at
> thirty-two megabytes. I wanted to know if that was
> what was causing the critical status to occur, or if there is something(s)
> else that I should investigate.
>
>
> Memory is not a factor in the load calculation, only the number of
> processes running or waiting to run. For at least 15 minutes you had
> approximately 71 processes either running or ready to run and waiting on CPU
> resources. Running top/ps was the right thing to do but you really need to
> do it when the problem is occurring to see what's actually using all the CPU
> resources. There are far too many reasons why load could be high but it
> should be easy for someone familiar with your system to figure it out (at
> least generally) while in-the-act.
>
> --
> Marc
>
>
>
> The information in this e-mail is intended only for the person to whom it
> is
> addressed. If you believe this e-mail was sent to you in error and the
> e-mail
> contains patient information, please contact the Partners Compliance
> HelpLine at
> http://www.partners.org/complianceline . If the e-mail was sent to you in
> error
> but does not contain patient information, please contact the sender and
> properly
> dispose of the e-mail.
>
>
> ------------------------------------------------------------------------------
> What happens now with your Lotus Notes apps - do you make another costly
> upgrade, or settle for being marooned without product support? Time to move
> off Lotus Notes and onto the cloud with Force.com, apps are easier to
> build,
> use, and manage than apps on traditional platforms. Sign up for the Lotus
> Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20101207/90d10bc6/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
What happens now with your Lotus Notes apps - do you make another costly 
upgrade, or settle for being marooned without product support? Time to move
off Lotus Notes and onto the cloud with Force.com, apps are easier to build,
use, and manage than apps on traditional platforms. Sign up for the Lotus 
Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list