[opennms-discuss] CPU load calculated, not hard limit.

Discussion:

Rob Walker

2013-06-05 15:46:13 UTC

I have been wondering if it makes sense to have a CPU load limit
calculated based upon the number of CPUs instead of having it be a hard
limit.

Shipping with ONMS is a default setting for CPU load under netsnmp. The
calculation is (loadavg5 / 100.0), with a trigger value of 10.0 and a
re-arm value of 7.5. With the deployment of VMs and dynamic CPU counts,
I am questioning having this as a set value.

Right now, I have a mix of machines, from 1 CPU, 4G RAM up to 24 CPU,
288G RAM. I currently am getting threshold alarms for 2 machines every
hour (database servers where the clients have hourly jobs), and one
machine multiple times a day (app server). Each of these three machines
have 16 CPUs, so I don't sweat it as long as the load stays below 32
(which it does).

My questions for the list are as follows:

1. Do any of you use a calculation for your CPU triggers? (Whether
ONMS, net-snmp config file generation or Nagios)

2. What calculation makes sense? I think that 4xCPU is nothing to
sweat single CPU machines, but I am pretty sure that I don't want to
sleep through a load of 96 on my largest boxes.

3. Can ONMS do non linear thresholding? I would even be OK with
"calculation" per "variable", which I may set by hand to be something
like this.

1. CPU < 5
(loadavg5 / 100) < 3x CPU count

2. CPU > 4
(loadavg5 / 100) < 2x CPU count

Thanks,
Rob

Mike Diehn

2013-06-05 17:32:28 UTC

Permalink

I use a script to create a snmpd.local.conf on each node. The script
calculates numbers for a "load 1m 5m 15m" line to go in the file. It does
lots of other such stuff, but that's the one you'd want, I think. I'll
share if you care.

Then, in ONMS, I have defined events for the 1, 5 and 15 load *traps* that
come from the netsnmp agent. I've disabled the thresholder for cpu load,
so my only notifications come from the traps.

Only thing I miss is the auto-clear and resolution features I'd have with a
threshold.

I'd really like to work out a way for both the poller and the trap receiver
to contribute to a state model and have that model drive the alarms and
notifications so I can have my auto-clears and resolutions back. I've
heard tell that JBoss has a business rules thing (forgot the name) that
some folks use for this but I haven't carved out time to study it yet.

I wouldn't be surprised if there's a way to do it with the translator and
vacuumd, so maybe just learning more about ONMS will suffice. :-)

Best,
Mike

Post by Rob Walker
I have been wondering if it makes sense to have a CPU load limit
calculated based upon the number of CPUs instead of having it be a hard
limit.
Shipping with ONMS is a default setting for CPU load under netsnmp. The
calculation is (loadavg5 / 100.0), with a trigger value of 10.0 and a
re-arm value of 7.5. With the deployment of VMs and dynamic CPU counts,
I am questioning having this as a set value.
Right now, I have a mix of machines, from 1 CPU, 4G RAM up to 24 CPU,
288G RAM. I currently am getting threshold alarms for 2 machines every
hour (database servers where the clients have hourly jobs), and one
machine multiple times a day (app server). Each of these three machines
have 16 CPUs, so I don't sweat it as long as the load stays below 32
(which it does).
1. Do any of you use a calculation for your CPU triggers? (Whether
ONMS, net-snmp config file generation or Nagios)
2. What calculation makes sense? I think that 4xCPU is nothing to
sweat single CPU machines, but I am pretty sure that I don't want to
sleep through a load of 96 on my largest boxes.
3. Can ONMS do non linear thresholding? I would even be OK with
"calculation" per "variable", which I may set by hand to be something
like this.
1. CPU < 5
(loadavg5 / 100) < 3x CPU count
2. CPU > 4
(loadavg5 / 100) < 2x CPU count
Thanks,
Rob
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom of
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129

Les Mikesell

2013-06-05 18:12:44 UTC

Permalink

Mike Diehn

2013-06-05 20:24:29 UTC

Permalink

Drools is already *in* onms?! I didn't *get* that last time you told me
this, Les!! OK, I've just moved this up my priority list....

Post by Mike Diehn

Post by Mike Diehn
I use a script to create a snmpd.local.conf on each node. The script
calculates numbers for a "load 1m 5m 15m" line to go in the file. It

does

Post by Mike Diehn
lots of other such stuff, but that's the one you'd want, I think. I'll
share if you care.
Then, in ONMS, I have defined events for the 1, 5 and 15 load *traps*

that

Post by Mike Diehn
come from the netsnmp agent. I've disabled the thresholder for cpu

load, so

Post by Mike Diehn
my only notifications come from the traps.
Only thing I miss is the auto-clear and resolution features I'd have

with a

Post by Mike Diehn
threshold.
I'd really like to work out a way for both the poller and the trap

receiver

Post by Mike Diehn
to contribute to a state model and have that model drive the alarms and
notifications so I can have my auto-clears and resolutions back. I've

heard

Post by Mike Diehn
tell that JBoss has a business rules thing (forgot the name) that some

folks

Post by Mike Diehn
use for this but I haven't carved out time to study it yet.
I wouldn't be surprised if there's a way to do it with the translator and
vacuumd, so maybe just learning more about ONMS will suffice. :-)

You are probably thinking of 'drools' which is in fact already
embedded in opennms.
http://www.opennms.org/wiki/Drools_Correlation_Engine
But, you have to set up your own state objects and tie them to events.
This sounds extremely useful but I haven't attempted to use it
myself. I'd love to see some real-world examples, though. I think
it would be great to be able to do things like notifications when some
percent of a load-balanced set fails.
--
Les Mikesell
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom of
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129

Rob Walker

2013-06-05 18:25:44 UTC

Permalink

Mike,

I thought about doing it on the snmpd.local.conf side, but I really like
how ONMS can do the auto-clear and how the resolution emails have Re: so
they show up threaded in my INBOX.

thanks,
Rob

Post by Mike Diehn
I use a script to create a snmpd.local.conf on each node. The script
calculates numbers for a "load 1m 5m 15m" line to go in the file. It
does lots of other such stuff, but that's the one you'd want, I think.
I'll share if you care.
Then, in ONMS, I have defined events for the 1, 5 and 15 load *traps*
that come from the netsnmp agent. I've disabled the thresholder for cpu
load, so my only notifications come from the traps.
Only thing I miss is the auto-clear and resolution features I'd have
with a threshold.
I'd really like to work out a way for both the poller and the trap
receiver to contribute to a state model and have that model drive the
alarms and notifications so I can have my auto-clears and resolutions
back. I've heard tell that JBoss has a business rules thing (forgot the
name) that some folks use for this but I haven't carved out time to
study it yet.
I wouldn't be surprised if there's a way to do it with the translator
and vacuumd, so maybe just learning more about ONMS will suffice. :-)
Best,
Mike
I have been wondering if it makes sense to have a CPU load limit
calculated based upon the number of CPUs instead of having it be a hard
limit.
Shipping with ONMS is a default setting for CPU load under netsnmp. The
calculation is (loadavg5 / 100.0), with a trigger value of 10.0 and a
re-arm value of 7.5. With the deployment of VMs and dynamic CPU counts,
I am questioning having this as a set value.
Right now, I have a mix of machines, from 1 CPU, 4G RAM up to 24 CPU,
288G RAM. I currently am getting threshold alarms for 2 machines every
hour (database servers where the clients have hourly jobs), and one
machine multiple times a day (app server). Each of these three machines
have 16 CPUs, so I don't sweat it as long as the load stays below 32
(which it does).
1. Do any of you use a calculation for your CPU triggers? (Whether
ONMS, net-snmp config file generation or Nagios)
2. What calculation makes sense? I think that 4xCPU is nothing to
sweat single CPU machines, but I am pretty sure that I don't want to
sleep through a load of 96 on my largest boxes.
3. Can ONMS do non linear thresholding? I would even be OK with
"calculation" per "variable", which I may set by hand to be something
like this.
1. CPU < 5
(loadavg5 / 100) < 3x CPU count
2. CPU > 4
(loadavg5 / 100) < 2x CPU count
Thanks,
Rob
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

Manuel Villarejo

2013-07-03 09:24:20 UTC

Permalink

John Blake

2013-07-03 13:36:59 UTC

Permalink

Since you are using snmpd to send threshold traps about cpu load, could you
use onms to poll it and look for "low" thresholds but just make the
severity clear?
Say you get a trap for high load on a device. Say the 15m load has been at
90% util for 20 minutes.
Could you configure onms to look for the 15m load to be lower than 90%?
When it finds it, the threshold event is "clear" versus minor. Then use
vacumd to correlate your initial alarm and then the clear one.

Use the idea at your own risk as I have never tried it.
heh

John

From: Manuel Villarejo <***@gmail.com>
To: General OpenNMS Discussion
<opennms-***@lists.sourceforge.net>,
Date: 07/03/2013 05:25 AM
Subject: Re: [opennms-discuss] CPU load calculated, not hard limit.

Hello Mike,

I really appreciate if you share your threshold and events configuration, I
am struggle at the moment with this 'kind of simple' issue.

Thanks

On Wed, Jun 5, 2013 at 7:32 PM, Mike Diehn <***@cd-adapco.com>
wrote:

I use a script to create a snmpd.local.conf on each node. The script
calculates numbers for a "load 1m 5m 15m" line to go in the file. It
does lots of other such stuff, but that's the one you'd want, I think.
I'll share if you care.

Then, in ONMS, I have defined events for the 1, 5 and 15 load *traps*
that come from the netsnmp agent. I've disabled the thresholder for cpu
load, so my only notifications come from the traps.

Only thing I miss is the auto-clear and resolution features I'd have with
a threshold.

I'd really like to work out a way for both the poller and the trap
receiver to contribute to a state model and have that model drive the
alarms and notifications so I can have my auto-clears and resolutions
back. I've heard tell that JBoss has a business rules thing (forgot the
name) that some folks use for this but I haven't carved out time to study
it yet.

I wouldn't be surprised if there's a way to do it with the translator and
vacuumd, so maybe just learning more about ONMS will suffice. :-)

Best,
Mike

On Wed, Jun 5, 2013 at 11:46 AM, Rob Walker <***@silverspringnet.com>
wrote:
I have been wondering if it makes sense to have a CPU load limit
calculated based upon the number of CPUs instead of having it be a hard
limit.

Shipping with ONMS is a default setting for CPU load under netsnmp. The
calculation is (loadavg5 / 100.0), with a trigger value of 10.0 and a
re-arm value of 7.5. With the deployment of VMs and dynamic CPU counts,
I am questioning having this as a set value.

Right now, I have a mix of machines, from 1 CPU, 4G RAM up to 24 CPU,
288G RAM. I currently am getting threshold alarms for 2 machines every
hour (database servers where the clients have hourly jobs), and one
machine multiple times a day (app server). Each of these three machines
have 16 CPUs, so I don't sweat it as long as the load stays below 32
(which it does).

My questions for the list are as follows:

1. Do any of you use a calculation for your CPU triggers? (Whether
ONMS, net-snmp config file generation or Nagios)

2. What calculation makes sense? I think that 4xCPU is nothing to
sweat single CPU machines, but I am pretty sure that I don't want to
sleep through a load of 96 on my largest boxes.

3. Can ONMS do non linear thresholding? I would even be OK with
"calculation" per "variable", which I may set by hand to be something
like this.

1. CPU < 5
(loadavg5 / 100) < 3x CPU count

2. CPU > 4
(loadavg5 / 100) < 2x CPU count

Thanks,
Rob
------------------------------------------------------------------------------

How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of
this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129

------------------------------------------------------------------------------

How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of
this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

--
Manuel Villarejo
e: ***@gmail.com

AVISO LEGAL
Este mensaje va dirigido exclusivamente a su destinatario y es
confidencial. Si por error lo recibe, por favor, comuníquelo por esta vía y
elimínelo. Cualquier uso de este mensaje o sus anexos sin autorización está
prohibida por ley.

LEGAL NOTICE
This message is intended to be read exclusively by the addressee and is
confidential. Should you receive it by error, please contact the sender by
email and delete it. Any use of this message or its attachments without due
authorisation is prohibited by law.
------------------------------------------------------------------------------

This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-discuss mailing list

To *unsubscribe* or change your subscription options, see the bottom of
this page:
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

Mike Diehn

2013-07-03 14:02:49 UTC

Permalink

Manuel,

I'm not using a threshold in OpenSNMP to monitor the loads on my Linux
systems. I couldn't make it flexible enough.

Instead I configure the Net-SNMP agent on the monitored systems to send
traps when the load is "too high" for the machine. I tailor the "load 1m
5m 15m" line to each machine. Then, in OpenNMS, I have a simple event to
catch the trap net-snmp sends about those loads. It's kind of annoying,
though. I haven't figured out how to configure OpenNMS to tell me when the
loads are normal again.

Hmmm... as I was writing this up for you, something occurred to me. I
suppose I could categorize the systems as 4-core, 8-core, 12-core, 16-core,
9000000-core and such and then write a threshold for each category. That
would let OpenNMS take care of the alert and resolution automatically for
me. Hmmmm!!! I may try that!

Here's the part of my script that counts the CPUs in a Linux box and then
writes out the custom "load" line for my snmpd.local.conf file:

#--- 8<
---------------------------------------------------------------------------
cat <<EOM

#--------------------------------------------------------------------------
# load: Check for unreasonable load average values.
# Watch the load average levels on the machine.
#
# load [1MAX=12.0] [5MAX=12.0] [15MAX=12.0]
#
# 1MAX: If the 1 minute load average is above this limit at query
# time, the errorFlag will be set.
# 5MAX: Similar, but for 5 min average.
# 15MAX: Similar, but for 15 min average.
#
# The results are reported in the laTable section of the
# UCD-SNMP-MIB tree
#
# These values were calculated as 2, 1.25 and 0.8 times the
# number of cores on this computer.
#
EOM

# With bash's arithmatical system, we can't use
# floats and fractional results are dropped: 1.95 -> 1
#
cores=$(fgrep processor /proc/cpuinfo | wc -l)
let max1=cores*2
let max5=cores*125/100 # 1.25 * cores
let max15=cores*8/10 # 80% of cores

echo load $max1 $max5 $max15
#--- 8<
---------------------------------------------------------------------------

And here is the event definition that catches the trap the Net-SNMP agent
sends. Maybe you'll want to change the company name in the UEI:

#--- 8<
---------------------------------------------------------------------------

<event>
<mask>
<maskelement>
<mename>id</mename>
<mevalue>.1.3.6.1.2.1.88.2</mevalue>
</maskelement>
<maskelement>
<mename>generic</mename>
<mevalue>6</mevalue>
</maskelement>
<maskelement>
<mename>specific</mename>
<mevalue>1</mevalue>
</maskelement>
<varbind>
<vbnumber>1</vbnumber>
<vbvalue>laTable</vbvalue>
</varbind>
</mask>
<uei>*uei.cd-adapco.com*
/standard/netsnmp/traps/laErrorAlertFired</uei> <event-label>NET-SNMP: laTable alert fired</event-label> <descr><p>A Net-SNMP agent is reporting a high load
average</descr>
<logmsg dest="logndisplay">Net-SNMP loadAvg alert:
%parm[#7]%</logmsg>
<severity>Minor</severity>
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%parm[#6]%"
alarm-type="3" auto-clean="false"/>
</event>

#--- 8<
---------------------------------------------------------------------------

Post by John Blake
Since you are using snmpd to send threshold traps about cpu load, could
you use onms to poll it and look for "low" thresholds but just make the
severity clear?
Say you get a trap for high load on a device. Say the 15m load has been at
90% util for 20 minutes.
Could you configure onms to look for the 15m load to be lower than 90%?
When it finds it, the threshold event is "clear" versus minor. Then use
vacumd to correlate your initial alarm and then the clear one.
Use the idea at your own risk as I have never tried it.
heh
John
[image: Inactive hide details for Manuel Villarejo ---07/03/2013 05:25:55
AM---Hello Mike, I really appreciate if you share your thresh]Manuel
Villarejo ---07/03/2013 05:25:55 AM---Hello Mike, I really appreciate if
you share your threshold and events configuration, I
Date: 07/03/2013 05:25 AM
Subject: Re: [opennms-discuss] CPU load calculated, not hard limit.
------------------------------
Hello Mike,
I really appreciate if you share your threshold and events configuration,
I am struggle at the moment with this 'kind of simple' issue.
Thanks
I use a script to create a snmpd.local.conf on each node. The script
calculates numbers for a "load 1m 5m 15m" line to go in the file. It does
lots of other such stuff, but that's the one you'd want, I think. I'll
share if you care.
Then, in ONMS, I have defined events for the 1, 5 and 15 load *traps*
that come from the netsnmp agent. I've disabled the thresholder for cpu
load, so my only notifications come from the traps.
Only thing I miss is the auto-clear and resolution features I'd have
with a threshold.
I'd really like to work out a way for both the poller and the trap
receiver to contribute to a state model and have that model drive the
alarms and notifications so I can have my auto-clears and resolutions back.
I've heard tell that JBoss has a business rules thing (forgot the name)
that some folks use for this but I haven't carved out time to study it yet.
I wouldn't be surprised if there's a way to do it with the translator
and vacuumd, so maybe just learning more about ONMS will suffice. :-)
Best,
Mike
On Wed, Jun 5, 2013 at 11:46 AM, Rob Walker <*
I have been wondering if it makes sense to have a CPU load limit
calculated based upon the number of CPUs instead of having it be a hard
limit.
Shipping with ONMS is a default setting for CPU load under netsnmp.
The
calculation is (loadavg5 / 100.0), with a trigger value of 10.0 and a
re-arm value of 7.5. With the deployment of VMs and dynamic CPU counts,
I am questioning having this as a set value.
Right now, I have a mix of machines, from 1 CPU, 4G RAM up to 24 CPU,
288G RAM. I currently am getting threshold alarms for 2 machines every
hour (database servers where the clients have hourly jobs), and one
machine multiple times a day (app server). Each of these three machines
have 16 CPUs, so I don't sweat it as long as the load stays below 32
(which it does).
1. Do any of you use a calculation for your CPU triggers? (Whether
ONMS, net-snmp config file generation or Nagios)
2. What calculation makes sense? I think that 4xCPU is nothing to
sweat single CPU machines, but I am pretty sure that I don't want to
sleep through a load of 96 on my largest boxes.
3. Can ONMS do non linear thresholding? I would even be OK with
"calculation" per "variable", which I may set by hand to be something
like this.
1. CPU < 5
(loadavg5 / 100) < 3x CPU count
2. CPU > 4
(loadavg5 / 100) < 2x CPU count
Thanks,
Rob
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes*
**http://p.sf.net/sfu/servicenow-d2d-j*<http://p.sf.net/sfu/servicenow-d2d-j>
_______________________________________________
Please read the OpenNMS Mailing List FAQ:*
**http://www.opennms.org/index.php/Mailing_List_FAQ*<http://www.opennms.org/index.php/Mailing_List_FAQ>
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the
bottom of this page:*
**https://lists.sourceforge.net/lists/listinfo/opennms-discuss*<https://lists.sourceforge.net/lists/listinfo/opennms-discuss>
--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes*
**http://p.sf.net/sfu/servicenow-d2d-j*<http://p.sf.net/sfu/servicenow-d2d-j>
_______________________________________________
Please read the OpenNMS Mailing List FAQ:*
**http://www.opennms.org/index.php/Mailing_List_FAQ*<http://www.opennms.org/index.php/Mailing_List_FAQ>
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom
of this page:*
**https://lists.sourceforge.net/lists/listinfo/opennms-discuss*<https://lists.sourceforge.net/lists/listinfo/opennms-discuss>
--
*Manuel Villarejo*
AVISO LEGAL
Este mensaje va dirigido exclusivamente a su destinatario y es
confidencial. Si por error lo recibe, por favor, comuníquelo por esta vía y
elimínelo. Cualquier uso de este mensaje o sus anexos sin autorización está
prohibida por ley.
LEGAL NOTICE
This message is intended to be read exclusively by the addressee and is
confidential. Should you receive it by error, please contact the sender by
email and delete it. Any use of this message or its attachments without due
authorisation is prohibited by law.
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom of
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom of
https://lists.sourceforge.net/lists/listinfo/opennms-discuss

--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129

Rob Walker

2013-07-03 15:27:26 UTC

Permalink

Manuel,

I also took the approach that Mike did and modify the snmp.local.conf
file on each machine. As much as I dislike having different config
files per machine, I can see the light at the end of the tunnel with
this approach. If I try to figure it out with ONMS, I just see "a maze
of twisty little passages, all alike", and have no idea how to move forward.

Now that I hear what Mike is talking about, I think that makes sense,
but that's also what I think we should be able to do within the
thresholding daemon, where we say "1 minute threshold = 10x core count".

Thanks,
Rob

Post by Mike Diehn
Manuel,
I'm not using a threshold in OpenSNMP to monitor the loads on my Linux
systems. I couldn't make it flexible enough.
Instead I configure the Net-SNMP agent on the monitored systems to send
traps when the load is "too high" for the machine. I tailor the "load
1m 5m 15m" line to each machine. Then, in OpenNMS, I have a simple
event to catch the trap net-snmp sends about those loads. It's kind of
annoying, though. I haven't figured out how to configure OpenNMS to
tell me when the loads are normal again.
Hmmm... as I was writing this up for you, something occurred to me. I
suppose I could categorize the systems as 4-core, 8-core, 12-core,
16-core, 9000000-core and such and then write a threshold for each
category. That would let OpenNMS take care of the alert and resolution
automatically for me. Hmmmm!!! I may try that!
Here's the part of my script that counts the CPUs in a Linux box and
#--- 8<
---------------------------------------------------------------------------
cat <<EOM
#--------------------------------------------------------------------------
# load: Check for unreasonable load average values.
# Watch the load average levels on the machine.
#
# load [1MAX=12.0] [5MAX=12.0] [15MAX=12.0]
#
# 1MAX: If the 1 minute load average is above this limit at query
# time, the errorFlag will be set.
# 5MAX: Similar, but for 5 min average.
# 15MAX: Similar, but for 15 min average.
#
# The results are reported in the laTable section of the
# UCD-SNMP-MIB tree
#
# These values were calculated as 2, 1.25 and 0.8 times the
# number of cores on this computer.
#
EOM
# With bash's arithmatical system, we can't use
# floats and fractional results are dropped: 1.95 -> 1
#
cores=$(fgrep processor /proc/cpuinfo | wc -l)
let max1=cores*2
let max5=cores*125/100 # 1.25 * cores
let max15=cores*8/10 # 80% of cores
echo load $max1 $max5 $max15
#--- 8<
---------------------------------------------------------------------------
And here is the event definition that catches the trap the Net-SNMP
#--- 8<
average</descr>
%parm[#7]%</logmsg>
<severity>Minor</severity>
<alarm-data reduction-key="%uei%:%dpname%:%nodeid%:%parm[#6]%"
alarm-type="3" auto-clean="false"/>
</event>
#--- 8<
---------------------------------------------------------------------------
Since you are using snmpd to send threshold traps about cpu load,
could you use onms to poll it and look for "low" thresholds but just
make the severity clear?
Say you get a trap for high load on a device. Say the 15m load has
been at 90% util for 20 minutes.
Could you configure onms to look for the 15m load to be lower than 90%?
When it finds it, the threshold event is "clear" versus minor. Then
use vacumd to correlate your initial alarm and then the clear one.
Use the idea at your own risk as I have never tried it.
heh
John
Inactive hide details for Manuel Villarejo ---07/03/2013 05:25:55
AM---Hello Mike, I really appreciate if you share your threshManuel
Villarejo ---07/03/2013 05:25:55 AM---Hello Mike, I really
appreciate if you share your threshold and events configuration, I
To: General OpenNMS Discussion
Date: 07/03/2013 05:25 AM
Subject: Re: [opennms-discuss] CPU load calculated, not hard limit.
------------------------------------------------------------------------
Hello Mike,
I really appreciate if you share your threshold and events
configuration, I am struggle at the moment with this 'kind of
simple' issue.
Thanks
On Wed, Jun 5, 2013 at 7:32 PM, Mike Diehn
I use a script to create a snmpd.local.conf on each node. The
script calculates numbers for a "load 1m 5m 15m" line to go in
the file. It does lots of other such stuff, but that's the one
you'd want, I think. I'll share if you care.
Then, in ONMS, I have defined events for the 1, 5 and 15 load
*traps* that come from the netsnmp agent. I've disabled the
thresholder for cpu load, so my only notifications come from the
traps.
Only thing I miss is the auto-clear and resolution features I'd
have with a threshold.
I'd really like to work out a way for both the poller and the
trap receiver to contribute to a state model and have that model
drive the alarms and notifications so I can have my auto-clears
and resolutions back. I've heard tell that JBoss has a business
rules thing (forgot the name) that some folks use for this but I
haven't carved out time to study it yet.
I wouldn't be surprised if there's a way to do it with the
translator and vacuumd, so maybe just learning more about ONMS
will suffice. :-)
Best,
Mike
On Wed, Jun 5, 2013 at 11:46 AM, Rob Walker
I have been wondering if it makes sense to have a CPU load limit
calculated based upon the number of CPUs instead of having
it be a hard
limit.
Shipping with ONMS is a default setting for CPU load under
netsnmp. The
calculation is (loadavg5 / 100.0), with a trigger value of
10.0 and a
re-arm value of 7.5. With the deployment of VMs and dynamic
CPU counts,
I am questioning having this as a set value.
Right now, I have a mix of machines, from 1 CPU, 4G RAM up
to 24 CPU,
288G RAM. I currently am getting threshold alarms for 2
machines every
hour (database servers where the clients have hourly jobs), and one
machine multiple times a day (app server). Each of these
three machines
have 16 CPUs, so I don't sweat it as long as the load stays
below 32
(which it does).
1. Do any of you use a calculation for your CPU triggers?
(Whether
ONMS, net-snmp config file generation or Nagios)
2. What calculation makes sense? I think that 4xCPU is nothing to
sweat single CPU machines, but I am pretty sure that I don't
want to
sleep through a load of 96 on my largest boxes.
3. Can ONMS do non linear thresholding? I would even be OK with
"calculation" per "variable", which I may set by hand to be
something
like this.
1. CPU < 5
(loadavg5 / 100) < 3x CPU count
2. CPU > 4
(loadavg5 / 100) < 2x CPU count
Thanks,
Rob
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes_
__http://p.sf.net/sfu/servicenow-d2d-j_
_______________________________________________
Please read the OpenNMS Mailing List FAQ:_
__http://www.opennms.org/index.php/Mailing_List_FAQ_
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see
the bottom of this page:_
__https://lists.sourceforge.net/lists/listinfo/opennms-discuss_
--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129 <tel:603%20643%209993%20x24129>
------------------------------------------------------------------------------
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes_
__http://p.sf.net/sfu/servicenow-d2d-j_
_______________________________________________
Please read the OpenNMS Mailing List FAQ:_
__http://www.opennms.org/index.php/Mailing_List_FAQ_
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the
bottom of this page:_
__https://lists.sourceforge.net/lists/listinfo/opennms-discuss_
--
*Manuel Villarejo*
AVISO LEGAL
Este mensaje va dirigido exclusivamente a su destinatario y es
confidencial. Si por error lo recibe, por favor, comuníquelo por
esta vía y elimínelo. Cualquier uso de este mensaje o sus anexos sin
autorización está prohibida por ley.
LEGAL NOTICE
This message is intended to be read exclusively by the addressee and
is confidential. Should you receive it by error, please contact the
sender by email and delete it. Any use of this message or its
attachments without due authorisation is prohibited by
law.------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
To *unsubscribe* or change your subscription options, see the bottom
https://lists.sourceforge.net/lists/listinfo/opennms-discuss
--
Mike Diehn
Development Operations
CD-adapco - Lebanon, NH
603 643 9993 x24129
------------------------------------------------------------------------------
Build for Windows Store.
http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
http://www.opennms.org/index.php/Mailing_List_FAQ
opennms-discuss mailing list
https://lists.sourceforge.net/lists/listinfo/opennms-discuss