13 March 2008

NDOutils on Solaris 10

Michael Prochaska was having trouble with compiling NDOutils on Solaris 10. Since we have an interest in getting Opsview working on Solaris (the upcoming 2.12 release will add Solaris 10 as a supported platform), we offered to help. So this is the result of his company, Bacher Systems, sponsoring our work.

Continue reading "NDOutils on Solaris 10" »

15 January 2008

Monitoring Cisco Netflow Data

Netflow is a great feature of Cisco IOS that allows you a view into the traffic that flows over your Cisco network devices, what that traffic is, where it came from and where it is going.

We wanted to make good use of this information and so we started looking for a way for Opsview to monitor it.

With a little configuration of IOS and some open source magic we achieved just that. Now our Opsview servers are keeping tabs on the data moving across our Cisco devices.

So true to our open source way of life we published our setup as part of the Opsview documentation.

08 January 2008

NSCA's aggregate writing

In our continual task to try and speed up Opsview, we found a bug in NSCA's handling of aggregate writes when run in --single mode.

The specific failure scenario is this:


  1. NSCA and Nagios are told to start up
  2. A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
  3. NSCA tries to write to open the command file, but sees it is not there
  4. NSCA opens the alternate dump file instead

Now when Nagios does create the nagios.cmd file, NSCA uses that ... unless aggregate mode is on and daemon mode is --single. In this case, it continues to use the alternate dump file, thus Nagios doesn't see the results from the slaves.

Here's the patch, which we've also added into our source for Opsview.

As we are very keen on good testing, we've managed to recreate the failing behaviour in a test script. You also need a test configuration file and a patch to the test framework. If you run this test, it will show the error and then after the patch is applied, the test should pass.

29 September 2007

Nagios Patch Day!

With Nagios 3 rapidly approaching and Opsview celebrating being a full open source project (GPL licensed, source code repository online, Sourceforge project), we think it is time to share some of our Nagios patches.

These are the latest patches you can find for Opsview within our code repository. Some are Opsview specific, but a lot can be incorporated into the core code - we'll say which is which. You can see all these on our SVN site (we've even tagged the current version so this will stay in our repository), but here's the lowdown:

Freshness checking, with separated file and tests

If there's a patch we definitely want to have applied to the core code, it will be this one. Not because of freshness checking per-se (though we'll explain why later), but because of the included libtap tests.

As much as we love Nagios, we're always a bit concerned that regressions may occur. We have complete faith in Ethan, but he's human and unintended effects may occur. In fact, we made one when we originally suggested this freshness patch, so with testing, future changes should hopefully not cause regressions.

It requires work to add in tests and to separate out files, and we intend to stick to our commitment to add new tests in. But the framework needs to be put in place to encourage other tests, otherwise the overhead for Altinity is too high.

Think of tests like this: the code is the generalised form; the test is under specific conditions. The key is to try and get more and more conditions to prove that things work as expected.

Have a look at the test - we think it is easy to see what is being tested here (to get it from our svn repository, you need to extract the tarball). And note how comprehensive it is - we think every case is considered. A change in logic anywhere will be immediately spotted.

We've refactored how the calculated freshness_threshold is arrived at so that we can run tests against it.

There's also an arbitrary 15 seconds added to the freshness threshold. We've made that a new variable called additional_freshness_latency in nagios.cfg, so you can tweak it without recompiling Nagios.

Could be applied to core. Please :)


More freshness tweaks

Another thing we found was that Nagios is very fast in reading 10,000 services (5 seconds), but slows down dramatically with NDOutils integrated (2 minutes). It appears to be reading configuration and then sending to the broker modules. Since NDOUtils is synchronously updated, Nagios is waiting while mysql is running the necessary SQL. We've updated the freshness code by introduced a new variable called monitoring_start. This is when Nagios actually starts monitoring, as opposed to program_start which is the HUP time. We get a better idea of how long it takes Nagios to startup.

We've written a little plugin that returns performance data about the startup time.

Also, we've pushed the threshold forward a little bit more to include the max_host_check_spread/max_service_check_spread, which is important for new services.

We've updated the tests to reflect the changes. Patches on top of other patches get really hard to maintain, which is why we need the libtap tests integrated into the core code.

Could be applied to core.

Initial passive state as OK

This is one where we change the Nagios CGIs to show passive states as OK. We just like everything green.

We don't expect this in core.

Issue commands

This has been applied to Nagios 3.

Status link to Nagiosgraph

This helps with our integration to Nagiosgraph.

We don't expect this in core.

Passive checks do not check host

We've discussed this before.

We don't expect this in core.

Ignore certain retained data

We've mentioned this before on the nagios-devel mailing list. Ethan has made changes to Nagios 3 to support this behaviour.

Adding a time=X to the statusmap

With the AJAX goodies we have in Opsview, we found the statusmap wasn't updating correctly. It appears that some browsers try to use cache data in an XHTTPrequest if the URL is the same. We've added this to the URL so that it is always unique.

This is AJAX specific, so we don't expect this in core.

W3 validation: history.cgi

We've big on valid HTML. Partly this is because we wrap the CGI output and remove the use of framesets in Opsview. However, it means the HTML has to be valid. We found several problems in the validity of history.cgi and other CGIs below.

A great tool is HTML Validator, which runs as a Firefox plugin - this tells you if your HTML is valid.

Could be applied to core.

Esccalation via notification levels

This is an extra field to the contact stanza where you can specify that they will receive notifications only after the Nth notification. This makes it an easy way of doing escalations.

We've spotted an issue where if no notifications are sent, the notification number doesn't get incremented. Maybe this is best as a different macro.

Could be applied to core, but requires a bit more thinking.

Documentation patches for validation

The use of markup caused problems for us, so we've fixed some of the docs.

Could be applied to core.

W3 validation: extinfo.c

We've fixed some validation errors with divs. Have you seen HMTL Validator? :)

Could be applied to core.

Trust authentication

This patch stops the Author box from being altered by the logged in user. Ethan has applied something similar to Nagios 3.

Already in Nagios 3.

Slice services within hosts

This patch allows a contact in a contactgroup to only see a subset of services. Normally, a contact to a host sees all the services, but this allows the contact to only see the services specified.

This is possible by setting the contact to not have the host in the contactgroup, but then that stops the contact from taking action on that host.

This could be applied to core, but is a (relatively) major change to the use of contactgroups.

Extinfo icon links to service notes

We find that users click on the extinfo icon and then get a bit worried when nothing happens. We make it a clickable link. There are also a few validation fixes here too (should really be separated out).

Could be applied to core.

Object dump

This is a good one! As you know, we love tests. One thing we do with Opsview is make sure that the configuration being generated is the same after we've tinkered with the rules. We tried to find a good way of doing this - initially we thought about using Nagios::Object to read the config data and then do a diff to find the changes. However, this didn't take into account all the relationships.

What we really wanted was some expanded form of the config files.

It then hit us - Nagios already does this! It uses the object.cache file as an expanded version of the configuration objects for the CGIs to use. So we've patched the core nagios executable so -o will now output to stdout this cache file and then exit. It works great in our testing.

Could be applied to core.

Retain status file over a reload

In our quest to make Nagios more friendly, there's nothing worse than getting the dreaded "Nagios is not running" screen on your browser. This patch adds a new command line option -F, for fast-reload.

It does two things:


  • It doesn't delete the status file on a HUP signal. This gives the impression that Nagios is still running even though no new status information is being updated. We think this is acceptable - after all, CGIs are displaying the "latest" data, it just so happens that there is no update at this precise moment. The status file age doesn't change, so nagiostat will show that the data is getting older, but it removes that scary screen

  • We ignore the pre-flight check. As part of Opsview, we validate the config before we send a HUP signal, so this is redundant. Along with the long startup times for Nagios, we find this makes Nagios a lot more responsive for large scale systems

Could be applied to core, possibly as two different command line options.

Check command by time period

This is a nice feature which we've discussed before. We have customers asking to run a different command based on a timeperiod. The most obvious use is altering the thresholds for the load of a server - a server may run batch work overnight thus increasing its load.

Could be applied to core.

Using relative path names for config files

We run tests internally on new versions of Opsview, trying to prove that our generated config files do not change unexpectedly. One thing we hit was the use of full path names in nagios.cfg. This meant we either had to change the path on the fly or move directories around.

This patch allows the use of a relative path. The path is taken as relative to the directory that holds nagios.cfg. We find this works really well.

There is a dependency on dirname(), which will probably have to be changed to a cross platform implementation.

Could be applied to core.

Making forcecheck option

By default, force check is on when you Reschedule an active check. In a distributed environment where you have a "set to stale" script as the active check, this is not wanted. We change it so that the form enables only if the field is passed through.

We then alter some of the links so that the field is off by default based on whether the service is actively checked.

Could be applied to core.

Add hosts to hostgroup in same order

We make the members field in the contactgroups stanza optional. What this means is that we can add the members of a contactgroup via the contact instead. This turns out to be significantly faster in our configuration generation scripts. Thus we also remove the error in the nagios configuration about the stanza information.

We also add the contacts into the list in the same order as they are processed. When it was added in reverse order, our tests were failing because the order was not preserved.

Could be applied to core.

Handle initial state

In NDO, if a service starts up in an error state, a state change is recorded. However, if a service starts up in OK, a state change is not. This patch will cause a state change to occur.

Technically it is a state change from a PENDING to an OK, so it should be recorded. This helps us in the NDO nagios_statehistory table, which we'll discuss about more in a future blog.

Could be applied to core.

Validation error in statusmap cgi

An incorrectly placed </form> caused problems with our AJAX screens. This fixes. Did we mention HMTL Validator?

Could be applied to core.

Latency values for passive checks

While working on freshness checking, we discovered that the latency values were incorrect. In fact, looking in the NDO db told us this. This fixes the calculation.

Could be applied to core.

Do not resend retained status to NDO

On startup, Nagios writes all the current host/service status to NDO. However, the database already knows this. This causes problems on large scale systems.

A side effect is that if NDO is switched on after Nagios is running for a long time, each object needs to have a new status result before NDO sees it, but this is probably acceptable.

Another impact is that other future broker modules might want the retained status information, so maybe this is best implemented at the broker level, but we couldn't see an easy way of passing only this particular case.

This also has an impact to NDO, so there's a patch required there.

Could be applied to core.


Segfault when processing no output


We had a big problem with a customer's system where it was crashing occasionally. We had to analyse coredumps and eventually found the problem: on the master server, if the plugin output is only "|" for a passive host check, then sometimes a segfault would occur.

We think this is related to parsing the plugin output, but only if passive checks are processed with a backtrace from check_host.

Anyway, we've fixed it by changing the algorithm for parsing the plugin output. Our guess is that strtok is causing the problem, but we really don't understand why. Sigh.

With this patch, our customer's Nagios has not crashed for a 1 month - so we're safe!

Code be applied to core.


Returning passive latency values in nagiostats


With the fix to the passive latency values, we then want to find out what the values are for passive latency over a long period of time.

This patch updates nagiostats.

Code can be applied to core.


Is that all?


Yes, for now! We've made lots of changes to Nagios over the last 12 months, which we think are suitable for core. Sorry for not publishing them sooner.

If you want to have an Altinity compiled version of Nagios, just do this:


cd /tmp
svn export http://svn.opsview.org/opsview/trunk/opsview-base
cd opsview-base
make nagios

This will patch Nagios and run ./configure with our usual settings (there are some dependencies (autoconf, automake) required, but we'll leave that for you to work out!). You'll get the exact version of Nagios that we use in Opsview - in fact, you'll get them before our customers get them!

We'll do a similar Patch Day for NDO soon and talk about some of the performance tuning we've been doing for our large customers.

Enjoy!

18 July 2007

SMS alerting via AQL

We came across AQL by accident. They came to us because they were interested in Opsview and we looked into what their company was about. They provide SMS messaging services: you buy a prepaid amount of credits and then you can send SMS text messages via their website in a variety of ways.

systempreferences.png

Our sales director thought it would be a good idea to integrate their service with Opsview. We agreed and thought there was a nice synergy about it.

So we've now integrated AQL's messaging through our UI. In the upcoming 2.8 release, there's a new screen: System Preferences. Here you can sign up at AQL and then enter the username and password. We even give you a little Check credits AJAX button for you to test your connectivity.

mobilenumber.png

Then on the Contacts screen, you enter in your mobile phone for sms number (with javascript validation so that it is in the correct format) and you can even send a test SMS to make sure this works correctly.

Finally, when Nagios is ready to alert, we send the notification via the SMS instead of email or RSS. Simple!

Actually, it was quite hard. We just like to make it look easy.

To communicate with AQL's servers, there are various methods: HTTP/HTTPS, XMPP, SOAP and a few others that made my eyes water. We just wanted a nicely encapsulated module to send a message.

And we found one on CPAN. SMS::AQL is a perl module written by David Precious. It works over HTTP and worked a treat. However, we initially worked with version 0.02 and the tests there were failing because it was trying to contact AQL's servers to do testing. This caused us some problems in our automated perl install.

So we set to work enhancing the module. First thing was to update the tests. Using Test::Mockobject, we were able to reply to SMS::AQL's HTTP calls as if they were being returned from AQL's servers. This allowed some really intensive testing. Using Devel::Cover, we got a 91% coverage in our testing! We found lots of inconsistencies in the API, which we fixed as well. Finally, we cleaned up the messages so there is a single lookup table now.

The guys at AQL have been very helpful in providing us with technical information. And David Precious has updated the perl module with our changes. And he's written a blog post too!

It's a symbiotic way of working - we didn't start from scratch working on an interface with AQL's systems, but we've managed to contribute back to existing code and move it up another level. Everybody wins! (Well, except for other monitoring system companies that want to be international conglomerates.) So now everyone can use the CPAN module to get SMS alerting.

But if you want a quick way of sending alerts, you can download our script here. This is the script that will be distributed with the 2.8 release soon. Just add that onto your server and put a check command entry into Nagios like:

define command {
	command_name service-notify-by-sms
	command_line /usr/local/nagios/bin/submit_sms_aql -u aql_username -p aql_password -n $CONTACTPAGER$ -t "$SERVICEDESC$ on $HOSTNAME$ is $SERVICESTATE$: $SERVICEOUTPUT$ ($SHORTDATETIME$)"
}

To be honest, I can't remember all the associations with the contact definitions - check out the Nagios documentation to set it up. I just use Opsview because it makes Nagios easier to administer. And now, Opsview makes SMS alerting easier too.

21 June 2007

Tweaking the freshness checking algorithm

With Opsview, one of the big features is the simple distributed monitoring - you just select a drop down to associate a host with a slave server and then when you hit the Opsview reload button, all the Nagios configurations are generated as you'd expect (slaves monitoring, master with freshness checking, automatic distribution to slaves, synchronized reloading). It works amazingly well.

But one of the niggly issues we have is that some services go stale before we think they should. So we've been tweaking some of the algorithms for setting the freshness_threshold.

One situation we found was that when the master was being restarted, a busy master can lose some slave results during the reload (due to the infamous command pipe being full limitation). So when the master comes back, it could lose one polling cycle's result from the slave and mark the service as stale before the slave has had a chance to send the next result.

So we patched it, by adding the freshness_threshold to the program_start time instead of the service's last check time. And we sent an the email to the nagios-devel mailing list to inform. This was accepted into Nagios 2.1. And we got less stale results - hooray!

Roll forward a year. Michelle Craft then discovered that this patch caused a problem - if you set a passive service to have a freshness_threshold of 1 day, but you restart Nagios every day, then the service never expires its freshness threshold. That's a bad bug, and I'm quite ashamed that slipped through.

Fortunately, we had a solution. Ethan wrote a patch very quickly for Nagios 3, but we wanted something a bit more robust.

At Altinity, we're big fans of testing. This is not because we like to test - heck, we hate testing as much as the next developer. But we hate regression and unintended consequences more. With the Nagios Plugins, there's a really large set of tests that get run for every nightly build, with a nice web page that displays the state. One of the tools that makes it happen is LibTap, a library written by Nik Clayton. This is a way of testing C code with output in a perl test format. Apparently, a lot of FreeBSD tests are being written in libtap to prove there are no regressions.

There are some instructions on the Nagios Plugins site for installing libtap on your development servers.

So we've fixed this problem now by moving the freshness calculation algorithm into a separate file and then writing a small C program with dummy services and hosts to test that the right thresholds are being returned. The benefits were immediate - I found I had put a wrong bracket around an if statement when one of the tests failed.

The patch, which consists of a patch file, a new freshness.c file and a tarball for the new test directory, applies cleanly onto Nagios 2.9. You need to run autoconf afterwards. ./configure will detect the existence of libtap and compile the test executables. Then when you run make test, it should execute the test and make sure it works properly (you may need to export LD_LIBRARY_PATH=/usr/local/lib to get the libtap library detected properly at runtime).

Tests are hard to do, but worthwhile in the long run. I see it as making sure things still continue to work the same way you expect. And that has to be a good thing.

Hopefully this can be the start of some automated testing for our favourite open source monitoring system!

27 April 2007

Changing a service check command depending on time of day

We have been asked by a customer if it is possible to change a check command for a service depending on the time of day.

Why would this be useful?

Well, if a server runs time critical processes during the day and slow running batch processes over night, how can a service check command take into account how it is supposed to report on CPU or memory usage without generating false alerts? Yes, you could write your own plugins to take account of the time and react accordingly for each check this needs to be for, but these would have to be installed on each host for each service, the wealth of plugins from http://www.nagiosexchange.org/ cannot easily be used, setting the system up takes longer, and it is all much harder to maintain.

Instead, we have made changes to the service stanza within the Nagios configuration files to include a "check_timeperiod_command <timeperiod>,<command>" entry:

define service {
	host_name server1
	service_description Free Widgets
	check_command check_widget -w 40% -c 20%
	check_timeperiod_command nonworkhours,check_widget -w 5% -c 2%
	.....
}

You get the idea....

check_command provides the default check for the day. During the nonworkhours period, the alternative command and arguments are used instead.

This seems far too useful to the community to keep to ourselves, so we offer the patch for Nagios 2.8 here, for peer review and comments (all of which are very welcome).

And here is a patch for ndoutils 1.4b2 that goes with it.

Enjoy!

02 April 2007

Better mysqlclient detection for NDOUtils

We've encountered some problems with mysql detection in NDOUtils - it doesn't work on one of our redhat servers. The specific problem is that the ceil function is not found, which is because -lm is missing from the list of libraries to add at link time:


utils.o(.text+0x14e): In function `ndo_dbuf_strcat':
: undefined reference to `ceil'
collect2: ld returned 1 exit status

Rather than adding that library in manually (along with the -lz library that we found earlier for Mac OS X), we should use information from mysql_config to construct the compile flags. However, this is a bit tricky because of the various permutations.

Fortunately, the Nagios Plugins have a solution already. They have an m4 file, called np_mysqlclient.m4, that is used to detect mysql_config and this returns data from the msyql_config for configure to use.

So we've patched NDOUtils so that it uses this m4 file now. In order to use, you have to apply the patch to configure.in, add a new m4/ directory to the top level and copy np_mysqlclient.m4 into m4/. Then run:

aclocal -I m4
autoconf
./configure --with-mysql=DIR

The detection is the same as in the Nagios Plugins: ./configure will try to find mysql_config in DIR/bin/mysql_config, otherwise will look in the PATH.

The nice thing is that if the logic for detection needs to be enhanced, we can update the m4 file and propagate the changes back to the Nagios Plugins as well. So everyone wins!

There's also a patch for CFLAGS in src/Makefile.in (which were getting overridden - presumably for testing), a small header change in config.h.in and some Makefile.in changes because make errors were getting lost by the cd .. command.

We've tested this on a Mac OS X server, a Debian Etch server, and 32bit and 64bit Redhat, and it is looking good.

Unfortunately, it means deprecating the --with-mysql-inc and --with-mysql-lib configure options. Hopefully, you'll see why this way is so much nicer.

Here's the patch against CVS HEAD.

Update: Here's the patch, reworked for NDOutils 1.4b3

Update: You can get the tarball with just this patch here

26 January 2007

The importance of being earnestly tested

We ran across a problem with NSCA 2.6 yesterday day. It turned out that running the nsca daemon in single mode only works for the first packet of data from send_nsca and hung for subsequent calls.

This was actually first discovered by Rudolf van der Leeden and it looks like it has been with us since April 2006 when NSCA 2.6 was first released, through to the current NSCA 2.7. We never picked it up until running it on a customer site which was tuned to use --single.

The fix is as Rudolf suggests - uncommenting the if statement that was removed. Our patch is here.

How do we know it works? Well, we've written a series of test scripts for NSCA.

We've always been a big fan of testing. We love using the Test Anything Protocol (TAP) in Perl. CPAN encourages you to write good tests to make sure your Perl modules run, which is why we know that modules we're uploaded to CPAN continue to work while we've been updating them. And we've provided quite a few fixes to CPAN modules where the tests fail (and some just suggest that we have a broken version of perl).

Here's the test scripts for NSCA. They are more like functional testing - it tests that the daemon can start up and accept messages and compares the output in the dummy nagios.cmd file with the sent data. Unit testing is a bit more tricky to do for C code - though libtap is being used for the Nagios Plugins.

To use the test scripts, drop it down to the top level of the NSCA directory after you've compiled NSCA and cd into nsca_tests. Run prove *.t. You will require several CPAN modules: Test::More, Class::Struct, Clone and Parallel::Forker, though most will be with your perl distribution.

There are 3 tests at the moment:


  • basic - just sends a few passive checks and makes sure that the nagios.cmd file receives them

  • multiple - runs the same as basic, but several times to check the daemon can handle multiple requests

  • simultaneous - runs lots of send_nscas at the same time (well, nearly). Uses Parallel::Forker to setup all the sends then executes them all at once. Expect about 200 extra processes to hit your server!

You'll find that multiple and simultaneous tests fails with the stock NSCA 2.6 and 2.7. But when the patch is applied, all the tests work.

The tests can obviously be extended, but this is a start and covers this basic functionality.

We hope Ethan will look into adding this to the NSCA distribution.

We're upset that something like this got to one of our customers, but we're more upset with ourselves for not catching this much earlier. This should be a good step towards better QA of future NSCA releases.

Update: Ethan has updated NSCA to 2.7.1 to fix this problem. And the tests are included as well!

23 January 2007

Helping NDOUtils to a final release

There has been a new update to NDOUtils to 1.4b2 recently and we thought we'd share our latest patches here so that they can be evaluated upstream.

We've always argued that it is best to be as close to the released code as possible - we don't want the expense of maintaining a fork, so it's in our interests to inform everyone about our changes. And since NDOUtils is gearing up for the 1.4 release, now is a good time to publicise them.

Our course, the link to our most stable code is updated daily, so the list below will not be accurate over time, but we've also uploaded the patches onto this blog server, so you can still reference them here. All patches will apply cleanly onto NDOUtils 1.4b2.

ndoutils_issue_commands.patch

This is the include header problem because we've changed the data structure for a contact. Long term, it is best if Nagios splits the include files out of NDOUtils and let it be installed by Nagios, but this is probably outside of Ethan's radar right now.

ndoutils_daemonize.patch

We found that the ndo2db process wasn't closing stdout, which meant the attaching terminal could not close. It looks like it should be set, but was commented out for debugging purposes. We uncomment those lines.

ndoutils_debug.patch

It looks like some memory debug is switched on by default. We switch them off here.

ndoutils_memory.patch

And the ifdef doesn't actually switch them off - we correct that too. (We're too lazy to combine these last two patches together!)

ndoutils_configure.patch


The configure script wasn't respecting the --with-mysql-inc option correctly. We also test for the compress lib, which gave us problems on Mac OS X.

ndoutils_notification_level.patch

This is required to support our use of simple escalations. Again, a separate location for the include files for Nagios would remove the need for this.

ndoutils_clear_tables_on_reload.patch

This is the biggie. If the configuration for Nagios has changed and a reload requested, the ndo_object table do not reflect the new configuration. We found that the ndo_objects table is only updated on a restart, not a reload. This caused problems for status views that use the database because the new hosts and services weren't there. This fixes that problem.

We also found that the active flag wasn't correctly set to inactive when the configuration was dumped. Once we fixed that, we found that hosts and services in ndo_objects were marked inactive, when they should be active. This has also been fixed, along with a SQL typo.

Update: we've discovered that the configdumpstart routine gets called twice - once with the original data, and once with retained data. Looking at the ndomod data stream, it looks like the configdumpstart is sent with a huge set of data, then another configdumpstart with more data. The patch above has been re-worked so that the table clearing only occurs once, before the original data is sent through. This does beg the question of what is the difference between the original and the retained data - if there was a table clear happening between the original and retained data yet all data was there, why send all the original data?

Also, we found a bug where configdumpend was not being called. It turned out to be a missing break in the case statement. This is included in the patch above too.


DEFAULT CHARSET mismatches


We also run a perl script when the NDOUtils distribution is unpacked. We strip out all the DEFAULT CHARSET=ascii statements in mysql.src. This is because if the server has a different charset, you can get some collation errors in mysql. We think it is better to remove these altogether and leave the charset to be set by the mysql database. The script is:


perl -pi -e 's/DEFAULT CHARSET=ascii //' db/mysql.src

ndoutils_upgradedb.pl

Upgrading database schemas are a terrible pain. NDOUtils includes scripts to update the database, but there's a manual step required to work out which scripts to apply. We've written a perl script (requires DBI.pm) to apply the upgrade scripts automatically, as long as the filename convention is adhered to. There's also a new table created, called nagios_database_version, which holds a single row with the version of the database schema for subsequent updates.

The copyright for this script can be claimed by Ethan if he chooses to include it in the NDOUtils distribution. Otherwise, you are free to use it and distribute it yourselves under the GPL, but the copyright is retained by Altinity.

Hopefully, these patches will get included into the new NDOUtils soon, as we move forward to the next generation of Nagios status viewers.

Update 2: Ethan has applied these changes to CVS, except for the notification_level patch as that is a bit more involved.

15 January 2007

Lessons in .... SNMP trap handling, part 3

It has been some time since we last talked about SNMP trap handling, but there's been some major developments. Recall we use the perl module SNMP::Trapinfo to process a incoming trap. We think this works really well, but there was a major piece of functionality our customer wanted:


Complex calculation of whether a trap passes a test

And by complex, we mean complex. Here's an example trap:


dastardly.altinity.net
10.243.196.251
SNMPv2-MIB::sysUpTime.0 119:2:04:40.34
SNMPv2-MIB::snmpTrapOID.0 CERENT-454-MIB::remoteAlarmIndication
CERENT-454-MIB::cerent454NodeTime.0 20060814114937D
CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication notAlarmedNonServiceAffecting
CERENT-454-MIB::cerent454AlarmObjectType.9216.remoteAlarmIndication ds1
CERENT-454-MIB::cerent454AlarmObjectIndex.9216.remoteAlarmIndication 9216
CERENT-454-MIB::cerent454AlarmSlotNumber.9216.remoteAlarmIndication 2
CERENT-454-MIB::cerent454AlarmPortNumber.9216.remoteAlarmIndication port36
CERENT-454-MIB::cerent454AlarmLineNumber.9216.remoteAlarmIndication 0
CERENT-454-MIB::cerent454AlarmObjectName.9216.remoteAlarmIndication DS1-2-36-7
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 216.243.196.251

Our customer wanted to be able to say: "Give me a critical alert if cerent454AlarmState.9216.remoteAlarmIndication is not 'cleared' and the cerent454AlarmSlotNumber is greater than 5". Well, this was impossible with our previous setup. I still don't know why it is called Simple Network Management Protocol...

We sat down to think about this and then realised we probably need an arbitrary way of calculating an SNMP trap, but the last thing we wanted to do was write a syntax parser. That would involve a whole new language, all the parsing work involved, etc, etc. This would take months of work!

Looking for inspiration, we realised OpenNMS has claimed this type of functionality. We downloaded a copy and tried to install it, but hit loads of pre-requisites. We're very lazy - we should evaluate other technologies, but if it is too much of a pain to install, then we'll give up right away!

Undeterred, we went for the next best thing - their documentation! Searching around, we found the section on evaluating traps. It appears that OpenNMS have a table called events, which is a list of all the things that happened. Then there are various filters which evaluate against those events to work out whether something needs to be alerted on. SNMP traps are converted into this event format and dropped into that table.

(As an aside, Nagios holds no such processing logic. All that complicated processing is handled by the plugins. Nagios only cares about the result. This is a feature :) )

It then dawned on us the beauty part of OpenNMS' design: rules are expressed as SQL statements.

Let me repeat that again: rules are just SQL statements. If the SQL evaluates to 1, then an alert is raised, otherwise ignored. Fantastic! This does away with all the "design your own syntax" work, with a clear, recognised language! No duplication of work!

So the above requirement could be met with a rule in OpenNMS (we think! We haven't actually tried this!) that says:

(cerent454AlarmState != 'cleared') & (cerent454AlarmSlotNumber > 5)

which equates to a SQL statement like:

SELECT ipaddr
FROM ipinterface
WHERE ipaddr in (SELECT ipaddr FROM ipinterface, node
WHERE cerent454AlarmState != 'cleared'
AND ipinterface.nodeid =node.nodeid)
AND ipaddr in (SELECT ipaddr FROM ipinterface, snmpInterface
WHERE cerent454AlarmSlotNumber > 5
AND ipinterface.ipaddr = snmpInterface.ipaddr);

But we couldn't do that with SNMP::Trapinfo - no SQL database. Tacking on DBI.pm support would be terrible. But then it hit us - why not use Perl? Most sysadmins know perl syntax and it would allow useful functionality like regular expressions, which are not as powerful in SQL.

How do we express the SNMP trap variables? Well, we already have that in SNMP::Trapinfo - macros. ${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication} evaluates as notAlarmedNonServiceAffecting in the example trap, but instead of making it a line to display, wrap it up in some perl code:

"${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication}" eq "cleared"

(These Cerent devices also make it difficult to find a specific variable because it encodes the object index number, 9216, into the oid name. Sigh - no one said SNMP had to be Simple or consistent. To overcome this, we introduced the idea of a wildcard for an OID tuple, so the above could be written as "${CERENT-454-MIB::cerent454AlarmState.*.remoteAlarmIndication}" eq "cleared". There are some issues if there are multiple OIDs which match this name, but we assume that only one matches...)

There's a new method in SNMP::Trapinfo called eval. This evaluates the string as a snippet of perl code and gets the return code. There are three possible results that come back from the eval:


  • 1 = true - the perl snippet runs and evaluates true

  • 0 = false - the perl snipper evaluates as false

  • undef = error - the perl code did not run correctly (most likely is syntax errors)

This last case is possible if the variable name does not exist. For instance, the expansion of '${CERENT-454-MIB::cerent454AlarmSlotNumber.*.remoteAlarmIndication} > 5' would convert to ' > 5' which is not valid perl code if the trap coming in did not contain the desired variable.

So our way of expressing the rule required is:


"${cerent454AlarmState.9216.remoteAlarmIndication}" ne "cleared" && cerent454AlarmSlotNumber.9216.remoteAlarmIndication > 5

We have a basic wrapper script that if this code returns as true, we send a passive check to Nagios.

One final thing: we have a front end application to configure the perl snippet of code. This is obviously tainted. We don't necessarily know what is contained in the code, so it could do things like "system('rm -fr $HOME')". We added on the Safe module, so now it is restricted to only running specific operators, like the comparison and regexps and mathematical functions. Good security lets us sleep at night :)

SNMP::Trapinfo is now released on CPAN. We use this for our SNMP trap processing and we think it works fantastically well. And this continues our aim of making the base portions of Opsview as solid as possible.

02 November 2006

Caching NSCA data from slaves to master

The problem

For one customer, we had a major scaling issue with distributed monitoring and NSCA. The initial setup was one master, 5 slaves using send_nsca to send passive service check results back to the master. This is the standard setup, with the ocsp_command like the submit_check_result script.

But we started to see some bad figures in the Nagios performance. The average Check Latency was showing 9.5 seconds, which seemed far too long. On the master, we could see 50+ nsca daemon processes, though they didn't appear to be doing anything.

The revelation

The revelation came when we looked on the slave. At any one time, there was only one send_nsca running! So even though the service checks were being run in parallel, it looked like ocsp_commands were being sent serially. This had to be our bottleneck.

The solution

So we wrote a script called send_nsca_cached to cache the passive check results. The idea is that the script will take the results as usual, but write to a cache file instead of running send_nsca. This cache file would hold a start time, so if the current result exceeded the start time + cache period, then send_nsca would be invoked and send all passive results at once.

We put the script on the slave and could see that the cache file would fill in spurts - 10 entries looked to be written within half a second, but then nothing for a few seconds. Nagios does some tricks to try and spread the service check load, but I wonder if the "traffic jam" of sending the uncached way was causing the services to be bunched up together.

When we checked again in an hour, the maximum Check Latency dropped to under 1 second and the master had only 9 nsca daemons. And I guess it is much better for network load as well to send a whole bunch of data at once, rather than a single message at a time.

The warnings

There had to be some bad points.
  1. This script is only for Nagios 2.0+ because of the use of environment variables
  2. We don't support passive host checks. Not sure if this is a good or bad thing
  3. Do not use this if your slave is not busy. As send_nsca_cached needs to be invoked in order to send results, if your ocsp_command is only invoked once every minute, then the quickest you will get a batch result sent to the master is every minute, regardless of your cache time. So only use this script on a busy slave. You could use a cache time of 0 to be the same as sending immediately
  4. Don't make the cache time too large. The results have no timestamps, so when Nagios on the master receives the results, it will process it as if the check happened just then. Also, if there is too much data being sent, you could fill the command pipe on the Nagios master
  5. On that point, make sure the master Nagios server has command_check_interval=-1 in nagios.cfg, so that the command pipe is read as quickly as possible. There are known limitations that if the pipe is filled, processes writing to the pipe will hang until more space is available

The future

That last point about the command pipe is being (partially) addressed in Nagios 3.0. Ethan has said at the Nagios Conference in Germany there will be a new external command called PROCESS_FILE, so the idea is that nsca can drop a file down on the master with a file containing passive check results and then only one command is put into the pipe, which will then process that entire batch.

The real solution to point (3) is to let the caching be done at Nagios, rather than externally, and that is also on the radar for Nagios 3.0. So there is lots to look forward to there. But if you want something now, check out our script. It's not a perfect script because it's hard coded in various places and you will need to customise the send_nsca command, but we hope it helps you regardless. Enjoy!

The end?

Not quite. At the Nagios Conference, Ethan was talking to two guys who were complaining that their distributed setup had huge slowdowns. I overheard and the symptoms looked exactly the same, so I gave them a copy of the script. Apparently it helped, but they had some lock ups in Nagios, which they think were attributed to our script - so caveat emptor. They have since reverted back to using the standard uncached mechanism.

We haven't had any issues for our customers, so we're interested in what you find. If you have a distributed environment with similar symptoms and you are thinking of using this script, please take a note of your Check Latency and the number of nsca daemons and add a comment to this blog with some before and after statistics. We'd love to know if this works elsewhere. Good luck!

01 November 2006

LinuxWorld show in London

Altinity were exhibiting at LinuxWorld in London last week. It was our first show and went really well. There were lots of people there and we got to chat to a large number of them.

Some people said that the show was smaller than previous years, but we found it busy. The larger companies (Oracle, Novell, IBM, HP) seemed like they had smaller stands, but this gave more space for the smaller organisations, like us!

We had a lot of interest. We paid for an advert in the official programme and had a great position. Our stand consisted of two flat panel screens: one with a rotating slideshow of Opsview's main features, with a 2nd for demonstration purposes. This was James' idea and it worked really well.

We decided against a live internet connection, because it was prohibitively expensive, so we had to create an entire little network for monitoring. Our best demo was pulling a little cable out of the Mac Mini we had on the front and seeing an SNMP trap raised into Opsview.

There were demo failures, of course. Our main Ubuntu server sporadically hung whenever we shut it away. In the end, we left the server hanging out - it was probably an airflow or cable issue somewhere. Also, the power kept being cut off on our row of stands. This caused one of our mysql tables to be corrupt, which we only noticed when we tried to display a web page. Sigh.

The biggest draw seemed to be the fact that we plastered Nagios across our stand. We had sought permission from Ethan and he was happy to grant, so we put a couple of posters up and we were also giving away 5 copies of Wolfgang Barth's Nagios book. It was amazing the number of people that were walking past, stopped and said "We use Nagios. What's Opsview?". We had a great story to tell around here, so we were happy enough to explain what we did. This just proves the point I was making at NagConf in my talk - use the Nagios brand to enhance your own reputation by "playing nicely" in this space.

It was interesting to find out which organisations had Nagios installed. We met some people from local governments, media companies, hosting companies, consultancies and telecoms. Some had really mature Nagios setups (distributed environments, automated updates of config files), some were just starting out. There were lots in the middle who liked Nagios, got it working and haven't touched it since, because it just keeps working away. We didn't mind who we talked to - another company to add to the Nagios user base is fine by us.

I only got to meet Ben Clewett from the mailing list, though he doesn't hang around there much recently. Hari Sekhon says he was there, but wandered pass, thinking we were "just another reseller". Damn, better do better next time...

18 October 2006

NRPE SSL connections during network problems

We had a customer problem where there were hundreds of NRPE processes on one of their monitored servers. It was quite bizarre because strace wouldn't attach to the process. Lsof said there was an established connection from the Nagios server, but when we looked on that server, netstat said there was no such connection! I've never seen anything like that before!

Well, the customer's network team said there was significant packet loss between the Nagios server and the NRPE box (obvious really, when all services to that host were complaining about SSL handshakes). Syslog showed lots of errors too:

Oct 17 10:41:46 host nrpe[2300]: Error: Could not complete SSL handshake. 5
Oct 17 10:42:26 host nrpe[2317]: Could not read request from client, bailing out...
Oct 17 10:42:26 host nrpe[2317]: INFO: SSL Socket Shutdown.

It looks to me like the SSL handshake was probably continuing to retry, but the connection must have been severed because it took too long. When our client tried to do an ssh onto the NRPE server, it was taking too long and he had to Cntrl-C to break out. We realised that NRPE should have some sort of timeout itself too.

So we've created a patch to NRPE 2.5.2. There is a new parameter in nrpe.cfg called connection_timeout. NRPE now sets an alarm just before handling a connection and then resets it before running the check command. It would have been best to have an alarm set over the entire session, but my_system sets an alarm handler too to make sure the command being executed does not exceed its timeout. This problem is probably SSL only, but the patch is over the connection regardless.

Testing on our customer's servers, we found that check_nrpe returned CHECK_NRPE: Error - Could not complete SSL handshake as expected and the nrpe daemon then died gracefully when it exceeded the connection_timeout parameter.

It would be too ironic for a monitoring system to cause a box to die - although I hear BMC Patrol has that feature :)

28 September 2006

Netway's Nagios Conference in Nuremberg

Netways hosted the first ever Nagios conference in Nuremberg last week. Since I have a dual role of also being the project lead for Nagios Plugins, I was invited to give two talks:

I think they were well received - I even had a feedback of Bessonders gefallen haben mir die beiden Vorträge von Ton Voon ;-) by Lars Sörensen which translates to I liked exceptionally the speeches of Ton Voon!

It was also the first time I had met Ethan Galstad, so we had lots to chat about. It was great meeting his fianceé, Mary.

There were lots of names at the conference that I recognised from the mailing lists, so it was gratifying and humbling to meet so many of you. To all of you that said Hello, I say Vielen dank.

All the Netways team really looked after me, but especially Julian, Peter, and Karolina (who was the "Hostess with the most-ess"). Thanks for making me feel so welcome.

I've been thinking about what a large contribution Netways make to the Nagios community, with their NagiosExchange, NagiosGrapher and, of course, this conference. Altinity compete with Netways on various levels, but when we get together to make the Nagios community better, it benefits everyone, so I applaud their participation.

By the way, I had two main questions that people asked me:


  • Is it really true you only met Ethan today? - yup! We never even spoke on the phone before!

  • Did you really control your presentation using your mobile phone? - yup. I'm a Keynote fan, and using Salling's remote control via Bluetooth on my Sony Ericsson K750i, I can control the presentation remotely. Some people thought I was going to take a call during my talk!

My personal photos can be found here.

Update: The slides for the presentations at the conference have been published by Netways. You have to select the talk you want here: http://www.netways.de/de/nagios_konferenz/archiv_2006/programm/ablauf/

14 September 2006

Simple escalations

We had a customer that was interested in having multiple levels of escalation, so, for example, the 3rd notification was received by the line manager but the 5th notification would get to the head of department. If it got to the 10th - the CTO would come screaming!

Since Nagios has support for escalations, we worked on getting this functionality exposed in Opsview. However, the definitions got very complicated.

The point of Opsview is to simplify the configuration, but the service escalations in Nagios required an administrator to define data in hosts, services and contacts. In the background, we had to define different contact groups for escalation people, and this meant we had twice as many contact groups as we originally had.

And if we wanted extra levels of escalations, well, the configuration would exponentially explode!

It quickly got out of hand.

When we thought about it a bit more, all we really wanted was another filter on a contact level. Something like:

define contact {
	contact_name	manager
	...
	notification_level	5
}

So the manager would only get notified from the 5th notification onwards.

This could be done in a notification script using the NAGIOS_NOTIFICATIONNUMBER environment variable, but then notifications.cgi would show an email to the manager, even though they didn't actually get one. We believe this sort of logic should be in Nagios, not in an external script.

So here's our patch. It applies to Nagios 2.5, but you need to apply the can_submit_commands patch first because the change is in the same area of code. There's an associated ndoutils patch as well, if you use that.

To use, just add notification_level in the contact's profile and you're off! It defaults to 1 to keep current configurations happy.

Enjoy!

Warning: We've spotted a problem with this in our testing. If a service fails and no-one gets a notification because the contact filters fail, then the notification number reverts back to 1. So if a manager is set to get the 2nd notification, but there are no other contacts that get the 1st alert, the notification number will never get to 2. Thus no one will ever get an alert. This is serious. We plan on amending the preflight checks so that it fails unless each contact group has at least one contact with notifications enabled and a notification level of 1. In the meantime, if you use this feature, make sure an operations team has notification level of 1. Or, you could setup RSS alerts for every failure.

We're not convinced this notification number should be decremented, but we'll have that discussion in the mailing lists.

11 September 2006

Starting on ndoutils

We are starting to use Ndoutils, which is the first event broker for Nagios. The idea with the event broker modules is that the functionality of Nagios can be extended without the core code being changed. Ethan has released ndoutils which writes Nagios data to a mysql database.

We managed to get Ndoutils 1.3.1 to compile, but whenever Nagios started up, we kept getting SIGSEGV and Nagios would crash. Nagios.log would say:

[1158012772] Nagios 2.5 starting... (PID=21793)
[1158012772] LOG VERSION: 2.0
[1158012773] ndomod: NDOMOD 1.3.1 Copyright (c) 2005-2006 Ethan Galstad (nagios@nagios.org)
[1158012773] ndomod: Successfully connected to data sink.  0 queued items to flush.
[1158012773] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1158012773] Caught SIGSEGV, shutting down...

It took us a few hours to work out with lots of debugging lines in the ndoutils (printf statements + starting up Nagios manually without daemonizing), but we eventually found out that the Nagios 2 header files distributed with Ndoutils did not have our changes for can_submit_commands. The patch is here - obviously, only use this if you are using the can_submit_commands patch.

We've been speaking to Ethan because we think it is a good idea for Nagios to install the header files (maybe in /usr/local/nagios/include?), so then any local patches are done there, rather than trying to maintain multiple header files across different projects.

Hope this saves you a few hours!

31 July 2006

Annoying developer platform problem in Nagios Plugins

There's a problem in the current CVS snapshot for the Nagios Plugins which caused me to waste a good few hours. I'm posting this so that Google can cache it and find it for anyone else in future.

When running ./configure, there's an error:


./configure: line 38316: gl_CHECK_HEADER_locale_h: command not found

This turns out to be a problem with the snapshot generated on Sourceforge. If you run tools/setup, it comes back with these warnings:

configure.in:1685: warning: gl_CHECK_HEADER_locale_h is m4_require'd but is not m4_defun'd
configure.in:1685: gl_CHECK_HEADER_locale_h is required by...
m4/onceonly_2_57.m4:48: AC_CHECK_HEADERS_ONCE is expanded from...
m4/regex.m4:187: gl_PREREQ_REGEX is expanded from...
m4/regex.m4:177: gl_REGEX is expanded from...
configure.in:1685: gl_REGEX is required by...
m4/np_coreutils.m4:29: np_COREUTILS is expanded from...

So the gl_CHECK_HEADER_locale_h cannot be m4 expanded, and is left in the ./configure script as-is. Tracing this through, the locale.h is defined in regex.m4, one of the files from the coreutils project. Since I trust that project, I didn't think it could be a problem with their files and I was trying to upgrade autoconf and automake to no avail.

The problem was m4. If you have m4 at version 1.4.1 (or below, I presume), you'll get the above error. You need m4 at version 1.4.2 (or above).

So the snapshots should be restored to working properly again. And I can get on with fixing the next thing on the list...

28 July 2006

A fixed scale on Nagiosgraph

One of the annoying things about the graphs generated by Nagiosgraph is that the y-axis is sometimes auto scaled and the max/avg/min/cur values can be scaled too.

We find it very confusing. Take a look at this:

nagiosgraph_autounits.png


This one lets RRD automatically scale the units. Note how the y-axis for the weekly graph says 600m, although the daily graph says 1.5. Note also how the Avg value in the daily graph is a different scale to the Max value.

Compare with this:

nagiosgraph_fixedunits.png


We think the latter is more obvious to read, especially comparing across different timeperiods.

We've spoken to Soren Dossing and he's happy to accept this patch if it is an option. So here's it is. This applies cleanly onto Nagiosgraph 0.8.2. To use it, just add fixedscale to the list of URL parameters.

Performance data is hard enough to interpret without having to apply translations in your head. This should help with understanding it a little bit better.

07 July 2006

Lessons in ... RSS

We had a request to integrate RSS feeds into Opsview. People were getting fed up with their amount of emails! Fair enough (I get hundreds of emails a day), but the idea was also that you could use a mobile phone to see the trail of alerts too.

We looked at the current offerings for RSS at Nagios Exchange, but weren't too keen on them. Ssugar's RSS notifications works by using a notification script for a single admin user. This writes a single RSS feed which then people can subscribe to. The main problem with this is that the feed is not personalised - if I was in the networking team, I don't want to know about the Oracle alerts.

Steve Shipway correctly identified this problem. His software, Nagios RSS, uses a slimmed down version of the Nagios status CGIs. Again, we weren't keen on this - it only works with Nagios 1.x, the CGIs are going to disappear by 3.x and there's a continuous poll on the Nagios server.

So we had to design our own. Our main requirement was to get personalised, authenticated feeds. As little performance overhead would be nice too!

We originally though about some central store of alerts and then a CGI to extract just the alerts required, based on the authenticated user. But it would have been a nightmare to work out what each contact was allowed to see. The key was that Nagios already knows this information, so just let Nagios do it!

Turns out, the trick is to use notifications per contact - each contact that wants RSS feeds has to specify it. This then becomes a direct replacement for email! Superb!

define contact {
	contact_name	admin
	...
	service_notification_commands	service-notify-by-email,service-notify-by-rss
}

Even better - because this hooks into Nagios' notifications, re-notifications will work, as will acknowledgements and escalations.

One possible problem is that notifications only happen with HARD state changes, so you may not see a problem as quickly as you would from a web browser. However, you wouldn't get your email either.

We store each contact's RSS feeds in a separate file, just like their mail server. When a user comes in to read their feed, they only get their data. Perfect!

But, how do we get a user's authentication? Originally, we were thinking that a user goes to http://nagios.server.com/feeds/username to get their feed, but that wouldn't provide the security. As CGIs already have security, why not have a single point, but then read their RSS feed independently?

So now the URL is fixed for everyone: http://nagios.server.com/cgi-bin/rss.cgi. When authenticated, the cgi will read that contact's feed and return that information. There is a CGI invocation overhead, but I think it is necessary one. The CGI only reads a single file, and not try to work out status of all services.

Because it is a single feed, we can use nice things like have the web browser show RSS icons. We amend our HTML headers to be:

<link title="Opsview feed" type="application/rss+xml" rel="alternate" href="/nagios/cgi-bin/rss.cgi" />

On Safari, this displays an RSS box which you can click.

safari bar.png

Job done!

We call this software RSS4NAGIOS and you can get it here.

We've tested on Firefox 1.5+, but we recommend using NetNewsWire for MacOSX.

By the way, we wanted to use Atom instead of RSS2, but we just couldn't get the XML::Atom::Syndication perl module to work nicely. This would be a good enhancement in future (we were thinking things like if a recovery happens, then the earlier failure should be marked as read - this would be impossible in email).

Let us know what you think!

[Update: RSS4NAGIOS 1.1 released to support host notifications]

Recent Posts

Recent Comments

Blog powered by TypePad
Member since 07/2004