With Nagios 3 rapidly approaching and Opsview celebrating being a full open source project (GPL licensed, source code repository online, Sourceforge project), we think it is time to share some of our Nagios patches.
These are the latest patches you can find for Opsview within our code repository. Some are Opsview specific, but a lot can be incorporated into the core code - we'll say which is which. You can see all these on our SVN site (we've even tagged the current version so this will stay in our repository), but here's the lowdown:
If there's a patch we definitely want to have applied to the core code, it will be this one. Not because of freshness checking per-se (though we'll explain why later), but because of the included libtap tests.
As much as we love Nagios, we're always a bit concerned that regressions may occur. We have complete faith in Ethan, but he's human and unintended effects may occur. In fact, we made one when we originally suggested this freshness patch, so with testing, future changes should hopefully not cause regressions.
It requires work to add in tests and to separate out files, and we intend to stick to our commitment to add new tests in. But the framework needs to be put in place to encourage other tests, otherwise the overhead for Altinity is too high.
Think of tests like this: the code is the generalised form; the test is under specific conditions. The key is to try and get more and more conditions to prove that things work as expected.
Have a look at the test - we think it is easy to see what is being tested here (to get it from our svn repository, you need to extract the tarball). And note how comprehensive it is - we think every case is considered. A change in logic anywhere will be immediately spotted.
We've refactored how the calculated freshness_threshold is arrived at so that we can run tests against it.
There's also an arbitrary 15 seconds added to the freshness threshold. We've made that a new variable called additional_freshness_latency in nagios.cfg, so you can tweak it without recompiling Nagios.
Could be applied to core. Please :)
Another thing we found was that Nagios is very fast in reading 10,000 services (5 seconds), but slows down dramatically with NDOutils integrated (2 minutes). It appears to be reading configuration and then sending to the broker modules. Since NDOUtils is synchronously updated, Nagios is waiting while mysql is running the necessary SQL. We've updated the freshness code by introduced a new variable called monitoring_start. This is when Nagios actually starts monitoring, as opposed to program_start which is the HUP time. We get a better idea of how long it takes Nagios to startup.
We've written a little plugin that returns performance data about the startup time.
Also, we've pushed the threshold forward a little bit more to include the max_host_check_spread/max_service_check_spread, which is important for new services.
We've updated the tests to reflect the changes. Patches on top of other patches get really hard to maintain, which is why we need the libtap tests integrated into the core code.
Could be applied to core.
This is one where we change the Nagios CGIs to show passive states as OK. We just like everything green.
We don't expect this in core.
This has been applied to Nagios 3.
This helps with our integration to Nagiosgraph.
We don't expect this in core.
We've discussed this
before.
We don't expect this in core.
We've mentioned this before on the nagios-devel mailing list. Ethan has made changes to Nagios 3 to support this behaviour.
With the AJAX goodies we have in Opsview, we found the statusmap wasn't updating correctly. It appears that some browsers try to use cache data in an XHTTPrequest if the URL is the same. We've added this to the URL so that it is always unique.
This is AJAX specific, so we don't expect this in core.
We've big on valid HTML. Partly this is because we wrap the CGI output and remove the use of framesets in Opsview. However, it means the HTML has to be valid. We found several problems in the validity of history.cgi and other CGIs below.
A great tool is HTML Validator, which runs as a Firefox plugin - this tells you if your HTML is valid.
Could be applied to core.
This is an extra field to the contact stanza where you can specify that they will receive notifications only after the Nth notification. This makes it an easy way of doing escalations.
We've spotted an issue where if no notifications are sent, the notification number doesn't get incremented. Maybe this is best as a different macro.
Could be applied to core, but requires a bit more thinking.
The use of markup caused problems for us, so we've fixed some of the docs.
Could be applied to core.
We've fixed some validation errors with divs. Have you seen
HMTL Validator? :)
Could be applied to core.
This patch stops the Author box from being altered by the logged in user. Ethan has applied something similar to Nagios 3.
Already in Nagios 3.
This patch allows a contact in a contactgroup to only see a subset of services. Normally, a contact to a host sees all the services, but this allows the contact to only see the services specified.
This is possible by setting the contact to not have the host in the contactgroup, but then that stops the contact from taking action on that host.
This could be applied to core, but is a (relatively) major change to the use of contactgroups.
We find that users click on the extinfo icon and then get a bit worried when nothing happens. We make it a clickable link. There are also a few validation fixes here too (should really be separated out).
Could be applied to core.
This is a good one! As you know, we love tests. One thing we do with Opsview is make sure that the configuration being generated is the same after we've tinkered with the rules. We tried to find a good way of doing this - initially we thought about using
Nagios::Object to read the config data and then do a diff to find the changes. However, this didn't take into account all the relationships.
What we really wanted was some expanded form of the config files.
It then hit us - Nagios already does this! It uses the object.cache file as an expanded version of the configuration objects for the CGIs to use. So we've patched the core nagios executable so -o will now output to stdout this cache file and then exit. It works great in our testing.
Could be applied to core.
In our quest to make Nagios more friendly, there's nothing worse than getting the dreaded "Nagios is not running" screen on your browser. This patch adds a new command line option -F, for fast-reload.
It does two things:
- It doesn't delete the status file on a HUP signal. This gives the impression that Nagios is still running even though no new status information is being updated. We think this is acceptable - after all, CGIs are displaying the "latest" data, it just so happens that there is no update at this precise moment. The status file age doesn't change, so nagiostat will show that the data is getting older, but it removes that scary screen
- We ignore the pre-flight check. As part of Opsview, we validate the config before we send a HUP signal, so this is redundant. Along with the long startup times for Nagios, we find this makes Nagios a lot more responsive for large scale systems
Could be applied to core, possibly as two different command line options.
This is a nice feature which we've discussed
before. We have customers asking to run a different command based on a timeperiod. The most obvious use is altering the thresholds for the load of a server - a server may run batch work overnight thus increasing its load.
Could be applied to core.
We run tests internally on new versions of Opsview, trying to prove that our generated config files do not change unexpectedly. One thing we hit was the use of full path names in nagios.cfg. This meant we either had to change the path on the fly or move directories around.
This patch allows the use of a relative path. The path is taken as relative to the directory that holds nagios.cfg. We find this works really well.
There is a dependency on dirname(), which will probably have to be changed to a cross platform implementation.
Could be applied to core.
By default, force check is on when you Reschedule an active check. In a distributed environment where you have a "set to stale" script as the active check, this is not wanted. We change it so that the form enables only if the field is passed through.
We then alter some of the links so that the field is off by default based on whether the service is actively checked.
Could be applied to core.
We make the members field in the contactgroups stanza optional. What this means is that we can add the members of a contactgroup via the contact instead. This turns out to be significantly faster in our configuration generation scripts. Thus we also remove the error in the nagios configuration about the stanza information.
We also add the contacts into the list in the same order as they are processed. When it was added in reverse order, our tests were failing because the order was not preserved.
Could be applied to core.
In NDO, if a service starts up in an error state, a state change is recorded. However, if a service starts up in OK, a state change is not. This patch will cause a state change to occur.
Technically it is a state change from a PENDING to an OK, so it should be recorded. This helps us in the NDO nagios_statehistory table, which we'll discuss about more in a future blog.
Could be applied to core.
An incorrectly placed </form> caused problems with our AJAX screens. This fixes. Did we mention
HMTL Validator?
Could be applied to core.
While working on freshness checking, we discovered that the latency values were incorrect. In fact, looking in the NDO db told us this. This fixes the calculation.
Could be applied to core.
On startup, Nagios writes all the current host/service status to NDO. However, the database already knows this. This causes problems on large scale systems.
A side effect is that if NDO is switched on after Nagios is running for a long time, each object needs to have a new status result before NDO sees it, but this is probably acceptable.
Another impact is that other future broker modules might want the retained status information, so maybe this is best implemented at the broker level, but we couldn't see an easy way of passing only this particular case.
This also has an impact to NDO, so there's a patch required there.
Could be applied to core.
We had a big problem with a customer's system where it was crashing occasionally. We had to analyse coredumps and eventually found the problem: on the master server, if the plugin output is only "|" for a passive host check, then sometimes a segfault would occur.
We think this is related to parsing the plugin output, but only if passive checks are processed with a backtrace from check_host.
Anyway, we've fixed it by changing the algorithm for parsing the plugin output. Our guess is that strtok is causing the problem, but we really don't understand why. Sigh.
With this patch, our customer's Nagios has not crashed for a 1 month - so we're safe!
Code be applied to core.
Returning passive latency values in nagiostats
With the fix to the passive latency values, we then want to find out what the values are for passive latency over a long period of time.
This patch updates nagiostats.
Code can be applied to core.
Is that all?
Yes, for now! We've made lots of changes to Nagios over the last 12 months, which we think are suitable for core. Sorry for not publishing them sooner.
If you want to have an Altinity compiled version of Nagios, just do this:
cd /tmp
svn export http://svn.opsview.org/opsview/trunk/opsview-base
cd opsview-base
make nagios
This will patch Nagios and run ./configure with our usual settings (there are some dependencies (autoconf, automake) required, but we'll leave that for you to work out!). You'll get the exact version of Nagios that we use in Opsview - in fact, you'll get them before our customers get them!
We'll do a similar Patch Day for NDO soon and talk about some of the performance tuning we've been doing for our large customers.
Enjoy!
Recent Comments