Wednesday, May 14, 2008

Zenoss Deathmatch

I've been working for the last week on implementing Zenoss to replace Nagios and Cacti. Individually Nagios and Cacti are pretty good at what they do, but they don't integrate well.

Nagios is primarily an availability monitor, so it's good for notifying you when something goes down, or a disk is filling up, or the load average is too high. etc., but it's not so great for monitoring performance. Nagios 1.4 uses text configuration files. There is a templating system which can be helpful if you have a lot of identical systems.

Cacti, on the other hand, is pretty good at monitoring performance, as in how much bandwidth are you using, resource utilization, and so on with nice long-term graphs using RRDtool, but it's not so great for notifying if something is down. Cacti is almost exclusively SNMP-based, and as a result, you can usually just point it at a device through the web interface and it will auto-discover everything interesting. If you have more than a few hundred items to measure, you need to use cactid, which is a very fast threaded poller written in C.

I've been using both for about 3-4 years separately, but because they don't integrate easily (even though both use MySQL as their backend storage), there's a lot of duplication of effort in getting both of them configured.

And then there's Zenoss. Zenoss does both availablity and performance monitoring, with long-term graphing using RRDtool, log analysis, and network-based auto-discovery. Zenoss is written in Python using the Zope-2 framework. Most of the device metadata is stored on ZODB, Zope's native object database. Long-term performance data is stored in RRDtool. Event logs are stored in MySQL.

Everything in Zenoss integrates together very well. The data is faceted in the sense that you can browse devices by location, by class, by group, or by system. It has a built-in syslog server, it can use WMI for monitoring Windows systems, it has very flexible event handling.

There are still some rough edges in 2.1.92, which is a beta for 2.2. First is, it's a bit of a memory hog and I'm inclined to believe there are some memory leaks. After a day or two the main process will start to use over 200 MB; restarting tends to knock it back down to to around 100 MB or so.

Syslog support has some issues. When I first started feeding it some syslog data, all the events were being classified as "/Unknown". This is normal. Once you have some log entires, you can then tell it to map that entry to an event. The problem was, the events had components (the process name when parsing syslog data), but they had no event classes set. Looking at the code, it seemed like it should have been setting the event class ID to whatever the component/process name was. It just wasn't. After some Googling, I found out the code to build the event class key was just plain broken. After making these suggested changes, I could start mapping events.

Another syslog problem was in parsing the hostname. I have a satellite syslog-ng server in a remote location that logs to my central syslog-ng server. Because of this, the hostname has the relay information in it. Zenoss' syslog support has an option to parse this though, so no problem, right? Despite turning this on, I was still getting entires like IP/IP, so back into the syslog code. It turns out, Zenoss expects the separator between the two hostnames to be "@", and syslog-ng uses "/'. Easy fix in the code, but I suspect this may work for the standard syslog, and it needs to be a configuration option.

Despite all of this, I like Zenoss a lot. I am running it parallel with Nagios until I get all the event handling nailed down. I might need 2 GB RAM on the monitoring server though, and I have already moved the MySQL database onto a different server.

9 comments:

Marius Gedminas said...

Zope uses per-thread object caches that start out empty and then fill up to maximum size after a while. That explains the memory size difference after a restart.
Now if it would keep growing and growing every day, that would be a memory leak.

Unknown said...

Thanks, I'll have to let it run a few days then. The server has 1 GB of physical RAM, and I've seen Zenoss grow to the point where it uses half a gig of swap. I'll also look into seeing if there is some way to either limit the number of threads or the size of the object cache in Zope (probably in ZEO).

qhartman said...

Thanks for posting this. I'm a plone user, so I'm pretty familiar with Zope. Sounds like Zenoss might be right up my alley. Have you looked at OpenNMS at all? Until I read your post it was solidly at the top of my list for a Nagios / Cactus replacement. I find the two useful, but like you, their lack of integration is frustrating.

mray said...

Thanks for checking out our 2.1.92 beta, we should be releasing the 2.2 version real soon.

If you let your Zenoss install run for a more than a day or 2, does the memory continue to grow or does it level off? It's quite possible there are memory leaks, so if you can help us reproduce them or track them down, we'd appreciate it.

On the syslog issues, I've flagged the first one to get development to look at it. Without a ticket, forum posts sometimes slip through the cracks. If you could open a ticket on the second one, that would raise the issue and get it looked at.

Thanks,
Matt Ray
Zenoss Community Manager
mray@zenoss.com

Jeronimo Zucco said...

You can use nagiosgraph with nagios to make the rrds performance graphs. I did it, it's easy to install and make maps to make the graphs:

http://nagiosgraph.wiki.sourceforge.net

Vitaly said...

Thanks for your post!
Did you finish your migration?
Are you happy with Zenoss?
BTW, did you evaluated Zabbix and/or OpenNMS?

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.