It seems everywhere I go to work, I face the same operational problems. Once again, I must find a way to centralize logs and provide different levels of access to said logs. Sadly, the syslog protocol is getting quite aged, and it’s just not enough anymore. It works well if you have only a few machines, and only need to provide access to sysadmins. But when developers and other types of users are thrown into the mix, you need a more granular system.
Also, support for the syslog protocol varies greatly from daemons to daemons. One major culprit for me as always been Apache (and web servers in general) because it only supports out of the box syslog for error logs. For access logs, you can use different techniques, but no matter which one you use, you end end with the same problem: if you have more than one vhost on the machine, all their logs end up in the same syslog facility. You can obviously filter them after that, but that’s way more work than say email logs.
If you work for a company that has a lot of budget, you may consider getting Splunk. It’s a very good commercial product with a free version also. But last I checked, it was priced at 7000$US per half gig of logs indexed per day. When you have web servers generating several gigs of logs each per day, that could end up being very expensive. Money that could be used to buy hardware to deploy more open source software. Which I tend to prefer.
So after a week or so of investigation, testing and benchmarking, here are my findings. The architecture of the final setup is not settled yet, but it will more or less look like this.
NOTE: Keep in mind, this post won’t go into details about clustering and scaling. But all chosen products can achieve that. You should be able to come up with that on your own easily. I will concentrate on the tools and the workflow.
In order to avoid changing the syslog daemon on each servers, I decide to keep either sysklogd or rsyslog intact. As it’s installed by default. And with a centralized configuration management system, it’s simple to add a new entry to send your logs to another machine. Something like this:
That will take care of most of the systems’ logs and send them to a central machine. But what about Apache? As far as error logs are concerned, it’s simple. You need to reconfigure Apache to send them to syslog. On RedHat-based systems, you want to edit /etc/httpd/conf/httpd.conf and on Debian-based systems (I might be wrong, I don’t have one handy with Apache installed) you’d have to modify /etc/apache2/apache2.conf or something like that.
I use local2 as an example, but you could pick any facility. In any case, I recommend using one of the local* facilities, as later on, it will allow you to create filters and alerts based on that.
Now as I was mentioning earlier, the issue with web servers is that you can have more than one vhost on a machine. And syslog was never designed with that in mind. So you will inevitably end up with all logs for a machine in the same facility. Apache supports pipping logs to an external program. I found that the simplest was to use the loggerÂ tool. So either system-wide or per vhost, you can add something similar to your configuration:
CustomLog "|/bin/logger -p local2.info" combined
During benchmarks, that obviously added some overhead. You will have to plan your site’s capacity with that in mind. But by not much. Maybe 5% more. I think it’s well worth it for the operational advantages.
As far as Java applications are concerned, you can easily configure it to send to syslog using log4j.
2. Tailing log files (warning)
In the past, I sometimes would use tools that would tail log files and then forward them to a central syslog machine. That’s fine if you just need to archive on a file system somewhere. But to use there’s one thing that’s important to know, what you get in aÂ log file, is just a string. That’s not a standard and properly formatted syslog message. Rsyslog is able to log real syslog messages to your files, but I don’t recommend it as it’s harder to read. That said, if you want the webUI at the end of the proposed chain to work properly, you don’t want to do that. It’s better to use something like logger if possible.
Now we leave the legacy world and enter the present time of logging. Logstash is the Swiss Army knife of the logging world. It’s a very well designed application that can be used in either agent or server mode. In agent mode, you can configure different types of inputs and outputs. It supports a wide range of protocols like file, syslog, amqp, etc.
So here, I decided to use it with a syslog input to receive logs from our machines. It’s also easily load-balancable with a layer 3 load-balancer. It will then send the logs to RabbitMQ to an exchange.
Now, this part is optional, you could send the logs straight from Logstash to Graylog2, but I prefer to have a middleman to do a bit of queuing. Also, once the messages enter the exchange in a AMQP server, you can route them to more than one queue. For different types of processing. In order to do that though, you need to use a fanout exchange.
Why RabbitMQ? Well, it’s written in Erlang and it’s very fast. During my benchmarks, it was processing between 4000 and 5000 messages per second during the peaks. Also, it’s easily clusterable in an elastic kind of way. And all operations can be done while the cluster is live. I also recommend you install the management plugins, as that will provide a very nice webUI to manage your stack. Often, UIs of the sort are limited, but in this case, everything that can be done on the CLI is doable with the webUI as well. And it’s very well designed and pleasant to the eye.
So at this point, my messages are entering an exchange named ‘syslog’Â that routes messages to two different queues: graylog and elasticsearch.
NOTE: As I write this, version 2.5.1 as just been released. At this point in time, queues can not be replicated in your cluster. So if you lose the node where the queue was created, you lose the queue. That said, you can query your queue from any node in the cluster. Support for replicated queues should be available soon. You could use DRDB though to cluster only one node. That would give you high-availabilityÂ at the queue level.
5. Logstash (again)
Now, we’re almost ready to give access to our logs to our different users. At this steps, logs are ready to be sent to Graylog2. So we will use another Logstash instance with an AMQP input, that will read messages from our ‘graylog’Â queue and forward them to Graylog2 using a Gelf output. That’s theÂ preferredÂ protocol for importing messages into Graylog2. I won’t provide an example configuration for Logstash, as it’s really easy and straightforward to configure.
This is where all the magic will happen. Graylog2 has two components: a daemon that receives logs, processes them, and inserts them in a capped collection in MongoDB. Now, take a few minutes to go read about that. As it’s important to understand well. Basically, it works like a FIFO. So in order to take advantage of the speed inherent to a FIFO, you want to make sure that your capped collection fits in RAM. MongoDB will allow you to create a capped collection larger than that, but you get major performance degradation when your data reach the amount of available RAM you have.
So with that taken into consideration, on my test machine, I created a roughly 5GB capped collection. With that, I was able to store more than 10 millions messages in it. It’s important to know, that Graylog2 is not meant to be used for archiving. Where it excels is in real-time (or close to it) view of your logs. Also, you can setup different alarms based on facilities, hosts and regex. That will then email you to alert you. Very cool. It allows you to be more proactive. And to detect issues a traditional monitoring system can’t find.
Remember I mentioned two different queues? The reason is simple. Once a message is consumed in RabbitMQ, it’s not available anymore. It’s deleted. So you need more than one queue, if you want to use different systems. As Graylog2 is great for short-term analysis and real-time debugging, you can’t count on it for archiving. Enters Elasticsearch: a clusterable full-text indexer/search engine. It’s based on the Lucene project from the Apache Foundation. Their main goal is to have a very simple to use and configure, elastic search engine. And from my short tests with it, it lives up to it. It discovers new nodes using multicast. So basically, you power up a new node, and the cluster detects it, recalibrateÂ itselfÂ and voila.
That’s where I plan to store my logs for long-time archiving. Logstash (is there anything it can’t do?), when run in server mode, provides a web interface to search them. You would use again an AMQP input and a Elasticsearch output to send them to Elasticsearch. Then run another instance of Logstash in web mode. To provide the webUI.
So that’s it. That’s a home-made Splunk-like system. Obviously, it’s more work to deploy, but it’s much cheaper, more flexible and open source. It will grow as needed by your infrastructure. You can use it to aggregate logs from servers, applications and networkingÂ equipmentÂ easily. And provided granular access to your logs through graylog.