For quite some time, I have been using a combination of graphite-web and Grafana as the front-end for my Graphite cluster. However, whenever there is an event requiring all developers and sysadmins to look at graphs, the front-end becomes unusable. Graphite-web communicates with the carbon caches to query for metric data. The carbon project does publish internal metrics so I could see how many cache queries were being processed every minute.

The graphite-web process can also be configured to log the requests that are being placed, but the log lines include all functions being applied to the metric and all parameters passed as part of the HTTP API request. Analyzing data by parsing log lines like these would be very cumbersome.

The Problem

During periods of heavy usage, Grafana would become unusable because the graphite-webapp would not respond to requests. The "solution" of late has been to restart the web applications because their health degrades with increased load. There were a couple of theories that could explain the problem:

The disks are I/O bound
The carbon-caches are unable to handle the load
The graphite-web application is unable to handle the load

The first two theories were easily discarded. During the incident I ran the iostat command and confirmed that there was sufficient headroom for the incoming reads and writes. Now, if the carbon-caches were the culprit, the incoming writes of metrics would have also taken a hit. Incoming metric rates, average update times and queues remained constant during the incident. It had to be the graphite-web application. I was tailing the log of the application and sporadic requests would get logged but none of them would complete.

The Solution

A few weeks back I had been playing around with Cyanite. During the process I stumbled upon the graphite-api project, a replacement for graphite-web. It strips out everything from the default graphite-web project and leaves only the HTTP API layer. The main benefit (at least for me) is that it accesses the filesystem directly instead of communicating with the carbon caches. Since replacing graphite-webapp with graphite-api, I have seen the Grafana dashboards loading almost immediately for time ranges of ~3 hours, and within a few seconds for time ranges of more than 30 days.

I'm Blind!

The graphite-api project is great, but I have no idea what's going on inside it because it doesn't log anything. In addition, since the queries don't go through the carbon caches, I no longer have any query metrics.

Elasticsearch + Logstash + Kibana + RabbitMQ

In true Monitorama 2014 spirit, I decided to try something new. I won't cover all of the details here, but I set up an ELK+R stack on an EC2 host.

graphite-api

First things first: I instrumented the graphite-api code to log the query information. To do this, make the code changes indicated in the following PR (if it hasn't been merged already). On a default graphite-api installation:

vi /usr/lib/python2.6/site-packages/graphite_api/finders/whisper.py OR
vi /usr/local/lib/python2.7/dist-packages/graphite_api/finders/whisper.py

Import structlog and define a logger:

from structlog import get_logger
logger = get_logger()

In the __init__ function:

self.log = logger.bind()

At the beginning of the find_nodes function:

start = query.startTime
end = query.endTime
rangeInMins = (end - start) / 60
self.log.info("find_nodes", start=query.startTime, end=query.endTime, rangeInMins=rangeInMins, pattern=query.pattern)

Restart the application and you should see log output such as this:

{"start": 1399759632, "rangeInMins": 60, "end": 1399763232, "pattern": "*.agents.*.metricsReceived", "event": "find_nodes"}

Elasticsearch

Install the latest version of Elasticsearch and set the cluster name to elkg.

RabbitMQ

Install RabbitMQ. Set up an exchange and a queue for use by Logstash.

Exchange: graphite-logstash-fanout
- type: fanout
- durable
Queue: graphite-logstash
- durable

Logstash & Kibana

Install the latest Logstash - I had two installations. One on my EC2 host with the ELK stack and another on my Graphite server. The first Logstash installation reads messages from RabbitMQ, runs a grok filter and stores the output into Elasticsearch.

Logstash configuration on server with ELK stack

The second Logstash installation tails the graphite-api log and sends any new log lines to the RabbitMQ queue.

Logstash configuration on Graphite server

Kibana

Spin up the Logstash and Kibana processes:

sh bin/logstash -f path/to/logstash.conf -- web

After this you should be able to load the Kibana front-end and open the default Logstash dashboard.

The default Logstash dashboard in Kibana

The Fun Part

The fun starts once you have Kibana running and searching the Elasticsearch backend. Just by collecting those simple log lines from graphite-api, I was able to build a very robust and insightful dashboard. I can now answer questions such as these:

How many queries were received in the past hour?
How does that compare to the previous hour or previous twelve hours?
What percentage of queries are for time ranges greater than 7 days?
What are the most popular metric patterns?

This is a screenshot of the dashboard I managed to build:

Conclusion

I've seen graphite-api improving the Grafana dashboard load speeds and I have yet to see how it will respond on a high usage day - but I'm optimistic. Furthermore, with this new Kibana dashboard I will have near real-time insight into what is happening with Graphite queries.

Graphite Querying Statistics on an ELK Stack