In the Graphite Series blog posts, I'll provide a guide to help through all of the steps involved in setting up a monitoring and alerting system using a Graphite stack. Disclaimer: I am no expert, I am just trying to help the Graphite community by providing more detailed documentation. If there's something wrong, please comment below or drop me an email at feangulo@yaipan.com.

In the previous blog posts, we've learned how to set up Carbon (caches) and Whisper, publish metrics and visualize the information and the behavior of the Carbon processes. In this blog post, I'll present another feature of Carbon - the aggregator.

The Carbon Aggregator

Carbon aggregators buffer metrics over time before reporting them into Whisper. For example, let's imagine that you have 10 application servers reporting the number of requests received every 10 seconds:

PRODUCTION.host.ip-0.requests.m1_rate
PRODUCTION.host.ip-1.requests.m1_rate
PRODUCTION.host.ip-2.requests.m1_rate
PRODUCTION.host.ip-3.requests.m1_rate
PRODUCTION.host.ip-4.requests.m1_rate
PRODUCTION.host.ip-5.requests.m1_rate
PRODUCTION.host.ip-6.requests.m1_rate
PRODUCTION.host.ip-7.requests.m1_rate
PRODUCTION.host.ip-8.requests.m1_rate
PRODUCTION.host.ip-9.requests.m1_rate

Data points in this fashion can be very insightful. You may verify whether the load balancer is actually functioning correctly and balancing the load between your servers.

However, other times you are only interested in the total number of requests received by all your application servers. This could easily be done by applying a Graphite function on your metrics.

sumSeries(PRODUCTION.host.*.requests.m1_rate)

The problem with this approach is that this operation is expensive. In order to render this graph we first need to read the 10 different metrics from their corresponding Whisper files, then we need to combine the results by applying the specified function, and finally build the graph. If we know that this is something we will always be interested in visualizing, we could benefit by precomputing the values.

To precompute the values, we can define a rule that matches metrics on a regular expression, buffers them for a specified amount of time, applies a function on the buffered data, and stores the result in a separate Whisper metric file. In our example, we would need the following:

Metric matching rule: PRODUCTION.host.*.requests.m1_rate
Buffering time interval: 60 seconds
Aggregation function: sum
Output metric: PRODUCTION.host.all.requests.m1_rate

The per server metrics are reported every 10 seconds in our environment. Given this configuration, metrics will be buffered for 6 publishing intervals, combined using the sum function and stored to the output Whisper metric file. Finally we can build a graph by querying the aggregate metric data.

The Carbon Process Stack

The Carbon aggregators can be configured to run in front of the Carbon caches. Incoming metrics can be received by the aggregators and then passed along to the caches.

The Carbon Cache

Refer to the Carbon & Whisper blog post for instructions on how to configure and run a Carbon cache. In my environment I have a cache with the following configuration:

$ vi /opt/graphite/conf/carbon.conf
[cache]

LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003

PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004

CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002

I run it using the following command:

$ cd /opt/graphite/bin
$ ./carbon-cache.py start
$ ps -efla | grep carbon-cache
1 S root     18826     1  4  80   0 - 156222 ep_pol Jun04 ?       02:04:31 /usr/bin/python ./carbon-cache.py start

The Carbon Aggregator

The same Carbon configuration file has a some default settings for a Carbon aggregator.

$ vi /opt/graphite/conf/carbon.conf
[aggregator]

LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023

PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024

DESTINATIONS = 127.0.0.1:2004

AGGREGATION_RULES = aggregation-rules.conf
REWRITE_RULES = rewrite-rules.conf

My Carbon cache process' Pickle receiver port is set to the default (2004). Therefore, I could start up an aggregator process with the default configuration and it would be able to communicate with my cache process.

$ cd /opt/graphite/bin
$ ./carbon-aggregator.py start
$ ps -efla | grep carbon-aggregator
1 S root     23767     1  0  80   0 - 56981 ep_pol 13:53 ?        00:00:00 /usr/bin/python ./carbon-aggregator.py start

The Aggregation Rules

The aggregation-rules configuration file is composed of multiple lines specifying the metrics that need to be aggregated and how they should be aggregated. The form of each line should be:

output_template (buffering_time_interval) = function input_pattern

This will capture any received metrics that match the input_pattern for calculating an aggregate metric. The calculation will occur every buffering_time_interval seconds and the function applied can either be sum or avg. The name of the of the aggregate metric will be derived from the output_template filling in any captured fields from the input_pattern. Using the example at the beginning of this blog post, we could build the following aggregation rule:

# aggregate all m1_rate metrics
<env>.host.all.<app_metric>.m1_rate (60) = sum <env>.host.*.<<app_metric>>.m1_rate

Due to the nature of the metrics that I publish, I know that all metrics will begin with the environment, followed by the host string and the corresponding host name. The rest of the metric string corresponds to the actual metric name. The following is a breakdown of the incoming metric and the resulting aggregate metric as they go through the above aggregation rule.

Incoming: PRODUCTION.host.ip-0.requests.m1_rate

   <env>        = PRODUCTION
   host.*       = host.ip-0
   <app_metric> = requests
   m1_rate      = m1_rate

Aggregate: PRODUCTION.host.all.requests.m1_rate

   <env>        = PRODUCTION
   host.all     = host.all
   <app_metric> = requests
   m1_rate      = m1_rate

At this point you have a Carbon aggregator process running with a single aggregation rule, sending data points to a Carbon cache. We can now start publishing data points to observe the behavior.

Aggregate The Data

In the previous blog post, we used the Stresser application to publish metrics to a Carbon cache. With some simple parameter modifications, we can configure the Stresser to publish metrics to a Carbon aggregator and simulate metric publishing from multiple hosts - to test the aggregation functionality. Use the following configuration:

Publishing port: 2023 (aggregator)
Number of timers: 2
Number of hosts: 5
Publishing interval: 10 seconds
Total metrics published every 10 seconds: 150 metrics
Total metrics published per minute: 900
Debug mode: true

$ java -jar stresser.jar localhost 2023 5 2 10 true
Initializing 2 timers - publishing 150 metrics every 10 seconds from 5 host(s)
Publishing metric: STRESS.host.ip-0.com.graphite.stresser.a
Publishing metric: STRESS.host.ip-1.com.graphite.stresser.a
Publishing metric: STRESS.host.ip-2.com.graphite.stresser.a
Publishing metric: STRESS.host.ip-3.com.graphite.stresser.a
Publishing metric: STRESS.host.ip-4.com.graphite.stresser.a
Publishing metric: STRESS.host.ip-0.com.graphite.stresser.b
Publishing metric: STRESS.host.ip-1.com.graphite.stresser.b
Publishing metric: STRESS.host.ip-2.com.graphite.stresser.b
Publishing metric: STRESS.host.ip-3.com.graphite.stresser.b
Publishing metric: STRESS.host.ip-4.com.graphite.stresser.b

Shortly after kicking off the Stresser you can check that the corresponding Whisper files have been created. Obviously, the Whisper files for each of the individual metrics should have been created.

$ ls -l /opt/graphite/storage/whisper/STRESS/host/ip-*/com/graphite/stresser/a/

/opt/graphite/storage/whisper/STRESS/host/ip-0/com/graphite/stresser/a/:
-rw-r--r--. 1 root root 17308 Jun  6 14:27 m1_rate.wsp

/opt/graphite/storage/whisper/STRESS/host/ip-1/com/graphite/stresser/a/:
-rw-r--r--. 1 root root 17308 Jun  6 14:27 m1_rate.wsp

/opt/graphite/storage/whisper/STRESS/host/ip-2/com/graphite/stresser/a/:
-rw-r--r--. 1 root root 17308 Jun  6 14:27 m1_rate.wsp

/opt/graphite/storage/whisper/STRESS/host/ip-3/com/graphite/stresser/a/:
-rw-r--r--. 1 root root 17308 Jun  6 14:27 m1_rate.wsp

/opt/graphite/storage/whisper/STRESS/host/ip-4/com/graphite/stresser/a/:
-rw-r--r--. 1 root root 17308 Jun  6 14:27 m1_rate.wsp

But most importantly, the aggregate metric should have also been created:

$ ls -l /opt/graphite/storage/whisper/STRESS/host/all/com/graphite/stresser/a/
-rw-r--r--. 1 root root 17308 Jun  6 14:30 m1_rate.wsp

Visualize the Aggregations

I have built a very simple dashboard on the Graphite Webapp to visualize the metrics that I'm publishing using the Stresser. Use the following dashboard definition:

[
  {
    "target": [
      "aliasByNode(STRESS.host.ip*.com.graphite.stresser.a.m1_rate,2)",
      "aliasByNode(STRESS.host.all.com.graphite.stresser.a.m1_rate,2)"
    ],
    "title": "Individual & Aggregate Rates"
  }
]

Notice how we no longer need to apply a function (i.e. sumSeries) on the individual metrics to get the aggregate data. We can just query the all metric which contains the precomputed aggregate data.

Next Steps

The aggregation rules can be augmented to include any number of metrics that you need to aggregate. These are some things to keep in mind:

As the number of metrics matching your aggregation rules increases, so does the memory consumption of the aggregator processes - because the buffered data points increase.
Make sure that the buffering interval in an aggregation rule is greater than the metric's publishing interval.

The Graphite Series:

Graphite Series #6: Carbon Aggregators

The Carbon Aggregator

The Carbon Process Stack

Aggregate The Data

Visualize the Aggregations

Next Steps