Elasticsearch: Introduction to pipeline aggregation

Elasticsearch: Introduction to pipeline aggregation

1. let's imagine a table like this:

The above table shows the number of users per month, but how can we get the following information:

  • Which month has the largest number of users?
  • What is the total number of users from April to January?

  • What is their average number of users?
  • What is the value that changes every month?

All of these problems can be calculated using Pipeline aggregation, because the statistics of these data need to use its parent or sibling-level aggregation output.

Pipeline aggregation is called pipeline aggregation in Chinese. It works on the output produced by other aggregations rather than the document set, thereby adding information to the output tree. Compared with metric and bucket aggregation, pipeline aggregation will process the output produced by other aggregations, and these outputs will transform their calculated values. Therefore, pipeline aggregation is suitable for intermediate values that do not exist in the original document set. This makes pipeline aggregation very useful for calculating complex statistical and mathematical measures (such as cumulative sums, derivatives, and moving averages, etc.). There are many different types of pipeline aggregation, and each type has different information from other aggregation calculations, but these types can be divided into two categories:

Parent

     Provides a set of pipeline aggregations based on the output of the parent aggregation, which can calculate new buckets or new aggregations to add to existing buckets. Derivative and cumulative sum aggregation are two common examples of parent pipeline aggregation in Elasticsearch

Sibling

    The output of the same-level aggregation provides pipeline aggregation, and can calculate a new aggregation at the same level as the same-level aggregation.

Pipeline aggregation requires a way to access the parent or sibling aggregation. They can indicate the desired aggregation by using the buckets_path parameter, which indicates the path of the desired measurement. You need to understand the special syntax of this parameter:

AGG_SEPARATOR = `>`; METRIC_SEPARATOR = `.`; AGG_NAME = <the name of the aggregation>; METRIC = <the name of the metric (in case of multi-value metrics aggregation)>; MULTIBUCKET_KEY = `[<KEY_NAME>]` PATH = <AGG_NAME><MULTIBUCKET_KEY>? (<AGG_SEPARATOR>, <AGG_NAME> )* (<METRIC_SEPARATOR>, <METRIC> ); Copy code

For example, the path "my_bucket>my_stats.avg" will point to the avg value in the "my_stats" metric, which is included in the "my_bucket" bucket aggregation.

It should be noted that the path is relative to the pipeline aggregation location. This is why the path cannot be returned "up" to the aggregation tree. For example, the following derivative pipeline aggregation is embedded in date_histogram and references the metric of the_sum adjacent to him.

curl -X POST "localhost:9200/traffic_stats/_search" -H'Content-Type: application/json' -d' { "aggs": { "total_monthly_visits":{ "date_histogram":{ "field":"date", "interval":"month" }, "aggs":{ "the_sum":{ "sum":{ "field": "visits"} }, "the_derivative":{ "derivative":{ "buckets_path": "the_sum"} } } } } } ' Copy code

The sibling pipeline aggregation can also be placed next to a series of buckets instead of being "embedded" inside them. In this case, to access the desired metric, we need to specify a complete path, including its parent aggregation:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H'Content-Type: application/json' -d' { "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } } } }, "avg_monthly_visits": { "avg_bucket": { "buckets_path": "visits_per_month>total_visits" } } } } ' Copy code

In the above, our buckets_path is defined as a complete parent aggregate path starting from the aggs root. In the above example, we refer to the sibling aggregation named total_visits through its parent aggregation date_histogram named visits_per_month. Therefore, the full path of the target summary is visits_per_month>total_visits.

Also, it is important to remember that pipeline aggregation cannot contain sub-aggregations. However, some pipeline aggregations (such as derivative pipeline aggregations) can reference other pipeline aggregations in their buckets_path. This allows linking multiple pipeline aggregations. For example, we can link two first-order derivatives together to calculate the second-order derivative (the derivative of the derivative).

As you remember, metrics and bucket aggregation use the "missing" parameter to deal with gaps in the data. Pipeline aggregation uses the gap_policy parameter to handle cases where documents do not contain required fields or no documents match one or more bucket queries. This parameter supports the following gap_policies:

  • skip- Treat the missing data as if the bucket does not exist. If this policy is enabled, the aggregation will skip empty buckets and continue calculations using the next available value.
  • insert_zeros-Replace all missing values with zeros, and the pipeline calculation will proceed as usual.

 

Tutorials

In this tutorial, we will use the locally deployed Elasticsearch to practice. If you have not installed your own Elasticsearch and Kibana, please refer to my previous article " Elastic: A beginner's guide ".

For this tutorial, we will create an index called traffic_stats. It is the document data about the visit of the blog post. The index mapping includes three fields: date, visits and max_time_spent.

First we start Kibana, first we establish the mapping of our index:

PUT traffic_stats { "mappings": { "properties": { "date": { "type": "date", "format": "dateOptionalTime" }, "visits": { "type": "integer" }, "max_time_spent": { "type": "integer" } } } } Copy code

Next, we use the  bulk API  to import our data into Elasticsearch:

PUT _bulk {"index":{"_index":"traffic_stats"}} {"visits":"488", "date":"2018-10-1", "max_time_spent":"900"} {"index":{"_index":"traffic_stats"}} {"visits":"783", "date":"2018-10-6", "max_time_spent":"928"} {"index":{"_index":"traffic_stats"}} {"visits":"789", "date":"2018-10-12", "max_time_spent":"1834"} {"index":{"_index":"traffic_stats"}} {"visits":"1299", "date":"2018-11-3", "max_time_spent":"592"} {"index":{"_index":"traffic_stats"}} {"visits":"394", "date":"2018-11-6", "max_time_spent":"1249"} {"index":{"_index":"traffic_stats"}} {"visits":"448", "date":"2018-11-24", "max_time_spent":"874"} {"index":{"_index":"traffic_stats"}} {"visits":"768", "date":"2018-12-18", "max_time_spent":"876"} {"index":{"_index":"traffic_stats"}} {"visits":"1194", "date":"2018-12-24", "max_time_spent":"1249"} {"index":{"_index":"traffic_stats"}} {"visits":"987", "date":"2018-12-28", "max_time_spent":"1599"} {"index":{"_index":"traffic_stats"}} {"visits":"872", "date":"2019-01-1", "max_time_spent":"828"} {"index":{"_index":"traffic_stats"}} {"visits":"972", "date":"2019-01-5", "max_time_spent":"723"} {"index":{"_index":"traffic_stats"}} {"visits":"827", "date":"2019-02-5", "max_time_spent":"1300"} {"index":{"_index":"traffic_stats"}} {"visits":"1584", "date":"2019-02-15", "max_time_spent":"1500"} {"index":{"_index":"traffic_stats"}} {"visits":"1604", "date":"2019-03-2", "max_time_spent":"1488"} {"index":{"_index":"traffic_stats"}} {"visits":"1499", "date":"2019-03-27", "max_time_spent":"1399"} {"index":{"_index":"traffic_stats"}} {"visits":"1392", "date":"2019-04-8", "max_time_spent":"1294"} {"index":{"_index":"traffic_stats"}} {"visits":"1247", "date":"2019-04-15", "max_time_spent":"1194"} {"index":{"_index":"traffic_stats"}} {"visits":"984", "date":"2019-05-15", "max_time_spent":"1184"} {"index":{"_index":"traffic_stats"}} {"visits":"1228", "date":"2019-05-18", "max_time_spent":"1485"} {"index":{"_index":"traffic_stats"}} {"visits":"1423", "date":"2019-06-14", "max_time_spent":"1452"} {"index":{"_index":"traffic_stats"}} {"visits":"1238", "date":"2019-06-24", "max_time_spent":"1329"} {"index":{"_index":"traffic_stats"}} {"visits":"1388", "date":"2019-07-14", "max_time_spent":"1542"} {"index":{"_index":"traffic_stats"}} {"visits":"1499", "date":"2019-07-24", "max_time_spent":"1742"} {"index":{"_index":"traffic_stats"}} {"visits":"1523", "date":"2019-08-13", "max_time_spent":"1552"} {"index":{"_index":"traffic_stats"}} {"visits":"1443", "date":"2019-08-19", "max_time_spent":"1511"} {"index":{"_index":"traffic_stats"}} {"visits":"1587", "date":"2019-09-14", "max_time_spent":"1497"} {"index":{"_index":"traffic_stats"}} {"visits":"1534", "date":"2019-09-27", "max_time_spent":"1434"} Copy code

Great! We now have 27 data. An example of pipeline aggregation is fully prepared. Let's start with avg bucket aggregation.

 

Avg Bucket Aggregation

Avg Bucket Aggregation is a typical example of sibling pipeline aggregation. It processes the value calculated by another sibling aggregation and calculates the average of all buckets. The two requirements for sibling aggregation are that the sibling aggregation must be a multi-bucket aggregation, and the specified index is a number.

In order to understand the working principle of pipeline aggregation, it is reasonable to divide the calculation process into several stages. Let's take a look at the query below. It will be carried out in three steps. 1. Elasticsearch will create a date histogram with one month intervals and apply it to the "visits" field of the index. The date histogram will generate n buckets containing n documents. Next, the sum sub-aggregate will calculate the sum of all visits in each monthly period. Finally, the average bucket pipeline will reference the sum sibling aggregation and use the sum of each bucket to calculate the average monthly blog visits for all buckets. Therefore, we will get the average of the monthly average blog visits.

GET traffic_stats/_search { "size": 0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } } } }, "avg_monthly_visits": { "avg_bucket": { "buckets_path": "visits_per_month>total_visits" } } } } Copy code

The result we got is:

"aggregations": { "visits_per_month": { "buckets": [ { "key_as_string": "2018-10-01T00:00:00.000Z", "key": 1538352000000, "doc_count": 3, "total_visits": { "value": 2060.0 } }, { "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 } }, { "key_as_string": "2019-01-01T00:00:00.000Z", "key": 1546300800000, "doc_count": 2, "total_visits": { "value": 1844.0 } }, { "key_as_string": "2019-02-01T00:00:00.000Z", "key": 1548979200000, "doc_count": 2, "total_visits": { "value": 2411.0 } }, { "key_as_string": "2019-03-01T00:00:00.000Z", "key": 1551398400000, "doc_count": 2, "total_visits": { "value": 3103.0 } }, { "key_as_string": "2019-04-01T00:00:00.000Z", "key": 1554076800000, "doc_count": 2, "total_visits": { "value": 2639.0 } }, { "key_as_string": "2019-05-01T00:00:00.000Z", "key": 1556668800000, "doc_count": 2, "total_visits": { "value": 2212.0 } }, { "key_as_string": "2019-06-01T00:00:00.000Z", "key": 1559347200000, "doc_count": 2, "total_visits": { "value": 2661.0 } }, { "key_as_string": "2019-07-01T00:00:00.000Z", "key": 1561939200000, "doc_count": 2, "total_visits": { "value": 2887.0 } }, { "key_as_string": "2019-08-01T00:00:00.000Z", "key": 1564617600000, "doc_count": 2, "total_visits": { "value": 2966.0 } }, { "key_as_string": "2019-09-01T00:00:00.000Z", "key": 1567296000000, "doc_count": 2, "total_visits": { "value": 3121.0 } } ] }, "avg_monthly_visits": { "value": 2582.8333333333335 } } Copy code

Therefore, the average monthly blog visits are 2,582.83. Carefully study the steps we described above, and you can understand how pipeline aggregation works. They take the intermediate results of metrics and/or bucket aggregation and perform other calculations on them. This method is very useful when your data does not contain intermediate results, and the intermediate results should be implicitly exported during the aggregation process.

Derivative Aggregation

Derivative written here is the derivative in our mathematical terms.

This is the parent pipeline aggregation, which is used to calculate the derivation of the specified indicator in the parent histogram or date histogram aggregation. There are two requirements for this aggregation:

  • The indicator must be a number, otherwise the derivative will not be found.
  • The histogram must have min_doc_count set to 0 (this is the default value for histogram aggregation). If min_doc_count is greater than 0, some buckets will be omitted, which may cause confusion or wrong derivative values.

In mathematics, the derivative of a function is used to measure the sensitivity of the function value (output value) to the change of its independent variable (input value). In other words, the derivative evaluates the rate of change of certain functions based on its variables. Applying this concept to our data, we can say that differential aggregation calculates the rate of change of numerical data compared to the previous period. Let's look at a real example to better understand what we are talking about.

1. we will calculate the first derivative. The first derivative tells us whether the function is increasing or decreasing, and by how much. Look at the following example:

GET traffic_stats/_search { "size":0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } }, "visits_deriv": { "derivative": { "buckets_path": "total_visits" } } } } } } Copy code

buckets_path instructs the derivative aggregation to use the output of the total_visits parent aggregation for the derivative (we should use the parent aggregation because derivatives are the parent pipeline aggregation).

The response to the above query should be similar to the following:

"aggregations": { "visits_per_month": { "buckets": [ { "key_as_string": "2018-10-01T00:00:00.000Z", "key": 1538352000000, "doc_count": 3, "total_visits": { "value": 2060.0 } }, { "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 }, "visits_deriv": { "value": 81.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 }, "visits_deriv": { "value": 808.0 } }, { "key_as_string": "2019-01-01T00:00:00.000Z", "key": 1546300800000, "doc_count": 2, "total_visits": { "value": 1844.0 }, "visits_deriv": { "value": -1105.0 } }, { "key_as_string": "2019-02-01T00:00:00.000Z", "key": 1548979200000, "doc_count": 2, "total_visits": { "value": 2411.0 }, "visits_deriv": { "value": 567.0 } }, { "key_as_string": "2019-03-01T00:00:00.000Z", "key": 1551398400000, "doc_count": 2, "total_visits": { "value": 3103.0 }, "visits_deriv": { "value": 692.0 } }, { "key_as_string": "2019-04-01T00:00:00.000Z", "key": 1554076800000, "doc_count": 2, "total_visits": { "value": 2639.0 }, "visits_deriv": { "value": -464.0 } }, { "key_as_string": "2019-05-01T00:00:00.000Z", "key": 1556668800000, "doc_count": 2, "total_visits": { "value": 2212.0 }, "visits_deriv": { "value": -427.0 } }, { "key_as_string": "2019-06-01T00:00:00.000Z", "key": 1559347200000, "doc_count": 2, "total_visits": { "value": 2661.0 }, "visits_deriv": { "value": 449.0 } }, { "key_as_string": "2019-07-01T00:00:00.000Z", "key": 1561939200000, "doc_count": 2, "total_visits": { "value": 2887.0 }, "visits_deriv": { "value": 226.0 } }, { "key_as_string": "2019-08-01T00:00:00.000Z", "key": 1564617600000, "doc_count": 2, "total_visits": { "value": 2966.0 }, "visits_deriv": { "value": 79.0 } }, { "key_as_string": "2019-09-01T00:00:00.000Z", "key": 1567296000000, "doc_count": 2, "total_visits": { "value": 3121.0 }, "visits_deriv": { "value": 155.0 } } ] } } Copy code

If a careful developer can see from the above results, the visits_deriv here is actually obtained by subtracting the value of the two adjacent query results from the total_visits in the previous example. If you compare two adjacent buckets, you will find that the first derivative is the difference between the total visits of the current bucket and the previous bucket. E.g:

{ "key_as_string": "2019-08-01T00:00:00.000Z", "key": 1564617600000, "doc_count": 2, "total_visits": { "value": 2966.0 }, "visits_deriv": { "value": 79.0 } }, { "key_as_string": "2019-09-01T00:00:00.000Z", "key": 1567296000000, "doc_count": 2, "total_visits": { "value": 3121.0 }, "visits_deriv": { "value": 155.0 } } Copy code

As you can see, the total number of visits in August 2018 was 2,966, while the number of visits in September 2019 was 3121. If we subtract 2966 times from 3121 times, we will get a first derivative value of 155.0. It's that simple!

We can actually display this data in Kibana. In order to display the data, we must create an index pattern of traffic_stats. If you don't know how to create an index pattern, please read my other article " Kibana: How to use Search Bar ".

In order to visualize the derivative, we need to select the custom indicator used by the derivative pipeline aggregation and derivative, that is, the sum on the "visits" field. On the X axis, we should use the "monthly" interval to define the "date histogram" aggregation on the "date" field. After running the visualization, Kibana will create vertical lines for each derivative. The positive derivative will be placed near the top of the graph, and the negative derivative will be placed near the bottom of the graph.

 

Second-order Derivative

The second derivative is the double derivative or the derivative of the derivative. It measures how the rate of change of the quantity itself changes.

In Elasticsearch, we can use the derivative pipeline aggregation to link to the output of another derivative pipeline aggregation to calculate the second derivative. In this way, we first calculate the first derivative, and then calculate the second derivative based on the first derivative. Let's look at the following example:

GET traffic_stats/_search { "size": 0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } }, "visits_deriv": { "derivative": { "buckets_path": "total_visits" } }, "visits_2nd_deriv": { "derivative": { "buckets_path": "visits_deriv" } } } } } } Copy code

As you can see, the first derivative uses the total_visits path calculated from the sum, while the second derivative uses the visits_deriv path to the first derivative pipeline. In this way, we can treat the second derivative calculation as a two-pipe aggregation. The above query should return the following response:

"aggregations": { "visits_per_month": { "buckets": [ { "key_as_string": "2018-10-01T00:00:00.000Z", "key": 1538352000000, "doc_count": 3, "total_visits": { "value": 2060.0 } }, { "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 }, "visits_deriv": { "value": 81.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 }, "visits_deriv": { "value": 808.0 }, "visits_2nd_deriv": { "value": 727.0 } }, { "key_as_string": "2019-01-01T00:00:00.000Z", "key": 1546300800000, "doc_count": 2, "total_visits": { "value": 1844.0 }, "visits_deriv": { "value": -1105.0 }, "visits_2nd_deriv": { "value": -1913.0 } }, { "key_as_string": "2019-02-01T00:00:00.000Z", "key": 1548979200000, "doc_count": 2, "total_visits": { "value": 2411.0 }, "visits_deriv": { "value": 567.0 }, "visits_2nd_deriv": { "value": 1672.0 } }, { "key_as_string": "2019-03-01T00:00:00.000Z", "key": 1551398400000, "doc_count": 2, "total_visits": { "value": 3103.0 }, "visits_deriv": { "value": 692.0 }, "visits_2nd_deriv": { "value": 125.0 } }, { "key_as_string": "2019-04-01T00:00:00.000Z", "key": 1554076800000, "doc_count": 2, "total_visits": { "value": 2639.0 }, "visits_deriv": { "value": -464.0 }, "visits_2nd_deriv": { "value": -1156.0 } }, { "key_as_string": "2019-05-01T00:00:00.000Z", "key": 1556668800000, "doc_count": 2, "total_visits": { "value": 2212.0 }, "visits_deriv": { "value": -427.0 }, "visits_2nd_deriv": { "value": 37.0 } }, { "key_as_string": "2019-06-01T00:00:00.000Z", "key": 1559347200000, "doc_count": 2, "total_visits": { "value": 2661.0 }, "visits_deriv": { "value": 449.0 }, "visits_2nd_deriv": { "value": 876.0 } }, { "key_as_string": "2019-07-01T00:00:00.000Z", "key": 1561939200000, "doc_count": 2, "total_visits": { "value": 2887.0 }, "visits_deriv": { "value": 226.0 }, "visits_2nd_deriv": { "value": -223.0 } }, { "key_as_string": "2019-08-01T00:00:00.000Z", "key": 1564617600000, "doc_count": 2, "total_visits": { "value": 2966.0 }, "visits_deriv": { "value": 79.0 }, "visits_2nd_deriv": { "value": -147.0 } }, { "key_as_string": "2019-09-01T00:00:00.000Z", "key": 1567296000000, "doc_count": 2, "total_visits": { "value": 3121.0 }, "visits_deriv": { "value": 155.0 }, "visits_2nd_deriv": { "value": 76.0 } } ] } } Copy code

Let us take a closer look at the two adjacent buckets and see what the second derivative really means:

{ "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 }, "visits_deriv": { "value": 81.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 }, "visits_deriv": { "value": 808.0 }, "visits_2nd_deriv": { "value": 727.0 } }, Copy code

2018-12-01 2019-11-01 808 2949-2141

2018-11-01 81 2018-12-01 808.0 2018-12-01 727.0 808-81

Min and Max Bucket Aggregation

total_visits

POST traffic_stats/_search { "size": 0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } } } }, "max_monthly_visits": { "max_bucket": { "buckets_path": "visits_per_month>total_visits" } } } }

"aggregations" : { "visits_per_month" : { "buckets" : [ { "key_as_string" : "2018-10-01T00:00:00.000Z", "key" : 1538352000000, "doc_count" : 3, "total_visits" : { "value" : 2060.0 } }, { "key_as_string" : "2018-11-01T00:00:00.000Z", "key" : 1541030400000, "doc_count" : 3, "total_visits" : { "value" : 2141.0 } }, { "key_as_string" : "2018-12-01T00:00:00.000Z", "key" : 1543622400000, "doc_count" : 3, "total_visits" : { "value" : 2949.0 } }, { "key_as_string" : "2019-01-01T00:00:00.000Z", "key" : 1546300800000, "doc_count" : 2, "total_visits" : { "value" : 1844.0 } }, { "key_as_string" : "2019-02-01T00:00:00.000Z", "key" : 1548979200000, "doc_count" : 2, "total_visits" : { "value" : 2411.0 } }, { "key_as_string" : "2019-03-01T00:00:00.000Z", "key" : 1551398400000, "doc_count" : 2, "total_visits" : { "value" : 3103.0 } }, { "key_as_string" : "2019-04-01T00:00:00.000Z", "key" : 1554076800000, "doc_count" : 2, "total_visits" : { "value" : 2639.0 } }, { "key_as_string" : "2019-05-01T00:00:00.000Z", "key" : 1556668800000, "doc_count" : 2, "total_visits" : { "value" : 2212.0 } }, { "key_as_string" : "2019-06-01T00:00:00.000Z", "key" : 1559347200000, "doc_count" : 2, "total_visits" : { "value" : 2661.0 } }, { "key_as_string" : "2019-07-01T00:00:00.000Z", "key" : 1561939200000, "doc_count" : 2, "total_visits" : { "value" : 2887.0 } }, { "key_as_string" : "2019-08-01T00:00:00.000Z", "key" : 1564617600000, "doc_count" : 2, "total_visits" : { "value" : 2966.0 } }, { "key_as_string" : "2019-09-01T00:00:00.000Z", "key" : 1567296000000, "doc_count" : 2, "total_visits" : { "value": 3121.0 } } ] }, "max_monthly_visits": { "value": 3121.0, "keys": [ "2019-09-01T00:00:00.000Z" ] } } Copy code

We can see from the above results that the value of max_monthly_visits is 3121.0, which is the total_visits value of all 2019-09-01. Similarly, we can also query this result through Kibana:

The smallest bucket aggregation has the same logic. To make it work, we just need to replace max_bucket with min_bucket in the query.

POST traffic_stats/_search { "size": 0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } } } }, "min_monthly_visits": { "min_bucket": { "buckets_path": "visits_per_month>total_visits" } } } } Copy code

The results are:

"aggregations": { "visits_per_month": { "buckets": [ { "key_as_string": "2018-10-01T00:00:00.000Z", "key": 1538352000000, "doc_count": 3, "total_visits": { "value": 2060.0 } }, { "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 } }, { "key_as_string": "2019-01-01T00:00:00.000Z", "key": 1546300800000, "doc_count": 2, "total_visits": { "value": 1844.0 } }, { "key_as_string": "2019-02-01T00:00:00.000Z", "key": 1548979200000, "doc_count": 2, "total_visits": { "value": 2411.0 } }, { "key_as_string": "2019-03-01T00:00:00.000Z", "key": 1551398400000, "doc_count": 2, "total_visits": { "value": 3103.0 } }, { "key_as_string": "2019-04-01T00:00:00.000Z", "key": 1554076800000, "doc_count": 2, "total_visits": { "value": 2639.0 } }, { "key_as_string": "2019-05-01T00:00:00.000Z", "key": 1556668800000, "doc_count": 2, "total_visits": { "value": 2212.0 } }, { "key_as_string": "2019-06-01T00:00:00.000Z", "key": 1559347200000, "doc_count": 2, "total_visits": { "value": 2661.0 } }, { "key_as_string": "2019-07-01T00:00:00.000Z", "key": 1561939200000, "doc_count": 2, "total_visits": { "value": 2887.0 } }, { "key_as_string": "2019-08-01T00:00:00.000Z", "key": 1564617600000, "doc_count": 2, "total_visits": { "value": 2966.0 } }, { "key_as_string": "2019-09-01T00:00:00.000Z", "key": 1567296000000, "doc_count": 2, "total_visits": { "value": 3121.0 } } ] }, "min_monthly_visits": { "value": 1844.0, "keys": [ "2019-01-01T00:00:00.000Z" ] } } Copy code

Sum and Cumulative Sum Buckets Aggregations

In some cases, you need to calculate the sum of all bucket values calculated by some other aggregation. In this case, you can use the sum bucket aggregation, which is the same-level pipeline aggregation.

Let's calculate the sum of monthly visits in all buckets:

POST traffic_stats/_search { "size": 0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } } } }, "sum_monthly_visits": { "sum_bucket": { "buckets_path": "visits_per_month>total_visits" } } } } Copy code

As you can see, this pipeline aggregation is for the same level total_visits aggregation, which represents the total number of visits per month. The response should look like this:

"aggregations": { "visits_per_month": { "buckets": [ { "key_as_string": "2018-10-01T00:00:00.000Z", "key": 1538352000000, "doc_count": 3, "total_visits": { "value": 2060.0 } }, { "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 } }, { "key_as_string": "2019-01-01T00:00:00.000Z", "key": 1546300800000, "doc_count": 2, "total_visits": { "value": 1844.0 } }, { "key_as_string": "2019-02-01T00:00:00.000Z", "key": 1548979200000, "doc_count": 2, "total_visits": { "value": 2411.0 } }, { "key_as_string": "2019-03-01T00:00:00.000Z", "key": 1551398400000, "doc_count": 2, "total_visits": { "value": 3103.0 } }, { "key_as_string": "2019-04-01T00:00:00.000Z", "key": 1554076800000, "doc_count": 2, "total_visits": { "value": 2639.0 } }, { "key_as_string": "2019-05-01T00:00:00.000Z", "key": 1556668800000, "doc_count": 2, "total_visits": { "value": 2212.0 } }, { "key_as_string": "2019-06-01T00:00:00.000Z", "key": 1559347200000, "doc_count": 2, "total_visits": { "value": 2661.0 } }, { "key_as_string": "2019-07-01T00:00:00.000Z", "key": 1561939200000, "doc_count": 2, "total_visits": { "value": 2887.0 } }, { "key_as_string": "2019-08-01T00:00:00.000Z", "key": 1564617600000, "doc_count": 2, "total_visits": { "value": 2966.0 } }, { "key_as_string": "2019-09-01T00:00:00.000Z", "key": 1567296000000, "doc_count": 2, "total_visits": { "value": 3121.0 } } ] }, "sum_monthly_visits": { "value": 30994.0 } } Copy code

Therefore, our total pipeline aggregation simply calculates the sum of monthly visits for each bucket, which itself is the sum of all monthly visits calculated by the sibling sum aggregation.

Cumulative sums use different methods. Generally, the cumulative sum is a sequence of partial sums of a given sequence. For example, the cumulative sum of the sequence {a, b, c,...} is a, a + b, a + b + c,...

Cumulative sun aggregation is the parent pipeline aggregation, used to calculate the cumulative sum of the specified indicators in the parent histogram (or date_histogram) aggregation. Like other parent pipeline aggregations, the specified indicator must be numeric, and the closed histogram must have min_doc_count set to 0 (the default setting for histogram aggregation).

POST traffic_stats/_search { "size": 0, "aggs": { "visits_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "total_visits": { "sum": { "field": "visits" } }, "cumulative_visits": { "cumulative_sum": { "buckets_path": "total_visits" } } } } } } Copy code

We can also use Kibana to manipulate and display this data:

The corresponding data is:

"aggregations": { "visits_per_month": { "buckets": [ { "key_as_string": "2018-10-01T00:00:00.000Z", "key": 1538352000000, "doc_count": 3, "total_visits": { "value": 2060.0 }, "cumulative_visits": { "value": 2060.0 } }, { "key_as_string": "2018-11-01T00:00:00.000Z", "key": 1541030400000, "doc_count": 3, "total_visits": { "value": 2141.0 }, "cumulative_visits": { "value": 4201.0 } }, { "key_as_string": "2018-12-01T00:00:00.000Z", "key": 1543622400000, "doc_count": 3, "total_visits": { "value": 2949.0 }, "cumulative_visits": { "value" : 7150.0 } }, { "key_as_string" : "2019-01-01T00:00:00.000Z", "key" : 1546300800000, "doc_count" : 2, "total_visits" : { "value" : 1844.0 }, "cumulative_visits" : { "value" : 8994.0 } }, { "key_as_string" : "2019-02-01T00:00:00.000Z", "key" : 1548979200000, "doc_count" : 2, "total_visits" : { "value" : 2411.0 }, "cumulative_visits" : { "value" : 11405.0 } }, { "key_as_string" : "2019-03-01T00:00:00.000Z", "key" : 1551398400000, "doc_count" : 2, "total_visits" : { "value" : 3103.0 }, "cumulative_visits" : { "value" : 14508.0 } }, { "key_as_string" : "2019-04-01T00:00:00.000Z", "key" : 1554076800000, "doc_count" : 2, "total_visits" : { "value" : 2639.0 }, "cumulative_visits" : { "value" : 17147.0 } }, { "key_as_string" : "2019-05-01T00:00:00.000Z", "key" : 1556668800000, "doc_count" : 2, "total_visits" : { "value" : 2212.0 }, "cumulative_visits" : { "value" : 19359.0 } }, { "key_as_string" : "2019-06-01T00:00:00.000Z", "key" : 1559347200000, "doc_count" : 2, "total_visits" : { "value" : 2661.0 }, "cumulative_visits" : { "value" : 22020.0 } }, { "key_as_string" : "2019-07-01T00:00:00.000Z", "key" : 1561939200000, "doc_count" : 2, "total_visits" : { "value" : 2887.0 }, "cumulative_visits" : { "value" : 24907.0 } }, { "key_as_string" : "2019-08-01T00:00:00.000Z", "key" : 1564617600000, "doc_count" : 2, "total_visits" : { "value" : 2966.0 }, "cumulative_visits" : { "value" : 27873.0 } }, { "key_as_string" : "2019-09-01T00:00:00.000Z", "key" : 1567296000000, "doc_count" : 2, "total_visits" : { "value" : 3121.0 }, "cumulative_visits" : { "value" : 30994.0 } } ] } }

1 www.elastic.co/guide/en/el

2 qbox.io/blog/compre