Elasticsearch: Aggregating everything: the new aggregation in Elasticsearch 7

Elasticsearch: Aggregating everything: the new aggregation in Elasticsearch 7

Since version 1.0, the aggregation framework has been an important part of Elasticsearch. Over the years, it has been optimized, repaired, and even overhauled. Since Elasticsearch 7.0 version, many new aggregations have been added to Elasticsearch, such as rare_terms , top_metrics or auto_date_histogram aggregation. In this blog post, we will explore some of them and take a closer look at what they can do for you. To test these new aggs, we will set up a sample data set in the Elasticsearch 7.9 deployment.

The following document may represent an e-commerce use case, where the user clicks on a product and retrieves product details. Of course, it lacks many details, such as the session ID of a single user, so you can start and monitor sessions, even go further, and use transforms to gain further insights into the data. This article is still very simple to ensure that all concepts are understood.

Use Discover or curl in Kibana to import sample data from the command line:

PUT website-analytics/_bulk?refresh {"index":{}} {"product_id":"123","@timestamp":"2020-10-01T11:11:23.000Z","price":12.34,"response_time_ms":242} {"index":{}} {"product_id":"456","@timestamp":"2020-10-02T12:14:00.000Z","price":20.58,"response_time_ms":98} {"index":{}} {"product_id":"789","@timestamp":"2020-10-03T13:15:00.000Z","price":34.16,"response_time_ms":123} {"index":{}} {"product_id":"123","@timestamp":"2020-10-02T14:16:00.000Z","price":12.34,"response_time_ms":465} {"index":{}} {"product_id":"123","@timestamp":"2020-10-02T14:18:00.000Z","price":12.34,"response_time_ms":158} {"index":{}} {"product_id":"123","@timestamp":"2020-10-03T15:17:00.000Z","price":12.34,"response_time_ms":168} {"index":{}} {"product_id":"789","@timestamp":"2020-10-06T15:17:00.000Z","price":34.16,"response_time_ms":220} {"index":{}} {"product_id":"789","@timestamp":"2020-10-10T15:17:00.000Z","price":34.16,"response_time_ms":99} Copy code

The above data shows a time series data, but there is one data every day.

Auto-bucketing aggregations

These types of aggregations automatically change the date when the bucket is defined. When you perform time-based aggregation, you usually define storage partitions based on time intervals (for example, 1d). However, sometimes you don't know the nature of the data. It is easier to tell the expected number of buckets only from the user's perspective.

This is where the following two new aggregations come into play.

auto_date_histogram Aggregation

The auto_date_histogram aggregation runs on date fields and allows you to configure the number of buckets you expect to return. Let's try it on a small data set:

POST website-analytics/_search?size=0 { "aggs": { "views_over_time": { "auto_date_histogram": { "field": "@timestamp", "buckets": 3 } } } } Copy code

Run the above aggregation, the result of the production is:

{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "views_over_time": { "buckets": [ { "key_as_string": "2020-10-01T00:00:00.000Z", "key": 1601510400000, "doc_count": 7 }, { "key_as_string": "2020-10-08T00:00:00.000Z", "key": 1602115200000, "doc_count": 1 } ], "interval": "7d" } } } Copy code

We then run the following aggregation:

POST website-analytics/_search?size=0 { "aggs": { "views_over_time": { "auto_date_histogram": { "field": "@timestamp", "buckets": 10 } } } } Copy code

The result of the above command is:

{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "views_over_time": { "buckets": [ { "key_as_string": "2020-10-01T00:00:00.000Z", "key": 1601510400000, "doc_count": 1 }, { "key_as_string": "2020-10-02T00:00:00.000Z", "key": 1601596800000, "doc_count": 3 }, { "key_as_string": "2020-10-03T00:00:00.000Z", "key": 1601683200000, "doc_count": 2 }, { "key_as_string": "2020-10-04T00:00:00.000Z", "key": 1601769600000, "doc_count": 0 }, { "key_as_string": "2020-10-05T00:00:00.000Z", "key": 1601856000000, "doc_count": 0 }, { "key_as_string": "2020-10-06T00:00:00.000Z", "key": 1601942400000, "doc_count": 1 }, { "key_as_string": "2020-10-07T00:00:00.000Z", "key": 1602028800000, "doc_count": 0 }, { "key_as_string": "2020-10-08T00:00:00.000Z", "key": 1602115200000, "doc_count": 0 }, { "key_as_string": "2020-10-09T00:00:00.000Z", "key": 1602201600000, "doc_count": 0 }, { "key_as_string": "2020-10-10T00:00:00.000Z", "key": 1602288000000, "doc_count": 1 } ], "interval": "1d" } } } Copy code

Running these two queries will show that the returned interval is based on the number of buckets requested. If the requirement is 3 barrels, it should be 1 barrel per week, and if it is 10 barrels, it should be 1 barrel per day.

If you need a minimum interval, you can also configure this minimum interval, please refer to the auto_date_histogram document.

variable_width_histogram Aggregation

Variable width histograms allow dynamic creation of a pre-configured number of buckets. Most importantly, the width of these buckets is variable compared to the fixed width of regular histogram aggregation.

POST website-analytics/_search?size=0 { "aggs": { "prices": { "variable_width_histogram": { "field": "price", "buckets": 3 } } } } Copy code

Since there are only three different prices in our data set, the minimum/maximum/key values are the same. However, you can try using two buckets and see that one bucket now has different values. Run the above aggregation:

{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "prices": { "buckets": [ { "min": 12.34000015258789, "key": 12.34000015258789, "max": 12.34000015258789, "doc_count": 4 }, { "min": 20.579999923706055, "key": 20.579999923706055, "max": 20.579999923706055, "doc_count": 1 }, { "min": 34.15999984741211, "key": 34.15999984741211, "max": 34.15999984741211, "doc_count": 3 } ] } } } Copy code

If we set buckets to 2, then this is the result:

{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "prices": { "buckets": [ { "min": 12.34000015258789, "key": 13.988000106811523, "max": 20.579999923706055, "doc_count": 5 }, { "min": 34.15999984741211, "key": 34.15999984741211, "max": 34.15999984741211, "doc_count": 3 } ] } } Copy code

Also, remember: bucket bounds are approximate .

One use case might be an e-commerce application where you want to display price segments as part of a multi-faceted navigation. However, using this option will make your website navigation very sensitive to outliers, so please consider using category filters before doing this.

Aggregations on strings

rare_terms Aggregation

As an Elastic Stack user, you may have used terms aggregation and encountered errors. Under normal circumstances, terms aggregation will return the most frequent terms in the data set. You can change the order to return the terms that were found the least. However, this will bring endless errors, so the result may be an approximation because the data is collected on multiple shards across the cluster. This is because Elasticsearch tries to prevent copying all data from different shards to a single node, because doing so is expensive and slow.

Compared with terms aggregation, rear_terms aggregation attempts to circumvent these problems by using a different implementation. Even though this is still doing approximate counting, the aggregation of rare items has a well-defined bounded error.

To find the product ID that is least indexed in the above data set, try the following

POST website-analytics/_search?size=0 { "aggs": { "rarest_product_ids": { "rare_terms": { "field": "product_id.keyword" } } } } Copy code

The result returned by the above command is:

{ "took": 6, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "rarest_product_ids": { "buckets": [ { "key": "456", "doc_count": 1 } ] } } } Copy code

You can also use max_doc_count (as opposed to min_doc_count for term aggregation) and change the number of buckets to be returned.

 

string_stats Aggregation

How to get some statistical information about the value of a string field in the data? Let's try the string_stats aggregation :

POST website-analytics/_search?size=0 { "aggs": { "rarest_product_ids": { "string_stats": { "field": "product_id.keyword", "show_distribution": true } } } } Copy code

The result displayed by the above command is:

{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "rarest_product_ids": { "count": 8, "min_length": 3, "max_length": 3, "avg_length": 3.0, "entropy": 2.9906015629507223, "distribution": { "1": 0.16666666666666666, "2": 0.16666666666666666, "3": 0.16666666666666666, "7": 0.125, "8": 0.125, "9": 0.125, "4": 0.041666666666666664, "5": 0.041666666666666664, "6": 0.041666666666666664 } } } } Copy code

This will return statistics about the minimum/maximum/average length of the string in the field, but by adding the show_distribution parameter, you will also see the distribution of each character found.

This is an easy way to quickly check the data to find outliers, for example, the index may be incorrectly indexed, such as the product ID is too long or too short. Similarly, the Shannon Entropy returned can be used for purposes such as finding DNS data penetration attempts.

Metrics based aggregations

Let us delve into the second set of aggregations, which are calculations of numerical field indicators on top of bucket aggregations.

top_metrics Aggregation

You may already know the top_hits aggregation , which will return the complete match, including its source. However, if you are only interested in a single value and want to sort by this, please check the top_metrics aggregation. If the entire document is not needed, this aggregation will be much faster than the top_hits aggregation and is usually used to retrieve the latest value from each bucket.

In our clickstream dataset, you may be interested in the price of the latest click event.

POST website-analytics/_search { "size": 0, "aggs": { "tm": { "top_metrics": { "metrics": {"field": "price"}, "sort": {"@timestamp": "desc"} } } } } Copy code

The above returns the following results:

{ "took": 12, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "tm": { "top": [ { "sort": [ "2020-10-10T15:17:00.000Z" ], "metrics": { "price": 34.15999984741211 } } ] } } } Copy code

Sorting also supports _score or geographic distance. In addition, you can specify multiple indicators, so you can add another field to the indicator field, and then you need to turn it into an array.

boxplot Aggregation

Boxplot aggregation operates exactly as the name says-provide box plot :

GET website-analytics/_search { "size": 0, "aggs": { "by_date": { "date_histogram": { "field": "@timestamp", "calendar_interval": "day", "min_doc_count": 1 }, "aggs": { "load_time_boxplot": { "boxplot": { "field": "price" } } } } } } Copy code

The result returned by the above query is:

{ "took": 6, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "by_date": { "buckets": [ { "key_as_string": "2020-10-01T00:00:00.000Z", "key": 1601510400000, "doc_count": 1, "load_time_boxplot": { "min": 12.34000015258789, "max": 12.34000015258789, "q1": 12.34000015258789, "q2": 12.34000015258789, "q3": 12.34000015258789 } }, { "key_as_string": "2020-10-02T00:00:00.000Z", "key": 1601596800000, "doc_count": 3, "load_time_boxplot": { "min": 12.34000015258789, "max": 20.579999923706055, "q1": 12.34000015258789, "q2": 12.34000015258789, "q3": 18.519999980926514 } }, { "key_as_string": "2020-10-03T00:00:00.000Z", "key": 1601683200000, "doc_count": 2, "load_time_boxplot": { "min": 12.34000015258789, "max": 34.15999984741211, "q1": 12.34000015258789, "q2": 23.25, "q3": 34.15999984741211 } }, { "key_as_string": "2020-10-06T00:00:00.000Z", "key": 1601942400000, "doc_count": 1, "load_time_boxplot": { "min": 34.15999984741211, "max": 34.15999984741211, "q1": 34.15999984741211, "q2": 34.15999984741211, "q3": 34.15999984741211 } }, { "key_as_string": "2020-10-10T00:00:00.000Z", "key": 1602288000000, "doc_count": 1, "load_time_boxplot": { "min": 34.15999984741211, "max": 34.15999984741211, "q1": 34.15999984741211, "q2": 34.15999984741211, "q3": 34.15999984741211 } } ] } } } Copy code

The above query returns a box plot for each day, where there is data in the daily bucket.

We will skip t-test aggregation because the ultra-small data set here does not allow any useful aggregation requests. To see the value of this aggregation, you need a data set in which to assume changes in behavior that can be discovered through statistical assumptions.

 

Pipeline aggregations

Next, the aggregation on top of the aggregation called the pipeline aggregation. A lot of such aggregations were added last year.

cumulative_cardinality Aggregation

This is a useful aggregation for finding the number of new items in the data set.

GET website-analytics/_search { "size": 0, "aggs": { "by_day": { "date_histogram": { "field": "@timestamp", "calendar_interval": "day" }, "aggs": { "distinct_products": { "cardinality": { "field": "product_id.keyword" } }, "total_new_products": { "cumulative_cardinality": { "buckets_path": "distinct_products" } } } } } } Copy code

The result returned by the above query is:

{ "took": 9, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "by_day": { "buckets": [ { "key_as_string": "2020-10-01T00:00:00.000Z", "key": 1601510400000, "doc_count": 1, "distinct_products": { "value": 1 }, "total_new_products": { "value": 1 } }, { "key_as_string": "2020-10-02T00:00:00.000Z", "key": 1601596800000, "doc_count": 3, "distinct_products": { "value": 2 }, "total_new_products": { "value": 2 } }, { "key_as_string": "2020-10-03T00:00:00.000Z", "key": 1601683200000, "doc_count": 2, "distinct_products": { "value": 2 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-04T00:00:00.000Z", "key": 1601769600000, "doc_count": 0, "distinct_products": { "value": 0 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-05T00:00:00.000Z", "key": 1601856000000, "doc_count": 0, "distinct_products": { "value": 0 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-06T00:00:00.000Z", "key": 1601942400000, "doc_count": 1, "distinct_products": { "value": 1 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-07T00:00:00.000Z", "key": 1602028800000, "doc_count": 0, "distinct_products": { "value": 0 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-08T00:00:00.000Z", "key": 1602115200000, "doc_count": 0, "distinct_products": { "value": 0 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-09T00:00:00.000Z", "key": 1602201600000, "doc_count": 0, "distinct_products": { "value": 0 }, "total_new_products": { "value": 3 } }, { "key_as_string": "2020-10-10T00:00:00.000Z", "key": 1602288000000, "doc_count": 1, "distinct_products": { "value": 1 }, "total_new_products": { "value": 3 } } ] } } } Copy code

Through the above query, you can calculate how many new products and unknown products are visited every day, and create a count of these products. This may help you determine in an e-commerce environment whether your new product has received real attention, or whether your bestseller is at the top of the list, and you should probably change your marketing methods.

normalize Aggregation

Let s try to outline which day has the highest percentage of traffic, where 100% is all the data that matches the query.

GET website-analytics/_search { "size": 0, "aggs": { "by_day": { "date_histogram": { "field": "@timestamp", "calendar_interval": "day" }, "aggs": { "normalize": { "normalize": { "buckets_path": "_count", "method": "percent_of_sum" } } } } } } Copy code

The result returned above is:

{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "by_day": { "buckets": [ { "key_as_string": "2020-10-01T00:00:00.000Z", "key": 1601510400000, "doc_count": 1, "normalize": { "value": 0.125 } }, { "key_as_string": "2020-10-02T00:00:00.000Z", "key": 1601596800000, "doc_count": 3, "normalize": { "value": 0.375 } }, { "key_as_string": "2020-10-03T00:00:00.000Z", "key": 1601683200000, "doc_count": 2, "normalize": { "value": 0.25 } }, { "key_as_string": "2020-10-04T00:00:00.000Z", "key": 1601769600000, "doc_count": 0, "normalize": { "value": 0.0 } }, { "key_as_string": "2020-10-05T00:00:00.000Z", "key": 1601856000000, "doc_count": 0, "normalize": { "value": 0.0 } }, { "key_as_string": "2020-10-06T00:00:00.000Z", "key": 1601942400000, "doc_count": 1, "normalize": { "value": 0.125 } }, { "key_as_string": "2020-10-07T00:00:00.000Z", "key": 1602028800000, "doc_count": 0, "normalize": { "value": 0.0 } }, { "key_as_string": "2020-10-08T00:00:00.000Z", "key": 1602115200000, "doc_count": 0, "normalize": { "value": 0.0 } }, { "key_as_string": "2020-10-09T00:00:00.000Z", "key": 1602201600000, "doc_count": 0, "normalize": { "value": 0.0 } }, { "key_as_string": "2020-10-10T00:00:00.000Z", "key": 1602288000000, "doc_count": 1, "normalize": { "value": 0.125 } } ] } } } Copy code

This will return additional information for each bucket: the percentage of the number of documents found in each bucket to the total number of documents returned by the search.

You may need to look at the  normalize aggregation documentation because you can choose more method values from it, such as mean or range recalibration.

moving percentiles Aggregation

The pipeline aggregation works on top of the percentile aggregation and uses a sliding window to calculate the cumulative percentile.

GET website-analytics/_search { "size": 0, "aggs": { "by_day": { "date_histogram": { "field": "@timestamp", "calendar_interval": "day" }, "aggs": { "response_time": { "percentiles": { "field": "response_time_ms", "percents": [75, 99] } }, "moving_pct": { "moving_percentiles": { "buckets_path": "response_time", "window": 2 } } } } } } Copy code

The result returned by the above query is:

{ "took": 4, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": null, "hits": [] }, "aggregations": { "by_day": { "buckets": [ { "key_as_string": "2020-10-01T00:00:00.000Z", "key": 1601510400000, "doc_count": 1, "response_time": { "values": { "75.0": 242.0, "99.0": 242.0 } } }, { "key_as_string": "2020-10-02T00:00:00.000Z", "key": 1601596800000, "doc_count": 3, "response_time": { "values": { "75.0": 388.25, "99.0": 465.0 } }, "moving_pct": { "values": { "75.0": 242.0, "99.0": 242.0 } } }, { "key_as_string": "2020-10-03T00:00:00.000Z", "key": 1601683200000, "doc_count": 2, "response_time": { "values": { "75.0": 168.0, "99.0": 168.0 } }, "moving_pct": { "values": { "75.0": 353.5, "99.0": 465.0 } } }, { "key_as_string": "2020-10-04T00:00:00.000Z", "key": 1601769600000, "doc_count": 0, "response_time": { "values": { "75.0": null, "99.0": null } }, "moving_pct": { "values": { "75.0": 242.25, "99.0": 465.0 } } }, { "key_as_string": "2020-10-05T00:00:00.000Z", "key": 1601856000000, "doc_count": 0, "response_time": { "values": { "75.0": null, "99.0": null } }, "moving_pct": { "values": { "75.0": 168.0, "99.0": 168.0 } } }, { "key_as_string": "2020-10-06T00:00:00.000Z", "key": 1601942400000, "doc_count": 1, "response_time": { "values": { "75.0": 220.0, "99.0": 220.0 } }, "moving_pct": { "values": { "75.0": null, "99.0": null } } }, { "key_as_string": "2020-10-07T00:00:00.000Z", "key": 1602028800000, "doc_count": 0, "response_time": { "values": { "75.0": null, "99.0": null } }, "moving_pct": { "values": { "75.0": 220.0, "99.0": 220.0 } } }, { "key_as_string": "2020-10-08T00:00:00.000Z", "key": 1602115200000, "doc_count": 0, "response_time": { "values": { "75.0": null, "99.0": null } }, "moving_pct": { "values": { "75.0": 220.0, "99.0": 220.0 } } }, { "key_as_string": "2020-10-09T00:00:00.000Z", "key": 1602201600000, "doc_count": 0, "response_time": { "values": { "75.0": null, "99.0": null } }, "moving_pct": { "values": { "75.0": null, "99.0": null } } }, { "key_as_string": "2020-10-10T00:00:00.000Z", "key": 1602288000000, "doc_count": 1, "response_time": { "values": { "75.0": 99.0, "99.0": 99.0 } }, "moving_pct": { "values": { "75.0": null, "99.0": null } } } ] } } } Copy code

Let's expand a little bit here, let's get started. After storing by day, percentile aggregation will calculate the percentile of the bucket for each day. Then, the moving_percentiles pipeline agg takes the first two buckets and calculates the moving average from them. Note that if you also want to include the current bucket, you can change the behavior of which buckets should be used by using the shift parameter.

 

pipeline inference Aggregation

We will skip the inference bucket aggregation, because we plan to publish a blog post soon to explain this aggregation. Stimulate your appetite: You can run a pre-trained model against the results aggregated by the parent bucket. stay tuned!

 

Support for the 
histogram
 field type

Strictly speaking, this is not an aggregation, but aggregation is affected by this data type, so it is worth mentioning.

You may have lost the  histogram field type , which allows you to store pre-aggregated numerical data-for example, this data is widely used in Elastic Observability . This special field type supports a subset of aggregations, and you will find some aggregations that are not yet supported.... However, more needs to be done to support these.

 

Support for 
geo_shape
 in geo aggregations

Strictly speaking again, this is not a single aggregation, but a big step forward, so it is worth mentioning.

In addition to the geo_point field type, a huge investment has been made to use the geo_bounds , geo_tile , geo_hashgrid, and geo_centroid aggregations with the geo_shape field type.