Elasticsearch – How to create buckets by using information from two fields at the same time?

Question:

My documents are like this:

{'start': 0, 'stop': 3, 'val': 3}
{'start': 2, 'stop': 4, 'val': 1}
{'start': 5, 'stop': 6, 'val': 4}

We can imagine that each document occupies the x-coordinates from 'start' to 'stop',
and has a certain value 'val' ('start' < 'stop' is guaranteed).

The goal is to plot a line showing the sum of these values 'val' from all the
documents which occupy an x-coordinate:

this graph online

In reality there are many documents with many different 'start' and 'stop' coordinates. Speed is important, so:

Is this possible to do with at most a couple of elastic search requests? how?


What I’ve tried:

With one elastic search request we can get the min_start, and max_stop coordinates. These will be the boundaries of x.

Then we divide the x-coordinates into N intervals, and in a loop for each interval we make an elastic search request: to filter out all the documents which lie completely outside of this interval, and do a sum aggregation of 'val'.

This approach takes too much time because there are N+1 requests, and if we want to have a line with higher precision, the time will increase linearly.

Code:

N = 300  # number of intervals along x

x = []
y = []

data = es.search(index='index_name',
                 body={
                    'aggs': {
                        'min_start': {'min': {'field': 'start'}},
                        'max_stop': {'max': {'field': 'stop'}}
                    }
                 })
min_x = data['aggregations']['min_start']['value']
max_x = data['aggregations']['max_stop']['value']

x_from = min_x
x_step = (max_x - min_x) / N
for _ in range(N):
    x_to = x_from + x_step
    
    data = es.search(
        index='index_name', 
        body= {
            'size': 0,  # to not return any actual documents

            'query': {
                'bool': {
                    'should': [
                        # start is in the current x-interval:
                        {'bool': {'must': [
                            {'range': {'start': {'gte': x_from}}},
                            {'range': {'start': {'lte': x_to}}}
                        ]}},

                        # stop is in the current x-interval:
                        {'bool': {'must': [
                            {'range': {'stop': {'gte': x_from}}},
                            {'range': {'stop': {'lte': x_to}}}
                        ]}},

                        # current x-interval is inside start--stop
                        {'bool': {'must': [
                            {'range': {'start': {'lte': x_from}}},
                            {'range': {'stop': {'gte': x_to}}}
                        ]}}
                    ],
                    'minimum_should_match': 1  # at least 1 of these 3 conditions should match
                }
            },

            'aggs': {
                'vals_sum': {'sum': {'field': 'val'}}
            }
        }
    )
    
    # Append info to the lists:
    x.append(x_from)    
    y.append(data['aggregations']['vals_sum']['value'])
    
    # Next x-interval:
    x_from = x_to


from matplotlib import pyplot as plt
plt.plot(x, y)
Asked By: Vladimir Fokow

||

Answers:

The right way to do this in one single query is to use the range field type (available since 5.2) instead of using two fields start and stop and reimplementing the same logic. Like this:

PUT test 
{
  "mappings": {
    "properties": {
      "range": {
        "type": "integer_range"
      },
      "val": {
        "type":"integer"
      }
    }
  }
}

Your documents would look like this:

    {
      "range" : {
        "gte" : 0,
        "lt" : 3
      },
      "val" : 3
    }

And then the query would simply leverage an histogram aggregation like this:

POST test/_search 
{
  "size": 0,
  "aggs": {
    "histo": {
      "histogram": {
        "field": "range",
        "interval": 1
      },
      "aggs": {
        "total": {
          "sum": {
            "field": "val"
          }
        }
      }
    }
  }
}

And the results are as expected: 3, 3, 4, 1, 0, 4

Answered By: Val
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.