Python: create a distribution from a list based on number of items that fall within certain ranges

Question:

I tagged this question with poisson as I am not sure if it will be helpful in this case.

I need to create a distribution (probably formatted as an image in the end) from a list of data.

For example:

data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]

such that the data can be used to create a visual distribution. I might, for example in this case, say that the ranges are in 10 and there needs to be at least 3 items in each range to be a valid point.

With this example data, I would expect the result to be analogous to

ditribution = [1, 2, 4, 6]

since I have > 3 items in ranges 0-9, 10-19, 30-39 and 50-59. Using that result I could generate an image that has the sections segmented out (darker color) that exist in my final distribution. An example of the type of image I am trying to create can be seen below and would have been generated with far more data. Ignore the blue line for now.

I know how to do this the brute force way of iterating over every item in the list and doing my calculation like that. But, my data set may have hundreds of thousands, or even millions of numbers. My range (10) and my required number of items (3) will likely be much larger in a real world example.

distribution image

Thanks for any help.

Asked By: sberry

||

Answers:

If data is always sorted, a compact approach might be:

import itertools as it

d = [k+1 for k, L in
         ((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
     if L>=3]

If data isn’t sorted, or if you don’t know, use sorted(data) as the first argument to itertools.groupby, instead of just data.

If you prefer a less dense/compact approach, you can of course expand this, e.g. to:

def divby10(x): return x//10

distribution = []
for k, g in it.groupby(data, key=divby10):
    L = len(list(g))
    if L < 3: continue
    distribution.append(k+1)

In either case, the mechanism is that groupby first applies the callable passed as key= to each item in the iterable passed as its first argument, to obtain each item’s”key”; for each consecutive group of items which have the same “key”, groupby yields a tuple with two items: the value of the key, and an iterable over all items in said group.

Here, the key is obtained by dividing an item by 10 (with truncation); len(list(g)) is the number of consecutive items with that “key”. Since the items must be consecutive, you need the data to be sorted (and, it’s simpler to just sort it, than sort it “by value divided by 10 with truncation”;-).

Answered By: Alex Martelli

This sounds like a job for some form of histogram. Presorting should not be necessary in order to accomplish this. I discuss using a variant of bucket sort to group nearby elements here, though you’ll need to adjust this algorithm to suit your purposes. Note that you do not need to store the numbers themselves in the buckets in order to form a histogram

Answered By: Brian

Since data might be very lengthy, you may want to look into using numpy. It provides many useful functions for numerical work, it requires less memory to store data in a numpy array than a Python list[*], and, since many of the numpy functions call C functions under the hood, you may be able to obtain some speed gains:

import numpy as np

data = np.array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59])

hist,bins=np.histogram(data,bins=np.linspace(0,60,7))
print(hist)
# [11  3  1  3  1  3]

distribution=np.where(hist>=3)[0]+1
print(distribution)
# [1 2 4 6]

[*] — Note: In the code above, a Python list was formed in the process of defining data. So the maximum memory requirement here is actually greater than if you had just used a Python list. The memory should get freed however, if there are no other references to the Python list. Alternatively, if the data is stored on disk, numpy.loadtxt can be used to read it directly into a numpy array.

Answered By: unutbu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.