How do I iterate through data values and add a count if they are within a +- .50 range

Question

I am trying to write a program that iterates through data values and adds them to a dictionary (from a csv file), while giving a running total of how many times that data value appears in the list of values I have. I am able to do this but I need to add a range(not the range func.), for example if current value is within + or – .50 of another then it’ll take the average and add another or the running total.

data = {}
file = open(fname)

#Create value dictionary, add running count to repeated values
for line in file:
    rows = line.split(",")
    for i in range(4):
        price = rows[i]
        price = float(price)
        newnum = price
        data[price] = data.get(price, 0) + 1

#Get top 10 most common values
top_dogs = {}
for i in range(10):
    key = max(data, key=data.get)
    value = data.pop(key)
    top_dogs[key] = value

print(top_dogs)

Asked By: TheDude

||

Source

Answer 1

In general, dicts don’t have a capability for matching ranges, so you either need to collapse the range to a single value and use another data structure such as a sorted list.

As example of the first technique, the round()` function will suffice will suffice for finding values with "+ or – .50" of one another:

data = [10.1, 11.2, 10.5, 12.5, 10.2, 12.6, 11.4, 11.7, 11.8]

d = {}
for x in data:
    k = round(x)
    d[k] = d.get(k, 0) + 1

For the second technique, you can maintain a sorted list with the bisect module which is good at searching ranges and maintaining search order.

from statistics import mean
from bisect import bisect_left, bisect_right, insort

data = [10.1, 11.2, 10.5, 12.5, 10.2, 12.6, 11.4, 11.7, 11.8]

d = {}
sorted_list = []
for x in data:
    lo = bisect_left(sorted_list, x - 0.5)
    hi = bisect_right(sorted_list, x + 0.5)
    if lo == hi:
        new_x = x
        new_count = 1
    else:
        old_x = sorted_list.pop(lo)
        new_x = mean([old_x, x])
        new_count = d.pop(old_x) + 1
    d[new_x] = new_count
    insort(sorted_list, new_x)

Note 1: This code can be tweaked further so that if multiple values are in the lo:hi range, the closest one to x can be updated. For example, if the sorted_list contained [10.1, 10.8], both values are within 0.50 of 10.5, but 10.8 should be selected for update because it is closer to 10.5.

Note 2: The request to average the inputs likely isn’t the right thing to do because it weights the most recently seen input more than the earlier inputs. A better result can be had by keeping a list of all nearby inputs and then averaging them at the end.

Note 3: Rather than the algorithm as requested, it may be better to sort all the inputs, then scan for blocks where all values lie in a specified interval:

from statistics import mean
    
data = [10.1, 11.2, 10.5, 12.5, 10.2, 12.6, 11.4, 11.7, 11.8]
data.sort()
d = {}
equivalents = []
for x in data:
    if not equivalents or x < equivalents[0] + 1.0:
        equivalents.append(x)
    else:
        d[mean(equivalents)] = len(equivalents)
        equivalents.clear()
if equivalents:
    d[mean(equivalents)] = len(equivalents)
    equivalents.clear()

Answered By: Raymond Hettinger

How do I iterate through data values and add a count if they are within a +- .50 range

Question:

Answers: