How to write a loop for calculating average value for recurring entities

Question:

I’m having problem with python while constructing a loop (or a nested loop) for my calculation. I have a very large data set (about 600k entries) with repeating ID’s and their values. I’d like to form a loop which starts from the first entry, looks if it is repeating and if it is how manty times it does and calculates average of values for the same ID.

For example in below table I’d like it to return Average a = 2, Average b = 4 and Average c = 1.

I tried a nested for loop below but completely lost my track.

ID Value
a 1
a 3
b 3
b 4
b 5
c 1

for i in range(0,len(ID)):
     if ID[i] == ID[i+1]: #if the next ID is equal to the current one
         for j in range (i,len(ID)): #look for how many repating values for current ID
                    if ID[j] != ID[j+1]:
                        i=j
                        break
          #since it's know how many repeating ID's exist, somehow average them and continue for the second (distinct) ID
Asked By: SatisG

||

Answers:

In case you just need an algorithm:
Since the data is sorted, you can iterate the outer loop on the ids, while storing two variables – one will count the number of identical ids, and the other one will sum up the values.
When you come to an iteration where a new id is found, that’s your cue to calculate the average (the sum of values divided by the id count), print it, and then do the same over and over for the next ids.

Answered By: Max Mor

You could do something like this – iterate through the IDs and values, comparing the current ID to the previous ID, and when it is different, calculating and storing the average.

ID = ["a", "a", "b", "b", "b", "c"]
values = [1,3,3,4,5,1]

averages = {}
previous_id = None
count = None
total = None
for id_, value in zip(ID, values):
    if id_ == previous_id:
        count += 1
        total += value
    else:
        if previous_id is not None:
            averages[previous_id] = total / count
        count = 1
        total = value
    previous_id = id_
averages[id_] = total / count  # to ensure we don't miss the last ID

But it will be much better and faster to use pandas:

import pandas as pd
df = pd.DataFrame({"id": ID, "values": values})  # or use read_excel to import from your spreadsheet
print(df.groupby("id").mean())
Answered By: Stuart

This is a snippet of code that does not use any external modules. You are able to use them if you wish, but thought to leave this as it is still an option.

def avg(lst):
    return sum(lst) / len(lst)

filename = "mycsvfile.csv"

averages = dict()
all_values = dict()

with open(filename, "r") as f:
    
    # read all lines and save 
    for line in f.readlines():
        
        split = line.split(",")  # split on comma
        
        row_id = split[0]
        try:
            # "ID is in the first column Value is at fifth." -OP
            value = float(split[4])
            
            # check if there is a new ID or not
            if row_id in all_values.keys():
                all_values[row_id].append(value)
            else:
                all_values[row_id] = [value]
        except ValueError:
            pass  # ignore header lines
        
    # get averages into dictionary
    for k, v in all_values.items():
        averages[k] = avg(v)
    
    # print the data to show it is correct
    for k, v in averages.items():
        print(f"{k}: {v}")

This snippet of code reads in all data from a CSV file and then gets the averages into a dictionary. Key is the ID, value is the average.

How this works:
This code creates 2 dictionaries:

# all_values:
{ID: [list_all_values_here]}
# for example
{'b': [3, 4, 5]}

#####
# averages:
{ID: average}
# for example
{'b': 3}

This starts off by reading in all data from the CSV file and putting it into the respective all_values dictionary. After reading all data, we iterate over all key: value pairs in the all_values dictionary and get the average of the list containing all values.

We iterate over all key:value pairs here in these 2 lines and add the average to the averages dictionary.

for k, v in all_values.items():
    averages[k] = avg(v)

We then print out all values at the end to show that it is correct.

Answered By: Mitchnoff
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.