How to parse csv file in python to get this output?

Question:

I have a csv file which contains data like that

Sample csv

Name Start End
John 12:00 13:00
John 12:10 13:00
John 12:20 13:20
Tom 12:00 13:10
John 13:50 14:00
Jerry 14:00 14:30
Alice 15:00 16:00
Jerry 11:00 15:00
  1. I need to find the average time taken by each people in python. How do i do that?

Sample output

Avg time taken by different people are :

John (60+50+60+10)/4 min
Tom (70)/1 min
Jerry (30+240)/2 min
Alice (60)/1 min

I tried parsing the csv file by python csv

import datetime
import csv


with open('people.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row['Start'],row['End'])

But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.

  1. Also Need to find which Person took maximum time

Here in case Jerry took maximum time

  1. Also need to perform merge operation

ex – john [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]

output – [12:00,13:20],[13:50,14:00]

Any help will be appreciated.

Asked By: Bk thomas

||

Answers:

If you are willing to use pandas for this, the code below can do the job –

import pandas as pd

df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()

Output –

Time Diff
Alice 0 days 01:00:00
Jerry 0 days 02:15:00
John 0 days 00:45:00
Tom 0 days 01:10:00

Code Explanation –

  • The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe. A dataframe is just a table and can be manipulated in the same way.

  • The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference. I have passed the parameter errors = "ignore" as the first column in the dataframe, the column Name, has string values that cannot be converted to datetime. Passing in the errors parameter as ignore would allow me to retain all the original values of that column.

  • The 5th line of code selects the columns from index 1 onwards and substracts them. Once that’s done the drop function gets implemented and the redundant column with null values. Once that’s done the rename function kicks in and renames the End column to Time Diff. All this is stored in a new variable by the name time_df.

  • Because time_df doens’t have the name column in it, the 6th line adds it.

  • Once I have the required dataframe, I just group the data based on the Name column, meaning all the data belonging to particular person would be worked on separately. This is ideal for us as we want to find the mean time taken by every person. To do that we just apply the mean function on the grouped data and voila we have the desired output.

Answered By: Zero

without Pandas:

times = {}

with open('people.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
    
        if row['name'] not in times.keys():
            times[row['name']] = []
    
        times[row['name']].append(row['End'] - row['Start'])
    
for person in times.keys():
    print(person + ": " + str(sum(times[person]) / len(times[person])))
Answered By: Dan Luba

A simplified code below. You can write it in fewer lines and further in fewer lines by using pandas.

import csv
from datetime import datetime

avg = {}
with open('people.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row["Name"]
        start_time = datetime.strptime(row["Start"], "%H:%M")
        end_time = datetime.strptime(row["End"], "%H:%M")
        time_taken = (end_time - start_time).total_seconds() / 60.0
        if name not in avg.keys():
            avg[name] = [time_taken,1]
        else:
            prev_time_total = avg[name][0]
            prev_count = avg[name][1]
            new_time_total = prev_time_total + time_taken
            new_count = prev_count + 1
            avg[name] = [new_time_total,new_count]

for entry in avg:
    print(entry,avg[entry][0]/avg[entry][1])
Answered By: Nisarg

Here’s another method without using pandas

from datetime import datetime, timedelta

with open("data.csv", "r") as f:
    f = csv.DictReader(f)
    data = [row for row in f]

diffs = {list(row.values())[0]: [] for row in data}
for row in data:
    vals = list(row.values())
    diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))

diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))

Output –

{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}
Answered By: Zero

calculating with pandas mean and max time in minutes for each person:

import pandas as pd

df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
      assign(diff=(df1.End-df1.Start).dt.total_seconds()//60).
      groupby('Name')['diff'].
      agg(['mean','max']))

print(df)
'''
        mean    max
Name               
Alice   60.0   60.0
Jerry  135.0  240.0
John    45.0   60.0
Tom     70.0   70.0
'''

df.to_dict()

>>> out
'''
{'mean': {'Alice': 60.0, 'Jerry': 135.0, 'John': 45.0, 'Tom': 70.0},
 'max': {'Alice': 60.0, 'Jerry': 240.0, 'John': 60.0, 'Tom': 70.0}}
Answered By: SergFSM
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.