How to parse csv file in python to get this output?
Question:
I have a csv file which contains data like that
Sample csv
Name
Start
End
John
12:00
13:00
John
12:10
13:00
John
12:20
13:20
Tom
12:00
13:10
John
13:50
14:00
Jerry
14:00
14:30
Alice
15:00
16:00
Jerry
11:00
15:00
- I need to find the average time taken by each people in python. How do i do that?
Sample output
Avg time taken by different people are :
John (60+50+60+10)/4 min
Tom (70)/1 min
Jerry (30+240)/2 min
Alice (60)/1 min
I tried parsing the csv file by python csv
import datetime
import csv
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['Start'],row['End'])
But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.
- Also Need to find which Person took maximum time
Here in case Jerry took maximum time
- Also need to perform merge operation
ex – john [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
output – [12:00,13:20],[13:50,14:00]
Any help will be appreciated.
Answers:
If you are willing to use pandas for this, the code below can do the job –
import pandas as pd
df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()
Output –
Time Diff
Alice
0 days 01:00:00
Jerry
0 days 02:15:00
John
0 days 00:45:00
Tom
0 days 01:10:00
Code Explanation –
-
The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe. A dataframe is just a table and can be manipulated in the same way.
-
The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference. I have passed the parameter errors = "ignore"
as the first column in the dataframe, the column Name
, has string values that cannot be converted to datetime. Passing in the errors
parameter as ignore
would allow me to retain all the original values of that column.
-
The 5th line of code selects the columns from index 1 onwards and substracts them. Once that’s done the drop
function gets implemented and the redundant column with null values. Once that’s done the rename
function kicks in and renames the End
column to Time Diff
. All this is stored in a new variable by the name time_df
.
-
Because time_df
doens’t have the name column in it, the 6th line adds it.
-
Once I have the required dataframe, I just group the data based on the Name
column, meaning all the data belonging to particular person would be worked on separately. This is ideal for us as we want to find the mean time taken by every person. To do that we just apply the mean
function on the grouped data and voila we have the desired output.
without Pandas:
times = {}
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
if row['name'] not in times.keys():
times[row['name']] = []
times[row['name']].append(row['End'] - row['Start'])
for person in times.keys():
print(person + ": " + str(sum(times[person]) / len(times[person])))
A simplified code below. You can write it in fewer lines and further in fewer lines by using pandas.
import csv
from datetime import datetime
avg = {}
with open('people.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
name = row["Name"]
start_time = datetime.strptime(row["Start"], "%H:%M")
end_time = datetime.strptime(row["End"], "%H:%M")
time_taken = (end_time - start_time).total_seconds() / 60.0
if name not in avg.keys():
avg[name] = [time_taken,1]
else:
prev_time_total = avg[name][0]
prev_count = avg[name][1]
new_time_total = prev_time_total + time_taken
new_count = prev_count + 1
avg[name] = [new_time_total,new_count]
for entry in avg:
print(entry,avg[entry][0]/avg[entry][1])
Here’s another method without using pandas
–
from datetime import datetime, timedelta
with open("data.csv", "r") as f:
f = csv.DictReader(f)
data = [row for row in f]
diffs = {list(row.values())[0]: [] for row in data}
for row in data:
vals = list(row.values())
diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))
diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))
Output –
{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}
calculating with pandas mean and max time in minutes for each person:
import pandas as pd
df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
assign(diff=(df1.End-df1.Start).dt.total_seconds()//60).
groupby('Name')['diff'].
agg(['mean','max']))
print(df)
'''
mean max
Name
Alice 60.0 60.0
Jerry 135.0 240.0
John 45.0 60.0
Tom 70.0 70.0
'''
df.to_dict()
>>> out
'''
{'mean': {'Alice': 60.0, 'Jerry': 135.0, 'John': 45.0, 'Tom': 70.0},
'max': {'Alice': 60.0, 'Jerry': 240.0, 'John': 60.0, 'Tom': 70.0}}
I have a csv file which contains data like that
Sample csv
Name | Start | End |
---|---|---|
John | 12:00 | 13:00 |
John | 12:10 | 13:00 |
John | 12:20 | 13:20 |
Tom | 12:00 | 13:10 |
John | 13:50 | 14:00 |
Jerry | 14:00 | 14:30 |
Alice | 15:00 | 16:00 |
Jerry | 11:00 | 15:00 |
- I need to find the average time taken by each people in python. How do i do that?
Sample output
Avg time taken by different people are :
John (60+50+60+10)/4 min
Tom (70)/1 min
Jerry (30+240)/2 min
Alice (60)/1 min
I tried parsing the csv file by python csv
import datetime
import csv
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['Start'],row['End'])
But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.
- Also Need to find which Person took maximum time
Here in case Jerry took maximum time
- Also need to perform merge operation
ex – john [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
output – [12:00,13:20],[13:50,14:00]
Any help will be appreciated.
If you are willing to use pandas for this, the code below can do the job –
import pandas as pd
df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()
Output –
Time Diff | |
---|---|
Alice | 0 days 01:00:00 |
Jerry | 0 days 02:15:00 |
John | 0 days 00:45:00 |
Tom | 0 days 01:10:00 |
Code Explanation –
-
The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe. A dataframe is just a table and can be manipulated in the same way.
-
The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference. I have passed the parameter
errors = "ignore"
as the first column in the dataframe, the columnName
, has string values that cannot be converted to datetime. Passing in theerrors
parameter asignore
would allow me to retain all the original values of that column. -
The 5th line of code selects the columns from index 1 onwards and substracts them. Once that’s done the
drop
function gets implemented and the redundant column with null values. Once that’s done therename
function kicks in and renames theEnd
column toTime Diff
. All this is stored in a new variable by the nametime_df
. -
Because
time_df
doens’t have the name column in it, the 6th line adds it. -
Once I have the required dataframe, I just group the data based on the
Name
column, meaning all the data belonging to particular person would be worked on separately. This is ideal for us as we want to find the mean time taken by every person. To do that we just apply themean
function on the grouped data and voila we have the desired output.
without Pandas:
times = {}
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
if row['name'] not in times.keys():
times[row['name']] = []
times[row['name']].append(row['End'] - row['Start'])
for person in times.keys():
print(person + ": " + str(sum(times[person]) / len(times[person])))
A simplified code below. You can write it in fewer lines and further in fewer lines by using pandas.
import csv
from datetime import datetime
avg = {}
with open('people.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
name = row["Name"]
start_time = datetime.strptime(row["Start"], "%H:%M")
end_time = datetime.strptime(row["End"], "%H:%M")
time_taken = (end_time - start_time).total_seconds() / 60.0
if name not in avg.keys():
avg[name] = [time_taken,1]
else:
prev_time_total = avg[name][0]
prev_count = avg[name][1]
new_time_total = prev_time_total + time_taken
new_count = prev_count + 1
avg[name] = [new_time_total,new_count]
for entry in avg:
print(entry,avg[entry][0]/avg[entry][1])
Here’s another method without using pandas
–
from datetime import datetime, timedelta
with open("data.csv", "r") as f:
f = csv.DictReader(f)
data = [row for row in f]
diffs = {list(row.values())[0]: [] for row in data}
for row in data:
vals = list(row.values())
diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))
diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))
Output –
{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}
calculating with pandas mean and max time in minutes for each person:
import pandas as pd
df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
assign(diff=(df1.End-df1.Start).dt.total_seconds()//60).
groupby('Name')['diff'].
agg(['mean','max']))
print(df)
'''
mean max
Name
Alice 60.0 60.0
Jerry 135.0 240.0
John 45.0 60.0
Tom 70.0 70.0
'''
df.to_dict()
>>> out
'''
{'mean': {'Alice': 60.0, 'Jerry': 135.0, 'John': 45.0, 'Tom': 70.0},
'max': {'Alice': 60.0, 'Jerry': 240.0, 'John': 60.0, 'Tom': 70.0}}