Pandas DF – Efficient way to loop through DF to find minimum values of one column from rows with common values in another column
Question:
I have a dataframe that looks something like:
matter
work_date
1
01/01/2020
2
01/02/2020
1
01/04/2020
2
01/05/2020
I want a new column which finds the minimum work_date of all rows with the same matter number so that I can do some time delta calculations. so the final result would look like this:
matter
work_date
first_date
1
01/01/2020
01/01/2020
2
01/02/2020
01/02/2020
1
01/04/2020
01/01/2020
2
01/05/2020
01/02/2020
Right now, I’m using the following code, but it is taking quite a while to run (the dataframe has approx 300k rows and I’m on an ancient computer).
min_dict = {}
def check_dict(val):
return min_dict.setdefault(val,min(df[df['tmatter']==val]['tworkdt']))
df['first_day'] = df.apply (lambda row: check_dict(row.tmatter), axis = 1)
Is there a better way to approach this?
Answers:
transform
does what you want and should be fast
The steps are (1) group the rows together that have the same matter
(2) for each group calculate the minimum work_date
and (3) save these values as a new column.
import pandas as pd
import io
df = pd.read_csv(io.StringIO("""
matter work_date
1 01/01/2020
2 01/02/2020
1 01/04/2020
2 01/05/2020
"""), delim_whitespace=True)
df['first_date'] = df.groupby('matter')['work_date'].transform('min')
print(df)
I have a dataframe that looks something like:
matter | work_date |
---|---|
1 | 01/01/2020 |
2 | 01/02/2020 |
1 | 01/04/2020 |
2 | 01/05/2020 |
I want a new column which finds the minimum work_date of all rows with the same matter number so that I can do some time delta calculations. so the final result would look like this:
matter | work_date | first_date |
---|---|---|
1 | 01/01/2020 | 01/01/2020 |
2 | 01/02/2020 | 01/02/2020 |
1 | 01/04/2020 | 01/01/2020 |
2 | 01/05/2020 | 01/02/2020 |
Right now, I’m using the following code, but it is taking quite a while to run (the dataframe has approx 300k rows and I’m on an ancient computer).
min_dict = {}
def check_dict(val):
return min_dict.setdefault(val,min(df[df['tmatter']==val]['tworkdt']))
df['first_day'] = df.apply (lambda row: check_dict(row.tmatter), axis = 1)
Is there a better way to approach this?
transform
does what you want and should be fast
The steps are (1) group the rows together that have the same matter
(2) for each group calculate the minimum work_date
and (3) save these values as a new column.
import pandas as pd
import io
df = pd.read_csv(io.StringIO("""
matter work_date
1 01/01/2020
2 01/02/2020
1 01/04/2020
2 01/05/2020
"""), delim_whitespace=True)
df['first_date'] = df.groupby('matter')['work_date'].transform('min')
print(df)