From specific row in a df_a, to count its occurrences in the past a year in df_b
Question:
I have two dataframe as below and I want to return how many Success (Yes) in a year (for a specific person) 1 year prior to his/her specific date, i.e. each entry in to check
to define the range in history
.
For example, in to_check
, Mike 20200602, I want to know how many Success (Yes) in Mike’s history (1 year before, until 20200602).
By using the "to_check" as a list, I came up with a clumsy way:
import pandas as pd
import datetime
import numpy as np
from io import StringIO
import time
from datetime import datetime, date, time, timedelta
csvfile = StringIO("""
Name Check
Mike 20200602
David 20210415
Kate 20201109""")
csvfile_1 = StringIO("""
Name History Success
David 20180312 Yes
David 20180811 Yes
David 20191223 Yes
David 20210311 Yes
Kate 20180906 Yes
Kate 20180912 Yes
Kate 20191204 Yes
Kate 20200505 Yes
Mike 20180912 Yes
Mike 20190312 Yes
Mike 20190806 Yes
Mike 20191204 Yes""")
df_check = pd.read_csv(csvfile, sep = ' ', engine='python')
df_history = pd.read_csv(csvfile_1, sep = ' ', engine='python')
df_history['Date'] = pd.to_datetime(df_history['History'], format='%Y%m%d')
to_check = ["Mike 20200602","David 20210415","Kate 20201109"]
for t in to_check:
name, d = t.split(" ")
date_obj = datetime.strptime(d, '%Y%m%d')
delta = timedelta(days = 365)
day_before = date_obj - delta
m1 = df_history['Name'] == name
m2 = df_history['Date'] >= day_before
df_history['OP'] = np.where(m1 & m2, "fit", '')
how_many = df_history['OP'].value_counts().tolist()[1]
print (t, how_many)
Output:
Mike 20200602 2
David 20210415 1
Kate 20201109 2
What’s the better and smarter way to achieve it?
Answers:
merge
and query
, but I would suggest leaving the dates as number for easy offset:
# both `Check` and `History` are numbers, not dates
(df_check.merge(df_history, on='Name', how='left')
.query('History<=Check<History+10000')
.groupby('Name').agg({'History':'first', 'Success':'size'})
)
Output:
History Success
Name
David 20210311 1
Kate 20191204 2
Mike 20190806 2
I have two dataframe as below and I want to return how many Success (Yes) in a year (for a specific person) 1 year prior to his/her specific date, i.e. each entry in to check
to define the range in history
.
For example, in to_check
, Mike 20200602, I want to know how many Success (Yes) in Mike’s history (1 year before, until 20200602).
By using the "to_check" as a list, I came up with a clumsy way:
import pandas as pd
import datetime
import numpy as np
from io import StringIO
import time
from datetime import datetime, date, time, timedelta
csvfile = StringIO("""
Name Check
Mike 20200602
David 20210415
Kate 20201109""")
csvfile_1 = StringIO("""
Name History Success
David 20180312 Yes
David 20180811 Yes
David 20191223 Yes
David 20210311 Yes
Kate 20180906 Yes
Kate 20180912 Yes
Kate 20191204 Yes
Kate 20200505 Yes
Mike 20180912 Yes
Mike 20190312 Yes
Mike 20190806 Yes
Mike 20191204 Yes""")
df_check = pd.read_csv(csvfile, sep = ' ', engine='python')
df_history = pd.read_csv(csvfile_1, sep = ' ', engine='python')
df_history['Date'] = pd.to_datetime(df_history['History'], format='%Y%m%d')
to_check = ["Mike 20200602","David 20210415","Kate 20201109"]
for t in to_check:
name, d = t.split(" ")
date_obj = datetime.strptime(d, '%Y%m%d')
delta = timedelta(days = 365)
day_before = date_obj - delta
m1 = df_history['Name'] == name
m2 = df_history['Date'] >= day_before
df_history['OP'] = np.where(m1 & m2, "fit", '')
how_many = df_history['OP'].value_counts().tolist()[1]
print (t, how_many)
Output:
Mike 20200602 2
David 20210415 1
Kate 20201109 2
What’s the better and smarter way to achieve it?
merge
and query
, but I would suggest leaving the dates as number for easy offset:
# both `Check` and `History` are numbers, not dates
(df_check.merge(df_history, on='Name', how='left')
.query('History<=Check<History+10000')
.groupby('Name').agg({'History':'first', 'Success':'size'})
)
Output:
History Success
Name
David 20210311 1
Kate 20191204 2
Mike 20190806 2