How to add/ impute additional rows on to a DF in Python
Question:
I have a dataframe that looks like this:
ID
Score
Age
Gender
Date
A
25
5
M
2019-01-01
A
32
5
M
2019-01-01
A
32
5
M
2019-01-05
B
45
9
F
2019-02-01
B
76
9
F
2019-05-01
C
54
7
F
2019-03-01
For each unique ID, I want to ensure there are exactly 2 entries. If an ID has more than 2 entries, I want the two entries with the latest Date (in case of a tie, just take any two rows with the Date). If an ID has fewer than 2 entries, insert / impute a row for that ID where the Score is set to 0, the date is set to the most recent date for that ID, but the Age and Gender are retained (assume that Age and Gender will always be the same for any one ID).
One possible solution for this would be:
ID
Score
Age
Gender
Date
A
32
5
M
2019-01-01
A
32
5
M
2019-01-05
B
45
9
F
2019-02-01
B
76
9
F
2019-05-01
C
54
7
F
2019-03-01
C
0
7
F
2019-03-01
My Dataset is quite big, so multindexing with pd.multiIndex made my memory run out quite quickly (the actual dataset I’m using has about half a million rows).
I tried implementing something similar to here:
How to pad on extra rows in dataframe for Neural Netowrk
But I’m not sure how to implement the "use the latest date" restriction.
Answers:
Brute Force
def f(d):
d = d.nlargest(2, ['Date'])
if len(d) < 2:
d = d.append(d.assign(Score=0))
return d
df.groupby('ID', as_index=False, group_keys=False).apply(f)
# ⇓ Ugly index is ugly
ID Score Age Gender Date
2 A 32 5 M 2019-01-05
0 A 25 5 M 2019-01-01
4 B 76 9 F 2019-05-01
3 B 45 9 F 2019-02-01
5 C 54 7 F 2019-03-01
5 C 0 7 F 2019-03-01
If you wanted a specific number other than 2
, say 5
def f(d, limit):
d = d.nlargest(limit, ['Date'])
if len(d) < limit:
d = pd.concat([d] + [d.assign(Score=0)] * (limit - len(d)))
return d
df.groupby('ID', as_index=False, group_keys=False).apply(f, limit=5)
Less Brute, maybe?
pd.concat([
d.append(d.assign(Score=0)) if len(d) < 2 else d.tail(2)
for _, d in df.sort_values(['ID', 'Date']).groupby('ID')
], ignore_index=True)
ID Score Age Gender Date
0 A 32 5 M 2019-01-01
1 A 32 5 M 2019-01-05
2 B 45 9 F 2019-02-01
3 B 76 9 F 2019-05-01
4 C 54 7 F 2019-03-01
5 C 0 7 F 2019-03-01
Here is a way:
First get the top 2 of each ID
d = df.sort_values(by='Date',ascending=False).groupby('ID').head(2).set_index('ID')
Then find the ones with no duplicates and make them duplicated
a = pd.concat([d.loc[~d.index.duplicated(keep=False)]]*2)
Then assign one of them to have a value of 0
a.loc[a.index.duplicated(),'Score'] = 0
Then concat the new df.
final = pd.concat([d.loc[d.index.duplicated(keep=False)],a]).sort_index()
This solution below should be able to handle more than 2 entries:
n = 2
df = df.sort_values('Date',ascending=False)
(pd.concat([df.groupby('ID').head(n),
v.loc[(v:=df.groupby('ID',as_index=False).last()
.assign(Score = 0))
.index
.repeat((n - v['ID'].map(df['ID'].value_counts()))
.clip(lower = 0))]])
.sort_values('ID'))
Let us try lazy groupby and concat:
df= df.sort_values(['ID','Date'],ascending=[True, False])
g = df.groupby('ID')
enums = g.cumcount()
sizes = g['ID'].transform('size')
pd.concat([df[enums<2], # row 1 and 2 in each group
df[sizes==1].assign(Score=0) # duplicate groups with 1 row
]).sort_index()
Also another variant with head
:
pd.concat([g.head(2), # row 1 and 2 in each group
df[sizes==1].assign(Score=0) # duplicate groups with 1 row
]).sort_index()
Output:
ID Score Age Gender Date
0 A 25 5 M 2019-01-01
2 A 32 5 M 2019-01-05
3 B 45 9 F 2019-02-01
4 B 76 9 F 2019-05-01
5 C 54 7 F 2019-03-01
5 C 0 7 F 2019-03-01
I have a dataframe that looks like this:
ID | Score | Age | Gender | Date |
---|---|---|---|---|
A | 25 | 5 | M | 2019-01-01 |
A | 32 | 5 | M | 2019-01-01 |
A | 32 | 5 | M | 2019-01-05 |
B | 45 | 9 | F | 2019-02-01 |
B | 76 | 9 | F | 2019-05-01 |
C | 54 | 7 | F | 2019-03-01 |
For each unique ID, I want to ensure there are exactly 2 entries. If an ID has more than 2 entries, I want the two entries with the latest Date (in case of a tie, just take any two rows with the Date). If an ID has fewer than 2 entries, insert / impute a row for that ID where the Score is set to 0, the date is set to the most recent date for that ID, but the Age and Gender are retained (assume that Age and Gender will always be the same for any one ID).
One possible solution for this would be:
ID | Score | Age | Gender | Date |
---|---|---|---|---|
A | 32 | 5 | M | 2019-01-01 |
A | 32 | 5 | M | 2019-01-05 |
B | 45 | 9 | F | 2019-02-01 |
B | 76 | 9 | F | 2019-05-01 |
C | 54 | 7 | F | 2019-03-01 |
C | 0 | 7 | F | 2019-03-01 |
My Dataset is quite big, so multindexing with pd.multiIndex made my memory run out quite quickly (the actual dataset I’m using has about half a million rows).
I tried implementing something similar to here:
How to pad on extra rows in dataframe for Neural Netowrk
But I’m not sure how to implement the "use the latest date" restriction.
Brute Force
def f(d):
d = d.nlargest(2, ['Date'])
if len(d) < 2:
d = d.append(d.assign(Score=0))
return d
df.groupby('ID', as_index=False, group_keys=False).apply(f)
# ⇓ Ugly index is ugly
ID Score Age Gender Date
2 A 32 5 M 2019-01-05
0 A 25 5 M 2019-01-01
4 B 76 9 F 2019-05-01
3 B 45 9 F 2019-02-01
5 C 54 7 F 2019-03-01
5 C 0 7 F 2019-03-01
If you wanted a specific number other than 2
, say 5
def f(d, limit):
d = d.nlargest(limit, ['Date'])
if len(d) < limit:
d = pd.concat([d] + [d.assign(Score=0)] * (limit - len(d)))
return d
df.groupby('ID', as_index=False, group_keys=False).apply(f, limit=5)
Less Brute, maybe?
pd.concat([
d.append(d.assign(Score=0)) if len(d) < 2 else d.tail(2)
for _, d in df.sort_values(['ID', 'Date']).groupby('ID')
], ignore_index=True)
ID Score Age Gender Date
0 A 32 5 M 2019-01-01
1 A 32 5 M 2019-01-05
2 B 45 9 F 2019-02-01
3 B 76 9 F 2019-05-01
4 C 54 7 F 2019-03-01
5 C 0 7 F 2019-03-01
Here is a way:
First get the top 2 of each ID
d = df.sort_values(by='Date',ascending=False).groupby('ID').head(2).set_index('ID')
Then find the ones with no duplicates and make them duplicated
a = pd.concat([d.loc[~d.index.duplicated(keep=False)]]*2)
Then assign one of them to have a value of 0
a.loc[a.index.duplicated(),'Score'] = 0
Then concat the new df.
final = pd.concat([d.loc[d.index.duplicated(keep=False)],a]).sort_index()
This solution below should be able to handle more than 2 entries:
n = 2
df = df.sort_values('Date',ascending=False)
(pd.concat([df.groupby('ID').head(n),
v.loc[(v:=df.groupby('ID',as_index=False).last()
.assign(Score = 0))
.index
.repeat((n - v['ID'].map(df['ID'].value_counts()))
.clip(lower = 0))]])
.sort_values('ID'))
Let us try lazy groupby and concat:
df= df.sort_values(['ID','Date'],ascending=[True, False])
g = df.groupby('ID')
enums = g.cumcount()
sizes = g['ID'].transform('size')
pd.concat([df[enums<2], # row 1 and 2 in each group
df[sizes==1].assign(Score=0) # duplicate groups with 1 row
]).sort_index()
Also another variant with head
:
pd.concat([g.head(2), # row 1 and 2 in each group
df[sizes==1].assign(Score=0) # duplicate groups with 1 row
]).sort_index()
Output:
ID Score Age Gender Date
0 A 25 5 M 2019-01-01
2 A 32 5 M 2019-01-05
3 B 45 9 F 2019-02-01
4 B 76 9 F 2019-05-01
5 C 54 7 F 2019-03-01
5 C 0 7 F 2019-03-01