How to add/ impute additional rows on to a DF in Python

Question

I have a dataframe that looks like this:

ID	Score	Age	Gender	Date
A	25	5	M	2019-01-01
A	32	5	M	2019-01-01
A	32	5	M	2019-01-05
B	45	9	F	2019-02-01
B	76	9	F	2019-05-01
C	54	7	F	2019-03-01

For each unique ID, I want to ensure there are exactly 2 entries. If an ID has more than 2 entries, I want the two entries with the latest Date (in case of a tie, just take any two rows with the Date). If an ID has fewer than 2 entries, insert / impute a row for that ID where the Score is set to 0, the date is set to the most recent date for that ID, but the Age and Gender are retained (assume that Age and Gender will always be the same for any one ID).

One possible solution for this would be:

ID	Score	Age	Gender	Date
A	32	5	M	2019-01-01
A	32	5	M	2019-01-05
B	45	9	F	2019-02-01
B	76	9	F	2019-05-01
C	54	7	F	2019-03-01
C	0	7	F	2019-03-01

My Dataset is quite big, so multindexing with pd.multiIndex made my memory run out quite quickly (the actual dataset I’m using has about half a million rows).

I tried implementing something similar to here:
How to pad on extra rows in dataframe for Neural Netowrk

But I’m not sure how to implement the "use the latest date" restriction.

Asked By: lvnwrth

||

Source

Answer 1

Brute Force

def f(d):
    d = d.nlargest(2, ['Date'])
    if len(d) < 2:
        d = d.append(d.assign(Score=0))
    return d

df.groupby('ID', as_index=False, group_keys=False).apply(f)

# ⇓ Ugly index is ugly

    ID  Score  Age Gender       Date
  2  A     32    5      M 2019-01-05
  0  A     25    5      M 2019-01-01
  4  B     76    9      F 2019-05-01
  3  B     45    9      F 2019-02-01
  5  C     54    7      F 2019-03-01
  5  C      0    7      F 2019-03-01

If you wanted a specific number other than 2, say 5

def f(d, limit):
    d = d.nlargest(limit, ['Date'])
    if len(d) < limit:
        d = pd.concat([d] + [d.assign(Score=0)] * (limit - len(d)))
    return d

df.groupby('ID', as_index=False, group_keys=False).apply(f, limit=5)

Less Brute, maybe?

pd.concat([
    d.append(d.assign(Score=0)) if len(d) < 2 else d.tail(2)
    for _, d in df.sort_values(['ID', 'Date']).groupby('ID')
], ignore_index=True)

  ID  Score  Age Gender       Date
0  A     32    5      M 2019-01-01
1  A     32    5      M 2019-01-05
2  B     45    9      F 2019-02-01
3  B     76    9      F 2019-05-01
4  C     54    7      F 2019-03-01
5  C      0    7      F 2019-03-01

Answered By: piRSquared

Answer 2

Here is a way:

First get the top 2 of each ID

d = df.sort_values(by='Date',ascending=False).groupby('ID').head(2).set_index('ID')

Then find the ones with no duplicates and make them duplicated

a = pd.concat([d.loc[~d.index.duplicated(keep=False)]]*2)

Then assign one of them to have a value of 0

a.loc[a.index.duplicated(),'Score'] = 0

Then concat the new df.

final = pd.concat([d.loc[d.index.duplicated(keep=False)],a]).sort_index()

This solution below should be able to handle more than 2 entries:

n = 2

df = df.sort_values('Date',ascending=False)

(pd.concat([df.groupby('ID').head(n),
v.loc[(v:=df.groupby('ID',as_index=False).last()
.assign(Score = 0))
.index
.repeat((n - v['ID'].map(df['ID'].value_counts()))
.clip(lower = 0))]])
.sort_values('ID'))

Answered By: rhug123

Answer 3

Let us try lazy groupby and concat:

df= df.sort_values(['ID','Date'],ascending=[True, False])

g = df.groupby('ID')
enums = g.cumcount()
sizes = g['ID'].transform('size')

pd.concat([df[enums<2],                  # row 1 and 2 in each group
           df[sizes==1].assign(Score=0)  # duplicate groups with 1 row
          ]).sort_index()

Also another variant with head:

pd.concat([g.head(2),                   # row 1 and 2 in each group
           df[sizes==1].assign(Score=0)  # duplicate groups with 1 row
          ]).sort_index()

Output:

   ID  Score  Age Gender        Date
0  A      25    5     M   2019-01-01
2  A      32    5     M   2019-01-05
3  B      45    9     F   2019-02-01
4  B      76    9     F   2019-05-01
5  C      54    7     F   2019-03-01
5  C       0    7     F   2019-03-01

Answered By: Quang Hoang

How to add/ impute additional rows on to a DF in Python

Question:

Answers:

Brute Force

Less Brute, maybe?