How to add/ impute additional rows on to a DF in Python

Question:

I have a dataframe that looks like this:

ID Score Age Gender Date
A 25 5 M 2019-01-01
A 32 5 M 2019-01-01
A 32 5 M 2019-01-05
B 45 9 F 2019-02-01
B 76 9 F 2019-05-01
C 54 7 F 2019-03-01

For each unique ID, I want to ensure there are exactly 2 entries. If an ID has more than 2 entries, I want the two entries with the latest Date (in case of a tie, just take any two rows with the Date). If an ID has fewer than 2 entries, insert / impute a row for that ID where the Score is set to 0, the date is set to the most recent date for that ID, but the Age and Gender are retained (assume that Age and Gender will always be the same for any one ID).

One possible solution for this would be:

ID Score Age Gender Date
A 32 5 M 2019-01-01
A 32 5 M 2019-01-05
B 45 9 F 2019-02-01
B 76 9 F 2019-05-01
C 54 7 F 2019-03-01
C 0 7 F 2019-03-01

My Dataset is quite big, so multindexing with pd.multiIndex made my memory run out quite quickly (the actual dataset I’m using has about half a million rows).

I tried implementing something similar to here:
How to pad on extra rows in dataframe for Neural Netowrk

But I’m not sure how to implement the "use the latest date" restriction.

Asked By: lvnwrth

||

Answers:

Brute Force

def f(d):
    d = d.nlargest(2, ['Date'])
    if len(d) < 2:
        d = d.append(d.assign(Score=0))
    return d

df.groupby('ID', as_index=False, group_keys=False).apply(f)

# ⇓ Ugly index is ugly

    ID  Score  Age Gender       Date
  2  A     32    5      M 2019-01-05
  0  A     25    5      M 2019-01-01
  4  B     76    9      F 2019-05-01
  3  B     45    9      F 2019-02-01
  5  C     54    7      F 2019-03-01
  5  C      0    7      F 2019-03-01

If you wanted a specific number other than 2, say 5

def f(d, limit):
    d = d.nlargest(limit, ['Date'])
    if len(d) < limit:
        d = pd.concat([d] + [d.assign(Score=0)] * (limit - len(d)))
    return d

df.groupby('ID', as_index=False, group_keys=False).apply(f, limit=5)

Less Brute, maybe?

pd.concat([
    d.append(d.assign(Score=0)) if len(d) < 2 else d.tail(2)
    for _, d in df.sort_values(['ID', 'Date']).groupby('ID')
], ignore_index=True)

  ID  Score  Age Gender       Date
0  A     32    5      M 2019-01-01
1  A     32    5      M 2019-01-05
2  B     45    9      F 2019-02-01
3  B     76    9      F 2019-05-01
4  C     54    7      F 2019-03-01
5  C      0    7      F 2019-03-01
Answered By: piRSquared

Here is a way:

First get the top 2 of each ID

d = df.sort_values(by='Date',ascending=False).groupby('ID').head(2).set_index('ID')

Then find the ones with no duplicates and make them duplicated

a = pd.concat([d.loc[~d.index.duplicated(keep=False)]]*2)

Then assign one of them to have a value of 0

a.loc[a.index.duplicated(),'Score'] = 0

Then concat the new df.

final = pd.concat([d.loc[d.index.duplicated(keep=False)],a]).sort_index()

This solution below should be able to handle more than 2 entries:

n = 2

df = df.sort_values('Date',ascending=False)

(pd.concat([df.groupby('ID').head(n),
v.loc[(v:=df.groupby('ID',as_index=False).last()
.assign(Score = 0))
.index
.repeat((n - v['ID'].map(df['ID'].value_counts()))
.clip(lower = 0))]])
.sort_values('ID'))
Answered By: rhug123

Let us try lazy groupby and concat:

df= df.sort_values(['ID','Date'],ascending=[True, False])

g = df.groupby('ID')
enums = g.cumcount()
sizes = g['ID'].transform('size')

pd.concat([df[enums<2],                  # row 1 and 2 in each group
           df[sizes==1].assign(Score=0)  # duplicate groups with 1 row
          ]).sort_index()

Also another variant with head:

pd.concat([g.head(2),                   # row 1 and 2 in each group
           df[sizes==1].assign(Score=0)  # duplicate groups with 1 row
          ]).sort_index()

Output:

   ID  Score  Age Gender        Date
0  A      25    5     M   2019-01-01
2  A      32    5     M   2019-01-05
3  B      45    9     F   2019-02-01
4  B      76    9     F   2019-05-01
5  C      54    7     F   2019-03-01
5  C       0    7     F   2019-03-01
Answered By: Quang Hoang
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.