How to use IF NOT IN in pandas groupby object?

Question:

I have such a dataframe:

import pandas as pd
import numpy as np
# create a sample DataFrame
data = {'ID': [1, 1, 1, 2, 2, 2],
        'timestamp': ['2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 18:00:00',
                      '2022-01-01 12:02:00', '2022-01-01 13:02:00', '2022-01-01 18:02:00'],
        'value1': [10, 20, 30, 40, 50, 60],
        'gender': ['M', 'M', 'F', 'F', 'F', 'M'],
        'age': [20, 25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

# extract the date from the timestamp column
df['date'] = pd.to_datetime(df['timestamp']).dt.date

I would like for this dataframe, to get timestamp values and enumerate them. Then, I will take a single value of a timestamp and check in groupby object if it exists or not. If it does not exist, I will append it. Here is my approach:

for indx, single_date in enumerate(df.timestamp):
    #print(single_date)
    if df.timestamp[indx] not in df.groupby(['ID'],as_index=False):
        df2 = pd.DataFrame([[df.ID[indx],df.timestamp[indx],np.nan,df.gender[indx],df.age[indx]]],
                           columns=['ID', 'timestamp', 'value1', 'gender', 'age'])
        #print(df2)
        df2['timestamp'] = pd.to_datetime(df2['timestamp'])
        new_ckd = df.groupby(['ID']).apply(lambda y: pd.concat([y, df2]))
new_ckd['timestamp'] = pd.to_datetime(new_ckd['timestamp'])
new_ckd = new_ckd.sort_values(by=['timestamp'], ascending=True).reset_index(drop=True)
#print(new_ckd)
    #print(df.ID[indx])
print(df.groupby(['ID'],as_index=False).timestamp.apply(print))
for indx, single_date in enumerate(df.timestamp):
    #print(df.timestamp[indx])
    if df.timestamp[indx] in df.groupby(['ID'],as_index=False).timestamp:
        print('a')

I realized that IF NOT IN condition on groupby object does not work. How can I make it work?

What I have:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 65 2022-01-01 13:00:00 m 7
2 65 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 63 2022-01-01 12:42:00 f 8

What I expect:

ID value1 timestamp gender age
1 50 2022-01-01 12:00:00 m 7
1 NaN 2022-01-01 12:02:00 m 7
1 NaN 2022-01-01 12:22:00 m 7
1 80 2022-01-01 12:30:00 m 7
1 NaN 2022-01-01 12:42:00 m 7
1 65 2022-01-01 13:00:00 m 7
2 NaN 2022-01-01 12:00:00 f 8
2 65 2022-01-01 12:02:00 f 8
2 83 2022-01-01 12:22:00 f 8
2 NaN 2022-01-01 12:30:00 f 8
2 63 2022-01-01 12:42:00 f 8
2 NaN 2022-01-01 13:00:00 f 8
Asked By: dspractician

||

Answers:

You can reimagine your task as: add missing dates to every unique ID based on all dates present in the dataframe and fill NaNs in the result.

This can be achieved for example with some magic using reindexing via multiindex and then filling the resulted NaNs:

data = {'ID': [1, 1, 1, 2, 2, 2],
        'timestamp': ['2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 18:00:00',
                      '2022-01-01 12:02:00', '2022-01-01 13:02:00', '2022-01-01 18:02:00'],
        'value1': [10, 20, 30, 40, 50, 60],
        'gender': ['M', 'M', 'F', 'F', 'F', 'M'],
        'age': [20, 25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

# cross apply to build index 
cross = df[['ID']].drop_duplicates().merge(df[['timestamp']].drop_duplicates(), how = 'cross')
multiIdx = pd.MultiIndex.from_frame(cross)

# "add" missing rows
df = df.set_index(['ID', 'timestamp']) 
        .reindex(multiIdx, fill_value=np.nan) 
        .reset_index() 
        .sort_values(by=['ID', 'timestamp'], ignore_index=True)

# fill NaNs
df[['gender', 'age']] = df.groupby('ID')[['gender', 'age']].ffill().bfill()

UPD

If you have non-unique entries (based on ID + timestamp pair) you can use left merge:

cross = ...
df = cross.merge(df, on=['ID', 'timestamp'], how='left').sort_values(by=['ID', 'timestamp'],ignore_index=True)
df[['gender', 'age']] = df.groupby('ID')[['gender', 'age']].ffill().bfill()
Answered By: Guru Stron

You can achieve this by first creating a new DataFrame with all possible timestamp values for each ID, and then merging it with the original DataFrame using an outer join. Finally, you can fill in the missing values using forward fill (ffill) and backward fill (bfill).

id_timestamps = df.groupby('ID')['timestamp'].apply(lambda x: pd.date_range(start=x.min(), end=x.max(), freq='2min')).reset_index()
id_timestamps = id_timestamps.explode('timestamp')

df_merged = pd.merge(id_timestamps, df, on=['ID', 'timestamp'], how='outer')

df_merged = df_merged.sort_values(by=['ID', 'timestamp'])

df_merged['value1'] = df_merged.groupby('ID')['value1'].ffill().bfill()

df_merged = df_merged.drop('timestamp', axis=1).reset_index(drop=True)

I hope this answer to your question.

Answered By: Benny

So, first create a DataFrame containing all the possible timestamp combinations for each ID, then merge it with the original DataFrame.

  1. Create a DataFrame with all possible timestamp combinations for each ID.
  2. Merge the original DataFrame with the new DataFrame using pd.merge() on [‘ID’, ‘timestamp’], using an outer join.
  3. Sort the merged DataFrame by ‘ID’ and ‘timestamp’.
  4. Reset the index.

Try this:

import pandas as pd
import numpy as np

# create a sample DataFrame
data = {'ID': [1, 1, 1, 2, 2, 2],
        'timestamp': ['2022-01-01 12:00:00', '2022-01-01 13:00:00', '2022-01-01 18:00:00',
                      '2022-01-01 12:02:00', '2022-01-01 13:02:00', '2022-01-01 18:02:00'],
        'value1': [10, 20, 30, 40, 50, 60],
        'gender': ['M', 'M', 'F', 'F', 'F', 'M'],
        'age': [20, 25, 30, 35, 40, 45]}
df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Create a new DataFrame with all possible timestamp combinations for each ID
unique_ids = df['ID'].unique()
unique_timestamps = df['timestamp'].unique()
all_combinations = pd.MultiIndex.from_product([unique_ids, unique_timestamps], names=['ID', 'timestamp']).to_frame(index=False)

# Merge the original DataFrame with the new DataFrame
merged_df = pd.merge(all_combinations, df, on=['ID', 'timestamp'], how='outer')

# Sort the merged DataFrame by 'ID' and 'timestamp'
merged_df = merged_df.sort_values(by=['ID', 'timestamp'])

# Reset the index
merged_df = merged_df.reset_index(drop=True)

print(merged_df)
Answered By: Dee Dee
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.