How do i 'ungroup' a grouped data frame?
Question:
I have a large dataframe containing ∼ 100 million observations of stock returns for a given date and an identifier for each specific stock. My problem is that for some stocks at some dates there are missing values in the return column. To set up a small example the data frame looks something like this:
ID
Date
Return
10001
1970-01-01
0.043
10001
1970-01-02
0.023
10001
1970-01-03
-0.010
10002
1970-01-01
0.010
10002
1970-01-02
NaN
10002
1970-01-03
-0.034
10003
1970-01-01
0,040
10003
1970-01-02
-0.041
10003
1970-01-03
0.009
I wish to remove all the stocks containing one or more missing values in the return column, which I do by:
df = df[df['RET'].notna().groupby('ID').transform('all')]
From the data frame example above, this will generate a grouped data frame without stocks having missing values in the return column:
ID
Date
Return
10001
1970-01-01
0.043
1970-01-02
0.023
1970-01-03
-0.010
10003
1970-01-01
0,040
1970-01-02
-0.041
1970-01-03
0.009
However, I need to convert the data frame back to being ungrouped again. Do any of you have a solution on how to do this? To specify, I want the data frame to look like this:
ID
Date
Return
10001
1970-01-01
0.043
10001
1970-01-02
0.023
10001
1970-01-03
-0.010
10003
1970-01-01
0,040
10003
1970-01-02
-0.041
10003
1970-01-03
0.009
Thank you in advance for any possible solution.
I have already tried using .unstack() and .reset_index() without any luck
Answers:
You don’t need grouping. Simple boolean indexing + .isin()
is enough:
x = df.loc[df['Return'].isna(), 'ID'].unique()
print(df[~df['ID'].isin(x)])
Prints:
ID Date Return
0 10001 1970-01-01 0.043
1 10001 1970-01-02 0.023
2 10001 1970-01-03 -0.010
6 10003 1970-01-01 0,040
7 10003 1970-01-02 -0.041
8 10003 1970-01-03 0.009
If anyone should wonder, this could also be fixed by adding ‘, group_keys=False’ to the groupby function:
df = df[df['RET'].notna().groupby('ID', group_keys=False).transform('all')]
I have a large dataframe containing ∼ 100 million observations of stock returns for a given date and an identifier for each specific stock. My problem is that for some stocks at some dates there are missing values in the return column. To set up a small example the data frame looks something like this:
ID | Date | Return |
---|---|---|
10001 | 1970-01-01 | 0.043 |
10001 | 1970-01-02 | 0.023 |
10001 | 1970-01-03 | -0.010 |
10002 | 1970-01-01 | 0.010 |
10002 | 1970-01-02 | NaN |
10002 | 1970-01-03 | -0.034 |
10003 | 1970-01-01 | 0,040 |
10003 | 1970-01-02 | -0.041 |
10003 | 1970-01-03 | 0.009 |
I wish to remove all the stocks containing one or more missing values in the return column, which I do by:
df = df[df['RET'].notna().groupby('ID').transform('all')]
From the data frame example above, this will generate a grouped data frame without stocks having missing values in the return column:
ID | Date | Return |
---|---|---|
10001 | 1970-01-01 | 0.043 |
1970-01-02 | 0.023 | |
1970-01-03 | -0.010 | |
10003 | 1970-01-01 | 0,040 |
1970-01-02 | -0.041 | |
1970-01-03 | 0.009 |
However, I need to convert the data frame back to being ungrouped again. Do any of you have a solution on how to do this? To specify, I want the data frame to look like this:
ID | Date | Return |
---|---|---|
10001 | 1970-01-01 | 0.043 |
10001 | 1970-01-02 | 0.023 |
10001 | 1970-01-03 | -0.010 |
10003 | 1970-01-01 | 0,040 |
10003 | 1970-01-02 | -0.041 |
10003 | 1970-01-03 | 0.009 |
Thank you in advance for any possible solution.
I have already tried using .unstack() and .reset_index() without any luck
You don’t need grouping. Simple boolean indexing + .isin()
is enough:
x = df.loc[df['Return'].isna(), 'ID'].unique()
print(df[~df['ID'].isin(x)])
Prints:
ID Date Return
0 10001 1970-01-01 0.043
1 10001 1970-01-02 0.023
2 10001 1970-01-03 -0.010
6 10003 1970-01-01 0,040
7 10003 1970-01-02 -0.041
8 10003 1970-01-03 0.009
If anyone should wonder, this could also be fixed by adding ‘, group_keys=False’ to the groupby function:
df = df[df['RET'].notna().groupby('ID', group_keys=False).transform('all')]