Collapse Pandas rows to elliminate NaN entries
Question:
Let’s consider the following DataFrame
Name
A
B
C
D
tom
10.0
NaN
NaN
NaN
tom
NaN
15.0
NaN
NaN
tom
NaN
NaN
20.0
NaN
tom
NaN
NaN
NaN
25.0
tom
30.0
NaN
NaN
NaN
tom
NaN
NaN
NaN
40.0
john
1.0
NaN
NaN
NaN
john
NaN
2.0
NaN
NaN
john
NaN
NaN
3.0
NaN
john
NaN
NaN
NaN
4.0
john
5.0
NaN
NaN
NaN
john
NaN
6.0
NaN
NaN
john
NaN
NaN
7.0
NaN
john
NaN
NaN
NaN
8.0
I want to collapse it to limit the amount of NaN
values in the DataFrame – can be sequential, i.e. combine the neighboring rows if possible, but all I care about is that the values of columns A-D
correspond to the same Name
after the collapse
My perfect outcome would be
Name
A
B
C
D
tom
10.0
15.0
20.0
25.0
tom
30.0
NaN
NaN
40.0
john
1.0
2.0
3.0
4.0
john
5.0
6.0
7.0
8.0
From what I understand, Pandas groupby('Name')
will not do the trick, because it will leave one entry for each name.
If that is of any help, I use a dictionary to create the dataframe. The dictionary looks like this:
{
"a": {
"tom": [10.0, 30.0],
"john": [1.0, 5.0]
},
"b": {
"tom": [15.0],
"john": [2.0, 6.0]
},
.....
}
So, basically, I am taking every number in the dictionary then create a row with just this number, and then combine all of the rows.
Is there a simple way to collapse the resulting DataFrame or build a more compact DataFrame given such a dictionary
Answers:
You can .groupby
+ .transform
(where you "move" the values up). Then drop rows which contain all NaN
values:
print(
df.set_index("Name")
.groupby(level=0)
.transform(lambda x: sorted(x, key=lambda k: pd.isna(k)))
.dropna(axis=0, how="all")
.reset_index()
)
Prints:
Name A B C D
0 tom 10.0 15.0 20.0 25.0
1 tom 30.0 NaN NaN 40.0
2 john 1.0 2.0 3.0 4.0
3 john 5.0 6.0 7.0 8.0
Let’s consider the following DataFrame
Name | A | B | C | D |
---|---|---|---|---|
tom | 10.0 | NaN | NaN | NaN |
tom | NaN | 15.0 | NaN | NaN |
tom | NaN | NaN | 20.0 | NaN |
tom | NaN | NaN | NaN | 25.0 |
tom | 30.0 | NaN | NaN | NaN |
tom | NaN | NaN | NaN | 40.0 |
john | 1.0 | NaN | NaN | NaN |
john | NaN | 2.0 | NaN | NaN |
john | NaN | NaN | 3.0 | NaN |
john | NaN | NaN | NaN | 4.0 |
john | 5.0 | NaN | NaN | NaN |
john | NaN | 6.0 | NaN | NaN |
john | NaN | NaN | 7.0 | NaN |
john | NaN | NaN | NaN | 8.0 |
I want to collapse it to limit the amount of NaN
values in the DataFrame – can be sequential, i.e. combine the neighboring rows if possible, but all I care about is that the values of columns A-D
correspond to the same Name
after the collapse
My perfect outcome would be
Name | A | B | C | D |
---|---|---|---|---|
tom | 10.0 | 15.0 | 20.0 | 25.0 |
tom | 30.0 | NaN | NaN | 40.0 |
john | 1.0 | 2.0 | 3.0 | 4.0 |
john | 5.0 | 6.0 | 7.0 | 8.0 |
From what I understand, Pandas groupby('Name')
will not do the trick, because it will leave one entry for each name.
If that is of any help, I use a dictionary to create the dataframe. The dictionary looks like this:
{
"a": {
"tom": [10.0, 30.0],
"john": [1.0, 5.0]
},
"b": {
"tom": [15.0],
"john": [2.0, 6.0]
},
.....
}
So, basically, I am taking every number in the dictionary then create a row with just this number, and then combine all of the rows.
Is there a simple way to collapse the resulting DataFrame or build a more compact DataFrame given such a dictionary
You can .groupby
+ .transform
(where you "move" the values up). Then drop rows which contain all NaN
values:
print(
df.set_index("Name")
.groupby(level=0)
.transform(lambda x: sorted(x, key=lambda k: pd.isna(k)))
.dropna(axis=0, how="all")
.reset_index()
)
Prints:
Name A B C D
0 tom 10.0 15.0 20.0 25.0
1 tom 30.0 NaN NaN 40.0
2 john 1.0 2.0 3.0 4.0
3 john 5.0 6.0 7.0 8.0