pandas.DataFrame.explode produces too many rows
Question:
Give the following data:
data = {'type': ['chisel', 'disc', 'user_defined'],
'depth': [[152, 178, 203], [127, 152, 178, 203], [0]],
'residue': [[0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], [0.0]],
'timing': [["10-nov", "10-apr"], ["10-nov", "10-apr"], ["10-apr"]]}
Create df
:
import pandas as pd
df = pd.DataFrame(data)
Output as expected:
explode
timing
:
df = df.explode('timing')
Output as expected:
- One additional row for each item in timing
explode
depth
:
df = df.explode('depth')
Output not as expected:
- I expect there to be 6 rows for
chisel
and 8 rows disc
- 3 each, for
10-apr
& 10-nov
, for chisel
- 4 each, for
10-apr
& 10-nov
, for disc
explode
is producing twice as many as expected
- 12 instead of 6, for
chisel
- 16 instead of 8, for
disc
Questions:
- Is my expectation incorrect?
- Am I using
explode
incorrectly?
Answers:
pandas
produces unexpected results whenever you work with duplicate indexes. Notice that after the first explode
, you end up having duplicated indexes.
Resetting them will yield a dataframe that works as you expect.
Fix it with
df.explode('timing', ignore_index=True).explode('depth')
type depth residue timing
0 chisel 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
0 chisel 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
0 chisel 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
1 chisel 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
1 chisel 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
1 chisel 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
2 disc 127 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
2 disc 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
2 disc 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
2 disc 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
3 disc 127 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
3 disc 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
3 disc 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
3 disc 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
4 user_defined 0 [0.0] 10-apr
Give the following data:
data = {'type': ['chisel', 'disc', 'user_defined'],
'depth': [[152, 178, 203], [127, 152, 178, 203], [0]],
'residue': [[0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], [0.0]],
'timing': [["10-nov", "10-apr"], ["10-nov", "10-apr"], ["10-apr"]]}
Create df
:
import pandas as pd
df = pd.DataFrame(data)
Output as expected:
explode
timing
:
df = df.explode('timing')
Output as expected:
- One additional row for each item in timing
explode
depth
:
df = df.explode('depth')
Output not as expected:
- I expect there to be 6 rows for
chisel
and 8 rowsdisc
- 3 each, for
10-apr
&10-nov
, forchisel
- 4 each, for
10-apr
&10-nov
, fordisc
- 3 each, for
explode
is producing twice as many as expected- 12 instead of 6, for
chisel
- 16 instead of 8, for
disc
- 12 instead of 6, for
Questions:
- Is my expectation incorrect?
- Am I using
explode
incorrectly?
pandas
produces unexpected results whenever you work with duplicate indexes. Notice that after the first explode
, you end up having duplicated indexes.
Resetting them will yield a dataframe that works as you expect.
Fix it with
df.explode('timing', ignore_index=True).explode('depth')
type depth residue timing
0 chisel 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
0 chisel 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
0 chisel 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
1 chisel 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
1 chisel 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
1 chisel 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
2 disc 127 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
2 disc 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
2 disc 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
2 disc 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-nov
3 disc 127 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
3 disc 152 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
3 disc 178 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
3 disc 203 [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] 10-apr
4 user_defined 0 [0.0] 10-apr