pandas.DataFrame.groupby loses index and messes up the data
Question:
I have a pandas.DataFrame
(named df
) with the following data:
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelA Some Text 12345678
3 labelA Some Text 12345678
4 labelB Some Text 12345678
5 labelB Some Text 12345678
6 labelB Some Text 12345678
7 labelC Some Text 12345678
8 labelC Some Text 12345678
9 labelC Some Text 12345678
10 labelC Some Text 12345678
11 labelC Some Text 12345678
12 labelC Some Text 12345678
when I perform group by with the following (the goal is to take 2 samples from each label), the index is lost:
grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2))
print(result)
The output becomes:
labels texts
labels
labelA 0 labelA Some Text 12345678
0 labelA Some Text 12345678
0 labelB Some Text 12345678
0 labelB Some Text 12345678
0 labelC Some Text 12345678
0 labelC Some Text 12345678
I would like the output becomes:
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelB Some Text 12345678
3 labelB Some Text 12345678
4 labelC Some Text 12345678
5 labelC Some Text 12345678
How should I make the changes?
I tried to use result.dropout(0).reset_index()
according to this answer, but it becomes:
index labels texts
0 0 labelA Some Text 12345678
1 0 labelA Some Text 12345678
2 0 labelB Some Text 12345678
3 0 labelB Some Text 12345678
4 0 labelC Some Text 12345678
5 0 labelC Some Text 12345678
Answers:
Add group_keys
parameter to DataFrame.groupby
:
grouped = df.groupby('labels', group_keys=False)
result = grouped.apply(lambda x: x.sample(n=2))
print(result)
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
4 labelB Some Text 12345678
6 labelB Some Text 12345678
9 labelC Some Text 12345678
8 labelC Some Text 12345678
Another idea is remove all index and replace by original default RangeIndex
:
grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2)).reset_index(drop=True)
print(result)
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelB Some Text 12345678
3 labelB Some Text 12345678
4 labelC Some Text 12345678
5 labelC Some Text 12345678
I have a pandas.DataFrame
(named df
) with the following data:
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelA Some Text 12345678
3 labelA Some Text 12345678
4 labelB Some Text 12345678
5 labelB Some Text 12345678
6 labelB Some Text 12345678
7 labelC Some Text 12345678
8 labelC Some Text 12345678
9 labelC Some Text 12345678
10 labelC Some Text 12345678
11 labelC Some Text 12345678
12 labelC Some Text 12345678
when I perform group by with the following (the goal is to take 2 samples from each label), the index is lost:
grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2))
print(result)
The output becomes:
labels texts
labels
labelA 0 labelA Some Text 12345678
0 labelA Some Text 12345678
0 labelB Some Text 12345678
0 labelB Some Text 12345678
0 labelC Some Text 12345678
0 labelC Some Text 12345678
I would like the output becomes:
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelB Some Text 12345678
3 labelB Some Text 12345678
4 labelC Some Text 12345678
5 labelC Some Text 12345678
How should I make the changes?
I tried to use result.dropout(0).reset_index()
according to this answer, but it becomes:
index labels texts
0 0 labelA Some Text 12345678
1 0 labelA Some Text 12345678
2 0 labelB Some Text 12345678
3 0 labelB Some Text 12345678
4 0 labelC Some Text 12345678
5 0 labelC Some Text 12345678
Add group_keys
parameter to DataFrame.groupby
:
grouped = df.groupby('labels', group_keys=False)
result = grouped.apply(lambda x: x.sample(n=2))
print(result)
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
4 labelB Some Text 12345678
6 labelB Some Text 12345678
9 labelC Some Text 12345678
8 labelC Some Text 12345678
Another idea is remove all index and replace by original default RangeIndex
:
grouped = df.groupby('labels')
result = grouped.apply(lambda x: x.sample(n=2)).reset_index(drop=True)
print(result)
labels texts
0 labelA Some Text 12345678
1 labelA Some Text 12345678
2 labelB Some Text 12345678
3 labelB Some Text 12345678
4 labelC Some Text 12345678
5 labelC Some Text 12345678