Python: Random selection per group
Question:
Say that I have a dataframe that looks like:
Name Group_Id
AAA 1
ABC 1
CCC 2
XYZ 2
DEF 3
YYH 3
How could I randomly select one (or more) row for each Group_Id
? Say that I want one random draw per Group_Id
, I would get:
Name Group_Id
AAA 1
XYZ 2
DEF 3
Answers:
Using random.choice
, you can do something like this:
import random
name_group = {'AAA': 1, 'ABC':1, 'CCC':2, 'XYZ':2, 'DEF':3, 'YYH':3}
names = [name for name in name_group.iterkeys()] #create a list out of the keys in the name_group dict
first_name = random.choice(names)
first_group = name_group[first_name]
print first_name, first_group
random.choice(seq)
Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.
You can use a combination of pandas.groupby
, pandas.concat
and random.sample
:
import pandas as pd
import random
df = pd.DataFrame({
'Name': ['AAA', 'ABC', 'CCC', 'XYZ', 'DEF', 'YYH'],
'Group_ID': [1,1,2,2,3,3]
})
grouped = df.groupby('Group_ID')
df_sampled = pd.concat([d.ix[random.sample(d.index, 1)] for _, d in grouped]).reset_index(drop=True)
print df_sampled
Output:
Group_ID Name
0 1 AAA
1 2 XYZ
2 3 DEF
size = 2 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)
Using groupby and random.choice in an elegant one liner:
df.groupby('Group_Id').apply(lambda x :x.iloc[random.choice(range(0,len(x)))])
From 0.16.x
onwards pd.DataFrame.sample
provides a way to return a random sample of items from an axis of object.
In [664]: df.groupby('Group_Id').apply(lambda x: x.sample(1)).reset_index(drop=True)
Out[664]:
Name Group_Id
0 ABC 1
1 XYZ 2
2 DEF 3
for randomly selecting just one row per group try:
df.sample(frac = 1.0).groupby('Group_Id').head(1)
There are two ways to do this very simply, one without using anything except basic pandas syntax:
df[['x','y']].groupby('x').agg(pd.DataFrame.sample)
This takes 14.4ms with 50k row dataset.
The other, slightly faster method, involves numpy.
df[['x','y']].groupby('x').agg(np.random.choice)
This takes 10.9ms with (the same) 50k row dataset.
Generally speaking, when using pandas, it’s preferable to stick with its native syntax. Especially for beginners.
A very pandas-ish way:
takesamp = lambda d: d.sample(n)
df = df.groupby('Group_Id').apply(takesamp)
The solutions offered fail if a group has fewer samples than the desired sample size n
. This addresses this problem:
n = 10
df.groupby('Group_Id').apply(lambda x: x.sample(min(n,len(x)))).reset_index(drop=True)
I found another one:
size=2
count_s = df['Id'].value_counts()
df.iloc[np.concatenate([previous_count + np.random.choice(count, size)
for count, previous_count in zip(count_s,
count_s.shift(fill_value=0))])]
df.groupby('Group_Id').sample(n=1)
New in version 1.1.0.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
Say that I have a dataframe that looks like:
Name Group_Id
AAA 1
ABC 1
CCC 2
XYZ 2
DEF 3
YYH 3
How could I randomly select one (or more) row for each Group_Id
? Say that I want one random draw per Group_Id
, I would get:
Name Group_Id
AAA 1
XYZ 2
DEF 3
Using random.choice
, you can do something like this:
import random
name_group = {'AAA': 1, 'ABC':1, 'CCC':2, 'XYZ':2, 'DEF':3, 'YYH':3}
names = [name for name in name_group.iterkeys()] #create a list out of the keys in the name_group dict
first_name = random.choice(names)
first_group = name_group[first_name]
print first_name, first_group
random.choice(seq)
Return a random element from the non-empty sequence seq. If seq is empty, raises IndexError.
You can use a combination of pandas.groupby
, pandas.concat
and random.sample
:
import pandas as pd
import random
df = pd.DataFrame({
'Name': ['AAA', 'ABC', 'CCC', 'XYZ', 'DEF', 'YYH'],
'Group_ID': [1,1,2,2,3,3]
})
grouped = df.groupby('Group_ID')
df_sampled = pd.concat([d.ix[random.sample(d.index, 1)] for _, d in grouped]).reset_index(drop=True)
print df_sampled
Output:
Group_ID Name
0 1 AAA
1 2 XYZ
2 3 DEF
size = 2 # sample size
replace = True # with replacement
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df.groupby('Group_Id', as_index=False).apply(fn)
Using groupby and random.choice in an elegant one liner:
df.groupby('Group_Id').apply(lambda x :x.iloc[random.choice(range(0,len(x)))])
From 0.16.x
onwards pd.DataFrame.sample
provides a way to return a random sample of items from an axis of object.
In [664]: df.groupby('Group_Id').apply(lambda x: x.sample(1)).reset_index(drop=True)
Out[664]:
Name Group_Id
0 ABC 1
1 XYZ 2
2 DEF 3
for randomly selecting just one row per group try:
df.sample(frac = 1.0).groupby('Group_Id').head(1)
There are two ways to do this very simply, one without using anything except basic pandas syntax:
df[['x','y']].groupby('x').agg(pd.DataFrame.sample)
This takes 14.4ms with 50k row dataset.
The other, slightly faster method, involves numpy.
df[['x','y']].groupby('x').agg(np.random.choice)
This takes 10.9ms with (the same) 50k row dataset.
Generally speaking, when using pandas, it’s preferable to stick with its native syntax. Especially for beginners.
A very pandas-ish way:
takesamp = lambda d: d.sample(n)
df = df.groupby('Group_Id').apply(takesamp)
The solutions offered fail if a group has fewer samples than the desired sample size n
. This addresses this problem:
n = 10
df.groupby('Group_Id').apply(lambda x: x.sample(min(n,len(x)))).reset_index(drop=True)
I found another one:
size=2
count_s = df['Id'].value_counts()
df.iloc[np.concatenate([previous_count + np.random.choice(count, size)
for count, previous_count in zip(count_s,
count_s.shift(fill_value=0))])]
df.groupby('Group_Id').sample(n=1)
New in version 1.1.0.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html