Random Sample of a subset of a dataframe in Pandas

Question:

I have a pandas DataFrame with 100,000 rows and want to split it into 100 sections with 1000 rows in each of them.

How do I draw a random sample of certain size (e.g. 50 rows) of just one of the 100 sections? The df is already ordered such that the first 1000 rows are from the first section, next 1000 rows from another, and so on.

Asked By: WGP

||

Answers:

One solution is to use the choice function from numpy.

Say you want 50 entries out of 100, you can use:

import numpy as np
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed = df.iloc[chosen_idx]

This is of course not considering your block structure. If you want a 50 item sample from block i for example, you can do:

import numpy as np
block_start_idx = 1000 * i
chosen_idx = np.random.choice(1000, replace=False, size=50)
df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]
Answered By: jpjandrade

You can use the sample method*:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"])

In [12]: df.sample(2)
Out[12]:
   A  B
0  1  2
2  5  6

In [13]: df.sample(2)
Out[13]:
   A  B
3  7  8
0  1  2

*On one of the section DataFrames.

Note: If you have a larger sample size that the size of the DataFrame this will raise an error unless you sample with replacement.

In [14]: df.sample(5)
ValueError: Cannot take a larger sample than population when 'replace=False'

In [15]: df.sample(5, replace=True)
Out[15]:
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
1  3  4
Answered By: Andy Hayden

This is a nice place for recursion.

def main2():
    rows = 8  # say you have 8 rows, real data will need len(rows) for int
    rands = []
    for i in range(rows):
        gen = fun(rands)
        rands.append(gen)
    print(rands)  # now range through random values


def fun(rands):
    gen = np.random.randint(0, 8)
    if gen in rands:
        a = fun(rands)
        return a
    else: return gen


if __name__ == "__main__":
    main2()

output: [6, 0, 7, 1, 3, 5, 4, 2]

Answered By: GeneralCode

You could add a "section" column to your data then perform a groupby and sample:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {"x": np.arange(1_000 * 100), "section": np.repeat(np.arange(100), 1_000)}
)
# >>> df
#            x  section
# 0          0        0
# 1          1        0
# 2          2        0
# 3          3        0
# 4          4        0
# ...      ...      ...
# 99995  99995       99
# 99996  99996       99
# 99997  99997       99
# 99998  99998       99
# 99999  99999       99
#
# [100000 rows x 2 columns]

sample = df.groupby("section").sample(50)
# >>> sample
#            x  section
# 907      907        0
# 494      494        0
# 775      775        0
# 20        20        0
# 230      230        0
# ...      ...      ...
# 99740  99740       99
# 99272  99272       99
# 99863  99863       99
# 99198  99198       99
# 99555  99555       99
#
# [5000 rows x 2 columns]

with additional .query("section == 42") or whatever if you are interested in only a particular section.

Note this requires pandas 1.1.0, see the docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html

For older versions, see the answer by @msh5678

Answered By: Jeff

Thank you, Jeff,
But I received an error;

AttributeError: Cannot access callable attribute 'sample' of 'DataFrameGroupBy' objects, try using the 'apply' method

So I suggest instead of sample = df.groupby("section").sample(50) using below command :

df.groupby('section').apply(lambda grp: grp.sample(50))
Answered By: msh5678