Distribute data from list into column in panda dataframe

Question

I have a list users=['a','b','c','d']

I have a dataframe X with 100 rows.
I want to populate the X['users'] with list users, such that

distribution is even. In the above example there must be 25 entries of each element
the distribution is done in a random way. It shouldn’t have a fix pattern of distribution each time I run. abcdabcd vs aaabbbcccddd vs accbddab are all valid distributions.

How do I go about this?

Asked By: anotherCoder

||

Source

Answer 1

So, assuming we have a df with 100 rows, I think we can use np.repeat for this.

import numpy as np
import pandas as pd

X = pd.DataFrame(np.zeros((100, 1)), columns=['users'])

users = ['a','b','c','d'] # users = ['a']
n_users = len(users)
n_rows = X.shape[0]
n_per_user = n_rows // n_users

if n_users == 1:
    users = np.repeat(users, n_rows)
else:
    np.random.shuffle(users)
    users = np.repeat(users, n_per_user)
    if n_rows % n_users != 0:
        n_extra = n_rows % n_users
        extra_users = np.random.choice(users, n_extra, replace=False)
        users = np.concatenate([users, extra_users])

X['users'] = users

Answered By: artemis

Answer 2

Pass 25 of each element (users*25) into np.random.Generator.choice (or the deprecated np.random.choice) and set replace=False:

users = list('abcd')
X = pd.DataFrame()
rng = np.random.default_rng(0)

X['users'] = rng.choice(users*25, size=100, replace=False)
#   users
# 0     d
# 1     d
# 2     b
# 3     a
# ...

X.value_counts()
# users
# a        25
# b        25
# c        25
# d        25
# dtype: int64

On additional runs, we get different sampling but always 25 per element:

X['users'] = rng.choice(users*25, size=100, replace=False)
#   users
# 0     b
# 1     b
# 2     c
# 3     c
# ...

X.value_counts()
# users
# a        25
# b        25
# c        25
# d        25
# dtype: int64

Answered By: tdy

Distribute data from list into column in panda dataframe

Question:

Answers: