Distribute data from list into column in panda dataframe
Question:
I have a list users=['a','b','c','d']
I have a dataframe X with 100 rows.
I want to populate the X['users']
with list users, such that
- distribution is even. In the above example there must be 25 entries of each element
- the distribution is done in a random way. It shouldn’t have a fix pattern of distribution each time I run.
abcdabcd vs aaabbbcccddd vs accbddab
are all valid distributions.
How do I go about this?
Answers:
So, assuming we have a df
with 100 rows, I think we can use np.repeat
for this.
import numpy as np
import pandas as pd
X = pd.DataFrame(np.zeros((100, 1)), columns=['users'])
users = ['a','b','c','d'] # users = ['a']
n_users = len(users)
n_rows = X.shape[0]
n_per_user = n_rows // n_users
if n_users == 1:
users = np.repeat(users, n_rows)
else:
np.random.shuffle(users)
users = np.repeat(users, n_per_user)
if n_rows % n_users != 0:
n_extra = n_rows % n_users
extra_users = np.random.choice(users, n_extra, replace=False)
users = np.concatenate([users, extra_users])
X['users'] = users
Pass 25 of each element (users*25
) into np.random.Generator.choice
(or the deprecated np.random.choice
) and set replace=False
:
users = list('abcd')
X = pd.DataFrame()
rng = np.random.default_rng(0)
X['users'] = rng.choice(users*25, size=100, replace=False)
# users
# 0 d
# 1 d
# 2 b
# 3 a
# ...
X.value_counts()
# users
# a 25
# b 25
# c 25
# d 25
# dtype: int64
On additional runs, we get different sampling but always 25 per element:
X['users'] = rng.choice(users*25, size=100, replace=False)
# users
# 0 b
# 1 b
# 2 c
# 3 c
# ...
X.value_counts()
# users
# a 25
# b 25
# c 25
# d 25
# dtype: int64
I have a list users=['a','b','c','d']
I have a dataframe X with 100 rows.
I want to populate the X['users']
with list users, such that
- distribution is even. In the above example there must be 25 entries of each element
- the distribution is done in a random way. It shouldn’t have a fix pattern of distribution each time I run.
abcdabcd vs aaabbbcccddd vs accbddab
are all valid distributions.
How do I go about this?
So, assuming we have a df
with 100 rows, I think we can use np.repeat
for this.
import numpy as np
import pandas as pd
X = pd.DataFrame(np.zeros((100, 1)), columns=['users'])
users = ['a','b','c','d'] # users = ['a']
n_users = len(users)
n_rows = X.shape[0]
n_per_user = n_rows // n_users
if n_users == 1:
users = np.repeat(users, n_rows)
else:
np.random.shuffle(users)
users = np.repeat(users, n_per_user)
if n_rows % n_users != 0:
n_extra = n_rows % n_users
extra_users = np.random.choice(users, n_extra, replace=False)
users = np.concatenate([users, extra_users])
X['users'] = users
Pass 25 of each element (users*25
) into np.random.Generator.choice
(or the deprecated np.random.choice
) and set replace=False
:
users = list('abcd')
X = pd.DataFrame()
rng = np.random.default_rng(0)
X['users'] = rng.choice(users*25, size=100, replace=False)
# users
# 0 d
# 1 d
# 2 b
# 3 a
# ...
X.value_counts()
# users
# a 25
# b 25
# c 25
# d 25
# dtype: int64
On additional runs, we get different sampling but always 25 per element:
X['users'] = rng.choice(users*25, size=100, replace=False)
# users
# 0 b
# 1 b
# 2 c
# 3 c
# ...
X.value_counts()
# users
# a 25
# b 25
# c 25
# d 25
# dtype: int64