randomly replacing a specific value in a dataset with frac in pandas

Question:

I’ve got a dataset with some missing values as " ?" in just one column
I want to replace all missing values with other values in that column (Feature1) like this:

Feature1_value_counts = df.Feature1.value_counts(normalize=True)

the code above gives me the number I can use for frac in pandas
Feature1 contains 15 set of unique values
so it has 15 numbers (all percentage)

and now I need to just randomly replace " ?"s with those unique values (All strings) with that frac probability

I don’t know how to do this using pandas!

I’ve tried loc() and iloc()
and also some for and ifs I couldn’t get there

Asked By: Hosna Asgari

||

Answers:

You can take advantage of the p parameter of numpy.random.choice:

import numpy as np

# ensure using real NaNs for missing values
df['Feature1'] = df['Feature1'].replace('?', np.nan)

# count the fraction of the non-NaN value
counts = df['Feature1'].value_counts(normalize=True)
# identify the rows with NaNs
m = df['Feature1'].isna()

# replace the NaNs with a random values with the frequencies as weights
df.loc[m, 'Feature1'] = np.random.choice(counts.index, p=counts, size=m.sum())

print(df)

Output (replaced values as uppercase for clarity):

  Feature1
0        a
1        b
2        a
3        A
4        a
5        b
6        B
7        a
8        A

Used input:

df = pd.DataFrame({'Feature1': ['a', 'b', 'a', np.nan, 'a', 'b', np.nan, 'a', np.nan]})
Answered By: mozway