np.random.choice not returning correct weights when vectorized

Question:

Thank you @tdelaney for guiding me with my first post, I had to edit it:

import pandas as pd
import numpy as np
# This is a hypothetical line to generate a df with a column similar to the one which I'm having trouble with:
dataset_2021 = pd.DataFrame({"genero_usuario":["M", "M", None, "F", None, "F", "M", None, "M", "M", None, "F", "F", "M", None, "M", "M", None, "F", None, "F", "M", None, "M", "M", None, "F", "F", "M", None, "M", "M", None, "F", None, "F", "M", None, "M", "M", None, "F", "F", "M", None, "M", "M", None, "F", None, "F", "M", None]})

The dataset has a string column with the user’s gender: "M" for Male and "F" for Female, with a few nulls I want to impute. I obtained the weights of "M" and "F" with a value_counts() of the non nulls: M = 0.656, F = 0.344 (this is from my dataset, the test one I wrote up gives 0.6 and 0.4)

The following line of code works perfectly and returns the correct weights when having a big enough dataset (in the small test dataset given above it changes it a little). The problem is that, because of the size of my df, it takes too long to execute:

dataset_2021["genero_usuario"] = dataset_2021["genero_usuario"].apply(lambda x : x if pd.isnull(x) == False else np.random.choice(a = ["M","F"], p=[0.656,0.344]))

The faster vectorized version I want to use doesn’t work. 1st attempt:

dataset_2021.loc[dataset_2021.genero_usuario.isnull(), dataset_2021.genero_usuario] = np.random.choice(a = ["M","F"], p=[0.656,0.344])

This throws the error:

Cannot mask with non-boolean array containing NA / NaN values

Second attempt:

dataset_2021.fillna(value = {"genero_usuario" : np.random.choice(a = ["M","F"], p=[0.656,0.344])}, inplace = True)

This imputes the nulls but decreases the weight of the "M" and increases the weight of the "F": the value_counts() gives M 0.616 and F 0.384.

  1. Why does the 1st attempt throw that error?
  2. Why does the 2nd attempt change the final weights? with lambda it remains equal
  3. How can I solve it? I don’t want to use lambda, I want the code to remain speedy.

Thanks in advance

Asked By: Agustín Bulzomi

||

Answers:

np.random.choice returns one value, so you assign the same value to all null-cells.

Therefore, you have first to find all null-values and generate enough random values to fill all gaps:

mask = dataset_2021["genero_usuario"].isnull()
dataset_2021["genero_usuario"][mask] = np.random.choice(size=mask.sum(), a=["M", "F"], p=[0.716, 0.284])
Answered By: Daniel
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.