np.random.rand() or random.random()

Question:

While analyzing a code, I’ve stumbled upon the following snippet:

msk = np.random.rand(len(df)) < 0.8

Variables "msk" and "df" are irrelevant for my question. After doing some research I think this usage is also related to "random" class as well. It gives True with 80% chance and False with 20% chance on random elements. It is done for masking. I understand why it is used but I don’t understand how it works. Isn’t random method supposed to give float numbers? Why are there boolean statements when we put the method in an interval?

Asked By: Kiddbora

||

Answers:

np.random.rand(len(df)) returns an array of uniform random numbers between 0 and 1, np.random.rand(len(df)) < 0.8 will transform it into an array of booleans based on the condition.

As there is a 80% chance to be below 0.8, there is 80% of True values.

A more explicit approach would be to use numpy.random.choice:

np.random.choice([True, False], p=[0.8, 0.2], size=len(df))

An even better approach, if your goal is to subset a dataframe, would be to use:

df.sample(frac=0.8)

how to split a dataframe 0.8/0.2:

df1 = df.sample(frac=0.8)
df2 = df.drop(df1.index)
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.