DataFrame most efficient way update row value less than 40% to NaN?
Question:
I have big dataframe, need to find all element less than 40% in a row set to NaN, element not sorted, repeat this for each row.
I can force the calculation, but you can imagine it’s not very efficient, there is no efficient way to do it?
40% mean row element order asc, and set low order 40% element to nan, does not contain an element that is itself a nan.
If I have ten element : 1,21,20,4,5,6,7,9,10,11
, should sort it to 1,4,5,6,7,9,10,11,20,21
and remove 1,4,5,6
, finally become nan,21,20,nan,nan,nan,7,9,10,11
.
1 21 20 4 5 6 7 9 10 11
to
NaN 21 20 NaN NaN NaN 7 9 10 11
Answers:
Use DataFrame.count
for get number of non missing values per rows, then compare by positions of sorted values by double numpy.argsort
and last set missing values by mask:
print (df)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 10 5 6 7 NaN 9 4 11.0
1 1 21 20 4 5 6 7 9.0 10 11 NaN
counts = df.count(axis=1).mul(0.4).to_numpy()[:, None]
arr = np.argsort(np.argsort(df.to_numpy()))
df[arr < counts] = np.nan
print (df)
0 1 2 3 4 5 6 7 8 9 10
0 NaN NaN NaN 10.0 5.0 6.0 7 NaN 9 NaN 11.0
1 NaN 21.0 20.0 NaN NaN NaN 7 9.0 10 11.0 NaN
I have big dataframe, need to find all element less than 40% in a row set to NaN, element not sorted, repeat this for each row.
I can force the calculation, but you can imagine it’s not very efficient, there is no efficient way to do it?
40% mean row element order asc, and set low order 40% element to nan, does not contain an element that is itself a nan.
If I have ten element : 1,21,20,4,5,6,7,9,10,11
, should sort it to 1,4,5,6,7,9,10,11,20,21
and remove 1,4,5,6
, finally become nan,21,20,nan,nan,nan,7,9,10,11
.
1 21 20 4 5 6 7 9 10 11
to
NaN 21 20 NaN NaN NaN 7 9 10 11
Use DataFrame.count
for get number of non missing values per rows, then compare by positions of sorted values by double numpy.argsort
and last set missing values by mask:
print (df)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 10 5 6 7 NaN 9 4 11.0
1 1 21 20 4 5 6 7 9.0 10 11 NaN
counts = df.count(axis=1).mul(0.4).to_numpy()[:, None]
arr = np.argsort(np.argsort(df.to_numpy()))
df[arr < counts] = np.nan
print (df)
0 1 2 3 4 5 6 7 8 9 10
0 NaN NaN NaN 10.0 5.0 6.0 7 NaN 9 NaN 11.0
1 NaN 21.0 20.0 NaN NaN NaN 7 9.0 10 11.0 NaN