Binarizing pandas dataframe column

Question:

mean radius mean texture    mean perimeter  mean area   mean smoothness mean compactness    mean concavity  mean concave points mean symmetry   mean fractal dimension  ... worst texture   worst perimeter worst area  worst smoothness    worst compactness   worst concavity worst concave points    worst symmetry  worst fractal dimension classification
0   17.99   10.38   122.80  1001.0  0.11840 0.27760 0.3001  0.14710 0.2419  0.07871 ... 17.33   184.60  2019.0  0.1622  0.6656  0.7119  0.2654  0.4601  0.11890 0
1   20.57   17.77   132.90  1326.0  0.08474 0.07864 0.0869  0.07017 0.1812  0.05667 ... 23.41   158.80  1956.0  0.1238  0.1866  0.2416  0.1860  0.2750  0.08902 0
2   19.69   21.25   130.00  1203.0  0.10960 0.15990 0.1974  0.12790 0.2069  0.05999 ... 25.53   152.50  1709.0  0.1444  0.4245  0.4504  0.2430  0.3613  0.08758 0
3   11.42   20.38   77.58   386.1   0.14250 0.28390 0.2414  0.10520 0.2597  0.09744 ... 26.50   98.87   567.7   0.2098  0.8663  0.6869  0.2575  0.6638  0.17300 0
4   20.29   14.34   135.10  1297.0  0.10030 0.13280 0.1980  0.10430 0.1809  0.05883 ... 16.67   152.20  1575.0  0.1374  0.2050  0.4000  0.1625  0.2364  0.07678 0

Suppose I have a pandas dataFrame that looks like above.
I want to binarize (change to 0 or 1) of the mean radius column if it the value is higher than 12.0.

What I’ve tried is

data_df.loc[data_df["mean radius"] > 12.0] = 0

But this gave me a weird result.

How can I solve this?

Asked By: Dawn17

||

Answers:

Specify the column as well, as so:

data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0
Answered By: jpp

If you wanted to change the whole column to 1 and 0, you could modify your code slightly to:

# 0 if greater than 12, 1 otherwise
data_df["mean_radius"] = (data_df["mean radius"] <= 12.0).astype(int)

If you just wanted to change the columns where the radius was greater than 12 to 0 (leaving the values less than 12 unchanged):

# only change the values > 12
# this method is discouraged, see edit below
data_df[data_df["mean radius"] > 12.0]["mean radius"] = 0

Edit

As @jp_data_analysis pointed out, chained indexing is discouraged. The preferred way to do the second operation is multi-axis indexing, reproduced here from this answer below:

# only change the values > 12
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0
Answered By: pault

By using mask

data_df["mean radius"]=data_df["mean radius"].mask(data_df["mean radius"] > 12.0,0)
Answered By: BENY

A better way to do this is to change the values to Boolean (TRUE and FALSE) and then multiply by 1 to binarize it into 1 for TRUE and 0 for FALSE. Here is how it is done:

data_df['mean_radius'] = (data_df['mean radius'] > 12.0)*1

print(data_df['mean_radius'])

This code will add a new column called mean_radius with binarized values. Let me know if this helps.

Answered By: Niga
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.