Binarizing pandas dataframe column
Question:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension classification
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0
Suppose I have a pandas dataFrame that looks like above.
I want to binarize (change to 0 or 1) of the mean radius
column if it the value is higher than 12.0
.
What I’ve tried is
data_df.loc[data_df["mean radius"] > 12.0] = 0
But this gave me a weird result.
How can I solve this?
Answers:
Specify the column as well, as so:
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0
If you wanted to change the whole column to 1 and 0, you could modify your code slightly to:
# 0 if greater than 12, 1 otherwise
data_df["mean_radius"] = (data_df["mean radius"] <= 12.0).astype(int)
If you just wanted to change the columns where the radius was greater than 12 to 0 (leaving the values less than 12 unchanged):
# only change the values > 12
# this method is discouraged, see edit below
data_df[data_df["mean radius"] > 12.0]["mean radius"] = 0
Edit
As @jp_data_analysis pointed out, chained indexing is discouraged. The preferred way to do the second operation is multi-axis indexing, reproduced here from this answer below:
# only change the values > 12
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0
By using mask
data_df["mean radius"]=data_df["mean radius"].mask(data_df["mean radius"] > 12.0,0)
A better way to do this is to change the values to Boolean (TRUE and FALSE) and then multiply by 1 to binarize it into 1 for TRUE and 0 for FALSE. Here is how it is done:
data_df['mean_radius'] = (data_df['mean radius'] > 12.0)*1
print(data_df['mean_radius'])
This code will add a new column called mean_radius with binarized values. Let me know if this helps.
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension classification
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0
Suppose I have a pandas dataFrame that looks like above.
I want to binarize (change to 0 or 1) of the mean radius
column if it the value is higher than 12.0
.
What I’ve tried is
data_df.loc[data_df["mean radius"] > 12.0] = 0
But this gave me a weird result.
How can I solve this?
Specify the column as well, as so:
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0
If you wanted to change the whole column to 1 and 0, you could modify your code slightly to:
# 0 if greater than 12, 1 otherwise
data_df["mean_radius"] = (data_df["mean radius"] <= 12.0).astype(int)
If you just wanted to change the columns where the radius was greater than 12 to 0 (leaving the values less than 12 unchanged):
# only change the values > 12
# this method is discouraged, see edit below
data_df[data_df["mean radius"] > 12.0]["mean radius"] = 0
Edit
As @jp_data_analysis pointed out, chained indexing is discouraged. The preferred way to do the second operation is multi-axis indexing, reproduced here from this answer below:
# only change the values > 12
data_df.loc[data_df["mean radius"] > 12.0, "mean radius"] = 0
By using mask
data_df["mean radius"]=data_df["mean radius"].mask(data_df["mean radius"] > 12.0,0)
A better way to do this is to change the values to Boolean (TRUE and FALSE) and then multiply by 1 to binarize it into 1 for TRUE and 0 for FALSE. Here is how it is done:
data_df['mean_radius'] = (data_df['mean radius'] > 12.0)*1
print(data_df['mean_radius'])
This code will add a new column called mean_radius with binarized values. Let me know if this helps.