How to replace the highest 10 values in a column of a csv file?
Question:
Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Edit: Updated code with @Maow suggested changes.
I am currently doing a project which requires me to analyse data of wines. I have spotted some extreme outliers in each column of the csv file. In short, I have determined that the highest 10 values of each column must be replaced by the median value of that column. I have tried the following with help from 1 article (Pandas Replace certain values in each column), and I modified it as shown below, but unfortunately this is my first time with python and I have no idea what causing the error.
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/hello/Downloads/winequality-red-ori.csv')
def cut(column):
condition = column > np.percentile(column,99.26470588) //Top 10 rows out of 1360 rows
replacewith = np.median(column) //replace with median
np.select(condition.values.reshape(-1, 1), column.values, replacewith) //input changes
df.set_index(["citric acid", "quality"], inplace=True) //exclude citric acid and quality
df = df.apply(lambda x: cut(x)).reset_index()
df.to_csv('C:/Users/hello/Downloads/new.csv')
I have tried researching what causes the error including missing values in the csv file but I have none. I am also not sure if the above code will help me acheive my goal even without this error. Any help appreciated.
Answers:
The error appears because you use np.select
wrong. It expects, an array of condtions, an array of choices and a default value in this order.
It works with
np.select(condition.values.reshape(-1, 1), column.values, replacewith)
- You are using a numpy function on pandas objects. This may work, but accessing the underlying
np.array
is imho good practice.
- Also
np.select
is not doing what you think it does. Its purpose is to select a single element from an array according to the first hit in a list of conditions. So you basically select the first value that belongs to the 10 largest.
Final Note: By calling set_index
twice, you are basically making citric acid
a value again. You should call
df.set_index(["citric acid", "quality"], inplace=True) # exclude citric acid and quality
EDIT:
The np.select
function expects a list of bool ndarrays
i.e. a 2d datastructure as per documentation. If you look at condition
this looks like this.
In [35]: condition
Out[35]: array([False, False, False, ..., False, False, False])
.reshape
will change the shape of the array. -1
is a shortcut to leave the number of rows the same and 1
means that you create a redundant ndarrray with only one element in each row.
In [36]: condition.reshape(-1, 1)
Out[36]:
array([[False],
[False],
[False],
...,
[False],
[False],
[False]])
This is to match the expected signature.
Figured out an algorithm:
condition = column > np.percentile(column,99.26470588)
replacewith = np.median(column) #replace with median
return np.where(condition,replacewith,column.values)
Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
Edit: Updated code with @Maow suggested changes.
I am currently doing a project which requires me to analyse data of wines. I have spotted some extreme outliers in each column of the csv file. In short, I have determined that the highest 10 values of each column must be replaced by the median value of that column. I have tried the following with help from 1 article (Pandas Replace certain values in each column), and I modified it as shown below, but unfortunately this is my first time with python and I have no idea what causing the error.
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/hello/Downloads/winequality-red-ori.csv')
def cut(column):
condition = column > np.percentile(column,99.26470588) //Top 10 rows out of 1360 rows
replacewith = np.median(column) //replace with median
np.select(condition.values.reshape(-1, 1), column.values, replacewith) //input changes
df.set_index(["citric acid", "quality"], inplace=True) //exclude citric acid and quality
df = df.apply(lambda x: cut(x)).reset_index()
df.to_csv('C:/Users/hello/Downloads/new.csv')
I have tried researching what causes the error including missing values in the csv file but I have none. I am also not sure if the above code will help me acheive my goal even without this error. Any help appreciated.
The error appears because you use np.select
wrong. It expects, an array of condtions, an array of choices and a default value in this order.
It works with
np.select(condition.values.reshape(-1, 1), column.values, replacewith)
- You are using a numpy function on pandas objects. This may work, but accessing the underlying
np.array
is imho good practice. - Also
np.select
is not doing what you think it does. Its purpose is to select a single element from an array according to the first hit in a list of conditions. So you basically select the first value that belongs to the 10 largest.
Final Note: By calling set_index
twice, you are basically making citric acid
a value again. You should call
df.set_index(["citric acid", "quality"], inplace=True) # exclude citric acid and quality
EDIT:
The np.select
function expects a list of bool ndarrays
i.e. a 2d datastructure as per documentation. If you look at condition
this looks like this.
In [35]: condition
Out[35]: array([False, False, False, ..., False, False, False])
.reshape
will change the shape of the array. -1
is a shortcut to leave the number of rows the same and 1
means that you create a redundant ndarrray with only one element in each row.
In [36]: condition.reshape(-1, 1)
Out[36]:
array([[False],
[False],
[False],
...,
[False],
[False],
[False]])
This is to match the expected signature.
Figured out an algorithm:
condition = column > np.percentile(column,99.26470588)
replacewith = np.median(column) #replace with median
return np.where(condition,replacewith,column.values)