How to replace the highest 10 values in a column of a csv file?

Question:

Dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

Edit: Updated code with @Maow suggested changes.

I am currently doing a project which requires me to analyse data of wines. I have spotted some extreme outliers in each column of the csv file. In short, I have determined that the highest 10 values of each column must be replaced by the median value of that column. I have tried the following with help from 1 article (Pandas Replace certain values in each column), and I modified it as shown below, but unfortunately this is my first time with python and I have no idea what causing the error.

import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/hello/Downloads/winequality-red-ori.csv')

 def cut(column):
     condition = column > np.percentile(column,99.26470588) //Top 10 rows out of 1360 rows          
     replacewith = np.median(column) //replace with median
     np.select(condition.values.reshape(-1, 1), column.values, replacewith) //input changes

df.set_index(["citric acid", "quality"], inplace=True) //exclude citric acid and quality
df = df.apply(lambda x: cut(x)).reset_index()
df.to_csv('C:/Users/hello/Downloads/new.csv')

I have tried researching what causes the error including missing values in the csv file but I have none. I am also not sure if the above code will help me acheive my goal even without this error. Any help appreciated.

Asked By: Do Ji

||

Answers:

The error appears because you use np.select wrong. It expects, an array of condtions, an array of choices and a default value in this order.

It works with

np.select(condition.values.reshape(-1, 1), column.values, replacewith)
  1. You are using a numpy function on pandas objects. This may work, but accessing the underlying np.array is imho good practice.
  2. Also np.select is not doing what you think it does. Its purpose is to select a single element from an array according to the first hit in a list of conditions. So you basically select the first value that belongs to the 10 largest.

Final Note: By calling set_index twice, you are basically making citric acid a value again. You should call

df.set_index(["citric acid", "quality"], inplace=True)  # exclude citric acid and quality

EDIT:
The np.select function expects a list of bool ndarrays i.e. a 2d datastructure as per documentation. If you look at condition this looks like this.

In [35]: condition
Out[35]: array([False, False, False, ..., False, False, False])

.reshape will change the shape of the array. -1 is a shortcut to leave the number of rows the same and 1 means that you create a redundant ndarrray with only one element in each row.

In [36]: condition.reshape(-1, 1)
Out[36]: 
array([[False],
       [False],
       [False],
       ...,
       [False],
       [False],
       [False]])

This is to match the expected signature.

Answered By: maow

Figured out an algorithm:

condition = column > np.percentile(column,99.26470588)         
    replacewith = np.median(column) #replace with median
    return np.where(condition,replacewith,column.values)
Answered By: Do Ji
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.