How to skip a point in a .csv file if it is larger than x?

Question:

I have data that has some outliers that need to be ignored, but I am struggling to find out how to do this. I need data that is over the value of 500 to be removed/ignored. Below is my code so far:

import pandas as pd 
import matplotlib

#convert the files to make sure that only the data needed is selected
INPUT_FILE = 'data.csv'
OUTPUT_FILE = 'machine_data.csv'
PACKET_ID = 'machine'

with open(INPUT_FILE, 'r') as f:
data = f.readlines()
with open(OUTPUT_FILE, 'w') as f:
for datum in data:
    if datum.startswith(PACKET_ID):
        f.write(datum)

#read the data file
df = pd.read_csv(OUTPUT_FILE, header=None, usecols=[2,10,11,12,13,14])
#plotting the conc
fig,conc = plt.subplots(1,1)
lns1 = conc.plot(df[2],df[11],color="g", label='Concentration')

As you can see, I have selected certain columns that I need, but within [11] I only need the data that is less than 500.

Asked By: EggSci

||

Answers:

You just have to filter your dataframe by that column
like :

df = df[(df[11] <= 500)]

Your code will then look like this:

import pandas as pd 
import matplotlib

#convert the files to make sure that only the data needed is selected
INPUT_FILE = 'data.csv'
OUTPUT_FILE = 'machine_data.csv'
PACKET_ID = 'machine'

with open(INPUT_FILE, 'r') as f:
data = f.readlines()
with open(OUTPUT_FILE, 'w') as f:
for datum in data:
    if datum.startswith(PACKET_ID):
        f.write(datum)

#read the data file
df = pd.read_csv(OUTPUT_FILE, header=None, usecols=[2,10,11,12,13,14])

# filter your data HERE:
df = df[(df[11] <= 500)]

#plotting the conc
fig,conc = plt.subplots(1,1)
lns1 = conc.plot(df[2],df[11],color="g", label='Concentration')
Answered By: mrCopiCat

In order to ignore outliers greater than 500 for column df[11] try something like:

df[11] = df[11].where(df[11] <= 500).dropna()

Source: DataFrame.where()

Answered By: Ingebrigt Nygård