How to drop a specific column of csv file while reading it using pandas?
Question:
I need to remove a column with label name at the time of loading a csv using pandas
. I am reading csv as follows and want to add parameters inside it to do so. Thanks.
pd.read_csv("sample.csv")
I know this to do after reading csv:
df.drop('name', axis=1)
Answers:
If you know the column names prior, you can do it by setting usecols
parameter
When you know which columns to use
Suppose you have csv file with columns ['id','name','last_name']
and you want just ['name','last_name']
. You can do it as below:
import pandas as pd
df = pd.read_csv("sample.csv", usecols = ['name','last_name'])
when you want first N columns
If you don’t know the column names but you want first N columns from dataframe. You can do it by
import pandas as pd
df = pd.read_csv("sample.csv", usecols = [i for i in range(n)])
Edit
When you know name of the column to be dropped
# Read column names from file
cols = list(pd.read_csv("sample_data.csv", nrows =1))
print(cols)
# Use list comprehension to remove the unwanted column in **usecol**
df= pd.read_csv("sample_data.csv", usecols =[i for i in cols if i != 'name'])
Get the column headers from your CSV using pd.read_csv
with nrows=1
, then do a subsequent read with usecols
to pull everything but the column(s) you want to omit.
headers = [*pd.read_csv('sample.csv', nrows=1)]
df = pd.read_csv('sample.csv', usecols=[c for c in headers if c != 'name']))
Alternatively, you can do the same thing (read only the headers) very efficiently using the CSV module,
import csv
with open("sample.csv", 'r') as f:
header = next(csv.reader(f))
# For python 2, use
# header = csv.reader(f).next()
df = pd.read_csv('sample.csv', usecols=list(set(header) - {'name'}))
Using df= df.drop(['ID','prediction'],axis=1)
made the work for me. I dropped ‘ID’ and ‘prediction’ columns. Make sure you put them in square brackets like ['column1','column2']
.
There is no need for other complicated solutions.
Columns can be dropped at the time of reading itself.
columns_to_be_removed = ['a', 'b']
data = pd.read_csv(sourceFileName).drop(columns_to_be_removed, axis = 'columns')
The only parameter to read_csv()
that you can use to select the columns you use is usecols
. According to the documentation, usecols
accepts list-like or callable. Because you only know the columns you want to drop, you can’t use a list of the columns you want to keep. So use a callable:
pd.read_csv("sample.csv",
usecols=lambda x: x != 'name'
)
And you could of course say x not in ['unwanted', 'column', 'names']
if you had a list of column names you didn’t want to use.
This answer with two lines of code will really help you. You can even dynamically remove column names while creating CSV.
I need to remove a column with label name at the time of loading a csv using pandas
. I am reading csv as follows and want to add parameters inside it to do so. Thanks.
pd.read_csv("sample.csv")
I know this to do after reading csv:
df.drop('name', axis=1)
If you know the column names prior, you can do it by setting usecols
parameter
When you know which columns to use
Suppose you have csv file with columns ['id','name','last_name']
and you want just ['name','last_name']
. You can do it as below:
import pandas as pd
df = pd.read_csv("sample.csv", usecols = ['name','last_name'])
when you want first N columns
If you don’t know the column names but you want first N columns from dataframe. You can do it by
import pandas as pd
df = pd.read_csv("sample.csv", usecols = [i for i in range(n)])
Edit
When you know name of the column to be dropped
# Read column names from file
cols = list(pd.read_csv("sample_data.csv", nrows =1))
print(cols)
# Use list comprehension to remove the unwanted column in **usecol**
df= pd.read_csv("sample_data.csv", usecols =[i for i in cols if i != 'name'])
Get the column headers from your CSV using pd.read_csv
with nrows=1
, then do a subsequent read with usecols
to pull everything but the column(s) you want to omit.
headers = [*pd.read_csv('sample.csv', nrows=1)]
df = pd.read_csv('sample.csv', usecols=[c for c in headers if c != 'name']))
Alternatively, you can do the same thing (read only the headers) very efficiently using the CSV module,
import csv
with open("sample.csv", 'r') as f:
header = next(csv.reader(f))
# For python 2, use
# header = csv.reader(f).next()
df = pd.read_csv('sample.csv', usecols=list(set(header) - {'name'}))
Using df= df.drop(['ID','prediction'],axis=1)
made the work for me. I dropped ‘ID’ and ‘prediction’ columns. Make sure you put them in square brackets like ['column1','column2']
.
There is no need for other complicated solutions.
Columns can be dropped at the time of reading itself.
columns_to_be_removed = ['a', 'b']
data = pd.read_csv(sourceFileName).drop(columns_to_be_removed, axis = 'columns')
The only parameter to read_csv()
that you can use to select the columns you use is usecols
. According to the documentation, usecols
accepts list-like or callable. Because you only know the columns you want to drop, you can’t use a list of the columns you want to keep. So use a callable:
pd.read_csv("sample.csv",
usecols=lambda x: x != 'name'
)
And you could of course say x not in ['unwanted', 'column', 'names']
if you had a list of column names you didn’t want to use.
This answer with two lines of code will really help you. You can even dynamically remove column names while creating CSV.