Creating new pandas dataframe from certain columns of existing dataframe
Question:
I have read a csv file into a pandas dataframe and want to do some simple manipulations on the dataframe. I can not figure out how to create a new dataframe based on selected columns from my original dataframe. My attempt:
names = ['A','B','C','D']
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset['A','D']
I would like to create a new dataframe with the columns A and D from the original dataframe.
Answers:
It is called subset
– passed list of columns in []
:
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset[['A','D']]
what is same as:
new_dataset = dataset.loc[:, ['A','D']]
If need only filtered output add parameter usecols
to read_csv
:
new_dataset = pandas.read_csv('file.csv', names=names, usecols=['A','D'])
EDIT:
If use only:
new_dataset = dataset[['A','D']]
and use some data manipulation, obviously get:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you modify values in new_dataset
later you will find that the modifications do not propagate back to the original data (dataset
), and that Pandas does warning.
As pointed EdChum add copy
for remove warning:
new_dataset = dataset[['A','D']].copy()
You must pass a list of column names to select columns. Otherwise, it will be interpreted as MultiIndex; df['A','D']
would work if df.columns
was MultiIndex.
The most obvious way is df.loc[:, ['A', 'B']]
but there are other ways (note how all of them take lists):
df1 = df.filter(items=['A', 'D'])
df1 = df.reindex(columns=['A', 'D'])
df1 = df.get(['A', 'D']).copy()
N.B. items
is the first positional argument, so df.filter(['A', 'D'])
also works.
Note that filter()
and reindex()
return a copy as well, so you don’t need to worry about getting SettingWithCopyWarning
later.
I have read a csv file into a pandas dataframe and want to do some simple manipulations on the dataframe. I can not figure out how to create a new dataframe based on selected columns from my original dataframe. My attempt:
names = ['A','B','C','D']
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset['A','D']
I would like to create a new dataframe with the columns A and D from the original dataframe.
It is called subset
– passed list of columns in []
:
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset[['A','D']]
what is same as:
new_dataset = dataset.loc[:, ['A','D']]
If need only filtered output add parameter usecols
to read_csv
:
new_dataset = pandas.read_csv('file.csv', names=names, usecols=['A','D'])
EDIT:
If use only:
new_dataset = dataset[['A','D']]
and use some data manipulation, obviously get:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you modify values in new_dataset
later you will find that the modifications do not propagate back to the original data (dataset
), and that Pandas does warning.
As pointed EdChum add copy
for remove warning:
new_dataset = dataset[['A','D']].copy()
You must pass a list of column names to select columns. Otherwise, it will be interpreted as MultiIndex; df['A','D']
would work if df.columns
was MultiIndex.
The most obvious way is df.loc[:, ['A', 'B']]
but there are other ways (note how all of them take lists):
df1 = df.filter(items=['A', 'D'])
df1 = df.reindex(columns=['A', 'D'])
df1 = df.get(['A', 'D']).copy()
N.B. items
is the first positional argument, so df.filter(['A', 'D'])
also works.
Note that filter()
and reindex()
return a copy as well, so you don’t need to worry about getting SettingWithCopyWarning
later.