Extracting specific selected columns to new DataFrame as a copy
Question:
I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandas way to do it.
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D))
# raises TypeError: data argument can't be an iterator
What is the pandas way to do it?
Answers:
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you’ll probably want to use .copy()
to avoid a SettingWithCopyWarning
.
An alternative method is to use filter
which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop
(this will also create a copy by default):
new = old.drop('B', axis=1)
Another simpler way seems to be:
new = pd.DataFrame([old.A, old.B, old.C]).transpose()
where old.column_name
will give you a series.
Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.
In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]:
A B C
0 4 10 100
1 5 20 50
Generic functional form
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
Specific for your problem above
selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)
As far as I can tell, you don’t necessarily need to specify the axis when using the filter function.
new = old.filter(['A','B','D'])
returns the same dataframe as
new = old.filter(['A','B','D'], axis=1)
The easiest way is
new = old[['A','C','D']]
.
columns by index:
# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()
If you want to have a new data frame then:
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new= old[['A', 'C', 'D']]
You can drop columns in the index:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
df[df.columns.drop(['B', 'C'])]
or
df.loc[:, df.columns.drop(['B', 'C'])]
Output:
A D
0 1 4
1 1 4
As an alternative:
new = pd.DataFrame().assign(A=old['A'], C=old['C'], D=old['D'])
You can also use get()
to create a new copy (that doesn’t run into SettingWithCopyWarning
).
new = old.get(['A', 'C', 'D'])
Also, filter
selects by column labels by default, so the following works.
new = old.filter(['A', 'C', 'D'])
axis=
is needed if one needs to select by row. For example, old.filter([0], axis=0)
selects the first row.
If new
is an already existing dataframe, then assign()
also works (if you want to keep the old columns with their original column names).
new = pd.DataFrame()
new = new.assign(**old[['A', 'C', 'D']])
I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandas way to do it.
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D))
# raises TypeError: data argument can't be an iterator
What is the pandas way to do it?
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you’ll probably want to use .copy()
to avoid a SettingWithCopyWarning
.
An alternative method is to use filter
which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop
(this will also create a copy by default):
new = old.drop('B', axis=1)
Another simpler way seems to be:
new = pd.DataFrame([old.A, old.B, old.C]).transpose()
where old.column_name
will give you a series.
Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.
In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]:
A B C
0 4 10 100
1 5 20 50
Generic functional form
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
Specific for your problem above
selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)
As far as I can tell, you don’t necessarily need to specify the axis when using the filter function.
new = old.filter(['A','B','D'])
returns the same dataframe as
new = old.filter(['A','B','D'], axis=1)
The easiest way is
new = old[['A','C','D']]
.
columns by index:
# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()
If you want to have a new data frame then:
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new= old[['A', 'C', 'D']]
You can drop columns in the index:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
df[df.columns.drop(['B', 'C'])]
or
df.loc[:, df.columns.drop(['B', 'C'])]
Output:
A D
0 1 4
1 1 4
As an alternative:
new = pd.DataFrame().assign(A=old['A'], C=old['C'], D=old['D'])
You can also use get()
to create a new copy (that doesn’t run into SettingWithCopyWarning
).
new = old.get(['A', 'C', 'D'])
Also, filter
selects by column labels by default, so the following works.
new = old.filter(['A', 'C', 'D'])
axis=
is needed if one needs to select by row. For example, old.filter([0], axis=0)
selects the first row.
If new
is an already existing dataframe, then assign()
also works (if you want to keep the old columns with their original column names).
new = pd.DataFrame()
new = new.assign(**old[['A', 'C', 'D']])