How to select all columns except one in pandas?
Question:
I have a dataframe that look like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df
a b c d
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
How I can get all columns except b
?
Answers:
When the columns are not a MultiIndex, df.columns
is just an array of column names so you can do:
df.loc[:, df.columns != 'b']
a c d
0 0.561196 0.013768 0.772827
1 0.882641 0.615396 0.075381
2 0.368824 0.651378 0.397203
3 0.788730 0.568099 0.869127
Don’t use ix
. It’s deprecated. The most readable and idiomatic way of doing this is df.drop()
:
>>> df
a b c d
0 0.175127 0.191051 0.382122 0.869242
1 0.414376 0.300502 0.554819 0.497524
2 0.142878 0.406830 0.314240 0.093132
3 0.337368 0.851783 0.933441 0.949598
>>> df.drop('b', axis=1)
a c d
0 0.175127 0.382122 0.869242
1 0.414376 0.554819 0.497524
2 0.142878 0.314240 0.093132
3 0.337368 0.933441 0.949598
Note that by default, .drop()
does not operate inplace; despite the ominous name, df
is unharmed by this process. If you want to permanently remove b
from df
, do df.drop('b', inplace=True)
.
df.drop()
also accepts a list of labels, e.g. df.drop(['a', 'b'], axis=1)
will drop column a
and b
.
Here is another way:
df[[i for i in list(df.columns) if i != '<your column>']]
You just pass all columns to be shown except of the one you do not want.
df[df.columns.difference(['b'])]
Out:
a c d
0 0.427809 0.459807 0.333869
1 0.678031 0.668346 0.645951
2 0.996573 0.673730 0.314911
3 0.786942 0.719665 0.330833
I think the best way to do is the way mentioned by @Salvador Dali. Not that the others are wrong.
Because when you have a data set where you just want to select one column and put it into one variable and the rest of the columns into another for comparison or computational purposes. Then dropping the column of the data set might not help. Of course there are use cases for that as well.
x_cols = [x for x in data.columns if x != 'name of column to be excluded']
Then you can put those collection of columns in variable x_cols
into another variable like x_cols1
for other computation.
ex: x_cols1 = data[x_cols]
Another slight modification to @Salvador Dali enables a list of columns to exclude:
df[[i for i in list(df.columns) if i not in [list_of_columns_to_exclude]]]
or
df.loc[:,[i for i in list(df.columns) if i not in [list_of_columns_to_exclude]]]
You can use df.columns.isin()
df.loc[:, ~df.columns.isin(['b'])]
When you want to drop multiple columns, as simple as:
df.loc[:, ~df.columns.isin(['col1', 'col2'])]
Here is a one line lambda:
df[map(lambda x :x not in ['b'], list(df.columns))]
before:
import pandas
import numpy as np
df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
df
a b c d
0 0.774951 0.079351 0.118437 0.735799
1 0.615547 0.203062 0.437672 0.912781
2 0.804140 0.708514 0.156943 0.104416
3 0.226051 0.641862 0.739839 0.434230
after:
df[map(lambda x :x not in ['b'], list(df.columns))]
a c d
0 0.774951 0.118437 0.735799
1 0.615547 0.437672 0.912781
2 0.804140 0.156943 0.104416
3 0.226051 0.739839 0.434230
I think a nice solution is with the function filter of pandas and regex (match everything except "b"):
df.filter(regex="^(?!b$)")
Similar to @Toms answer, it is also possible to select all columns except "b" without using .loc, like so:
df[df.columns[~df.columns.isin(['b'])]]
I’ve tested speed and found that for me the .loc
solution was the fastest
df_working_1.loc[:, df_working_1.columns != "market_id"]
# 7.19 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df_working_1.drop("market_id", axis=1)
# 7.65 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df_working_1[df_working_1.columns.difference(['market_id'])]
# 7.58 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df_working_1[[i for i in list(df_working_1.columns) if i != 'market_id']]
# 7.57 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a dataframe that look like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df
a b c d
0 0.418762 0.042369 0.869203 0.972314
1 0.991058 0.510228 0.594784 0.534366
2 0.407472 0.259811 0.396664 0.894202
3 0.726168 0.139531 0.324932 0.906575
How I can get all columns except b
?
When the columns are not a MultiIndex, df.columns
is just an array of column names so you can do:
df.loc[:, df.columns != 'b']
a c d
0 0.561196 0.013768 0.772827
1 0.882641 0.615396 0.075381
2 0.368824 0.651378 0.397203
3 0.788730 0.568099 0.869127
Don’t use ix
. It’s deprecated. The most readable and idiomatic way of doing this is df.drop()
:
>>> df
a b c d
0 0.175127 0.191051 0.382122 0.869242
1 0.414376 0.300502 0.554819 0.497524
2 0.142878 0.406830 0.314240 0.093132
3 0.337368 0.851783 0.933441 0.949598
>>> df.drop('b', axis=1)
a c d
0 0.175127 0.382122 0.869242
1 0.414376 0.554819 0.497524
2 0.142878 0.314240 0.093132
3 0.337368 0.933441 0.949598
Note that by default, .drop()
does not operate inplace; despite the ominous name, df
is unharmed by this process. If you want to permanently remove b
from df
, do df.drop('b', inplace=True)
.
df.drop()
also accepts a list of labels, e.g. df.drop(['a', 'b'], axis=1)
will drop column a
and b
.
Here is another way:
df[[i for i in list(df.columns) if i != '<your column>']]
You just pass all columns to be shown except of the one you do not want.
df[df.columns.difference(['b'])]
Out:
a c d
0 0.427809 0.459807 0.333869
1 0.678031 0.668346 0.645951
2 0.996573 0.673730 0.314911
3 0.786942 0.719665 0.330833
I think the best way to do is the way mentioned by @Salvador Dali. Not that the others are wrong.
Because when you have a data set where you just want to select one column and put it into one variable and the rest of the columns into another for comparison or computational purposes. Then dropping the column of the data set might not help. Of course there are use cases for that as well.
x_cols = [x for x in data.columns if x != 'name of column to be excluded']
Then you can put those collection of columns in variable x_cols
into another variable like x_cols1
for other computation.
ex: x_cols1 = data[x_cols]
Another slight modification to @Salvador Dali enables a list of columns to exclude:
df[[i for i in list(df.columns) if i not in [list_of_columns_to_exclude]]]
or
df.loc[:,[i for i in list(df.columns) if i not in [list_of_columns_to_exclude]]]
You can use df.columns.isin()
df.loc[:, ~df.columns.isin(['b'])]
When you want to drop multiple columns, as simple as:
df.loc[:, ~df.columns.isin(['col1', 'col2'])]
Here is a one line lambda:
df[map(lambda x :x not in ['b'], list(df.columns))]
before:
import pandas
import numpy as np
df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
df
a b c d
0 0.774951 0.079351 0.118437 0.735799
1 0.615547 0.203062 0.437672 0.912781
2 0.804140 0.708514 0.156943 0.104416
3 0.226051 0.641862 0.739839 0.434230
after:
df[map(lambda x :x not in ['b'], list(df.columns))]
a c d
0 0.774951 0.118437 0.735799
1 0.615547 0.437672 0.912781
2 0.804140 0.156943 0.104416
3 0.226051 0.739839 0.434230
I think a nice solution is with the function filter of pandas and regex (match everything except "b"):
df.filter(regex="^(?!b$)")
Similar to @Toms answer, it is also possible to select all columns except "b" without using .loc, like so:
df[df.columns[~df.columns.isin(['b'])]]
I’ve tested speed and found that for me the .loc
solution was the fastest
df_working_1.loc[:, df_working_1.columns != "market_id"]
# 7.19 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df_working_1.drop("market_id", axis=1)
# 7.65 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df_working_1[df_working_1.columns.difference(['market_id'])]
# 7.58 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df_working_1[[i for i in list(df_working_1.columns) if i != 'market_id']]
# 7.57 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)