How to change the column order in a pandas dataframe when there are too many columns?
Question:
I have a large pandas dataframe that contains many columns.
I would like to change the order of the columns so that only a subset of them appears first. I dont care about the ordering of the rest (and there are too many variables to list them all)
For instance, if my dataframe is like this
a b c d e f g h i
5 8 7 2 1 4 1 2 3
1 4 2 2 3 4 1 5 3
I would like to specify a subset of the columns
mysubset=['d','f']
and reorder the dataframe such that
the order of the columns is now
d,f,a,b,c,e,g,h,i
Is there a way to do that in a panda-esque way?
Answers:
use a multi-index to do that :
priority=[ 0 if x in {'d','f'} else 1 for x in df.columns]
newdf=df.T.set_index([priority,df.columns]).sort_index().T
Then you have :
In [3]: newdf
Out[3]:
0 1
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
To move an entire subset of columns, you could do this:
#!/usr/bin/python
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print df
cols = df.columns.tolist()
print cols
mysubset = ['B','D']
for idx, item in enumerate(mysubset):
cols.remove(item)
cols.insert(idx, item)
print cols
df = df[cols]
print df
Here I moved B and D first and let the others trailing. Output:
A B C D
2013-01-01 0.905122 -0.004839 -0.697663 -1.307550
2013-01-02 0.651998 -1.092546 0.594493 0.341066
2013-01-03 0.355832 -0.840057 0.016989 0.377502
2013-01-04 -0.544407 0.826708 -0.889118 0.871769
2013-01-05 0.190630 0.717418 1.325479 -0.882652
2013-01-06 2.730582 0.195908 -0.657642 1.606263
['A', 'B', 'C', 'D']
['B', 'D', 'A', 'C']
B D A C
2013-01-01 -0.004839 -1.307550 0.905122 -0.697663
2013-01-02 -1.092546 0.341066 0.651998 0.594493
2013-01-03 -0.840057 0.377502 0.355832 0.016989
2013-01-04 0.826708 0.871769 -0.544407 -0.889118
2013-01-05 0.717418 -0.882652 0.190630 1.325479
2013-01-06 0.195908 1.606263 2.730582 -0.657642
For more, read this answer.
You could use a column mask:
>>> mysubset = ["d","f"]
>>> mask = df.columns.isin(mysubset)
>>> pd.concat([df.loc[:,mask], df.loc[:,~mask]], axis=1)
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
or use sorted
:
>>> mysubset = ["d","f"]
>>> df[sorted(df, key=lambda x: x not in mysubset)]
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
which works because x not in mysubset
will be False for d
and f
, and False < True.
I usually do something like this:
mysubset = ['d', 'f']
othercols = [c for c in df.columns if c not in mysubset]
df = df[mysubset+othercols]
a=list('abcdefghi')
b=list('dfabceghi')
ind = pd.Series(range(9),index=b).reindex(a)
df.sort_index(axis=1,inplace=True,key=lambda x:ind)
The benefit of the above approach is inplace=True
, and costs lower memory and time when df
is a large dataframe.
If your dataframe is in common shape:
df.filter(b)
may be more pythonic.
I have a large pandas dataframe that contains many columns.
I would like to change the order of the columns so that only a subset of them appears first. I dont care about the ordering of the rest (and there are too many variables to list them all)
For instance, if my dataframe is like this
a b c d e f g h i
5 8 7 2 1 4 1 2 3
1 4 2 2 3 4 1 5 3
I would like to specify a subset of the columns
mysubset=['d','f']
and reorder the dataframe such that
the order of the columns is now
d,f,a,b,c,e,g,h,i
Is there a way to do that in a panda-esque way?
use a multi-index to do that :
priority=[ 0 if x in {'d','f'} else 1 for x in df.columns]
newdf=df.T.set_index([priority,df.columns]).sort_index().T
Then you have :
In [3]: newdf
Out[3]:
0 1
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
To move an entire subset of columns, you could do this:
#!/usr/bin/python
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print df
cols = df.columns.tolist()
print cols
mysubset = ['B','D']
for idx, item in enumerate(mysubset):
cols.remove(item)
cols.insert(idx, item)
print cols
df = df[cols]
print df
Here I moved B and D first and let the others trailing. Output:
A B C D
2013-01-01 0.905122 -0.004839 -0.697663 -1.307550
2013-01-02 0.651998 -1.092546 0.594493 0.341066
2013-01-03 0.355832 -0.840057 0.016989 0.377502
2013-01-04 -0.544407 0.826708 -0.889118 0.871769
2013-01-05 0.190630 0.717418 1.325479 -0.882652
2013-01-06 2.730582 0.195908 -0.657642 1.606263
['A', 'B', 'C', 'D']
['B', 'D', 'A', 'C']
B D A C
2013-01-01 -0.004839 -1.307550 0.905122 -0.697663
2013-01-02 -1.092546 0.341066 0.651998 0.594493
2013-01-03 -0.840057 0.377502 0.355832 0.016989
2013-01-04 0.826708 0.871769 -0.544407 -0.889118
2013-01-05 0.717418 -0.882652 0.190630 1.325479
2013-01-06 0.195908 1.606263 2.730582 -0.657642
For more, read this answer.
You could use a column mask:
>>> mysubset = ["d","f"]
>>> mask = df.columns.isin(mysubset)
>>> pd.concat([df.loc[:,mask], df.loc[:,~mask]], axis=1)
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
or use sorted
:
>>> mysubset = ["d","f"]
>>> df[sorted(df, key=lambda x: x not in mysubset)]
d f a b c e g h i
0 2 4 5 8 7 1 1 2 3
1 2 4 1 4 2 3 1 5 3
which works because x not in mysubset
will be False for d
and f
, and False < True.
I usually do something like this:
mysubset = ['d', 'f']
othercols = [c for c in df.columns if c not in mysubset]
df = df[mysubset+othercols]
a=list('abcdefghi')
b=list('dfabceghi')
ind = pd.Series(range(9),index=b).reindex(a)
df.sort_index(axis=1,inplace=True,key=lambda x:ind)
The benefit of the above approach is inplace=True
, and costs lower memory and time when df
is a large dataframe.
If your dataframe is in common shape:
df.filter(b)
may be more pythonic.