Index pandas DataFrame by column numbers, when column names are integers
Question:
I am trying to keep just certain columns of a DataFrame, and it works fine when column names are strings:
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: a = np.arange(35).reshape(5,7)
In [5]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], ['a', 'b', 'c', 'd', 'e', 'f', 'g'])
In [6]: df
Out[6]:
a b c d e f g
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
[5 rows x 7 columns]
In [7]: df[[1,3]] #No problem
Out[7]:
b d
x 1 3
y 8 10
u 15 17
z 22 24
w 29 31
However, when column names are integers, I am getting a key error:
In [8]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [9]: df
Out[9]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
[5 rows x 7 columns]
In [10]: df[[1,3]]
Results in:
KeyError: '[1 3] not in index'
I can see why pandas does not allow that -> to avoid mix up between indexing by column names and column numbers. However, is there a way to tell pandas that I want to index by column numbers? Of course, one solution is to convert column names to strings, but I am wondering if there is a better solution.
Answers:
This is certainly one of those things that feels like a bug but is really a design decision (I think).
A few work around options:
rename the columns with their positions as their name:
df.columns = arange(0,len(df.columns))
Another way is to get names from df.columns
:
print df[ df.columns[[1,3]] ]
11 13
x 1 3
y 8 10
u 15 17
z 22 24
w 29 31
I suspect this is the most appealing as it just requires adding a wee bit of code and not changing any column names.
This is exactly the purpose of iloc, see here
In [37]: df
Out[37]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
In [38]: df.iloc[:,[1,3]]
Out[38]:
11 13
x 1 3
y 8 10
u 15 17
z 22 24
w 29 31
Just convert the headers from integer to string. This should be done almost always as a best practice when working with pandas datasets to avoid surprise
df.columns = df.columns.map(str)
import pandas as pd
df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
#Let say you want to keep only columns 1,2 (these are locations not names)
needed_columns = [1,2]
df = df[df.columns[needed_columns]
print(df)
11 12
x 1 2
y 8 9
u 15 16
z 22 23
w 29 30
I am trying to keep just certain columns of a DataFrame, and it works fine when column names are strings:
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: a = np.arange(35).reshape(5,7)
In [5]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], ['a', 'b', 'c', 'd', 'e', 'f', 'g'])
In [6]: df
Out[6]:
a b c d e f g
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
[5 rows x 7 columns]
In [7]: df[[1,3]] #No problem
Out[7]:
b d
x 1 3
y 8 10
u 15 17
z 22 24
w 29 31
However, when column names are integers, I am getting a key error:
In [8]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [9]: df
Out[9]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
[5 rows x 7 columns]
In [10]: df[[1,3]]
Results in:
KeyError: '[1 3] not in index'
I can see why pandas does not allow that -> to avoid mix up between indexing by column names and column numbers. However, is there a way to tell pandas that I want to index by column numbers? Of course, one solution is to convert column names to strings, but I am wondering if there is a better solution.
This is certainly one of those things that feels like a bug but is really a design decision (I think).
A few work around options:
rename the columns with their positions as their name:
df.columns = arange(0,len(df.columns))
Another way is to get names from df.columns
:
print df[ df.columns[[1,3]] ]
11 13
x 1 3
y 8 10
u 15 17
z 22 24
w 29 31
I suspect this is the most appealing as it just requires adding a wee bit of code and not changing any column names.
This is exactly the purpose of iloc, see here
In [37]: df
Out[37]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
In [38]: df.iloc[:,[1,3]]
Out[38]:
11 13
x 1 3
y 8 10
u 15 17
z 22 24
w 29 31
Just convert the headers from integer to string. This should be done almost always as a best practice when working with pandas datasets to avoid surprise
df.columns = df.columns.map(str)
import pandas as pd
df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
#Let say you want to keep only columns 1,2 (these are locations not names)
needed_columns = [1,2]
df = df[df.columns[needed_columns]
print(df)
11 12
x 1 2
y 8 9
u 15 16
z 22 23
w 29 30