Get a list from Pandas DataFrame column headers
Question:
I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won’t know how many columns there will be or what they will be called.
For example, if I’m given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
['y', 'gdp', 'cap']
Answers:
That’s available as my_dataframe.columns
.
You can get the values as a list by doing:
list(my_dataframe.columns.values)
Also you can simply use (as shown in Ed Chum’s answer):
list(my_dataframe)
n = []
for i in my_dataframe.columns:
n.append(i)
print n
There is a built-in method which is the most performant:
my_dataframe.columns.values.tolist()
.columns
returns an Index
, .columns.values
returns an array and this has a helper function .tolist
to return a list.
If performance is not as important to you, Index
objects define a .tolist()
method that you can call directly:
my_dataframe.columns.tolist()
The difference in performance is obvious:
%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For those who hate typing, you can just call list
on df
, as so:
list(df)
A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.
my_dataframe.keys()
Create a list of keys/columns – object method to_list()
and the Pythonic way:
my_dataframe.keys().to_list()
list(my_dataframe.keys())
Basic iteration on a DataFrame returns column labels:
[column for column in my_dataframe]
Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.
xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) # Compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) # Constant time operation - O(1)
I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist()
is the fastest:
In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop
In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop
In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop
In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop
(I still really like the list(dataframe)
though, so thanks EdChum!)
It gets even simpler (by Pandas 0.16.0):
df.columns.tolist()
will give you the column names in a nice list.
>>> list(my_dataframe)
['y', 'gdp', 'cap']
To list the columns of a dataframe while in debugger mode, use a list comprehension:
>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']
By the way, you can get a sorted list simply by using sorted
:
>>> sorted(my_dataframe)
['cap', 'gdp', 'y']
It’s interesting, but df.columns.values.tolist()
is almost three times faster than df.columns.tolist()
, but I thought that they were the same:
In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop
In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop
In the Notebook
For data exploration in the IPython notebook, my preferred way is this:
sorted(df)
Which will produce an easy to read alphabetically ordered list.
In a code repository
In code I find it more explicit to do
df.columns
Because it tells others reading your code what you are doing.
I feel the question deserves an additional explanation.
As fixxxer noted, the answer depends on the Pandas version you are using in your project. Which you can get with pd.__version__
command.
If you are for some reason like me (on Debian 8 (Jessie) I use 0.14.1) using an older version of Pandas than 0.16.0, then you need to use:
df.keys().tolist()
because there isn’t any df.columns
method implemented yet.
The advantage of this keys method is that it works even in newer version of Pandas, so it’s more universal.
As answered by Simeon Visser, you could do
list(my_dataframe.columns.values)
or
list(my_dataframe) # For less typing.
But I think most the sweet spot is:
list(my_dataframe.columns)
It is explicit and at the same time not unnecessarily long.
For a quick, neat, visual check, try this:
for col in df.columns:
print col
Even though the solution that was provided previously is nice, I would also expect something like frame.column_names() to be a function in Pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the "tolist" function: frame.columns.tolist()
frame.columns.tolist()
Extended Iterable Unpacking (Python 3.5+): [*df]
and Friends
Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
If you want a list
….
[*df]
# ['A', 'B', 'C']
Or, if you want a set
,
{*df}
# {'A', 'B', 'C'}
Or, if you want a tuple
,
*df, # Please note the trailing comma
# ('A', 'B', 'C')
Or, if you want to store the result somewhere,
*cols, = df # A wild comma appears, again
cols
# ['A', 'B', 'C']
… if you’re the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently 😉
P.S.: if performance is important, you will want to ditch the
solutions above in favour of
df.columns.to_numpy().tolist()
# ['A', 'B', 'C']
This is similar to Ed Chum’s answer, but updated for
v0.24 where .to_numpy()
is preferred to the use of .values
. See
this answer (by me) for more information.
Visual Check
Since I’ve seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).
print(*df)
A B C
print(*df, sep='n')
A
B
C
Critique of Other Methods
Don’t use an explicit for
loop for an operation that can be done in a single line (list comprehensions are okay).
Next, using sorted(df)
does not preserve the original order of the columns. For that, you should use list(df)
instead.
Next, list(df.columns)
and list(df.columns.values)
are poor suggestions (as of the current version, v0.24). Both Index
(returned from df.columns
) and NumPy arrays (returned by df.columns.values
) define .tolist()
method which is faster and more idiomatic.
Lastly, listification i.e., list(df)
should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.
%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If the DataFrame happens to have an Index or MultiIndex and you want those included as column names too:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation.
I’ve run into needing this more often because I’m shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another “column” to me. It would probably make sense for pandas to have a built-in method for something like this (totally possible I’ve missed it).
listHeaders = [colName for colName in my_dataframe]
The simplest option would be:
list(my_dataframe.columns)
or my_dataframe.columns.tolist()
No need for the complex stuff above 🙂
import pandas as pd
# create test dataframe
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(2))
list(df.columns)
Returns
['A', 'B', 'C']
This is the easiest way to reach your goal.
my_dataframe.columns.values.tolist()
and if you are Lazy, try this >
list(my_dataframe)
its the simple code for you :
for i in my_dataframe:
print(i)
just do it
Its very simple.
Like you can do it as:
list(df.columns)
my_dataframe.columns.tolist() #or
list(my_dataframe.columns)
I want to get a list of the column headers from a Pandas DataFrame. The DataFrame will come from user input, so I won’t know how many columns there will be or what they will be called.
For example, if I’m given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
['y', 'gdp', 'cap']
That’s available as my_dataframe.columns
.
You can get the values as a list by doing:
list(my_dataframe.columns.values)
Also you can simply use (as shown in Ed Chum’s answer):
list(my_dataframe)
n = []
for i in my_dataframe.columns:
n.append(i)
print n
There is a built-in method which is the most performant:
my_dataframe.columns.values.tolist()
.columns
returns an Index
, .columns.values
returns an array and this has a helper function .tolist
to return a list.
If performance is not as important to you, Index
objects define a .tolist()
method that you can call directly:
my_dataframe.columns.tolist()
The difference in performance is obvious:
%timeit df.columns.tolist()
16.7 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.columns.values.tolist()
1.24 µs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For those who hate typing, you can just call list
on df
, as so:
list(df)
A DataFrame follows the dict-like convention of iterating over the “keys” of the objects.
my_dataframe.keys()
Create a list of keys/columns – object method to_list()
and the Pythonic way:
my_dataframe.keys().to_list()
list(my_dataframe.keys())
Basic iteration on a DataFrame returns column labels:
[column for column in my_dataframe]
Do not convert a DataFrame into a list, just to get the column labels. Do not stop thinking while looking for convenient code samples.
xlarge = pd.DataFrame(np.arange(100000000).reshape(10000,10000))
list(xlarge) # Compute time and memory consumption depend on dataframe size - O(N)
list(xlarge.keys()) # Constant time operation - O(1)
I did some quick tests, and perhaps unsurprisingly the built-in version using dataframe.columns.values.tolist()
is the fastest:
In [1]: %timeit [column for column in df]
1000 loops, best of 3: 81.6 µs per loop
In [2]: %timeit df.columns.values.tolist()
10000 loops, best of 3: 16.1 µs per loop
In [3]: %timeit list(df)
10000 loops, best of 3: 44.9 µs per loop
In [4]: % timeit list(df.columns.values)
10000 loops, best of 3: 38.4 µs per loop
(I still really like the list(dataframe)
though, so thanks EdChum!)
It gets even simpler (by Pandas 0.16.0):
df.columns.tolist()
will give you the column names in a nice list.
>>> list(my_dataframe)
['y', 'gdp', 'cap']
To list the columns of a dataframe while in debugger mode, use a list comprehension:
>>> [c for c in my_dataframe]
['y', 'gdp', 'cap']
By the way, you can get a sorted list simply by using sorted
:
>>> sorted(my_dataframe)
['cap', 'gdp', 'y']
It’s interesting, but df.columns.values.tolist()
is almost three times faster than df.columns.tolist()
, but I thought that they were the same:
In [97]: %timeit df.columns.values.tolist()
100000 loops, best of 3: 2.97 µs per loop
In [98]: %timeit df.columns.tolist()
10000 loops, best of 3: 9.67 µs per loop
In the Notebook
For data exploration in the IPython notebook, my preferred way is this:
sorted(df)
Which will produce an easy to read alphabetically ordered list.
In a code repository
In code I find it more explicit to do
df.columns
Because it tells others reading your code what you are doing.
I feel the question deserves an additional explanation.
As fixxxer noted, the answer depends on the Pandas version you are using in your project. Which you can get with pd.__version__
command.
If you are for some reason like me (on Debian 8 (Jessie) I use 0.14.1) using an older version of Pandas than 0.16.0, then you need to use:
df.keys().tolist()
because there isn’t any df.columns
method implemented yet.
The advantage of this keys method is that it works even in newer version of Pandas, so it’s more universal.
As answered by Simeon Visser, you could do
list(my_dataframe.columns.values)
or
list(my_dataframe) # For less typing.
But I think most the sweet spot is:
list(my_dataframe.columns)
It is explicit and at the same time not unnecessarily long.
For a quick, neat, visual check, try this:
for col in df.columns:
print col
Even though the solution that was provided previously is nice, I would also expect something like frame.column_names() to be a function in Pandas, but since it is not, maybe it would be nice to use the following syntax. It somehow preserves the feeling that you are using pandas in a proper way by calling the "tolist" function: frame.columns.tolist()
frame.columns.tolist()
Extended Iterable Unpacking (Python 3.5+): [*df]
and Friends
Unpacking generalizations (PEP 448) have been introduced with Python 3.5. So, the following operations are all possible.
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
If you want a list
….
[*df]
# ['A', 'B', 'C']
Or, if you want a set
,
{*df}
# {'A', 'B', 'C'}
Or, if you want a tuple
,
*df, # Please note the trailing comma
# ('A', 'B', 'C')
Or, if you want to store the result somewhere,
*cols, = df # A wild comma appears, again
cols
# ['A', 'B', 'C']
… if you’re the kind of person who converts coffee to typing sounds, well, this is going consume your coffee more efficiently 😉
P.S.: if performance is important, you will want to ditch the
solutions above in favour ofdf.columns.to_numpy().tolist() # ['A', 'B', 'C']
This is similar to Ed Chum’s answer, but updated for
v0.24 where.to_numpy()
is preferred to the use of.values
. See
this answer (by me) for more information.
Visual Check
Since I’ve seen this discussed in other answers, you can use iterable unpacking (no need for explicit loops).
print(*df)
A B C
print(*df, sep='n')
A
B
C
Critique of Other Methods
Don’t use an explicit for
loop for an operation that can be done in a single line (list comprehensions are okay).
Next, using sorted(df)
does not preserve the original order of the columns. For that, you should use list(df)
instead.
Next, list(df.columns)
and list(df.columns.values)
are poor suggestions (as of the current version, v0.24). Both Index
(returned from df.columns
) and NumPy arrays (returned by df.columns.values
) define .tolist()
method which is faster and more idiomatic.
Lastly, listification i.e., list(df)
should only be used as a concise alternative to the aforementioned methods for Python 3.4 or earlier where extended unpacking is not available.
%%timeit
final_df.columns.values.tolist()
948 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
list(final_df.columns)
14.2 µs ± 79.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.columns.values)
1.88 µs ± 11.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%timeit
final_df.columns.tolist()
12.3 µs ± 27.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
list(final_df.head(1).columns)
163 µs ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If the DataFrame happens to have an Index or MultiIndex and you want those included as column names too:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation.
I’ve run into needing this more often because I’m shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another “column” to me. It would probably make sense for pandas to have a built-in method for something like this (totally possible I’ve missed it).
listHeaders = [colName for colName in my_dataframe]
The simplest option would be:
list(my_dataframe.columns)
or my_dataframe.columns.tolist()
No need for the complex stuff above 🙂
import pandas as pd
# create test dataframe
df = pd.DataFrame('x', columns=['A', 'B', 'C'], index=range(2))
list(df.columns)
Returns
['A', 'B', 'C']
This is the easiest way to reach your goal.
my_dataframe.columns.values.tolist()
and if you are Lazy, try this >
list(my_dataframe)
its the simple code for you :
for i in my_dataframe:
print(i)
just do it
Its very simple.
Like you can do it as:
list(df.columns)
my_dataframe.columns.tolist() #or
list(my_dataframe.columns)