Get list from pandas dataframe column or row?
Question:
I have a dataframe df
imported from an Excel document like this:
cluster load_date budget actual fixed_price
A 1/1/2014 1000 4000 Y
A 2/1/2014 12000 10000 Y
A 3/1/2014 36000 2000 Y
B 4/1/2014 15000 10000 N
B 4/1/2014 12000 11500 N
B 4/1/2014 90000 11000 N
C 7/1/2014 22000 18000 N
C 8/1/2014 30000 28960 N
C 9/1/2014 53000 51200 N
I want to be able to return the contents of column 1 df['cluster']
as a list, so I can run a for-loop over it, and create an Excel worksheet for every cluster.
Is it also possible to return the contents of a whole column or row to a list? e.g.
list = [], list[column1] or list[df.ix(row1)]
Answers:
Pandas DataFrame columns are Pandas Series when you pull them out, which you can then call x.tolist()
on to turn them into a Python list. Alternatively you cast it with list(x)
.
import pandas as pd
data_dict = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data_dict)
print(f"DataFrame:n{df}n")
print(f"column types:n{df.dtypes}")
col_one_list = df['one'].tolist()
col_one_arr = df['one'].to_numpy()
print(f"ncol_one_list:n{col_one_list}ntype:{type(col_one_list)}")
print(f"ncol_one_arr:n{col_one_arr}ntype:{type(col_one_arr)}")
Output:
DataFrame:
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
column types:
one float64
two int64
dtype: object
col_one_list:
[1.0, 2.0, 3.0, nan]
type:<class 'list'>
col_one_arr:
[ 1. 2. 3. nan]
type:<class 'numpy.ndarray'>
This returns a numpy array:
arr = df["cluster"].to_numpy()
This returns a numpy array of unique values:
unique_arr = df["cluster"].unique()
You can also use numpy to get the unique values, although there are differences between the two methods:
arr = df["cluster"].to_numpy()
unique_arr = np.unique(arr)
Example conversion:
Numpy Array -> Panda Data Frame -> List from one Panda Column
Numpy Array
data = np.array([[10,20,30], [20,30,60], [30,60,90]])
Convert numpy array into Panda data frame
dataPd = pd.DataFrame(data = data)
print(dataPd)
0 1 2
0 10 20 30
1 20 30 60
2 30 60 90
Convert one Panda column to list
pdToList = list(dataPd['2'])
Assuming the name of the dataframe after reading the excel sheet is df
, take an empty list (e.g. dataList
), iterate through the dataframe row by row and append to your empty list like-
dataList = [] #empty list
for index, row in df.iterrows():
mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
dataList.append(mylist)
Or,
dataList = [] #empty list
for row in df.itertuples():
mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
dataList.append(mylist)
No, if you print the dataList
, you will get each rows as a list in the dataList
.
As this question attained a lot of attention and there are several ways to fulfill your task, let me present several options.
Those are all one-liners by the way 😉
Starting with:
df
cluster load_date budget actual fixed_price
0 A 1/1/2014 1000 4000 Y
1 A 2/1/2014 12000 10000 Y
2 A 3/1/2014 36000 2000 Y
3 B 4/1/2014 15000 10000 N
4 B 4/1/2014 12000 11500 N
5 B 4/1/2014 90000 11000 N
6 C 7/1/2014 22000 18000 N
7 C 8/1/2014 30000 28960 N
8 C 9/1/2014 53000 51200 N
Overview of potential operations:
ser_aggCol (collapse each column to a list)
cluster [A, A, A, B, B, B, C, C, C]
load_date [1/1/2014, 2/1/2014, 3/1/2...
budget [1000, 12000, 36000, 15000...
actual [4000, 10000, 2000, 10000,...
fixed_price [Y, Y, Y, N, N, N, N, N, N]
dtype: object
ser_aggRows (collapse each row to a list)
0 [A, 1/1/2014, 1000, 4000, Y]
1 [A, 2/1/2014, 12000, 10000...
2 [A, 3/1/2014, 36000, 2000, Y]
3 [B, 4/1/2014, 15000, 10000...
4 [B, 4/1/2014, 12000, 11500...
5 [B, 4/1/2014, 90000, 11000...
6 [C, 7/1/2014, 22000, 18000...
7 [C, 8/1/2014, 30000, 28960...
8 [C, 9/1/2014, 53000, 51200...
dtype: object
df_gr (here you get lists for each cluster)
load_date budget actual fixed_price
cluster
A [1/1/2014, 2/1/2014, 3/1/2... [1000, 12000, 36000] [4000, 10000, 2000] [Y, Y, Y]
B [4/1/2014, 4/1/2014, 4/1/2... [15000, 12000, 90000] [10000, 11500, 11000] [N, N, N]
C [7/1/2014, 8/1/2014, 9/1/2... [22000, 30000, 53000] [18000, 28960, 51200] [N, N, N]
a list of separate dataframes for each cluster
df for cluster A
cluster load_date budget actual fixed_price
0 A 1/1/2014 1000 4000 Y
1 A 2/1/2014 12000 10000 Y
2 A 3/1/2014 36000 2000 Y
df for cluster B
cluster load_date budget actual fixed_price
3 B 4/1/2014 15000 10000 N
4 B 4/1/2014 12000 11500 N
5 B 4/1/2014 90000 11000 N
df for cluster C
cluster load_date budget actual fixed_price
6 C 7/1/2014 22000 18000 N
7 C 8/1/2014 30000 28960 N
8 C 9/1/2014 53000 51200 N
just the values of column load_date
0 1/1/2014
1 2/1/2014
2 3/1/2014
3 4/1/2014
4 4/1/2014
5 4/1/2014
6 7/1/2014
7 8/1/2014
8 9/1/2014
Name: load_date, dtype: object
just the values of column number 2
0 1000
1 12000
2 36000
3 15000
4 12000
5 90000
6 22000
7 30000
8 53000
Name: budget, dtype: object
just the values of row number 7
cluster C
load_date 8/1/2014
budget 30000
actual 28960
fixed_price N
Name: 7, dtype: object
============================== JUST FOR COMPLETENESS ==============================
you can convert a series to a list
['C', '8/1/2014', '30000', '28960', 'N']
<class 'list'>
you can convert a dataframe to a nested list
[['A', '1/1/2014', '1000', '4000', 'Y'], ['A', '2/1/2014', '12000', '10000', 'Y'], ['A', '3/1/2014', '36000', '2000', 'Y'], ['B', '4/1/2014', '15000', '10000', 'N'], ['B', '4/1/2014', '12000', '11500', 'N'], ['B', '4/1/2014', '90000', '11000', 'N'], ['C', '7/1/2014', '22000', '18000', 'N'], ['C', '8/1/2014', '30000', '28960', 'N'], ['C', '9/1/2014', '53000', '51200', 'N']]
<class 'list'>
the content of a dataframe can be accessed as a numpy.ndarray
[['A' '1/1/2014' '1000' '4000' 'Y']
['A' '2/1/2014' '12000' '10000' 'Y']
['A' '3/1/2014' '36000' '2000' 'Y']
['B' '4/1/2014' '15000' '10000' 'N']
['B' '4/1/2014' '12000' '11500' 'N']
['B' '4/1/2014' '90000' '11000' 'N']
['C' '7/1/2014' '22000' '18000' 'N']
['C' '8/1/2014' '30000' '28960' 'N']
['C' '9/1/2014' '53000' '51200' 'N']]
<class 'numpy.ndarray'>
code:
# prefix ser refers to pd.Series object
# prefix df refers to pd.DataFrame object
# prefix lst refers to list object
import pandas as pd
import numpy as np
df=pd.DataFrame([
['A', '1/1/2014', '1000', '4000', 'Y'],
['A', '2/1/2014', '12000', '10000', 'Y'],
['A', '3/1/2014', '36000', '2000', 'Y'],
['B', '4/1/2014', '15000', '10000', 'N'],
['B', '4/1/2014', '12000', '11500', 'N'],
['B', '4/1/2014', '90000', '11000', 'N'],
['C', '7/1/2014', '22000', '18000', 'N'],
['C', '8/1/2014', '30000', '28960', 'N'],
['C', '9/1/2014', '53000', '51200', 'N']
], columns=['cluster', 'load_date', 'budget', 'actual', 'fixed_price'])
print('df',df, sep='n', end='nn')
ser_aggCol=df.aggregate(lambda x: [x.tolist()], axis=0).map(lambda x:x[0])
print('ser_aggCol (collapse each column to a list)',ser_aggCol, sep='n', end='nnn')
ser_aggRows=pd.Series(df.values.tolist())
print('ser_aggRows (collapse each row to a list)',ser_aggRows, sep='n', end='nnn')
df_gr=df.groupby('cluster').agg(lambda x: list(x))
print('df_gr (here you get lists for each cluster)',df_gr, sep='n', end='nnn')
lst_dfFiltGr=[ df.loc[df['cluster']==val,:] for val in df['cluster'].unique() ]
print('a list of separate dataframes for each cluster', sep='n', end='nn')
for dfTmp in lst_dfFiltGr:
print('df for cluster '+str(dfTmp.loc[dfTmp.index[0],'cluster']),dfTmp, sep='n', end='nn')
ser_singleColLD=df.loc[:,'load_date']
print('just the values of column load_date',ser_singleColLD, sep='n', end='nnn')
ser_singleCol2=df.iloc[:,2]
print('just the values of column number 2',ser_singleCol2, sep='n', end='nnn')
ser_singleRow7=df.iloc[7,:]
print('just the values of row number 7',ser_singleRow7, sep='n', end='nnn')
print('='*30+' JUST FOR COMPLETENESS '+'='*30, end='nnn')
lst_fromSer=ser_singleRow7.tolist()
print('you can convert a series to a list',lst_fromSer, type(lst_fromSer), sep='n', end='nnn')
lst_fromDf=df.values.tolist()
print('you can convert a dataframe to a nested list',lst_fromDf, type(lst_fromDf), sep='n', end='nn')
arr_fromDf=df.values
print('the content of a dataframe can be accessed as a numpy.ndarray',arr_fromDf, type(arr_fromDf), sep='n', end='nn')
as pointed out by cs95 other methods should be preferred over pandas .values
attribute from pandas version 0.24 on see here. I use it here, because most people will (by 2019) still have an older version, which does not support the new recommendations. You can check your version with print(pd.__version__)
amount = list()
for col in df.columns:
val = list(df[col])
for v in val:
amount.append(v)
If your column will only have one value something like pd.series.tolist()
will produce an error. To guarantee that it will work for all cases, use the code below:
(
df
.filter(['column_name'])
.values
.reshape(1, -1)
.ravel()
.tolist()
)
If you do df.T.values.tolist()
it generates list of lists of column values.
Here is simple one liner:
list(df['load_date'])
.toList() does not work anymore. It may have been the right API 10 years ago.
TL;DR: Use .tolist()
. Don’t use list()
If we look at the source code of .tolist()
, under the hood, list()
function is being called on the underlying data in the dataframe, so both should produce the same output.
However, it looks like tolist()
is optimized for columns of Python scalars because I found that calling list()
on a column was 10 times slower than calling tolist()
. For the record, I was trying to convert a column of json strings in a very large dataframe into a list and list()
was taking its sweet time. That inspired me to test the runtimes of the two methods.
FYI, there’s no need to call .to_numpy()
or get .values
attribute because dataframe columns/Series objects already implement .tolist()
method. Also, because of how numpy arrays are stored, list()
and tolist()
would give different types of scalars (at least) for numeric columns. For example,
type(list(df['budget'].values)[0]) # numpy.int64
type(df['budget'].values.tolist()[0]) # int
The following perfplot shows the runtime differences between the two methods on various pandas dtype Series objects. Basically, it’s showing the runtime differences between the following two methods:
list(df['some_col']) # list()
df['some_col'].tolist() # .tolist()
As you can see, no matter the size of the column/Series, for numeric and object dtype columns/Series, .tolist()
method is much faster than list()
. Not included here but the graphs for float
and bool
dtype columns were very similar to that of the int
dtype column shown here. Also the graph for an object dtype column containing lists was very similar to the graph of string column shown here. Extension dtypes such as 'Int64Dtype'
, 'StringDtype'
, 'Float64Dtype'
etc. also showed similar patterns.
On the other hand, there is virtually no difference between the two methods for datetime
, timedelta
and Categorical
dtype columns.
Code used to produce the above plot:
from perfplot import plot
kernels = [lambda s: list(s), lambda s: s.tolist()]
labels = ['list()', '.tolist()']
n_range = [2**k for k in range(4, 20)]
xlabel = 'Rows in DataFrame'
eq_chk = lambda x,y: all([x,y])
numeric = lambda n: pd.Series(range(5)).repeat(n)
string = lambda n: pd.Series(['some word', 'another word', 'a word']).repeat(n)
datetime = lambda n: pd.to_datetime(pd.Series(['2012-05-14', '2046-12-31'])).repeat(n)
timedelta = lambda n: pd.to_timedelta(pd.Series([1,2]), unit='D').repeat(n)
categorical = lambda n: pd.Series(pd.Categorical([1, 2, 3, 1, 2, 3])).repeat(n)
for n, f in [('Numeric', numeric), ('Object dtype', string),
('Datetime', datetime), ('Timedelta', timedelta),
('Categorical', categorical)]:
plot(setup=f, kernels=kernels, labels=labels, n_range=n_range,
xlabel=xlabel, title=f'{n} column', equality_check=eq_chk);
If you want to use index instead of column names (e.g. in a loop), you can use
for i in range(len(df.columns)):
print(df[df.columns[i]].to_list())
I have a dataframe df
imported from an Excel document like this:
cluster load_date budget actual fixed_price
A 1/1/2014 1000 4000 Y
A 2/1/2014 12000 10000 Y
A 3/1/2014 36000 2000 Y
B 4/1/2014 15000 10000 N
B 4/1/2014 12000 11500 N
B 4/1/2014 90000 11000 N
C 7/1/2014 22000 18000 N
C 8/1/2014 30000 28960 N
C 9/1/2014 53000 51200 N
I want to be able to return the contents of column 1 df['cluster']
as a list, so I can run a for-loop over it, and create an Excel worksheet for every cluster.
Is it also possible to return the contents of a whole column or row to a list? e.g.
list = [], list[column1] or list[df.ix(row1)]
Pandas DataFrame columns are Pandas Series when you pull them out, which you can then call x.tolist()
on to turn them into a Python list. Alternatively you cast it with list(x)
.
import pandas as pd
data_dict = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data_dict)
print(f"DataFrame:n{df}n")
print(f"column types:n{df.dtypes}")
col_one_list = df['one'].tolist()
col_one_arr = df['one'].to_numpy()
print(f"ncol_one_list:n{col_one_list}ntype:{type(col_one_list)}")
print(f"ncol_one_arr:n{col_one_arr}ntype:{type(col_one_arr)}")
Output:
DataFrame:
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
column types:
one float64
two int64
dtype: object
col_one_list:
[1.0, 2.0, 3.0, nan]
type:<class 'list'>
col_one_arr:
[ 1. 2. 3. nan]
type:<class 'numpy.ndarray'>
This returns a numpy array:
arr = df["cluster"].to_numpy()
This returns a numpy array of unique values:
unique_arr = df["cluster"].unique()
You can also use numpy to get the unique values, although there are differences between the two methods:
arr = df["cluster"].to_numpy()
unique_arr = np.unique(arr)
Example conversion:
Numpy Array -> Panda Data Frame -> List from one Panda Column
Numpy Array
data = np.array([[10,20,30], [20,30,60], [30,60,90]])
Convert numpy array into Panda data frame
dataPd = pd.DataFrame(data = data)
print(dataPd)
0 1 2
0 10 20 30
1 20 30 60
2 30 60 90
Convert one Panda column to list
pdToList = list(dataPd['2'])
Assuming the name of the dataframe after reading the excel sheet is df
, take an empty list (e.g. dataList
), iterate through the dataframe row by row and append to your empty list like-
dataList = [] #empty list
for index, row in df.iterrows():
mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
dataList.append(mylist)
Or,
dataList = [] #empty list
for row in df.itertuples():
mylist = [row.cluster, row.load_date, row.budget, row.actual, row.fixed_price]
dataList.append(mylist)
No, if you print the dataList
, you will get each rows as a list in the dataList
.
As this question attained a lot of attention and there are several ways to fulfill your task, let me present several options.
Those are all one-liners by the way 😉
Starting with:
df
cluster load_date budget actual fixed_price
0 A 1/1/2014 1000 4000 Y
1 A 2/1/2014 12000 10000 Y
2 A 3/1/2014 36000 2000 Y
3 B 4/1/2014 15000 10000 N
4 B 4/1/2014 12000 11500 N
5 B 4/1/2014 90000 11000 N
6 C 7/1/2014 22000 18000 N
7 C 8/1/2014 30000 28960 N
8 C 9/1/2014 53000 51200 N
Overview of potential operations:
ser_aggCol (collapse each column to a list)
cluster [A, A, A, B, B, B, C, C, C]
load_date [1/1/2014, 2/1/2014, 3/1/2...
budget [1000, 12000, 36000, 15000...
actual [4000, 10000, 2000, 10000,...
fixed_price [Y, Y, Y, N, N, N, N, N, N]
dtype: object
ser_aggRows (collapse each row to a list)
0 [A, 1/1/2014, 1000, 4000, Y]
1 [A, 2/1/2014, 12000, 10000...
2 [A, 3/1/2014, 36000, 2000, Y]
3 [B, 4/1/2014, 15000, 10000...
4 [B, 4/1/2014, 12000, 11500...
5 [B, 4/1/2014, 90000, 11000...
6 [C, 7/1/2014, 22000, 18000...
7 [C, 8/1/2014, 30000, 28960...
8 [C, 9/1/2014, 53000, 51200...
dtype: object
df_gr (here you get lists for each cluster)
load_date budget actual fixed_price
cluster
A [1/1/2014, 2/1/2014, 3/1/2... [1000, 12000, 36000] [4000, 10000, 2000] [Y, Y, Y]
B [4/1/2014, 4/1/2014, 4/1/2... [15000, 12000, 90000] [10000, 11500, 11000] [N, N, N]
C [7/1/2014, 8/1/2014, 9/1/2... [22000, 30000, 53000] [18000, 28960, 51200] [N, N, N]
a list of separate dataframes for each cluster
df for cluster A
cluster load_date budget actual fixed_price
0 A 1/1/2014 1000 4000 Y
1 A 2/1/2014 12000 10000 Y
2 A 3/1/2014 36000 2000 Y
df for cluster B
cluster load_date budget actual fixed_price
3 B 4/1/2014 15000 10000 N
4 B 4/1/2014 12000 11500 N
5 B 4/1/2014 90000 11000 N
df for cluster C
cluster load_date budget actual fixed_price
6 C 7/1/2014 22000 18000 N
7 C 8/1/2014 30000 28960 N
8 C 9/1/2014 53000 51200 N
just the values of column load_date
0 1/1/2014
1 2/1/2014
2 3/1/2014
3 4/1/2014
4 4/1/2014
5 4/1/2014
6 7/1/2014
7 8/1/2014
8 9/1/2014
Name: load_date, dtype: object
just the values of column number 2
0 1000
1 12000
2 36000
3 15000
4 12000
5 90000
6 22000
7 30000
8 53000
Name: budget, dtype: object
just the values of row number 7
cluster C
load_date 8/1/2014
budget 30000
actual 28960
fixed_price N
Name: 7, dtype: object
============================== JUST FOR COMPLETENESS ==============================
you can convert a series to a list
['C', '8/1/2014', '30000', '28960', 'N']
<class 'list'>
you can convert a dataframe to a nested list
[['A', '1/1/2014', '1000', '4000', 'Y'], ['A', '2/1/2014', '12000', '10000', 'Y'], ['A', '3/1/2014', '36000', '2000', 'Y'], ['B', '4/1/2014', '15000', '10000', 'N'], ['B', '4/1/2014', '12000', '11500', 'N'], ['B', '4/1/2014', '90000', '11000', 'N'], ['C', '7/1/2014', '22000', '18000', 'N'], ['C', '8/1/2014', '30000', '28960', 'N'], ['C', '9/1/2014', '53000', '51200', 'N']]
<class 'list'>
the content of a dataframe can be accessed as a numpy.ndarray
[['A' '1/1/2014' '1000' '4000' 'Y']
['A' '2/1/2014' '12000' '10000' 'Y']
['A' '3/1/2014' '36000' '2000' 'Y']
['B' '4/1/2014' '15000' '10000' 'N']
['B' '4/1/2014' '12000' '11500' 'N']
['B' '4/1/2014' '90000' '11000' 'N']
['C' '7/1/2014' '22000' '18000' 'N']
['C' '8/1/2014' '30000' '28960' 'N']
['C' '9/1/2014' '53000' '51200' 'N']]
<class 'numpy.ndarray'>
code:
# prefix ser refers to pd.Series object
# prefix df refers to pd.DataFrame object
# prefix lst refers to list object
import pandas as pd
import numpy as np
df=pd.DataFrame([
['A', '1/1/2014', '1000', '4000', 'Y'],
['A', '2/1/2014', '12000', '10000', 'Y'],
['A', '3/1/2014', '36000', '2000', 'Y'],
['B', '4/1/2014', '15000', '10000', 'N'],
['B', '4/1/2014', '12000', '11500', 'N'],
['B', '4/1/2014', '90000', '11000', 'N'],
['C', '7/1/2014', '22000', '18000', 'N'],
['C', '8/1/2014', '30000', '28960', 'N'],
['C', '9/1/2014', '53000', '51200', 'N']
], columns=['cluster', 'load_date', 'budget', 'actual', 'fixed_price'])
print('df',df, sep='n', end='nn')
ser_aggCol=df.aggregate(lambda x: [x.tolist()], axis=0).map(lambda x:x[0])
print('ser_aggCol (collapse each column to a list)',ser_aggCol, sep='n', end='nnn')
ser_aggRows=pd.Series(df.values.tolist())
print('ser_aggRows (collapse each row to a list)',ser_aggRows, sep='n', end='nnn')
df_gr=df.groupby('cluster').agg(lambda x: list(x))
print('df_gr (here you get lists for each cluster)',df_gr, sep='n', end='nnn')
lst_dfFiltGr=[ df.loc[df['cluster']==val,:] for val in df['cluster'].unique() ]
print('a list of separate dataframes for each cluster', sep='n', end='nn')
for dfTmp in lst_dfFiltGr:
print('df for cluster '+str(dfTmp.loc[dfTmp.index[0],'cluster']),dfTmp, sep='n', end='nn')
ser_singleColLD=df.loc[:,'load_date']
print('just the values of column load_date',ser_singleColLD, sep='n', end='nnn')
ser_singleCol2=df.iloc[:,2]
print('just the values of column number 2',ser_singleCol2, sep='n', end='nnn')
ser_singleRow7=df.iloc[7,:]
print('just the values of row number 7',ser_singleRow7, sep='n', end='nnn')
print('='*30+' JUST FOR COMPLETENESS '+'='*30, end='nnn')
lst_fromSer=ser_singleRow7.tolist()
print('you can convert a series to a list',lst_fromSer, type(lst_fromSer), sep='n', end='nnn')
lst_fromDf=df.values.tolist()
print('you can convert a dataframe to a nested list',lst_fromDf, type(lst_fromDf), sep='n', end='nn')
arr_fromDf=df.values
print('the content of a dataframe can be accessed as a numpy.ndarray',arr_fromDf, type(arr_fromDf), sep='n', end='nn')
as pointed out by cs95 other methods should be preferred over pandas .values
attribute from pandas version 0.24 on see here. I use it here, because most people will (by 2019) still have an older version, which does not support the new recommendations. You can check your version with print(pd.__version__)
amount = list()
for col in df.columns:
val = list(df[col])
for v in val:
amount.append(v)
If your column will only have one value something like pd.series.tolist()
will produce an error. To guarantee that it will work for all cases, use the code below:
(
df
.filter(['column_name'])
.values
.reshape(1, -1)
.ravel()
.tolist()
)
If you do df.T.values.tolist()
it generates list of lists of column values.
Here is simple one liner:
list(df['load_date'])
.toList() does not work anymore. It may have been the right API 10 years ago.
TL;DR: Use .tolist()
. Don’t use list()
If we look at the source code of .tolist()
, under the hood, list()
function is being called on the underlying data in the dataframe, so both should produce the same output.
However, it looks like tolist()
is optimized for columns of Python scalars because I found that calling list()
on a column was 10 times slower than calling tolist()
. For the record, I was trying to convert a column of json strings in a very large dataframe into a list and list()
was taking its sweet time. That inspired me to test the runtimes of the two methods.
FYI, there’s no need to call .to_numpy()
or get .values
attribute because dataframe columns/Series objects already implement .tolist()
method. Also, because of how numpy arrays are stored, list()
and tolist()
would give different types of scalars (at least) for numeric columns. For example,
type(list(df['budget'].values)[0]) # numpy.int64
type(df['budget'].values.tolist()[0]) # int
The following perfplot shows the runtime differences between the two methods on various pandas dtype Series objects. Basically, it’s showing the runtime differences between the following two methods:
list(df['some_col']) # list()
df['some_col'].tolist() # .tolist()
As you can see, no matter the size of the column/Series, for numeric and object dtype columns/Series, .tolist()
method is much faster than list()
. Not included here but the graphs for float
and bool
dtype columns were very similar to that of the int
dtype column shown here. Also the graph for an object dtype column containing lists was very similar to the graph of string column shown here. Extension dtypes such as 'Int64Dtype'
, 'StringDtype'
, 'Float64Dtype'
etc. also showed similar patterns.
On the other hand, there is virtually no difference between the two methods for datetime
, timedelta
and Categorical
dtype columns.
Code used to produce the above plot:
from perfplot import plot
kernels = [lambda s: list(s), lambda s: s.tolist()]
labels = ['list()', '.tolist()']
n_range = [2**k for k in range(4, 20)]
xlabel = 'Rows in DataFrame'
eq_chk = lambda x,y: all([x,y])
numeric = lambda n: pd.Series(range(5)).repeat(n)
string = lambda n: pd.Series(['some word', 'another word', 'a word']).repeat(n)
datetime = lambda n: pd.to_datetime(pd.Series(['2012-05-14', '2046-12-31'])).repeat(n)
timedelta = lambda n: pd.to_timedelta(pd.Series([1,2]), unit='D').repeat(n)
categorical = lambda n: pd.Series(pd.Categorical([1, 2, 3, 1, 2, 3])).repeat(n)
for n, f in [('Numeric', numeric), ('Object dtype', string),
('Datetime', datetime), ('Timedelta', timedelta),
('Categorical', categorical)]:
plot(setup=f, kernels=kernels, labels=labels, n_range=n_range,
xlabel=xlabel, title=f'{n} column', equality_check=eq_chk);
If you want to use index instead of column names (e.g. in a loop), you can use
for i in range(len(df.columns)):
print(df[df.columns[i]].to_list())