How to determine the length of lists in a pandas dataframe column
Question:
How can the length of the lists in the column be determine without iteration?
I have a dataframe like this:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik]
I am calculating the length of lists in the CreationDate
column and making a new Length
column like this:
df['Length'] = df.CreationDate.apply(lambda x: len(x))
Which gives me this:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Is there a more pythonic way to do this?
Answers:
You can use the str
accessor for some list operations as well. In this example,
df['CreationDate'].str.len()
returns the length of each list. See the docs for str.len
.
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
-
pandas.Series.map(len)
and pandas.Series.apply(len)
are equivalent in execution time, and slightly faster than pandas.Series.str.len()
.
-
Difference between map, applymap and apply methods in Pandas
import pandas as pd
data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']
df = pd.DataFrame(data, index)
# create Length column
df['Length'] = df.os.map(len)
# display(df)
os Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
%timeit
import pandas as pd
import random
import string
random.seed(365)
ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Convert to list and map
a function
Pandas dataframe columns are not meant to store collections such as lists, tuples etc. because virtually none of the optimized methods work on these columns, so when a dataframe contains such items, it’s usually more efficient to convert the column into a Python list and manipulate the list.
Also, if a function (especially a built-in one like len()
) needs to be called on each item in a list, it’s usually faster to map
this function, rather than calling it in a loop.
mylist = df['CreationDate'].tolist()
df['Length'] = list(map(len, mylist))
Handle NaNs
Nice thing about str.len()
is that it handles NaNs but a custom function with try-except
should fill that gap.
def nanlen(x):
try:
return len(x)
except TypeError:
return float('nan')
df['Length'] = list(map(nanlen, mylist))
Runtime benchmarks
Essentially, mapping len
over lists is approx. 2.5 times faster than looping over a Series, which in turn is 2.5 times faster than pd.Series.str.len
for large frames.
Code used to produce the plot above:
import pandas as pd
import random, string, perfplot
random.seed(365)
perfplot.plot(
setup=lambda n: pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(n)]),
kernels=[lambda ser: ser.str.len(), lambda ser: ser.map(len), lambda ser: list(map(len, ser.tolist())), lambda ser: [len(x) for x in ser]],
labels=["ser.str.len()", "ser.map(len)", "list(map(len, ser.tolist()))", "[len(x) for x in ser]"],
n_range=[2**k for k in range(21)],
xlabel='Length of dataframe',
equality_check=lambda x,y: x.eq(y).all()
)
How can the length of the lists in the column be determine without iteration?
I have a dataframe like this:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux]
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2]
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik]
I am calculating the length of lists in the CreationDate
column and making a new Length
column like this:
df['Length'] = df.CreationDate.apply(lambda x: len(x))
Which gives me this:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
Is there a more pythonic way to do this?
You can use the str
accessor for some list operations as well. In this example,
df['CreationDate'].str.len()
returns the length of each list. See the docs for str.len
.
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
-
pandas.Series.map(len)
andpandas.Series.apply(len)
are equivalent in execution time, and slightly faster thanpandas.Series.str.len()
. -
Difference between map, applymap and apply methods in Pandas
import pandas as pd
data = {'os': [['ubuntu', 'mac-osx', 'syslinux'], ['ubuntu', 'mod-rewrite', 'laconica', 'apache-2.2'], ['ubuntu', 'nat', 'squid', 'mikrotik']]}
index = ['2013-12-22 15:25:02', '2009-12-14 14:29:32', '2013-12-22 15:42:00']
df = pd.DataFrame(data, index)
# create Length column
df['Length'] = df.os.map(len)
# display(df)
os Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
%timeit
import pandas as pd
import random
import string
random.seed(365)
ser = pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.str.len()
252 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.map(len)
220 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ser.apply(len)
222 ms ± 8.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Convert to list and map
a function
Pandas dataframe columns are not meant to store collections such as lists, tuples etc. because virtually none of the optimized methods work on these columns, so when a dataframe contains such items, it’s usually more efficient to convert the column into a Python list and manipulate the list.
Also, if a function (especially a built-in one like len()
) needs to be called on each item in a list, it’s usually faster to map
this function, rather than calling it in a loop.
mylist = df['CreationDate'].tolist()
df['Length'] = list(map(len, mylist))
Handle NaNs
Nice thing about str.len()
is that it handles NaNs but a custom function with try-except
should fill that gap.
def nanlen(x):
try:
return len(x)
except TypeError:
return float('nan')
df['Length'] = list(map(nanlen, mylist))
Runtime benchmarks
Essentially, mapping len
over lists is approx. 2.5 times faster than looping over a Series, which in turn is 2.5 times faster than pd.Series.str.len
for large frames.
Code used to produce the plot above:
import pandas as pd
import random, string, perfplot
random.seed(365)
perfplot.plot(
setup=lambda n: pd.Series([random.sample(string.ascii_letters, random.randint(1, 20)) for _ in range(n)]),
kernels=[lambda ser: ser.str.len(), lambda ser: ser.map(len), lambda ser: list(map(len, ser.tolist())), lambda ser: [len(x) for x in ser]],
labels=["ser.str.len()", "ser.map(len)", "list(map(len, ser.tolist()))", "[len(x) for x in ser]"],
n_range=[2**k for k in range(21)],
xlabel='Length of dataframe',
equality_check=lambda x,y: x.eq(y).all()
)