Find length of longest string in Pandas dataframe column
Question:
Is there a faster way to find the length of the longest string in a Pandas DataFrame than what’s shown in the example below?
import numpy as np
import pandas as pd
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1e7)
df = pd.DataFrame(x, columns=['col1'])
print df.col1.map(lambda x: len(x)).max()
# result --> 6
It takes about 10 seconds to run df.col1.map(lambda x: len(x)).max()
when timing it with IPython’s %timeit
.
Answers:
DSM’s suggestion seems to be about the best you’re going to get without doing some manual microoptimization:
%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop
%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop
Note that explicitly using the str.len()
method doesn’t seem to be much of an improvement. If you’re not familiar with IPython, which is where that very convenient %timeit
syntax comes from, I’d definitely suggest giving it a shot for quick testing of things like this.
Update Added screenshot:
Just as a minor addition, you might want to loop through all object columns in a data frame:
for c in df:
if df[c].dtype == 'object':
print('Max length of column %s: %sn' % (c, df[c].map(len).max()))
This will prevent errors being thrown by bool, int types etc.
Could be expanded for other non-numeric types such as ‘string_’, ‘unicode_’ i.e.
if df[c].dtype in ('object', 'string_', 'unicode_'):
Sometimes you want the length of the longest string in bytes. This is relevant for strings that use fancy Unicode characters, in which case the length in bytes is greater than the regular length. This can be very relevant in specific situations, e.g. for database writes.
col_bytes_len = int(df[col_name].astype(bytes).str.len().max())
Remarks:
- Using
astype(bytes)
is more reliable than using str.encode(encoding='utf-8')
. This is because astype(bytes)
also works correctly with a column that mixed dtypes.
- The output is enclosed in
int()
because the output is otherwise a numpy object.
- Iff having an encoding error, then instead of
df[col_name].astype(bytes)
, consider:
df[col_name].str.encode('utf-8')
df[col_name].str.encode('ascii', errors='backslashreplace')
(last choice)
Excellent answers, in particular Marius and Ricky which were very helpful.
Given that most of us are optimising for coding time, here is a quick extension to those answers to return all the columns’ max item length as a series, sorted by the maximum item length per column:
mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)
Or as a one liner:
pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)
Adapting the original sample, this can be demoed as:
import pandas as pd
x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])
print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))
Output:
col2 6
col1 3
dtype: int64
import pandas as pd
import numpy as np
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])
# get longest string index from column
indx = df["col1"].str.len().idxmax()
# get longest string value
df["col1"][indx] # <---------------------
This might be faster (depending on the size of your dataframe):
maxsize=[df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
or
maxsize=[df[x].array.astype('U').dtype.itemsize // 4 for x in df.columns]
for small data frames, it is not necessary:
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
171 µs ± 5.92 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
126 µs ± 4.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
124 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
143 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But the bigger the data frame is, the faster it gets:
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1000)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
1.08 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.2 ms ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
878 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
705 µs ± 3.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10000)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
8.87 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.88 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.81 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Since I was testing different methods on my data frame, I had to convert the dtype first (df[x].astype(‘string’))
If it is already a Series of dtype string, it is 10% faster:
%timeit -n 100 [df[x].array.astype('U').dtype.itemsize // 4 for x in df.columns]
5.26 ms ± 95.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This here is even faster:
%timeit -n 100 [df[x].astype('string').array.astype('S').dtype.itemsize for x in df.columns]
3.89 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -n 100 [df[x].array.astype('S').dtype.itemsize for x in df.columns]
3.26 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But you might get encoding errors:
UnicodeEncodeError: 'ascii' codec can't encode character 'xf3' in position 15: ordinal not in range(128)
If you let NumPy decide what data type to use, you automatically know the biggest item:
df.col1.astype('string').array.astype('S')
Out[173]:
array([b'ab', b'ab', b'ab', ..., b'efghik', b'efghik', b'efghik'],
dtype='|S6')
You can find the longest string itself (not just the index) using this approach:
import pandas as pd
df = pd.DataFrame(['a', 'aaa', 'aaaaa'], columns=['A'])
# 1. Get index of longest string in column
idx = df.A.str.len().idxmax()
# Index: 2
# 2. Get longest string using df['A'][idx]
print('Longest string in column:', df['A'][idx])
# Longest string in column: aaaaa
Source: https://blog.finxter.com/python-find-longest-string-in-a-dataframe-column/
Is there a faster way to find the length of the longest string in a Pandas DataFrame than what’s shown in the example below?
import numpy as np
import pandas as pd
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1e7)
df = pd.DataFrame(x, columns=['col1'])
print df.col1.map(lambda x: len(x)).max()
# result --> 6
It takes about 10 seconds to run df.col1.map(lambda x: len(x)).max()
when timing it with IPython’s %timeit
.
DSM’s suggestion seems to be about the best you’re going to get without doing some manual microoptimization:
%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop
%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop
Note that explicitly using the str.len()
method doesn’t seem to be much of an improvement. If you’re not familiar with IPython, which is where that very convenient %timeit
syntax comes from, I’d definitely suggest giving it a shot for quick testing of things like this.
Update Added screenshot:
Just as a minor addition, you might want to loop through all object columns in a data frame:
for c in df:
if df[c].dtype == 'object':
print('Max length of column %s: %sn' % (c, df[c].map(len).max()))
This will prevent errors being thrown by bool, int types etc.
Could be expanded for other non-numeric types such as ‘string_’, ‘unicode_’ i.e.
if df[c].dtype in ('object', 'string_', 'unicode_'):
Sometimes you want the length of the longest string in bytes. This is relevant for strings that use fancy Unicode characters, in which case the length in bytes is greater than the regular length. This can be very relevant in specific situations, e.g. for database writes.
col_bytes_len = int(df[col_name].astype(bytes).str.len().max())
Remarks:
- Using
astype(bytes)
is more reliable than usingstr.encode(encoding='utf-8')
. This is becauseastype(bytes)
also works correctly with a column that mixed dtypes. - The output is enclosed in
int()
because the output is otherwise a numpy object. - Iff having an encoding error, then instead of
df[col_name].astype(bytes)
, consider:df[col_name].str.encode('utf-8')
df[col_name].str.encode('ascii', errors='backslashreplace')
(last choice)
Excellent answers, in particular Marius and Ricky which were very helpful.
Given that most of us are optimising for coding time, here is a quick extension to those answers to return all the columns’ max item length as a series, sorted by the maximum item length per column:
mx_dct = {c: df[c].map(lambda x: len(str(x))).max() for c in df.columns}
pd.Series(mx_dct).sort_values(ascending =False)
Or as a one liner:
pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df).sort_values(ascending =False)
Adapting the original sample, this can be demoed as:
import pandas as pd
x = [['ab', 'bcd'], ['dfe', 'efghik']]
df = pd.DataFrame(x, columns=['col1','col2'])
print(pd.Series({c: df[c].map(lambda x: len(str(x))).max() for c in df}).sort_values(ascending =False))
Output:
col2 6
col1 3
dtype: int64
import pandas as pd
import numpy as np
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])
# get longest string index from column
indx = df["col1"].str.len().idxmax()
# get longest string value
df["col1"][indx] # <---------------------
This might be faster (depending on the size of your dataframe):
maxsize=[df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
or
maxsize=[df[x].array.astype('U').dtype.itemsize // 4 for x in df.columns]
for small data frames, it is not necessary:
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
171 µs ± 5.92 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
126 µs ± 4.17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
124 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
143 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But the bigger the data frame is, the faster it gets:
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 1000)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
1.08 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.2 ms ± 9.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
878 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
705 µs ± 3.33 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
x = ['ab', 'bcd', 'dfe', 'efghik']
x = np.repeat(x, 10000)
df = pd.DataFrame(x, columns=['col1'])
%timeit -n 100 df.col1.str.len().max()
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
%timeit -n 100 df.col1.map(len).max()
%timeit -n 100 [df[x].astype('string').array.astype('U').dtype.itemsize // 4 for x in df.columns]
8.87 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
11 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.88 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.81 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Since I was testing different methods on my data frame, I had to convert the dtype first (df[x].astype(‘string’))
If it is already a Series of dtype string, it is 10% faster:
%timeit -n 100 [df[x].array.astype('U').dtype.itemsize // 4 for x in df.columns]
5.26 ms ± 95.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This here is even faster:
%timeit -n 100 [df[x].astype('string').array.astype('S').dtype.itemsize for x in df.columns]
3.89 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -n 100 [df[x].array.astype('S').dtype.itemsize for x in df.columns]
3.26 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But you might get encoding errors:
UnicodeEncodeError: 'ascii' codec can't encode character 'xf3' in position 15: ordinal not in range(128)
If you let NumPy decide what data type to use, you automatically know the biggest item:
df.col1.astype('string').array.astype('S')
Out[173]:
array([b'ab', b'ab', b'ab', ..., b'efghik', b'efghik', b'efghik'],
dtype='|S6')
You can find the longest string itself (not just the index) using this approach:
import pandas as pd
df = pd.DataFrame(['a', 'aaa', 'aaaaa'], columns=['A'])
# 1. Get index of longest string in column
idx = df.A.str.len().idxmax()
# Index: 2
# 2. Get longest string using df['A'][idx]
print('Longest string in column:', df['A'][idx])
# Longest string in column: aaaaa
Source: https://blog.finxter.com/python-find-longest-string-in-a-dataframe-column/