How to calculate number of words in a string in DataFrame?

Question

Suppose we have simple Dataframe

df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside', 'one banana', 'fruits'])
df.columns = ['fruits']

how to calculate number of words in keywords, similar to:

1 word: 2
2 words: 2
3 words: 1
4 words: 1

Asked By: Sergei

||

Source

Answer 1

IIUC then you can do the following:

In [89]:
count = df['fruits'].str.split().apply(len).value_counts()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[89]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Here we use the vectorised str.split to split on spaces, and then apply len to get the count of the number of elements, we can then call value_counts to aggregate the frequency count.

We then rename the index and sort it to get the desired output

UPDATE

This can also be done using str.len rather than apply which should scale better:

In [41]:
count = df['fruits'].str.split().str.len()
count.index = count.index.astype(str) + ' words:'
count.sort_index(inplace=True)
count

Out[41]:
0 words:    2
1 words:    1
2 words:    3
3 words:    4
4 words:    2
5 words:    1
Name: fruits, dtype: int64

Timings

In [42]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

1000 loops, best of 3: 799 µs per loop
1000 loops, best of 3: 347 µs per loop

For a 6K df:

In [51]:
%timeit df['fruits'].str.split().apply(len).value_counts()
%timeit df['fruits'].str.split().str.len()

100 loops, best of 3: 6.3 ms per loop
100 loops, best of 3: 6 ms per loop

Answered By: EdChum

Answer 2

You could use str.count with space ' ' as delimiter.

In [1716]: count = df['fruits'].str.count(' ').add(1).value_counts(sort=False)

In [1717]: count.index = count.index.astype('str') + ' words:'

In [1718]: count
Out[1718]:
1 words:    2
2 words:    2
3 words:    1
4 words:    1
Name: fruits, dtype: int64

Timings

str.count is marginally faster

_Small

In [1724]: df.shape
Out[1724]: (6, 1)

In [1725]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1000 loops, best of 3: 649 µs per loop

In [1726]: %timeit df['fruits'].str.split().apply(len).value_counts()
1000 loops, best of 3: 840 µs per loop

_Medium

In [1728]: df.shape
Out[1728]: (6000, 1)

In [1729]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
100 loops, best of 3: 6.58 ms per loop

In [1730]: %timeit df['fruits'].str.split().apply(len).value_counts()
100 loops, best of 3: 6.99 ms per loop

_Large

In [1732]: df.shape
Out[1732]: (60000, 1)

In [1733]: %timeit df['fruits'].str.count(' ').add(1).value_counts(sort=False)
1 loop, best of 3: 57.6 ms per loop

In [1734]: %timeit df['fruits'].str.split().apply(len).value_counts()
1 loop, best of 3: 73.8 ms per loop

Answered By: Zero

How to calculate number of words in a string in DataFrame?

Question:

Answers: