Count number of words per row
Question:
I’m trying to create a new column in a DataFrame that contains the word count for the respective row. I’m looking for the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I’m stuck. I’ve tried the solutions put forward in the linked SO posts, but got lots of attribute errors back.
words = df['col'].split()
df['totalwords'] = len(words)
results in
AttributeError: 'Series' object has no attribute 'split'
and
f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)
results in
AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')
Answers:
This is one way using pd.Series.str.split
and pd.Series.map
:
df['word_count'] = df['col'].str.split().map(len)
The above assumes that df['col']
is a series of strings.
Example:
df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})
df['word_count'] = df['col'].str.split().map(len)
print(df)
# col word_count
# 0 This is an example 4
# 1 This is another 3
# 2 A third 2
str.split
+ str.len
str.len
works nicely for any non-numeric column.
df['totalwords'] = df['col'].str.split().str.len()
str.count
If your words are single-space separated, you may simply count the spaces plus 1.
df['totalwords'] = df['col'].str.count(' ') + 1
List Comprehension
This is faster than you think!
df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]
Here is a way using .apply()
:
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
example
Given this df
:
>>> df
col
0 This is one sentence
1 and another
After applying the .apply()
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
>>> df
col number_of_words
0 This is one sentence 4
1 and another 2
Note: As pointed out by in comments, and in this answer, .apply
is not necessarily the fastest method. If speed is important, better go with one of @cᴏʟᴅsᴘᴇᴇᴅ’s methods.
With list
and map
data from cold
list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]
You could also map
split
and len
methods to the strings in the DataFrame column:
df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
Here’s some preliminary benchmark of the answers given here. map
seems to do well on very large Series:
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside',
'one banana', 'fruits']*100000,
columns=['col'])
>>> df.shape
(600000, 1)
>>> %timeit df['word_count'] = df['col'].str.split().str.len()
761 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].str.count(' ').add(1)
691 ms ± 71.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = [len(x.split()) for x in df['col'].tolist()]
405 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].apply(lambda x: len(x.split()))
450 ms ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].str.split().map(len)
657 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = list(map(lambda x : len(x.split()), df['col'].tolist()))
435 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
329 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You may use a simple regex expression within Pandas’ built-in str.count() method:
df['total_words'] = df['col'].str.count('w+')
-
w
character class matches any word character, which includes any letter, digit, or underscore. It is equivalent to the character range [A-Za-z0-9_].
-
+
sign for 1 or unlimited repeat times.
Or use the following regex if you would like words consisting of alphabetic symbols only:
df['total_words'] = df['col'].str.count('[A-Za-z]+')
I’m trying to create a new column in a DataFrame that contains the word count for the respective row. I’m looking for the total number of words, not frequencies of each distinct word. I assumed there would be a simple/quick way to do this common task, but after googling around and reading a handful of SO posts (1, 2, 3, 4) I’m stuck. I’ve tried the solutions put forward in the linked SO posts, but got lots of attribute errors back.
words = df['col'].split()
df['totalwords'] = len(words)
results in
AttributeError: 'Series' object has no attribute 'split'
and
f = lambda x: len(x["col"].split()) -1
df['totalwords'] = df.apply(f, axis=1)
results in
AttributeError: ("'list' object has no attribute 'split'", 'occurred at index 0')
This is one way using pd.Series.str.split
and pd.Series.map
:
df['word_count'] = df['col'].str.split().map(len)
The above assumes that df['col']
is a series of strings.
Example:
df = pd.DataFrame({'col': ['This is an example', 'This is another', 'A third']})
df['word_count'] = df['col'].str.split().map(len)
print(df)
# col word_count
# 0 This is an example 4
# 1 This is another 3
# 2 A third 2
str.split
+ str.len
str.len
works nicely for any non-numeric column.
df['totalwords'] = df['col'].str.split().str.len()
str.count
If your words are single-space separated, you may simply count the spaces plus 1.
df['totalwords'] = df['col'].str.count(' ') + 1
List Comprehension
This is faster than you think!
df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]
Here is a way using .apply()
:
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
example
Given this df
:
>>> df
col
0 This is one sentence
1 and another
After applying the .apply()
df['number_of_words'] = df.col.apply(lambda x: len(x.split()))
>>> df
col number_of_words
0 This is one sentence 4
1 and another 2
Note: As pointed out by in comments, and in this answer, .apply
is not necessarily the fastest method. If speed is important, better go with one of @cᴏʟᴅsᴘᴇᴇᴅ’s methods.
With list
and map
data from cold
list(map(lambda x : len(x.split()),df.col))
Out[343]: [4, 3, 2]
You could also map
split
and len
methods to the strings in the DataFrame column:
df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
Here’s some preliminary benchmark of the answers given here. map
seems to do well on very large Series:
df = pd.DataFrame(['one apple','banana','box of oranges','pile of fruits outside',
'one banana', 'fruits']*100000,
columns=['col'])
>>> df.shape
(600000, 1)
>>> %timeit df['word_count'] = df['col'].str.split().str.len()
761 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].str.count(' ').add(1)
691 ms ± 71.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = [len(x.split()) for x in df['col'].tolist()]
405 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].apply(lambda x: len(x.split()))
450 ms ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = df['col'].str.split().map(len)
657 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = list(map(lambda x : len(x.split()), df['col'].tolist()))
435 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df['word_count'] = [*map(len, map(str.split, df['col'].tolist()))]
329 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You may use a simple regex expression within Pandas’ built-in str.count() method:
df['total_words'] = df['col'].str.count('w+')
-
w
character class matches any word character, which includes any letter, digit, or underscore. It is equivalent to the character range [A-Za-z0-9_]. -
+
sign for 1 or unlimited repeat times.
Or use the following regex if you would like words consisting of alphabetic symbols only:
df['total_words'] = df['col'].str.count('[A-Za-z]+')