Python too slow to find text in string in for loop
Question:
I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5
Answers:
Try this algorithm
https://en.wikipedia.org/wiki/Aho–Corasick_algorithm
you also can search for ready realisations like
You can do this by making a Counter
from the words in each Text
value, then converting that into columns (using pd.Series
), summing the columns that don’t exist in wordlist
into other_words
and then dropping those columns:
import re
import pandas as pd
from collections import Counter
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'b[a-z]+b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
- I’ve converted all the words to lower-case so you’re not counting
I
and i
separately.
- I’ve used
re.findall
rather than the more obvious split()
so that forever...
gets counted as the word forever
rather than forever...
If you only want to count the words in wordlist
(and don’t want an other_words
count), you can simplify this to:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'b[a-z]+b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
Output:
No Text i love
0 1 I love you forever... other words 1 1
1 2 No , i know that you know xxx words 1 0
Another way of also generating the other_words
value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist
. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'b[a-z]+b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'b[a-z]+b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))
Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total
function:
(c2 - counters).apply(Counter.total)
as an alternative you could try this:
counts = (df['Text'].str.lower().str.findall(r'b[a-z]+b')
.apply(lambda x: pd.Series(x).value_counts())
.filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts
print(df)
'''
№ Text i love
0 1 I love you forever... other words 1.0 1.0
1 2 No , i know that you know xxx words 1.0 0.0
I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5
Try this algorithm
https://en.wikipedia.org/wiki/Aho–Corasick_algorithm
you also can search for ready realisations like
You can do this by making a Counter
from the words in each Text
value, then converting that into columns (using pd.Series
), summing the columns that don’t exist in wordlist
into other_words
and then dropping those columns:
import re
import pandas as pd
from collections import Counter
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'b[a-z]+b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
- I’ve converted all the words to lower-case so you’re not counting
I
andi
separately. - I’ve used
re.findall
rather than the more obvioussplit()
so thatforever...
gets counted as the wordforever
rather thanforever...
If you only want to count the words in wordlist
(and don’t want an other_words
count), you can simplify this to:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'b[a-z]+b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
Output:
No Text i love
0 1 I love you forever... other words 1 1
1 2 No , i know that you know xxx words 1 0
Another way of also generating the other_words
value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist
. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'b[a-z]+b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'b[a-z]+b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))
Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total
function:
(c2 - counters).apply(Counter.total)
as an alternative you could try this:
counts = (df['Text'].str.lower().str.findall(r'b[a-z]+b')
.apply(lambda x: pd.Series(x).value_counts())
.filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts
print(df)
'''
№ Text i love
0 1 I love you forever... other words 1.0 1.0
1 2 No , i know that you know xxx words 1.0 0.0