Python too slow to find text in string in for loop

Question:

I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now

DataFrame

No                  Text   
1     I love you forever...*500 other words
2     No , i know that you know xxx *100 words

My word list

wordlist =['i','love','David','Mary',......]

My code to count word

for i in wordlist :
    df[i] = df['Text].str.count(i)

Result :

No   Text                  I    love  other_words
 1    I love you ...       1      1      4
 2    No, i know ...       1      0      5  
Asked By: foy

||

Answers:

Try this algorithm

https://en.wikipedia.org/wiki/Aho–Corasick_algorithm

you also can search for ready realisations like

https://github.com/Guangyi-Z/py-aho-corasick

You can do this by making a Counter from the words in each Text value, then converting that into columns (using pd.Series), summing the columns that don’t exist in wordlist into other_words and then dropping those columns:

import re
import pandas as pd
from collections import Counter

wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'b[a-z]+b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1) 
df = df.drop(other_words, axis=1)

Output (for the sample data in your question):

   No                                 Text  i  love  other_words
0   1    I love you forever... other words  1     1            4
1   2  No , i know that you know xxx words  1     0            7

Note:

  • I’ve converted all the words to lower-case so you’re not counting I and i separately.
  • I’ve used re.findall rather than the more obvious split() so that forever... gets counted as the word forever rather than forever...

If you only want to count the words in wordlist (and don’t want an other_words count), you can simplify this to:

wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'b[a-z]+b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)

Output:

   No                                 Text  i  love
0   1    I love you forever... other words  1     1
1   2  No , i know that you know xxx words  1     0

Another way of also generating the other_words value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:

wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'b[a-z]+b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'b[a-z]+b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))

Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total function:

(c2 - counters).apply(Counter.total)
Answered By: Nick

as an alternative you could try this:

counts = (df['Text'].str.lower().str.findall(r'b[a-z]+b')
          .apply(lambda x: pd.Series(x).value_counts())
          .filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts

print(df)
'''
   №                                 Text    i  love
0  1    I love you forever... other words  1.0   1.0
1  2  No , i know that you know xxx words  1.0   0.0
Answered By: SergFSM