Quicker Iteration in python through a big list

Question:

I am trying to scan a list of 100,000,000 (list1) strings and match it with another list(list2).
List 1 can have upto 10 million rows.
If the contents of list2 are in list1 I am flagging those values in a counter and storing the result in a third list. So my lists are somewhat like this :

list1

['My name is ABC and I live in DEF',
'I am trying XYZ method to speed up my LMN problem'
... 100000 rows
]

list2 ( length 90k )

['ABC','DEF','XYZ','LMN' ......XXX']

I have converted list 1 to a dataframe and list 2 to a joined list ( reduce the number of passes ) .
Updated List 2 :

['ABC|DEF|XYZ...|XXX']

My desired output is :

['My name is ABC and I live in DEF',2] ( since I have two matching patterns with list2 )

I have tried the below code , but it is taking a lot of time to iterate through the df and give me the result. Can you please let me know how to make this code faster and what exactly am I doing wrong ?

import snowflake.connector
import pandas as pd
import numpy as np
my_list=[]
df_list1 = pd.DataFrame({'cola':cola_val})
for row in tqdm.tqdm(df_product_list.values):
    val= row[0]
    match_list = re.findall(SKU_LIST,str(val),re.IGNORECASE)
    my_list.append(str(val)+'~'+str(len(set(match_list))))

Answers:

In your case regexp is not a good option as it’s quite costly and the alternation (..|..| 90K items) will cause a huge regex backtracking.
Convert your lst2 into a set object beforehand and find intersection to each splitted sentence:

def count_keys_within(lst1, lst2):
    keys = set(lst2)
    for s in lst1:
        yield [s, len(set(s.split()) & keys)]

counts = list(count_keys_within(lst1, lst2))
print(counts)

Sample output:

[['My name is ABC and I live in DEF', 2], ['I am trying XYZ method to speed up my LMN problem', 2]]

Update:
If sentences have more complex delimiters (you mentioned +) use a precompiled regex pattern for splitting which can be extended with other additional delimiter chars:

def count_keys_within(lst1, lst2):
    keys = set(lst2)
    pat = re.compile(r'[s+]')
    for s in lst1:
        yield [s, len(set(pat.split(s)) & keys)]
Answered By: RomanPerekhrest
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.