Quicker Iteration in python through a big list
Question:
I am trying to scan a list of 100,000,000 (list1) strings and match it with another list(list2).
List 1 can have upto 10 million rows.
If the contents of list2 are in list1 I am flagging those values in a counter and storing the result in a third list. So my lists are somewhat like this :
list1
['My name is ABC and I live in DEF',
'I am trying XYZ method to speed up my LMN problem'
... 100000 rows
]
list2 ( length 90k )
['ABC','DEF','XYZ','LMN' ......XXX']
I have converted list 1 to a dataframe and list 2 to a joined list ( reduce the number of passes ) .
Updated List 2 :
['ABC|DEF|XYZ...|XXX']
My desired output is :
['My name is ABC and I live in DEF',2] ( since I have two matching patterns with list2 )
I have tried the below code , but it is taking a lot of time to iterate through the df and give me the result. Can you please let me know how to make this code faster and what exactly am I doing wrong ?
import snowflake.connector
import pandas as pd
import numpy as np
my_list=[]
df_list1 = pd.DataFrame({'cola':cola_val})
for row in tqdm.tqdm(df_product_list.values):
val= row[0]
match_list = re.findall(SKU_LIST,str(val),re.IGNORECASE)
my_list.append(str(val)+'~'+str(len(set(match_list))))
Answers:
In your case regexp is not a good option as it’s quite costly and the alternation (..|..| 90K items)
will cause a huge regex backtracking.
Convert your lst2
into a set
object beforehand and find intersection to each splitted sentence:
def count_keys_within(lst1, lst2):
keys = set(lst2)
for s in lst1:
yield [s, len(set(s.split()) & keys)]
counts = list(count_keys_within(lst1, lst2))
print(counts)
Sample output:
[['My name is ABC and I live in DEF', 2], ['I am trying XYZ method to speed up my LMN problem', 2]]
Update:
If sentences have more complex delimiters (you mentioned +
) use a precompiled regex pattern for splitting which can be extended with other additional delimiter chars:
def count_keys_within(lst1, lst2):
keys = set(lst2)
pat = re.compile(r'[s+]')
for s in lst1:
yield [s, len(set(pat.split(s)) & keys)]
I am trying to scan a list of 100,000,000 (list1) strings and match it with another list(list2).
List 1 can have upto 10 million rows.
If the contents of list2 are in list1 I am flagging those values in a counter and storing the result in a third list. So my lists are somewhat like this :
list1
['My name is ABC and I live in DEF',
'I am trying XYZ method to speed up my LMN problem'
... 100000 rows
]
list2 ( length 90k )
['ABC','DEF','XYZ','LMN' ......XXX']
I have converted list 1 to a dataframe and list 2 to a joined list ( reduce the number of passes ) .
Updated List 2 :
['ABC|DEF|XYZ...|XXX']
My desired output is :
['My name is ABC and I live in DEF',2] ( since I have two matching patterns with list2 )
I have tried the below code , but it is taking a lot of time to iterate through the df and give me the result. Can you please let me know how to make this code faster and what exactly am I doing wrong ?
import snowflake.connector
import pandas as pd
import numpy as np
my_list=[]
df_list1 = pd.DataFrame({'cola':cola_val})
for row in tqdm.tqdm(df_product_list.values):
val= row[0]
match_list = re.findall(SKU_LIST,str(val),re.IGNORECASE)
my_list.append(str(val)+'~'+str(len(set(match_list))))
In your case regexp is not a good option as it’s quite costly and the alternation (..|..| 90K items)
will cause a huge regex backtracking.
Convert your lst2
into a set
object beforehand and find intersection to each splitted sentence:
def count_keys_within(lst1, lst2):
keys = set(lst2)
for s in lst1:
yield [s, len(set(s.split()) & keys)]
counts = list(count_keys_within(lst1, lst2))
print(counts)
Sample output:
[['My name is ABC and I live in DEF', 2], ['I am trying XYZ method to speed up my LMN problem', 2]]
Update:
If sentences have more complex delimiters (you mentioned +
) use a precompiled regex pattern for splitting which can be extended with other additional delimiter chars:
def count_keys_within(lst1, lst2):
keys = set(lst2)
pat = re.compile(r'[s+]')
for s in lst1:
yield [s, len(set(pat.split(s)) & keys)]