Check if multiple strings exist in another string
Question:
How can I check if any of the strings in an array exists in another string?
For example:
a = ['a', 'b', 'c']
s = "a123"
if a in s:
print("some of the strings found in s")
else:
print("no strings found in s")
How can I replace the if a in s:
line to get the appropriate result?
Answers:
You need to iterate on the elements of a.
a = ['a', 'b', 'c']
a_string = "a123"
found_a_string = False
for item in a:
if item in a_string:
found_a_string = True
if found_a_string:
print "found a match"
else:
print "no match found"
a = ['a', 'b', 'c']
str = "a123"
a_match = [True for match in a if match in str]
if True in a_match:
print "some of the strings found in str"
else:
print "no strings found in str"
You should be careful if the strings in a
or str
gets longer. The straightforward solutions take O(S*(A^2)), where S
is the length of str
and A is the sum of the lenghts of all strings in a
. For a faster solution, look at Aho-Corasick algorithm for string matching, which runs in linear time O(S+A).
Just to add some diversity with regex
:
import re
if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
print 'possible matches thanks to regex'
else:
print 'no matches'
or if your list is too long – any(re.findall(r'|'.join(a), str, re.IGNORECASE))
any()
is by far the best approach if all you want is True
or False
, but if you want to know specifically which string/strings match, you can use a couple things.
If you want the first match (with False
as a default):
match = next((x for x in a if x in a_string), False)
If you want to get all matches (including duplicates):
matches = [x for x in a if x in a_string]
If you want to get all non-duplicate matches (disregarding order):
matches = {x for x in a if x in a_string}
If you want to get all non-duplicate matches in the right order:
matches = []
for x in a:
if x in a_string and x not in matches:
matches.append(x)
It depends on the context
suppose if you want to check single literal like(any single word a,e,w,..etc) in is enough
original_word ="hackerearcth"
for 'h' in original_word:
print("YES")
if you want to check any of the character among the original_word:
make use of
if any(your_required in yourinput for your_required in original_word ):
if you want all the input you want in that original_word,make use of all
simple
original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
print("yes")
jbernadas already mentioned the Aho-Corasick-Algorithm in order to reduce complexity.
Here is one way to use it in Python:
-
Download aho_corasick.py from here
-
Put it in the same directory as your main Python file and name it aho_corasick.py
-
Try the alrorithm with the following code:
from aho_corasick import aho_corasick #(string, keywords)
print(aho_corasick(string, ["keyword1", "keyword2"]))
Note that the search is case-sensitive
I would use this kind of function for speed:
def check_string(string, substring_list):
for substring in substring_list:
if substring in string:
return True
return False
data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']
# for each
for field in mandatory_fields:
if field not in data:
print("Error, missing req field {0}".format(field));
# still fine, multiple if statements
if ('firstName' not in data or
'lastName' not in data or
'age' not in data):
print("Error, missing a req field");
# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
print("Error, missing fields {0}".format(", ".join(missing_fields)));
Just some more info on how to get all list elements availlable in String
a = ['a', 'b', 'c']
str = "a123"
list(filter(lambda x: x in str, a))
A surprisingly fast approach is to use set
:
a = ['a', 'b', 'c']
a_string = "a123"
if set(a) & set(a_string):
print("some of the strings found in a_string")
else:
print("no strings found in a_string")
This works if a
does not contain any multiple-character values (in which case use any
as listed above). If so, it’s simpler to specify a
as a string: a = 'abc'
.
Yet another solution with set. using set.intersection
. For a one-liner.
subset = {"some" ,"words"}
text = "some words to be searched here"
if len(subset & set(text.split())) == len(subset):
print("All values present in text")
if subset & set(text.split()):
print("Atleast one values present in text")
A compact way to find multiple strings in another list of strings is to use set.intersection. This executes much faster than list comprehension in large sets or lists.
>>> astring = ['abc','def','ghi','jkl','mno']
>>> bstring = ['def', 'jkl']
>>> a_set = set(astring) # convert list to set
>>> b_set = set(bstring)
>>> matches = a_set.intersection(b_set)
>>> matches
{'def', 'jkl'}
>>> list(matches) # if you want a list instead of a set
['def', 'jkl']
>>>
If you want exact matches of words then consider word tokenizing the target string. I use the recommended word_tokenize from nltk:
from nltk.tokenize import word_tokenize
Here is the tokenized string from the accepted answer:
a_string = "A string is more than its parts!"
tokens = word_tokenize(a_string)
tokens
Out[46]: ['A', 'string', 'is', 'more', 'than', 'its', 'parts', '!']
The accepted answer gets modified as follows:
matches_1 = ["more", "wholesome", "milk"]
[x in tokens for x in matches_1]
Out[42]: [True, False, False]
As in the accepted answer, the word "more" is still matched. If "mo" becomes a match string, however, the accepted answer still finds a match. That is a behavior I did not want.
matches_2 = ["mo", "wholesome", "milk"]
[x in a_string for x in matches_1]
Out[43]: [True, False, False]
Using word tokenization, "mo" is no longer matched:
[x in tokens for x in matches_2]
Out[44]: [False, False, False]
That is the additional behavior that I wanted. This answer also responds to the duplicate question here.
I needed to do that in a performance-critical environment, so I benchmarked all the possible variants I could find and think of with Python 3.11. Here are the results:
words =['test', 'èk', 'user_me', '<markup>', '[^1]']
def find_words(words):
for word in words:
if "_" in word or "<" in word or ">" in word or "^" in word:
pass
def find_words_2(words):
for word in words:
for elem in [">", "<", "_", "^"]:
if elem in word:
pass
def find_words_3(words):
for word in words:
if re.search(r"_|<|>|^", word):
pass
def find_words_4(words):
for word in words:
if re.match(r"S*(_|<|>|^)S*", word):
pass
def find_words_5(words):
for word in words:
if any(elem in word for elem in [">", "<", "_", "^"]):
pass
def find_words_6(words):
for word in words:
if any(map(word.__contains__, [">", "<", "_", "^"])):
pass
> %timeit find_words(words)
351 ns ± 6.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_2(words)
689 ns ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_3(words)
2.42 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_4(words)
2.75 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_5(words)
2.65 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_6(words)
1.64 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
- The naive chained
or
approach wins (function 1)
- The basic iteration over each element to test (function 2) is at least 50% faster than using
any()
, and even a regex search is faster than the basic any()
without map()
, so I don’t get why it exists at all. Not to mention, the syntax is purely algorithmic so any programmer will understand what it does, even without Python background.
re.match()
only searches for patterns starting at the beginning of the line (which is confusing if you come from PHP/Perl regex), so to make it work like PHP/Perl, you need to use re.search()
or to tweak the regex to include characters before, which comes with a performance penalty.
If the list of substrings to search for is known at programming time, the ugly chained or
is definitely the way to go. Otherwise, use the basic for
loop over the list of substrings to search. any()
and regex are a loss of time in this context.
For a more down-to-earth application (searching if a file is an image by looking for its extension in a list):
def is_image(word: str ) -> bool:
if ".bmp" in word or
".jpg" in word or
".jpeg" in word or
".jpe" in word or
".jp2" in word or
".j2c" in word or
".j2k" in word or
".jpc" in word or
".jpf" in word or
".jpx" in word or
".png" in word or
".ico" in word or
".svg" in word or
".webp" in word or
".heif" in word or
".heic" in word or
".tif" in word or
".tiff" in word or
".hdr" in word or
".exr" in word or
".ppm" in word or
".pfm" in word or
".nef" in word or
".rw2" in word or
".cr2" in word or
".cr3" in word or
".crw" in word or
".dng" in word or
".raf" in word or
".arw" in word or
".srf" in word or
".sr2" in word or
".iiq" in word or
".3fr" in word or
".dcr" in word or
".ari" in word or
".pef" in word or
".x3f" in word or
".erf" in word or
".raw" in word or
".rwz" in word:
return True
return False
IMAGE_PATTERN = re.compile(r".(bmp|jpg|jpeg|jpe|jp2|j2c|j2k|jpc|jpf|jpx|png|ico|svg|webp|heif|heic|tif|tiff|hdr|exr|ppm|pfm|nef|rw2|cr2|cr3|crw|dng|raf|arw|srf|sr2|iiq|3fr|dcr|ari|pef|x3f|erf|raw|rwz)")
extensions = [".bmp", ".jpg", ".jpeg", ".jpe", ".jp2", ".j2c", ".j2k", ".jpc", ".jpf", ".jpx", ".png", ".ico", ".svg", ".webp", ".heif", ".heic", ".tif", ".tiff", ".hdr", ".exr", ".ppm", ".pfm", ".nef", ".rw2", ".cr2", ".cr3", ".crw", ".dng", ".raf", ".arw", ".srf", ".sr2", ".iiq", ".3fr", ".dcr", ".ari", ".pef", ".x3f", ".erf", ".raw", ".rwz"]
(Note that the extensions are declared in the same order in all variants).
> %timeit is_image("DSC_blablabla_001256.nef") # found
536 ns ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit is_image("DSC_blablabla_001256.noop") # not found
923 ns ± 43.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.nef")
221 ns ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.noop") # not found
207 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.nef" for ext in extensions) # found
1.53 µs ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.noop" for ext in extensions) # not found
2.2 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
With a lot more options to test, regex are actually faster and more legible (for once…) than the chained or
. any()
ist still the worst.
Empiric tests show that the performance threshold is at 9 elements to test:
- below 9 elements, chained
or
is faster,
- above 9 elements, regex
search()
is faster,
- at exactly 9 elements, both run around 225 ns.
I found this question from a link from another closed question:
Python: How to check a string for substrings from a list? but don’t see an explicit solution to that question in the above answers.
Given a list of substrings and a list of strings, return a unique list of strings that have any of the substrings.
substrings = ['hello','world','python']
strings = ['blah blah.hello_everyone','this is a-crazy_world.here',
'one more string','ok, one more string with hello world python']
# one-liner
list(set([strings_of_interest for strings_of_interest in strings for substring in substrings if substring in strings_of_interest]))
How can I check if any of the strings in an array exists in another string?
For example:
a = ['a', 'b', 'c']
s = "a123"
if a in s:
print("some of the strings found in s")
else:
print("no strings found in s")
How can I replace the if a in s:
line to get the appropriate result?
You need to iterate on the elements of a.
a = ['a', 'b', 'c']
a_string = "a123"
found_a_string = False
for item in a:
if item in a_string:
found_a_string = True
if found_a_string:
print "found a match"
else:
print "no match found"
a = ['a', 'b', 'c']
str = "a123"
a_match = [True for match in a if match in str]
if True in a_match:
print "some of the strings found in str"
else:
print "no strings found in str"
You should be careful if the strings in a
or str
gets longer. The straightforward solutions take O(S*(A^2)), where S
is the length of str
and A is the sum of the lenghts of all strings in a
. For a faster solution, look at Aho-Corasick algorithm for string matching, which runs in linear time O(S+A).
Just to add some diversity with regex
:
import re
if any(re.findall(r'a|b|c', str, re.IGNORECASE)):
print 'possible matches thanks to regex'
else:
print 'no matches'
or if your list is too long – any(re.findall(r'|'.join(a), str, re.IGNORECASE))
any()
is by far the best approach if all you want is True
or False
, but if you want to know specifically which string/strings match, you can use a couple things.
If you want the first match (with False
as a default):
match = next((x for x in a if x in a_string), False)
If you want to get all matches (including duplicates):
matches = [x for x in a if x in a_string]
If you want to get all non-duplicate matches (disregarding order):
matches = {x for x in a if x in a_string}
If you want to get all non-duplicate matches in the right order:
matches = []
for x in a:
if x in a_string and x not in matches:
matches.append(x)
It depends on the context
suppose if you want to check single literal like(any single word a,e,w,..etc) in is enough
original_word ="hackerearcth"
for 'h' in original_word:
print("YES")
if you want to check any of the character among the original_word:
make use of
if any(your_required in yourinput for your_required in original_word ):
if you want all the input you want in that original_word,make use of all
simple
original_word = ['h', 'a', 'c', 'k', 'e', 'r', 'e', 'a', 'r', 't', 'h']
yourinput = str(input()).lower()
if all(requested_word in yourinput for requested_word in original_word):
print("yes")
jbernadas already mentioned the Aho-Corasick-Algorithm in order to reduce complexity.
Here is one way to use it in Python:
-
Download aho_corasick.py from here
-
Put it in the same directory as your main Python file and name it
aho_corasick.py
-
Try the alrorithm with the following code:
from aho_corasick import aho_corasick #(string, keywords) print(aho_corasick(string, ["keyword1", "keyword2"]))
Note that the search is case-sensitive
I would use this kind of function for speed:
def check_string(string, substring_list):
for substring in substring_list:
if substring in string:
return True
return False
data = "firstName and favoriteFood"
mandatory_fields = ['firstName', 'lastName', 'age']
# for each
for field in mandatory_fields:
if field not in data:
print("Error, missing req field {0}".format(field));
# still fine, multiple if statements
if ('firstName' not in data or
'lastName' not in data or
'age' not in data):
print("Error, missing a req field");
# not very readable, list comprehension
missing_fields = [x for x in mandatory_fields if x not in data]
if (len(missing_fields)>0):
print("Error, missing fields {0}".format(", ".join(missing_fields)));
Just some more info on how to get all list elements availlable in String
a = ['a', 'b', 'c']
str = "a123"
list(filter(lambda x: x in str, a))
A surprisingly fast approach is to use set
:
a = ['a', 'b', 'c']
a_string = "a123"
if set(a) & set(a_string):
print("some of the strings found in a_string")
else:
print("no strings found in a_string")
This works if a
does not contain any multiple-character values (in which case use any
as listed above). If so, it’s simpler to specify a
as a string: a = 'abc'
.
Yet another solution with set. using set.intersection
. For a one-liner.
subset = {"some" ,"words"}
text = "some words to be searched here"
if len(subset & set(text.split())) == len(subset):
print("All values present in text")
if subset & set(text.split()):
print("Atleast one values present in text")
A compact way to find multiple strings in another list of strings is to use set.intersection. This executes much faster than list comprehension in large sets or lists.
>>> astring = ['abc','def','ghi','jkl','mno']
>>> bstring = ['def', 'jkl']
>>> a_set = set(astring) # convert list to set
>>> b_set = set(bstring)
>>> matches = a_set.intersection(b_set)
>>> matches
{'def', 'jkl'}
>>> list(matches) # if you want a list instead of a set
['def', 'jkl']
>>>
If you want exact matches of words then consider word tokenizing the target string. I use the recommended word_tokenize from nltk:
from nltk.tokenize import word_tokenize
Here is the tokenized string from the accepted answer:
a_string = "A string is more than its parts!"
tokens = word_tokenize(a_string)
tokens
Out[46]: ['A', 'string', 'is', 'more', 'than', 'its', 'parts', '!']
The accepted answer gets modified as follows:
matches_1 = ["more", "wholesome", "milk"]
[x in tokens for x in matches_1]
Out[42]: [True, False, False]
As in the accepted answer, the word "more" is still matched. If "mo" becomes a match string, however, the accepted answer still finds a match. That is a behavior I did not want.
matches_2 = ["mo", "wholesome", "milk"]
[x in a_string for x in matches_1]
Out[43]: [True, False, False]
Using word tokenization, "mo" is no longer matched:
[x in tokens for x in matches_2]
Out[44]: [False, False, False]
That is the additional behavior that I wanted. This answer also responds to the duplicate question here.
I needed to do that in a performance-critical environment, so I benchmarked all the possible variants I could find and think of with Python 3.11. Here are the results:
words =['test', 'èk', 'user_me', '<markup>', '[^1]']
def find_words(words):
for word in words:
if "_" in word or "<" in word or ">" in word or "^" in word:
pass
def find_words_2(words):
for word in words:
for elem in [">", "<", "_", "^"]:
if elem in word:
pass
def find_words_3(words):
for word in words:
if re.search(r"_|<|>|^", word):
pass
def find_words_4(words):
for word in words:
if re.match(r"S*(_|<|>|^)S*", word):
pass
def find_words_5(words):
for word in words:
if any(elem in word for elem in [">", "<", "_", "^"]):
pass
def find_words_6(words):
for word in words:
if any(map(word.__contains__, [">", "<", "_", "^"])):
pass
> %timeit find_words(words)
351 ns ± 6.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_2(words)
689 ns ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_3(words)
2.42 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_4(words)
2.75 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_5(words)
2.65 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_6(words)
1.64 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
- The naive chained
or
approach wins (function 1) - The basic iteration over each element to test (function 2) is at least 50% faster than using
any()
, and even a regex search is faster than the basicany()
withoutmap()
, so I don’t get why it exists at all. Not to mention, the syntax is purely algorithmic so any programmer will understand what it does, even without Python background. re.match()
only searches for patterns starting at the beginning of the line (which is confusing if you come from PHP/Perl regex), so to make it work like PHP/Perl, you need to usere.search()
or to tweak the regex to include characters before, which comes with a performance penalty.
If the list of substrings to search for is known at programming time, the ugly chained or
is definitely the way to go. Otherwise, use the basic for
loop over the list of substrings to search. any()
and regex are a loss of time in this context.
For a more down-to-earth application (searching if a file is an image by looking for its extension in a list):
def is_image(word: str ) -> bool:
if ".bmp" in word or
".jpg" in word or
".jpeg" in word or
".jpe" in word or
".jp2" in word or
".j2c" in word or
".j2k" in word or
".jpc" in word or
".jpf" in word or
".jpx" in word or
".png" in word or
".ico" in word or
".svg" in word or
".webp" in word or
".heif" in word or
".heic" in word or
".tif" in word or
".tiff" in word or
".hdr" in word or
".exr" in word or
".ppm" in word or
".pfm" in word or
".nef" in word or
".rw2" in word or
".cr2" in word or
".cr3" in word or
".crw" in word or
".dng" in word or
".raf" in word or
".arw" in word or
".srf" in word or
".sr2" in word or
".iiq" in word or
".3fr" in word or
".dcr" in word or
".ari" in word or
".pef" in word or
".x3f" in word or
".erf" in word or
".raw" in word or
".rwz" in word:
return True
return False
IMAGE_PATTERN = re.compile(r".(bmp|jpg|jpeg|jpe|jp2|j2c|j2k|jpc|jpf|jpx|png|ico|svg|webp|heif|heic|tif|tiff|hdr|exr|ppm|pfm|nef|rw2|cr2|cr3|crw|dng|raf|arw|srf|sr2|iiq|3fr|dcr|ari|pef|x3f|erf|raw|rwz)")
extensions = [".bmp", ".jpg", ".jpeg", ".jpe", ".jp2", ".j2c", ".j2k", ".jpc", ".jpf", ".jpx", ".png", ".ico", ".svg", ".webp", ".heif", ".heic", ".tif", ".tiff", ".hdr", ".exr", ".ppm", ".pfm", ".nef", ".rw2", ".cr2", ".cr3", ".crw", ".dng", ".raf", ".arw", ".srf", ".sr2", ".iiq", ".3fr", ".dcr", ".ari", ".pef", ".x3f", ".erf", ".raw", ".rwz"]
(Note that the extensions are declared in the same order in all variants).
> %timeit is_image("DSC_blablabla_001256.nef") # found
536 ns ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit is_image("DSC_blablabla_001256.noop") # not found
923 ns ± 43.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.nef")
221 ns ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.noop") # not found
207 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.nef" for ext in extensions) # found
1.53 µs ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.noop" for ext in extensions) # not found
2.2 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
With a lot more options to test, regex are actually faster and more legible (for once…) than the chained or
. any()
ist still the worst.
Empiric tests show that the performance threshold is at 9 elements to test:
- below 9 elements, chained
or
is faster, - above 9 elements, regex
search()
is faster, - at exactly 9 elements, both run around 225 ns.
I found this question from a link from another closed question:
Python: How to check a string for substrings from a list? but don’t see an explicit solution to that question in the above answers.
Given a list of substrings and a list of strings, return a unique list of strings that have any of the substrings.
substrings = ['hello','world','python']
strings = ['blah blah.hello_everyone','this is a-crazy_world.here',
'one more string','ok, one more string with hello world python']
# one-liner
list(set([strings_of_interest for strings_of_interest in strings for substring in substrings if substring in strings_of_interest]))