Check a list of words and return found words from page source code with a unique list
Question:
I have looked through various other questions but none seem to fit the bill. So here goes
I have a list of words
l = ['red','green','yellow','blue','orange']
I also have a source code of a webpage in another variable. I am using the requests lib
import requests
url = 'https://google.com'
response = requests.get(url)
source = response.content
I then created a substring lookup function like so
def find_all_substrings(string, sub):
import re
starts = [match.start() for match in re.finditer(re.escape(sub), string)]
return starts
I now lookup the words using the following code where I am stuck
for word in l:
substrings = find_all_substrings(source, word)
new = []
for pos in substrings:
ok = False
if not ok:
print(word + ";")
if word not in new:
new.append(word)
print(new)
page['words'] = new
My ideal output looks like the following
Found words – ['red', 'green']
Answers:
If all you want is a list of words that are present, you can avoid most of the regex processing and just use
found_words = [word for word in target_words if word in page_content]
(I’ve renamed your string
-> page_content
and l
-> target_words
.)
If you need additional information or processing (e.g. the regexs / BeautifulSoup parser) and have a list of items which you need to deduplicate, you can just run it through a set()
call. If you need a list instead of a set, or want to guarantee the order of found_words, just cast it again. Any of the following should work fine:
found_words = set(possibly_redundant_list_of_found_words)
found_words = list(set(possibly_redundant_list_of_found_words))
found_words = sorted(set(possibly_redundant_list_of_found_words))
If you’ve got some sort of data structure you’re parsing (because BeautifulSoup & regex can provide supplemental information about position & context, and you might care about those), then just define a custom function extract_word_from_struct()
which extracts the word from that structure, and call that inside a set comprehension:
possibly_redundant_list_of_found_words = [extract_word_from_struct(struct) for struct in possibly_redundant_list_of_findings]
found_words = set(word for word in possibly_redundant_list_of_found_words if word in target_words)
RT if u been in
FaZe
SoaR
Red
Obey
Obey S
OpTic
Lv
Elder
xJMx
Genesis
L7
Marv
Myth
North
Saw
High
eRa
dZ O
dZ R
Justic
Hail
Oxygen
Vail
Solar
Zin
Auto
RB
Syn
Silver
Set
Zoo
Past
Next
Jade
Trio
Enter
InFa
June
Oni
Aero
PZ
Sith
Dare
Colt
Viral
Darth
Arrow
Ice
SB
Trust
Rush
PysQo
I have looked through various other questions but none seem to fit the bill. So here goes
I have a list of words
l = ['red','green','yellow','blue','orange']
I also have a source code of a webpage in another variable. I am using the requests lib
import requests
url = 'https://google.com'
response = requests.get(url)
source = response.content
I then created a substring lookup function like so
def find_all_substrings(string, sub):
import re
starts = [match.start() for match in re.finditer(re.escape(sub), string)]
return starts
I now lookup the words using the following code where I am stuck
for word in l:
substrings = find_all_substrings(source, word)
new = []
for pos in substrings:
ok = False
if not ok:
print(word + ";")
if word not in new:
new.append(word)
print(new)
page['words'] = new
My ideal output looks like the following
Found words – ['red', 'green']
If all you want is a list of words that are present, you can avoid most of the regex processing and just use
found_words = [word for word in target_words if word in page_content]
(I’ve renamed your string
-> page_content
and l
-> target_words
.)
If you need additional information or processing (e.g. the regexs / BeautifulSoup parser) and have a list of items which you need to deduplicate, you can just run it through a set()
call. If you need a list instead of a set, or want to guarantee the order of found_words, just cast it again. Any of the following should work fine:
found_words = set(possibly_redundant_list_of_found_words)
found_words = list(set(possibly_redundant_list_of_found_words))
found_words = sorted(set(possibly_redundant_list_of_found_words))
If you’ve got some sort of data structure you’re parsing (because BeautifulSoup & regex can provide supplemental information about position & context, and you might care about those), then just define a custom function extract_word_from_struct()
which extracts the word from that structure, and call that inside a set comprehension:
possibly_redundant_list_of_found_words = [extract_word_from_struct(struct) for struct in possibly_redundant_list_of_findings]
found_words = set(word for word in possibly_redundant_list_of_found_words if word in target_words)
RT if u been in
FaZe
SoaR
Red
Obey
Obey S
OpTic
Lv
Elder
xJMx
Genesis
L7
Marv
Myth
North
Saw
High
eRa
dZ O
dZ R
Justic
Hail
Oxygen
Vail
Solar
Zin
Auto
RB
Syn
Silver
Set
Zoo
Past
Next
Jade
Trio
Enter
InFa
June
Oni
Aero
PZ
Sith
Dare
Colt
Viral
Darth
Arrow
Ice
SB
Trust
Rush
PysQo