Check a list of words and return found words from page source code with a unique list

Question:

I have looked through various other questions but none seem to fit the bill. So here goes

I have a list of words

l = ['red','green','yellow','blue','orange'] 

I also have a source code of a webpage in another variable. I am using the requests lib

import requests

url = 'https://google.com'
response = requests.get(url)
source = response.content

I then created a substring lookup function like so

def find_all_substrings(string, sub):

    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

I now lookup the words using the following code where I am stuck

for word in l:
    substrings = find_all_substrings(source, word)
    new = []
    for pos in substrings:
        ok = False
        if not ok:
            print(word + ";")
            if word not in new:
                new.append(word)
                print(new)
            page['words'] = new

My ideal output looks like the following

Found words – ['red', 'green']

Asked By: Sam

||

Answers:

If all you want is a list of words that are present, you can avoid most of the regex processing and just use

found_words = [word for word in target_words if word in page_content]

(I’ve renamed your string -> page_content and l -> target_words.)

If you need additional information or processing (e.g. the regexs / BeautifulSoup parser) and have a list of items which you need to deduplicate, you can just run it through a set() call. If you need a list instead of a set, or want to guarantee the order of found_words, just cast it again. Any of the following should work fine:

found_words = set(possibly_redundant_list_of_found_words)
found_words = list(set(possibly_redundant_list_of_found_words))
found_words = sorted(set(possibly_redundant_list_of_found_words))

If you’ve got some sort of data structure you’re parsing (because BeautifulSoup & regex can provide supplemental information about position & context, and you might care about those), then just define a custom function extract_word_from_struct() which extracts the word from that structure, and call that inside a set comprehension:

possibly_redundant_list_of_found_words = [extract_word_from_struct(struct) for struct in possibly_redundant_list_of_findings]
found_words = set(word for word in possibly_redundant_list_of_found_words if word in target_words)
Answered By: Sarah Messer

RT if u been in
FaZe
SoaR
Red
Obey
Obey S
OpTic
Lv
Elder
xJMx
Genesis
L7
Marv
Myth
North
Saw
High
eRa
dZ O
dZ R
Justic
Hail
Oxygen
Vail
Solar
Zin
Auto
RB
Syn
Silver
Set
Zoo
Past
Next
Jade
Trio
Enter
InFa
June
Oni
Aero
PZ
Sith
Dare
Colt
Viral
Darth
Arrow
Ice
SB
Trust
Rush
PysQo

Answered By: Milly