Pandas findall re.IGNORECASE doesn't work

Question:

I have a list of keywords:

keywords = ['fake', 'hoax', 'misleading', etc.]

I’d like to search the text column of DataFrame df1 for the above keywords and return rows containing these keywords (exact match), both in uppercase and lowercase (case-insensitive).

I tried the following:

df2 = df1[df1.text.apply(lambda x: any(i for i in re.findall('w+', x, flags=re.IGNORECASE) if i in keywords))] 
df2

The above code returns all rows with the specified keywords, BUT it doesn’t include the uppercase words (e.g., it return text containing "hoax", but not "HOAX").

Can someone please help me with this?

Asked By: mOna

||

Answers:

Your regex here is working properly, but not really doing anything of note. 'w+' will match any "word" character (regardless of case) that occurs one or more times in sequence. This will match individual, space-separated words, as I presume you intended.

The problem lies in your if i in keywords. As an example, if re comes across the word "FakE", it will correctly be included in the result of re.findall, but then your code will check if "FakE" is in keywords, which it is not (for membership checks, case does matter). Changing the final part of your lambda function to if i.lower() in keywords should resolve this issue.

Answered By: L0tad