Identify words that follows a particular pattern from sentences in a body of text

Question:

I want to find all words which follow the below mentioned four patterns in the body of text individually using python. I am trying regex. For example the first pattern should capture presence of word ‘carbon’ and words like ‘carbon law’, ‘carbon policy’, ‘carbon policies’, ‘carbon police’, ‘carbon regulation’. It can capture both upper and lower cases. If there are other options other than regex, we can use that

 1. ("carbon" AND ("law" OR "polic*" OR "regulation")) 
 2. ("Carbon" OR "carbon dioxide") AND "emissions") 
 3. (("greenhouse gas*" OR "GHG") AND "emission*") 
 4. (("carbon" OR "GHG" OR "greenhouse gas*) AND "pollution")
  * denotes wild character

A reproducing example can be the following dataframe df[‘Text’]. All the examples here will get identied by regex or other solution.

 df['Text']

 Text
 1.  carbon footprint reducing law, and the policies have a potential to form regulations. There are many examples of regulations happening.
 2.  Net Zero Carbon sourced emissions come from carbon dioxide generated emissions from fossil fuel.
 3.  carbon reduction essentially means Carbon led greenhouse gases or GHG laced emissions.
 4.  Reducing carbon  and netzero carbon footprint can happen from GHG reduction, greenhouse gases reduction and reducing pollution therefrom.

It should essentially identify based on the following conditions.

 1. (Word "carbon" AND any of word ("law" OR "polic*" OR "regulation") appearing anywhere in group of sentences.) 
 2. Word ("Carbon" OR "carbon dioxide") AND along with word "emissions" appearing anywhere within group of sentences) 
 3. (Word ("greenhouse gas*" OR "GHG") AND word "emission*" appearing anywhere within group of sentences) 
 4. (Word ("carbon" OR "GHG" OR "greenhouse gas*) AND word "pollution" appearing anywhere).. 

all the words can be lower and upper cases. There can be multiple times this instances can happen.

We can then use apply regex function on the df[‘Text’] to identify the examples:

 df['Text'].apply(lambda x: regex(x))

I have used

df_x = pd.DataFrame()

df_x['Text'] = ['carbon footprint reducing law, and the policies have a potential to form regulations. There are many examples of regulations happening.',
 'Net Zero Carbon sourced emissions come from carbon dioxide generated emissions from fossil fuel.',
 'carbon reduction essentially means Carbon led greenhouse gases or GHG laced emissions.',
 'Reducing carbon  and netzero carbon footprint can happen from GHG reduction, greenhouse gases reduction and reducing pollution therefrom.']

#Validation Query = ("carbon" AND ("law" OR "polic*" OR "regulation"))

df_x.loc[((df_x["carbon"] == True) & (df_x["law"] == True)) |  ((df_x["carbon"] == True) & (df_x["polic"] == True)) |
    ((df_x["carbon"] == True) & (df_x["regulation"] == True))
]

The above results error:
KeyError: ‘carbon’

Asked By: shan

||

Answers:

One way would be to use https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html .

df.Text.str.contains("carbon", case=False)

would give you

0    True
1    True
2    True
3    True

Something like

df["carbon"] = df.Text.str.contains("carbon", case=False)
df["law"] =df.Text.str.contains("law", case=False)
df["regulation"] =df.Text.str.contains("regulation", case=False)
df["polic"] =df.Text.str.contains("polic", case=False)

and query using

df[
    ((df["carbon"] == True) & (df["law"] == True)) |
    ((df["carbon"] == True) & (df["polic"] == True)) |
    ((df["carbon"] == True) & (df["regulation"] == True))
]

you can generate the matrix, by applying contains for each word. and then query the matrix to get the output.. but it may not work when the words are in reverse order.. for ex: law occurs before carbon.

if there are not so many words, you can use this approach. otherwise go for regex.

df.str.contains supports regex as well.

you can apply regex like

df.Text.str.contains("^.*carbon.*law.*$", regex=True)
Answered By: srinath
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.