Convert contents from a list tp dataframe

Question:

I have a list which looks something like this,

list = ['some random sentence','some random sentence 25% •Assignments for week1',
        'some random sentence','some random sentence 20% •Exam for week2','some random sentence',
        'some random sentence']

This is extracted from a pdf. I want to take only specific characters and words from a specific value in this list and convert it into pandas dataframe, something like this,

enter image description here

The word ‘Assignment’ is just an example, there could be different words but always after the percentage sign. It may have multiple spaces or sometimes 1-2 special characters.
Is there a way to do this?

Asked By: Ritesh Kankonkar

||

Answers:

I think the most simple is using regex :

import re

def regex_split(sentence):
    match = re.search(r".+ (d+)% •(w+) for weekd", sentence)
    if match:
        return match.group(1) + '%', match.group(2)
    else:
        return "None"

df = pd.DataFrame({"sentence": list})
df["data"] = df["sentence"].apply(lambda x: regex_split(x))
df = df[df["data"] != "None"]
df["Object"] = df["data"].apply(lambda x: x[0])
df["Weight"] = df["data"].apply(lambda x: x[1])
df.drop(["sentence", "data"], axis=1)
Answered By: Scaro974

With str.extract:

l = ['some random sentence','some random sentence 25% •Assignments for week1',
        'some random sentence','some random sentence 20% •Exam for week2','some random sentence',
        'some random sentence']


out = (pd.Series(l)
         .str.extract(r'(?P<Weight>d+%)W*(?P<Object>w+)')
         .dropna(subset='Object')
       )

print(out)

Output:

  Weight       Object
1    25%  Assignments
3    20%         Exam

older answer

If you have a single term to match:

l = ['some random sentence','some random sentence 25% •Assignments for week1',
        'some random sentence','some random sentence 20% •Assignments for week2','some random sentence',
        'some random sentence']

s = pd.Series(l)
m = s.str.contains('assignment', case=False)

out = (s[m].str.extract(r'(?P<Weight>d+%)')
       .assign(Object='Assignment')
       )

print(out)

Alternative with a regex to match any number of terms:

s = pd.Series(l)
out = (s.str.extractall(r'(?P<Object>Assignment|otherword)|(?P<Weight>d+%)')
       .groupby(level=0).first()
       )

Output:

       Object Weight
1  Assignment    25%
3  Assignment    20%
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.