String separator for a Dataframe

Question:

Below is my extracted String :

extractedString = "1) No structured exercise.nn2) Above ideal body Mass index.nn3) Cancer gene testing.nn4) Suboptimal vitamin D.nn5) Slight anaemia."

What I am looking for is the output that you see for the dataframe:

list = [['No structured exercise'],['Above ideal body Mass index'],['Cancer gene testing']
        ,['Suboptimal vitamin D'],['Slight anaemia']]
df = pd.DataFrame(list)
print(df)

Ouput:

                             0
0       No structured exercise
1  Above ideal body Mass index
2          Cancer gene testing
3         Suboptimal vitamin D
4               Slight anaemia

How best can I achieve this?

Asked By: WhoamI

||

Answers:

This is pretty straightforward you could use pandas like this:

import pandas as pd

extractedString = "1) No structured exercise.nn2) Above ideal body Mass index.nn3) Cancer gene testing.nn4) Suboptimal vitamin D.nn5) Slight anaemia."


listOfStrings = extractedString.split("nn")


df = pd.DataFrame(listOfStrings)

df = df.apply(lambda x: x[1:-1])

first split the string into a list of strings
then we create a DataFrame from the list of strings
lastly remove the leading and trailing characters from each string

I hope this helps!

Answered By: Ahmed

You can try with this regex:

import re
import pandas as pd

data = re.findall(r'd+)s*([^n.]+)s{0,2}', extractedString)
df = pd.DataFrame(data, columns=['text'])
print(df)

# Output
                          text
0       No structured exercise
1  Above ideal body Mass index
2          Cancer gene testing
3         Suboptimal vitamin D
4               Slight anaemia

Only with Pandas:

import pandas as pd

df = (pd.Series(extractedString)
        .str.split('nn')
        .explode(ignore_index=True)
        .str.extract(r'd+)s*(?P<text>[^.]+)'))
print(df)

# Output
                          text
0       No structured exercise
1  Above ideal body Mass index
2          Cancer gene testing
3         Suboptimal vitamin D
4               Slight anaemia
Answered By: Corralien
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.