String separator for a Dataframe
Question:
Below is my extracted String :
extractedString = "1) No structured exercise.nn2) Above ideal body Mass index.nn3) Cancer gene testing.nn4) Suboptimal vitamin D.nn5) Slight anaemia."
What I am looking for is the output that you see for the dataframe:
list = [['No structured exercise'],['Above ideal body Mass index'],['Cancer gene testing']
,['Suboptimal vitamin D'],['Slight anaemia']]
df = pd.DataFrame(list)
print(df)
Ouput:
0
0 No structured exercise
1 Above ideal body Mass index
2 Cancer gene testing
3 Suboptimal vitamin D
4 Slight anaemia
How best can I achieve this?
Answers:
This is pretty straightforward you could use pandas like this:
import pandas as pd
extractedString = "1) No structured exercise.nn2) Above ideal body Mass index.nn3) Cancer gene testing.nn4) Suboptimal vitamin D.nn5) Slight anaemia."
listOfStrings = extractedString.split("nn")
df = pd.DataFrame(listOfStrings)
df = df.apply(lambda x: x[1:-1])
first split the string into a list of strings
then we create a DataFrame from the list of strings
lastly remove the leading and trailing characters from each string
I hope this helps!
You can try with this regex:
import re
import pandas as pd
data = re.findall(r'd+)s*([^n.]+)s{0,2}', extractedString)
df = pd.DataFrame(data, columns=['text'])
print(df)
# Output
text
0 No structured exercise
1 Above ideal body Mass index
2 Cancer gene testing
3 Suboptimal vitamin D
4 Slight anaemia
Only with Pandas:
import pandas as pd
df = (pd.Series(extractedString)
.str.split('nn')
.explode(ignore_index=True)
.str.extract(r'd+)s*(?P<text>[^.]+)'))
print(df)
# Output
text
0 No structured exercise
1 Above ideal body Mass index
2 Cancer gene testing
3 Suboptimal vitamin D
4 Slight anaemia
Below is my extracted String :
extractedString = "1) No structured exercise.nn2) Above ideal body Mass index.nn3) Cancer gene testing.nn4) Suboptimal vitamin D.nn5) Slight anaemia."
What I am looking for is the output that you see for the dataframe:
list = [['No structured exercise'],['Above ideal body Mass index'],['Cancer gene testing']
,['Suboptimal vitamin D'],['Slight anaemia']]
df = pd.DataFrame(list)
print(df)
Ouput:
0
0 No structured exercise
1 Above ideal body Mass index
2 Cancer gene testing
3 Suboptimal vitamin D
4 Slight anaemia
How best can I achieve this?
This is pretty straightforward you could use pandas like this:
import pandas as pd
extractedString = "1) No structured exercise.nn2) Above ideal body Mass index.nn3) Cancer gene testing.nn4) Suboptimal vitamin D.nn5) Slight anaemia."
listOfStrings = extractedString.split("nn")
df = pd.DataFrame(listOfStrings)
df = df.apply(lambda x: x[1:-1])
first split the string into a list of strings
then we create a DataFrame from the list of strings
lastly remove the leading and trailing characters from each string
I hope this helps!
You can try with this regex:
import re
import pandas as pd
data = re.findall(r'd+)s*([^n.]+)s{0,2}', extractedString)
df = pd.DataFrame(data, columns=['text'])
print(df)
# Output
text
0 No structured exercise
1 Above ideal body Mass index
2 Cancer gene testing
3 Suboptimal vitamin D
4 Slight anaemia
Only with Pandas:
import pandas as pd
df = (pd.Series(extractedString)
.str.split('nn')
.explode(ignore_index=True)
.str.extract(r'd+)s*(?P<text>[^.]+)'))
print(df)
# Output
text
0 No structured exercise
1 Above ideal body Mass index
2 Cancer gene testing
3 Suboptimal vitamin D
4 Slight anaemia