Parse word doc using python to find all words with _
Question:
I am looking to parse a word document in python and I want to load that in a DataFrame (DF) to print all the words in that DF that contains a _(underscore).
Any sample code on this would be great?
I have tried multiple doc libraries but all seem to have some issue or the other to do it cleanly.
Answers:
This did the job:
import docx2python as docp
import re
doc = docp.docx2pthon(‘test.docx’)
bodyText = doc.text
list1 = re.findall(‘[a-zA-Z0-9_]asterisk symbol_[a-zA-Z0-9_]*’, bodyText)
I am looking to parse a word document in python and I want to load that in a DataFrame (DF) to print all the words in that DF that contains a _(underscore).
Any sample code on this would be great?
I have tried multiple doc libraries but all seem to have some issue or the other to do it cleanly.
This did the job:
import docx2python as docp
import re
doc = docp.docx2pthon(‘test.docx’)
bodyText = doc.text
list1 = re.findall(‘[a-zA-Z0-9_]asterisk symbol_[a-zA-Z0-9_]*’, bodyText)