Parse word doc using python to find all words with _

Question:

I am looking to parse a word document in python and I want to load that in a DataFrame (DF) to print all the words in that DF that contains a _(underscore).

Any sample code on this would be great?

I have tried multiple doc libraries but all seem to have some issue or the other to do it cleanly.

Asked By: as_da_programming

Source

Answers:

This did the job:

import docx2python as docp

import re

doc = docp.docx2pthon(‘test.docx’)

bodyText = doc.text

list1 = re.findall(‘[a-zA-Z0-9_]asterisk symbol_[a-zA-Z0-9_]*’, bodyText)

Answered By: as_da_programming