How to convert txt file or PDF to Word doc using python?
Question:
Is there a way to convert PDFs (or text files) to Word docs in python? I’m doing some web-scraping for my professor and the original docs are PDFs. I converted all 1,611 of those to text files and now we need to convert them to Word docs. The only thing I could find was a Word-to-txt converter, not the reverse.
Thanks!
Answers:
You could check out python-docx. It can create Word docs with python so you could store the text files into word.
See python-docx – what-it-can-do
Using python-docx I was able to pretty easily convert the txt files to Word docs.
Here’s what I did.
from docx import Document
import re
import os
path = '/users/tdobbins/downloads/smithtxt'
direct = os.listdir(path)
for i in direct:
document = Document()
document.add_heading(i, 0)
myfile = open('/path/to/read/from/'+i).read()
myfile = re.sub(r'[^x00-x7F]+|x0c',' ', myfile) # remove all non-XML-compatible characters
p = document.add_paragraph(myfile)
document.save('/path/to/write/to/'+i+'.docx')
You can use GroupDocs.Conversion Cloud, it offers Python SDK for Text/PDF to DOC/DOCX converion and many other common files formats from on format to another, without depending on any third-party tool or software.
Here is sample Python Code.
# Import module
import groupdocs_conversion_cloud
# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)
try:
#upload soruce file to storage
filename = 'Sample.pdf'
remote_name = 'Sample.pdf'
output_name= 'sample.doc'
strformat='doc'
request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
response_upload = file_api.upload_file(request_upload)
#Convert PDF to Word document
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path =remote_name
settings.format = strformat
settings.output_path = output_name
loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
loadOptions.hide_pdf_annotations = True
loadOptions.remove_embedded_files = False
loadOptions.flatten_all_fields = True
settings.load_options = loadOptions
convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
convertOptions.from_page = 1
convertOptions.pages_count = 1
settings.convert_options = convertOptions
.
request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
response = convert_api.convert_document(request)
print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
Run the code below. After run, the files will automaticaly be converted in .docx extension, BUT you will have to change the extension yourself after it.
# pip install docx
# pip install document
# pip install python-docx
# pip install Path
# pip install pathlib
import re
from pathlib import Path
from docx import Document
path = Path(r"d:text")
if path.exists():
print(path, "exists")
else:
print(path, "does not exist")
raise SystemExit(-1)
for file in path.glob("*"):
# file is a Path object
document = Document()
# file.name is the name of the file as str without the Path
document.add_heading(file.name, 0)
# Path objects do have the read_text, read_bytes
# method and also supports
# open with context managers
# remove all non-XML-compatible characters
file_content = re.sub(r"[^x00-x7F]+|x0c", " ", file.read_text())
document.add_paragraph(file_content)
# if Document could not handle Path objects,
# you must convert the Path object to a str
# document.save(str(file))
document.save(file)
To convert simple plain text into docX can be done commando, without libs, that includes you running a customised OS shell script or as a text in/output in Python.
Note this is a draft (Proof of Concept) for you to tailor as required. My default is 54 lines portrait per page using Windows Consolas.
MS Word or WordPad is not required (but would help). Here showing a print preview from WordPad just to illustrate output, should you wish to auto print to PDF!.
The core function is xpdf/poppler pdftotext -layout
which I will not describe more, as covered so many other places, to get simple plain text in different layouts.
So lets 1st "Round Trip" that text to PDF
lets see that in console:-
pdftotext -layout -enc UTF-8 input.pdf -
(NOTE @ this time we Do NOT NEED to see page feeds)
undesired input
...
Line 53
Line 54
♀Line 55
Line 56
Line 57
Line 58
Line 59
Line 60
♀
So there are page feeds after line 54 and after line 60 (Lets save as output.txt without them -nopgbrk)
pdftotext -nopgbrk -layout -enc UTF-8 input.pdf output.txt
Now I did not say setup is easy but only needed once for thousands of files.
CAUTION before leaping up and down shouting "Eurika" this simple method has one key common disadvantage (possibly more) that’s mentioned [*] later
A docX file is an archive zip folder with multiple parts . Thus our template needs to be a word working folder with the minimal components
WorkFolder
- Our script (for me its a MakeDocX.cmd which can be used to loop through files)
- output.txt (from pdftotext, in a batch run this would be constantly variable)
- possibly our input PDF (again a variable file could be overwritten)
- DocXheader.txt (this is the one where you set font name and height (24 units = 12 points)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><w:document rel="nofollow noreferrer">
lets start with that sub folder, so it needs just one file unsurprisingly
- document.xml.rels
<?xml version="1.0" encoding="UTF-8"?><Relationships rel="nofollow noreferrer">
[*] During testing one simple plain text character caused failure in XML, and that was a raw &
in the output.text from PDFtotext (which was &
) Thus ALL &
MUST be replaced with &
. There are likely other candidates that need similar text replacement. Luckily only 5 are listed
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML so we can easily use a replacement for those (in my case jrepl.bat handled the replacement) so afer pdftotext and before convert I have jrepl "&" "&" /F dirty.txt /O output.txt
# pip install docx
# pip install document
# pip install python-docx
# pip install pathlib
import re
import os
from pathlib import Path
import sys
from docx import Document
# Locatia unde se afla fisierele
input_path = r'c:Folder7input'
# Locatia unde vom scrie fisierele docx
output_path = r'c:Folder7output'
# Creeaza structura de foldere daca nu exista
os.makedirs(output_path, exist_ok=True)
# Verifica existenta folder-ului
directory_path = Path(input_path)
if directory_path.exists() and directory_path.is_dir():
print(directory_path, "exists")
else:
print(directory_path, "is invalid")
sys.exit(1)
for file_path in directory_path.glob("*"):
# file_path is a Path object
print("Procesez fisierul:", file_path)
document = Document()
# file_path.name is the name of the file as str without the Path
document.add_heading(file_path.name, 0)
file_content = file_path.read_text(encoding='UTF-8')
document.add_paragraph(file_content)
# build the new path where we store the files
output_file_path = os.path.join(output_path, file_path.name + ".docx")
document.save(output_file_path)
print("Am convertit urmatorul fisier:", file_path, "in: ", output_file_path)
Is there a way to convert PDFs (or text files) to Word docs in python? I’m doing some web-scraping for my professor and the original docs are PDFs. I converted all 1,611 of those to text files and now we need to convert them to Word docs. The only thing I could find was a Word-to-txt converter, not the reverse.
Thanks!
You could check out python-docx. It can create Word docs with python so you could store the text files into word.
See python-docx – what-it-can-do
Using python-docx I was able to pretty easily convert the txt files to Word docs.
Here’s what I did.
from docx import Document
import re
import os
path = '/users/tdobbins/downloads/smithtxt'
direct = os.listdir(path)
for i in direct:
document = Document()
document.add_heading(i, 0)
myfile = open('/path/to/read/from/'+i).read()
myfile = re.sub(r'[^x00-x7F]+|x0c',' ', myfile) # remove all non-XML-compatible characters
p = document.add_paragraph(myfile)
document.save('/path/to/write/to/'+i+'.docx')
You can use GroupDocs.Conversion Cloud, it offers Python SDK for Text/PDF to DOC/DOCX converion and many other common files formats from on format to another, without depending on any third-party tool or software.
Here is sample Python Code.
# Import module
import groupdocs_conversion_cloud
# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)
try:
#upload soruce file to storage
filename = 'Sample.pdf'
remote_name = 'Sample.pdf'
output_name= 'sample.doc'
strformat='doc'
request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
response_upload = file_api.upload_file(request_upload)
#Convert PDF to Word document
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path =remote_name
settings.format = strformat
settings.output_path = output_name
loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
loadOptions.hide_pdf_annotations = True
loadOptions.remove_embedded_files = False
loadOptions.flatten_all_fields = True
settings.load_options = loadOptions
convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
convertOptions.from_page = 1
convertOptions.pages_count = 1
settings.convert_options = convertOptions
.
request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
response = convert_api.convert_document(request)
print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
Run the code below. After run, the files will automaticaly be converted in .docx extension, BUT you will have to change the extension yourself after it.
# pip install docx
# pip install document
# pip install python-docx
# pip install Path
# pip install pathlib
import re
from pathlib import Path
from docx import Document
path = Path(r"d:text")
if path.exists():
print(path, "exists")
else:
print(path, "does not exist")
raise SystemExit(-1)
for file in path.glob("*"):
# file is a Path object
document = Document()
# file.name is the name of the file as str without the Path
document.add_heading(file.name, 0)
# Path objects do have the read_text, read_bytes
# method and also supports
# open with context managers
# remove all non-XML-compatible characters
file_content = re.sub(r"[^x00-x7F]+|x0c", " ", file.read_text())
document.add_paragraph(file_content)
# if Document could not handle Path objects,
# you must convert the Path object to a str
# document.save(str(file))
document.save(file)
To convert simple plain text into docX can be done commando, without libs, that includes you running a customised OS shell script or as a text in/output in Python.
Note this is a draft (Proof of Concept) for you to tailor as required. My default is 54 lines portrait per page using Windows Consolas.
MS Word or WordPad is not required (but would help). Here showing a print preview from WordPad just to illustrate output, should you wish to auto print to PDF!.
The core function is xpdf/poppler pdftotext -layout
which I will not describe more, as covered so many other places, to get simple plain text in different layouts.
So lets 1st "Round Trip" that text to PDF
lets see that in console:-
pdftotext -layout -enc UTF-8 input.pdf -
(NOTE @ this time we Do NOT NEED to see page feeds)
undesired input
...
Line 53
Line 54
♀Line 55
Line 56
Line 57
Line 58
Line 59
Line 60
♀
So there are page feeds after line 54 and after line 60 (Lets save as output.txt without them -nopgbrk)
pdftotext -nopgbrk -layout -enc UTF-8 input.pdf output.txt
Now I did not say setup is easy but only needed once for thousands of files.
CAUTION before leaping up and down shouting "Eurika" this simple method has one key common disadvantage (possibly more) that’s mentioned [*] later
A docX file is an archive zip folder with multiple parts . Thus our template needs to be a word working folder with the minimal components
WorkFolder
- Our script (for me its a MakeDocX.cmd which can be used to loop through files)
- output.txt (from pdftotext, in a batch run this would be constantly variable)
- possibly our input PDF (again a variable file could be overwritten)
- DocXheader.txt (this is the one where you set font name and height (24 units = 12 points)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><w:document rel="nofollow noreferrer">
lets start with that sub folder, so it needs just one file unsurprisingly
- document.xml.rels
<?xml version="1.0" encoding="UTF-8"?><Relationships rel="nofollow noreferrer">
[*] During testing one simple plain text character caused failure in XML, and that was a raw &
in the output.text from PDFtotext (which was &
) Thus ALL &
MUST be replaced with &
. There are likely other candidates that need similar text replacement. Luckily only 5 are listed
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML so we can easily use a replacement for those (in my case jrepl.bat handled the replacement) so afer pdftotext and before convert I have jrepl "&" "&" /F dirty.txt /O output.txt
# pip install docx
# pip install document
# pip install python-docx
# pip install pathlib
import re
import os
from pathlib import Path
import sys
from docx import Document
# Locatia unde se afla fisierele
input_path = r'c:Folder7input'
# Locatia unde vom scrie fisierele docx
output_path = r'c:Folder7output'
# Creeaza structura de foldere daca nu exista
os.makedirs(output_path, exist_ok=True)
# Verifica existenta folder-ului
directory_path = Path(input_path)
if directory_path.exists() and directory_path.is_dir():
print(directory_path, "exists")
else:
print(directory_path, "is invalid")
sys.exit(1)
for file_path in directory_path.glob("*"):
# file_path is a Path object
print("Procesez fisierul:", file_path)
document = Document()
# file_path.name is the name of the file as str without the Path
document.add_heading(file_path.name, 0)
file_content = file_path.read_text(encoding='UTF-8')
document.add_paragraph(file_content)
# build the new path where we store the files
output_file_path = os.path.join(output_path, file_path.name + ".docx")
document.save(output_file_path)
print("Am convertit urmatorul fisier:", file_path, "in: ", output_file_path)