How to convert txt file or PDF to Word doc using python?

Question:

Is there a way to convert PDFs (or text files) to Word docs in python? I’m doing some web-scraping for my professor and the original docs are PDFs. I converted all 1,611 of those to text files and now we need to convert them to Word docs. The only thing I could find was a Word-to-txt converter, not the reverse.

Thanks!

Asked By: tmthyjames

||

Answers:

You could check out python-docx. It can create Word docs with python so you could store the text files into word.
See python-docx – what-it-can-do

Answered By: ebaharilikult

Using python-docx I was able to pretty easily convert the txt files to Word docs.

Here’s what I did.

from docx import Document
import re
import os

path = '/users/tdobbins/downloads/smithtxt'
direct = os.listdir(path)

for i in direct:
    document = Document()
    document.add_heading(i, 0)
    myfile = open('/path/to/read/from/'+i).read()
    myfile = re.sub(r'[^x00-x7F]+|x0c',' ', myfile) # remove all non-XML-compatible characters
    p = document.add_paragraph(myfile)
    document.save('/path/to/write/to/'+i+'.docx')
Answered By: tmthyjames

You can use GroupDocs.Conversion Cloud, it offers Python SDK for Text/PDF to DOC/DOCX converion and many other common files formats from on format to another, without depending on any third-party tool or software.

Here is sample Python Code.

# Import module
import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload soruce file to storage
        filename = 'Sample.pdf'
        remote_name = 'Sample.pdf'
        output_name= 'sample.doc'
        strformat='doc'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
        response_upload = file_api.upload_file(request_upload)
        #Convert PDF to Word document
        settings = groupdocs_conversion_cloud.ConvertSettings()
        settings.file_path =remote_name
        settings.format = strformat
        settings.output_path = output_name

        loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
        loadOptions.hide_pdf_annotations = True
        loadOptions.remove_embedded_files = False
        loadOptions.flatten_all_fields = True

        settings.load_options = loadOptions

        convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
        convertOptions.from_page = 1
        convertOptions.pages_count = 1

        settings.convert_options = convertOptions
 .               
        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
Answered By: Tilal Ahmad

Run the code below. After run, the files will automaticaly be converted in .docx extension, BUT you will have to change the extension yourself after it.

# pip install docx
# pip install document
# pip install python-docx
# pip install Path
# pip install pathlib

import re
from pathlib import Path

from docx import Document

path = Path(r"d:text")

if path.exists():
    print(path, "exists")
else:
    print(path, "does not exist")
    raise SystemExit(-1)


for file in path.glob("*"):
    # file is a Path object

    document = Document()
    # file.name is the name of the file as str without the Path
    document.add_heading(file.name, 0)

    # Path objects do have the read_text, read_bytes
    # method and also supports
    # open with context managers

    # remove all non-XML-compatible characters
    file_content = re.sub(r"[^x00-x7F]+|x0c", " ", file.read_text())
    document.add_paragraph(file_content)
    # if Document could not handle Path objects,
    # you must convert the Path object to a str

    # document.save(str(file))
    document.save(file)

SOURCE

Answered By: Just Me

To convert simple plain text into docX can be done commando, without libs, that includes you running a customised OS shell script or as a text in/output in Python.

Note this is a draft (Proof of Concept) for you to tailor as required. My default is 54 lines portrait per page using Windows Consolas.

MS Word or WordPad is not required (but would help). Here showing a print preview from WordPad just to illustrate output, should you wish to auto print to PDF!.

enter image description here

The core function is xpdf/poppler pdftotext -layout which I will not describe more, as covered so many other places, to get simple plain text in different layouts.

So lets 1st "Round Trip" that text to PDF

lets see that in console:-
pdftotext -layout -enc UTF-8 input.pdf - (NOTE @ this time we Do NOT NEED to see page feeds)

undesired input

...

Line 53
Line 54
♀Line 55
Line 56
Line 57
Line 58
Line 59
Line 60
♀

So there are page feeds after line 54 and after line 60 (Lets save as output.txt without them -nopgbrk)

pdftotext -nopgbrk -layout -enc UTF-8 input.pdf output.txt

Now I did not say setup is easy but only needed once for thousands of files.
CAUTION before leaping up and down shouting "Eurika" this simple method has one key common disadvantage (possibly more) that’s mentioned [*] later

A docX file is an archive zip folder with multiple parts . Thus our template needs to be a word working folder with the minimal components

WorkFolder

  • Our script (for me its a MakeDocX.cmd which can be used to loop through files)
  • output.txt (from pdftotext, in a batch run this would be constantly variable)
  • possibly our input PDF (again a variable file could be overwritten)
  • DocXheader.txt (this is the one where you set font name and height (24 units = 12 points)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><w:document  rel="nofollow noreferrer">enter image description here

lets start with that sub folder, so it needs just one file unsurprisingly

  • document.xml.rels
<?xml version="1.0" encoding="UTF-8"?><Relationships  rel="nofollow noreferrer">enter image description here

[*] During testing one simple plain text character caused failure in XML, and that was a raw & in the output.text from PDFtotext (which was &) Thus ALL & MUST be replaced with &amp;. There are likely other candidates that need similar text replacement. Luckily only 5 are listed https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML so we can easily use a replacement for those (in my case jrepl.bat handled the replacement) so afer pdftotext and before convert I have jrepl "&" "&amp;" /F dirty.txt /O output.txt

Answered By: K J
# pip install docx
# pip install document
# pip install python-docx
# pip install pathlib
 
import re
import os
from pathlib import Path
import sys
from docx import Document
 
# Locatia unde se afla fisierele
input_path = r'c:Folder7input'
# Locatia unde vom scrie fisierele docx
output_path = r'c:Folder7output'
# Creeaza structura de foldere daca nu exista
os.makedirs(output_path, exist_ok=True)
 
# Verifica existenta folder-ului
directory_path = Path(input_path)
if directory_path.exists() and directory_path.is_dir():
    print(directory_path, "exists")
else:
    print(directory_path, "is invalid")
    sys.exit(1)
 
for file_path in directory_path.glob("*"):
    # file_path is a Path object
 
    print("Procesez fisierul:", file_path)
    document = Document()
    # file_path.name is the name of the file as str without the Path
    document.add_heading(file_path.name, 0)
 
    file_content = file_path.read_text(encoding='UTF-8')
    document.add_paragraph(file_content)
 
    # build the new path where we store the files
    output_file_path = os.path.join(output_path, file_path.name + ".docx")
 
    document.save(output_file_path)
    print("Am convertit urmatorul fisier:", file_path, "in: ", output_file_path)

SOURCE HERE:

Answered By: Just Me
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.