How can I replace text in a PDF using Python?

Question:

I have a pdf file and I want to replace some text within the PDF file and generate new PDF file. How can I do that in Python?

Asked By: Dax Amin

||

Answers:

Have a look in THIS thread for one of the many ways to read text from a PDF. Then you’ll need to create a new pdf, as they will, as far as I know, not retrieve any formatting for you.

Answered By: Stiffo

The CAM::PDF Perl Library can output text that’s not too hard to parse (it seems to fairly randomly split lines of text). I couldn’t be bothered to learn too much Perl, so I wrote these really basic Perl command line scripts, one that reads a single page pdf to a text file perl read.pl pdfIn.pdf textOut.txt and one that writes the text (that you can modify in the meantime) to a pdf perl write.pl pdfIn.pdf textIn.txt pdfOut.pdf.

#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";

$pdfIn = $ARGV[0];
$textOut = $ARGV[1];

$pdf = CAM::PDF->new($pdfIn);
$page = $pdf->getPageContent(1);

open(my $fh, '>', $textOut);
print $fh $page;
close $fh;

exit;

and

#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";

$pdfIn = $ARGV[0];
$textIn = $ARGV[1];
$pdfOut = $ARGV[2];

$pdf = CAM::PDF->new($pdfIn);

my $page;
   open(my $fh, '<', $textIn) or die "cannot open file $filename";
   {
       local $/;
       $page = <$fh>;
   }
close($fh);

$pdf->setPageContent(1, $page);

$pdf->cleanoutput($pdfOut);

exit;

You can call these with python either side of doing some regex etc stuff on the outputted text file.

If you’re completely new to Perl (like I was), you need to make sure that Perl and CPAN are installed, then run sudo cpan, then in the prompt install "CAM::PDF";, this will install the required modules.

Also, I realise that I should probably be using stdout etc, but I was in a hurry 🙂

Also also, any ideas what the format CAM-PDF outputs is? is there any doc for it?

Answered By: leontrolski

You can try Aspose.PDF Cloud SDK for Python, Aspose.PDF Cloud is a REST API PDF Processing solution. It is paid API and its free package plan provides 50 credits per month.

I’m developer evangelist at Aspose.

import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi

# Get App key and App SID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
    app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    app_sid='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx')

pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
copied_file= '02_pages_new.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)

#upload PDF file to storage
pdf_api.copy_file(remote_name,copied_file)

#Replace Text
text_replace = asposepdfcloud.models.TextReplace(old_value='origami',new_value='polygami',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace])

response = pdf_api.post_document_text_replace(copied_file, text_replace_list)
print(response)
Answered By: Tilal Ahmad

disclaimer: I am the author of borb, the library used in this answer

This example comes straight out of the examples repository, which you can find here.

It starts by creating a PDF document:

#!chapter_007/src/snippet_012.py
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PageLayout, SingleColumnLayout
from borb.pdf import Table, FixedColumnWidthTable
from borb.pdf import Paragraph
from borb.pdf import PDF

from decimal import Decimal


def main():

    # create empty Document
    doc: Document = Document()

    # add new Page
    pge: Page = Page()
    doc.add_page(pge)

    # set PageLayout
    lay: PageLayout = SingleColumnLayout(pge)

    # add Table
    tab: Table = FixedColumnWidthTable(number_of_columns=2, number_of_rows=3)
    tab.add(Paragraph("Name:", font="Helvetica-Bold"))
    tab.add(Paragraph("Schellekens"))
    tab.add(Paragraph("Firstname:", font="Helvetica-Bold"))
    tab.add(Paragraph("Jots"))
    tab.add(Paragraph("Title:", font="Helvetica-Bold"))
    tab.add(Paragraph("CEO borb"))
    tab.set_padding_on_all_cells(Decimal(5), Decimal(5), Decimal(5), Decimal(5))
    lay.add(tab)

    # store
    with open("output.pdf", 'wb') as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)

if __name__ == "__main__":
    main()

This document should look like this:

enter image description here

It contains a misspelling of my name, which we will correct using the following code:

#!chapter_007/src/snippet_013.py
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleFindReplace

import typing


def main():

    # attempt to read a PDF
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)

    # check whether we actually read a PDF
    assert doc is not None

    # find/replace
    doc = SimpleFindReplace.sub("Jots", "Joris", doc)

    # store
    with open("output2.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)


if __name__ == "__main__":
    main()

This gives us the following output:

enter image description here

Keep in mind that SimpleFindReplace does not handle complex re-flow. It only handles cases where you want to replace some text by some other text, without any influence on surrounding text.

You can obtain borb using pip or download its source code here.

Answered By: Joris Schellekens

The documents I am dealing with use subsetted fonts and character maps. I came up with this solution. It translates the search and replacement strings "backwards" and proceeds to modify the actual object stream data:

#!/usr/bin/env python3
import argparse
import pypdf
from typing import Any, Callable, Dict, Tuple, Union, cast
from pypdf.generic import DictionaryObject, NameObject, RectangleObject
from pypdf.constants import PageAttributes as PG
from pypdf._cmap import build_char_map

# from https://github.com/py-pdf/pypdf/blob/27d0e99/pypdf/_page.py#L1546
def get_char_maps(obj: Any, space_width: float = 200.0):
    cmaps: Dict[
        str,
        Tuple[
            str, float, Union[str, Dict[int, str]], Dict[str, str], DictionaryObject
        ],
    ] = {}
    objr = obj
    while NameObject(PG.RESOURCES) not in objr:
        # /Resources can be inherited sometimes so we look to parents
        objr = objr["/Parent"].get_object()
    resources_dict = cast(DictionaryObject, objr[PG.RESOURCES])
    if "/Font" in resources_dict:
        for f in cast(DictionaryObject, resources_dict["/Font"]):
            cmaps[f] = build_char_map(f, space_width, obj)
    return {cmap[4]["/BaseFont"]:cmap[3] for cmap in cmaps.values()}

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Replace text in a PDF file.')
    parser.add_argument('--input')
    parser.add_argument('--output')
    parser.add_argument('--papersize', default="A4")
    parser.add_argument('--font', required=False)
    parser.add_argument('--separator', default="→")
    parser.add_argument('replacements', nargs='*')
    args = parser.parse_args()
    replacements = [pair.split(args.separator) for pair in args.replacements]
    total_replacements = 0
    reader = pypdf.PdfReader(args.input)
    writer = pypdf.PdfWriter()

    for page_index, page in enumerate(reader.pages):
        print(f"Processing page {page_index+1}…")

        cmaps = get_char_maps(page)
        for fontname in cmaps.keys():
            if (not args.font):
                print(fontname)
            elif (fontname.endswith(args.font)):
                args.font = fontname
        if (not args.font):
            continue

        charmap = cmaps[args.font]
        reverse_charmap = {v:k for k,v in charmap.items()}
        def full_to_subsetted(full):
            subsetted = ''.join([reverse_charmap[c] for c in full])
            subsetted = subsetted.replace(r'(',r'(').replace(r')',r')') # TODO: which other characters must be escaped?
            return subsetted.encode('ascii') # TODO: which encoding is actually used?
        subsetted_replacements = [(full_to_subsetted(f),full_to_subsetted(t)) for f,t in replacements]

        page_replacements = 0
        # based on https://stackoverflow.com/questions/41769120/search-and-replace-for-text-within-a-pdf-in-python#69276885
        for content in page.get_contents():
            obj = content.get_object()
            data = obj.get_data()
            for f,t in subsetted_replacements:
                while (f in data):
                    data = data.replace(f, t, 1)
                    page_replacements += 1
            obj.set_data(data)

        if (page_replacements > 0):
            total_replacements += page_replacements
            print(f"Replaced {page_replacements} occurrences on this page.")
        papersize = getattr(pypdf.PaperSize, args.papersize)
        page.mediabox = RectangleObject((0, 0, papersize.width, papersize.height))
        writer.add_page(page)
    if (args.output):
        writer.write(args.output)
        print(f"Replaced {total_replacements} occurrences in document.")

This only works if the text you want to replace is typeset by a single Tj command using a specific syntax and without any adjustments to the character spacing. The streams must not be compressed. I use qpdf --qdf in.pdf uncompressed.pdf for decompression. D.Deriso’s answer was useful. Maybe my answer here helps someone else, too.

Answered By: Hermann
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.