PyPDF2 PdfFileMerger loosing PDF module in merged file

Question

I am merging PDF files with PyPDF2 but, when one of the files contains a PDF Module filled with data (a typical application-filled PDF), in the merged file the module is empty, no data is shown.

Here’s the two methods I am using to merge the PDF:

def merge_pdf_files(pdf_files, i):
    pdf_merger = PdfFileMerger(strict=False)
    for pdf in pdf_files:
        pdf_merger.append(pdf)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    pdf_merger.write(output_filename)

def merge_pdf_files2(pdf_files, i):
    output = PdfFileWriter()
    for pdf in pdf_files:
        input = PdfFileReader(pdf)
        for page in input.pages:
            output.addPage(page)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    with open(output_filename,'wb') as output_stream:
        output.write(output_stream)

I would expect the final, merged PDF to show all the data filled in the PDF Module.
Or, in alternative, someone can point me to another python library not suffering this (in appearance) bug.
Thanks

UPDATE
I tried also PyMuPDF with the same results.

def merge_pdf_files4(pdf_files, i):
    output = fitz.open()
    for pdf in pdf_files:
        input = fitz.open(pdf)
        output.insertPDF(input)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    output.save(output_filename)

Tried also PyPDF4. Same result as PyPDF2

Tried also using external tools launched from the script with a command line:

subprocess.call(cmd, shell=True)

I tried pdftk at first, but it failed too.
The only one that worked was PDFill, commercial version, $19 bucks spent on the task… 🙁
Too bad I couldn’t find an open source, platform independant solution.

Asked By: A_E

||

Source

Answer 1

Finally I worked it out by myself, I am sharing it here in the hope to be useful to others.

It’s been a tough task.

In the end I sticked to the pdfrw library (https://pypi.org/project/pdfrw/ and https://github.com/pmaupin/pdfrw), which gives a good PDF-DOM representation, very close to the PDF-Structure publicly documented in Adobe’s official reference (https://www.adobe.com/devnet/pdf/pdf_reference.html).

Using this library, PyCharm’s object inspector and Adobe’s documentation I could experiment with the output file’s structure and found out that the simple 1-line-merge:

    from pdfrw import PdfReader, PdfWriter

    output = PdfWriter()
    input = PdfReader(pdf_filename)
    output.addpages(input.pages)

would not add the AcroForm node to the output PDF file, hence losing all form fields.

So I had to write my own code to merge, as best as I can, the AcroForm nodes of the various input files.

I stress the “as best ad I can” sentence, because the merge function I ended up with is far from perfect but at least it works for me and can help others to build up from this point if they need.

One important thing to do is to rename the form fields in order to avoid conflicts, so I renamed them to {file_num}_{field_num}_{original_name}.

Then, not knowing exactly how to merge CO, DA, DR and NeedAppearances nodes, I simply add the nodes of the first source file that has them. If the same node is present in subsequent files, I skip it.

I skip it except for the Fonts, I merge the contents of Font subnode of DR node.

Last note, at my first attempt, all the above manipulation was done on output’s trailer attribute. Then I found out that each time I added the pages from a new input file, pdfrw seems to erase any AcroForm already present in the trailer.
I don’t know the reason but I had to build an ouptut_acroform variable and to assign it to the output file the line before writing out the final pdf.

In the end, here’s my code.
Forgive me if it’s not pythonic, I just hope it clarifies the points above.

from pdfrw import PdfReader, PdfWriter, PdfName


def merge_pdf_files_pdfrw(pdf_files, output_filename):
  output = PdfWriter()
  num = 0
  output_acroform = None
  for pdf in pdf_files:
      input = PdfReader(pdf,verbose=False)
      output.addpages(input.pages)
      if PdfName('AcroForm') in input[PdfName('Root')].keys():  # Not all PDFs have an AcroForm node
          source_acroform = input[PdfName('Root')][PdfName('AcroForm')]
          if PdfName('Fields') in source_acroform:
              output_formfields = source_acroform[PdfName('Fields')]
          else:
              output_formfields = []
          num2 = 0
          for form_field in output_formfields:
              key = PdfName('T')
              old_name = form_field[key].replace('(','').replace(')','')  # Field names are in the "(name)" format
              form_field[key] = 'FILE_{n}_FIELD_{m}_{on}'.format(n=num, m=num2, on=old_name)
              num2 += 1
          if output_acroform == None:
              # copy the first AcroForm node
              output_acroform = source_acroform
          else:
              for key in source_acroform.keys():
                  # Add new AcroForms keys if output_acroform already existing
                  if key not in output_acroform:
                      output_acroform[key] = source_acroform[key]
              # Add missing font entries in /DR node of source file
              if (PdfName('DR') in source_acroform.keys()) and (PdfName('Font') in source_acroform[PdfName('DR')].keys()):
                  if PdfName('Font') not in output_acroform[PdfName('DR')].keys():
                      # if output_acroform is missing entirely the /Font node under an existing /DR, simply add it
                      output_acroform[PdfName('DR')][PdfName('Font')] = source_acroform[PdfName('DR')][PdfName('Font')]
                  else:
                      # else add new fonts only
                      for font_key in source_acroform[PdfName('DR')][PdfName('Font')].keys():
                          if font_key not in output_acroform[PdfName('DR')][PdfName('Font')]:
                              output_acroform[PdfName('DR')][PdfName('Font')][font_key] = source_acroform[PdfName('DR')][PdfName('Font')][font_key]
          if PdfName('Fields') not in output_acroform:
              output_acroform[PdfName('Fields')] = output_formfields
          else:
              # Add new fields
              output_acroform[PdfName('Fields')] += output_formfields
      num +=1
  output.trailer[PdfName('Root')][PdfName('AcroForm')] = output_acroform
  output.write(output_filename)

Hope this helps.

Answered By: A_E

Answer 2

@A_E, can’t tell you how much time this saved. Thank you! Brought here from https://github.com/pmaupin/pdfrw/issues/192

To admins, I recognize this is an old question, but it is ranked high in Google searches for this information and referenced in the Github repo for the library.

I had a very similar requirement, where one bit of a form has space for 3 things, and if more, I build out a separate schedule and attach it as a new page, but I was getting what looked like blank field values in the resultant pdf when viewed in Kofax PDF, Acrobat Reader or Evince (Linux). The fields would show their values in Gmail’s pdf viewer, or if viewed in a separate browser window (Edge and Chrome worked). They would also show when clicked on and had a font changed, or alignment etc in properties. Exporting the data and reimporting also worked, but that wouldn’t be feasible in my application.

I’m adding this not as an answer, but to provide the code I ended up with after making some changes based on my current setup where instead of passing in files, I’ve got some "in memory" readers already (the original form, and the additional schedules of extra items).

Replying here to say thank you, and to anyone else that lands here, this method does seem to work (I can’t imagine the process of digging through the debugger and documentation to figure it out).

I pass in a list of PdfReaders, with the first one having had the NeedAppearances set as shown below; otherwise the fields continued to appear blank until clicked on. pdf_writer is subsequently used in another method to save to the appropriate place. Every other method I used to combine the form with non form pdf’s resulted in the same behaviour.

self.template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))

where self.template_pdf was a PdfReader instance for the main form.

import pdfrw

def merge_pdf_files_pdfrw(pdf_readers, pdf_writer):
    # output = pdfrw.PdfWriter()
    output_acroform = None
    for reader_idx, pdf_reader in enumerate(pdf_readers):
        # input = PdfReader(pdf,verbose=False)
        pdf_writer.addpages(pdf_reader.pages)
        if pdfrw.PdfName.AcroForm in pdf_reader[pdfrw.PdfName.Root].keys():  # Not all PDFs have an AcroForm node
            source_acroform = pdf_reader[pdfrw.PdfName.Root][pdfrw.PdfName.AcroForm]
            if pdfrw.PdfName.Fields in source_acroform:
                output_formfields = source_acroform[pdfrw.PdfName.Fields]
            else:
                output_formfields = []

            for ff_idx, form_field in enumerate(output_formfields):
                key = pdfrw.PdfName.T
                old_name = form_field[key].replace('(', '').replace(')', '')  # Field names are in the "(name)" format
                form_field[key] = f'FILE_{reader_idx}_FIELD_{ff_idx}_{old_name}'

            if output_acroform is None:
                # copy the first AcroForm node
                output_acroform = source_acroform
            else:
                for key in source_acroform.keys():
                    # Add new AcroForms keys if output_acroform already existing
                    if key not in output_acroform:
                        output_acroform[key] = source_acroform[key]
                # Add missing font entries in /DR node of source file
                if (pdfrw.PdfName.DR in source_acroform.keys()) and (
                        pdfrw.PdfName.Font in source_acroform[pdfrw.PdfName.DR].keys()):
                    if pdfrw.PdfName.Font not in output_acroform[pdfrw.PdfName.DR].keys():
                        # if output_acroform is missing entirely the /Font node under an existing /DR, simply add it
                        output_acroform[pdfrw.PdfName.DR][pdfrw.PdfName.Font] = 
                        source_acroform[pdfrw.PdfName.DR][pdfrw.PdfName.Font]
                    else:
                        # else add new fonts only
                        for font_key in source_acroform[pdfrw.PdfName.DR][pdfrw.PdfName.Font].keys():
                            if font_key not in output_acroform[pdfrw.PdfName.DR][pdfrw.PdfName.Font]:
                                output_acroform[pdfrw.PdfName.DR][pdfrw.PdfName.Font][font_key] = 
                                source_acroform[pdfrw.PdfName.DR][pdfrw.PdfName.Font][font_key]
            if pdfrw.PdfName.Fields not in output_acroform:
                output_acroform[pdfrw.PdfName.Fields] = output_formfields
            else:
                # Add new fields
                output_acroform[pdfrw.PdfName.Fields] += output_formfields

    pdf_writer.trailer[pdfrw.PdfName.Root][pdfrw.PdfName.AcroForm] = output_acroform

Answered By: AMG

PyPDF2 PdfFileMerger loosing PDF module in merged file

Question:

Answers: