Repairing pdfs with damaged xref table

Question:

Are there any solutions (preferably in Python) that can repair pdfs with damaged xref tables?

I have a pdf that I tried to convert to a png in Ghostscript and received the following error:

**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.

However, I am able to open the pdf in Preview on my Mac and when I export the pdf using Preview, I am able to convert the exported pdf.

Is there any way to repair pdfs without having to manually open them and export them?

Asked By: Kelvin

||

Answers:

If the file renders as expected in Ghostscript then you can run it through GS to the pdfwrite device and create a new PDF file which won’t be damaged.

Preview is (like Acrobat) almost certainly silently repairing the problem in the background. Ghostscript will be doing the same, but unlike other applications we feel you need to know that the file has a problem. Firstly so that you know its broken, secondly so that if the file renders incorrectly in Ghostscript (or indeed, other applications) you know why.

Note that there are two main reasons for a damaged xref; firstly the developer of the application didn’t read the specification carefully enough and the file offsets in the xref are correct, but the format is incorrect (this is not uncommon and a repair by GS will be harmless), secondly the file genuinely has been damaged in transit, or by editing it.

In the latter case there may be other problems and Ghostscript will try to warn you about those too. If you don’t get any other warnings or errors, then its probably just a malformed xref table.

Answered By: KenS

i know im super late but, if you try…

cat my.pdf > temp.pdf && hexdump temp.pdf > newpdf.pdf

or

zip my.pdf && unzip my.pdf

if you opened the document in…

utf-8 read mode

…then you probably changed some key bytes around, specifically the octal 011, hexadecimal 0A, decimal 10… these are the line feed or "new line" characters and they are essential to documentation in ascii encoding.

You can hexdump the octal or hexadecimal line strings with hexdump, all-search the document for bad newline characters and change them back to ascii newline.
Be sure to open the document in encoding=’ascii’ or in bytes mode. your have to get out a character matrix…

If heard of people just compressing the file with zip and uncompressing it to fix this problem as well.

Whenever fiddling around in a pdf, first make a new copy, then fiddle it.

TL;DR

on line 17 of your document 
you hit a << or ascii 'Line/page Separator' character. 
The guilleme or double chevron isnt used for 
that in UTF-8, your reader panicked and raised an error

PDF was written in postscript. If you want to learn how to go crazy on a pdf, i recommend learning postscript.
This forbidden text is a good start

Answered By: TheCableGUI

disclaimer I am the author of borb, the library used in this answer

Simply opening and writing a PDF in borb should fix some of the corrupt PDF documents (including fixes to a corrupt XREF).

from borb.pdf import Document
from borb.pdf import PDF

from pathlib import Path
import typing

def fix_pdf(in_path: Path, out_path: Path) -> None:
   doc: typing.Optional[Document] = None
   with open(in_path, "rb") as fh:
       doc = PDF.loads(fh)
   with open(out_path, "wb") as fh:
       PDF.dumps(fh, doc)

borb is an open source, pure Python PDF library that creates, modifies and reads PDF documents. You can download it using:

pip install borb

Alternatively, you can build from source by forking/downloading the GitHub repository.

Answered By: Joris Schellekens
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.