Reading the PDF properties/metadata in Python

Question:

How can I read the properties/metadata like Title, Author, Subject and Keywords stored on a PDF file using Python?

Asked By: Quicksilver

||

Answers:

Try pdfminer:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)

print(doc.info)  # The "Info" metadata

Here’s the output:

>>> [{'CreationDate': 'D:20040520151901-0500',
  'Creator': 'DocBook XSL Stylesheets V1.52.2',
  'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free',
  'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.',
  'Title': 'Dive Into Python'}]

For more info, look at this tutorial: A lightweight XMP parser for extracting PDF metadata in Python.

Answered By: namit

I have implemented this using pypdf. Please see the sample code below. pypdf is maintained again since December 2022. The PyPDF2 project was merged back into pypdf.

from pypdf import PdfReader
pdf_toread = PdfReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.metadata
print(str(pdf_info))

Output:

{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}
Answered By: Quicksilver

As the maintainer of pypdf I strongly recommend pypdf 🙂

from pypdf import PdfReader

reader = PdfReader("test.pdf")

# See what is there:
print(str(reader.metadata))

# Or just access specific values:
print(reader.metadata.creation_date)  # that is actually a datetime object!

Install using pip install pypdf --upgrade.

See also: How to read/write metadata with pypdf

Answered By: Morten Zilmer

For Python 3 and new pdfminer (pip install pdfminer3k):

import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument

fp = open("foo.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
doc.set_parser(parser)
if len(doc.info) > 0:
    info = doc.info[0]
    print(info)
Answered By: Rabash

Try pdfreader
You can access document catalog Metadata like below:

   from pdfreader import PDFDocument    
   f = open("foo.pdf", 'rb')
   doc = PDFDocument(f)
   metadata = doc.root.Metadata
Answered By: Maksym Polshcha

pikepdf provides an easy and reliable way to do this.

I tested this with a bunch of pdf files, and it seems there are two distinct ways to insert metadata when the PDF is created. Some are inserting NUL bytes and other gibberish. Pikepdf handles both well.

import pikepdf
p = pikepdf.Pdf.open(r'path/to/file.pdf')
str(p.docinfo['/Author'])  # mind the slash

This returns a string – if you wrapped it with str. Examples:

  • 'Normal person'
  • 'ABC'

Comparing with other options:

  • pdfminer – Not actively maintained
  • pdfminer.six – active
  • pdfreader – active (but still suggest you to use easy_install, a.o.)
  • pypdf – Active.
  • PyPDF2 – was merged back into pypdf. PyPDF2==3.0.0 and pypdf==3.1.0 are essentially the same, but development continues in pypdf
  • Borb – Active.

Pdfminer.six:

pip install pdfminer.six

import pdfminer.pdfparser
import pdfminer.pdfdocument
h = open('path/to/file.pdf', 'rb')
p = pdfminer.pdfparser.PDFParser(h)
d = pdfminer.pdfparser.PDFDocument(p)
d.info[0]['Author']

This returns a binary string, including the non-decodable characters if they are present. Examples:

  • b'Normal person'
  • b'xfexffx00Ax00Bx00C' (ABC)

To convert to a string:

  • b'Normal person'.decode() yields the string 'Normal person'
  • b'xfexffx00Ax00Bx00C'.decode(encoding='utf-8', errors='ignore').replace('x00', '') yields the string 'ABC'

pdfreader

pip install pdfreader

import pdfreader
h = open(r'path/to/file.pdf', 'rb')
d = pdfreader.PDFDocument(h)
d.metadata['Author']

This returns either the string with the requested information, or a string containing the hex representation of the data it found. This then also includes the same non-decodable characters. Examples:

  • 'Normal person'
  • 'FEFF004100420043' (ABC)

You would then first need to detect whether this is still ‘encoded’, which I think is quite a nuisance. The second can be made a sensible string by calling this ugly piece of code:

s = 'FEFF004100420043'
''.join([c for c in (chr(int(s[i:i+2], 16)) for i in range(0, len(s), 2)) if c.isascii()]).replace('x00', '')
>>> 'ABC'

Borb

pip install borb

import borb.pdf.pdf
h = open(r'path/to/file.pdf', 'rb')
d: borb.pdf.document.Document = borb.pdf.pdf.PDF.loads(h)
str(d.get_document_info().get_author())

This returns a string – if you wrapped it with str. Loading a sizeable PDF takes a long time. I had one PDF on which borb choked with a TypeError exception. See also the examples on borb’s dedicated example repo.

Answered By: parvus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.