Reading the PDF properties/metadata in Python
Question:
How can I read the properties/metadata like Title, Author, Subject and Keywords stored on a PDF file using Python?
Answers:
Try pdfminer:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
print(doc.info) # The "Info" metadata
Here’s the output:
>>> [{'CreationDate': 'D:20040520151901-0500',
'Creator': 'DocBook XSL Stylesheets V1.52.2',
'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free',
'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.',
'Title': 'Dive Into Python'}]
For more info, look at this tutorial: A lightweight XMP parser for extracting PDF metadata in Python.
I have implemented this using pypdf. Please see the sample code below. pypdf
is maintained again since December 2022. The PyPDF2
project was merged back into pypdf.
from pypdf import PdfReader
pdf_toread = PdfReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.metadata
print(str(pdf_info))
Output:
{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}
As the maintainer of pypdf I strongly recommend pypdf 🙂
from pypdf import PdfReader
reader = PdfReader("test.pdf")
# See what is there:
print(str(reader.metadata))
# Or just access specific values:
print(reader.metadata.creation_date) # that is actually a datetime object!
Install using pip install pypdf --upgrade
.
See also: How to read/write metadata with pypdf
For Python 3 and new pdfminer (pip install pdfminer3k):
import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument
fp = open("foo.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
doc.set_parser(parser)
if len(doc.info) > 0:
info = doc.info[0]
print(info)
Try pdfreader
You can access document catalog Metadata like below:
from pdfreader import PDFDocument
f = open("foo.pdf", 'rb')
doc = PDFDocument(f)
metadata = doc.root.Metadata
pikepdf provides an easy and reliable way to do this.
I tested this with a bunch of pdf files, and it seems there are two distinct ways to insert metadata when the PDF is created. Some are inserting NUL
bytes and other gibberish. Pikepdf handles both well.
import pikepdf
p = pikepdf.Pdf.open(r'path/to/file.pdf')
str(p.docinfo['/Author']) # mind the slash
This returns a string – if you wrapped it with str
. Examples:
'Normal person'
'ABC'
Comparing with other options:
- pdfminer – Not actively maintained
- pdfminer.six – active
- pdfreader – active (but still suggest you to use
easy_install
, a.o.)
- pypdf – Active.
- PyPDF2 – was merged back into pypdf.
PyPDF2==3.0.0
and pypdf==3.1.0
are essentially the same, but development continues in pypdf
- Borb – Active.
Pdfminer.six:
pip install pdfminer.six
import pdfminer.pdfparser
import pdfminer.pdfdocument
h = open('path/to/file.pdf', 'rb')
p = pdfminer.pdfparser.PDFParser(h)
d = pdfminer.pdfparser.PDFDocument(p)
d.info[0]['Author']
This returns a binary string, including the non-decodable characters if they are present. Examples:
b'Normal person'
b'xfexffx00Ax00Bx00C'
(ABC)
To convert to a string:
b'Normal person'.decode()
yields the string 'Normal person'
b'xfexffx00Ax00Bx00C'.decode(encoding='utf-8', errors='ignore').replace('x00', '')
yields the string 'ABC'
pdfreader
pip install pdfreader
import pdfreader
h = open(r'path/to/file.pdf', 'rb')
d = pdfreader.PDFDocument(h)
d.metadata['Author']
This returns either the string with the requested information, or a string containing the hex representation of the data it found. This then also includes the same non-decodable characters. Examples:
'Normal person'
'FEFF004100420043'
(ABC)
You would then first need to detect whether this is still ‘encoded’, which I think is quite a nuisance. The second can be made a sensible string by calling this ugly piece of code:
s = 'FEFF004100420043'
''.join([c for c in (chr(int(s[i:i+2], 16)) for i in range(0, len(s), 2)) if c.isascii()]).replace('x00', '')
>>> 'ABC'
Borb
pip install borb
import borb.pdf.pdf
h = open(r'path/to/file.pdf', 'rb')
d: borb.pdf.document.Document = borb.pdf.pdf.PDF.loads(h)
str(d.get_document_info().get_author())
This returns a string – if you wrapped it with str
. Loading a sizeable PDF takes a long time. I had one PDF on which borb choked with a TypeError exception. See also the examples on borb’s dedicated example repo.
How can I read the properties/metadata like Title, Author, Subject and Keywords stored on a PDF file using Python?
Try pdfminer:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
print(doc.info) # The "Info" metadata
Here’s the output:
>>> [{'CreationDate': 'D:20040520151901-0500',
'Creator': 'DocBook XSL Stylesheets V1.52.2',
'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free',
'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.',
'Title': 'Dive Into Python'}]
For more info, look at this tutorial: A lightweight XMP parser for extracting PDF metadata in Python.
I have implemented this using pypdf. Please see the sample code below. pypdf
is maintained again since December 2022. The PyPDF2
project was merged back into pypdf.
from pypdf import PdfReader
pdf_toread = PdfReader(open("doc2.pdf", "rb"))
pdf_info = pdf_toread.metadata
print(str(pdf_info))
Output:
{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}
As the maintainer of pypdf I strongly recommend pypdf 🙂
from pypdf import PdfReader
reader = PdfReader("test.pdf")
# See what is there:
print(str(reader.metadata))
# Or just access specific values:
print(reader.metadata.creation_date) # that is actually a datetime object!
Install using pip install pypdf --upgrade
.
See also: How to read/write metadata with pypdf
For Python 3 and new pdfminer (pip install pdfminer3k):
import os
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument
fp = open("foo.pdf", 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
doc.set_parser(parser)
if len(doc.info) > 0:
info = doc.info[0]
print(info)
Try pdfreader
You can access document catalog Metadata like below:
from pdfreader import PDFDocument
f = open("foo.pdf", 'rb')
doc = PDFDocument(f)
metadata = doc.root.Metadata
pikepdf provides an easy and reliable way to do this.
I tested this with a bunch of pdf files, and it seems there are two distinct ways to insert metadata when the PDF is created. Some are inserting NUL
bytes and other gibberish. Pikepdf handles both well.
import pikepdf
p = pikepdf.Pdf.open(r'path/to/file.pdf')
str(p.docinfo['/Author']) # mind the slash
This returns a string – if you wrapped it with str
. Examples:
'Normal person'
'ABC'
Comparing with other options:
- pdfminer – Not actively maintained
- pdfminer.six – active
- pdfreader – active (but still suggest you to use
easy_install
, a.o.) - pypdf – Active.
- PyPDF2 – was merged back into pypdf.
PyPDF2==3.0.0
andpypdf==3.1.0
are essentially the same, but development continues in pypdf - Borb – Active.
Pdfminer.six:
pip install pdfminer.six
import pdfminer.pdfparser
import pdfminer.pdfdocument
h = open('path/to/file.pdf', 'rb')
p = pdfminer.pdfparser.PDFParser(h)
d = pdfminer.pdfparser.PDFDocument(p)
d.info[0]['Author']
This returns a binary string, including the non-decodable characters if they are present. Examples:
b'Normal person'
b'xfexffx00Ax00Bx00C'
(ABC)
To convert to a string:
b'Normal person'.decode()
yields the string'Normal person'
b'xfexffx00Ax00Bx00C'.decode(encoding='utf-8', errors='ignore').replace('x00', '')
yields the string'ABC'
pdfreader
pip install pdfreader
import pdfreader
h = open(r'path/to/file.pdf', 'rb')
d = pdfreader.PDFDocument(h)
d.metadata['Author']
This returns either the string with the requested information, or a string containing the hex representation of the data it found. This then also includes the same non-decodable characters. Examples:
'Normal person'
'FEFF004100420043'
(ABC)
You would then first need to detect whether this is still ‘encoded’, which I think is quite a nuisance. The second can be made a sensible string by calling this ugly piece of code:
s = 'FEFF004100420043'
''.join([c for c in (chr(int(s[i:i+2], 16)) for i in range(0, len(s), 2)) if c.isascii()]).replace('x00', '')
>>> 'ABC'
Borb
pip install borb
import borb.pdf.pdf
h = open(r'path/to/file.pdf', 'rb')
d: borb.pdf.document.Document = borb.pdf.pdf.PDF.loads(h)
str(d.get_document_info().get_author())
This returns a string – if you wrapped it with str
. Loading a sizeable PDF takes a long time. I had one PDF on which borb choked with a TypeError exception. See also the examples on borb’s dedicated example repo.