PyPDF2 : extract table of contents/outlines and their page number
Question:
I am trying to extract the TOC/outlines from PDFs and their page number using Python (PyPDF2), I am aware of the reader.outlines
but it does not return the correct page number.
Pdf example: https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf
and the output of reader.outlines
is :
[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'},
...
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
...
For instance, PART I was not expected to begin at page 10, am I missing something ?
Does anyone have an alternative ?
I’ve tried with PyMupdf, Tabula and the getDestinationPageNumber method with no luck.
Thank you in advance.
Answers:
Check out the package called Tabula. It is really easy to extract tables using this package. The package also has options which enable you to extract content from tables which extend over multiple pages.
Here is link worth checking out:- https://towardsdatascience.com/scraping-table-data-from-pdf-files-using-a-single-line-in-python-8607880c750
Martin Thoma’s answer is exactly what I needed (PyMuPDF).
Diblo Dk’s answer is an interesting workaround as well (PyPDF2).
I am citing exactly Martin Thoma’s code :
from typing import Dict
import fitz # pip install pymupdf
def get_bookmarks(filepath: str) -> Dict[int, str]:
# WARNING! One page can have multiple bookmarks!
bookmarks = {}
with fitz.open(filepath) as doc:
toc = doc.getToC() # [[lvl, title, page, …], …]
for level, title, page in toc:
bookmarks[page] = title
return bookmarks
print(get_bookmarks("my.pdf"))
you should reference this PDF outlines and their Page Number
targetPDFFile = 'your_pdf_filename.pdf'
pdfFileObj=open(targetPDFFile, 'rb')
# use outline replace of bookmark, outline is more accuracy than bookmark
result = {}
def outline_dict(bookmark_list):
for item in bookmark_list:
if isinstance(item, list):
# recursive call
outline_dict(item)
else:
try:
pageNum = pdfReader.getDestinationPageNumber(item) + 1
# print("key=" + str(pageNum) + ",title=" + item.title)
# 相同页码的item会被替换掉
result[pageNum] = item.title
except:
print("except:" + item)
pass
outline_dict(pdfReader.getOutlines())
print(result)
I am trying to extract the TOC/outlines from PDFs and their page number using Python (PyPDF2), I am aware of the reader.outlines
but it does not return the correct page number.
Pdf example: https://www.annualreports.com/HostedData/AnnualReportArchive/l/NASDAQ_LOGM_2018.pdf
and the output of reader.outlines
is :
[{'/Title': '2018 Highlights', '/Page': IndirectObject(5, 0), '/Type': '/Fit'},
{'/Title': 'Letter to Stockholders', '/Page': IndirectObject(6, 0), '/Type': '/Fit'},
...
{'/Title': 'Part I', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
[{'/Title': 'Item 1. Business', '/Page': IndirectObject(10, 0), '/Type': '/Fit'},
{'/Title': 'Item 1A. Risk Factors', '/Page': IndirectObject(19, 0), '/Type': '/Fit'}
...
For instance, PART I was not expected to begin at page 10, am I missing something ?
Does anyone have an alternative ?
I’ve tried with PyMupdf, Tabula and the getDestinationPageNumber method with no luck.
Thank you in advance.
Check out the package called Tabula. It is really easy to extract tables using this package. The package also has options which enable you to extract content from tables which extend over multiple pages.
Here is link worth checking out:- https://towardsdatascience.com/scraping-table-data-from-pdf-files-using-a-single-line-in-python-8607880c750
Martin Thoma’s answer is exactly what I needed (PyMuPDF).
Diblo Dk’s answer is an interesting workaround as well (PyPDF2).
I am citing exactly Martin Thoma’s code :
from typing import Dict
import fitz # pip install pymupdf
def get_bookmarks(filepath: str) -> Dict[int, str]:
# WARNING! One page can have multiple bookmarks!
bookmarks = {}
with fitz.open(filepath) as doc:
toc = doc.getToC() # [[lvl, title, page, …], …]
for level, title, page in toc:
bookmarks[page] = title
return bookmarks
print(get_bookmarks("my.pdf"))
you should reference this PDF outlines and their Page Number
targetPDFFile = 'your_pdf_filename.pdf'
pdfFileObj=open(targetPDFFile, 'rb')
# use outline replace of bookmark, outline is more accuracy than bookmark
result = {}
def outline_dict(bookmark_list):
for item in bookmark_list:
if isinstance(item, list):
# recursive call
outline_dict(item)
else:
try:
pageNum = pdfReader.getDestinationPageNumber(item) + 1
# print("key=" + str(pageNum) + ",title=" + item.title)
# 相同页码的item会被替换掉
result[pageNum] = item.title
except:
print("except:" + item)
pass
outline_dict(pdfReader.getOutlines())
print(result)