Python Script for counting the number of Pages for each PDF in a directory

Question:

I am new to Python, and I am trying to create a script that will list all the PDF’s in a directory and the number of pages in each of the files.

I have used the recommended code from this thread: Using Python to pull the number of pages in all the pdf documents in a directory

However, there were two problems:

DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

DeprecationError: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead.

I used the recommendations but get the error:

AttributeError: ‘PdfReader’ object has no attribute ‘len’

How can I fix this?

Thanks

Asked By: Assassin47

||

Answers:

The deprecation warning isn’t very helpful. The reason is that a PDFDocument’s pages() function returns a generator. So, in order to count the number of pages you need to exhaust the generator. You could do it like this:

from pdfreader import PDFDocument
from glob import glob

for file in glob('/Volumes/G-Drive/*.pdf'):
    with open(file, 'rb') as pdf:
        doc = PDFDocument(pdf)
        print(file, len(list(doc.pages())), 'pages')
Answered By: Pingu

With this modification PdfReader works as well

import pandas as pd
import os
from PyPDF2 import PdfReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
for root, dirs, files in os.walk(r'/home/papers'):
    for f in files:
        if f.endswith(".pdf"):
            pdf=PdfReader(open(os.path.join(root, f),'rb'))
            df2 = pd.DataFrame([[f, os.path.join(root,f), len(pdf.pages)]], columns=['fileName', 'fileLocation', 'pageNumber'])
            df = pd.concat([df, df2])
print(df.head)
Answered By: delirium78

If you want to use newer pypdf version here is the code.

Only thing you need to install is pypdf

pip install pypdf

Than you can run:

from pathlib import Path
from typing import Mapping

from pypdf import PdfReader

directory = Path("C://YourDirToSearch/")

def get_num_pages(pdf_file: Path) -> int:
    reader = PdfReader(pdf_file)
    return len(reader.pages)

def get_pdf_page_numbers(directory: Path) -> Mapping[Path, int]:
    return {file: get_num_pages(file) for file in directory.glob("*.pdf")}

print(get_pdf_page_numbers(directory))

As a result you get something like:

{
 "path1.pdf": 1,
 "path2.pdf": 2,
}
Answered By: Jan Tkacik
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.