Python Script for counting the number of Pages for each PDF in a directory
Question:
I am new to Python, and I am trying to create a script that will list all the PDF’s in a directory and the number of pages in each of the files.
I have used the recommended code from this thread: Using Python to pull the number of pages in all the pdf documents in a directory
However, there were two problems:
DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.
DeprecationError: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead.
I used the recommendations but get the error:
AttributeError: ‘PdfReader’ object has no attribute ‘len’
How can I fix this?
Thanks
Answers:
The deprecation warning isn’t very helpful. The reason is that a PDFDocument’s pages() function returns a generator. So, in order to count the number of pages you need to exhaust the generator. You could do it like this:
from pdfreader import PDFDocument
from glob import glob
for file in glob('/Volumes/G-Drive/*.pdf'):
with open(file, 'rb') as pdf:
doc = PDFDocument(pdf)
print(file, len(list(doc.pages())), 'pages')
With this modification PdfReader works as well
import pandas as pd
import os
from PyPDF2 import PdfReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
for root, dirs, files in os.walk(r'/home/papers'):
for f in files:
if f.endswith(".pdf"):
pdf=PdfReader(open(os.path.join(root, f),'rb'))
df2 = pd.DataFrame([[f, os.path.join(root,f), len(pdf.pages)]], columns=['fileName', 'fileLocation', 'pageNumber'])
df = pd.concat([df, df2])
print(df.head)
If you want to use newer pypdf
version here is the code.
Only thing you need to install is pypdf
pip install pypdf
Than you can run:
from pathlib import Path
from typing import Mapping
from pypdf import PdfReader
directory = Path("C://YourDirToSearch/")
def get_num_pages(pdf_file: Path) -> int:
reader = PdfReader(pdf_file)
return len(reader.pages)
def get_pdf_page_numbers(directory: Path) -> Mapping[Path, int]:
return {file: get_num_pages(file) for file in directory.glob("*.pdf")}
print(get_pdf_page_numbers(directory))
As a result you get something like:
{
"path1.pdf": 1,
"path2.pdf": 2,
}
I am new to Python, and I am trying to create a script that will list all the PDF’s in a directory and the number of pages in each of the files.
I have used the recommended code from this thread: Using Python to pull the number of pages in all the pdf documents in a directory
However, there were two problems:
DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.
DeprecationError: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead.
I used the recommendations but get the error:
AttributeError: ‘PdfReader’ object has no attribute ‘len’
How can I fix this?
Thanks
The deprecation warning isn’t very helpful. The reason is that a PDFDocument’s pages() function returns a generator. So, in order to count the number of pages you need to exhaust the generator. You could do it like this:
from pdfreader import PDFDocument
from glob import glob
for file in glob('/Volumes/G-Drive/*.pdf'):
with open(file, 'rb') as pdf:
doc = PDFDocument(pdf)
print(file, len(list(doc.pages())), 'pages')
With this modification PdfReader works as well
import pandas as pd
import os
from PyPDF2 import PdfReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
for root, dirs, files in os.walk(r'/home/papers'):
for f in files:
if f.endswith(".pdf"):
pdf=PdfReader(open(os.path.join(root, f),'rb'))
df2 = pd.DataFrame([[f, os.path.join(root,f), len(pdf.pages)]], columns=['fileName', 'fileLocation', 'pageNumber'])
df = pd.concat([df, df2])
print(df.head)
If you want to use newer pypdf
version here is the code.
Only thing you need to install is pypdf
pip install pypdf
Than you can run:
from pathlib import Path
from typing import Mapping
from pypdf import PdfReader
directory = Path("C://YourDirToSearch/")
def get_num_pages(pdf_file: Path) -> int:
reader = PdfReader(pdf_file)
return len(reader.pages)
def get_pdf_page_numbers(directory: Path) -> Mapping[Path, int]:
return {file: get_num_pages(file) for file in directory.glob("*.pdf")}
print(get_pdf_page_numbers(directory))
As a result you get something like:
{
"path1.pdf": 1,
"path2.pdf": 2,
}