How to check if PDF is scanned image or contains text
Question:
I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?
environment: PYTHON 3.6
Answers:
The below code will work, to extract data text data from both searchable and non-searchable PDF’s.
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
If you don’t have fitz
module you need to do this:
pip install --upgrade pymupdf
How about the PDF metadata check on '/Resources'
?!
I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.
If you are a pypdf
user, try
from pypdf import PdfReader
reader = PdfReader(input_file_location)
page = reader.pages[page_num]
page_resources = page["/Resources"]
if "/Font" in page_resources:
print(
"[Info]: Looks like there is text in the PDF, contains:",
page_resources.keys(),
)
elif len(page_resources.get("/XObject", {})) != 1:
print("[Info]: PDF Contains:", page_resources.keys())
x_object = page_resources.get("/XObject", {})
for obj in x_object:
obj_ = x_object[obj]
if obj_["/Subtype"] == "/Image":
print("[Info]: PDF is image only")
Try OCRmyPDF.
You can use this command to convert a scanned pdf to digital pdf.
ocrmypdf input_scanned.pdf output_digital.pdf
If the input pdf is digital the command will throw an error “PriorOcrFoundError: page already has text!”.
import subprocess as sp
import re
output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
print("Uploaded scanned pdf")
else:
print("Uploaded digital pdf")
def get_pdf_searchable_pages(fname):
# pip install pdfminer
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num > 0:
if len(searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is non-searchable")
elif len(non_searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is searchable")
else:
print(f"searchable_pages : {searchable_pages}")
print(f"non_searchable_pages : {non_searchable_pages}")
else:
print(f"Not a valid document")
if __name__ == '__main__':
get_pdf_searchable_pages("1.pdf")
get_pdf_searchable_pages("1Scanned.pdf")
Output:
Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
Building on top of Rahul Agarwal’s solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.
You need to install fitz
and PyMuPDF
modules. You can do it by means of pip
.
The following code has been tested with Python 3.7.9 and PyMuPDF
1.16.14. Moreover, it is important to install fitz
BEFORE PyMuPDF
, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:
pip3 install fitz
pip3 install PyMuPDF==1.16.14
And here is the Python 3 implementation:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area = total_page_area + abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area = total_text_area + text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware – such as pdfsandwich or Adobe Acrobat – that adds "invisible" text blocks on top of the image, so that you can select the text).
I created a script to detect whether a PDF was OCRd. The main idea: In OCRd PDFs is the text invisible.
Algorithm to test whether a given PDF (f1
) was OCRd:
- create a copy of
f1
noted as f2
- delete all text on
f2
- create images (PNG) for all (or just a few) pages for
f1
and f2
f1
was OCRd if all the images of f1
and f2
are identical.
https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh
#!/usr/bin/env bash
set -e
set -x
################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
# bash is_scanned_pdf.sh [-p] file
#
# Exit 0: Yes, file is a scanned PDF
# Exit 99: No, file was created digitally
#
# Arguments:
# -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################
# parse arguments
# h/t https://stackoverflow.com/a/33826763/4028896
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
case $1 in
-p | --pages)
max_pages="$2"
shift
;;
*)
echo "Unknown parameter passed: $1"
exit 1
;;
esac
shift
done
# increment to make it easier with page numbering
max_pages=$((max_pages++))
command_exists() {
if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
echo $(error: $1 is not installed.) >&2
exit 1
fi
}
command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo
orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')
echo $num_pages
echo $max_pages
if ((($max_pages > 1) && ($max_pages < $num_pages))); then
num_pages=$max_pages
fi
cd $(mktemp -d)
for ((i = 1; i <= num_pages; i++)); do
mkdir -p output/$i && echo $i
done
# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null
for ((i = 1; i <= num_pages; i++)); do
echo $i
# difference in pixels, if 0 there are the same pictures
# discard diff image
if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
echo " pixels difference, not a scanned PDF, mismatch on page $i"
exit 99
fi
done
You can use pdfplumber. If the following code returns "None", it’s a scanned pdf otherwise it’s searchable.
pip install pdfplumber
with pdfplumber.open(file_name) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
To extract text from scanned pdf, you can use OCRmyPDF. Very easy package, one line solution. You can find more on the package here and a video explaining an example here. Upvote the answer if helpful. Good luck!
You can use ocrmypdf, it has a parameter to skip the text
more info here: https://ocrmypdf.readthedocs.io/en/latest/advanced.html
ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=False, language=language, deskew=False, force_ocr=False, skip_text=True)
Just I re-modified code from @Vikas Goel
But a very few cases it is not giving decent result
def get_pdf_searchable_pages(fname):
""" intentifying a digitally created pdf or a scanned pdf"""
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num == len(searchable_pages):
return("searchable_pages")
elif page_num != len(searchable_pages):
return("non_searchable_pages")
else:
return("Not a valid document")
None of the posted answers worked for me. Unfortunately, the solutions often detect scanned PDFs as textual PDFs, most often because of the media boxes present in the documents.
As funny as it may look, the following code proved to be more accurate for my use-case:
extracted_text = ''.join([page.getText() for page in fitz.open(path)])
doc_type = "text" if extracted_text else "scan"
Make sure to install fitz and PyMuPDF beforehand, though:
pip install fitz PyMuPDF
If it is only all images or else, then here is another version to do this with PyMuPDF:
import fitz
my_pdf = r"C:UsersTestFileName.pdf"
doc = fitz.open(my_pdf)
def pdftype(doc):
i=0
for page in doc:
if len(page.getText())>0: #for scanned page it will be 0
i+=1
if i>0:
print('full/partial text PDF file')
else:
print('only scanned images in PDF file')
pdftype(doc)
If your digital PDFs have a table of contents, you can use doc.get_toc()
from PyMuPDF
. As far as I’m aware, the scanned PDFs will never have a table of contents. There’s no guarantee the digital ones will though, so it really depends on the context.
I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?
environment: PYTHON 3.6
The below code will work, to extract data text data from both searchable and non-searchable PDF’s.
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
If you don’t have fitz
module you need to do this:
pip install --upgrade pymupdf
How about the PDF metadata check on '/Resources'
?!
I believe for any text in a PDF (electronic document) there are more chances of having a font, especially the PDF, whose objective is to make a portable file, so, it maintains the font definition.
If you are a pypdf
user, try
from pypdf import PdfReader
reader = PdfReader(input_file_location)
page = reader.pages[page_num]
page_resources = page["/Resources"]
if "/Font" in page_resources:
print(
"[Info]: Looks like there is text in the PDF, contains:",
page_resources.keys(),
)
elif len(page_resources.get("/XObject", {})) != 1:
print("[Info]: PDF Contains:", page_resources.keys())
x_object = page_resources.get("/XObject", {})
for obj in x_object:
obj_ = x_object[obj]
if obj_["/Subtype"] == "/Image":
print("[Info]: PDF is image only")
Try OCRmyPDF.
You can use this command to convert a scanned pdf to digital pdf.
ocrmypdf input_scanned.pdf output_digital.pdf
If the input pdf is digital the command will throw an error “PriorOcrFoundError: page already has text!”.
import subprocess as sp
import re
output = sp.getoutput("ocrmypdf input.pdf output.pdf")
if not re.search("PriorOcrFoundError: page already has text!",output):
print("Uploaded scanned pdf")
else:
print("Uploaded digital pdf")
def get_pdf_searchable_pages(fname):
# pip install pdfminer
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num > 0:
if len(searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is non-searchable")
elif len(non_searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is searchable")
else:
print(f"searchable_pages : {searchable_pages}")
print(f"non_searchable_pages : {non_searchable_pages}")
else:
print(f"Not a valid document")
if __name__ == '__main__':
get_pdf_searchable_pages("1.pdf")
get_pdf_searchable_pages("1Scanned.pdf")
Output:
Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
Building on top of Rahul Agarwal’s solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.
You need to install fitz
and PyMuPDF
modules. You can do it by means of pip
.
The following code has been tested with Python 3.7.9 and PyMuPDF
1.16.14. Moreover, it is important to install fitz
BEFORE PyMuPDF
, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:
pip3 install fitz
pip3 install PyMuPDF==1.16.14
And here is the Python 3 implementation:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area = total_page_area + abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area = total_text_area + text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware – such as pdfsandwich or Adobe Acrobat – that adds "invisible" text blocks on top of the image, so that you can select the text).
I created a script to detect whether a PDF was OCRd. The main idea: In OCRd PDFs is the text invisible.
Algorithm to test whether a given PDF (f1
) was OCRd:
- create a copy of
f1
noted asf2
- delete all text on
f2
- create images (PNG) for all (or just a few) pages for
f1
andf2
f1
was OCRd if all the images off1
andf2
are identical.
https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh
#!/usr/bin/env bash
set -e
set -x
################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
# bash is_scanned_pdf.sh [-p] file
#
# Exit 0: Yes, file is a scanned PDF
# Exit 99: No, file was created digitally
#
# Arguments:
# -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################
# parse arguments
# h/t https://stackoverflow.com/a/33826763/4028896
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
case $1 in
-p | --pages)
max_pages="$2"
shift
;;
*)
echo "Unknown parameter passed: $1"
exit 1
;;
esac
shift
done
# increment to make it easier with page numbering
max_pages=$((max_pages++))
command_exists() {
if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
echo $(error: $1 is not installed.) >&2
exit 1
fi
}
command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo
orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')
echo $num_pages
echo $max_pages
if ((($max_pages > 1) && ($max_pages < $num_pages))); then
num_pages=$max_pages
fi
cd $(mktemp -d)
for ((i = 1; i <= num_pages; i++)); do
mkdir -p output/$i && echo $i
done
# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null
for ((i = 1; i <= num_pages; i++)); do
echo $i
# difference in pixels, if 0 there are the same pictures
# discard diff image
if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
echo " pixels difference, not a scanned PDF, mismatch on page $i"
exit 99
fi
done
You can use pdfplumber. If the following code returns "None", it’s a scanned pdf otherwise it’s searchable.
pip install pdfplumber
with pdfplumber.open(file_name) as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
To extract text from scanned pdf, you can use OCRmyPDF. Very easy package, one line solution. You can find more on the package here and a video explaining an example here. Upvote the answer if helpful. Good luck!
You can use ocrmypdf, it has a parameter to skip the text
more info here: https://ocrmypdf.readthedocs.io/en/latest/advanced.html
ocrmypdf.ocr(file_path, save_path, rotate_pages=True, remove_background=False, language=language, deskew=False, force_ocr=False, skip_text=True)
Just I re-modified code from @Vikas Goel
But a very few cases it is not giving decent result
def get_pdf_searchable_pages(fname):
""" intentifying a digitally created pdf or a scanned pdf"""
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num == len(searchable_pages):
return("searchable_pages")
elif page_num != len(searchable_pages):
return("non_searchable_pages")
else:
return("Not a valid document")
None of the posted answers worked for me. Unfortunately, the solutions often detect scanned PDFs as textual PDFs, most often because of the media boxes present in the documents.
As funny as it may look, the following code proved to be more accurate for my use-case:
extracted_text = ''.join([page.getText() for page in fitz.open(path)])
doc_type = "text" if extracted_text else "scan"
Make sure to install fitz and PyMuPDF beforehand, though:
pip install fitz PyMuPDF
If it is only all images or else, then here is another version to do this with PyMuPDF:
import fitz
my_pdf = r"C:UsersTestFileName.pdf"
doc = fitz.open(my_pdf)
def pdftype(doc):
i=0
for page in doc:
if len(page.getText())>0: #for scanned page it will be 0
i+=1
if i>0:
print('full/partial text PDF file')
else:
print('only scanned images in PDF file')
pdftype(doc)
If your digital PDFs have a table of contents, you can use doc.get_toc()
from PyMuPDF
. As far as I’m aware, the scanned PDFs will never have a table of contents. There’s no guarantee the digital ones will though, so it really depends on the context.