Detect the content type of multiple PDF in a Folder

Question

so far I am using PyPDF2 in anaconda platform to place a watermark in 20000+ pdfs. The code is working for the majority of PDF files but there are a few of them where the content is a poorly scanned image from reports.

I want to know if there is a tool within python or any other way where I can analyse the content of the PDF and determine if the PDF is an image or is a pdf file with text characters. This will allow me to know which files have this defect and place them in other folder.

Thanks

I added my code.

import PyPDF2 #this library requires to be installed
import os


if __name__ == "__main__":


    ROOT_PATH = "."
    #STAMP_PATH = "." + "/stamped/"
    TEMPLATE_PATH = "."
    
    STAMP_PATH = "."
        
    
    count = 0
    
    for dirName, subdirList, fileList in os.walk(ROOT_PATH):
        
        files=[]

        print('Found directory: %s' % dirName)
        for fileName in fileList:

            if fileName.find('.pdf') > 0:
                count += 1

                print('tHandling %s - %s  %s' % (count, dirName, fileName))

                files.append(fileName)


#=======================main code part ==========================================                
                file= open(fileName,'rb')
                reader = PyPDF2.PdfFileReader(file)
                page= reader.getPage(0)
                
                
                water = open(TEMPLATE_PATH + 'StampTemplate1109.pdf','rb')
                reader2 = PyPDF2.PdfFileReader(water)
                waterpage = reader2.getPage(0)
                
                #command to merge parent PDF first page with PDF watermark page
                page.mergeTranslatedPage(waterpage, 0, -20, expand=True)
                
                
                writer =PyPDF2.PdfFileWriter()
                writer.addPage(page)
                
                #add rest of PDF pages
                for pageNum in range(1, reader.numPages): # this will give length of book
                 pageObj = reader.getPage(pageNum)
                 writer.addPage(pageObj)
                 
                #return the parent PDF file with the watermark 
                # here we are writing so 'wb' is for write binary
                resultFile = open(STAMP_PATH + 'Reviewed ' + fileName,'wb')
                
                writer.write(resultFile)
                file.close()
                resultFile.close()
#==============================================================================                

    print "TOTAL OF %s PROCESSED" % count

Asked By: FrankQA

||

Source

Answer 1

Since you’re already using PyPDF2 you may want to use the PageObject.extractText function to see if you get any text on each page of the PDF. If you get an empty string from a page then it’s probably an image.

Answered By: J. Owens

Detect the content type of multiple PDF in a Folder

Question:

Answers: