Loop through folder and subfolders and merge pdf

Question:

I tried to create a script to loop through parent folder and subfolders and merge all of the pdfs into one. Below if the code I wrote so far, but I don’t know how to combine them into one script.

Reference:
Merge PDF files

The first function is to loop through all of the subfolders under parent folder and get a list of path for each pdf.

import os
from PyPDF2 import PdfFileMerger

root = r"folder path"
path = os.path.join(root, "folder path")

def list_dir():
    for path,subdirs,files in os.walk(root):
        for name in files:
            if name.endswith(".pdf") or name.endswith(".ipynb"):
                print (os.path.join(path,name))

            
            

Second, I created a list to append all of the path to pdf files in the subfolders and merge into one combined file. At this step, I was told:

TypeError: listdir: path should be string, bytes, os.PathLike or None,
not list

root_folder = []
root_folder.append(list_dir())
    
def pdf_merge():
    
    merger = PdfFileMerger()    
    allpdfs = [a for a in os.listdir(root_folder)]

    
    for pdf in allpdfs:
        merger.append(open(pdf,'rb'))
        
    with open("Combined.pdf","wb") as new_file:
        merger.write(new_file)

pdf_merge()

Where and what should I modify the code in order to avoid the error and also combine two functions together?

Asked By: Brian C.

||

Answers:

First you have to create functions which create list with all files and return it.

def list_dir(root):
    result = []
    
    for path, dirs, files in os.walk(root):
        for name in files:
            if name.lower().endswith( (".pdf", ".ipynb") ):
                result.append(os.path.join(path, name))
                
    return result

I use also .lower() to catch extensions like .PDF.

endswith() can use tuple with all extensions.

It is good to get external values as arguments – list_dir(root) instead of list_dir()


And later you can use as

allpdfs = list_dir("folder path")

in

def pdf_merge(root):
    
    merger = PdfFileMerger()    
    allpdfs = list_dir(root)
    
    for pdf in allpdfs:
        merger.append(open(pdf, 'rb'))
        
    with open("Combined.pdf", 'wb') as new_file:
        merger.write(new_file)

pdf_merge("folder path")

EDIT:

First function could be even more universal if it would get also extensions

import os

def list_dir(root, exts=None):
    result = []
    
    for path, dirs, files in os.walk(root):
        for name in files:
            if exts and not name.lower().endswith(exts):
               continue 

            result.append(os.path.join(path, name))
                
    return result

all_files  = list_dir('folder_path')
all_pdfs   = list_dir('folder_path', '.pdf')
all_images = list_dir('folder_path', ('.png', '.jpg', '.gif'))

print(all_files)
print(all_pdfs)
print(all_images)

EDIT:

For single extension you can also do

improt glob

all_pdfs = glob.glob('folder_path/**/*.pdf', recursive=True)

It needs ** with recursive=True to search in subfolders.

Answered By: furas
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.