How to compare directories to determine which files have changed?

Question:

We need a script that will compare two directories of files and for each file that has been altered between directory 1 and directory 2 (added, deleted, modified), need to create a subset of only those modified files.

My first impression is to create a python script to traverse each directory, compute a hash of each file, and if the hash has changed, copy the file over to the new subset of files. Is this a proper approach? Am I neglecting any tools out there which may do this already? I’ve never used it, but maybe use something like rsync could be used?

Thanks

Edit:

The important part is that I am able to compile a subset of only those files were changed– so if a only 3 files have changed between versions, I only need those three files copied to a new directory…

Asked By: Cuga

||

Answers:

That is one completely reasonable approach, but you are essentially reinventing rsync. So yes, use rsync.

edit: There’s a way to create “difference-only” folders using rsync

Answered By: Daniel DiPaolo

I like diffmerge, it works great for this purpose.

Answered By: dcp

It seems to me that you need something as simple as that:

from os.path import getmtime
from os import sep,listdir

rep1 = 'I:\dada'
rep2 = 'I:\didi'

R1 = listdir(rep1)
R2 = listdir(rep2)


vanished = [ filename for filename in R1 if filename not in R2]
appeared = [ filename for filename in R2 if filename not in R1]
modified = [ filename for filename in ( f for f in R2 if f in R1)
             if getmtime(rep1+sep+filename)!=getmtime(rep2+sep+filename)]


print 'vanished==',vanished
print 'appeared==',appeared
print 'modified==',modified
Answered By: eyquem

I have modified @eyquem answer a bit!

Arguments can be given as

python file.py dir1 dir2

NOTE : sorts on basis of modification time !

#!/usr/bin/python
import os, sys,time
from os.path import getmtime
from os import sep,listdir

ORIG_DIR = sys.argv[1]#orig:-->/root/backup.FPSS/bin/httpd
MODIFIED_DIR = sys.argv[2]#modified-->/FPSS/httpd/bin/httpd

LIST_OF_FILES_IN_ORIG_DIR = listdir(ORIG_DIR)
LIST_OF_FILES_IN_MODIFIED_DIR = listdir(MODIFIED_DIR)


vanished = [ filename for filename in LIST_OF_FILES_IN_ORIG_DIR if filename not in LIST_OF_FILES_IN_MODIFIED_DIR]
appeared = [ filename for filename in LIST_OF_FILES_IN_MODIFIED_DIR if filename not in LIST_OF_FILES_IN_ORIG_DIR]
modified = [ filename for filename in ( f for f in LIST_OF_FILES_IN_MODIFIED_DIR if f in LIST_OF_FILES_IN_ORIG_DIR) if getmtime(ORIG_DIR+sep+filename)<getmtime(MODIFIED_DIR+sep+filename)]
same = [ filename for filename in ( f for f in LIST_OF_FILES_IN_MODIFIED_DIR if f in LIST_OF_FILES_IN_ORIG_DIR) if getmtime(ORIG_DIR+sep+filename)>=getmtime(MODIFIED_DIR+sep+filename)]

def print_list(arg):
    for f in arg:
        print '----->',f
    print 'Total :: ',len(arg)

print '###################################################################################################'
print 'Files which have Vanished from MOD: ',MODIFIED_DIR,' but still present ',ORIG_DIR,' ==>n',print_list(vanished)
print '-----------------------------------------------------------------------------------------------------'
print 'Files which are Appearing in MOD: ',MODIFIED_DIR,' but not present ',ORIG_DIR,' ==>n',print_list(appeared)
print '-----------------------------------------------------------------------------------------------------'
print 'Files in MOD: ',MODIFIED_DIR,' which are MODIFIED if compared to ORIG: ',ORIG_DIR,' ==>n',print_list(modified)
print '-----------------------------------------------------------------------------------------------------'
print 'Files in MOD: ',MODIFIED_DIR,' which are NOT modified if compared to ORIG: ',ORIG_DIR,' ==>n',print_list(same)
print '###################################################################################################'
Answered By: SUMIT KUMAR SINGH

Including Subfolders and comparing hashes of the files (>Python 3.11 required)

from os.path import isdir,normpath
from os import sep,walk
import hashlib

rep1=normpath(input('Folder 1: '))
rep2=normpath(input('Folder 2: '))

def hashcheck(fileloc1,fileloc2): # only works from python 3.11 on
    if isdir(fileloc1) or isdir(fileloc2):
        return False if fileloc1[fileloc1.rfind(sep):]==fileloc2[fileloc2.rfind(sep):] else True
    with open(fileloc1,'rb') as f1:
        f1hash=hashlib.file_digest(f1,"sha256").hexdigest()
    with open(fileloc2,'rb') as f2:
        f2hash=hashlib.file_digest(f2,"sha256").hexdigest()
    return (f1hash!=f2hash)

R1=[]
R2=[]
for wfolder in list(walk(rep1)):
    R1+=(wfolder[0].replace(rep1,'')+sep+item for item in wfolder[2])
for wfolder in list(walk(rep2)):
    R2+=(wfolder[0].replace(rep2,'')+sep+item for item in wfolder[2])

vanished = [ pathname for pathname in R1 if pathname not in R2]
appeared = [ pathname for pathname in R2 if pathname not in R1]
modified = [ pathname for pathname in ( f for f in R2 if f in R1)
            if hashcheck(rep1+sep+pathname,rep2+sep+pathname)]

print ('vanished==',vanished,'n')
print ('appeared==',appeared,'n')
print ('modified==',modified,'n')
input()
Answered By: kwtf
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.