python 3.9 – unable to get correct sha1 hash for multiple files in loop

Question:

By referring the code, given in solution in below link, not getting the correct SHA1 hash for 2nd onwards files in loop. Why saying incorrect because

Using the code given below: –

  • CORRECT -> When trying to generate the SHA1 hash for same file individually (by executing code twice) then getting different SHA1 hash (correct) and

  • INCORRECT -> When generating hash for multiple files in single execution including this file also then getting different hash (incorrect) for this file ->

Please advice if anything to modify in this code or need to opt any other approach?

Code written by referring link given at bottom ->

import glob
import hashlib
import os

path = input("Please provide path to search for file pattern (search will be in this path sub-directories also: ")
filepattern = input("Please provide the file pattern to search in given path. Example *.jar, *abc*.jar.: ")
assert os.path.exists(path), "I did not find the path " + str(path)
path = path.rstrip("/")
tocheck = (f'{path}/**/{filepattern}')
hash_obj = hashlib.sha1()

searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        checksum = ""
        file_for_sha1 = ""
        file_for_sha1 = open(file, 'rb')
        hash_obj.update(file_for_sha1.read())
        checksum = hash_obj.hexdigest()
        print(f'sha1 for file ({file})= {checksum}')
    finally:
        file_for_sha1.close()

Example file -> abc.txt with below text created at /home/test/git/reader/cabin/: –
Hi This is to test SHA1 code.

and then this file copied to one more location i.e. /home/test/git/reader/check/cabin/

Linux console output showing same SHA1 for both files: –

:~/git/reader/check/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc  abc.txt
:~/git/reader/check/cabin$ cd ../..
:~/git/reader$ cd cabin/
:~/git/reader/cabin$ sha1sum abc.txt
fc4db67f46711b2c18bd133abd67965649edfffc  abc.txt

Code in loop in single execution – generating two different SHA1 for this abc.txt file from both locations: –

  • sha1 for file (/home/test/git/reader/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc
  • sha1 for file (/home/test/git/reader/check/cabin/abc.txt)= a4691598ea25ea4c7404369a685725115c7f305b

Code executed twice for same file by giving respective location (means one file at a time) then generating same and correct SHA1 hash:

  • sha1 for file (/home/test/git/reader/check/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc

  • sha1 for file (/home/test/git/reader/cabin/abc.txt)= fc4db67f46711b2c18bd133abd67965649edfffc

Referred code link ->
Generating one MD5/SHA1 checksum of multiple files in Python

Asked By: bsethi24

||

Answers:

To quote the docs on the update method

Repeated calls are equivalent to a single call with the concatenation
of all the arguments: m.update(a); m.update(b) is equivalent to
m.update(a+b).

So instead of finding the hash of both files separately, you’re finding the hash of both files concatenated. That is what the question you’ve linked is doing – a single hash for multiple files. You want a hash for each file, so instead of using the update method multiple times on the same hash_obj instance, create a new instance for each file, so

hash_obj = hashlib.sha1()
searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        ...
        hash_obj.update(file_for_sha1.read())

will become

searched_file_list = glob.iglob(tocheck, recursive=True)
for file in searched_file_list:
    print(f'{file}')
    try:
        hash_obj = hashlib.sha1()
        ...
        hash_obj.update(file_for_sha1.read())
Answered By: Henry
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.