What information depicts the quantitative difference between two large given files of the same size?
Question:
Usually, in order to find how two binary files are different, I use diff and hexdump tools. But in some situations if two large binary files of the same size are given, I would like to see only their quantitative differences, like number of regions of differences, cumulative difference.
Example: 2 Files A and B. They have 2 diff regions, and their cumulative difference is
6c-a3 + 6c-11 + 6f-6e + 20-22.
File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
|--------| |--|
reg 1 reg 2
How can I get such information using standard GNU tools and Bash or should I better use a simple Python script? Other statistics about how 2 files are different can also be useful, but I don’t know what else and how can be measured? Entropy difference? Variance difference?
Answers:
For everything but the regions thing you can use numpy. Something like this (untested):
import numpy as np
a = np.fromfile("file A", dtype="uint8")
b = np.fromfile("file B", dtype="uint8")
# Compute the number of bytes that are different
different_bytes = np.sum(a != b)
# Compute the sum of the differences
difference = np.sum(a - b)
# Compute the sum of the absolute value of the differences
absolute_difference = np.sum(np.abs(a - b))
# In some cases, the number of bits that have changed is a better
# measurement of change. To compute it we make a lookup array where
# bitcount_lookup[byte] == number_of_1_bits_in_byte (so
# bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4])
bitcount_lookup = np.array(
[bin(i).count("1") for i in range(256)], dtype="uint8")
# Numpy allows using an array as an index. ^ computes the XOR of
# each pair of bytes. The result is a byte with a 1 bit where the
# bits of the input differed, and a 0 bit otherwise.
bit_diff_count = np.sum(bitcount_lookup[a ^ b])
I couldn’t find a numpy function for computing the regions, but just write your own using a != b
as input, it shouldn’t be hard. See this question for inspiration.
One approach that springs to mind is to hack a bit on a binary diffing algorithm. E.g. a python implementation of the rsync algorithm. Starting from that should relatively easily get you a list of block ranges where the files differ, and then do whatever statistics you want to do on those blocks.
Usually, in order to find how two binary files are different, I use diff and hexdump tools. But in some situations if two large binary files of the same size are given, I would like to see only their quantitative differences, like number of regions of differences, cumulative difference.
Example: 2 Files A and B. They have 2 diff regions, and their cumulative difference is
6c-a3 + 6c-11 + 6f-6e + 20-22.
File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
|--------| |--|
reg 1 reg 2
How can I get such information using standard GNU tools and Bash or should I better use a simple Python script? Other statistics about how 2 files are different can also be useful, but I don’t know what else and how can be measured? Entropy difference? Variance difference?
For everything but the regions thing you can use numpy. Something like this (untested):
import numpy as np
a = np.fromfile("file A", dtype="uint8")
b = np.fromfile("file B", dtype="uint8")
# Compute the number of bytes that are different
different_bytes = np.sum(a != b)
# Compute the sum of the differences
difference = np.sum(a - b)
# Compute the sum of the absolute value of the differences
absolute_difference = np.sum(np.abs(a - b))
# In some cases, the number of bits that have changed is a better
# measurement of change. To compute it we make a lookup array where
# bitcount_lookup[byte] == number_of_1_bits_in_byte (so
# bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4])
bitcount_lookup = np.array(
[bin(i).count("1") for i in range(256)], dtype="uint8")
# Numpy allows using an array as an index. ^ computes the XOR of
# each pair of bytes. The result is a byte with a 1 bit where the
# bits of the input differed, and a 0 bit otherwise.
bit_diff_count = np.sum(bitcount_lookup[a ^ b])
I couldn’t find a numpy function for computing the regions, but just write your own using a != b
as input, it shouldn’t be hard. See this question for inspiration.
One approach that springs to mind is to hack a bit on a binary diffing algorithm. E.g. a python implementation of the rsync algorithm. Starting from that should relatively easily get you a list of block ranges where the files differ, and then do whatever statistics you want to do on those blocks.