How to compare two huge text file in linux and get the difference
Question:
I have two text files, both are around 1 billion rows but one has 218 more rows than the other, I need to find out the 218 rows and save them for analysis.
What would be the fastest solution to do it? is there any miracle shell command or python library that delivers the needed result with best efficiency?
Thank you very much.
Answers:
Just use the command line tool diff
:
$ diff ./file1.txt ./file2.txt
comm will produce more readable output than diff (Plus its output easier to pipe to something else), and should be more efficient:
$ cat file1.txt
dog
cat
rabbit
$ cat file2.txt
cat
dog
rabbit
llama
$ comm -13 <(sort file1.txt) <(sort file2.txt)
llama
Its default behavior is to print three columns – lines only in file1, lines only in file2, and lines in both. The -1
and -3
suppress those respective columns. If file1 had the extra lines, you’d thus use -23
instead.
If your shell doesn’t support <(command)
style redirection, you’d have to sort the files as a separate step.
Run the command to get difference of two files (file A and file B).
diff -U 0 a b
Or if you want to store diff to other file (c) then run the command
diff -U 0 a b >> c
I have two text files, both are around 1 billion rows but one has 218 more rows than the other, I need to find out the 218 rows and save them for analysis.
What would be the fastest solution to do it? is there any miracle shell command or python library that delivers the needed result with best efficiency?
Thank you very much.
Just use the command line tool diff
:
$ diff ./file1.txt ./file2.txt
comm will produce more readable output than diff (Plus its output easier to pipe to something else), and should be more efficient:
$ cat file1.txt
dog
cat
rabbit
$ cat file2.txt
cat
dog
rabbit
llama
$ comm -13 <(sort file1.txt) <(sort file2.txt)
llama
Its default behavior is to print three columns – lines only in file1, lines only in file2, and lines in both. The -1
and -3
suppress those respective columns. If file1 had the extra lines, you’d thus use -23
instead.
If your shell doesn’t support <(command)
style redirection, you’d have to sort the files as a separate step.
Run the command to get difference of two files (file A and file B).
diff -U 0 a b
Or if you want to store diff to other file (c) then run the command
diff -U 0 a b >> c