How to compare two huge text file in linux and get the difference

Question:

I have two text files, both are around 1 billion rows but one has 218 more rows than the other, I need to find out the 218 rows and save them for analysis.

What would be the fastest solution to do it? is there any miracle shell command or python library that delivers the needed result with best efficiency?

Thank you very much.

Asked By: mdivk

||

Answers:

Just use the command line tool diff:

$ diff ./file1.txt ./file2.txt
Answered By: setholopolus

comm will produce more readable output than diff (Plus its output easier to pipe to something else), and should be more efficient:

$ cat file1.txt         
dog
cat
rabbit
$ cat file2.txt
cat
dog
rabbit
llama
$ comm -13 <(sort file1.txt) <(sort file2.txt)
llama

Its default behavior is to print three columns – lines only in file1, lines only in file2, and lines in both. The -1 and -3 suppress those respective columns. If file1 had the extra lines, you’d thus use -23 instead.

If your shell doesn’t support <(command) style redirection, you’d have to sort the files as a separate step.

Answered By: Shawn

Run the command to get difference of two files (file A and file B).

diff -U 0 a b

Or if you want to store diff to other file (c) then run the command

diff -U 0 a b >> c
Answered By: SURJIT KUMAR
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.