How to compare two huge text file in linux and get the difference

Question

I have two text files, both are around 1 billion rows but one has 218 more rows than the other, I need to find out the 218 rows and save them for analysis.

What would be the fastest solution to do it? is there any miracle shell command or python library that delivers the needed result with best efficiency?

Thank you very much.

Asked By: mdivk

||

Source

Answer 1

Just use the command line tool diff:

$ diff ./file1.txt ./file2.txt

Answered By: setholopolus

Answer 2

comm will produce more readable output than diff (Plus its output easier to pipe to something else), and should be more efficient:

$ cat file1.txt         
dog
cat
rabbit
$ cat file2.txt
cat
dog
rabbit
llama
$ comm -13 <(sort file1.txt) <(sort file2.txt)
llama

Its default behavior is to print three columns – lines only in file1, lines only in file2, and lines in both. The -1 and -3 suppress those respective columns. If file1 had the extra lines, you’d thus use -23 instead.

If your shell doesn’t support <(command) style redirection, you’d have to sort the files as a separate step.

Answered By: Shawn

Answer 3

Run the command to get difference of two files (file A and file B).

diff -U 0 a b

Or if you want to store diff to other file (c) then run the command

diff -U 0 a b >> c

Answered By: SURJIT KUMAR

How to compare two huge text file in linux and get the difference

Question:

Answers: