How do I combine 2 lines at a time, appending line1 onto line2, pulling only specific parts of each line in Bash?

Question:

I have millions of short input files. PyLauncher will run on supercomputers, running millions of python scripts in parallel. Each runs a program on each input and copies 2 lines from the output of each, then appends those 2 lines to results.txt. The python script looks like:

for input_file in directory:
 subprocess.run(["script_name input_file | sed -n '22p; 39p' | tee -a results.txt"], shell=True)

results.txt will have 2*num_input_files (millions) of lines like:

Ligand: ./input/ZINC00001677.pdbqt
1       -8.288          0          0
Ligand: ./input/ZINC00001567.pdbqt
1       -10.86          0          0
Ligand: ./input/ZINC00001601.pdbqt
1       -7.721          0          0

I’d like to take this, rearrange, drop the 1, 0, and 0 from line 2, and sort so most negative number comes first so it looks like:

-10.86 ZINC00001567.pdbqt
-8.288 ZINC00001677.pdbqt
-7.721 ZINC00001601.pdbqt

I found this StackOverflow question: How do I sort two lines at a time in bash, using the second line as index?

But I can’t quite get the commands to work for my file. Speed of execution is very important, so Bash commands or Python could both work, depending on which is faster.
Thanks in advance!

Asked By: darrowboat

||

Answers:

In python I would do something like this:

with open('input.txt', 'r') as f_inp, open('output.txt', 'w') as f_out:
    while True:
        one = f_inp.readline().strip('n')
        if not one:
            break
        two = f_inp.readline().strip('n')
        f_out.write(f'{two} - {one}n')

Then I would leave it to the sort command to do the sort part.

Answered By: stenci

If you have enough RAM to store the output file contents then you could do this:

from os.path import basename

INPUTFILE = 'verylargefile.txt'
OUTPUTFILE = 'results.txt'

result = []

with open(INPUTFILE) as data:
    while line := data.readline():
        filename = basename(line.split()[-1])
        v = data.readline().split()[1]
        result.append(f'{v} {filename}n')


with open(OUTPUTFILE, 'w') as data:
    data.writelines(sorted(result, key=lambda x: float(x.split()[0])))
Answered By: Pingu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.