Why is awk so much faster than python in this case?

Question:

I have a clip list with 200,000 rows, each row is of the form

<field 1> <field2>

In order to get just field 1, I can run a script that looks like this

import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()

for line in text: 
     clip_to_add =   line.split(" ")[0]
     list_of_clips = list_of_clips + clip_to_add +'n' 

with open ('clips.list', 'w') as file:
file.write (list_of_clips)

jump.close()

or I can just use awk 'print{($1)}'

why is awk SO much quicker? It completes the job in about 1 second.

Asked By: ZakS

||

Answers:

import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()

for line in text: 
     clip_to_add =   line.split(" ")[0]
     list_of_clips = list_of_clips + clip_to_add +'n' 

with open ('clips.list', 'w') as file:
file.write (list_of_clips)

jump.close()

This code is poorly written from performance point of view. .readlines() needs to read whole file to create list (which is mutable, feature which you do not use at all), even despite in your case you do not have to know content of whole file to get processing done. When you are reading file you might use for line in <filehandle>: to avoid reading whole file to memory, using this you might print first field of SPACE-separated file.txt like so

with open("file.txt","r") as f:
    for line in f:
        print(line.split(" ")[0])

Moreover you do import os and then do not use any features contained therein and also open clips.list twice, once as clips later as file and then never make any use of former.

To sum it shortly: awk '{print $1}' is correctly written AWK code whilst presented python code is of very dubious quality, comparing them gives unreliable result.

Answered By: Daweo
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.