Why is awk so much faster than python in this case?
Question:
I have a clip list with 200,000 rows, each row is of the form
<field 1> <field2>
In order to get just field 1, I can run a script that looks like this
import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()
for line in text:
clip_to_add = line.split(" ")[0]
list_of_clips = list_of_clips + clip_to_add +'n'
with open ('clips.list', 'w') as file:
file.write (list_of_clips)
jump.close()
or I can just use awk 'print{($1)}'
why is awk SO much quicker? It completes the job in about 1 second.
Answers:
import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()
for line in text:
clip_to_add = line.split(" ")[0]
list_of_clips = list_of_clips + clip_to_add +'n'
with open ('clips.list', 'w') as file:
file.write (list_of_clips)
jump.close()
This code is poorly written from performance point of view. .readlines()
needs to read whole file to create list (which is mutable, feature which you do not use at all), even despite in your case you do not have to know content of whole file to get processing done. When you are reading file you might use for line in <filehandle>:
to avoid reading whole file to memory, using this you might print
first field of SPACE-separated file.txt
like so
with open("file.txt","r") as f:
for line in f:
print(line.split(" ")[0])
Moreover you do import os
and then do not use any features contained therein and also open clips.list
twice, once as clips
later as file
and then never make any use of former.
To sum it shortly: awk '{print $1}'
is correctly written AWK code whilst presented python
code is of very dubious quality, comparing them gives unreliable result.
I have a clip list with 200,000 rows, each row is of the form
<field 1> <field2>
In order to get just field 1, I can run a script that looks like this
import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()
for line in text:
clip_to_add = line.split(" ")[0]
list_of_clips = list_of_clips + clip_to_add +'n'
with open ('clips.list', 'w') as file:
file.write (list_of_clips)
jump.close()
or I can just use awk 'print{($1)}'
why is awk SO much quicker? It completes the job in about 1 second.
import os
import sys
jump = open(sys.argv[1],"r")
clips = open("clips.list","w")
text = jump.readlines()
list_of_clips = str()
for line in text:
clip_to_add = line.split(" ")[0]
list_of_clips = list_of_clips + clip_to_add +'n'
with open ('clips.list', 'w') as file:
file.write (list_of_clips)
jump.close()
This code is poorly written from performance point of view. .readlines()
needs to read whole file to create list (which is mutable, feature which you do not use at all), even despite in your case you do not have to know content of whole file to get processing done. When you are reading file you might use for line in <filehandle>:
to avoid reading whole file to memory, using this you might print
first field of SPACE-separated file.txt
like so
with open("file.txt","r") as f:
for line in f:
print(line.split(" ")[0])
Moreover you do import os
and then do not use any features contained therein and also open clips.list
twice, once as clips
later as file
and then never make any use of former.
To sum it shortly: awk '{print $1}'
is correctly written AWK code whilst presented python
code is of very dubious quality, comparing them gives unreliable result.