Scan paired cells of two columns for the same pattern using Python
Question:
I’m a Python beginner and would like to learn how to use it for operations on text files. I have an input txt file of 4 columns separated by TAB, and I want to search whether, row by row, the cell pairs in columns 1 and 4 simultaneously contain the pattern "BBB" or "CCC". If true, send the whole line to output1. If false, send the whole line to output2.
This is the input.txt:
more input.txt
AABBBAA 2 5 AACCCAA
AAAAAAA 4 10 AAAAAAA
AABBBAA 6 15 AABBBAA
AAAAAAA 8 20 AAAAAAA
AACCCAA 10 25 AACCCAA
AAAAAAA 12 30 AAAAAAA
This is the Python code I wrote:
more main.py
import sys
input = open(sys.argv[1], "r")
output1 = open(sys.argv[2], "w")
output2 = open(sys.argv[3], "w")
list = ["BBB", "CCC"]
for line in input:
for item in list:
if item in line.split("t")[0] and item in line.split("t")[3]:
output1.write(line)
else:
output2.write(line)
input.close()
output1.close()
output2.close()
Command:
python main.py input.txt output1.txt output2.txt
output1.txt is correct
more output1.txt
AABBBAA 6 15 AABBBAA
AACCCAA 10 25 AACCCAA
output2 is incorrect. I’m trying to understand why it takes both the lines of output1.txt and the double copy of the other lines.
more output2.txt
AABBBAA 2 5 AACCCAA
AABBBAA 2 5 AACCCAA
AAAAAAA 4 10 AAAAAAA
AAAAAAA 4 10 AAAAAAA
AABBBAA 6 15 AABBBAA
AAAAAAA 8 20 AAAAAAA
AAAAAAA 8 20 AAAAAAA
AACCCAA 10 25 AACCCAA
AAAAAAA 12 30 AAAAAAA
AAAAAAA 12 30 AAAAAAA
output2.txt should be:
AABBBAA 2 5 AACCCAA
AAAAAAA 4 10 AAAAAAA
AAAAAAA 8 20 AAAAAAA
AAAAAAA 12 30 AAAAAAA
Thank you for your help!
Answers:
You get duplicated lines in output2
because you ask it to do so. Your condition is: If item
exists in both columns, write the line to output1
, else write it to output2
. Then you proceed to do this for each item
in list
. Since there are two items in list
, and (e.g. in line 1) the first item doesn’t exist in both columns, it writes the line once to output2
, then the second item doesn’t exist in both columns either, so it writes the line again to output2
.
Let’s restate your condition:
[Check if] the cell pairs in columns 1 and 4 simultaneously contain the pattern "BBB" or "CCC". If true, send the whole line to output1. If false, send the whole line to output2.
So for each row, you want to check if any (any
) of the items in list
(for item in lst
) occur in both those columns(item in cols[0] and item in cols[3]
).
lst = ["BBB", "CCC"]
for line in input_file:
cols = line.split("t")
if any(item in cols[0] and item in cols[3] for item in lst):
output1.write(line)
else:
output2.write(line)
Note that I renamed list
to lst
and input
to input_file
in my code to avoid shadowing the builtins
The issue is with the else
part of the if
statement. As every time the if
condition doesn’t return True
you are writing the line to the output2.txt file, which is not the logic you want.
You will need to change the logic of the code to make it only write to output2.txt, if both ‘BBB’ and ‘CCC’ are not found.
I’m a Python beginner and would like to learn how to use it for operations on text files. I have an input txt file of 4 columns separated by TAB, and I want to search whether, row by row, the cell pairs in columns 1 and 4 simultaneously contain the pattern "BBB" or "CCC". If true, send the whole line to output1. If false, send the whole line to output2.
This is the input.txt:
more input.txt
AABBBAA 2 5 AACCCAA
AAAAAAA 4 10 AAAAAAA
AABBBAA 6 15 AABBBAA
AAAAAAA 8 20 AAAAAAA
AACCCAA 10 25 AACCCAA
AAAAAAA 12 30 AAAAAAA
This is the Python code I wrote:
more main.py
import sys
input = open(sys.argv[1], "r")
output1 = open(sys.argv[2], "w")
output2 = open(sys.argv[3], "w")
list = ["BBB", "CCC"]
for line in input:
for item in list:
if item in line.split("t")[0] and item in line.split("t")[3]:
output1.write(line)
else:
output2.write(line)
input.close()
output1.close()
output2.close()
Command:
python main.py input.txt output1.txt output2.txt
output1.txt is correct
more output1.txt
AABBBAA 6 15 AABBBAA
AACCCAA 10 25 AACCCAA
output2 is incorrect. I’m trying to understand why it takes both the lines of output1.txt and the double copy of the other lines.
more output2.txt
AABBBAA 2 5 AACCCAA
AABBBAA 2 5 AACCCAA
AAAAAAA 4 10 AAAAAAA
AAAAAAA 4 10 AAAAAAA
AABBBAA 6 15 AABBBAA
AAAAAAA 8 20 AAAAAAA
AAAAAAA 8 20 AAAAAAA
AACCCAA 10 25 AACCCAA
AAAAAAA 12 30 AAAAAAA
AAAAAAA 12 30 AAAAAAA
output2.txt should be:
AABBBAA 2 5 AACCCAA
AAAAAAA 4 10 AAAAAAA
AAAAAAA 8 20 AAAAAAA
AAAAAAA 12 30 AAAAAAA
Thank you for your help!
You get duplicated lines in output2
because you ask it to do so. Your condition is: If item
exists in both columns, write the line to output1
, else write it to output2
. Then you proceed to do this for each item
in list
. Since there are two items in list
, and (e.g. in line 1) the first item doesn’t exist in both columns, it writes the line once to output2
, then the second item doesn’t exist in both columns either, so it writes the line again to output2
.
Let’s restate your condition:
[Check if] the cell pairs in columns 1 and 4 simultaneously contain the pattern "BBB" or "CCC". If true, send the whole line to output1. If false, send the whole line to output2.
So for each row, you want to check if any (any
) of the items in list
(for item in lst
) occur in both those columns(item in cols[0] and item in cols[3]
).
lst = ["BBB", "CCC"]
for line in input_file:
cols = line.split("t")
if any(item in cols[0] and item in cols[3] for item in lst):
output1.write(line)
else:
output2.write(line)
Note that I renamed list
to lst
and input
to input_file
in my code to avoid shadowing the builtins
The issue is with the else
part of the if
statement. As every time the if
condition doesn’t return True
you are writing the line to the output2.txt file, which is not the logic you want.
You will need to change the logic of the code to make it only write to output2.txt, if both ‘BBB’ and ‘CCC’ are not found.