Scan paired cells of two columns for the same pattern using Python

Question:

I’m a Python beginner and would like to learn how to use it for operations on text files. I have an input txt file of 4 columns separated by TAB, and I want to search whether, row by row, the cell pairs in columns 1 and 4 simultaneously contain the pattern "BBB" or "CCC". If true, send the whole line to output1. If false, send the whole line to output2.

This is the input.txt:


more input.txt

AABBBAA 2   5   AACCCAA
AAAAAAA 4   10  AAAAAAA
AABBBAA 6   15  AABBBAA
AAAAAAA 8   20  AAAAAAA
AACCCAA 10  25  AACCCAA
AAAAAAA 12  30  AAAAAAA

This is the Python code I wrote:

more main.py
import sys

input = open(sys.argv[1], "r")
output1 = open(sys.argv[2], "w")
output2 = open(sys.argv[3], "w")

list = ["BBB", "CCC"]

for line in input:
    for item in list:
        if item in line.split("t")[0] and item in line.split("t")[3]:
            output1.write(line)
        else:
            output2.write(line)

input.close()
output1.close()
output2.close()

Command:

python main.py input.txt output1.txt output2.txt

output1.txt is correct

more output1.txt
AABBBAA 6   15  AABBBAA
AACCCAA 10  25  AACCCAA

output2 is incorrect. I’m trying to understand why it takes both the lines of output1.txt and the double copy of the other lines.

more output2.txt
AABBBAA 2   5   AACCCAA
AABBBAA 2   5   AACCCAA
AAAAAAA 4   10  AAAAAAA
AAAAAAA 4   10  AAAAAAA
AABBBAA 6   15  AABBBAA
AAAAAAA 8   20  AAAAAAA
AAAAAAA 8   20  AAAAAAA
AACCCAA 10  25  AACCCAA
AAAAAAA 12  30  AAAAAAA
AAAAAAA 12  30  AAAAAAA

output2.txt should be:

AABBBAA 2   5   AACCCAA
AAAAAAA 4   10  AAAAAAA
AAAAAAA 8   20  AAAAAAA
AAAAAAA 12  30  AAAAAAA

Thank you for your help!

Asked By: Gabriele

||

Answers:

You get duplicated lines in output2 because you ask it to do so. Your condition is: If item exists in both columns, write the line to output1, else write it to output2. Then you proceed to do this for each item in list. Since there are two items in list, and (e.g. in line 1) the first item doesn’t exist in both columns, it writes the line once to output2, then the second item doesn’t exist in both columns either, so it writes the line again to output2.

Let’s restate your condition:

[Check if] the cell pairs in columns 1 and 4 simultaneously contain the pattern "BBB" or "CCC". If true, send the whole line to output1. If false, send the whole line to output2.

So for each row, you want to check if any (any) of the items in list(for item in lst) occur in both those columns(item in cols[0] and item in cols[3]).

lst = ["BBB", "CCC"]
for line in input_file:
    cols = line.split("t")
    if any(item in cols[0] and item in cols[3] for item in lst):
        output1.write(line)
    else:
        output2.write(line)

Note that I renamed list to lst and input to input_file in my code to avoid shadowing the builtins

Answered By: Pranav Hosangadi

The issue is with the else part of the if statement. As every time the if condition doesn’t return True you are writing the line to the output2.txt file, which is not the logic you want.

You will need to change the logic of the code to make it only write to output2.txt, if both ‘BBB’ and ‘CCC’ are not found.

Answered By: SgtSafety
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.