How to combine lines in two files with condition in python?

Question:

I need to combine lines in two files, basing in the condition, that in the line of one of the files is a part of the line of the second file.

A part of the first file:

12319000    -64,7357668067227   -0,1111052148685535  
12319000    -79,68527661064425  -0,13231739777754026  
12319000    -94,69642857142858  -0,15117839559513543    
12319000    -109,59301470588237 -0,18277783185642743  
12319001    99,70264355742297   0,48329515727315125  
12319001    84,61113445378152   0,4060446341409862  
12319001    69,7032037815126    0,29803063228455073  
12319001    54,93886554621849   0,20958105041136763  
12319001    39,937394957983194  0,13623056582981297  
12319001    25,05574229691877   0,07748669438398018  
12319001    9,99716386554622    0,028110643107892755  

A part of the second file:

12319000.abf    mutant  1  
12319001.abf    mutant  2  
12319002.abf    mutant  3  

I need to create a file, where the line consists of this: all line from the first file and everything from the line of the second one. except for the filename in the first column.

As you can see, there are more, than one line in the first file, cooresponding to a line in the second one. I need that operation to be done with each line, so the output should be like this:

12319000    -94,69642857142858  -0,15117839559513543  mutant    1  
12319000    -109,59301470588237 -0,18277783185642743  mutant    1  
12319001    99,70264355742297   0,48329515727315125  mutant 2  
12319001    84,61113445378152   0,4060446341409862  mutant  2  

I’ve written this code:

oocytes = open(file_with_oocytes, 'r')  
results = open(os.path.join(path, 'results.csv'), 'r')  
results_new = open(os.path.join(path, 'results_with_oocytes.csv'), 'w')  
for line in results:  
    for lines in oocytes:  
        if lines[0:7] in line:  
            print line + lines[12:]  

But it prints out this, and nothing more, thow there are 45 line in the first file:

12319000    99,4952380952381    0,3011778623990699
    mutant  1  

12319000    99,4952380952381    0,3011778623990699
    mutant  2  

12319000    99,4952380952381    0,3011778623990699
    mutant  3  

What is wrong with the code?
Or it should be done somehow completely differently?

Asked By: Phlya

||

Answers:

Note that this solution doesn’t rely on the lengths of any field except for the length of the file extension in the second file.

# make a dict keyed on the filename before the extension
# with the other two fields as its value
file2dict = dict((row[0][:-4], row[1:])  
                     for row in (line.split() for line in file2))

# then add to the end of each row 
# the values to it's first column
output = [row + file2dict[row[0]] for row in (line.split() for line in file1)]

For testing purposes only, I used:

# I just use this to emulate a file object, as iterating over it yields lines
# just use file1 = open(whatever_the_filename_is_for_this_data)
# and the rest of the program is the same
file1 = """12319000    -64,7357668067227   -0,1111052148685535
12319000    -79,68527661064425  -0,13231739777754026
12319000    -94,69642857142858  -0,15117839559513543
12319000    -109,59301470588237 -0,18277783185642743
12319001    99,70264355742297   0,48329515727315125
12319001    84,61113445378152   0,4060446341409862
12319001    69,7032037815126    0,29803063228455073
12319001    54,93886554621849   0,20958105041136763
12319001    39,937394957983194  0,13623056582981297
12319001    25,05574229691877   0,07748669438398018
12319001    9,99716386554622    0,028110643107892755""".splitlines()

# again, use file2 = open(whatever_the_filename_is_for_this_data)
# and the rest of the program will work the same
file2 = """12319000.abf    mutant  1
12319001.abf    mutant  2
12319002.abf    mutant  3""".splitlines()

where you should just use normal file objects. The output for the test data is :

   [['12319000', '-64,7357668067227', '-0,1111052148685535', 'mutant', '1'],
    ['12319000', '-79,68527661064425', '-0,13231739777754026', 'mutant', '1'],
    ['12319000', '-94,69642857142858', '-0,15117839559513543', 'mutant', '1'],
    ['12319000', '-109,59301470588237', '-0,18277783185642743', 'mutant', '1'],
    ['12319001', '99,70264355742297', '0,48329515727315125', 'mutant', '2'],
    ['12319001', '84,61113445378152', '0,4060446341409862', 'mutant', '2'],
    ['12319001', '69,7032037815126', '0,29803063228455073', 'mutant', '2'],
    ['12319001', '54,93886554621849', '0,20958105041136763', 'mutant', '2'],
    ['12319001', '39,937394957983194', '0,13623056582981297', 'mutant', '2'],
    ['12319001', '25,05574229691877', '0,07748669438398018', 'mutant', '2'],
    ['12319001', '9,99716386554622', '0,028110643107892755', 'mutant', '2']]
Answered By: agf

File handles in Python have state; that is, they do not work like lists. You can repeatedly iterate over a list and get all the values out each time. Files, on the other hand, have a position from which the next read() will occur. When you iterate over the file, you read() each line. When you reach the last line, the file pointer is at the end of the file. A read() from the end of the file returns the string ''!

What you need to do is read in the oocytes file once at the beginning, and store the values, maybe something like this:

oodict = {}
for line in oocytes:
    oodict[line[0:7]] = line[12:]

for line in results:
    results_key = line[0:7]
    if results_key in oodict:
        print oodict[results_key] + line
Answered By: Cuadue

well, simple things first, you printed the newline at the end of line – you would want to drop that with line[0:-1]

Next, “lines[0:7]” only tests the first 7 characters of the line – you wanted to test 8 chars. That’s why the same value of “line” was printed out with 3 different mutant values.

Finally, you need to close and re-open oocytes for each line in results. Failure to do so ended your output after the first line of results.

Actually, the other answer is better – don’t open and close oocytes for each line of results – open it and read it in (to a list) once, then iterate over that list for each line of results.

Answered By: Frank Klotz

Your code should work with some tweaks:

oocytes = open(file_with_oocytes, 'r').readlines()
results = open(os.path.join(path, 'results.csv'), 'r').readlines()  
results_new = open(os.path.join(path, 'results_with_oocytes.csv'), 'w')  
for line in results:  
    for lines in oocytes:  
        if lines[0:8] in line:  
            results_new.write(line.strip() + lines[12:])

Note the addition of readlines() in order to have iterable lists. Another important fix is in the 0:8 range, cause you need the whole identifier.

I know this answer is coming +10 years later, but I consider this a good excersise to solve a pretty common task.

Answered By: Tonat San
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.