Why does my file keep closing after the first loop in python

Question:

I’m trying to read through a large file in which I have marked the start and end lines of each segment. I’m extracting a component of each segment using regex.
What I don’t understand is that after the first inner loop, my code seems to have closed the file and I don’t get the desired output.
Simplified code below

with open("data_full", 'r') as file:
    for x in position:
        print(x)
        s = position[x]['start']
        e = position[x]['end']
        title = []
        abs = []
        mesh = []
        ti_prev = False
        for i,line in enumerate(file.readlines()[s:e]):
            print(i)
            print(s,e)
            if re.search(r'(?<=TIs{2}-s).*', line) is not None and ti_prev is False:
                title.append(re.search(r'(?<=TIs{2}-s).*', line).group())
                ti_prev = True
                line_mark = i
            if re.search(r'(?<=s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
                title.append(re.search(r'(?<=s{6}).*',line).group())
            else:
                pass
        data[x]['title']=title

What I think has happened, is that after the first inner loop file.readlines() does not work since the file is closed. But I don’t understand why, since it’s within my with open loop.

My alternative is to read the file for each segment (9k+ segments) and is not doing wonders to my performance.
Any suggestions are welcomed with thanks !

Asked By: Warren Manuel

||

Answers:

Assuming your indentation is wrong in the description and not actually in your original code, readlines() moves the file pointer to the end so you can’t read any more lines.
You need to either reopen the file or .seek(0).
See this for more info: Does fp.readlines() close a file?

Answered By: AvitanD

It looks like the file.readlines() method reads the entire file and returns a list of the lines. Once the file has been read, the for loop in the second block of code is operating on the list of lines and not the file itself. This means that the for loop will only run once and will not loop through the remainder of the file.

To fix this, you can move the call to file.readlines() outside of the outer for loop. This will cause the entire file to be read and stored in a list before the for loop starts. Then, inside the for loop, you can use the enumerate function on the list of lines to loop through the lines in the segment.

Here’s an example of how you could modify your code to fix the issue:

# Read the entire file and store the lines in a list
lines = file.readlines()

# Loop through the positions in the `position` dictionary
for x in position:
    # Get the start and end indices of the current segment
    s = position[x]['start']
    e = position[x]['end']

    # Initialize variables to store the title, abstract, and mesh terms
    title = []
    abs = []
    mesh = []

    # Set a flag to track whether the title has been found
    ti_prev = False

    # Loop through the lines in the current segment
    for i, line in enumerate(lines[s:e]):
        # Check if the current line is a title line
        if re.search(r'(?<=TIs{2}-s).*', line) is not None and ti_prev is False:
            # If it is a title line, store it in the `title` list and set the flag
            title.append(re.search(r'(?<=TIs{2}-s).*', line).group())
            ti_prev = True
            line_mark = i

        # Check if the current line is a continuation of the title
        if re.search(r'(?<=s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
            # If it is, store it in the `title` list
            title.append(re.search(r'(?<=s{6}).*',line).group())
        else:
            pass

    # Store the title in the `data` dictionary
    data[x]['title'] = title

Hope this helps!

Answered By: A-poc
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.