Why does my file keep closing after the first loop in python
Question:
I’m trying to read through a large file in which I have marked the start and end lines of each segment. I’m extracting a component of each segment using regex.
What I don’t understand is that after the first inner loop, my code seems to have closed the file and I don’t get the desired output.
Simplified code below
with open("data_full", 'r') as file:
for x in position:
print(x)
s = position[x]['start']
e = position[x]['end']
title = []
abs = []
mesh = []
ti_prev = False
for i,line in enumerate(file.readlines()[s:e]):
print(i)
print(s,e)
if re.search(r'(?<=TIs{2}-s).*', line) is not None and ti_prev is False:
title.append(re.search(r'(?<=TIs{2}-s).*', line).group())
ti_prev = True
line_mark = i
if re.search(r'(?<=s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
title.append(re.search(r'(?<=s{6}).*',line).group())
else:
pass
data[x]['title']=title
What I think has happened, is that after the first inner loop file.readlines() does not work since the file is closed. But I don’t understand why, since it’s within my with open loop.
My alternative is to read the file for each segment (9k+ segments) and is not doing wonders to my performance.
Any suggestions are welcomed with thanks !
Answers:
Assuming your indentation is wrong in the description and not actually in your original code, readlines() moves the file pointer to the end so you can’t read any more lines.
You need to either reopen the file or .seek(0).
See this for more info: Does fp.readlines() close a file?
It looks like the file.readlines()
method reads the entire file and returns a list of the lines. Once the file has been read, the for
loop in the second block of code is operating on the list of lines and not the file itself. This means that the for
loop will only run once and will not loop through the remainder of the file.
To fix this, you can move the call to file.readlines()
outside of the outer for
loop. This will cause the entire file to be read and stored in a list before the for
loop starts. Then, inside the for loop, you can use the enumerate
function on the list of lines to loop through the lines in the segment.
Here’s an example of how you could modify your code to fix the issue:
# Read the entire file and store the lines in a list
lines = file.readlines()
# Loop through the positions in the `position` dictionary
for x in position:
# Get the start and end indices of the current segment
s = position[x]['start']
e = position[x]['end']
# Initialize variables to store the title, abstract, and mesh terms
title = []
abs = []
mesh = []
# Set a flag to track whether the title has been found
ti_prev = False
# Loop through the lines in the current segment
for i, line in enumerate(lines[s:e]):
# Check if the current line is a title line
if re.search(r'(?<=TIs{2}-s).*', line) is not None and ti_prev is False:
# If it is a title line, store it in the `title` list and set the flag
title.append(re.search(r'(?<=TIs{2}-s).*', line).group())
ti_prev = True
line_mark = i
# Check if the current line is a continuation of the title
if re.search(r'(?<=s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
# If it is, store it in the `title` list
title.append(re.search(r'(?<=s{6}).*',line).group())
else:
pass
# Store the title in the `data` dictionary
data[x]['title'] = title
Hope this helps!
I’m trying to read through a large file in which I have marked the start and end lines of each segment. I’m extracting a component of each segment using regex.
What I don’t understand is that after the first inner loop, my code seems to have closed the file and I don’t get the desired output.
Simplified code below
with open("data_full", 'r') as file:
for x in position:
print(x)
s = position[x]['start']
e = position[x]['end']
title = []
abs = []
mesh = []
ti_prev = False
for i,line in enumerate(file.readlines()[s:e]):
print(i)
print(s,e)
if re.search(r'(?<=TIs{2}-s).*', line) is not None and ti_prev is False:
title.append(re.search(r'(?<=TIs{2}-s).*', line).group())
ti_prev = True
line_mark = i
if re.search(r'(?<=s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
title.append(re.search(r'(?<=s{6}).*',line).group())
else:
pass
data[x]['title']=title
What I think has happened, is that after the first inner loop file.readlines() does not work since the file is closed. But I don’t understand why, since it’s within my with open loop.
My alternative is to read the file for each segment (9k+ segments) and is not doing wonders to my performance.
Any suggestions are welcomed with thanks !
Assuming your indentation is wrong in the description and not actually in your original code, readlines() moves the file pointer to the end so you can’t read any more lines.
You need to either reopen the file or .seek(0).
See this for more info: Does fp.readlines() close a file?
It looks like the file.readlines()
method reads the entire file and returns a list of the lines. Once the file has been read, the for
loop in the second block of code is operating on the list of lines and not the file itself. This means that the for
loop will only run once and will not loop through the remainder of the file.
To fix this, you can move the call to file.readlines()
outside of the outer for
loop. This will cause the entire file to be read and stored in a list before the for
loop starts. Then, inside the for loop, you can use the enumerate
function on the list of lines to loop through the lines in the segment.
Here’s an example of how you could modify your code to fix the issue:
# Read the entire file and store the lines in a list
lines = file.readlines()
# Loop through the positions in the `position` dictionary
for x in position:
# Get the start and end indices of the current segment
s = position[x]['start']
e = position[x]['end']
# Initialize variables to store the title, abstract, and mesh terms
title = []
abs = []
mesh = []
# Set a flag to track whether the title has been found
ti_prev = False
# Loop through the lines in the current segment
for i, line in enumerate(lines[s:e]):
# Check if the current line is a title line
if re.search(r'(?<=TIs{2}-s).*', line) is not None and ti_prev is False:
# If it is a title line, store it in the `title` list and set the flag
title.append(re.search(r'(?<=TIs{2}-s).*', line).group())
ti_prev = True
line_mark = i
# Check if the current line is a continuation of the title
if re.search(r'(?<=s{6}).*',line) is not None and ti_prev is True and i == (line_mark+1):
# If it is, store it in the `title` list
title.append(re.search(r'(?<=s{6}).*',line).group())
else:
pass
# Store the title in the `data` dictionary
data[x]['title'] = title
Hope this helps!