Process text files with Python
Question:
Complete newbie is here, please help. Suppose I have a text file which looks like this:
some strings
go here
* head1 head2 head3 ... headN
3 "a" 0.3 ... -2
0.1 "b" 5 ... 1
10 "c" -4 ... 100
# and other rows with some numbers
So, I have several strings before the main block of interest. This block has "header" line, note that it starts with "*" and real columns’ heads start from 2nd column. Next there are rows with some numbers and strings, which correspond to particular head[i].
I need to process this block line by line depending on the string value in head2 column: if, for example, value of head2 is "a" then write new string in a file like ‘param1 = 3, param2 = 0.3’, i.e. take values from head1 and head3 of the current line processed.
The problem is that this "header" line can have different number of elements and the order of head[i] can vary, so this row can be
* head3 head1 head2 ... headN
I need to make some association between column name and column values so for each line I can use like if line.head2 == "a" then … How to do that?
Answers:
This will read your text file line by line and if a header row has been found then write the current row as dictionary to a list called rows
.
I’ve added a simple example of how the elements in the rows
list can be accessed.
headers = []
rows = []
with open('input.txt') as f:
for line in f:
split_line = line.strip().split()
if headers:
rows.append(dict(zip(headers, split_line)))
if split_line and '*' == split_line[0]:
headers = split_line[1:]
for row in rows:
if row['head2'] == '"a"':
print('found an "a"')
File: input.txt
some strings
go here
* head1 head2 head3 headN
3 "a" 0.3 -2
0.1 "b" 5 1
10 "c" -4 100
You can use a DictReader which will give you all the power and robustness of the csv module, but you will have to first skip the initial lines.
You could use a 3 step processing:
- ignore any line before a line starting with a
*
- extract field names from that line after skipping its initial
*
characters
- process the other lines as a normal csv file which you know the name of the fields
Possible code:
with open(filename) as fd:
# skip the initial lines up to a line starting with a *
for line in fd:
if line.startswith('*'):
break
# use a DictReader to parse that line (after the initial *)
rd = csv.DictReader(io.StringIO(line[1:]), delimiter=' ',
skipinitialspace=True)
# prepare a DictReader for the rest of the file
fieldnames = rd.fieldnames
rd = csv.DictReader(fd, fieldnames=fieldnames, delimiter=' ',
skipinitialspace=True)
for row in rd:
if row['head2'] == 'a':
# add your processing here...
The rationale for using the csv module is that Python comes battery included, and that the csv module is a very robust module able to handle fields containing the delimiter of even newlines. So best practices recomment to always use it for processing csv files instead of a custom parser.
If it always starts with 3 lines you can use python’s array to jump over lines[3:].
If you read this text line by line wait for the ‘*’ by using if line[0] == '*'
.
Now for the parsing part, first we will parse the headers by using the split function
headers = line.split()[1:]
we are splitting by the whitespace delimiter (default it split function) and then we are ignoring the first element from the split ("*")
this will give you an array of headers.
Now we can continue by parsing each line and creating a mapping between header and value (I’m ignoring value parsing from str to int/float/any other type)
data_dict = {}
splitted_line = line.split()
for i in range(len(headers)):
data_dict[headers[i]] = splitted_line[i]
print(data_dict)
parsed_data.append(data_dict)
while parsed_data is the global data container
Complete newbie is here, please help. Suppose I have a text file which looks like this:
some strings
go here
* head1 head2 head3 ... headN
3 "a" 0.3 ... -2
0.1 "b" 5 ... 1
10 "c" -4 ... 100
# and other rows with some numbers
So, I have several strings before the main block of interest. This block has "header" line, note that it starts with "*" and real columns’ heads start from 2nd column. Next there are rows with some numbers and strings, which correspond to particular head[i].
I need to process this block line by line depending on the string value in head2 column: if, for example, value of head2 is "a" then write new string in a file like ‘param1 = 3, param2 = 0.3’, i.e. take values from head1 and head3 of the current line processed.
The problem is that this "header" line can have different number of elements and the order of head[i] can vary, so this row can be
* head3 head1 head2 ... headN
I need to make some association between column name and column values so for each line I can use like if line.head2 == "a" then … How to do that?
This will read your text file line by line and if a header row has been found then write the current row as dictionary to a list called rows
.
I’ve added a simple example of how the elements in the rows
list can be accessed.
headers = []
rows = []
with open('input.txt') as f:
for line in f:
split_line = line.strip().split()
if headers:
rows.append(dict(zip(headers, split_line)))
if split_line and '*' == split_line[0]:
headers = split_line[1:]
for row in rows:
if row['head2'] == '"a"':
print('found an "a"')
File: input.txt
some strings
go here
* head1 head2 head3 headN
3 "a" 0.3 -2
0.1 "b" 5 1
10 "c" -4 100
You can use a DictReader which will give you all the power and robustness of the csv module, but you will have to first skip the initial lines.
You could use a 3 step processing:
- ignore any line before a line starting with a
*
- extract field names from that line after skipping its initial
*
characters - process the other lines as a normal csv file which you know the name of the fields
Possible code:
with open(filename) as fd:
# skip the initial lines up to a line starting with a *
for line in fd:
if line.startswith('*'):
break
# use a DictReader to parse that line (after the initial *)
rd = csv.DictReader(io.StringIO(line[1:]), delimiter=' ',
skipinitialspace=True)
# prepare a DictReader for the rest of the file
fieldnames = rd.fieldnames
rd = csv.DictReader(fd, fieldnames=fieldnames, delimiter=' ',
skipinitialspace=True)
for row in rd:
if row['head2'] == 'a':
# add your processing here...
The rationale for using the csv module is that Python comes battery included, and that the csv module is a very robust module able to handle fields containing the delimiter of even newlines. So best practices recomment to always use it for processing csv files instead of a custom parser.
If it always starts with 3 lines you can use python’s array to jump over lines[3:].
If you read this text line by line wait for the ‘*’ by using if line[0] == '*'
.
Now for the parsing part, first we will parse the headers by using the split function
headers = line.split()[1:]
we are splitting by the whitespace delimiter (default it split function) and then we are ignoring the first element from the split ("*")
this will give you an array of headers.
Now we can continue by parsing each line and creating a mapping between header and value (I’m ignoring value parsing from str to int/float/any other type)
data_dict = {}
splitted_line = line.split()
for i in range(len(headers)):
data_dict[headers[i]] = splitted_line[i]
print(data_dict)
parsed_data.append(data_dict)
while parsed_data is the global data container