Process text files with Python

Question:

Complete newbie is here, please help. Suppose I have a text file which looks like this:

some strings
go here 

* head1 head2 head3 ... headN
  3     "a"   0.3   ... -2
  0.1   "b"   5     ... 1
  10    "c"   -4    ... 100
# and other rows with some numbers

So, I have several strings before the main block of interest. This block has "header" line, note that it starts with "*" and real columns’ heads start from 2nd column. Next there are rows with some numbers and strings, which correspond to particular head[i].

I need to process this block line by line depending on the string value in head2 column: if, for example, value of head2 is "a" then write new string in a file like ‘param1 = 3, param2 = 0.3’, i.e. take values from head1 and head3 of the current line processed.

The problem is that this "header" line can have different number of elements and the order of head[i] can vary, so this row can be

* head3 head1 head2 ... headN

I need to make some association between column name and column values so for each line I can use like if line.head2 == "a" then … How to do that?

Asked By: Alex

||

Answers:

This will read your text file line by line and if a header row has been found then write the current row as dictionary to a list called rows.

I’ve added a simple example of how the elements in the rows list can be accessed.

headers = []
rows = []
with open('input.txt') as f:
    for line in f:
        split_line = line.strip().split()
        if headers:
            rows.append(dict(zip(headers, split_line)))
        if split_line and '*' == split_line[0]:
            headers = split_line[1:]

for row in rows:
    if row['head2'] == '"a"':
        print('found an "a"')

File: input.txt

some strings
go here 

* head1 head2 head3 headN
  3     "a"   0.3   -2
  0.1   "b"   5     1
  10    "c"   -4    100
Answered By: bn_ln

You can use a DictReader which will give you all the power and robustness of the csv module, but you will have to first skip the initial lines.

You could use a 3 step processing:

  • ignore any line before a line starting with a *
  • extract field names from that line after skipping its initial * characters
  • process the other lines as a normal csv file which you know the name of the fields

Possible code:

with open(filename) as fd:
    # skip the initial lines up to a line starting with a *
    for line in fd:
        if line.startswith('*'):
            break
    # use a DictReader to parse that line (after the initial *)
    rd = csv.DictReader(io.StringIO(line[1:]), delimiter=' ',
                        skipinitialspace=True)
    # prepare a DictReader for the rest of the file
    fieldnames = rd.fieldnames
    rd = csv.DictReader(fd, fieldnames=fieldnames, delimiter=' ',
                        skipinitialspace=True)
    for row in rd:
        if row['head2'] == 'a':
            # add your processing here...

The rationale for using the csv module is that Python comes battery included, and that the csv module is a very robust module able to handle fields containing the delimiter of even newlines. So best practices recomment to always use it for processing csv files instead of a custom parser.

Answered By: Serge Ballesta

If it always starts with 3 lines you can use python’s array to jump over lines[3:].
If you read this text line by line wait for the ‘*’ by using if line[0] == '*'.

Now for the parsing part, first we will parse the headers by using the split function

headers = line.split()[1:]

we are splitting by the whitespace delimiter (default it split function) and then we are ignoring the first element from the split ("*")
this will give you an array of headers.

Now we can continue by parsing each line and creating a mapping between header and value (I’m ignoring value parsing from str to int/float/any other type)

data_dict = {}
splitted_line = line.split()
for i in range(len(headers)):
  data_dict[headers[i]] = splitted_line[i]
print(data_dict)
parsed_data.append(data_dict)

while parsed_data is the global data container

Answered By: Yehonatan Bitton
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.