How can I loop through blocks of lines in a file?

Question:

I have a text file that looks like this, with blocks of lines separated by blank lines:

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13

ID: 4
Name: M
FamilyN: Z
Age: 25

How can I loop through the blocks and process the data in each block? eventually I want to gather the name, family name and age values into three columns, like so:

Y X 20
F H 23
Y S 13
Z M 25
Asked By: Adia

||

Answers:

Use a dict, namedtuple, or custom class to store each attribute as you come across it, then append the object to a list when you reach a blank line or EOF.

Use a generator.

def blocks( iterable ):
    accumulator= []
    for line in iterable:
        if start_pattern( line ):
            if accumulator:
                yield accumulator
                accumulator= []
        # elif other significant patterns
        else:
            accumulator.append( line )
     if accumulator:
         yield accumulator
Answered By: S.Lott
import re
result = re.findall(
    r"""(?mx)           # multiline, verbose regex
    ^ID:.*s*           # Match ID: and anything else on that line 
    Name:s*(.*)s*     # Match name, capture all characters on this line
    FamilyN:s*(.*)s*  # etc. for family name
    Age:s*(.*)$        # and age""", 
    subject)

Result will then be

[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]

which can be trivially changed into whatever string representation you want.

Answered By: Tim Pietzcker

If file is not huge you can read whole file with:

content = f.open(filename).read()

then you can split content to blocks using:

blocks = content.split('nn')

Now you can create function to parse block of text. I would use split('n') to get lines from block and split(':') to get key and value, eventually with str.strip() or some help of regular expressions.

Without checking if block has required data code can look like:

f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('nn'):
    person = {}
    for l in block.split('n'):
        k, v = l.split(': ')
        person[k] = v
    print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))
Answered By: MichaƂ Niklas

simple solution:

result = []
for record in content.split('nn'):
    try:
        id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('n'))
    except ValueError:
        pass
    except IndexError:
        pass
    else:
        result.append((familyn, name, age))
Answered By: Andrey Gubarev

If your file is too large to read into memory all at once, you can still use a regular expressions based solution by using a memory mapped file, with the mmap module:

import sys
import re
import os
import mmap

block_expr = re.compile('ID:.*?nAge: d+', re.DOTALL)

filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)

for block_match in block_expr.finditer(contents):
    print block_match.group()

The mmap trick will provide a “pretend string” to make regular expressions work on the file without having to read it all into one large string. And the find_iter() method of the regular expression object will yield matches without creating an entire list of all matches at once (which findall() does).

I do think this solution is overkill for this use case however (still: it’s a nice trick to know…)

Answered By: Steven
import itertools

# Assuming input in file input.txt
data = open('input.txt').readlines()

records = (lines for valid, lines in itertools.groupby(data, lambda l : l != 'n') if valid)    
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]

# You can change output to generator by    
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)

# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
#You can iterate and change the order of elements in the way you want    
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output
Answered By: Anoop

Here’s another way, using itertools.groupby.
The function groupy iterates through lines of the file and calls isa_group_separator(line) for each line. isa_group_separator returns either True or False (called the key), and itertools.groupby then groups all the consecutive lines that yielded the same True or False result.

This is a very convenient way to collect lines into groups.

import itertools

def isa_group_separator(line):
    return line=='n'

with open('data_file') as f:
    for key,group in itertools.groupby(f,isa_group_separator):
        # print(key,list(group))  # uncomment to see what itertools.groupby does.
        if not key:               # however, this will make the rest of the code not work
            data={}               # as it exhausts the `group` iterator
            for item in group:
                field,value=item.split(':')
                value=value.strip()
                data[field]=value
            print('{FamilyN} {Name} {Age}'.format(**data))

# Y X 20
# F H 23
# Y S 13
# Z M 25
Answered By: unutbu

Along with the half-dozen other solutions I already see here, I’m a bit surprised that no one has been so simple-minded (that is, generator-, regex-, map-, and read-free) as to propose, for example,

fp = open(fn)
def get_one_value():
    line = fp.readline()
    if not line:
        return None
    parts = line.split(':')
    if 2 != len(parts):
        return ''
    return parts[1].strip()

# The result is supposed to be a list.
result = []
while 1:
        # We don't care about the ID.
   if get_one_value() is None:
       break
   name = get_one_value()
   familyn = get_one_value()
   age = get_one_value()
   result.append((name, familyn, age))
       # We don't care about the block separator.
   if get_one_value() is None:
       break

for item in result:
    print item

Re-format to taste.

Answered By: Cameron Laird

This answer isn’t necessarily better than what’s already been posted, but as an illustration of how I approach problems like this it might be useful, especially if you’re not used to working with Python’s interactive interpreter.

I’ve started out knowing two things about this problem. First, I’m going to use itertools.groupby to group the input into lists of data lines, one list for each individual data record. Second, I want to represent those records as dictionaries so that I can easily format the output.

One other thing that this shows is how using generators makes breaking a problem like this down into small parts easy.

>>> # first let's create some useful test data and put it into something 
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
        return True if line.strip() else False

>>> # make sure this really works
>>> "n".join([line for line in data if is_data(line)])
'ID: 1nName: XnFamilyN: YnAge: 20nID: 2nName: HnFamilyN: FnAge: 23nID: 3nName: SnFamilyN: YnAge: 13nID: 4nName: MnFamilyN: ZnAge: 25'

>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
        items = item.split(":")
        return (items[0].strip(), items[1].strip())

>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this 
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
        for (key, value) in itertools.groupby(data, is_data):
            if key:
                yield dict(make_key_value_pair(item) for item in value)

>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick:  using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
        with open(filename, "r") as f:
            columns = ["Name", "FamilyN", "Age"]
            for d in get_data_as_dicts(f):
                yield " ".join(d[c] for c in columns)

>>> print "n".join(get_formatted_data("c:\temp\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25
Answered By: Robert Rossney
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.