Parsing multi-line table cells from space-aligned table data

Question:

I have a bit of a messy file generated that just dumps everything into HTML <pre> tags and decides to separate the headers into 2 lines. I am a Python and regex newb and having trouble figuring out a way to merge those 2 lines into one properly to get the column headers on one line and matching up, with the end goal to get the entire file parsed into fields.

Here is an example of how it looks on the web:
example of output

What I want to be able to do is match the fields up in one line. Example, if I just get rid of excess spaces, "clock" would match up with Finisher instead of Time. What I want is:

ID# | Place | Class Place | Finisher | Clock Time | Net Time | Pace

Here is the actual HTML:

</B>             CLASS                                            CLOCK       NET    
  ID#  PLACE PLACE         FINISHER                          TIME       TIME     PACE  

Asked By: Steve

||

Answers:

This code does the job. We separate headings in the two lines using the following assumption: text in the two lines whose indices overlap or are immediately adjacent belongs to the same heading; when both lines have a space in a particular position, we can assume that the material on each side belongs to separate headings. No regexes are needed.

# read in the 2 lines:
line1 = '             CLASS                                            CLOCK       NET    '
line2 = '  ID#  PLACE PLACE         FINISHER                          TIME       TIME     PACE  '

# pad the shorter among the lines, so that both are equally long:
linediff = len(line1) - len(line2)
if linediff > 0:
    line2 += ' ' * linediff
else:
    line1 += ' ' * (-linediff)
length = len(line1)

# go through both lines character-by-character:
top, bottom = [], []
i = 0
while i < length:
    # skip indices where both lines have a space:
    if line1[i] == ' ' and line2[i] == ' ':
        i += 1
    else:
        # find the first j to the right of i for which
        # both lines have a space:
        j = i
        while (j < length) and (line1[j] != ' ' or line2[j] != ' '):
            j += 1
        # copy the lines from position i (inclusive)
        # to j (exclusive) into top and bottom:
        top.append(line1[i:j])
        bottom.append(line2[i:j])
        # we are done with one heading and advance i:
        i = j

# top:
# ['   ', '     ', 'CLASS', '        ', ' CLOCK', '  NET', '    ']
# bottom:
# ['ID#', 'PLACE', 'PLACE', 'FINISHER', 'TIME  ', 'TIME ', 'PACE']

headers = []
for str1, str2 in zip(top, bottom):
    # remove leading/trailing spaces from each partial heading:
    s1, s2 = str1.strip(), str2.strip()
    # merge partial headings
    # (strip is needed because one of the two might be empty):
    headers.append((s1 + ' ' + s2).strip())

# headers:
# ['ID#', 'PLACE', 'CLASS PLACE', 'FINISHER', 'CLOCK TIME', 'NET TIME', 'PACE']

Note that the problem is actually not related to HTML and hence doesn’t require any special HTML handling.

Answered By: Lover of Structure
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.