Modify the data found between two recurring patterns in a multi-line string

Question:

I have a multi-line string, it’s around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:

==============================================================================

*THE HEADINGS/COLUMN NAMES IN THE TABLE*

------------------------------------------------------------------------------

THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS

I’m trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I’m using the python regex module to identify the pattern, but I’m failing to do so due to inexperience in using the re module. To modify the part that I need modified, I’m using the below piece of code:

re.sub(r'={78}.*-{78}',some_replacement_string, complete_multi_line_string)

But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I’m sure the mistake is in the pattern I’m asking re.sub to match)

However:

re.sub(r'-{78}',some_replacement_string, complete_multi_line_string)

is working as it’s returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I’m wanting is something like:

==============================================================================

<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<span>

------------------------------------------------------------------------------

THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS

Also, please note that there are newlines or ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.

The code snippet I’m currently trying to debug, if helpful:

result = re.sub(r'={78}.*-{78}', replacement, multi_line_string)
l = result.count('<span>')
print(l)

PS: There are 78 = and 78 - in all the occurances.

Answers:

You should try using the following version:

re.sub(r'(={78})n(.*?)n(-{78})', r'1<span>2</span>3', complete_multi_line_string, flags=re.S)

The changes I made here include:

  • Match on lazy dot .*? instead of greedy dot .*, to ensure that we don’t match across header sections
  • Match with the re.S flag, so that .*? will match across newlines
Answered By: Tim Biegeleisen
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.