Python – Use regex to extract substrings between two markers

Question:

I have a problem that I need help with. I have the below strings and need to do the following:

  1. Extract all the substrings between the equal sign and "END=STRING"
    string or closing double quotation mark.
  2. Group the extracted substrings into a single group
  3. Do not show the starting and ending markers in the output
  4. If possible, do not show the back slashes or newlines

Two samples of extended result:

STRING database file 2025.01 ABC_ONE ABC_TWO

STRING database file 2025.01 ABC_ONE:12.3456 ABC_TWO:12.3456 ABC_THREE:12.3456 ABC_FOUR:12.3456 ABC_THREE:12.3456 ABC_FOUR:12.3456 ABC_FIVE:12.3456 ABC_SIX:12.3456 ABC_SEVEN:12.3456 ABC_EIGHT:12.3456 ABC_NINE:12.3456 ABC_TEN:12.3456

I will use Python re.finditer to loop through the results I get from regex. Also, re.MULTILINE and re.IGNORECASE will be used.

Link to what I have on regex101: https://regex101.com/r/CwMaEZ/1

Feel free to suggest a different pattern but keep in mind the following:

  1. Groups are needed like how I show in my pattern.
  2. I want to iterate over the results in Python so I prefer re.finditer

Here is the regex I have so far:

(STRING)s([a-zA-Z0-9/+._-]+)s([a-zA-Z0-9/+._-]+)s([a-zA-Z0-9/+._-]+)s([a-zA-Z0-9/+._-]+)?s?\?n?(.*VALUE=s*"?)

Here are the strings:

STRING database file 2025.01 
     0123456789ABCD VALUE="ABC_ONE 
     ABC_TWO " END=STRING
     ST=

STRING database file 2025.01 
     0123456789ABCD VALUE=ABC_ONE 
     ABC_TWO END=STRING
     ST=

STRING database file 2025.01 ABCDEFGH123456 
     VALUE=ABC_ONE ABC_TWO END=STRING 

STRING database file 2025.01 
    VALUE=ABC_ONE:12.3456 END=STRING 
    AAAA=ABCDEFGH1234

STRING database file 2025.01 
    VALUE="ABC_ONE:12.3456 ABC_TWO:12.3456 
    ABC_THREE:12.3456 ABC_FOUR:12.3456 " 
    END=STRING 

STRING database file 2025.01 
    0123456789ABCD VALUE="ABC_ONE ABC_TWO " 
    END=STRING 

STRING database file 2025.01 VALUE="ABC_ONE 
    ABC_TWO ABC_THREE END=STRING

STRING database file 2025.01 
    VALUE="ABC_ONE ABC_TWO ABC_THREE " END=STRING

STRING database file 2025.01 VALUE=
    "ABC_ONE ABC_TWO ABC_THREE " END=STRING 

STRING database file 2025.01 VALUE="ABC_ONE 
    ABC_TWO ABC_THREE " END=STRING

STRING database file 2025.01 VALUE="ABC_ONE ABC_TWO 
    ABC_THREE " END=STRING

STRING database file 2025.01 VALUE="ABC_ONE ABC_TWO ABC_THREE " 

STRING database file 2025.01 
    VALUE="ABC_ONE:12.3456 ABC_TWO:12.3456 
    ABC_THREE:12.3456 ABC_FOUR:12.3456 
    ABC_THREE:12.3456 ABC_FOUR:12.3456 
    ABC_FIVE:12.3456 ABC_SIX:12.3456 
    ABC_SEVEN:12.3456 ABC_EIGHT:12.3456 
    ABC_NINE:12.3456 ABC_TEN:12.3456 
    ABC_ELEVEN:12.3456 ABC_TWELVE:12.3456 
    END=STRING
Asked By: John Doe

||

Answers:

Question askers are usually expected to put in some effort toward a solution. Here is some code to help you started:

s = s.replace('\n', '')
re.findall(r'VALUE="(.*?)s*(?: " END=STRING|END=STRING)', s, re.M)
Answered By: Raymond Hettinger

Firstly, you haven’t really laid out the question neatly.

But suggestion 1. is just use S+ instead of [a-zA-Z0-9/+._-]

Answered By: user3456886
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.