capture pattern_X repeatedly, then capture pattern_Y once, then repeat until EOS

Question:

[update:] Accepted answer suggests, this can not be done with the python re library in one step. If you know otherwise, please comment.

I’m reverse-engineering a massive ETL pipeline, I’d like to extract the full data lineage from stored procedures and views.

I’m struggling with the following regexp.

import re

select_clause = "`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`) AS `birth_date`,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`) AS `death_date`,(case when (isnull(`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`)) and (`landing_boggle_replica`.`CUST`.`u_cust_type` <> 'E')) then timestampdiff(YEAR,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`),curdate()) else NULL end) AS `age_in_years`,nullif(`landing_boggle_replica`.`CUST`.`u_occupationCode`,'') AS `occupation_code`,nullif(`landing_boggle_replica`.`CUST`.`u_industryCode`,'') AS `industry_code`,((`landing_boggle_replica`.`CUST`.`u_intebank` = 'Y') or (`sso`.`u_mySecondaryCust` is not null)) AS `online_web_enabled`,(`landing_boggle_replica`.`CUST`.`u_telebank` = 'Y') AS `online_phone_enabled`,(`landing_boggle_replica`.`CUST`.`u_hasProBank` = 1) AS `has_pro_bank`"

# this captures every occurrence of the source fields, but not the target
okay_pattern = r"(?i)((`[a-z0-9_]+`.`[a-z0-9_]+`)[ ,)=<>]).*?"

# this captures the target too, but captures only the first input field
wrong_pattern = r"(?i)((((`[a-z0-9_]+`.`[a-z0-9_]+`)[ ,)=<>]).*?AS (`[a-z0-9_]+)`).*?)"

re.findall(okay_pattern, select_clause)
re.findall(wrong_pattern, select_clause)

TLDR: I’d like to capture

[aaa, bbb, XXX],
[eee, fff, ..., ooo, YYY],
[ppp, ZZZ]

from a string like

"...aaa....bbb...XXX....eee...fff...[many]...ooo... YYY...ppp...ZZZ...."

where a,b,e,f,h match one pattern, X,Y,Z match another, and the first pattern might occur up to ~20 times, before the second one appears, which always appears alone.

I’m open to solutions with the sqlglot, sql-metadata, or sqlparse libraries as well, it is just regex is better documented.

(Probably I’m code golfing, and I should do this in several steps, starting with splitting the string into individual expressions.)

Asked By: Lorinc Nyitrai

||

Answers:

Here’s two regexes, one to group things by the outside pattern, and one for the inside:

(.*?)(XXX|YYY|ZZZ)
(aaa|bbb|ccc|ddd|eee|fff|ggg)

What I would suggest is matching the whole string with the first regex, and then using the second regex on the first regex’s match (.*?)

By using these two regexes, your matches will be grouped first by the outer pattern and then by the inner pattern, but the regexes themselves doesn’t have to be overly complicated.

Answered By: Canyon Turtle

You may use this regex with 3 capture and 1 non-capture groups:

(w+).+(w+)(?:.+(w+))?

RegEx Demo

Code:

import re
s = '...aaa....bbb...XXX....eee...fff...YYY...hhh...ZZZ....'
print (re.findall(r'(w+).+(w+)(?:.+(w+))?', s))

Output:

[('aaa', 'bbb', 'XXX'), ('eee', 'fff', 'YYY'), ('hhh', 'ZZZ', '')]
Answered By: anubhava