how to split/reformat long line by list of optional keywords by using regex python?

Question:

I have input txt file with multiple lines, I want to split each line and start new line after list of optional keywords detected in current line. To do so, I composed pattern where possible keywords might appear each line, if so, I just want to split right after them. In my current attempt, not able to split the line if keywords detected. Is there any better idea to do this with regex in python? any thoughts?

use case

here is part of input txt that I want to check its format and split the line if I see key words either in the middle or end of line.

input txt

CREATE APPLICATION vod2_sar_app_wak1_tst_xor WITH ENCRYPTION RECOVERY 30 SECOND INTERVAL AUTORESUME MAXRETRIES 2 RETRYINTERVAL 60;

CREATE FLOW flow_src_vod2_sar_app_wak1_tst_xor;

CREATE OR REPLACE SOURCE vod2_epc_app_cln1_tst_comp_src_cdc USING Global.reader ( 
  DicMod: 'on', 
  url: '$src_url', 
  trans: true, 
  toSend: true ) OUTPUT TO stream_src_vod2_sar_app_wak1_tst_xor;


CREATE OR REPLACE TARGET comp_tgt_vod2_sar_app_wak1_tst_xor USING Global.wirter ( 
  stdSQL: 'true', 
  hasID: 'true', 
  delim: '"' ) 
INPUT FROM vod2_sar_app_wak1_tst_xor_text_clean_stream;

CREATE OR REPLACE Hm Hm_vod2_sar_app_wak1_tst_xor INSERT INTO Hm_STREAM_vod2_sar_app_wak1_tst_xor SELECT putUserdata(s,'Converted_SCN',META(s,"SCN").toString(),'DNOW', to_long(DNow()),'Created_time',DNOW(),'Nanoseconds',System.nanoTime()) FROM stream_src_vod2_sar_app_wak1_tst_xor s;

CREATE OR REPLACE OPEN PROCESSOR vod2_sar_app_wak1_tst_xor_text_clean USING Global.TextCleaner (
  replaceFrom: '[\u0000]', 
  includeBefore: true ) 
INSERT INTO vod2_sar_app_wak1_tst_xor_text_clean_stream FROM Hm_STREAM_vod2_sar_app_wak1_tst_xor;

END FLOW flow_tgt_vod2_sar_app_wak1_tst_xor;

END APPLICATION vod2_sar_app_wak1_tst_xor;

my current attempt:

line_patt = re.compile(r"(INPUT(?: FROM)?|(INSERT(?: INTO))|(OUPUT(?: TO)))")
with open('input.txt', 'r+') as f:
    lines = f.readlines()
    nlines = [v for v in lines if not v.isspace()]
    for line in nlines:
        if match := line_patt.match(line):
            line.split('n')
        else:
            continue

but this is not giving me desired format. How should we do this correctly with regex?

desired format for input

this is how I want the input to be formatted:

CREATE APPLICATION vod2_sar_app_wak1_tst_xor WITH ENCRYPTION RECOVERY 30 SECOND INTERVAL AUTORESUME MAXRETRIES 2 RETRYINTERVAL 60;

CREATE FLOW flow_src_vod2_sar_app_wak1_tst_xor;

CREATE OR REPLACE SOURCE vod2_epc_app_cln1_tst_comp_src_cdc USING Global.reader ( 
  DicMod: 'on', 
  url: '$src_url', 
  trans: true, 
  toSend: true ) 
 OUTPUT TO stream_src_vod2_sar_app_wak1_tst_xor;


CREATE OR REPLACE TARGET comp_tgt_vod2_sar_app_wak1_tst_xor USING Global.wirter ( 
  stdSQL: 'true', 
  hasID: 'true', 
  delim: '"' ) 
INPUT FROM vod2_sar_app_wak1_tst_xor_text_clean_stream;

CREATE OR REPLACE Hm Hm_vod2_sar_app_wak1_tst_xor INSERT INTO Hm_STREAM_vod2_sar_app_wak1_tst_xor SELECT putUserdata(s,'Converted_SCN',META(s,"SCN").toString(),'DNOW', to_long(DNow()),'Created_time',DNOW(),'Nanoseconds',System.nanoTime()) 
FROM stream_src_vod2_sar_app_wak1_tst_xor s;

CREATE OR REPLACE OPEN PROCESSOR vod2_sar_app_wak1_tst_xor_text_clean USING Global.TextCleaner (
  replaceFrom: '[\u0000]', 
  includeBefore: true ) 
INSERT INTO vod2_sar_app_wak1_tst_xor_text_clean_stream 
FROM Hm_STREAM_vod2_sar_app_wak1_tst_xor;

END FLOW flow_tgt_vod2_sar_app_wak1_tst_xor;

END APPLICATION vod2_sar_app_wak1_tst_xor;
Asked By: beyond_inifinity

||

Answers:

You can do something like this after you read your entire text to variable text assuming that the desired list of keywords are in key_words

key_words = ['FROM', 'INPUT', 'OUTPUT']
for k in key_words:
    b = text.split(k)
    for i in range(1,len(b)):
        b[i] = k+b[i]
    text = 'n'.join(b)
Answered By: Mohammad Tehrani

In your code, you are currently not doing anything with line.split('n') and the split also does not take any of the matches into account.

Also note that there is a typo: OUPUT ==> OUTPUT

In this part for example (INSERT(?: INTO)) the non capture group has no purpose, so you can write that just as (INSERT INTO)

But looking at the example data, INPUT is optional and FROM occurs two times, so that would be (?:INPUT )?FROM

One option could be reading the whole file at once, and use re.sub to replace the match with a newline followed by what is matched.

As you don’t want to prepend a newline before one of the alternatives, you can match them asserting not the start of the string directly to the left, and only match FROM when it is not preceded by INPUT

In the replacement use a newline and then the full match using ng<0>

b(?<!^)(?:INPUT FROM|INSERT INTO|OUTPUT TO|(?<!^INPUT )FROM)b

Regex demo

import re

line_patt = re.compile(r"b(?<!^)(?:INPUT FROM|INSERT INTO|OUTPUT TO|(?<!^INPUT )FROM)b", re.M)
with open('input.txt', 'r+') as f:
    lines = line_patt.sub(r"ng<0>", f.read())
    print(lines)

Output

CREATE APPLICATION vod2_sar_app_wak1_tst_xor WITH ENCRYPTION RECOVERY 30 SECOND INTERVAL AUTORESUME MAXRETRIES 2 RETRYINTERVAL 60;

CREATE FLOW flow_src_vod2_sar_app_wak1_tst_xor;

CREATE OR REPLACE SOURCE vod2_epc_app_cln1_tst_comp_src_cdc USING Global.reader ( 
  DicMod: 'on', 
  url: '$src_url', 
  trans: true, 
  toSend: true ) 
OUTPUT TO stream_src_vod2_sar_app_wak1_tst_xor;


CREATE OR REPLACE TARGET comp_tgt_vod2_sar_app_wak1_tst_xor USING Global.wirter ( 
  stdSQL: 'true', 
  hasID: 'true', 
  delim: '"' ) 
INPUT FROM vod2_sar_app_wak1_tst_xor_text_clean_stream;

CREATE OR REPLACE Hm Hm_vod2_sar_app_wak1_tst_xor 
INSERT INTO Hm_STREAM_vod2_sar_app_wak1_tst_xor SELECT putUserdata(s,'Converted_SCN',META(s,"SCN").toString(),'DNOW', to_long(DNow()),'Created_time',DNOW(),'Nanoseconds',System.nanoTime()) 
FROM stream_src_vod2_sar_app_wak1_tst_xor s;

CREATE OR REPLACE OPEN PROCESSOR vod2_sar_app_wak1_tst_xor_text_clean USING Global.TextCleaner (
  replaceFrom: '[\u0000]', 
  includeBefore: true ) 
INSERT INTO vod2_sar_app_wak1_tst_xor_text_clean_stream 
FROM Hm_STREAM_vod2_sar_app_wak1_tst_xor;

END FLOW flow_tgt_vod2_sar_app_wak1_tst_xor;

END APPLICATION vod2_sar_app_wak1_tst_xor;
Answered By: The fourth bird
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.