Python re.findall regex and text processing

Question:

I’m looking to find and modify some sql syntax around the convert function. I want basically any convert(A,B) or CONVERT(A,B) in all my files to be selected and converted to B::A.

So far I tried selecting them with re.findall(r"bconvertb(.*?,.*)", l, re.IGNORECASE) But it’s only returning a small selection out of what I want and I also have trouble actually manipulating the A/B I mentioned.

For example, a sample line (note the nested structure here is irrelevant, I’m only getting the outer layer working if possible)

convert(varchar, '/' || convert(nvarchar, es.Item_ID) || ':' || convert(nvarchar, o.Option_Number) || '/') as LocPath

…should become…

'/' || es.Item_ID::nvarchar || ':' || o.Option_Number::nvarchar || '/' :: varchar as LocPath

Example2:

SELECT LocationID AS ItemId, convert(bigint, -1),

…should become…

SELECT LocationID AS ItemId, -1::bigint,

I think this should be possible with some kind of re.sub with groups and currently have a code structure inside a for each loop where line is the each line in the file:

matchConvert = ["convert(", "CONVERT("]
a = next((a for a in matchConvert if a in line), False)
if a:
    print("convert() line")
    #line = re.sub(re.escape(a) + r'', '', line)

Edit: In the end I went with a non re solution and handled each line by identifying each block and manipulate them accordingly.

Asked By: neveratdennys

||

Answers:

The task:

Swap the parameters of all the ‘convert’ functions in the given string. Parameters can contain any character, including nested ‘convert’ functions.

A solution based on the re module:

def convert_re(s):
    import re
    start,part_1,part_2,end=re.search(r'''
                               (.*?)   
                               convert(
                               ([^,)(]+(.+?)[^,)(]*|[^,)(]+)
                               ,
                               ([^,)(]+(.+?)[^,)(]*|[^,)(]+)
                               )
                               (.*)                                     
                                       ''',s,re.X).groups()


    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_re(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_re(s)
    return s

Discription of the ‘convert_re’ function:

Regular expression:

  • start is the first group with what comes before ‘convert’

  • Then follows convert() which has no group and contains the name of the function and the opening ‘(‘

  • part_1 is the second group ([^,)(]+(.+?)[^,)(]*|[^,)(]+). This should match the first parameter. It can be anything except – ,)(, or a function preceded by anything except ,)(, optionally followed by anything except ,)( and with anything inside (except a new line)

  • Then follows a comma ,, which has no group

  • part_2 is the third group and it acts like the second, but should catch everything what’s left inside the external function

  • Then follows ), which has no group

  • end is the fourth group (.*) with what’s left before the new line.

The resulting string is then created by swapping part_1 and part_2, putting ‘ :: ‘ between them, removing spaces on the left from part_2 and adding start to the beginning and end to the end.

Description of the ‘multi_convert_re’ function

Repeatedly calls ‘convert_re’ function until there are no "convert" left.

Notes:

  • N.B.: The code implies that the ‘convert’ function in the string has exactly two parameters.
  • The code works on the given examples, but I’m afraid there may still be unforeseen flaws when it comes to other examples. Please tell, if you find any flaws.
  • I have provided another solution presented in another answer that is not based on the re module. It may turn out that the results will be different.

This may be an X/Y problem, meaning you’re asking how to do something with Regex that may be better solved with parsing (meaning using/modifying/writing a SQL parser). An indication that this is the case is the fact that “convert” calls can be nested. I’m guessing Regex is going to be more of a headache than it’s worth here in the long run if you’re working with a lot of files and they’re at all complicated.

Answered By: bob

The task:

Swap the parameters of all the ‘convert’ functions in this given. Parameters can contain any character, including nested ‘convert’ functions.

A solution:

def convert_py(s):
    #capturing start:
    left=s.index('convert')
    start=s[:left]
    #capturing part_1:
    c=0
    line=''
    for n1,i in enumerate(s[left+8:],start=len(start)+8):
        if i==',' and c==0:
            part_1=line
            break
        if i==')':
            c-=1
        if i=='(':
            c+=1
        line+=i
    #capturing part_2:
    c=0
    line=''
    for n2,i in enumerate(s[n1+1:],start=n1+1):
        if i==')':
            c-=1
        if i=='(':
            c+=1
        if c<0:
            part_2=line
            break
        line+=i
    #capturing end:
    end=s[n2+1:]
    #capturing result:
    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_py(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_py(s)
    return s

Notes:

  • Unlike the solution based on the re module, which is presented in another answer – this version should not fail if there are more than two parameters in the ‘convert’ function in the given string. However, it will swap them only once, for example: convert(a,b, c) –> b, c : a
  • I am afraid that unforeseen cases may arise that will lead to failure. Please tell if you find any flaws

Here’s my solution based on @Иван-Балван‘s code. Breaking this structure into blocks makes further specification a lot easier than I previously thought and I’ll be using this method for a lot of other operations as well.

# Check for balanced brackets
def checkBracket(my_string):
    count = 0
    for c in my_string:
        if c == "(":
            count+=1
        elif c == ")":
            count-=1
    return count


# Modify the first convert in line
# Based on suggestions from stackoverflow.com/questions/73040953
def modifyConvert(l):
    # find the location of convert()
    count = l.index('convert(')

    # select the group before convert() call
    before = l[:count]

    group=""
    n1=0
    n2=0
    A=""
    B=""
    operate = False
    operators = ["|", "<", ">", "="]
    # look for A group before comma
    for n1, i in enumerate(l[count+8:], start=len(before)+8):
        # find current position in l
        checkIndex = checkBracket(l[count+8:][:n1-len(before)-8])
        if i == ',' and checkIndex == 0:
            A = group
            break
        group += i

    # look for B group after comma
    group = ""
    for n2, i in enumerate(l[n1+1:], start=n1+1):
        checkIndex = checkBracket(l[count+n1-len(before):][:n2-n1+1])
        if i == ',' and checkIndex == 0:
            return l
        elif checkIndex < 0:
            B = group
            break
        group += i
        
        # mark operators
        if i in operators:
            operate = True

    # select the group after convert() call
    after = l[n2+1:]

    # (B) if it contains operators
    if operate:
        return before + "(" + B.lstrip() + ') :: ' + A + after
    else:
        return before + B.lstrip() + '::' + A + after


# Modify cast syntax with convert(a,b). return line.
def convertCast(l):

    # Call helper for nested cases
    i = l.count('convert(')
    while i>0:
        i -= 1
        l = modifyConvert(l)

    return l
Answered By: neveratdennys