Python re.findall regex and text processing

Question

I’m looking to find and modify some sql syntax around the convert function. I want basically any convert(A,B) or CONVERT(A,B) in all my files to be selected and converted to B::A.

So far I tried selecting them with re.findall(r"bconvertb(.*?,.*)", l, re.IGNORECASE) But it’s only returning a small selection out of what I want and I also have trouble actually manipulating the A/B I mentioned.

For example, a sample line (note the nested structure here is irrelevant, I’m only getting the outer layer working if possible)

convert(varchar, '/' || convert(nvarchar, es.Item_ID) || ':' || convert(nvarchar, o.Option_Number) || '/') as LocPath

…should become…

'/' || es.Item_ID::nvarchar || ':' || o.Option_Number::nvarchar || '/' :: varchar as LocPath

Example2:

SELECT LocationID AS ItemId, convert(bigint, -1),

…should become…

SELECT LocationID AS ItemId, -1::bigint,

I think this should be possible with some kind of re.sub with groups and currently have a code structure inside a for each loop where line is the each line in the file:

matchConvert = ["convert(", "CONVERT("]
a = next((a for a in matchConvert if a in line), False)
if a:
    print("convert() line")
    #line = re.sub(re.escape(a) + r'', '', line)

Edit: In the end I went with a non re solution and handled each line by identifying each block and manipulate them accordingly.

Asked By: neveratdennys

||

Source

Answer 1

The task:

Swap the parameters of all the ‘convert’ functions in the given string. Parameters can contain any character, including nested ‘convert’ functions.

A solution based on the re module:

def convert_re(s):
    import re
    start,part_1,part_2,end=re.search(r'''
                               (.*?)   
                               convert(
                               ([^,)(]+(.+?)[^,)(]*|[^,)(]+)
                               ,
                               ([^,)(]+(.+?)[^,)(]*|[^,)(]+)
                               )
                               (.*)                                     
                                       ''',s,re.X).groups()


    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_re(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_re(s)
    return s

Discription of the ‘convert_re’ function:

Regular expression:

start is the first group with what comes before ‘convert’
Then follows convert() which has no group and contains the name of the function and the opening ‘(‘
part_1 is the second group ([^,)(]+(.+?)[^,)(]*|[^,)(]+). This should match the first parameter. It can be anything except – ,)(, or a function preceded by anything except ,)(, optionally followed by anything except ,)( and with anything inside (except a new line)
Then follows a comma ,, which has no group
part_2 is the third group and it acts like the second, but should catch everything what’s left inside the external function
Then follows ), which has no group
end is the fourth group (.*) with what’s left before the new line.

The resulting string is then created by swapping part_1 and part_2, putting ‘ :: ‘ between them, removing spaces on the left from part_2 and adding start to the beginning and end to the end.

Description of the ‘multi_convert_re’ function

Repeatedly calls ‘convert_re’ function until there are no "convert" left.

Notes:

N.B.: The code implies that the ‘convert’ function in the string has exactly two parameters.
The code works on the given examples, but I’m afraid there may still be unforeseen flaws when it comes to other examples. Please tell, if you find any flaws.
I have provided another solution presented in another answer that is not based on the re module. It may turn out that the results will be different.

Answered By: Иван Балван

Answer 2

This may be an X/Y problem, meaning you’re asking how to do something with Regex that may be better solved with parsing (meaning using/modifying/writing a SQL parser). An indication that this is the case is the fact that “convert” calls can be nested. I’m guessing Regex is going to be more of a headache than it’s worth here in the long run if you’re working with a lot of files and they’re at all complicated.

Answered By: bob

Answer 3

The task:

Swap the parameters of all the ‘convert’ functions in this given. Parameters can contain any character, including nested ‘convert’ functions.

A solution:

def convert_py(s):
    #capturing start:
    left=s.index('convert')
    start=s[:left]
    #capturing part_1:
    c=0
    line=''
    for n1,i in enumerate(s[left+8:],start=len(start)+8):
        if i==',' and c==0:
            part_1=line
            break
        if i==')':
            c-=1
        if i=='(':
            c+=1
        line+=i
    #capturing part_2:
    c=0
    line=''
    for n2,i in enumerate(s[n1+1:],start=n1+1):
        if i==')':
            c-=1
        if i=='(':
            c+=1
        if c<0:
            part_2=line
            break
        line+=i
    #capturing end:
    end=s[n2+1:]
    #capturing result:
    result=start+part_2.lstrip()+' :: '+part_1+end
    return result

def multi_convert_py(s):
    converts=s.count('convert')
    for n in range(converts):
        s=convert_py(s)
    return s

Notes:

Unlike the solution based on the re module, which is presented in another answer – this version should not fail if there are more than two parameters in the ‘convert’ function in the given string. However, it will swap them only once, for example: convert(a,b, c) –> b, c : a
I am afraid that unforeseen cases may arise that will lead to failure. Please tell if you find any flaws

Answered By: Иван Балван

Answer 4

Here’s my solution based on @Иван-Балван‘s code. Breaking this structure into blocks makes further specification a lot easier than I previously thought and I’ll be using this method for a lot of other operations as well.

# Check for balanced brackets
def checkBracket(my_string):
    count = 0
    for c in my_string:
        if c == "(":
            count+=1
        elif c == ")":
            count-=1
    return count


# Modify the first convert in line
# Based on suggestions from stackoverflow.com/questions/73040953
def modifyConvert(l):
    # find the location of convert()
    count = l.index('convert(')

    # select the group before convert() call
    before = l[:count]

    group=""
    n1=0
    n2=0
    A=""
    B=""
    operate = False
    operators = ["|", "<", ">", "="]
    # look for A group before comma
    for n1, i in enumerate(l[count+8:], start=len(before)+8):
        # find current position in l
        checkIndex = checkBracket(l[count+8:][:n1-len(before)-8])
        if i == ',' and checkIndex == 0:
            A = group
            break
        group += i

    # look for B group after comma
    group = ""
    for n2, i in enumerate(l[n1+1:], start=n1+1):
        checkIndex = checkBracket(l[count+n1-len(before):][:n2-n1+1])
        if i == ',' and checkIndex == 0:
            return l
        elif checkIndex < 0:
            B = group
            break
        group += i
        
        # mark operators
        if i in operators:
            operate = True

    # select the group after convert() call
    after = l[n2+1:]

    # (B) if it contains operators
    if operate:
        return before + "(" + B.lstrip() + ') :: ' + A + after
    else:
        return before + B.lstrip() + '::' + A + after


# Modify cast syntax with convert(a,b). return line.
def convertCast(l):

    # Call helper for nested cases
    i = l.count('convert(')
    while i>0:
        i -= 1
        l = modifyConvert(l)

    return l

Answered By: neveratdennys

Python re.findall regex and text processing

Question:

Answers: