Replacing Overlapping Regex Patterns in Python

Question:

I am dealing with trying to make a .ttl file I was handed digestible. One of the issues is that the rdfs:seeAlso values are not sanitized and it breaks down downstream programs. What I mean by this is that there are links of the form:

rdfs:seeAlso prefix:value_(discipline)

In order to fix this, I need to precede particular characters with a , per the RDF 1.1 Turtle documentation. Of the characters present, I need to escape the following:

_, ~, -, !, $, &, (, ), *, +, =, ?, #, %

At first I thought this would be easy and I began constructing a re.sub() pattern. I tried a number of potential solutions, but the closest I could get was with:

re.sub(pattern=r"(rdfs:seeAlso)(.{0,}?)([_~-!$&()*+=?#%]{1})(.{0,})", repl='\1\2\\\3\4', string=str_of_ttl_file)

The (rdfs:seeAlso) component was added in order to prevent accidentally changing characters within strings that are instances of rdfs:label and rdfs:comment (i.e. any of the above characters in between '' or "").

However, this has the drawback of only working for the first occurrence and results in:

rdfs:seeAlso prefix:value_(discipline)

Where it should be

rdfs:seeAlso prefix:value_(discipline)

Any help or guidance with this would be much appreciated!

EDIT 1: Instances of rdfs:label and rdfs:comment are strings that are between single (') or double (") quotes, such as:

rdfs:label "example-label"@en

Or

rdfs:comment "This_ is+ an $example$ comment where n&thing should be replaced."@en

The special characters there do not need to be replaced for Turtle to function and should therefore be left alone by the regular expression.

Asked By: user3684314

||

Answers:

First you don’t have to escape characters inside [...] in your pattern (- should be last however, otherwise in will be recognized as range). This will make your code more readable. Then you can replace in a while loop and use a lookbehind to ensure that the character isn’t already escaped:

import re

input_text = "rdfs:seeAlso prefix:value_(discipline)" 

pattern = re.compile(r"(rdfs:seeAlso.*?)(?<!\)([_~!$&()*+=?#%-])")

repl_str = ''
while repl_str != input_text:
    repl_str = input_text
    input_text = re.sub(pattern, r'1\2', repl_str)

print(input_text)

Note: using raw string for your replace pattern makes it much more readable

Output:

rdfs:seeAlso prefix:value_(discipline)
Answered By: Tranbi

I believe you should split checking if your string starts with rdfs:seeAlso and replacement.

str_of_ttl_file = "rdfs:seeAlso prefix:value_(discipline)"

if str_of_ttl_file.startswith('rdfs:seeAlso'):
    str_of_ttl_file = re.sub(r'([_~!$&()*+=?#%-])', r'\1', str_of_ttl_file)

print(str_of_ttl_file)
Answered By: markalex

This solution does it without regular expressions:

def escape(inputstr, chars_to_escape):
    translation_dict = {c: '\' + c for c in chars_to_escape}
    translation_table = str.maketrans(translation_dict)
    return inputstr.translate(translation_table)

def conditionalTurtleReplace(inputstr):
    if inputstr.startswith('rdfs:seeAlso'):
        return escape(inputstr, r'_~-!$&()*+=?#%')
    else:
        return inputstr

str1 = 'rdfs:seeAlso prefix:value_(discipline)'
str2 = 'rdfs:label "example-label"@en'
str3 = 'rdfs:comment "This_ is+ an $example$ comment where n&thing should be replaced."@en'
print(conditionalTurtleReplace(str1))
# output: rdfs:seeAlso prefix:value_(discipline)
print(conditionalTurtleReplace(str2))
# output: rdfs:label "example-label"@en
print(conditionalTurtleReplace(str3))
# output: rdfs:comment "This_ is+ an $example$ comment where n&thing should be replaced."@en
Answered By: Lover of Structure
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.