Detect latin characters in regex

Question:

I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.

def clean_str(string):
    string = re.sub(r"#(@[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
    string = re.sub(r'#([^a-zA-Z0-9#])', r' 1 ', string, re.UNICODE)
    string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
    string = re.sub(r'(s{2,})', ' ', string, re.UNICODE)
    return string.lower().strip()

My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.

example:
if I have a text like “@aaa bbb các. ddd”.

it should be like “bbb các . ddd” with space “before the DOT” and with deleting the Tag “@aaa”.

But it produces the same input text!: “@aaa bbb các. ddd”

Did I miss something?

Asked By: Minions

||

Answers:

You have several issues in the current code:

  • To match any Unicode word char, use w (rather than [A-Za-z0-9_]) with a Unicode flag
  • When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
  • To match any non-word char but a whitespace, you may use [^ws]
  • When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use g<0> backreference in the replacement pattern.

See an updated method to clean the strings:

>>> def clean_str(s):
...     s = re.sub(r'@w+', ' ', s, flags=re.U)
...     s = re.sub(r'[^ws]', r' g<0>', s, flags=re.U)
...     s = re.sub(r's{2,}', ' ', s, flags=re.U)
...     return s.lower().strip()
...
>>> print(clean_str(s))
Answered By: Wiktor Stribiżew
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.