Detect latin characters in regex
Question:
I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(@[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' 1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like “@aaa bbb các. ddd”.
it should be like “bbb các . ddd” with space “before the DOT” and with deleting the Tag “@aaa”.
But it produces the same input text!: “@aaa bbb các. ddd”
Did I miss something?
Answers:
You have several issues in the current code:
- To match any Unicode word char, use
w
(rather than [A-Za-z0-9_]
) with a Unicode flag
- When using a
re.U
with re.sub
, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U
/ flags=re.UNICODE
- To match any non-word char but a whitespace, you may use
[^ws]
- When you want to replace with a whole match, you do not have to wrap the whole pattern with
(...)
, just make sure you use g<0>
backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'@w+', ' ', s, flags=re.U)
... s = re.sub(r'[^ws]', r' g<0>', s, flags=re.U)
... s = re.sub(r's{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))
I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(@[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' 1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like “@aaa bbb các. ddd”.
it should be like “bbb các . ddd” with space “before the DOT” and with deleting the Tag “@aaa”.
But it produces the same input text!: “@aaa bbb các. ddd”
Did I miss something?
You have several issues in the current code:
- To match any Unicode word char, use
w
(rather than[A-Za-z0-9_]
) with a Unicode flag - When using a
re.U
withre.sub
, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just useflags=re.U
/flags=re.UNICODE
- To match any non-word char but a whitespace, you may use
[^ws]
- When you want to replace with a whole match, you do not have to wrap the whole pattern with
(...)
, just make sure you useg<0>
backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'@w+', ' ', s, flags=re.U)
... s = re.sub(r'[^ws]', r' g<0>', s, flags=re.U)
... s = re.sub(r's{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))