How to account for accent characters for regex in Python?

Question:

I currently use re.findall to find and isolate words after the ‘#’ character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all the hashtags. This works however it doesn’t account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this

Asked By: deadlock

||

Answers:

Try the following:

hashtags = re.findall(r'#(w+)', str1, re.UNICODE)

Regex101 Demo

EDIT
Check the useful comment below from Martijn Pieters.

Answered By: Ibrahim Najjar

You may also want to use

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode… normalizing à into a is this simple…

import unicodedata
output = unicodedata.normalize(‘NFD’, my_unicode).encode(‘ascii’, ‘ignore’)
Explicit example…

myfoo = u'àà'
myfoo
u'xe0xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?

Answered By: Berk

I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['yogenfrüz']

Hope this’ll help anyone else.

Answered By: zanga

Here’s an update to Ibrahim Najjar’s original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:

import re
import unicodedata

s = "#ábá123"
n = unicodedata.normalize('NFC', s)

print(n)
c = ''.join(re.findall(r'#w+', n, re.UNICODE))
print(s, len(s), c, len(c))
Answered By: Shabbir Khan

Building on all the other answers:

The key problem is that the re module differs in significant ways to other regular expression engines. In theory, Unicode’s definition of w metacharacter would do what the question requires, but the re module does not implement Unicode’s w metacharacter.

The easy solution is to swap the regular expression engine, using a solution that is more compatible. The easiest way is to install the regex module and use it. The code that some of the other answers have given will then work as the question needs.

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#(w+)', ud.normalize("NFC",str1))

Or if you only what to focus on Latin script, including non-spacing marks (i.e. combining diacritics):

import regex as re
# import unicodedata as ud
import unicodedataplus as ud
hashtags = re.findall(r'#([p{Latin}p{Mn}]+)', ud.normalize("NFC",str1))

P.S. I have used unicodedataplus which is a drop-in replacement for unicodedata. It has additional methods, and it is kept up to date with Unicode versions. With unicodedata module to up date the Unicode version required updating Python.

Answered By: Andj