Regex to match all Hangul (Korean) characters and syllable blocks

Question:

I’m trying to validate user input (in Python) and see if the right language is being used, Korean in this case. Lets take the Korean word for email address: 이메일 주소

I can check each character like so:

import unicodedata as ud
for chr in u'이메일 주소':
    if 'HANGUL' in ud.name(chr): print "Yep, that's a Korean character."

But that seems highly inefficient, especially for longer texts. Of course, I could create a static dictionary containing all Korean syllable blocks, but that dictionary would contain some 25,000 characters and again, that would be inefficient to check against. Also, I also need a solution for Japanese and Chinese, which may contain even more characters.

Therefore, I’d like to use a Regex pattern covering all Unicode characters for Hangul syllable blocks. But I have no clue if there is a range for that or where to find it.

As an example, this regex pattern covers all Latin based characters, including brackets and other commonly used symbols:

import re
LATIN_CHARACTERS = re.compile(ur'[x00-x7Fx80-xFFu0100-u017Fu0180-u024Fu1E00-u1EFF]')

Can somebody translate this regex to match Korean Hangul syllable block? Or can you show me a table or reference to lookup such ranges myself?

A pattern to match Chinese and Japanese would also be very helpful. Or one regex to match all CJK characters at once. I wouldn’t need to distinguish between Japanese and Korean.

Here’s a Python library for that task, but it works with incredibly huge dictionaries: https://github.com/EliFinkelshteyn/alphabet-detector
I cannot imagine that to be efficient for large texte and lots of user inputs.

Thanks!

Asked By: Simon Steinberger

||

Answers:

You are aware of how Unicode is broken into blocks, and how each block represents a contiguous range of code-points? IE, there’s a much more efficient solution than a regular expression.

There is a single code block for Hangul Jamo, with additional characters in the CJK block, a compatability block, Hangul syllables, etc.

The most efficient way is to check if each character is within the acceptable range, using if/then statements. You could almost certainly speed this up using a C-extension.

For example, if I were just checking the Hangul block (insufficient, but merely a simple starting place), I would check each character in a string with the following code:

def is_hangul_character(char):
    '''Check if character is in the Hangul Jamo block'''

    value = ord(char)
    return value >= 4352 and value <= 4607


def is_hangul(string):
    '''Check if all characters are in the Hangul Jamo block'''

    return all(is_hangul_character(i) for i in string)

It would be easy to extend this for the 8 or so blocks that contain Hangul characters. No tables lookups, no regex compilation. Just fast range checks based on the block of the Unicode character.

In C, this would be very easy as well (if you would like a significant performance boost, to match a fully-optimized library with little work):

// Return 0 if a character is in Hangul Jamo block, -1 otherwise
int is_hangul_character(char32_t c)
{
    if (c >= 4352 && c <= 4607) {
        return 0;
    }
    return -1;
}


// Return 0 if all characters are in Hangul Jamo block, -1 otherwise
int is_hangul(const char32_t* string, size_t length)
{
    size_t i;
    for (i = 0; i < length; ++i) {
        if (is_hangul_character(string[i]) < 0) {
            return -1;
        }
    }
    return 0;
}

Edit A cursory glance at the CPython implementation shows CPython uses this exact approach for the unicodedata module. IE, it’s efficient despite the relative ease to implement it on your own. It is still worth implementing, since you don’t have to allocate any intermediate string, or use superfluous string comparisons (which is likely the primary cost of the unicodedata module).

Answered By: Alex Huszagh

if u want a solution that’s not dependent on unicode-compliance of the utility app, for the main block of AC00-D7AF, u can use

(([352][260-277]|[353354][200-277]|
[355][200-235])[200-277]|[355][236][200-243]) # mawk/gawk -b 

that slab expanded out would be

(355236(200|201|202|203|204|205|206|207|
210|211|212|213|214|215|216|217|220|221|
222|223|224|225|226|227|230|231|232|233|
234|235|236|237|240|241|242|243)|
(352(260|261|262|263|264|265|266|
267|270|271|272|273|274|275|276|277)|
355(200|201|202|203|204|205|206|207|
210|211|212|213|214|215|216|217|220|
221|222|223|224|225|226|227|230|231|
232|233|234|235)|(353|354)
(200|201|202|203|204|205|206|207|210|
211|212|213|214|215|216|217|220|221|
222|223|224|225|226|227|230|231|232|
233|234|235|236|237|240|241|242|243|
244|245|246|247|250|251|252|253|254|
255|256|257|260|261|262|263|264|265|
266|267|270|271|272|273|274|275|276|
277))(200|201|202|203|204|205|206|207|210
|211|212|213|214|215|216|217|220|221
|222|223|224|225|226|227|230|231|232
|233|234|235|236|237|240|241|242|243
|244|245|246|247|250|251|252|253|254
|255|256|257|260|261|262|263|264|265
|266|267|270|271|272|273|274|275|276|277))

if u need the extra stuff – jamo, compatability jamo, circled form, parenthesized form, and half width form, append this one to the one above

either

 [341204200-341207277
  343204260-343206217
  352245240-352245277
  355236260-355237277
  343200256-343200257
  343210200-343210236
  343211240-343211276
  357276240-357276276
  357277202-357277207
  357277212-357277217
  357277222-357277227
  357277232-357277234]  # gawk unicode-mode only

or

 ((343205|355237|341(204|205|206|207))
(200|201|202|203|204|205|206|207|210|211
|212|213|214|215|216|217|220|221|222|223
|224|225|226|227|230|231|232|233|234|235
|236|237|240|241|242|243|244|245|246|247
|250|251|252|253|254|255|256|257|260|261
|262|263|264|265|266|267|270|271|272|273
|274|275|276|277)|(343204|355236)(260|261
|262|263|264|265|266|267|270|271
|272|273|274|275|276|277)|343206(200|201
|202|203|204|205|206|207|210|211|212|213
|214|215|216|217)|352245(240|241|242|243
|244|245|246|247|250|251|252|253|254|255
|256|257|260|261|262|263|264|265|266|267
|270|271|272|273|274|275|276|277)
|343200256|343200257|
343210(200|201|202|203|204|205|206|207
|210|211|212|213|214|215|216|217|220|221|
222|223|224|225|226|227|230|231|232|233|
234|235|236)|(343211|357276)
(240|241|242|243|244|245|246|247|250|251
|252|253|254|255|256|257|260|261|262|263
|264|265|266|267|270|271|272|273|274|275
|276)|357277(202|203|204|205|206|207|212
|213|214|215|216|217|222|223|224|225|226
|227|232|233|234))

if you only need the modern jamo that makes up the 11,172 syllable collection, then it’s a lot cleaner :

((341)((204)[200-222]|(205)[241-265]|(206)[250-277]|(207)[200-202]))

or if u prefer it without superfluous brackets :

(341(204[200-222]|205[241-265]|206[250-277]|207[200-202]))

ps : i only formatted like this here for readability. there aren’t any spaces tabs or new line in between those octal codes. it’s one continuous string.

personally i’d much rather work with clean modern era regex myself, but using these octals are an necessary evil for myself to bring mawk1.3.4 and mawk2-beta up to full UTF8 compliancy.

(at least in terms of lengthC() ordC() substrC() and character-level splitting but at the UC13 code-point level, plus hangul-only NFD-to-NFC.

but nothing fancy like grapheme clusters or bi-directional texts)

Answered By: RARE Kpop Manifesto