Generate random UTF-8 string in Python

Question:

I’d like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I’ll rephrase the question – Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )
Asked By: l0b0

||

Answers:

Since Unicode is just a range of – well – codes, what about using unichr() to get the unicode string corresponding to a random number between 0 and 0xFFFF?
(Of course that would give just one codepoint, so iterate as required)

Answered By: Joril

You could download a website written in greek or german that uses unicode and feed that to your code.

Answered By: Esteban Küber
Answered By: Gumbo

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

The basic approach I’d adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

Note that you need to know what to expect given the input – otherwise you are not testing; you are experimenting.

Answered By: Jonathan Leffler

Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

Answered By: Philipp

Answering revised question:

Yes, on a strict definition of “control characters” — note that you won’t include CR, LF, and TAB; is that what you want?

Please consider responding to my earlier invitation to tell us what you are really trying to do.

Answered By: John Machin

People may find their way here based mainly on the question title, so here’s a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

import random

def get_random_unicode(length):

    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        ( 0x0021, 0x0021 ),
        ( 0x0023, 0x0026 ),
        ( 0x0028, 0x007E ),
        ( 0x00A1, 0x00AC ),
        ( 0x00AE, 0x00FF ),
        ( 0x0100, 0x017F ),
        ( 0x0180, 0x024F ),
        ( 0x2C60, 0x2C7F ),
        ( 0x16A0, 0x16F0 ),
        ( 0x0370, 0x0377 ),
        ( 0x037A, 0x037E ),
        ( 0x0384, 0x038A ),
        ( 0x038C, 0x038C ),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
            for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))

if __name__ == '__main__':
    print('A random string: ' + get_random_unicode(10))
Answered By: Jacob Wan

Follows a code that print any printable character of UTF-8:

print(''.join(tuple(chr(i) for i in range(32, 0x110000) if chr(i).isprintable())))

All printable characters are included above, even those that are not printed by the current font. The clause and not chr(i).isspace() can be added to filter out whitespace characters.

Answered By: aluriak
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.