What is the max length of a Python string?

Question:

If it is environment-independent, what is the theoretical maximum number of characters in a Python string?

Asked By: blippy

||

Answers:

With a 64-bit Python installation, and (say) 64 GB of memory, a Python string of around 63 GB should be quite feasible, if not maximally fast. If you can upgrade your memory beyond 64 GB, your maximum feasible strings should get proportionally longer. (I don’t recommend relying on virtual memory to extend that by much, or your runtimes will get simply ridiculous;-).

With a typical 32-bit Python installation, the total memory you can use in your application is limited to something like 2 or 3 GB (depending on OS and configuration), so the longest strings you can use will be much smaller than in 64-bit installations with high amounts of RAM.

Answered By: Alex Martelli

I ran this code on an x2iedn.16xlarge EC2 instance, which has 2048 GiB (2.2 TB) of RAM

>>> one_gigabyte = 1_000_000_000
>>> my_str = 'A' * (2000 * one_gigabyte)

It took a couple minutes but I was able to allocate a 2TB string on Python 3.10 running on Ubuntu 22.04.

>>> import sys
>>> sys.getsizeof(my_str)
2000000000049
>>> my_str
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...

The last line actually hangs, but it would print 2 trillion As.

9 quintillion characters on a 64 bit system on CPython 3.10.

That’s only if your string is made up of only ASCII characters. The max length can be smaller depending on what characters the string contains due to the way CPython implements strings:

  • 9,223,372,036,854,775,758 characters if your string only has ASCII characters (U+00 to U+7F) or
  • 9,223,372,036,854,775,734 characters if your string only has ASCII characters and characters from the Latin-1 Supplement Unicode block (U+80 to U+FF) or
  • 4,611,686,018,427,387,866 characters if your string only contains characters in the Basic Multilingual Plane (for example if it contains Cyrillic letters but no emojis, i.e. U+0100 to U+FFFF) or
  • 2,305,843,009,213,693,932 characters if your string might contain at least one emoji (more formally, if it can contain a character outside the Basic Multilingual Plane, i.e. U+10000 and above)

On a 32 bit system it’s around 2 billion or 500 million characters. If you don’t know whether you’re using a 64 bit or a 32 bit system or what that means, you’re probably using a 64 bit system.


Python strings are length-prefixed, so their length is limited by the size of the integer holding their length and the amount of memory available on your system. Since PEP 353, Python uses Py_ssize_t as the data type for storing container length. Py_ssize_t is defined as the same size as the compiler’s size_t but signed. On a 64 bit system, size_t is 64. 1 bit for the sign means you have 63 bits for the actual quantity, meaning CPython strings cannot be larger than 2⁶³ – 1 bytes or around 9 million TB (8EiB). This much RAM would cost you around 19 billion dollars if we multiply today’s (November 2022) price of around $2/GB by 9 billion. On 32-bit systems (which are rare these days), it’s 2³¹ – 1 bytes or 2GiB.

CPython will use 1, 2 or 4 bytes per character, depending on how many bytes it needs to encode the "longest" character in your string. So for example if you have a string like 'aaaaaaaaa', the a‘s each take 1 byte to store, but if you have a string like 'aaaaaaaaa ' then all the a‘s will now take 4 bytes each. 1-byte-per-character strings will also use either 48 or 72 bytes of metadata and 2 or 4 bytes-per-character strings will take 72 bytes for metadata. Each string also has an extra character at the end for a terminating null, so the empty string is actually 49 bytes.

When you allocate a string with PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) (see docs) in CPython, it performs this check:

    /* Ensure we won't overflow the size. */
    // [...]
    if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
        return PyErr_NoMemory();

Where PY_SSIZE_T_MAX is

/* Largest positive value of type Py_ssize_t. */
#define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))

which is casting -1 into a size_t (a type defined by the C compiler, a 64 bit unsigned integer on a 64 bit system) which causes it to wrap around to its largest possible value, 2⁶⁴-1 and then right shifts it by 1 (so that the sign bit is 0) which causes it to become 2⁶³-1 and casts that into a Py_ssize_t type.

struct_size is just a bit of overhead for the str object’s metadata, either 48 or 72, it’s set earlier in the function

    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        // [...]
        struct_size = sizeof(PyASCIIObject);
    }

and char_size is either 1, 2 or 4 and so we have

>>> ((2**63 - 1) - 72) // 4 - 1
2305843009213693932

There’s of course the possibility that Python strings are practically limited by some other part of Python that I don’t know about, but you should be able to at least allocate a new string of that size, assuming you can get get your hands on 9 exabytes of RAM.

Answered By: Boris Verkhovskiy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.