Why does an empty string in Python sometimes take up 49 bytes and sometimes 51?

Question:

I tested sys.getsize('') and sys.getsize(' ') in three environments, and in two of them sys.getsize('') gives me 51 bytes (one byte more than the second) instead of 49 bytes:

Screenshots:

Win8 + Spyder + CPython 3.6:

sys.getsizeof('') == 49 and sys.getsizeof(' ') == 50

Win8 + Spyder + IPython 3.6:

sys.getsizeof('') == 51 and sys.getsizeof(' ') == 50

Win10 (VPN remote) + PyCharm + CPython 3.7:

sys.getsizeof('') == 51 and sys.getsizeof(' ') == 50

First edit

I did a second test in Python.exe instead of Spyder and PyCharm (These two are still showing 51), and everything seems to be good. Apparently I don’t have the expertise to solve this problem so I’ll leave it to you guys 🙂

Win10 + Python 3.7 console versus PyCharm using same interpreter:

enter image description here

Win8 + IPython 3.6 + Spyder using same interpreter:

enter image description here

Asked By: Nicholas Humphrey

||

Answers:

https://docs.python.org/3.5/library/sys.html#sys.getsizeof

sys is system specific so it can easily differ. This is often overlooked by everyone. All system specific stuff in python has been dumped in the sys package for years. For e.g sys.getwindowsversion() is not portable by definition but it’s there. It like the bottomless pit of rejects in the perfect world of cross platform coding. What you see is one of the interesting nuggets of Python.

from getsizeof docs:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

When Garbage collection is in use the OS will add those extra bits. If you read Python and GC Q & A When are objects garbage collected in python? the folks have gone into excruciating detail expounding the GC and how it will affect the memory/refcount and bits blah blah.

I hope that explains where this coming from. If you don’t use system level attributes but more pythonic attributes then you will get consistent sizes.

Answered By: Abhishek Dujari

This sounds like something is accessing the deprecated Py_UNICODE API.

As of CPython 3.7, the way the CPython Unicode representation works out, an empty string is normally stored in "compact ASCII" representation, and the base data and padding for a compact ASCII string on a 64-bit build works out to 48 bytes, plus one byte of string data (just the null terminator). You can see the relevant header file here.

For now (this is scheduled for removal in 3.12), there is also a deprecated Py_UNICODE API that stores an auxiliary wchar_t representation of the string. On a platform with 2-byte wchar_t, the wchar_t representation of an empty string is 2 bytes (just the null terminator again). The Py_UNICODE API caches this representation on the string object on first access, and str.__sizeof__ accounts for this extra data when it exists, resulting in a 51-byte total.

(If you need a wchar_t representation of a string, the non-deprecated way to get one is to use PyUnicode_AsWideChar or PyUnicode_AsWideCharString. These functions are not scheduled for removal, and do not attach any data to the string object.)

Answered By: user2357112
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.