How are small sets stored in memory?

Question:

If we look at the resize behavior for sets under 50k elements:

>>> import sys
>>> s = set()
>>> seen = {}
>>> for i in range(50_000):
...     size = sys.getsizeof(s)
...     if size not in seen:
...         seen[size] = len(s)
...         print(f"{size=} {len(s)=}")
...     s.add(i)
... 
size=216 len(s)=0
size=728 len(s)=5
size=2264 len(s)=19
size=8408 len(s)=77
size=32984 len(s)=307
size=131288 len(s)=1229
size=524504 len(s)=4915
size=2097368 len(s)=19661

This pattern is consistent with quadrupling of the backing storage size once the set is 3/5ths full, plus some presumably constant overhead for the PySetObject:

>>> for i in range(9, 22, 2):
...     print(2**i + 216)
... 
728
2264
8408
32984
131288
524504
2097368

A similar pattern continues even for larger sets, but the resize factor switches to doubling instead of quadrupling.

The reported size for small sets is an outlier. Instead of size 344 bytes, i.e. 16 * 8 + 216 (the storage array of a newly created empty set has 8 slots avail until the first resize up to 32 slots) only 216 bytes is reported by sys.getsizeof.

What am I missing? How are those small sets stored so that they use only 216 bytes instead of 344?

Asked By: wim

||

Answers:

The set object in Python is represented by the following C structure.

typedef struct {
    PyObject_HEAD

    Py_ssize_t fill;            /* Number of active and dummy entries*/
    Py_ssize_t used;            /* Number of active entries */

    /* The table contains mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t mask;

    /* The table points to a fixed-size smalltable for small tables
     * or to additional malloc'ed memory for bigger tables.
     * The table pointer is never NULL which saves us from repeated
     * runtime null-tests.
     */
    setentry *table;
    Py_hash_t hash;             /* Only used by frozenset objects */
    Py_ssize_t finger;          /* Search finger for pop() */

    setentry smalltable[PySet_MINSIZE];
    PyObject *weakreflist;      /* List of weak references */
} PySetObject;

Now remember, getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

Ok, set implements the __sizeof__.

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

Now let’s inspect the line

res = _PyObject_SIZE(Py_TYPE(so));

_PyObject_SIZE is just a macro which expands to (typeobj)->tp_basicsize.

#define _PyObject_SIZE(typeobj) ( (typeobj)->tp_basicsize )

This code is essentially trying to access the tp_basicsize slot to get the size in bytes of instances of the type which is just sizeof(PySetObject) in case of set.

PyTypeObject PySet_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "set",                              /* tp_name */
    sizeof(PySetObject),                /* tp_basicsize */
    0,                                  /* tp_itemsize */
    # Skipped rest of the code for brevity.

I have modified the set_sizeof C function with the following changes.

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    unsigned long py_object_head_size = sizeof(so->ob_base); // Because PyObject_HEAD expands to PyObject ob_base;
    unsigned long fill_size = sizeof(so->fill);
    unsigned long used_size = sizeof(so->used);
    unsigned long mask_size = sizeof(so->mask);
    unsigned long table_size = sizeof(so->table);
    unsigned long hash_size = sizeof(so->hash);
    unsigned long finger_size = sizeof(so->finger);
    unsigned long smalltable_size = sizeof(so->smalltable);
    unsigned long weakreflist_size = sizeof(so->weakreflist);
    int is_using_fixed_size_smalltables = so->table == so->smalltable;

    printf("| PySetObject Fields   | Size(bytes) |n");
    printf("|------------------------------------|n");
    printf("|    PyObject_HEAD     |     '%zu'    |n", py_object_head_size);
    printf("|      fill            |      '%zu'    |n", fill_size);
    printf("|      used            |      '%zu'    |n", used_size);
    printf("|      mask            |      '%zu'    |n", mask_size);
    printf("|      table           |      '%zu'    |n", table_size);
    printf("|      hash            |      '%zu'    |n", hash_size);
    printf("|      finger          |      '%zu'    |n", finger_size);
    printf("|    smalltable        |    '%zu'    |n", smalltable_size);
    printf("|    weakreflist       |      '%zu'    |n", weakreflist_size);
    printf("-------------------------------------|n");
    printf("|       Total          |    '%zu'    |n", py_object_head_size+fill_size+used_size+mask_size+table_size+hash_size+finger_size+smalltable_size+weakreflist_size);
    printf("n");
    printf("Total size of PySetObject '%zu' bytesn", sizeof(PySetObject));
    printf("Has set resized: '%s'n", is_using_fixed_size_smalltables ? "No": "Yes");
    if(!is_using_fixed_size_smalltables) {
        printf("Size of malloc'ed table: '%zu' bytesn", (so->mask + 1) * sizeof(setentry));
    }

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

and compiling and running these changes gives me

>>> import sys
>>>
>>> set_ = set()
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'No'
216
>>> set_.add(1)
>>> set_.add(2)
>>> set_.add(3)
>>> set_.add(4)
>>> set_.add(5)
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'Yes'
Size of malloc'ed table: '512' bytes
728

The return value is 216/728 bytes because sys.getsize add 16 bytes of GC overhead.

But the important thing to note here is this line.

|    smalltable        |    '128'    |

Because for small tables(before the first resize) so->table is just a reference to fixed size(8) so->smalltable(No malloc’ed memory) so sizeof(PySetObject) is sufficient enough to get the size because it also includes the storage size( 128(16(size of setentry) * 8)).

Now what happens when the resize occurs? It constructs entirely new table (malloc’ed) and uses that table instead of so->smalltables. This means that the sets, which have resized, also carry out a dead-weight of 128 bytes (size of fixed size small table) along with the size of malloc’ed so->table.

else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }

    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->mask = newsize - 1;
    so->table = newtable;
Answered By: Abdul Niyas P M