Python "set" with duplicate/repeated elements

Question:

Is there a standard way to represent a “set” that can contain duplicate elements.

As I understand it, a set has exactly one or zero of an element. I want functionality to have any number.

I am currently using a dictionary with elements as keys, and quantity as values, but this seems wrong for many reasons.

Motivation:
I believe there are many applications for such a collection. For example, a survey of favourite colours could be represented by:
survey = [‘blue’, ‘red’, ‘blue’, ‘green’]

Here, I do not care about the order, but I do about quantities. I want to do things like:

survey.add('blue')
# would give survey == ['blue', 'red', 'blue', 'green', 'blue']

…and maybe even

survey.remove('blue')
# would give survey == ['blue', 'red', 'green']

Notes:
Yes, set is not the correct term for this kind of collection. Is there a more correct one?

A list of course would work, but the collection required is unordered. Not to mention that the method naming for sets seems to me to be more appropriate.

Asked By: cammil

||

Answers:

Your approach with dict with element/count seems ok to me. You probably need some more functionality. Have a look at collections.Counter.

  • O(1) test whether an element is present and current count retrieval (faster than with element in list and list.count(element))
  • counter.elements() looks like a list with all duplicates
  • easy manipulation union/difference with other Counters
Answered By: eumiro

If you need duplicates, use a list, and transform it to a set when you need operate as a set.

Answered By: Antonio Beamud

You can use a plain list and use list.count(element) whenever you want to access the “number” of elements.

my_list = [1, 1, 2, 3, 3, 3]

my_list.count(1) # will return 2
Answered By: cfedermann

You are looking for a multiset.

Python’s closest datatype is collections.Counter:

A Counter is a dict subclass for counting hashable objects. It is an
unordered collection where elements are stored as dictionary keys and
their counts are stored as dictionary values. Counts are allowed to be
any integer value including zero or negative counts. The Counter class
is similar to bags or multisets in other languages.

For an actual implementation of a multiset, use the bag class from the data-structures package on pypi. Note that this is for Python 3 only. If you need Python 2, here is a recipe for a bag written for Python 2.4.

Answered By: Steven Rumbalski

An alternative Python multiset implementation uses a sorted list data structure. There are a couple implementations on PyPI. One option is the sortedcontainers module which implements a SortedList data type that efficiently implements set-like methods like add, remove, and contains. The sortedcontainers module is implemented in pure-Python, fast-as-C implementations (even faster), has 100% unit test coverage, and hours of stress testing.

Installation is easy from PyPI:

pip install sortedcontainers

If you can’t pip install then simply pull the sortedlist.py file down from the open-source repository.

Use it as you would a set:

from sortedcontainers import SortedList
survey = SortedList(['blue', 'red', 'blue', 'green']]
survey.add('blue')
print survey.count('blue') # "3"
survey.remove('blue')

The sortedcontainers module also maintains a performance comparison with other popular implementations.

Answered By: GrantJ

What you’re looking for is indeed a multiset (or bag), a collection of not necessarily distinct elements (whereas a set does not contain duplicates).

There’s an implementation for multisets here: https://github.com/mlenzen/collections-extended (Pypy’s collections extended module).

The data structure for multisets is called bag. A bag is a subclass of the Set class from collections module with an extra dictionary to keep track of the multiplicities of elements.

class _basebag(Set):
    """
    Base class for bag and frozenbag.   Is not mutable and not hashable, so there's
    no reason to use this instead of either bag or frozenbag.
    """
    # Basic object methods

    def __init__(self, iterable=None):
        """Create a new basebag.

        If iterable isn't given, is None or is empty then the bag starts empty.
        Otherwise each element from iterable will be added to the bag
        however many times it appears.

        This runs in O(len(iterable))
        """
        self._dict = dict()
        self._size = 0
        if iterable:
            if isinstance(iterable, _basebag):
                for elem, count in iterable._dict.items():
                    self._inc(elem, count)
            else:
                for value in iterable:
                    self._inc(value)

A nice method for bag is nlargest (similar to Counter for lists), that returns the multiplicities of all elements blazingly fast since the number of occurrences of each element is kept up-to-date in the bag’s dictionary:

>>> b=bag(random.choice(string.ascii_letters) for x in xrange(10))
>>> b.nlargest()
[('p', 2), ('A', 1), ('d', 1), ('m', 1), ('J', 1), ('M', 1), ('l', 1), ('n', 1), ('W', 1)]
>>> Counter(b)
Counter({'p': 2, 'A': 1, 'd': 1, 'm': 1, 'J': 1, 'M': 1, 'l': 1, 'n': 1, 'W': 1}) 
Answered By: user2314737

Python "set" with duplicate/repeated elements

This depends on how you define a set. One may assume that to the OP

  1. order does not matter (whether ordered or unordered)
  2. replicates/repeated elements (a.k.a. multiplicities) are permitted

Given these assumptions, the options reduce to two abstract types: a list or a multiset. In Python, these type usually translate to a list and Counter respectively. See the Details on some subtleties to observe.

Given

import random

import collections as ct

random.seed(123)


elems = [random.randint(1, 11) for _ in range(10)]
elems
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]

Code

A list of replicate elements:

list(elems)
# [1, 5, 2, 7, 5, 2, 1, 7, 9, 9]

A "multiset" of replicate elements:

ct.Counter(elems)
# Counter({1: 2, 5: 2, 2: 2, 7: 2, 9: 2})

Details

On Data Structures

We have a mix of terms here that easily get confused. To clarify, here are some basic mathematical data structures compared to ones in Python.

Type        |Abbr|Order|Replicates|   Math*   |   Python    | Implementation
------------|----|-----|----------|-----------|-------------|----------------
Set         |Set |  n  |     n    | {2  3  1} |  {2, 3, 1}  | set(el)
Ordered Set |Oset|  y  |     n    | {1, 2, 3} |      -      | list(dict.fromkeys(el)
Multiset    |Mset|  n  |     y    | [2  1  2] |      -      | <see `mset` below>
List        |List|  y  |     y    | [1, 2, 2] |  [1, 2, 2]  | list(el)

From the table, one can deduce the definition of each type. Example: a set is a container that ignores order and rejects replicate elements. In contrast, a list is a container that preserves order and permits replicate elements.

Also from the table, we can see:

  • Both an ordered set and a multiset are not explicitly implemented in Python
  • "Order" is a contrary term to a random arrangement of elements, e.g. sorted or insertion order
  • Sets and multisets are not strictly ordered. They can be ordered, but order does not matter.
  • Multisets permit replicates, thus they are not strict sets (the term "set" is indeed confusing).

On Multisets

Some may argue that collections.Counter is a multiset. You are safe in many cases to treat it as such, but be aware that Counter is simply a dict (a mapping) of key-multiplicity pairs. It is a map of multiplicities. See an example of elements in a flattened multiset:

mset = [x for k, v in ct.Counter(elems).items() for x in [k]*v]
mset
# [1, 1, 5, 5, 2, 2, 7, 7, 9, 9]

Notice there is some residual ordering, which may be surprising if you expect disordered results. However, disorder does not preclude order. Thus while you can generate a multiset from a Counter, be aware of the following provisos on residual ordering in Python:

  • replicates get grouped together in the mapping, introducing some degree of order
  • in Python 3.6, dict’s preserve insertion order

Summary

In Python, a multiset can be translated to a map of multiplicities, i.e. a Counter, which is not randomly unordered like a pure set. There can be some residual ordering, which in most cases is ok since order does not generally matter in multisets.

See Also

*Mathematically, (according to N. Wildberger, we express braces {} to imply a set and brackets [] to imply a list, as seen in Python. Unlike Python, commas , to imply order.

Answered By: pylang

You can used collections.Counter to implement a multiset, as already mentioned.

Another way to implement a multiset is by using defaultdict, which would work by counting occurrences, like collections.Counter.

Here’s a snippet from the python docs:

Setting the default_factory to int makes the defaultdict useful for counting (like a bag or multiset in other languages):

>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
...     d[k] += 1
...
>>> d.items()
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]
Answered By: stwykd
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.