Python: How to group a list of objects by their characteristics or attributes?

Question:

I want to separate a list of objects into sublists, where objects with same attribute/characteristic stay in the same sublist.

Suppose we have a list of strings:

["This", "is", "a", "sentence", "of", "seven", "words"]

We want to separate the strings based on their length as follows:

[['sentence'], ['a'], ['is', 'of'], ['This'], ['seven', 'words']]

The program I currently come up with is this

sentence = ["This", "is", "a", "sentence", "of", "seven", "words"]
word_len_dict = {}
for word in sentence:
    if len(word) not in word_len_dict.keys():
        word_len_dict[len(word)] = [word]
    else:
        word_len_dict[len(word)].append(word)


print word_len_dict.values()

I want to know if there is a better way to achieve this?

Asked By: Mark Jin

||

Answers:

With defaultdict(list), you can omit the key-existence check:

from collections import defaultdict

word_len_dict = defaultdict(list)

for word in sentence:
    word_len_dict[len(word)].append(word)
Answered By: xiaofeng.li

Now i am not saying this is better in any way unless you consider compact code better. Your version (which is very ok imo) is much more readable and maintainable.

list_ = ["This", "is", "a", "sentence", "of", "seven", "words"]

# for python 2 filter returns() a list
result = filter(None,[[x for x in list_ if len(x) == i] for i in range(len(max(list_, key=lambda y: len(y)))+1)])

# for python 3 filter() returns an iterator
result = list(filter(None,[[x for x in list_ if len(x) == i] for i in range(len(max(list_, key=lambda y: len(y)))+1)]))
Answered By: Ma0

Take a look at itertools.groupby(). Note your list must be sorted first (more expensive than your method OP).

>>> from itertools import groupby
>>> l = ["This", "is", "a", "sentence", "of", "seven", "words"]
>>> print [list(g[1]) for g in groupby(sorted(l, key=len), len)]
[['a'], ['is', 'of'], ['This'], ['seven', 'words'], ['sentence']]

or if you want a dictionary ->

>>> {k:list(g) for k, g in groupby(sorted(l, key=len), len)}
{8: ['sentence'], 1: ['a'], 2: ['is', 'of'], 4: ['This'], 5: ['seven', 'words']}
Answered By: ospahiu

The doc of itertools.groupby has a example that matches exactly what you want.

keyfunc = lambda x: len(x)
data = ["This", "is", "a", "sentence", "of", "seven", "words"]
data = sorted(data, key=keyfunc)
groups = []
for k, g in groupby(data, keyfunc):
    groups.append(list(g))
print groups
Answered By: Mauro Baraldi
sentence = ["This", "is", "a", "sentence", "of", "seven", "words"]
getLength = sorted(list(set([len(data) for data in sentence])))

result = []

for length in getLength:
    result.append([data for data in sentence if length == len(data)])

print(result)
Answered By: Janarthanan .S

You can do this with the dict only by using setdefault function:

sentence = ["This", "is", "a", "sentence", "of", "seven", "words"]
word_len_dict = {}
for word in sentence:
    word_len_dict.setdefault(len(word), []).append(word)

What setdefault does is set the key len(word) in your dictionary if it doesn’t exist and just retrieve the value in case it does. The second argument in setdefault is the default value you want it to store along with that key.

It’s important to notice that in case the key already exist, the default value passed in setdefault won’t replace the old value. This ensures that each list will be created only once and after then that same list will just be retrieved by setdefault.

Answered By: Carlos Afonso

If your goal is to do it in fewer lines, there is always comprehensions:

data = ["This", "is", "a", "sentence", "of", "seven", "words"]
# Get all unique length values
unique_length_vals = set([len(word) for word in data])
# Get lists of same-length words
res = [filter(lambda x: len(x) == lval, data) for lval in unique_length_vals]

It might be less clear, but useful if you just want to code something quickly.

Answered By: F. Moïni
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.