Grouping words in python dataframe

Question:

I have a dataframe column with different device models. I want to generate summary for common models by grouping similar devices together. I don’t have a list of exact brands/words to use difflib SequenceMatcher.

Main challenge I’m facing: – I don’t have a fixed list of products (devices) – it’s a million rows long data frame with devices ranging from Playstations and Nintendos to all the phones and tablets.

This is how my data looks like:

device_model volume of requests (100k)
iphone XR 1.6151
ipad 1.4768
galaxy s21 5g 1.4443
ultra 5g galaxy s21 1.3395
samsung galaxy s9 1.2902
iphone 11 1.2696
galaxy s10 1.2613
a32 5g galaxy 1.209
galaxy s10e 1.0994
apple iphone 13 pro max 1.0285
a12 galaxy 1.0089
galaxy s22 ultra 5g 1.0054
samsung galaxy s20 fe 5g 0.9686
iphone 13 pro 0.9116
galaxy s10+ 0.8795
S22 galaxy 5g 0.8279
galaxy s9 plus 0.8175
5g galaxy s21 plus 0.8155
apple ipad air 0.7958

This is how I would like my output to be

device_model total request volume (100k)
iPhone 4.8248
iPad 2.2726
Galaxy 13.967

I tried to apply topic modelling using Spacy & Gensim, and N-grams, but it doesn’t work reliably for such small words. Looking for a quick and simple way.

Asked By: lightyagami96

||

Answers:

Found a neat solution (I think), loved the puzzle:

It finds the most common word in the column and assign this word as device for the rows where the most common word appears. Then, for the remaining rows, it finds the most common word and assigns this word as device for these rows. This repeats untill the complete column device is filled.

from collections import Counter

df['device'] = ''

while sum(df['device'] == '') > 0:
    most_common_word = Counter(" ".join(df[df['device'] == '']["device_model"]).split()).most_common(1)[0][0]
    df.loc[df.device_model.str.contains(most_common_word) & (df.device == ''), 'device'] = most_common_word
    print(most_common_word)

Returns:

galaxy
iphone
ipad

Then apply pd.groupby:

df.groupby('device')['volume of requests (100k)'].sum()

Output:

device
galaxy    13.9670
ipad       2.2726
iphone     4.8248
Name: volume of requests (100k), dtype: float64

Note that this does not work when (for example) the word pro is the most common word which is quite possible. However for the dataset given, it works. These über common words need to be removed first, otherwise it is quite hard for a program to know that pro is not a device

Answered By: T C Molenaar