A practical example of GSDMM in python?

Question:

I want to use GSDMM to assign topics to some tweets in my data set. The only examples I found (1 and 2) are not detailed enough. I was wondering if you know of a source (or care enough to make a small example) that shows how GSDMM is implemented using python.

Asked By: Pie-ton

||

Answers:

GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) is a short text
clustering model. It is essentially a modified LDA (Latent Drichlet
Allocation) which suppose that a document such as a tweet or any other
text encompasses one topic.

GSDMM

enter image description here
LDA

Address: github.com/da03/GSDMM

import numpy as np
from scipy.sparse import lil_matrix
from scipy.sparse import find
import math

class GSDMM:
    def __init__(self, n_topics, n_iter, random_state=910820, alpha=0.1, beta=0.1):
        self.n_topics = n_topics
        self.n_iter = n_iter
        self.random_state = random_state
        np.random.seed(random_state)
        self.alpha = alpha
        self.beta = beta
    def fit(self, X):
        alpha = self.alpha
        beta = self.beta

        D, V = X.shape
        K = self.n_topics

        N_d = X.sum(axis=1)
        words_d = {}
        for d in range(D):
            words_d[d] = find(X[d,:])[1]

        # initialization
        N_k = np.zeros(K)
        M_k = np.zeros(K)
        N_k_w = lil_matrix((K, V), dtype=np.int32)

        K_d = np.zeros(D)

        for d in range(D):
            k = np.random.choice(K, 1, p=[1.0/K]*K)[0]
            K_d[d] = k
            M_k[k] = M_k[k]+1
            N_k[k] = N_k[k] + N_d[d]
            for w in words_d[d]:
                N_k_w[k, w] = N_k_w[k,w]+X[d,w]

        for iter in range(self.n_iter):
            print 'iter ', iter
            for d in range(D):
                k_old = K_d[d]
                M_k[k_old] -= 1
                N_k[k_old] -= N_d[d]
                for w in words_d[d]:
                    N_k_w[k_old, w] -= X[d,w]
                # sample k_new
                log_probs = [0]*K
                for k in range(K):
                    log_probs[k] += math.log(alpha+M_k[k])
                    for w in words_d[d]:
                        N_d_w = X[d,w]
                        for j in range(N_d_w):
                            log_probs[k] += math.log(N_k_w[k,w]+beta+j)
                    for i in range(N_d[d]):
                        log_probs[k] -= math.log(N_k[k]+beta*V+i)
                log_probs = np.array(log_probs) - max(log_probs)
                probs = np.exp(log_probs)
                probs = probs/np.sum(probs)
                k_new = np.random.choice(K, 1, p=probs)[0]
                K_d[d] = k_new
                M_k[k_new] += 1
                N_k[k_new] += N_d[d]
                for w in words_d[d]:
                    N_k_w[k_new, w] += X[d,w]
        self.topic_word_ = N_k_w.toarray()
Answered By: Mahsa Hassankashi

I am experimenting with GSDMM as well and ran into the same problem, that there is just not much online (I was unable to find more than you, of course besides some papers using it). If you look at the code of the GSDMM GitHub repo you can see, that it is a pretty small repo with only a few functionalities. These are basically all used in the tutorial from towarddatascience, so I don´t think you are missing out on things.

If you have a specific question, feel free to ask!

Edit: If you follow the tutorial on towardsdatascience you will realize, that it is an inconsistent and not finished project. Some helper functions are missing and the algorithm is not correctly used. The author runs it with K=10 and ends up with 10 Clusters. If you increase K (and you should) then the number of clusters would be higher than 10, so there is a little bit of cheating happening.

Answered By: simon

I finally compiled my code for GSDMM and will put it here from scratch for others’ use. I have tried to comment on important parts:

# Imports
import random

import numpy as np
from gensim.models.phrases import Phraser, Phrases
from gensim.utils import simple_preprocess
from gsdmm import MovieGroupProcess


# data
data = ...

# stop words
stop_words = ...

# turning sentences into words
data_words =[]
for doc in data:
    doc = doc.split()
    data_words.append(doc)

# create vocabulary
vocabulary = ...

# Removing stop Words
stop_words.extend(['from', 'rt'])

def remove_stopwords(texts):
    return [
        [
            word
            for word in simple_preprocess(str(doc))
            if word not in stop_words
        ]
        for doc in texts
    ]

data_words_nostops = remove_stopwords(vocabulary)

# building bi-grams 
bigram = Phrases(vocabulary, min_count=5, threshold=100) 
bigram_mod = Phraser(bigram)
print('done!')

# Form Bigrams
data_words_bigrams = [bigram_mod[doc] for doc in data_words_nostops]

# lemmatization
pos_to_use = ['NOUN', 'ADJ', 'VERB', 'ADV']
data_lemmatized = []
for sent in data_words_bigrams:
    doc = nlp(" ".join(sent)) 
    data_lemmatized.append(
        [token.lemma_ for token in doc if token.pos_ in pos_to_use]
    )
      
docs = data_lemmatized
vocab = set(x for doc in docs for x in doc)

# Train a new model 
random.seed(1000)
# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)

vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)

# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)

def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts = sorted(
            mgp.cluster_word_distribution[cluster].items(),
            key=lambda k: k[1],
            reverse=True,
        )[:values]
        print('Cluster %s : %s'%(cluster,sort_dicts))
        print(' — — — — — — — — — ')

doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)


# Show the top 10 words in term frequency for each cluster 
top_words(mgp.cluster_word_distribution, top_index, 10)


Links

  1. gensim modules
  2. Python library gsdmm
Answered By: Pie-ton

As I understand it you have the code https://github.com/rwalk/gsdmm but you need to decide how to apply it.

How does it work?

You can download the paper A dirichlet multinomial mixture model-based approach for short text clustering, it shows that the clusters search is equivalent to game of table choosing. Image to have a group of students and want to group them on tables by their movie interest. Every student (=item) switches in each round to a table(=cluster) that has students with similar movies and is popular. Alpha controls a factor that decides how easily a table gets removed when it’s empty (low alpha = less tables). Small betas means that a table is chosen based on similarity to the table than based on popularity of a table. For short text clustering you take word instead of movies.

Alpha, beta, number of iterations

Therefore low alpha result in many clusters with single words, while high alphas result in less clusters with more words. High beta result in popular clusters while low beta result in similar clusters (which are not strong populated). What parameteres you need is based on the dataset. The number of clusters can mostly controlled by beta, but alpha has also (as described) an influence. The number of iterations seems to be stable after 20 iterations but 10 is also ok.

Data preperation process

Before you train the algorithm you will need to create a clean data set. For this you convert every text to lower case, you remove non-ASCII characters and stop-words and you apply stemming or lemmatisation. You will also need to apply this process when you execute it on a new sample.

Answered By: simsi
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.