Iterate function across dataframe

Question:

I have a dataset containing pre-processed online reviews, each row contains words from online review. I am doing a Latent Dirichlet Allocation process to extract topics from the entire dataframe. Now, I want to assign topics to each row of data based on an LDA function called get_document_topics.

I found a code from a source but it only prints the probability of a document being assign to each topic. I’m trying to iterate the code to all documents and returns to the same dataset. Here’s the code I found…

text = ["user"]
bow = dictionary.doc2bow(text)
print "get_document_topics", model.get_document_topics(bow)
### get_document_topics [(0, 0.74568415806946331), (1, 0.25431584193053675)]

Here’s what I’m trying to get…

                  stemming   probabOnTopic1 probOnTopic2 probaOnTopic3  topic 
0      [bank, water, bank]              0.7          0.3           0.0      0 
1  [baseball, rain, track]              0.1          0.8           0.1      1
2     [coin, money, money]              0.9          0.0           0.1      0 
3      [vote, elect, bank]              0.2          0.0           0.8      2

Here’s the codes that I’m working on…

def bow (text):
    return [dictionary.doc2bow(text) in document]

df["probability"] = optimal_model.get_document_topics(bow)
df[['probOnTopic1', 'probOnTopic2', 'probOnTopic3']] = pd.DataFrame(df['probability'].tolist(), index=df.index)
Asked By: Christabel

||

Answers:

One possible option can be creating a new column in your DF and then iterate over each row in your DF. You can use the get_document_topics function to get the topic distribution for each row and then assign the most likely topic to that row.

df['topic'] = None
for index, row in df.iterrows():
    text = row['review_text']
    bow = dictionary.doc2bow(text)
    topic_distribution = model.get_document_topics(bow)
    most_likely_topic = max(topic_distribution, key=lambda x: x[1])
    df.at[index, 'topic'] = most_likely_topic

is it helpful ?

Answered By: Lorenzo Bassetti

slightly different approach @Christabel, that include your other request with 0.7 threshold:

import pandas as pd

results = []

# Iterate over each review
for review in df['review']:
  bow = dictionary.doc2bow(review)
  topics = model.get_document_topics(bow)

  #to a dictionary
  topic_dict = {topic[0]: topic[1] for topic in topics}
  #get the prob
  max_topic = max(topic_dict, key=topic_dict.get)

  if topic_dict[max_topic] > 0.7:
    topic = max_topic
  else:
    topic = 0

  topic_dict['topic'] = topic
  results.append(topic_dict)

#to a DF
df_topics = pd.DataFrame(results)
df = df.merge(df_topics, left_index=True, right_index=True)

Is it helpful and working for you ?
You can then place this code inside of a function and use the ‘0.70’ value as an external parameter so to make it usable in different use-cases.

Answered By: Lorenzo Bassetti