Removal of Stop Words and Stemming/Lemmatization for BERTopic

Question:

For Topic Modelling, I’m trying out the BERTopic: Link

I’m little confused here, I am trying out the BERTopic on my custom Dataset.
Since BERT was trained in such a way that it holds the semantic meaning of the text/document,
Should I be removing the stop words and stem/lemmatize my documents before passing it onto BERTopic?
Because I’m afraid if these stopwords might land into my topics as salient terms which they are not

Suggestions and Advices please!

Asked By: WarlockQ

||

Answers:

A good way to know if this is needed is to check the examples/tutorials given by the link you provided : Here is Topic Modeling. As you can see, it does not seem to do any preprocess before calling the model.

It then seems that it’s not needed or preconised by the authors of the model.

However, removing stopwords can make the whole process faster and they often do not contains salient informations about the topic (by their nature). It is sometimes preconised not to remove them for certains tasks such as Sentiment Analysis as you can read in these links :

Why is removing stopwords not always a good idea ?

DataStack discussion over stopwords

As for Lemmatization or Stemmatization, this link provides you good insights about the subject for a Topic Modeling task saying that it should be implemented for improved results.

In conclusion, the BERTTopic does not need Lemming/stemming nor removing stopwords to work but can be implemented to enhance both processing time and results.
At the end, it always depend on your needs and ressources. Giving a try to both solutions and compare the results you have depending on what you want is always a good way to understand pros and cons about these tools.

Answered By: Jules Civel

No.

BERTopic uses transformers that are based on "real and clean" text, not on text without stopwords, lemmas or tokens. At the end of the calculation stop words have become noise (non-informative) and are all in topic_id = -1.

For the same reason you should not tokenize (done internally) or lemmatize (somewhat subjective) the text. That will mess-up your topics

A disadvantage of not lemmatizing is that the keywords of a topic have a lot of redundancy, like (topn=10) "hotel, hotels", "resort, resorts" etc. It also does not handle bigrams like "New York" or "Barack Obama" elegantly

You can’t have it all 😉

Andreas

PS: You can ofcourse remove HTML tags; they are not in your reference corpus either

Answered By: user9165100

The official FAQ of BERTopic presents a solution for stop word removal:
They can be removed by using scikit-learns CountVectorizer after the embeddings are generated.

This is recommended especially if disturbing stop words are appearing in the resulting topics.

See the example in the BERTopic FAQ.

However, any preprocessing (stop word removal, lemmatization, etc.) should be avoided with BERTopic.

Answered By: oberbus