How to split a conversation on WhatsApp in multiple blocks based on the context?

Question:

Let’s imagine I download a csv file that includes all the conversations I had with a friend for the past 6 months (WhatsApp chat). I would like to divide that csv file in multiple "blocks" (each block defines a different conversation). Eg:

Day 1:

  • U1: Hey, how’s going?
  • U2: Fine! Any plan for tomorrow?
  • U1: Nope

Day 2:

  • U2: Hello!

Day 3:

  • U1: Morning!
  • U2: ….

So the idea is to identify that in my WhatsApp Chat, if we follow the example I have provided, there should be 3 blocks of different conversations, two initiated by U1, and one initiated by U2.

I cannot split it by time because some of the users could take long enough to reply the previous message. So it seems I should be able to identify if the new sentence that appears in the chat is related to the previous "block" of conversation or if it is actually starting a new block.

Any ideas of what steps I need to follow if I want to identify different conversations in one chat, or if a sentence is continuing the previous conversation/starting a new one?

Thanks!!

Asked By: Wolfox

||

Answers:

I think even though you dont like time as the proxy for one conversation bloc, it might perform just as well as more complicated NLP.

If you want to try sth more complicated, you would need some measure of semantic relatedness between texts. A classical method is to embedd your sentences/messages e.g. with sentence-BERT (see sbert.net) and use cosine similarity between sentences. you could say that a bloc ends once the embedding of the last sentence is too dissimilar from the preceeding sentence. Or you could even use BERT for next-sentence-prediction to test which sentences are plausible to follow others.
But its unclear if this performs better than a simple time proxy. Sometimes simpler is better 🙂

Answered By: Moritz

Moritz is quite right. Another bit that has helped me in a similar task is to look for conversation starters and closers, typically greetings and farewells. In your case those "hey", "Hello!" etc.

Time + greeting (especially if is not answering a previous one) is most likely a new block.

Answered By: user20886661
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.