NLP: pre-processing dataset into a new dataset

Question:

I need help with processing an unsorted dataset. Sry, if I am a complete noob. I never did anything like that before. So as you can see, each conversation is identified by a dialogueID which consists of multiple rows of "from" & "to", as well as text messages.
I would like to concatenate the text messages from the same sender of a dialogueID to one column and from the receiver to another column. This way, I could have a new csv-file with just [dialogueID, sender, receiver].

dataset
the new dataset should look like this
new dataset

I watched multiple tutorials and really struggle to figure out how to do it. I read in this 9-year-old post that iterating through data frames are not a good idea. Could someone help me out with a code snippet or give me a hint on how to properly do it without overcomplicating things? I thought something like this pseudo code below, but the performance with 1 million rows is not great, right?

while !endOfFile
  for dialogueID in range (0, 1038324)
    if dialogueID+1 == dialogueID and toValue.isnull()
      concatenate textFromPrevRow + " " + textFromCurrentRow
      add new string to table column sender
    else
      add text to column receiver
Asked By: CodingStudent

||

Answers:

Edit 1

According to your clarification, this is what I believe you’re looking for.

Create an aggregation function which basically concats your string values with a line-break character. Then group by dialogueID and apply your aggregation.

d = {}
d['from'] = 'n'.join
d['to'] = 'n'.join
new_df = dialogue_dataframe.groupby('dialogueID', as_index=False).agg(d)

After that rename the columns as you’d like:

df.rename(columns={"from": "sender", "to": "receiver"})

Original answer

Not quite sure I understood what you try to achieve, but maybe this will give some insights. Maybe write a couple of rows of the table you expect to get, for better clarification

Answered By: Nazar Nintendo

While the exact structure of the data (and thus your task) is not completely clear, maybe DataFrame.apply or rather DataFrame.aggregate can help you speed things up. Also, I would aggregate into either a dictionary or dataframe indexed by dialogue id. This way you can easily check if a given dialogue / sender already exists.

Answered By: Yuri Feldman