Concatenate rows based on name with pandas dataframe

Question:

I am trying to concatenate rows of data from this source

https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv

There are lines from the same speaker but they’re broken into different rows in the dataframe. I’m trying to concatenate those speaker blocks into one row instead of 2+ rows.

Here’s what I’ve tried but it doesn’t work. I’m still learning pandas and python.

url = (r'https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv')
data = pd.read_csv(url, on_bad_lines='skip')
data.drop('I', inplace=True, axis=1)
data.drop('I.1', inplace=True, axis=1)
data.rename(columns={'In delivering my son from me, I bury a second husband.': 'Text', 'COUNTESS': 'Speaker'}, inplace=True)
data = data.groupby(['Speaker'])['Text'].apply(' '.join).reset_index()
Asked By: EthanMcQ

||

Answers:

A little ugly, but you can group by consecutive values using a helper series based on shift() and cumsum(). Then aggregating in the group by:

df = pd.read_csv('https://raw.githubusercontent.com/dherman/wc-demo/master/data/shakespeare-plays.csv',on_bad_lines='skip', names=['act','scene','char','line'])
g = df['char'].ne(df['char'].shift()).cumsum().rename('speakernumber')
df = df.groupby(g).agg({'act':'first', 'scene':'first', 'char':'first', 'line': ' '.join}).reset_index()

I believe this will work across acts and scenes as well althought I didn’t dig in deep enough to test.

Answered By: JNevill
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.