Pandas; Trying to split a string in a column with | , and then list all strings, removing all duplicates

Question:

I’m working on a data frame for a made up TV show. In this dataframe, are columns: "Season","EpisodeTitle","About","Ratings","Votes","Viewership","Duration","Date","GuestStars",Director","Writers", With rows listed as ascending numerical values.

In this data frame, my problem relates to two columns; ‘Writers’ and ‘Viewership’. In the Writers column, some of the columns have multiple writers, separated with " | ". In the Viewership column, each column has a float value between 1 and 23, with a max of 2 decimal places.

Here’s a condensed example of the data frame I’m working with. I am trying to filter the "Writers" column, and then determine the total average viewership for each individual writer:

df = pd.DataFrame({'Writers' : ['John Doe','Jennifer Hopkins | John Doe','Ginny Alvera','Binny Glasglow | Jennifer Hopkins','Jennifer Hopkins','Sam Write','Lawrence Fieldings | Ginny Alvera | John Doe','John Doe'], 'Viewership' : '3.4','5.26','22.82','13.5','4.45','7.44','9'})

The solution I came up with to split the column strings:

df["Writers"]= df["Writers"].str.split('|', expand=False)

This does split the string, but in some cases will leave whitespace before and after commas. I need the whitespace removed, and then I need to list all writers, but only list each writer once.

Second, for each individual writer, I would like to have columns stating their total average viewership, or a list of each writer, stating what their total average viewership was for all episodes they worked on:

["John Doe : 15" , "Jennifer Hopkins : 7.54" , "Lawrence Fieldings : 3.7"]

This is my first post here, I really appreciate any help!

Asked By: mattgg01

||

Answers:

# I believe in newer versions of pandas you can split cells to multiple rows like this
# here is a reference https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows

df2 =df.assign(Writers=df.Writers.str.split('|')).explode('Writers').reset_index(drop=True)

#to remove whitespaces just use this
#this will remove white spaces at the beginning and end of every cell in that column
df2['Writers'] = df2['Writers'].str.strip()

#if you want to remove duplicates, then do a groupby
# this will combine (sum) duplicate, you can use any other mathematical aggregation
# function as well (you can replace sum() by mean())
df2.groupby(['writers']).sum()
Answered By: Ahmed Sayed