How to enrich dataframe by adding columns in specific condition

Question:

I have a two different datasets:

users:

+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
|   100 |   1000  |20200728|
|   101 |   1001  |20200727|
|   101 |   1002  |20200726|
+-------+---------+--------+

movies:

+--------+---------+--------------------------+
|movie_id|  title  |         genre            |
+--------+---------+--------------------------+
|   1000 |Toy Story|Adventure|Animation|Chil..|
|   1001 | Jumanji |Adventure|Children|Fantasy|
|   1002 | Iron Man|Action|Adventure|Sci-Fi   |
+--------+---------+--------------------------+

How to get dataset in the following format? So I can get user’s taste profile, so I can compare different users by their similarity score?

+-------+---------+--------+---------+---------+-----+
|user_id|  Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
|   100 |    0    |    1    |    1    |   1    |  0  |
|   101 |    1    |    1    |    0    |   1    |  0  |
+-------+---------+---------+---------+--------+-----+
Asked By: Azamat

||

Answers:

  • Where df is the movies dataframe and dfu is the users dataframe
  • The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
  • pandas.merge the two dataframes on 'movie_id'
  • Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
  • Shape final
    • .unstack converts the groupby dataframe from long to wide format
    • .fillna replace NaN with 0
    • .astype changes the numeric values from float to int
  • Tested in python 3.10, pandas 1.4.3
import pandas as pd

# data
movies = {'movie_id': [1000, 1001, 1002],
          'title': ['Toy Story', 'Jumanji', 'Iron Man'],
          'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}

users = {'user_id': [100, 101, 101],
         'movie_id': [1000, 1001, 1002],
         'timestep': [20200728, 20200727, 20200726]}

# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)

# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')

# explode the lists in genre
df = df.explode('genre', ignore_index=True)

# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')

# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)

# display(final)
genre    Action  Adventure  Animation  Children  Fantasy  Sci-Fi
user_id                                                         
100           0          1          1         1        0       0
101           1          2          0         1        1       1
Answered By: Trenton McKinney
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.