Elegant wat to join dataframe by name then autofill blank value by pandas, create a coloured scatter plot according value using matplotlib

Question:

I am wondering is there an elegant and efficient way to achieve what my title as stated.

import pandas as pd

data1 = pd.DataFrame([['ad_001','50'], ['ad_002', '100'], ['ad_003', '150']],columns=['name', 'score'])
  
data2 = pd.DataFrame([['ad_001','75'], ['ad_002', '200'], ['ad_004', '100']],columns=['name', 'score'])

I tried using

data1.merge(data2, how='left', left_on='name', right_on='name')

to merge the two dataframes.

My aim is to join the following dataframes and auto-fill the missing values :

data1 = pd.DataFrame([['ad_001','50','75'], ['ad_002', '100', '200'], ['ad_003', '150', '0'], ['ad_004', '0', '100']],columns=['name', 'score_x','score_y'])

Then I want to show a scatterplot of the data using matplotlib and colour each point according to the maximum score of x and y.

  • if x or y >100, colour red
  • if x or y >150, colour green
  • if x or y >200, colour red.

I tried looking at
[https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_with_legend.html#sphx-glr-gallery-lines-bars-and-markers-scatter-with-legend-py] the userguide but do quite know how to implement it.

Or is there any other plotting python modules that one would recommend to achieve the same outcome ?

Answers:

For the first part of merging the two dataframes, one of the ways to do this is to use merge and use outer so that all columns are captured. This will include all rows with nan where no data is available. Using .fillna(0) will handle this – based on how you mentioned you want the invalid number to appear.

For the conditions and plotting, the simplest way would be to use something like np.where(), which you can use to identify the colors you want. As your question had red for two conditions, I have made one as red, while the other is blue. You can adjust the numbers and colors are you need. Once the column with colors is available, using groupby() and plotting will give you the results you need. Hope this helps…

import pandas as pd
import matplotlib.pyplot as plt
data1 = pd.DataFrame([['ad_001','50'], ['ad_002', '100'], ['ad_003', '150']],columns=['name', 'score'])
data2 = pd.DataFrame([['ad_001','75'], ['ad_002', '200'], ['ad_004', '100']],columns=['name', 'score'])

newdata=pd.merge(data1, data2, on="name", how='outer').fillna(0) ## Merge & fillna()
newdata['score_x']=newdata['score_x'].astype('int64') ## Convert to int as you are comparing
newdata['score_y']=newdata['score_y'].astype('int64') ## Convert to int as you are comparing

##Use np.where to create color column with the colors you need
newdata['color']=np.where(((newdata.score_x<100) & (newdata.score_y<100)), 'red', 
                       np.where(((newdata.score_x<150) & (newdata.score_y<150)), 'green', 'blue'))

## Group and plot
fig, ax = plt.subplots()
for clr, d in newdata.groupby('color'):
    ax.scatter(x=d['score_x'],y=d['score_x'], label=clr)

enter image description here

Answered By: Redox