How can I connect two dataframes based on mutliple criteria in python?

Question:

I have the following two dataframes:

Shortlist:

ticker     date     open    high    low     close   volume
ABC     2000-12-29  0.450   0.455   0.445   0.455   205843.0
ABC     2001-01-31  0.410   0.410   0.405   0.410   381500.0
ABC     2001-02-28  0.380   0.405   0.380   0.400   318384.0
...
ABC     2001-06-30  0.430   0.445   0.430   0.440   104016.0

MCap

Code    EOM       mcRank    MktCap
ABC    29/12/2000   74     1563.967892
ABC    31/03/2001   98     998.156279
ABC    30/06/2001   59     2035.603350

I now want to create a new table that adds the columns of mcRank and MktCap from the MCap dataframe to the Shortlist dataframe, where the Code and the Date match. If the date is shortlist is between the dates in MCap it should use the last known date.

The result should like this:

ticker     date     open    high    low     close   volume    mcRank    MktCap
ABC     2000-12-29  0.450   0.455   0.445   0.455   205843.0   74     1563.967892
ABC     2001-01-31  0.410   0.410   0.405   0.410   381500.0   74     1563.967892   
ABC     2001-02-28  0.380   0.405   0.380   0.400   318384.0   74     1563.967892
...
ABC     2001-06-30  0.430   0.445   0.430   0.440   104016.0   59     2035.603350

I’ve tried pd.concat and pd.merge – but can’t seem to get the right results.

Asked By: daveskis

||

Answers:

What you want to do is

First align both date format, you can process it as string which makes it easier

Second pd.merge them, use left_on, right_on and how=’outer’ to merge everything, and PURPOSELY create NA values

Then you can use DataFrame.fillna(method=’ffill’) to fill na base on previous values

Answered By: Michael Hsi

Well this seems like a merge task, but first make sure EOM and date columns are actually the same variable dtype (datetime).

shortlist['date'] = pd.to_datetime(shortlist['date'], format='%Y-%m-%d')
MCap['EOM'] = pd.to_datetime(MCap['EOM'], format='%d/%m/%Y')

And then do the merge (this wont work if ticker or Codes are the indexes, if they are, reset the index first i.e shortlist.rest_index(inplace=True)):

new_df = shortlist.merge(how='left', left_on=['ticker', 'date'], right_on=['Code', 'EOM']).reset_index()
Answered By: jcaliz

You might have to break down the steps: first merge the two dataframes (i use the join function) on the dates, then fill in the null values with the oldest date from mcap (I am using the result of your output as a guide):

Convert to datetime and set index:

df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')
df = df.set_index('date')
mcap['EOM'] = pd.to_datetime(mcap['EOM'])
mcap = mcap.set_index("EOM")

Combine the dataframes:

res = df.join(mcap)

Get the indices for the null rows:

indices = res[res.isna().any(axis=1)].index

Get the values from mcap for the oldest date:

latest_mcap = mcap.loc[mcap.index.min()].tolist()

Assign latest_mcap to the null values in res:

res.loc[indices,['Code','mcRank','MktCap']] = latest_mcap

ticker  open    high    low close   volume  Code    mcRank  MktCap
date                                    
2000-12-29  ABC 0.45    0.455   0.445   0.455   205843.0    ABC 74.0    1563.967892
2001-01-31  ABC 0.41    0.410   0.405   0.410   381500.0    ABC 74.0    1563.967892
2001-02-28  ABC 0.38    0.405   0.380   0.400   318384.0    ABC 74.0    1563.967892
2001-06-30  ABC 0.43    0.445   0.430   0.440   104016.0    ABC 59.0    2035.603350
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.