How can I connect two dataframes based on mutliple criteria in python?
Question:
I have the following two dataframes:
Shortlist:
ticker date open high low close volume
ABC 2000-12-29 0.450 0.455 0.445 0.455 205843.0
ABC 2001-01-31 0.410 0.410 0.405 0.410 381500.0
ABC 2001-02-28 0.380 0.405 0.380 0.400 318384.0
...
ABC 2001-06-30 0.430 0.445 0.430 0.440 104016.0
MCap
Code EOM mcRank MktCap
ABC 29/12/2000 74 1563.967892
ABC 31/03/2001 98 998.156279
ABC 30/06/2001 59 2035.603350
I now want to create a new table that adds the columns of mcRank and MktCap from the MCap dataframe to the Shortlist dataframe, where the Code and the Date match. If the date is shortlist is between the dates in MCap it should use the last known date.
The result should like this:
ticker date open high low close volume mcRank MktCap
ABC 2000-12-29 0.450 0.455 0.445 0.455 205843.0 74 1563.967892
ABC 2001-01-31 0.410 0.410 0.405 0.410 381500.0 74 1563.967892
ABC 2001-02-28 0.380 0.405 0.380 0.400 318384.0 74 1563.967892
...
ABC 2001-06-30 0.430 0.445 0.430 0.440 104016.0 59 2035.603350
I’ve tried pd.concat and pd.merge – but can’t seem to get the right results.
Answers:
What you want to do is
First align both date format, you can process it as string which makes it easier
Second pd.merge them, use left_on, right_on and how=’outer’ to merge everything, and PURPOSELY create NA values
Then you can use DataFrame.fillna(method=’ffill’) to fill na base on previous values
Well this seems like a merge
task, but first make sure EOM
and date
columns are actually the same variable dtype (datetime
).
shortlist['date'] = pd.to_datetime(shortlist['date'], format='%Y-%m-%d')
MCap['EOM'] = pd.to_datetime(MCap['EOM'], format='%d/%m/%Y')
And then do the merge (this wont work if ticker
or Codes
are the indexes, if they are, reset the index first i.e shortlist.rest_index(inplace=True)
):
new_df = shortlist.merge(how='left', left_on=['ticker', 'date'], right_on=['Code', 'EOM']).reset_index()
You might have to break down the steps: first merge the two dataframes (i use the join function) on the dates, then fill in the null values with the oldest date from mcap (I am using the result of your output as a guide):
Convert to datetime and set index:
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')
df = df.set_index('date')
mcap['EOM'] = pd.to_datetime(mcap['EOM'])
mcap = mcap.set_index("EOM")
Combine the dataframes:
res = df.join(mcap)
Get the indices for the null rows:
indices = res[res.isna().any(axis=1)].index
Get the values from mcap for the oldest date:
latest_mcap = mcap.loc[mcap.index.min()].tolist()
Assign latest_mcap to the null values in res:
res.loc[indices,['Code','mcRank','MktCap']] = latest_mcap
ticker open high low close volume Code mcRank MktCap
date
2000-12-29 ABC 0.45 0.455 0.445 0.455 205843.0 ABC 74.0 1563.967892
2001-01-31 ABC 0.41 0.410 0.405 0.410 381500.0 ABC 74.0 1563.967892
2001-02-28 ABC 0.38 0.405 0.380 0.400 318384.0 ABC 74.0 1563.967892
2001-06-30 ABC 0.43 0.445 0.430 0.440 104016.0 ABC 59.0 2035.603350
I have the following two dataframes:
Shortlist:
ticker date open high low close volume
ABC 2000-12-29 0.450 0.455 0.445 0.455 205843.0
ABC 2001-01-31 0.410 0.410 0.405 0.410 381500.0
ABC 2001-02-28 0.380 0.405 0.380 0.400 318384.0
...
ABC 2001-06-30 0.430 0.445 0.430 0.440 104016.0
MCap
Code EOM mcRank MktCap
ABC 29/12/2000 74 1563.967892
ABC 31/03/2001 98 998.156279
ABC 30/06/2001 59 2035.603350
I now want to create a new table that adds the columns of mcRank and MktCap from the MCap dataframe to the Shortlist dataframe, where the Code and the Date match. If the date is shortlist is between the dates in MCap it should use the last known date.
The result should like this:
ticker date open high low close volume mcRank MktCap
ABC 2000-12-29 0.450 0.455 0.445 0.455 205843.0 74 1563.967892
ABC 2001-01-31 0.410 0.410 0.405 0.410 381500.0 74 1563.967892
ABC 2001-02-28 0.380 0.405 0.380 0.400 318384.0 74 1563.967892
...
ABC 2001-06-30 0.430 0.445 0.430 0.440 104016.0 59 2035.603350
I’ve tried pd.concat and pd.merge – but can’t seem to get the right results.
What you want to do is
First align both date format, you can process it as string which makes it easier
Second pd.merge them, use left_on, right_on and how=’outer’ to merge everything, and PURPOSELY create NA values
Then you can use DataFrame.fillna(method=’ffill’) to fill na base on previous values
Well this seems like a merge
task, but first make sure EOM
and date
columns are actually the same variable dtype (datetime
).
shortlist['date'] = pd.to_datetime(shortlist['date'], format='%Y-%m-%d')
MCap['EOM'] = pd.to_datetime(MCap['EOM'], format='%d/%m/%Y')
And then do the merge (this wont work if ticker
or Codes
are the indexes, if they are, reset the index first i.e shortlist.rest_index(inplace=True)
):
new_df = shortlist.merge(how='left', left_on=['ticker', 'date'], right_on=['Code', 'EOM']).reset_index()
You might have to break down the steps: first merge the two dataframes (i use the join function) on the dates, then fill in the null values with the oldest date from mcap (I am using the result of your output as a guide):
Convert to datetime and set index:
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')
df = df.set_index('date')
mcap['EOM'] = pd.to_datetime(mcap['EOM'])
mcap = mcap.set_index("EOM")
Combine the dataframes:
res = df.join(mcap)
Get the indices for the null rows:
indices = res[res.isna().any(axis=1)].index
Get the values from mcap for the oldest date:
latest_mcap = mcap.loc[mcap.index.min()].tolist()
Assign latest_mcap to the null values in res:
res.loc[indices,['Code','mcRank','MktCap']] = latest_mcap
ticker open high low close volume Code mcRank MktCap
date
2000-12-29 ABC 0.45 0.455 0.445 0.455 205843.0 ABC 74.0 1563.967892
2001-01-31 ABC 0.41 0.410 0.405 0.410 381500.0 ABC 74.0 1563.967892
2001-02-28 ABC 0.38 0.405 0.380 0.400 318384.0 ABC 74.0 1563.967892
2001-06-30 ABC 0.43 0.445 0.430 0.440 104016.0 ABC 59.0 2035.603350