Joining pandas DataFrames by Column names

Question:

I have two DataFrames with the following column names:

frame_1:
event_id, date, time, county_ID

frame_2:
countyid, state

I would like to get a DataFrame with the following columns by left-joining on county_ID = countyid:

joined_dataframe:
event_id, date, time, county, state

I cannot figure out how to do it if the columns on which I want to join are not the index.

Asked By: Alexis Eggermont

||

Answers:

You can use the left_on and right_on options of pd.merge as follows:

pd.merge(frame_1, frame_2, left_on='county_ID', right_on='countyid')

Or equivalently with DataFrame.merge:

frame_1.merge(frame_2, left_on='county_ID', right_on='countyid')

I was not sure from the question if you only wanted to merge if the key was in the left hand DataFrame. If that is the case then the following will do that (the above will in effect do a many to many merge)

pd.merge(frame_1, frame_2, how='left', left_on='county_ID', right_on='countyid')

Or

frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid')
Answered By: Woody Pride

you need to make county_ID as index for the right frame:

frame_2.join ( frame_1.set_index( [ 'county_ID' ], verify_integrity=True ),
               on=[ 'countyid' ], how='left' )

for your information, in pandas left join breaks when the right frame has non unique values on the joining column. see this bug.

so you need to verify integrity before joining by , verify_integrity=True

Answered By: behzad.nouri

Merging using differently named columns duplicates columns; for example, after the call frame_1.merge(frame_2, how='left', left_on='county_ID', right_on='countyid'), both county_ID and countyid columns are created on the joined frame but they have exactly the same values for every row, so presumably only one of them is needed. To not have this problem from the beginning, rename the merge-key column be the same and merge on that column.

df1 = frame_1.rename(columns={'county_ID':'county'})
df2 = frame_2.rename(columns={'countyid':'county'})
joined_frame = df1.merge(df2, on='county', how='left')

res1

Also, if the second frame has only one new additional column (e.g. state) as in the OP, then you can map that column to frame_1 via the common column.

frame_1['state'] = frame_1['county_ID'].map(frame_2.set_index('countyid')['state'])

If county_ID and countyid were indices (not columns), then a straightforward join call does the job. By default, it performs left join.

joined_frame = frame_1.join(frame_2)

res2

One nice thing about join is that if you want to join multiple dataframes on index, then you can pass a list of dataframes and join efficiently (instead of multiple chained merge calls).

joined_frame = frame_1.join([frame_2, frame_3])
Answered By: cottontail
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.