How To Map Two Columns from One Dataset with One Column from Another Dataset?

Question:

I have two datasets:

df1 = pd.DataFrame({'id1': 'AAA ABC ACD ADE AEE AFG'.split(),
                   'id2': 'BBB BBC BCD BDE BEE BFG'.split(),})

print(df1)

   id1  id2
0  AAA  BBB
1  ABC  BBC
2  ACD  BCD
3  ADE  BDE
4  AEE  BEE
5  AFG  BFG

-----------

df2 = pd.DataFrame({'student_id': 'ABC BBB AAA DEF AEE BEE'.split(),
                    'center': '11 22 33 44 55 66'.split()})

print(df2)

  student_id center
0        ABC     11
1        BBB     22
2        AAA     33
3        DEF     44
4        AEE     55
5        BEE     66

I need to map both id1 and id2 columns from dataset 1 with student_id column of dataset 2. Then keep only those id1 and id2 columns, if both are present in the student_id column. Finally, get their mapped values from dataset 2 as separate columns, respectively.

I’m trying the following script and getting the desired output for my example:

map1 = df1.merge(df2, left_on='id1', right_on='student_id').drop(columns=['id2'])
map2 = df1.merge(df2, left_on='id2', right_on='student_id')
map1.merge(map2, on='id1') 

However, it’s neither scaling nor giving the right output when the dataset is huge. For example, with map1 length of 100,000 rows and map2 with 70,000 rows, the final length after joining both is close to 1 million. I tried to set id1 as index for both the mapping datasets and join them, but it didn’t scale, too!

Desired Output

   id1  id2 student_id1 center_1 student_id2 center_2
0  AAA  BBB         AAA       33         BBB       22 # Both AAA, BBB present from dataset 1, with respective values from dataset 2
1  AEE  BEE         AEE       55         BEE       66 # Both AEE, BEE present from dataset 1, with respective values from dataset 2

What would be the better ways to do that? Any suggestions would be appreciated. Thanks!

Asked By: Roy

||

Answers:

It looks like you want to merge df1 and df2 like you did, and then merge df2 onto the result, just on a different column (id2 instead of id1) like this

map1 = df1.merge(df2, left_on='id1', right_on='student_id')
map2 = map1.merge(df2, left_on='id2', right_on='student_id', suffixes=['_1', '_2'])

This should produce the correct answer and only uses two merges instead of three for scalability.

Answered By: mpopken

To map both id1 and id2 columns from dataset 1 with the student_id column of dataset 2, you can use the merge function twice, once for each column, and then merge the results on the id1 column. Here’s an example code that should work efficiently for large datasets:

# Merge id1 column
map1 = df1.merge(df2, left_on='id1', right_on='student_id', how='inner', suffixes=['', '_1'])

# Merge id2 column
map2 = df1.merge(df2, left_on='id2', right_on='student_id', how='inner', suffixes=['', '_2'])

# Merge both mappings on id1 column
result = map1.merge(map2, on='id1', how='inner')

# Keep only desired columns
result = result[['id1', 'id2', 'student_id', 'center', 'student_id_2', 'center_2']]

# Rename columns
result.columns = ['id1', 'id2', 'student_id1', 'center_1', 'student_id2', 'center_2']

# Keep only rows where both id1 and id2 are present in student_id column
result = result[(result['student_id1'].notnull()) & (result['student_id2'].notnull())]

# Reset index if needed
result = result.reset_index(drop=True)


Here’s what the code does:

  1. First, it merges df1 with df2 based on the id1 column, and saves the result as map1.
  2. Then, it merges df1 with df2 based on the id2 column, and saves the result as map2.
  3. Finally, it merges map1 and map2 based on the id1 column, and saves the result as result. It keeps only the desired columns, and renames them to match your desired output.
  4. It then filters the result to keep only the rows where both id1 and id2 are present in the student_id column.
  5. Finally, it resets the index if needed.
    Note that I used the ‘inner’ join instead of the default ‘outer’ join to reduce the size of the intermediate results, and I used the ‘suffixes’ parameter to distinguish between the two student_id columns.

I hope this helps!

Answered By: Ali Hassan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.