How To Map Two Columns from One Dataset with One Column from Another Dataset?
Question:
I have two datasets:
df1 = pd.DataFrame({'id1': 'AAA ABC ACD ADE AEE AFG'.split(),
'id2': 'BBB BBC BCD BDE BEE BFG'.split(),})
print(df1)
id1 id2
0 AAA BBB
1 ABC BBC
2 ACD BCD
3 ADE BDE
4 AEE BEE
5 AFG BFG
-----------
df2 = pd.DataFrame({'student_id': 'ABC BBB AAA DEF AEE BEE'.split(),
'center': '11 22 33 44 55 66'.split()})
print(df2)
student_id center
0 ABC 11
1 BBB 22
2 AAA 33
3 DEF 44
4 AEE 55
5 BEE 66
I need to map both id1
and id2
columns from dataset 1 with student_id
column of dataset 2. Then keep only those id1
and id2
columns, if both are present in the student_id
column. Finally, get their mapped values from dataset 2 as separate columns, respectively.
I’m trying the following script and getting the desired output for my example:
map1 = df1.merge(df2, left_on='id1', right_on='student_id').drop(columns=['id2'])
map2 = df1.merge(df2, left_on='id2', right_on='student_id')
map1.merge(map2, on='id1')
However, it’s neither scaling nor giving the right output when the dataset is huge. For example, with map1
length of 100,000 rows and map2
with 70,000 rows, the final length after joining both is close to 1 million. I tried to set id1
as index for both the mapping datasets and join them, but it didn’t scale, too!
Desired Output
id1 id2 student_id1 center_1 student_id2 center_2
0 AAA BBB AAA 33 BBB 22 # Both AAA, BBB present from dataset 1, with respective values from dataset 2
1 AEE BEE AEE 55 BEE 66 # Both AEE, BEE present from dataset 1, with respective values from dataset 2
What would be the better ways to do that? Any suggestions would be appreciated. Thanks!
Answers:
It looks like you want to merge df1 and df2 like you did, and then merge df2 onto the result, just on a different column (id2 instead of id1) like this
map1 = df1.merge(df2, left_on='id1', right_on='student_id')
map2 = map1.merge(df2, left_on='id2', right_on='student_id', suffixes=['_1', '_2'])
This should produce the correct answer and only uses two merges instead of three for scalability.
To map both id1 and id2 columns from dataset 1 with the student_id column of dataset 2, you can use the merge function twice, once for each column, and then merge the results on the id1 column. Here’s an example code that should work efficiently for large datasets:
# Merge id1 column
map1 = df1.merge(df2, left_on='id1', right_on='student_id', how='inner', suffixes=['', '_1'])
# Merge id2 column
map2 = df1.merge(df2, left_on='id2', right_on='student_id', how='inner', suffixes=['', '_2'])
# Merge both mappings on id1 column
result = map1.merge(map2, on='id1', how='inner')
# Keep only desired columns
result = result[['id1', 'id2', 'student_id', 'center', 'student_id_2', 'center_2']]
# Rename columns
result.columns = ['id1', 'id2', 'student_id1', 'center_1', 'student_id2', 'center_2']
# Keep only rows where both id1 and id2 are present in student_id column
result = result[(result['student_id1'].notnull()) & (result['student_id2'].notnull())]
# Reset index if needed
result = result.reset_index(drop=True)
Here’s what the code does:
- First, it merges df1 with df2 based on the id1 column, and saves the result as map1.
- Then, it merges df1 with df2 based on the id2 column, and saves the result as map2.
- Finally, it merges map1 and map2 based on the id1 column, and saves the result as result. It keeps only the desired columns, and renames them to match your desired output.
- It then filters the result to keep only the rows where both id1 and id2 are present in the student_id column.
- Finally, it resets the index if needed.
Note that I used the ‘inner’ join instead of the default ‘outer’ join to reduce the size of the intermediate results, and I used the ‘suffixes’ parameter to distinguish between the two student_id columns.
I hope this helps!
I have two datasets:
df1 = pd.DataFrame({'id1': 'AAA ABC ACD ADE AEE AFG'.split(),
'id2': 'BBB BBC BCD BDE BEE BFG'.split(),})
print(df1)
id1 id2
0 AAA BBB
1 ABC BBC
2 ACD BCD
3 ADE BDE
4 AEE BEE
5 AFG BFG
-----------
df2 = pd.DataFrame({'student_id': 'ABC BBB AAA DEF AEE BEE'.split(),
'center': '11 22 33 44 55 66'.split()})
print(df2)
student_id center
0 ABC 11
1 BBB 22
2 AAA 33
3 DEF 44
4 AEE 55
5 BEE 66
I need to map both id1
and id2
columns from dataset 1 with student_id
column of dataset 2. Then keep only those id1
and id2
columns, if both are present in the student_id
column. Finally, get their mapped values from dataset 2 as separate columns, respectively.
I’m trying the following script and getting the desired output for my example:
map1 = df1.merge(df2, left_on='id1', right_on='student_id').drop(columns=['id2'])
map2 = df1.merge(df2, left_on='id2', right_on='student_id')
map1.merge(map2, on='id1')
However, it’s neither scaling nor giving the right output when the dataset is huge. For example, with map1
length of 100,000 rows and map2
with 70,000 rows, the final length after joining both is close to 1 million. I tried to set id1
as index for both the mapping datasets and join them, but it didn’t scale, too!
Desired Output
id1 id2 student_id1 center_1 student_id2 center_2
0 AAA BBB AAA 33 BBB 22 # Both AAA, BBB present from dataset 1, with respective values from dataset 2
1 AEE BEE AEE 55 BEE 66 # Both AEE, BEE present from dataset 1, with respective values from dataset 2
What would be the better ways to do that? Any suggestions would be appreciated. Thanks!
It looks like you want to merge df1 and df2 like you did, and then merge df2 onto the result, just on a different column (id2 instead of id1) like this
map1 = df1.merge(df2, left_on='id1', right_on='student_id')
map2 = map1.merge(df2, left_on='id2', right_on='student_id', suffixes=['_1', '_2'])
This should produce the correct answer and only uses two merges instead of three for scalability.
To map both id1 and id2 columns from dataset 1 with the student_id column of dataset 2, you can use the merge function twice, once for each column, and then merge the results on the id1 column. Here’s an example code that should work efficiently for large datasets:
# Merge id1 column
map1 = df1.merge(df2, left_on='id1', right_on='student_id', how='inner', suffixes=['', '_1'])
# Merge id2 column
map2 = df1.merge(df2, left_on='id2', right_on='student_id', how='inner', suffixes=['', '_2'])
# Merge both mappings on id1 column
result = map1.merge(map2, on='id1', how='inner')
# Keep only desired columns
result = result[['id1', 'id2', 'student_id', 'center', 'student_id_2', 'center_2']]
# Rename columns
result.columns = ['id1', 'id2', 'student_id1', 'center_1', 'student_id2', 'center_2']
# Keep only rows where both id1 and id2 are present in student_id column
result = result[(result['student_id1'].notnull()) & (result['student_id2'].notnull())]
# Reset index if needed
result = result.reset_index(drop=True)
Here’s what the code does:
- First, it merges df1 with df2 based on the id1 column, and saves the result as map1.
- Then, it merges df1 with df2 based on the id2 column, and saves the result as map2.
- Finally, it merges map1 and map2 based on the id1 column, and saves the result as result. It keeps only the desired columns, and renames them to match your desired output.
- It then filters the result to keep only the rows where both id1 and id2 are present in the student_id column.
- Finally, it resets the index if needed.
Note that I used the ‘inner’ join instead of the default ‘outer’ join to reduce the size of the intermediate results, and I used the ‘suffixes’ parameter to distinguish between the two student_id columns.
I hope this helps!