Is there any way match two different csv files with similar columns in python?

Question:

I am a beginner in numpy and I have two csv files that look like this:

csv1:

ID item_size Cost
0010 4.4
0010 5.5
0012 8
0012 10.1

csv2:

ID item_size Cost
0010 6.1 5
0010 7.2 2
0010 5.3 1
0010 3.1 3
0010 4.7 2
0012 7.6 5
0012 22 4
0012 13.1 2
0012 9.2 3
0012 11.1 3

The output should look like this:
csv_output:

ID item_size Cost
0010 4.4 2
0010 5.5 1
0012 8 5
0012 10.1 3

"The closest value for item_size for 4.4 is 4.7, they also have the same ID so the Cost column is filled with a 2."

The task is to match the ID between the csv1 and csv2. Also to get the closest item size and add the cost. Mantaining the csv1 as the output csv.

So I have to fill the missing cost of csv1 based on the nearest item_size value between csv1 and csv2.
Also, they have to belong to the same ID.
I tried separating this problem into several tasks such as: nearest value problem, similar ID problem and filling the cost problem. Then, to combine all of them.

I already solved the nearest value problem. Using this script:

import numpy as np
 
def closest_value(input_list, input_value):

    arr = np.asarray(input_list)

    i = (np.abs(arr - input_value)).argmin()
    return arr[i]
 
list1 = dataset_1['item_size'].values.tolist()
list2 = dataset_2['item_size'].values.tolist()

for i in range(len(list2)):

    val=closest_value(list1,list2[i])

    print("The closest value to the "+ str(list[i])+" is",val)

I also tried solving the similar ID problem using this:

import pandas as pd

csv1 = pd.read_csv('csv1.csv')
csv2= pd.read_csv('csv2.csv')

csv2[csv2['ID'].duplicated(keep=False)]
csv2.groupby('ID',axis=0).apply(lambda x: x.to_csv(str(x.name) + '.csv'))

However, this method creates csv files based on ID similarity which is too computational resourceful. if there is another method to solve this, I will gladly take it. I have been trying to solve this problem for at least 5 days so any help is appreciated. Best Regards

Asked By: Dude Rar

||

Answers:

You can use a merge_asof:

# save the index to restore it later
# sort the data for the merge_asof (required)
(pd.merge_asof(df1.reset_index().sort_values(by='item_size').drop(columns='Cost'),
              df2.sort_values(by='item_size'),
              by='ID', on='item_size', direction='nearest'
              )
   .set_index('index').sort_index() # restore original order
 )

Output:

       ID  item_size  Cost
index                     
0      10        4.4     2
1      10        5.5     1
2      12        8.0     5
3      12       10.1     3
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.