Vlookup using python when data given in range
Question:
I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.
My files look like this
source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100.
DISTANCE COST
1-100 4800
101-120 5100
121-140 5500
141-160 5900
161-180 6200
181-200 6600
210-220 6900
221-240 7200
Analysis.xlsx
loading_station distance_travel total_cost status
PUGU 40 4000 PAID
PUGU 80 3200 PAID
MOROGORO 50 5000 PAID
MOROGORO 220 30400 PAID
DODOMA 150 5100 PAID
KIGOMA 90 2345 PAID
DODOMA 230 6000 PAID
DODOMA 180 16500 PAID
KIGOMA 32 3000 PAID
DODOMA 45 6000 PAID
DODOMA 65 5000 PAID
KIGOMA 77 1000 PAID
KIGOMA 90 4000 PAID
Actual Cost for distance is given in source_data.xlsx
, I want to check cost in Analysis.xlsx
if it correspond to Actual value, I want to detect underpayment and overpayment.
Desired Output should be like this, with two column added, source_cost
which is taken from source_xlsx
by using vlookup
and difference which is difference between total_cost
and source_cost
loading_station distance_travel total_cost status source_cost Difference
PUGU 40 4000 PAID 4800 -800
PUGU 80 3200 PAID 4800 -1600
MOROGORO 50 5000 PAID 4800 200
MOROGORO 220 30400 PAID 6900 23500
DODOMA 150 5100 PAID 5900 -800
KIGOMA 90 2345 PAID 4800 -2455
DODOMA 230 6000 PAID 7200 -1200
DODOMA 180 16500 PAID 6200 10300
KIGOMA 32 3000 PAID 4800 -1800
DODOMA 45 6000 PAID 4800 1200
DODOMA 65 5000 PAID 4800 200
KIGOMA 77 1000 PAID 4800 -3800
KIGOMA 90 4000 PAID 4800 -800
My code so far
# import pandas
import pandas as pd
# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)
Answers:
You can use merge_asof
:
source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
source_data,
left_on="distance_travel",
right_on="DISTANCE",
direction="forward")
.set_index("index")
.sort_index())
res["Difference"] = res["total_cost"] - res["COST"]
print (res)
loading_station distance_travel total_cost status DISTANCE COST Difference
index
0 PUGU 40 4000 PAID 100 4800 -800
1 PUGU 80 3200 PAID 100 4800 -1600
2 MOROGORO 50 5000 PAID 100 4800 200
3 MOROGORO 220 30400 PAID 220 6900 23500
4 DODOMA 150 5100 PAID 160 5900 -800
5 KIGOMA 90 2345 PAID 100 4800 -2455
6 DODOMA 230 6000 PAID 240 7200 -1200
7 DODOMA 180 16500 PAID 180 6200 10300
8 KIGOMA 32 3000 PAID 100 4800 -1800
9 DODOMA 45 6000 PAID 100 4800 1200
10 DODOMA 65 5000 PAID 100 4800 200
11 KIGOMA 77 1000 PAID 100 4800 -3800
12 KIGOMA 90 4000 PAID 100 4800 -800
Note that this does not take care of 0 distance traveled. You need to handle that separately.
Since it is a categorical bins problem, I suggest utilizing cut()
and find the corresponding value.
import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')
print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')
As it shown, IntervalIndex
, dtype='interval[int64, both]'
# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)
# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']
print(df_analysis)
###
loading_station
distance_travel
total_cost
status
source_cost
Difference
PUGU
40
4000
PAID
4800
-800
PUGU
80
3200
PAID
4800
-1600
MOROGORO
50
5000
PAID
4800
200
MOROGORO
220
30400
PAID
6900
23500
DODOMA
150
5100
PAID
5900
-800
KIGOMA
90
2345
PAID
4800
-2455
DODOMA
230
6000
PAID
7200
-1200
DODOMA
180
16500
PAID
6200
10300
KIGOMA
32
3000
PAID
4800
-1800
DODOMA
45
6000
PAID
4800
1200
DODOMA
65
5000
PAID
4800
200
KIGOMA
77
1000
PAID
4800
-3800
KIGOMA
90
4000
PAID
4800
-800
I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.
My files look like this
source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100.
DISTANCE COST
1-100 4800
101-120 5100
121-140 5500
141-160 5900
161-180 6200
181-200 6600
210-220 6900
221-240 7200
Analysis.xlsx
loading_station distance_travel total_cost status
PUGU 40 4000 PAID
PUGU 80 3200 PAID
MOROGORO 50 5000 PAID
MOROGORO 220 30400 PAID
DODOMA 150 5100 PAID
KIGOMA 90 2345 PAID
DODOMA 230 6000 PAID
DODOMA 180 16500 PAID
KIGOMA 32 3000 PAID
DODOMA 45 6000 PAID
DODOMA 65 5000 PAID
KIGOMA 77 1000 PAID
KIGOMA 90 4000 PAID
Actual Cost for distance is given in source_data.xlsx
, I want to check cost in Analysis.xlsx
if it correspond to Actual value, I want to detect underpayment and overpayment.
Desired Output should be like this, with two column added, source_cost
which is taken from source_xlsx
by using vlookup
and difference which is difference between total_cost
and source_cost
loading_station distance_travel total_cost status source_cost Difference
PUGU 40 4000 PAID 4800 -800
PUGU 80 3200 PAID 4800 -1600
MOROGORO 50 5000 PAID 4800 200
MOROGORO 220 30400 PAID 6900 23500
DODOMA 150 5100 PAID 5900 -800
KIGOMA 90 2345 PAID 4800 -2455
DODOMA 230 6000 PAID 7200 -1200
DODOMA 180 16500 PAID 6200 10300
KIGOMA 32 3000 PAID 4800 -1800
DODOMA 45 6000 PAID 4800 1200
DODOMA 65 5000 PAID 4800 200
KIGOMA 77 1000 PAID 4800 -3800
KIGOMA 90 4000 PAID 4800 -800
My code so far
# import pandas
import pandas as pd
# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)
You can use merge_asof
:
source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
source_data,
left_on="distance_travel",
right_on="DISTANCE",
direction="forward")
.set_index("index")
.sort_index())
res["Difference"] = res["total_cost"] - res["COST"]
print (res)
loading_station distance_travel total_cost status DISTANCE COST Difference
index
0 PUGU 40 4000 PAID 100 4800 -800
1 PUGU 80 3200 PAID 100 4800 -1600
2 MOROGORO 50 5000 PAID 100 4800 200
3 MOROGORO 220 30400 PAID 220 6900 23500
4 DODOMA 150 5100 PAID 160 5900 -800
5 KIGOMA 90 2345 PAID 100 4800 -2455
6 DODOMA 230 6000 PAID 240 7200 -1200
7 DODOMA 180 16500 PAID 180 6200 10300
8 KIGOMA 32 3000 PAID 100 4800 -1800
9 DODOMA 45 6000 PAID 100 4800 1200
10 DODOMA 65 5000 PAID 100 4800 200
11 KIGOMA 77 1000 PAID 100 4800 -3800
12 KIGOMA 90 4000 PAID 100 4800 -800
Note that this does not take care of 0 distance traveled. You need to handle that separately.
Since it is a categorical bins problem, I suggest utilizing cut()
and find the corresponding value.
import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')
print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')
As it shown, IntervalIndex
, dtype='interval[int64, both]'
# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)
# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']
print(df_analysis)
###
loading_station | distance_travel | total_cost | status | source_cost | Difference |
---|---|---|---|---|---|
PUGU | 40 | 4000 | PAID | 4800 | -800 |
PUGU | 80 | 3200 | PAID | 4800 | -1600 |
MOROGORO | 50 | 5000 | PAID | 4800 | 200 |
MOROGORO | 220 | 30400 | PAID | 6900 | 23500 |
DODOMA | 150 | 5100 | PAID | 5900 | -800 |
KIGOMA | 90 | 2345 | PAID | 4800 | -2455 |
DODOMA | 230 | 6000 | PAID | 7200 | -1200 |
DODOMA | 180 | 16500 | PAID | 6200 | 10300 |
KIGOMA | 32 | 3000 | PAID | 4800 | -1800 |
DODOMA | 45 | 6000 | PAID | 4800 | 1200 |
DODOMA | 65 | 5000 | PAID | 4800 | 200 |
KIGOMA | 77 | 1000 | PAID | 4800 | -3800 |
KIGOMA | 90 | 4000 | PAID | 4800 | -800 |