Vlookup using python when data given in range

Question

I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.

My files look like this

source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100.

DISTANCE     COST

1-100        4800

101-120      5100

121-140      5500

141-160      5900

161-180      6200

181-200      6600

210-220      6900

221-240      7200

Analysis.xlsx

loading_station  distance_travel     total_cost    status

PUGU                  40                4000       PAID


PUGU                  80                3200       PAID

MOROGORO              50                5000       PAID

MOROGORO              220               30400      PAID

DODOMA                150               5100       PAID

KIGOMA                90                2345       PAID

DODOMA                230               6000       PAID

DODOMA                180               16500      PAID

KIGOMA                32                3000       PAID

DODOMA                45                6000       PAID

DODOMA                65                5000       PAID

KIGOMA                77                1000       PAID

KIGOMA                90                4000       PAID

Actual Cost for distance is given in source_data.xlsx, I want to check cost in Analysis.xlsx if it correspond to Actual value, I want to detect underpayment and overpayment.

Desired Output should be like this, with two column added, source_cost which is taken from source_xlsx by using vlookup and difference which is difference between total_cost and source_cost

loading_station distance_travel total_cost  status  source_cost Difference

PUGU               40                4000     PAID     4800        -800

PUGU               80                3200     PAID     4800        -1600

MOROGORO           50                5000     PAID     4800         200

MOROGORO           220               30400    PAID     6900         23500

DODOMA             150               5100     PAID     5900         -800

KIGOMA             90                2345     PAID     4800         -2455

DODOMA             230               6000     PAID     7200         -1200

DODOMA             180               16500    PAID     6200          10300

KIGOMA             32                3000     PAID     4800          -1800

DODOMA             45                6000     PAID     4800           1200

DODOMA             65                5000     PAID     4800           200

KIGOMA             77                1000     PAID     4800           -3800

KIGOMA             90                4000     PAID     4800           -800

My code so far

# import pandas
import pandas as pd

# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)

Asked By: tony michael

||

Source

Answer 1

You can use merge_asof:

source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
                     source_data,
                     left_on="distance_travel",
                     right_on="DISTANCE",
                     direction="forward")
       .set_index("index")
       .sort_index())
res["Difference"] = res["total_cost"] - res["COST"]

print (res)

      loading_station  distance_travel  total_cost status  DISTANCE  COST  Difference
index
0                PUGU               40        4000   PAID       100  4800        -800
1                PUGU               80        3200   PAID       100  4800       -1600
2            MOROGORO               50        5000   PAID       100  4800         200
3            MOROGORO              220       30400   PAID       220  6900       23500
4              DODOMA              150        5100   PAID       160  5900        -800
5              KIGOMA               90        2345   PAID       100  4800       -2455
6              DODOMA              230        6000   PAID       240  7200       -1200
7              DODOMA              180       16500   PAID       180  6200       10300
8              KIGOMA               32        3000   PAID       100  4800       -1800
9              DODOMA               45        6000   PAID       100  4800        1200
10             DODOMA               65        5000   PAID       100  4800         200
11             KIGOMA               77        1000   PAID       100  4800       -3800
12             KIGOMA               90        4000   PAID       100  4800        -800

Note that this does not take care of 0 distance traveled. You need to handle that separately.

Answered By: Henry Yik

Answer 2

Since it is a categorical bins problem, I suggest utilizing cut() and find the corresponding value.

import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')

print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')

As it shown, IntervalIndex, dtype='interval[int64, both]'

# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)

# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']

print(df_analysis)
###

loading_station	distance_travel	total_cost	status	source_cost	Difference
PUGU	40	4000	PAID	4800	-800
PUGU	80	3200	PAID	4800	-1600
MOROGORO	50	5000	PAID	4800	200
MOROGORO	220	30400	PAID	6900	23500
DODOMA	150	5100	PAID	5900	-800
KIGOMA	90	2345	PAID	4800	-2455
DODOMA	230	6000	PAID	7200	-1200
DODOMA	180	16500	PAID	6200	10300
KIGOMA	32	3000	PAID	4800	-1800
DODOMA	45	6000	PAID	4800	1200
DODOMA	65	5000	PAID	4800	200
KIGOMA	77	1000	PAID	4800	-3800
KIGOMA	90	4000	PAID	4800	-800

Answered By: Baron Legendre

Vlookup using python when data given in range

Question:

Answers: