groupby and apply multiple conditions
Question:
This is a bit complicated but I will try to explain as best as I can.
i have the following dataframe.
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e+20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e+20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e+20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e+20 transfer stETH
5 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 6.391691160717606e+19 transfer stETH
6 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 4.260674654274649e+19 transfer stETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e+20 trace ETH
8 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e+20 trace ETH
9 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e+20 transfer stETH
10 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 4.257e+19 transfer stETH
11 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 9.933e+19 transfer stETH
These 11 transfers represent two swap transactions: 2 unique hash) between ETH and stETH. It’d be very nice if there was two clean transactions with one ETH going from A to B, and the other one stETH going from B to A. But, in decentralized exchanges, things work via routers which send multiple transactions through various addresses to complete one swap transaction.
Here, there are two transactions (as you can see, within one transaction (hash) made up of several transfers). I want to verify which ETH amounts corresponds to stETH. ETH and stETH prices are almost 1:1 so they should be quite close in value.
So from the first transaction (1-6), there is one ETH value (1.052e+20) but three different stETH value (1.0652365814992255e+20, 6.391691160717606e+19, and 4.260674654274649e+19). Obviously, it is clear that the pair that corresponds to the 1.052e+20 ETH swap is 1.0652365814992255e+20 stETH as it is the closest in value.
So, in order to filter out the right pair, I want to groupby transaction_hash
, if there are more than 1 unique value of stETH, then I want to pick out the one that is closest to the ETH value.
so, the desired output will be:
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e+20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e+20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e+20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e+20 transfer stETH
5 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e+20 trace ETH
6 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e+20 trace ETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e+20 transfer stETH
Thanks!
EDIT!
I applied the code suggested below. Something strange is happening. My original df has a transaction like this
transaction_hash block_timestamp from_address to_address value data token_address
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e+17 transfer stETH
when I apply the code, instead of getting what I want which is..
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e+17 transfer stETH
I get more rows! and doesn’t even remove the rows
transaction_hash block_timestamp from_address to_address value data token_address
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
what is going on?
Answers:
If you can sort your data by value
you can use .merge_asof()
to find the "nearest neighbor".
The by
argument allows you specify grouping.
df = df.sort_values("value")
left = (
df.loc[df["token_address"] == "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_x"})
.reset_index()
)
right = (
df.loc[df["token_address"] != "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_y"})
.reset_index()
)
# Near matches
nearest = pd.merge_asof(
left = left,
right = right,
by = "transaction_hash",
left_on = "value_x",
right_on = "value_y",
direction = "nearest"
)
# Closest matches only
nearest["abs"] = (nearest["value_x"] - nearest["value_y"]).abs()
idxmin = nearest.groupby("value_y")["abs"].idxmin()
# index_x / index_y contain the row indexes of closest matches
minrows = nearest.loc[idxmin, ["index_x", "index_y"]].to_numpy().ravel()
# Extract their values used to filter rows
pairs = df.loc[minrows, ["transaction_hash", "value", "data", "token_address"]]
print(df.merge(pairs))
We find the distance from each "near neighbor" (x - y).abs()
and .idxmin()
to choose the closest.
This is a bit complicated but I will try to explain as best as I can.
i have the following dataframe.
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e+20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e+20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e+20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e+20 transfer stETH
5 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 6.391691160717606e+19 transfer stETH
6 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 4.260674654274649e+19 transfer stETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e+20 trace ETH
8 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e+20 trace ETH
9 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e+20 transfer stETH
10 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 4.257e+19 transfer stETH
11 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 9.933e+19 transfer stETH
These 11 transfers represent two swap transactions: 2 unique hash) between ETH and stETH. It’d be very nice if there was two clean transactions with one ETH going from A to B, and the other one stETH going from B to A. But, in decentralized exchanges, things work via routers which send multiple transactions through various addresses to complete one swap transaction.
Here, there are two transactions (as you can see, within one transaction (hash) made up of several transfers). I want to verify which ETH amounts corresponds to stETH. ETH and stETH prices are almost 1:1 so they should be quite close in value.
So from the first transaction (1-6), there is one ETH value (1.052e+20) but three different stETH value (1.0652365814992255e+20, 6.391691160717606e+19, and 4.260674654274649e+19). Obviously, it is clear that the pair that corresponds to the 1.052e+20 ETH swap is 1.0652365814992255e+20 stETH as it is the closest in value.
So, in order to filter out the right pair, I want to groupby transaction_hash
, if there are more than 1 unique value of stETH, then I want to pick out the one that is closest to the ETH value.
so, the desired output will be:
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e+20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e+20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e+20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e+20 transfer stETH
5 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e+20 trace ETH
6 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e+20 trace ETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47+00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e+20 transfer stETH
Thanks!
EDIT!
I applied the code suggested below. Something strange is happening. My original df has a transaction like this
transaction_hash block_timestamp from_address to_address value data token_address
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e+17 transfer stETH
when I apply the code, instead of getting what I want which is..
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e+17 transfer stETH
I get more rows! and doesn’t even remove the rows
transaction_hash block_timestamp from_address to_address value data token_address
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49+00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e+17 transfer stETH
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e+17 trace ETH
what is going on?
If you can sort your data by value
you can use .merge_asof()
to find the "nearest neighbor".
The by
argument allows you specify grouping.
df = df.sort_values("value")
left = (
df.loc[df["token_address"] == "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_x"})
.reset_index()
)
right = (
df.loc[df["token_address"] != "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_y"})
.reset_index()
)
# Near matches
nearest = pd.merge_asof(
left = left,
right = right,
by = "transaction_hash",
left_on = "value_x",
right_on = "value_y",
direction = "nearest"
)
# Closest matches only
nearest["abs"] = (nearest["value_x"] - nearest["value_y"]).abs()
idxmin = nearest.groupby("value_y")["abs"].idxmin()
# index_x / index_y contain the row indexes of closest matches
minrows = nearest.loc[idxmin, ["index_x", "index_y"]].to_numpy().ravel()
# Extract their values used to filter rows
pairs = df.loc[minrows, ["transaction_hash", "value", "data", "token_address"]]
print(df.merge(pairs))
We find the distance from each "near neighbor" (x - y).abs()
and .idxmin()
to choose the closest.