groupby and apply multiple conditions

Question:

This is a bit complicated but I will try to explain as best as I can.

i have the following dataframe.

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
1   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0xe5d84152dd961e2eb0d6c202cf3396f579974983  0x1111111254eeb25477b68fb85ed929f73a960582  1.052e+20   trace   ETH
2   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.052e+20   trace   ETH
3   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xe5d84152dd961e2eb0d6c202cf3396f579974983  1.0652365814992255e+20  transfer    stETH
4   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.0652365814992255e+20  transfer    stETH
5   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  6.391691160717606e+19   transfer    stETH
6   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0xdc24316b9ae028f1497c275eb9192a3ea0f67022  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  4.260674654274649e+19   transfer    stETH
7   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  1.401493579633375e+20   trace   ETH
8   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.401493579633375e+20   trace   ETH
9   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.419e+20   transfer    stETH
10  0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0  4.257e+19   transfer    stETH
11  0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0xdc24316b9ae028f1497c275eb9192a3ea0f67022  9.933e+19   transfer    stETH

These 11 transfers represent two swap transactions: 2 unique hash) between ETH and stETH. It’d be very nice if there was two clean transactions with one ETH going from A to B, and the other one stETH going from B to A. But, in decentralized exchanges, things work via routers which send multiple transactions through various addresses to complete one swap transaction.

Here, there are two transactions (as you can see, within one transaction (hash) made up of several transfers). I want to verify which ETH amounts corresponds to stETH. ETH and stETH prices are almost 1:1 so they should be quite close in value.

So from the first transaction (1-6), there is one ETH value (1.052e+20) but three different stETH value (1.0652365814992255e+20, 6.391691160717606e+19, and 4.260674654274649e+19). Obviously, it is clear that the pair that corresponds to the 1.052e+20 ETH swap is 1.0652365814992255e+20 stETH as it is the closest in value.

So, in order to filter out the right pair, I want to groupby transaction_hash, if there are more than 1 unique value of stETH, then I want to pick out the one that is closest to the ETH value.

so, the desired output will be:

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
1   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0xe5d84152dd961e2eb0d6c202cf3396f579974983  0x1111111254eeb25477b68fb85ed929f73a960582  1.052e+20   trace   ETH
2   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.052e+20   trace   ETH
3   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xe5d84152dd961e2eb0d6c202cf3396f579974983  1.0652365814992255e+20  transfer    stETH
4   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59+00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.0652365814992255e+20  transfer    stETH
5   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  1.401493579633375e+20   trace   ETH
6   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.401493579633375e+20   trace   ETH
7   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47+00:00   0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.419e+20   transfer    stETH


Thanks!

EDIT!

I applied the code suggested below. Something strange is happening. My original df has a transaction like this

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
60347   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x558247e365be655f9144e1a0140d793984372ef3  6917030000000000.0  trace   ETH
61076   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0xb1720612d0131839dc489fcf20398ea925282fca  1220650000000000.0  trace   ETH
399307  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x6d02e95909da8da09865a26b62055bd6a1d5f706  8.4846e+17  trace   ETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e+17   transfer    stETH
625132  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x4028daac072e492d34a3afdbef0ba7e35d8b55c4  8.800918863842962e+17   transfer    stETH

when I apply the code, instead of getting what I want which is..

399307  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x6d02e95909da8da09865a26b62055bd6a1d5f706  8.4846e+17  trace   ETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e+17   transfer    stETH
625132  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x4028daac072e492d34a3afdbef0ba7e35d8b55c4  8.800918863842962e+17   transfer    stETH

I get more rows! and doesn’t even remove the rows

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e+17   transfer    stETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e+17   transfer    stETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49+00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e+17   transfer    stETH
60347   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x558247e365be655f9144e1a0140d793984372ef3  6917030000000000.0  trace   ETH
61076   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0xb1720612d0131839dc489fcf20398ea925282fca  1220650000000000.0  trace   ETH
399307  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x6d02e95909da8da09865a26b62055bd6a1d5f706  8.4846e+17  trace   ETH

what is going on?

Asked By: Olive

||

Answers:

If you can sort your data by value you can use .merge_asof() to find the "nearest neighbor".

The by argument allows you specify grouping.

df = df.sort_values("value")

left = (
   df.loc[df["token_address"] == "ETH", ["transaction_hash", "value"]]
     .rename(columns={"value": "value_x"})
     .reset_index()
)

right = (
   df.loc[df["token_address"] != "ETH", ["transaction_hash", "value"]]
     .rename(columns={"value": "value_y"})
     .reset_index()
)

# Near matches
nearest = pd.merge_asof(
   left      = left,
   right     = right,
   by        = "transaction_hash", 
   left_on   = "value_x",
   right_on  = "value_y",
   direction = "nearest"
)
 
# Closest matches only
nearest["abs"] = (nearest["value_x"] - nearest["value_y"]).abs()
idxmin = nearest.groupby("value_y")["abs"].idxmin()

# index_x / index_y contain the row indexes of closest matches
minrows = nearest.loc[idxmin, ["index_x", "index_y"]].to_numpy().ravel()

# Extract their values used to filter rows
pairs = df.loc[minrows, ["transaction_hash", "value", "data", "token_address"]]

print(df.merge(pairs))

We find the distance from each "near neighbor" (x - y).abs() and .idxmin() to choose the closest.

Answered By: jqurious
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.