How to write pandas' merge_asof equivalence in PySpark


I am trying to write a merge_asof of pandas in Spark.

Here is a sample example:

from datetime import datetime
df1 = spark.createDataFrame(
    ("time", "ticker", "bid", "ask")
df2 = spark.createDataFrame(
    ("time", "ticker", "price", "quantity")


d1 = df1.toPandas().sort_values("time", ascending=True)
d2 = df2.toPandas().sort_values("time", ascending=True)

pd.merge_asof(d2, d1, on='time', by='ticker')


                        time ticker   price  quantity     bid     ask
0 2019-02-03 13:30:00.000023   MSFT   51.95        75   51.95   51.96
1 2019-02-03 13:30:00.000038   MSFT   51.95       155   51.95   51.96
2 2019-02-03 13:30:00.000048   GOOG  720.77       100  720.50  720.93
3 2019-02-03 13:30:00.000048   GOOG  720.92       100  720.50  720.93
4 2019-02-03 13:30:00.000048   AAPL   98.00       100     NaN     NaN

Using UDF in Spark

import pandas as pd
def asof_join(l, r):
      return pd.merge_asof(l, r, on="time", by="ticker")

  asof_join, schema="time timestamp, ticker string, price float,quantity int,bid float, ask float"
).show(10, False)


|time                      |ticker|price |quantity|bid  |ask   |
|2019-02-03 13:30:00.000048|AAPL  |98.0  |100     |null |null  |
|2019-02-03 13:30:00.000048|GOOG  |720.77|100     |720.5|720.93|
|2019-02-03 13:30:00.000048|GOOG  |720.92|100     |720.5|720.93|
|2019-02-03 13:30:00.000023|MSFT  |51.95 |75      |51.95|51.96 |
|2019-02-03 13:30:00.000038|MSFT  |51.95 |155     |51.95|51.96 |


The UDF works and gives me the right results, but I wanted to know if there is a more efficient way to do in PySpark using window functions? I am processing large data and UDF is the bottleneck.

Asked By: shahidammer



Here’s a more comprehensive answer with configurable on, by, tolerance and direction clauses of merge_asof. This answer only covers the "backward" direction.

OP’s question can be answered by first joining and then using last over window:

from pyspark.sql import functions as F, Window as W

df = df2.withColumn('_df_left', F.lit(True)) 
        .join(df1, ['time', 'ticker'], 'full')
w = W.partitionBy('ticker').orderBy('time')
for c in set(df1.columns) - {'time', 'ticker'}:
    df = df.withColumn(c, F.coalesce(c, F.last(c, True).over(w)))
df = df.filter('_df_left').drop('_df_left')
# +--------------------------+------+------+--------+-----+------+
# |time                      |ticker|price |quantity|bid  |ask   |
# +--------------------------+------+------+--------+-----+------+
# |2019-02-03 13:30:00.000048|AAPL  |98.0  |100     |null |null  |
# |2019-02-03 13:30:00.000048|GOOG  |720.77|100     |720.5|720.93|
# |2019-02-03 13:30:00.000048|GOOG  |720.92|100     |720.5|720.93|
# |2019-02-03 13:30:00.000023|MSFT  |51.95 |75      |51.95|51.96 |
# |2019-02-03 13:30:00.000038|MSFT  |51.95 |155     |51.95|51.96 |
# +--------------------------+------+------+--------+-----+------+
Answered By: ZygD