Cannot sink Windowed queried streaming data to MongoDB
Question:
Using Spark Structured Streaming I am trying to sink streaming data to a MongoDB collection. The issue is that I am querying my data using a window as following:
def basicAverage(df):
return df.groupby(window(col('timestamp'), "1 hour", "5 minutes"), col('stationcode'))
.agg(avg('mechanical').alias('avg_mechanical'), avg('ebike').alias('avg_ebike'), avg('numdocksavailable').alias('avg_numdocksavailable'))
And it seems that mongodb integration cannot support a writeStream containing windowed data because when I run the script, my collection remains empty and no error shows up. I tried to delete the window option on my query and the sink worked like a charm.
Here is my sink method:
queryBasicAvg.writeStream.format('mongodb').queryName("basicAvg")
.option("checkpointLocation", "./tmp/pyspark7/").option("forceDeleteTempCheckpointLocation", "true")
.option('spark.mongodb.connection.uri', 'mongodb://127.0.0.1')
.option("spark.mongodb.database", 'velibprj').option("spark.mongodb.collection", 'stationsBasicAvg')
.outputMode("append").start()
Any thought on how to solve this issue ?
Thanks in advance
Answers:
I solved the issue using foreach() sink method.
Using Spark Structured Streaming I am trying to sink streaming data to a MongoDB collection. The issue is that I am querying my data using a window as following:
def basicAverage(df):
return df.groupby(window(col('timestamp'), "1 hour", "5 minutes"), col('stationcode'))
.agg(avg('mechanical').alias('avg_mechanical'), avg('ebike').alias('avg_ebike'), avg('numdocksavailable').alias('avg_numdocksavailable'))
And it seems that mongodb integration cannot support a writeStream containing windowed data because when I run the script, my collection remains empty and no error shows up. I tried to delete the window option on my query and the sink worked like a charm.
Here is my sink method:
queryBasicAvg.writeStream.format('mongodb').queryName("basicAvg")
.option("checkpointLocation", "./tmp/pyspark7/").option("forceDeleteTempCheckpointLocation", "true")
.option('spark.mongodb.connection.uri', 'mongodb://127.0.0.1')
.option("spark.mongodb.database", 'velibprj').option("spark.mongodb.collection", 'stationsBasicAvg')
.outputMode("append").start()
Any thought on how to solve this issue ?
Thanks in advance
I solved the issue using foreach() sink method.