Spark Mongo connector: Upsert only one attribute in MongoDB connection

Question:

Let’s say I have the following Mongo document:

{
 "_id":1, 
 "age": 10,
 "foo": 20
}

and the following Spark DataFrame df:

_id | val
 1  | 'a'
 2  | 'b'

and now I want to append the val from the dataframe to the Mongo document…

Using the MongoDB Spark connector, I can use the default upserting logic via “_id” append, meaning if the “_id” in Spark dataframe and Mongo document matches, Mongo connector will not create a new document, but rather update the old one.

But! The update basically behaves like replace – if I do the following:

df
.write.format("com.mongodb.spark.sql.DefaultSource")
.mode("append")
.option('spark.mongodb.output.uri','mongodb://mongo_server:27017/testdb.test_collection')
.save()

The collection will look like:

[   
    {
     "_id":1, 
     "val": 'a'
    },
   {
     "_id":2, 
     "val':'b' 
    }
]

and I would like to obtain this:

[   
    {
     "_id":1, 
     "age": 10,
     "foo": 20
     "val": 'a'
    },
   {
     "_id":2, 
     "val':'b' 
    }
]

My questions are:

  • is there way (some option) to make the Spark connector behave the way
    I want it to behave?

  • Sure, I can first read the documents from Mongo to Spark, enrich
    them with the “val” attribute and write/append it back to the Mongo. What is I/O of this operation? Is it a full load (reading all documents and then
    replacing all attributes) or is it somewhat clever (like reading all
    documents but appending only the “val” attribute, not
    replacing the entire document)?

Asked By: mLC

||

Answers:

is there way (some option) to make the Spark connector behave the way I want it to behave?

Yes, you can set the replaceDocument to false. For example, using MongoDB connector for Spark v2.2.2 and Apache Spark v2.3 in Python:

df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
               .option("spark.mongodb.input.uri", "mongodb://host101:27017/dbName.collName").load()
df.first() 
> Row(_id=1.0, age=10.0, foo=20.0)

# Create a dataframe 
d = {'_id': [1, 2], 'val':['a', 'b']}
inputdf = pd.DataFrame(data=d) 
sparkdf = sqlContext.createDataFrame(inputdf)

# Write to Spark 
sparkdf.write.format("com.mongodb.spark.sql.DefaultSource")
             .mode("append").option("spark.mongodb.output.uri", "mongodb://host101:27017/dbName.collName")
             .option("replaceDocument", "false")
             .save()

# Result 
+---+----+----+---+
|_id| age| foo|val|
+---+----+----+---+
|1.0|10.0|20.0|  a|
|2.0|null|null|  b|
+---+----+----+---+
Answered By: Wan B.

As of spark connector version 3.X, replaceDocument has been deprecated.

To achieve what you want, in spark connector 3.X+ you need to set operationType to ‘update’ (not the default ‘replace’). You also want to set upsertDocument to False (not the default which is True)

Read more here:
https://www.mongodb.com/docs/spark-connector/current/configuration/write/

Answered By: rouble