How do I get a list of just the ObjectId's using pymongo?

Question:

I have the following code:

client = MongoClient()
data_base = client.hkpr_restore
agents_collection = data_base.agents
agent_ids = agents_collection.find({},{"_id":1})

This gives me a result of:

{u'_id': ObjectId('553020a8bf2e4e7a438b46d9')}
{u'_id': ObjectId('553020a8bf2e4e7a438b46da')}
{u'_id': ObjectId('553020a8bf2e4e7a438b46db')}

How do I just get at the ObjectId’s so I can then use each ID to search another collection?

Asked By: DBWeinstein

||

Answers:

Try creating a list comprehension with just the _ids as follows:

>>> client = MongoClient()
>>> data_base = client.hkpr_restore
>>> agents_collection = data_base.agents
>>> result = agents_collection.find({},{"_id":1})
>>> agent_ids = [x["_id"] for x in result]
>>> 
>>> print agent_ids
[ ObjectId('553020a8bf2e4e7a438b46d9'),  ObjectId('553020a8bf2e4e7a438b46da'),  ObjectId('553020a8bf2e4e7a438b46db')]
>>>
Answered By: chridam

Use distinct

In [27]: agent_ids = agents_collection.distinct('_id')

In [28]: agent_ids
Out[28]: 
[ObjectId('553662940acf450bef638e6d'),
 ObjectId('553662940acf450bef638e6e'),
 ObjectId('553662940acf450bef638e6f')]

In [29]: agent_id2 = [str(id) for id in agents_collection.distinct('_id')]

In [30]: agent_id2
Out[30]: 
['553662940acf450bef638e6d',
 '553662940acf450bef638e6e',
 '553662940acf450bef638e6f']
Answered By: styvane

I would like to add something which is more general than querying for all _id.

import bson
[...]
results = agents_collection.find({}})
objects = [v for result in results for k,v in result.items()
          if isinstance(v,bson.objectid.ObjectId)]

Context: saving objects in gridfs creates ObjectIds, to retrieve all of them for further querying, this function helped me out.

Answered By: Jordy Van Landeghem

I solved the problem by following this answer.
Adding hint to the find syntax then simply iterate through the cursor returned.

db.c.find({},{_id:1}).hint(_id:1);

I am guessing without the hint the cursor would get the whole documentation back when iterated, causing the iteration to be extremely slow.
With hint, the cursor would only return ObjectId back and the iteration would finish very quickly.

The background is I am working on an ETL job that require sync one mongo collection to another while modify the data by some criteria. The total number of Object id is around
100000000.

I tried using distinct but got the following error:

Error in : distinct too big, 16mb cap

I tried using aggregation and did $group as answered from other similar question. Only to hit some memory consumption error.

Answered By: Heyang Wang

Although I wasn’t searching for the _id, I was extracting another field. I found this method was fast (assuming you have an index on the field):

list_of_strings = {x.get("MY_FIELD") for x in db.col.find({},{"_id": 0, "MY_FIELD": 1}).hint("MY_FIELDIdx")}

Where MY_FIELDIdx is the name of the index for the field I’m trying to extract.

Answered By: Jack
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.