How do i download a large collection in Firestore with Python without getting at 503 error?

Question:

Trying to count the number of docs in a firestore collection with python. When i use db.collection('xxxx").stream() i get the following error:

 503 The datastore operation timed out, or the data was temporarily unavailable.

about half way through. It was working fine. Here is the code:

    docs = db.collection(u'theDatabase').stream()
    count = 0
    for doc in docs:
        count += 1
    print (count)

Every time I get a 503 error at about 73,000 records. Does anyone know how to overcome the 20 second timeout?

Answers:

Try using a recursive function to batch document retrievals and keep them under the timeout. Here’s an example based on the delete_collections snippet:

from google.cloud import firestore

# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()


def count_collection(coll_ref, count, cursor=None):

    if cursor is not None:
        docs = [snapshot.reference for snapshot
                in coll_ref.limit(1000).order_by("__name__").start_after(cursor).stream()]
    else:
        docs = [snapshot.reference for snapshot
                in coll_ref.limit(1000).order_by("__name__").stream()]

    count = count + len(docs)

    if len(docs) == 1000:
        return count_collection(coll_ref, count, docs[999].get())
    else:
        print(count)


count_collection(db.collection('users'), 0)
Answered By: Juan Lara

Although Juan’s answer works for basic counting, in case you need more of the data from Firebase and not just the id (a common use case of which is total migration of the data that is not through GCP), the recursive algorithm will eat your memory.

So I took Juan’s code and transformed it to a standard iterative algorithm. Hope this helps someone.

limit = 1000  # Reduce this if it uses too much of your RAM
def stream_collection_loop(collection, count, cursor=None):
    while True:
        docs = []  # Very important. This frees the memory incurred in the recursion algorithm.

        if cursor:
            docs = [snapshot for snapshot in
                    collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
        else:
            docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]

        for doc in docs:
            print(doc.id)
            print(count)
            # The `doc` here is already a `DocumentSnapshot` so you can already call `to_dict` on it to get the whole document.
            process_data_and_log_errors_if_any(doc)
            count = count + 1

        if len(docs) == limit:
            cursor = docs[limit-1]
            continue

        break


stream_collection_loop(db_v3.collection('collection'), 0)
Answered By: Alec Gerona

In other answers was shown how to use the pagination to solve the timeout issue.

With asyncio there can be a more reusable approach which lets you to iterate the stream of documents in the same way as before with the improvement of pagination.

Here is an example of function that takes an AsyncQuery and returns an async generator in the same way as the AsyncQuery stream() method.

from typing import AsyncGenerator, Optional

from google.cloud.firestore import AsyncQuery, DocumentSnapshot


async def paginate_query_stream(
    query: AsyncQuery,
    order_by: str,
    cursor: Optional[DocumentSnapshot] = None,
    page_size: int = 10000,
) -> AsyncGenerator[DocumentSnapshot, None]:
    paged_query: AsyncQuery = query.order_by(order_by)
    document = cursor
    has_any = True
    while has_any:
        has_any = False
        if document:
            paged_query = paged_query.start_after(document)
        paged_query = paged_query.limit(page_size)
        async for document in paged_query.stream():
            has_any = True
            yield document

Take in mind if your target collection constantly grows then you need to filter the upper bound in the query in advance to prevent a potential infinite loop.

A usage example with counting of documents.

from google.cloud.firestore import AsyncQuery

# db is AsyncClient instance which returns AsyncCollectionReference. 
docs = db.collection(u'theDatabase')
# Query without conditions, get all documents.
query = AsyncQuery(docs)

count = 0
async for doc in paginate_query_stream(query, order_by='__name__'):
    count += 1
print(count)
Answered By: vilozio
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.