MongoDB InvalidDocument: Cannot encode object

Question:

I am using scrapy to scrap blogs and then store the data in mongodb. At first i got the InvalidDocument Exception. So obvious to me is that the data is not in the right encoding. So before persisting the object, in my MongoPipeline i check if the document is in ‘utf-8 strict’, and only then i try to persist the object to mongodb. BUT Still i get InvalidDocument Exceptions, now that is annoying.

This is my code my MongoPipeline Object that persists objects to mongodb

# -*- coding: utf-8 -*-

# Define your item pipelines here
#

import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem


class MongoPipeline(object):
    collection_name = 'master'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):

        if type(item) is BlogItem:
            try:
                if 'url' in item:
                    item['url'] = item['url'].encode('utf-8', 'strict')
                if 'domain' in item:
                    item['domain'] = item['domain'].encode('utf-8', 'strict')
                if 'title' in item:
                    item['title'] = item['title'].encode('utf-8', 'strict')
                if 'date' in item:
                    item['date'] = item['date'].encode('utf-8', 'strict')
                if 'content' in item:
                    item['content'] = item['content'].encode('utf-8', 'strict')
                if 'author' in item:
                    item['author'] = item['author'].encode('utf-8', 'strict')

            except:  # catch *all* exceptions
                e = sys.exc_info()[0]
                spider.logger.critical("ERROR ENCODING %s", e)
                traceback.print_exc(file=sys.stdout)
                raise DropItem("Error encoding BLOG %s" % item['url'])

            if 'comments' in item:
                comments = item['comments']
                item['comments'] = []

                try:
                    for comment in comments:
                        if 'date' in comment:
                            comment['date'] = comment['date'].encode('utf-8', 'strict')
                        if 'author' in comment:
                            comment['author'] = comment['author'].encode('utf-8', 'strict')
                        if 'content' in comment:
                            comment['content'] = comment['content'].encode('utf-8', 'strict')

                        item['comments'].append(comment)

                except:  # catch *all* exceptions
                    e = sys.exc_info()[0]
                    spider.logger.critical("ERROR ENCODING COMMENT %s", e)
                    traceback.print_exc(file=sys.stdout)

        self.db[self.collection_name].insert(dict(item))

        return item

And still i get the following exception:

au coeur de lu2019explosion de la bulle Internet nu2019est probablement pas xe9tranger au succxe8s qui a suivi. Mais franchement, cu2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui prxe9cxe8de, cu2019est quelles sont les conditions pour rxe9ussir si lu2019on est vraiment contraint de rester en France. Ce sont des sujets que je dxe9velopperai dans un autre article.',
     'date': u'2012-06-27T23:21:25+00:00',
     'domain': 'reussir-sa-boite.fr',
     'title': u'Peut-on encore entreprendre en France ?ttt ',
     'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
    Traceback (most recent call last):
      File "h:program filesanacondalibsite-packagestwistedinternetdefer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "H:PDSBNPcrawlercrawlerpipelines.py", line 76, in process_item
        self.db[self.collection_name].insert(dict(item))
      File "h:program filesanacondalibsite-packagespymongocollection.py", line 409, in insert
        gen(), check_keys, self.uuid_subtype, client)
    InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
     'content': 'Tellement vraixe2x80xa6 Il faut vraiment xc3xaatre motivxc3xa9 aujourdxe2x80x99hui pour monter sa boxc3xaete. On est prxc3xa9levxc3xa9 de partout, je ne pense mxc3xaame pas xc3xa0 embaucher, cela me coxc3xbbterait bien trop cher. Bref, 100% dxe2x80x99accord avec vous. Le problxc3xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuelxe2x80xa6 A moins que si, jxe2x80x99ai pu lire il me semble quxe2x80x99ils avaient en txc3xaate de rxc3xa9duire lxe2x80x99IS pour les petites entreprises et de lxe2x80x99augmenter pour les grandesxe2x80xa6 A voir',
     'date': '2012-06-27T23:21:25+00:00'}
    2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
    2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 252396,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
     'log_count/DEBUG': 2,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start)
    time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}

Another funny thing from the comment of @eLRuLL i did the following:

>>> s = "Tellement vraixe2x80xa6 Il faut vraiment xc3xaatre motivxc3xa9 aujourdxe2x80x99hui pour monter sa boxc3xaete. On est prxc3xa9levxc3xa9 de partout, je ne pense mxc3xaame pas xc3xa0 embaucher, cela me"
>>> s
'Tellement vraixe2x80xa6 Il faut vraiment xc3xaatre motivxc3xa9 aujourdxe2x80x99hui pour monter sa boxc3xaete. On est prxc3xa9levxc3xa9 de partout, je ne pense mxc3xaame pas xc3xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)

Then my question is. If this text cannot be encoded. Then why, is my MongoPipeline try catch not catching this EXCEPTION? Because only objects that don’t raise any exception should be appended to item[‘comments’] ?

Asked By: Codious-JR

||

Answers:

First, when you do "somestring".encode(...), isn’t changing "somestring", but it returns a new encoded string, so you should use something like:

 item['author'] = item['author'].encode('utf-8', 'strict')

and the same for the other fields.

Answered By: eLRuLL

Finally I figured it out. The problem was not with encoding. It was with the structure of the documents.

Because i went off on the standard MongoPipeline example which does not deal with nested scrapy items.

What i am doing is:
BlogItem:
“url”

comments = [CommentItem]

So my BlogItem has a list of CommentItems. Now the problem came here, for persisting the object in the database i do:

self.db[self.collection_name].insert(dict(item))

So here i am parsing the BlogItem to a dict. But i am not parsing the list of CommentItems. And because the traceback displays the CommentItem kind of like a dict, It did not occur to me that the problematic object is not a dict!

So finally the the way to fix this problem is to change the line when appending the comment to the comment list as such:

item['comments'].append(dict(comment))

Now MongoDB considers it as a valid document.

Lastly, for the last part where i ask why i am getting a exception on the python console and not in the script.

The reason is because i was working on the python console, which only supports ascii. And thus the error.

Answered By: Codious-JR

I got this error when running a query

db.collection.find({'attr': {'$gte': 20}})

and some records in collection had a non-numeric value for attr.

Answered By: duhaime

I ran into the same error using a numpy array in a Mongo query :

'myField' : { '$in': myList },

The fix was simply to convert the nd.array() into a list :

'myField' : { '$in': list(myList) },

in my case it was super stupid yet not easy to notice:

I accidentally wrote

f"indexes_access.{jsonData['index']}: {jsonData['newState']}"

instead of

{f"indexes_access.{jsonData['index']}": f"{jsonData['newState']}"}

(one long string parsed with f strings instead of key and value parsed separately)

Answered By: Eliav Louski
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.