How to create a fast, queryable index from many JSON files (ideally in Python)

Question:

I have 78,000 individual JSON files that I created with a Python script that scrapes a community forum and extracts information from each post. They consist of simple key-value pairs, like so:

{
    "name": "Chris Wilson",
    "item": "Darth Vader speaker phone",
    "price": "$100",
    "notes": "Great condition!"
}

Some keys are common to all files — name and price, for example — while many others appear in only some. (The site I’m crawling allows for user-defined fields.) I want to be able to search, sort, and group by any field I want.

Normally, I would load each file into a SQLite database and query it from there. This would be extremely tedious, given the multitude of fields.

From what little I understand about NoSQL frameworks, this seems like a project that is well-suited for a document-based system over a traditional relational database. I tried to learn CloudDB, but most of the documentation I can find assumes that you start with the empty database, not the pre-fabricated documents themselves.

Is there a good, reasonably simple (or at least well-documented) solution for indexing and querying large numbers of dictionary objects? I prefer Python, but happy to venture into Node or whatever else.

Thank you!

P.S. Let me know if you’re interested in that Darth Vader phone.

Asked By: Chris Wilson

||

Answers:

You might want to check Julian Hyde’s blog, he posted something about SQL over JSON using Apache Drill recently.

Answered By: Lucas Soares

This sounds like the perfect use case for MongoDB. Setup MongoDB and import your JSON files directly to the collection using mongoimport --file <filename>

They have great python support too.

Some documentation links:

http://docs.mongodb.org/manual/reference/mongoimport/#cmdoption-mongoimport–file

http://docs.mongodb.org/ecosystem/drivers/python/

Answered By: andrewleung
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.