How to create a fast, queryable index from many JSON files (ideally in Python)
Question:
I have 78,000 individual JSON files that I created with a Python script that scrapes a community forum and extracts information from each post. They consist of simple key-value pairs, like so:
{
"name": "Chris Wilson",
"item": "Darth Vader speaker phone",
"price": "$100",
"notes": "Great condition!"
}
Some keys are common to all files — name
and price
, for example — while many others appear in only some. (The site I’m crawling allows for user-defined fields.) I want to be able to search, sort, and group by any field I want.
Normally, I would load each file into a SQLite database and query it from there. This would be extremely tedious, given the multitude of fields.
From what little I understand about NoSQL frameworks, this seems like a project that is well-suited for a document-based system over a traditional relational database. I tried to learn CloudDB, but most of the documentation I can find assumes that you start with the empty database, not the pre-fabricated documents themselves.
Is there a good, reasonably simple (or at least well-documented) solution for indexing and querying large numbers of dictionary objects? I prefer Python, but happy to venture into Node or whatever else.
Thank you!
P.S. Let me know if you’re interested in that Darth Vader phone.
Answers:
You might want to check Julian Hyde’s blog, he posted something about SQL over JSON using Apache Drill recently.
This sounds like the perfect use case for MongoDB. Setup MongoDB and import your JSON files directly to the collection using mongoimport --file <filename>
They have great python support too.
Some documentation links:
http://docs.mongodb.org/manual/reference/mongoimport/#cmdoption-mongoimport–file
I have 78,000 individual JSON files that I created with a Python script that scrapes a community forum and extracts information from each post. They consist of simple key-value pairs, like so:
{
"name": "Chris Wilson",
"item": "Darth Vader speaker phone",
"price": "$100",
"notes": "Great condition!"
}
Some keys are common to all files — name
and price
, for example — while many others appear in only some. (The site I’m crawling allows for user-defined fields.) I want to be able to search, sort, and group by any field I want.
Normally, I would load each file into a SQLite database and query it from there. This would be extremely tedious, given the multitude of fields.
From what little I understand about NoSQL frameworks, this seems like a project that is well-suited for a document-based system over a traditional relational database. I tried to learn CloudDB, but most of the documentation I can find assumes that you start with the empty database, not the pre-fabricated documents themselves.
Is there a good, reasonably simple (or at least well-documented) solution for indexing and querying large numbers of dictionary objects? I prefer Python, but happy to venture into Node or whatever else.
Thank you!
P.S. Let me know if you’re interested in that Darth Vader phone.
You might want to check Julian Hyde’s blog, he posted something about SQL over JSON using Apache Drill recently.
This sounds like the perfect use case for MongoDB. Setup MongoDB and import your JSON files directly to the collection using mongoimport --file <filename>
They have great python support too.
Some documentation links:
http://docs.mongodb.org/manual/reference/mongoimport/#cmdoption-mongoimport–file