Avro, Hive or HBASE – What to use for 10 mio. records daily?

Question:

I have the following requirements: i need to process per day around 20.000 elements (lets call them baskets) which generate each between 100 and 1.000 records (lets call them products in basket). A single record has about 10 columns, each row has about 500B – 1KB size (in total).

That means, that i produce around 5 to max. 20 Mio. records per day.

From analytical perspective i need to do some sum up, filtering, especially show trends over multiple days etc.

The solution is Python based and i am able to use anything Hadoop, Microsoft SQL Server, Google Big Query etc. I am reading through lots of articles about Avro, Parquet, Hive, HBASE, etc.

I tested in the first something small with SQL Server and two tables (one for the main elements and the other one the produced items over all days). But with this, the database get very fast quite large + it is not that fast when trying to acess, filter, etc.

So i thought about using Avro and creating per day a single Avro file with the corresponding items. And when i want to analyse them, read them with Python or multiple of them, when i need to analyse multiple of them.

When i think about this, this could be way to large (30 days files with each 10 mio. records) …

There must be something else. Then i came aroung HIVE and HBASE. But now i am totally confused.

Anyone out there who can sort things in the right manner? What is the easiest or most general way to handle this kind of data?

Asked By: STORM

||

Answers:

If you want to analyze data based on columns and aggregates, ORC or Parquet are better. If you don’t plan on managing Hadoop infrastructure, then Hive or HBase wouldn’t be acceptable. I agree a SQL Server might struggle with large queries… Out of the options listed, that narrows it down to BigQuery.

If you want to explore alternative solutions in the same space, Apache Pinot or Druid support analytical use cases.

Otherwise, throw files (as parquet or ORC) into GCS and use pyspark

Answered By: OneCricketeer
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.