which one is effecient, join queries using sql, or merge queries using pandas?

Question:

I want to use data from multiple tables in a pandas dataframe. I have 2 idea for downloading data from the server, one way is to use SQL join and retrieve data and one way is to download dataframes separately and merge them using pandas.merge.

SQL Join

when I want to download data into pandas.

query='''SELECT table1.c1, table2.c2
    FROM table1
    INNER JOIN table2 ON table1.ID=table2.ID where condidtion;'''
df = pd.read_sql(query,engine)

Pandas Merge

df1 = pd.read_sql('select c1 from table1 where condition;',engine)
df2 = pd.read_sql('select c2 from table2 where condition;',engine)
df = pd.merge(df1,df2,on='ID', how='inner')

which one is faster? Assume that I want to do that for more than 2 tables and 2 columns.
Is there any better idea?
If it is necessary to know I use PostgreSQL.

Asked By: Mehdi

||

Answers:

To really know which is faster, you need to try out the two queries using your data on your databases.

The rule of thumb is to do the logic in a single query. Databases are designed for queries. They have sophisticated algorithms, multiple processors, and lots of memory to handle them. So, relying on the database is quite reasonable. In addition, each query has a bit of overhead, so two queries have twice the overhead of one.

That said, there are definitely circumstances where doing the work in pandas is going to be faster. Pandas is going to do the work in local memory. That is limited — but much less so than in the “good old days”. It is probably not going to be multi-threaded.

For example, the result set might be much larger than the two tables. Moving the data from the database to the application might be (relatively) expensive in that case. Doing the work in in pandas could be faster than in the database.

At the other extreme, no records might match the JOIN conditions. That is definitely a case where a single query would be faster.

Answered By: Gordon Linoff

The former is faster than the latter. The former just do a single call to the database, and return the result already joined and filtered. However, the latter do two calls to the database, and then it merges the result sets in the application/program.

Answered By: alfonsohdez08

Parallel processing can be used in case SQL, many modern SQL engines use it. In the case of Pandas, it’s not possible. I know there are few libraries that support parallel processing.

Answered By: asonagra
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.