which one is effecient, join queries using sql, or merge queries using pandas?
Question:
I want to use data from multiple tables in a pandas dataframe
. I have 2 idea for downloading data from the server, one way is to use SQL
join and retrieve data and one way is to download dataframes separately and merge them using pandas.merge.
SQL Join
when I want to download data into pandas
.
query='''SELECT table1.c1, table2.c2
FROM table1
INNER JOIN table2 ON table1.ID=table2.ID where condidtion;'''
df = pd.read_sql(query,engine)
Pandas Merge
df1 = pd.read_sql('select c1 from table1 where condition;',engine)
df2 = pd.read_sql('select c2 from table2 where condition;',engine)
df = pd.merge(df1,df2,on='ID', how='inner')
which one is faster? Assume that I want to do that for more than 2 tables and 2 columns.
Is there any better idea?
If it is necessary to know I use PostgreSQL
.
Answers:
To really know which is faster, you need to try out the two queries using your data on your databases.
The rule of thumb is to do the logic in a single query. Databases are designed for queries. They have sophisticated algorithms, multiple processors, and lots of memory to handle them. So, relying on the database is quite reasonable. In addition, each query has a bit of overhead, so two queries have twice the overhead of one.
That said, there are definitely circumstances where doing the work in pandas is going to be faster. Pandas is going to do the work in local memory. That is limited — but much less so than in the “good old days”. It is probably not going to be multi-threaded.
For example, the result set might be much larger than the two tables. Moving the data from the database to the application might be (relatively) expensive in that case. Doing the work in in pandas could be faster than in the database.
At the other extreme, no records might match the JOIN
conditions. That is definitely a case where a single query would be faster.
The former is faster than the latter. The former just do a single call to the database, and return the result already joined and filtered. However, the latter do two calls to the database, and then it merges the result sets in the application/program.
Parallel processing can be used in case SQL, many modern SQL engines use it. In the case of Pandas, it’s not possible. I know there are few libraries that support parallel processing.
I want to use data from multiple tables in a pandas dataframe
. I have 2 idea for downloading data from the server, one way is to use SQL
join and retrieve data and one way is to download dataframes separately and merge them using pandas.merge.
SQL Join
when I want to download data into pandas
.
query='''SELECT table1.c1, table2.c2
FROM table1
INNER JOIN table2 ON table1.ID=table2.ID where condidtion;'''
df = pd.read_sql(query,engine)
Pandas Merge
df1 = pd.read_sql('select c1 from table1 where condition;',engine)
df2 = pd.read_sql('select c2 from table2 where condition;',engine)
df = pd.merge(df1,df2,on='ID', how='inner')
which one is faster? Assume that I want to do that for more than 2 tables and 2 columns.
Is there any better idea?
If it is necessary to know I use PostgreSQL
.
To really know which is faster, you need to try out the two queries using your data on your databases.
The rule of thumb is to do the logic in a single query. Databases are designed for queries. They have sophisticated algorithms, multiple processors, and lots of memory to handle them. So, relying on the database is quite reasonable. In addition, each query has a bit of overhead, so two queries have twice the overhead of one.
That said, there are definitely circumstances where doing the work in pandas is going to be faster. Pandas is going to do the work in local memory. That is limited — but much less so than in the “good old days”. It is probably not going to be multi-threaded.
For example, the result set might be much larger than the two tables. Moving the data from the database to the application might be (relatively) expensive in that case. Doing the work in in pandas could be faster than in the database.
At the other extreme, no records might match the JOIN
conditions. That is definitely a case where a single query would be faster.
The former is faster than the latter. The former just do a single call to the database, and return the result already joined and filtered. However, the latter do two calls to the database, and then it merges the result sets in the application/program.
Parallel processing can be used in case SQL, many modern SQL engines use it. In the case of Pandas, it’s not possible. I know there are few libraries that support parallel processing.