Executing an SQL query over a pandas dataset
Question:
I have a pandas data set, called ‘df’.
How can I do something like below;
df.query("select * from df")
Thank you.
For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python
Answers:
You can use DataFrame.query(condition)
to return a subset of the data frame matching condition
like this:
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.query('C < 6')
A B C
0 0 1 2
1 3 4 5
df.query('2*B <= C')
A B C
0 0 1 2
df.query('A % 2 == 0')
A B C
0 0 1 2
2 6 7 8
This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE
is implied.
This is not what pandas.query
is supposed to do. You can look at package pandasql
(same like sqldf
in R )
import pandas as pd
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
[1234, 'Customer A', np.nan, '333 Street'],
[1233, 'Customer B', '444 Street', '333 Street'],
[1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])
q1 = """SELECT ID FROM df """
print(ps.sqldf(q1, locals()))
ID
0 1234
1 1234
2 1233
3 1233
Update 2020-07-10
update the pandasql
ps.sqldf("select * from df")
There’s actually a new package that I just released, called dataframe_sql. This gives you the ability to query pandas dataframes using SQL just as you want to. You can find the package here
After some time of using this I realised the easiest way is to just do
from pandasql import sqldf
output = sqldf("select * from df")
Works like a charm where df
is a pandas dataframe
You can install pandasql: https://pypi.org/project/pandasql/
Or, you can use the tools that do what they do best:
-
Install postgresql
-
Connect to the database:
from sqlalchemy import create_engine
import urllib.parse
engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
dbengine = create_engine(engconnect)
database = dbengine.connect()
- Dump the dataframe into postgres
df.to_sql(‘mytablename’, database, if_exists=’replace’)
- Write your query with all the SQL nesting your brain can handle.
myquery = "select distinct * from mytablename"
- Create a dataframe by running the query:
newdf = pd.read_sql(myquery, database)
I think a better solution than pandassql
would be duckdb. The way it handles the table name mapping to a dataframe object is a little cleaner imo. I have not evaluated performance though.
Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.
pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe
Performance improvement over pandasql:
test data NYC yellow cabs ~120mb of csv data
nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms
A improvement of speed of roughly 100x
This article gives good details and claims 1000x improvement over pandasql:
https://duckdb.org/2021/05/14/sql-on-pandas.html
There is also FugueSQL
pip install fugue[sql]
import pandas as pd
from fugue_sql import fsql
comics_df = pd.DataFrame({'book': ['Secret Wars 8',
'Tomb of Dracula 10',
'Amazing Spider-Man 252',
'New Mutants 98',
'Eternals 1',
'Amazing Spider-Man 300',
'Department of Truth 1'],
'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
'value': [400, 2500, 300, 600, 400, 750, 175]})
# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""
fsql(query).run()
Output
PandasDataFrame
book:str |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8 |Marvel |9.6 |400
Eternals 1 |Marvel |9.2 |400
Department of Truth 1 |Image |9.8 |175
Total count: 3
References
https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html
https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html
Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql
magic command to use in Jupyter/IPyhon:
# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC
In this example my_cars_df
is a Pandas Dataframe.
You can try it in this demo Google Colab notebook.
I have a pandas data set, called ‘df’.
How can I do something like below;
df.query("select * from df")
Thank you.
For those who know R, there is a library called sqldf where you can execute SQL code in R, my question is basically, is there some library like sqldf in python
You can use DataFrame.query(condition)
to return a subset of the data frame matching condition
like this:
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.query('C < 6')
A B C
0 0 1 2
1 3 4 5
df.query('2*B <= C')
A B C
0 0 1 2
df.query('A % 2 == 0')
A B C
0 0 1 2
2 6 7 8
This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE
is implied.
This is not what pandas.query
is supposed to do. You can look at package pandasql
(same like sqldf
in R )
import pandas as pd
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
[1234, 'Customer A', np.nan, '333 Street'],
[1233, 'Customer B', '444 Street', '333 Street'],
[1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])
q1 = """SELECT ID FROM df """
print(ps.sqldf(q1, locals()))
ID
0 1234
1 1234
2 1233
3 1233
Update 2020-07-10
update the
pandasql
ps.sqldf("select * from df")
There’s actually a new package that I just released, called dataframe_sql. This gives you the ability to query pandas dataframes using SQL just as you want to. You can find the package here
After some time of using this I realised the easiest way is to just do
from pandasql import sqldf
output = sqldf("select * from df")
Works like a charm where df
is a pandas dataframe
You can install pandasql: https://pypi.org/project/pandasql/
Or, you can use the tools that do what they do best:
-
Install postgresql
-
Connect to the database:
from sqlalchemy import create_engine
import urllib.parse
engconnect = "{0}://{1}:{2}@{3}:{4}/{5}".format(dialect,user_uenc, pw_uenc, host,port, dbname)
dbengine = create_engine(engconnect)
database = dbengine.connect()
- Dump the dataframe into postgres
df.to_sql(‘mytablename’, database, if_exists=’replace’)
- Write your query with all the SQL nesting your brain can handle.
myquery = "select distinct * from mytablename"
- Create a dataframe by running the query:
newdf = pd.read_sql(myquery, database)
I think a better solution than pandassql
would be duckdb. The way it handles the table name mapping to a dataframe object is a little cleaner imo. I have not evaluated performance though.
Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.
pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe
Performance improvement over pandasql:
test data NYC yellow cabs ~120mb of csv data
nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms
A improvement of speed of roughly 100x
This article gives good details and claims 1000x improvement over pandasql:
https://duckdb.org/2021/05/14/sql-on-pandas.html
There is also FugueSQL
pip install fugue[sql]
import pandas as pd
from fugue_sql import fsql
comics_df = pd.DataFrame({'book': ['Secret Wars 8',
'Tomb of Dracula 10',
'Amazing Spider-Man 252',
'New Mutants 98',
'Eternals 1',
'Amazing Spider-Man 300',
'Department of Truth 1'],
'publisher': ['Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Marvel', 'Image'],
'grade': [9.6, 5.0, 7.5, 8.0, 9.2, 6.5, 9.8],
'value': [400, 2500, 300, 600, 400, 750, 175]})
# which of my books are graded above 8.0?
query = """
SELECT book, publisher, grade, value FROM comics_df
WHERE grade > 8.0
PRINT
"""
fsql(query).run()
Output
PandasDataFrame
book:str |publisher:str|grade:double|value:long
--------------------------------------------------------------+-------------+------------+----------
Secret Wars 8 |Marvel |9.6 |400
Eternals 1 |Marvel |9.2 |400
Department of Truth 1 |Image |9.8 |175
Total count: 3
References
https://fugue-tutorials.readthedocs.io/tutorials/beginner/beginner_sql.html
https://www.kdnuggets.com/2021/10/query-pandas-dataframes-sql.html
Another solution is RBQL which provides SQL-like query language that allows using Python expression inside SELECT and WHERE statements. It also provides a convenient %rbql
magic command to use in Jupyter/IPyhon:
# Get some test data:
!pip install vega_datasets
from vega_datasets import data
my_cars_df = data.cars()
# Install and use RBQL:
!pip install rbql
%load_ext rbql
%rbql SELECT * FROM my_cars_df WHERE a.Horsepower > 100 ORDER BY a.Weight_in_lbs DESC
In this example my_cars_df
is a Pandas Dataframe.
You can try it in this demo Google Colab notebook.