SQL statement for CSV files on IPython notebook

Question:

I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.

For instance I use the following code to select all “name” where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.

             df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
             df['name'][df['session_id']==1]

I just wonder after I have read the csv file, is it possible to somehow “switch/read” it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:

Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`

But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!

Asked By: yingnan liu

||

Answers:

Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.

import pandas as pd
import sqlite3

#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)
Answered By: Sam

Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.

First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).

Here’s a full example you can run in a Jupyter notebook:

Installation

pip install jupysql duckdb duckdb-engine

Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine

Example

Load extension (%sql magic) and create in-memory database:

%load_ext SQL
%sql duckdb://

Download some sample CSV data:

from urllib.request import urlretrieve

urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")

Query:

%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC

JupySQL documentation available here

Answered By: Eduardo
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.