Can tqdm be used with Database Reads?
Question:
While reading large relations from a SQL database to a pandas dataframe, it would be nice to have a progress bar, because the number of tuples is known statically and the I/O rate could be estimated. It looks like the tqdm
module has a function tqdm_pandas
which will report progress on mapping functions over columns, but by default calling it does not have the effect of reporting progress on I/O like this. Is it possible to use tqdm
to make a progress bar on a call to pd.read_sql
?
Answers:
Edit: Answer is misleading – chunksize
has no effect on database side of the operation. See comments below.
You could use the chunksize
parameter to do something like this:
chunks = pd.read_sql('SELECT * FROM table', con=conn, chunksize=100)
df = pd.DataFrame()
for chunk in tqdm(chunks):
df = pd.concat([df, chunk])
I think this would use less memory as well.
yes! you can!
expanding the answer here, and Alex answer, to include tqdm, we get:
# get total number or rows
q = f"SELECT COUNT(*) FROM table"
total_rows = pd.read_sql_query(q, conn).values[0, 0]
# note that COUNT implementation should not download the whole table.
# some engine will prefer you to use SELECT MAX(ROWID) or whatever...
# read table with tqdm status bar
q = f"SELECT * FROM table"
rows_in_chunk = 1_000
chunks = pd.read_sql_query(q, conn, chunksize=rows_in_chunk)
df = tqdm(chunks, total=total_rows/rows_in_chunk)
df = pd.concat(df)
output example:
39%|███▉ | 99/254.787 [01:40<02:09, 1.20it/s]
While reading large relations from a SQL database to a pandas dataframe, it would be nice to have a progress bar, because the number of tuples is known statically and the I/O rate could be estimated. It looks like the tqdm
module has a function tqdm_pandas
which will report progress on mapping functions over columns, but by default calling it does not have the effect of reporting progress on I/O like this. Is it possible to use tqdm
to make a progress bar on a call to pd.read_sql
?
Edit: Answer is misleading – chunksize
has no effect on database side of the operation. See comments below.
You could use the chunksize
parameter to do something like this:
chunks = pd.read_sql('SELECT * FROM table', con=conn, chunksize=100)
df = pd.DataFrame()
for chunk in tqdm(chunks):
df = pd.concat([df, chunk])
I think this would use less memory as well.
yes! you can!
expanding the answer here, and Alex answer, to include tqdm, we get:
# get total number or rows
q = f"SELECT COUNT(*) FROM table"
total_rows = pd.read_sql_query(q, conn).values[0, 0]
# note that COUNT implementation should not download the whole table.
# some engine will prefer you to use SELECT MAX(ROWID) or whatever...
# read table with tqdm status bar
q = f"SELECT * FROM table"
rows_in_chunk = 1_000
chunks = pd.read_sql_query(q, conn, chunksize=rows_in_chunk)
df = tqdm(chunks, total=total_rows/rows_in_chunk)
df = pd.concat(df)
output example:
39%|███▉ | 99/254.787 [01:40<02:09, 1.20it/s]