Reading large table by chunks

Question:

I have a table generated on a server and I connect to it using a presto client as follows:

conn = presto.connect('hostname',port)
db = "some_large_table"

What I would like to do is to read in 1 chunk at a time then do my processing and append that chunk to an existing df. Ie:

sql = "select column1, .. column20 limit 100"
chunk_count = 0
dfs = []


for chunk in pd.read_sql_query(sql, conn, chunksize=10) :
    chunk_count +=1
    dfs.append(chunk)

print(dfs[['column1', 'column2']]) 

The print dfs only shows 10 rows. Which Means its not appending.

Asked By: chowpay

||

Answers:

Well I can’t be sure that this will work without more context but I can tell you that your issue arises because dfs is a list of data frames not a data frame… That said with this approach you will assign dfs to be equal to your first query and append subsequent querys to that result.

sql = "select column1, .. column20 limit 100"
chunk_count = 0

for chunk in pd.read_sql_query(sql, conn, chunksize=10) :
    chunk_count +=1
    try:
        dfs.append(chunk)
    except:
        dfs = chunk

print(dfs[['column1', 'column2']]) 
Answered By: kpie

In my query I limited the number of rows to 10. For some reason df_full.append() does not work, I changed it to df_full = df_full.append() and it works fine.

sql = "select*...limit 10"
df_source = pd.read_sql_query(sql, conn, chunksize=2)
chunk_count = 0

df_list = []
df_full = pd.DataFrame(columns = col_names)

for chunk in df_source:
    df_list.append(chunk)

for df_item in df_list:
    df_full = df_full.append(df_item, ignore_index = True)

print(df_full)

Result:

[10 rows x 38 columns]
Answered By: chowpay

Appending to dataframes is really slow and should be avoided. Pandas does not do in-place appending. The dataframe is always copied to a new version.

So in your code, you could do this:

dfs = pd.DataFrame()  #empty dataframe

 # then in the loop:

    dfs = dfs.append(chunk)

And that would work. If you don’t have very many chunks, then it is not bad. But as dfs grows, it will start to slow down to a crawl.

I find it is best to append lod (list of dict) and then cast all at once into a dataframe, if you want one big dataframe.

Thus:

sql = "select column1, .. column20 limit 100"
chunk_count = 0
lod = []

for chunk_df in pd.read_sql_query(sql, conn, chunksize=10) :
   chunk_count += 1
   lod += chunk_df.to_dict(orient='records')
   
dfs = pd.DataFrame.from_dict(lod)

print(f"processed {chunk_count} chunks.n", dfs[['column1', 'column2']]) 

This method does not get slower as you go, because lists can append quickly,
and then the lod is converted to dfs in one shot.

Answered By: Ray Lutz
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.