Export dataset haveing more than 1048576 rows, in multiple sheets of single excel file, in Python
Question:
I have a dataset given as such:
import numpy as np
import pandas as pd
## Create an array
data = np.arange(7152123)
print(data)
## Dataframe
df = pd.DataFrame(data)
print("n an df = n", df)
## Load the data in excel sheet
df.to_excel('df.xlsx', index=False,header=False)
I get an error:
f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols}
" ValueError: This sheet is too large! Your sheet size is: 7152123, 1
Max sheet size is: 1048576, 16384
The error comes because a single sheet of excel allows a maximum of 1048576 rows.
However, the dataset that I need to export has 7152123 rows
Can somebody please let me know how do I export such a huge dataset in multiple sheets of single excel file in Python?
Answers:
The data is being generated in Python and then I need to import the data in MATLAB.
In that case I’d suggest not using Excel. Excel was built to analyze data, not exchange data. It’s not only that splitting up the data into 1M-row sheets is complex, the MATLAB code will have to read those sheets and recombine them in memory.
There are better formats for this. to_csv can generate a single text file that can be read by MATLAB.
df.to_csv('out.csv')
Another option, for really large files that can’t fit in memory, is Parquet. In the MATLAB docs Parquet Files have their own section in the Large Files and Big Data section.
The file can be generated with to_parquet and read with MATLAB’s parquetread :
df.to_parquet('out.parquet')
T = parquetread('outages.parquet');
Finally, rowfilter can be used to filter data before importing it.
BAD IDEA – Splitting into sheets
This can be done by splitting the dataframe into several, then storing them into different sheets with the help of ExcelWriter
. There are several relevant SO answers already. This answer shows several options, including using NumPy’s array_split
:
n=1048576
list_df = np.array_split(df, n)
The to_excel docs show how to write to different sheets in the same file:
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet_name_1')
df2.to_excel(writer, sheet_name='Sheet_name_2')
This can be done in a loop:
with pd.ExcelWriter('output.xlsx') as writer:
for i, d in enumerate(list_df):
d.to_excel(writer, sheet_name=f'Sheet_name_{i}')
Meaningful grouping
If the data is meant for analysis using Excel, it makes sense to split the data using a meaningful grouping, assuming no group exceeds 1M rows:
with pd.ExcelWriter('output.xlsx') as writer:
for name,d in df.groupby('AcctName'):
d.to_excel(writer, sheet_name=f'Sheet_name_{name}')
Try splitting the dataframe in groups of max_rows:
# Split dataframe in groups of 1048576 rows
max_rows = 1048576
# Round up the number
rows = len(df)
if rows > max_rows:
n_of_groups = math.ceil(rows / max_rows)
for i in range(n_of_groups):
start = i*max_rows
end = (i+1)*max_rows
if end < rows:
df_write = df[start:end]
else:
df_write = df[start:]
print(f'Write df_write with {len(df_write)} rows on Excel file')
Output:
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 860667 rows on Excel file
Hope it works!
You can. You have to set the pandas.ExcelWriter
object to append mode
.
Here is an example where I slice the database in batches of 1000 rows and save each slice into a sheet.
Make sure you have an empty output.xlsx
file in the same path, then try the following.
import pandas as pd
import numpy as np
import pandas as pd
data = np.arange(10000)
df = pd.DataFrame(data)
with pd.ExcelWriter(path="output.xlsx", mode="a") as writer:
for n in range(1, len(df), 1000):
df.iloc[n:n+1000].to_excel(writer, sheet_name=f'MySheet_{n}')
However, as already suggested, you should probably just save the data in a single csv file instead.
I have a dataset given as such:
import numpy as np
import pandas as pd
## Create an array
data = np.arange(7152123)
print(data)
## Dataframe
df = pd.DataFrame(data)
print("n an df = n", df)
## Load the data in excel sheet
df.to_excel('df.xlsx', index=False,header=False)
I get an error:
f"This sheet is too large! Your sheet size is: {num_rows}, {num_cols}
" ValueError: This sheet is too large! Your sheet size is: 7152123, 1
Max sheet size is: 1048576, 16384
The error comes because a single sheet of excel allows a maximum of 1048576 rows.
However, the dataset that I need to export has 7152123 rows
Can somebody please let me know how do I export such a huge dataset in multiple sheets of single excel file in Python?
The data is being generated in Python and then I need to import the data in MATLAB.
In that case I’d suggest not using Excel. Excel was built to analyze data, not exchange data. It’s not only that splitting up the data into 1M-row sheets is complex, the MATLAB code will have to read those sheets and recombine them in memory.
There are better formats for this. to_csv can generate a single text file that can be read by MATLAB.
df.to_csv('out.csv')
Another option, for really large files that can’t fit in memory, is Parquet. In the MATLAB docs Parquet Files have their own section in the Large Files and Big Data section.
The file can be generated with to_parquet and read with MATLAB’s parquetread :
df.to_parquet('out.parquet')
T = parquetread('outages.parquet');
Finally, rowfilter can be used to filter data before importing it.
BAD IDEA – Splitting into sheets
This can be done by splitting the dataframe into several, then storing them into different sheets with the help of ExcelWriter
. There are several relevant SO answers already. This answer shows several options, including using NumPy’s array_split
:
n=1048576
list_df = np.array_split(df, n)
The to_excel docs show how to write to different sheets in the same file:
with pd.ExcelWriter('output.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet_name_1')
df2.to_excel(writer, sheet_name='Sheet_name_2')
This can be done in a loop:
with pd.ExcelWriter('output.xlsx') as writer:
for i, d in enumerate(list_df):
d.to_excel(writer, sheet_name=f'Sheet_name_{i}')
Meaningful grouping
If the data is meant for analysis using Excel, it makes sense to split the data using a meaningful grouping, assuming no group exceeds 1M rows:
with pd.ExcelWriter('output.xlsx') as writer:
for name,d in df.groupby('AcctName'):
d.to_excel(writer, sheet_name=f'Sheet_name_{name}')
Try splitting the dataframe in groups of max_rows:
# Split dataframe in groups of 1048576 rows
max_rows = 1048576
# Round up the number
rows = len(df)
if rows > max_rows:
n_of_groups = math.ceil(rows / max_rows)
for i in range(n_of_groups):
start = i*max_rows
end = (i+1)*max_rows
if end < rows:
df_write = df[start:end]
else:
df_write = df[start:]
print(f'Write df_write with {len(df_write)} rows on Excel file')
Output:
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 1048576 rows on Excel file
Write df_write with 860667 rows on Excel file
Hope it works!
You can. You have to set the pandas.ExcelWriter
object to append mode
.
Here is an example where I slice the database in batches of 1000 rows and save each slice into a sheet.
Make sure you have an empty output.xlsx
file in the same path, then try the following.
import pandas as pd
import numpy as np
import pandas as pd
data = np.arange(10000)
df = pd.DataFrame(data)
with pd.ExcelWriter(path="output.xlsx", mode="a") as writer:
for n in range(1, len(df), 1000):
df.iloc[n:n+1000].to_excel(writer, sheet_name=f'MySheet_{n}')
However, as already suggested, you should probably just save the data in a single csv file instead.