How to prevent Data loss in Pandas.to_excel when handling very long string of numbers
Question:
This is my input file (csv)
id1,id2
233924749247492472,9284372492472497294749
298347230474308444,9472943274947429427477
I want to read this file in a dataframe, remove the delimiter and then write it back in .xlsx file
Few code combinations that I have already tried
Attempt 1:
df2 = pd.read_csv(path, sep=Delimiter, float_precision=None )
pd.options.display.float_format = '{:.1f}'.format
df2.to_excel(filepath, index=False)
Attempt 2:
df2 = pd.read_csv(path, sep=delimiter)
writer = pd.ExcelWriter(path, engine=None)
df3.to_excel(writer, index=False)
Attempt 3:
df2 = pd.read_csv(path, sep=delimiter)
df3.to_excel(path, index=False)
Everytime I am getting the same output in excel file
I am seeing a data loss in the first column. The output looks like this:
id1
id2
233924749247493000
9284372492472497294749
298347230474309000
9472943274947429427477
Answers:
you can specify the data type of the first column as a string (instead of the default float) when reading in the CSV file.
specifies the data type of both columns as string when reading in the CSV file. This should prevent any automatic conversion to scientific notation, and should preserve the full values.
import pandas as pd
# read the CSV file into a pandas dataframe, specifying data types
df = pd.read_csv('input_file.csv', dtype={'id1': str, 'id2': str})
# remove the delimiter (assuming the delimiter is a comma)
df = df.replace(',', '', regex=True)
# write the modified dataframe to an Excel file
df.to_excel('output_file.xlsx', index=False)
By default, pandas will cast integer as int64
. This is enough for integer between -2⁶³
and 2⁶³-1 = 9223372036854775807
. So if any element in a column exceeds this value, pandas will set the column type to object.
Apparently, Excel truncates big int (smaller than 2⁶³-1
) but not objects. So a solution would be to set the dtypes of all your columns to objects:
pd.read_csv('input.csv', dtype=object).to_excel('output.xlsx')
This is my input file (csv)
id1,id2 |
---|
233924749247492472,9284372492472497294749 |
298347230474308444,9472943274947429427477 |
I want to read this file in a dataframe, remove the delimiter and then write it back in .xlsx file
Few code combinations that I have already tried
Attempt 1:
df2 = pd.read_csv(path, sep=Delimiter, float_precision=None )
pd.options.display.float_format = '{:.1f}'.format
df2.to_excel(filepath, index=False)
Attempt 2:
df2 = pd.read_csv(path, sep=delimiter)
writer = pd.ExcelWriter(path, engine=None)
df3.to_excel(writer, index=False)
Attempt 3:
df2 = pd.read_csv(path, sep=delimiter)
df3.to_excel(path, index=False)
Everytime I am getting the same output in excel file
I am seeing a data loss in the first column. The output looks like this:
id1 | id2 |
---|---|
233924749247493000 | 9284372492472497294749 |
298347230474309000 | 9472943274947429427477 |
you can specify the data type of the first column as a string (instead of the default float) when reading in the CSV file.
specifies the data type of both columns as string when reading in the CSV file. This should prevent any automatic conversion to scientific notation, and should preserve the full values.
import pandas as pd
# read the CSV file into a pandas dataframe, specifying data types
df = pd.read_csv('input_file.csv', dtype={'id1': str, 'id2': str})
# remove the delimiter (assuming the delimiter is a comma)
df = df.replace(',', '', regex=True)
# write the modified dataframe to an Excel file
df.to_excel('output_file.xlsx', index=False)
By default, pandas will cast integer as int64
. This is enough for integer between -2⁶³
and 2⁶³-1 = 9223372036854775807
. So if any element in a column exceeds this value, pandas will set the column type to object.
Apparently, Excel truncates big int (smaller than 2⁶³-1
) but not objects. So a solution would be to set the dtypes of all your columns to objects:
pd.read_csv('input.csv', dtype=object).to_excel('output.xlsx')