How to prevent Floating-Point errors with Pandas
Question:
I have a problem with my Python code. I’m using pandas to read a Dataset and store it in a Data Frame. I’m now trying to convert ug to mg (1000ug == 1 mg) and g to mg (1000 mg == 1g).
I’m first converting the Datatype of the column to float64
df[data_column] = df[data_column].astype("float64")
After that am, I’m selecting all the rows that contain values ug
and multiplying them by 0.0001
and then the rows with g
multiplying them with 1000
df.loc[df[unit_colum] == "g", [data_column]] *= 1000
df.loc[df[unit_colum] == "ug", [data_column]] *= 0.001
Btw:
I know that I also can devide values in pandas but this code should at the end run in a Loop where it also converts other values like (l -> ml).
My question now is:
Is there any chance that a Floating-Point error occures and what is the best way to prevent it.
I already thought about not converting the Data Frame columns into float64 and just work with the Strings. But this isn’t my prefered way.
Answers:
It is difficult to fully avoid floating point errors in general.
You have two major options to avoid/limit them:
- perform your computations in the smallest available unit (here µg) as integers
- round the values to the desired precision after conversion
Also, a tip for your conversion, rather than using multiple lines you can map
the factors:
factors = {'ug': 0.001, 'g': 1000, 'mg': 1}
df['data_column'] *= df['unit_column'].map(factors)
Going for integers in a known unit is certainly a good option with easy to understand error bounds and good performance. It’s effectively the same as using floating point with an absolute error threshold.
You can also switch to fractions. This should be done starting with the conversion from strings since it avoids all floating point effects. In particular Fraction("0.01") != Fraction(0.01)
but Fraction("0.01") == Fraction("0.1") / Fraction(10)
This should work:
df[data_column] = df[data_column].map(fractions.Fraction)
df.loc[df[unit_colum] == "g", [data_column]] *= fractions.Fraction(1000)
df.loc[df[unit_colum] == "ug", [data_column]] *= fractions.Fraction(1, 1000)
I have a problem with my Python code. I’m using pandas to read a Dataset and store it in a Data Frame. I’m now trying to convert ug to mg (1000ug == 1 mg) and g to mg (1000 mg == 1g).
I’m first converting the Datatype of the column to float64
df[data_column] = df[data_column].astype("float64")
After that am, I’m selecting all the rows that contain values ug
and multiplying them by 0.0001
and then the rows with g
multiplying them with 1000
df.loc[df[unit_colum] == "g", [data_column]] *= 1000
df.loc[df[unit_colum] == "ug", [data_column]] *= 0.001
Btw:
I know that I also can devide values in pandas but this code should at the end run in a Loop where it also converts other values like (l -> ml).
My question now is:
Is there any chance that a Floating-Point error occures and what is the best way to prevent it.
I already thought about not converting the Data Frame columns into float64 and just work with the Strings. But this isn’t my prefered way.
It is difficult to fully avoid floating point errors in general.
You have two major options to avoid/limit them:
- perform your computations in the smallest available unit (here µg) as integers
- round the values to the desired precision after conversion
Also, a tip for your conversion, rather than using multiple lines you can map
the factors:
factors = {'ug': 0.001, 'g': 1000, 'mg': 1}
df['data_column'] *= df['unit_column'].map(factors)
Going for integers in a known unit is certainly a good option with easy to understand error bounds and good performance. It’s effectively the same as using floating point with an absolute error threshold.
You can also switch to fractions. This should be done starting with the conversion from strings since it avoids all floating point effects. In particular Fraction("0.01") != Fraction(0.01)
but Fraction("0.01") == Fraction("0.1") / Fraction(10)
This should work:
df[data_column] = df[data_column].map(fractions.Fraction)
df.loc[df[unit_colum] == "g", [data_column]] *= fractions.Fraction(1000)
df.loc[df[unit_colum] == "ug", [data_column]] *= fractions.Fraction(1, 1000)