How to get rid of noise (redundant commas/dots) in decimal values – Python

Question:

I have a dataset df with two columns ID and Value. Both are of Dtype "object". However, I would like to convert the column Value to Dtype "double" with a dot as decimal separator. The problem is that the values of this column contain noise due to the presence of too many commas (e.g. 0,1,,) – or after replacement too many dots (e.g. 0.1..). As a result, when I try to convert the Dtype to double, I get the error message: could not convert string to float: '0.2.'

Example code:

#required packages
import pandas as pd
import numpy as np
  
# initialize list of lists
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'Value'])

#replace comma with dot as separator
df = df.replace(',', '.', regex=True)

#examine dtype per column
df.info()

#convert dtype from object to double
df = df.astype({'Value': np.double}) #this is where the error message appears

The preferred outcome is to have the values within the column Value as 0.1, 0.2 and 0.01 respectively.

How can I get rid of the redundant commas or, after replacement, dots in the values of the column Values?

Asked By: Foeke Boersma

||

Answers:

One option: use string functions to convert and strip the values. For example:

#required packages                                                                  
import pandas as pd                                                                 
import numpy as np                                                                  
                                                                                    
# initialize list of lists                                                          
data = [[1, '0,1'], [2, '0,2,'], [3, '0,01,,']]                                     
                                                                                    
# Create the pandas DataFrame                                                       
df = pd.DataFrame(data, columns=['ID', 'Value'])                                    
                                                                                    
#replace comma with dot as separator                                                
df['Value'] = df['Value'].str.replace(',', '.', 1).str.rstrip(',') 
                                                                                    
#examine dtype per column                                                           
df.info()                                                                           
                                                                                    
#convert dtype from object to double                                                
df = df.astype({'Value': np.double})
      
print("------ df:")                                                                              
print(df)

prints:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      3 non-null      int64 
 1   Value   3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
----- df:
   ID  Value
0   1   0.10
1   2   0.20
2   3   0.01
Answered By: DraftyHat
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.