Convert the string into a float value

Question:

I have copied a table with three columns from a pdf file. I am attaching the screenshot from the PDF here:

enter image description here

The values in the column padj are exponential values, however, when you copy from the pdf to an excel and then open it with pandas, these are strings or object data types. Hence, these values cannot be parsed as floats or numeric values. I need these values as floats, not as strings. Can someone help me with some suggestions?
So far this is what I have tried.

The excel or the csv file is then opened in python using the escape_unicode encoding in order to circumvent the UnicodeDecodeError

## open the file
df       = pd.read_csv("S2_GSE184956.csv",header=0,sep=',',encoding='unicode_escape')[["DEGs","LFC","padj"]]

   df.head()
        DEGs   padj          LFC
    0   JUNB    1.5 ×10-8   -1.273329
    1   HOOK2   2.39×10-7   -1.109320
    2   EGR1    3.17×10-6   -4.187828
    3   DUSP1   3.95×10-6   -3.251030
    4   IL6     3.95×10-6   -3.415500
    5   ARL4C   5.06×10-6   -2.147519
    6   NR4A2   2.94×10-4   -3.001167
    7   CCL3L1  4.026×10-4  -5.293694


# Convert the string to float by replacing the x10- with exponential sign 
    df['padj'] = df['padj'].apply(lambda x: (unidecode(x).replace('x10-','x10-e'))).astype(float)
That threw an error,

    ValueError: could not convert string to float: '1.5 x10-e8'

Any suggestions would be appreciated. Thanks

Asked By: user1017373

||

Answers:

With the dataframe shared in the question on this last edit, the following using pandas.Series.str.replace and pandas.Series.astype will do the work:

df['padj'] = df['padj'].str.replace('×10','e').str.replace(' ', '').astype(float)

The goal is to get the cells to look like the following 1.560000e-08.


Notes:

  • Depending on the rest of the dataframe, additional adjustments might still be required, such as, removing the spaces ' that might exist in one of the cells. For that one can use pandas.Series.str.replace as follows

    df['padj'] = df['padj'].str.replace("'", '')
    
Answered By: Gonçalo Peres

Considering your sample (column padj), the code below should work:

f_value = eval(str_float.replace('x10', 'e').replace(' ', ''))
Answered By: Adrien Derouene

Updated based on the data you provided above. The most significant thing being that the x is actually a times symbol:

import pandas as pd
DEGs = ["JUNB", "HOOK2", "EGR1", "DUSP1", "IL6", "ARL4C", "NR4A2", "CCL3L1"]
padj = ["1.5 ×10-8", "2.39×10-7", "3.17×10-6", "3.95×10-6", "3.95×10-6", "5.06×10-6", "2.94×10-4", "4.026×10-4"]
LFC = ["-1.273329", "-1.109320", "-4.187828", "-3.251030", "-3.415500", "-2.147519", "-3.001167", "-5.293694"]
df = pd.DataFrame({'DEGs': DEGs, 'padj': padj, 'LFC': LFC})
# change to python-friendly float format
df['padj'] = df['padj'].str.replace(' ×10-', 'e-', regex=False)
df['padj'] = df['padj'].str.replace('×10-', 'e-', regex=False)
# convert padj from string to float
df['padj'] = df['padj'].astype(float)

will give you this dataframe:
enter image description here

Answered By: Daniel Schneider

If you want a numerical vectorial solution, you can use:

df['float'] = (df['padj'].str.extract(r'(d+(?:.d+))s*×10(.?d+)')
 .apply(pd.to_numeric).pipe(lambda d: d[0].mul(10.**d[1]))
)

output:

     DEGs        padj       LFC         float
0    JUNB   1.5 ×10-8 -1.273329  1.500000e-08
1   HOOK2   2.39×10-7 -1.109320  2.390000e-07
2    EGR1   3.17×10-6 -4.187828  3.170000e-06
3   DUSP1   3.95×10-6 -3.251030  3.950000e-06
4     IL6   3.95×10-6 -3.415500  3.950000e-06
5   ARL4C   5.06×10-6 -2.147519  5.060000e-06
6   NR4A2   2.94×10-4 -3.001167  2.940000e-04
7  CCL3L1  4.026×10-4 -5.293694  4.026000e-04

Intermediate:

df['padj'].str.extract('(d+(?:.d+))s*×10(.?d+)')

       0   1
0    1.5  -8
1   2.39  -7
2   3.17  -6
3   3.95  -6
4   3.95  -6
5   5.06  -6
6   2.94  -4
7  4.026  -4
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.