How to convert a column with missing values to string?
Question:
I need to export a dataframe from pandas to Microsoft SQL Server using SQL Alchemy. Many columns are strings, with missing values and with some very long integers, e.g. 999999999999999999999999999999999 . These numbers are some kind of foreign key, so the value itself doesn’t mean anything, so I can convert those to strings.
This causes the following error in SQL Alchemy when trying to export to SQL:
OverflowError: int too big to convert
I tried converting to string with astype(str)
, but then I run into the problem that missing values, identified as nans, are converted into the string ‘nan’ – so SQL does not see them as nulls but as the string ‘nan’.
The only solution I have found is to first convert to str then replace ‘nan’ with numpy.nan
. Is there a better way? This is cumbersome, relatively slow, and as unpythonic as it can get: first I convert everything to string, the conversion turns nulls into strings, so I convert those into NaN, which can be a float only, and I end up with a mixed-type column.
Or do I simply have to suck it up and accept that pandas is dreadful at dealing with missing values?
I have an example below:
import numpy as np, pandas as pd, time
from sqlalchemy import create_engine, MetaData, Table, select
import sqlalchemy as sqlalchemy
start=time.time()
ServerName = "DESKTOP-MRXSQLEXPRESS"
Database = 'MYDATABASE'
params = '?driver=SQL+Server+Native+Client+11.0'
engine = create_engine('mssql+pyodbc://' + ServerName + '/'+ Database + params, encoding ='latin1' )
conn=engine.connect()
df=pd.DataFrame()
df['mixed']=np.arange(0,9)
df.iloc[0,0]='test'
df['numb']=3.0
df['text']='my string'
df.iloc[0,2]=np.nan
df.iloc[1,2]=999999999999999999999999999999999
df['text']=df['text'].astype(str).replace('nan',np.nan)
df.to_sql('test_df_mixed_types', engine, schema='dbo', if_exists='replace')
Answers:
Using np.where
would certainly be a bit faster compared to replace i.e
df['text'] = np.where(pd.isnull(df['text']),df['text'],df['text'].astype(str))
Timings :
%%timeit
df['text'].astype(str).replace('nan',np.nan)
1000 loops, best of 3: 536 µs per loop
%%timeit
np.where(pd.isnull(df['text']),df['text'],df['text'].astype(str))
1000 loops, best of 3: 274 µs per loop
x = pd.concat([df['text']]*10000)
%%timeit
np.where(pd.isnull(x),x,x.astype(str))
10 loops, best of 3: 28.8 ms per loop
%%timeit
x.astype(str).replace('nan',np.nan)
10 loops, best of 3: 33.5 ms per loop
To keep NaN as NaN and only convert non-NaN rows to str
, use boolean indexing.
msk = df['text'].notna()
df.loc[msk, 'text'] = df.loc[msk, 'text'].astype(str)
or use mask()
method to select values depending on a condition (whether the value is non-NaN), much like np.where()
.
df['text'] = df['text'].mask(lambda x: x.notna(), df['text'].astype(str))
If however, you want to make NaNs into empty strings (perhaps to operate on the strings later), then use fillna()
.
df['text'] = df['text'].fillna('').astype(str)
I need to export a dataframe from pandas to Microsoft SQL Server using SQL Alchemy. Many columns are strings, with missing values and with some very long integers, e.g. 999999999999999999999999999999999 . These numbers are some kind of foreign key, so the value itself doesn’t mean anything, so I can convert those to strings.
This causes the following error in SQL Alchemy when trying to export to SQL:
OverflowError: int too big to convert
I tried converting to string with astype(str)
, but then I run into the problem that missing values, identified as nans, are converted into the string ‘nan’ – so SQL does not see them as nulls but as the string ‘nan’.
The only solution I have found is to first convert to str then replace ‘nan’ with numpy.nan
. Is there a better way? This is cumbersome, relatively slow, and as unpythonic as it can get: first I convert everything to string, the conversion turns nulls into strings, so I convert those into NaN, which can be a float only, and I end up with a mixed-type column.
Or do I simply have to suck it up and accept that pandas is dreadful at dealing with missing values?
I have an example below:
import numpy as np, pandas as pd, time
from sqlalchemy import create_engine, MetaData, Table, select
import sqlalchemy as sqlalchemy
start=time.time()
ServerName = "DESKTOP-MRXSQLEXPRESS"
Database = 'MYDATABASE'
params = '?driver=SQL+Server+Native+Client+11.0'
engine = create_engine('mssql+pyodbc://' + ServerName + '/'+ Database + params, encoding ='latin1' )
conn=engine.connect()
df=pd.DataFrame()
df['mixed']=np.arange(0,9)
df.iloc[0,0]='test'
df['numb']=3.0
df['text']='my string'
df.iloc[0,2]=np.nan
df.iloc[1,2]=999999999999999999999999999999999
df['text']=df['text'].astype(str).replace('nan',np.nan)
df.to_sql('test_df_mixed_types', engine, schema='dbo', if_exists='replace')
Using np.where
would certainly be a bit faster compared to replace i.e
df['text'] = np.where(pd.isnull(df['text']),df['text'],df['text'].astype(str))
Timings :
%%timeit
df['text'].astype(str).replace('nan',np.nan)
1000 loops, best of 3: 536 µs per loop
%%timeit
np.where(pd.isnull(df['text']),df['text'],df['text'].astype(str))
1000 loops, best of 3: 274 µs per loop
x = pd.concat([df['text']]*10000)
%%timeit
np.where(pd.isnull(x),x,x.astype(str))
10 loops, best of 3: 28.8 ms per loop
%%timeit
x.astype(str).replace('nan',np.nan)
10 loops, best of 3: 33.5 ms per loop
To keep NaN as NaN and only convert non-NaN rows to str
, use boolean indexing.
msk = df['text'].notna()
df.loc[msk, 'text'] = df.loc[msk, 'text'].astype(str)
or use mask()
method to select values depending on a condition (whether the value is non-NaN), much like np.where()
.
df['text'] = df['text'].mask(lambda x: x.notna(), df['text'].astype(str))
If however, you want to make NaNs into empty strings (perhaps to operate on the strings later), then use fillna()
.
df['text'] = df['text'].fillna('').astype(str)