Python – Turn all items in a Dataframe to strings
Question:
I followed the following procedure: In Python, how do I convert all of the items in a list to floats? because each column of my Dataframe is list
, but instead of floats
I chose to change all the values to strings
.
df = [str(i) for i in df]
But this failed.
It simply erased all the data except for the first row of column names.
Then, trying df = [str(i) for i in df.values]
resulted in changing the entire Dataframe into one big list, but that messes up the data too much to be able to meet the goal of my script which is to export the Dataframe to my Oracle table.
Is there a way to convert all the items that are in my Dataframe that are NOT strings into strings?
Answers:
You can use applymap
method:
df = df.applymap(str)
You can use this:
df = df.astype(str)
out of curiosity I decided to see if there is any difference in efficiency between the accepted solution and mine.
The results are below:
example df:
df = pd.DataFrame([list(range(1000))], index=[0])
test df.astype
:
%timeit df.astype(str)
>> 100 loops, best of 3: 2.18 ms per loop
test df.applymap
:
%timeit df.applymap(str)
1 loops, best of 3: 245 ms per loop
It seems df.astype
is quite a lot faster 🙂
This worked for me:
dt.applymap(lambda x: x[0] if type(x) is list else None)
With pandas >= 1.0 there is now a dedicated string datatype:
You can convert your column to this pandas string datatype using .astype(‘string’):
df = df.astype('string')
This is different from using str
which sets the pandas ‘object’ datatype:
df = df.astype(str)
You can see the difference in datatypes when you look at the info of the dataframe:
df = pd.DataFrame({
'zipcode_str': [90210, 90211] ,
'zipcode_string': [90210, 90211],
})
df['zipcode_str'] = df['zipcode_str'].astype(str)
df['zipcode_string'] = df['zipcode_str'].astype('string')
df.info()
# you can see that the first column has dtype object
# while the second column has the new dtype string
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode_str 2 non-null object
1 zipcode_string 2 non-null string
dtypes: object(1), string(1)
From the docs:
The ‘string’ extension type solves several issues with object-dtype
NumPy arrays:
1) You can accidentally store a mixture of strings and non-strings in an
object dtype array. A StringArray can only store strings.
2) object dtype breaks dtype-specific operations like
DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text, but still object-dtype columns.
3) When reading code, the contents of an object dtype array is less clear
than string.
Information about pandas 1.0 can be found here:
https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html
I followed the following procedure: In Python, how do I convert all of the items in a list to floats? because each column of my Dataframe is list
, but instead of floats
I chose to change all the values to strings
.
df = [str(i) for i in df]
But this failed.
It simply erased all the data except for the first row of column names.
Then, trying df = [str(i) for i in df.values]
resulted in changing the entire Dataframe into one big list, but that messes up the data too much to be able to meet the goal of my script which is to export the Dataframe to my Oracle table.
Is there a way to convert all the items that are in my Dataframe that are NOT strings into strings?
You can use applymap
method:
df = df.applymap(str)
You can use this:
df = df.astype(str)
out of curiosity I decided to see if there is any difference in efficiency between the accepted solution and mine.
The results are below:
example df:
df = pd.DataFrame([list(range(1000))], index=[0])
test df.astype
:
%timeit df.astype(str)
>> 100 loops, best of 3: 2.18 ms per loop
test df.applymap
:
%timeit df.applymap(str)
1 loops, best of 3: 245 ms per loop
It seems df.astype
is quite a lot faster 🙂
This worked for me:
dt.applymap(lambda x: x[0] if type(x) is list else None)
With pandas >= 1.0 there is now a dedicated string datatype:
You can convert your column to this pandas string datatype using .astype(‘string’):
df = df.astype('string')
This is different from using str
which sets the pandas ‘object’ datatype:
df = df.astype(str)
You can see the difference in datatypes when you look at the info of the dataframe:
df = pd.DataFrame({
'zipcode_str': [90210, 90211] ,
'zipcode_string': [90210, 90211],
})
df['zipcode_str'] = df['zipcode_str'].astype(str)
df['zipcode_string'] = df['zipcode_str'].astype('string')
df.info()
# you can see that the first column has dtype object
# while the second column has the new dtype string
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 zipcode_str 2 non-null object
1 zipcode_string 2 non-null string
dtypes: object(1), string(1)
From the docs:
The ‘string’ extension type solves several issues with object-dtype
NumPy arrays:1) You can accidentally store a mixture of strings and non-strings in an
object dtype array. A StringArray can only store strings.2) object dtype breaks dtype-specific operations like
DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text, but still object-dtype columns.3) When reading code, the contents of an object dtype array is less clear
than string.
Information about pandas 1.0 can be found here:
https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html