Strip / trim all strings of a dataframe
Question:
Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :
import pandas as pd
df = pd.DataFrame([[' a ', 10], [' c ', 5]])
df.replace('^s+', '', regex=True, inplace=True) #front
df.replace('s+$', '', regex=True, inplace=True) #end
df.values
This is quite slow, what could I improve ?
Answers:
You can use the apply
function of the Series
object:
>>> df = pd.DataFrame([[' a ', 10], [' c ', 5]])
>>> df[0][0]
' a '
>>> df[0] = df[0].apply(lambda x: x.strip())
>>> df[0][0]
'a'
Note the usage of strip
and not the regex
which is much faster
Another option – use the apply
function of the DataFrame object:
>>> df = pd.DataFrame([[' a ', 10], [' c ', 5]])
>>> df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)
0 1
0 a 10
1 c 5
You can use DataFrame.select_dtypes
to select string
columns and then apply
function str.strip
.
Notice: Values cannot be types
like dicts
or lists
, because their dtypes
is object
.
df_obj = df.select_dtypes(['object'])
print (df_obj)
0 a
1 c
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)
0 1
0 a 10
1 c 5
But if there are only a few columns use str.strip
:
df[0] = df[0].str.strip()
If you really want to use regex, then
>>> df.replace('(^s+|s+$)', '', regex=True, inplace=True)
>>> df
0 1
0 a 10
1 c 5
But it should be faster to do it like this:
>>> df[0] = df[0].str.strip()
You can try:
df[0] = df[0].str.strip()
or more specifically for all string columns
non_numeric_columns = list(set(df.columns)-set(df._get_numeric_data().columns))
df[non_numeric_columns] = df[non_numeric_columns].apply(lambda x : str(x).strip())
Money Shot
Here’s a compact version of using applymap
with a straightforward lambda expression to call strip
only when the value is of a string type:
df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
Full Example
A more complete example:
import pandas as pd
def trim_all_columns(df):
"""
Trim whitespace from ends of each value across all series in dataframe
"""
trim_strings = lambda x: x.strip() if isinstance(x, str) else x
return df.applymap(trim_strings)
# simple example of trimming whitespace from data elements
df = pd.DataFrame([[' a ', 10], [' c ', 5]])
df = trim_all_columns(df)
print(df)
>>>
0 1
0 a 10
1 c 5
Working Example
Here’s a working example hosted by trinket:
https://trinket.io/python3/e6ab7fb4ab
def trim(x):
if x.dtype == object:
x = x.str.split(' ').str[0]
return(x)
df = df.apply(trim)
how about (for string columns)
df[col] = df[col].str.replace(" ","")
never fails
Strip alone does not remove the inner extra spaces in a string. The workaround to this is to first replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
# Import packages
import re
# First inspect the dtypes of the dataframe
df.dtypes
# First replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
df = df.applymap(lambda x: re.sub('s+', ' ', x) if isinstance(x, str) else x)
# Then strip leading and trailing white spaces
df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x)
@jezrael answer is looking good. But if you want to get back the other (numeric/integer etc) columns as well in the final result set then you suppose need to merge back with original DataFrame.
If it is the case then you may use this approach,
df = df.apply(lambda x: x.str.strip() if x.dtype.name == 'object' else x, axis=0)
Thanks!
Benchmarks for best answers:
bm = Benchmark()
df = pd.read_excel(
path,
sheet_name=advantage_sheet_name,
parse_dates=True
)
bm.mark('Loaded')
# @jezrael 's answer (accepted answer)
dfClean_1 = df
.select_dtypes(['object'])
.apply(lambda x: x.str.strip())
bm.mark('Clean method 1')
# @Jonathan B. answer
dfClean_2 = df
.applymap(lambda x: x.strip() if isinstance(x, str) else x)
bm.mark('Clean method 2')
#@MaxU - stop genocide of UA / @Roman Pekar answer
dfClean_3 = df
.replace(r's*(.*?)s*', r'1', regex=True)
bm.mark('Clean method 3')
Results
145.734375 - 145.734375 : Loaded
147.765625 - 2.03125 : Clean method 1
155.109375 - 7.34375 : Clean method 2
288.953125 - 133.84375 : Clean method 3
You can use applymap + str.strip, and use it on all columns:
df.applymap(str.strip)
or only a few:
columns = ['a', 'b', 'c']
df[columns] = df[columns].applymap(str.strip)
Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :
import pandas as pd
df = pd.DataFrame([[' a ', 10], [' c ', 5]])
df.replace('^s+', '', regex=True, inplace=True) #front
df.replace('s+$', '', regex=True, inplace=True) #end
df.values
This is quite slow, what could I improve ?
You can use the apply
function of the Series
object:
>>> df = pd.DataFrame([[' a ', 10], [' c ', 5]])
>>> df[0][0]
' a '
>>> df[0] = df[0].apply(lambda x: x.strip())
>>> df[0][0]
'a'
Note the usage of
strip
and not theregex
which is much faster
Another option – use the apply
function of the DataFrame object:
>>> df = pd.DataFrame([[' a ', 10], [' c ', 5]])
>>> df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)
0 1
0 a 10
1 c 5
You can use DataFrame.select_dtypes
to select string
columns and then apply
function str.strip
.
Notice: Values cannot be types
like dicts
or lists
, because their dtypes
is object
.
df_obj = df.select_dtypes(['object'])
print (df_obj)
0 a
1 c
df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)
0 1
0 a 10
1 c 5
But if there are only a few columns use str.strip
:
df[0] = df[0].str.strip()
If you really want to use regex, then
>>> df.replace('(^s+|s+$)', '', regex=True, inplace=True)
>>> df
0 1
0 a 10
1 c 5
But it should be faster to do it like this:
>>> df[0] = df[0].str.strip()
You can try:
df[0] = df[0].str.strip()
or more specifically for all string columns
non_numeric_columns = list(set(df.columns)-set(df._get_numeric_data().columns))
df[non_numeric_columns] = df[non_numeric_columns].apply(lambda x : str(x).strip())
Money Shot
Here’s a compact version of using applymap
with a straightforward lambda expression to call strip
only when the value is of a string type:
df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
Full Example
A more complete example:
import pandas as pd
def trim_all_columns(df):
"""
Trim whitespace from ends of each value across all series in dataframe
"""
trim_strings = lambda x: x.strip() if isinstance(x, str) else x
return df.applymap(trim_strings)
# simple example of trimming whitespace from data elements
df = pd.DataFrame([[' a ', 10], [' c ', 5]])
df = trim_all_columns(df)
print(df)
>>>
0 1
0 a 10
1 c 5
Working Example
Here’s a working example hosted by trinket:
https://trinket.io/python3/e6ab7fb4ab
def trim(x):
if x.dtype == object:
x = x.str.split(' ').str[0]
return(x)
df = df.apply(trim)
how about (for string columns)
df[col] = df[col].str.replace(" ","")
never fails
Strip alone does not remove the inner extra spaces in a string. The workaround to this is to first replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
# Import packages
import re
# First inspect the dtypes of the dataframe
df.dtypes
# First replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
df = df.applymap(lambda x: re.sub('s+', ' ', x) if isinstance(x, str) else x)
# Then strip leading and trailing white spaces
df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x)
@jezrael answer is looking good. But if you want to get back the other (numeric/integer etc) columns as well in the final result set then you suppose need to merge back with original DataFrame.
If it is the case then you may use this approach,
df = df.apply(lambda x: x.str.strip() if x.dtype.name == 'object' else x, axis=0)
Thanks!
Benchmarks for best answers:
bm = Benchmark()
df = pd.read_excel(
path,
sheet_name=advantage_sheet_name,
parse_dates=True
)
bm.mark('Loaded')
# @jezrael 's answer (accepted answer)
dfClean_1 = df
.select_dtypes(['object'])
.apply(lambda x: x.str.strip())
bm.mark('Clean method 1')
# @Jonathan B. answer
dfClean_2 = df
.applymap(lambda x: x.strip() if isinstance(x, str) else x)
bm.mark('Clean method 2')
#@MaxU - stop genocide of UA / @Roman Pekar answer
dfClean_3 = df
.replace(r's*(.*?)s*', r'1', regex=True)
bm.mark('Clean method 3')
Results
145.734375 - 145.734375 : Loaded
147.765625 - 2.03125 : Clean method 1
155.109375 - 7.34375 : Clean method 2
288.953125 - 133.84375 : Clean method 3
You can use applymap + str.strip, and use it on all columns:
df.applymap(str.strip)
or only a few:
columns = ['a', 'b', 'c']
df[columns] = df[columns].applymap(str.strip)