How to remove excess whitespaces in entire python dataframe columns
Question:
What is the pythonic way of removing all excess whitespaces in a dateframe(all the columns). I know the method .str.strip() can be used for single column or for each column. The dataframe as many columns as such I would like to apply the method on the entire dataframe.
The whitespaces occur at different points, beginning of text, in between and at the end. Attached is view of what the dataframe looks like before and after removing the spaces. The dataframe below is the expected result.
I have searched and most related questions I came across were for a single column.
Sample data
import pandas as pd
data = [[' James Brown ', '10', 'USA'], [' James Bond', '15', 'UK'], ['Jimbo Bosco ', '14', 'Unknown']]
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Country'])
ยดยดยด
Answers:
You could use apply
:
df = df.applymap(lambda x: " ".join(x.split()) if isinstance(x, str) else x)
An idea would be to do a combination of:
regex
to remove duplicate spaces (e.g " James Bond" to " James Bond")
str.strip
to remove leading/trailing spaces (e.g " James Bond" to "James Bond").
You could do this one of two ways:
1. On the whole DataFrame:
df = df.replace("s+", " ", regex=True).apply(lambda x: x.str.strip())
2. On each column individually:
for col in ["Name", "Country"]:
df[col] = df[col].replace("s+", " ", regex=True).str.strip()
This works for me, seems shorter and cleaner:
df[col] = df[col].str.replace(' ','')
You can use it to replace any string items in the column values.
What is the pythonic way of removing all excess whitespaces in a dateframe(all the columns). I know the method .str.strip() can be used for single column or for each column. The dataframe as many columns as such I would like to apply the method on the entire dataframe.
The whitespaces occur at different points, beginning of text, in between and at the end. Attached is view of what the dataframe looks like before and after removing the spaces. The dataframe below is the expected result.
I have searched and most related questions I came across were for a single column.
Sample data
import pandas as pd
data = [[' James Brown ', '10', 'USA'], [' James Bond', '15', 'UK'], ['Jimbo Bosco ', '14', 'Unknown']]
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Country'])
ยดยดยด
You could use apply
:
df = df.applymap(lambda x: " ".join(x.split()) if isinstance(x, str) else x)
An idea would be to do a combination of:
regex
to remove duplicate spaces (e.g " James Bond" to " James Bond")str.strip
to remove leading/trailing spaces (e.g " James Bond" to "James Bond").
You could do this one of two ways:
1. On the whole DataFrame:
df = df.replace("s+", " ", regex=True).apply(lambda x: x.str.strip())
2. On each column individually:
for col in ["Name", "Country"]:
df[col] = df[col].replace("s+", " ", regex=True).str.strip()
This works for me, seems shorter and cleaner:
df[col] = df[col].str.replace(' ','')
You can use it to replace any string items in the column values.