Function to remove a part of a string before a capital letter in Pandas Series

Question:

I have a dataframe that includes a column [‘locality_name’] with names of villages, towns, cities. Some names are written like "town of Hamilton", some like "Hamilton", some like "city of Hamilton" etc. As such, it’s hard to count unique values etc. My goal is to leave the names only.

I want to write a function that removes the part of a string till the capital letter and then apply it to my dataframe.

That’s what I tried:

import re

def my_slicer(row):
"""
Returns a string with the name of locality
"""
return re.sub(‘ABCDEFGHIKLMNOPQRSTVXYZ’,”, row[‘locality_name’])

raw_data[‘locality_name_only’] = raw_data.apply(my_slicer, axis=1)

I excpected it to return a new column with the names of places. Instead, nothing changed – [‘locality_name_only’] has the same values as in [‘locality_name’].

Asked By: Cpt_keaSar

||

Answers:

You can use pandas.Series.str.extract. For the example :

ser = pd.Series(["town of Hamilton", "Hamilton", "city of Hamilton"])
ser_2= ser.str.extract("([A-Z][a-z]+-?w+)")

In your case, use :

raw_data['locality_name_only'] = raw_data['locality_name'].str.extract("([A-Z][a-z]+-?w+)")

# Output :

print(ser_2)

          0
0  Hamilton
1  Hamilton
2  Hamilton
Answered By: abokey

I would use str.replace and phrase the problem as removing all non uppercase words:

raw_data["locality_name_only"] = df["locality_name"].str.replace(r's*b[a-z]w*s*', ' ', regex=True).str.strip()
Answered By: Tim Biegeleisen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.