splitting the address column in pandas

Question:

I have a pandas dataset like this:

import pandas as pd

data = {'id':  ['001', '002', '003','004'],
        'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA",
                    "1180 Shelard Tower\nMinneapolis, MN 55426\nUSA",
                    "William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA",
                    "215 S 11th ST"],
        'locality': [None, None, None,'Laramie'],
        'region': [None, None, None, 'WY'],
        'Zipcode': [None, None, None, '87656'],
        'Country': [None, None, None, 'US']
        }

df = pd.DataFrame(data)

I tried to split the address column by new line but however since it has two followed by n. I am not able to do . Please help me in splitting the n from address and exptrapolate into locality region zipcode and country.

sample output:

id  address locality    region  Zipcode Country
1   290 Valley Dr.  Casper  WY  82604   USA
2   1180 Shelard Tower  Minneapolis MN  55426   USA
3   145 S. Durbin   Casper  WY  82601   USA
4   215 S 11th ST   Laramie WY  87656   US

I tried different methods to split n using split command but it gives me extra . And I am trying to keep it in pandas dataframe so that I can carry further analysis.

Asked By: Sushmitha

||

Answers:

Line two has no name so it may cause some trouble. Please check if your example is correct.

Given your df like this:

data = {'id':  ['001', '002', '003','004'],
        'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA",
                    "name\n1180 Shelard Tower\nMinneapolis, MN 55426\nUSA",
                    "William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA",
                    "215 S 11th ST"],
        'locality': [None, None, None,'Laramie'],
        'region': [None, None, None, 'WY'],
        'Zipcode': [None, None, None, '87656'],
        'Country': [None, None, None, 'US']
        }

df = pd.DataFrame(data)

You can create a filter condition:

cond = df[["locality", "region", "Zipcode", "Country"]].isna()

and then use split with expand=True

df.loc[cond.all(axis=1)].address.str.split(r"\n", expand=True)

0    William J. Clare      290 Valley Dr.       Casper, WY 82604  USA
1                name  1180 Shelard Tower  Minneapolis, MN 55426  USA
2  William N. Barnard       145 S. Durbin       Casper, WY 82601  USA
Answered By: bitflip

Here is an approach using extract instead of split and update in place:

df.update(df['address'].str.extract(r'([^,]+)(?:,s(w+)s*(d+)\n(w+))?$')
 .set_axis(["address", "region", "Zipcode", "Country"], axis=1) 
 )

df['name'] = pd.NA

df.update(df['address'].str.extract(r'(?:(.*?)\n)?(.*)\n(.+)')
          .set_axis(['name', 'address', 'locality'], axis=1)
         )

output:

    id             address     locality region Zipcode Country                name
0  001      290 Valley Dr.       Casper     WY   82604     USA    William J. Clare
1  002  1180 Shelard Tower  Minneapolis     MN   55426     USA                <NA>
2  003       145 S. Durbin       Casper     WY   82601     USA  William N. Barnard
3  004       215 S 11th ST      Laramie     WY   87656      US                <NA>

regex demo

Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.