Why is pandas.series.str.extract not working here but working elsewhere

Question:

Why is a pandas.series.extract(regex) able to print the correct values, but won’t assign the value to an existing variable using indexing or np.where.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
        ['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']], 
    columns=['id', "Street", 'Postcode', 'FullAddress']
)

m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),'))                        # prints street
print(df["FullAddress"].str.extract(r'b(d{5})b'))                   # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),')  # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))

What I’m trying to do is fill the empty Street and Postcode with the values from FullAddress – without disturbing the existing Street and Postcode values.

There is no problem with the indexing, the regex, or even the extract… I’ve read the docs, searched for anything similar… What does every get, but I don’t understand!?!?!

Asked By: DrWhat

||

Answers:

You can use .fillna to fill in the NaN values in your dataframe:

df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'b(d{5})b')[0])

This will fill in all of your null values with the result of the extract while keeping all existing values:

  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St    67898     2 Choco Rd, 69412 Eberbach
Answered By: rassar

You missed expand=False as parameter of str.extract:

>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')

             0  # <- it's not a Series but a DataFrame with one column
0  1 Banana St

>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)

0    1 Banana St
Name: FullAddress, dtype: object  # <- now it's a Series

In the first version, Pandas can’t align column labels Street vs 0. In the second version, the Series fit into the Street Series, so:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b', expand=False)
print(df)

# Output
  id       Street Postcode                    FullAddress
0  1  1 Banana St    69126  1 Banana St, 69126 Heidelberg
1  2   Doloros St  67898.0     2 Choco Rd, 69412 Eberbach

Update*: it’s possible to use extract without expand=False using named groups (?P<xxx>...) to align column labels:

df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(?P<Postcode>d{5})b')

# OR

pattern = r'(?P<Street>.+?),s*b(?P<Postcode>d{5})b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
Answered By: Corralien
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.