Why is pandas.series.str.extract not working here but working elsewhere
Question:
Why is a pandas.series.extract(regex) able to print the correct values, but won’t assign the value to an existing variable using indexing or np.where.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']],
columns=['id', "Street", 'Postcode', 'FullAddress']
)
m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),')) # prints street
print(df["FullAddress"].str.extract(r'b(d{5})b')) # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),') # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))
What I’m trying to do is fill the empty Street and Postcode with the values from FullAddress – without disturbing the existing Street and Postcode values.
There is no problem with the indexing, the regex, or even the extract… I’ve read the docs, searched for anything similar… What does every get, but I don’t understand!?!?!
Answers:
You can use .fillna
to fill in the NaN values in your dataframe:
df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'b(d{5})b')[0])
This will fill in all of your null values with the result of the extract
while keeping all existing values:
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898 2 Choco Rd, 69412 Eberbach
You missed expand=False
as parameter of str.extract
:
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')
0 # <- it's not a Series but a DataFrame with one column
0 1 Banana St
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
0 1 Banana St
Name: FullAddress, dtype: object # <- now it's a Series
In the first version, Pandas can’t align column labels Street
vs 0
. In the second version, the Series fit into the Street
Series, so:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b', expand=False)
print(df)
# Output
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898.0 2 Choco Rd, 69412 Eberbach
Update*: it’s possible to use extract
without expand=False
using named groups (?P<xxx>...)
to align column labels:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(?P<Postcode>d{5})b')
# OR
pattern = r'(?P<Street>.+?),s*b(?P<Postcode>d{5})b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)
Why is a pandas.series.extract(regex) able to print the correct values, but won’t assign the value to an existing variable using indexing or np.where.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
['1', np.nan, np.nan, '1 Banana St, 69126 Heidelberg'],
['2', "Doloros St", 67898, '2 Choco Rd, 69412 Eberbach']],
columns=['id', "Street", 'Postcode', 'FullAddress']
)
m = df['Street'].isna()
print(df["FullAddress"].str.extract(r'(.+?),')) # prints street
print(df["FullAddress"].str.extract(r'b(d{5})b')) # prints postcode
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),') # outputs NaN
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b')
# trying where method throws error - NotImplementedError: cannot align with a higher dimensional NDFrame
df["Street"] = df["Street"].where(~(df["Street"].isna()), df["FullAddress"].str.extract(r'(.+?),'))
What I’m trying to do is fill the empty Street and Postcode with the values from FullAddress – without disturbing the existing Street and Postcode values.
There is no problem with the indexing, the regex, or even the extract… I’ve read the docs, searched for anything similar… What does every get, but I don’t understand!?!?!
You can use .fillna
to fill in the NaN values in your dataframe:
df["Street"] = df["Street"].fillna(df["FullAddress"].str.extract(r'(.+?),')[0])
df["Postcode"] = df["Postcode"].fillna(df["FullAddress"].str.extract(r'b(d{5})b')[0])
This will fill in all of your null values with the result of the extract
while keeping all existing values:
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898 2 Choco Rd, 69412 Eberbach
You missed expand=False
as parameter of str.extract
:
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),')
0 # <- it's not a Series but a DataFrame with one column
0 1 Banana St
>>> df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
0 1 Banana St
Name: FullAddress, dtype: object # <- now it's a Series
In the first version, Pandas can’t align column labels Street
vs 0
. In the second version, the Series fit into the Street
Series, so:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(.+?),', expand=False)
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(d{5})b', expand=False)
print(df)
# Output
id Street Postcode FullAddress
0 1 1 Banana St 69126 1 Banana St, 69126 Heidelberg
1 2 Doloros St 67898.0 2 Choco Rd, 69412 Eberbach
Update*: it’s possible to use extract
without expand=False
using named groups (?P<xxx>...)
to align column labels:
df.loc[m, 'Street'] = df.loc[m, 'FullAddress'].str.extract(r'(?P<Street>.+?),')
df.loc[m, 'Postcode'] = df.loc[m, 'FullAddress'].str.extract(r'b(?P<Postcode>d{5})b')
# OR
pattern = r'(?P<Street>.+?),s*b(?P<Postcode>d{5})b'
df.loc[m, ['Street', 'Postcode']] = df.loc[m, 'FullAddress'].str.extract(pattern)