Pandas how can I extract by regex from column into multiple rows?
Question:
I have the following data:
ID
content
date
1
2429(sach:MySpezialItem :16.59)
2022-04-12
2
2429(sach:item 13 :18.59)(sach:this and that costs:16.59)
2022-06-12
And I want to achieve the following:
ID
number
price
date
1
2429
2022-04-12
1
16.59
2022-04-12
2
2429
2022-06-12
2
18.59
2022-06-12
2
16.59
2022-06-12
What I tried
df['sach'] = df['content'].str.split(r'(sach:.*)').explode('content')
df['content'] = df['content'].str.replace(r'(sach:.*)','', regex=True)
Answers:
You can use a single regex with str.extractall
:
regex = r'(?P<number>d+)(|:(?P<price>d+(?:.d+)?))'
df = df.join(df.pop('content').str.extractall(regex).droplevel(1))
NB. if you want a new DataFrame, don’t pop
:
df2 = (df.drop(columns='content')
.join(df['content'].str.extractall(regex).droplevel(1))
)
output:
ID date number price
0 1 2022-04-12 2429 NaN
0 1 2022-04-12 NaN 16.59
1 2 2022-06-12 2429 NaN
1 2 2022-06-12 NaN 18.59
1 2 2022-06-12 NaN 16.59
I have the following data:
ID | content | date |
---|---|---|
1 | 2429(sach:MySpezialItem :16.59) | 2022-04-12 |
2 | 2429(sach:item 13 :18.59)(sach:this and that costs:16.59) | 2022-06-12 |
And I want to achieve the following:
ID | number | price | date |
---|---|---|---|
1 | 2429 | 2022-04-12 | |
1 | 16.59 | 2022-04-12 | |
2 | 2429 | 2022-06-12 | |
2 | 18.59 | 2022-06-12 | |
2 | 16.59 | 2022-06-12 |
What I tried
df['sach'] = df['content'].str.split(r'(sach:.*)').explode('content')
df['content'] = df['content'].str.replace(r'(sach:.*)','', regex=True)
You can use a single regex with str.extractall
:
regex = r'(?P<number>d+)(|:(?P<price>d+(?:.d+)?))'
df = df.join(df.pop('content').str.extractall(regex).droplevel(1))
NB. if you want a new DataFrame, don’t pop
:
df2 = (df.drop(columns='content')
.join(df['content'].str.extractall(regex).droplevel(1))
)
output:
ID date number price
0 1 2022-04-12 2429 NaN
0 1 2022-04-12 NaN 16.59
1 2 2022-06-12 2429 NaN
1 2 2022-06-12 NaN 18.59
1 2 2022-06-12 NaN 16.59