Extract last specific word/value from one column and move it to the next row
Question:
I have a DataFrame like the following
|Animals | Type | Year |
|Penguin AVES | Omnivore | 2015 |
|Caiman REP | Carnivore | 2018 |
|Komodo.Rep | Carnivore | 2019 |
|Blue Jay.aves | Omnivore | 2015 |
|Iguana+rep | Carnivore | 2020 |
I want to extract the last specific words (e.g. AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row. There are several specific words other than AVES and REP. It’s not very clean (as shown by the whitespace, dot, and "+" operator before the specific words). The expected new DataFrame would be like the following
| Animals | Type | Year |
| Penguin AVES | Omnivore | 2015 |
| AVES | Omnivore | 2015 |
| Caiman REP | Carnivore | 2018 |
| REP | Carnivore | 2018 |
| Komodo.Rep | Carnivore | 2019 |
| Rep | Carnivore | 2019 |
| Blue Jay.aves | Omnivore | 2015 |
| aves | Omnivore | 2015 |
| Iguana+rep | Carnivore | 2020 |
| rep | Carnivore | 2020 |
I was thinking of using a negative indexing to split the string, but I got confused with the lambda function for this particular issue. Any idea how I should approach this problem? Thanks in advance.
Answers:
You can use str.extract
to get the last word ((w+)$
regex, but you can also use a specific list (?i)(aves|rep)$
if needed) and assign
it to replace the column, then concat
the updated DataFrame to the original one, and sort_index
with a stable method to interleave the rows:
out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(w+)$'))])
.sort_index(kind='stable', ignore_index=True)
)
Output:
Animals Type Year
0 Penguin AVES Omnivore 2015
1 AVES Omnivore 2015
2 Caiman REP Carnivore 2018
3 REP Carnivore 2018
4 Komodo.Rep Carnivore 2019
5 Rep Carnivore 2019
6 Blue Jay.aves Omnivore 2015
7 aves Omnivore 2015
8 Iguana+rep Carnivore 2020
9 rep Carnivore 2020
alternative using stack
:
cols = df.columns.difference(['Animals']).tolist()
out = (df.assign(Word=df['Animals'].str.extract(r'(w+)$'))
.set_index(cols).stack().reset_index(cols, name='Animals')
.reset_index(drop=True)[df.columns]
)
alternative with indexing:
Duplicate all rows, modify the odd rows with the extracted word
out = df.loc[df.index.repeat(2)].reset_index(drop=True)
out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(r'(w+)$', expand=False)
I have a DataFrame like the following
|Animals | Type | Year |
|Penguin AVES | Omnivore | 2015 |
|Caiman REP | Carnivore | 2018 |
|Komodo.Rep | Carnivore | 2019 |
|Blue Jay.aves | Omnivore | 2015 |
|Iguana+rep | Carnivore | 2020 |
I want to extract the last specific words (e.g. AVES and REP) from the values in column "Animals" and move it to the next row while keeping the values of the entire row. There are several specific words other than AVES and REP. It’s not very clean (as shown by the whitespace, dot, and "+" operator before the specific words). The expected new DataFrame would be like the following
| Animals | Type | Year |
| Penguin AVES | Omnivore | 2015 |
| AVES | Omnivore | 2015 |
| Caiman REP | Carnivore | 2018 |
| REP | Carnivore | 2018 |
| Komodo.Rep | Carnivore | 2019 |
| Rep | Carnivore | 2019 |
| Blue Jay.aves | Omnivore | 2015 |
| aves | Omnivore | 2015 |
| Iguana+rep | Carnivore | 2020 |
| rep | Carnivore | 2020 |
I was thinking of using a negative indexing to split the string, but I got confused with the lambda function for this particular issue. Any idea how I should approach this problem? Thanks in advance.
You can use str.extract
to get the last word ((w+)$
regex, but you can also use a specific list (?i)(aves|rep)$
if needed) and assign
it to replace the column, then concat
the updated DataFrame to the original one, and sort_index
with a stable method to interleave the rows:
out = (pd.concat([df, df.assign(Animals=df['Animals'].str.extract(r'(w+)$'))])
.sort_index(kind='stable', ignore_index=True)
)
Output:
Animals Type Year
0 Penguin AVES Omnivore 2015
1 AVES Omnivore 2015
2 Caiman REP Carnivore 2018
3 REP Carnivore 2018
4 Komodo.Rep Carnivore 2019
5 Rep Carnivore 2019
6 Blue Jay.aves Omnivore 2015
7 aves Omnivore 2015
8 Iguana+rep Carnivore 2020
9 rep Carnivore 2020
alternative using stack
:
cols = df.columns.difference(['Animals']).tolist()
out = (df.assign(Word=df['Animals'].str.extract(r'(w+)$'))
.set_index(cols).stack().reset_index(cols, name='Animals')
.reset_index(drop=True)[df.columns]
)
alternative with indexing:
Duplicate all rows, modify the odd rows with the extracted word
out = df.loc[df.index.repeat(2)].reset_index(drop=True)
out.loc[1::2, 'Animals'] = out.loc[1::2, 'Animals'].str.extract(r'(w+)$', expand=False)