Chain df.str.split() in pandas dataframe
Question:
Edit: 2022NOV21
How do we chain df.col.str.split()
since this returns the split columns if expand = True
I am trying to split a column after performing .melt()
. If I use assign I end up using the original column and the melted column actually does not even exist.
df = pd.DataFrame().from_dict({
'id' : [1,2,3,4],
'2022_amt' : [10.1,20.2,30.3, 40.4],
'2022_qty' : [10,20,30,40]
})
df = (
df
.melt(
id_vars=['id'],
value_vars=['2022_amt', '2022_qty'],
var_name='fy',
value_name='num'
)
# can i chain any pd.Series.str.[METHOD] here
# .assign(
# year=df.fy.str.split('_', expand=True)[0],
# t=df.fy.str.split('_', expand=True)[1]
# )
)
# i can add the two columns in this way but can we use chain to expand dataframe df
df[['year', 't']] = df.fy.str.split('_', expand=True)
df = df.drop(columns = ['fy'])
Answers:
Not sure what you are trying to do.
But what I am sure of, is that you cannot use [0]
(at least not to do what you want to) directly on a series. But you can call .str
again so that you can use [0]
operators
Example
df=pd.DataFrame({'s':['abc-def|ghi', 'one-two|three']})
df.s.str.split('-').str[0]
#0 abc
#1 one
#Name: s, dtype: object
df.s.str.split('-').str[1].str.split('|').str[0]
#0 def
#1 two
#Name: s, dtype: object
df.s.str.split('-').str[1].str.split('|').str[1]
#0 ghi
#1 three
#Name: s, dtype: object
Note that half of the .str
here are counter intuitive, since we are not really using string functions on the result (which are arrays). But .str
also works on arrays, and on anything that have usage of the [..]
indexing. As long as you are not call string specific function on its. So it is a trick: .str
on a series allows to call string methods on the elements of the series. And some of the string methods, including indexation, happen to have a meaning on arrays too.
Using expand
converts it into a DataFrame, which you do not really want here; secondly with chaining, use an anonymous function to refer to the previous dataframe:
(df
.melt(id_vars='id',var_name='fy',value_name='num')
assign(year = lambda df: df.fy.str.split('_').str[0],
t = lambda df: df.fy.str.split('_').str[1])
)
id fy num year t
0 1 2022_amt 10.1 2022 amt
1 2 2022_amt 20.2 2022 amt
2 3 2022_amt 30.3 2022 amt
3 4 2022_amt 40.4 2022 amt
4 1 2022_qty 10.0 2022 qty
5 2 2022_qty 20.0 2022 qty
6 3 2022_qty 30.0 2022 qty
7 4 2022_qty 40.0 2022 qty
For your use case, there are simpler, more efficient ways to do this:
- with
pd.stack
:
df = df.set_index('id')
df.columns = df.columns.str.split('_', expand = True)
df.columns.names = ['year', 't']
df.stack(['year', 't']).reset_index(name='num')
id year t num
0 1 2022 amt 10.1
1 1 2022 qty 10.0
2 2 2022 amt 20.2
3 2 2022 qty 20.0
4 3 2022 amt 30.3
5 3 2022 qty 30.0
6 4 2022 amt 40.4
7 4 2022 qty 40.0
- with
pivot_longer
from pyjanitor
:
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.pivot_longer(index = 'id', names_to = ('year','t'), names_sep = '_')
id year t value
0 1 2022 amt 10.1
1 2 2022 amt 20.2
2 3 2022 amt 30.3
3 4 2022 amt 40.4
4 1 2022 qty 10.0
5 2 2022 qty 20.0
6 3 2022 qty 30.0
7 4 2022 qty 40.0
Edit: 2022NOV21
How do we chain df.col.str.split()
since this returns the split columns if expand = True
I am trying to split a column after performing .melt()
. If I use assign I end up using the original column and the melted column actually does not even exist.
df = pd.DataFrame().from_dict({
'id' : [1,2,3,4],
'2022_amt' : [10.1,20.2,30.3, 40.4],
'2022_qty' : [10,20,30,40]
})
df = (
df
.melt(
id_vars=['id'],
value_vars=['2022_amt', '2022_qty'],
var_name='fy',
value_name='num'
)
# can i chain any pd.Series.str.[METHOD] here
# .assign(
# year=df.fy.str.split('_', expand=True)[0],
# t=df.fy.str.split('_', expand=True)[1]
# )
)
# i can add the two columns in this way but can we use chain to expand dataframe df
df[['year', 't']] = df.fy.str.split('_', expand=True)
df = df.drop(columns = ['fy'])
Not sure what you are trying to do.
But what I am sure of, is that you cannot use [0]
(at least not to do what you want to) directly on a series. But you can call .str
again so that you can use [0]
operators
Example
df=pd.DataFrame({'s':['abc-def|ghi', 'one-two|three']})
df.s.str.split('-').str[0]
#0 abc
#1 one
#Name: s, dtype: object
df.s.str.split('-').str[1].str.split('|').str[0]
#0 def
#1 two
#Name: s, dtype: object
df.s.str.split('-').str[1].str.split('|').str[1]
#0 ghi
#1 three
#Name: s, dtype: object
Note that half of the .str
here are counter intuitive, since we are not really using string functions on the result (which are arrays). But .str
also works on arrays, and on anything that have usage of the [..]
indexing. As long as you are not call string specific function on its. So it is a trick: .str
on a series allows to call string methods on the elements of the series. And some of the string methods, including indexation, happen to have a meaning on arrays too.
Using expand
converts it into a DataFrame, which you do not really want here; secondly with chaining, use an anonymous function to refer to the previous dataframe:
(df
.melt(id_vars='id',var_name='fy',value_name='num')
assign(year = lambda df: df.fy.str.split('_').str[0],
t = lambda df: df.fy.str.split('_').str[1])
)
id fy num year t
0 1 2022_amt 10.1 2022 amt
1 2 2022_amt 20.2 2022 amt
2 3 2022_amt 30.3 2022 amt
3 4 2022_amt 40.4 2022 amt
4 1 2022_qty 10.0 2022 qty
5 2 2022_qty 20.0 2022 qty
6 3 2022_qty 30.0 2022 qty
7 4 2022_qty 40.0 2022 qty
For your use case, there are simpler, more efficient ways to do this:
- with
pd.stack
:
df = df.set_index('id')
df.columns = df.columns.str.split('_', expand = True)
df.columns.names = ['year', 't']
df.stack(['year', 't']).reset_index(name='num')
id year t num
0 1 2022 amt 10.1
1 1 2022 qty 10.0
2 2 2022 amt 20.2
3 2 2022 qty 20.0
4 3 2022 amt 30.3
5 3 2022 qty 30.0
6 4 2022 amt 40.4
7 4 2022 qty 40.0
- with
pivot_longer
frompyjanitor
:
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.pivot_longer(index = 'id', names_to = ('year','t'), names_sep = '_')
id year t value
0 1 2022 amt 10.1
1 2 2022 amt 20.2
2 3 2022 amt 30.3
3 4 2022 amt 40.4
4 1 2022 qty 10.0
5 2 2022 qty 20.0
6 3 2022 qty 30.0
7 4 2022 qty 40.0