Chain df.str.split() in pandas dataframe

Question:

Edit: 2022NOV21

How do we chain df.col.str.split() since this returns the split columns if expand = True
I am trying to split a column after performing .melt(). If I use assign I end up using the original column and the melted column actually does not even exist.

df = pd.DataFrame().from_dict({
    'id' : [1,2,3,4],
    '2022_amt' : [10.1,20.2,30.3, 40.4],
    '2022_qty' : [10,20,30,40]
})

df = ( 
    df
    .melt(
         id_vars=['id'], 
         value_vars=['2022_amt', '2022_qty'], 
         var_name='fy', 
         value_name='num' 
    )
    # can i chain any pd.Series.str.[METHOD] here
    # .assign(
    #         year=df.fy.str.split('_', expand=True)[0], 
    #         t=df.fy.str.split('_', expand=True)[1]
    # )
    )

# i can add the two columns in this way but can we use chain to expand dataframe df
df[['year', 't']] = df.fy.str.split('_', expand=True)
df = df.drop(columns = ['fy'])
Asked By: manny

||

Answers:

Not sure what you are trying to do.
But what I am sure of, is that you cannot use [0] (at least not to do what you want to) directly on a series. But you can call .str again so that you can use [0] operators

Example

df=pd.DataFrame({'s':['abc-def|ghi', 'one-two|three']})
df.s.str.split('-').str[0]
#0    abc
#1    one
#Name: s, dtype: object

df.s.str.split('-').str[1].str.split('|').str[0]
#0    def
#1    two
#Name: s, dtype: object

df.s.str.split('-').str[1].str.split('|').str[1]
#0      ghi
#1    three
#Name: s, dtype: object

Note that half of the .str here are counter intuitive, since we are not really using string functions on the result (which are arrays). But .str also works on arrays, and on anything that have usage of the [..] indexing. As long as you are not call string specific function on its. So it is a trick: .str on a series allows to call string methods on the elements of the series. And some of the string methods, including indexation, happen to have a meaning on arrays too.

Answered By: chrslg

Using expand converts it into a DataFrame, which you do not really want here; secondly with chaining, use an anonymous function to refer to the previous dataframe:

(df
.melt(id_vars='id',var_name='fy',value_name='num')
assign(year = lambda df: df.fy.str.split('_').str[0],
       t = lambda df: df.fy.str.split('_').str[1])
)

   id        fy   num  year    t
0   1  2022_amt  10.1  2022  amt
1   2  2022_amt  20.2  2022  amt
2   3  2022_amt  30.3  2022  amt
3   4  2022_amt  40.4  2022  amt
4   1  2022_qty  10.0  2022  qty
5   2  2022_qty  20.0  2022  qty
6   3  2022_qty  30.0  2022  qty
7   4  2022_qty  40.0  2022  qty

For your use case, there are simpler, more efficient ways to do this:

  • with pd.stack:
df = df.set_index('id')
df.columns = df.columns.str.split('_', expand = True)
df.columns.names = ['year', 't']
df.stack(['year', 't']).reset_index(name='num')

   id  year    t   num
0   1  2022  amt  10.1
1   1  2022  qty  10.0
2   2  2022  amt  20.2
3   2  2022  qty  20.0
4   3  2022  amt  30.3
5   3  2022  qty  30.0
6   4  2022  amt  40.4
7   4  2022  qty  40.0
  • with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.pivot_longer(index = 'id', names_to = ('year','t'), names_sep = '_')

   id  year    t  value
0   1  2022  amt   10.1
1   2  2022  amt   20.2
2   3  2022  amt   30.3
3   4  2022  amt   40.4
4   1  2022  qty   10.0
5   2  2022  qty   20.0
6   3  2022  qty   30.0
7   4  2022  qty   40.0
Answered By: sammywemmy
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.