Duplicating Pandas Dataframe rows based on string split, without iteration

Question:

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a “|”, like this:

            value
left right 
x    a|b    2
y    b|c|d  -1

I want to duplicate the rows based on the “right” column, to get something like this:

           values
left right
x    a     2
x    b     2
y    b     -1
y    c     -1
y    d     -1

The solution I have to this feels wrong and runs slow, because it’s based on iteration:

df2 = df.iloc[:0]
for index, row in df.iterrows():
    stgs = index[1].split("|")
    for s in stgs:
        row.name = (index[0], s)
        df2 = df2.append(row)

Is there a more vectored way to do this?

Asked By: Jacob H

||

Answers:

Pandas Series have a dedicated method split to perform this operation

split works only on Series so isolate the Column you want

SO = df['right']

Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column

S1 = SO.str.split(',').apply(pd.Series, 1).stack()

The only issue is that you have now a multi-index. So just drop the level you don`t need

S1.index.droplevel(-1)

Full example

SO = pd.Series(data=["a,b", "b,c,d"])

S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0  0    a
   1    b
1  0    b
   1    c
   2    d

S1.index = S1.index.droplevel(-1) 
S1 
Out[5]:
0    a
0    b
1    b
1    c
1    d
Answered By: xNok

Building upon the answer @xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.

We have this data:

arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df

Out[17]:
            value
left right
x    a|b        2
y    b|c|d     -1

First, let’s generate the values for right index as @xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.

new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val

Out[18]:
right
a|b    0    a
       1    b
b|c|d  0    b
       1    c
       2    d
dtype: object

Now we want to put this value in the original DataFrame df. To do that, let’s change its shape so that result we generated in the previous step could be copied.

In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.

df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted

Out[19]:
            value
left right
x    a|b        2
     a|b        2
y    b|c|d     -1
     b|c|d     -1
     b|c|d     -1

Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.

Replace the index of df_repeted with desired values as following:

df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted

Out[20]:
            value
left right
x    a          2
     b          2
y    b         -1
     c         -1
     d         -1
Answered By: InvisibleWolf
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.