Replace specific values in multiindex dataframe

Question:

I have a multindex dataframe with 3 index levels and 2 numerical columns.

A   1   2017-04-01  14.0    87.346878
        2017-06-01  4.0     87.347504
    2   2014-08-01  1.0     123.110001
        2015-01-01  4.0     209.612503
B   3   2014-07-01  1.0     68.540001
        2014-12-01  1.0     64.370003
    4   2015-01-01  3.0     75.000000

I want to replace the values in first row of 3rd index level wherever a new second level index begins.
For ex: every first row

(A,1,2017-04-01)->0.0   0.0 
(A,2,2014-08-01)->0.0   0.0  
(B,3,2014-07-01)->0.0   0.0  
(B,4,2015-01-01)->0.0   0.0

The dataframe is too big and doing it datframe by dataframe like df.xs('A,1')...df.xs(A,2) gets time consuming. Is there some way where i can get a mask and replace with new values in these positions ?

Asked By: Siddhant Tandon

||

Answers:

Use DataFrame.reset_index on level=2, then use DataFrame.groupby on level=[0, 1] and aggregate level_2 using first, then using pd.MultiIndex.from_arrays create a multilevel index, finally use this multilevel index to change the values in dataframe:

idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0

Result:

# print(df)
               col1        col2
A 1 2017-04-01  0.0    0.000000
    2017-06-01  4.0   87.347504
  2 2014-08-01  0.0    0.000000
    2015-01-01  4.0  209.612503
B 3 2014-07-01  0.0    0.000000
    2014-12-01  1.0   64.370003
  4 2015-01-01  0.0    0.000000
Answered By: Shubham Sharma

We can extract a series of the second-level index with:

df.index.get_level_values(1)
# output: Int64Index([1, 1, 2, 2, 3, 3, 4], dtype='int64')

And check where it changes with:

idx = df.index.get_level_values(1)
np.where(idx != np.roll(idx, 1))[0]
# output: array([0, 2, 4, 6])

So we can simply use the returned value of the second statement with iloc to get the first row of every second-level index and modify their values like this:

idx = df.index.get_level_values(1)
df.iloc[np.where(idx != np.roll(idx, 1))[0]] = 0

output:

                  value1      value2
A 1 2017-04-01       0.0    0.000000
    2017-06-01       4.0   87.347504
  2 2014-08-01       0.0    0.000000
    2015-01-01       4.0  209.612503
B 3 2014-07-01       0.0    0.000000
    2014-12-01       1.0   64.370003
  4 2015-01-01       0.0    0.000000
Answered By: jfaccioni

You can use the grouper indices in a simple iloc:

df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0

Example:

df = pd.DataFrame({'col1': [14., 4., 1., 4., 1., 1., 3.],
                   'col2': [ 87.346878, 87.347504, 123.110001, 209.612503, 68.540001, 64.370003, 75.]},
                   index = pd.MultiIndex.from_tuples(([('A', 1, '2017-04-01'), ('A', 1, '2017-06-01'),
                                                       ('A', 2, '2014-08-01'), ('A', 2, '2015-01-01'),
                                                       ('B', 3, '2014-07-01'), ('B', 3, '2014-12-01'),
                                                       ('B', 4, '2015-01-01')])))

Result:

                col1        col2
A 1 2017-04-01   0.0    0.000000
    2017-06-01   4.0   87.347504
  2 2014-08-01   0.0    0.000000
    2015-01-01   4.0  209.612503
B 3 2014-07-01   0.0    0.000000
    2014-12-01   1.0   64.370003
  4 2015-01-01   0.0    0.000000

Timings:

%%timeit
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
#6.7 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
#897 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So this is about 7 times faster than the accepted answer

Answered By: Stef

I think you can use something like this:

import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))

df = pd.DataFrame([['A', 'B'], ['bar', 'two'],
                   ['foo', 'one'], ['foo', 'two']],
                 columns=['first', 'second'])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

You can create a list of unique values from your index. Then get the index position, to replace on your column the row value coincidence with the row value.

lst = ['bar','foo', 'qux']
ls = []
for i in lst:
    base = df.index.get_loc(i)
    a = base.indices(len(df))
    a = a[0]
    ls.append(a)
    
    for ii in ls:
    #print(ii)
        df[0][ii] = 0

df

Fortunately, this can help you.

Cheers!