Replacing the first occurrence of a value in each row of a pandas DataFrame
Question:
I have a DataFrame like this:
col1
col2
col3
col4
5
7
12
9
0
9
9
1
9
9
1
1
10
5
2
9
9
3
0
18
Each row has at least one 9 and for each row, I want to replace the first instance of it with 90.
Currently, I’m doing:
out = df.mask(df.eq(9) & df.apply(lambda x: ~x.duplicated(), axis=1), 90)
Is there any better/faster way than this?
Expected output:
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
Constructor:
data = {'col1': [5, 0, 9, 10, 9],
'col2': [7, 9, 9, 5, 3],
'col3': [12, 9, 1, 2, 0],
'col4': [9, 1, 1, 9, 18]}
df = pd.DataFrame(data)
Answers:
One way using idxmax
:
s = df.eq(9).idxmax(axis=1)
s = s.apply(df.columns.get_loc)
df.values[s.index, s.values] = 90
Output:
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
This is about 2.5x faster than the original code:
%timeit df.mask(df.eq(9) & df.apply(lambda x: ~x.duplicated(), axis=1), 90)
# 2.59 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = df.eq(9).idxmax(axis=1)
s = s.apply(df.columns.get_loc)
df.copy().values[s.index, s.values] = 90 # Note the copy is to keep the df same over the `timeit`
# 1.07 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You may check with numpy
assign
df.values[df.index, np.argmax(df.values==9,1)] = 90
df
Out[56]:
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
Here is another way:
df.mask(df.eq(9)).T.fillna(90,limit=1).T.fillna(9)
Update:
Here is a way using rank()
df.mask(df.eq(9).rank(axis=1,method='first',ascending=False).eq(1),90)
And here is a way using idxmax()
s = df.set_axis(range(df.shape[1]),axis=1).eq(9).idxmax(axis=1)
df.to_numpy()[range(df.shape[0]),s] = 90
or
df.mask(df.eq(9).cumsum(axis=1).eq(1) & df.eq(9),90)
Output:
col1 col2 col3 col4
0 5.0 7.0 12.0 90.0
1 0.0 90.0 9.0 1.0
2 90.0 9.0 1.0 1.0
3 10.0 5.0 2.0 90.0
4 90.0 3.0 0.0 18.0
You can use eq
+ cummax
+ ^
:
x = df.eq(9).cummax(axis=1)
df[x ^ x.shift(axis=1)] = 90
# One-liner:
df[df.eq(9).cummax(axis=1).pipe(lambda x: x ^ x.shift(axis=1))] = 90
Output:
>>> df
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
I have a DataFrame like this:
col1 | col2 | col3 | col4 |
---|---|---|---|
5 | 7 | 12 | 9 |
0 | 9 | 9 | 1 |
9 | 9 | 1 | 1 |
10 | 5 | 2 | 9 |
9 | 3 | 0 | 18 |
Each row has at least one 9 and for each row, I want to replace the first instance of it with 90.
Currently, I’m doing:
out = df.mask(df.eq(9) & df.apply(lambda x: ~x.duplicated(), axis=1), 90)
Is there any better/faster way than this?
Expected output:
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
Constructor:
data = {'col1': [5, 0, 9, 10, 9],
'col2': [7, 9, 9, 5, 3],
'col3': [12, 9, 1, 2, 0],
'col4': [9, 1, 1, 9, 18]}
df = pd.DataFrame(data)
One way using idxmax
:
s = df.eq(9).idxmax(axis=1)
s = s.apply(df.columns.get_loc)
df.values[s.index, s.values] = 90
Output:
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
This is about 2.5x faster than the original code:
%timeit df.mask(df.eq(9) & df.apply(lambda x: ~x.duplicated(), axis=1), 90)
# 2.59 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = df.eq(9).idxmax(axis=1)
s = s.apply(df.columns.get_loc)
df.copy().values[s.index, s.values] = 90 # Note the copy is to keep the df same over the `timeit`
# 1.07 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You may check with numpy
assign
df.values[df.index, np.argmax(df.values==9,1)] = 90
df
Out[56]:
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18
Here is another way:
df.mask(df.eq(9)).T.fillna(90,limit=1).T.fillna(9)
Update:
Here is a way using rank()
df.mask(df.eq(9).rank(axis=1,method='first',ascending=False).eq(1),90)
And here is a way using idxmax()
s = df.set_axis(range(df.shape[1]),axis=1).eq(9).idxmax(axis=1)
df.to_numpy()[range(df.shape[0]),s] = 90
or
df.mask(df.eq(9).cumsum(axis=1).eq(1) & df.eq(9),90)
Output:
col1 col2 col3 col4
0 5.0 7.0 12.0 90.0
1 0.0 90.0 9.0 1.0
2 90.0 9.0 1.0 1.0
3 10.0 5.0 2.0 90.0
4 90.0 3.0 0.0 18.0
You can use eq
+ cummax
+ ^
:
x = df.eq(9).cummax(axis=1)
df[x ^ x.shift(axis=1)] = 90
# One-liner:
df[df.eq(9).cummax(axis=1).pipe(lambda x: x ^ x.shift(axis=1))] = 90
Output:
>>> df
col1 col2 col3 col4
0 5 7 12 90
1 0 90 9 1
2 90 9 1 1
3 10 5 2 90
4 90 3 0 18