Find first value in dataframe's columns greater than another
Question:
I have been looking for the most efficient way to find the first value in all columns of a pandas df, from left to right (0,1,2,3), for each row, that is greater than another column (t), and put the corresponding column label in a new column (val). If no column value is greater, then I want 0 instead.
For some reason I could not find anything simple and efficient (as the real table is really big).
E.g:
Initial table:
t 0 1 2 3
JAN 3 1.9 2.1 2.6 2.9
FEB 6 2.0 4.0 5.0 9.0
MAR 2 1.0 3.0 4.0 4.0
APR 4 1.5 3.0 6.0 2.0
Final table:
t 0 1 2 3 val
JAN 3 1.9 2.1 2.6 2.9 0
FEB 6 2.0 4.0 5.0 9.0 3
MAR 2 1.0 3.0 4.0 4.0 1
APR 4 1.5 3.0 6.0 2.0 2
Answers:
from io import StringIO
import numpy as np
import pandas as pd
s = """
t 0 1 2 3
JAN 3 1.9 2.1 2.6 2.9
FEB 6 2.0 4.0 5.0 9.0
MAR 2 1.0 3.0 4.0 4.0
APR 4 1.5 3.0 6.0 2.0
"""
# Read in the string
df = pd.read_csv(StringIO(s), delim_whitespace=True)
# Find all columns greater than your threshold column
s = np.where(df.gt(df['t'],0), ['', 0, 1, 2, 3], np.nan)
# Convert to dataframe, find the first instance, fill the
# rest with zeros and make a new column
df['vals'] = pd.DataFrame(s).min(axis=1).fillna(0).astype(int).values
# Which yields your expected result
print(df)
# t 0 1 2 3 vals
#JAN 3 1.9 2.1 2.6 2.9 0
#FEB 6 2.0 4.0 5.0 9.0 3
#MAR 2 1.0 3.0 4.0 4.0 1
#APR 4 1.5 3.0 6.0 2.0 2
I based this off the answer to a similar problem here, which suggested that this technique is faster than a few other options.
I have been looking for the most efficient way to find the first value in all columns of a pandas df, from left to right (0,1,2,3), for each row, that is greater than another column (t), and put the corresponding column label in a new column (val). If no column value is greater, then I want 0 instead.
For some reason I could not find anything simple and efficient (as the real table is really big).
E.g:
Initial table:
t 0 1 2 3
JAN 3 1.9 2.1 2.6 2.9
FEB 6 2.0 4.0 5.0 9.0
MAR 2 1.0 3.0 4.0 4.0
APR 4 1.5 3.0 6.0 2.0
Final table:
t 0 1 2 3 val
JAN 3 1.9 2.1 2.6 2.9 0
FEB 6 2.0 4.0 5.0 9.0 3
MAR 2 1.0 3.0 4.0 4.0 1
APR 4 1.5 3.0 6.0 2.0 2
from io import StringIO
import numpy as np
import pandas as pd
s = """
t 0 1 2 3
JAN 3 1.9 2.1 2.6 2.9
FEB 6 2.0 4.0 5.0 9.0
MAR 2 1.0 3.0 4.0 4.0
APR 4 1.5 3.0 6.0 2.0
"""
# Read in the string
df = pd.read_csv(StringIO(s), delim_whitespace=True)
# Find all columns greater than your threshold column
s = np.where(df.gt(df['t'],0), ['', 0, 1, 2, 3], np.nan)
# Convert to dataframe, find the first instance, fill the
# rest with zeros and make a new column
df['vals'] = pd.DataFrame(s).min(axis=1).fillna(0).astype(int).values
# Which yields your expected result
print(df)
# t 0 1 2 3 vals
#JAN 3 1.9 2.1 2.6 2.9 0
#FEB 6 2.0 4.0 5.0 9.0 3
#MAR 2 1.0 3.0 4.0 4.0 1
#APR 4 1.5 3.0 6.0 2.0 2
I based this off the answer to a similar problem here, which suggested that this technique is faster than a few other options.