Get column name where value is something in pandas dataframe
Question:
I’m trying to find, at each timestamp, the column name in a dataframe for which the value matches with the one in a timeseries at the same timestamp.
Here is my dataframe:
>>> df
col5 col4 col3 col2 col1
1979-01-01 00:00:00 1181.220328 912.154923 648.848635 390.986156 138.185861
1979-01-01 06:00:00 1190.724461 920.767974 657.099560 399.395338 147.761352
1979-01-01 12:00:00 1193.414510 918.121482 648.558837 384.632475 126.254342
1979-01-01 18:00:00 1171.670276 897.585930 629.201469 366.652033 109.545607
1979-01-02 00:00:00 1168.892579 900.375126 638.377583 382.584568 132.998706
>>> df.to_dict()
{'col4': {<Timestamp: 1979-01-01 06:00:00>: 920.76797370744271, <Timestamp: 1979-01-01 00:00:00>: 912.15492332839756, <Timestamp: 1979-01-01 18:00:00>: 897.58592995700656, <Timestamp: 1979-01-01 12:00:00>: 918.1214819496729}, 'col5': {<Timestamp: 1979-01-01 06:00:00>: 1190.7244605667831, <Timestamp: 1979-01-01 00:00:00>: 1181.2203275146587, <Timestamp: 1979-01-01 18:00:00>: 1171.6702763228691, <Timestamp: 1979-01-01 12:00:00>: 1193.4145103184442}, 'col2': {<Timestamp: 1979-01-01 06:00:00>: 399.39533771666561, <Timestamp: 1979-01-01 00:00:00>: 390.98615646597591, <Timestamp: 1979-01-01 18:00:00>: 366.65203285812231, <Timestamp: 1979-01-01 12:00:00>: 384.63247469269874}, 'col3': {<Timestamp: 1979-01-01 06:00:00>: 657.09956023625466, <Timestamp: 1979-01-01 00:00:00>: 648.84863460462293, <Timestamp: 1979-01-01 18:00:00>: 629.20146872682449, <Timestamp: 1979-01-01 12:00:00>: 648.55883747413225}, 'col1': {<Timestamp: 1979-01-01 06:00:00>: 147.7613518219286, <Timestamp: 1979-01-01 00:00:00>: 138.18586102094068, <Timestamp: 1979-01-01 18:00:00>: 109.54560722575859, <Timestamp: 1979-01-01 12:00:00>: 126.25434189361377}}
And the time series with values I want to match at each timestamp:
>>> ts
1979-01-01 00:00:00 1181.220328
1979-01-01 06:00:00 657.099560
1979-01-01 12:00:00 126.254342
1979-01-01 18:00:00 109.545607
Freq: 6H
>>> ts.to_dict()
{<Timestamp: 1979-01-01 06:00:00>: 657.09956023625466, <Timestamp: 1979-01-01 00:00:00>: 1181.2203275146587, <Timestamp: 1979-01-01 18:00:00>: 109.54560722575859, <Timestamp: 1979-01-01 12:00:00>: 126.25434189361377}
Then the result would be:
>>> df_result
value Column
1979-01-01 00:00:00 1181.220328 col5
1979-01-01 06:00:00 657.099560 col3
1979-01-01 12:00:00 126.254342 col1
1979-01-01 18:00:00 109.545607 col1
I hope my question is clear enough. Anyone has an idea how to get df_result?
Thanks
Greg
Answers:
Here is one, perhaps inelegant, way to do it:
df_result = pd.DataFrame(ts, columns=['value'])
Set up a function which grabs the column name which contains the value (from ts
):
def get_col_name(row):
b = (df.ix[row.name] == row['value'])
return b.index[b.argmax()]
for each row, test which elements equal the value, and extract column name of a True.
And apply
it (row-wise):
In [3]: df_result.apply(get_col_name, axis=1)
Out[3]:
1979-01-01 00:00:00 col5
1979-01-01 06:00:00 col3
1979-01-01 12:00:00 col1
1979-01-01 18:00:00 col1
i.e. use df_result['Column'] = df_result.apply(get_col_name, axis=1)
.
.
Note: there is quite a lot going on in get_col_name
so perhaps it warrants some further explanation:
In [4]: row = df_result.irow(0) # an example row to pass to get_col_name
In [5]: row
Out[5]:
value 1181.220328
Name: 1979-01-01 00:00:00
In [6]: row.name # use to get rows of df
Out[6]: <Timestamp: 1979-01-01 00:00:00>
In [7]: df.ix[row.name]
Out[7]:
col5 1181.220328
col4 912.154923
col3 648.848635
col2 390.986156
col1 138.185861
Name: 1979-01-01 00:00:00
In [8]: b = (df.ix[row.name] == row['value'])
#checks whether each elements equal row['value'] = 1181.220328
In [9]: b
Out[9]:
col5 True
col4 False
col3 False
col2 False
col1 False
Name: 1979-01-01 00:00:00
In [10]: b.argmax() # index of a True value
Out[10]: 0
In [11]: b.index[b.argmax()] # the index value (column name)
Out[11]: 'col5'
It might be there is more efficient way to do this…
Following on from Andy’s detailed answer, the solution to selecting the column name of the highest value per row can be simplified to a single line:
df['column'] = df.apply(lambda x: df.columns[x.argmax()], axis = 1)
Just wanted to add that for a situation where multiple columns may have the value and you want all the column names in a list, you can do the following (e.g. get all column names with a value = ‘x’):
df.apply(lambda row: row[row == 'x'].index, axis=1)
The idea is that you turn each row into a series (by adding axis=1
) where the column names are now turned into the index of the series. You then filter your series with a condition (e.g. row == 'x'
), then take the index values (aka column names!).
I was trying to create a new column to indicate which existing column has the biggest value for a row. This gave me the desired string column label:
df['column_with_biggest_value'] = df.idxmax(axis=1)
Use df.eq()
for ~300x speedup over df.apply()
The other answers are fine but very slow compared to the vectorized df.eq()
:
df.loc[ts.index].eq(ts, axis=0).idxmax(axis=1)
# 1979-01-01 00:00:00 col5
# 1979-01-01 06:00:00 col3
# 1979-01-01 12:00:00 col1
# 1979-01-01 18:00:00 col1
# dtype: object
loc[ts.index]
returns df
rows that match ts
timestamps
eq(ts, axis=0)
compares each ts
value to one row (axis=0
) of df
eq(ts[:, None])
would be the numpy broadcasting equivalent
idxmax(axis=1)
returns the first matching column (axis=1
) in each row
Testing data:
index = pd.date_range('2000-01-01', periods=n, freq='1T')
df = pd.DataFrame(np.random.random(size=(n, 5)), index=index).add_prefix('col')
ts = df.apply(np.random.choice, axis=1).sample(frac=0.9)
Use np.isclose()
for safer float comparison
Unless you have a specific reason to test strict equality, floats should be compared with a tolerance, e.g., using isclose()
:
-
Use isclose()
to compare df
with ts
, where [:, None]
stretches ts
to the same size as df
:
close = np.isclose(df.loc[ts.index], ts[:, None])
# array([[ True, False, False, False, False],
# [False, False, True, False, False],
# [False, False, False, False, True],
# [False, False, False, False, True]])
-
Then, as before, use idxmax(axis=1)
to extract the first matching column per row:
pd.DataFrame(close, index=ts.index, columns=df.columns).idxmax(axis=1)
# 1979-01-01 00:00:00 col5
# 1979-01-01 06:00:00 col3
# 1979-01-01 12:00:00 col1
# 1979-01-01 18:00:00 col1
# dtype: object
Using isclose()
will be just as fast as eq()
(and thus much faster than df.apply()
:
Note that if you have more complex joining conditions, use df.merge()
, df.join()
, or df.reindex()
. For OP’s question, these are overkill but would look something like this:
df.merge(ts.rename('ts'), left_index=True, right_index=True)
df.join(ts.rename('ts'), how='right')
df.reindex(ts.index)
I’m trying to find, at each timestamp, the column name in a dataframe for which the value matches with the one in a timeseries at the same timestamp.
Here is my dataframe:
>>> df
col5 col4 col3 col2 col1
1979-01-01 00:00:00 1181.220328 912.154923 648.848635 390.986156 138.185861
1979-01-01 06:00:00 1190.724461 920.767974 657.099560 399.395338 147.761352
1979-01-01 12:00:00 1193.414510 918.121482 648.558837 384.632475 126.254342
1979-01-01 18:00:00 1171.670276 897.585930 629.201469 366.652033 109.545607
1979-01-02 00:00:00 1168.892579 900.375126 638.377583 382.584568 132.998706
>>> df.to_dict()
{'col4': {<Timestamp: 1979-01-01 06:00:00>: 920.76797370744271, <Timestamp: 1979-01-01 00:00:00>: 912.15492332839756, <Timestamp: 1979-01-01 18:00:00>: 897.58592995700656, <Timestamp: 1979-01-01 12:00:00>: 918.1214819496729}, 'col5': {<Timestamp: 1979-01-01 06:00:00>: 1190.7244605667831, <Timestamp: 1979-01-01 00:00:00>: 1181.2203275146587, <Timestamp: 1979-01-01 18:00:00>: 1171.6702763228691, <Timestamp: 1979-01-01 12:00:00>: 1193.4145103184442}, 'col2': {<Timestamp: 1979-01-01 06:00:00>: 399.39533771666561, <Timestamp: 1979-01-01 00:00:00>: 390.98615646597591, <Timestamp: 1979-01-01 18:00:00>: 366.65203285812231, <Timestamp: 1979-01-01 12:00:00>: 384.63247469269874}, 'col3': {<Timestamp: 1979-01-01 06:00:00>: 657.09956023625466, <Timestamp: 1979-01-01 00:00:00>: 648.84863460462293, <Timestamp: 1979-01-01 18:00:00>: 629.20146872682449, <Timestamp: 1979-01-01 12:00:00>: 648.55883747413225}, 'col1': {<Timestamp: 1979-01-01 06:00:00>: 147.7613518219286, <Timestamp: 1979-01-01 00:00:00>: 138.18586102094068, <Timestamp: 1979-01-01 18:00:00>: 109.54560722575859, <Timestamp: 1979-01-01 12:00:00>: 126.25434189361377}}
And the time series with values I want to match at each timestamp:
>>> ts
1979-01-01 00:00:00 1181.220328
1979-01-01 06:00:00 657.099560
1979-01-01 12:00:00 126.254342
1979-01-01 18:00:00 109.545607
Freq: 6H
>>> ts.to_dict()
{<Timestamp: 1979-01-01 06:00:00>: 657.09956023625466, <Timestamp: 1979-01-01 00:00:00>: 1181.2203275146587, <Timestamp: 1979-01-01 18:00:00>: 109.54560722575859, <Timestamp: 1979-01-01 12:00:00>: 126.25434189361377}
Then the result would be:
>>> df_result
value Column
1979-01-01 00:00:00 1181.220328 col5
1979-01-01 06:00:00 657.099560 col3
1979-01-01 12:00:00 126.254342 col1
1979-01-01 18:00:00 109.545607 col1
I hope my question is clear enough. Anyone has an idea how to get df_result?
Thanks
Greg
Here is one, perhaps inelegant, way to do it:
df_result = pd.DataFrame(ts, columns=['value'])
Set up a function which grabs the column name which contains the value (from ts
):
def get_col_name(row):
b = (df.ix[row.name] == row['value'])
return b.index[b.argmax()]
for each row, test which elements equal the value, and extract column name of a True.
And apply
it (row-wise):
In [3]: df_result.apply(get_col_name, axis=1)
Out[3]:
1979-01-01 00:00:00 col5
1979-01-01 06:00:00 col3
1979-01-01 12:00:00 col1
1979-01-01 18:00:00 col1
i.e. use df_result['Column'] = df_result.apply(get_col_name, axis=1)
.
.
Note: there is quite a lot going on in get_col_name
so perhaps it warrants some further explanation:
In [4]: row = df_result.irow(0) # an example row to pass to get_col_name
In [5]: row
Out[5]:
value 1181.220328
Name: 1979-01-01 00:00:00
In [6]: row.name # use to get rows of df
Out[6]: <Timestamp: 1979-01-01 00:00:00>
In [7]: df.ix[row.name]
Out[7]:
col5 1181.220328
col4 912.154923
col3 648.848635
col2 390.986156
col1 138.185861
Name: 1979-01-01 00:00:00
In [8]: b = (df.ix[row.name] == row['value'])
#checks whether each elements equal row['value'] = 1181.220328
In [9]: b
Out[9]:
col5 True
col4 False
col3 False
col2 False
col1 False
Name: 1979-01-01 00:00:00
In [10]: b.argmax() # index of a True value
Out[10]: 0
In [11]: b.index[b.argmax()] # the index value (column name)
Out[11]: 'col5'
It might be there is more efficient way to do this…
Following on from Andy’s detailed answer, the solution to selecting the column name of the highest value per row can be simplified to a single line:
df['column'] = df.apply(lambda x: df.columns[x.argmax()], axis = 1)
Just wanted to add that for a situation where multiple columns may have the value and you want all the column names in a list, you can do the following (e.g. get all column names with a value = ‘x’):
df.apply(lambda row: row[row == 'x'].index, axis=1)
The idea is that you turn each row into a series (by adding axis=1
) where the column names are now turned into the index of the series. You then filter your series with a condition (e.g. row == 'x'
), then take the index values (aka column names!).
I was trying to create a new column to indicate which existing column has the biggest value for a row. This gave me the desired string column label:
df['column_with_biggest_value'] = df.idxmax(axis=1)
Use df.eq()
for ~300x speedup over df.apply()
The other answers are fine but very slow compared to the vectorized df.eq()
:
df.loc[ts.index].eq(ts, axis=0).idxmax(axis=1)
# 1979-01-01 00:00:00 col5
# 1979-01-01 06:00:00 col3
# 1979-01-01 12:00:00 col1
# 1979-01-01 18:00:00 col1
# dtype: object
loc[ts.index]
returnsdf
rows that matchts
timestampseq(ts, axis=0)
compares eachts
value to one row (axis=0
) ofdf
eq(ts[:, None])
would be the numpy broadcasting equivalent
idxmax(axis=1)
returns the first matching column (axis=1
) in each row
Testing data:
index = pd.date_range('2000-01-01', periods=n, freq='1T')
df = pd.DataFrame(np.random.random(size=(n, 5)), index=index).add_prefix('col')
ts = df.apply(np.random.choice, axis=1).sample(frac=0.9)
Use np.isclose()
for safer float comparison
Unless you have a specific reason to test strict equality, floats should be compared with a tolerance, e.g., using isclose()
:
-
Use
isclose()
to comparedf
withts
, where[:, None]
stretchests
to the same size asdf
:close = np.isclose(df.loc[ts.index], ts[:, None]) # array([[ True, False, False, False, False], # [False, False, True, False, False], # [False, False, False, False, True], # [False, False, False, False, True]])
-
Then, as before, use
idxmax(axis=1)
to extract the first matching column per row:pd.DataFrame(close, index=ts.index, columns=df.columns).idxmax(axis=1) # 1979-01-01 00:00:00 col5 # 1979-01-01 06:00:00 col3 # 1979-01-01 12:00:00 col1 # 1979-01-01 18:00:00 col1 # dtype: object
Using isclose()
will be just as fast as eq()
(and thus much faster than df.apply()
:
Note that if you have more complex joining conditions, use df.merge()
, df.join()
, or df.reindex()
. For OP’s question, these are overkill but would look something like this:
df.merge(ts.rename('ts'), left_index=True, right_index=True)
df.join(ts.rename('ts'), how='right')
df.reindex(ts.index)