Get column name where value is something in pandas dataframe

Question:

I’m trying to find, at each timestamp, the column name in a dataframe for which the value matches with the one in a timeseries at the same timestamp.

Here is my dataframe:

>>> df
                            col5        col4        col3        col2        col1
1979-01-01 00:00:00  1181.220328  912.154923  648.848635  390.986156  138.185861
1979-01-01 06:00:00  1190.724461  920.767974  657.099560  399.395338  147.761352
1979-01-01 12:00:00  1193.414510  918.121482  648.558837  384.632475  126.254342
1979-01-01 18:00:00  1171.670276  897.585930  629.201469  366.652033  109.545607
1979-01-02 00:00:00  1168.892579  900.375126  638.377583  382.584568  132.998706

>>> df.to_dict()
{'col4': {<Timestamp: 1979-01-01 06:00:00>: 920.76797370744271, <Timestamp: 1979-01-01 00:00:00>: 912.15492332839756, <Timestamp: 1979-01-01 18:00:00>: 897.58592995700656, <Timestamp: 1979-01-01 12:00:00>: 918.1214819496729}, 'col5': {<Timestamp: 1979-01-01 06:00:00>: 1190.7244605667831, <Timestamp: 1979-01-01 00:00:00>: 1181.2203275146587, <Timestamp: 1979-01-01 18:00:00>: 1171.6702763228691, <Timestamp: 1979-01-01 12:00:00>: 1193.4145103184442}, 'col2': {<Timestamp: 1979-01-01 06:00:00>: 399.39533771666561, <Timestamp: 1979-01-01 00:00:00>: 390.98615646597591, <Timestamp: 1979-01-01 18:00:00>: 366.65203285812231, <Timestamp: 1979-01-01 12:00:00>: 384.63247469269874}, 'col3': {<Timestamp: 1979-01-01 06:00:00>: 657.09956023625466, <Timestamp: 1979-01-01 00:00:00>: 648.84863460462293, <Timestamp: 1979-01-01 18:00:00>: 629.20146872682449, <Timestamp: 1979-01-01 12:00:00>: 648.55883747413225}, 'col1': {<Timestamp: 1979-01-01 06:00:00>: 147.7613518219286, <Timestamp: 1979-01-01 00:00:00>: 138.18586102094068, <Timestamp: 1979-01-01 18:00:00>: 109.54560722575859, <Timestamp: 1979-01-01 12:00:00>: 126.25434189361377}}

And the time series with values I want to match at each timestamp:

>>> ts
1979-01-01 00:00:00    1181.220328
1979-01-01 06:00:00    657.099560
1979-01-01 12:00:00    126.254342
1979-01-01 18:00:00    109.545607
Freq: 6H

>>> ts.to_dict()
{<Timestamp: 1979-01-01 06:00:00>: 657.09956023625466, <Timestamp: 1979-01-01 00:00:00>: 1181.2203275146587, <Timestamp: 1979-01-01 18:00:00>: 109.54560722575859, <Timestamp: 1979-01-01 12:00:00>: 126.25434189361377}

Then the result would be:

>>> df_result
                             value  Column
1979-01-01 00:00:00    1181.220328  col5
1979-01-01 06:00:00    657.099560   col3
1979-01-01 12:00:00    126.254342   col1
1979-01-01 18:00:00    109.545607   col1

I hope my question is clear enough. Anyone has an idea how to get df_result?

Thanks

Greg

Asked By: leroygr

||

Answers:

Here is one, perhaps inelegant, way to do it:

df_result = pd.DataFrame(ts, columns=['value'])

Set up a function which grabs the column name which contains the value (from ts):

def get_col_name(row):    
    b = (df.ix[row.name] == row['value'])
    return b.index[b.argmax()]

for each row, test which elements equal the value, and extract column name of a True.

And apply it (row-wise):

In [3]: df_result.apply(get_col_name, axis=1)
Out[3]: 
1979-01-01 00:00:00    col5
1979-01-01 06:00:00    col3
1979-01-01 12:00:00    col1
1979-01-01 18:00:00    col1

i.e. use df_result['Column'] = df_result.apply(get_col_name, axis=1).

.

Note: there is quite a lot going on in get_col_name so perhaps it warrants some further explanation:

In [4]: row = df_result.irow(0) # an example row to pass to get_col_name

In [5]: row
Out[5]: 
value    1181.220328
Name: 1979-01-01 00:00:00

In [6]: row.name # use to get rows of df
Out[6]: <Timestamp: 1979-01-01 00:00:00>

In [7]: df.ix[row.name]
Out[7]: 
col5    1181.220328
col4     912.154923
col3     648.848635
col2     390.986156
col1     138.185861
Name: 1979-01-01 00:00:00

In [8]: b = (df.ix[row.name] == row['value'])
        #checks whether each elements equal row['value'] = 1181.220328  

In [9]: b
Out[9]: 
col5     True
col4    False
col3    False
col2    False
col1    False
Name: 1979-01-01 00:00:00

In [10]: b.argmax() # index of a True value
Out[10]: 0

In [11]: b.index[b.argmax()] # the index value (column name)
Out[11]: 'col5'

It might be there is more efficient way to do this…

Answered By: Andy Hayden

Following on from Andy’s detailed answer, the solution to selecting the column name of the highest value per row can be simplified to a single line:

df['column'] = df.apply(lambda x: df.columns[x.argmax()], axis = 1)
Answered By: Mike

Just wanted to add that for a situation where multiple columns may have the value and you want all the column names in a list, you can do the following (e.g. get all column names with a value = ‘x’):

df.apply(lambda row: row[row == 'x'].index, axis=1)

The idea is that you turn each row into a series (by adding axis=1) where the column names are now turned into the index of the series. You then filter your series with a condition (e.g. row == 'x'), then take the index values (aka column names!).

Answered By: Nic Scozzaro

I was trying to create a new column to indicate which existing column has the biggest value for a row. This gave me the desired string column label:

df['column_with_biggest_value'] = df.idxmax(axis=1)
Answered By: Zhanwen Chen

Use df.eq() for ~300x speedup over df.apply()

The other answers are fine but very slow compared to the vectorized df.eq():

df.loc[ts.index].eq(ts, axis=0).idxmax(axis=1)

# 1979-01-01 00:00:00    col5
# 1979-01-01 06:00:00    col3
# 1979-01-01 12:00:00    col1
# 1979-01-01 18:00:00    col1
# dtype: object

vectorized vs df.apply timings

Testing data:
index = pd.date_range('2000-01-01', periods=n, freq='1T')
df = pd.DataFrame(np.random.random(size=(n, 5)), index=index).add_prefix('col')
ts = df.apply(np.random.choice, axis=1).sample(frac=0.9)


Use np.isclose() for safer float comparison

Unless you have a specific reason to test strict equality, floats should be compared with a tolerance, e.g., using isclose():

  • Use isclose() to compare df with ts, where [:, None] stretches ts to the same size as df:

    close = np.isclose(df.loc[ts.index], ts[:, None])
    
    # array([[ True, False, False, False, False],
    #        [False, False,  True, False, False],
    #        [False, False, False, False,  True],
    #        [False, False, False, False,  True]])
    
  • Then, as before, use idxmax(axis=1) to extract the first matching column per row:

    pd.DataFrame(close, index=ts.index, columns=df.columns).idxmax(axis=1)
    
    # 1979-01-01 00:00:00    col5
    # 1979-01-01 06:00:00    col3
    # 1979-01-01 12:00:00    col1
    # 1979-01-01 18:00:00    col1
    # dtype: object
    

Using isclose() will be just as fast as eq() (and thus much faster than df.apply():

vectorized eq vs isclose timings


Note that if you have more complex joining conditions, use df.merge(), df.join(), or df.reindex(). For OP’s question, these are overkill but would look something like this:

  • df.merge(ts.rename('ts'), left_index=True, right_index=True)
  • df.join(ts.rename('ts'), how='right')
  • df.reindex(ts.index)
Answered By: tdy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.