Pandas – select column using other column value as column name

Question:

I have a dataframe that contains a column, let’s call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.

Example:

Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})

  a  |  b  | names |
 --- | --- | ----  |
  1  |  -1 | 'a'   |
  2  |  -2 | 'b'   |
  3  |  -3 | 'a'   |
  4  |  -4 | 'b'   |

Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})

  a  |  b  | names | new_col | 
 --- | --- | ----  | ------  |
  1  |  -1 | 'a'   |    1    |
  2  |  -2 | 'b'   |   -2    |
  3  |  -3 | 'a'   |    3    |
  4  |  -4 | 'b'   |   -4    |
Asked By: ab3

||

Answers:

You can use lookup:

df['new_col'] = df.lookup(df.index, df.names)
df
#   a    b  names   new_col
#0  1   -1      a   1
#1  2   -2      b   -2
#2  3   -3      a   3
#3  4   -4      b   -4

EDIT

lookup has been deprecated, here’s the currently recommended solution:

idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Answered By: Psidom

Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:

df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']

Output:

>>> df
   a  b names  new_col
0  1 -1     a        1
1  2 -2     b       -2
2  3 -3     a        3
3  4 -4     b       -4

Can this be simplified? For correctness, the index must not be ignored.

Additional reference:

Answered By: Asclepius

Solution using pd.factorize (from a pandas issue):

idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Answered By: ignoring_gravity

With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:

df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]

Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:

>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)

>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)

(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)

Here’s a short solution using df.melt and df.merge:

df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])

Outputs:

   key_0  a  b names value
0      0  1 -1     a     1
1      1  2 -2     b    -2
2      2  3 -3     a     3
3      3  4 -4     b    -4

There’s a redundant key_0 column which you need to drop with df.drop.

Answered By: Leif Metcalf
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.