Pandas – select column using other column value as column name
Question:
I have a dataframe that contains a column, let’s call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names |
--- | --- | ---- |
1 | -1 | 'a' |
2 | -2 | 'b' |
3 | -3 | 'a' |
4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col |
--- | --- | ---- | ------ |
1 | -1 | 'a' | 1 |
2 | -2 | 'b' | -2 |
3 | -3 | 'a' | 3 |
4 | -4 | 'b' | -4 |
Answers:
You can use lookup
:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup
has been deprecated, here’s the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Because DataFrame.lookup
is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt
:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Solution using pd.factorize
(from a pandas issue):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
With the straightforward and easy solution (lookup
) deprecated, another alternative to the pandas-based ones proposed here is to convert df
into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values
is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer
function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index
as the first argument inside the bracket, but this does not work generally.)
I have a dataframe that contains a column, let’s call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names | --- | --- | ---- | 1 | -1 | 'a' | 2 | -2 | 'b' | 3 | -3 | 'a' | 4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col | --- | --- | ---- | ------ | 1 | -1 | 'a' | 1 | 2 | -2 | 'b' | -2 | 3 | -3 | 'a' | 3 | 4 | -4 | 'b' | -4 |
You can use lookup
:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup
has been deprecated, here’s the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Because DataFrame.lookup
is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt
:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Solution using pd.factorize
(from a pandas issue):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
With the straightforward and easy solution (lookup
) deprecated, another alternative to the pandas-based ones proposed here is to convert df
into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values
is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer
function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index
as the first argument inside the bracket, but this does not work generally.)