How can I select a specific column from each row in a Pandas DataFrame?

Question:

I have a DataFrame in this format:

    a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3   10  11  12
4   13  14  15

and an array like this, with column names:

['a', 'a', 'b', 'c', 'b']

and I’m hoping to extract an array of data, one value from each row. The array of column names specifies which column I want from each row. Here, the result would be:

[1, 4, 8, 12, 14]

Is this possible as a single command with Pandas, or do I need to iterate? I tried using indexing

i = pd.Index(['a', 'a', 'b', 'c', 'b'])
i.choose(df)

but I got a segfault, which I couldn’t diagnose because the documentation is lacking.

Asked By: gggritso

||

Answers:

You can always use list comprehension:

[df.loc[idx, col] for idx, col in enumerate(['a', 'a', 'b', 'c', 'b'])]
Answered By: Gregor

For large datasets, you can use indexing on the base numpy data, if you’re prepared to transform your column names into a numerical index (simple in this case):

df.values[arange(5),[0,0,1,2,1]]

out: array([ 1,  4,  8, 12, 14])

This will be much more efficient that list comprehensions, or other explicit iterations.

Answered By: mdurant

You could use lookup, e.g.

>>> i = pd.Series(['a', 'a', 'b', 'c', 'b'])
>>> df.lookup(i.index, i.values)
array([ 1,  4,  8, 12, 14])

where i.index could be different from range(len(i)) if you wanted.

Answered By: DSM

As MorningGlory stated in the comments, lookup has been deprecated in version 1.2.0.

The documentation states that the same can be achieved using melt and loc but I didn’t think it was very obvious so here it goes.

First, use melt to create a look-up DataFrame:

i = pd.Series(["a", "a", "b", "c", "b"], name="col")
melted = pd.melt(
    pd.concat([i, df], axis=1),
    id_vars="col",
    value_vars=df.columns,
    ignore_index=False,
)

  col variable  value
0   a        a      1
1   a        a      4
2   b        a      7
3   c        a     10
4   b        a     13
0   a        b      2
1   a        b      5
2   b        b      8
3   c        b     11
4   b        b     14
0   a        c      3
1   a        c      6
2   b        c      9
3   c        c     12
4   b        c     15

Then, use loc to only get relevant values:

result = melted.loc[melted["col"] == melted["variable"], "value"]

0     1
1     4
2     8
4    14
3    12
Name: value, dtype: int64

Finally – if needed – to get the same index order as before:

result.loc[df.index]

0     1
1     4
2     8
3    12
4    14
Name: value, dtype: int64

Pandas also provides a different solution in the documentation using factorize and numpy indexing:

df = pd.concat([i, df], axis=1)
idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

[ 1  4  8 12 14]
Answered By: spettekaka
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.