How can I select a specific column from each row in a Pandas DataFrame?
Question:
I have a DataFrame in this format:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
and an array like this, with column names:
['a', 'a', 'b', 'c', 'b']
and I’m hoping to extract an array of data, one value from each row. The array of column names specifies which column I want from each row. Here, the result would be:
[1, 4, 8, 12, 14]
Is this possible as a single command with Pandas, or do I need to iterate? I tried using indexing
i = pd.Index(['a', 'a', 'b', 'c', 'b'])
i.choose(df)
but I got a segfault, which I couldn’t diagnose because the documentation is lacking.
Answers:
You can always use list comprehension:
[df.loc[idx, col] for idx, col in enumerate(['a', 'a', 'b', 'c', 'b'])]
For large datasets, you can use indexing on the base numpy data, if you’re prepared to transform your column names into a numerical index (simple in this case):
df.values[arange(5),[0,0,1,2,1]]
out: array([ 1, 4, 8, 12, 14])
This will be much more efficient that list comprehensions, or other explicit iterations.
You could use lookup
, e.g.
>>> i = pd.Series(['a', 'a', 'b', 'c', 'b'])
>>> df.lookup(i.index, i.values)
array([ 1, 4, 8, 12, 14])
where i.index
could be different from range(len(i))
if you wanted.
As MorningGlory stated in the comments, lookup
has been deprecated in version 1.2.0
.
The documentation states that the same can be achieved using melt
and loc
but I didn’t think it was very obvious so here it goes.
First, use melt
to create a look-up DataFrame
:
i = pd.Series(["a", "a", "b", "c", "b"], name="col")
melted = pd.melt(
pd.concat([i, df], axis=1),
id_vars="col",
value_vars=df.columns,
ignore_index=False,
)
col variable value
0 a a 1
1 a a 4
2 b a 7
3 c a 10
4 b a 13
0 a b 2
1 a b 5
2 b b 8
3 c b 11
4 b b 14
0 a c 3
1 a c 6
2 b c 9
3 c c 12
4 b c 15
Then, use loc
to only get relevant values:
result = melted.loc[melted["col"] == melted["variable"], "value"]
0 1
1 4
2 8
4 14
3 12
Name: value, dtype: int64
Finally – if needed – to get the same index order as before:
result.loc[df.index]
0 1
1 4
2 8
3 12
4 14
Name: value, dtype: int64
Pandas also provides a different solution in the documentation using factorize
and numpy
indexing:
df = pd.concat([i, df], axis=1)
idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
[ 1 4 8 12 14]
I have a DataFrame in this format:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
and an array like this, with column names:
['a', 'a', 'b', 'c', 'b']
and I’m hoping to extract an array of data, one value from each row. The array of column names specifies which column I want from each row. Here, the result would be:
[1, 4, 8, 12, 14]
Is this possible as a single command with Pandas, or do I need to iterate? I tried using indexing
i = pd.Index(['a', 'a', 'b', 'c', 'b'])
i.choose(df)
but I got a segfault, which I couldn’t diagnose because the documentation is lacking.
You can always use list comprehension:
[df.loc[idx, col] for idx, col in enumerate(['a', 'a', 'b', 'c', 'b'])]
For large datasets, you can use indexing on the base numpy data, if you’re prepared to transform your column names into a numerical index (simple in this case):
df.values[arange(5),[0,0,1,2,1]]
out: array([ 1, 4, 8, 12, 14])
This will be much more efficient that list comprehensions, or other explicit iterations.
You could use lookup
, e.g.
>>> i = pd.Series(['a', 'a', 'b', 'c', 'b'])
>>> df.lookup(i.index, i.values)
array([ 1, 4, 8, 12, 14])
where i.index
could be different from range(len(i))
if you wanted.
As MorningGlory stated in the comments, lookup
has been deprecated in version 1.2.0
.
The documentation states that the same can be achieved using melt
and loc
but I didn’t think it was very obvious so here it goes.
First, use melt
to create a look-up DataFrame
:
i = pd.Series(["a", "a", "b", "c", "b"], name="col")
melted = pd.melt(
pd.concat([i, df], axis=1),
id_vars="col",
value_vars=df.columns,
ignore_index=False,
)
col variable value
0 a a 1
1 a a 4
2 b a 7
3 c a 10
4 b a 13
0 a b 2
1 a b 5
2 b b 8
3 c b 11
4 b b 14
0 a c 3
1 a c 6
2 b c 9
3 c c 12
4 b c 15
Then, use loc
to only get relevant values:
result = melted.loc[melted["col"] == melted["variable"], "value"]
0 1
1 4
2 8
4 14
3 12
Name: value, dtype: int64
Finally – if needed – to get the same index order as before:
result.loc[df.index]
0 1
1 4
2 8
3 12
4 14
Name: value, dtype: int64
Pandas also provides a different solution in the documentation using factorize
and numpy
indexing:
df = pd.concat([i, df], axis=1)
idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
[ 1 4 8 12 14]