Selecting different columns by row for pandas dataframe

Question:

The closest thing I could find for this question is this.

What I’m trying to do is very similar, except I’m basically trying to extract one matrix from another. To use the same example from that link:

    a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3   10  11  12
4   13  14  15

Given the above, my extraction matrix would look like:

[ ['a', 'a'],
['a', 'a'],
['a', 'b'],
['c', 'a'],
['b', 'b'] ]

The expected result would be either the following pd.DataFrame or np.array:

[ [1, 1],
[4, 4],
[7, 8],
[12, 10],
[14, 14] ]

I feel like this is probably a common manipulation, I just don’t know how to do it here. I want to rule out pd.iterrows because my parent matrix is really long, and really wide, and pd.iterrows is remarkably slow on even a fraction of the matrix. I have a decent amount of memory, so I’d like to lean on that a little bit if I can.

Asked By: John Rouhana

||

Answers:

With simple list comprehension and df.loc call:

# assuming idx is your extraction matrix
res = [df.loc[i, c].tolist() for i, c in enumerate(idx)]

[[1, 1], [4, 4], [7, 8], [12, 10], [14, 14]]
Answered By: RomanPerekhrest

A bit more complex than the other answer, but I feel like it’s what you are looking for ?

extraction_matrix = [ ['a', 'a'],
                      ['a', 'a'],
                      ['a', 'b'],
                      ['c', 'a'],
                      ['b', 'b'] ]

numeric_extraction_matrix = [[list(df.columns).index(col) for col in row] 
                             for row in extraction_matrix]
df.values[np.array([range(5), range(5)]).transpose(), numeric_extraction_matrix]

you will get:

array([[ 1,  1],
       [ 4,  4],
       [ 7,  8],
       [12, 10],
       [14, 14]])
Answered By: zaki98

try this:

# Create dataset
data = [{'a': 1, 'b': 2, 'c': 3},
        {'a': 4, 'b': 5, 'c': 6},
        {'a': 7, 'b': 8, 'c': 9},
        {'a': 10, 'b': 11, 'c': 12},
        {'a': 13, 'b': 14, 'c': 15}]
df = pd.DataFrame(data)

# Create an additional matrix used for data extraction
extr_matrix = [['a', 'a'],
               ['a', 'a'],
               ['a', 'b'],
               ['c', 'a'],
               ['b', 'b']]

# Map column names to numbers for data extraction
cols_numeric, _ = pd.factorize(df.columns)
cols_mapper = dict(zip(df.columns.tolist(), cols_numeric))

# Replace values in extract_matrix with column's numeric index
extract_matrix_as_numeric = pd.DataFrame(extr_matrix).replace(cols_mapper)

# Extract data using extract_matrix's row and column combination
extract_rows = extract_matrix_as_numeric.index
extract_cols = [extract_matrix_as_numeric[i]
                for i in extract_matrix_as_numeric.columns]
result = df.values[[extract_rows, extract_rows], extract_cols].T
print(result)
>>>
array([[ 1,  1],
       [ 4,  4],
       [ 7,  8],
       [12, 10],
       [14, 14]], dtype=int64)

Or try this:

# Create dataset
np.random.seed(2)
data = np.random.rand(5, 3)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
df = pd.DataFrame(data, columns=[*'abc'])
print(df)
>>>
           a          b          c
0 0.43599490 0.02592623 0.54966248
1 0.43532239 0.42036780 0.33033482
2 0.20464863 0.61927097 0.29965467
3 0.26682728 0.62113383 0.52914209
4 0.13457995 0.51357812 0.18443987
cols_mapper = {col: df.columns.get_loc(col) for col in df.columns}
extract_cols = [[cols_mapper[col] for col in row] for row in extract_matrix]
extract_matrix_length = len(extract_matrix)
extract_rows = [*zip(*[range(extract_matrix_length)] * 2)]
result = df.values[extract_rows, extract_cols]
print(result)
>>>
[[0.4359949  0.4359949 ]
 [0.43532239 0.43532239]
 [0.20464863 0.61927097]
 [0.52914209 0.26682728]
 [0.51357812 0.51357812]]
Answered By: ziying35

A list comprehension – reusing @ziying35 data:

# Create dataset
data = [{'a': 1, 'b': 2, 'c': 3},
        {'a': 4, 'b': 5, 'c': 6},
        {'a': 7, 'b': 8, 'c': 9},
        {'a': 10, 'b': 11, 'c': 12},
        {'a': 13, 'b': 14, 'c': 15}]
df = pd.DataFrame(data)

# Create an additional matrix used for data extraction
matrix =       [['a', 'a'],
               ['a', 'a'],
               ['a', 'b'],
               ['c', 'a'],
               ['b', 'b']]
mat = np.array(matrix)
out =  [df.reindex(columns=mat[:, n]).to_numpy().diagonal() 
        for n in range(mat.shape[-1])]

np.column_stack(out)
array([[ 1,  1],
       [ 4,  4],
       [ 7,  8],
       [12, 10],
       [14, 14]])
Answered By: sammywemmy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.