Having trouble interpreting a Numpy question

Question:

Here’s the question and the example given:

You are given a 2-d array A of size NxN containing floating-point
numbers. The array represents pairwise correlation between N elemenets
with A[i,j] = A[j,i] = corr(i,j) and A[i,i] = 1.

Write a Python program using NumPy to find the index of the highest
correlated element for each element and finally print the sum of all
these indexes.

Example: The array A = [[1, 0.3, 0.4], [0.4,1,0.5],[0.1,0.6,1]]. Then, the indexes of the highest correlated elements for each element
are [3, 3, 2]. the sum of these indexes is 8.

I’m having trouble understanding the question, but the example makes my confusion worse. With each array inside A having only 3 values, and A itself having only three arrays inside how can any "index of the highest correlated elements" being greater than 2 if numpy is zero indexed?

Does anyone understand the question?

Asked By: gooby

||

Answers:

To reiterate, the example is wrong in multiple ways.

Correlation matrices are by definition symmetric, yet the example is not:

array([[1. , 0.3, 0.4],
       [0.4, 1. , 0.5],
       [0.1, 0.6, 1. ]])

Also you are right, numpy arrays (like everything else I know in Python that supports indexing) are zero-indexed. So the solution is off by one.

The exercise wants you to find the index j of the random variable with the greatest correlation for each random variable with index i. Obviously excluding itself (the correlation coefficient of 1 on the diagonal).

Here is one way to do that given your numpy array a:

np.where(a != 1, a, 0).argmax(axis=1)

Here np.where produces an array identical to a except we replace the ones with zeroes. This is based on the assumption that if i != j, the correlation is always < 1. If that does not hold, the solution will obviously be wrong.

Then argmax gives the indices of the greatest values in each row. Although, in an actual correlation matrix, axis=0 would work just as well, since it would be… you know… symmetrical.

The result is array([2, 2, 1]). To get the sum, you just add a .sum() at the end.

EDIT:

Now that I think about it, the assumption is too strong. Here is a better way:

b = a.copy()
np.fill_diagonal(b, -1)
b.argmax(axis=1)

Now we only assume that actual correlations can never be < 0, which I think is reasonable. If you don’t care about mutating the original array, you could obviously omit the copy and fill the diagonal of a with -1. instead.

Answered By: Daniil Fajnberg
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.