Any faster way to get the same results?

Question:

I have two given arrays: x and y. I want to calculate correlation coefficient between two arrays as follows:

import numpy as np
from scipy.stats import pearsonr

x = np.array([[[1,2,3,4],
               [5,6,7,8]],
              [[11,22,23,24],
               [25,26,27,28]]])


i,j,k = x.shape

y = np.array([[[31,32,33,34],
               [35,36,37,38]],
              [[41,42,43,44],
               [45,46,47,48]]])



xx = np.row_stack(np.dstack(x))
yy = np.row_stack(np.dstack(y))

results = []

for a, b in zip(xx,yy):
    r_sq, p_val = pearsonr(a, b)
    results.append(r_sq)

results = np.array(results).reshape(j,k)

print results

[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]

The answer is correct. However, would like to know if there are better and faster ways of doing it using numpy and/or scipy.

Asked By: Borys

||

Answers:

An alternate way (not necessarily better) is:

xx = x.reshape(2,-1).T  # faster, minor issue though
yy = y.reshape(2,-1).T
results = [pearsonr(a,b)[0] for a,b in zip(xx,yy)]
results = np.array(results).reshape(x.shape[1:])

Another current thread was discussing the use of list comprehensions to iterate over values of an array(s): Confusion about numpy's apply along axis and list comprehensions

As discussed there, an alternative is to initialize results, and fill in values during the iteration. That’s probably faster for really large cases, but for modest ones, this

np.array([... for .. in ...]) 

is reasonable.

The deeper question is whether pearsonr, or some alternative, can calculate this correlation for many pairs, rather than just one pair. That may require studying the internals of pearsonr, or other functions in stats.

Here’s a first cut at vectorizing stats.pearsonr:

def pearsonr2(a,b):
    # stats.pearsonr adapted for
    # x and y are (N,2) arrays
    n = x.shape[1]
    mx = x.mean(1)
    my = y.mean(1)
    xm, ym = x-mx[:,None], y-my[:,None]
    r_num = np.add.reduce(xm * ym, 1)
    r_den = np.sqrt(stats.ss(xm,1) * stats.ss(ym,1))
    r = r_num / r_den
    r = np.clip(r, -1.0, 1.0)
    return r

print pearsonr2(xx,yy)

It matches your case, though these test values don’t really exercise the function. I just took the pearsonr code, added the axis=1 parameter in most of the lines, and made sure everything ran. The prob step could be included with some boolean masking.

(I can add the stats.pearsonr code to my answer if needed).


This version will take any dimension a,b (as long as they are the same), and do your pearsonr calc along the designated axis. No reshaping needed.

def pearsonr_flex(a,b, axis=1):
    # stats.pearsonr adapted for
    # x and y are (N,2) arrays
    n = x.shape[axis]
    mx = x.mean(axis, keepdims=True)
    my = y.mean(axis, keepdims=True)
    xm, ym = x-mx, y-my
    r_num = np.add.reduce(xm * ym, axis)
    r_den = np.sqrt(stats.ss(xm, axis) * stats.ss(ym, axis))
    r = r_num / r_den
    r = np.clip(r, -1.0, 1.0)
    return r

pearsonr_flex(xx, yy, 1)
preasonr_flex(x, y, 0)
Answered By: hpaulj