# Approximation of covariance for differently sized arrays

## Question:

Are there any common tools in NumPy/SciPy for computing a correlation measure that works even when the input variables are differently sized? In the standard formulation of covariance and correlation, one is required to have the same number of observations for each different variable under test. Typically, you must pass a matrix where each row is a different variable and each column represents a distinct observation.

In my case, I have 9 different variables, but for each variable the number of observations is not constant. Some variables have more observations than others. I know that there are fields like sensor fusion which study problems like this, so what standard tools are out there for computing relational statistics on data series of differing lengths (preferably in Python)?

## Answers:

I would examine this page:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.cov.html

UPDATE:

Suppose each row of your data matrix corresponds to a particular random variable, and the entries in the row are observations. What you have is a simple missing data problem, as long as you have a correspondence between the observations. That is to say, if one of your rows has only 10 entries, then do these 10 entries (i.e., trials) correspond to 10 samples of the random variable in the first row? E.g., suppose you have two temperature sensors and they take samples at the same times, but one is faulty and sometimes misses a sample. Then you should treat the trials where the faulty sensor missed generating a reading as “missing data.” In your case, it’s as simple as creating two vectors in NumPy that are of the same length, putting zeros (or any value, really) in the smaller of the two vectors that correspond to the missing trials, and creating a mask matrix that indicates *where* your missing values exist in your data matrix.

Supplying such a matrix to the function linked to above should allow you to perform exactly the computation you want.

From a purely mathmatical point of view, I believe they have to be the same. To make them the same you can apply some concepts related to the missing data problem. I guess I am saying it is not strictly a covariance anymore if the vectors aren’t the same size. Whatever tool you use will just make up some points in some smart way to make the vectors of equal length.

“The issue is that each variable corresponds to the response on a survey, and not every survey taker answered every question. Thus, I want some measure of how an answer to question 2, say, affects likelihood of answers to question 8, for example.”

This is the missing data problem. I think what’s confusing people is that you keep referring to your samples as having different lengths. I think you might be visualizing them like this:

sample 1:

```
question number: [1,2,3,4,5]
response : [1,0,1,1,0]
```

sample 2:

```
question number: [2,4,5]
response : [1,1,0]
```

when sample 2 should be more like this:

```
question number: [ 1,2, 3,4,5]
response : [NaN,1,NaN,1,0]
```

It’s the question number, not the number of questions answered that’s important. Without question-to-question correspondence it’s impossible to calculate anything like a covariance matrix.

Anyway, that `numpy.ma.cov`

function that ddodev mentioned calculates the covariance, by taking advantage of the fact that the elements being summed, each only depend on two values.

So it calculates the ones it can. Then when it comes to the step of dividing by n, it divides by the number of values that were calculated (for that particular covvariance-matrix element), instead of the total number of samples.

Here’s my take on the question. Strictly speaking, the formula for computing the covariance of 2 random variables `Cov(X,Y) = E[XY] - E[X]E[Y]`

does not tell you anything about sample sizes or how X and Y should form a random vector (i.e. `x_i`

‘s and `y_i`

‘s do not explicitly come in pairs).

`E[X]`

and `E[Y]`

are computed the usual way, no matter that the number of observations for X and Y do not match. As for `E[XY]`

, in the case of separately sampled X and Y, you can take it as meaning "the mean of all possible combinations of `x_i * y_j`

", in other words:

```
# NumPy code :
import numpy as np
X = ... # your first data sample
Y = ... # your second data sample
E_XY = np.outer(X, Y).ravel().mean()
```