# Homemade pearson's correlation implementation returning 0.999…2 when passing two of the same sets of data to it

## Question:

I was getting fed up of scipy and numpy, and decided to go ahead and work on another implementation, based on a SO answer somewhere.

``````from statistics import pstdev, mean

def pearson(x, y):
sx = []
sy = []

mx = mean(x)
my = mean(y)

stdx = pstdev(x)
stdy = pstdev(y)

for i in x:
sx.append((i - mx) / stdx)

for j in y:
sy.append((j - my) / stdy)

return sum([i * j for i, j in zip(sx, sy)]) / len(x)
``````

I passed a few numbers into it to see if it was giving the same thing as `scipy.stats.pearsonr`, and it seemed to be fine. A number or so towards the end was different, but wasn’t anything groundbreaking…

Until I tried passing the same set of data to it as `x` and `y`. When I did, I got returned `0.9999999999999992`, when scipy and numpy both say it’s `1.0`.

Is there something wrong with this implementation? I’m using the population stdev instead of the sample one, and as far as I’m aware, both numpy and scipy use that. I’m wondering why this isn’t returning `1.0` as it should be. Could it be float issues in python itself? I’ve tried it in Py 2 and 3, and I’m getting `0.999...` in both.

If needed, the set of data I passed into it was:

`[7, 1, 5, 1, 8, 5, 9, 8, 5, 10, 5, 8, 1, 8, 8, 8, 10, 4, 8, 9, 9, 6, 8, 7, 8, 5, 10, 5, 6, 8, 8, 7, 9, 4, 6, 10, 7, 10, 4, 5, 4, 7, 4, 8, 9, 10, 9, 8, 7, 8, 6, 8, 6, 6, 5, 7, 7, 7, 7, 3, 7, 8, 6, 8, 5, 7, 8, 7, 8, 6, 8, 6, 9, 6, 6, 6, 8, 9, 5, 7, 9, 2, 9, 6, 7, 6, 7, 7, 5, 5, 7, 7, 8, 6, 9, 1, 3, 6, 7, 9, 7, 7, 6, 9, 9, 4, 9, 9, 7, 9, 6, 2, 2, 8, 4, 7, 7, 6, 3, 7, 3, 5, 10, 9, 8, 10, 8, 7, 4, 7, 8, 9, 8, 4, 7, 9, 7, 7, 6, 8, 8, 9, 9, 7, 4, 4, 7, 3, 9, 3, 1, 8, 3, 9, 4, 8, 3, 9, 8, 8, 7, 9, 9, 8, 10, 8, 3, 10, 4, 7, 7, 10, 8, 7, 8, 7, 1, 8, 9, 5, 7, 5, 5, 3, 5, 7, 7, 7, 2, 4, 1, 6, 9, 9, 7, 7, 10, 9, 2, 9, 8, 2, 5, 1, 2, 5, 9, 1, 4, 8, 9, 6, 4, 4, 7, 3, 7, 9, 4, 3, 7, 8, 7, 6, 8, 8, 7]`

Your expectations about floating-point behavior are way too optimistic. With experience, you wouldn’t be surprised a bit that the result isn’t exactly 1.0. For example, try this much smaller input instead:

``````[7, 1, 5]
``````

On my box, your function returns 1.0000000000000002. “Close to” 1.0, but not exactly 1.0. That’s the best you can hope for, in general.

For a good clue about why, think about what this “should” compute:

``````math.sqrt(x)**2 == x
``````

Mathematically (working in infinite precision), that should always return True. But in floating-point (no matter how much precision is used, provided only that the precision is bounded), it’s impossible for it to always be true. In fact, counterexamples are very easy to find; like, on my box just now:

``````>>> math.sqrt(2)**2
2.0000000000000004
``````

The problem is that, with finite precision, `sqrt()` is necessarily a many-to-one function. It squashes the domain `1..N` into the range `1..sqrt(N)`, and with finite precision the cardinality of the domain is larger than the cardinality of the range. Therefore there must exist distinct `x` and `y` in the domain that map to the same value in the range, so there is no exact functional inverse.

Your function is more complicated than a plain `sqrt`, but the same principles are at work.

Yes this has to do with floating-point behavior. You can try using the decimal module

``````from decimal import *
data = [7, 1, 5, 1, 8, 5, 9, 8, 5, 10, 5, 8, 1, 8, 8, 8, 10, 4, 8]
data = [Decimal(x) for x in data]
print(pearson(data, data))
``````

Note that you will need to calculate the mean and standard deviation using decimals also.

Tim Peters explained:
The decimal, by default, uses more precision than native binary floating point. Every finite-precision sqrt is necessarily a many-to-one function. For the rest: Decimal(0.5) is identical to Decimal("0.5"), because 0.5 happens to be exactly representable as a binary float. You should use sqrt() instead of **0.5, because sqrt() guarantees a correctly rounded result to full precision and ** does not.

Categories: questions
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.