Homemade pearson's correlation implementation returning 0.999…2 when passing two of the same sets of data to it

Question:

I was getting fed up of scipy and numpy, and decided to go ahead and work on another implementation, based on a SO answer somewhere.

from statistics import pstdev, mean

def pearson(x, y):
    sx = []
    sy = []

    mx = mean(x)
    my = mean(y)

    stdx = pstdev(x)
    stdy = pstdev(y)

    for i in x:
        sx.append((i - mx) / stdx)

    for j in y:
        sy.append((j - my) / stdy)

    return sum([i * j for i, j in zip(sx, sy)]) / len(x)

I passed a few numbers into it to see if it was giving the same thing as scipy.stats.pearsonr, and it seemed to be fine. A number or so towards the end was different, but wasn’t anything groundbreaking…

Until I tried passing the same set of data to it as x and y. When I did, I got returned 0.9999999999999992, when scipy and numpy both say it’s 1.0.

Is there something wrong with this implementation? I’m using the population stdev instead of the sample one, and as far as I’m aware, both numpy and scipy use that. I’m wondering why this isn’t returning 1.0 as it should be. Could it be float issues in python itself? I’ve tried it in Py 2 and 3, and I’m getting 0.999... in both.

If needed, the set of data I passed into it was:

[7, 1, 5, 1, 8, 5, 9, 8, 5, 10, 5, 8, 1, 8, 8, 8, 10, 4, 8, 9, 9, 6, 8, 7, 8, 5, 10, 5, 6, 8, 8, 7, 9, 4, 6, 10, 7, 10, 4, 5, 4, 7, 4, 8, 9, 10, 9, 8, 7, 8, 6, 8, 6, 6, 5, 7, 7, 7, 7, 3, 7, 8, 6, 8, 5, 7, 8, 7, 8, 6, 8, 6, 9, 6, 6, 6, 8, 9, 5, 7, 9, 2, 9, 6, 7, 6, 7, 7, 5, 5, 7, 7, 8, 6, 9, 1, 3, 6, 7, 9, 7, 7, 6, 9, 9, 4, 9, 9, 7, 9, 6, 2, 2, 8, 4, 7, 7, 6, 3, 7, 3, 5, 10, 9, 8, 10, 8, 7, 4, 7, 8, 9, 8, 4, 7, 9, 7, 7, 6, 8, 8, 9, 9, 7, 4, 4, 7, 3, 9, 3, 1, 8, 3, 9, 4, 8, 3, 9, 8, 8, 7, 9, 9, 8, 10, 8, 3, 10, 4, 7, 7, 10, 8, 7, 8, 7, 1, 8, 9, 5, 7, 5, 5, 3, 5, 7, 7, 7, 2, 4, 1, 6, 9, 9, 7, 7, 10, 9, 2, 9, 8, 2, 5, 1, 2, 5, 9, 1, 4, 8, 9, 6, 4, 4, 7, 3, 7, 9, 4, 3, 7, 8, 7, 6, 8, 8, 7]

Asked By: James McDowell

||

Answers:

Your expectations about floating-point behavior are way too optimistic. With experience, you wouldn’t be surprised a bit that the result isn’t exactly 1.0. For example, try this much smaller input instead:

[7, 1, 5]

On my box, your function returns 1.0000000000000002. “Close to” 1.0, but not exactly 1.0. That’s the best you can hope for, in general.

For a good clue about why, think about what this “should” compute:

math.sqrt(x)**2 == x

Mathematically (working in infinite precision), that should always return True. But in floating-point (no matter how much precision is used, provided only that the precision is bounded), it’s impossible for it to always be true. In fact, counterexamples are very easy to find; like, on my box just now:

>>> math.sqrt(2)**2
2.0000000000000004

The problem is that, with finite precision, sqrt() is necessarily a many-to-one function. It squashes the domain 1..N into the range 1..sqrt(N), and with finite precision the cardinality of the domain is larger than the cardinality of the range. Therefore there must exist distinct x and y in the domain that map to the same value in the range, so there is no exact functional inverse.

Your function is more complicated than a plain sqrt, but the same principles are at work.

Answered By: Tim Peters

Yes this has to do with floating-point behavior. You can try using the decimal module

from decimal import *
data = [7, 1, 5, 1, 8, 5, 9, 8, 5, 10, 5, 8, 1, 8, 8, 8, 10, 4, 8]
data = [Decimal(x) for x in data]
print(pearson(data, data))

Note that you will need to calculate the mean and standard deviation using decimals also.

Tim Peters explained:
The decimal, by default, uses more precision than native binary floating point. Every finite-precision sqrt is necessarily a many-to-one function. For the rest: Decimal(0.5) is identical to Decimal("0.5"), because 0.5 happens to be exactly representable as a binary float. You should use sqrt() instead of **0.5, because sqrt() guarantees a correctly rounded result to full precision and ** does not.