Pandas Scipy mannwhitneyu in this type of data table

Question:

I have a data table similar to this one (but huge), many types and more "Spot" cells for each "Color":

Type    Color   Spots
A   Blue    792
A   Blue    56
A   Blue    2726
A   Blue    780
A   Blue    591
A   Blue    2867
A   Blue    193
A   Green   134
A   Green   631
A   Green   1010
A   Green   53
A   Green   5826
A   Green   6409
A   Green   3278
B   Blue    670
B   Blue    42
B   Blue    1165
B   Blue    3203
B   Blue    2164
B   Blue    5876
B   Blue    525
B   Green   26
B   Green   143
B   Green   399
B   Green   68
B   Green   939
B   Green   1528
B   Green   401
B   Green   1842
C   Blue    265
C   Blue    19
C   Blue    1381
C   Blue    4483
C   Blue    1103
C   Blue    1906
C   Blue    691
C   Green   38
C   Green   149
C   Green   87
C   Green   33
C   Green   1427
C   Green   1009
C   Green   342
C   Green   190

I want to do a Scipy mannwhitneyu analysis comparing Blue vs Green spots of each type, for instance for type A, this comparison and the same for all the types automatically:

Blue Green
792 134
56  631
2726 1010
780 53
591 5826
2867 6409
193 3278

I thought that defining those kind of groups in Pandas and then calling them in scipy should be the strategy, but my skills are not at that level still.
The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc.
Could somebody give me a hint?
Thanks

Asked By: Eusebio Perdiguero

||

Answers:

Your questions might be leaving a lot that is obvious to you implied for people who are not as familiar with the sort of statistical analysis you are interested in. For other readers, the documentation for the scipy implementation can be found under https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html:

The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.

More explanation for the Mann-Whitney test can be found under https://en.wikipedia.org/wiki/Mann–Whitney_U_test. Roughly speaking, what you are probably interested in are the statistical differences in occurrence of green and blue spots between different types of objects being observed. Discussing the applicability of this statistic, given the nature and distribution of the data, I understand to be outside the scope of this question.

If you need to read the data, formatted the way you present it, from a CSV file, you could use the following. A separator of ‘s+’ will match all whitespace.

import pandas
import scipy.stats
import itertools

# The CSV data is not comma-separated, so not really comma-separated. 
# This uses whitespace as a separator.
data = pandas.read_csv('data.csv', sep='s+')

# Generate all unique combinations of values of the second column.
# Having these ahead of time would save going over the data multiple times, 
# but the idea is to infer these automatically.
combinations = list(itertools.combinations(data[data.columns[1]].unique(), 2))

for key, group in data.groupby(data.columns[0]):
    for c in combinations:
        # Select values for each element of the combination.
        select_x = group[data.columns[1]] == c[0]
        select_y = group[data.columns[1]] == c[1]
        x = group[select_x][data.columns[2]]
        y = group[select_y][data.columns[2]]
        mwu = scipy.stats.mannwhitneyu(x, y)
        print(f'{data.columns[0]}: {key} ({c[0]} vs {c[1]}): {mwu}')

This will print:

Type: A (Blue vs Green): MannwhitneyuResult(statistic=19.0, pvalue=0.534965034965035)
Type: B (Blue vs Green): MannwhitneyuResult(statistic=41.0, pvalue=0.151981351981352)
Type: C (Blue vs Green): MannwhitneyuResult(statistic=41.0, pvalue=0.151981351981352)

First of all, I am inferring the types and classes, because of how interpreted this portion of the question:

The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc.

Knowing the types ahead of time could be used to make this code more efficient, but I am purposefully not hardcoding any of the classes such as "A", "B", "C" or the color of the spots because of this requirement from the author of the question above. This requirement might make it necessary to go over the data multiple times, because these are needed to determine the combinations of the groupby classes.

Documentation for groupby can be found under https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html.

Explanation

First, I am generating all combinations of unique values in the second column (data.columns[1]). In your case, that’s the colors. You only have "Green" and "Blue" but I assume that there can be more, so I did not hardcode them. I then group the data by the first column (in your case "Type"): data.groupby(data.columns[0]). Each groupby returns a key, which is the value being grouped on (your types), and the values within that group (group). Then, values from the third column ("Spots" in your case) are selected for the X and Y values of the Mann-Whitney test, for each element of the combination (select_x and select_y). Those are objects of type pandas.core.series.Series holding boolean values that specify which elements to select. You can also print the name of the column by simply using data.columns[], so that I did not need to hardcode the name of the first column ("Type"), either.

This code should be agnostic to the names of your columns. It automatically performs the statistical test you asked for by grouping all unique values it finds in the first column and generates all combinations of unique values from the second, in order to select the actual measurements from the third column.

As you can see, they might be out of order, which is due to us using a set. I assume that to not be an issue, but if it is, sort them first:

types = sorted(set(df['Type']))

Efficiency

I do not currently know a straightforward way to select all pairs of groups, which it appears you need. However, Pandas does have the ability to specify more than one column to group by. If the groups and combinations were fixed, it would be possible to reduce the number of times this code goes over the data. Panda might be able to sufficiently optimize some of these operations, though, so that it’s also feasible to use this approach with larger datasets.

Answered By: BananaMango
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.