Pandas Scipy mannwhitneyu in this type of data table
Question:
I have a data table similar to this one (but huge), many types and more "Spot" cells for each "Color":
Type Color Spots
A Blue 792
A Blue 56
A Blue 2726
A Blue 780
A Blue 591
A Blue 2867
A Blue 193
A Green 134
A Green 631
A Green 1010
A Green 53
A Green 5826
A Green 6409
A Green 3278
B Blue 670
B Blue 42
B Blue 1165
B Blue 3203
B Blue 2164
B Blue 5876
B Blue 525
B Green 26
B Green 143
B Green 399
B Green 68
B Green 939
B Green 1528
B Green 401
B Green 1842
C Blue 265
C Blue 19
C Blue 1381
C Blue 4483
C Blue 1103
C Blue 1906
C Blue 691
C Green 38
C Green 149
C Green 87
C Green 33
C Green 1427
C Green 1009
C Green 342
C Green 190
I want to do a Scipy mannwhitneyu analysis comparing Blue vs Green spots of each type, for instance for type A, this comparison and the same for all the types automatically:
Blue Green
792 134
56 631
2726 1010
780 53
591 5826
2867 6409
193 3278
I thought that defining those kind of groups in Pandas and then calling them in scipy should be the strategy, but my skills are not at that level still.
The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc.
Could somebody give me a hint?
Thanks
Answers:
Your questions might be leaving a lot that is obvious to you implied for people who are not as familiar with the sort of statistical analysis you are interested in. For other readers, the documentation for the scipy implementation can be found under https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html:
The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.
More explanation for the Mann-Whitney test can be found under https://en.wikipedia.org/wiki/Mann–Whitney_U_test. Roughly speaking, what you are probably interested in are the statistical differences in occurrence of green and blue spots between different types of objects being observed. Discussing the applicability of this statistic, given the nature and distribution of the data, I understand to be outside the scope of this question.
If you need to read the data, formatted the way you present it, from a CSV file, you could use the following. A separator of ‘s+’ will match all whitespace.
import pandas
import scipy.stats
import itertools
# The CSV data is not comma-separated, so not really comma-separated.
# This uses whitespace as a separator.
data = pandas.read_csv('data.csv', sep='s+')
# Generate all unique combinations of values of the second column.
# Having these ahead of time would save going over the data multiple times,
# but the idea is to infer these automatically.
combinations = list(itertools.combinations(data[data.columns[1]].unique(), 2))
for key, group in data.groupby(data.columns[0]):
for c in combinations:
# Select values for each element of the combination.
select_x = group[data.columns[1]] == c[0]
select_y = group[data.columns[1]] == c[1]
x = group[select_x][data.columns[2]]
y = group[select_y][data.columns[2]]
mwu = scipy.stats.mannwhitneyu(x, y)
print(f'{data.columns[0]}: {key} ({c[0]} vs {c[1]}): {mwu}')
This will print:
Type: A (Blue vs Green): MannwhitneyuResult(statistic=19.0, pvalue=0.534965034965035)
Type: B (Blue vs Green): MannwhitneyuResult(statistic=41.0, pvalue=0.151981351981352)
Type: C (Blue vs Green): MannwhitneyuResult(statistic=41.0, pvalue=0.151981351981352)
First of all, I am inferring the types and classes, because of how interpreted this portion of the question:
The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc.
Knowing the types ahead of time could be used to make this code more efficient, but I am purposefully not hardcoding any of the classes such as "A", "B", "C" or the color of the spots because of this requirement from the author of the question above. This requirement might make it necessary to go over the data multiple times, because these are needed to determine the combinations of the groupby classes.
Documentation for groupby
can be found under https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html.
Explanation
First, I am generating all combinations of unique values in the second column (data.columns[1]
). In your case, that’s the colors. You only have "Green" and "Blue" but I assume that there can be more, so I did not hardcode them. I then group the data by the first column (in your case "Type"): data.groupby(data.columns[0])
. Each groupby
returns a key, which is the value being grouped on (your types), and the values within that group (group
). Then, values from the third column ("Spots" in your case) are selected for the X and Y values of the Mann-Whitney test, for each element of the combination (select_x
and select_y
). Those are objects of type pandas.core.series.Series
holding boolean values that specify which elements to select. You can also print the name of the column by simply using data.columns[]
, so that I did not need to hardcode the name of the first column ("Type"), either.
This code should be agnostic to the names of your columns. It automatically performs the statistical test you asked for by grouping all unique values it finds in the first column and generates all combinations of unique values from the second, in order to select the actual measurements from the third column.
As you can see, they might be out of order, which is due to us using a set. I assume that to not be an issue, but if it is, sort them first:
types = sorted(set(df['Type']))
Efficiency
I do not currently know a straightforward way to select all pairs of groups, which it appears you need. However, Pandas does have the ability to specify more than one column to group by. If the groups and combinations were fixed, it would be possible to reduce the number of times this code goes over the data. Panda might be able to sufficiently optimize some of these operations, though, so that it’s also feasible to use this approach with larger datasets.
I have a data table similar to this one (but huge), many types and more "Spot" cells for each "Color":
Type Color Spots
A Blue 792
A Blue 56
A Blue 2726
A Blue 780
A Blue 591
A Blue 2867
A Blue 193
A Green 134
A Green 631
A Green 1010
A Green 53
A Green 5826
A Green 6409
A Green 3278
B Blue 670
B Blue 42
B Blue 1165
B Blue 3203
B Blue 2164
B Blue 5876
B Blue 525
B Green 26
B Green 143
B Green 399
B Green 68
B Green 939
B Green 1528
B Green 401
B Green 1842
C Blue 265
C Blue 19
C Blue 1381
C Blue 4483
C Blue 1103
C Blue 1906
C Blue 691
C Green 38
C Green 149
C Green 87
C Green 33
C Green 1427
C Green 1009
C Green 342
C Green 190
I want to do a Scipy mannwhitneyu analysis comparing Blue vs Green spots of each type, for instance for type A, this comparison and the same for all the types automatically:
Blue Green
792 134
56 631
2726 1010
780 53
591 5826
2867 6409
193 3278
I thought that defining those kind of groups in Pandas and then calling them in scipy should be the strategy, but my skills are not at that level still.
The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc.
Could somebody give me a hint?
Thanks
Your questions might be leaving a lot that is obvious to you implied for people who are not as familiar with the sort of statistical analysis you are interested in. For other readers, the documentation for the scipy implementation can be found under https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html:
The Mann-Whitney U test is a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. It is often used as a test of difference in location between distributions.
More explanation for the Mann-Whitney test can be found under https://en.wikipedia.org/wiki/Mann–Whitney_U_test. Roughly speaking, what you are probably interested in are the statistical differences in occurrence of green and blue spots between different types of objects being observed. Discussing the applicability of this statistic, given the nature and distribution of the data, I understand to be outside the scope of this question.
If you need to read the data, formatted the way you present it, from a CSV file, you could use the following. A separator of ‘s+’ will match all whitespace.
import pandas
import scipy.stats
import itertools
# The CSV data is not comma-separated, so not really comma-separated.
# This uses whitespace as a separator.
data = pandas.read_csv('data.csv', sep='s+')
# Generate all unique combinations of values of the second column.
# Having these ahead of time would save going over the data multiple times,
# but the idea is to infer these automatically.
combinations = list(itertools.combinations(data[data.columns[1]].unique(), 2))
for key, group in data.groupby(data.columns[0]):
for c in combinations:
# Select values for each element of the combination.
select_x = group[data.columns[1]] == c[0]
select_y = group[data.columns[1]] == c[1]
x = group[select_x][data.columns[2]]
y = group[select_y][data.columns[2]]
mwu = scipy.stats.mannwhitneyu(x, y)
print(f'{data.columns[0]}: {key} ({c[0]} vs {c[1]}): {mwu}')
This will print:
Type: A (Blue vs Green): MannwhitneyuResult(statistic=19.0, pvalue=0.534965034965035)
Type: B (Blue vs Green): MannwhitneyuResult(statistic=41.0, pvalue=0.151981351981352)
Type: C (Blue vs Green): MannwhitneyuResult(statistic=41.0, pvalue=0.151981351981352)
First of all, I am inferring the types and classes, because of how interpreted this portion of the question:
The idea is do it automatically for of the types, so I get the p-value of A, B, C, etc.
Knowing the types ahead of time could be used to make this code more efficient, but I am purposefully not hardcoding any of the classes such as "A", "B", "C" or the color of the spots because of this requirement from the author of the question above. This requirement might make it necessary to go over the data multiple times, because these are needed to determine the combinations of the groupby classes.
Documentation for groupby
can be found under https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html.
Explanation
First, I am generating all combinations of unique values in the second column (data.columns[1]
). In your case, that’s the colors. You only have "Green" and "Blue" but I assume that there can be more, so I did not hardcode them. I then group the data by the first column (in your case "Type"): data.groupby(data.columns[0])
. Each groupby
returns a key, which is the value being grouped on (your types), and the values within that group (group
). Then, values from the third column ("Spots" in your case) are selected for the X and Y values of the Mann-Whitney test, for each element of the combination (select_x
and select_y
). Those are objects of type pandas.core.series.Series
holding boolean values that specify which elements to select. You can also print the name of the column by simply using data.columns[]
, so that I did not need to hardcode the name of the first column ("Type"), either.
This code should be agnostic to the names of your columns. It automatically performs the statistical test you asked for by grouping all unique values it finds in the first column and generates all combinations of unique values from the second, in order to select the actual measurements from the third column.
As you can see, they might be out of order, which is due to us using a set. I assume that to not be an issue, but if it is, sort them first:
types = sorted(set(df['Type']))
Efficiency
I do not currently know a straightforward way to select all pairs of groups, which it appears you need. However, Pandas does have the ability to specify more than one column to group by. If the groups and combinations were fixed, it would be possible to reduce the number of times this code goes over the data. Panda might be able to sufficiently optimize some of these operations, though, so that it’s also feasible to use this approach with larger datasets.