Aggregate row pair combinations and calculate difference using pandas

Question:

I am using the context and code of this question here: Aggregate all dataframe row pair combinations using pandas

import pandas
import itertools

mygenes=['ABC1', 'ABC2', 'ABC3', 'ABC4']

df = pandas.DataFrame({'Gene' : ['ABC1', 'ABC2', 'ABC3', 'ABC4','ABC5'],
                       'case1'   : [0,1,1,0,0],
                       'case2'   : [1,1,1,0,1],
                       'control1':[0,0,1,1,1],
                       'control2':[1,0,0,1,0] })
>>> df
   Gene  case1  case2  control1  control2
0  ABC1      0      1         0         1
1  ABC2      1      1         0         0
2  ABC3      1      1         1         0
3  ABC4      0      0         1         1
4  ABC5      0      1         1         0

The solution below sums up the rows, however, in my case, I would need the difference.

>>> df = df.set_index("Gene")
>>> cc = list(combinations(mygenes,2))
>>> out = pd.DataFrame([df.loc[c,:].sum() for c in cc], index=cc)
>>> out
              case1  case2  control1  control2
(ABC1, ABC2)      1      2         0         1
(ABC1, ABC3)      1      2         1         1
(ABC1, ABC4)      0      1         1         2
(ABC2, ABC3)      2      2         1         0
(ABC2, ABC4)      1      1         1         1
(ABC3, ABC4)      1      1         2         1

So in my case, (ABC1, ABC2) would yield [-1,0,0,1]

Unfortunately, using .diff() instead of .sum() results in the following error message:

ValueError: Must pass 2-d input. shape=(10, 2, 8)

Asked By: Kaya

||

Answers:

It may help to look at what .sum() and .diff() do

.sum() produces a Series

>>> df.loc[cc[0], :].sum()
case1       1
case2       2
control1    0
control2    1
dtype: int64

.diff() produces a DataFrame

>>> df.loc[cc[0], :].diff()
      case1  case2  control1  control2
Gene                                  
ABC1    NaN    NaN       NaN       NaN
ABC2    1.0    0.0       0.0      -1.0

You can use the period of -1 with diff and then index the first row of the resulting DataFrame

>>> df.loc[cc[0], :].diff(-1).iloc[0]
case1      -1.0
case2       0.0
control1    0.0
control2    1.0
Name: ABC1, dtype: float64
>>> pd.DataFrame([df.loc[c,:].diff(-1).iloc[0] for c in cc], index=cc)
              case1  case2  control1  control2
(ABC1, ABC2)   -1.0    0.0       0.0       1.0
(ABC1, ABC3)   -1.0    0.0      -1.0       1.0
(ABC1, ABC4)    0.0    1.0      -1.0       0.0
(ABC2, ABC3)    0.0    0.0      -1.0       0.0
(ABC2, ABC4)    1.0    1.0      -1.0      -1.0
(ABC3, ABC4)    1.0    1.0       0.0      -1.0
Answered By: jqurious

Another possible solution:

out = pd.DataFrame(
    [[x, (df.loc[df.Gene.eq(x[0]), 'case1':].to_numpy() - 
          df.loc[df.Gene.eq(x[1]), 'case1':].to_numpy())
      .flatten().tolist()] 
     for x in combinations(mygenes, 2)])
out = (pd.concat([out.loc[:,0], pd.DataFrame(out[1].to_list())], axis=1)
       .set_axis(df.columns, axis=1))

Output:

           Gene  case1  case2  control1  control2
0  (ABC1, ABC2)     -1      0         0         1
1  (ABC1, ABC3)     -1      0        -1         1
2  (ABC1, ABC4)      0      1        -1         0
3  (ABC2, ABC3)      0      0        -1         0
4  (ABC2, ABC4)      1      1        -1        -1
5  (ABC3, ABC4)      1      1         0        -1
Answered By: PaulS