How to do I count the number of string combinations in each row of a pandas dataframe?

Question:

I’m trying to count the number of times a combination of strings appear in each row of a dataframe. Each ID uses a number of methods (some IDs use more methods than others) and I want to count the number of times any two methods have been combined together.

# df is from csv and has blank cells - I've used empty strings to demo here
df = pd.DataFrame({'id': ['101', '102', '103', '104'],
    'method_1': ['HR', 'q-SUS', 'PEP', 'ET'],
    'method_2': ['q-SUS', 'q-IEQ', 'AUC', 'EEG'],
    'method_3': ['SC', '', 'HR', 'SC'],
    'method_4': ['q-IEQ', '', 'ST', 'HR'],
    'method_5': ['PEP', '', 'SC', '']})

print(df)

    id method_1 method_2 method_3 method_4 method_5
0  101       HR    q-SUS       SC    q-IEQ      PEP
1  102    q-SUS    q-IEQ                           
2  103      PEP      AUC       HR       ST       SC
3  104       ET      EEG       SC       HR         

I want to end up with a table that looks something like this:

Method A Method B Number of Times Combined
HR SC 3
HR q-SUS 1
HR PEP 2
q-IEQ q-SUS 2
EEG ET 1
EEG SC 1
etc. etc. etc.

So far I’ve been trying variations of this code using itertools.combinations and collections Counter:

import numpy as np
import pandas as pd
import itertools
from collections import Counter

def get_all_combinations_without_nan(row):
    # remove nan - this is for the blank csv cells
    set_without_nan = {value for value in row if isinstance(value, str)}

    # generate all combinations of values in row
    all_combinations = []
    for index, row in df.iterrows():  
        result = list(itertools.combinations(set_without_nan, 2))
        all_combinations.extend(result)

    return all_combinations

# get all possible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))

count_combinations = Counter(all_rows_flatten)

print(count_combinations)

It’s doing something, but it seems to be counting multiple times or something (it’s counting more combinations than are actually there. I’ve had a good look on Stack, but can’t seem to solve this – everything seems really close though!

I hope someone can help – Thanks!

Asked By: gecko_mike

||

Answers:

Use DataFrame.melt for reshape with remove empty strings or missing values, then use DataFrame.merge for all combinations, remove rows with same methods and count by GroupBy.size:

df1 = df.melt('id', value_name='method_')
df1 = df1[(df1["method_"] != '') & (df1["method_"].notna())]

df = (df1.merge(df1, on='id', suffixes=('A','B'))
          .query("method_A != method_B")
          .groupby(['method_A','method_B'])
          .size()
          .reset_index(name='Number of Times Combined'))
print (df.head(20))
   method_A method_B  Number of Times Combined
0       AUC       HR                         1
1       AUC      PEP                         1
2       AUC       SC                         1
3       AUC       ST                         1
4       EEG       ET                         1
5       EEG       HR                         1
6       EEG       SC                         1
7        ET      EEG                         1
8        ET       HR                         1
9        ET       SC                         1
10       HR      AUC                         1
11       HR      EEG                         1
12       HR       ET                         1
13       HR      PEP                         2
14       HR       SC                         3
15       HR       ST                         1
16       HR    q-IEQ                         1
17       HR    q-SUS                         1
18      PEP      AUC                         1
19      PEP       HR                         2
    
Answered By: jezrael
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.