Count occurencies of a conditional relationship between 2 dataframes and compute correlation
Question:
I want to find the relationship between elements of different yet connected data frames. Here I have a df that shows friendships: id1 is friends with id2, id3, id4, id5, id21 etc.
friend1
friend2
row1
id1
id3
row2
id2
id1
row3
id5
id1
row4
id12
id2
row5
id21
id1
row6
id4
id2
row7
id7
id8
row8
id1
id4
row9
id21
id2
row10
id3
id5
Here is another dataframe where it shows when someone goes to a party. For example, Id5 went to parties on 2012-02-03 and 2012-05-09.
person
date
row1
id1
2012-02-03
row2
id2
2012-05-09
row3
id5
2012-02-03
row4
id12
2012-05-09
row5
id21
2012-02-03
row6
id7
2012-02-22
row7
id5
2012-05-09
row8
id3
2012-02-22
row9
id8
2012-02-22
row10
id1
2012-02-22
I want to find the correlation between people attending parties depending on whether their friends attend. For example for id1:
Went to party 2012-02-03 (same day as id21, id5) and 2012-02-22 (same day as id7, id3, id8). So 2 friends on 1 occasion and 1 on another (mean=1.5 friends when he attends a party).
I would like to see the average number of friends existing at a party for each person present in the dataset. If someone has no friends, visited no parties, or visited parties without his friends then the mean will be 0.
I tried to build this using pandas methods like value_counts/groupby and dictionaries but I lost hope along the way. Thanks in advance for any help.
Here are the constructors for the dfs:
index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)
Answers:
The idea is to build a dictionary with a person as key and a set with friends as value and a dictionary with parties where the value is a set of all participants. Having both of the dictionaries a loop over all persons will collect the appropriate data and calculate the mean value:
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
dctPersons = {}
for friend1, friend2 in zip(data1["friend1"], data1["friend2"]):
theSet = dctPersons.get(friend1, set())
theSet.add(friend2)
dctPersons[friend1] = theSet
theSet = dctPersons.get(friend2, set())
theSet.add(friend1)
dctPersons[friend2] = theSet
print(dctPersons)
dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
theSet = dctParties.get(date, set())
theSet.add(person)
dctParties[date] = theSet
print(dctParties)
dctMeanFriends = {}
for persID, setFriends in dctPersons.items():
visitedParties = 0
friendsAtParty = 0
for setPersAtParty in dctParties.values():
if persID in setPersAtParty:
visitedParties += 1
friendsAtParty += len( setPersAtParty.intersection(setFriends))
dctMeanFriends[persID] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )
dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)
outputs:
{'id1': {'id2', 'id3', 'id21', 'id5', 'id4'}, 'id3': {'id1', 'id5'}, 'id2': {'id21', 'id1', 'id4', 'id12'}, 'id5': {'id1', 'id3'}, 'id12': {'id2'}, 'id21': {'id1', 'id2'}, 'id4': {'id1', 'id2'}, 'id7': {'id8'}, 'id8': {'id7'}}
{'2012-02-03': {'id1', 'id21', 'id5'}, '2012-05-09': {'id12', 'id2', 'id5'}, '2012-02-22': {'id7', 'id1', 'id8', 'id3'}}
{'id1': 1.5, 'id3': 1.0, 'id2': 1.0, 'id5': 0.5, 'id12': 1.0, 'id21': 1.0, 'id4': 0, 'id7': 1.0, 'id8': 1.0}
0
id1 1.5
id3 1.0
id2 1.0
id5 0.5
id12 1.0
id21 1.0
id4 0.0
id7 1.0
id8 1.0
Below an update to the code above motivated by the desire to get the number of code lines down. This was possible after finding a way to eliminate an assignment line required only because the set.add()
method returns None
. This way nine lines of code could be cut down to four lines:
import pandas as pd
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21','id3'],
'friend2': ['id3', 'id1','id1', 'id2', 'id1','id2','id8','id4', 'id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
dctPersons = {}
for f1, f2 in zip(data1["friend1"], data1["friend2"]):
dctPersons[f1] = dctPersons.get(f1,[])+[f2]
dctPersons[f2] = dctPersons.get(f2,[])+[f1]
for key, val in dctPersons.items(): dctPersons[key]=set(val)
print(dctPersons)
dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
dctParties[date] = dctParties.get(date, [])+[person]
print(dctParties)
dctMeanFriends = {}
for person, friends in dctPersons.items():
visitedParties = 0
friendsAtParty = 0
for personsAtParty in dctParties.values():
if person in personsAtParty:
visitedParties += 1
friendsAtParty += len( set(personsAtParty).intersection(friends))
dctMeanFriends[person] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )
dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)
A similar solution. Although since this summary doesn’t seem to account for the parties which the folks didn’t attend, because there were few/no friends there, I can’t see how we can calculate any kind of correlation/relationship from this summary table…
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5', 'sits_home']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1','id4','no_friends'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22','2012-02-23','2012-02-22']}
friends = pd.DataFrame(data1)
meetings = pd.DataFrame(data2)
the_friends = set(data1['friend1']) | set(data1['friend2']) | set(data2['person'])
all_friends = {(i, i) for i in the_friends}
df = meetings.merge(meetings, on='date', how='left')
df.loc[:,'pairs'] = df.apply(lambda x: tuple(set([x.person_x, x.person_y])) if x.person_x != x.person_y else (x.person_x, x.person_y), axis=1)
friends.loc[:,'pairs'] = friends.apply(lambda x: tuple(set([x.friend1, x.friend2])), axis=1)
df['count'] = 1.0
df['friends'] = False
all_friends = all_friends | set(friends.pairs.unique())
df.loc[df.pairs.isin(all_friends),'friends'] = True
result = df.loc[df.friends,['person_x', 'date','count']].groupby(['person_x', 'date']).sum().reset_index()[['person_x','count']]
result.loc[:,'count'] = result['count'] - 1
result = result.groupby('person_x').mean()
result.index.name = 'friends'
result.columns = ['Mean number of friends at the party attended']
not_attended = the_friends - set(result.index.values)
for i in not_attended:
result.loc[i, 'Mean number of friends at the party attended'] = 0.0
print(result)
Output:
Mean number of friends at the party attended
friends
id1 1.5
id12 1.0
id2 1.0
id21 1.0
id3 1.0
id4 0.0
id5 0.5
id7 1.0
id8 1.0
no_friends 0.0
sits_home 0.0
I want to find the relationship between elements of different yet connected data frames. Here I have a df that shows friendships: id1 is friends with id2, id3, id4, id5, id21 etc.
friend1 | friend2 | |
---|---|---|
row1 | id1 | id3 |
row2 | id2 | id1 |
row3 | id5 | id1 |
row4 | id12 | id2 |
row5 | id21 | id1 |
row6 | id4 | id2 |
row7 | id7 | id8 |
row8 | id1 | id4 |
row9 | id21 | id2 |
row10 | id3 | id5 |
Here is another dataframe where it shows when someone goes to a party. For example, Id5 went to parties on 2012-02-03 and 2012-05-09.
person | date | |
---|---|---|
row1 | id1 | 2012-02-03 |
row2 | id2 | 2012-05-09 |
row3 | id5 | 2012-02-03 |
row4 | id12 | 2012-05-09 |
row5 | id21 | 2012-02-03 |
row6 | id7 | 2012-02-22 |
row7 | id5 | 2012-05-09 |
row8 | id3 | 2012-02-22 |
row9 | id8 | 2012-02-22 |
row10 | id1 | 2012-02-22 |
I want to find the correlation between people attending parties depending on whether their friends attend. For example for id1:
Went to party 2012-02-03 (same day as id21, id5) and 2012-02-22 (same day as id7, id3, id8). So 2 friends on 1 occasion and 1 on another (mean=1.5 friends when he attends a party).
I would like to see the average number of friends existing at a party for each person present in the dataset. If someone has no friends, visited no parties, or visited parties without his friends then the mean will be 0.
I tried to build this using pandas methods like value_counts/groupby and dictionaries but I lost hope along the way. Thanks in advance for any help.
Here are the constructors for the dfs:
index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
df1 = pd.DataFrame(data1, index=index)
df2 = pd.DataFrame(data2, index=index)
The idea is to build a dictionary with a person as key and a set with friends as value and a dictionary with parties where the value is a set of all participants. Having both of the dictionaries a loop over all persons will collect the appropriate data and calculate the mean value:
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
dctPersons = {}
for friend1, friend2 in zip(data1["friend1"], data1["friend2"]):
theSet = dctPersons.get(friend1, set())
theSet.add(friend2)
dctPersons[friend1] = theSet
theSet = dctPersons.get(friend2, set())
theSet.add(friend1)
dctPersons[friend2] = theSet
print(dctPersons)
dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
theSet = dctParties.get(date, set())
theSet.add(person)
dctParties[date] = theSet
print(dctParties)
dctMeanFriends = {}
for persID, setFriends in dctPersons.items():
visitedParties = 0
friendsAtParty = 0
for setPersAtParty in dctParties.values():
if persID in setPersAtParty:
visitedParties += 1
friendsAtParty += len( setPersAtParty.intersection(setFriends))
dctMeanFriends[persID] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )
dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)
outputs:
{'id1': {'id2', 'id3', 'id21', 'id5', 'id4'}, 'id3': {'id1', 'id5'}, 'id2': {'id21', 'id1', 'id4', 'id12'}, 'id5': {'id1', 'id3'}, 'id12': {'id2'}, 'id21': {'id1', 'id2'}, 'id4': {'id1', 'id2'}, 'id7': {'id8'}, 'id8': {'id7'}}
{'2012-02-03': {'id1', 'id21', 'id5'}, '2012-05-09': {'id12', 'id2', 'id5'}, '2012-02-22': {'id7', 'id1', 'id8', 'id3'}}
{'id1': 1.5, 'id3': 1.0, 'id2': 1.0, 'id5': 0.5, 'id12': 1.0, 'id21': 1.0, 'id4': 0, 'id7': 1.0, 'id8': 1.0}
0
id1 1.5
id3 1.0
id2 1.0
id5 0.5
id12 1.0
id21 1.0
id4 0.0
id7 1.0
id8 1.0
Below an update to the code above motivated by the desire to get the number of code lines down. This was possible after finding a way to eliminate an assignment line required only because the set.add()
method returns None
. This way nine lines of code could be cut down to four lines:
import pandas as pd
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21','id3'],
'friend2': ['id3', 'id1','id1', 'id2', 'id1','id2','id8','id4', 'id2','id5']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22']}
dctPersons = {}
for f1, f2 in zip(data1["friend1"], data1["friend2"]):
dctPersons[f1] = dctPersons.get(f1,[])+[f2]
dctPersons[f2] = dctPersons.get(f2,[])+[f1]
for key, val in dctPersons.items(): dctPersons[key]=set(val)
print(dctPersons)
dctParties = {}
for person, date in zip(data2["person"], data2["date"]):
dctParties[date] = dctParties.get(date, [])+[person]
print(dctParties)
dctMeanFriends = {}
for person, friends in dctPersons.items():
visitedParties = 0
friendsAtParty = 0
for personsAtParty in dctParties.values():
if person in personsAtParty:
visitedParties += 1
friendsAtParty += len( set(personsAtParty).intersection(friends))
dctMeanFriends[person] = friendsAtParty / visitedParties if visitedParties > 0 else 0
print( dctMeanFriends )
dfMeanFriends = pd.DataFrame.from_dict(dctMeanFriends, orient='index') # columns / tight
print(dfMeanFriends)
A similar solution. Although since this summary doesn’t seem to account for the parties which the folks didn’t attend, because there were few/no friends there, I can’t see how we can calculate any kind of correlation/relationship from this summary table…
data1 = {'friend1': ['id1', 'id2','id5','id12','id21','id4','id7','id1','id21', 'id3', 'id3'],
'friend2': ['id3', 'id1','id1','id2','id1','id2','id8','id4','id2','id5', 'sits_home']}
data2 = {'person': ['id1', 'id2','id5','id12','id21','id7','id5','id3','id8','id1','id4','no_friends'],
'date': ['2012-02-03', '2012-05-09', '2012-02-03', '2012-05-09', '2012-02-03','2012-02-22','2012-05-09','2012-02-22','2012-02-22','2012-02-22','2012-02-23','2012-02-22']}
friends = pd.DataFrame(data1)
meetings = pd.DataFrame(data2)
the_friends = set(data1['friend1']) | set(data1['friend2']) | set(data2['person'])
all_friends = {(i, i) for i in the_friends}
df = meetings.merge(meetings, on='date', how='left')
df.loc[:,'pairs'] = df.apply(lambda x: tuple(set([x.person_x, x.person_y])) if x.person_x != x.person_y else (x.person_x, x.person_y), axis=1)
friends.loc[:,'pairs'] = friends.apply(lambda x: tuple(set([x.friend1, x.friend2])), axis=1)
df['count'] = 1.0
df['friends'] = False
all_friends = all_friends | set(friends.pairs.unique())
df.loc[df.pairs.isin(all_friends),'friends'] = True
result = df.loc[df.friends,['person_x', 'date','count']].groupby(['person_x', 'date']).sum().reset_index()[['person_x','count']]
result.loc[:,'count'] = result['count'] - 1
result = result.groupby('person_x').mean()
result.index.name = 'friends'
result.columns = ['Mean number of friends at the party attended']
not_attended = the_friends - set(result.index.values)
for i in not_attended:
result.loc[i, 'Mean number of friends at the party attended'] = 0.0
print(result)
Output:
Mean number of friends at the party attended
friends
id1 1.5
id12 1.0
id2 1.0
id21 1.0
id3 1.0
id4 0.0
id5 0.5
id7 1.0
id8 1.0
no_friends 0.0
sits_home 0.0