Turning project data into a relationship matrix

Question:

My data set a list of people either working together or alone.

I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.

My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.

From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.

I know of a ‘pretty’ quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.

An example of the data, in a comma delimited fashion:

A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A

Hope that helps!

Brice.
F,D
B
F
F,X,C
C,F,D

Asked By: FJ17

||

Answers:

If I were to keep this project around for a while, I’d implement a database and then create the matrix you’re talking about from a query against that database.

You have a Project table (let’s say) with one record per project, an Actor table with one row per person, and a Participant table with a record per project for each actor that was in that project. (Each record would have an ID, a ProjectID, and an ActorID.)

From your example, you’d have 14 Project records, 7 Actor records (A through F, and X), and 31 Participant records.

Now, with this set up, each cell is a query against this database.

To reconstruct the matrix, first you’d add/update/remove the appropriate records in your database, and then rerun the query.

Answered By: John

Maybe something like this would get you started?

import csv
import collections
import itertools

grid = collections.Counter()

with open("connect.csv", "r", newline="") as fp:
    reader = csv.reader(fp)
    for line in reader:
        # clean empty names
        line = [name.strip() for name in line if name.strip()]
        # count single works
        if len(line) == 1:
            grid[line[0], line[0]] += 1
        # do pairwise counts
        for pair in itertools.combinations(line, 2):
            grid[pair] += 1
            grid[pair[::-1]] += 1

actors = sorted(set(pair[0] for pair in grid))

with open("connection_grid.csv", "w", newline="") as fp:
    writer = csv.writer(fp)
    writer.writerow([''] + actors)
    for actor in actors:
        line = [actor,] + [grid[actor, other] for other in actors]
        writer.writerow(line)

[edit: modified to work under Python 3.2]

The key modules are (1)csv, which makes reading and writing csv files much simpler; (2) collections, which provides an object called a Counter — like a defaultdict(int), which you could use if your Python doesn’t have Counter, it’s a dictionary which automatically generates default values so you don’t have to, and here the default count is 0; and (3) itertools, which has a combinations function to get all the pairs.

which produces

,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0

You could use itertools.product to make building the array a little more compact, but since it’s only a line or two I figured it was as simple to do it manually.

Answered By: DSM

I suggest using Python Pandas for this. It enables a slick solutions for formatting your adjacency matrix, and it will make any statistical calculations much easier too. You can also directly extract the matrix of values into a NumPy array, for doing eigenvalue decompositions or other graph-theoretical procedures on the group clusters if needed later.

I assume that the example data you listed is saved into a file called projects_data.csv (it doesn’t need to actually be a .csv file though). I also assume no blank lines between each observations, but this is all just file organization details.

Here’s my code for this:

# File I/O part
import itertools, pandas, numpy as np
with open("projects_data.csv") as tmp:
    lines = tmp.readlines()
lines = [line.split('n')[0].split(',') for line in lines]
# Unique letters
s = set(list(itertools.chain(*lines)))


# Actual work.
df = pandas.DataFrame(
                      np.zeros((len(s),len(s))), 
                      columns=sorted(list(s)), 
                      index=sorted(list(s))
                     )

for line in lines:
    if len(line) == 1:
        df.ix[line[0],line[0]] += 1 # Single-person projects
    elif len(line) > 1:
        # Get all pairs in multi-person project.
        tmp_pairs = list(itertools.combinations(line, 2))

        # Append pair reversals to update (i,j) and (j,i) for each pair.
        tmp_pairs = tmp_pairs + [pair[::-1] for pair in tmp_pairs]
        for pair in tmp_pairs:
            df.ix[pair[0], pair[1]] +=1
            # Uncomment below if you don't want the list
            # comprehension method for getting the reverals.
            #df.ix[pair[1], pair[0]] +=1 

# Final product
print df.to_string()

   A  B  C  D  E  F  X
A  1  2  1  1  0  0  1
B  2  1  3  1  2  1  0
C  1  3  0  1  2  2  1
D  1  1  1  0  0  3  1
E  0  2  2  0  0  0  0
F  0  1  2  3  0  1  1
X  1  0  1  1  0  1  0

Now you can do a lot of stuff for free, like see the total number of project partners (repeats included) for each participant:

>>> df.sum()
A     6
B    10
C    10
D     7
E     4
F     8
X     4
Answered By: ely

I guess that you don’t have thousands of people working together in these projects. This implementation is pretty simple.

fp = open('projects.cvs')

# counts how many times each pair worked together
pairs = {}

# each element of `project` is a person
for project in (p[:-1].split(',') for p in fp):
    project.sort()

    # someone is alone here
    if len(project) == 1:
        continue

    # iterate over each pair 
    for i in range(len(project)):
        for j in range(i+1, len(project)):
            pair = (project[i], project[j])
            # increase `pairs` counter
            pairs[pair] = pairs.get(pair, 0) + 1

from pprint import pprint
pprint(pairs)

It outputs:

{('A', 'B'): 1,
 ('B', 'C'): 2,
 ('B', 'D'): 1,
 ('B', 'E'): 1,
 ('B', 'F'): 2,
 ('C', 'E'): 1,
 ('C', 'F'): 1,
 ('D', 'F'): 1}
Answered By: Pedro Lacerda