How to count people who are below a position

Question:

I’m looking to count how many people are below a given user of the data frame.

Employee Manager
A
B A
C A
D A
E A
F B
G B
H C
I C

I would like to get in the output:
I, H, G, F, E and D have no employees below
C has two employees (H and I) below it
B has two employees (F and G)
A has eight employees below him (B, C, D and E plus the employees of B and C)

Would anyone have any suggestions?
In my DF I have more hierarchy layers and a very large amount of data.

I thought about storing it in a dictionary and doing a loop to update it, but I believe that this solution is not efficient at all. I would like to know if there is any more efficient technique to solve this type of problem.

Answers:

I would use a directed graph with networkx. This is a super fun python package.

import networkx as nx, pandas as pd

#set up data
employee = ['A', 'B', 'C','D','E','F','G','H','I']
manager = ['', 'A', 'A','A','A','B','B','C','C']
relations = pd.DataFrame(list(zip(employee,manager)), columns = ['Employee', 'Manager'])

# If there is no manager, make it the employee
relations.Manager = np.where(relations.Manager == '', relations.Employee, relations.Manager)
# or might need depending on data format:
relations.Manager = np.where(relations.Manager.isna(), relations.Employee, relations.Manager)

# Create tuples for 'edges'
relations['edge'] = list(zip(relations.Manager, relations.Employee))

# Create graph
G = nx.DiGraph()
G.add_nodes_from(list(relations.Employee))
G.add_edges_from(list(set(relations.edge)))

#Find all the descendants of nodes/employees
relations['employees_below'] = relations.apply(lambda row: nx.descendants(G,row.Employee), axis = 1)

returns:

  Employee Manager    edge           employees_below
0        A       A  (A, A)  {C, G, I, D, H, F, E, B}
1        B       A  (A, B)                    {F, G}
2        C       A  (A, C)                    {H, I}
3        D       A  (A, D)                        {}
4        E       A  (A, E)                        {}
5        F       B  (B, F)                        {}
6        G       B  (B, G)                        {}
7        H       C  (C, H)                        {}
8        I       C  (C, I)                        {}

The way it works: graphs are nodes and edges. In this case, your nodes are employees and your edges are a relationship between a manager and an employee. Do a quick google for ‘networkx directed graph’ images and you’ll get the idea of what this looks like in an image representation.

  • Make sure your data is cleaned up where everyone has a manager (make it themselves if there is none, for example)
  • First, create your edges in the form of a tuple of (manager, employee) and save it somewhere (I chose to make it a column in the df called edges).
  • Next, make a directed graph in networkx. A directed graph is needed due to the hierarchical relationship. this means that relationships work down from manager to employee. So, in this case, each edge goes in a direction from manager to employee.
  • Add every employee to your graph as a ‘node’.
  • Add every employee-manager relationship to your graph as an edge, using the pre-defined tuples of (manager, employee) discussed previously.
  • Lastly, you can get the output of an employee’s subordinates by finding all this node’s descendants. Descendants are all nodes (ie, employees) that can be reached from a node (ie, employee). I chose to assign this to a column and apply the descendants function to the employee in each row with apply.
Answered By: 34jbonz

As originally mentioned by @34jbonz networkx is the best tool for the task. There is however no need to preprocess the data as networkx provides a pandas interface

G = nx.from_pandas_edgelist(temp, source='manager',target='employee',create_using=nx.DiGraph)

also the use of apply and descendants should be avoided as it results in some calculations being done multiple times. Here a depth first search is the most efficient solution

for node in nx.dfs_postorder_nodes(G,'-'):
    successors = list(G.successors(node))
    G.nodes[node]['size'] = sum([G.nodes[p]['size'] for p in successors]) + len(successors)
    G.nodes[node]['descendants'] = [s for sn in successors for s in G.nodes[sn]['descendants']]
        + successors

finally information can be extracted in bulk from a networkx graph as a dict, which in turn can be transformed into a dataframe

pd.DataFrame.from_dict(dict(G.nodes(data=True)),orient='index')
Answered By: Arnau