How to count people who are below a position
Question:
I’m looking to count how many people are below a given user of the data frame.
Employee
Manager
A
–
B
A
C
A
D
A
E
A
F
B
G
B
H
C
I
C
I would like to get in the output:
I, H, G, F, E and D have no employees below
C has two employees (H and I) below it
B has two employees (F and G)
A has eight employees below him (B, C, D and E plus the employees of B and C)
Would anyone have any suggestions?
In my DF I have more hierarchy layers and a very large amount of data.
I thought about storing it in a dictionary and doing a loop to update it, but I believe that this solution is not efficient at all. I would like to know if there is any more efficient technique to solve this type of problem.
Answers:
I would use a directed graph with networkx
. This is a super fun python package.
import networkx as nx, pandas as pd
#set up data
employee = ['A', 'B', 'C','D','E','F','G','H','I']
manager = ['', 'A', 'A','A','A','B','B','C','C']
relations = pd.DataFrame(list(zip(employee,manager)), columns = ['Employee', 'Manager'])
# If there is no manager, make it the employee
relations.Manager = np.where(relations.Manager == '', relations.Employee, relations.Manager)
# or might need depending on data format:
relations.Manager = np.where(relations.Manager.isna(), relations.Employee, relations.Manager)
# Create tuples for 'edges'
relations['edge'] = list(zip(relations.Manager, relations.Employee))
# Create graph
G = nx.DiGraph()
G.add_nodes_from(list(relations.Employee))
G.add_edges_from(list(set(relations.edge)))
#Find all the descendants of nodes/employees
relations['employees_below'] = relations.apply(lambda row: nx.descendants(G,row.Employee), axis = 1)
returns:
Employee Manager edge employees_below
0 A A (A, A) {C, G, I, D, H, F, E, B}
1 B A (A, B) {F, G}
2 C A (A, C) {H, I}
3 D A (A, D) {}
4 E A (A, E) {}
5 F B (B, F) {}
6 G B (B, G) {}
7 H C (C, H) {}
8 I C (C, I) {}
The way it works: graphs are nodes and edges. In this case, your nodes are employees and your edges are a relationship between a manager and an employee. Do a quick google for ‘networkx directed graph’ images and you’ll get the idea of what this looks like in an image representation.
- Make sure your data is cleaned up where everyone has a manager (make it themselves if there is none, for example)
- First, create your edges in the form of a tuple of
(manager, employee)
and save it somewhere (I chose to make it a column in the df
called edges
).
- Next, make a directed graph in networkx. A directed graph is needed due to the hierarchical relationship. this means that relationships work down from manager to employee. So, in this case, each edge goes in a direction from manager to employee.
- Add every employee to your graph as a ‘node’.
- Add every employee-manager relationship to your graph as an edge, using the pre-defined tuples of
(manager, employee)
discussed previously.
- Lastly, you can get the output of an employee’s subordinates by finding all this node’s descendants. Descendants are all nodes (ie, employees) that can be reached from a node (ie, employee). I chose to assign this to a column and apply the
descendants
function to the employee in each row with apply
.
As originally mentioned by @34jbonz networkx is the best tool for the task. There is however no need to preprocess the data as networkx provides a pandas interface
G = nx.from_pandas_edgelist(temp, source='manager',target='employee',create_using=nx.DiGraph)
also the use of apply
and descendants
should be avoided as it results in some calculations being done multiple times. Here a depth first search is the most efficient solution
for node in nx.dfs_postorder_nodes(G,'-'):
successors = list(G.successors(node))
G.nodes[node]['size'] = sum([G.nodes[p]['size'] for p in successors]) + len(successors)
G.nodes[node]['descendants'] = [s for sn in successors for s in G.nodes[sn]['descendants']]
+ successors
finally information can be extracted in bulk from a networkx graph as a dict, which in turn can be transformed into a dataframe
pd.DataFrame.from_dict(dict(G.nodes(data=True)),orient='index')
I’m looking to count how many people are below a given user of the data frame.
Employee | Manager |
---|---|
A | – |
B | A |
C | A |
D | A |
E | A |
F | B |
G | B |
H | C |
I | C |
I would like to get in the output:
I, H, G, F, E and D have no employees below
C has two employees (H and I) below it
B has two employees (F and G)
A has eight employees below him (B, C, D and E plus the employees of B and C)
Would anyone have any suggestions?
In my DF I have more hierarchy layers and a very large amount of data.
I thought about storing it in a dictionary and doing a loop to update it, but I believe that this solution is not efficient at all. I would like to know if there is any more efficient technique to solve this type of problem.
I would use a directed graph with networkx
. This is a super fun python package.
import networkx as nx, pandas as pd
#set up data
employee = ['A', 'B', 'C','D','E','F','G','H','I']
manager = ['', 'A', 'A','A','A','B','B','C','C']
relations = pd.DataFrame(list(zip(employee,manager)), columns = ['Employee', 'Manager'])
# If there is no manager, make it the employee
relations.Manager = np.where(relations.Manager == '', relations.Employee, relations.Manager)
# or might need depending on data format:
relations.Manager = np.where(relations.Manager.isna(), relations.Employee, relations.Manager)
# Create tuples for 'edges'
relations['edge'] = list(zip(relations.Manager, relations.Employee))
# Create graph
G = nx.DiGraph()
G.add_nodes_from(list(relations.Employee))
G.add_edges_from(list(set(relations.edge)))
#Find all the descendants of nodes/employees
relations['employees_below'] = relations.apply(lambda row: nx.descendants(G,row.Employee), axis = 1)
returns:
Employee Manager edge employees_below
0 A A (A, A) {C, G, I, D, H, F, E, B}
1 B A (A, B) {F, G}
2 C A (A, C) {H, I}
3 D A (A, D) {}
4 E A (A, E) {}
5 F B (B, F) {}
6 G B (B, G) {}
7 H C (C, H) {}
8 I C (C, I) {}
The way it works: graphs are nodes and edges. In this case, your nodes are employees and your edges are a relationship between a manager and an employee. Do a quick google for ‘networkx directed graph’ images and you’ll get the idea of what this looks like in an image representation.
- Make sure your data is cleaned up where everyone has a manager (make it themselves if there is none, for example)
- First, create your edges in the form of a tuple of
(manager, employee)
and save it somewhere (I chose to make it a column in thedf
callededges
). - Next, make a directed graph in networkx. A directed graph is needed due to the hierarchical relationship. this means that relationships work down from manager to employee. So, in this case, each edge goes in a direction from manager to employee.
- Add every employee to your graph as a ‘node’.
- Add every employee-manager relationship to your graph as an edge, using the pre-defined tuples of
(manager, employee)
discussed previously. - Lastly, you can get the output of an employee’s subordinates by finding all this node’s descendants. Descendants are all nodes (ie, employees) that can be reached from a node (ie, employee). I chose to assign this to a column and apply the
descendants
function to the employee in each row withapply
.
As originally mentioned by @34jbonz networkx is the best tool for the task. There is however no need to preprocess the data as networkx provides a pandas interface
G = nx.from_pandas_edgelist(temp, source='manager',target='employee',create_using=nx.DiGraph)
also the use of apply
and descendants
should be avoided as it results in some calculations being done multiple times. Here a depth first search is the most efficient solution
for node in nx.dfs_postorder_nodes(G,'-'):
successors = list(G.successors(node))
G.nodes[node]['size'] = sum([G.nodes[p]['size'] for p in successors]) + len(successors)
G.nodes[node]['descendants'] = [s for sn in successors for s in G.nodes[sn]['descendants']]
+ successors
finally information can be extracted in bulk from a networkx graph as a dict, which in turn can be transformed into a dataframe
pd.DataFrame.from_dict(dict(G.nodes(data=True)),orient='index')