Creating dfs from a nested dictionary and counting strings to create a plot with subplots

Question:

I have a df that looks like this:

df = pd.DataFrame({
    'job_title':['Senior Data Scientist', 'Junior Data Analyst', 'Data Engineer Intern', 'Data Engieneer/Architect', 'Data Analyst/Visualisation'],
    'location':['Berlin', 'Frankfurt', 'Munich','Frankfurt', 'Munich'],
    'job_desc':['something something Python, R, Tableau something', 'something R and SQL',
                 'something Power Bi, Excel and Python','something Power Bi, Excel and Python somthing', 'Power BI and R something']})
        

My objective is to now plot the necessary skills that appear in the job description in job_desc column depending on the job title in job_title. Whats important is that the job titles in ‘job_title’ need to somehow filtered according to the three roles I mention below.

My idea was to do the following:

  1. create sub data frames according to the job title for Data Scientist, Data Analyst and Data Engineer
  2. creating new dfs from the ones I created that count the skills in the job_desc
  3. plot the skills in a bar plot with three sub bar plot according to the role

To do this I have done the following:

1.)

# creating sub datasets according to the three roles above to look further into the different skillset

# data analyst dataset
dfa = df[df['job_title'].str.contains('Data Ana')]

# data scientist dataset
dfs = df[df['job_title'].str.contains('Data Sci')]

# data engineer dataset
dfe = df[df['job_title'].str.contains('Data Eng')]

2.) Here I created a loop and stored the obtained information in a nested dictionary. At first I tried to directly store the data from the loop in new data frames, but I read here that it is best to do so using dictionaries.

# looping through each sub dataset to get the skill count
list = [dfa, dfs, dfe]

#creating an empty dictionary to store the new information in
dict_of_df = {}

for li in range(len(list)):

    # counting the skills in each df of the list
    python = list[li].job_desc.str.count('Python').sum()     
    R = list[li].job_desc.str.count('R ').sum()         
    tableau = list[li].job_desc.str.count('Tableau').sum()     
    pbi = list[li].job_desc.str.count('Power BI').sum()  
    excel = list[li].job_desc.str.count('Excel').sum()   
    sql = list[li].job_desc.str.count('SQL').sum()
    
    #creating a dictionary with the skills and their counts
    skills = ['python', 'R', 'pbi', 'tableau', 'excel', 'sql']
    counts = [python, R, tableau, pbi, excel, sql]
    dic = {'Skills': skills, 'Counts': counts}
        
    #appending the information in the empty dictionary
    dict_of_df['df_{}'.format(li)] = dic

This results in the following output:

dict_of_df = {{'df_0': {'Skills': ['python', 'R', 'pbi', 'tableau', 'excel', 'sql'], 'Counts': [0, 2, 0, 1, 0, 1]}, 'df_1': {'Skills': ['python', 'R', 'pbi', 'tableau', 'excel', 'sql'], 'Counts': [1, 0, 1, 0, 0, 0]}, 'df_2': {'Skills': ['python', 'R', 'pbi', 'tableau', 'excel', 'sql'], 'Counts': [2, 0, 0, 0, 2, 0]}}}

The dictionary contains the correct information and my desired output would then be to have three dfs from df_0,df_1 and df_2 in this format:

Skills  Counts
0   python  0
1   R   1
2   pbi 0
3   tableau 0
4   excel   0
5   sql 1

But this I am not able to do, I tried to apply what I have found in these posts

Creating multiple dataframes from a dictionary in a loop

Construct pandas DataFrame from items in nested dictionary

Construct a pandas DataFrame from items in a nested dictionary with lists as inner values

Python Pandas: Convert nested dictionary to dataframe

However, all of the above posts have different dictionary structures as mine seems to be double nested. I also have the impression that my way may be too over complicating things.

Asked By: twoRay

||

Answers:

Don’t overcomplicate things, here is the simplified approach:

skills = ['python', 'R', 'pbi', 'tableau', 'excel', 'sql']
pattern = r'(?i)b(%s)b' % '|'.join(skills)

s = df.set_index('job_title')['job_desc'].str.extractall(pattern)[0].droplevel(1) # -- step 1
s = pd.crosstab(s.index, s, rownames=['job_title'], colnames=['skills']) # -- step 2

Explained

Build a regex pattern using the skills then use extractall to find all the matching occurrences from each row of the job description column

# -- step 1

job_title
Senior Data Scientist     Python
Senior Data Scientist          R
Senior Data Scientist    Tableau
Junior Data Analyst            R
Junior Data Analyst          SQL
Data Engineer Intern       Excel
Data Engineer Intern      Python
Name: 0, dtype: object

Create a frequency table using crosstab

# -- step 2

skills                 Excel  Python  R  SQL  Tableau
job_title                                            
Data Engineer Intern       1       1  0    0        0
Junior Data Analyst        0       0  1    1        0
Senior Data Scientist      0       1  1    0        1

That’s it…Now depending upon how you would want to visualize the above data you can either use barplot or heat map. Personally I would prefer heatmap

import seaborn as sns

sns.heatmap(s, cmap='Blues')

enter image description here

Answered By: Shubham Sharma