Count frequency of elements from a given list taken from a dataframe column based on a condition from another column

Question:

I have a data frame df.

dataframe link

It has columns for Job Title and Job Description. Job Description has all the description for a particular Job Title including the required skills. I am trying to get a dataframe with two columns, skills and frequency for the role of ‘Data Scientist.
Skills should be: Python, R, C/C++, Java, Machine Learning, Deep Learning, NLP, Power BI, Tableau, Excel, spark, aws, MS Access, LMS, SAS.

My take on this was:

(df
 .loc[df['Job Title'].eq('Data Scientist'), 'Job Description']
 .str.split(',s*')
 .explode()
 .value_counts()
)

But this gives frequency of each word from the description.

Another way I tried:

test = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower()else 0)
test.value_counts()

This gives count when each skill is fed one by one.

Asked By: Atif

||

Answers:

I couldn’t check the online file, but if I understand correctly you can use a regex to extract the terms;

l = ['Python', 'R', 'C/C++', 'Java', 'Machine Learning', 'Deep Learning', 'NLP', 'Power BI', 'Tableau', 'Excel', 'spark', 'aws', 'MS Access', 'lLMS', 'SAS']

import re

regex = '|'.join(map(re.escape, l))
# 'Python|R|C/C\+\+|Java|Machine\ Learning|Deep\ Learning|NLP|Power\ BI|Tableau|Excel|spark|aws|MS\ Access|lLMS|SAS'

(df
 .set_index('Job Title') # set index what you want to group on later
 ['Job Description']
 .str.extractall(f'({regex})')[0]
 .groupby(level=0)
 .value_counts()
)
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.