Looking for a way to check keywords occurrences in a dataframe sector column

Question:

I have dataframe with following columns:
[company_name, company_sector, company_country]

There are 10 unique sectors: Business services, Finance Services, Technology etc.
This is how it looks like :
enter image description here

on the other hand I have a list of keywords = [‘services’, ‘holdings’, ‘group’, ‘manufacture’] etc

I am looking for a way to check how many times each keyword occurs in company_name and assign it to company_sector like that:
enter image description here

meaning :
if there is a company "Atlantic Navigation Holdings (S) Limited" and it belongs to sector Industrials – then industrials will have a count 1 for keyword holdings (I already changed everything to lowercase – both keywords and company name)

if there is a company "Atlantic Navigation Holdings (S) Limited" and it belongs to sector Industrials – then industrials will have a count 1 for keyword holdings (I already changed everything to lowercase – both keywords and company name)

Asked By: Justyna Rumpca

||

Answers:

You can use groupby from pandas [1] to select each sector. Based on each sector you can count the occurrences of a keyword in a for loop.

I used the default dictionary [2] to create this new dataframe.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

[2] https://docs.python.org/3/library/collections.html#collections.defaultdict

import pandas as pd

from collections import defaultdict

# dictionary with fake data
d = {'sector': ['one', 'one', 'two', 'two' , 'one'], 'name': ['a', 'b', 'b', 'a', 'b']}
# convert dictionary to pandas DataFrame
df = pd.DataFrame(d)
    sector  name
0   one a
1   one b
2   two b
3   two a
4   one b

keywords = ['a', 'b', 'c']

# create empty dictionary
new_d = defaultdict(list)

for key, group in df.groupby('sector'):
    for k in keywords:
        new_d[key].append(sum(group['name'].str.contains(k)))
pd.DataFrame(new_d, index=keywords)

  one   two
a   1   1
b   2   1
c   0   0

In this case the keywords are as index in the new dataframe and the columns are the sectors.

Answered By: 3dSpatialUser
  1. first create a new dataframe skeleton and fill it with 0:

    counts_df = pd.DataFrame(columns=keywords, index=df[‘comapny_sector’].unique())
    counts_df = counts_df.fillna(0)

  2. Iter through a dataframe and check if keyword is in company_name, if it exists – add to the df:

    for _, row in train_df.iterrows():
    for keyword in keywords:
    if keyword in row[‘company_name’]:
    counts_df.loc[row[‘company_sector’], keyword] += 1

    counts_df

Answered By: Kas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.