Split dataframe column based on specific list of words
Question:
Is it possible to split strings from a dataframe column based on a list of words?
For example: There is a dataframe with a column Company
, each record includes the company name, a legal form, and sometimes additional information after the legal form like ‘electronics’.
Company
XYZ ltd electronics
ABC ABC inc iron
AB XY Z inc
CD EF GHI JK llc incident
I have list with 1500 worldwide legal form for companies (inc, ltd, …). I would like to split the string in the dataframe column, based on this legal form list for example:
['gmbh', 'ltd', 'inc', 'srl', 'spa', 'co', 'sa', 'ag', 'kg', 'ab', 'spol', 'sasu', 'sas', 'pvt', 'sarl', 'gmbh & co kg', 'llc', 'ilc', 'corp', 'ltda', 'coltd', 'se', 'as', 'sp zoo', 'plc', 'pvtltd', 'og', 'gen']
In other words, to separate everything before and after the words in the list to new columns. This is the desired output:
Company
Legal form
Addition
XYZ
ltd
electronics
ABC ABC
inc
iron
AB XY Z
inc
CD EF GHI JK
llc
incident
Note that "inc" appears in the middle, at the end, and also part of a word in the various company name examples.
Answers:
Assuming you are just trying to string split after spaces you could try something like this:
import re
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Addition'] = df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)
Company Addition
0 XYZ ltd electronics [XYZ , ltd, electronics]
1 ABC ABC inc iron [ABC ABC , inc, iron]
2 AB XY Z inc [AB XY Z , inc, ]
3 CD EF GHI JK llc chicago [CD EF GHI JK , llc, chicago]
You could use regular expression (Regex) to filter out the legal form. Each legal form is in this format: s
legalforms
s
means the legal form is preceded by and ended with a whitespace. Because I have appended all company names with a white space, so the legal form can be at the end as well. The data is processed in lowercase, then converted back to Title Case. So try this:
import pandas as pd
import re
legal_forms = '(sgmbhs|sltds|sincs|ssrls|sspas|scos|ssas|sags|skgs|sabs|sspols|ssasus|ssass|spvts|ssarls|sgmbhs&scoskgs|sllcs|silcs|scorps|sltdas|scoltds|sses|sass|sspszoos|splcs|spvtltds|sogs|sgens)'
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc incident']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: [e.strip() for e in re.split(legal_forms, x.lower()+' ')])
print(df)
This will create a list for each company name, separated by the legal form
Company Coy
0 XYZ ltd electronics [xyz, ltd, electronics]
1 ABC ABC inc iron [abc abc, inc, iron]
2 AB XY Z inc [ab xy z, inc, ]
3 CD EF GHI JK llc incident [cd ef ghi jk, llc, incident]
After that you can split them into 3 separate columns:
df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
for col in df1.columns:
df1[col] = df1[col].str.title()
print(df1)
Output:
Company Legal form Addition
0 Xyz Ltd Electronics
1 Abc Abc Inc Iron
2 Ab Xy Z Inc
3 Cd Ef Ghi Jk Llc Incident
Is it possible to split strings from a dataframe column based on a list of words?
For example: There is a dataframe with a column Company
, each record includes the company name, a legal form, and sometimes additional information after the legal form like ‘electronics’.
Company |
---|
XYZ ltd electronics |
ABC ABC inc iron |
AB XY Z inc |
CD EF GHI JK llc incident |
I have list with 1500 worldwide legal form for companies (inc, ltd, …). I would like to split the string in the dataframe column, based on this legal form list for example:
['gmbh', 'ltd', 'inc', 'srl', 'spa', 'co', 'sa', 'ag', 'kg', 'ab', 'spol', 'sasu', 'sas', 'pvt', 'sarl', 'gmbh & co kg', 'llc', 'ilc', 'corp', 'ltda', 'coltd', 'se', 'as', 'sp zoo', 'plc', 'pvtltd', 'og', 'gen']
In other words, to separate everything before and after the words in the list to new columns. This is the desired output:
Company | Legal form | Addition |
---|---|---|
XYZ | ltd | electronics |
ABC ABC | inc | iron |
AB XY Z | inc | |
CD EF GHI JK | llc | incident |
Note that "inc" appears in the middle, at the end, and also part of a word in the various company name examples.
Assuming you are just trying to string split after spaces you could try something like this:
import re
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Addition'] = df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)
Company Addition
0 XYZ ltd electronics [XYZ , ltd, electronics]
1 ABC ABC inc iron [ABC ABC , inc, iron]
2 AB XY Z inc [AB XY Z , inc, ]
3 CD EF GHI JK llc chicago [CD EF GHI JK , llc, chicago]
You could use regular expression (Regex) to filter out the legal form. Each legal form is in this format: s
legalforms
s
means the legal form is preceded by and ended with a whitespace. Because I have appended all company names with a white space, so the legal form can be at the end as well. The data is processed in lowercase, then converted back to Title Case. So try this:
import pandas as pd
import re
legal_forms = '(sgmbhs|sltds|sincs|ssrls|sspas|scos|ssas|sags|skgs|sabs|sspols|ssasus|ssass|spvts|ssarls|sgmbhs&scoskgs|sllcs|silcs|scorps|sltdas|scoltds|sses|sass|sspszoos|splcs|spvtltds|sogs|sgens)'
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc incident']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: [e.strip() for e in re.split(legal_forms, x.lower()+' ')])
print(df)
This will create a list for each company name, separated by the legal form
Company Coy
0 XYZ ltd electronics [xyz, ltd, electronics]
1 ABC ABC inc iron [abc abc, inc, iron]
2 AB XY Z inc [ab xy z, inc, ]
3 CD EF GHI JK llc incident [cd ef ghi jk, llc, incident]
After that you can split them into 3 separate columns:
df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
for col in df1.columns:
df1[col] = df1[col].str.title()
print(df1)
Output:
Company Legal form Addition
0 Xyz Ltd Electronics
1 Abc Abc Inc Iron
2 Ab Xy Z Inc
3 Cd Ef Ghi Jk Llc Incident