regex : how to keep relevant words and remove other?
Question:
The original output looks like this:
JOBS column:
{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}
{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}
{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `
And I want something like this, so I want to remove the word "job" and the associated codes:
New JOBS column
{"Waitress", "Programmer", "Marketing"}
{"Waitress", "Programmer", "Marketing"}
{"Programmer", "Marketing"}
Before using the regex, I converted the column Jobs into a list (df_old) and I tried this:
df_new = [re.sub('^/j/', '', doc) for doc in df_old]
I had an error: TypeError: expected string or bytes-like object
, so I did this
df_new = [re.sub('^/j/', '', doc) for doc in str(df_old)
I had no errors but the output was horrible and was not conclusive in my objectives.
I hope you can help. Thank you in advance.
Answers:
As per the comment…there are far better ways of doing this. However, as a rough example direct to the question asked…
import pandas as pd
data = ['{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `']
df = pd.DataFrame(data, columns=['JOBS'])
df['Cleaned_JOBS'] = df['JOBS'].str.findall(r': (".*?sJob"),?').str.join(', ')
df['Cleaned_JOBS'] = df['Cleaned_JOBS'].str.replace(' Job', '')
df['Cleaned_JOBS'] = '{' + df['Cleaned_JOBS'] + '}'
print(df, 'nn')
Output:
JOBS Cleaned_JOBS
0 {"/j/03k50": "Waitress Job", "/j/055qm": "Prog... {"Waitress", "Programmer", "Marketing"}
1 {"/j/03k50": "Waitress Job", "/j/055qm": "Prog... {"Waitress", "Programmer", "Marketing"}
2 {"/j/055qm": "Programmer Job", "/j/02h40lc": "... {"Programmer", "Marketing"}
The original output looks like this:
JOBS column:
{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}
{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}
{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `
And I want something like this, so I want to remove the word "job" and the associated codes:
New JOBS column
{"Waitress", "Programmer", "Marketing"}
{"Waitress", "Programmer", "Marketing"}
{"Programmer", "Marketing"}
Before using the regex, I converted the column Jobs into a list (df_old) and I tried this:
df_new = [re.sub('^/j/', '', doc) for doc in df_old]
I had an error: TypeError: expected string or bytes-like object
, so I did this
df_new = [re.sub('^/j/', '', doc) for doc in str(df_old)
I had no errors but the output was horrible and was not conclusive in my objectives.
I hope you can help. Thank you in advance.
As per the comment…there are far better ways of doing this. However, as a rough example direct to the question asked…
import pandas as pd
data = ['{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/03k50": "Waitress Job", "/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"}',
'{"/j/055qm": "Programmer Job", "/j/02h40lc": "Marketing Job"} `']
df = pd.DataFrame(data, columns=['JOBS'])
df['Cleaned_JOBS'] = df['JOBS'].str.findall(r': (".*?sJob"),?').str.join(', ')
df['Cleaned_JOBS'] = df['Cleaned_JOBS'].str.replace(' Job', '')
df['Cleaned_JOBS'] = '{' + df['Cleaned_JOBS'] + '}'
print(df, 'nn')
Output:
JOBS Cleaned_JOBS
0 {"/j/03k50": "Waitress Job", "/j/055qm": "Prog... {"Waitress", "Programmer", "Marketing"}
1 {"/j/03k50": "Waitress Job", "/j/055qm": "Prog... {"Waitress", "Programmer", "Marketing"}
2 {"/j/055qm": "Programmer Job", "/j/02h40lc": "... {"Programmer", "Marketing"}