How can I separate one row from a data set but repeat in each line some of the variables?
Question:
I have a dataset where each row contains information that needs to be separated and printed in different rows, but I need to keep the name of the company on each newly printed row:
example dataset
These are the headers:
company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
These are 2 rows of data:
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000
I need to separate one line into as many as I need. Some companies might have a marketing budget but don’t have a finance budget or any other possible combination…
So the output should look like this (also I need to add the department, which is not included as a column, it is only the title of the column where the info is taken)
Company | Department | Budget | Amount Remaining
Law Office | Marketing | 450,000 | 150,000
Law Office | Finace | 300,000 | 100,000
Law Office | Sales | 200,00 | 50,000
Restaurant | Marketing | 30,000 | 7,000
Restaurant | Sales | 25,000 | 10,000
Answers:
You could use the Python package pandas
to build the table. And also using list comprehension, and list.split()
method to process the data
import pandas as pd
d='''company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000'''
data = [e.strip().split('|') for e in d.split('n')]
df = pd.DataFrame([[e.strip() for e in l] for l in data[1:]], columns=[e.strip() for e in data[0]])
print(df)
Output
company marketing_budget marketing_remaining finance_budget finance_remaining sales_budget sales_remaining
0 Law Office 450,000 150,000 300,000 100,000 200,000 50,000
1 Restaurant 30,000 7,000 null null 25,000 10,000
After this, use df.melt()
and df.pivot()
methods to get the final results!
df = df.melt(id_vars='company')
df[['department','value_type']] = df.variable.str.split('_', expand=True)
df = df.pivot(index=['company', 'department'], columns='value_type', values='value').sort_index().reset_index()
df = df[df['budget']!='null']
df = df.rename_axis(None, axis=1).reset_index(drop=True)
print(df)
Output:
company department budget remaining
0 Law Office finance 300,000 100,000
1 Law Office marketing 450,000 150,000
2 Law Office sales 200,000 50,000
3 Restaurant marketing 30,000 7,000
4 Restaurant sales 25,000 10,000
Thanks @BeRT2me, good learning for me!
Given a text file that looks like:
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000
We can do:
df = pd.read_csv('file.txt', sep=' | ', engine='python')
# Reverse the column names on '_'.
df.columns = ['_'.join(reversed(x.split('_'))) for x in df.columns]
# Use pd.wide_to_long
df = pd.wide_to_long(df, ['budget', 'remaining'], i='company', j='department', sep='_', suffix=r'w+').sort_index()
df = df.reset_index().dropna()
print(df)
Output:
company department budget remaining
0 Law Office finance 300,000 100,000
1 Law Office marketing 450,000 150,000
2 Law Office sales 200,000 50,000
4 Restaurant marketing 30,000 7,000
5 Restaurant sales 25,000 10,000
Testing, and how I’d make the values numeric for future calculations:
import pandas as pd
from io import StringIO
d='''company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000'''
df = pd.read_csv(StringIO(d), sep=' | ', engine='python')
df = df.fillna('').applymap(lambda x: x.replace(',', ''))
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='ignore')
df.columns = ['_'.join(reversed(x.split('_'))) for x in df.columns]
df = pd.wide_to_long(df, ['budget', 'remaining'], i='company', j='department', sep='_', suffix=r'w+').sort_index()
df = df.reset_index().dropna()
print(df)
....
company department budget remaining
0 Law Office finance 300000.0 100000.0
1 Law Office marketing 450000.0 150000.0
2 Law Office sales 200000.0 50000.0
4 Restaurant marketing 30000.0 7000.0
5 Restaurant sales 25000.0 10000.0
I have a dataset where each row contains information that needs to be separated and printed in different rows, but I need to keep the name of the company on each newly printed row:
example dataset
These are the headers:
company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
These are 2 rows of data:
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000
I need to separate one line into as many as I need. Some companies might have a marketing budget but don’t have a finance budget or any other possible combination…
So the output should look like this (also I need to add the department, which is not included as a column, it is only the title of the column where the info is taken)
Company | Department | Budget | Amount Remaining
Law Office | Marketing | 450,000 | 150,000
Law Office | Finace | 300,000 | 100,000
Law Office | Sales | 200,00 | 50,000
Restaurant | Marketing | 30,000 | 7,000
Restaurant | Sales | 25,000 | 10,000
You could use the Python package pandas
to build the table. And also using list comprehension, and list.split()
method to process the data
import pandas as pd
d='''company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000'''
data = [e.strip().split('|') for e in d.split('n')]
df = pd.DataFrame([[e.strip() for e in l] for l in data[1:]], columns=[e.strip() for e in data[0]])
print(df)
Output
company marketing_budget marketing_remaining finance_budget finance_remaining sales_budget sales_remaining
0 Law Office 450,000 150,000 300,000 100,000 200,000 50,000
1 Restaurant 30,000 7,000 null null 25,000 10,000
After this, use df.melt()
and df.pivot()
methods to get the final results!
df = df.melt(id_vars='company')
df[['department','value_type']] = df.variable.str.split('_', expand=True)
df = df.pivot(index=['company', 'department'], columns='value_type', values='value').sort_index().reset_index()
df = df[df['budget']!='null']
df = df.rename_axis(None, axis=1).reset_index(drop=True)
print(df)
Output:
company department budget remaining
0 Law Office finance 300,000 100,000
1 Law Office marketing 450,000 150,000
2 Law Office sales 200,000 50,000
3 Restaurant marketing 30,000 7,000
4 Restaurant sales 25,000 10,000
Thanks @BeRT2me, good learning for me!
Given a text file that looks like:
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000
We can do:
df = pd.read_csv('file.txt', sep=' | ', engine='python')
# Reverse the column names on '_'.
df.columns = ['_'.join(reversed(x.split('_'))) for x in df.columns]
# Use pd.wide_to_long
df = pd.wide_to_long(df, ['budget', 'remaining'], i='company', j='department', sep='_', suffix=r'w+').sort_index()
df = df.reset_index().dropna()
print(df)
Output:
company department budget remaining
0 Law Office finance 300,000 100,000
1 Law Office marketing 450,000 150,000
2 Law Office sales 200,000 50,000
4 Restaurant marketing 30,000 7,000
5 Restaurant sales 25,000 10,000
Testing, and how I’d make the values numeric for future calculations:
import pandas as pd
from io import StringIO
d='''company | marketing_budget | marketing_remaining | finance_budget | finance_remaining | sales_budget | sales_remaining
Law Office | 450,000 | 150,000 | 300,000 | 100,000 | 200,000 | 50,000
Restaurant | 30,000 | 7,000 | null | null | 25,000 | 10,000'''
df = pd.read_csv(StringIO(d), sep=' | ', engine='python')
df = df.fillna('').applymap(lambda x: x.replace(',', ''))
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors='ignore')
df.columns = ['_'.join(reversed(x.split('_'))) for x in df.columns]
df = pd.wide_to_long(df, ['budget', 'remaining'], i='company', j='department', sep='_', suffix=r'w+').sort_index()
df = df.reset_index().dropna()
print(df)
....
company department budget remaining
0 Law Office finance 300000.0 100000.0
1 Law Office marketing 450000.0 150000.0
2 Law Office sales 200000.0 50000.0
4 Restaurant marketing 30000.0 7000.0
5 Restaurant sales 25000.0 10000.0