Python: calculate with dataframe and dictionary?
Question:
I have a dataframe/excel sheet with transaction types of business processes and how often a transaction type was performed:
branch
Transaction Type
occurrences
aa
red
12
aa
green
100
bb
blue
20
cc
red
12
cc
green
100
cc
blue
20
I have a second df/excel sheet with processing time in seconds per transaction type
Transaction Type
time in S
red
120
green
320
blue
60
What i need is a new column in the processes-df, where the # of occurrences is multiplied by the processing time, in order to get the effort in seconds for a specific transaction type:
branch
Transaction Type
occurrences
Effort in S
aa
red
12
1440
aa
green
100
32000
bb
blue
20
1200
cc
red
12
1440
cc
green
100
32000
cc
blue
20
1200
[edit]
I was not precise enough. it is not only a simple merge of 2 dataframes, but rather the calculation of the effort per branch….
[/edit]
As i am a beginner with only theoretical knowledge i assume that i have to import my 2 excels with openpyxl and create dataframes with pandas.
Then i need to iterate over the dataframes and maybe with a function (lambda?) i can do this simple calculation.
Maybe it is better to create a dictionary out of the 2nd excel, since it has only 2 columns?
Any help is appreciated 🙂
Answers:
Use Pandas library in python, much easier to do this thing.
import pandas as pd
df1 = pd.read_csv(<PATH_TO_FILE>)
df2 = pd.read_csv(<PATH_TO_SECOND_FILE>)
final_df = pd.DataFrame()
final_df = df1 #get first three columns same as df1
final_df.merge(df2, on='Transaction Type', how='left')
final_df['Effort in S'] = final_df['time in S']*final_df['occurrences']
#Incase u need to remove the time in S column
#df.drop('column_name', axis=1, inplace=True)
final_df.to_csv(<PATH_TO_Directory/file_name>, sep='t', encoding='utf-8', index=False)
Edited after seeing you edited the question.
import pandas as pd
df1 = pd.DataFrame({"branch":["aa","aa","bb","cc","cc","cc"], "Transaction Type": ["red","green","blue", "red","green","blue"], "occurrences":[12,100,20,12,100,20]})
df2 = pd.DataFrame({"Transaction Type": ["red","green","blue"], "time in S":[120,320,60]})
df3 = df1.merge(df2, how='inner')
df3["Effort in S"] = df3["occurrences"]*df3["time in S"]
df3 = df3.drop("time in S", axis=1).sort_values('branch')
print(df3)
thank you, both suggested solutions work fine.
I have a dataframe/excel sheet with transaction types of business processes and how often a transaction type was performed:
branch | Transaction Type | occurrences |
---|---|---|
aa | red | 12 |
aa | green | 100 |
bb | blue | 20 |
cc | red | 12 |
cc | green | 100 |
cc | blue | 20 |
I have a second df/excel sheet with processing time in seconds per transaction type
Transaction Type | time in S |
---|---|
red | 120 |
green | 320 |
blue | 60 |
What i need is a new column in the processes-df, where the # of occurrences is multiplied by the processing time, in order to get the effort in seconds for a specific transaction type:
branch | Transaction Type | occurrences | Effort in S |
---|---|---|---|
aa | red | 12 | 1440 |
aa | green | 100 | 32000 |
bb | blue | 20 | 1200 |
cc | red | 12 | 1440 |
cc | green | 100 | 32000 |
cc | blue | 20 | 1200 |
[edit]
I was not precise enough. it is not only a simple merge of 2 dataframes, but rather the calculation of the effort per branch….
[/edit]
As i am a beginner with only theoretical knowledge i assume that i have to import my 2 excels with openpyxl and create dataframes with pandas.
Then i need to iterate over the dataframes and maybe with a function (lambda?) i can do this simple calculation.
Maybe it is better to create a dictionary out of the 2nd excel, since it has only 2 columns?
Any help is appreciated 🙂
Use Pandas library in python, much easier to do this thing.
import pandas as pd
df1 = pd.read_csv(<PATH_TO_FILE>)
df2 = pd.read_csv(<PATH_TO_SECOND_FILE>)
final_df = pd.DataFrame()
final_df = df1 #get first three columns same as df1
final_df.merge(df2, on='Transaction Type', how='left')
final_df['Effort in S'] = final_df['time in S']*final_df['occurrences']
#Incase u need to remove the time in S column
#df.drop('column_name', axis=1, inplace=True)
final_df.to_csv(<PATH_TO_Directory/file_name>, sep='t', encoding='utf-8', index=False)
Edited after seeing you edited the question.
import pandas as pd
df1 = pd.DataFrame({"branch":["aa","aa","bb","cc","cc","cc"], "Transaction Type": ["red","green","blue", "red","green","blue"], "occurrences":[12,100,20,12,100,20]})
df2 = pd.DataFrame({"Transaction Type": ["red","green","blue"], "time in S":[120,320,60]})
df3 = df1.merge(df2, how='inner')
df3["Effort in S"] = df3["occurrences"]*df3["time in S"]
df3 = df3.drop("time in S", axis=1).sort_values('branch')
print(df3)
thank you, both suggested solutions work fine.