Using Merge on a column and Index in Pandas
Question:
I have two separate dataframes that share a project number. In type_df
, the project number is the index. In time_df
, the project number is a column. I would like to count the number of rows in type_df
that have a Project Type
of 2
. I am trying to do this with pandas.merge()
. It works great when using both columns, but not indices. I’m not sure how to reference the index and if merge
is even the right way to do this.
import pandas as pd
type_df = pd.DataFrame(data = [['Type 1'], ['Type 2']],
columns=['Project Type'],
index=['Project2', 'Project1'])
time_df = pd.DataFrame(data = [['Project1', 13], ['Project1', 12],
['Project2', 41]],
columns=['Project', 'Time'])
merged = pd.merge(time_df,type_df, on=[index,'Project'])
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
Error:
Name ‘Index’ is not defined.
Desired Output:
2
Answers:
If you want to use an index in your merge you have to specify left_index=True
or right_index=True
, and then use left_on
or right_on
. For you it should look something like this:
merged = pd.merge(type_df, time_df, left_index=True, right_on='Project')
You must have the same column in each dataframe to merge on.
In this case, just make a ‘Project’ column for type_df
, then merge on that:
type_df['Project'] = type_df.index.values
merged = pd.merge(time_df,type_df, on='Project', how='inner')
merged
# Project Time Project Type
#0 Project1 13 Type 2
#1 Project1 12 Type 2
#2 Project2 41 Type 1
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
2
Another solution is use DataFrame.join
:
df3 = type_df.join(time_df, on='Project')
For version pandas 0.23.0+
the on
, left_on
, and right_on
parameters may now refer to either column names or index level names:
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)
print (left)
A B key2
key1
K0 A0 B0 K0
K0 A1 B1 K1
K1 A2 B2 K0
K2 A3 B3 K1
print (right)
C D key2
key1
K0 C0 D0 K0
K1 C1 D1 K0
K2 C2 D2 K0
K2 C3 D3 K1
df = left.merge(right, on=['key1', 'key2'])
print (df)
A B key2 C D
key1
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3
I have two separate dataframes that share a project number. In type_df
, the project number is the index. In time_df
, the project number is a column. I would like to count the number of rows in type_df
that have a Project Type
of 2
. I am trying to do this with pandas.merge()
. It works great when using both columns, but not indices. I’m not sure how to reference the index and if merge
is even the right way to do this.
import pandas as pd
type_df = pd.DataFrame(data = [['Type 1'], ['Type 2']],
columns=['Project Type'],
index=['Project2', 'Project1'])
time_df = pd.DataFrame(data = [['Project1', 13], ['Project1', 12],
['Project2', 41]],
columns=['Project', 'Time'])
merged = pd.merge(time_df,type_df, on=[index,'Project'])
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
Error:
Name ‘Index’ is not defined.
Desired Output:
2
If you want to use an index in your merge you have to specify left_index=True
or right_index=True
, and then use left_on
or right_on
. For you it should look something like this:
merged = pd.merge(type_df, time_df, left_index=True, right_on='Project')
You must have the same column in each dataframe to merge on.
In this case, just make a ‘Project’ column for type_df
, then merge on that:
type_df['Project'] = type_df.index.values
merged = pd.merge(time_df,type_df, on='Project', how='inner')
merged
# Project Time Project Type
#0 Project1 13 Type 2
#1 Project1 12 Type 2
#2 Project2 41 Type 1
print merged[merged['Project Type'] == 'Type 2']['Project Type'].count()
2
Another solution is use DataFrame.join
:
df3 = type_df.join(time_df, on='Project')
For version pandas 0.23.0+
the on
, left_on
, and right_on
parameters may now refer to either column names or index level names:
left_index = pd.Index(['K0', 'K0', 'K1', 'K2'], name='key1')
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'key2': ['K0', 'K1', 'K0', 'K1']},
index=left_index)
right_index = pd.Index(['K0', 'K1', 'K2', 'K2'], name='key1')
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'key2': ['K0', 'K0', 'K0', 'K1']},
index=right_index)
print (left)
A B key2
key1
K0 A0 B0 K0
K0 A1 B1 K1
K1 A2 B2 K0
K2 A3 B3 K1
print (right)
C D key2
key1
K0 C0 D0 K0
K1 C1 D1 K0
K2 C2 D2 K0
K2 C3 D3 K1
df = left.merge(right, on=['key1', 'key2'])
print (df)
A B key2 C D
key1
K0 A0 B0 K0 C0 D0
K1 A2 B2 K0 C1 D1
K2 A3 B3 K1 C3 D3