Merging 2 datasets in Python
Question:
I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.
inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")
TFC_DATA (in the orginal dataset there exits a column called year but I’ve dropped it):
| Country | TFP |
| Angola | 0.8633379340171814 |
| Angola | 0.9345720410346984 |
| Angola | 1.0301895141601562 |
| Angola | 1.0850582122802734 |
.
.
.
CO2_DATA:
| Country | year | GDP | co2
| Afghanistan | 2005 | 25397688320.0 | 1
| Afghanistan | 2006 | 28704401408.0 | 2
| Afghanistan | 2007 | 34507530240.0 | 2
| Afghanistan | 2008 | 1.0850582122802734 | 3
| Afghanistan | 2009 | 1.040212631225586 | 1
.
.
.
What I want is
Output
|Country|Year|gdp|co2|TFP
Angola|2005|51967275008.0|19.006|0.8633379340171814
Angola|2006|66748907520.0|19.006|0.9345720410346984
Angola|2007|87085293568.0|19.006|1.0301895141601562
.
.
.
What I have instead
Output
Country|Year|gdp|co2|Year|TFP
Angola|2005|51967275008.0|19.006|2005|0.8633379340171814
Angola|2005|51967275008.0|19.006|2006|0.9345720410346984
Angola|2005|51967275008.0|19.006|2007|1.0301895141601562
Angola|2005|51967275008.0|19.006|2008|1.0850582122802734
Angola|2005|51967275008.0|19.006|2009|1.040212631225586
Angola|2005|51967275008.0|19.006|2010|1.0594196319580078
Angola|2005|51967275008.0|19.006|2011|1.036203384399414
Angola|2005|51967275008.0|19.006|2012|1.076979637145996
Angola|2005|51967275008.0|19.006|2013|1.0862818956375122
Angola|2005|51967275008.0|19.006|2014|1.096832513809204
Angola|2005|51967275008.0|19.006|2015|1.0682281255722046
Angola|2005|51967275008.0|19.006|2016|1.0160540342330933
Angola|2005|51967275008.0|19.006|2017|1.0
I expected the datas of the countrys’ merge in one dataset but it duplicates itself until the second one data is over then the second one does the same
Answers:
TFC_DATA (in the orginal dataset there exits a column called year but
I’ve dropped it):
Well, based on your expected output, you should not drop the column Year
from the dataframe TFC_DATA
. Only then, you can use pandas.merge
(as shown below). Because otherwise, you’ll have duplicated values.
pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])
OR :
pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])
pd.merge()
function performs an inner join by default that means it only includes rows that have matching values in the specified columns.
Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA)
and only the matching rows from the right dataset (CO2_DATA)
.
Specify a left outer join using the how="left"
parameter in the pd.merge()
function.
merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")
After @abokey’s comment
EDIT
First, create a new column in the TFC_DATA dataset with the year value
TFC_DATA["year"] = TFC_DATA.index.year
Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group
TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()
Reset the index to make "country" and "year" columns in the resulting dataset
TFC_DATA_agg = TFC_DATA_agg.reset_index()
Perform the inner merge, using "country" and "year" as the merge keys
merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])
I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.
inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")
TFC_DATA (in the orginal dataset there exits a column called year but I’ve dropped it):
| Country | TFP |
| Angola | 0.8633379340171814 |
| Angola | 0.9345720410346984 |
| Angola | 1.0301895141601562 |
| Angola | 1.0850582122802734 |
.
.
.
CO2_DATA:
| Country | year | GDP | co2
| Afghanistan | 2005 | 25397688320.0 | 1
| Afghanistan | 2006 | 28704401408.0 | 2
| Afghanistan | 2007 | 34507530240.0 | 2
| Afghanistan | 2008 | 1.0850582122802734 | 3
| Afghanistan | 2009 | 1.040212631225586 | 1
.
.
.
What I want is
Output
|Country|Year|gdp|co2|TFP
Angola|2005|51967275008.0|19.006|0.8633379340171814
Angola|2006|66748907520.0|19.006|0.9345720410346984
Angola|2007|87085293568.0|19.006|1.0301895141601562
.
.
.
What I have instead
Output
Country|Year|gdp|co2|Year|TFP
Angola|2005|51967275008.0|19.006|2005|0.8633379340171814
Angola|2005|51967275008.0|19.006|2006|0.9345720410346984
Angola|2005|51967275008.0|19.006|2007|1.0301895141601562
Angola|2005|51967275008.0|19.006|2008|1.0850582122802734
Angola|2005|51967275008.0|19.006|2009|1.040212631225586
Angola|2005|51967275008.0|19.006|2010|1.0594196319580078
Angola|2005|51967275008.0|19.006|2011|1.036203384399414
Angola|2005|51967275008.0|19.006|2012|1.076979637145996
Angola|2005|51967275008.0|19.006|2013|1.0862818956375122
Angola|2005|51967275008.0|19.006|2014|1.096832513809204
Angola|2005|51967275008.0|19.006|2015|1.0682281255722046
Angola|2005|51967275008.0|19.006|2016|1.0160540342330933
Angola|2005|51967275008.0|19.006|2017|1.0
I expected the datas of the countrys’ merge in one dataset but it duplicates itself until the second one data is over then the second one does the same
TFC_DATA (in the orginal dataset there exits a column called year but
I’ve dropped it):
Well, based on your expected output, you should not drop the column Year
from the dataframe TFC_DATA
. Only then, you can use pandas.merge
(as shown below). Because otherwise, you’ll have duplicated values.
pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])
OR :
pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])
pd.merge()
function performs an inner join by default that means it only includes rows that have matching values in the specified columns.
Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA)
and only the matching rows from the right dataset (CO2_DATA)
.
Specify a left outer join using the how="left"
parameter in the pd.merge()
function.
merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")
After @abokey’s comment
EDIT
First, create a new column in the TFC_DATA dataset with the year value
TFC_DATA["year"] = TFC_DATA.index.year
Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group
TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()
Reset the index to make "country" and "year" columns in the resulting dataset
TFC_DATA_agg = TFC_DATA_agg.reset_index()
Perform the inner merge, using "country" and "year" as the merge keys
merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])