Merging 2 datasets in Python

Question:

I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.

inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")

TFC_DATA (in the orginal dataset there exits a column called year but I’ve dropped it):

| Country | TFP |

| Angola | 0.8633379340171814 |

| Angola | 0.9345720410346984 |

| Angola | 1.0301895141601562 |

| Angola | 1.0850582122802734 |

.
.
.

CO2_DATA:
| Country | year | GDP | co2

| Afghanistan | 2005 | 25397688320.0 | 1

| Afghanistan | 2006 | 28704401408.0 | 2

| Afghanistan | 2007 | 34507530240.0 | 2

| Afghanistan | 2008 | 1.0850582122802734 | 3

| Afghanistan | 2009 | 1.040212631225586 | 1

.
.
.

What I want is

Output
|Country|Year|gdp|co2|TFP
Angola|2005|51967275008.0|19.006|0.8633379340171814
Angola|2006|66748907520.0|19.006|0.9345720410346984
Angola|2007|87085293568.0|19.006|1.0301895141601562
.
.
.

What I have instead

Output
Country|Year|gdp|co2|Year|TFP
Angola|2005|51967275008.0|19.006|2005|0.8633379340171814
Angola|2005|51967275008.0|19.006|2006|0.9345720410346984
Angola|2005|51967275008.0|19.006|2007|1.0301895141601562
Angola|2005|51967275008.0|19.006|2008|1.0850582122802734
Angola|2005|51967275008.0|19.006|2009|1.040212631225586
Angola|2005|51967275008.0|19.006|2010|1.0594196319580078
Angola|2005|51967275008.0|19.006|2011|1.036203384399414
Angola|2005|51967275008.0|19.006|2012|1.076979637145996
Angola|2005|51967275008.0|19.006|2013|1.0862818956375122
Angola|2005|51967275008.0|19.006|2014|1.096832513809204
Angola|2005|51967275008.0|19.006|2015|1.0682281255722046
Angola|2005|51967275008.0|19.006|2016|1.0160540342330933
Angola|2005|51967275008.0|19.006|2017|1.0

I expected the datas of the countrys’ merge in one dataset but it duplicates itself until the second one data is over then the second one does the same

Asked By: Pelin Kasap

||

Answers:

TFC_DATA (in the orginal dataset there exits a column called year but
I’ve dropped it):

Well, based on your expected output, you should not drop the column Year from the dataframe TFC_DATA. Only then, you can use pandas.merge (as shown below). Because otherwise, you’ll have duplicated values.

pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])

OR :

pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])
Answered By: abokey

pd.merge() function performs an inner join by default that means it only includes rows that have matching values in the specified columns.

Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA) and only the matching rows from the right dataset (CO2_DATA).

Specify a left outer join using the how="left" parameter in the pd.merge() function.

merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")

After @abokey’s comment
EDIT

First, create a new column in the TFC_DATA dataset with the year value

TFC_DATA["year"] = TFC_DATA.index.year

Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group

TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()

Reset the index to make "country" and "year" columns in the resulting dataset

TFC_DATA_agg = TFC_DATA_agg.reset_index()

Perform the inner merge, using "country" and "year" as the merge keys

merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])

Answered By: Sachin