Trying to merge 2 dataframes but get ValueError
Question:
These are my two dataframes saved in two variables:
> print(df.head())
>
club_name tr_jan tr_dec year
0 ADO Den Haag 1368 1422 2010
1 ADO Den Haag 1455 1477 2011
2 ADO Den Haag 1461 1443 2012
3 ADO Den Haag 1437 1383 2013
4 ADO Den Haag 1386 1422 2014
> print(rankingdf.head())
>
club_name ranking year
0 ADO Den Haag 12 2010
1 ADO Den Haag 13 2011
2 ADO Den Haag 11 2012
3 ADO Den Haag 14 2013
4 ADO Den Haag 17 2014
I’m trying to merge these two using this code:
new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')
The how=’left’ is added because I have less datapoints in my ranking_df than in my standard df.
The expected behaviour is as such:
> print(new_df.head())
>
club_name tr_jan tr_dec year ranking
0 ADO Den Haag 1368 1422 2010 12
1 ADO Den Haag 1455 1477 2011 13
2 ADO Den Haag 1461 1443 2012 11
3 ADO Den Haag 1437 1383 2013 14
4 ADO Den Haag 1386 1422 2014 17
But I get this error:
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
But I do not wish to use concat since I want to merge the trees not just add them on.
Another behaviour that’s weird in my mind is that my code works if I save the first df to .csv and then load that .csv into a dataframe.
The code for that:
df = pd.DataFrame(data_points, columns=['club_name', 'tr_jan', 'tr_dec', 'year'])
df.to_csv('preliminary.csv')
df = pd.read_csv('preliminary.csv', index_col=0)
ranking_df = pd.DataFrame(rankings, columns=['club_name', 'ranking', 'year'])
new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')
I think that it has to do with the index_col=0 parameter. But I have no idea to fix it without having to save it, it doesn’t matter much but is kind of an annoyance that I have to do that.
Answers:
In one of your dataframes the year is a string and the other it is an int64
you can convert it first and then join (e.g. df['year']=df['year'].astype(int)
or as RafaelC suggested df.year.astype(int)
)
Edit: Also note the comment by Anderson Zhu: Just in case you have None
or missing values in one of your dataframes, you need to use Int64
instead of int
. See the reference here.
It happens when common column in both table are of different data type.
Example: In table1, you have date as string whereas in table2 you have date as datetime. so before merging,we need to change date to common data type.
Additional: when you save df to .csv format, the datetime (year in this specific case) is saved as object, so you need to convert it into integer (year in this specific case) when you do the merge. That is why when you upload both df from csv files, you can do the merge easily, while above error will show up if one df is uploaded from csv files and the other is from an existing df. This is somewhat annoying, but have an easy solution if kept in mind.
@Arnon Rotem-Gal-Oz answer is right for the most part. But I would like to point out the difference between df['year']=df['year'].astype(int)
and df.year.astype(int)
. df.year.astype(int)
returns a view of the dataframe and doesn’t not explicitly change the type, atleast in pandas 0.24.2. df['year']=df['year'].astype(int)
explicitly change the type because it’s an assignment. I would argue that this is the safest way to permanently change the dtype of a column.
Example:
df = pd.DataFrame({'Weed': ['green crack', 'northern lights', 'girl scout
cookies'], 'Qty':[10,15,3]})
df.dtypes
Weed object,
Qty int64
df['Qty'].astype(str)
df.dtypes
Weed object,
Qty int64
Even setting the inplace arg to True doesn’t help at times. I don’t know why this
happens though. In most cases inplace=True equals an explicit assignment.
df['Qty'].astype(str, inplace = True)
df.dtypes
Weed object,
Qty int64
Now the assignment,
df['Qty'] = df['Qty'].astype(str)
df.dtypes
Weed object,
Qty object
At first check the type of columns which you want to merge. You will see one of them is string where other one is int
. Then convert it to int as following code:
df["something"] = df["something"].astype(int)
merged = df.merge[df1, on="something"]
I found that my dfs both had the same type column (str
) but switching from join
to merge
solved the issue.
this simple solution works for me
final = pd.concat([df, rankingdf], axis=1, sort=False)
but you may need to drop some duplicate column first.
In my case, it happened because I was trying to merge columns that where not the index, so in order to fix this I used this code that I found in the documentation:
df.set_index('key').join(other.set_index('key'))
Documentation:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
set key to be the index
My 2 cents: I had the same issue and could not see why I was getting that because when doing
data.head()
I saw the exact same values in ds
(time column).
The error was fixed when I added parse_dates
in the pd.read_csv()
function. This way:
data = pd.read_csv(('source.csv'), sep=';', parse_dates=['Date'], encoding= 'unicode_escape')
I was also facing the same issue as I was only trying to merge 2 data-frames. In my scenario, both datasets are almost identical, except 2 extra columns.
Hence I was following your solution.
def merge_dataframe(df1, df2, mode, column_name: any, col_chng: str):
"""
:param df1: dataframe as str
:param df2: dataframe as str
:param mode: str
:param column_name: any
:param col_chng:
:return:
"""
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
os.chdir(output_dir)
if column_name is None:
new_df = df1.merge(df2, how=mode)
else:
df1[column_name] = df2[column_name].astype(col_chng)
new_df = df1.merge(df2, on=column_name, how=mode)
new_df.to_excel('new_df.xlsx', index=False)
Hope, this will help others. Kindly correct me, if I’m missing something.
These are my two dataframes saved in two variables:
> print(df.head())
>
club_name tr_jan tr_dec year
0 ADO Den Haag 1368 1422 2010
1 ADO Den Haag 1455 1477 2011
2 ADO Den Haag 1461 1443 2012
3 ADO Den Haag 1437 1383 2013
4 ADO Den Haag 1386 1422 2014
> print(rankingdf.head())
>
club_name ranking year
0 ADO Den Haag 12 2010
1 ADO Den Haag 13 2011
2 ADO Den Haag 11 2012
3 ADO Den Haag 14 2013
4 ADO Den Haag 17 2014
I’m trying to merge these two using this code:
new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')
The how=’left’ is added because I have less datapoints in my ranking_df than in my standard df.
The expected behaviour is as such:
> print(new_df.head())
>
club_name tr_jan tr_dec year ranking
0 ADO Den Haag 1368 1422 2010 12
1 ADO Den Haag 1455 1477 2011 13
2 ADO Den Haag 1461 1443 2012 11
3 ADO Den Haag 1437 1383 2013 14
4 ADO Den Haag 1386 1422 2014 17
But I get this error:
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
But I do not wish to use concat since I want to merge the trees not just add them on.
Another behaviour that’s weird in my mind is that my code works if I save the first df to .csv and then load that .csv into a dataframe.
The code for that:
df = pd.DataFrame(data_points, columns=['club_name', 'tr_jan', 'tr_dec', 'year'])
df.to_csv('preliminary.csv')
df = pd.read_csv('preliminary.csv', index_col=0)
ranking_df = pd.DataFrame(rankings, columns=['club_name', 'ranking', 'year'])
new_df = df.merge(ranking_df, on=['club_name', 'year'], how='left')
I think that it has to do with the index_col=0 parameter. But I have no idea to fix it without having to save it, it doesn’t matter much but is kind of an annoyance that I have to do that.
In one of your dataframes the year is a string and the other it is an int64
you can convert it first and then join (e.g. df['year']=df['year'].astype(int)
or as RafaelC suggested df.year.astype(int)
)
Edit: Also note the comment by Anderson Zhu: Just in case you have None
or missing values in one of your dataframes, you need to use Int64
instead of int
. See the reference here.
It happens when common column in both table are of different data type.
Example: In table1, you have date as string whereas in table2 you have date as datetime. so before merging,we need to change date to common data type.
Additional: when you save df to .csv format, the datetime (year in this specific case) is saved as object, so you need to convert it into integer (year in this specific case) when you do the merge. That is why when you upload both df from csv files, you can do the merge easily, while above error will show up if one df is uploaded from csv files and the other is from an existing df. This is somewhat annoying, but have an easy solution if kept in mind.
@Arnon Rotem-Gal-Oz answer is right for the most part. But I would like to point out the difference between df['year']=df['year'].astype(int)
and df.year.astype(int)
. df.year.astype(int)
returns a view of the dataframe and doesn’t not explicitly change the type, atleast in pandas 0.24.2. df['year']=df['year'].astype(int)
explicitly change the type because it’s an assignment. I would argue that this is the safest way to permanently change the dtype of a column.
Example:
df = pd.DataFrame({'Weed': ['green crack', 'northern lights', 'girl scout
cookies'], 'Qty':[10,15,3]})
df.dtypes
Weed object,
Qty int64
df['Qty'].astype(str)
df.dtypes
Weed object,
Qty int64
Even setting the inplace arg to True doesn’t help at times. I don’t know why this
happens though. In most cases inplace=True equals an explicit assignment.
df['Qty'].astype(str, inplace = True)
df.dtypes
Weed object,
Qty int64
Now the assignment,
df['Qty'] = df['Qty'].astype(str)
df.dtypes
Weed object,
Qty object
At first check the type of columns which you want to merge. You will see one of them is string where other one is int
. Then convert it to int as following code:
df["something"] = df["something"].astype(int)
merged = df.merge[df1, on="something"]
I found that my dfs both had the same type column (str
) but switching from join
to merge
solved the issue.
this simple solution works for me
final = pd.concat([df, rankingdf], axis=1, sort=False)
but you may need to drop some duplicate column first.
In my case, it happened because I was trying to merge columns that where not the index, so in order to fix this I used this code that I found in the documentation:
df.set_index('key').join(other.set_index('key'))
Documentation:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
set key to be the index
My 2 cents: I had the same issue and could not see why I was getting that because when doing
data.head()
I saw the exact same values in ds
(time column).
The error was fixed when I added parse_dates
in the pd.read_csv()
function. This way:
data = pd.read_csv(('source.csv'), sep=';', parse_dates=['Date'], encoding= 'unicode_escape')
I was also facing the same issue as I was only trying to merge 2 data-frames. In my scenario, both datasets are almost identical, except 2 extra columns.
Hence I was following your solution.
def merge_dataframe(df1, df2, mode, column_name: any, col_chng: str):
"""
:param df1: dataframe as str
:param df2: dataframe as str
:param mode: str
:param column_name: any
:param col_chng:
:return:
"""
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
os.chdir(output_dir)
if column_name is None:
new_df = df1.merge(df2, how=mode)
else:
df1[column_name] = df2[column_name].astype(col_chng)
new_df = df1.merge(df2, on=column_name, how=mode)
new_df.to_excel('new_df.xlsx', index=False)
Hope, this will help others. Kindly correct me, if I’m missing something.