pandas – Merging on string columns not working (bug?)

Question:

I’m trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape
(0, 19)

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'

# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result

# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

So, the columns are defined with the ‘object’ dtype. Searching for them as strings don’t yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn’t work above..

Any ideas what’s going on?

It’s almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don’t see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

Asked By: user1496984

||

Answers:

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str…)

Answered By: user1496984

I ran into a case where the df.col = df.col.astype(str) solution did not work. Turns out the problem was in the encoding.

My original data looked like this:

In [72]: df1['col1'][:3]
Out[73]: 
             col1
0  dustin pedroia
1  kevin youkilis
2     david ortiz

In [72]: df2['col2'][:3]
Out[73]: 
             col2
0  dustin pedroia
1  kevin youkilis
2     david ortiz

And after using .astype(str) the merge still wasn’t working so I executed the following:

df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')

and was able to find the difference:

In [95]: df1
Out[95]: 
                       col1
0  b'dustinxc2xa0pedroia'
1  b'kevinxc2xa0youkilis'
2     b'davidxc2xa0ortiz'

In [95]: df2
Out[95]: 
                col2
0  b'dustin pedroia'
1  b'kevin youkilis'
2     b'david ortiz'

At which point all I had to do was run df1.col1 = df1.col1.str.replace('xa0',' ') on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')) and the merge worked perfectly.

NOTE: Regardless of what I was replacing I always used .str.encode('utf-8') to check whether it worked.

Alternatively

Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.

import re
#places the raw string into a list
df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))  
df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))

where my df1 data turned into this (copied and pasted from Spyder):

['dustinxa0pedroia']
['kevinxa0youkilis']
['davidxa0ortiz']

which just has a slightly different solution. I don’t know in what case the first example wouldn’t work and the second would but I wanted to provide both just in case someone runs into it 🙂

Answered By: seeiespi

Thanks,@seeiespi the ..str.encode(‘utf-8’) has helped me to figure out that my string needs to be stripped, as below

20                 b'Belize '   ...     0,612
21                  b'Benin '   ...     0,546

The solution was to use the strip

df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()
Answered By: 0xFK

None of the above solutions worked for me because the merge was actually done correctly but the indexing got messed up. Removing the index solved it for me:

df['sth'] = df.merge(df2, how='left', on=['x', 'y'])['sth'].values
Answered By: Jan

Probably your columns have some difference or a blank space that is resulting on this error.

First check your column type and if there is any different between items

df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')

If there is any difference on them, you can use

df1.col1 = df1.col1.str.replace("this", "for that")

or if there is any blank space

df1.col1 = df1.col1.apply(str).str.strip()
# This apply(str) is being used because without it, the program returns an error related to being enable to convert from byte.
Answered By: Vinicius Raphael

This answer solved it for me:

pd.merge(df1.assign(x=df1.x.astype(str)), 
         df2.assign(x=df2.x.astype(str)), 
         how='left', on='x')

from Pandas merge issue on key of object type containing number and string values

Answered By: Kyle Pennell

This step:

df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()

and then doing this step:

pd.merge(df1.assign(x=df1.x.astype(str)), 
         df2.assign(x=df2.x.astype(str)), 
         how='left', on='x')

worked for me. I meant BOTH together.

Answered By: Ehsan Rezaei
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.