Probably a pretty simple question.

I have two datasets that I read in using pandas.read_csv().

My data is in two separate csv.

With the following code:

        import mibian
        import pandas as pd

        underlying = pd.read_csv("txt1.csv", names=['dt1','price']);

        options = pd.read_csv("txt2.txt", names=['dt2','ticker','maturity','strike','cP','px','strike','yield','rF','T','rlzd10']);

        merged = underlying.merge(options, left_on='dt1', right_on='dt2');

My two data heads look like this:

>>> underlying.head();

          0         1
0  20040326  3.579987
1  20040329  3.690494
2  20040330  3.755247
3  20040331  3.719373
4  20040401  3.728671


>>> options.head();

         0     1         2     3     4      5     6   7      8         9                10

0  20130628  SVXY  20130817  32.5  call  39.22  32.5   0  0.005  0.136986   0.411224

So my column 0 on either data set is the key i want to merge on, and i want to keep all data from both result sets.

How would I go about doing this? All the examples I am finding online require keys, but I do not have that in my results.

But on the join I get the following errors:

                            Traceback (most recent call last):
                              File "<stdin>", line 1, in <module>
                              File "/Applications/", line 540, in runfile
                                execfile(filename, namespace)
                              File "/Users/jasonmellone/.spyder2/", line 12, in <module>
                                merged = underlying.merge(options, left_on='dt1', right_on='dt2',how='outer');
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/", line 3723, in merge
                                suffixes=suffixes, copy=copy)
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/", line 40, in merge
                                return op.get_result()
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/", line 197, in get_result
                                result_data = join_op.get_result()
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/", line 722, in get_result
                                return BlockManager(result_blocks, self.result_axes)
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/", line 1954, in __init__
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/", line 2091, in _set_ref_locs
                                'have _ref_locs set' % (block, labels))
                            AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [dt1], 1 x 372145, dtype: int64] with duplicate items [Index([u'dt1', u'price', u'dt2', u'ticker', u'maturity', u'strike', u'cP', u'px', u'strike', u'yield', u'rF', u'T', u'rlzd10'], dtype='object')] does not have _ref_locs set

I searched through my data sets and there are no duplicates.

Thank you!

Asked By: jason m



You should still be able to merge on the columns:

merged = underlying.merge(options, left_on='0', right_on='0')

This will perform an inner merge so only the intersection of both datasets, i.e. where the values in column 0 exist in both, if you want all values, then specifcy outer:

merged = underlying.merge(options, left_on='0', right_on='0', how='outer')

In [10]:  

merged = underlying.merge(options, left_on='0', right_on='0', how='outer')



          0       1_x   1_y         2     3     4      5     6   7      8  
0  20040326  3.579987   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
1  20040329  3.690494   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
2  20040330  3.755247   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
3  20040331  3.719373   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
4  20040401  3.728671   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
5  20130628       NaN  SVXY  20130817  32.5  call  39.22  32.5   0  0.005   

          9        10  
0       NaN       NaN  
1       NaN       NaN  
2       NaN       NaN  
3       NaN       NaN  
4       NaN       NaN  
5  0.136986  0.411224  

[6 rows x 12 columns]

You would have to rename or move the columns that clashed 1_x and 1_y above.

It is probably better to rename the columns to something sensible before hand.
When reading the csv you can pass a list of column names:

df = pd.read_csv('data.csv', names=['Id', 'Price'])
Answered By: EdChum

If you would like to use the same column for merging, which is true in your case, you can simply use on=0 where 0 represents the first column in both dataframes.

import pandas as pd
merged = underlying.merge(options, on=0, how='outer')
# or
merged = pd.merge(underlying, options, on=0, how='outer')

If the index column is different in both dataframes then you could use left_on and right_on onptions.

# here 0 is the index column for df1 and 2 is the index column for df2
pd.merge(df1, df2, left_on=0, right_on=2, how='outer')
Answered By: Abu Shoeb

Similar issue brought me to this thread. I was ending up with a key error. The fix was removing the single quotes from left_on='0' to left_on=0.

merged = underlying.merge(options, left_on='0', right_on='0')
merged = underlying.merge(options, left_on=0, right_on=0)
Answered By: xl_cheese
