Merging two dataframes in pandas without column names (new to pandas)

Question:

Short explanation:

If you have duplicate column names in your data, be sure to rename one column when you read the file.

If you have NaN etc in your data, remove those.

Then merge using correct answer below.


Probably a pretty simple question.

I have two datasets that I read in using pandas.read_csv().

My data is in two separate csv.

With the following code:

        import mibian
        import pandas as pd


        underlying = pd.read_csv("txt1.csv", names=['dt1','price']);

        options = pd.read_csv("txt2.txt", names=['dt2','ticker','maturity','strike','cP','px','strike','yield','rF','T','rlzd10']);

        merged = underlying.merge(options, left_on='dt1', right_on='dt2');

My two data heads look like this:

>>> underlying.head();

          0         1
0  20040326  3.579987
1  20040329  3.690494
2  20040330  3.755247
3  20040331  3.719373
4  20040401  3.728671

and

>>> options.head();

         0     1         2     3     4      5     6   7      8         9                10

0  20130628  SVXY  20130817  32.5  call  39.22  32.5   0  0.005  0.136986   0.411224

So my column 0 on either data set is the key i want to merge on, and i want to keep all data from both result sets.

How would I go about doing this? All the examples I am finding online require keys, but I do not have that in my results.

But on the join I get the following errors:

                            Traceback (most recent call last):
                              File "<stdin>", line 1, in <module>
                              File "/Applications/Spyder.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
                                execfile(filename, namespace)
                              File "/Users/jasonmellone/.spyder2/.temp.py", line 12, in <module>
                                merged = underlying.merge(options, left_on='dt1', right_on='dt2',how='outer');
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/frame.py", line 3723, in merge
                                suffixes=suffixes, copy=copy)
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 40, in merge
                                return op.get_result()
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 197, in get_result
                                result_data = join_op.get_result()
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/tools/merge.py", line 722, in get_result
                                return BlockManager(result_blocks, self.result_axes)
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/internals.py", line 1954, in __init__
                                self._set_ref_locs(do_refs=True)
                              File "/Library/Python/2.7/site-packages/pandas-0.13.0-py2.7-macosx-10.9-intel.egg/pandas/core/internals.py", line 2091, in _set_ref_locs
                                'have _ref_locs set' % (block, labels))
                            AssertionError: Cannot create BlockManager._ref_locs because block [IntBlock: [dt1], 1 x 372145, dtype: int64] with duplicate items [Index([u'dt1', u'price', u'dt2', u'ticker', u'maturity', u'strike', u'cP', u'px', u'strike', u'yield', u'rF', u'T', u'rlzd10'], dtype='object')] does not have _ref_locs set

I searched through my data sets and there are no duplicates.

Thank you!

Asked By: jason m

||

Answers:

You should still be able to merge on the columns:

merged = underlying.merge(options, left_on='0', right_on='0')

This will perform an inner merge so only the intersection of both datasets, i.e. where the values in column 0 exist in both, if you want all values, then specifcy outer:

merged = underlying.merge(options, left_on='0', right_on='0', how='outer')

In [10]:  

merged = underlying.merge(options, left_on='0', right_on='0', how='outer')

merged

Out[10]:

          0       1_x   1_y         2     3     4      5     6   7      8  
0  20040326  3.579987   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
1  20040329  3.690494   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
2  20040330  3.755247   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
3  20040331  3.719373   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
4  20040401  3.728671   NaN       NaN   NaN   NaN    NaN   NaN NaN    NaN   
5  20130628       NaN  SVXY  20130817  32.5  call  39.22  32.5   0  0.005   

          9        10  
0       NaN       NaN  
1       NaN       NaN  
2       NaN       NaN  
3       NaN       NaN  
4       NaN       NaN  
5  0.136986  0.411224  

[6 rows x 12 columns]

You would have to rename or move the columns that clashed 1_x and 1_y above.

It is probably better to rename the columns to something sensible before hand.
When reading the csv you can pass a list of column names:

df = pd.read_csv('data.csv', names=['Id', 'Price'])
Answered By: EdChum

If you would like to use the same column for merging, which is true in your case, you can simply use on=0 where 0 represents the first column in both dataframes.

import pandas as pd
merged = underlying.merge(options, on=0, how='outer')
# or
merged = pd.merge(underlying, options, on=0, how='outer')

If the index column is different in both dataframes then you could use left_on and right_on onptions.

# here 0 is the index column for df1 and 2 is the index column for df2
pd.merge(df1, df2, left_on=0, right_on=2, how='outer')
Answered By: Abu Shoeb

Similar issue brought me to this thread. I was ending up with a key error. The fix was removing the single quotes from left_on='0' to left_on=0.

merged = underlying.merge(options, left_on='0', right_on='0')
merged = underlying.merge(options, left_on=0, right_on=0)
Answered By: xl_cheese
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.