pandas cross join no columns in common
Question:
How would you perform a full outer join a cross join of two dataframes with no columns in common using pandas?
In MySQL, you can simply do:
SELECT *
FROM table_1
[CROSS] JOIN table_2;
But in pandas, doing:
df_1.merge(df_2, how='outer')
gives an error:
MergeError: No common columns to perform merge on
The best solution I have so far is using sqlite
:
import sqlalchemy as sa
engine = sa.create_engine('sqlite:///tmp.db')
df_1.to_sql('df_1', engine)
df_2.to_sql('df_2', engine)
df = pd.read_sql_query('SELECT * FROM df_1 JOIN df_2', engine)
Answers:
Even in MySQL you have to specify which fields are you joining on.
http://dev.mysql.com/doc/refman/5.7/en/join.html
Example:
SELECT * FROM t1 LEFT JOIN t2 ON (t1.a = t2.a);
Same concept with Pandas:
Parameters:
right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)
on : label or list
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per left_on docs
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
Update:
From Paul’s comment, you can now use df = df1.merge(df2, how="cross")
.
The older method of creating temporary columns:
IIUC you need merge
with temporary columns tmp
of both DataFrames
:
import pandas as pd
df1 = pd.DataFrame({'fld1': ['x', 'y'],
'fld2': ['a', 'b1']})
df2 = pd.DataFrame({'fld3': ['y', 'x', 'y'],
'fld4': ['a', 'b1', 'c2']})
print df1
fld1 fld2
0 x a
1 y b1
print df2
fld3 fld4
0 y a
1 x b1
2 y c2
df1['tmp'] = 1
df2['tmp'] = 1
df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
print df
fld1 fld2 fld3 fld4
0 x a y a
1 x a x b1
2 x a y c2
3 y b1 y a
4 y b1 x b1
5 y b1 y c2
How would you perform a full outer join a cross join of two dataframes with no columns in common using pandas?
In MySQL, you can simply do:
SELECT *
FROM table_1
[CROSS] JOIN table_2;
But in pandas, doing:
df_1.merge(df_2, how='outer')
gives an error:
MergeError: No common columns to perform merge on
The best solution I have so far is using sqlite
:
import sqlalchemy as sa engine = sa.create_engine('sqlite:///tmp.db') df_1.to_sql('df_1', engine) df_2.to_sql('df_2', engine) df = pd.read_sql_query('SELECT * FROM df_1 JOIN df_2', engine)
Even in MySQL you have to specify which fields are you joining on.
http://dev.mysql.com/doc/refman/5.7/en/join.html
Example:
SELECT * FROM t1 LEFT JOIN t2 ON (t1.a = t2.a);
Same concept with Pandas:
Parameters:
right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)
on : label or list
Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.
left_on : label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns
right_on : label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per left_on docs
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
Update:
From Paul’s comment, you can now use df = df1.merge(df2, how="cross")
.
The older method of creating temporary columns:
IIUC you need merge
with temporary columns tmp
of both DataFrames
:
import pandas as pd
df1 = pd.DataFrame({'fld1': ['x', 'y'],
'fld2': ['a', 'b1']})
df2 = pd.DataFrame({'fld3': ['y', 'x', 'y'],
'fld4': ['a', 'b1', 'c2']})
print df1
fld1 fld2
0 x a
1 y b1
print df2
fld3 fld4
0 y a
1 x b1
2 y c2
df1['tmp'] = 1
df2['tmp'] = 1
df = pd.merge(df1, df2, on=['tmp'])
df = df.drop('tmp', axis=1)
print df
fld1 fld2 fld3 fld4
0 x a y a
1 x a x b1
2 x a y c2
3 y b1 y a
4 y b1 x b1
5 y b1 y c2