Does Pandas "cross" merge keep order of both left and right?
Question:
I would like to know if the merge operation using how="cross"
will keep my lines order on the left and right side, to be more clear, I except something like that :
df1 = pd.DataFrame(["a", "b", "c"])
df2 = pd.DataFrame(["1", "2", "3"])
df1.merge(df2, how="cross")
# I except the result to be ALWAYS like this (with 1, 2, 3 repeating) :
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 c 3
I tested with few data, but I will have to use billions of rows, thus it’s hard to check if the order stays the same.
In pandas doc, they say :
cross: creates the cartesian product from both frames, preserves the order of the left keys.
Left key is preserved, so should I assume right keys order is not ?
Thanks for your help
Answers:
The order of the right keys is preserved within each group of left keys. What the documentation describes is that the left keys will be the primary order.
Example:
df1 = pd.DataFrame(["c", "b", "a"])
df2 = pd.DataFrame(["1", "3", "2"])
df1.merge(df2, how="cross")
Output:
0_x 0_y
0 c 1 # within a group
1 c 3 # the order is always
2 c 2 # 1 -> 3 -> 2
3 b 1
4 b 3
5 b 2
6 a 1
7 a 3
8 a 2
how is it done internally?
pandas sets up dummy columns in each DataFrame with a value of 1 and performs an inner merge:
def _create_cross_configuration(
self, left: DataFrame, right: DataFrame
) -> tuple[DataFrame, DataFrame, str, str]:
"""
Creates the configuration to dispatch the cross operation to inner join,
e.g. adding a join column and resetting parameters. Join column is added
to a new object, no inplace modification
Parameters
----------
left : DataFrame
right : DataFrame
Returns
-------
a tuple (left, right, how, cross_col) representing the adjusted
DataFrames with cross_col, the merge operation set to inner and the column
to join over.
"""
cross_col = f"_cross_{uuid.uuid4()}"
how = "inner"
return (
left.assign(**{cross_col: 1}),
right.assign(**{cross_col: 1}),
how,
cross_col,
)
I would like to know if the merge operation using how="cross"
will keep my lines order on the left and right side, to be more clear, I except something like that :
df1 = pd.DataFrame(["a", "b", "c"])
df2 = pd.DataFrame(["1", "2", "3"])
df1.merge(df2, how="cross")
# I except the result to be ALWAYS like this (with 1, 2, 3 repeating) :
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 c 3
I tested with few data, but I will have to use billions of rows, thus it’s hard to check if the order stays the same.
In pandas doc, they say :
cross: creates the cartesian product from both frames, preserves the order of the left keys.
Left key is preserved, so should I assume right keys order is not ?
Thanks for your help
The order of the right keys is preserved within each group of left keys. What the documentation describes is that the left keys will be the primary order.
Example:
df1 = pd.DataFrame(["c", "b", "a"])
df2 = pd.DataFrame(["1", "3", "2"])
df1.merge(df2, how="cross")
Output:
0_x 0_y
0 c 1 # within a group
1 c 3 # the order is always
2 c 2 # 1 -> 3 -> 2
3 b 1
4 b 3
5 b 2
6 a 1
7 a 3
8 a 2
how is it done internally?
pandas sets up dummy columns in each DataFrame with a value of 1 and performs an inner merge:
def _create_cross_configuration(
self, left: DataFrame, right: DataFrame
) -> tuple[DataFrame, DataFrame, str, str]:
"""
Creates the configuration to dispatch the cross operation to inner join,
e.g. adding a join column and resetting parameters. Join column is added
to a new object, no inplace modification
Parameters
----------
left : DataFrame
right : DataFrame
Returns
-------
a tuple (left, right, how, cross_col) representing the adjusted
DataFrames with cross_col, the merge operation set to inner and the column
to join over.
"""
cross_col = f"_cross_{uuid.uuid4()}"
how = "inner"
return (
left.assign(**{cross_col: 1}),
right.assign(**{cross_col: 1}),
how,
cross_col,
)