SQL Server: How to replicate pandas merge?
Question:
How can I replicate Pandas merge in SQL server?
I want to do this:
# merge and filter out rows that are in "both" dataframes
df1 = pd.DataFrame([
['A', 1, 'c', 'a'],
['A', 2, 'c', 'a'],
['B', 2, 'c', 'a'],
['B', 3, 'c', 'a'],
['C', 3, 'c', 'a'],
['C', 4, 'c', 'a'],
['D', 3, 'c', 'a']
],
columns = ['ID', 'Period', 'Pivot', 'Group'])
df2 = pd.DataFrame([
['A', 1, 'c', 'a'],
['A', 2, 'c', 'a'],
['B', 2, 'c', 'a'],
['B', 3, 'c', 'a'],
['C', 3, 'c', 'a'],
['C', 4, 'd', 'a'],
['D', 3, 'd', 'a']
],
columns = ['ID', 'Period', 'Pivot', 'Group'])
out = df1.merge(df2, how='outer', left_on=['ID', 'Period', 'Pivot', 'Group'], right_on=['ID', 'Period', 'Pivot', 'Group'], indicator=True).query('_merge != "both"')
What I have tried to do is implement a variant of this:
https://stackoverflow.com/a/511022/6534818
SELECT a.SelfJoinTableID
FROM dbo.SelfJoinTable a
INNER JOIN dbo.SelfJoinTable b
ON a.SelfJoinTableID = b.SelfJoinTableID
INNER JOIN dbo.SelfJoinTable c
ON a.SelfJoinTableID = c.SelfJoinTableID
WHERE a.Status = 'Status to filter a'
AND b.Status = 'Status to filter b'
AND c.Status = 'Status to filter c'
But it does return what I get in Pandas.
Answers:
It looks like you are trying to select the rows that are in df1
or df2
, but not both. On SQL Server, you can use the UNION, EXCEPT, and INTERSECT operators like this:
(SELECT * FROM TableA
UNION
SELECT * FROM TableB)
EXCEPT
(SELECT * FROM TableA
INTERSECT
SELECT * FROM TableB)
The first three rows performs the set union of all rows in TableA and TableB. The last three rows performs the set intersection of TableA and TableB (i.e., only the rows that appear in both). Finally, the EXCEPT operator removes the latter group from the former.
And: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-union-transact-sql
How can I replicate Pandas merge in SQL server?
I want to do this:
# merge and filter out rows that are in "both" dataframes
df1 = pd.DataFrame([
['A', 1, 'c', 'a'],
['A', 2, 'c', 'a'],
['B', 2, 'c', 'a'],
['B', 3, 'c', 'a'],
['C', 3, 'c', 'a'],
['C', 4, 'c', 'a'],
['D', 3, 'c', 'a']
],
columns = ['ID', 'Period', 'Pivot', 'Group'])
df2 = pd.DataFrame([
['A', 1, 'c', 'a'],
['A', 2, 'c', 'a'],
['B', 2, 'c', 'a'],
['B', 3, 'c', 'a'],
['C', 3, 'c', 'a'],
['C', 4, 'd', 'a'],
['D', 3, 'd', 'a']
],
columns = ['ID', 'Period', 'Pivot', 'Group'])
out = df1.merge(df2, how='outer', left_on=['ID', 'Period', 'Pivot', 'Group'], right_on=['ID', 'Period', 'Pivot', 'Group'], indicator=True).query('_merge != "both"')
What I have tried to do is implement a variant of this:
https://stackoverflow.com/a/511022/6534818
SELECT a.SelfJoinTableID
FROM dbo.SelfJoinTable a
INNER JOIN dbo.SelfJoinTable b
ON a.SelfJoinTableID = b.SelfJoinTableID
INNER JOIN dbo.SelfJoinTable c
ON a.SelfJoinTableID = c.SelfJoinTableID
WHERE a.Status = 'Status to filter a'
AND b.Status = 'Status to filter b'
AND c.Status = 'Status to filter c'
But it does return what I get in Pandas.
It looks like you are trying to select the rows that are in df1
or df2
, but not both. On SQL Server, you can use the UNION, EXCEPT, and INTERSECT operators like this:
(SELECT * FROM TableA
UNION
SELECT * FROM TableB)
EXCEPT
(SELECT * FROM TableA
INTERSECT
SELECT * FROM TableB)
The first three rows performs the set union of all rows in TableA and TableB. The last three rows performs the set intersection of TableA and TableB (i.e., only the rows that appear in both). Finally, the EXCEPT operator removes the latter group from the former.
And: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-union-transact-sql