Substantiate that polars isn't copying data even though python reports different id
Question:
If I have two tables A and B, how can I do a join of B into A in place so that A keeps all its data and is modified by that join without having to make a copy?
And can that join only take specified columns from B into A?
A:
┌─────┬─────┬───────┐
│ one ┆ two ┆ three │
╞═════╪═════╪═══════╡
│ a ┆ 1 ┆ 3 │
│ b ┆ 4 ┆ 6 │
│ c ┆ 7 ┆ 9 │
│ d ┆ 10 ┆ 12 │
│ e ┆ 13 ┆ 15 │
│ f ┆ 16 ┆ 18 │
└─────┴─────┴───────┘
B:
┌─────┬─────┬───────┬──────┐
│ one ┆ two ┆ three ┆ four │
╞═════╪═════╪═══════╪══════╡
│ a ┆ 1 ┆ 3 ┆ yes │
│ c ┆ 7 ┆ 9 ┆ yes │
│ f ┆ 16 ┆ 18 ┆ yes │
└─────┴─────┴───────┴──────┘
I’d like to left join A and B, keeping all data in A and the four
column of B – renamed as result.
With data.table I can do exactly this after reading A and B:
address(A)
# [1] "0x55fc74197910"
A[B, on = .(one, two), result := i.four]
A
# one two three result
# 1: a 1 3 yes
# 2: b 4 6 <NA>
# 3: c 7 9 yes
# 4: d 10 12 <NA>
# 5: e 13 15 <NA>
# 6: f 16 18 yes
address(A)
# [1] "0x55fc74197910"
With polars in python:
A.join(B, on = ["one", "two"], how = 'left')
# shape: (6, 5)
# ┌─────┬─────┬───────┬─────────────┬──────┐
# │ one ┆ two ┆ three ┆ three_right ┆ four │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═══════╪═════════════╪══════╡
# │ a ┆ 1 ┆ 3 ┆ 3 ┆ yes │
# │ b ┆ 4 ┆ 6 ┆ null ┆ null │
# │ c ┆ 7 ┆ 9 ┆ 9 ┆ yes │
# │ d ┆ 10 ┆ 12 ┆ null ┆ null │
# │ e ┆ 13 ┆ 15 ┆ null ┆ null │
# │ f ┆ 16 ┆ 18 ┆ 18 ┆ yes │
# └─────┴─────┴───────┴─────────────┴──────┘
A
# shape: (6, 3)
# ┌─────┬─────┬───────┐
# │ one ┆ two ┆ three │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════╡
# │ a ┆ 1 ┆ 3 │
# │ b ┆ 4 ┆ 6 │
# │ c ┆ 7 ┆ 9 │
# │ d ┆ 10 ┆ 12 │
# │ e ┆ 13 ┆ 15 │
# │ f ┆ 16 ┆ 18 │
# └─────┴─────┴───────┘
A is unchanged. If A is assigned again:
id(A)
# 139703375023552
A = A.join(B, on = ['one', 'two'], right_on=["four"])
id(A)
# 139703374967280
its memory address changes.
Answers:
There is indeed no copy occurring there; if you think of the DataFrame
class as a container (like a python list), you can see the same sort of thing happening here – the container id changes, but the contents of the container are not copied:
# create a list/container with some object data
v1 = [object(), object(), object()]
print(v1)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v2 = v1[:]
print(v2)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v3 = v1[:2]
print(v3)
# [<object at 0x1686b6510>, <object at 0x1686b6490>]
(Each of v1
, v2
, and v3
will have different ids).
If I have two tables A and B, how can I do a join of B into A in place so that A keeps all its data and is modified by that join without having to make a copy?
And can that join only take specified columns from B into A?
A:
┌─────┬─────┬───────┐
│ one ┆ two ┆ three │
╞═════╪═════╪═══════╡
│ a ┆ 1 ┆ 3 │
│ b ┆ 4 ┆ 6 │
│ c ┆ 7 ┆ 9 │
│ d ┆ 10 ┆ 12 │
│ e ┆ 13 ┆ 15 │
│ f ┆ 16 ┆ 18 │
└─────┴─────┴───────┘
B:
┌─────┬─────┬───────┬──────┐
│ one ┆ two ┆ three ┆ four │
╞═════╪═════╪═══════╪══════╡
│ a ┆ 1 ┆ 3 ┆ yes │
│ c ┆ 7 ┆ 9 ┆ yes │
│ f ┆ 16 ┆ 18 ┆ yes │
└─────┴─────┴───────┴──────┘
I’d like to left join A and B, keeping all data in A and the four
column of B – renamed as result.
With data.table I can do exactly this after reading A and B:
address(A)
# [1] "0x55fc74197910"
A[B, on = .(one, two), result := i.four]
A
# one two three result
# 1: a 1 3 yes
# 2: b 4 6 <NA>
# 3: c 7 9 yes
# 4: d 10 12 <NA>
# 5: e 13 15 <NA>
# 6: f 16 18 yes
address(A)
# [1] "0x55fc74197910"
With polars in python:
A.join(B, on = ["one", "two"], how = 'left')
# shape: (6, 5)
# ┌─────┬─────┬───────┬─────────────┬──────┐
# │ one ┆ two ┆ three ┆ three_right ┆ four │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 ┆ i64 ┆ str │
# ╞═════╪═════╪═══════╪═════════════╪══════╡
# │ a ┆ 1 ┆ 3 ┆ 3 ┆ yes │
# │ b ┆ 4 ┆ 6 ┆ null ┆ null │
# │ c ┆ 7 ┆ 9 ┆ 9 ┆ yes │
# │ d ┆ 10 ┆ 12 ┆ null ┆ null │
# │ e ┆ 13 ┆ 15 ┆ null ┆ null │
# │ f ┆ 16 ┆ 18 ┆ 18 ┆ yes │
# └─────┴─────┴───────┴─────────────┴──────┘
A
# shape: (6, 3)
# ┌─────┬─────┬───────┐
# │ one ┆ two ┆ three │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════╡
# │ a ┆ 1 ┆ 3 │
# │ b ┆ 4 ┆ 6 │
# │ c ┆ 7 ┆ 9 │
# │ d ┆ 10 ┆ 12 │
# │ e ┆ 13 ┆ 15 │
# │ f ┆ 16 ┆ 18 │
# └─────┴─────┴───────┘
A is unchanged. If A is assigned again:
id(A)
# 139703375023552
A = A.join(B, on = ['one', 'two'], right_on=["four"])
id(A)
# 139703374967280
its memory address changes.
There is indeed no copy occurring there; if you think of the DataFrame
class as a container (like a python list), you can see the same sort of thing happening here – the container id changes, but the contents of the container are not copied:
# create a list/container with some object data
v1 = [object(), object(), object()]
print(v1)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v2 = v1[:]
print(v2)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]
v3 = v1[:2]
print(v3)
# [<object at 0x1686b6510>, <object at 0x1686b6490>]
(Each of v1
, v2
, and v3
will have different ids).