Substantiate that polars isn't copying data even though python reports different id

Question:

If I have two tables A and B, how can I do a join of B into A in place so that A keeps all its data and is modified by that join without having to make a copy?
And can that join only take specified columns from B into A?

A:

┌─────┬─────┬───────┐
│ one ┆ two ┆ three │
╞═════╪═════╪═══════╡
│ a   ┆ 1   ┆ 3     │
│ b   ┆ 4   ┆ 6     │
│ c   ┆ 7   ┆ 9     │
│ d   ┆ 10  ┆ 12    │
│ e   ┆ 13  ┆ 15    │
│ f   ┆ 16  ┆ 18    │
└─────┴─────┴───────┘

B:

┌─────┬─────┬───────┬──────┐
│ one ┆ two ┆ three ┆ four │
╞═════╪═════╪═══════╪══════╡
│ a   ┆ 1   ┆ 3     ┆ yes  │
│ c   ┆ 7   ┆ 9     ┆ yes  │
│ f   ┆ 16  ┆ 18    ┆ yes  │
└─────┴─────┴───────┴──────┘

I’d like to left join A and B, keeping all data in A and the four column of B – renamed as result.

With data.table I can do exactly this after reading A and B:

address(A)
# [1] "0x55fc74197910"

A[B, on = .(one, two), result := i.four]
A

#    one two three result
# 1:   a   1     3    yes
# 2:   b   4     6   <NA>
# 3:   c   7     9    yes
# 4:   d  10    12   <NA>
# 5:   e  13    15   <NA>
# 6:   f  16    18    yes

address(A)
# [1] "0x55fc74197910"

With polars in python:

A.join(B, on = ["one", "two"], how = 'left')

# shape: (6, 5)
# ┌─────┬─────┬───────┬─────────────┬──────┐
# │ one ┆ two ┆ three ┆ three_right ┆ four │
# │ --- ┆ --- ┆ ---   ┆ ---         ┆ ---  │
# │ str ┆ i64 ┆ i64   ┆ i64         ┆ str  │
# ╞═════╪═════╪═══════╪═════════════╪══════╡
# │ a   ┆ 1   ┆ 3     ┆ 3           ┆ yes  │
# │ b   ┆ 4   ┆ 6     ┆ null        ┆ null │
# │ c   ┆ 7   ┆ 9     ┆ 9           ┆ yes  │
# │ d   ┆ 10  ┆ 12    ┆ null        ┆ null │
# │ e   ┆ 13  ┆ 15    ┆ null        ┆ null │
# │ f   ┆ 16  ┆ 18    ┆ 18          ┆ yes  │
# └─────┴─────┴───────┴─────────────┴──────┘


A

# shape: (6, 3)
# ┌─────┬─────┬───────┐
# │ one ┆ two ┆ three │
# │ --- ┆ --- ┆ ---   │
# │ str ┆ i64 ┆ i64   │
# ╞═════╪═════╪═══════╡
# │ a   ┆ 1   ┆ 3     │
# │ b   ┆ 4   ┆ 6     │
# │ c   ┆ 7   ┆ 9     │
# │ d   ┆ 10  ┆ 12    │
# │ e   ┆ 13  ┆ 15    │
# │ f   ┆ 16  ┆ 18    │
# └─────┴─────┴───────┘

A is unchanged. If A is assigned again:

id(A)
# 139703375023552

A = A.join(B, on = ['one', 'two'], right_on=["four"])
id(A)

# 139703374967280

its memory address changes.

Asked By: basesorbytes

||

Answers:

There is indeed no copy occurring there; if you think of the DataFrame class as a container (like a python list), you can see the same sort of thing happening here – the container id changes, but the contents of the container are not copied:

# create a list/container with some object data
v1 = [object(), object(), object()]
print(v1)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]

v2 = v1[:]
print(v2)
# [<object at 0x1686b6510>, <object at 0x1686b6490>, <object at 0x1686b6550>]

v3 = v1[:2]
print(v3)
# [<object at 0x1686b6510>, <object at 0x1686b6490>]

(Each of v1, v2, and v3 will have different ids).

Answered By: alexander-beedie
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.