How to efficiently create an index-like Polars DataFrame from multiple sparse series?

Question:

I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value) tuple or {index: value} mapping.

(1) A small example

In Pandas, this is very easy as we can create a DataFrame at a time, like

>>> pd.DataFrame({
   "A": {0:  'a', 20: 'b', 40: 'c'},
   "B": {10: 'd', 20: 'e', 30: 'f'},
   "C": {20: 'g', 30: 'h'},
}).sort_index()

      A    B    C
0     a  NaN  NaN
10  NaN    d  NaN
20    b    e    g
30  NaN    f    h
40    c  NaN  NaN

but I can’t find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:

A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })

I tried to combine these multiple DataFrames, joining on the index column:

>>> A.join(B, on='index', how='outer').join(C, on='index', how='outer').sort(by='index')

shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A    ┆ B    ┆ C    │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ str  ┆ str  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ 0     ┆ a    ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10    ┆ null ┆ d    ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 20    ┆ b    ┆ e    ┆ g    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 30    ┆ null ┆ f    ┆ h    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 40    ┆ c    ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

This gives the result I want, but I wonder:

  • (i) if there is there more concise way to do this over many columns, and
  • (ii) how make this operation as efficient as possible.

Alternatives?

I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.

Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index:

>>> pl.concat([A, B, C], how='diagonal')

   index     A     B     C
0      0     a  None  None
1     20     b  None  None
2     40     c  None  None
3     10  None     d  None
4     20  None     e  None
5     30  None     f  None
6     20  None  None     g
7     30  None  None     h

(2) Efficiently Building a Large Table

The approach I found above gives desired results I’d want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:

N, C = 300000, 20
pls = []
pds = []

for i in range(C):
    A = pl.DataFrame({
        "index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
        f"A{i}": np.arange(N, dtype=np.float32),
    })
    pls.append(A)
    
    B = A.to_pandas().set_index("index")
    pds.append(B)

The approach of joining two columns in a row is somewhat slow than I expected:

%%time
F = functools.reduce(lambda a, b: a.join(b, on='index', how='outer'), pls)
F.sort(by='index')

CPU times: user 1.49 s, sys: 97.8 ms, total: 1.59 s
Wall time: 611 ms

or than one-pass creation in pd.DataFrame:

%%time
pd.DataFrame({
    f"A{i}": pds[i][f'A{i}'] for i in range(C)
}).sort_index()

CPU times: user 230 ms, sys: 50.7 ms, total: 281 ms
Wall time: 281 ms
Asked By: Jongwook Choi

||

Answers:

Quick solution

I’ve tried a couple of examples but I think pl.from_pandas is faster than any native polars solution I could find.

dt = pl.from_pandas(
  pd.DataFrame({
     "A": {0:  'a', 20: 'b', 40: 'c'},
     "B": {10: 'd', 20: 'e', 30: 'f'},
     "C": {20: 'g', 30: 'h'},
  }).sort_index().reset_index()
)

shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A    ┆ B    ┆ C    │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ str  ┆ str  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ 0     ┆ a    ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10    ┆ null ┆ d    ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 20    ┆ b    ┆ e    ┆ g    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 30    ┆ null ┆ f    ┆ h    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 40    ┆ c    ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

UPDATE

Finally found a faster solution using lazyframe joins. You can shave more time if you happen to know all the values of the index ahead of time, instead of constructing it from an unique.

# Convert tables to lazy tables
pls = [dt.lazy() for dt in pls]
%%time
out = (
    pl.concat([dt.melt('index') for dt in pls])
    .select(pl.col('index').unique())
)

for i in range(20):
    out = out.join(pls[i], on='index', how='left')

out.collect()

CPU times: total: 766 ms
Wall time: 277 ms
Answered By: kwk

Following your example, but only informing polars on the fact that the "index" column is sorted (polars will use fast paths if data is sorted).

You can use align_frames together with functools.reduce to get what you want.

This is your data creation snippet:

import functools
import polars as pl

N, C = 300000, 20
pls = []
pds = []

for i in range(C):
    A = pl.DataFrame({
        "index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
        f"A{i}": np.arange(N, dtype=np.float32),
    }).with_column(pl.col("index").set_sorted())
    
    pls.append(A)
    
    B = A.to_pandas().set_index("index")
    pds.append(B)

Creating the frame aligned by index. We need to use functools.reduce because align_frames returns a list of new DataFrame objects that are aligned by index.

frames = pl.align_frames(*pls, on="index")
functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)

Performance

The performance is better than the pandas sort_index method.

Pandas

>>> %%timeit
>>> pd.DataFrame({
...     f"A{i}": pds[i][f'A{i}'] for i in range(C)
... }).sort_index()
389 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Polars

>>> %%timeit
>>> frames = pl.align_frames(*pls, on="index")
>>> functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
348 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answered By: ritchie46