create a polars dataframe containing unique values from a set of CSVs

Question

I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:

import polars as pl

df1 = pl.read_csv("test1.biobank.tsv.gz", sep='t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

I can read the remaining files one by one, i.e.:

df2 = pl.read_csv("test2.biobank.tsv.gz", sep='t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

check if all the values are not equal:

if not df1.frame_equal(df2):
    df = df1.vstack(df2)
    del(df1)
    del(df2)

then .drop_duplicates(). But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.

Thank you for your help in advance

DK

edit

There is another way to do it using Polars and DuckDB.

create parquet files for each of the inputs

tsv_pattern = "gwas_*.gz"

for fn in glob.glob(tsv_pattern):
    print(fn)
    parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
    df = pl.read_csv(fn, sep='t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
    df.to_parquet(parquet_fn, compression='zstd')
    del(df)

run duckdb and execute:

CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

Credits go to Mark Mytherin from DuckDB

Asked By: darked89

||

Source

Answer 1

it sounds like merge k sorted arrays,
i’ve found a article for the solution, wish it could help：
https://medium.com/outco/how-to-merge-k-sorted-arrays-c35d87aa298e

Answered By: prof_FL

Answer 2

You can use glob patterns to read the csv’s and then call distinct.

(pl.scan_csv("**/*.csv")
 .unique()
 .collect())

Answered By: ritchie46

create a polars dataframe containing unique values from a set of CSVs

Question:

Answers: