create a polars dataframe containing unique values from a set of CSVs

Question:

I have +3000 CSVs with +10 columns. What I need is to get all unique values from just two of these. I am able to read unique values in polars:

import polars as pl

df1 = pl.read_csv("test1.biobank.tsv.gz", sep='t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

I can read the remaining files one by one, i.e.:

df2 = pl.read_csv("test2.biobank.tsv.gz", sep='t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()

check if all the values are not equal:

if not df1.frame_equal(df2):
    df = df1.vstack(df2)
    del(df1)
    del(df2)  

then .drop_duplicates(). But since all the input files are already sorted on the two columns (chr, pos) and the differences are in thousands out of 16M input rows I hope there is a better way to do it.

Thank you for your help in advance

DK

edit

There is another way to do it using Polars and DuckDB.

  • create parquet files for each of the inputs
tsv_pattern = "gwas_*.gz"

for fn in glob.glob(tsv_pattern):
    print(fn)
    parquet_fn = fn.replace(".gz", ".chr_pos.parquet")
    df = pl.read_csv(fn, sep='t', dtype={"#chrom": pl.Utf8}, n_threads=8, columns=["#chrom", "pos"], new_columns=["chr", "pos"]).drop_duplicates()
    df.to_parquet(parquet_fn, compression='zstd')
    del(df)

  • run duckdb and execute:
CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

Credits go to Mark Mytherin from DuckDB

Asked By: darked89

||

Answers:

it sounds like merge k sorted arrays,
i’ve found a article for the solution, wish it could help:
https://medium.com/outco/how-to-merge-k-sorted-arrays-c35d87aa298e

Answered By: prof_FL

You can use glob patterns to read the csv’s and then call distinct.

(pl.scan_csv("**/*.csv")
 .unique()
 .collect())
Answered By: ritchie46