Reading a csv in polars

Question:

What is the difference between polars.read_csv vs polars.read_csv_batched vs polars.scan_csv ?

polars.read_csv looks equivalent to pandas.read_csv as they have the same name.

Which one to use in which scenario and how they are similar/different to pandas.read_csv?

Answers:

Scenarios:

  • I use pandas.read_csv when my data is messy or complex in structure and the data is not too large

  • I use polars.read_csv when my data file is very large (> 10GB).

This is an answer based solely on my (humble) opinion.

Answered By: Corralien

polars.scan_csv produces a query plan (called a LazyFrame). You can then build you query and on the end call collect to materialize a DataFrame.

This is the case for all scan_ methods. The benefit of this is that the Polars optimizer then can push down optimizations into the readers. It can apply filters in the readers and only selects columns it needs. This can save a lot of work.

polars.read_csv can be seen as polars.scan_csv().collect(). E.g. you simply read all the data and immediately produce a DataFrame. This means that you might do work that was not needed. The Polars optimizer is not able to do anything if you want a result immediately.

I don’t agree with the other answer that polars.read_csv only should be used when data is large. It is just as well suitable for smaller data.

Answered By: ritchie46
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.