What is the difference between
polars.read_csv looks equivalent to
pandas.read_csv as they have the same name.
Which one to use in which scenario and how they are similar/different to
pandas.read_csv when my data is messy or complex in structure and the data is not too large
polars.read_csv when my data file is very large (> 10GB).
This is an answer based solely on my (humble) opinion.
polars.scan_csv produces a query plan (called a
LazyFrame). You can then build you query and on the end call
collect to materialize a
This is the case for all
scan_ methods. The benefit of this is that the Polars optimizer then can push down optimizations into the readers. It can apply filters in the readers and only selects columns it needs. This can save a lot of work.
polars.read_csv can be seen as
polars.scan_csv().collect(). E.g. you simply read all the data and immediately produce a
DataFrame. This means that you might do work that was not needed. The Polars optimizer is not able to do anything if you want a result immediately.
I don’t agree with the other answer that
polars.read_csv only should be used when data is large. It is just as well suitable for smaller data.