Reading a csv in polars
Question:
What is the difference between polars.read_csv
vs polars.read_csv_batched
vs polars.scan_csv
?
polars.read_csv
looks equivalent to pandas.read_csv
as they have the same name.
Which one to use in which scenario and how they are similar/different to pandas.read_csv
?
Answers:
-
polars.read_csv_batched
is pretty equivalent to pandas.read_csv(iterator=True)
.
-
polars.scan_csv
doesn’t do anything until you perform an operation on the dataframe like dask.dataframe.read_csv
(lazy loading).
Scenarios:
-
I use pandas.read_csv
when my data is messy or complex in structure and the data is not too large
-
I use polars.read_csv
when my data file is very large (> 10GB).
This is an answer based solely on my (humble) opinion.
polars.scan_csv
produces a query plan (called a LazyFrame
). You can then build you query and on the end call collect
to materialize a DataFrame
.
This is the case for all scan_
methods. The benefit of this is that the Polars optimizer then can push down optimizations into the readers. It can apply filters in the readers and only selects columns it needs. This can save a lot of work.
polars.read_csv
can be seen as polars.scan_csv().collect()
. E.g. you simply read all the data and immediately produce a DataFrame
. This means that you might do work that was not needed. The Polars optimizer is not able to do anything if you want a result immediately.
I don’t agree with the other answer that polars.read_csv
only should be used when data is large. It is just as well suitable for smaller data.
What is the difference between polars.read_csv
vs polars.read_csv_batched
vs polars.scan_csv
?
polars.read_csv
looks equivalent to pandas.read_csv
as they have the same name.
Which one to use in which scenario and how they are similar/different to pandas.read_csv
?
-
polars.read_csv_batched
is pretty equivalent topandas.read_csv(iterator=True)
. -
polars.scan_csv
doesn’t do anything until you perform an operation on the dataframe likedask.dataframe.read_csv
(lazy loading).
Scenarios:
-
I use
pandas.read_csv
when my data is messy or complex in structure and the data is not too large -
I use
polars.read_csv
when my data file is very large (> 10GB).
This is an answer based solely on my (humble) opinion.
polars.scan_csv
produces a query plan (called a LazyFrame
). You can then build you query and on the end call collect
to materialize a DataFrame
.
This is the case for all scan_
methods. The benefit of this is that the Polars optimizer then can push down optimizations into the readers. It can apply filters in the readers and only selects columns it needs. This can save a lot of work.
polars.read_csv
can be seen as polars.scan_csv().collect()
. E.g. you simply read all the data and immediately produce a DataFrame
. This means that you might do work that was not needed. The Polars optimizer is not able to do anything if you want a result immediately.
I don’t agree with the other answer that polars.read_csv
only should be used when data is large. It is just as well suitable for smaller data.