polars.read_csv() with german number formatting

Question:

Is there a possibility in polars to read in csv with german number formatting like it is possible in pandas.read_csv() with the parameters "decimal" and "thousands"

Asked By: alexp

||

Answers:

Currently, the Polars read_csv method does not expose those parameters.

However, there is an easy workaround to convert them. For example, with this csv, allow Polars to read the German-formatted numbers as utf8.

from io import StringIO
import polars as pl

my_csv = """col1tcol2tcol3
1.234,5tabct1.234.567
9.876tdeft3,21
"""
df = pl.read_csv(StringIO(my_csv), sep="t")
print(df)

shape: (2, 3)
┌─────────┬──────┬───────────┐
│ col1    ┆ col2 ┆ col3      │
│ ---     ┆ ---  ┆ ---       │
│ str     ┆ str  ┆ str       │
╞═════════╪══════╪═══════════╡
│ 1.234,5 ┆ abc  ┆ 1.234.567 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 9.876   ┆ def  ┆ 3,21      │
└─────────┴──────┴───────────┘

From here, the conversion is just a few lines of code:

df = df.with_column(
    pl.col(["col1", "col3"])
    .str.replace_all(r".", "")
    .str.replace(",", ".")
    .cast(pl.Float64)  # or whatever datatype needed
)
print(df)
shape: (2, 3)
┌────────┬──────┬────────────┐
│ col1   ┆ col2 ┆ col3       │
│ ---    ┆ ---  ┆ ---        │
│ f64    ┆ str  ┆ f64        │
╞════════╪══════╪════════════╡
│ 1234.5 ┆ abc  ┆ 1.234567e6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9876.0 ┆ def  ┆ 3.21       │
└────────┴──────┴────────────┘

Just be careful to apply this logic only to numbers encoded in German locale. It will mangle numbers formatted in other locales.

Answered By: user18559875
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.