Trouble with strptime() conversion of duration time string

Question:

I have some duration type data (lap times) as pl.Utf8 that fails to convert using strptime, whereas regular datetimes work as expected.

Minutes (before 🙂 and Seconds (before .) are always padded to two digits, Milliseconds are always 3 digits.

Lap times are always < 2 min.

df = pl.DataFrame({
    "lap_time": ["01:14.007", "00:53.040", "01:00.123"]
})

df = df.with_columns(
    [
        # pl.col('release_date').str.strptime(pl.Date, fmt="%B %d, %Y"), # works
        pl.col('lap_time').str.strptime(pl.Time, fmt="%M:%S.%3f").cast(pl.Duration), # fails
    ]
)

So I used the chrono format specifier definitions from https://docs.rs/chrono/latest/chrono/format/strftime/index.html which are used as per the polars docs of strptime

the second conversion (for lap_time) always fails, no matter whether I use .%f, .%3f, %.3f. Apparently, strptime doesn’t allow creating a pl.Duration directly, so I tried with pl.Time but it fails with error:

ComputeError: strict conversion to dates failed, maybe set strict=False

but setting strict=False yields all null values for the whole Series.

Am I missing something or this some weird behavior on chrono‘s or python-polars part?

Asked By: Dorian

||

Answers:

General case

In case you have duration that may exceed 24 hours, you can extract data (minutes, seconds and so on) from string using regex pattern. For example:

df = pl.DataFrame({
    "time": ["+01:14.007", "100:20.000", "-05:00.000"]
})

df.with_columns(
    pl.col("time").str.extract_all(r"([+-]?d+)")
    #                                /
    #                 you will get array of length 3
    #                 ["min", "sec", "ms"]
).with_columns(
    pl.duration(
        minutes=pl.col("time").arr.get(0),
        seconds=pl.col("time").arr.get(1),
        milliseconds=pl.col("time").arr.get(2)
    ).alias("time")
)
┌──────────────┐
│ time         │
│ ---          │
│ duration[ns] │
╞══════════════╡
│ 1m 14s 7ms   │
│ 1h 40m 20s   │
│ -5m          │
└──────────────┘

About pl.Time

To convert data to pl.Time, you need to specify hours as well. When you add 00 hours to your time, code will work:

df = pl.DataFrame({"str_time": ["01:14.007", "01:18.880"]})

df.with_columns(
    duration = (pl.lit("00:") + pl.col("str_time"))
        .str.strptime(pl.Time, fmt="%T%.3f")
        .cast(pl.Duration)
)
┌───────────┮──────────────┐
│ str_time  ┆ duration     │
│ ---       ┆ ---          │
│ str       ┆ duration[ξs] │
╞═══════════╩══════════════╡
│ 01:14.007 ┆ 1m 14s 7ms   │
│ 01:18.880 ┆ 1m 18s 880ms │
└───────────â”ī──────────────┘
Answered By: glebcom

Create your own parser – strptime works for DateTime stamps only, not for time deltas. The accepted answer is bad practice as it fails for reasonable inputs like durations of 80 minutes, or negative durations.

You can use pl.Series.str.extract() to make your own regex parser and extract the values you want before passing them into the Duration constructor.

As far as I’m aware there is no "duration stamp" parser in Rust. Maybe good idea for a crate if anyone is reading this. Syntax could be similar to strptime but handle cases like: negative duration, non-wrapping for the most significant "digit"/subunit, in this case where it’s a "minute duration stamp" you would wrap seconds at 60 but not minutes. Especially making sure that 61 remains 61.

Answered By: Cornelius Roemer

Code adapted from glebcom’s answer:

df = df.with_columns(
    [
        # pl.col('release_date').str.strptime(pl.Date, fmt="%B %d, %Y"), # works
        pl.duration(
            minutes=pl.col("lap_time").str.slice(0,2),
            seconds=pl.col("lap_time").str.slice(3,2),
            milliseconds=pl.col("lap_time").str.slice(6,3)
        ).alias('lap_time'),
    ]
)

This answer was posted as an edit to the question Trouble with strptime() conversion of duration time string by the OP Dorian under CC BY-SA 4.0.

Answered By: vvvvv