how to handle timestamps from summer and winter when converting strings in polars

Question:

I’m trying to convert string timestamps to polars datetime from the timestamps my camera puts in it RAW file metadata, but polars throws this error when I have timestamps from both summer time and winter time.

ComputeError: Different timezones found during 'strptime' operation.

How do I persuade it to convert these successfully?
(ideally handling different timezones as well as the change from summer to winter time)

And then how do I convert these timestamps back to the proper local clocktime for display?

Note that while the timestamp strings just show the offset, there is an exif field "Time Zone City" in the metadata as well as fields with just the local (naive) timestamp

import polars as plr

testdata=[
    {'name': 'BST 11:06', 'ts': '2022:06:27 11:06:12.16+01:00'},
    {'name': 'GMT 7:06', 'ts': '2022:12:27 12:06:12.16+00:00'},
]

pdf = plr.DataFrame(testdata)
pdfts = pdf.with_column(plr.col('ts').str.strptime(plr.Datetime, fmt = "%Y:%m:%d %H:%M:%S.%f%z"))

print(pdf)
print(pdfts)

It looks like I need to use tz_convert, but I cannot see how to add it to the conversion expression and what looks like the relevant docpage just 404’s
broken link to dt_namespace

Asked By: pootle

||

Answers:

polars 0.16 update

Since PR 6496, was merged you can parse mixed offsets to UTC, then set the time zone:

import polars as pl

pdf = pl.DataFrame([
    {'name': 'BST 11:06', 'ts': '2022:06:27 11:06:12.16+01:00'},
    {'name': 'GMT 7:06', 'ts': '2022:12:27 12:06:12.16+00:00'},
])

pdfts = pdf.with_columns(
    pl.col('ts').str.strptime(
        pl.Datetime(time_unit="us"), fmt="%Y:%m:%d %H:%M:%S.%f%z", utc=True)
    .dt.convert_time_zone("Europe/London")
)

print(pdfts)
shape: (2, 2)
┌───────────┬─────────────────────────────┐
│ name      ┆ ts                          │
│ ---       ┆ ---                         │
│ str       ┆ datetime[μs, Europe/London] │
╞═══════════╪═════════════════════════════╡
│ BST 11:06 ┆ 2022-06-27 11:06:12 BST     │
│ GMT 7:06  ┆ 2022-12-27 12:06:12 GMT     │
└───────────┴─────────────────────────────┘

old version:

Here’s a work-around you could use: remove the UTC offset and localize to a pre-defined time zone. Note: the result will only be correct if UTC offsets and time zone agree.

timezone = "Europe/London"

pdfts = pdf.with_column(
    plr.col('ts')
    .str.replace("[+|-][0-9]{2}:[0-9]{2}", "")
    .str.strptime(plr.Datetime, fmt="%Y:%m:%d %H:%M:%S%.f")
    .dt.tz_localize(timezone)
)

print(pdf)
┌───────────┬──────────────────────────────┐
│ name      ┆ ts                           │
│ ---       ┆ ---                          │
│ str       ┆ str                          │
╞═══════════╪══════════════════════════════╡
│ BST 11:06 ┆ 2022:06:27 11:06:12.16+01:00 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ GMT 7:06  ┆ 2022:12:27 12:06:12.16+00:00 │
└───────────┴──────────────────────────────┘
print(pdfts)
┌───────────┬─────────────────────────────┐
│ name      ┆ ts                          │
│ ---       ┆ ---                         │
│ str       ┆ datetime[ns, Europe/London] │
╞═══════════╪═════════════════════════════╡
│ BST 11:06 ┆ 2022-06-27 11:06:12.160 BST │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ GMT 7:06  ┆ 2022-12-27 12:06:12.160 GMT │
└───────────┴─────────────────────────────┘

Side-Note: to be fair, pandas does not handle mixed UTC offsets either, unless you parse to UTC straight away (keyword utc=True in pd.to_datetime). With mixed UTC offsets, it falls back to using series of native Python datetime objects. That makes a lot of the pandas time series functionality like the dt accessor unavailable.

Answered By: FObersteiner

Similar to FObersteiner’s solution but this will manually parse the offset rather than having to assume your camera’s offset matches a predefined timezone definition correctly.

First step is to use extract regex to separate the offset from the rest of the time. The offset is split into the hours and minutes inclusive of the sign. Then we just strptime the datetime component from the first step as a naive time, add/subtract the offset, localize it to UTC, and then make it the desired timezone (in this case Europe/London). **(I load polars as pl not plr so adjust as necessary)

(pdf 
.with_columns(
    [pl.col('ts').str.extract("(d{4}:d{2}:d{2} d{2}:d{2}:d{2}.d{2})"),
     pl.col('ts').str.extract("d{4}:d{2}:d{2} d{2}:d{2}:d{2}.d{2}((+|-)d{2}):d{2}")
                 .cast(pl.Float64()).alias("offset"),
     pl.col('ts').str.extract("d{4}:d{2}:d{2} d{2}:d{2}:d{2}.d{2}(+|-)d{2}:(d{2})", group_index=2)
                 .cast(pl.Float64()).alias("offset_minute")])
.select(
    ['name', 
     (pl.col('ts').str.strptime(pl.Datetime(), "%Y:%m:%d %H:%M:%S%.f") - pl.duration(hours=pl.col('offset'), minutes=pl.col('offset_minute')))
                  .dt.tz_localize('UTC').dt.with_time_zone('Europe/London')]))




shape: (2, 3)
┌───────────┬────────┬─────────────────────────────┐
│ name      ┆ offset ┆ dt                          │
│ ---       ┆ ---    ┆ ---                         │
│ str       ┆ f64    ┆ datetime[ns, Europe/London] │
╞═══════════╪════════╪═════════════════════════════╡
│ BST 11:06 ┆ 1.0    ┆ 2022-06-27 11:06:12.160 BST │
│ GMT 7:06  ┆ 0.0    ┆ 2022-12-27 12:06:12.160 GMT │
└───────────┴────────┴─────────────────────────────┘
Answered By: Dean MacGregor