How to fix date format issues while reading xlsx files using polars?

Question:

I have an excel file with an extension .xlsx. and it has also a field called date_of_birth, this filed is having years from 1860,1870 to till 2000 as below

enter image description here

Here is a command that I used for loading an excel:

df_pl = pl.read_excel('Data_Set_14_Data.xlsx',
                      read_csv_options={'ignore_errors':True,'infer_schema_length':0,'parse_dates':True})

On running this it gives an error:

XlsxValueError: Error: potential invalid date format.

How to ignore/Fix this error while reading the file so that I would get the data as it is in data frame. Is there any work around for this ?

enter image description here

Asked By: myamulla_ciencia

||

Answers:

Well, the error is due to strftime function which do not support pre-1900 years.

Probably polars is using that and it causes the problem.

You may try not parsing the dates on polar function; so that you can read the CSV file (and dates stay as String). And when you need to parsing the dates; just use strptime like:

datetime.datetime.strptime("1800/04/10", "%Y/%m/%d")

Also, you may try to use with_column method of polars framework (I couldn’t test it yet; will update after trying it):

df_pl = pl.read_excel('Data_Set_14_Data.xlsx',
                      read_csv_options={'ignore_errors':True,'infer_schema_length':0,'parse_dates':True}).with_column(pl.col('<last_col_name>').str.strptime(pl.Date, '%m/%d/%Y')) 
Answered By: stuck
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.