How to automatically detect columns that contain datetime in a pandas dataframe

Question:

I’m using Pandas’ read_sql() function to read multiple SQL tables into DataFrames. This function requires a pre-specified list of column names which should be read as datetime objects using the ‘parse_dates’ parameter but the function does not infer datetimes automatically from varchar columns in the server. Because of this, I get DataFrames in which all columns are of dtype Object.

   col1    col2
   -----------------------------------
0  A       2017-02-04 10:41:00.0000000
1  B       2017-02-04 10:41:00.0000000
2  C       2017-02-04 10:41:00.0000000
3  D       2017-02-04 10:41:00.0000000
4  E       2017-02-03 06:13:00.0000000

Is there a built-in Pandas function to automatically infer columns which should be datetime64[ns] WITHOUT having to specify the column names?

I’ve tried:

df.apply(pd.to_datetime(df, infer_datetime_format=True), axis=1)

which results in an error:

to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing

I also tried:

pd.to_datetime(df.stack(),  errors='ignore', format='%Y%m%d% H%M%S%f').unstack()

and

pd.to_datetime(df.stack(),  errors='coerce', format='%Y%m%d% H%M%S%f').unstack()

But this does not work.

Any suggestions about how to infer datetime columns automatically after the DataFrame is constructed?

Asked By: Carlos Muñiz

||

Answers:

You can identify which columns in your dataframe are of type object and then only convert those columns to datetime using coerce=True so that errors are generated for columns which cannot be converted. Use combine_first to overwrite the values in your dataframe with the timestamp values that did convert to datetimes.

df = pd.DataFrame(
     {'col1': ['A', 'B', 'C', 'D', 'E'],
      'col2': ['2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-04 18:41:00',
               '2017-02-03 14:13:00'],
      'col3': [0, 1, 2, 3, 4]})

object_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object']

df.loc[:, object_cols] = df[object_cols].combine_first(df[object_cols].apply(
    pd.to_datetime, coerce=True))
>>> df
  col1                col2  col3
0    A 2017-02-04 18:41:00     0
1    B 2017-02-04 18:41:00     1
2    C 2017-02-04 18:41:00     2
3    D 2017-02-04 18:41:00     3
4    E 2017-02-03 14:13:00     4

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
col1    5 non-null object
col2    5 non-null datetime64[ns]
col3    5 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 160.0+ bytes
Answered By: Alexander

There is no builtin currently to convert object to datetime automatically. One simple way is based on list comprehension and regex pattern of the datetime varchar ie.

If you have a df (based on @Alexander’s df)

df = pd.DataFrame( {'col1': ['A', 'B', 'C', 'D', 'E'],
  'col2': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00'],
  'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-04 18:41:00',
           '2017-02-03 14:13:00']})

data = [pd.to_datetime(df[x]) if df[x].astype(str).str.match(r'd{4}-d{2}-d{2} d{2}:d{2}:d{2}').all() else df[x] for x in df.columns]

df = pd.concat(data, axis=1, keys=[s.name for s in data])

or with the help of a mask i.e

mask = df.astype(str).apply(lambda x : x.str.match(r'd{4}-d{2}-d{2} d{2}:d{2}:d{2}').all())
df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime)

df.types

Output:

col1            object
col2    datetime64[ns]
col3             int64
col4    datetime64[ns]
dtype: object

If you have mixed date formats then you can use r'(d{2,4}-d{2}-d{2,4})+' Eg:

ndf = pd.DataFrame({'col3': [0, 1, 2, 3, 4],
  'col4': ['2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-04 18:41:00',
       '2017-02-03 14:13:00'],
  'col5': ['2017-02-04',
       '2017-02-04',
       '17-02-2004 14:13:00',
       '17-02-2014',
       '2017-02-03']})

mask = ndf.astype(str).apply(lambda x : x.str.match(r'(d{2,4}-d{2}-d{2,4})+').all())
ndf.loc[:,mask] = ndf.loc[:,mask].apply(pd.to_datetime)

Output :

   col3                col4                col5
0     0 2017-02-04 18:41:00 2017-02-04 00:00:00
1     1 2017-02-04 18:41:00 2017-02-04 00:00:00
2     2 2017-02-04 18:41:00 2004-02-17 14:13:00
3     3 2017-02-04 18:41:00 2014-02-17 00:00:00
4     4 2017-02-03 14:13:00 2017-02-03 00:00:00

Hope it helps

Answered By: Bharath M Shetty

Use different regex formats for dates – This code uses following formats – yyyy/mm/dd+ or yyyy-m-dd+ or yyyymmdd+ or yyyy.mm.dd+.

Below code will find the datetime columns:

  def format_date(df):
        non_numeric_cols = [col for col, col_type in df.dtypes.iteritems() if col_type == 'object' or is_datetime(df[col])]
        if len(non_numeric_cols) > 0:
            mask = df.astype(str).apply(lambda x : x.str.match('(d{2,4}(-|/|\|.| )d{2}(-|/|\|.| )d{2,4})+').any())
            df.loc[:,mask] = df.loc[:,mask].apply(pd.to_datetime,dayfirst=False)
        return df
Answered By: hp_elite

If all these doesn’t work out, I have a workaround which sort of works although there are certain assumptions. Here I have assumed that datetime columns already present in the dataset would have ‘year’ or ‘month’ or ‘date’ or ‘day’ in their name, so further I’m using Regex to bring all of them out.

normzd_column_names=[x.lower() for x in df.columns]

pattern='year|month|date|day'
for col in normzd_column_names:
    match=re.search(pattern,col)
    if match:
       print(f'Datetime column found: {col}')
Answered By: Ayan
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.