Change a column format while ignoring (or keeping) NaN
Question:
I want to change a column from a DataFrame which contains values of this format hh:mm:ss
to a column containing the number of minutes (while keeping the NaN values)
I can’t change it directly from the excel file so I’ve tried to do it with pandas (I’m working on a ML model with a health database):
38 00:35:00
39 00:50:00
40 00:45:00
41 01:32:00
42 00:29:00
43 NaN
44 00:45:00
45 00:13:00
46 00:20:00
47 00:31:00
48 00:54:00
49 00:43:00
50 02:33:00
I tried to separate the values from the NaN values using a mask then convert to minutes with str.split()
df1 = df['delay'][df['delay'].notnull()].astype(str).str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1]))```
df2 = df['delai_ponc_recal_calc'][df['delai_ponc_recal_calc'].isnull()]
But then I cannot merge to two series without loosing the order (I get the NaN values with the correct indexes at the end of the merged series)
39 50
40 45
41 92
42 29
44 45
45 13
46 20
47 31
48 54
49 43
50 153
43 NaN
I also tried to go from hh:mm:ss
to minutes with datatime.time
and timedelta
using a loop (without using a mask) but I still can’t have a column (series or DF) with the all the values in minutes while keeping the NaN …
Answers:
You can use pd.to_timedelta
to convert the delay
column to pandas timedelta series then divide it by Timedelta
of 1 min
to get total minutes:
pd.to_timedelta(df['delay'], errors='coerce') / pd.Timedelta(1, 'min')
39 50.0
40 45.0
41 92.0
42 29.0
43 NaN
44 45.0
45 13.0
46 20.0
47 31.0
48 54.0
49 43.0
50 153.0
Name: delay, dtype: float64
Here are some possible solutions:
Input:
delay
38 00:35:00
39 00:50:00
40 00:45:00
41 01:32:00
42 00:29:00
43 NaN
44 00:45:00
45 00:13:00
46 00:20:00
47 00:31:00
48 00:54:00
49 00:43:00
50 02:33:00
Method 1: pd.to_datetime
+ map
df['delay'] = pd.to_datetime(df['delay'])
#using lambda function
df['delay2'] = df['delay'].map(lambda x : x.hour*60 + x.minute)
print(df['delay2'])
#df.drop(['delay'],axis=1,inplace=True)
Method 2: pd.to_datetime
+ dt
#converts time columns to pandas datetime64ns format
df['delay'] = pd.to_datetime(df['delay'])
#using dt to extract hour and minute data
df['delay2'] = df['delay'].dt.hour*60 + df['delay'].dt.minute
print(df['delay2'])
Output:
39 50.0
40 45.0
41 92.0
42 29.0
43 NaN
44 45.0
45 13.0
46 20.0
47 31.0
48 54.0
49 43.0
50 153.0
Name: Time, dtype: float64
You can use errors='ignore'
for general cases as follows:
df['column_name'].astype(int, errors='ignore')
I want to change a column from a DataFrame which contains values of this format hh:mm:ss
to a column containing the number of minutes (while keeping the NaN values)
I can’t change it directly from the excel file so I’ve tried to do it with pandas (I’m working on a ML model with a health database):
38 00:35:00
39 00:50:00
40 00:45:00
41 01:32:00
42 00:29:00
43 NaN
44 00:45:00
45 00:13:00
46 00:20:00
47 00:31:00
48 00:54:00
49 00:43:00
50 02:33:00
I tried to separate the values from the NaN values using a mask then convert to minutes with str.split()
df1 = df['delay'][df['delay'].notnull()].astype(str).str.split(':').apply(lambda x: int(x[0]) * 60 + int(x[1]))```
df2 = df['delai_ponc_recal_calc'][df['delai_ponc_recal_calc'].isnull()]
But then I cannot merge to two series without loosing the order (I get the NaN values with the correct indexes at the end of the merged series)
39 50
40 45
41 92
42 29
44 45
45 13
46 20
47 31
48 54
49 43
50 153
43 NaN
I also tried to go from hh:mm:ss
to minutes with datatime.time
and timedelta
using a loop (without using a mask) but I still can’t have a column (series or DF) with the all the values in minutes while keeping the NaN …
You can use pd.to_timedelta
to convert the delay
column to pandas timedelta series then divide it by Timedelta
of 1 min
to get total minutes:
pd.to_timedelta(df['delay'], errors='coerce') / pd.Timedelta(1, 'min')
39 50.0
40 45.0
41 92.0
42 29.0
43 NaN
44 45.0
45 13.0
46 20.0
47 31.0
48 54.0
49 43.0
50 153.0
Name: delay, dtype: float64
Here are some possible solutions:
Input:
delay
38 00:35:00
39 00:50:00
40 00:45:00
41 01:32:00
42 00:29:00
43 NaN
44 00:45:00
45 00:13:00
46 00:20:00
47 00:31:00
48 00:54:00
49 00:43:00
50 02:33:00
Method 1: pd.to_datetime
+ map
df['delay'] = pd.to_datetime(df['delay'])
#using lambda function
df['delay2'] = df['delay'].map(lambda x : x.hour*60 + x.minute)
print(df['delay2'])
#df.drop(['delay'],axis=1,inplace=True)
Method 2: pd.to_datetime
+ dt
#converts time columns to pandas datetime64ns format
df['delay'] = pd.to_datetime(df['delay'])
#using dt to extract hour and minute data
df['delay2'] = df['delay'].dt.hour*60 + df['delay'].dt.minute
print(df['delay2'])
Output:
39 50.0
40 45.0
41 92.0
42 29.0
43 NaN
44 45.0
45 13.0
46 20.0
47 31.0
48 54.0
49 43.0
50 153.0
Name: Time, dtype: float64
You can use errors='ignore'
for general cases as follows:
df['column_name'].astype(int, errors='ignore')