“Is there an pandas function for adding a new column based on certain values of another column of the data frame?”
Question:
I am trying to create a new column in data frame based on time value in another column i.e if time is between 06:00:00 and 12:00:00 then Morning, if time is between 12:0:00 and 15:00:00 the afternoon and so on
I have tried using a for loop and if else statements but my dataframe has 1549293 rows so the loop is not fiesable
import datetime
import time
times= [datetime.time(6,0,0),datetime.time(12,0,0),datetime.time(15,0,0),datetime.time(20,0,0),datetime.time(23,0,0)]
times
df['time']=df['start_time'].dt.time
df['day_interval']=df['time']
for i in range(0,df.shape[0]):
if df['time'][i] >= times[0] and df['time'][i] < times[1]:
df['day_interval'][i]= "Morning"
elif df['time'][i] >= times[1] and df['time'][i] < times[2]:
df['day_interval'][i]= "Afternoon"
elif df['time'][i] >= times[2] and df['time'][i] < times[3]:
df['day_interval'][i]= "Evening"
elif df['time'][i] >= times[3] and df['time'][i] < times[4]:
df['day_interval'][i]= "Night"
elif df['time'][i] >= times[4]:
df['day_interval'][i]= "Late Night"
if df['time'][i] < times[0]:
df['day_interval'][i]= "Early Hours"
Is there some way to reduce the time taken for processing
Answers:
Row-wise loops should almost never be used in pandas. Pandas supports vectorized operations:
df.loc[(df['time'] >= times[0]) & (df['time'] < times[1]),
'day_interval'] = "Morning"
df.loc[(df['time'] >= times[1]) & (df['time'] < times[2]),
'day_interval'] = "Afternoon"
Etc. But using pd.cut
is even more elegant – see W-B’s solution.
Using pd.cut
Notice I adding two time in your times
00:00:00 and 23:59:59
pd.cut(s1,bins=pd.to_datetime(pd.Series(times),format='%H:%M:%S').tolist(),labels=['Early','M','A','E','N','L'])
0 Early
1 M
Name: time, dtype: category
Categories (6, object): [Early < M < A < E < N < L]
Data setup
times= [datetime.time(0,0,0),datetime.time(6,0,0),datetime.time(12,0,0),datetime.time(15,0,0),datetime.time(20,0,0),datetime.time(23,0,0),datetime.time(23,59,59)]
s1=pd.to_datetime(df.time,format='%H:%M:%S')
In pandas/numpy land, most of the time if you are reaching for a foorloop, there is probably a better way.
Not sure if faster, but this I think is at least a little cleaner [hopefully correct also?]
def time_of_day(hour):
if hour < 6:
return 'Early Hours'
elif 6 <= hour < 12:
return 'Morning'
elif 12 <= hour < 15:
return 'Afternoon'
elif 15 <= hour < 20:
return 'Evening'
elif 20 <= hour < 23:
return 'Night'
else:
return 'Late Night'
def main():
# ... code that generates df ...
df['day_interval'] = df['start_time'].dt.hour.map(time_of_day)
if __name__ == '__main__':
main()
I will throw it out there as an option df.between_time
with loc
df = pd.DataFrame(np.random.randn(25), index=pd.date_range('2017-08-20', '2017-08-21', freq='H'))
df.loc[df.between_time('06:00:00', '12:00:00').index, 'newCol'] = 'morning'
df.loc[df.between_time('12:00:00', '15:00:00').index, 'newCol'] = 'afternoon'
update per comment
If you want to use between_time
on a column and not an index then try:
# sample data
df = pd.DataFrame(np.random.randn(25),
index=pd.date_range('2017-08-20', '2017-08-21', freq='H'))
df = df.reset_index().rename(columns={'index': 'date'})
# create a datetime index from the date column
idx = pd.DatetimeIndex(df['date'])
# create a mask using between_time
morning_mask = idx.indexer_between_time('06:00:00', '12:00:00')
afternoon_mask = idx.indexer_between_time('12:00:00', '15:00:00')
# use loc to assign value to a new column
df.loc[morning_mask, 'newCol'] = 'morning'
df.loc[afternoon_mask, 'newCol'] = 'afternoon'
I am trying to create a new column in data frame based on time value in another column i.e if time is between 06:00:00 and 12:00:00 then Morning, if time is between 12:0:00 and 15:00:00 the afternoon and so on
I have tried using a for loop and if else statements but my dataframe has 1549293 rows so the loop is not fiesable
import datetime
import time
times= [datetime.time(6,0,0),datetime.time(12,0,0),datetime.time(15,0,0),datetime.time(20,0,0),datetime.time(23,0,0)]
times
df['time']=df['start_time'].dt.time
df['day_interval']=df['time']
for i in range(0,df.shape[0]):
if df['time'][i] >= times[0] and df['time'][i] < times[1]:
df['day_interval'][i]= "Morning"
elif df['time'][i] >= times[1] and df['time'][i] < times[2]:
df['day_interval'][i]= "Afternoon"
elif df['time'][i] >= times[2] and df['time'][i] < times[3]:
df['day_interval'][i]= "Evening"
elif df['time'][i] >= times[3] and df['time'][i] < times[4]:
df['day_interval'][i]= "Night"
elif df['time'][i] >= times[4]:
df['day_interval'][i]= "Late Night"
if df['time'][i] < times[0]:
df['day_interval'][i]= "Early Hours"
Is there some way to reduce the time taken for processing
Row-wise loops should almost never be used in pandas. Pandas supports vectorized operations:
df.loc[(df['time'] >= times[0]) & (df['time'] < times[1]),
'day_interval'] = "Morning"
df.loc[(df['time'] >= times[1]) & (df['time'] < times[2]),
'day_interval'] = "Afternoon"
Etc. But using pd.cut
is even more elegant – see W-B’s solution.
Using pd.cut
Notice I adding two time in your times
00:00:00 and 23:59:59
pd.cut(s1,bins=pd.to_datetime(pd.Series(times),format='%H:%M:%S').tolist(),labels=['Early','M','A','E','N','L'])
0 Early
1 M
Name: time, dtype: category
Categories (6, object): [Early < M < A < E < N < L]
Data setup
times= [datetime.time(0,0,0),datetime.time(6,0,0),datetime.time(12,0,0),datetime.time(15,0,0),datetime.time(20,0,0),datetime.time(23,0,0),datetime.time(23,59,59)]
s1=pd.to_datetime(df.time,format='%H:%M:%S')
In pandas/numpy land, most of the time if you are reaching for a foorloop, there is probably a better way.
Not sure if faster, but this I think is at least a little cleaner [hopefully correct also?]
def time_of_day(hour):
if hour < 6:
return 'Early Hours'
elif 6 <= hour < 12:
return 'Morning'
elif 12 <= hour < 15:
return 'Afternoon'
elif 15 <= hour < 20:
return 'Evening'
elif 20 <= hour < 23:
return 'Night'
else:
return 'Late Night'
def main():
# ... code that generates df ...
df['day_interval'] = df['start_time'].dt.hour.map(time_of_day)
if __name__ == '__main__':
main()
I will throw it out there as an option df.between_time
with loc
df = pd.DataFrame(np.random.randn(25), index=pd.date_range('2017-08-20', '2017-08-21', freq='H'))
df.loc[df.between_time('06:00:00', '12:00:00').index, 'newCol'] = 'morning'
df.loc[df.between_time('12:00:00', '15:00:00').index, 'newCol'] = 'afternoon'
update per comment
If you want to use between_time
on a column and not an index then try:
# sample data
df = pd.DataFrame(np.random.randn(25),
index=pd.date_range('2017-08-20', '2017-08-21', freq='H'))
df = df.reset_index().rename(columns={'index': 'date'})
# create a datetime index from the date column
idx = pd.DatetimeIndex(df['date'])
# create a mask using between_time
morning_mask = idx.indexer_between_time('06:00:00', '12:00:00')
afternoon_mask = idx.indexer_between_time('12:00:00', '15:00:00')
# use loc to assign value to a new column
df.loc[morning_mask, 'newCol'] = 'morning'
df.loc[afternoon_mask, 'newCol'] = 'afternoon'