Pandas to add a column of numbers to denote months recency
Question:
A simple dataframe that I want to add a column of numbers to indicate how recent the month is, e.g. the most recent month has the highest "score", the furthest has the lowest.
Clumsy lines below helps the simple dataframe, but incapable with large ones:
import pandas as pd
from io import StringIO
csvfile = StringIO("""
Town,Department,Staff,Month,Project,Score
East,Produce,Ethan,1987-08,A814,27
East,Produce,Ethan,1987-09,A848,27
East,Produce,Ethan,1987-10,A736,29
East,Meat,Harry,1987-07,A813,26""")
df = pd.read_csv(csvfile, sep = ',', engine='python')
def condition(s):
if (s['Month'] == '1987-10'):
return 4
if (s['Month'] == '1987-09'):
return 3
if (s['Month'] == '1987-08'):
return 2
if (s['Month'] == '1987-07'):
return 1
else:
return ''
df["Month score"] = df.apply(condition, axis=1)
print (df)
For another large dataframe with 24 months and more, months in the rows are duplicated, what’s the good way to write it?
Answers:
If possible use Series.rank
:
df['score'] = df['Month'].rank(method='dense').astype(int)
print (df)
Town Department Staff Month Project Score score
0 East Produce Ethan 1987-08 A814 27 2
1 East Produce Ethan 1987-09 A848 27 3
2 East Produce Ethan 1987-10 A736 29 4
3 East Meat Harry 1987-07 A813 26 1
This seems to work, no need for a month score
df['Month'] = pd.to_datetime(df['Month'])
df.sort_values('Month', ascending=False )
Or if you really need a score
Score = pd.to_datetime(df['Month'])
df['Score'] = Score
df.sort_values('Score', ascending=False)
A simple dataframe that I want to add a column of numbers to indicate how recent the month is, e.g. the most recent month has the highest "score", the furthest has the lowest.
Clumsy lines below helps the simple dataframe, but incapable with large ones:
import pandas as pd
from io import StringIO
csvfile = StringIO("""
Town,Department,Staff,Month,Project,Score
East,Produce,Ethan,1987-08,A814,27
East,Produce,Ethan,1987-09,A848,27
East,Produce,Ethan,1987-10,A736,29
East,Meat,Harry,1987-07,A813,26""")
df = pd.read_csv(csvfile, sep = ',', engine='python')
def condition(s):
if (s['Month'] == '1987-10'):
return 4
if (s['Month'] == '1987-09'):
return 3
if (s['Month'] == '1987-08'):
return 2
if (s['Month'] == '1987-07'):
return 1
else:
return ''
df["Month score"] = df.apply(condition, axis=1)
print (df)
For another large dataframe with 24 months and more, months in the rows are duplicated, what’s the good way to write it?
If possible use Series.rank
:
df['score'] = df['Month'].rank(method='dense').astype(int)
print (df)
Town Department Staff Month Project Score score
0 East Produce Ethan 1987-08 A814 27 2
1 East Produce Ethan 1987-09 A848 27 3
2 East Produce Ethan 1987-10 A736 29 4
3 East Meat Harry 1987-07 A813 26 1
This seems to work, no need for a month score
df['Month'] = pd.to_datetime(df['Month'])
df.sort_values('Month', ascending=False )
Or if you really need a score
Score = pd.to_datetime(df['Month'])
df['Score'] = Score
df.sort_values('Score', ascending=False)