Add a column (in Pandas) that is calculated based on another column
Question:
I have a simple database that has every month’s earnings, with Year
(values 1991-2020), Month
(Jan-Dec) and Earnings
. I want to make a new column, where for years 1991-2005 I divide the Earnings
column by 10000 but for 2006-2020 I want it to be the same as in the earnings column.
I am a beginner, but what I was thinking is that I want the new column (TrueEarn
) to be Earnings
/10000 but only for columns 1991-2005.
df['TrueEarn'] = df['Earnings']/10000 for (['Year']=('1991':"2005"))
Since I am a newb with Python, this may not make sense for you, but that is how I logically wanted to write it
Can you help me, please?
Answers:
Yoy should provide a minimum reproducible example. But assuming that you have the year in another column, the way to go could be
df['TrueEarn'] = np.where((df['YEAR'] >= 1991) & (df['YEAR'] <= 2005),
df['Earnings'] / 10000, df['Earnings'])
As @wjandrea says, this can be done directly with pandas, but numpy is faster. Benchmark with a toy dataframe:
df = pd.DataFrame(
{"YEAR": np.random.randint(1991, 2020, size=50000), "Earnings": np.random.uniform(0, 2e10, size=50000)}
)
%timeit df["TrueEarn"] = np.where((df["YEAR"] >= 1991) & (df["YEAR"] <= 2005), df["Earnings"] / 10000, df["Earnings"])
695 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
VS with pandas mask
%timeit df["TrueEarn"] = df["Earnings"].mask(df["YEAR"].between(1991, 2005), df["Earnings"] / 10000)
959 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I have a simple database that has every month’s earnings, with Year
(values 1991-2020), Month
(Jan-Dec) and Earnings
. I want to make a new column, where for years 1991-2005 I divide the Earnings
column by 10000 but for 2006-2020 I want it to be the same as in the earnings column.
I am a beginner, but what I was thinking is that I want the new column (TrueEarn
) to be Earnings
/10000 but only for columns 1991-2005.
df['TrueEarn'] = df['Earnings']/10000 for (['Year']=('1991':"2005"))
Since I am a newb with Python, this may not make sense for you, but that is how I logically wanted to write it
Can you help me, please?
Yoy should provide a minimum reproducible example. But assuming that you have the year in another column, the way to go could be
df['TrueEarn'] = np.where((df['YEAR'] >= 1991) & (df['YEAR'] <= 2005),
df['Earnings'] / 10000, df['Earnings'])
As @wjandrea says, this can be done directly with pandas, but numpy is faster. Benchmark with a toy dataframe:
df = pd.DataFrame(
{"YEAR": np.random.randint(1991, 2020, size=50000), "Earnings": np.random.uniform(0, 2e10, size=50000)}
)
%timeit df["TrueEarn"] = np.where((df["YEAR"] >= 1991) & (df["YEAR"] <= 2005), df["Earnings"] / 10000, df["Earnings"])
695 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
VS with pandas mask
%timeit df["TrueEarn"] = df["Earnings"].mask(df["YEAR"].between(1991, 2005), df["Earnings"] / 10000)
959 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)