How Do I Create New Column In Pandas Dataframe Using Two Columns Simultaneously From A Different Dataframe?
Question:
EDIT: Also, it is okay if I lose the month and day in the "State_Date" column. In other words, it may be easier to make values in the "State_Date" column datatype int as well, preserving just the year, and then merge on that.
I have two pandas dataframes, df_montly and df_pop.
df_monthly looks more or less like this, with Start_Date as datatype datetime64:
Jurisdiction
Start_Date
CrimeCount
AR0010000
2007-02-24
10.0
WVWSP9000
2008-06-04
15.0
…
…
…
df_pop is a dataframe containing Jurisdictions and their corresponding populations for any given year (datatype int64), like:
data_year
ori
population
1970
AK0010100
44237
1970
AK0010200
13311
…
…
…
I want to create a new column in df_monthly called year_pop, which contains the corresponding population for that jurisdiction and year of the Start_Date value.
I tried achieving this with "data_year" as datatype period[A-DEC] with the following:
# merge the two dataframes
merged_df = pd.merge(df_monthly, df_pop, left_on='Jurisdiction', right_on='ori')
# create a new column "year_pop"
merged_df['year_pop'] = merged_df.apply(lambda x: df_pop[(df_pop['ori']==x['ori']) & (df_pop['data_year']==x['Start_Date'].to_period('A-DEC'))]['population'].values[0], axis=1)
# drop unnecessary columns
merged_df.drop(['data_year', 'ori', 'population'], axis=1, inplace=True)
# assign the merged dataframe to 'df_monthly'
df_monthly = merged_df
However, this gives me an index 0 is out of bounds error. Is there a more straightforward way of doing this?
Answers:
IIUC, why don’t you extract the year from Start_Date
column and merge both on ['Jurisdiction', df['Start_Date'].dt.year]
and ['ori', 'data_year']
. Something like:
df_merged = (df_monthly.assign(year=df_monthly['Start_Date'].dt.year)
.merge(df_pop, how='inner',
left_on=['Jurisdiction', 'year'],
right_on=['ori', 'data_year']))
You could also use:
df_monthly['data_year'] = df_monthly['Start_Date'].dt.year
df_merged = pd.merge(df_monthly, df_pop, how = 'outer', on = ['data_year', 'Jurisdiction'])
df_merged.drop('data_year', axis = 1, inplace = True)
EDIT: Also, it is okay if I lose the month and day in the "State_Date" column. In other words, it may be easier to make values in the "State_Date" column datatype int as well, preserving just the year, and then merge on that.
I have two pandas dataframes, df_montly and df_pop.
df_monthly looks more or less like this, with Start_Date as datatype datetime64:
Jurisdiction | Start_Date | CrimeCount |
---|---|---|
AR0010000 | 2007-02-24 | 10.0 |
WVWSP9000 | 2008-06-04 | 15.0 |
… | … | … |
df_pop is a dataframe containing Jurisdictions and their corresponding populations for any given year (datatype int64), like:
data_year | ori | population |
---|---|---|
1970 | AK0010100 | 44237 |
1970 | AK0010200 | 13311 |
… | … | … |
I want to create a new column in df_monthly called year_pop, which contains the corresponding population for that jurisdiction and year of the Start_Date value.
I tried achieving this with "data_year" as datatype period[A-DEC] with the following:
# merge the two dataframes
merged_df = pd.merge(df_monthly, df_pop, left_on='Jurisdiction', right_on='ori')
# create a new column "year_pop"
merged_df['year_pop'] = merged_df.apply(lambda x: df_pop[(df_pop['ori']==x['ori']) & (df_pop['data_year']==x['Start_Date'].to_period('A-DEC'))]['population'].values[0], axis=1)
# drop unnecessary columns
merged_df.drop(['data_year', 'ori', 'population'], axis=1, inplace=True)
# assign the merged dataframe to 'df_monthly'
df_monthly = merged_df
However, this gives me an index 0 is out of bounds error. Is there a more straightforward way of doing this?
IIUC, why don’t you extract the year from Start_Date
column and merge both on ['Jurisdiction', df['Start_Date'].dt.year]
and ['ori', 'data_year']
. Something like:
df_merged = (df_monthly.assign(year=df_monthly['Start_Date'].dt.year)
.merge(df_pop, how='inner',
left_on=['Jurisdiction', 'year'],
right_on=['ori', 'data_year']))
You could also use:
df_monthly['data_year'] = df_monthly['Start_Date'].dt.year
df_merged = pd.merge(df_monthly, df_pop, how = 'outer', on = ['data_year', 'Jurisdiction'])
df_merged.drop('data_year', axis = 1, inplace = True)