Transforming a DataFrame from long to wide with specific columns
Question:
I got a DataFrame looks like this (call it df1):
id date value
A1 day1 0.1
A1 day2 0.2
A1 day3 -0.1
A2 day1 0.3
A3 day2 0.2
A3 day4 -0.5
I need to convert the value to a matrix for calculation, so I think I need to transform the DataFrame to this form (call it df2) first (and then convert to a numpy array):
day1 day2 day3 day4 day5
A1 0.1 0.2 -0.1 0.0 0.0
A2 0.3 0.0 0.0 0.0 0.0
A3 0.0 0.2 0.0 -0.5 0.0
if an id don’t have value on a day, just set that day’s value to 0 (and probably none of the ids
have a full-date value).
What I think is to generate an empty DataFrame (call it df3) first and then fill df1’s data in it:
day1 day2 day3 day4 day5
A1 0.0 0.0 0.0 0.0 0.0
A2 0.0 0.0 0.0 0.0 0.0
A3 0.0 0.0 0.0 0.0 0.0
But I don’t know the proper way to iterate df1’s value to match the cell in df3 (And people say it’s a bad idea to iterate a dataframe ?), or is there a better approach (like pivot or merge)?
Answers:
You could try df.pivot()
to reshape the DataFrame
df2 = df1.pivot(index='id', columns='date').fillna(0.0)
df2.columns = ['day1', 'day2', 'day3', 'day4']
print(df2)
Output
day1 day2 day3 day4
id
A1 0.1 0.2 -0.1 0.0
A2 0.3 0.0 0.0 0.0
A3 0.0 0.2 0.0 -0.5
Assume you have df3 as
day1 day2 day3 day4 day5 day6 day7
id
A1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
You can merge
df4 = pd.merge(df2.reset_index(), df3.reset_index(), how='left').set_index('id').fillna(0.0
print(df4)
to get output
day1 day2 day3 day4 day5 day6 day7
id
A1 0.1 0.2 -0.1 0.0 0.0 0.0 0.0
A2 0.3 0.0 0.0 0.0 0.0 0.0 0.0
A3 0.0 0.2 0.0 -0.5 0.0 0.0 0.0
This should work.
# pivot and reindex to add the missing days
df.pivot(*df).reindex(['day1', 'day2', 'day3', 'day4', 'day5'], axis=1).fillna(0).values
# array([[ 0.1, 0.2, -0.1, 0. , 0. ],
# [ 0.3, 0. , 0. , 0. , 0. ],
# [ 0. , 0.2, 0. , -0.5, 0. ]])
I got a DataFrame looks like this (call it df1):
id date value
A1 day1 0.1
A1 day2 0.2
A1 day3 -0.1
A2 day1 0.3
A3 day2 0.2
A3 day4 -0.5
I need to convert the value to a matrix for calculation, so I think I need to transform the DataFrame to this form (call it df2) first (and then convert to a numpy array):
day1 day2 day3 day4 day5
A1 0.1 0.2 -0.1 0.0 0.0
A2 0.3 0.0 0.0 0.0 0.0
A3 0.0 0.2 0.0 -0.5 0.0
if an id don’t have value on a day, just set that day’s value to 0 (and probably none of the ids
have a full-date value).
What I think is to generate an empty DataFrame (call it df3) first and then fill df1’s data in it:
day1 day2 day3 day4 day5
A1 0.0 0.0 0.0 0.0 0.0
A2 0.0 0.0 0.0 0.0 0.0
A3 0.0 0.0 0.0 0.0 0.0
But I don’t know the proper way to iterate df1’s value to match the cell in df3 (And people say it’s a bad idea to iterate a dataframe ?), or is there a better approach (like pivot or merge)?
You could try df.pivot()
to reshape the DataFrame
df2 = df1.pivot(index='id', columns='date').fillna(0.0)
df2.columns = ['day1', 'day2', 'day3', 'day4']
print(df2)
Output
day1 day2 day3 day4
id
A1 0.1 0.2 -0.1 0.0
A2 0.3 0.0 0.0 0.0
A3 0.0 0.2 0.0 -0.5
Assume you have df3 as
day1 day2 day3 day4 day5 day6 day7
id
A1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
You can merge
df4 = pd.merge(df2.reset_index(), df3.reset_index(), how='left').set_index('id').fillna(0.0
print(df4)
to get output
day1 day2 day3 day4 day5 day6 day7
id
A1 0.1 0.2 -0.1 0.0 0.0 0.0 0.0
A2 0.3 0.0 0.0 0.0 0.0 0.0 0.0
A3 0.0 0.2 0.0 -0.5 0.0 0.0 0.0
This should work.
# pivot and reindex to add the missing days
df.pivot(*df).reindex(['day1', 'day2', 'day3', 'day4', 'day5'], axis=1).fillna(0).values
# array([[ 0.1, 0.2, -0.1, 0. , 0. ],
# [ 0.3, 0. , 0. , 0. , 0. ],
# [ 0. , 0.2, 0. , -0.5, 0. ]])