pandas align() function : illustrative example
Question:
I came across this line of code
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)
here app_train_poly and app_test_poly are the pandas dataframe.
I know that with align() you are able to perform some sort of combining of the two dataframes but I am not able to visualize how does it actually work.
I searched the documentation but could not find any illustrative example.
Answers:
You are on the right track, except that DataFrame.align
doesn’t combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration. Let’s try an example:
Initialising two dataframes with some descriptive column names and toy data:
df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])
Now, let’s view these data frames by themselves:
print(df1)
D B E A
1 1 2 3 4
2 6 7 8 9
print(df2)
A B C D
2 10 20 30 40
3 60 70 80 90
4 600 700 800 900
Let’s align these two dataframes, aligning by columns (axis=1
), and performing an outer join on column labels (join='outer'
):
a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
print(a2)
A B C D E
1 4 2 NaN 1 3
2 9 7 NaN 6 8
A B C D E
2 10 20 30 40 NaN
3 60 70 80 90 NaN
4 600 700 800 900 NaN
A few things to notice here:
- The columns in
df1
have been rearranged so they align with the columns in df2
.
- There is a column labelled
'C'
that has been added to df1
, and a column labelled 'E'
that has been added to df2
. These columns have been filled with NaN
. This is because we performed an outer join on the column labels.
- None of the values inside the DataFrames have been altered.
- Note that the row labels are not aligned;
df2
has rows 3
and 4
, whereas df1
does not. This is because we requested alignment on columns (axis=1
).
What happens if we align on both rows and columns, but change the join
parameter to 'right'
?
a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
print(a2)
A B C D
2 9.0 7.0 NaN 6.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
A B C D
2 10 20 30 40
3 60 70 80 90
4 600 700 800 900
Note that:
- Only the columns and rows that are found in the "right" dataframe (
df2
) are retained. Column 'E'
is no longer present. This is because we made a right join on both the column and row labels.
- Rows with labels
3
and 4
have been added to df1
, filled with Nan
. This is because we requested alignment on both rows and columns (axis=None
).
- Row labels are now aligned as well as column labels.
- Again, note that none of the actual values within the dataframes have been altered.
Finally, let’s have a look at the code in the question, with join='inner'
and axis=1
:
a1, a2 = df1.align(df2, join='inner', axis=1)
print(a1)
print(a2)
D B A
1 1 2 4
2 6 7 9
D B A
2 40 20 10
3 90 70 60
4 900 700 600
- Only column labels are aligned (
axis=1
).
- Only column labels that are present in both
df1
and df2
are retained (join='inner'
).
In summary, use DataFrame.align()
when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.
I came across this line of code
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)
here app_train_poly and app_test_poly are the pandas dataframe.
I know that with align() you are able to perform some sort of combining of the two dataframes but I am not able to visualize how does it actually work.
I searched the documentation but could not find any illustrative example.
You are on the right track, except that DataFrame.align
doesn’t combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration. Let’s try an example:
Initialising two dataframes with some descriptive column names and toy data:
df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])
Now, let’s view these data frames by themselves:
print(df1)
D B E A 1 1 2 3 4 2 6 7 8 9
print(df2)
A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Let’s align these two dataframes, aligning by columns (axis=1
), and performing an outer join on column labels (join='outer'
):
a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
print(a2)
A B C D E 1 4 2 NaN 1 3 2 9 7 NaN 6 8 A B C D E 2 10 20 30 40 NaN 3 60 70 80 90 NaN 4 600 700 800 900 NaN
A few things to notice here:
- The columns in
df1
have been rearranged so they align with the columns indf2
. - There is a column labelled
'C'
that has been added todf1
, and a column labelled'E'
that has been added todf2
. These columns have been filled withNaN
. This is because we performed an outer join on the column labels. - None of the values inside the DataFrames have been altered.
- Note that the row labels are not aligned;
df2
has rows3
and4
, whereasdf1
does not. This is because we requested alignment on columns (axis=1
).
What happens if we align on both rows and columns, but change the join
parameter to 'right'
?
a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
print(a2)
A B C D 2 9.0 7.0 NaN 6.0 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN A B C D 2 10 20 30 40 3 60 70 80 90 4 600 700 800 900
Note that:
- Only the columns and rows that are found in the "right" dataframe (
df2
) are retained. Column'E'
is no longer present. This is because we made a right join on both the column and row labels. - Rows with labels
3
and4
have been added todf1
, filled withNan
. This is because we requested alignment on both rows and columns (axis=None
). - Row labels are now aligned as well as column labels.
- Again, note that none of the actual values within the dataframes have been altered.
Finally, let’s have a look at the code in the question, with join='inner'
and axis=1
:
a1, a2 = df1.align(df2, join='inner', axis=1)
print(a1)
print(a2)
D B A 1 1 2 4 2 6 7 9 D B A 2 40 20 10 3 90 70 60 4 900 700 600
- Only column labels are aligned (
axis=1
). - Only column labels that are present in both
df1
anddf2
are retained (join='inner'
).
In summary, use DataFrame.align()
when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.