pandas align() function : illustrative example

Question:

I came across this line of code

app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

here app_train_poly and app_test_poly are the pandas dataframe.

I know that with align() you are able to perform some sort of combining of the two dataframes but I am not able to visualize how does it actually work.

I searched the documentation but could not find any illustrative example.

Asked By: ashunigion

||

Answers:

You are on the right track, except that DataFrame.align doesn’t combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration. Let’s try an example:

Initialising two dataframes with some descriptive column names and toy data:

df1 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df2 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])

Now, let’s view these data frames by themselves:

print(df1)
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
print(df2)
     A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Let’s align these two dataframes, aligning by columns (axis=1), and performing an outer join on column labels (join='outer'):

a1, a2 = df1.align(df2, join='outer', axis=1)
print(a1)
print(a2)
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
     A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

A few things to notice here:

  • The columns in df1 have been rearranged so they align with the columns in df2.
  • There is a column labelled 'C' that has been added to df1, and a column labelled 'E' that has been added to df2. These columns have been filled with NaN. This is because we performed an outer join on the column labels.
  • None of the values inside the DataFrames have been altered.
  • Note that the row labels are not aligned; df2 has rows 3 and 4, whereas df1 does not. This is because we requested alignment on columns (axis=1).

What happens if we align on both rows and columns, but change the join parameter to 'right'?

a1, a2 = df1.align(df2, join='right', axis=None)
print(a1)
print(a2)
     A    B   C    D
2  9.0  7.0 NaN  6.0
3  NaN  NaN NaN  NaN
4  NaN  NaN NaN  NaN
     A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Note that:

  • Only the columns and rows that are found in the "right" dataframe (df2) are retained. Column 'E' is no longer present. This is because we made a right join on both the column and row labels.
  • Rows with labels 3 and 4 have been added to df1, filled with Nan. This is because we requested alignment on both rows and columns (axis=None).
  • Row labels are now aligned as well as column labels.
  • Again, note that none of the actual values within the dataframes have been altered.

Finally, let’s have a look at the code in the question, with join='inner' and axis=1:

a1, a2 = df1.align(df2, join='inner', axis=1)
print(a1)
print(a2)
   D  B  A
1  1  2  4
2  6  7  9
     D    B    A
2   40   20   10
3   90   70   60
4  900  700  600
  • Only column labels are aligned (axis=1).
  • Only column labels that are present in both df1 and df2 are retained (join='inner').

In summary, use DataFrame.align() when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.

Answered By: Andrew Guy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.