Selecting only one column from the right dataframe when joining

Question:

I have two huge dataframes that even contain columns with the same name that have no connection whatsoever. I have 2 join keys, though, and I want to add to data_left just one column from data_right. I tried:

output_df = data_left.join(data_right, on=["join_key_1", "join_key_2"], how="left").select("data_left.*", "data_right.extraColumn")

But it does not recognize the * even after importing it.

Sample:

data_left = 

col_1    col_2   join_key_1    join_key_2
   12        a          a_b             1
   14        c          r_t             2
   12        d          v_b             1
   24        r          a_s             2


data_right = 

col_3    col_4   join_key_1    join_key_2     extraColumn
   12        a          a_b             1             456
   14        g          r_t             2             654
   15        e          v_c             5             464
   24        r          a_s             2             546
   12        d          v_b             1             549

output_df =

       col_1    col_2   join_key_1       join_key_2     extraColumn
          12        a          a_b                1             456
          14        c          r_t                2             654
          12        d          v_b                1             546
          24        r          a_s                2             549

If there is no correspondent group of join keys in the data_right, we keep the extraColumn empty.

Asked By: johnnydoe

||

Answers:

Would this work for your usecase? :

output_df = data_left.join(data_right.select("join_key_1", "join_key_2", "extraColumn"), on=["join_key_1", "join_key_2"], how="left")
Answered By: Robert Kossendey
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.