Merge two spark dataframes with different columns to get all columns

Question

Lets say I have 2 spark dataframes:

Location    Date        Date_part   Sector      units   
USA         7/1/2021    7/1/2021    Cars        200     
IND         7/1/2021    7/1/2021    Scooters    180     
COL         7/1/2021    7/1/2021    Trucks      100

Location    Date    Brands  units   values    
UK          null    brand1  400     120       
AUS         null    brand2  450     230       
CAN         null    brand3  150     34

I need my resultant dataframe as

Location    Date        Date_part   Sector      Brands  units   values
USA         7/1/2021    7/1/2021    Cars                200     
IND         7/1/2021    7/1/2021    Scooters            180     
COL         7/1/2021    7/1/2021    Trucks              100
UK          null        7/1/2021                brand1  400     120
AUS         null        7/1/2021                brand2  450     230
CAN         null        7/1/2021                brand3  150     34

So my desired df should contain all column from both dataframes also I need Date_part in all rows
This is what I tried:

df_result= df1.union(df_2)

But Im getting this as my result. The values are being swapped and one column from second dataframe is missing.

Location    Date        Date_part   Sector      Brands  units
USA         7/1/2021    7/1/2021    Cars        200     
IND         7/1/2021    7/1/2021    Scooters    180     
COL         7/1/2021    7/1/2021    Trucks      100
UK          null        brand1                  400     120
AUS         null        brand2                  450     230
CAN         null        brand3                  150     34

Any suggestions plsss

Asked By: user175025

||

Source

Answer 1

union : this function resolves columns by position (not by name)

That is the reason why you believed "The values are being swapped and one column from second dataframe is missing."

You should use unionByName, but this functions requires both dataframe to have the same structure.

I offer you this simple code to harmonize the structure of your dataframes and then do the union(ByName).

from pyspark.sql import DataFrame
from pyspark.sql import functions as F

def add_missing_columns(df: DataFrame, ref_df: DataFrame) -> DataFrame:
    """Add missing columns from ref_df to df

    Args:
        df (DataFrame): dataframe with missing columns
        ref_df (DataFrame): referential dataframe

    Returns:
        DataFrame: df with additionnal columns from ref_df
    """
    for col in ref_df.schema:
        if col.name not in df.columns:
            df = df.withColumn(col.name, F.lit(None).cast(col.dataType))

    return df


df1 = add_missing_columns(df1, df2)
df2 = add_missing_columns(df2, df1)

df_result = df1.unionByName(df2)

Answered By: Steven

Answer 2

This is an add-on to @Steven’s response (since I don’t have enough reputation to comment directly under his post):

Apart from the optional argument suggested by @minus34 for Spark 3.1+ and above, @Steven’s solution (add_missing_columns) is a perfect workaround. However, calling withColumn introduces a projection internally, which when called in a large loop generates big plans that can potentially cause performance issues, eventually amounting to a StackOverflowError for large datasets.

A scalable modification of @Steven’s code could be:

from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql import types as T

def add_missing_columns(df: DataFrame, ref_df: DataFrame) -> DataFrame:
    """Add missing columns from ref_df to df

    Args:
        df (DataFrame): dataframe with missing columns
        ref_df (DataFrame): referential dataframe

    Returns:
        DataFrame: df with additionnal columns from ref_df
    """
    missing_col = []
    for col in ref_df.schema:
        if col.name not in df.columns:
            missing_col.append(col.name)
            
    df = df.select(['*'] + [F.lit(None).cast(T.NullType()).alias(c) for c in missing_col])

    return df

select is therefore a possible alternative, and it might be better to cast new empty columns of value None to NullType(), as you needn’t specify the specific data type to cast this empty column to! (NullType() works fine in union and unionByName with any data type in spark)

Answered By: Anselm

Merge two spark dataframes with different columns to get all columns

Question:

Answers: