How can I read multiple CSV files and merge them in single dataframe in PySpark

Question

I have 4 CSV files with different columns. Some CSV have same column name as well. The details of the csv are:

capstone_customers.csv: [customer_id, customer_type, repeat_customer]

capstone_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

capstone_recent_customers.csv: [customer_id, customer_type]

capstone_recent_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

My code is:

df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
    df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
    df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
    df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")


    from functools import reduce
    def unite_dfs(df1, df2):
      return df2.union(df1)
    
    list_of_dfs = [df1, df2,df3,df4]
    united_df = reduce(unite_dfs, list_of_dfs)

but I got the error:

Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;n’Unionn:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csvn+- Relation[customer_id#218,customer_type#219,repeat_customer#220] csvn

How can I merge in a single data frame and remove same column names using PySpark?

Asked By: Mary

||

Source

Answer 1

you can provide list of files or path to files to read, instead of reading one by one. And don’t forget about mergeSchema option:

files = [
   "capstone_customers.csv",
   "capstone_invoices.csv",
   "capstone_recent_customers.csv",
   "capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)

# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')

Answered By: iurii_n

Answer 2

To read multiple files in shark you can make list of all files you want and read them at once, you don’t have to read them in order.

Here is an example of code you can use:

path = ['file.cvs','file.cvs']
 
df = spark.read.options(header=True).csv(path)
df.show()

Answered By: Ondra907

How can I read multiple CSV files and merge them in single dataframe in PySpark

Question:

Answers: