How can I read multiple CSV files and merge them in single dataframe in PySpark

Question:

I have 4 CSV files with different columns. Some CSV have same column name as well. The details of the csv are:

capstone_customers.csv: [customer_id, customer_type, repeat_customer]

capstone_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

capstone_recent_customers.csv: [customer_id, customer_type]

capstone_recent_invoices.csv: [invoice_id,product_id,  customer_id, days_until_shipped,  product_line, total]

My code is:

df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
    df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
    df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
    df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")


    from functools import reduce
    def unite_dfs(df1, df2):
      return df2.union(df1)
    
    list_of_dfs = [df1, df2,df3,df4]
    united_df = reduce(unite_dfs, list_of_dfs)

but I got the error:

Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;n’Unionn:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csvn+- Relation[customer_id#218,customer_type#219,repeat_customer#220] csvn

How can I merge in a single data frame and remove same column names using PySpark?

Asked By: Mary

||

Answers:

you can provide list of files or path to files to read, instead of reading one by one. And don’t forget about mergeSchema option:

files = [
   "capstone_customers.csv",
   "capstone_invoices.csv",
   "capstone_recent_customers.csv",
   "capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)

# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')
Answered By: iurii_n

To read multiple files in shark you can make list of all files you want and read them at once, you don’t have to read them in order.

Here is an example of code you can use:

path = ['file.cvs','file.cvs']
 
df = spark.read.options(header=True).csv(path)
df.show()
Answered By: Ondra907
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.