How can I read multiple CSV files and merge them in single dataframe in PySpark
Question:
I have 4 CSV files with different columns. Some CSV have same column name as well. The details of the csv are:
capstone_customers.csv: [customer_id, customer_type, repeat_customer]
capstone_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
capstone_recent_customers.csv: [customer_id, customer_type]
capstone_recent_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
My code is:
df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")
from functools import reduce
def unite_dfs(df1, df2):
return df2.union(df1)
list_of_dfs = [df1, df2,df3,df4]
united_df = reduce(unite_dfs, list_of_dfs)
but I got the error:
Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;n’Unionn:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csvn+- Relation[customer_id#218,customer_type#219,repeat_customer#220] csvn
How can I merge in a single data frame and remove same column names using PySpark?
Answers:
you can provide list of files or path to files to read, instead of reading one by one. And don’t forget about mergeSchema
option:
files = [
"capstone_customers.csv",
"capstone_invoices.csv",
"capstone_recent_customers.csv",
"capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)
# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')
To read multiple files in shark you can make list of all files you want and read them at once, you don’t have to read them in order.
Here is an example of code you can use:
path = ['file.cvs','file.cvs']
df = spark.read.options(header=True).csv(path)
df.show()
I have 4 CSV files with different columns. Some CSV have same column name as well. The details of the csv are:
capstone_customers.csv: [customer_id, customer_type, repeat_customer]
capstone_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
capstone_recent_customers.csv: [customer_id, customer_type]
capstone_recent_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
My code is:
df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")
from functools import reduce
def unite_dfs(df1, df2):
return df2.union(df1)
list_of_dfs = [df1, df2,df3,df4]
united_df = reduce(unite_dfs, list_of_dfs)
but I got the error:
Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;n’Unionn:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csvn+- Relation[customer_id#218,customer_type#219,repeat_customer#220] csvn
How can I merge in a single data frame and remove same column names using PySpark?
you can provide list of files or path to files to read, instead of reading one by one. And don’t forget about mergeSchema
option:
files = [
"capstone_customers.csv",
"capstone_invoices.csv",
"capstone_recent_customers.csv",
"capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)
# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')
To read multiple files in shark you can make list of all files you want and read them at once, you don’t have to read them in order.
Here is an example of code you can use:
path = ['file.cvs','file.cvs']
df = spark.read.options(header=True).csv(path)
df.show()