how to make loop in pyspark
Question:
i have a code:
list_files = glob.glob("/t/main_folder/*/file_*[0-9].csv")
test = sorted(list_files, key = lambda x:x[-5:])
so this code has helped me to find files that i need to work with. I found 5 csv files in different folders.
next step-im using a code down below , to work with every file i found, i need to use full outer join for every file, firstly for main_folder/folder1/file1.csv, then for main_folder/folder2/file2 and etc etc. until last file that was found one-by-one.
thats why i need loop
df_deltas = spark.read.format("csv").schema(schema).option("header","true")
.option("delimiter",";").load(test)
df_mirror = spark.read.format("csv").schema(schema).option("header","true")
.option("delimiter",",").load("/t/org_file.csv").cache()
df_deltas.createOrReplaceTempView("deltas")
df_mirror.createOrReplaceTempView("mirror")
df_mir2=spark.sql("""select
coalesce (deltas.DATA_ACTUAL_DATE,mirror.DATA_ACTUAL_DATE) as DATA_ACTUAL_DATE,
coalesce (deltas.DATA_ACTUAL_END_DATE,mirror.DATA_ACTUAL_END_DATE) as DATA_ACTUAL_END_DATE,
coalesce (deltas.ACCOUNT_RK,mirror.ACCOUNT_RK) as ACCOUNT_RK,
coalesce (deltas.ACCOUNT_NUMBER,mirror.ACCOUNT_NUMBER) as ACCOUNT_NUMBER,
coalesce (deltas.CHAR_TYPE,mirror.CHAR_TYPE) as CHAR_TYPE,
coalesce (deltas.CURRENCY_RK,mirror.CURRENCY_RK) as CURRENCY_RK,
coalesce (deltas.CURRENCY_CODE,mirror.CURRENCY_CODE) as CURRENCY_CODE,
coalesce (deltas.CLIENT_ID,mirror.CLIENT_ID) as CLIENT_ID,
coalesce (deltas.BRANCH_ID,mirror.BRANCH_ID) as BRANCH_ID,
coalesce (deltas.OPEN_IN_INTERNET,mirror.OPEN_IN_INTERNET) as OPEN_IN_INTERNET
from mirror
full outer join deltas on
deltas.ACCOUNT_RK=mirror.ACCOUNT_RK
""")
df_deltas = spark.read.format("csv").schema(schema).option("header","true")
.option("delimiter",";").load(test)--HERE I'M USING MY CODE TO FILL THE .LOAD WITH FILES
how is it possible to make a loop for the first found file, then for the second and so on?
Answers:
You can use a for loop to do that,
for idx, file in enumerate(test):
globals()[f"df_{idx}"] = spark.read.format("csv").schema(schema).option("header","true").option("delimiter",";").load(file)
This will create DFs in the global namespace with names df_0
for the first file, df_1
for the second file, and so on. Then you can use this DF to do whatever you want
i have a code:
list_files = glob.glob("/t/main_folder/*/file_*[0-9].csv")
test = sorted(list_files, key = lambda x:x[-5:])
so this code has helped me to find files that i need to work with. I found 5 csv files in different folders.
next step-im using a code down below , to work with every file i found, i need to use full outer join for every file, firstly for main_folder/folder1/file1.csv, then for main_folder/folder2/file2 and etc etc. until last file that was found one-by-one.
thats why i need loop
df_deltas = spark.read.format("csv").schema(schema).option("header","true")
.option("delimiter",";").load(test)
df_mirror = spark.read.format("csv").schema(schema).option("header","true")
.option("delimiter",",").load("/t/org_file.csv").cache()
df_deltas.createOrReplaceTempView("deltas")
df_mirror.createOrReplaceTempView("mirror")
df_mir2=spark.sql("""select
coalesce (deltas.DATA_ACTUAL_DATE,mirror.DATA_ACTUAL_DATE) as DATA_ACTUAL_DATE,
coalesce (deltas.DATA_ACTUAL_END_DATE,mirror.DATA_ACTUAL_END_DATE) as DATA_ACTUAL_END_DATE,
coalesce (deltas.ACCOUNT_RK,mirror.ACCOUNT_RK) as ACCOUNT_RK,
coalesce (deltas.ACCOUNT_NUMBER,mirror.ACCOUNT_NUMBER) as ACCOUNT_NUMBER,
coalesce (deltas.CHAR_TYPE,mirror.CHAR_TYPE) as CHAR_TYPE,
coalesce (deltas.CURRENCY_RK,mirror.CURRENCY_RK) as CURRENCY_RK,
coalesce (deltas.CURRENCY_CODE,mirror.CURRENCY_CODE) as CURRENCY_CODE,
coalesce (deltas.CLIENT_ID,mirror.CLIENT_ID) as CLIENT_ID,
coalesce (deltas.BRANCH_ID,mirror.BRANCH_ID) as BRANCH_ID,
coalesce (deltas.OPEN_IN_INTERNET,mirror.OPEN_IN_INTERNET) as OPEN_IN_INTERNET
from mirror
full outer join deltas on
deltas.ACCOUNT_RK=mirror.ACCOUNT_RK
""")
df_deltas = spark.read.format("csv").schema(schema).option("header","true")
.option("delimiter",";").load(test)--HERE I'M USING MY CODE TO FILL THE .LOAD WITH FILES
how is it possible to make a loop for the first found file, then for the second and so on?
You can use a for loop to do that,
for idx, file in enumerate(test):
globals()[f"df_{idx}"] = spark.read.format("csv").schema(schema).option("header","true").option("delimiter",";").load(file)
This will create DFs in the global namespace with names df_0
for the first file, df_1
for the second file, and so on. Then you can use this DF to do whatever you want