apache-spark

explode a pyspark column with root name intact

explode a pyspark column with root name intact Question: I have pyspark dataframe , schema looks like this: |– col1: timestamp (nullable = true) |– col2: array (nullable = true) | |– element: struct (containsNull = true) | | |– NM: string (nullable = true) How can I explode col2 so that final column name …

Total answers: 1

how to make loop in pyspark

how to make loop in pyspark Question: i have a code: list_files = glob.glob("/t/main_folder/*/file_*[0-9].csv") test = sorted(list_files, key = lambda x:x[-5:]) so this code has helped me to find files that i need to work with. I found 5 csv files in different folders. next step-im using a code down below , to work with …

Total answers: 1

How to generate Pyspark dynamic frame name dynamically

How to generate Pyspark dynamic frame name dynamically Question: I have a table which has data as shown in the diagram . I want to create store results in dynamically generated data frame names. For eg here in the below example I want to create two different data frame name dnb_df and es_df and store …

Total answers: 2

Flatten Map Type in Pyspark

Flatten Map Type in Pyspark Question: I have a dataframe as below +————-+————–+—-+——-+———————————————————————————–+ |empId |organization |h_cd|status |additional | +————-+————–+—-+——-+———————————————————————————–+ |FTE:56e662f |CATENA |0 |CURRENT|{hr_code -> 84534, bgc_val -> 170187, interviewPanel -> 6372, meetingId -> 3671} | |FTE:633e7bc |Data Science |0 |CURRENT|{hr_code -> 21036, bgc_val -> 170187, interviewPanel -> 764, meetingId -> 577} | |FTE:d9badd2 |CATENA |0 …

Total answers: 4

pySpark check Dataframe contains in another Dataframe

pySpark check Dataframe contains in another Dataframe Question: Assume I have two Dataframes: DF1: DATA1, DATA1, DATA2, DATA2 DF2: DATA2 I want to exclude all existence of data in DF2 while keeping duplicates in DF1, what should I do? Expected result: DATA1, DATA1 Asked By: TommyQu || Source Answers: Use left anti When you join …

Total answers: 2

Write multiple Avro files from pyspark to the same directory

Write multiple Avro files from pyspark to the same directory Question: I’m trying to write out dataframe as Avro files from PySpark dataframe to the path /my/path/ to HDFS, and partition by the col ‘partition’, so under /my/path/ , there should be the following sub directory structures partition= 20230101 partition= 20230102 …. Under these sub …

Total answers: 1

Delete rows from Pyspark Dataframe which match to header

Delete rows from Pyspark Dataframe which match to header Question: I have a huge dataframe similar to this: l = [(‘20190503’, ‘par1’, ‘feat2’, ‘0x0’), (‘20190503’, ‘par1’, ‘feat3’, ‘0x01’), (‘date’, ‘part’, ‘feature’, ‘value’), (‘20190501’, ‘par5’, ‘feat9’, ‘0x00’), (‘20190506’, ‘par8’, ‘feat2’, ‘0x00f45’), (‘date’, ‘part’, ‘feature’, ‘value’), (‘20190501’, ‘par11’, ‘feat3’, ‘0x000000000’), (‘date’, ‘part’, ‘feature’, ‘value’), (‘20190501’, ‘par3’, ‘feat9’, …

Total answers: 1

TypeError: col should be Column with apache spark

TypeError: col should be Column with apache spark Question: I have this method where I am gathering positive values def pos_values(df, metrics): num_pos_values = df.where(df.ttu > 1).count() df.withColumn("loader_ttu_pos_value", num_pos_values) df.write.json(metrics) However I get TypeError: col should be Column whenever I go to test it. I tried to cast it but that doesn’t seem to be …

Total answers: 1

Create new Data frame from an existing one in pyspark

Create new Data frame from an existing one in pyspark Question: I created this dataframe with pySpark from txt file that includes searches queries and user ID. `spark = SparkSession.builder.getOrCreate() df = spark.read.option("header", "true") .option("delimiter", "t") .option("inferSchema", "true") .csv("/content/drive/MyDrive/my_data.txt") df.select("AnonID","Query").show()` And it look like that: +——+——————–+ |AnonID| Query| +——+——————–+ | 142| rentdirect.com| | 142|www.prescriptionf…| | …

Total answers: 1