pyspark

Spark SQL – Pivot and concatenation

Spark SQL – Pivot and concatenation Question: I am working with spark sql and have a requirement to pivot and concatenate the data. My input data looks like ID Quantity Location 1 10 US 2 20 UK 2 5 CA 2 20 US 3 15 US 3 20 CA 4 25 US 4 10 CA …

Total answers: 1

Compute maximum number of consecutive identical integers in array column

Compute maximum number of consecutive identical integers in array column Question: Consider the following: df = spark.createDataFrame([ [0, [1, 1, 4, 4, 4]], [1, [3, 2, 2, -4]], [2, [1, 1, 5, 5]], [3, [-1, -9, -9, -9, -9]]] , [‘id’, ‘array_col’] ) df.show() ”’ +—+——————–+ | id| array_col| +—+——————–+ | 0| [1, 1, 4, …

Total answers: 2

How to Group by Conditional aggregation of adjacent rows In PySpark

How to Group by Conditional aggregation of adjacent rows In PySpark Question: I am facing issue when doing conditional grouping in spark dataframe Below is complete example I have a dataframe, which has been sorted by user and by time activity location user 0 watch movie house A 1 sleep house A 2 cardio gym …

Total answers: 1

Back-ticks in DataFrame.colRegex?

Back-ticks in DataFrame.colRegex? Question: For PySpark, I find back-ticks enclosing regular expressions for DataFrame.colRegex() here, here, and in this SO question. Here is the example from the DataFrame.colRegex doc string: df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"]) df.select(df.colRegex("`(Col1)?+.+`")).show() +—-+ |Col2| +—-+ | 1| | 2| | 3| +—-+ The answer to the …

Total answers: 1

PySpark loading from MySQL ends up loading the entire table?

PySpark loading from MySQL ends up loading the entire table? Question: I am quite new to PySpark (or Spark in general). I am trying to connect Spark with a MySQL instance I have running on RDS. When I load the table like so, does Spark load the entire table in memory? from pyspark.sql import SparkSession …

Total answers: 1

How can I access data from a nested dynamic frame to properly format it in Pyspark?

How can I access data from a nested dynamic frame to properly format it in Pyspark? Question: I’ve uploaded some semi-structed data into AWS glue using a Dynamic frame. From the dynamic frame I just the payload element which I selected by executing the following code in a Glue notebook df_p = df.select_fields(["payload"]) I’m trying …

Total answers: 1

i want to sum date in a looping 13 times using pyspark

i want to sum date in a looping 13 times using pyspark Question: Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column. I have a master table like …

Total answers: 1

Python lambda to pyspark

Python lambda to pyspark Question: I have this Python code written in pandas, I need to write the same in Pyspark: Source_df_write[‘default_flag1’]=Source_df_write.apply(lambda x: ‘T’ if ((x[‘A’]==1) or (x[‘crr’] in (‘sss’,’tttt’)) or (x[‘reg’]==’T’)) else ‘F’, axis=1) Asked By: Hala El Henawy || Source Answers: You can use when and otherwise: import pyspark.sql.functions as F Source_df_write.withColumn("default_flag1", F.when( …

Total answers: 1

PySpark: which is the default cluster manager

What is Spark's default cluster manager Question: When using PySaprk and getting the Spark Session using following statement: spark = SparkSession.builder .appName("sample-app") .getOrCreate() app works fine but I am unsure which cluster manager is being with this spark session. Is it local or standalone. I read through the docs but no where I found this …

Total answers: 1

Pyspark: JSON to Pyspark dataframe

Pyspark: JSON to Pyspark dataframe Question: I want to transform this json to a pyspark dataframe I have added my current code. json = { "key1": 0.75, "values":[ { "id": 2313, "val1": 350, "val2": 6000 }, { "id": 2477, "val1": 340, "val2": 6500 } ] } my code: I can get the expected output using …

Total answers: 2