apache-spark

Back-ticks in DataFrame.colRegex?

Back-ticks in DataFrame.colRegex? Question: For PySpark, I find back-ticks enclosing regular expressions for DataFrame.colRegex() here, here, and in this SO question. Here is the example from the DataFrame.colRegex doc string: df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], ["Col1", "Col2"]) df.select(df.colRegex("`(Col1)?+.+`")).show() +—-+ |Col2| +—-+ | 1| | 2| | 3| +—-+ The answer to the …

Total answers: 1

PySpark loading from MySQL ends up loading the entire table?

PySpark loading from MySQL ends up loading the entire table? Question: I am quite new to PySpark (or Spark in general). I am trying to connect Spark with a MySQL instance I have running on RDS. When I load the table like so, does Spark load the entire table in memory? from pyspark.sql import SparkSession …

Total answers: 1

How can I access data from a nested dynamic frame to properly format it in Pyspark?

How can I access data from a nested dynamic frame to properly format it in Pyspark? Question: I’ve uploaded some semi-structed data into AWS glue using a Dynamic frame. From the dynamic frame I just the payload element which I selected by executing the following code in a Glue notebook df_p = df.select_fields(["payload"]) I’m trying …

Total answers: 1

PySpark: which is the default cluster manager

What is Spark's default cluster manager Question: When using PySaprk and getting the Spark Session using following statement: spark = SparkSession.builder .appName("sample-app") .getOrCreate() app works fine but I am unsure which cluster manager is being with this spark session. Is it local or standalone. I read through the docs but no where I found this …

Total answers: 1

Pyspark: JSON to Pyspark dataframe

Pyspark: JSON to Pyspark dataframe Question: I want to transform this json to a pyspark dataframe I have added my current code. json = { "key1": 0.75, "values":[ { "id": 2313, "val1": 350, "val2": 6000 }, { "id": 2477, "val1": 340, "val2": 6500 } ] } my code: I can get the expected output using …

Total answers: 2

Dataframe column name with $$ failing in filter condition with parse error

Dataframe column name with $$ failing in filter condition with parse error Question: I have dataframe with column names as "lastname$$" and "firstname$$" +———–+———-+———-+——————+—–+——+ |firstname$$|middlename|lastname$$|languages |state|gender| +———–+———-+———-+——————+—–+——+ |James | |Smith |[Java, Scala, C++]|OH |M | |Anna |Rose | |[Spark, Java, C++]|NY |F | |Julia | |Williams |[CSharp, VB] |OH |F | |Maria |Anne |Jones |[CSharp, …

Total answers: 2

Databricks: Issue while creating spark data frame from pandas

Databricks: Issue while creating spark data frame from pandas Question: I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has …

Total answers: 2

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file Question: Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file. Is this a bug ? Is there something missing? –pyspark version ____ __ / …

Total answers: 1

Need to add sequential numbering as per the grouping in Pyspark

Need to add sequential numbering as per the grouping in Pyspark Question: I am working on one code where I need to add sequential number as per the grouping on the basis of column A & column B. Below is the table/dataframe I have. The data is sorted by colA & Date. colA colB Date …

Total answers: 1

Error in defining pyspark datastructure variables with a for loop

Error in defining pyspark datastructure variables with a for loop Question: I would like to define a set of pyspark features as a run time variables (features). I tried the below, it throws an error. Could you please help on this colNames = [‘colA’, ‘colB’, ‘colC’, ‘colD’, ‘colE’] tsfresh_feature_set = StructType( [ StructField(‘field1’, StringType(), True), …

Total answers: 1