pyspark

Dataframe column name with $$ failing in filter condition with parse error

Dataframe column name with $$ failing in filter condition with parse error Question: I have dataframe with column names as "lastname$$" and "firstname$$" +———–+———-+———-+——————+—–+——+ |firstname$$|middlename|lastname$$|languages |state|gender| +———–+———-+———-+——————+—–+——+ |James | |Smith |[Java, Scala, C++]|OH |M | |Anna |Rose | |[Spark, Java, C++]|NY |F | |Julia | |Williams |[CSharp, VB] |OH |F | |Maria |Anne |Jones |[CSharp, …

Total answers: 2

Pyspark DataFrame Function

Pyspark DataFrame Function Question: The problem I was having is trying to convert the following code in Python to PySpark. I’m extremely new to PySpark but I have a column of float data and for each row I want to perform a calculation based on the floor function value of the data input into the …

Total answers: 1

Pandas to Pyspark conversion (repeat/explode)

Pandas to Pyspark conversion (repeat/explode) Question: I’m trying to take a notebook that I’ve written in Python/Pandas and modify/convert it to use Pyspark. The dataset I’m working with is (as real world datasets often are) complete and utter garbage, and so some of the things I have to do to it are potentially a little …

Total answers: 1

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file Question: Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file. Is this a bug ? Is there something missing? –pyspark version ____ __ / …

Total answers: 1

Python pyspark columns count

Python pyspark columns count Question: I have a dataset like this : Zone 1 Zone 2 A A A B B A A B B B And I want this : Category Zone count A Zone1 3 B Zone1 2 A Zone2 2 B Zone2 3 I tried with a group by Zone1 & Zone2 …

Total answers: 1

Need to add sequential numbering as per the grouping in Pyspark

Need to add sequential numbering as per the grouping in Pyspark Question: I am working on one code where I need to add sequential number as per the grouping on the basis of column A & column B. Below is the table/dataframe I have. The data is sorted by colA & Date. colA colB Date …

Total answers: 1

Error in defining pyspark datastructure variables with a for loop

Error in defining pyspark datastructure variables with a for loop Question: I would like to define a set of pyspark features as a run time variables (features). I tried the below, it throws an error. Could you please help on this colNames = [‘colA’, ‘colB’, ‘colC’, ‘colD’, ‘colE’] tsfresh_feature_set = StructType( [ StructField(‘field1’, StringType(), True), …

Total answers: 1

How to use Pandas_UDF function in Pyspark program

How to use Pandas_UDF function in Pyspark program Question: I have a Pyspark dataframe with million records. It has a column with string persian date and need to convert it to miladi date.I tried several approuches, first I used UDF function in Python which did not have good performance. Then I wrote UDF function in …

Total answers: 1

python udf iterator -> iterator giving outputted more rows error

python udf iterator -> iterator giving outputted more rows error Question: Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. Getting outputted more than input rows errors. Tried couple of ways to troubleshoot , but not …

Total answers: 1

How to get the maximum value from within a column in pyspark dataframe?

How to get the maximum value from within a column in pyspark dataframe? Question: I have a DataFrame (df_testing) with the following sample data: I need to get the max value from the Amount column. So the output DataFrame (dfnew) looks like this: I’m a newbie in pyspark, so I looped through the dataframe using …

Total answers: 1