apache-spark-sql

Splitting a text file based on empty lines in Spark

Splitting a text file based on empty lines in Spark Question: I am working on a really big file which is a very large text document almost 2GBs. Something like this – #*MOSFET table look-up models for circuit simulation #t1984 #cIntegration, the VLSI Journal #index1 #*The verification of the protection mechanisms of high-level language machines …

Total answers: 1

Pyspark create sliding windows from rows with padding

Pyspark create sliding windows from rows with padding Question: I’m trying to collect groups of rows into sliding windows represented as vectors. Given the example input: +—+—–+—–+ | id|Label|group| +—+—–+—–+ | A| T| 1| | B| T| 1| | C| F| 2| | D| F| 2| | E| F| 3| | F| T| 3| | …

Total answers: 2

Comparing two values in a structfield of a column in pyspark

Comparing two values in a structfield of a column in pyspark Question: I have Column where each row is a StructField. I want to get max of two values in the StructField. I tried this trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key)) But it throws this error ValueError: Cannot convert column into bool: please use ‘&’ …

Total answers: 1

How to unstack a column to create multiple columns out of it in pyspark?

How to unstack a column to create multiple columns out of it in pyspark? Question: I have csv file which contains data in below format row_num classes 1 0:0.2,1:0.3,2:0.5 2 0:0.1,1:0.5:2:0.4 3 0:0.4,1:0.5:2:0.1 4 0:0.2,1:0.4:2:0.4 I want it to be converted as follows: row_num class_0 class_1 class_2 1 0.2 0.3 0.5 2 0.1 0.5 0.4 …

Total answers: 2

Iterate through each column and find the max length

Iterate through each column and find the max length Question: I want to get the maximum length from each column from a pyspark dataframe. Following is the sample dataframe: from pyspark.sql.types import StructType,StructField, StringType, IntegerType data2 = [("James","","Smith","36636","M",3000), ("Michael","Rose","","40288","M",4000), ("Robert","","Williams","42114","M",4000), ("Maria","Anne","Jones","39192","F",4000), ("Jen","Mary","Brown","","F",-1) ] schema = StructType([ StructField("firstname",StringType(),True), StructField("middlename",StringType(),True), StructField("lastname",StringType(),True), StructField("id", StringType(), True), StructField("gender", StringType(), True), …

Total answers: 1

How to use Round Function with groupBy in Pyspark?

How to use Round Function with groupBy in Pyspark? Question: How can we use the Round function with Group by in pyspark? i have a spark dataframe through which i need to generate a result by using group by and round function?? data1 = [{‘Name’:’Jhon’,’ID’:21.528,’Add’:’USA’,’ID_2′:’30.90′}, {‘Name’:’Joe’,’ID’:3.69,’Add’:’USA’,’ID_2′:’12.80′}, {‘Name’:’Tina’,’ID’:2.48,’Add’:’IND’,’ID_2′:’11.07′}, {‘Name’:’Jhon’,’ID’:22.22, ‘Add’:’USA’,’ID_2′:’34.87′}, {‘Name’:’Joe’,’ID’:5.33,’Add’:’INA’,’ID_2′:’56.89′}] a = sc.parallelize(data1) In SQL …

Total answers: 1

pyspark createDataframe typeerror: structtype can not accept object 'id' in type <class 'str'>

pyspark createDataframe typeerror: structtype can not accept object 'id' in type <class 'str'> Question: An API call is returning DICT type response similar to the output below: {‘Account’: {‘id’: 123, ‘externalIdentifier’: None, ‘name’: ‘test acct’, ‘accountNumber’: None, ‘Rep’: None, ‘organizationId’: 123, ‘streetAddress’: ‘123 Main Road’, ‘streetAddressCity’: ‘Town City’, ‘streetAddressState’: ‘Texas’, ‘streetAddressZipCode’: ‘76123’, ‘contact’: [{‘id’: 10001, …

Total answers: 1

PySpark: create column based on value and dictionary in columns

PySpark: create column based on value and dictionary in columns Question: I have a PySpark dataframe with values and dictionaries that provide a textual mapping for the values. Not every row has the same dictionary and the values can vary too. | value | dict | | ——– | ———————————————- | | 1 | {"1": …

Total answers: 2