apache-spark | Page 4

How to find values in a column of a PySpark DataFrame that don't exist in another DataFrame?

How to find values in a column of a PySpark DataFrame that don't exist in another DataFrame? Question: I have two PySpark DataFrames, both have a column named "Country". One DataFrame is the reference and I want to compare name of the countries in the 2nd DataFrame with the reference DataFrame to find the difference. …

Total answers: 1

How to coalesce multiple pyspark arrays?

How to coalesce multiple pyspark arrays? Question: I have an arbitrary number of arrays of equal length in a PySpark DataFrame. I need to coalesce these, element by element, into a single list. The problem with coalesce is that it doesn’t work by element, but rather selects the entire first non-null array. Any suggestions for …

Total answers: 3

further expldoe on string datatype pyspark

further expldoe on string datatype pyspark Question: I have df where I have the column called data. In the data column we can expect the single values per identifier_filed column or list values. This is shown as [ ]brackets under the data column. For example Allegren under the values column can have different data type, …

Total answers: 2

How to change the schima of the spark dataframe

How to change the schema of the spark dataframe Question: I am reading a JSON file with spark.read.json and it automatically gives me the dataframe with schema but is it possible to change the schema of exisiting Dataframe with the below schema? schema = StructType([StructField("_links", MapType(StringType(), MapType(StringType(), StringType()))), StructField("identifier", StringType()), StructField("enabled", BooleanType()), StructField("family", StringType()), StructField("categories", …

Total answers: 1

Dealing with very small static tables in pySpark

Dealing with very small static tables in pySpark Question: I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes. Currently, …

Total answers: 2

Convert python pandas iterator and string concat into pyspark

Convert python pandas iterator and string concat into pyspark Question: I am attempting to move a process from Pandas into Pyspark, but I am a complete novice in the latter. Note: This is an EDA process so I am not too worried about having it as a loop for now, I can optimise that at …

Total answers: 1

Splitting a text file based on empty lines in Spark

Splitting a text file based on empty lines in Spark Question: I am working on a really big file which is a very large text document almost 2GBs. Something like this – #*MOSFET table look-up models for circuit simulation #t1984 #cIntegration, the VLSI Journal #index1 #*The verification of the protection mechanisms of high-level language machines …

Total answers: 1

Pyspark create sliding windows from rows with padding

Pyspark create sliding windows from rows with padding Question: I’m trying to collect groups of rows into sliding windows represented as vectors. Given the example input: +—+—–+—–+ | id|Label|group| +—+—–+—–+ | A| T| 1| | B| T| 1| | C| F| 2| | D| F| 2| | E| F| 3| | F| T| 3| | …

Total answers: 2

Comparing two values in a structfield of a column in pyspark

Comparing two values in a structfield of a column in pyspark Question: I have Column where each row is a StructField. I want to get max of two values in the StructField. I tried this trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key)) But it throws this error ValueError: Cannot convert column into bool: please use ‘&’ …

Total answers: 1

PySpark executing queries from different processes

PySpark executing queries from different processes Question: Is there any way to have two separate processes executing queries on Spark? Something like: def process_1(): spark_context = SparkSession.builder.getOrCreate() data1 = spark_context.sql("SELECT * FROM table 1").toPandas() do_processing(data1) def process_2(): spark_context = SparkSession.builder.getOrCreate() data1 = spark_context.sql("SELECT * FROM table 2").toPandas() do_processing(data1) p1 = Process(target=process_1) p1.start() p2 = Process(target=process_2) …

Total answers: 1