apache-spark

Check if columns exist and if not, create and fill with NaN using PySpark

Check if columns exist and if not, create and fill with NaN using PySpark Question: I have a pyspark dataframe and a separate list of column names. I want to check and see if any of the list column names are missing, and if they are, I want to create them and fill with null …

Total answers: 1

How to query for the maximum / highest value in an field with PySpark

How to query for the maximum / highest value in an field with PySpark Question: The following dataframe will produce values 0 to 3. df = DeltaTable.forPath(spark, ‘/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1’).history().select(col("version")) Can someone show me how to modify the dataframe such that it only provides the maximum value i.e 3? I have tried df.select("*").max("version") And df.max("version") But no …

Total answers: 1

spark dataframe convert a few flattened columns to one array of struct column

spark dataframe convert a few flattened columns to one array of struct column Question: I’d like to have some guidance what functions in spark dataframe together with scala/python code to achieve this transformation. given a dataframe which has below columns columnA, columnB, columnA1, ColumnB1, ColumnA2, ColumnB2 …. ColumnA10, ColumnB10 eg. Fat Value, Fat Measure, Salt …

Total answers: 1

Pyspark: Compare Column Values across different dataframe

Pyspark: Compare Column Values across different dataframe Question: we are planning to do the following, compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data. We are using pyspark dataframe and the following are our dataframes. Dataframe1: | Manager | Department | isHospRelated | ——– | ————– …

Total answers: 1

Calling udf is not working on spark dataframe

Calling udf is not working on spark dataframe Question: I have a dictionary and function I defined and registered a udf as a SQL function %%spark d = {‘I’:’Ice’, ‘U’:’UN’, ‘T’:’Tick’} def key_to_val(k): if k in d: return d[k] else: return "Null" spark.udf.register(‘key_to_val’, key_to_val,StringType()) And I have spark dataframe that looks like sdf = +—-+————+————–+ …

Total answers: 1

Rank does not go in order if the value does not change

Rank does not go in order if the value does not change Question: I have a dataframe: data = [[‘p1’, ‘t1’], [‘p4’, ‘t2’], [‘p2’, ‘t1’],[‘p4’, ‘t3’], [‘p4’, ‘t3’], [‘p3’, ‘t1’],] sdf = spark.createDataFrame(data, schema = [‘id’, ‘text’]) sdf.show() +—+—-+ | id|text| +—+—-+ | p1| t1| | p4| t2| | p2| t1| | p4| t3| | …

Total answers: 1

Default lib jars folder for Apache Toree kernel

Default lib jars folder for Apache Toree kernel Question: Say I want a default relative lib folder in jupyter notebook project directory where I can download custom jars so that I can import later without %addjar magic. I was under impression I can do something like: "__TOREE_OPTS__": "–jar-dir=./lib/" in ~/.local/share/jupyter/kernels/apache_toree_scala/kernel.json, but this doesn’t work. What …

Total answers: 2

PySpark in Databricks error with table conversion to pandas

PySpark in Databricks error with table conversion to pandas Question: I’m using Databricks and want to convert my PySpark DataFrame to a pandas one using the df.toPandas() command. However, I keep getting this error: /databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because ‘spark.sql.execution.arrow.pyspark.enabled’ is set to true, but has reached the error below and can not …

Total answers: 1

Cannot sink Windowed queried streaming data to MongoDB

Cannot sink Windowed queried streaming data to MongoDB Question: Using Spark Structured Streaming I am trying to sink streaming data to a MongoDB collection. The issue is that I am querying my data using a window as following: def basicAverage(df): return df.groupby(window(col(‘timestamp’), "1 hour", "5 minutes"), col(‘stationcode’)) .agg(avg(‘mechanical’).alias(‘avg_mechanical’), avg(‘ebike’).alias(‘avg_ebike’), avg(‘numdocksavailable’).alias(‘avg_numdocksavailable’)) And it seems that mongodb …

Total answers: 1