pyspark

Check if columns exist and if not, create and fill with NaN using PySpark

Check if columns exist and if not, create and fill with NaN using PySpark Question: I have a pyspark dataframe and a separate list of column names. I want to check and see if any of the list column names are missing, and if they are, I want to create them and fill with null …

Total answers: 1

pyspark to pandas dataframe: datetime compatability

pyspark to pandas dataframe: datetime compatability Question: I am using pyspark to do most of the data wrangling but at the end I need to convert to pandas dataframe. When converting columns that I have formatted to date become "object" dtype in pandas. Are datetimes between pyspark and pandas incompatible? How can I keep dateformat …

Total answers: 1

How to query for the maximum / highest value in an field with PySpark

How to query for the maximum / highest value in an field with PySpark Question: The following dataframe will produce values 0 to 3. df = DeltaTable.forPath(spark, ‘/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1’).history().select(col("version")) Can someone show me how to modify the dataframe such that it only provides the maximum value i.e 3? I have tried df.select("*").max("version") And df.max("version") But no …

Total answers: 1

Get Geometric Mean Over Window in Pyspark Dataframe

Get Geometric Mean Over Window in Pyspark Dataframe Question: I have the following pyspark dataframe Car Time Val1 1 1 3 2 1 6 3 1 8 1 2 10 2 2 21 3 2 33 I want to get the geometric mean of all the cars at each time, resulting df should look like …

Total answers: 2

spark dataframe convert a few flattened columns to one array of struct column

spark dataframe convert a few flattened columns to one array of struct column Question: I’d like to have some guidance what functions in spark dataframe together with scala/python code to achieve this transformation. given a dataframe which has below columns columnA, columnB, columnA1, ColumnB1, ColumnA2, ColumnB2 …. ColumnA10, ColumnB10 eg. Fat Value, Fat Measure, Salt …

Total answers: 1

Datetime object error while inserting datetime in pyspark dataframe

Datetime object error while inserting datetime in pyspark dataframe Question: I am getting a error while inserting a datetime object into the pyspark data structure from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, ArrayType, BinaryType filename = ‘Rekorder_2022-08-24_14-12-42.mf4′ match = re.search(r'(d+-d+-d+_d+-d+-d+)’,filename) date_time = datetime.datetime.strptime(match.group(1),"%Y-%m-%d_%H-%M-%S") Now, I am trying to insert the date into the …

Total answers: 1

How we can set table properties for delta table in pyspark using DeltaTable API

How we can set table properties for delta table in pyspark using DeltaTable API Question: Below is the code that I am trying in PySpark from delta import DeltaTable delta_table = DeltaTable.forPath(spark, delta_table_path) delta_table.logRetentionDuration = "interval 1 days" After this do we need to save this config or it will be applicable automatically. How we …

Total answers: 1

Pyspark: Compare Column Values across different dataframe

Pyspark: Compare Column Values across different dataframe Question: we are planning to do the following, compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data. We are using pyspark dataframe and the following are our dataframes. Dataframe1: | Manager | Department | isHospRelated | ——– | ————– …

Total answers: 1

Calling udf is not working on spark dataframe

Calling udf is not working on spark dataframe Question: I have a dictionary and function I defined and registered a udf as a SQL function %%spark d = {‘I’:’Ice’, ‘U’:’UN’, ‘T’:’Tick’} def key_to_val(k): if k in d: return d[k] else: return "Null" spark.udf.register(‘key_to_val’, key_to_val,StringType()) And I have spark dataframe that looks like sdf = +—-+————+————–+ …

Total answers: 1

Rank does not go in order if the value does not change

Rank does not go in order if the value does not change Question: I have a dataframe: data = [[‘p1’, ‘t1’], [‘p4’, ‘t2’], [‘p2’, ‘t1’],[‘p4’, ‘t3’], [‘p4’, ‘t3’], [‘p3’, ‘t1’],] sdf = spark.createDataFrame(data, schema = [‘id’, ‘text’]) sdf.show() +—+—-+ | id|text| +—+—-+ | p1| t1| | p4| t2| | p2| t1| | p4| t3| | …

Total answers: 1