pyspark

How can I have a row-wise rank in a pyspark dataframe

How can I have a row-wise rank in a pyspark dataframe Question: I have a dataset for which I am going to find the rank per row. This is a toy example in pandas. import pandas as pd df = pd.DataFrame({"ID":[1,2,3,4], "a":[2,7,9,10], "b":[6,7,4,2], "c":[3,4,8,5]}) print(df) # ID a b c # 0 1 2 6 …

Total answers: 1

How can I calculate a date differential in Python across multiple rows and columns?

How can I calculate a date differential in Python across multiple rows and columns? Question: I’m trying to calculate the differential between the first Sent date/time in an ID and the last Received date/time in an ID, grouping them by Source and Destination. Sample (named test_subset) looks like this (but it is ‘000s of rows): …

Total answers: 1

Ratio after a groupby in pyspark

Ratio after a groupby in pyspark Question: I have a pyspark df like this +————+————-+ |Gender | Language| +————+————-+ | Male| Spanish| | Female| English| | Female| Indian| | Female| Spanish| | Female| Indian| | Male| English| | Male| English| | Female|Latin Spanish| | Male| Spanish| | Female| English| | Male| Indian| | Male| Catalan| …

Total answers: 1

PySpark: Create a condition from a string

PySpark: Create a condition from a string Question: I have to apply conditions to pyspark dataframes based on a distribution. My distribution looks like: mp = [413, 291, 205, 169, 135] And I am generating condition expression like this: when_decile = (F.when((F.col(colm) >= float(mp[0])), F.lit(1)) .when( (F.col(colm) >= float(mp[1])) & (F.col(colm) < float(mp[0])), F.lit(2)) .when( …

Total answers: 2

How can i convert from 03MAR23 format to yyyy-mm-dd in python

How can i convert from 03MAR23 format to yyyy-mm-dd in python Question: I wanted to convert from 03FEB23 format to yyyy-mm-dd in python how can I do it? Use the below code: from pyspark.sql.functions import * df=spark.createDataFrame([["1"]],["id"]) df.select(current_date().alias("current_date"), date_format("03MAR23","yyyy-MMM-dd").alias("yyyy-MMM-dd")).show() Asked By: Gaurav Gangwar || Source Answers: from datetime import datetime date_str = ’03FEB23′ date = …

Total answers: 2

PySpark in Databricks error with table conversion to pandas

PySpark in Databricks error with table conversion to pandas Question: I’m using Databricks and want to convert my PySpark DataFrame to a pandas one using the df.toPandas() command. However, I keep getting this error: /databricks/spark/python/pyspark/sql/pandas/conversion.py:145: UserWarning: toPandas attempted Arrow optimization because ‘spark.sql.execution.arrow.pyspark.enabled’ is set to true, but has reached the error below and can not …

Total answers: 1

Cannot sink Windowed queried streaming data to MongoDB

Cannot sink Windowed queried streaming data to MongoDB Question: Using Spark Structured Streaming I am trying to sink streaming data to a MongoDB collection. The issue is that I am querying my data using a window as following: def basicAverage(df): return df.groupby(window(col(‘timestamp’), "1 hour", "5 minutes"), col(‘stationcode’)) .agg(avg(‘mechanical’).alias(‘avg_mechanical’), avg(‘ebike’).alias(‘avg_ebike’), avg(‘numdocksavailable’).alias(‘avg_numdocksavailable’)) And it seems that mongodb …

Total answers: 1

explode a pyspark column with root name intact

explode a pyspark column with root name intact Question: I have pyspark dataframe , schema looks like this: |– col1: timestamp (nullable = true) |– col2: array (nullable = true) | |– element: struct (containsNull = true) | | |– NM: string (nullable = true) How can I explode col2 so that final column name …

Total answers: 1

how to make loop in pyspark

how to make loop in pyspark Question: i have a code: list_files = glob.glob("/t/main_folder/*/file_*[0-9].csv") test = sorted(list_files, key = lambda x:x[-5:]) so this code has helped me to find files that i need to work with. I found 5 csv files in different folders. next step-im using a code down below , to work with …

Total answers: 1