databricks | Page 2

Python Exception Message Escaping <=> character

Python Exception Message Escaping <=> character Question: My colleagues and I are working on some code to produce SQL merge strings for users of a library we’re building in Python to be run in the Azure Databricks environment. These functions provide the SQL string through a custom exception that we’ve written called DebugMode. The issue …

Total answers: 1

Dealing with very small static tables in pySpark

Dealing with very small static tables in pySpark Question: I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes. Currently, …

Total answers: 2

Change Column Names using Dictionary (key value pair) in Databricks

Change Column Names using Dictionary (key value pair) in Databricks Question: I am new to Databricks and python, I just want to know the best way to change the column names in Databricks. For example if the column name is ‘ID’ then I want to change that to Patient_ID ,’Name’ to ‘Patient_Name’.. So I thought …

Total answers: 1

Transposing python/pyspark dataframe

Transposing python/pyspark dataframe Question: I have data in excel file in the following format I want it to be transposed into the following format I have tried transposing the data using the following pdf = df.toPandas() df_transposed = pdf.T But that didn’t work and i get incorrect results… Any help please.. thanks Asked By: Pysparker …

Total answers: 1

Processing large number of JSONs (~12TB) with Databricks

Processing large number of JSONs (~12TB) with Databricks Question: I am looking for guidance/best practice to approach a task. I want to use Azure-Databricks and PySpark. Task: Load and prepare data so that it can be efficiently/quickly analyzed in the future. The analysis will involve summary statistics, exploratory data analysis and maybe simple ML (regression). …

Total answers: 1

Why are constants not accessible from a python file?

Why are constants not accessible from a python file? Question: In my Databricks project, I have a very basic notebook which contains some constants as below: RAW_FOLDER_PATH = ‘dbfs:/mnt/formuleinsstorage/rawdata/unziped/’ PROCESSED_FOLDER_PATH = ‘dbfs:/mnt/formuleinsstorage/processeddata’ MESSAGE_TO_WHEN_COMPLETING_NOTEBOOK_SUCCESSFULLY = ‘Success’ dbutils.notebook.exit(MESSAGE_TO_WHEN_COMPLETING_NOTEBOOK_SUCCESSFULLY) and then i need to run this notebook as part of another notebook using this code: dbutils.notebook.run("./../helpers/configuration", 0) dbutils.notebook.run("./../helpers/functions", …

Total answers: 1

'DataFrame' object does not support item assignment

'DataFrame' object does not support item assignment Question: I imported a df into Databricks as a pyspark.sql.dataframe.DataFrame. Within this df I have 3 columns (which I have verified to be strings) that I wish to concatenate. I have tried to use a simple "+" function first, eg. df["fullname"] = df["firstname"] + df["middlename"] + df["lastname"] But …

Total answers: 2

How to Perform GroupBy , Having and Order by together in Pyspark

How to Perform GroupBy , Having and Order by together in Pyspark Question: I am looking for a solution where i am performing GROUP BY, HAVING CLAUSE and ORDER BY Together in a Pyspark Code. Basically we need to shift some data from one dataframe to another with some conditions. The SQL Query looks like …

Total answers: 1

Standardize large Pyspark dataframe using scipy Z-score

Standardize large Pyspark dataframe using scipy Z-score Question: I have a py-spark code running in Azure databricks. I have a spark dataframe with 20 numerical columns, named column1, column2, …column20. I have to calculate the Zscore(from scipy.stats import zscore) of these 20 columns, for that I am reading these 20 columns as numpy array. But …

Total answers: 2

How to Compare rows values in Pyspark using leadlag?

How to Compare rows values in Pyspark using leadlag? Question: I have a dataframe having Column Name as ‘YEAR’,i want to check if the alternate rows of the column are matching and update another Column ‘FLAG’ with value as 100 if the alternate value matches. df_prod Year FLAG 2020 None 2020 None 2019 None 2021 …

Total answers: 1