azure-databricks

Databricks DLT pipeline with for..loop reports error "AnalysisException: Cannot redefine dataset"

Databricks DLT pipeline with for..loop reports error "AnalysisException: Cannot redefine dataset" Question: I have the following code which works fine for a single table. But when I try to use a for..loop() to process all the tables in my database, I am getting the error, "AnalysisException: Cannot redefine dataset ‘source_ds’,Map(),Map(),List(),List(),Map())". I need to pass the …

Total answers: 1

Split file based on r into new rows

Split file based on r into new rows Question: I have an csv file in my source folder, want to get the output as new line where we have "r" Source File nameagegenderrkiran29malerrekha12femalersiva39maler Expected output file nameagegender kiran29male rekha12female siva39male Asked By: Raj || Source Answers: with open(‘filename.csv’, ‘r+’) as file: data = file.readlines()[0].replace((‘\r\’,’n\’))[:-1] print(data) …

Total answers: 1

df to table throw error TypeError: __init__() got multiple values for argument 'schema'

df to table throw error TypeError: __init__() got multiple values for argument 'schema' Question: I have dataframe in pandas :- purchase_df. I want to convert it to sql table so I can perform sql query in pandas. I tried this method purchase_df.to_sql(‘purchase_df’, con=engine, if_exists=’replace’, index=False) It throw an error TypeError: __init__() got multiple values for …

Total answers: 3

dbx execute install from azure artifacts / private pypi

dbx execute install from azure artifacts / private pypi Question: I would like to use dbx execute to run a task/job on an azure databricks cluster. However, i cannot make it install my code. More Details on the situation: Project A with a setup.py is dependent on Project B Project B is also python based …

Total answers: 1

parse xlsx file having merged cells using python or pyspark

parse xlsx file having merged cells using python or pyspark Question: I want to parse an xlsx file. Some of the cells in the file are merged and working as a header for the underneath values. But do not know what approach I should select to parse the file. Shall I parse the file from …

Total answers: 1

Saving Custom TableNet Model (VGG19 based) for table extraction – Azure Databricks

Saving Custom TableNet Model (VGG19 based) for table extraction – Azure Databricks Question: I have a model based on TableNet and VGG19, the data (Marmoot) for training and the saving path is mapped to a datalake storage (using Azure). I’m trying to save it in the following ways and get the following errors on Databricks: …

Total answers: 1

PySpark / Python Slicing and Indexing Issue

PySpark / Python Slicing and Indexing Issue Question: Can someone let me know how to pull out certain values from a Python output. I would like the retrieve the value ‘ocweeklyreports’ from the the following output using either indexing or slicing: ‘config’: ‘{"hiveView":"ocweeklycur.ocweeklyreports"} This should be relatively easy, however, I’m having problem defining the Slicing …

Total answers: 3

Processing large number of JSONs (~12TB) with Databricks

Processing large number of JSONs (~12TB) with Databricks Question: I am looking for guidance/best practice to approach a task. I want to use Azure-Databricks and PySpark. Task: Load and prepare data so that it can be efficiently/quickly analyzed in the future. The analysis will involve summary statistics, exploratory data analysis and maybe simple ML (regression). …

Total answers: 1

Dynamically create pyspark dataframes according to a condition

Dynamically create pyspark dataframes according to a condition Question: I have a pyspark dataframe store_df :- store ID Div 637 4000000970 Pac 637 4000000435 Pac 637 4000055542 Pac 637 4000042206 Pac 638 2200015935 Pac 638 2200000483 Pac 638 4000014114 Pac 640 4000000162 Pac 640 2200000067 Pac 642 2200000067 Mac 642 4000044148 Mac 642 4000014114 Mac …

Total answers: 3