apache-spark-sql

PySpark loading from MySQL ends up loading the entire table?

PySpark loading from MySQL ends up loading the entire table? Question: I am quite new to PySpark (or Spark in general). I am trying to connect Spark with a MySQL instance I have running on RDS. When I load the table like so, does Spark load the entire table in memory? from pyspark.sql import SparkSession …

Total answers: 1

Dataframe column name with $$ failing in filter condition with parse error

Dataframe column name with $$ failing in filter condition with parse error Question: I have dataframe with column names as "lastname$$" and "firstname$$" +———–+———-+———-+——————+—–+——+ |firstname$$|middlename|lastname$$|languages |state|gender| +———–+———-+———-+——————+—–+——+ |James | |Smith |[Java, Scala, C++]|OH |M | |Anna |Rose | |[Spark, Java, C++]|NY |F | |Julia | |Williams |[CSharp, VB] |OH |F | |Maria |Anne |Jones |[CSharp, …

Total answers: 2

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file Question: Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file. Is this a bug ? Is there something missing? –pyspark version ____ __ / …

Total answers: 1

How to query for the maximum / highest value in an field with PySpark

How to query for the maximum / highest value in an field with PySpark Question: The following dataframe will produce values 0 to 3. df = DeltaTable.forPath(spark, ‘/mnt/lake/BASE/SQLClassification/cdcTest/dbo/cdcmergetest/1’).history().select(col("version")) Can someone show me how to modify the dataframe such that it only provides the maximum value i.e 3? I have tried df.select("*").max("version") And df.max("version") But no …

Total answers: 1

Calling udf is not working on spark dataframe

Calling udf is not working on spark dataframe Question: I have a dictionary and function I defined and registered a udf as a SQL function %%spark d = {‘I’:’Ice’, ‘U’:’UN’, ‘T’:’Tick’} def key_to_val(k): if k in d: return d[k] else: return "Null" spark.udf.register(‘key_to_val’, key_to_val,StringType()) And I have spark dataframe that looks like sdf = +—-+————+————–+ …

Total answers: 1

Rank does not go in order if the value does not change

Rank does not go in order if the value does not change Question: I have a dataframe: data = [[‘p1’, ‘t1’], [‘p4’, ‘t2’], [‘p2’, ‘t1’],[‘p4’, ‘t3’], [‘p4’, ‘t3’], [‘p3’, ‘t1’],] sdf = spark.createDataFrame(data, schema = [‘id’, ‘text’]) sdf.show() +—+—-+ | id|text| +—+—-+ | p1| t1| | p4| t2| | p2| t1| | p4| t3| | …

Total answers: 1

Converting string dd.mm.yyyy to date format yyyy-MM-dd using Pyspark

Converting string dd.mm.yyyy to date format yyyy-MM-dd using Pyspark Question: I have a column with date in string format: dd.mm.yyyy I want to convert it into date format yyyy-MM-dd using Pyspark, I have tried the following but it’s returning null values df.withColumn("date_col", to_date("string_col", "yyyy-mmm-dd") string_col date_col 02.11.2008 null 26.02.2021 null Asked By: f.ivy || Source …

Total answers: 1

Flatten Map Type in Pyspark

Flatten Map Type in Pyspark Question: I have a dataframe as below +————-+————–+—-+——-+———————————————————————————–+ |empId |organization |h_cd|status |additional | +————-+————–+—-+——-+———————————————————————————–+ |FTE:56e662f |CATENA |0 |CURRENT|{hr_code -> 84534, bgc_val -> 170187, interviewPanel -> 6372, meetingId -> 3671} | |FTE:633e7bc |Data Science |0 |CURRENT|{hr_code -> 21036, bgc_val -> 170187, interviewPanel -> 764, meetingId -> 577} | |FTE:d9badd2 |CATENA |0 …

Total answers: 4

Split file based on r into new rows

Split file based on r into new rows Question: I have an csv file in my source folder, want to get the output as new line where we have "r" Source File nameagegenderrkiran29malerrekha12femalersiva39maler Expected output file nameagegender kiran29male rekha12female siva39male Asked By: Raj || Source Answers: with open(‘filename.csv’, ‘r+’) as file: data = file.readlines()[0].replace((‘\r\’,’n\’))[:-1] print(data) …

Total answers: 1

Delete rows from Pyspark Dataframe which match to header

Delete rows from Pyspark Dataframe which match to header Question: I have a huge dataframe similar to this: l = [(‘20190503’, ‘par1’, ‘feat2’, ‘0x0’), (‘20190503’, ‘par1’, ‘feat3’, ‘0x01’), (‘date’, ‘part’, ‘feature’, ‘value’), (‘20190501’, ‘par5’, ‘feat9’, ‘0x00’), (‘20190506’, ‘par8’, ‘feat2’, ‘0x00f45’), (‘date’, ‘part’, ‘feature’, ‘value’), (‘20190501’, ‘par11’, ‘feat3’, ‘0x000000000’), (‘date’, ‘part’, ‘feature’, ‘value’), (‘20190501’, ‘par3’, ‘feat9’, …

Total answers: 1