scala

Get current number of partitions of a DataFrame

Get current number of partitions of a DataFrame Question: Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn’t found a method for that, or am I just missed it? (In case of JavaRDD there’s a getNumPartitions() method.) Asked By: kecso || …

Total answers: 5

How to get today -"1 day" date in sparksql?

How to get today -"1 day" date in sparksql? Question: How to get current_date – 1 day in sparksql, same as cur_date()-1 in mysql. Asked By: Vishan Rana || Source Answers: The arithmetic functions allow you to perform arithmetic operation on columns containing dates. For example, you can calculate the difference between two dates, add …

Total answers: 6

How to use a Scala class inside Pyspark

How to use a Scala class inside Pyspark Question: I’ve been searching for a while if there is any way to use a Scala class in Pyspark, and I haven’t found any documentation nor guide about this subject. Let’s say I create a simple class in Scala that uses some libraries of apache-spark, something like: …

Total answers: 2

Column alias after groupBy in pyspark

Column alias after groupBy in pyspark Question: I need the resulting data frame in the line below, to have an alias name “maxDiff” for the max(‘diff’) column after groupBy. However, the below line does not makeany change, nor throw an error. grpdf = joined_df.groupBy(temp1.datestamp).max(‘diff’).alias(“maxDiff”) Asked By: mhn || Source Answers: This is because you are …

Total answers: 4

Calling Java/Scala function from a task

Calling Java/Scala function from a task Question: Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib? When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is …

Total answers: 1

How to use JDBC source to write and read data in (Py)Spark?

How to use JDBC source to write and read data in (Py)Spark? Question: The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages including Scala and …

Total answers: 3

Explain the aggregate functionality in Spark (with Python and Scala)

Explain the aggregate functionality in Spark (with Python and Scala) Question: I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example I have is as follows (using pyspark from Spark 1.2.0 version) sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda …

Total answers: 9

How does the pyspark mapPartitions function work?

How does the pyspark mapPartitions function work? Question: So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives. I couldn’t find any proper example from the internet. Lets say, I have an RDD object containing …

Total answers: 4

What are the Spark transformations that causes a Shuffle?

What are the Spark transformations that causes a Shuffle? Question: I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones does not? Map and filter does not. However, I am not sure with the …

Total answers: 4

How to turn off INFO logging in Spark?

How to turn off INFO logging in Spark? Question: I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully. However, I cannot for the life of me figure out how to …

Total answers: 17