Spark with Cassandra python setup

Question:

I am trying to use spark to do some simple computations on Cassandra tables, but I am quite lost.

I am trying to follow: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/15_python.md

So I’m running the PySpark shell: with

./bin/pyspark 
  --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3

But I am not sure how to set things up from here. How do I let Spark know where my Cassandra cluster is? I’ve seen that CassandraSQLContext can be used for this, but I also read that this is deprecated.

I have read this: How to connect spark with cassandra using spark-cassandra-connector?

But if I use

import com.datastax.spark.connector._

Python says that it can’t find the module.
Can someone maybe point me in the right direction on how to set things up properly?

Asked By: SilverTear

||

Answers:

Cassandra connector doesn’t provide any Python modules. All functionality is provided with Data Source API and as long as required jars are present, everything should work out of the box.

How do I let Spark know where my Cassandra cluster is?

Use spark.cassandra.connection.host property. You can for exampel pass it as an argument for spark-submit / pyspark:

pyspark ... --conf spark.cassandra.connection.host=x.y.z.v

or set in your configuration:

(SparkSession.builder
    .config("cassandra.connection.host", "x.y.z.v"))

Configuration like table name or keyspace can be set directly on reader:

(spark.read
    .format("org.apache.spark.sql.cassandra")
    .options(table="kv", keyspace="test", cluster="cluster")
    .load())

So you can follows Dataframes documentation.

As a side note

import com.datastax.spark.connector._

is a Scala syntax and is accepted in Python only accidentally.

Answered By: zero323
  1. Copy pyspark-cassandra connector spark-folder/jars.
  2. Below code will connect to cassandra.

    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext, SparkSession
    
    spark = SparkSession.builder 
      .appName('SparkCassandraApp') 
      .config('spark.cassandra.connection.host', 'localhost') 
      .config('spark.cassandra.connection.port', '9042') 
      .config('spark.cassandra.output.consistency.level','ONE') 
      .master('local[2]') 
      .getOrCreate()
    
    sqlContext = SQLContext(spark)
    ds = sqlContext 
      .read 
      .format('org.apache.spark.sql.cassandra') 
      .options(table='emp', keyspace='demo') 
      .load()
    
    ds.show(10) 
    
Answered By: AkshayK

With username and password:

spark = SparkSession.builder 
  .appName('SparkCassandraApp') 
  .config('spark.cassandra.connection.host', 'localhost') 
  .config('spark.cassandra.connection.port', '9042') 
  .config("spark.cassandra.auth.username","cassandrauser")
  .config("spark.cassandra.auth.password","cassandrapwd")
  .master('local[2]') 
  .getOrCreate()

df = spark.read.format("org.apache.spark.sql.cassandra")
   .options(table="tablename", keyspace="keyspacename").load()

df.show()
Answered By: pkwied