Query cassandra table in Databricks using python cassandra driver

Question

I’m trying to optimize a way to query a cassadnra table when working in databricks. After reading this article https://medium.com/@yoke_techworks/cassandra-and-pyspark-5d7830512f19, the author suggest to query the cassandra table one row at the time and union each results.

My attempt, using the python cassandra driver, is this:

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import pandas as pd

def init_cassandra_session(endpoints, keyspace, username, password, port=9042):
    
    auth_provider = PlainTextAuthProvider(username, password)
    cluster = Cluster(endpoints, port=port, auth_provider=auth_provider)
    cassandra_session = cluster.connect(keyspace, wait_for_all_pools=False)
    return cassandra_session

def get_rdd_values(rows):
    out_df = None
    cassandra_session = init_cassandra_session(host, keyspace, username, password)
    for row in rows:
        device_id = row.__getitem__('device_id')
        timestamp = row.__getitem__('timestamp')
        category = row.__getitem__('category')
        query = '''
                select * from headcounter_category_h_aggr where device_id = '%s' and timestamp = '&s' and category = '%s'
                '''
        result_query = cassandra_session.execute(query, [device_id, timestamp, category])
        if out_df is None:
            out_df = result_query
        else:
            out_df = out_df.append(result_query)
    
    return out_df 

columns = ['device_id', 'timestamp', 'category']
data = [['SIMUL_TEST03', '2020-12-23 11:00:00', 'PERSON'], ['SIMUL_TEST03', '2020-12-23 12:00:00', 'PERSON']]

pdf = pd.DataFrame(data, columns=columns)
dfFromData1 = spark.createDataFrame(pdf)

rdd_values = dfFromData1.rdd.mapPartitions(get_rdd_values)
rdd_values.collect()

when try to collect the results, rdd_values seams to be NoneType and so it’s not iterable.

I cannot find the error I’m making.

EDIT

I resolved the issue: I changed the get_rdd_values() function like this:

def get_rdd_values(rows):
    out_df = []
    cassandra_session = init_cassandra_session(host, keyspace, username, password)
    for row in rows:
        device_id = row.__getitem__('device_id')
        timestamp = row.__getitem__('timestamp')
        category = row.__getitem__('category')
        query = f"select * from headcounter_category_h_aggr where device_id = '{device_id}' and timestamp = '{timestamp}' and category = '{category}'"
        result_query = cassandra_session.execute(query)
        if len(out_df)== 0:
            out_df = result_query
        else:
            out_df = out_df.append(result_query)
    
    return out_df

But now it seams to make two time the same query or at least the out_df is made from two identical element

EDIT 2 and solution:

After some attempts I found out that making a spark dataframe directly from the RDD removes duplicate row. Here the code:

dfFromRDD = spark.createDataFrame(rdd_values, schema = schema)

Asked By: Gabriele Sciurti

||

Source

Answer 1

You shouldn’t do this – instead you need to use Spark Cassandra Connector that provides native access to Cassandra from Spark using DataFrame APIs (documentation for PySpark). You just need to install a version matching your Databricks Runtime (on Databricks you need to use assembly version due the reasons described here), and then you’ll be able to query Cassandra very easily, like this:

df = spark.read
    .format("org.apache.spark.sql.cassandra")
    .options(table="table_name", keyspace="ks_name")
    .load()

or integrate with Spark catalogs like this:

spark.conf.set("spark.sql.catalog.myCatalog", 
   "com.datastax.spark.connector.datasource.CassandraCatalog")
df = spark.read.table("myCatalog.myKs.myTab")

And Spark Cassandra Connector will perform predicates pushdown where it’s possible (for example, when you query by partition key).

If you need to join your dataset with Cassandra table, then you can follow instructions on use of so-called direct join outlined in the following blog post.

Answered By: Alex Ott

Query cassandra table in Databricks using python cassandra driver

Question:

Answers: