Query cassandra table in Databricks using python cassandra driver
Question:
I’m trying to optimize a way to query a cassadnra table when working in databricks. After reading this article https://medium.com/@yoke_techworks/cassandra-and-pyspark-5d7830512f19, the author suggest to query the cassandra table one row at the time and union each results.
My attempt, using the python cassandra driver, is this:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import pandas as pd
def init_cassandra_session(endpoints, keyspace, username, password, port=9042):
auth_provider = PlainTextAuthProvider(username, password)
cluster = Cluster(endpoints, port=port, auth_provider=auth_provider)
cassandra_session = cluster.connect(keyspace, wait_for_all_pools=False)
return cassandra_session
def get_rdd_values(rows):
out_df = None
cassandra_session = init_cassandra_session(host, keyspace, username, password)
for row in rows:
device_id = row.__getitem__('device_id')
timestamp = row.__getitem__('timestamp')
category = row.__getitem__('category')
query = '''
select * from headcounter_category_h_aggr where device_id = '%s' and timestamp = '&s' and category = '%s'
'''
result_query = cassandra_session.execute(query, [device_id, timestamp, category])
if out_df is None:
out_df = result_query
else:
out_df = out_df.append(result_query)
return out_df
columns = ['device_id', 'timestamp', 'category']
data = [['SIMUL_TEST03', '2020-12-23 11:00:00', 'PERSON'], ['SIMUL_TEST03', '2020-12-23 12:00:00', 'PERSON']]
pdf = pd.DataFrame(data, columns=columns)
dfFromData1 = spark.createDataFrame(pdf)
rdd_values = dfFromData1.rdd.mapPartitions(get_rdd_values)
rdd_values.collect()
when try to collect the results, rdd_values seams to be NoneType and so it’s not iterable.
I cannot find the error I’m making.
EDIT
I resolved the issue: I changed the get_rdd_values() function like this:
def get_rdd_values(rows):
out_df = []
cassandra_session = init_cassandra_session(host, keyspace, username, password)
for row in rows:
device_id = row.__getitem__('device_id')
timestamp = row.__getitem__('timestamp')
category = row.__getitem__('category')
query = f"select * from headcounter_category_h_aggr where device_id = '{device_id}' and timestamp = '{timestamp}' and category = '{category}'"
result_query = cassandra_session.execute(query)
if len(out_df)== 0:
out_df = result_query
else:
out_df = out_df.append(result_query)
return out_df
But now it seams to make two time the same query or at least the out_df is made from two identical element
EDIT 2 and solution:
After some attempts I found out that making a spark dataframe directly from the RDD removes duplicate row. Here the code:
dfFromRDD = spark.createDataFrame(rdd_values, schema = schema)
Answers:
You shouldn’t do this – instead you need to use Spark Cassandra Connector that provides native access to Cassandra from Spark using DataFrame APIs (documentation for PySpark). You just need to install a version matching your Databricks Runtime (on Databricks you need to use assembly
version due the reasons described here), and then you’ll be able to query Cassandra very easily, like this:
df = spark.read
.format("org.apache.spark.sql.cassandra")
.options(table="table_name", keyspace="ks_name")
.load()
or integrate with Spark catalogs like this:
spark.conf.set("spark.sql.catalog.myCatalog",
"com.datastax.spark.connector.datasource.CassandraCatalog")
df = spark.read.table("myCatalog.myKs.myTab")
And Spark Cassandra Connector will perform predicates pushdown where it’s possible (for example, when you query by partition key).
If you need to join your dataset with Cassandra table, then you can follow instructions on use of so-called direct join outlined in the following blog post.
I’m trying to optimize a way to query a cassadnra table when working in databricks. After reading this article https://medium.com/@yoke_techworks/cassandra-and-pyspark-5d7830512f19, the author suggest to query the cassandra table one row at the time and union each results.
My attempt, using the python cassandra driver, is this:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import pandas as pd
def init_cassandra_session(endpoints, keyspace, username, password, port=9042):
auth_provider = PlainTextAuthProvider(username, password)
cluster = Cluster(endpoints, port=port, auth_provider=auth_provider)
cassandra_session = cluster.connect(keyspace, wait_for_all_pools=False)
return cassandra_session
def get_rdd_values(rows):
out_df = None
cassandra_session = init_cassandra_session(host, keyspace, username, password)
for row in rows:
device_id = row.__getitem__('device_id')
timestamp = row.__getitem__('timestamp')
category = row.__getitem__('category')
query = '''
select * from headcounter_category_h_aggr where device_id = '%s' and timestamp = '&s' and category = '%s'
'''
result_query = cassandra_session.execute(query, [device_id, timestamp, category])
if out_df is None:
out_df = result_query
else:
out_df = out_df.append(result_query)
return out_df
columns = ['device_id', 'timestamp', 'category']
data = [['SIMUL_TEST03', '2020-12-23 11:00:00', 'PERSON'], ['SIMUL_TEST03', '2020-12-23 12:00:00', 'PERSON']]
pdf = pd.DataFrame(data, columns=columns)
dfFromData1 = spark.createDataFrame(pdf)
rdd_values = dfFromData1.rdd.mapPartitions(get_rdd_values)
rdd_values.collect()
when try to collect the results, rdd_values seams to be NoneType and so it’s not iterable.
I cannot find the error I’m making.
EDIT
I resolved the issue: I changed the get_rdd_values() function like this:
def get_rdd_values(rows):
out_df = []
cassandra_session = init_cassandra_session(host, keyspace, username, password)
for row in rows:
device_id = row.__getitem__('device_id')
timestamp = row.__getitem__('timestamp')
category = row.__getitem__('category')
query = f"select * from headcounter_category_h_aggr where device_id = '{device_id}' and timestamp = '{timestamp}' and category = '{category}'"
result_query = cassandra_session.execute(query)
if len(out_df)== 0:
out_df = result_query
else:
out_df = out_df.append(result_query)
return out_df
But now it seams to make two time the same query or at least the out_df is made from two identical element
EDIT 2 and solution:
After some attempts I found out that making a spark dataframe directly from the RDD removes duplicate row. Here the code:
dfFromRDD = spark.createDataFrame(rdd_values, schema = schema)
You shouldn’t do this – instead you need to use Spark Cassandra Connector that provides native access to Cassandra from Spark using DataFrame APIs (documentation for PySpark). You just need to install a version matching your Databricks Runtime (on Databricks you need to use assembly
version due the reasons described here), and then you’ll be able to query Cassandra very easily, like this:
df = spark.read
.format("org.apache.spark.sql.cassandra")
.options(table="table_name", keyspace="ks_name")
.load()
or integrate with Spark catalogs like this:
spark.conf.set("spark.sql.catalog.myCatalog",
"com.datastax.spark.connector.datasource.CassandraCatalog")
df = spark.read.table("myCatalog.myKs.myTab")
And Spark Cassandra Connector will perform predicates pushdown where it’s possible (for example, when you query by partition key).
If you need to join your dataset with Cassandra table, then you can follow instructions on use of so-called direct join outlined in the following blog post.