How to decode/deserialize Avro with Python from Kafka

Question:

I am receiving from a remote server Kafka Avro messages in Python (using the consumer of Confluent Kafka Python library), that represent clickstream data with json dictionaries with fields like user agent, location, url, etc. Here is what a message looks like:

b'x01x00x00xdex9exa8xd5x8fWxecx9axa8xd5x8fWx1axxx.xxx.xxx.xxxx02:https://website.in/rooms/x02Hhttps://website.in/wellness-spa/x02xaax14x02x9cnx02xaax14x02xd0x0bx02V0:j3lcu1if:rTftGozmxSPo96dz1kGH2hvd0CREXmf2x02V0:j3lj1xt7:YD4daqNRv_Vsea4wuFErpDaWeHu4tW7ex02x08nullx02nnull0x10pageviewx00x00x00x00x00x00x00x00x00x02x10Thailandx02xa6x80xc4x01x02x0eBangkokx02x8cxbaxc4x01x020*xa9x13xd0x84+@x02xecxc09#Jx1fY@x02x8ax02Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/58.0.3029.96 Chrome/58.0.3029.96 Safari/537.36x02x10Chromiumx02x10Chromiumx028Google Inc. and contributorsx02x0eBrowserx02x1858.0.3029.96x02"Personal computerx02nLinuxx02x00x02x1cCanonical Ltd.'

How to decode it? I tried bson decode but the string was not recognized as UTF-8 as it’s a specific Avro encoding I guess. I found https://github.com/verisign/python-confluent-schemaregistry but it only supports Python 2.7. Ideally I would like to work with Python 3.5+ and MongoDB to process the data and store it as it’s my current infrastructure.

Answers:

I thought Avro library was just to read Avro files, but it actually solved the problem of decoding Kafka messages, as follow: I first import the libraries and give the schema file as a parameter and then create a function to decode the message into a dictionary, that I can use in the consumer loop.

import io

from confluent_kafka import Consumer, KafkaError
from avro.io import DatumReader, BinaryDecoder
import avro.schema

schema = avro.schema.Parse(open("data_sources/EventRecord.avsc").read())
reader = DatumReader(schema)

def decode(msg_value):
    message_bytes = io.BytesIO(msg_value)
    decoder = BinaryDecoder(message_bytes)
    event_dict = reader.read(decoder)
    return event_dict

c = Consumer()
c.subscribe(topic)
running = True
while running:
    msg = c.poll()
    if not msg.error():
        msg_value = msg.value()
        event_dict = decode(msg_value)
        print(event_dict)
    elif msg.error().code() != KafkaError._PARTITION_EOF:
        print(msg.error())
        running = False

If you use Confluent Schema Registry and want to deserialize avro messages, just add message_bytes.seek(5) to the decode function, since Confluent adds 5 extra bytes before the typical avro-formatted data.

def decode(msg_value):
    message_bytes = io.BytesIO(msg_value)
    message_bytes.seek(5)
    decoder = BinaryDecoder(message_bytes)
    event_dict = reader.read(decoder)
    return event_dict
Answered By: Sheng-yi Hsu

If you have access to a Confluent schema registry server, you can also use Confluent’s own AvroDeserializer to avoid messing with their magic 5 bytes:

from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer

def process_record_confluent(record: bytes, src: SchemaRegistryClient, schema: str):
    deserializer = AvroDeserializer(schema_str=schema, schema_registry_client=src)
    return deserializer(record, None) # returns dict
Answered By: Khristian S.

Decoding of the msg_value (c.poll().value()) was having issue in my case and using the below code to decode the value worked

import io
import avro
from avro.io import DatumReader, BinaryDecoder
message_bytes = io.BytesIO(msg.value())
message_bytes.seek(5)
decoder = BinaryDecoder(message_bytes)
schema = avro.schema.parse(jstr)
reader = DatumReader(schema)
event_dict = reader.read(decoder)
print(event_dict)
Answered By: chandru selvaraj
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.