Consuming a kinesis stream in python

Question:

I cant seem to find a decent example that shows how can I consume an AWS Kinesis stream via Python. Can someone please provide me with some examples I could look into?

Best

Asked By: aliirz

||

Answers:

you should use boto.kinesis:

from boto import kinesis

After you created a stream:

step 1: connect to aws kinesis:

auth = {"aws_access_key_id":"id", "aws_secret_access_key":"key"}
connection = kinesis.connect_to_region('us-east-1',**auth)

step 2: get the stream info (like how many shards, if it is active ..)

tries = 0
while tries < 10:
    tries += 1
    time.sleep(1)
    try:
        response = connection.describe_stream('stream_name')   
        if response['StreamDescription']['StreamStatus'] == 'ACTIVE':
            break 
    except :
        logger.error('error while trying to describe kinesis stream : %s')
else:
    raise TimeoutError('Stream is still not active, aborting...')

step 3 : get all shard ids, and for each shared id get the shard iterator:

shard_ids = []
stream_name = None 
if response and 'StreamDescription' in response:
    stream_name = response['StreamDescription']['StreamName']                   
    for shard_id in response['StreamDescription']['Shards']:
         shard_id = shard_id['ShardId']
         shard_iterator = connection.get_shard_iterator(stream_name, shard_id, shard_iterator_type)
         shard_ids.append({'shard_id' : shard_id ,'shard_iterator' : shard_iterator['ShardIterator'] })

step 4 : read the data for each shard

limit is the limit of records that you want to receive. (you can receive up to 10 MB)
shard_iterator is the shared from previous step.

tries = 0
result = []
while tries < 100:
     tries += 1
     response = connection.get_records(shard_iterator = shard_iterator , limit = limit)
     shard_iterator = response['NextShardIterator']
     if len(response['Records'])> 0:
          for res in response['Records']: 
               result.append(res['Data'])                  
          return result , shard_iterator

in your next call to get_records, you should use the shard_iterator that you received with the result of the previous get_records.

note: in one call to get_records, (limit = None) you can receive empty records.
if calling to get_records with a limit, you will get the records that are in the same partition key (when you put data in to stream, you have to use partition key :

connection.put_record(stream_name, data, partition_key)
Answered By: Eyal Ch

While this question has already been answered, it might be a good idea for future readers to consider using the Kinesis Client Library (KCL) for Python instead of using boto directly. It simplifies consuming from the stream when you have multiple consumer instances, and/or changing shard configurations.

https://aws.amazon.com/blogs/aws/speak-to-kinesis-in-python/

A more complete enumeration of what the KCL provides

  • Connects to the stream
  • Enumerates the shards
  • Coordinates shard associations with other workers (if any)
  • Instantiates a record processor for every shard it manages
  • Pulls data records from the stream
  • Pushes the records to the corresponding record processor
  • Checkpoints processed records (it uses DynamoDB so your code doesn’t have to manually persist the checkpoint value)
  • Balances shard-worker associations when the worker instance count changes
  • Balances shard-worker associations when shards are split or merged

The items in bold are the ones that I think are where the KCL really provides non-trivial value over boto. But depending on your usecase boto may be much much much simpler.

Answered By: jumand