Skip to content

102. Kinesis Basics

Qingye Jiang (John) edited this page Jun 15, 2018 · 14 revisions

(1) Streaming Data

Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or geo-spatial services, and telemetry from connected devices or instrumentation in data centres.

This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Information derived from such analysis gives companies visibility into many aspects of their business and customer activity such as –service usage (for metering/billing), server activity, website clicks, and geo-location of devices, people, and physical goods –and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams, and respond in a timely fashion as the necessity arises.

The Producer/Consumer model.

(2) Distributed System Overview

Vertical scaling vs horizontal scaling.

Partitioning / sharding / hash algorithm / request router.

(3) Create a Kinesis Stream

Read the following AWS documentation to understand what Kinesis Stream is.

In your Kinesis console, create a Kinesis stream with 2 shards. Then use the AWS CLI to practice the following operations. For more information on AWS CLI commands for Kinesis Stream, please refer to the following AWS documentation:

$ aws kinesis list-streams
$ aws kinesis describe-stream --stream-name [stream-name]
$ for i in `seq 01 99`; do
> aws kinesis put-record --stream-name [stream-name] --data "Test Data $i" --partition-key "$i"
> done
$ aws kinesis get-shard-iterator --stream-name [stream-name] --shard-id shardId-000000000001 --shard-iterator-type TRIM_HORIZON
$ aws kinesis get-records --shard-iterator [shard-iterator] --limit 10

It should be noted that the "Data" you saw from the CLI output is base64 encoded. You will need to decode it to view the real data. After you read some data from the shard, you get a NextShardIterator so that you can continue to read from the current position.

Use the AWS CLI or the AWS Console to change the number of shards in a stream. Compare the output from "aws kinesis describe-stream" to see what happens when the number of shards is changed.

Amazon Kinesis Data Streams uses the partition key as input to a hash function that maps the partition key and associated data to a specific shard. Specifically, an MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards. As a result of this hashing mechanism, all data records with the same partition key map to the same shard within the stream.

(4) Amazon Kinesis Agent

The Amazon Kinesis Agent is a stand-alone Java software application that offers an easier way to collect and ingest data into Amazon Kinesis services, including Amazon Kinesis Streams and Amazon Kinesis Firehose.

Launch an EC2 instance and install the Apache web server. Install Amazon Kinesis Agent and configure the agent to push Apach logs (/var/log/apache2/*) to your Kinesis stream. Once you have this working, create a new AMI from the EC2 instance. Create a new launch configuration with the new AMI, update your Auto Scaling group to use the new launch configuration. Create an ELB with at least 2 web server in the back end.

Generate some test traffic to your ELB using curl, wget, or ab. Again, use the AWS CLI to read from the stream. As you can see, you are now aggregating the Apache logs from multiple web servers to the same Kinesis stream. In other words, the Kinesis stream now becomes your data sink for Apache logs.

(5) Data Generation

Design a bash script to ingest 100,000 records into your Kinesis stream. Try to achieve an even distribution across all shards. Collect the necessary data to prove that you do have even data distribution.

Generate sufficient write pressure on your Kinesis data stream until you consistently encounter write throttling.

AWS provides a Kinesis Data Generator for to generate data for ingression to Kinesis streams. Please set this up in the us-east-1 region and send some data into your Kinesis stream. The KDG extends faker.js, an open source random data generator. For full documentation of the items that can be "faked" by faker.js, see the faker.js documentation.

How do you read data according to the sequence they enter into your stream?

Clone this wiki locally