Get to Know About Amazon Kinesis Data Streams!

Snega S
3 min readFeb 2, 2022

In this blog we will see about

  • Data Streaming
  • Amazon Kinesis Data Streaming
  • AWS Kinesis Data Analytics
  • Amazon SageMaker Feature Store Overview
  • Streaming Workflow

What is Data Streaming?

Data streaming is a process of continuously generating data from different sources and the data streams are stored and can be used for real-time analytics. AWS provides a data streaming service called Amazon Kinesis Data Streams.

Amazon Kinesis Data Stream

Amazon Kinesis Data Streams is a real-time data streaming service that is massively scalable and durable. It can continuously capture gigabytes of data per second from different sources like infrastructure logs, websites, etc. The streaming data can be then used for analytics, anomaly detection, and various purposes. Many applications can receive data from the same Kinesis Data Stream which helps scale customer needs and reduce the average latency of consuming data. This is what they call Enhanced Fan-Out.

Amazon Kinesis ingests and stores data streams for processing and then it is used in Amazon data firebase, AWS lambda, etc. It can be easily administered at a low cost and it provides real-time and elastic performance.

AWS Kinesis Data Analytics

AWS Kinesis Data Analytics is easy to transform and analyze data in real-time. It only takes less time complexity and it runs streaming apps continuously where it can be easily scaled. The main advantage is that it does everything automatically so that there is no server to manage and no setup cost. You are only going to pay for what you are going to use. It is very easy and efficient to use applications.

Amazon SageMaker Feature Store

In the Amazon SageMaker feature store, we can store features securely, discover and share features for Machine Learning. The raw data is transformed into meaningful features and it is stored in the feature store. Feature groups are created for common parameters that are used for similar applications. We can create, update and list feature groups for all machine learning applications. We can also search and discover features.

Streaming Data Workflow

In a feature store if there are two feature groups such as one to store batch aggregated features and real-time aggregated features from streaming applications. If we are taking any particular use case the data from that is collected from backend databases and it is processed to give useful features. This processing job can be developed by Pyspark and can be deployed in Amazon SageMaker or AWS Glue. The extracted features are used for training the model and some are stored in batch aggregation. Now, this model is deployed in Amazon Sagemaker for inference through an endpoint.

Now the data is received from Amazon Kinesis Data Streams which acts as a buffer to hold data for consumption. The real-time feature generation can be performed by SQL or Apache Flink and deployed in Kinesis Data Analytics. Then the resulting features are pushed into the feature store by calling PutRecord API through an AWS Lambda function. Since this is a synchronous process it allows small batches of updates to be pushed in a single API call which enables freshness of feature values and this is called streaming features.

Now to make predictions another lambda function is created which uses Kinesis Data Stream as a trigger. So for each event in the use case, the batch and streaming features are retrieved from the feature store and all required derived features and ratios are calculated and called on the model endpoint to make predictions. This is the workflow of the Amazon Kinesis Data Streams.

--

--