New – Amazon Kinesis Data Analytics for Java

Customers are using to collect, process, and analyze real-time streaming data. In this way, they can react quickly to new information from their business, their infrastructure, or their customers. For example, Epic Games ingests more than 1.5 million game events per second for its popular online game, Fornite. With Amazon Kinesis Data Analytics you can process data […]

Customers are using Amazon Kinesis to collect, process, and analyze real-time streaming data. In this way, they can react quickly to new information from their business, their infrastructure, or their customers. For example, Epic Games ingests more than 1.5 million game events per second for its popular online game, Fornite.

With Amazon Kinesis Data Analytics you can process data in real-time using standard SQL. While SQL provides an easy way to quickly query large volumes of streaming data without learning new frameworks or languages, many customers also want to build more sophisticated data processing applications using general-purpose programming languages.

Using Java with Amazon Kinesis Data Analytics

Today, we are introducing support for Java in Amazon Kinesis Data Analytics. Now, developers can use their own Java code to create powerful real-time applications that process streaming data like continuously transforming and loading data into their data lakes, generating metrics to feed real-time gaming leaderboards, applying machine learning models to data streams from connected devices, and more.

To use this new functionality, developers build applications using open source libraries which include built-in operators for common data processing functions that allow applications to organize, transform, aggregate, and analyze data at any scale. These libraries are both open source and you can run them anywhere:

  • Apache Flink, an open source framework and engine for processing data streams.
  • AWS SDK for Java, providing Java APIs for many AWS services.

Developers can use these Java libraries within their Integrated Development Environment (IDE) of choice. Using these libraries, the following AWS services can be integrated with as little as one line of code:

  • Streaming Data Sources: Amazon Kinesis Data Streams
  • Streaming Destinations: Amazon S3, Amazon DynamoDB, Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose

In addition to the pre-built AWS integrations, the Java libraries include more connectors to tools like Cassandra, ElasticSearch, RabbitMQ, Redis, and more, and the ability to build custom integrations.

Building a Kinesis Data Streams Java Application

I prepared a simple Java application that implements the “mandatory” word count example for data processing. I send some paragraphs of text in input and I get, every five seconds, the number of times each word is being used as output.

First, I create two Kinesis Data Streams:

  • TextInputStream, where I am going to send my input records
  • WordCountOutputStream, where I am going to read the output of the Java application

Here is the code of the word-count Java application. To read and write from Kinesis Data Streams, I am using the Kinesis Connector from the Apache Flink project.

public class StreamingJob { private static final String region = "us-east-1"; private static final String inputStreamName = "TextInputStream"; private static final String outputStreamName = "WordCountOutputStream"; private static DataStream<String> createSourceFromStaticConfig( StreamExecutionEnvironment env) { Properties inputProperties = new Properties(); inputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region); inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST"); return env.addSource(new FlinkKinesisConsumer<>(inputStreamName, new SimpleStringSchema(), inputProperties)); } private static FlinkKinesisProducer<String> createSinkFromStaticConfig() { Properties outputProperties = new Properties(); outputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region); FlinkKinesisProducer<String> sink = new FlinkKinesisProducer<>(new SimpleStringSchema(), outputProperties); sink.setDefaultStream(outputStreamName); sink.setDefaultPartition("0"); return sink; } public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> input = createSourceFromStaticConfig(env); input.flatMap(new Tokenizer()) .keyBy(0) .timeWindow(Time.seconds(5)) .sum(1) .map(new MapFunction<Tuple2<String, Integer>, String>() { @Override public String map(Tuple2<String, Integer> value) throws Exception { return value.f0 + "," + value.f1.toString(); } }) .addSink(createSinkFromStaticConfig()); env.execute("Word Count"); } public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("\\W+"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<>(token, 1)); } } } } }

The most important part of the application is the manipulation of the input object, where I apply a few DataStream Transformations:

  1. I start with a DataFrame containing the String from the input stream.
  2. I use a Tokenizer in a FlatMap to split the sentence into “words”, each word followed by the number “1”.
  3. I apply the KeyBy operator to logically partition the stream in respect to the “word”.
  4. I use a 5 seconds tumbling window.
  5. I aggregate within the window, summing up for each word the number “1” to count them.
  6. I use a simple Map for each record to join the word and the number into a comma-separated values (CSV) String that I send to the output stream.

One of the most powerful operators shown here is the KeyBy operator. It enables you to re-organize a particular stream by a specified key in real-time. This type of re-keying enables further downstream operations like aggregations, counts, and much more. This enables you to set up streaming map-reduce on different keys within the same application.

I build the Java application using Maven and load the output JAR to an Amazon Simple Storage Service (S3) bucket in the region where I want to deploy the application. In the Kinesis Data Analytics console, I create a new application and select “Flink” as runtime:

I then configure the application to use the code on my S3 bucket. The console updates the IAM role for the application to have permissions to read the code.

You can optionally add key/value properties to the configuration of the application. You can read those properties from within the application, to provide customization at deployment time.

For monitoring, I leave the default metrics. I enable logging to Amazon CloudWatch, for errors only.

Don’t forget to add permissions to the IAM role created by the console to allow the Kinesis Analytics application to read and write from the streams used for input and output, TextInputStream and WordCountOutputStream in my case.

I can now start the application with the “Run” button, and when it is running, I use a script that I prepared to put some text (I am using a description of the Amazon Kinesis platform) in the input stream:

$ python put_records.py TextInputStream
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data...

The behavior of my application is summarized in the console in the Application Graph, a visual representation of the data flow consisting of operators and intermediate results (complex applications, using multiple streams, have a much more interesting graph):

To read the output stream, I am using a Lambda function written in Python. I am using the one provided with the Kinesis Record Aggregation & Deaggregation Modules for AWS Lambda, that provides automatic “de-aggregation” of records aggregated by the Amazon Kinesis Producer Library (KPL).

As expected, in the CloudWatch Logs console I get the list of the words and the number of times they were used, updated every 5 seconds by the Lambda function:

Pricing and Availability

With Amazon Kinesis Data Analytics for Java, you pay only for what you use. Pricing is similar to Amazon Kinesis Data Analytics for SQL, but there are a few differences.

For Java applications, you are charged a single additional Amazon Kinesis Processing Unit (KPU) per application, used for application orchestration. Java applications are also charged for running application storage and durable application backups. Running application storage is used for Amazon Kinesis Data Analytics’ stateful processing capabilities and is charged per GB-month. Durable application backups are optional and provide a point-in-time recovery point for applications, charged per GB-month.

For example, pricing is $0.11 per KPU hour in US East (N. Virginia), and you are charged for running application storage ($0.10 per GB-month) and durable application backups ($0.023 per GB-month).

Available Now

Amazon Kinesis Data Analytics for Java is available now in US East (N. Virginia), US East (Ohio), US West (Oregon), EU West (Ireland).

I only scratched the surface of the capabilities for stream processing enabled by the support of Java in Amazon Kinesis Data Analytics. I think this is a powerful tool that can enable new use cases. Let me know what you are going to build with it!

Posted by Web Monkey