AWS Data Processing Solutions

Batch and Streaming processing model

Batch processing model

Data is collected over a period of time, then run through analyitcs tools
Time consuming, designed for large quantities of information that are not time-sensitive

Streaming processing model

Data is processed in a stream, a record at a time or in micro-batches of tens, hundreds, or thousands of records
Fast, designed for information that’s needed immediately

AWS services used for batch processing

Glue Batch ETL

EMR Batch ETL

AWS services used for streaming processing

Lambda

Kinesis

EMR Streaming ETL

Coordinating your ETL jobs across Glue, EMR and Redshift.

With orchestration you can automate each step in your workflow
- Retry on errors
- Find and recover jobs that fail
- Track the steps in your workflow
- Build repeatable workflows
Respond to state changes on EMR cluster
Use CloudWatch metrics to manage your EMR cluster
View and monitor your EMR cluster
Leverage EMR API calls in CloudTrail
Orchestrate Spark, Redshift and EMR workloads using Step Functions

Use workflows to create and visualize complex ETL tasks involving multiple crawlers, jobs, and triggers
Manages the execution and monitoring of all components
Glue console provides a visual representation of a workflow as a graph
Chain interdependent jobs and crawlers
- Event triggers fired by both jobs or crawlers, and can start both jobs and crawlers
Views:
- Static: shows the design of the workflow
- Dynamic: run time view, shows the latest run information for each of the jobs and crawlers

Several ways to operationalize Glue and EMR workflows

A workflow is a grouping of a set of jobs, crawlers, and triggers in Glue

Can design a complex multi-job ETL sequence that Glue can execute and track as single entity
Create workflows using the AWS Console or the Glue API
Console lets you see the components and flow of a workflow with a graph

Use Lambda functions and CloudWatch Events to orchestrate your workflow

Use Step Functions to automate your Glue workflow

Step Functions state machine starts executing, and the input file path of the data stored in S3 is passed to the state machine
First step in the state machine triggers a Lambda function
Lambda function interacts with Apache Spark running on EMR using Apache Livy, and submits a Spark job
State Machine checks the Spark job status
Based on the job status, the state machine moves to the success or failure state
Output data from the Spark job is stored in S3

Step Functions now interacts directly with EMR
Use Step Functions to control all steps in your EMR cluster operation
Also integrates other AWS servies into your workflow, e.g.
- Lambda, AWS Batch, DynamoDB, ECSFargate
- SNS, SQS, Glue, SageMaker, EMR, CodeBuild, StepFunctions

Use Kinesis Analytics to gain real-time insights into your streaming data