AWS Data Storage Solutions

Operational and Analytical Data Storages Characteristics

Operational Storage Characteristics

Data stored as rows
Low latency
Hight throughput
Highly concurrent
Frequent changes
Benefits from caching
Often used in enterprise critical applications

Analytical Storage Characteristics

Two types:
- OLAP (ad-hoc queries)
- DSS (long running aggregations)
Data stored as columns
Large datasets that take advantage of partitioning (e.g. parquet)
Frequent complex aggregations
Loaded in bulk or via streaming
Less frequent change

Operational Storage Services on AWS

1. RDS - distributed relational database service

Use cases
- E-commerce, web, mobile
Fast OLTP database options
- SSD-backed storage options
Scale
- Vertical scaling or in other words scaling up
- Instance and storage size determine scale
Reliability and durability
- Multi-AZ
- Automated backups and snapshots
- Automated failover

2. DynamoDB - fully managed NoSQL database

Use cases
- Ad Tech, gaming, retail, banking and finance
Fast NoSQL database options
- Single-digit milisecond latency at scale
Scale
- Horizontal scaling
- Can store data without bounds
- High performance and low cost even at extreme scale
Reliability and durability
- Data replicated across three AZs
- Global tables for multi-region replication

3. Elasticache - fully managed Redis and Memcached

Use cases
- Caching, session stores, gaming real-time analytics
Sub-milisecond response time from in-memory data store
- Single-digit milisecond latency at scale
Reliability and durability
- Redis Elasticache offers multi-AZ with automatic failover

4. Timestream - fully managed time series database

Use cases
- IoT applications, Industrial telemetry, application monitoring
Fast: analyze trillions of events per day
- One tenth the cost of relational database
Scale:
- Vertical scaling
- Timestream scales up or down depending on our load
Reliability and durability
- Managed service takes care of provisioning patching, etc.
- Retention policies to manage reliability and durability

Analytical Storage Services on AWS

1. Redshift - cloud data warehouse

Use cases
- Data science queries, marketing analysis
Fast: columnar storage technology that parallelizes queries
- Milisecond latency queries
Reliability and durability
- Data replicated within the Redshift cluster
- Continous backup to S3

2. S3 - object storage via a web service

Use cases
- Data lake, analytics, data archiving, static website
Fast: query structured and semi-structured data
- Use Athena and Redshift Spectrum to query at low latency
Reliability and durability
- Data replicated across three AZs in a region
- Same-region or cross-region replication

Data Freshness

We should consider our data’s freshness when selecting our storage system components

Place hot data in cache (Elasticache or DAX) or NoSQL (DynamoDB)
Place warm data in SQL data stores (RDS)
Can use S3 for all types (hot, warm, cold)
Use S3 Glacier for colda data

Columnar storage

Drastically reduces the overall disk I/O requirements and reduces the amount of data we need to load from disk

In relational dabases, data blocks store values sequentially for each consecutive column making up the entire row
In columnar databases, each data block stores values of a single column for multiple rows

Data Access and Retrieval Patterns

Characteristics of our data. What type of date are we storing?

Structured data

Examples: accounting data, demograhpic info, logs, mobile device, geolocation data
Storage options: RDS, Redshift, S3 Data Lake

Unstructured data

Examples: email text, photos, video, audio, PDFs
Storage options: S3 Data Lake, DynamoDB

Semi-structured data

Examples: email metadata, digital photo metadata, video metadata, JSON data
Storage options: S3 Data Lake, DynamoDB

Data Storage Lifecycle

How long do we need to retain our data?

Persistent data

OLTP and OLAP
DynamoDB, RDS, Redshift, S3

Transient data

Cached data, streaming data consumed in near-real time
Elasticache (Redis, Memcached), DynamoDB Accelerator (DAX)
Website session infor, streaming gaming data

Archive data

Retained for years, typically regulatory
S3 Glacier

Data Access Retrieval and Latency

How fast does our retrieval need to be? Retrieval speed:

Near-real time

Streaming data with near-real time dashboad display

Cached data

Elasticache (Memcached, Redis)
DAX
- Right through cache

Data Lake vs Data Warehouse

Data Warehouse

Optimized for relational data produced by transactional systems
Data structure/schema defined which optimizes fast SQL queries
Used for operational reporting and analysis
Schema on write, i.e. data is transformed before loading

Data Lake

Relational and non-relational data
Data structure/schema not defined when stored in the data lake
Big data analytics, text analysis, ML
Schema on read

Object vs Block store

Object storage

S3 is used for object storage: highly scalable and available
Store structured, unstructured, and semi-structured data
Web sites, mobile apps, archive, analytics applications
Storage via a web service

File storage

Elastic File System (EFS) is used for file storage: shared file systems
Content repositories, development environments, media stores, user home directories

Block storage

Elastic Block Storage (EBS) attached to EC2 instances, EFS: volume type choices
Redshift, Operating Systems, DBMS installs, file systems
HDD: throughput intensive, large I/O, sequential I/O, big data
SSD: high I/O per second, transaction, random access, boot volumes
What is the difference between HDD and SSD?