AWS Data Storage Solutions

 · 5 mins read

AWS Data Storage Solutions

Operational and Analytical Data Storages Characteristics

Operational Storage Characteristics

  • Data stored as rows
  • Low latency
  • Hight throughput
  • Highly concurrent
  • Frequent changes
  • Benefits from caching
  • Often used in enterprise critical applications

Analytical Storage Characteristics

  • Two types:
    • OLAP (ad-hoc queries)
    • DSS (long running aggregations)
  • Data stored as columns
  • Large datasets that take advantage of partitioning (e.g. parquet)
  • Frequent complex aggregations
  • Loaded in bulk or via streaming
  • Less frequent change

Operational Storage Services on AWS

1. RDS - distributed relational database service

  • Use cases
    • E-commerce, web, mobile
  • Fast OLTP database options
    • SSD-backed storage options
  • Scale
    • Vertical scaling or in other words scaling up
    • Instance and storage size determine scale
  • Reliability and durability
    • Multi-AZ
    • Automated backups and snapshots
    • Automated failover

2. DynamoDB - fully managed NoSQL database

  • Use cases
    • Ad Tech, gaming, retail, banking and finance
  • Fast NoSQL database options
    • Single-digit milisecond latency at scale
  • Scale
    • Horizontal scaling
    • Can store data without bounds
    • High performance and low cost even at extreme scale
  • Reliability and durability
    • Data replicated across three AZs
    • Global tables for multi-region replication

3. Elasticache - fully managed Redis and Memcached

  • Use cases
    • Caching, session stores, gaming real-time analytics
  • Sub-milisecond response time from in-memory data store
    • Single-digit milisecond latency at scale
  • Reliability and durability
    • Redis Elasticache offers multi-AZ with automatic failover

4. Timestream - fully managed time series database

  • Use cases
    • IoT applications, Industrial telemetry, application monitoring
  • Fast: analyze trillions of events per day
    • One tenth the cost of relational database
  • Scale:
    • Vertical scaling
    • Timestream scales up or down depending on our load
  • Reliability and durability
    • Managed service takes care of provisioning patching, etc.
    • Retention policies to manage reliability and durability

Analytical Storage Services on AWS

1. Redshift - cloud data warehouse

  • Use cases
    • Data science queries, marketing analysis
  • Fast: columnar storage technology that parallelizes queries
    • Milisecond latency queries
  • Reliability and durability
    • Data replicated within the Redshift cluster
    • Continous backup to S3

2. S3 - object storage via a web service

  • Use cases
    • Data lake, analytics, data archiving, static website
  • Fast: query structured and semi-structured data
    • Use Athena and Redshift Spectrum to query at low latency
  • Reliability and durability
    • Data replicated across three AZs in a region
    • Same-region or cross-region replication

Data Freshness

We should consider our data’s freshness when selecting our storage system components

  • Place hot data in cache (Elasticache or DAX) or NoSQL (DynamoDB)
  • Place warm data in SQL data stores (RDS)
  • Can use S3 for all types (hot, warm, cold)
  • Use S3 Glacier for colda data

Columnar storage

Drastically reduces the overall disk I/O requirements and reduces the amount of data we need to load from disk

  • In relational dabases, data blocks store values sequentially for each consecutive column making up the entire row
  • In columnar databases, each data block stores values of a single column for multiple rows

Data Access and Retrieval Patterns

Characteristics of our data. What type of date are we storing?

Structured data

  • Examples: accounting data, demograhpic info, logs, mobile device, geolocation data
  • Storage options: RDS, Redshift, S3 Data Lake

Unstructured data

  • Examples: email text, photos, video, audio, PDFs
  • Storage options: S3 Data Lake, DynamoDB

Semi-structured data

  • Examples: email metadata, digital photo metadata, video metadata, JSON data
  • Storage options: S3 Data Lake, DynamoDB

Data Storage Lifecycle

How long do we need to retain our data?

Persistent data

  • OLTP and OLAP
  • DynamoDB, RDS, Redshift, S3

Transient data

  • Cached data, streaming data consumed in near-real time
  • Elasticache (Redis, Memcached), DynamoDB Accelerator (DAX)
  • Website session infor, streaming gaming data

Archive data

  • Retained for years, typically regulatory
  • S3 Glacier

Data Access Retrieval and Latency

How fast does our retrieval need to be? Retrieval speed:

Near-real time

  • Streaming data with near-real time dashboad display

Cached data

  • Elasticache (Memcached, Redis)
  • DAX
    • Right through cache

Data Lake vs Data Warehouse

Data Warehouse

  • Optimized for relational data produced by transactional systems
  • Data structure/schema defined which optimizes fast SQL queries
  • Used for operational reporting and analysis
  • Schema on write, i.e. data is transformed before loading

Data Lake

  • Relational and non-relational data
  • Data structure/schema not defined when stored in the data lake
  • Big data analytics, text analysis, ML
  • Schema on read

Object vs Block store

Object storage

  • S3 is used for object storage: highly scalable and available
  • Store structured, unstructured, and semi-structured data
  • Web sites, mobile apps, archive, analytics applications
  • Storage via a web service

File storage

  • Elastic File System (EFS) is used for file storage: shared file systems
  • Content repositories, development environments, media stores, user home directories

Block storage

  • Elastic Block Storage (EBS) attached to EC2 instances, EFS: volume type choices
  • Redshift, Operating Systems, DBMS installs, file systems
  • HDD: throughput intensive, large I/O, sequential I/O, big data
  • SSD: high I/O per second, transaction, random access, boot volumes
  • What is the difference between HDD and SSD?