Which cloud database should you use?

The fundamental principle of cloud systems is a focus on multiple, disposable, and replaceable machines. This has direct consequences for the implementation techniques, and therefore the capabilities of database systems implemented in the cloud.

Traditional databases can be roughly classified as parallel-first (for example, MongoDB or Teradata) or single-system first (for example, PostgreSQL or MySQL), often with scale later (for example, Redshift, Greenplum). Each category has limitations inherent to its core design. The extent of these limitations is partially a function of maturity. However, for certain core architectural decisions, particular features may not be efficiently supportable.

For example, Greenplum has sequences, but Redshift does not, despite both being PostgreSQL derivatives. BigQuery has no sequences, but Teradata does (although they aren’t truly sequential, in the traditional sense).

Cloud databases fall into the same categories, with a distinct bias towards parallel-first for new systems. The fundamental properties of cloud systems are parallelism for scale and replaceability of machines.

Within the single-system-first category, cloud instantiations tend to focus on managed cost, upgrade, and reliability (RPO/RTO) of the traditional single-machine product, such as Heroku PostgreSQL, Amazon Aurora (PostgreSQL/MySQL), Google Cloud SQL (PostgreSQL/MySQL), and Azure SQL (SQL Server).

Within the parallel category, there are effectively two subcategories: the SQL/relational category (BigQuery, Snowflake, Redshift, Spark, Azure Synapse) and the DHT/NoSQL (BigTable, Dynamo, Cassandra, Redis) category. This distinction has less to do with the presence or absence of a SQL-like language and more to do with whether the physical layout of the data within the system is tuned for single-row access by hashing for fast lookups on a key, or bulk access using sort-merge and filter operations.

Copyright © 2021 IDG Communications, Inc.

Source link