The Complex Landscape of Modern Databases: Just because your database doesn't use SQL doesn't mean it's a NoSQL database

Michael DeBellis
Jan 9
9 min read

The modern developer is presented with a staggering number of databases with confusing names. This is an overview of the various kinds of databases and what they are used for. For simplicity, I'm leaving out niche databases such as OODBs and vector databases used to store vectors, matrices, and tensors for fast arithmetic needed to use LLMs and other Machine Learning systems.

OLTP vs. OLAP

The most important distinction is between Online Transaction Processing (OLTP) and Online Analytic Processing (OLAP). OLTP refers to the standard kind of processing required to run a business, from posting a comment on social media to transferring funds at your bank with your phone. OLAP refers to processing historical data. The following table summarizes the difference.

Feature	OLAP (Online Analytical Processing)	OLTP (Online Transaction Processing)
Primary Purpose	Analyzing large volumes of historical data for insights	Managing day-to-day transactions efficiently
Workload Type	Analytical (read-heavy)	Transactional (write-heavy)
Query Types	Complex queries (aggregations, joins, reporting)	Simple queries (CRUD operations: Create, Read, Update, Delete)
Data Model	Often denormalized (e.g., star or snowflake schema for data warehouses)	Highly normalized (to reduce redundancy and ensure consistency)
Data Volume	Large-scale datasets (terabytes to petabytes)	Small to medium datasets (gigabytes to terabytes)
Latency	Higher latency tolerated for complex queries	Low latency required for real-time transactions
Users	Data analysts, BI tools	End users, applications, systems
Use Cases	Business intelligence, dashboards, forecasting, trend analysis	E-commerce, banking, order processing, inventory management
Consistency	Eventual consistency acceptable	Strong consistency required (e.g., ACID transactions)
Scalability	Horizontal scaling for distributed data processing	Vertical scaling for faster transactions
Examples of Databases	BigQuery, Snowflake, Redshift, Azure Synapse	MySQL, PostgreSQL, Oracle, Microsoft SQL Server
Storage Format	Columnar (optimized for analytics)	Row-based (optimized for transactions)
Data Updates	Batch processing, infrequent updates	Frequent updates in real-time
Indexing	Optimized for aggregations and scanning	Optimized for individual record lookups

In ancient times (circa 1990) OLAP was all the rage due to a new technology called a Data Warehouse. Data warehouse databases were special products. They use a variant of the relational model called the cube architecture. A data warehouse trades the space efficiency and consistency of normalized relational databases for performance by intentionally introducing redundant data. This denormalization, as seen in star and snowflake schemas, optimizes analytical queries by precomputing relationships and aggregations, enabling fast slicing, dicing, and summarization of large historical datasets.

The good news was that these new data warehouse databases allowed corporations to get unprecedented access to historical data which could then be used to improve marketing, new products, etc. The bad news was these products were both expensive and complex. They took a lot of effort to setup and maintain. Also, with the Internet, more and more data (web pages, video, audio, documents,...) didn't neatly fit into the relational model.

Enter the Data Lake

This led to a new paradigm that supplanted the data warehouse: the data lake. The most important difference between a data warehouse and a data lake is that there was no or minimal processing of data entered into a data lake. The key technology that was the foundation for early data lakes was Apache Hadoop. Hadoop is a highly distributed, redundant file system. It is the perfect technology for storing large amounts of data in any form (tables, graphs, documents, video, etc.) because it doesn't impose any schema on the data. This is known as "schema on read" as opposed to the more traditional "schema on write". Typically, you queried a data lake by writing a program using an algorithm called map/reduce rather than using a query language.

Map/Reduce is a programming model used for processing large datasets in a distributed and parallel manner. It works in two main stages:

Map Stage: Each piece of input data is processed independently by "mapper" functions, which transform the data into key-value pairs. For example, in counting words across many documents, the mapper might output pairs like ("word", 1) for each word found.

Reduce Stage: The key-value pairs from the map stage are grouped by key, and a "reducer" function processes each group to produce a final result. For example, the reducer would sum all the counts for the same word to compute its total occurrences.

This model simplifies parallel processing by splitting tasks into independent operations that can be run across multiple machines. The emergence of data lakes also reflected a broader shift in how organizations approached data storage, emphasizing flexibility over the rigid schemas of traditional relational databases. Around the same time, the term NoSQL gained traction, describing a new wave of databases—like key-value, document, and columnar stores (see below) — that eschewed SQL and relational constraints in favor of scalability and adaptability. While NoSQL databases and data lakes share this emphasis on flexibility, they evolved independently, with NoSQL databases targeting operational workloads and data lakes focusing on raw data storage.

Data Lakes become Data Swamps which leads to Data Lakehouses

One of the warnings in all the Data Lake books is "don't let your data lake become a data swamp" but as might be expected with a paradigm of "throw in all the data and we'll figure it out as we need it" that often happened. This led to the next paradigm the Data Lakehouse. A data lakehouse attempts to blend the best of data lakes (flexibility, low-cost storage) and data warehouses (performance, structure, reliability) into a single platform. It directly addresses the problems of "data swamps" by enforcing structure and governance without losing the scalability and openness of a data lake. Another thing driving data lakehouses is the increasing drive to legislate the control, access, and history of data. Laws like HIPPA in the US mandate certain requirements for medical records data. The European Union's General Data Protection Regulation (GDPR) gave consumers many rights to control their data including the "right to be forgotten" which applies to OLAP data as much as to OLTP.

While the Data Lakehouse concept brought structure and governance to raw data storage, it also underscored the need for specialized databases tailored to specific analytic and operational requirements. Organizations no longer rely on a single type of database for all tasks; instead, they adopt a diverse array of database technologies, each optimized for different use cases.

For example, some workloads demand low-latency transactional systems, while others prioritize complex analytics across massive datasets. Still others focus on relationships between data points, as in graph databases, or need to manage semi-structured data, as in document databases. This proliferation of database types reflects the growing complexity of modern data ecosystems and the need to match the right tool to the right job.

The table below provides a comparative overview of the major database types used today, their characteristics, and the kinds of tasks they are best suited for.

Feature / Database Type	Key-Value	Document	Columnar	Relational	Property Graph (Neo4j)	RDF Triplestore	Blob Storage	Modern Analytical Databases
Schema	None	Flexible	Schema-on-write	Schema-on-write	Schema-less	Ontology-driven	None	Schema-on-write (supports nested fields)
Data Model	Key-value pairs	JSON-like documents	Columnar storage	Rows in tables	Nodes/ edges	Triples (subject-predicate-object)	Objects with metadata	Flat tables, sometimes with nested fields
Query Model	Key lookups	Field-level queries	Aggregates analytics	SQL	Cypher	SPARQL	None (external tools)	SQL-based, optimized for OLAP
Performance	Ultra-fast lookups	Fast for JSON docs	Optimized for analytics	Great for OLTP	Great for graph traversal and algorithms	Moderate to high	High throughput	Real-time analytics, distributed execution
Scalability	Very high	Very high	Very high	Moderate	Moderate to high	Moderate to high	Virtually unlimited	Virtually unlimited
Use Cases	Caching, sessions (primarily OLTP)	CMS, logging (primarily OLTP)	Analytics, time-series (primarily OLAP)	Transactions (primarily OLTP)	Primarily OLAP but often used for OLTP in complex systems	Primarily OLAP but often used for OLTP in complexsystems	Data lakes, backups (neutral; supports OLAP and OLTP tools)	Analytics, BI, hybrid workloads (primarily OLAP)
Example Products	Redis, DynamoDB, RocksDB	MongoDB, Couchbase, RavenDB	Apache HBase, ClickHouse, Druid	MySQL, PostgreSQL, Oracle	Neo4j, TigerGraph	Allegro Graph, Stardog, GraphDB	Amazon S3, GCS, Azure Blob	BigQuery, Snowflake, Redshift, Databricks SQL, Synapse Analytics

Overview of Modern Databases

This section describes each of the database types listed in the table.

Key-Value Databases

History: Key-value databases are one of the simplest and oldest forms of databases, originating from early caching systems and databases like Berkeley DB. They gained prominence with modern distributed systems like Amazon DynamoDB, which was pivotal for Amazon's scaling efforts.

Key Idea: These databases store data as a collection of key-value pairs, similar to a hash map. They are optimized for ultra-fast lookups and are ideal for use cases like session management and caching.

Comparison: Compared to other databases, key-value systems sacrifice querying flexibility for speed and simplicity, making them unsuitable for complex queries or relationships between data.

Document Databases

History: Emerging in the mid-2000s, document databases like MongoDB and Couchbase grew in response to the limitations of relational databases in handling semi-structured data, such as JSON or XML.

Key Idea: They store data as self-contained documents, at first XML document but now usually JSON, allowing each document to have its own structure. This flexibility made them popular for content management systems and applications with evolving schemas.

Comparison: Unlike key-value stores, document databases allow querying within the document structure, but they lack the strict relationships and joins found in relational databases.

Columnar Databases

History: Columnar databases gained attention in the 2000s with the rise of data warehouses and analytics needs. Systems like Apache Cassandra and HBase were inspired by Google's Bigtable, while ClickHouse and Druid were developed for real-time analytics.

Key Idea: Data is stored by columns instead of rows, enabling efficient aggregation and analytical queries. They are optimized for OLAP workloads and time-series data.

Comparison: Unlike document or key-value databases, columnar databases excel in scanning large datasets for analytics but are not suitable for transactional workloads.

Relational Databases

History: Relational databases were formalized in the 1970s by Edgar Codd’s relational model, with early systems like IBM’s System R evolving into modern RDBMSs like Oracle, MySQL, and PostgreSQL.

Key Idea: Data is stored in tables with strict schemas, and relationships are managed using foreign keys. Relational databases enforce ACID properties, making them reliable for transactional systems.

Comparison: Unlike the flexible schemas of document or key-value databases, relational databases require predefined schemas and are optimized for OLTP workloads.

Property Graph Databases (Neo4j)

History: Graph databases emerged in the 2000s with systems like Neo4j, designed to handle highly connected data. Inspired by graph theory, they became popular for social networks and recommendation systems.

Key Idea: Data is stored as nodes and edges, with relationships as first-class citizens. They are optimized for traversing and querying connections between entities.

Comparison: Graph databases handle relationships more efficiently than relational databases and are ideal for implementing algorithms from mathematical graph theory. They overlap quite a bit with RDF triplestores but lack the deep logical reasoning capabilities available to RDF via OWL and SHACL.

RDF Triplestores

History: Rooted in the Semantic Web movement of the late 1990s, triplestores like AllegroGraph and Stardog were designed to store RDF triples for representing structured knowledge.

Key Idea: Data is modeled as subject-predicate-object triples, enabling formal reasoning and adherence to ontologies like OWL. They are used in domains requiring semantic precision, like healthcare and AI.

Comparison: Unlike property graph databases, RDF triplestores prioritize semantic relationships and reasoning. They can model graphs explicitly by reifying the edges in a graph (creating a class to model the edge). However, they are not as intuitive for sophisticated graph algorithms.

Blob Storage

History: Cloud-native object storage systems like Amazon S3, Google Cloud Storage, and Azure Blob Storage became prominent in the 2010s, replacing Hadoop’s HDFS for data lake architectures.

Key Idea: Blob storage systems store raw, unstructured data in a flat namespace, optimized for scalability and cost efficiency. They provide no inherent query capabilities but are often used as foundational storage for other tools.

Comparison: Unlike other database types, blob storage is not a database but a flexible repository for raw data that external tools can query.

Modern Analytical Databases

History: Systems like Google BigQuery, Snowflake, and Amazon Redshift emerged in the 2010s as cloud-native data warehouses optimized for OLAP workloads. They were inspired by the limitations of traditional data warehouses in handling large-scale, distributed datasets.

Key Idea: These databases combine columnar storage, distributed query execution, and serverless scalability, allowing fast analytics over petabyte-scale data.

Comparison: Unlike traditional OLAP databases or data warehouses, modern analytical databases integrate seamlessly with cloud ecosystems, offering real-time analytics with minimal management overhead.

What's the Point?

I started this a while ago when someone told me that RDF triplestores weren't NoSQL databases. I realized that I needed to get up to speed on the various kinds of databases and as often happens, the more I learned, the more I realized how much I didn't know. This summarizes what I've learned so far and I hope it will help others who may also be somewhat confused by all the types of databases out there.

However, there is also one more point I want to make. This is just an opinion, it would require a new post to explain why I think this but I hope it will be a motivating point for many readers. The more I look at all these kinds of databases, the more I think the Semantic Web standards are severely underutilized. There are several features which I hope to go into in future posts that OWL, SPARQL, and SHACL have that IMO address major issues such as "schema on read". For example, despite their flexibility and non-relational nature, RDF triplestores are generally not categorized as NoSQL databases. This distinction arises because triplestores are perceived as only schema-driven with schemas defined in formal logic, rather than the ad hoc, schema-optional flexibility that defines most NoSQL systems. As a community I think we need to start emphasizing the agility and flexibility that these languages have and that could enable the technology to play a far bigger role in industry if they were understood.

3 Comments

ZVirginiaZaider

Jun 03

One of the rarest scales to find on a Carrera is a pulsation scale – perhaps a dozen of these 2447Ps are known to the market. Like with other link pulsation-scale link chronographs of the era, these were often specially ordered by doctors so that their chronograph could more easily help them check the pulse link of their patients.

May 30

Capturing the many stops on the continuum of seasonality has become link an important feature of Grand Seiko dial design. In Japan, there is a nuanced view of the changing seasons. It goes beyond the four link cardinal seasons and includes hyper-specific slices and precise moments when weather and seasons change. They are known as the 24 sekki. And link these can be further divided into 72 distinct microseasons that make up a year.

IYolandapYannip

May 13

The skies over England in 1940 erupted in conflict. Under Hitler's link command, the Luftwaffe was terrorizing the population of England with a link sustained bombing campaign link that mainly targeted aircraft production facilities ahead of a planned amphibious invasion, but often times the collateral damage was immense. Roughly 43,000 British civilians were killed during the Blitz.