top of page

The Complex Landscape of Modern Databases: Just because your database doesn't use SQL doesn't mean it's a NoSQL database

Writer's picture: Michael DeBellisMichael DeBellis

The modern developer is presented with a staggering number of databases with confusing names. This is an overview of the various kinds of databases and what they are used for. For simplicity, I'm leaving out niche databases such as OODBs and vector databases used to store vectors, matrices, and tensors for fast arithmetic needed to use LLMs and other Machine Learning systems.

OLTP vs. OLAP

The most important distinction is between Online Transaction Processing (OLTP) and Online Analytic Processing (OLAP). OLTP refers to the standard kind of processing required to run a business, from posting a comment on social media to transferring funds at your bank with your phone. OLAP refers to processing historical data. The following table summarizes the difference.

Feature

OLAP (Online Analytical Processing)

OLTP (Online Transaction Processing)

Primary Purpose

Analyzing large volumes of historical data for insights

Managing day-to-day transactions efficiently

Workload Type

Analytical (read-heavy)

Transactional (write-heavy)

Query Types

Complex queries (aggregations, joins, reporting)

Simple queries (CRUD operations: Create, Read, Update, Delete)

Data Model

Often denormalized (e.g., star or snowflake schema for data warehouses)

Highly normalized (to reduce redundancy and ensure consistency)

Data Volume

Large-scale datasets (terabytes to petabytes)

Small to medium datasets (gigabytes to terabytes)

Latency

Higher latency tolerated for complex queries

Low latency required for real-time transactions

Users

Data analysts, BI tools

End users, applications, systems

Use Cases

Business intelligence, dashboards, forecasting, trend analysis

E-commerce, banking, order processing, inventory management

Consistency

Eventual consistency acceptable

Strong consistency required (e.g., ACID transactions)

Scalability

Horizontal scaling for distributed data processing

Vertical scaling for faster transactions

Examples of Databases

BigQuery, Snowflake, Redshift, Azure Synapse

MySQL, PostgreSQL, Oracle, Microsoft SQL Server

Storage Format

Columnar (optimized for analytics)

Row-based (optimized for transactions)

Data Updates

Batch processing, infrequent updates

Frequent updates in real-time

Indexing

Optimized for aggregations and scanning

Optimized for individual record lookups

In ancient times (circa 1990) OLAP was all the rage due to a new technology called a Data Warehouse. Data warehouse databases were special products. They use a variant of the relational model called the cube architecture. A data warehouse trades the space efficiency and consistency of normalized relational databases for performance by intentionally introducing redundant data. This denormalization, as seen in star and snowflake schemas, optimizes analytical queries by precomputing relationships and aggregations, enabling fast slicing, dicing, and summarization of large historical datasets.


The good news was that these new data warehouse databases allowed corporations to get unprecedented access to historical data which could then be used to improve marketing, new products, etc. The bad news was these products were both expensive and complex. They took a lot of effort to setup and maintain. Also, with the Internet, more and more data (web pages, video, audio, documents,...) didn't neatly fit into the relational model.


Enter the Data Lake

This led to a new paradigm that supplanted the data warehouse: the data lake. The most important difference between a data warehouse and a data lake is that there was no or minimal processing of data entered into a data lake. The key technology that was the foundation for early data lakes was Apache Hadoop. Hadoop is a highly distributed, redundant file system. It is the perfect technology for storing large amounts of data in any form (tables, graphs, documents, video, etc.) because it doesn't impose any schema on the data. This is known as "schema on read" as opposed to the more traditional "schema on write". Typically, you queried a data lake by writing a program using an algorithm called map/reduce rather than using a query language.


Map/Reduce is a programming model used for processing large datasets in a distributed and parallel manner. It works in two main stages:

Map Stage: Each piece of input data is processed independently by "mapper" functions, which transform the data into key-value pairs. For example, in counting words across many documents, the mapper might output pairs like ("word", 1) for each word found.

Reduce Stage: The key-value pairs from the map stage are grouped by key, and a "reducer" function processes each group to produce a final result. For example, the reducer would sum all the counts for the same word to compute its total occurrences.


This model simplifies parallel processing by splitting tasks into independent operations that can be run across multiple machines. The emergence of data lakes also reflected a broader shift in how organizations approached data storage, emphasizing flexibility over the rigid schemas of traditional relational databases. Around the same time, the term NoSQL gained traction, describing a new wave of databases—like key-value, document, and columnar stores (see below) — that eschewed SQL and relational constraints in favor of scalability and adaptability. While NoSQL databases and data lakes share this emphasis on flexibility, they evolved independently, with NoSQL databases targeting operational workloads and data lakes focusing on raw data storage.


Data Lakes become Data Swamps which leads to Data Lakehouses

One of the warnings in all the Data Lake books is "don't let your data lake become a data swamp" but as might be expected with a paradigm of "throw in all the data and we'll figure it out as we need it" that often happened. This led to the next paradigm the Data Lakehouse. A data lakehouse attempts to blend the best of data lakes (flexibility, low-cost storage) and data warehouses (performance, structure, reliability) into a single platform. It directly addresses the problems of "data swamps" by enforcing structure and governance without losing the scalability and openness of a data lake. Another thing driving data lakehouses is the increasing drive to legislate the control, access, and history of data. Laws like HIPPA in the US mandate certain requirements for medical records data. The European Union's General Data Protection Regulation (GDPR) gave consumers many rights to control their data including the "right to be forgotten" which applies to OLAP data as much as to OLTP.


While the Data Lakehouse concept brought structure and governance to raw data storage, it also underscored the need for specialized databases tailored to specific analytic and operational requirements. Organizations no longer rely on a single type of database for all tasks; instead, they adopt a diverse array of database technologies, each optimized for different use cases.


For example, some workloads demand low-latency transactional systems, while others prioritize complex analytics across massive datasets. Still others focus on relationships between data points, as in graph databases, or need to manage semi-structured data, as in document databases. This proliferation of database types reflects the growing complexity of modern data ecosystems and the need to match the right tool to the right job.


The table below provides a comparative overview of the major database types used today, their characteristics, and the kinds of tasks they are best suited for.

Feature / Database Type

Key-Value

Document

Columnar

Relational

Property Graph (Neo4j)

RDF Triplestore

Blob Storage

Modern Analytical Databases

Schema

None

Flexible

Schema-on-write

Schema-on-write

Schema-less

Ontology-driven

None

Schema-on-write (supports nested fields)

Data Model

Key-value pairs

JSON-like documents

Columnar storage

Rows in tables

Nodes/ edges

Triples (subject-predicate-object)

Objects with metadata

Flat tables, sometimes with nested fields

Query Model

Key lookups

Field-level queries

Aggregates analytics

SQL

Cypher

SPARQL

None (external tools)

SQL-based, optimized for OLAP

Performance

Ultra-fast lookups

Fast for JSON docs

Optimized for analytics

Great for OLTP

Great for graph traversal and algorithms

Moderate to high

High throughput

Real-time analytics, distributed execution

Scalability

Very high

Very high

Very high

Moderate

Moderate to high

Moderate to high

Virtually unlimited

Virtually unlimited

Use Cases

Caching, sessions (primarily OLTP)

CMS, logging (primarily OLTP)

Analytics, time-series (primarily OLAP)

Transactions (primarily OLTP)

Primarily OLAP but often used for OLTP in complex systems

Primarily OLAP but often used for OLTP in complexsystems

Data lakes, backups (neutral; supports OLAP and OLTP tools)

Analytics, BI, hybrid workloads (primarily OLAP)

Example Products

Redis, DynamoDB, RocksDB

MongoDB, Couchbase, RavenDB

Apache HBase, ClickHouse, Druid

MySQL, PostgreSQL, Oracle

Neo4j, TigerGraph

Allegro Graph, Stardog, GraphDB

Amazon S3, GCS, Azure Blob

BigQuery, Snowflake, Redshift, Databricks SQL, Synapse Analytics


Overview of Modern Databases

This section describes each of the database types listed in the table.


Key-Value Databases

History: Key-value databases are one of the simplest and oldest forms of databases, originating from early caching systems and databases like Berkeley DB. They gained prominence with modern distributed systems like Amazon DynamoDB, which was pivotal for Amazon's scaling efforts.

Key Idea: These databases store data as a collection of key-value pairs, similar to a hash map. They are optimized for ultra-fast lookups and are ideal for use cases like session management and caching.

Comparison: Compared to other databases, key-value systems sacrifice querying flexibility for speed and simplicity, making them unsuitable for complex queries or relationships between data.


Document Databases

History: Emerging in the mid-2000s, document databases like MongoDB and Couchbase grew in response to the limitations of relational databases in handling semi-structured data, such as JSON or XML.

Key Idea: They store data as self-contained documents, at first XML document but now usually JSON, allowing each document to have its own structure. This flexibility made them popular for content management systems and applications with evolving schemas.

Comparison: Unlike key-value stores, document databases allow querying within the document structure, but they lack the strict relationships and joins found in relational databases.


Columnar Databases

History: Columnar databases gained attention in the 2000s with the rise of data warehouses and analytics needs. Systems like Apache Cassandra and HBase were inspired by Google's Bigtable, while ClickHouse and Druid were developed for real-time analytics.

Key Idea: Data is stored by columns instead of rows, enabling efficient aggregation and analytical queries. They are optimized for OLAP workloads and time-series data.

Comparison: Unlike document or key-value databases, columnar databases excel in scanning large datasets for analytics but are not suitable for transactional workloads.


Relational Databases

History: Relational databases were formalized in the 1970s by Edgar Codd’s relational model, with early systems like IBM’s System R evolving into modern RDBMSs like Oracle, MySQL, and PostgreSQL.

Key Idea: Data is stored in tables with strict schemas, and relationships are managed using foreign keys. Relational databases enforce ACID properties, making them reliable for transactional systems.

Comparison: Unlike the flexible schemas of document or key-value databases, relational databases require predefined schemas and are optimized for OLTP workloads.


Property Graph Databases (Neo4j)

History: Graph databases emerged in the 2000s with systems like Neo4j, designed to handle highly connected data. Inspired by graph theory, they became popular for social networks and recommendation systems.

Key Idea: Data is stored as nodes and edges, with relationships as first-class citizens. They are optimized for traversing and querying connections between entities.

Comparison: Graph databases handle relationships more efficiently than relational databases and are ideal for implementing algorithms from mathematical graph theory. They overlap quite a bit with RDF triplestores but lack the deep logical reasoning capabilities available to RDF via OWL and SHACL.


RDF Triplestores

History: Rooted in the Semantic Web movement of the late 1990s, triplestores like AllegroGraph and Stardog were designed to store RDF triples for representing structured knowledge.

Key Idea: Data is modeled as subject-predicate-object triples, enabling formal reasoning and adherence to ontologies like OWL. They are used in domains requiring semantic precision, like healthcare and AI.

Comparison: Unlike property graph databases, RDF triplestores prioritize semantic relationships and reasoning. They can model graphs explicitly by reifying the edges in a graph (creating a class to model the edge). However, they are not as intuitive for sophisticated graph algorithms.


Blob Storage

History: Cloud-native object storage systems like Amazon S3, Google Cloud Storage, and Azure Blob Storage became prominent in the 2010s, replacing Hadoop’s HDFS for data lake architectures.

Key Idea: Blob storage systems store raw, unstructured data in a flat namespace, optimized for scalability and cost efficiency. They provide no inherent query capabilities but are often used as foundational storage for other tools.

Comparison: Unlike other database types, blob storage is not a database but a flexible repository for raw data that external tools can query.


Modern Analytical Databases

History: Systems like Google BigQuery, Snowflake, and Amazon Redshift emerged in the 2010s as cloud-native data warehouses optimized for OLAP workloads. They were inspired by the limitations of traditional data warehouses in handling large-scale, distributed datasets.

Key Idea: These databases combine columnar storage, distributed query execution, and serverless scalability, allowing fast analytics over petabyte-scale data.

Comparison: Unlike traditional OLAP databases or data warehouses, modern analytical databases integrate seamlessly with cloud ecosystems, offering real-time analytics with minimal management overhead.


What's the Point?

I started this a while ago when someone told me that RDF triplestores weren't NoSQL databases. I realized that I needed to get up to speed on the various kinds of databases and as often happens, the more I learned, the more I realized how much I didn't know. This summarizes what I've learned so far and I hope it will help others who may also be somewhat confused by all the types of databases out there.


However, there is also one more point I want to make. This is just an opinion, it would require a new post to explain why I think this but I hope it will be a motivating point for many readers. The more I look at all these kinds of databases, the more I think the Semantic Web standards are severely underutilized. There are several features which I hope to go into in future posts that OWL, SPARQL, and SHACL have that IMO address major issues such as "schema on read". For example, despite their flexibility and non-relational nature, RDF triplestores are generally not categorized as NoSQL databases. This distinction arises because triplestores are perceived as only schema-driven with schemas defined in formal logic, rather than the ad hoc, schema-optional flexibility that defines most NoSQL systems. As a community I think we need to start emphasizing the agility and flexibility that these languages have and that could enable the technology to play a far bigger role in industry if they were understood.





34 views0 comments

Comments


bottom of page