The modern developer is presented with a staggering number of databases with confusing names. This is an overview of the various kinds of databases and what they are used for. For simplicity, I'm leaving out niche databases such as OODBs and vector databases used to store vectors, matrices, and tensors for fast arithmetic needed to use LLMs and other Machine Learning systems.
OLTP vs. OLAP
The most important distinction is between Online Transaction Processing (OLTP) and Online Analytic Processing (OLAP). OLTP refers to the standard kind of processing required to run a business, from posting a comment on social media to transferring funds at your bank with your phone. OLAP refers to processing historical data. The following table summarizes the difference.
Feature | OLAP (Online Analytical Processing) | OLTP (Online Transaction Processing) |
Primary Purpose | Analyzing large volumes of historical data for insights | Managing day-to-day transactions efficiently |
Workload Type | Analytical (read-heavy) | Transactional (write-heavy) |
Query Types | Complex queries (aggregations, joins, reporting) | Simple queries (CRUD operations: Create, Read, Update, Delete) |
Data Model | Often denormalized (e.g., star or snowflake schema for data warehouses) | Highly normalized (to reduce redundancy and ensure consistency) |
Data Volume | Large-scale datasets (terabytes to petabytes) | Small to medium datasets (gigabytes to terabytes) |
Latency | Higher latency tolerated for complex queries | Low latency required for real-time transactions |
Users | Data analysts, BI tools | End users, applications, systems |
Use Cases | Business intelligence, dashboards, forecasting, trend analysis | E-commerce, banking, order processing, inventory management |
Consistency | Eventual consistency acceptable | Strong consistency required (e.g., ACID transactions) |
Scalability | Horizontal scaling for distributed data processing | Vertical scaling for faster transactions |
Examples of Databases | BigQuery, Snowflake, Redshift, Azure Synapse | MySQL, PostgreSQL, Oracle, Microsoft SQL Server |
Storage Format | Columnar (optimized for analytics) | Row-based (optimized for transactions) |
Data Updates | Batch processing, infrequent updates | Frequent updates in real-time |
Indexing | Optimized for aggregations and scanning | Optimized for individual record lookups |
In ancient times (circa 1990) OLAP was all the rage due to a new technology called a Data Warehouse. Data warehouse databases were special products. They use a variant of the relational model called the cube architecture. A data warehouse trades the space efficiency and consistency of normalized relational databases for performance by intentionally introducing redundant data. This denormalization, as seen in star and snowflake schemas, optimizes analytical queries by precomputing relationships and aggregations, enabling fast slicing, dicing, and summarization of large historical datasets.
The good news was that these new data warehouse databases allowed corporations to get unprecedented access to historical data which could then be used to improve marketing, new products, etc. The bad news was these products were both expensive and complex. They took a lot of effort to setup and maintain. Also, with the Internet, more and more data (web pages, video, audio, documents,...) didn't neatly fit into the relational model.
Enter the Data Lake
This led to a new paradigm that supplanted the data warehouse: the data lake. The most important difference between a data warehouse and a data lake is that there was no or minimal processing of data entered into a data lake. The key technology that was the foundation for early data lakes was Apache Hadoop. Hadoop is a highly distributed, redundant file system. It is the perfect technology for storing large amounts of data in any form (tables, graphs, documents, video, etc.) because it doesn't impose any schema on the data. This is known as "schema on read" as opposed to the more traditional "schema on write". Typically, you queried a data lake by writing a program using an algorithm called map/reduce rather than using a query language.
Map/Reduce is a programming model used for processing large datasets in a distributed and parallel manner. It works in two main stages:
Map Stage: Each piece of input data is processed independently by "mapper" functions, which transform the data into key-value pairs. For example, in counting words across many documents, the mapper might output pairs like ("word", 1) for each word found.
Reduce Stage: The key-value pairs from the map stage are grouped by key, and a "reducer" function processes each group to produce a final result. For example, the reducer would sum all the counts for the same word to compute its total occurrences.
This model simplifies parallel processing by splitting tasks into independent operations that can be run across multiple machines. The emergence of data lakes also reflected a broader shift in how organizations approached data storage, emphasizing flexibility over the rigid schemas of traditional relational databases. Around the same time, the term NoSQL gained traction, describing a new wave of databases—like key-value, document, and columnar stores (see below) — that eschewed SQL and relational constraints in favor of scalability and adaptability. While NoSQL databases and data lakes share this emphasis on flexibility, they evolved independently, with NoSQL databases targeting operational workloads and data lakes focusing on raw data storage.
Data Lakes become Data Swamps which leads to Data Lakehouses
One of the warnings in all the Data Lake books is "don't let your data lake become a data swamp" but as might be expected with a paradigm of "throw in all the data and we'll figure it out as we need it" that often happened. This led to the next paradigm the Data Lakehouse. A data lakehouse attempts to blend the best of data lakes (flexibility, low-cost storage) and data warehouses (performance, structure, reliability) into a single platform. It directly addresses the problems of "data swamps" by enforcing structure and governance without losing the scalability and openness of a data lake. Another thing driving data lakehouses is the increasing drive to legislate the control, access, and history of data. Laws like HIPPA in the US mandate certain requirements for medical records data. The European Union's General Data Protection Regulation (GDPR) gave consumers many rights to control their data including the "right to be forgotten" which applies to OLAP data as much as to OLTP.
While the Data Lakehouse concept brought structure and governance to raw data storage, it also underscored the need for specialized databases tailored to specific analytic and operational requirements. Organizations no longer rely on a single type of database for all tasks; instead, they adopt a diverse array of database technologies, each optimized for different use cases.
For example, some workloads demand low-latency transactional systems, while others prioritize complex analytics across massive datasets. Still others focus on relationships between data points, as in graph databases, or need to manage semi-structured data, as in document databases. This proliferation of database types reflects the growing complexity of modern data ecosystems and the need to match the right tool to the right job.
The table below provides a comparative overview of the major database types used today, their characteristics, and the kinds of tasks they are best suited for.
Feature / Database Type | Key-Value | Document | Columnar | Relational | Property Graph (Neo4j) | RDF Triplestore | Blob Storage | Modern Analytical Databases |
Schema | None | Flexible | Schema-on-write | Schema-on-write | Schema-less | Ontology-driven | None | Schema-on-write (supports nested fields) |
Data Model | Key-value pairs | JSON-like documents | Columnar storage | Rows in tables | Nodes/ edges | Triples (subject-predicate-object) | Objects with metadata | Flat tables, sometimes with nested fields |
Query Model | Key lookups | Field-level queries | Aggregates analytics | SQL | Cypher | SPARQL | None (external tools) | SQL-based, optimized for OLAP |
Performance | Ultra-fast lookups | Fast for JSON docs | Optimized for analytics | Great for OLTP | Great for graph traversal and algorithms | Moderate to high | High throughput | Real-time analytics, distributed execution |
Scalability | Very high | Very high | Very high | Moderate | Moderate to high | Moderate to high | Virtually unlimited | Virtually unlimited |
Use Cases | Caching, sessions (primarily OLTP) | CMS, logging (primarily OLTP) | Analytics, time-series (primarily OLAP) | Transactions (primarily OLTP) | Primarily OLAP but often used for OLTP in complex systems | Primarily OLAP but often used for OLTP in complexsystems | Data lakes, backups (neutral; supports OLAP and OLTP tools) | Analytics, BI, hybrid workloads (primarily OLAP) |
Example Products | Redis, DynamoDB, RocksDB | MongoDB, Couchbase, RavenDB | Apache HBase, ClickHouse, Druid | MySQL, PostgreSQL, Oracle | Neo4j, TigerGraph | Allegro Graph, Stardog, GraphDB | Amazon S3, GCS, Azure Blob | BigQuery, Snowflake, Redshift, Databricks SQL, Synapse Analytics |
Overview of Modern Databases
This section describes each of the database types listed in the table.
Key-Value Databases
History: Key-value databases are one of the simplest and oldest forms of databases, originating from early caching systems and databases like Berkeley DB. They gained prominence with modern distributed systems like Amazon DynamoDB, which was pivotal for Amazon's scaling efforts.
Key Idea: These databases store data as a collection of key-value pairs, similar to a hash map. They are optimized for ultra-fast lookups and are ideal for use cases like session management and caching.
Comparison: Compared to other databases, key-value systems sacrifice querying flexibility for speed and simplicity, making them unsuitable for complex queries or relationships between data.
Document Databases
History: Emerging in the mid-2000s, document databases like MongoDB and Couchbase grew in response to the limitations of relational databases in handling semi-structured data, such as JSON or XML.
Key Idea: They store data as self-contained documents, at first XML document but now usually JSON, allowing each document to have its own structure. This flexibility made them popular for content management systems and applications with evolving schemas.
Comparison: Unlike key-value stores, document databases allow querying within the document structure, but they lack the strict relationships and joins found in relational databases.
Columnar Databases
History: Columnar databases gained attention in the 2000s with the rise of data warehouses and analytics needs. Systems like Apache Cassandra and HBase were inspired by Google's Bigtable, while ClickHouse and Druid were developed for real-time analytics.
Key Idea: Data is stored by columns instead of rows, enabling efficient aggregation and analytical queries. They are optimized for OLAP workloads and time-series data.
Comparison: Unlike document or key-value databases, columnar databases excel in scanning large datasets for analytics but are not suitable for transactional workloads.
Relational Databases
History: Relational databases were formalized in the 1970s by Edgar Codd’s relational model, with early systems like IBM’s System R evolving into modern RDBMSs like Oracle, MySQL, and PostgreSQL.
Key Idea: Data is stored in tables with strict schemas, and relationships are managed using foreign keys. Relational databases enforce ACID properties, making them reliable for transactional systems.
Comparison: Unlike the flexible schemas of document or key-value databases, relational databases require predefined schemas and are optimized for OLTP workloads.
Property Graph Databases (Neo4j)
History: Graph databases emerged in the 2000s with systems like Neo4j, designed to handle highly connected data. Inspired by graph theory, they became popular for social networks and recommendation systems.
Key Idea: Data is stored as nodes and edges, with relationships as first-class citizens. They are optimized for traversing and querying connections between entities.
Comparison: Graph databases handle relationships more efficiently than relational databases and are ideal for implementing algorithms from mathematical graph theory. They overlap quite a bit with RDF triplestores but lack the deep logical reasoning capabilities available to RDF via OWL and SHACL.
RDF Triplestores
History: Rooted in the Semantic Web movement of the late 1990s, triplestores like AllegroGraph and Stardog were designed to store RDF triples for representing structured knowledge.
Key Idea: Data is modeled as subject-predicate-object triples, enabling formal reasoning and adherence to ontologies like OWL. They are used in domains requiring semantic precision, like healthcare and AI.
Comparison: Unlike property graph databases, RDF triplestores prioritize semantic relationships and reasoning. They can model graphs explicitly by reifying the edges in a graph (creating a class to model the edge). However, they are not as intuitive for sophisticated graph algorithms.
Blob Storage
History: Cloud-native object storage systems like Amazon S3, Google Cloud Storage, and Azure Blob Storage became prominent in the 2010s, replacing Hadoop’s HDFS for data lake architectures.
Key Idea: Blob storage systems store raw, unstructured data in a flat namespace, optimized for scalability and cost efficiency. They provide no inherent query capabilities but are often used as foundational storage for other tools.
Comparison: Unlike other database types, blob storage is not a database but a flexible repository for raw data that external tools can query.
Modern Analytical Databases
History: Systems like Google BigQuery, Snowflake, and Amazon Redshift emerged in the 2010s as cloud-native data warehouses optimized for OLAP workloads. They were inspired by the limitations of traditional data warehouses in handling large-scale, distributed datasets.
Key Idea: These databases combine columnar storage, distributed query execution, and serverless scalability, allowing fast analytics over petabyte-scale data.
Comparison: Unlike traditional OLAP databases or data warehouses, modern analytical databases integrate seamlessly with cloud ecosystems, offering real-time analytics with minimal management overhead.
What's the Point?
I started this a while ago when someone told me that RDF triplestores weren't NoSQL databases. I realized that I needed to get up to speed on the various kinds of databases and as often happens, the more I learned, the more I realized how much I didn't know. This summarizes what I've learned so far and I hope it will help others who may also be somewhat confused by all the types of databases out there.
However, there is also one more point I want to make. This is just an opinion, it would require a new post to explain why I think this but I hope it will be a motivating point for many readers. The more I look at all these kinds of databases, the more I think the Semantic Web standards are severely underutilized. There are several features which I hope to go into in future posts that OWL, SPARQL, and SHACL have that IMO address major issues such as "schema on read". For example, despite their flexibility and non-relational nature, RDF triplestores are generally not categorized as NoSQL databases. This distinction arises because triplestores are perceived as only schema-driven with schemas defined in formal logic, rather than the ad hoc, schema-optional flexibility that defines most NoSQL systems. As a community I think we need to start emphasizing the agility and flexibility that these languages have and that could enable the technology to play a far bigger role in industry if they were understood.
Comments