Learn how to store data, including how and when to use relational databases, key-value stores, file stores, object stores, CDNs, document stores, columnar databases, queues, streams, and more.
By Martin Kleppmann (O'Reilly)
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.
By Laine Campbell and Charity Majors (O'Reilly)
The infrastructure-as-code revolution in IT is also affecting database administration. With this practical book, developers, system administrators, and junior to mid-level DBAs will learn how the modern practice of site reliability engineering applies to the craft of database architecture and operations. Authors Laine Campbell and Charity Majors provide a framework for professionals looking to join the ranks of today’s database reliability engineers (DBRE). You’ll begin by exploring core operational concepts that DBREs need to master. Then you’ll examine a wide range of database persistence options, including how to implement key technologies to provide resilient, scalable, and performant data storage and retrieval. With a firm foundation in database reliability engineering, you’ll be ready to dive into the architecture and operations of any modern database.
By Pramod Sadalage and Martin Fowler (Addison-Wesley Professional)
A brief introduction to the class of non-relational databases known as 'NoSQL.' The book covers core concepts as well as implementation issues and use cases.
By Eric Redmond and Jim Wilson (Pragmatic Bookshelf)
Data is getting bigger and more complex by the day, and so are your choices in handling it. Explore some of the most cutting-edge databases available - from a traditional relational database to newer NoSQL approaches - and make informed decisions about challenging data storage problems. This is the only comprehensive guide to the world of NoSQL databases, with in-depth practical and conceptual introductions to seven different technologies: Redis, Neo4J, CouchDB, MongoDB, HBase, Postgres, and DynamoDB. This second edition includes a new chapter on DynamoDB and updated content for each chapter.
By Jay Kreps (O'Reilly)
Why a book about logs? That’s the humble log is an abstraction that lies at the heart of many systems, from NoSQL databases to cryptocurrencies. Even though most engineers don’t think much about them, this short book shows you why logs are worthy of your attention. Based on his popular blog posts, LinkedIn principal engineer Jay Kreps shows you how logs work in distributed systems, and then delivers practical applications of these concepts in a variety of common uses―data integration, enterprise architecture, real-time stream processing, data system design, and abstract computing models. Go ahead and take the plunge with logs; you’re going love them.
By Anthony Molinaro and Robert de Graaf (O'Reilly)
You may know SQL basics, but are you taking advantage of its expressive power? This second edition applies a highly practical approach to Structured Query Language (SQL) so you can create and manipulate large stores of data. Based on real-world examples, this updated cookbook provides a framework to help you construct solutions and executable examples in severalflavors of SQL, including Oracle, DB2, SQL Server, MySQL, andPostgreSQL. SQL programmers, analysts, data scientists, database administrators, and even relatively casual SQL users will find SQL Cookbook to be a valuable problem-solving guide for everyday issues. No other resource offers recipes in this unique format to help you tackle nagging day-to-day conundrums with SQL.
By Ben Forta (Sams Publishing)
Whether you're an application developer, database administrator, web application designer, mobile app developer, or Microsoft Office users, a good working knowledge of SQL is an important part of interacting with databases. And Sams Teach Yourself SQL in 10 Minutes offers the straightforward, practical answers you need to help you do your job. Expert trainer and popular author Ben Forta teaches you just the parts of SQL you need to know–starting with simple data retrieval and quickly going on to more complex topics including the use of joins, subqueries, stored procedures, cursors, triggers, and table constraints. You'll learn methodically, systematically, and simply–in short, quick lessons that will each take only 10 minutes or less to complete.
By Itzik Ben-Gan (Microsoft Press)
Master Transact-SQL's fundamentals, and write correct, robust code for querying and modifying data with modern Microsoft data technologies, including SQL Server 2022, Azure SQL Database, and Azure SQL Managed Instance. Long-time Microsoft Data Platform MVP Itzik Ben-Gan explains key T-SQL concepts, helping you apply your knowledge with hands-on exercises. Ben-Gan first introduces T-SQL's theory and underlying logic, illuminating it as both a language and a way of thinking. Next, he walks through core topics, including logical query processing, single table queries, joins, subqueries, table expressions, set operators, data analysis, data modifications, temporal tables, and transactions and concurrency. Building on this foundation, you'll enhance your coding capabilities, from programmatic constructs to the powerful new SQL Graph. Throughout, Ben-Gan presents reusable T-SQL sample code that works in cloud, on-premises, and hybrid environments.
By Paper Trail (Blog post)
Writing about distributed systems, compilers, virtual machines, databases and research papers from SOSP, ATC, NSDI, OSDI, EuroSys and others.
By Dan McKinley (Talk)
This is the spoken word version of my essay, Choose Boring Technology. I have largely come to terms with it and the reality that I will never escape its popularity.
By Jeffrey Dean and Sanjay Ghemawat (Article)
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
By Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber (Article)
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
By Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels (Article)
This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Easy to use, high performance block storage at any scale.
Persistent Disk is Google’s local durable storage service, fully integrated with Google Cloud products, Compute Engine and Google Kubernetes Engine.
Discover a high-performance, highly durable block storage service designed for Azure Virtual Machines.
Serverless, fully elastic file storage.
Fully-managed, secure cloud file storage. Filestore offers petabyte-scale online network attached storage (NAS) for high performance computing.
Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API. Azure file shares can be mounted concurrently by cloud or on-premises deployments. SMB Azure file shares are accessible from Windows, Linux, and macOS clients. NFS Azure file shares are accessible from Linux clients. Additionally, SMB Azure file shares can be cached on Windows servers with Azure File Sync for fast access near where the data is being used.
PostgreSQL is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
The world's most popular open source database.
SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.
Get the flexibility you need to use integrated solutions and apps with your data—in the cloud, on-premises, or at the edge. SQL Server 2022 is the most Azure-enabled release yet, with innovation across performance, security, and availability.
Harness AI with Oracle Database 23ai to power app development and critical workloads, enhancing your data operations for free.
From version control to continuous delivery, Redgate Flyway helps individuals, teams, and enterprises build on application delivery processes to automate database development.
Automate database change management to code at full speed & continuously deliver with full confidence. Liquibase helps developers build applications faster.
Manage your database schema as code.
Schema migration and database security for developer, security, DBA, and platform engineering teams.
Alembic is a lightweight database migration tool for usage with the SQLAlchemy Database Toolkit for Python.
Database migrations written in Go. Use as CLI or import as library.
Sqitch is the developer-friendly, confidence-inducing, platform-neutral database change management system.
Active Record is part of the M in MVC - the model - which is the layer of the system responsible for representing data and business logic. Active Record helps you create and use Ruby objects whose attributes require persistent storage to a database.
Ruby Sequel is a lightweight database toolkit for Ruby.
Knex.js (pronounced /kəˈnɛks/) is a 'batteries included' SQL query builder for PostgreSQL, CockroachDB, MSSQL, MySQL, MariaDB, SQLite3, Better-SQLite3, Oracle, and Amazon Redshift designed to be flexible, portable, and fun to use. It features both traditional node style callbacks as well as a promise interface for cleaner async flow control, a stream interface, full-featured query and schema builders, transaction support (with savepoints), connection pooling and standardized responses between different query clients and dialects.
The fantastic ORM library for Golang aims to be developer friendly.
Developers love Redis. Unlock the full potential of the Redis database with Redis Enterprise and start building blazing fast apps.
Valkey is an open source (BSD) high-performance key/value datastore that supports a variety of workloads such as caching, message queues, and can act as a primary database. Valkey can run as either a standalone daemon or in a cluster, with options for replication and high availability.
Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
Serverless, NoSQL, fully managed database with single-digit millisecond performance at any scale.
With a key/value design that delivers powerful – yet simple – data models for storing massive amounts of unstructured data, Riak KV is built to handle a variety of challenges facing Big Data applications that include tracking user or session information, storing connected device data and replicating data across the globe. Riak KV automates data distribution across the cluster to achieve fast performance and robust business continuity with a masterless architecture that ensures high availability, and scales near linearly using commodity hardware so you can easily add capacity without a large operational burden.
Build the fastest apps and deliver the richest real-time experiences with the official Redis-as-a-service.
Real-time performance for real-time applications.
A fully managed in-memory service for Redis and Memcached that offers sub millisecond data access, scalability, and high availability.
Distributed, in-memory, scalable solution providing super-fast data access.
Upstash is a serverless data platform providing low latency and high scalability for real-time applications. Optimize your data infrastructure with Upstash's managed services for Redis, Vector, QStash, and other key data technologies.
Make employees, applications and networks faster and more secure everywhere, while reducing complexity and cost.
Akamai is the cybersecurity and cloud computing company that powers and protects business online.
Fastly's edge cloud platform delivers faster, safer, and more scalable sites and apps to customers. Elevate your edge CDN, video delivery, security, and more.
Imperva provides complete cyber security by protecting what really matters most—your data and applications—whether on-premises or in the cloud.
Deliver content hosted on-premises or in another cloud over Google's high-performance distributed infrastructure.
Fast, reliable content delivery network with global reach.
Object storage built to retrieve any amount of data from anywhere.
Cloud Storage lets you store data with multiple redundancy options, virtually anywhere.
Massively scalable and secure object storage for cloud-native workloads, archives, data lakes, HPC, and machine learning.
Cloudflare R2 is an S3-compatible, zero egress-fee, object storage. Move data freely and build the multi-cloud architecture you desire.
With Wasabi, you pay only for what you store. Enjoy the freedom to access your data whenever you want, without fees for egress or API requests.
Backblaze is a pioneer in robust, scalable low cost cloud backup and storage services. Enterprise hot storage, low cost backup and archive, and more.
Get your ideas to market faster with a developer data platform built on the leading modern database. MongoDB makes working with data easy.
Seamless multi-master sync, that scales from Big Data to Mobile, with an Intuitive HTTP/JSON API and designed for Reliability.
Couchbase is the NoSQL cloud developer data platform for critical, AI-powered applications. Uncompromised speed, affordability, and ease of use.
Use our flexible, scalable NoSQL cloud database, built on Google Cloud infrastructure, to store and sync data for client- and server-side development.
Elasticsearch is the leading distributed, RESTful, open source search and analytics engine designed for speed, horizontal scalability, reliability, and easy management. Get started for free.
OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.
Securely unlock real-time search, monitoring, and analysis of business and operational data.
Enterprises and developers use Algolia’s AI search infrastructure to understand users and show them what they’re looking for.
Solr is the blazing-fast, open source, multi-modal search platform built on the full-text, vector, and geospatial search capabilities of Apache Lucene.
The Apache Lucene project develops open-source search software. The project releases a core search library, named Lucene core, as well as PyLucene, a python binding for Lucene. Lucene Core is a Java library providing powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. The PyLucene sub project provides Python bindings for Lucene Core.
Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Bigtable is an HBase-compatible, enterprise-grade NoSQL database with low single-digit millisecond latency and limitless scale.
Apache HBase is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
Manage all types of time series data in a single, purpose-built database. Optimized for speed in any environment in the cloud, on-premises, or at the edge.
Easy-to-manage time-series databases optimized for security, performance, availability, and scalability.
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data. It ingests, transforms, stores, and analyzes massive amounts of time series data. Riak TS is engineered to be faster than Cassandra.
Engineered to handle demanding workloads, like time series, vector, events, and analytics data. Built on PostgreSQL, with expert support at no extra charge.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Cloudera delivers a hybrid data platform with secure data management and portable cloud-native data analytics.
Easily run and scale Apache Spark, Hive, Presto, and other big data workloads.
Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Provision cloud Hadoop, Spark and HBase clusters.
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use! Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.
Samza allows you to build stateful applications that process data in real-time from multiple sources including Apache Kafka. Battle-tested at scale, it supports flexible deployment options to run on YARN or as a standalone library.
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Beam also brings DSL in different languages, allowing users to easily implement their data integration processes.
A high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.
Realtime distributed OLAP datastore.
Reliably load real-time streams into data lakes, warehouses, and analytics services.
Snowflake enables organizations to learn, build, and connect with their data-driven peers. Collaborate, build data apps & power diverse workloads in the AI Data Cloud.
Deliver unmatched price performance at scale with SQL for your data lakehouse.
BigQuery is a serverless, cost-effective, and multicloud data warehouse designed to help you turn big data into valuable business insights. Start free.
Accelerate time to insight across enterprise data warehouses and big data systems.
The Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.
Get real-time data access and machine learning generated insights to make better decisions that drive innovation with Enterprise Data Warehouse. Benefit from auto-scalability, high performance, security, and autonomous management, on-premises or in the cloud, eliminating complexity and lowering operational costs. Oracle’s enterprise-class data warehouse solution integrates, transforms, and connects all data across the organization.
Discover how Teradata's complete cloud analytics and data platform scales Trusted AI to generate ROI and drive profits.
Informatica is an Enterprise Cloud Data Management leader that brings data to life by empowering businesses to realize the transformative power of their most critical assets.
Turbocharge your data game with OpenText Analytics Database (Vertica)— high-speed SQL, Python analytics, and built-in ML, tailored for any deployment, anywhere!
Platform created by the community to programmatically author, schedule and monitor workflows.
Oracle Data Integrator is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services. Oracle Data Integrator (ODI) 12c, the latest version of Oracle’s strategic Data Integration offering, provides superior developer productivity and improved user experience with a redesigned flow-based declarative user interface and deeper integration with Oracle GoldenGate. ODI12c further builds on its flexible and high-performance architecture with comprehensive big data support and added parallelism when executing data integration processes. It includes interoperability with Oracle Warehouse Builder (OWB) for a quick and simple migration for OWB customers to ODI12c. Additionally, ODI can be monitored from a single solution along with other Oracle technologies and applications through the integration with Oracle Enterprise Manager 12c.
SQL Server Integration Services is a platform for building enterprise-level data integration and data transformations solutions.
Discover, prepare, and integrate all your data at any scale.
Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. You can also lift and shift existing SSIS packages to Azure and run them with full compatibility in ADF. SSIS Integration Runtime offers a fully managed service, so you don't have to worry about infrastructure management.
Dataflow is a fully managed streaming analytics service that reduces latency, processing time, cost through autoscaling and real-time data processing.
Get to insights faster with fully automated cloud data pipelines. Rapidly move data from source to warehouse in just a few clicks, no IT expertise required.
Qlik, now with Talend, delivers a data fabric and next-level insights with its end-to-end data integration, data quality, & analytics solutions.
Informatica is an Enterprise Cloud Data Management leader that brings data to life by empowering businesses to realize the transformative power of their most critical assets.
Matillion’s unified ELT platform is the next step in data integration. Use AI to build faster pipelines, enhance data productivity and deliver analytics at scale.
Unify your data while building & managing clean, secure pipelines for better decision making. Power your data warehouse with ETL, ELT, CDC, Reverse ETL, and API Management.
RabbitMQ is a reliable and mature messaging and streaming broker, which is easy to deploy on cloud environments, on-premises, and on your local machine. It is currently used by millions worldwide.
Apache ActiveMQ is the most popular open source, multi-protocol, Java-based message broker. It supports industry standard protocols so users get the benefits of client choices across a broad range of languages and platforms. Connect from clients written in JavaScript, C, C++, Python, .Net, and more. Integrate your multi-platform applications using the ubiquitous AMQP protocol. Exchange messages between your web applications using STOMP over websockets. Manage your IoT devices using MQTT. Support your existing JMS infrastructure and beyond. ActiveMQ offers the power and flexibility to support any messaging use-case.
ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems.
Fully managed message queuing for microservices, distributed systems, and serverless applications.
Cloud Tasks enables developers to manage large numbers of distributed tasks, small units of asynchronous computing work, through the use of queues and worker services.
Durable queues for large-volume cloud services.
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Stream, connect, process, and govern your data with a unified Data Streaming Platform built on the heritage of Apache Kafka and Apache Flink.
Securely stream data with a fully managed, highly available Apache Kafka service.
Collect, process, and analyze real-time video and data streams.
Build event-driven applications at scale across AWS, existing systems, or SaaS applications.
Apache Kafka made easy. Learn about the managed service for Apache Kafka that automates Kafka operations and security on Google Cloud.
Ingest events into Pub/Sub to stream to BigQuery, data lakes and databases; messaging middleware for streaming analytics and service integrations.
Provision cloud Hadoop, Spark and HBase clusters.
Apache Pulsar is an open-source, distributed messaging and streaming platform built for the cloud.
NATS is a connective technology powering modern distributed systems, unifying Cloud, On-Premise, Edge, and IoT.
Redpanda is a powerful, simple, and cost-efficient streaming data platform that is compatible with Kafka APIs while eliminating Kafka complexity.
Connect data as it's stored with Neo4j. Perform powerful, complex queries at scale and speed with our graph data platform.
High-performance graph analytics and serverless database for superior scalability and availability.
Aerospike Graph is a developer-ready graph database for real-time data that can be scaled without performance issues.
Build intelligent apps with a single database that combines relational, graph, key value, and search. No maintenance windows mean uninterrupted apps.
Unparalleled high performance and availability at global scale for PostgreSQL, MySQL, and DSQL.
CockroachDB is a distributed database with standard SQL for cloud applications. CockroachDB powers companies like Comcast, Lush, and Bose.
YugabyteDB is the 100% open source cloud native database for mission critical applications. YugabyteDB runs in any public or hybrid cloud.
Ensure uptime, lower costs, and scale easily with the only data platform built for reliable support of mission-critical applications.