Chapter 9 overview

How to Store Data

Learn how to store data, including how and when to use relational databases, key-value stores, file stores, object stores, CDNs, document stores, columnar databases, queues, streams, and more.

Key ideas you'll learn

  • Relational DBs, schemas
  • NoSQL, NewSQL, queues, streams
  • File storage and CDNs
  • Backup and recovery

Examples you'll try

  • Deploy PostgreSQL using RDS
  • Configure RDS backup, replicas
  • Use Knex.js for schema migrations
  • Use S3 and CloudFront for static assets

Table of contents

9.1 Local Storage: Hard Drives
9.2 Primary Data Store: Relational Databases
9.2.1 Reading and Writing Data
9.2.2 ACID Transactions
9.2.3 Schemas and Constraints
9.2.4 Example: PostgreSQL, Lambda, and Schema Migrations
Create an OpenTofu module
Create schema migrations
Create the Lambda function
9.3 Caching: Key-Value Stores and CDNs
9.3.1 Key-Value Stores
9.3.2 CDNs
9.4 File Storage: File Servers and Object Stores
9.4.1 File Servers
9.4.2 Object Stores
9.4.3 Example: Serving Files With S3 and CloudFront
Create an S3 bucket configured for website hosting
Upload static content to the S3 bucket
Deploy CloudFront as a CDN in front of the S3 bucket
9.5 Semi-Structured Data and Search: Document Stores
9.5.1 Reading and Writing Data
9.5.2 ACID Transactions
9.5.3 Schemas and Constraints
9.6 Analytics: Columnar Databases
9.6.1 Columnar Database Basics
9.6.2 Analytics Use Cases
9.7 Asynchronous Processing: Queues and Streams
9.7.1 Message Queues
9.7.2 Event Streams
9.8 Scalability and Availability
9.8.1 Relational Databases
Replication
Partitioning
9.8.2 NoSQL and NewSQL Databases
9.8.3 Distributed Systems
9.9 Backup and Recovery
9.9.1 Backup Strategies
Scheduled disk backups
Scheduled data store backups
Continuous data store backups
Data store replication
9.9.2 Backup Recommendations
9.9.3 Example: Backups and Read Replicas with PostgreSQL
9.10 Conclusion

Related Books

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

By Martin Kleppmann (O'Reilly)

recommended

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.

Database Reliability Engineering: Designing and Operating Resilient Database Systems

By Laine Campbell and Charity Majors (O'Reilly)

The infrastructure-as-code revolution in IT is also affecting database administration. With this practical book, developers, system administrators, and junior to mid-level DBAs will learn how the modern practice of site reliability engineering applies to the craft of database architecture and operations. Authors Laine Campbell and Charity Majors provide a framework for professionals looking to join the ranks of today’s database reliability engineers (DBRE). You’ll begin by exploring core operational concepts that DBREs need to master. Then you’ll examine a wide range of database persistence options, including how to implement key technologies to provide resilient, scalable, and performant data storage and retrieval. With a firm foundation in database reliability engineering, you’ll be ready to dive into the architecture and operations of any modern database.

NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence

By Pramod Sadalage and Martin Fowler (Addison-Wesley Professional)

A brief introduction to the class of non-relational databases known as 'NoSQL.' The book covers core concepts as well as implementation issues and use cases.

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement

By Eric Redmond and Jim Wilson (Pragmatic Bookshelf)

Data is getting bigger and more complex by the day, and so are your choices in handling it. Explore some of the most cutting-edge databases available - from a traditional relational database to newer NoSQL approaches - and make informed decisions about challenging data storage problems. This is the only comprehensive guide to the world of NoSQL databases, with in-depth practical and conceptual introductions to seven different technologies: Redis, Neo4J, CouchDB, MongoDB, HBase, Postgres, and DynamoDB. This second edition includes a new chapter on DynamoDB and updated content for each chapter.

I Heart Logs: Event Data, Stream Processing, and Data Integration

By Jay Kreps (O'Reilly)

Why a book about logs? That’s the humble log is an abstraction that lies at the heart of many systems, from NoSQL databases to cryptocurrencies. Even though most engineers don’t think much about them, this short book shows you why logs are worthy of your attention. Based on his popular blog posts, LinkedIn principal engineer Jay Kreps shows you how logs work in distributed systems, and then delivers practical applications of these concepts in a variety of common uses―data integration, enterprise architecture, real-time stream processing, data system design, and abstract computing models. Go ahead and take the plunge with logs; you’re going love them.

SQL Cookbook: Query Solutions and Techniques for All SQL Users

By Anthony Molinaro and Robert de Graaf (O'Reilly)

You may know SQL basics, but are you taking advantage of its expressive power? This second edition applies a highly practical approach to Structured Query Language (SQL) so you can create and manipulate large stores of data. Based on real-world examples, this updated cookbook provides a framework to help you construct solutions and executable examples in severalflavors of SQL, including Oracle, DB2, SQL Server, MySQL, andPostgreSQL. SQL programmers, analysts, data scientists, database administrators, and even relatively casual SQL users will find SQL Cookbook to be a valuable problem-solving guide for everyday issues. No other resource offers recipes in this unique format to help you tackle nagging day-to-day conundrums with SQL.

SQL in 10 Minutes a Day: Sams Teach Yourself

By Ben Forta (Sams Publishing)

Whether you're an application developer, database administrator, web application designer, mobile app developer, or Microsoft Office users, a good working knowledge of SQL is an important part of interacting with databases. And Sams Teach Yourself SQL in 10 Minutes offers the straightforward, practical answers you need to help you do your job. Expert trainer and popular author Ben Forta teaches you just the parts of SQL you need to know–starting with simple data retrieval and quickly going on to more complex topics including the use of joins, subqueries, stored procedures, cursors, triggers, and table constraints. You'll learn methodically, systematically, and simply–in short, quick lessons that will each take only 10 minutes or less to complete.

T-SQL Fundamentals

By Itzik Ben-Gan (Microsoft Press)

Master Transact-SQL's fundamentals, and write correct, robust code for querying and modifying data with modern Microsoft data technologies, including SQL Server 2022, Azure SQL Database, and Azure SQL Managed Instance. Long-time Microsoft Data Platform MVP Itzik Ben-Gan explains key T-SQL concepts, helping you apply your knowledge with hands-on exercises. Ben-Gan first introduces T-SQL's theory and underlying logic, illuminating it as both a language and a way of thinking. Next, he walks through core topics, including logical query processing, single table queries, joins, subqueries, table expressions, set operators, data analysis, data modifications, temporal tables, and transactions and concurrency. Building on this foundation, you'll enhance your coding capabilities, from programmatic constructs to the powerful new SQL Graph. Throughout, Ben-Gan presents reusable T-SQL sample code that works in cloud, on-premises, and hybrid environments.

Other Related Resources

Distributed systems theory for the distributed systems engineer

By Paper Trail (Blog post)

recommended

Writing about distributed systems, compilers, virtual machines, databases and research papers from SOSP, ATC, NSDI, OSDI, EuroSys and others.

Choose Boring Technology

By Dan McKinley (Talk)

recommended

This is the spoken word version of my essay, Choose Boring Technology. I have largely come to terms with it and the reality that I will never escape its popularity.

MapReduce: Simplified Data Processing on Large Clusters

By Jeffrey Dean and Sanjay Ghemawat (Article)

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Bigtable: A Distributed Storage System for Structured Data

By Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber (Article)

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Dynamo: Amazon's Highly Available Key-value Store

By Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels (Article)

This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.

Network-Attached Hard Drive Tools

Amazon Elastic Block Store (EBS)

Easy to use, high performance block storage at any scale.

Google Persistent Disk

Persistent Disk is Google’s local durable storage service, fully integrated with Google Cloud products, Compute Engine and Google Kubernetes Engine.

Azure Disk Storage

Discover a high-performance, highly durable block storage service designed for Azure Virtual Machines.

Amazon Elastic File System (EFS)

Serverless, fully elastic file storage.

Google Cloud Filestore

Fully-managed, secure cloud file storage. Filestore offers petabyte-scale online network attached storage (NAS) for high performance computing.

Azure Files

Azure Files offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol, Network File System (NFS) protocol, and Azure Files REST API. Azure file shares can be mounted concurrently by cloud or on-premises deployments. SMB Azure file shares are accessible from Windows, Linux, and macOS clients. NFS Azure file shares are accessible from Linux clients. Additionally, SMB Azure file shares can be cached on Windows servers with Azure File Sync for fast access near where the data is being used.

Relational Database Tools

PostgreSQL

recommended used-in-book

PostgreSQL is a powerful, open source object-relational database system with over 35 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.

MySQL

The world's most popular open source database.

SQLite

recommended

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day.

MS SQL Server

Get the flexibility you need to use integrated solutions and apps with your data—in the cloud, on-premises, or at the edge. SQL Server 2022 is the most Azure-enabled release yet, with innovation across performance, security, and availability.

Oracle

Harness AI with Oracle Database 23ai to power app development and critical workloads, enhancing your data operations for free.

Schema Management Tools

Flyway

From version control to continuous delivery, Redgate Flyway helps individuals, teams, and enterprises build on application delivery processes to automate database development.

Liquibase

Automate database change management to code at full speed & continuously deliver with full confidence. Liquibase helps developers build applications faster.

Atlas

Manage your database schema as code.

Bytebase

Schema migration and database security for developer, security, DBA, and platform engineering teams.

Alembic

Alembic is a lightweight database migration tool for usage with the SQLAlchemy Database Toolkit for Python.

migrate

Database migrations written in Go. Use as CLI or import as library.

Squitch

Sqitch is the developer-friendly, confidence-inducing, platform-neutral database change management system.

ActiveRecord

recommended

Active Record is part of the M in MVC - the model - which is the layer of the system responsible for representing data and business logic. Active Record helps you create and use Ruby objects whose attributes require persistent storage to a database.

Sequel

Ruby Sequel is a lightweight database toolkit for Ruby.

Knex.js

used-in-book

Knex.js (pronounced /kəˈnɛks/) is a 'batteries included' SQL query builder for PostgreSQL, CockroachDB, MSSQL, MySQL, MariaDB, SQLite3, Better-SQLite3, Oracle, and Amazon Redshift designed to be flexible, portable, and fun to use. It features both traditional node style callbacks as well as a promise interface for cleaner async flow control, a stream interface, full-featured query and schema builders, transaction support (with savepoints), connection pooling and standardized responses between different query clients and dialects.

GORM

The fantastic ORM library for Golang aims to be developer friendly.

Key-Value Store Tools

Redis

Developers love Redis. Unlock the full potential of the Redis database with Redis Enterprise and start building blazing fast apps.

Valkey

Valkey is an open source (BSD) high-performance key/value datastore that supports a variety of workloads such as caching, message queues, and can act as a primary database. Valkey can run as either a standalone daemon or in a cluster, with options for replication and high availability.

Memcached

Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

Amazon DynamoDB

used-in-book

Serverless, NoSQL, fully managed database with single-digit millisecond performance at any scale.

Riak KV

With a key/value design that delivers powerful – yet simple – data models for storing massive amounts of unstructured data, Riak KV is built to handle a variety of challenges facing Big Data applications that include tracking user or session information, storing connected device data and replicating data across the globe. Riak KV automates data distribution across the cluster to achieve fast performance and robust business continuity with a masterless architecture that ensures high availability, and scales near linearly using commodity hardware so you can easily add capacity without a large operational burden.

Redis Cloud

Build the fastest apps and deliver the richest real-time experiences with the official Redis-as-a-service.

Amazon ElastiCache

Real-time performance for real-time applications.

Google Cloud Memorystore

A fully managed in-memory service for Redis and Memcached that offers sub millisecond data access, scalability, and high availability.

Azure Cache for Redis

Distributed, in-memory, scalable solution providing super-fast data access.

Upstash

Upstash is a serverless data platform providing low latency and high scalability for real-time applications. Optimize your data infrastructure with Upstash's managed services for Redis, Vector, QStash, and other key data technologies.

CDN Tools

CloudFlare

recommended

Make employees, applications and networks faster and more secure everywhere, while reducing complexity and cost.

Akamai

Akamai is the cybersecurity and cloud computing company that powers and protects business online.

Fastly

Fastly's edge cloud platform delivers faster, safer, and more scalable sites and apps to customers. Elevate your edge CDN, video delivery, security, and more.

Imperva

Imperva provides complete cyber security by protecting what really matters most—your data and applications—whether on-premises or in the cloud.

Amazon CloudFront

used-in-book

Securely deliver content with low latency and high transfer speeds.

Google Cloud and Media CDN

Deliver content hosted on-premises or in another cloud over Google's high-performance distributed infrastructure.

Azure CDN

Fast, reliable content delivery network with global reach.

Object Store Tools

Amazon Simple Storage Service (S3)

recommended used-in-book

Object storage built to retrieve any amount of data from anywhere.

Google Cloud Storage (GCS)

Cloud Storage lets you store data with multiple redundancy options, virtually anywhere.

Azure Blog Storage

Massively scalable and secure object storage for cloud-native workloads, archives, data lakes, HPC, and machine learning.

CloudFlare R2

Cloudflare R2 is an S3-compatible, zero egress-fee, object storage. Move data freely and build the multi-cloud architecture you desire.

Wasabi

With Wasabi, you pay only for what you store. Enjoy the freedom to access your data whenever you want, without fees for egress or API requests.

Backblaze

Backblaze is a pioneer in robust, scalable low cost cloud backup and storage services. Enterprise hot storage, low cost backup and archive, and more.

MongoDB

Get your ideas to market faster with a developer data platform built on the leading modern database. MongoDB makes working with data easy.

CouchDB

Seamless multi-master sync, that scales from Big Data to Mobile, with an Intuitive HTTP/JSON API and designed for Reliability.

Couchbase

Couchbase is the NoSQL cloud developer data platform for critical, AI-powered applications. Uncompromised speed, affordability, and ease of use.

Google Firestore

Use our flexible, scalable NoSQL cloud database, built on Google Cloud infrastructure, to store and sync data for client- and server-side development.

Elasticsearch

Elasticsearch is the leading distributed, RESTful, open source search and analytics engine designed for speed, horizontal scalability, reliability, and easy management. Get started for free.

OpenSearch

OpenSearch is a community-driven, Apache 2.0-licensed open source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data.

Amazon OpenSearch

Securely unlock real-time search, monitoring, and analysis of business and operational data.

Algolia

Enterprises and developers use Algolia’s AI search infrastructure to understand users and show them what they’re looking for.

Apache Solr

Solr is the blazing-fast, open source, multi-modal search platform built on the full-text, vector, and geospatial search capabilities of Apache Lucene.

Apache Lucene

The Apache Lucene project develops open-source search software. The project releases a core search library, named Lucene core, as well as PyLucene, a python binding for Lucene. Lucene Core is a Java library providing powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. The PyLucene sub project provides Python bindings for Lucene Core.

Columnar and Time-Series Database Tools

Cassandra

Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.

Google Bigtable

Bigtable is an HBase-compatible, enterprise-grade NoSQL database with low single-digit millisecond latency and limitless scale.

HBase

Apache HBase is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Kudu

A new open source Apache Hadoop ecosystem project, Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.

InfluxDB

Manage all types of time series data in a single, purpose-built database. Optimized for speed in any environment in the cloud, on-premises, or at the edge.

Amazon Timestream

Easy-to-manage time-series databases optimized for security, performance, availability, and scalability.

Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

Riak TS

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data. It ingests, transforms, stores, and analyzes massive amounts of time series data. Riak TS is engineered to be faster than Cassandra.

Timescale

Engineered to handle demanding workloads, like time series, vector, events, and analytics data. Built on PostgreSQL, with expert support at no extra charge.

Big Data Tools

Hadoop

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Cloudera

Cloudera delivers a hybrid data platform with secure data management and portable cloud-native data analytics.

Amazon EMR

Easily run and scale Apache Spark, Hive, Presto, and other big data workloads.

Google Dataproc

Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.

Azure HDInsight

Provision cloud Hadoop, Spark and HBase clusters.

Fast Data Tools

Apache Spark

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Apache Flink

Apache Storm

Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use! Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.

Apache Samza

Samza allows you to build stateful applications that process data in real-time from multiple sources including Apache Kafka. Battle-tested at scale, it supports flexible deployment options to run on YARN or as a standalone library.

Apache Beam

Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Beam also brings DSL in different languages, allowing users to easily implement their data integration processes.

Apache Druid

A high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.

Apache Pinot

Realtime distributed OLAP datastore.

Amazon Data Firehose

Reliably load real-time streams into data lakes, warehouses, and analytics services.

Data Warehouse Tools

Snowflake

Snowflake enables organizations to learn, build, and connect with their data-driven peers. Collaborate, build data apps & power diverse workloads in the AI Data Cloud.

Amazon Redshift

Deliver unmatched price performance at scale with SQL for your data lakehouse.

Google BigQuery

BigQuery is a serverless, cost-effective, and multicloud data warehouse designed to help you turn big data into valuable business insights. Start free.

Azure Synapse Analytics

Accelerate time to insight across enterprise data warehouses and big data systems.

Apache Hive

The Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.

Oracle Enterprise Data Warehouse

Get real-time data access and machine learning generated insights to make better decisions that drive innovation with Enterprise Data Warehouse. Benefit from auto-scalability, high performance, security, and autonomous management, on-premises or in the cloud, eliminating complexity and lowering operational costs. Oracle’s enterprise-class data warehouse solution integrates, transforms, and connects all data across the organization.

Teradata

Discover how Teradata's complete cloud analytics and data platform scales Trusted AI to generate ROI and drive profits.

Informatica

Informatica is an Enterprise Cloud Data Management leader that brings data to life by empowering businesses to realize the transformative power of their most critical assets.

Vertica

Turbocharge your data game with OpenText Analytics Database (Vertica)— high-speed SQL, Python analytics, and built-in ML, tailored for any deployment, anywhere!

ETL Tools

Apache Airflow

Platform created by the community to programmatically author, schedule and monitor workflows.

Oracle Data Integrator

Oracle Data Integrator is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes, to SOA-enabled data services. Oracle Data Integrator (ODI) 12c, the latest version of Oracle’s strategic Data Integration offering, provides superior developer productivity and improved user experience with a redesigned flow-based declarative user interface and deeper integration with Oracle GoldenGate. ODI12c further builds on its flexible and high-performance architecture with comprehensive big data support and added parallelism when executing data integration processes. It includes interoperability with Oracle Warehouse Builder (OWB) for a quick and simple migration for OWB customers to ODI12c. Additionally, ODI can be monitored from a single solution along with other Oracle technologies and applications through the integration with Oracle Enterprise Manager 12c.

SQL Server Integration Services

SQL Server Integration Services is a platform for building enterprise-level data integration and data transformations solutions.

AWS Glue

Discover, prepare, and integrate all your data at any scale.

Azure Data Factory

Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. You can also lift and shift existing SSIS packages to Azure and run them with full compatibility in ADF. SSIS Integration Runtime offers a fully managed service, so you don't have to worry about infrastructure management.

Google Cloud Dataflow

Dataflow is a fully managed streaming analytics service that reduces latency, processing time, cost through autoscaling and real-time data processing.

Stitch

Get to insights faster with fully automated cloud data pipelines. Rapidly move data from source to warehouse in just a few clicks, no IT expertise required.

Qlik

Qlik, now with Talend, delivers a data fabric and next-level insights with its end-to-end data integration, data quality, & analytics solutions.

Informatica

Informatica is an Enterprise Cloud Data Management leader that brings data to life by empowering businesses to realize the transformative power of their most critical assets.

Matillion

Matillion’s unified ELT platform is the next step in data integration. Use AI to build faster pipelines, enhance data productivity and deliver analytics at scale.

Integrate.io

Unify your data while building & managing clean, secure pipelines for better decision making. Power your data warehouse with ETL, ELT, CDC, Reverse ETL, and API Management.

Message Queue Tools

RabbitMQ

RabbitMQ is a reliable and mature messaging and streaming broker, which is easy to deploy on cloud environments, on-premises, and on your local machine. It is currently used by millions worldwide.

ActiveMQ

Apache ActiveMQ is the most popular open source, multi-protocol, Java-based message broker. It supports industry standard protocols so users get the benefits of client choices across a broad range of languages and platforms. Connect from clients written in JavaScript, C, C++, Python, .Net, and more. Integrate your multi-platform applications using the ubiquitous AMQP protocol. Exchange messages between your web applications using STOMP over websockets. Manage your IoT devices using MQTT. Support your existing JMS infrastructure and beyond. ActiveMQ offers the power and flexibility to support any messaging use-case.

ZeroMQ

ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems.

Amazon Simple Queue Service (SQS)

recommended

Fully managed message queuing for microservices, distributed systems, and serverless applications.

Google Cloud Tasks

Cloud Tasks enables developers to manage large numbers of distributed tasks, small units of asynchronous computing work, through the use of queues and worker services.

Azure Queue Storage

Durable queues for large-volume cloud services.

Event Streaming Tools

Apache Kafka

recommended

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Confluent

Stream, connect, process, and govern your data with a unified Data Streaming Platform built on the heritage of Apache Kafka and Apache Flink.

Amazon Managed Streaming for Kafka (MSK)

Securely stream data with a fully managed, highly available Apache Kafka service.

Amazon Kinesis

Collect, process, and analyze real-time video and data streams.

Amazon EventBridge

Build event-driven applications at scale across AWS, existing systems, or SaaS applications.

Google Cloud Managed Service for Kafka

Apache Kafka made easy. Learn about the managed service for Apache Kafka that automates Kafka operations and security on Google Cloud.

Google Cloud Pub/Sub

Ingest events into Pub/Sub to stream to BigQuery, data lakes and databases; messaging middleware for streaming analytics and service integrations.

Azure HDInsight

Provision cloud Hadoop, Spark and HBase clusters.

Apache Pulsar

Apache Pulsar is an open-source, distributed messaging and streaming platform built for the cloud.

NATS

NATS is a connective technology powering modern distributed systems, unifying Cloud, On-Premise, Edge, and IoT.

Redpanda

Redpanda is a powerful, simple, and cost-efficient streaming data platform that is compatible with Kafka APIs while eliminating Kafka complexity.

Graph Database Tools

Neo4j

Connect data as it's stored with Neo4j. Perform powerful, complex queries at scale and speed with our graph data platform.

Amazon Neptune

High-performance graph analytics and serverless database for superior scalability and availability.

Aerospike

Aerospike Graph is a developer-ready graph database for real-time data that can be scaled without performance issues.

NewSQL Tools

Google Spanner

Build intelligent apps with a single database that combines relational, graph, key value, and search. No maintenance windows mean uninterrupted apps.

Amazon Aurora

Unparalleled high performance and availability at global scale for PostgreSQL, MySQL, and DSQL.

CockroachDB

CockroachDB is a distributed database with standard SQL for cloud applications. CockroachDB powers companies like Comcast, Lush, and Bose.

YugabyteDB

YugabyteDB is the 100% open source cloud native database for mission critical applications. YugabyteDB runs in any public or hybrid cloud.

VoltDB

Ensure uptime, lower costs, and scale easily with the only data platform built for reliable support of mission-critical applications.

Comments