Stacking up: Building a modern data warehouse with Amazon Redshift

The cloud has opened up large-scale data analytics opportunities

Paid Feature Data warehouses have come a long way since the big iron days of the late eighties.

Now, the cloud has opened up large-scale data analytics opportunities for users that could never have previously afforded it. Scalable compute and storage options, combined with open data formats and online integration, make it easier for developers to stand up data warehouses that only enterprise giants could have handled before.

Amazon has spent the last few years honing its Amazon Redshift managed data warehouse to ease this task for developers and data scientists. Today, the company offers a hybrid architectural pattern when storing warehouse data.

Redshift can store data in its own cluster, but it can also access data from other sources. This is a common approach for customers who use data lakes in their data warehousing process. A data lake draws on multiple sources of structured and unstructured data, feeding it into a pipeline that aggregates, converts, and organizes it into a collection of data that developers can query using standard SQL.

Data needn't live inside a Redshift cluster to be in a data lake. In 2017, Amazon launched Amazon Redshift Spectrum, which is a feature that allows developers to access data in-place in the S3 storage layer. This eliminates the need to batch transfer S3 data into a Redshift cluster.

Redshift Spectrum divides a query and distributes them across multiple nodes to speed up access. Redshift aggregates the results and delivers them back to the cluster. The service supports a variety of data files ranging from CSVs through to Apache's open-source Parquet data format.

This makes it cheaper to build large data stores - you can store around a terabyte of data in S3 for about $23 per month. That capability arrived right on cue for a massive uptick in data volumes thanks to developments like the increased use of telemetry data and the internet of things.

For example, IoT sensors are present on consumer devices like smart watches and fridges, and industrial scale machinery such as wind turbines and jet engines. They are all capable of generating huge volumes of data at high velocity. Rather than ingesting it directly into a data warehouse, you now have the option to store and analyze this data and keep it for extended periods of time in S3.

Extending Redshift's reach

Since launching Redshift Spectrum, Amazon has expanded the range of data sources that its data warehouse can access. In 2020, the cloud service provider unveiled federated query for Redshift, enabling it to analyze data in external databases directly from within a Redshift cluster. The service began with support for PostgreSQL in its Relational Database Service (RDS), along with its Aurora PostgreSQL service, and recently expanded with support for MySQL.

Offering such queries in Redshift offloads some of the query directly into the target databases to increase performance by minimizing the data that it needs to extract over the network. That does away with cumbersome ETL work to get relational database content ready for Redshift. The system transforms the results on the fly for use in the Redshift cluster.

Developers can build these queries using the Redshift Query Editor, a free web-based tool that supports access to both Redshift Spectrum and federated queries. Redshift lets you touch those different data sources seamlessly, according to AWS, with the upshot that end users don't need to know where it is.

Grappling with rising data volumes at Nasdaq

Nasdaq has taken advantage of Redshift in a deployment that is now in its seventh year. As automated trading platforms have entered the market, the pace and volume of transactions has grown. In 2014, to increase scale and performance and lower operational costs, Nasdaq moved from a legacy on-premises data warehouse to an Amazon Web Services (AWS) data warehouse powered by an Amazon Redshift cluster.

Between 2014 and 2018, this Amazon Redshift cluster grew to 70 nodes as the company expanded the solution to support all its North American markets. By 2018, the solution ingested financial market data from thousands of sources nightly, ranging from 30 billion to 55 billion records and surpassing 4 terabytes.

In January 2019, Nasdaq attended an AWS Data Lab, where it worked with AWS Solutions Architects and analytics service experts who provided prescriptive architectural guidance to rethink how Nasdaq implemented data warehousing. In the four-day lab, Nasdaq reinvented how it delivers analytics by using Amazon Redshift as a compute layer.

As a result, Nasdaq began using Amazon Redshift Spectrum, a feature that powers a lake house architecture to query data both in the data warehouse and in the Amazon S3 data lake.

“We’re putting all the data that comes from our internally operated exchanges into Amazon S3 and Amazon Redshift Spectrum,” says Robert Hunt, vice president of software engineering for global surrounding systems at Nasdaq. “That includes orders, cancellations, quotes, and trades. Those are turned into messages and archived in Amazon S3, and those messages drive our downstream billing and reporting surveillance processes.”

With compute and storage scaling independently, Nasdaq can now flex its compute layer to support the volume of transactions, with the data lake built on Amazon S3 storage easily supporting data that continues to grow in volume and complexity. For example, market volatility spiked in late February 2020, at the beginning of the COVID-19 pandemic, and the solution scaled to support an ingest of 70 billion records daily—with a peak volume of 113 billion.

Redshift's ability to query that volume of records without slowing down is also partly due to its columnar architecture. A traditional relational database's tables store records in rows containing fields, such as name and address. Each record could have hundreds of fields, especially when using JOINs to reference tables against each other. Querying all of that data isn't efficient if all you want is three columns, because a relational database reads through all the fields in a row when querying. When you're scaling to billions of rows, that's a problem.

A columnar database is column-centric. It lets you query records by column rather than row. Querying on three columns across billions of records would cause it to read three long columns, cross-referencing them by row ID. It then employs extra performance measures, such as combining lists of records that have the same values for specific columns.

The benefits of Redshift

Redshift's columnar architecture, along with features such as concurrency scaling that allow Redshift to seamlessly add computing power when needed, improves its price-performance ratio, AWS says. You don’t have to size for your peaks, rather you can size for your average use cases and the system scales up and down as needed.

Another factor that speeds up Redshift is its cloud based massively parallel processing architecture. Original on-premises data warehouses scaled up, using expensive boxes to divide queries across processors sharing proprietary backplanes. As they began using commodity hardware they scaled out, spreading queries across multiple nodes in a cluster of x86 servers. That scales, but only based on the number of CPUs you have on your premises. AWS has lots.

MPP databases are good at analytic queries. Redshift has a smart optimizer that analyzes query requests and distributes work amongst nodes in a cluster, allowing operations to happen in parallel. They are very fast when compared with a traditional database.

The company has also changed the hardware underpinning Redshift to accelerate access to that data. If you ask a question and it requires processing billions of rows, then traditionally it must move that data before processing it. That's a problem when you're accessing 10s of billions of rows. It takes too much time.

To solve this problem, AWS moved computing closer to the storage layer as part of its AQUA hardware accelerated caching system. This enables more of the decryption, filtering, and aggregation to happen at the storage layer, avoiding the network bottlenecks that customers would face trying to pass data back across the infrastructure for processing.

The company processes the data using Nitro chips that AWS designed itself and FPGAs, optimizing them for that purpose. The optimizer might push the query to the AQUA layer and then those nodes do the heavy lifting.

In 2019, Amazon introduced a new type of Redshift cluster node intended to make it easier for customers to mix computing and storage capability. RA3 instances include managed storage that it decouples from the compute side.

Customers pay only for the managed storage that they use at a rate of $0.024 per GB per month and they pay separately for computing power on a per-hour basis or via Reserved Instances which offer a discount over on-demand rates. These nodes will prefetch data from S3 storage automatically based on customer access patterns. They also support the AQUA architecture.

Getting data out of the system

On the other side of the data lake are the business users and data scientists that need extra data from Redshift.

Business users can create reports and visualizations from Redshift using business intelligence tools. AWS works with multiple third parties to provide reporting dashboards, but users can also access Redshift data via the company's own serverless QuickSightBI service. Redshift also has built in machine learning functionality, which makes it easy for data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands.

Individuals needn't just interrogate their own Redshift cluster for data - AWS has identified a need to share data between multiple Redshift clusters owned by different parts of the business.

Companies often run different internal data warehouse instances that create their own data lakes. Traditionally, they batch-transferred the data between systems. This created an ETL burden for Redshift customers that was lengthy and costly in network and computing resources. It also meant that the data available to other clusters lagged behind the data available in the cluster that produced it.

Amazon announced its Redshift data sharing capability at the end of 2020 to help exchange data between different clusters. Using this feature, one cluster can publish data for others to consume in-place. Users of one Redshift cluster can issue SQL queries to the publishing cluster via business intelligence tools.

Amazon recently also announced AWS Data Exchange for Amazon Redshift, currently in public preview. This is a feature built on the data sharing functionality that enables you to find and subscribe to third-party data in AWS Data Exchange, and query that data from Redshift within minutes.

There's a common thread running through AWS's Redshift investments in recent years. It has integrated the data warehouse with other elements of its infrastructure, easing the burden of managing large data volumes.

There is still a place for ETL at some parts of the data lifecycle, but the focus has been on minimizing those tasks. Redshift either manages data transfer invisibly behind the scenes or eliminates the problem altogether by empowering users to make direct queries across other services in place. That will become an increasingly important trend as big data volumes continue to grow.

This article is sponsored by AWS.

Biting the hand that feeds IT © 1998–2021