The rise of the cloud data warehouse

Performance at the right price is key, says AWS

Paid Feature The cloud has a habit of transforming on-premises technologies that have existed for decades.

It absorbed business applications that used to run exclusively on local servers. It embraced the databases they relied on, presenting an alternative to costly proprietary implementations. And it has also driven new efficiencies into one of the most venerable on-premises data analytics technologies of all: the data warehouse.

Data warehousing is a huge market. Allied Market Research put it at $21.18bn in 2019, and estimates that it will more than double to $51.18bn in 2028. The projected 10.7 percent CAGR between 2020 and 2028 comes from a raw hunger for data-driven insights that we've never seen before.

It isn't as though data warehousing is a new concept. It has been around since the late eighties, when researchers began building systems that funneled operational data through to decision-making systems. They wanted that data to help strategists understand the subtle currents that made a business tick.

This product category initially targeted on-premises installations, with big iron servers capable of handling large computing workloads. Many of these systems were designed to scale up, adding more processors connected by proprietary backplanes. They were expensive to buy, complex to operate, and difficult to maintain. The upshot, AWS claims, was that companies found themselves spending a lot on these implementations and not getting enough value in return.

As companies produced more data, it became harder for these implementations to keep up. Data volumes exploded, driven not just by the increase in structured records but also by an expansion in data types. Unstructured data, ranging from social media posts to streaming IoT data, has sent storage and processing requirements soaring.

The evolution of cloud data warehouses

Cloud computing evolved around the same time, and AWS argues that it changed data warehousing for the better. Data Warehousing has been popular with customers in sectors like financial services and healthcare, which have been heavy analytics users.

Manage data at any scale and velocity while remaining cost effective

But the cloud has opened up the concept to far more companies thanks to lower prices and better performance, according to AWS. Applications previously restricted to multinational banks and academic labs are now open to smaller businesses. For example, you’re able to perform data analytics in the cloud with benefits like scale, elasticity, time to value, cost efficiency and readily available applications.

The numbers bear this out. According to Research and Markets, the global market for data warehouse as a service (DWaaS) products will enjoy a 21.7 percent CAGR between 2021 and 2026, growing from $1.7bn to $4.5bn.

The largest cloud players have leaped on this trend, with Microsoft offering its Synapse service and Google running BigQuery. AWS announced Redshift as the first cloud data warehouse to address the market in 2012. The idea was pretty simple, AWS told us. The company wanted to give customers a scalable solution, where they could use the flexibility of the cloud to manage data at any scale and velocity while remaining cost effective.

Enter Redshift

Unlike online transaction processing databases like Amazon Aurora, Redshift targets online analytics processing (OLAP), offering support for fast queries thanks to scalable nodes with massive parallel processing (MPP) in a cluster. The cloud-based data warehouse follows the AWS managed database ethos. Rather than relying on a customer's administrators to take care of maintenance tasks, the company handles it behind the scenes in the cloud.

Aside from standing up hardware, this includes patching the software and handling backups and recovery. That means developers can focus on building applications ranging from modernizing existing data warehouse strategies through to accelerating analytics workloads, which it does using back-end parallel processing to spread queries over up to 128 nodes. Companies can use it for everything from analyzing global sales data to crunching through advertising impression metrics.

AWS also highlights other applications that can draw on cloud-based data warehouse technology, including predictive analytics, which enable companies to mine historical data for insights that could help to chart future events. Redshift also helps customers with applications that are often time critical, AWS says. These include recommendation and personalization, and fraud detection.

Performance at the right price is key, asserts AWS, which reports that customers’ latency requirements for processing and analyzing their data are shortening, with many wanting to make things almost real time.

AWS benchmarked Redshift against other market players and found price performance up to three times better than the alternatives. The system's ability to dynamically scale the number of nodes in a cluster helps here, as does its ability to access data in place from various sources across a data lake.

Data sharing

Data sharing is a cumbersome process, traditionally, where files are uploaded manually from one system and copied to another. This system, AWS says, “does not provide complete and up-to-date views of the data as the manual processes introduce delays, human error and data inconsistencies, resulting in stale data and poor decisions”.

In response to feedback from customers who wanted to share “data at many levels to enable broad and deep insights but also minimize complexity and cost,” AWS has introduced a capability that overcomes this issue.

Announced late last year, Amazon Redshift data sharing enables you to avoid copies. The new capability enables customers to query live data at their convenience, and get up to date views across organizations, customers and partners as the data is updated. In addition, Redshift integrates with AWS Data Exchange, enabling customers to easily find and subscribe to third-party data in AWS Data Exchange without extracting, transforming and loading it.

Amazon Redshift data sharing is already proving a hit with AWS customers, who are finding new use cases such as data marketplaces and workload isolation.

Data integration capabilities

Data lakes have evolved as companies draw in data of different types from multiple sources. When unstructured data comes in such as machine logs, sensor data, or clickstream data from websites, you don't know about its quality or what insights you're going to find from it.

AWS told us many customers have asked for data stores where they can break free of data silos and land all of this data quickly, process it, and move it to more SLA-intensive systems for query and reporting like data warehouses and databases.

The cloud is the perfect place to put this data thanks to commodity storage. Storing data in the cloud is cheap thanks to a mixture of economies of scale on the cloud service provider side, and tiered storage that lets you put data in lower-cost tiers such as S3.

Data gravity is the other driver. A lot of data today begins in the cloud whether it comes from social media, machine logs, or cloud-based business software. It makes little sense to move that data from the cloud to on-premises applications for processing. Instead, why not just shorten the time it takes to get insights from it, AWS says.

The company designed the data warehouse to share information in the cloud, folding in API support for direct access. Redshift can pull in data from S3's cheap storage layer if necessary for fast, repeated processing, or it can access it in place. It also features different types of nodes optimized for storage or compute. It can interact with data in Amazon's Aurora cloud-native relational database, and other relational databases via Amazon Relational Database Services (RDS).

It also includes support for other interface types. Developers can import and export data from other data warehousing systems using open data formats like Parquet and optimized row columnar (ORC). Client applications also access the system via standard SQL, ODBC, or JDBC interfaces, making it easy to connect with business intelligence and analytics tools.

The ability to scale the storage layer separately to the compute nodes makes the system more flexible and eliminates network bottlenecks, the cloud service provider says.

Additional services and data types

Cloud databases also provide application developers with other services that they can use to enhance those insights. One of the most notable for AWS is its machine learning capability. ML algorithms are good at spotting probabilistic patterns in data, making them useful for analytics applications, but inference - the application of statistical models when processing new data - takes a lot of computing power. Scalable cloud computing power makes that easier, AWS says.

Cloud-based machine learning services are also easy for companies to consume because they are pluggable with data warehouses via application programming interfaces (APIs). AWS makes these available to anyone who knows SQL. Customers can use SQL statements to create and use machine learning models from data warehouse data using Redshift ML, a capability of Redshift that provides integration with Amazon SageMaker, a fully managed machine learning service.

In 2019, Amazon Redshift also introduced support for geospatial data by adding a new data type to Redshift: geometry. That supports coordinate data in table columns, making it possible to handle geospatial polygons for mapping purposes. This makes it possible to combine location information with other data types when making conventional data warehousing queries and building machine learning models for Redshift.

As data warehousing continues its move to the cloud, it shows no sign of slowing down. Customers can choose offerings from the largest cloud service providers or from third-party software vendors alike. Evaluation criteria will depend on each customer's individual strategy, but the need to scale compute and storage capabilities is sure to factor highly in any decision. One thing's for sure: the cloud will help customers as their big data gets bigger still.

This article is sponsored by AWS.

Biting the hand that feeds IT © 1998–2021