Location, location, location: how the cloud handles geospatial data

Using spatial data with AWS Redshift

Paid feature If John Snow was alive today, he would have loved geospatial databases. The famous epidemiologist pioneered mapping techniques for public health. During an 1850s outbreak in London's Soho area, he plotted cases on an area map along with local water sources. Analyzing the relationships between cases and water sources enabled him to calculate the source of the outbreak and save lives. The locals gave him the highest possible honor for a Londoner - they named a pub after him.

Today, an entire generation of data scientists use geospatial databases for their own location-based calculations. They adopt similar techniques to Snow's but with far more data and more complex techniques. Geospatial databases allow them to innovate in areas ranging from healthcare to transportation.

Amazon Web Services (AWS) has supported geospatial calculations in its Redshift data warehouse since 2019, and recently added more capabilities. The company hopes that customers will use it, along with the scalable computing power of the cloud, to crunch datasets with location information and generate new insights.

The growth of geospatial data

AWS already offered geospatial data processing indirectly through the relational databases that it supported in its Relational Database System (RDS). However, this was limited to whatever the database offered, such as the PostGIS features, which might have to be added manually. It also offered some native geospatial capabilities in its own Aurora RDBMS, but these databases are all transactional and not designed for deep, multi-layered analytics work.

To bridge the gap, AWS introduced support for geospatial data in its Redshift data warehouse. The product excels at mapping lots of data sets using different types against each other, enabling data scientists and data analysts to find new patterns. Why not add geospatial capabilities to support these spatial data overlays?

Geospatial systems have specific use cases, but they're not as niche as you might think. At their highest level, they are a way to represent objects' characteristics in space. These could cover use cases ranging from selecting sites for electric vehicle chargers based on traffic flow and demographics in specific regions through to measuring the level of rain across different cities and predicting accumulation and flooding.

An uptick in geospatial investment

The geospatial analysis market is booming. Meticulous Research believes that it will grow at a CAGR of 17.6 percent from 2021 through 2028 to reach $256bn. As more data becomes available, people are eager to map it spatially. In the public sector geospatial daia is harnessed for a huge range of purposes, from developing policies through to planning major public works.

Everybody wants to get insights. To take just one example, heat maps derived from geospatial datasets, help governments to plan relief efforts, forecast economic growth, or predict the effect of climate change. They use data sets covering population demographics, economic indicators, and weather maps.

Private businesses spanning energy companies to retailers also look for knowledge in spatially mapped data. A retailer might plan store placements with other economic or demographic data sets, along with data describing a neighborhood's walkability or access to infrastructure. Logistics companies might use this type of data to optimize their distribution networks, speeding deliveries and saving fuel. Energy companies could use the data for seismological analysis and mining plans.

The data type that you use depends on your specific geospatial application. AWS introduced support for geometry geospatial data type in Redshift in November 2019. Geometry is useful for SQL queries against two-dimensional and three-dimensional geographic data points that do not span vast distances that account for the curvature of the Earth.

AWS points to insurance as a potential application for this data type. As an insurer you want to calculate risk premiums for insurance policies on a regional basis. A customer's address would be a point and then you would overlay a data set with other features."

This might include proximity to rivers, along with historical flood data, or location relative to forests that could be susceptible to wildfires. Other economic indicators that insurers overlay include population, crime, job opportunities, and income."

Taken together, you could use this overlaid data to calculate property insurance risks and premiums for customers around that region. At a simpler level, an insurer might match population densities with agent locations to determine where they should hire more insurance agents.

Understanding geometry and geography data types

In Redshift, the geometry data type is abstract, processed as points on a cartesian plane (an X-Y grid). These points also form the basis for linestrings, polygons, multipoints, multilinestrings, multipolygons, and geometry collections, which are groups of these things. If your application needed to define the area prone to flooding around a river, for example, you'd define it by constructing polygons from point data.

At its re:Invent conference this year AWS introduced geospatial data type: geography. Unlike geometry, this represents points as coordinates on a sphere (Earth). For this, it uses the World Geodetic System (WGS84), which is the same reference coordinate model used by the Global Positioning System.

Because of the spheroid nature of the Earth, where it is flattened at the poles and bulging at the equator, modeling the planet in an accurate manner mathematically requires certain precision calculations, AWS told us. "And that's really what the functions that operate with the geography data type do."

This data type, which enables precise modeling of latitude and longitude, is especially important for use cases requiring a high degree of accuracy over long distances. A government building a highway or a utility planning a long power distribution line might want to use the geography data type.

In some applications, it's appropriate to use these two data types together. A company that wants to precisely model geographic features for building a high-speed railroad might also want to overlay geometric constructs describing population numbers and existing travel patterns.

Handling geospatial data in Amazon Redshift

Amazon Redshift can ingest geospatial data from AWS S3 storage using a copy command. It supports multiple spatial data formats, ranging from the Parquet columnar storage format through to the GeoJSON format for encoding geospatial data. Both of these are open-source. It also supports well-known text/binary (WKT and WKB), which are open-source markup and binary languages for representing vector geometry. Or you can feed it with spatial data in ESRI's shapefile format.

Once loaded, Redshift can access this data using straightforward SQL queries, which should make most developers feel at home. They can use it to make JOIN and UNION statements manipulating data, either via an ad hoc query using a query editor such as Query Editor v2 in the Amazon Redshift Console or via an API call from your geospatial application. This enables them to combine data sets, pulling multiple layers into their analysis in the same way that you might layer data over a map.

Redshift also includes 76 spatial-specific functions, supporting calculations including intersections of different lines, transformation of coordinates, calculating whether a point is inside a polygon or not, and determining the distance from a point to a line.

When it announced the geography data type AWS updated many of the existing functions to support it while adding some geography-only ones.

Optimizing geospatial performance in the cloud

Spatial queries are computationally intensive, but Redshift can return results for even complex ones in under a second, according to AWS. That's due in part to Redshift's underlying massively parallel processing architecture. This distributes data queries to multiple nodes in a Redshift cluster.

The company offers three instance types: dense storage instance (DSD), DC2, which is a dense compute instance type, and its current generation RA3 instance type, which separates compute and storage. The latter enables customers to scale storage and compute separately, paying only for the managed storage that they use. Any time AWS introduces a new feature, function or data type, it's added to the core engine, the company told us. “It's that one unified engine that we maintain,".

Almost 170 years after John Snow discovered the source of a cholera outbreak, geospatial data is once again saving lives, as healthcare professionals use it to track the spread and effects of the COVID-19 pandemic and attempts to mitigate it. IoT technologies and smartphones create more location-based data every minute, all of which compounds the value of geospatial databases. As the volume and diversity of location-based data grows, this powerful type of data analysis is just beginning to shine.

Sponsored by AWS.

Biting the hand that feeds IT © 1998–2022