Want to put more data in your database engine?
Learn how bulk Amazon S3 imports open the floodgates for Amazon DynamoDB
Sponsored Feature Amazon's DynamoDB is eleven years old this year. The NoSQL database continues to serve tens of thousands of customers with high-volume, low-latency data thanks to its performance-intensive key-value pair and scalable architecture. Until recently, customers wanting to take advantage of its speed and capacity had just one primary challenge: how to easily get their data into DynamoDB.
Last August, that issue was solved with the launch of a new functionality that makes it easier to import data from Amazon S3 into DynamoDB tables. The new bulk import feature was specifically designed to offer a simpler way to import data into DynamoDB at scale while minimizing cost and complexity for users since it doesn't require writing code or managing infrastructure.
Life before bulk import
Before the bulk import feature was introduced, importing data from S3 to DynamoDB required more manual effort and complex management of throughput and retries. Customers had to write custom scripts using Amazon EMR or use AWS Glue for data integration, both involving trade-offs between cost and performance. These approaches required handling deserialization, managing throughput, and dealing with potential failures due to exceeding available write capacity units (WCUs).
DynamoDB bulk import from S3 solves all of these problems by automating the whole process for the customer, says Shahzeb Farrukh, Senior Product Manager at AWS.
"We heard from customers that they wanted an easier and quicker way to bulk load data into the DynamoDB tables. They also wanted to do this cost effectively," he explains. "This is a fully managed one-click solution. It helps to alleviate the major pain points involved with importing data from S3."
The bulk import system supports three file types: CSV, JSON, and an Amazon-developed data serialization language called ION, which is a superset of JSON. Customers can compress these files using GZIP or ZSTD if they wish (although the service is based on the uncompressed size).
Customers need to do very little apart from identifying their primary and sort keys. They activate the import either from the AWS Management Console or the AWS Command Line Interface (CLI), or via the AWS SDK. They select the input file and specify the final capacity mode and capacity unit settings for the new DynamoDB tables. The system then formats the data in these file formats automatically for DynamoDB, creating a new table for the import.
During the import, AWS creates log entries in its CloudWatch monitoring tool to register any errors such as invalid data or schema mismatches, which helps to identify any issues with the process. AWS recommends a test run with a small data set to see if any such errors crop up before doing the bulk import.
Under the hood
When creating the import feature, AWS understands the data sets and optimizes the distribution to create the best possible import performance. This all happens under the hood without the customer having to worry about it, says AWS.
The service creates three main benefits for customers: convenience, speed, and lower cost adds Farrukh. It eliminates the need for the customer to build a custom loader, meaning that they can put those technical skills to use elsewhere. This can shave valuable time from the setup process.
AWS manages the bandwidth capacity automatically for users, solving the table throughput capacity problem. Customers no longer need worry about the number of WCUs they're using per second, nor do they have to write custom code to throttle table capacity or pay more for on-demand capacity. Instead, they simply run the job and pay a flat fee of $0.15 per GB of imported data, explains Farrukh.
Not only does this make pricing more predictable for customers, but it also saves them money. "We've purposely priced it to be simple," he says. "It's also priced pretty cheaply compared to the other options available today."
AWS analyzed the cost of importing 381 Gb of data containing a hundred million records. It calculated that an import without using bulk S3 import, using on-demand capacity, would cost $500. Using provisioned capacity would cut that down to $83. Using the bulk import function slashed it to $28 while also removing the associated setup headaches.
Use cases in need of support
What potential applications could this service accommodate? Farrukh highlights three primary possibilities. The first one is migration. AWS has collaborated with clients migrating data from other databases, such as MySQL or NoSQL databases like MongoDB. Farrukh notes that the creation of new tables for imported data through the bulk import feature is particularly beneficial in this context. Bulk importing transfers historical data into a new DynamoDB table, kickstarting the process. Customers can then establish pipelines to capture any data changes in the source database during migration. This feature significantly eases the migration process, according to Farrukh.
The second application involves transferring or duplicating data between accounts. There may be multiple reasons for customers to do this, such as creating a separate function that requires access to the same data, recovering from compromised accounts, or populating a testing or development database managed by a different team with distinct data permissions. The bulk import feature works in conjunction with DynamoDB to facilitate this use case.
In November 2020, AWS introduced the ability to bulk export data from DynamoDB to S3. Before this, clients had to use the AWS Data Pipeline feature or EMR to transfer their data from the NoSQL database to S3 storage, or depend on custom solutions based on DynamoDB Streams. The bulk export feature enables clients to export data to Amazon S3 in DynamoDB JSON format or Amazon's enhanced ION JSON-based alternative. Customers can choose to export data from any moment in the last 35 days, with granular per-second time intervals. Like the bulk import feature, the export feature does not consume WCUs or RCUs and operates independently of a customer's DynamoDB table capacity.
Hydrate the data lake
Farrukh suggests that clients could use this data to "hydrate a data lake" for downstream analytics applications. However, it can also contribute to a database copying workflow, with the bulk import feature completing the process. Farrukh emphasizes that the exported data should be directly usable for the import process without any modifications.
The third use case that has attracted attention involves using the bulk import feature to load machine learning models into DynamoDB. This proves valuable when a model needs to be served with low latency. Farrukh envisions customers utilizing this for inference purposes, using data models to identify patterns in new data for AI-powered applications.
By adding the bulk import feature to DynamoDB, AWS enables customers to more easily introduce additional data into this high-volume, low-latency data engine. This may encourage more users to explore DynamoDB, which offers single-millisecond latency, making it suitable for internet-scale applications that can serve hundreds of thousands of users seamlessly.
As such the ability to quickly, effortlessly, and affordably populate DynamoDB with S3 data could help to drive a significant shift in adoption by delivering what for many will be a highly sought-after feature.
Sponsored by AWS.