Distributed NoSQL database Aerospike is introducing set indexes and SQL operations within expressions in the pursuit of greater machine learning efficiency via its Apache Spark 3.0 connector.
Speaking to The Register, chief product officer Srini Srinivasan claimed the combined tweaks could help reduce the feedback cycle to improve ML models from days to hours.
A key-value and multi-modal database, Aerospike can run on the edge to support so-called real-time decisions based on pre-existing ML models in applications such as fraud detection. It is also used to feed data back into the ML model management commonly used by data pipeline platform Apache Spark to ensure models reflect changes to data patterns in the real world.
"This cycle is something we can reduce from weeks and days to hours and minutes because of the efficiency of query that we are adding on top of Aerospike plus the integration of Aerospike with Spark," Srinivasan said.
As well as the Spark connector, the Database 5.6 update includes support for set indexes and improvements to SQL expressions.
Aerospike, which went open source in 2014, has a concept of namespaces, akin to tablespace in a relational database. Within the namespace reside sets, which are like tables in a relational database.
Srinivasan said customers using Aerospike as a data service might have 100 billion entries in a namespace, a few million in a set. "Indexing a set individually means we can run queries on that set, so you can have efficient access to a subset of data, which is part of a smaller set in a larger namespace, and this will speed up the execution of the set."
Complimenting this will be SQL expressions to filter queries. "We are adding some processing, where you can also write and read data as part of the expression, which we call operation execution as part of an expression," Srinivasan said. "This means you can write complex expressions and move the processing closer to the data, and it's very efficient because the expressions are implemented in C in the database."
Combined, the Spark integration, indexing, expression enhancements, and support for Presto federated querying, would improve productivity in managing ML models, or so Aerospike reckoned.
R "Ray" Wang, principal analyst and founder of Constellation Research, said support for set indexes would help make it easier to bring records together with a very low overhead in organisations working at petabyte scale.
"Users will like not only the set indexes but also the connection to Spark 3.0. It's all about the reduction of time, and being able to quickly iterate AI and ML models," Wang said.
In DB-engine's ranking, which relies on sales, mentions and downloads of databases, Aerospike is ranked 65, well below fellow key-value database Redis, placed at number 7.
Redis's popularity might be largely down to its common usage as a web-application cache – it is the most popular database on AWS. None of this seems to put off Aerospike's users, which include companies of some size such as PayPal and Verizon Media.
A paper from Bloor Research noted that Aerospike was specifically designed, from its inception, to run on SSDs rather than rotating disks, and runs efficiently at scale, compared with rival databases.
It does not directly support the popular deep-learning library TensorFlow, but engineers can load data from Aerospike into a Spark DataFrame and connect to libraries such as Pandas, NumPy, R, and frameworks such as TensorFlow and PyTorch.
Redis Labs is also keen to have its database seen as an ML platform, and is working with Tensorwerk, to give developers the freedom to choose their own AI back end, including PyTorch and TensorFlow.
It already had a Spark connector but does not support Spark 3.0. ®