Watch for schema design in graph database rollout, business user warns
Hardware configuration also important to get query times down, says global research publisher
Developers and architects considering deploying graph databases should spend more time on defining schema than they initially think, according to one multi-national user of the system.
Global science and legal publisher RELX Group, formerly Reed Elsevier, runs a search engine for its content which receives about three-quarters of a trillion requests per year, 95 percent of which are fulfilled by structured data, Erik Schwartz, vice president of product management, knowledge discovery told The Register.
After two years of working with graph database vendor Neo4j to improve citation counts and other use cases on the data, the 10-strong squad of engineers figured that getting the schema right was the thorniest problem.
"Whenever you load a new database from any existing data source, you always run into some ripples. We also made a few mistakes in terms of how we designed the schema internally, so that had to be redone a couple of times," Schwartz said.
Nonetheless, the £8.6 billion ($10.6 billion) revenue organization found that for this initial use case queries were taking 15 to 20 seconds to come back. The team worked very closely with Neo4j to review the schema again and also loaded the entire data into memory to solve the immediate short-term performance issues.
The longer-term solution for performance came from employing direct attached storage — solid state hard drives attached to AWS virtual machines — instead of network-attached storage. "Our latency dropped dramatically. So, it's about getting the design right, getting the schema right, and getting the hardware configuration right, even in a virtual cloud-based environment," Schwartz said.
The primary use case was based on citation counts. Still, the publisher also manages the peer-review process for research papers, so it also wanted to be able to spot any potential conflicts of interest for reviewers based on work they had already done.
- Oracle floats its HeatWave system into Amazon's cloud
- SAS backs Python as alternative to its own language
- Celonis and Software AG target IoT and streaming process mining as firms grapple with data overload
- Junking orbital junk? The mind behind ASTRIAGraph database project hopes to 'make space transparent'
"Without building anything additional, we could identify potential conflicts of interest when we're going through that process of identifying experts that can do peer review," Schwartz said.
The database currently has around six use cases in production, with more in the pipeline. But the hope is to also reduce costs.
"The fact that we build something that solves a particular point, and then solve other use cases down the line, was really the value add. And now we have close to 200 nodes search engine to support this load. We think we can deal with a lot smaller, so even though the license is a bit more expensive — because we're using Neo4j Enterprise Edition — we think we can get a [return on investment] case because it's a lot less hardware," Schwartz said.
Schwartz presented at this week's Neo4j GraphSummit conference, where the database biz announced the bite-sized purchase of of Distributed Technology Associates, thereby taking on 11 staff to expand its capability in global cloud management services. ®