MongoDB speaks elephantese with Hadoop Connector upgrades

10Gen proves square JSON pegs can be inserted into round HDFS holes


MongoDB steward 10Gen has increased the capabilities of its Hadoop Connector, which lets administrators shuttle data between MongoDB and HDFS and other Hadoop services.

The updates were announced on Tuesday, and see the company add support for Mongo's Binary JSON (BSON) backup files into the connector, along with support for Apache Hive and incremental MapReduce jobs.

The Hadoop Connector puts MongoDB data in a Hadoop File System (HDFS) costume, letting MapReduce jobs fiddle with the datastores. This tech lets organizations manipulate MongoDB data without having to move it through the data center, saving bandwidth.

Combined, these enhancements help 10Gen push MongoDB into being more than a NoSQL datastore, and into its own platform for minor analytics, data storage, and cross-platform querying. It follows on from IBM implementing support for MongoDB's JSON-oriented query method inside DB2 and WebSphere.

Apache Hive is a query engine for Hadoop that lets people probe HDFS datasets without having to write MapReduce jobs, and instead use a SQL-like query language. This does not map perfectly to MongoDB, and this created some challenges.

"Figuring out a way to express field mappings for fields in Hive to fields in MongoDB in a way that covers the edge cases users may encounter is tricky," 10Gen software engineer Mike O Brien told The Register via email. "Also, there are data types in MongoDB that do not have analogous counterparts in Hive (for example, ObjectId) so there are some design decisions around how to handle those as well."

The JSON filetype is also not native to Hadoop, so work had to be done to get the system to churn through the objects without introducing errors.

"To handle splitting for parallelism, it crawls through a BSON file and calculates byte-offsets in the files to create a list of fixed size chunks which are then processed in parallel," O'Brien writes. "Or, the splits can be pre-built locally with a provided script. When reading the bson off disk, it decodes the bson documents on the fly and passes them into the Mapper as a 'BSONObject' which is the base class used to represent a simple document in the mongo java driver."

In the future, the company plans to boost performance, enforce better integration with various Hadoop APIs, and "expose some more fine-grained control options to the user on how jobs run and read/write data," O'Brien said.

As more and more companies invite Hadoop into their data center, gaining compatibility with the technology will be crucial for new databases, lest developers start forsaking the data stores for more HDFS-friendly systems. With the Hadoop connector, 10Gen is working to make sure this problem doesn't appear, and that DBAs can dance with the elephant, wherever their data is stored. ®


Other stories you might like

  • Cassandra vendor DataStax secures $115m investment for $1.6b valuation
    Tech stock crash fails to deter Goldman Sachs as it leads funding round in the real-time data specialist

    DataStax, the database company based on the open-source Cassandra system, has secured $115 million in funding for a $1.6 billion valuation.

    Led by the Growth Equity business within Goldman Sachs and backed by RCM Private Markets and EDB Investments, the latest round follows a strong first quarter based on the popularity of DataStax's Cassandra DBaaS Astra DB. Existing investors include Crosslink Capital, Meritech Capital Partners, OnePrime Capital, and others.

    Cassandra is a distributed, wide-column store database suited to real-time use cases such as e-commerce and inventory management, personalization and recommendations, Internet of Things-related applications, and fraud detection. It is freely available on the Apache Version 2 license, although DataStax offers managed service Astra on a subscription model.

    Continue reading
  • IBM ordered to hand over ex-CEO emails plotting cuts in older workers
    Infamous 'Dinobabies' memo comes back to haunt Big Blue again

    Updated In one of the many ongoing age discrimination lawsuits against IBM, Big Blue has been ordered to produce internal emails in which former CEO Ginny Rometty and former SVP of Human Resources Diane Gherson discuss efforts to get rid of older employees.

    IBM as recently as February denied any "systemic age discrimination" ever occurred at the mainframe giant, despite the August 31, 2020 finding by the US Equal Employment Opportunity Commission (EEOC) that "top-down messaging from IBM’s highest ranks directing managers to engage in an aggressive approach to significantly reduce the headcount of older workers to make room for Early Professional Hires."

    The court's description of these emails between executives further contradicts IBM's assertions and supports claims of age discrimination raised by a 2018 report from ProPublica and Mother Jones, by other sources prior to that, and by numerous lawsuits.

    Continue reading
  • NoSQL player Aerospike links up with Starburst for SQL-based access to edge data
    'We’re not necessarily replacing Snowflake' is an interesting choice of words

    Aerospike, the value-key NoSQL database, has launched a collaboration with data connection vendor Starburst to offer SQL access to its datastores.

    Dubbed Aerospike SQL Powered by Starburst, the system hopes to offer data analysts and data scientists a single point of access to federated data in Aerospike using existing SQL analytic tools such as Tableau, Qlik, and Power BI. It is the first time Aerospike has offered an off-the-shelf tool to analyze its database using SQL, the ubiquitous database language.

    Aerospike was purpose-built with a highly parallelized architecture to support real-time, data-driven applications that cost-effectively scale up and out. It claims to offer predictable sub-millisecond performance up to petabyte-scale with five-nines uptime with globally distributed, strongly consistent data.

    Continue reading

Biting the hand that feeds IT © 1998–2022