SciDB: Relational daddy answers Google, Hadoop, NoSQL

Stonebraker doesn't drop ACID


Mention "relational databases" and a few people's names might spring to mind: Oracle's Larry Ellison, thanks to his billions, or Monty Widenius, main author of the ferociously popular MySQL. Geekier types might plump for Oracle's former Dr DBA Ken Jacobs or open-sourcer Brian Akers, who helped architect MySQL.

Michael Stonebraker's name probably doesn't jump very high in many minds outside computer science, yet it was Stonebraker's quick thinking 40 years ago that paved the way for the industry these better-knowns call home.

A faculty member of the University of California in Berkeley, Stonebraker seized on the research of IBM mathematician Tedd Codd to start work co-developing the industry's first relational database in 1973: Ingres. Ellison wasn't around until 1977, while it took lumbering IBM, owner of the mighty DB2, another eight years before it had something.

Ingres fed into Sybase as Sybase founder Robert Epstein was one of those who worked on Ingres, and then Microsoft's SQL Server. Through his various teaching positions, Stonebraker has also schooled CEOs, CTOs, founders and vice presidents of engineering at VMware, Sleepycat Software, Tibco, Oracle, Documentum, Alfresco and Cloudera. Along the way, Stonebraker found time to deliver Postgres, Mariposa, Aurora, C-Store and H-Store and help found startups to sell and support them: Illustra Information Technologies, Cohera, Streambase, Vertica and VoltDB.

After 40 years, though, Stonebraker finally thinks it's no longer a "one-size-fits-all" world and that there could be more to life than just relational. His latest work, SciDB, is going post-relational to serve the needs of those working with big data - large volumes of information crunched on thousands of nodes in distributed data centers.

"In the 1980s, the 'answer' was if all you wanted to do was business data processing, then it was relational databases. Try to stretch SQL to do everything, though, and that's an unnatural act."

And as this pioneer from the past keeps working, he's come into conflict with those on today's leading edge - the NoSQL movement - as he's put the sacred cows of the Web 2.0 crowd in their place for cheaply sacrificing the benefits of the relational technology he pioneered.

As autumn kicks in, the 66-year-old MIT adjunct professor is on the cusp of releasing the first code under open source for SciDB, a collaboration with long-time colleague Dave DeWitt.

SciDB is Stonebraker's big-data analytics play in an era of Google's MapReduce, Apache Software Foundation's Hadoop and the NoSQL evangelists who seem to be setting the pace, if not hogging the limelight, on big data in massive data centers today.

The database targets boffins, number crunchers and computer scientists and will scale, it's claimed, from megabytes of data to petabytes running on tens of thousands - all on industry standard, multi-core x86 servers with little human administration.

"The 'answer' is the current thing I'm focused on," Stonebraker told The Reg about the work on SciDB.

"The 'answer' in the 1980s was there was only one database market. In 2010, there are business processing databases with OLTP, science databases, document databases. There are genomic databases. The horizontal world of the database space has mushroomed.

"In the 1980s, the 'answer' was if all you wanted to do was business data processing, then it was relational databases. Try to stretch SQL to do everything, though, and that's an unnatural act."

According to Stonebraker, the relational model he helped popularize doesn't work in data-intensive scientific discovery because the data is multidimensional.

Combining and sharing that data means complicated engineering work on the part of developers and database admins, and it produces bottlenecks. Scientists have been rolling their own architectures or - recently - deploying Hadoop, the open-source implementation of Google's MapReduce.

SciDB goes beyond the relational world Stonebraker helped pioneer by swapping rows and columns for mathematical arrays that put fewer restrictions on the data and can work in any number of dimensions. Stonebraker claimed arrays are 100 or so times faster than a RDBMS on this class of problem.

A database for all seasons

It's a world away from where Stonebraker started. In addition to being multidimensional and offering array-based scaling from megabytes to petabytes and running on tens of thousands of clustered nodes, SciDB's will be write once read many, allow bulk load rather than single road insert, provide parallel computation, be designed for automatic rather than manual administration, and work with R, Matlab, IDL, C++ and Python.

SciDB's being piloted by healthcare products giant Novartis, the Large Synoptic Survey Telescope (LSST), Fermilab and an unnamed Russian astronomy project.

"Postgres is no good at the data warehouse market because the science market wants arrays, they don't want tables. But arrays are impossibly slow on top of tables. Postgres has arrays, but they were supported by blobs, so weren't first-class citizens," Stonebraker said.

"I learned if you want to advance the data warehouse market and want to go fast, you need a column store, not a row store...There are unbelievable advantages to specialization."

SciDB's is Stonebraker's biggest departure from the rules of the relational road. It's a journey that in the last five years has seen Stonebraker reinvent different parts of the relational stack.

Next page: Battle of the rows

Other stories you might like

  • DigitalOcean sets sail for serverless seas with Functions feature
    Might be something for those who find AWS, Azure, GCP overly complex

    DigitalOcean dipped its toes in the serverless seas Tuesday with the launch of a Functions service it's positioning as a developer-friendly alternative to Amazon Web Services Lambda, Microsoft Azure Functions, and Google Cloud Functions.

    The platform enables developers to deploy blocks or snippets of code without concern for the underlying infrastructure, hence the name serverless. However, according to DigitalOcean Chief Product Officer Gabe Monroy, most serverless platforms are challenging to use and require developers to rewrite their apps for the new architecture. The ultimate goal being to structure, or restructure, an application into bits of code that only run when events occur, without having to provision servers and stand up and leave running a full stack.

    "Competing solutions are not doing a great job at meeting developers where they are with workloads that are already running today," Monroy told The Register.

    Continue reading
  • Patch now: Zoom chat messages can infect PCs, Macs, phones with malware
    Google Project Zero blows lid off bug involving that old chestnut: XML parsing

    Zoom has fixed a security flaw in its video-conferencing software that a miscreant could exploit with chat messages to potentially execute malicious code on a victim's device.

    The bug, tracked as CVE-2022-22787, received a CVSS severity score of 5.9 out of 10, making it a medium-severity vulnerability. It affects Zoom Client for Meetings running on Android, iOS, Linux, macOS and Windows systems before version 5.10.0, and users should download the latest version of the software to protect against this arbitrary remote-code-execution vulnerability.

    The upshot is that someone who can send you chat messages could cause your vulnerable Zoom client app to install malicious code, such as malware and spyware, from an arbitrary server. Exploiting this is a bit involved, so crooks may not jump on it, but you should still update your app.

    Continue reading
  • Google says it would release its photorealistic DALL-E 2 rival – but this AI is too prejudiced for you to use
    It has this weird habit of drawing stereotyped White people, team admit

    DALL·E 2 may have to cede its throne as the most impressive image-generating AI to Google, which has revealed its own text-to-image model called Imagen.

    Like OpenAI's DALL·E 2, Google's system outputs images of stuff based on written prompts from users. Ask it for a vulture flying off with a laptop in its claws and you'll perhaps get just that, all generated on the fly.

    A quick glance at Imagen's website shows off some of the pictures it's created (and Google has carefully curated), such as a blue jay perched on a pile of macarons, a robot couple enjoying wine in front of the Eiffel Tower, or Imagen's own name sprouting from a book. According to the team, "human raters exceedingly prefer Imagen over all other models in both image-text alignment and image fidelity," but they would say that, wouldn't they.

    Continue reading
  • Facebook opens political ad data vaults to researchers
    Facebook builds FORT to protect against onslaught of regulation, investigation

    Meta's ad transparency tools will soon reveal another treasure trove of data: advertiser targeting choices for political, election-related, and social issue spots.

    Meta said it plans to add the targeting data into its Facebook Open Research and Transparency (FORT) environment for academic researchers at the end of May.

    The move comes a day after Meta's reputation as a bad data custodian resurfaced with news of a lawsuit filed in Washington DC against CEO Mark Zuckerberg. Yesterday's filing alleges Zuckerberg built a company culture of mishandling data, leading directly to the Cambridge Analytica scandal. The suit seeks to hold Zuckerberg responsible for the incident, which saw millions of users' data harvested and used to influence the 2020 US presidential election.

    Continue reading
  • Toyota cuts vehicle production over global chip shortage
    Just as Samsung pledges to invest $360b to shore up next-gen industries

    Toyota is to slash global production of motor vehicles due to the semiconductor shortage. The news comes as Samsung pledges to invest about $360 billion over the next five years to bolster chip production, along with other strategic sectors.

    In a statement, Toyota said it has had to lower the production schedule by tens of thousands of units globally from the numbers it provided to suppliers at the beginning of the year.

    "The shortage of semiconductors, spread of COVID-19 and other factors are making it difficult to look ahead, but we will continue to make every effort possible to deliver as many vehicles to our customers at the earliest date," the company said.

    Continue reading

Biting the hand that feeds IT © 1998–2022