Google's MapReduce patent - no threat to stuffed elephants

Hadoop will keep its head


In mid-January, Google won a patent for MapReduce, the distributed data crunching platform that underpins its globe-spanning online infrastructure. And that means there's at least a question mark hanging over Hadoop, the much-hyped open source platform that helps drive Yahoo!, Facebook, Microsoft's Bing, and an ever-expanding array of other web services and back-end business applications.

Hadoop is based in part on a MapReduce research paper Google published in 2004, about six months after it applied for the patent.

The Mountain View Chocolate Factory doesn't officially comment on specific patents in its portfolio. "Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops," the company recently told GigaOM, in response to questions about its MapReduce patent. "We feel that our behavior to date has been inline with our corporate values and priorities."

But the general assumption is that Google wouldn't use its patent against Hadoop or any other software that takes a lead from MapReduce, including databaseware from the likes of Aster Data Systems or Teradata. This is certainly the view of Cloudera, the all-star Silicon Valley startup that recently commercialized Hadoop in Red Hat-like fashion.

"I don't speak for Google. But Google has lots of patents, and it has basically has no track record of using those patents offensively, either involving licensing or pursuing people for infringement," Cloudera chief executive Mike Olson tells The Reg, before pointing out that Google is a member in the Open Invention Network, a patent pool that grants use licenses for patented technology in an effort to promote Linux.

"All of this convinces us that this is a strategic move from Google and not something that is aimed at the head of any Hadoop adopter or satellite company - Cloudera included."

Olson adds that Cloudera has "excellent ties" back to the Mountain View search giant and that he and his backers were well ware of Google patent before Cloudera was founded. "We - and our investors - talked about it in detail and at length, and without a qualm, we went ahead and founded the company."

The salient Google link is Cloudera vice president Christophe Bisciglia - the former Google engineer who Mountain View famously dispatched to the University of Washington to teach a course on what it likes to calls Big Data, i.e. net-scale distributed computing. Bisciglia's curriculum actually made use of Hadoop, and he stresses that the open source platform has become an important teaching tool for Google.

"In the past, it took three to six months to get hires up to speed with how to work with [Google] technology," Bisciglia has told The Reg. "But if schools are teaching this as part of the standard undergraduate curriculum, Google saved that three to six months - multiplied by thousands of engineers."

Google hired about half the students who took Bisciglia's first class.

But even if did Google change tact, if it suddenly went on the offensive with that MapoReduce patent, you wonder how successful it would be. As Yahoo! vice president of labs and research Ron Brachman points out, the basic concepts behind MapReduce are far from revolutionary. "To my mind, having grown up as a computer scientist in the 70s and taking courses on what was then though of as parallel processing, there were techniques around that felt very similar to [MapReduce's] type of parallelism," Brachman tells The Reg.

The patent - which you can see here - describes a "system and method for efficient large-scale data processing," and this involves "map" and "reduce" functions that have indeed been a part of parallel programming since Brachman's school days.

In essence, Google's platform "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation. As the patent abstract puts it, one or more map modules read input data, apply an operation to "produce intermediate data values," and distribute these values "across multiple processors in the parallel processing environment." One or more reduce modules then retrieve the intermediate data and apply a new operation to provide the ultimate output.

In any event, Hadoop mirrors this general setup, as Google described it in a research paper published in December 2004. The platform was originally developed by Nutch founder Doug Cutting, who needed a distributed data crunching platform for his open source web crawler, and after he open sourced it at Apache, the platform - named for his son's yellow stuffed elephant - soon spread to some the web's biggest names.

Yahoo! uses it to generate, among other things, the Yahoo! Search Webmap, which provides the index for its search engine. And it underpins Powerset, the so-called semantic search engine that was purchased by Microsoft and now drives portions of Bing.

Meanwhile, Cloudera is helping to deploy the platform on clusters used by countless other companies, including Rackspace, Netflix, LinkedIn, Samsung, and eHarmony. Rackspace, for one, is using a Hadoop cluster to crunch log data from its hosting infrastructure and serve up reports to support reps. The platform can applied to almost any breed of Big Data - and not so big data.

"We really don't like the term 'Big Data,'" Olson says. "To use Hadoop, you don't need to have petabytes of data. You don't even need terabytes. When customers hear a word like 'Big Data', they think 'It must be a Google thing.' But it's not," says Olson.

It's not - no matter what's on file at the US patent office. And we're quite sure that Google would agree. ®

Similar topics

Broader topics


Other stories you might like

  • AI tool finds hundreds of genes related to human motor neuron disease

    Breakthrough could lead to development of drugs to target illness

    A machine-learning algorithm has helped scientists find 690 human genes associated with a higher risk of developing motor neuron disease, according to research published in Cell this week.

    Neuronal cells in the central nervous system and brain break down and die in people with motor neuron disease, like amyotrophic lateral sclerosis (ALS) more commonly known as Lou Gehrig's disease, named after the baseball player who developed it. They lose control over their bodies, and as the disease progresses patients become completely paralyzed. There is currently no verified cure for ALS.

    Motor neuron disease typically affects people in old age and its causes are unknown. Johnathan Cooper-Knock, a clinical lecturer at the University of Sheffield in England and leader of Project MinE, an ambitious effort to perform whole genome sequencing of ALS, believes that understanding how genes affect cellular function could help scientists develop new drugs to treat the disease.

    Continue reading
  • Need to prioritize security bug patches? Don't forget to scan Twitter as well as use CVSS scores

    Exploit, vulnerability discussion online can offer useful signals

    Organizations looking to minimize exposure to exploitable software should scan Twitter for mentions of security bugs as well as use the Common Vulnerability Scoring System or CVSS, Kenna Security argues.

    Better still is prioritizing the repair of vulnerabilities for which exploit code is available, if that information is known.

    CVSS is a framework for rating the severity of software vulnerabilities (identified using CVE, or Common Vulnerability Enumeration, numbers), on a scale from 1 (least severe) to 10 (most severe). It's overseen by First.org, a US-based, non-profit computer security organization.

    Continue reading
  • Sniff those Ukrainian emails a little more carefully, advises Uncle Sam in wake of Belarusian digital vandalism

    NotPetya started over there, don't forget

    US companies should be on the lookout for security nasties from Ukrainian partners following the digital graffiti and malware attack launched against Ukraine by Belarus, the CISA has warned.

    In a statement issued on Tuesday, the Cybersecurity and Infrastructure Security Agency said it "strongly urges leaders and network defenders to be on alert for malicious cyber activity," having issued a checklist [PDF] of recommended actions to take.

    "If working with Ukrainian organizations, take extra care to monitor, inspect, and isolate traffic from those organizations; closely review access controls for that traffic," added CISA, which also advised reviewing backups and disaster recovery drills.

    Continue reading

Biting the hand that feeds IT © 1998–2022