HPC

Yahoo! looks beyond Google's data cruncher

Can you really MapReduce natural language?


Nowadays, when it comes to crunching epic amounts of web data, Google's MapReduce credo is all the rage. The Mountain View method of distributing back-end compute tasks across a sea of commodity machines has given rise to the open source Hadoop platform, which now underpins Yahoo!, Facebook, and even a chunk of Microsoft Bing.

But for Ron Brachman - the former Bell Labs and DARPA man who now serves as vice president of Yahoo! labs and research - a future interwebs may need something very different. MapReduce splinters compute tasks into tiny pieces that are processed independently of each other, and this sort of parallelism by complete separation, he argues, may be ill-suited to a more nuanced breed of web application.

One example is a web that leans heavily on natural language processing. "When we get closer to doing broad-scale language processing that's more, if you will, semantic, we might need to move away from a MapReduce architecture to something that may be equally parallel but with a very different computational architecture," Brachman tells The Reg.

Yahoo! calls itself the leading Hadoop contributor, and the general assumption is that its Yahoo! Search Webmap - which generates the index for its public search engine - is still the world's largest Hadoop application. But two years after the launch of Webmap, the net giant is looking beyond the much-hyped open source platform.

"We continue to explore how to run complex computational jobs on data, and that starts with MapReduce," Brachman says. "But we're looking at other methods of very large scale parallelism. All of this stuff is still emerging - even though some people claim to offer the be-all, end-all 'cloud computing' product already."

Google's MapReduce framework maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. Mountain View published a research paper on the platform in 2004, and this inspired Hadoop, an Apache project founded by former Yahooligan Doug Cutting.

Though Hadoop is "quite a significant piece" of Yahoo!'s current distributed-computing research, Brachman wouldn't call it the only way to crunch data. "I couldn't tell you whether [its a significant piece] just because it's there or because it's the most essential way to do data processing."

Certainly, it gets the job done on today's web. After all, it handles back-end processing for three net giants likely juggling more data than any other web outfits on the planet. But that doesn't mean it's future-proof.

"There are cases where running very large scale parallelism on completely separable units of data - where there is no interaction between the units - and then gathering up the results is the natural way to attack a problem," Brachman says. "But clearly, there are also problems where we need to invent new ways of doing large-scale computing that are not MapReduce-oriented."

Brachman points to natural-language processing in part because the so-called semantic web is his particular area of expertise. "If you start trying to do true language understanding - which is beyond our reach right now, especially if you want to do it deeply - you need something else," he says. "If you're trying to understand, say, a single English sentence with multiple clauses, you can't just processing a sentence sequentially and know the meaning as you add one word and then another...

"Now imagine growing that to not just not just sentences, but discourses and dialogues across the entire web - or more, broadly, the entire internet. There will be cases where you can't just process little pieces completely divorced from everything else. You need to pull things together."

This is not to say that MapReduce is completely incompatible with semantic processing. Hadoop drives Carnegie Mellon's Read the Web project - an effort to create a semantic map of the web that runs on the M45 cluster Yahoo! serves up to various academic institutions - and it underpins Powerset, the semantic search engine that Microsoft has applied to portions of Bing. But Brachman is looking further down the road, to an altogether different level of machine "understanding."

Yes, Yahoo! will continue to explore such far-reaching avenues. Regulators are on the verge of approving the company's mega-pact with Microsoft, which will see Bing handle search duties on Yahoo.com, but even without search, Brachman and company are still in the business of juggling epic amounts of web data.

And as it pushes for advancements in distributing computing, Yahoo! will do so in tandem with the community at large - through continued contributions to Hadoop and other open source projects and through partnerships with academic institutions such as Carnegie Mellon and the University of California at Berkeley. "A rising tide," Brachman says, "lifts all ships."

This too contrasts with the Mountain View credo. Google did release that MapReduce research paper. And like Yahoo!, it offers back-end compute resources to academic researchers. But Google MapReduce is decidedly closed. And as of last month, it's patented. ®

Similar topics


Other stories you might like

  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading
  • Utility biz Delta-Montrose Electric Association loses billing capability and two decades of records after cyber attack

    All together now - R, A, N, S, O...

    A US utility company based in Colorado was hit by a ransomware attack in November that wiped out two decades' worth of records and knocked out billing systems that won't be restored until next week at the earliest.

    The attack was detailed by the Delta-Montrose Electric Association (DMEA) in a post on its website explaining that current customers won't be penalised for being unable to pay their bills because of the incident.

    "We are a victim of a malicious cyber security attack. In the middle of an investigation, that is as far as I’m willing to go," DMEA chief exec Alyssa Clemsen Roberts told a public board meeting, as reported by a local paper.

    Continue reading

Biting the hand that feeds IT © 1998–2021