Machine learning climbs atop Hadoop

Pattern hoists machine-learning models onto HDFS


Hadoop whisperer Concurrent has released a free tool for porting machine-learning models over to Hadoop.

The Pattern tool lets you run machine-learning models on top of the Hadoop compute and storage framework via either exported Predictive Model Markup Language (PMML) files or a Pattern Java API.

Designing machine-learning models requires a precise set of skills, and though the technology can bring great efficiencies by creating automated programs that can, say, automatically score query results by relevance, it is rare that machine-learning experts – who are a subcategory of the data scientist breed of tech bod – are also familiar with the vagaries of MapReduce jobs.

Rather, many data scientists work within the confines of mathematical or machine-learning programs such as R or MicroStrategies – and it can be a tall order for these people to learn HDFS and MapReduce sufficiently to re-implement their algorithms on large HDFS-stored datasets.

With Patterns, Concurrent has created a free technology that can take machine-learning models exported into PMML files and run them atop Hadoop. "You should be able to export from your favorite tools your PMML docs and get into production at least at scale," Concurrent founder Chris Wensel says. "The goal with Pattern is to be able to apply a [machine-learning] scoring model and run it at scale."

Pattern is the third prong in Concurrent's pitchfork for getting useful data in and out of Hadoop without having to learn the vagaries of the application. It sits alongside the company's Java API for Hadoop and its Lingual add-on for making SQL queries on Hadoop easy.

The tool is designed for data scientists who are unfamiliar with Hadoop but want to use the technology to run machine-learning models against large pools of data. It works with any program capable of exporting a model as a PMML file – R, MicroStrategies, SAS, and so on.

"We've used the Cascading APIs and implemented the scoring aspect of these models against the cascading APIs," Wensel says. "It'll generalize itself thanks to the facilities Hadoop provides. If you export the model from R into PMML and run [it] across Hadoop, it'll parallelize itself appropriately."

Pattern is part of Concurrent's overall strategy of shifting Cascading into an all-purpose translation layer for people who want to access the inherent scalability of Hadoop without having to invest time in learning its peculiarities.

Its closest contemporary would be the open source Apache Mahout project. However, Mahout is more a selection of HDFS-compatible machine learning algorithms than anything else, so it lacks the flexibility and tooling that software like R may have.

"Mahout is a set of standalone and independent applications that have to be orchestrated with other applications to do their job, each using different file formats," Wensel says. "This is fundamentally very brittle and adds lots of latency to the applications."

The company expects existing Cascade users such as Airbnb will start experimenting with the Patterns tool imminently. It is already in use by AgileOne.

Over time, Concurrent hopes to build an ecosystem of complementary tools for Hadoop around the Cascading data analysis software. This announcement comes after the company took $4m from VCs to give it time to follow through on Wensel's ambition to "build a sustainable business around Cascading." ®

Similar topics

Broader topics


Other stories you might like

  • 5G C-band rollout at US airports slowed over radio altimeter safety fears
    Well, they did say from July, now they really mean from July 2023

    America's aviation watchdog has said the rollout of 5G C-band coverage near US airports won't fully start until next year, delaying some travelers' access to better cellular broadband at crowded terminals.

    Acting FAA Administrator Billy Nolen said in a statement this month that its discussions with wireless carriers "have identified a path that will continue to enable aviation and 5G C-band wireless to safely co-exist."

    5G C-band operates between 3.7-3.98GHz, near the 4.2-4.4GHz band used by radio altimeters that are jolly useful for landing planes in limited visibility. There is or was a fear that these cellular signals, such as from cell towers close to airports, could bleed into the frequencies used by aircraft and cause radio altimeters to display an incorrect reading. C-band technology, which promises faster mobile broadband, was supposed to roll out nationwide on Verizon, AT&T and T-Mobile US's networks, but some deployments have been paused near airports due to these concerns. 

    Continue reading
  • IBM settles age discrimination case that sought top execs' emails
    Just days after being ordered to provide messages, Big Blue opts out of public trial

    Less than a week after IBM was ordered in an age discrimination lawsuit to produce internal emails in which its former CEO and former SVP of human resources discuss reducing the number of older workers, the IT giant chose to settle the case for an undisclosed sum rather than proceed to trial next month.

    The order, issued on June 9, in Schenfeld v. IBM, describes Exhibit 10, which "contains emails that discuss the effort taken by IBM to increase the number of 'millennial' employees."

    Plaintiff Eugene Schenfeld, who worked as an IBM research scientist when current CEO Arvind Krishna ran IBM's research group, sued IBM for age discrimination in November, 2018. His claim is one of many that followed a March 2018 report by ProPublica and Mother Jones about a concerted effort to de-age IBM and a 2020 finding by the US Equal Employment Opportunity Commission (EEOC) that IBM executives had directed managers to get rid of older workers to make room for younger ones.

    Continue reading
  • FTC urged to probe Apple, Google for enabling ‘intense system of surveillance’
    Ad tracking poses a privacy and security risk in post-Roe America, lawmakers warn

    Democrat lawmakers want the FTC to investigate Apple and Google's online ad trackers, which they say amount to unfair and deceptive business practices and pose a privacy and security risk to people using the tech giants' mobile devices.

    US Senators Ron Wyden (D-OR), Elizabeth Warren (D-MA), and Cory Booker (D-NJ) and House Representative Sara Jacobs (D-CA) requested on Friday that the watchdog launch a probe into Apple and Google, hours before the US Supreme Court overturned Roe v. Wade, clearing the way for individual states to ban access to abortions. 

    In the days leading up to the court's action, some of these same lawmakers had also introduced data privacy bills, including a proposal that would make it illegal for data brokers to sell sensitive location and health information of individuals' medical treatment.

    Continue reading

Biting the hand that feeds IT © 1998–2022