IBM compiles dataset to teach software how software is made: 14m code samples, half of which actually work

Big Blue hopes to create the ImageNet of training resources for AI-powered programming tools

Think IBM has assembled a massive silo of source code for teaching machine-learning programs about programming.

Dubbed Project CodeNet, the set contains, we're told, 14 million code samples totaling 500 million lines in more than 55 programming languages, from Java, C, and Go to COBOL, Pascal, and FORTRAN. Truth be told, more than three-quarters of it all is in C++ and Python.

This source code wasn't taken from production nor in-development applications: it was collected from entries submitted to two programming contests organized in Japan: Aizu and AtCoder. In these contests, competitors are challenged to write the necessary code to turn a given set of inputs into a set of desired outputs. About half of the samples work as expected, and the rest are labeled as either wrong solutions, non-building, or buggy.

Ideally, you would train an AI tool to favorably identify the good programs, and reject the bad ones, for example. For seven million of the samples, the input and required output is included.

Big Blue wants CodeNet to follow in the footsteps of ImageNet, the database of pictures and labels for training computer-vision applications, and become the leading dataset for teaching software to understand the blueprints of software – what code actually looks like, and how it compares to other code. It's hoped CodeNet can be used to train development tools that can, for instance, search application and library source for desired routines, or perhaps translate from one language to another, or recognize faulty or correct implementations.

"IBM believes Project CodeNet will serve as a valuable benchmark dataset for source-to-source translation and transitioning legacy codebases to modern code languages, helping businesses speed up their application of AI," the biz gushed in announcing the project as part of its Think online conference this week.

The IBM and MIT-IBM Watson AI Lab team behind the dataset has produced a paper [PDF] describing their work, and put all the collated material on the project's GitHub page.

"This dataset is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety of programming languages, to advances in code performance improvement techniques," the boffins concluded in their report. ®

Similar topics

Broader topics

Other stories you might like

  • SEC probes Musk for not properly disclosing Twitter stake
    Meanwhile, social network's board rejects resignation of one its directors

    America's financial watchdog is investigating whether Elon Musk adequately disclosed his purchase of Twitter shares last month, just as his bid to take over the social media company hangs in the balance. 

    A letter [PDF] from the SEC addressed to the tech billionaire said he "[did] not appear" to have filed the proper form detailing his 9.2 percent stake in Twitter "required 10 days from the date of acquisition," and asked him to provide more information. Musk's shares made him one of Twitter's largest shareholders. The letter is dated April 4, and was shared this week by the regulator.

    Musk quickly moved to try and buy the whole company outright in a deal initially worth over $44 billion. Musk sold a chunk of his shares in Tesla worth $8.4 billion and bagged another $7.14 billion from investors to help finance the $21 billion he promised to put forward for the deal. The remaining $25.5 billion bill was secured via debt financing by Morgan Stanley, Bank of America, Barclays, and others. But the takeover is not going smoothly.

    Continue reading
  • Cloud security unicorn cuts 20% of staff after raising $1.3b
    Time to play blame bingo: Markets? Profits? Too much growth? Russia? Space aliens?

    Cloud security company Lacework has laid off 20 percent of its employees, just months after two record-breaking funding rounds pushed its valuation to $8.3 billion.

    A spokesperson wouldn't confirm the total number of employees affected, though told The Register that the "widely speculated number on Twitter is a significant overestimate."

    The company, as of March, counted more than 1,000 employees, which would push the jobs lost above 200. And the widely reported number on Twitter is about 300 employees. The biz, based in Silicon Valley, was founded in 2015.

    Continue reading
  • Talos names eight deadly sins in widely used industrial software
    Entire swaths of gear relies on vulnerability-laden Open Automation Software (OAS)

    A researcher at Cisco's Talos threat intelligence team found eight vulnerabilities in the Open Automation Software (OAS) platform that, if exploited, could enable a bad actor to access a device and run code on a targeted system.

    The OAS platform is widely used by a range of industrial enterprises, essentially facilitating the transfer of data within an IT environment between hardware and software and playing a central role in organizations' industrial Internet of Things (IIoT) efforts. It touches a range of devices, including PLCs and OPCs and IoT devices, as well as custom applications and APIs, databases and edge systems.

    Companies like Volvo, General Dynamics, JBT Aerotech and wind-turbine maker AES are among the users of the OAS platform.

    Continue reading

Biting the hand that feeds IT © 1998–2022