IBM compiles dataset to teach software how software is made: 14m code samples, half of which actually work

Big Blue hopes to create the ImageNet of training resources for AI-powered programming tools


Think IBM has assembled a massive silo of source code for teaching machine-learning programs about programming.

Dubbed Project CodeNet, the set contains, we're told, 14 million code samples totaling 500 million lines in more than 55 programming languages, from Java, C, and Go to COBOL, Pascal, and FORTRAN. Truth be told, more than three-quarters of it all is in C++ and Python.

This source code wasn't taken from production nor in-development applications: it was collected from entries submitted to two programming contests organized in Japan: Aizu and AtCoder. In these contests, competitors are challenged to write the necessary code to turn a given set of inputs into a set of desired outputs. About half of the samples work as expected, and the rest are labeled as either wrong solutions, non-building, or buggy.

Ideally, you would train an AI tool to favorably identify the good programs, and reject the bad ones, for example. For seven million of the samples, the input and required output is included.

Big Blue wants CodeNet to follow in the footsteps of ImageNet, the database of pictures and labels for training computer-vision applications, and become the leading dataset for teaching software to understand the blueprints of software – what code actually looks like, and how it compares to other code. It's hoped CodeNet can be used to train development tools that can, for instance, search application and library source for desired routines, or perhaps translate from one language to another, or recognize faulty or correct implementations.

"IBM believes Project CodeNet will serve as a valuable benchmark dataset for source-to-source translation and transitioning legacy codebases to modern code languages, helping businesses speed up their application of AI," the biz gushed in announcing the project as part of its Think online conference this week.

The IBM and MIT-IBM Watson AI Lab team behind the dataset has produced a paper [PDF] describing their work, and put all the collated material on the project's GitHub page.

"This dataset is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety of programming languages, to advances in code performance improvement techniques," the boffins concluded in their report. ®

Similar topics

Broader topics


Other stories you might like

  • Experts: AI should be recognized as inventors in patent law
    Plus: Police release deepfake of murdered teen in cold case, and more

    In-brief Governments around the world should pass intellectual property laws that grant rights to AI systems, two academics at the University of New South Wales in Australia argued.

    Alexandra George, and Toby Walsh, professors of law and AI, respectively, believe failing to recognize machines as inventors could have long-lasting impacts on economies and societies. 

    "If courts and governments decide that AI-made inventions cannot be patented, the implications could be huge," they wrote in a comment article published in Nature. "Funders and businesses would be less incentivized to pursue useful research using AI inventors when a return on their investment could be limited. Society could miss out on the development of worthwhile and life-saving inventions."

    Continue reading
  • Declassified and released: More secret files on US govt's emergency doomsday powers
    Nuke incoming? Quick break out the plans for rationing, censorship, property seizures, and more

    More papers describing the orders and messages the US President can issue in the event of apocalyptic crises, such as a devastating nuclear attack, have been declassified and released for all to see.

    These government files are part of a larger collection of records that discuss the nature, reach, and use of secret Presidential Emergency Action Documents: these are executive orders, announcements, and statements to Congress that are all ready to sign and send out as soon as a doomsday scenario occurs. PEADs are supposed to give America's commander-in-chief immediate extraordinary powers to overcome extraordinary events.

    PEADs have never been declassified or revealed before. They remain hush-hush, and their exact details are not publicly known.

    Continue reading
  • Stolen university credentials up for sale by Russian crooks, FBI warns
    Forget dark-web souks, thousands of these are already being traded on public bazaars

    Russian crooks are selling network credentials and virtual private network access for a "multitude" of US universities and colleges on criminal marketplaces, according to the FBI.

    According to a warning issued on Thursday, these stolen credentials sell for thousands of dollars on both dark web and public internet forums, and could lead to subsequent cyberattacks against individual employees or the schools themselves.

    "The exposure of usernames and passwords can lead to brute force credential stuffing computer network attacks, whereby attackers attempt logins across various internet sites or exploit them for subsequent cyber attacks as criminal actors take advantage of users recycling the same credentials across multiple accounts, internet sites, and services," the Feds' alert [PDF] said.

    Continue reading

Biting the hand that feeds IT © 1998–2022