Google open sources standardized code in bid to become Mr Robots.txt

Forget about past technical decisions made by fiat, this time your thoughts matter

Google on Monday released its robots.txt parsing and matching library as open source in the hope of its now public code will help encourage web developers to agree on a standard way to spell out the proper etiquette for web crawlers.

The C++ library powers Googlebot, the company's crawler for indexing websites in accordance with the Robots Exclusion Protocol (REP), a scheme that allows website owners to declare how code that visits websites to index them should behave. REP specifies how directives can be included in a text file, robots.txt, to tell visiting crawlers like Googlebot which website resources can be visited and which can be indexed.

In the 25 years since Martijn Koster - creator of the first web search engine - created the rules, REP has been widely adopted by web publishers but never blessed as an official internet standard.

"[S]ince its inception, the REP hasn't been updated to cover today's corner cases," explained a trio of Googlers – Henner Zeller, Lizzi Harvey, and Gary Illyes – in a blog post. "This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly."

For example, differences in the way text editors handle newline characters on different operating systems can prevent robots.txt files from working as expected.

Google's library goes out of its way to try to make such files less brittle. For example, it includes code to accept five different misspellings of the "disallow" directive in robots.txt.

To make REP implementations more consistent, Google is pushing to make the REP an Internet Engineering Task Force standard. It has published a draft proposal in the hope anyone concerned about such things will voice an opinion about what's needed.

The latest draft expands robots.txt from HTTP to any URI-based transfer protocol, including FTP and CoAP. Other changes include a requirement that developers parse only the first 500 kibibytes of a robots.txt file, to minimize demands on servers, and a maximum caching time of 24 hours, unless the robots.txt file is unreachable.

The trio of Googlers note that RFC stands for "request for comments," and insist they really do want to hear from developers if they have thoughts on shaping the standard for the better.

"As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right," they said.

The Chocolate Factory isn't always so solicitous of input. The ad biz last month decided to push for the adoption of a <toast> notification element for the web – already present in the Materialize design framework – without much input from outsiders.

Reaction to Google's toast proposal has been skeptical. Developers have become worried that Chrome's dominant market share, amplified by Microsoft's recent adoption of the open source Chromium project as the foundation of its Edge browser, makes Google's technical decisions de facto standards. The company has so much sway over the web, they fret, that it doesn't have to consult with the web community.

"It feels like a Google-designed, Google-approved, Google-benefiting idea which has been dumped onto the Web without any consideration for others," observed developer Terence Eden last month.

Dave Cramer, a developer who works for book publisher Hachette, edits the EPUB spec, and contributes to the CSS Working Group, published a similar lament about Google's habit of presenting new web technology before involving outsiders.

"It does not appear that any discussions happened with other browser vendors or standards bodies before the intent to implement," he said in a GitHub post. "Why is this a problem? Google is seeking feedback on a solution, not on how to solve the problem." ®

Broader topics

Other stories you might like

  • Experts: AI should be recognized as inventors in patent law
    Plus: Police release deepfake of murdered teen in cold case, and more

    In-brief Governments around the world should pass intellectual property laws that grant rights to AI systems, two academics at the University of New South Wales in Australia argued.

    Alexandra George, and Toby Walsh, professors of law and AI, respectively, believe failing to recognize machines as inventors could have long-lasting impacts on economies and societies. 

    "If courts and governments decide that AI-made inventions cannot be patented, the implications could be huge," they wrote in a comment article published in Nature. "Funders and businesses would be less incentivized to pursue useful research using AI inventors when a return on their investment could be limited. Society could miss out on the development of worthwhile and life-saving inventions."

    Continue reading
  • Declassified and released: More secret files on US govt's emergency doomsday powers
    Nuke incoming? Quick break out the plans for rationing, censorship, property seizures, and more

    More papers describing the orders and messages the US President can issue in the event of apocalyptic crises, such as a devastating nuclear attack, have been declassified and released for all to see.

    These government files are part of a larger collection of records that discuss the nature, reach, and use of secret Presidential Emergency Action Documents: these are executive orders, announcements, and statements to Congress that are all ready to sign and send out as soon as a doomsday scenario occurs. PEADs are supposed to give America's commander-in-chief immediate extraordinary powers to overcome extraordinary events.

    PEADs have never been declassified or revealed before. They remain hush-hush, and their exact details are not publicly known.

    Continue reading
  • Stolen university credentials up for sale by Russian crooks, FBI warns
    Forget dark-web souks, thousands of these are already being traded on public bazaars

    Russian crooks are selling network credentials and virtual private network access for a "multitude" of US universities and colleges on criminal marketplaces, according to the FBI.

    According to a warning issued on Thursday, these stolen credentials sell for thousands of dollars on both dark web and public internet forums, and could lead to subsequent cyberattacks against individual employees or the schools themselves.

    "The exposure of usernames and passwords can lead to brute force credential stuffing computer network attacks, whereby attackers attempt logins across various internet sites or exploit them for subsequent cyber attacks as criminal actors take advantage of users recycling the same credentials across multiple accounts, internet sites, and services," the Feds' alert [PDF] said.

    Continue reading

Biting the hand that feeds IT © 1998–2022