Google on Monday released its robots.txt parsing and matching library as open source in the hope of its now public code will help encourage web developers to agree on a standard way to spell out the proper etiquette for web crawlers.
The C++ library powers Googlebot, the company's crawler for indexing websites in accordance with the Robots Exclusion Protocol (REP), a scheme that allows website owners to declare how code that visits websites to index them should behave. REP specifies how directives can be included in a text file, robots.txt, to tell visiting crawlers like Googlebot which website resources can be visited and which can be indexed.
In the 25 years since Martijn Koster - creator of the first web search engine - created the rules, REP has been widely adopted by web publishers but never blessed as an official internet standard.
"[S]ince its inception, the REP hasn't been updated to cover today's corner cases," explained a trio of Googlers – Henner Zeller, Lizzi Harvey, and Gary Illyes – in a blog post. "This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly."
For example, differences in the way text editors handle newline characters on different operating systems can prevent robots.txt files from working as expected.
Google's library goes out of its way to try to make such files less brittle. For example, it includes code to accept five different misspellings of the "disallow" directive in robots.txt.
To make REP implementations more consistent, Google is pushing to make the REP an Internet Engineering Task Force standard. It has published a draft proposal in the hope anyone concerned about such things will voice an opinion about what's needed.
The latest draft expands robots.txt from HTTP to any URI-based transfer protocol, including FTP and CoAP. Other changes include a requirement that developers parse only the first 500 kibibytes of a robots.txt file, to minimize demands on servers, and a maximum caching time of 24 hours, unless the robots.txt file is unreachable.
The trio of Googlers note that RFC stands for "request for comments," and insist they really do want to hear from developers if they have thoughts on shaping the standard for the better.
"As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right," they said.
The Chocolate Factory isn't always so solicitous of input. The ad biz last month decided to push for the adoption of a <toast> notification element for the web – already present in the Materialize design framework – without much input from outsiders.
Reaction to Google's toast proposal has been skeptical. Developers have become worried that Chrome's dominant market share, amplified by Microsoft's recent adoption of the open source Chromium project as the foundation of its Edge browser, makes Google's technical decisions de facto standards. The company has so much sway over the web, they fret, that it doesn't have to consult with the web community.
"It feels like a Google-designed, Google-approved, Google-benefiting idea which has been dumped onto the Web without any consideration for others," observed developer Terence Eden last month.
Dave Cramer, a developer who works for book publisher Hachette, edits the EPUB spec, and contributes to the CSS Working Group, published a similar lament about Google's habit of presenting new web technology before involving outsiders.
"It does not appear that any discussions happened with other browser vendors or standards bodies before the intent to implement," he said in a GitHub post. "Why is this a problem? Google is seeking feedback on a solution, not on how to solve the problem." ®