Software

This article is more than 1 year old

Google open sources standardized code in bid to become Mr Robots.txt

Forget about past technical decisions made by fiat, this time your thoughts matter

Tue 2 Jul 2019 // 07:03 UTC

Google on Monday released its robots.txt parsing and matching library as open source in the hope of its now public code will help encourage web developers to agree on a standard way to spell out the proper etiquette for web crawlers.

The C++ library powers Googlebot, the company's crawler for indexing websites in accordance with the Robots Exclusion Protocol (REP), a scheme that allows website owners to declare how code that visits websites to index them should behave. REP specifies how directives can be included in a text file, robots.txt, to tell visiting crawlers like Googlebot which website resources can be visited and which can be indexed.

In the 25 years since Martijn Koster - creator of the first web search engine - created the rules, REP has been widely adopted by web publishers but never blessed as an official internet standard.

"[S]ince its inception, the REP hasn't been updated to cover today's corner cases," explained a trio of Googlers – Henner Zeller, Lizzi Harvey, and Gary Illyes – in a blog post. "This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly."

For example, differences in the way text editors handle newline characters on different operating systems can prevent robots.txt files from working as expected.

Google's library goes out of its way to try to make such files less brittle. For example, it includes code to accept five different misspellings of the "disallow" directive in robots.txt.

To make REP implementations more consistent, Google is pushing to make the REP an Internet Engineering Task Force standard. It has published a draft proposal in the hope anyone concerned about such things will voice an opinion about what's needed.

The latest draft expands robots.txt from HTTP to any URI-based transfer protocol, including FTP and CoAP. Other changes include a requirement that developers parse only the first 500 kibibytes of a robots.txt file, to minimize demands on servers, and a maximum caching time of 24 hours, unless the robots.txt file is unreachable.

The trio of Googlers note that RFC stands for "request for comments," and insist they really do want to hear from developers if they have thoughts on shaping the standard for the better.

"As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right," they said.

The Chocolate Factory isn't always so solicitous of input. The ad biz last month decided to push for the adoption of a <toast> notification element for the web – already present in the Materialize design framework – without much input from outsiders.

Reaction to Google's toast proposal has been skeptical. Developers have become worried that Chrome's dominant market share, amplified by Microsoft's recent adoption of the open source Chromium project as the foundation of its Edge browser, makes Google's technical decisions de facto standards. The company has so much sway over the web, they fret, that it doesn't have to consult with the web community.

"It feels like a Google-designed, Google-approved, Google-benefiting idea which has been dumped onto the Web without any consideration for others," observed developer Terence Eden last month.

Dave Cramer, a developer who works for book publisher Hachette, edits the EPUB spec, and contributes to the CSS Working Group, published a similar lament about Google's habit of presenting new web technology before involving outsiders.

"It does not appear that any discussions happened with other browser vendors or standards bodies before the intent to implement," he said in a GitHub post. "Why is this a problem? Google is seeking feedback on a solution, not on how to solve the problem." ®

Topics

Special Features

Vendor Voice

Resources

Software

Google open sources standardized code in bid to become Mr Robots.txt

Forget about past technical decisions made by fiat, this time your thoughts matter

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Google fires 28 staff after sit-in protest against Israeli cloud deal ends in arrests

YouTube now sabotages ad-blocking apps that stream its vids

Chrome Enterprise Premium promises extra security – for a fee

Protecting distributed branch office environments from ransomware

Google Cloud chief is really psyched about this AI thing

Google One VPN axed for everyone but Pixel loyalists ... for now

Protest group says Google has fired more staff over sit-ins opposing work for Israel

Tokyo wags finger at Google for blocking Yahoo Japan! from using ad tech

Google location tracking deal could be derailed by politics

Google squashes AI teams together in push for fresh models

UK data watchdog questions how private Google's Privacy Sandbox is

Google will pump more than $100B into AI, says DeepMind boss

About Us

Our Websites

Your Privacy