Data, it has been argued, is the new oil – the fuel for the information economy – but its importance to search engines may be overstated.
In a paper released on Monday through the National Bureau of Economic Research, Lesley Chiou, an associate professor at Occidental College, and Catherine Tucker, a professor at the MIT Sloan School of Management, all in the US, argue that retaining search log data doesn't do much for search quality.
Data retention has implications in the debate over Europe's right to be forgotten, the authors suggest, because retained data undermines that right. It's also relevant to US policy discussions about privacy regulations.
A decade ago, Google changed its search data retention policy for server logs from as long as it wants, to... as long as it wants, with a caveat: the data is identifiable only for the first 18‑24 months, after which it gets anonymized.
It was an issue other search engine providers like Microsoft and Yahoo! had to confront, too.
By 2008, Google had settled on the removal of the last 8 bits of the IP address after nine months, and on more substantive anonymization after 18 months.
At the time, the company said one of its reasons for keeping search logs was "to improve our search algorithms for the benefit of users."
There are other reasons to retain data, such as legal compliance and anti-spam efforts.
Google tracks what you spend offline to prove its online ads work. And privacy folks are furiousREAD MORE
But it can be beneficial to avoid keeping too much data around. Data retention turns a company into a magnet for legal requests and represents a liability in the event of hacking. Storage infrastructure also has a cost.
To determine whether retention policies affected the accuracy of search results, Chiou and Tucker used data from metrics biz Hitwise to assess web traffic being driven by search sites.
They looked at Microsoft Bing and Yahoo! Search during a period when Bing changed its search data retention period from 18 months to 6 months and when Yahoo! changed its retention period from 13 months to 3 months, as well as when Yahoo! had second thoughts and shifted to an 18‑month retention period.
According to Chiou and Tucker, data retention periods didn't affect the flow of traffic from search engines to downstream websites.
"Our findings suggest that long periods of data storage do not confer advantages in search quality, which is an often-cited benefit of data retention by companies," their paper states.
Asked via email whether these findings suggest that Google has overstated the value of search log data, Chiou told The Register, "Our study examined retention data policies for Yahoo! and Bing and did not study Google, as Google did not undergo any changes in its retention policy at the time. Our paper does not find evidence that Yahoo!'s and Bing's change conferred an advantage."
Chiou and Tucker observe that the supposed cost of privacy laws to consumers and to companies may be lower than perceived. They also contend that their findings weaken the claim that data retention affects search market dominance, which could make data retention less relevant in antitrust discussions of Google. ®