This article is more than 1 year old

Online tracking is alive and well in link decoration

The pending death of third-party cookies won't do much for other privacy intrusions

Analysis Link decoration, the practice of appending data to the end of web links, has become more of a privacy problem that most people realize. The data exfiltration practice is now widely used to send info associated with web users – including email addresses – to ad tracking firms.

The phase out of third-party cookies, planned for next year in Chrome, is supposed to mitigate ad-tech data gathering. But data-focused operators have many other ways to track web users.

About 73 percent of websites, based on a data set of 20,000, use at least one link decoration for tracking, said Shaoor Munir, a doctoral student in computer science at University of California at Davis, during the Ad-Filtering Dev Summit in Amsterdam on Wednesday.

Munir gave a presentation about a machine-learning tool for privacy called PURL that may help the privacy-conscious avoid unwanted tracking.

PURL, a Python-based tool, is designed to identify link decorations that need to be sanitized to prevent web tracking. Its creators claim it does so better than existing tracking countermeasures – including CrumbCruncher, Cookiepedia, Request filter lists (EasyList, EasyPrivacy), and link decoration specific filter lists (AgGuard, uBlock Origin, Brave, Safari, and Firefox).

To understand why one might want to cleanse URLs of decorations, it helps to identify the problem area. Link decorations are not just the part of the URL known as query parameters – the key-value pairs that can be passed to servers following the question mark delimiter in a URL (for example https://example.com?id=email).

They encompass the resource path, query parameters, and fragments – though query parameters are used for tracking far more often (~93 percent) than paths and fragments.

Such practices also tend to co-opt the user's computing resources – specifically local browser storage – to store data that gets transmitted. According to Munir's presentation, 69.4 percent of tested sites contain instances where tracking storage values – specifically first-party cookies and local storage – are shared via link decorations.

As Munir explained in his presentation and a related research paper [PDF] – co-authored by Patrick Lee (UC Davis), Umar Iqbal (University of Washington), Zubair Shafiq (UC Davis), and Sandra Siby (Imperial College London) – link decorations are difficult to deal with because they can have benign functional uses or be used for tracking.

"The abuse of link decoration for tracking is not a new phenomenon," the boffins explain in their paper. "For example, going as far back as 1996, Webtrends (an analytics service) used the WT.mc_id query parameter for click tracking in advertising campaigns."

Several years ago – when Brave, Firefox, and Safari decided at different times to block third-party cookies by default, and Google embarked on its controversial Privacy Sandbox project to reinvent cookie-based tracking in a way palatable to regulators – it appeared that web privacy might improve somewhat.

But marketers have just moved on to other tracking techniques – like first-party cookies, email address-based identifiers, canvas fingerprinting, AudioContext fingerprinting, and CNAME cloaking. Privacy-focused browser makers have tried to deal with these techniques too, but defensive measures tend to lead to more sophisticated data gathering techniques – an evolutionary Red Queen scenario.

"Users are generally not aware of different tracking techniques but they definitely freak out when they see the effects of online tracking (e.g., ads that are too personalized)," Munir told The Register. "Ever since there has been an indication of restrictions on third-party cookies by major browsers, trackers have started moving on to more robust, invasive, and implicit ways of tracking users."

Munir worked previously on a research paper that showed how first-party cookies are being used for tracking instead of third-party cookies.

"All of Google’s Analytics and Advertising platforms use first-party cookies such as _ga and _gid," he explained. "Fingerprinting and the use of first-party data such as email addresses and phone numbers have also become much more common."

The general lack of awareness about the invasiveness of this tracking is one of the reasons he believes marketers don't feel pressure to limit their collection of personal data unless there's government intervention.

A growing problem

Based on a 20 percent sample of the top million websites, Munir and his colleagues identified almost 45 million link decorations. Of these, about 45 percent were flagged by PURL as an advertising and tracking service (ATS).

This is why you can't just strip query parameters for the sake of privacy: just over half of link decorations serve functional purposes (non-ATS) and removing them can prevent websites from functioning properly. But there's reason to be concerned about ATS link decorations from a privacy perspective.

"Our analysis showed that there were significant instances where email addresses that we entered on the webpage were also being exfiltrated, either in clear text or in a hashed format," said Munir. "There are a number of new trackers that work on 'cookieless' solutions (Feathr, Rich Audience, LiveIntent) and they rely on email addresses to identify users and make full use of the exfiltration of such personal information.

"We also observed link decorations being used by scripts involved in fingerprinting as well. These are the cases that we were able to analyze, however, trackers have been resorting to encryption more and more, which makes pinpointing exact examples more difficult, and the behavior-based analysis that PURL does is more important to distinguish tracking behavior."

Munir said these behaviors at least violate privacy expectations because people would not be aware that information they provide to a website might be sent to a third party. But the legal status of such tracking is a bit murkier, he suggested, because privacy policies may disclose enough to provide legal cover – even though most people don't read the fine print.

Google is said to be the most common destination for ATS link decorations operated by Yahoo, DoubleVerify, Adform, StackAdapt, and BidSwitch. And the paper's authors observe that Raptive, an ad management platform, is a source of tracking links that feed other tracking services like OpenX, RubiconProject, Yahoo, and GumGum.

Here's an example of a URL with link decoration that mixes tracking and non-tracking functions:

http://go.artinstitutes.edu/search/brand/local/PSGLC?source=BGNAG&ven=search&Tac=sem&school=newyork&Matchtype =Exact&gclid=KjwKEAjwq6m3BRsdfdfsdfCP7IfMq6Oo9gsdfACRc0bN3J-fcQ1t1DdfO5AyuTfKIyFbgTFPfCmPXyGdrKRBoCmv3w_wcB

In this URL, it's the portion that begins with the gclid key that contains a tracking identifier.

Munir said it's to the advantage of advertisers that the function of link decorations can be hard to pin down.

"If a URL contains only ATS link decorations, it is very easy to block that URL without any impact on website functionality," he said. "However, if it's accompanied by one or more non-ATS link decorations, it makes it a difficult choice for the user, where they now have to choose between protecting their privacy or accessing the website’s full functionality.

"We report in our paper as well that on average, we observed that an ATS link decoration is accompanied by 16 non-ATS link decorations in the same URL. Google uses one query parameter named 'v' quite a lot in both ATS and non-ATS use cases. Facebook also uses fbclid which is appended to any link that you click on from within Facebook. This shows that advertisers and trackers do rely on this ambiguity and fear of losing out functionality to track users."

Le jeu est terminé

All this has not been lost on privacy-focused browser makers. In July 2020, Brave implemented tracking parameter removal, based on a filtering list that identifies about 47 query parameters. In January 2022, Firefox added parameter stripping and currently blocks about 23 query parameters. And in Safari 17, which arrived last month with the launch of iOS 17, Apple began blocking about 24 query parameters based on a filter list.

To prevent the removal of tracking information, Facebook has encrypted its link decoration.

However, this doesn't entirely prevent PURL from recognizing when tracking is going on. In fact, the paper notes that Facebook's approach creates higher entropy and allows its ATS link decoration to be flagged about 83 percent of the time.

"We use machine learning to differentiate between ATS and non-ATS," explained Munir. "We capture the webpage execution in a graph and look for unique patterns that can be used to identify tracking. For example, third-party JavaScript code reading a cookie or a script accessing APIs which can give unique information about the user's device. These behaviors help our ML model to understand the difference between ATS and non-ATS link decorations."

The paper claims PURL is 98.74 percent accurate, which is six percentage points better than community-run filter lists, and better still than other approaches. It also results in less site breakage (0.7 percent compared to 6.2 percent for Request filter lists). But the software is intended to improve those other approaches rather than compete with them.

"To minimize the friction in using PURL, we output a list of tracking link decorations which can then be used to augment the existing lists that these browsers and privacy-enhancing extensions use [to block trackers]," Munir explained.

Munir argued that an automated approach is necessary to deal with the scale of online tracking.

"Previously we used to rely on filter lists which were manually curated, but the scale at which trackers are moving to other forms of tracking and the complication that arises due to the mixing of functional and tracking resources requires us to use more and more automated solutions which can perform analysis at a large scale to counteract advancements in tracking," he said.

"These automated solutions do rely on machine learning as it does give us flexibility to cater to a larger number of different tracking behaviors which would be difficult to identify manually or through other methods." ®

More about

TIP US OFF

Send us news


Other stories you might like