Wayback is way ahead: Three million webpages are set to become hacker fodder according to research that could predict what websites will become vulnerable ahead of time.
The research by Kyle Soska and Nicolas Christin of Carnegie Mellon University used an engine which divined the future by looking at the past - more specifically, by trawling the Way Back Machine with its 391 billion stored pages for sites that had become malicious.
It determined that of 4,916,203 current benign webpages (tied to 444,519 websites) about 3 million would become vulnerable within a year.
The work was a boon to search engines for assessing malicious hits, blacklist operators, and affected website admins who could be warned ahead of potential compromise, according to Soska and Christin.
Their predictions, made with 66 per cent accuracy, were determined by a intelligent algorithm and by obtaining samples both malicious from blacklists including PhishTank and benign from the .com zone file. An astonishing eighty-nine per cent of these samples were captured by the Way Back Machine.
It was then a matter of looking back between three to 12 months before a site was compromised to acquire indicators of why it was popped.
Those indicators included sudden increases to traffic, the presence of certain files like the WordPress CMS which may be unpatched, and particular HTML tags.
User-generated content was parsed out from the assessed data on websites as it was not useful for determining sites that would become vulnerable in the future.
"Our approach relies on an online classification algorithm that can automatically detect whether a server is likely to become malicious," the duo wrote in their paper Automatically detecting vulnerable websites before they turn malicious[PDF].
"At a high level, the classifier determines if a given website shares a set of features with websites known to have been malicious. A key aspect of our approach is that the feature list used to make this determination is automatically extracted from a training set of malicious and benign webpages, and is updated over time, as threats evolve."
The classifier was efficient, interpretable, robust to imbalanced data and missing features, and adaptive to drastic changes over time, they said.
Plenty of systems existed to determine vulnerable and compromised websites but all were reactive, prompting the duo to develop a means to divine future flaws.
Determining vulnerable websites ahead of time could help decrease black hat search engine poisoning and redirection which was increasingly common; Sophos said in its 2013 threat report 80 per cent of malware-foisting websites were hacked web servers owned by innocent third-parties.
There were limitations in the system's assumptions that a potential vulnerable site could be determined by its traffic and content: attackers could compromise sites by brite-forcing passwords or could host their own sites with malicious intent.
The software developed in the research would be later released publicly. Further technical details were available in the Usenix paper. ®