Alibaba Cloud boosts failure prediction with logfile timestamps

Machine learning helps, but more data catches more faults - so Chinese champ has shared its data

Alibaba Cloud has revealed homebrew tech it used to improve server fault prediction and detection, which it claims saw its ability to detect problems beat comparable tech by ten percent.

The Chinese cloud champ's claims emerged last week in a paper [PDF] presented at the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.

The document points out that reliability is a major selling point for public clouds, making predicting failures an important ability. Log files, the authors observe, contain plenty of info on "exceptions" to normal performance that indicate potential performance problems. The authors opine that tools using logs to predict failures rely on machine learning and deep learning to detect future failures, when more obvious indicators – timestamps – aren't paid the attention they are due.

Here's the thinking, in a nutshell:

The time interval lengths between successive exceptions often reflect the urgency and severity of the anomalies. For instance, a server with 1,000 "machine check exceptions" in three days may not fail, but a server with 1,000 such exceptions in five minutes tends to fail. Therefore, effective failure prediction must adequately make use of the exception timestamp information.

Alibaba Cloud therefore created its own tool called Time-Aware Attention-Based Transformer (TAAT) to analyze timestamp info.

TAAT doesn't entirely ignore ML tools. Instead, it uses the Bidirectional Encoder Representations from Transformers (BERT) – a language model developed by Google that represents text as vectors and has been used to predict server failures. The paper asserts, however, that BERT hasn't been tuned to make full use of log timestamps.

Alibaba's tool therefore relies on BERT for some failure analysis and compares that with TAAT's analysis of logfile timestamps. The paper contains a lot of math describing exactly how Alibaba analyzes log info, but the bottom line was apparently a ten percent improvement in fault predictions – and presumably slightly more reliable cloudy IaaS.

Alibaba's boffins think TAAT's output is also useful because it doesn't need expert analysis – meaning folks familiar with cloudy crashes aren't needed to help as often. It's already in production at Alibaba Cloud.

TAAT appears not to be available for download. But Alibaba Cloud has posted a colossal dataset comprising "∼2.7 billion syslogs from ∼300,000 servers in a four-month period of the real productional system of Alibaba Cloud" to help researchers consider how to develop log sampling strategies of their own to inform future failure prediction efforts.

The authors have also posted a video outlining TAAT's operation. ®

More about

TIP US OFF

Send us news


Other stories you might like