Me-ow! Russian tech titan Yandex open-sources ML library CatBoost
Something to do with categories, not feline antidepressants
Russia's tech behemoth Yandex has open-sourced its first machine learning library, CatBoost.
Many big-name tech companies including Google, Facebook, Microsoft and even Sony already offer machine learning frameworks. These tend to focus on neural networks, computer systems modelled on the human brain that can be trained to recognise specific objects or events in images and videos.
"Neural nets are getting all the hype," said Misha Bilenko, who left Microsoft's machine learning group to head up Yandex's earlier this year.
CatBoost is a Python and R API library, built on C++, which supports Linux, Windows and Mac systems. It integrates with scikit-learn, the popular Python machine learning workhorse, and supports classification, regression and custom prediction tasks.
Yandex had previously open-sourced ClickHouse, a data management library, about a year ago.
Unlike the popular TensorFlow, Theano or Caffe, however, Yandex's first open-source machine learning tool – quite conspicuously – isn't a neural network library. It's a gradient boosting library.
To train a gradient boosting ML model, you train the first model (i.e. decision trees) on your data, and consider the difference between what you predicted to what your original target outcome was (residual error). Then you create a new model to predict the difference between your prediction and the target, repeating ad nauseam. Each time you create a new model, the prediction error is reduced (the gradient representing the direction of steepest decrease). In the end, you combine these to get a "minimal error" prediction model.
Gradient boosting is particularly well suited to tackling prediction problems around datasets with defined categorical attributes, such as user IDs or zipcodes, Bilenko says. He claims it can sometimes be more efficient than neural nets – in one empirical study (PDF) out of 10 supervised learning methods boosted trees predicted the best probabilities once calibrated.
Why use CatBoost over another gradient boosting library? "Our [benchmark] table looks really good," Bilenko said.
Comparing CatBoost's performance on 9 Kaggle datasets to three competing libraries (LightGBM, XGBoost and and H20), CatBoost appears to come out on top. The log loss values Yandex reports on test data, which represent the uncertainty behind a prediction given its variance from the target label, are lowest for CatBoost (methodology here).
However, Bilenko admitted that "in some cases" CatBoost can take longer to train than competing open-source libraries. Yandex declined to immediately provide any specific training speed comparisons, claiming it would be difficult to get a "cleanroom" setup that isolates competing factors.
Patrick Jähnichen, a machine learning researcher at Humboldt University of Berlin, told The Register that he'd try Catboost for its pre-processing features (CatBoost automatically extracts features from categorical data) and visualization capabilities.
He said it's useful to try different types of machine learning models on datasets. He was not, however, aware of a specific case where gradient boosting would be more efficient on defined categorical data than neural networks with a single hidden layer and single output layer. He pointed out that the paper Yandex supplied only compared boosted trees to artificial neural nets, not modern convolution neural nets used in deep learning. The empirical paper was also only on specific data with specific metrics – the results would change under different conditions.
In regards to other gradient boosting libraries, he says the lower training speed of CatBoost could be a problem for real-time applications (if you can train in advance and then just want to query the model, it wouldn't really matter, but otherwise it's something to keep in mind). He also wants to see more comparisons to scikit-learn's existing boosting tools.
He added that he would need time to analyse the algorithms to get to the bottom of the performance improvement.
A spokesperson for CERN, which is using CatBoost in its Large Hadron Collider Beauty Experiment, told The Register: "The state-of-the-art algorithm developed using Yandex's CatBoost has been deployed in LHCb to improve the performance of our particle identification subsystems. CatBoost will improve how efficiently we can identify charged particles, providing greater accuracy in the selection of our data."
Bilenko says he wants to close the training speed gap "in the coming months". Yandex uses some gradient boosting internally for web search, spam detection, weather forecasts, ad recommendations, speech recognition and translations – though it has not yet rolled out CatBoost to any of its services outside of testing. He hopes to do so by the fall.
He said Yandex could potentially open-source more software in future. ®