Naver debuts multilingual HyperCLOVA X LLM it will use to build sovereign AI for Asia
Because English isn't the only language
Korean web giant Naver last week debuted a family of large language models named HyperCLOVA X, which it claimed perform better at cross-lingual reasoning in Asian languages than other models – and may therefore help the region to develop sovereign large language models.
Naver announced the debut of HyperCLOVA X in Korean and pointed to an English-language technical report on open access journal arXiv that asserts "We believe that HyperCLOVA X – with its competitive capabilities in English and other languages beyond Korean – can provide helpful guidance for regions or countries on developing their own sovereign LLMs."
The LLMs were pre-trained on data "comprised of Korean, multilingual, and code segments."
The multilingual subset was predominantly English, but also included a variety of other languages – such as Japanese, German, and French.
Korean language material made up around a third of the pre-training data, an indication that Naver chose to improve its models' performance in its home tongue. The pre-training process also took into account the particular grammar of the Korean language.
The result of that effort, Naver asserts, is models "with inherent proficiency in both Korean and English."
Better yet, the models display "multilinguality" – the ability to work in languages other than those it was trained to handle.
"Our analysis shows that HyperCLOVA X is not only able to extend its reasoning capability beyond its primarily targeted languages but also achieve the state-of-the-art level in machine translation between Korean and untargeted languages, such as Japanese and Chinese," the technical report states. "HyperCLOVA X's impressive multilingual ability also includes cross-lingual transfer between Korean and English, where instruction-tuning in one language can lead to the emergence of instruction-following capabilities in the other," it added.
Multilingual test results led the developer to conclude HyperCLOVA X "can be transferred to Asian languages that are underrepresented in the pre-training data."
- Washington plans to block Chinese access to AI cloud services
- Japan orders local giants LINE and NAVER to disentangle their tech stacks
- ChatGPT study suggests its LLMs are getting dumber at some tasks
- SoftBank boss Masayoshi Son predicts artificial general intelligence is a decade away
Sovereign AI is emerging as a necessary national capacity – as a means of ensuring data security and reducing dependency on offshore providers. Nvidia has championed the concept, which coincidentally has the potential to create an even larger market for its wares.
But as Naver's technical report points out, English and North American cultures "are extremely overrepresented in the pre-training corpora" for existing mainstream LLMs.
"Consequently, these LLMs exhibit limitations in their capacity to process and understand non-English languages like Korean, which embodies distinctive cultural nuances, geopolitical situations, and other regional specificities, as well as unique linguistic attributes," it explains.
Regional heavyweight China has sought to develop LLMs in its national interest – or at least the CCP's interest – to varying success. Nonetheless, chatbots like Baidu's ERNIE had garnered over 100 million users by the end of 2023.
Nak-ho-Seon, head of Naver Cloud Hyperscale AI technology, declared that it plans "to create specialized super-scale AI for various regions and countries in the future."
Meanwhile, the technical report includes a pledge to "explore multimodality, aiming to broaden HyperCLOVA X's capabilities to seamlessly process and integrate diverse types of data, such as text, images, and audio,” while seeking to optimize the model's inferencing abilities.
Naver claimed to be "actively researching the integration of external tools and APIs to augment the model's functionalities" – an endeavor it believes will "enable HyperCLOVA X to access specialized datasets and services." ®