Encrypted training offers new path to safer language models

A research team from the University of Tokyo has outlined a new approach to training large language models that aims to curb sensitive data leakage while preserving performance, addressing one of the most pressing challenges facing artificial intelligence developers as regulation tightens worldwide.

The study, titled Crypto-LLM, proposes encrypting large volumes of training text before it is fed into a model, preventing the system from memorising personal or copyrighted material in readable form. The work has drawn attention because it attempts to reconcile two competing pressures shaping the AI sector: the demand for ever-larger datasets to improve model quality, and the growing legal and ethical constraints on how data can be used.

Developed by Yohei Kobashi, Fumiya Uchiyama, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa and Yutaka Matsuo at the University of Tokyo, the method relies on classical polyalphabetic substitution ciphers. Instead of removing or masking sensitive information, the entire dataset is transformed into encrypted text before tokenisation and pre-training. The model learns abstract linguistic structures from the ciphered data without being able to reproduce the original content in natural language.

The researchers tested the approach using a 551-million-parameter variant of Llama 3, first pre-training it on roughly 7.5 billion tokens of encrypted text and then continuing training on 2.5 billion tokens of normal English. They compared the results with a baseline model trained only on the plaintext portion. According to their evaluation, the Crypto-LLM model delivered an average improvement of 7.75 per cent across downstream tasks when shorter encryption keys were used, suggesting that security measures did not come at the cost of capability.

The findings arrive at a time when concerns over data leakage from language models have moved beyond theory into regulatory scrutiny. In Europe, the General Data Protection Regulation places strict obligations on how personal data can be processed, and similar frameworks are emerging elsewhere. Models that inadvertently reproduce names, addresses or copyrighted passages from their training corpora face potential legal exposure and barriers to deployment.

Existing safeguards have struggled to provide a clear answer. Data scrubbing techniques attempt to remove sensitive content before training, but studies have shown that fragments can still be reconstructed. Differential privacy offers stronger mathematical guarantees but imposes heavy computational costs and can degrade model quality, limiting its appeal for large-scale systems. Synthetic data generation has also been promoted, yet over-reliance on synthetic text risks amplifying errors and may still echo the sensitive information present in the original sources.

Against this backdrop, the Tokyo team argues that encryption offers a practical alternative that data holders can apply themselves before sharing material for model development. By encrypting content at source, organisations retain control over their data while still contributing to model training. The strength of protection can be tuned by adjusting key length, allowing a balance between privacy and utility.

Security testing formed a central part of the evaluation. The researchers subjected the model to name reconstruction, true-prefix completion and data extraction attacks, techniques commonly used to probe whether a system has memorised training examples. The encrypted-pretraining model showed markedly lower success rates in these attacks compared with models trained on scrubbed or unprotected text, indicating stronger resistance to leakage of personal and copyrighted material.

For developers and policymakers, the work highlights a shift in thinking about responsible AI development. Rather than relying solely on post-hoc safeguards or legal disclaimers, Crypto-LLM embeds protection into the training pipeline itself. This aligns with the principle of privacy by design, increasingly favoured by regulators and corporate governance bodies.

The approach is not without trade-offs. Encryption adds an extra preprocessing step and requires careful key management. The study also focused on English text and a mid-sized model, leaving open questions about performance across languages and at the multi-billion-parameter scale now common in commercial systems. Still, the results suggest that the core idea scales in principle and warrants further exploration.

Follow Arabian Post

Select Arabian Post as your preferred source on Google and MSN News for trusted business news and Arab politics and updates.

Arabian Post on MSN News Arabia business and politics

Follow Arabian Post on Google News business coverage

Arabian Post Telegram channel Dubai news updates

Arabian Post Medium articles business insights

Notice an issue?

Arabian Post strives to deliver the most accurate and reliable information to its readers. If you believe you have identified an error or inconsistency in this article, please don't hesitate to contact our editorial team at editor[at]thearabianpost[dot]com. We are committed to promptly addressing any concerns and ensuring the highest level of journalistic integrity.

Search

Encrypted training offers new path to safer language models

Follow Arabian Post

Notice an issue?

Supreme Court Decision On Mamata’s Action During ED Raid Will Impact Bengal Polls

Jollibee Group Opens First Multi-Brand State-of-the-Art Commissary in Cebu, Gearing for Accelerated VisMin and PH Growth

Abu Dhabi strengthens long-term property appeal

UK sets overnight social media curfew for teens

Dubai weighs turning organic waste into aviation fuel

Fynd brings AI fashion platform to Gulf

Dubai-Botswana pact opens new commodity trade corridor

Dealing.com claims record for tokenised stock access

More from Hyphen Digital Network

Contact

Advertising Enquiries

Syndication & PR Distribution

Encrypted training offers new path to safer language models

🕚 Wed 21 Jan 2026 | AP News

Follow Arabian Post

Notice an issue?

Share

Supreme Court Decision On Mamata’s Action During ED Raid Will Impact Bengal Polls

Jollibee Group Opens First Multi-Brand State-of-the-Art Commissary in Cebu, Gearing for Accelerated VisMin and PH Growth

Related news

More from Hyphen Digital Network

Contact

Advertising Enquiries

Syndication & PR Distribution