Encrypted training offers new path to safer language models

llm applications meta arab news
A research team from the University of Tokyo has outlined a new approach to training large language models that aims to curb sensitive data leakage while preserving performance, addressing one of the most pressing challenges facing artificial intelligence developers as regulation tightens worldwide.

The study, titled Crypto-LLM, proposes encrypting large volumes of training text before it is fed into a model, preventing the system from memorising personal or copyrighted material in readable form. The work has drawn attention because it attempts to reconcile two competing pressures shaping the AI sector: the demand for ever-larger datasets to improve model quality, and the growing legal and ethical constraints on how data can be used.

Developed by Yohei Kobashi, Fumiya Uchiyama, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa and Yutaka Matsuo at the University of Tokyo, the method relies on classical polyalphabetic substitution ciphers. Instead of removing or masking sensitive information, the entire dataset is transformed into encrypted text before tokenisation and pre-training. The model learns abstract linguistic structures from the ciphered data without being able to reproduce the original content in natural language.

ADVERTISEMENT

The researchers tested the approach using a 551-million-parameter variant of Llama 3, first pre-training it on roughly 7.5 billion tokens of encrypted text and then continuing training on 2.5 billion tokens of normal English. They compared the results with a baseline model trained only on the plaintext portion. According to their evaluation, the Crypto-LLM model delivered an average improvement of 7.75 per cent across downstream tasks when shorter encryption keys were used, suggesting that security measures did not come at the cost of capability.

The findings arrive at a time when concerns over data leakage from language models have moved beyond theory into regulatory scrutiny. In Europe, the General Data Protection Regulation places strict obligations on how personal data can be processed, and similar frameworks are emerging elsewhere. Models that inadvertently reproduce names, addresses or copyrighted passages from their training corpora face potential legal exposure and barriers to deployment.

Existing safeguards have struggled to provide a clear answer. Data scrubbing techniques attempt to remove sensitive content before training, but studies have shown that fragments can still be reconstructed. Differential privacy offers stronger mathematical guarantees but imposes heavy computational costs and can degrade model quality, limiting its appeal for large-scale systems. Synthetic data generation has also been promoted, yet over-reliance on synthetic text risks amplifying errors and may still echo the sensitive information present in the original sources.

Against this backdrop, the Tokyo team argues that encryption offers a practical alternative that data holders can apply themselves before sharing material for model development. By encrypting content at source, organisations retain control over their data while still contributing to model training. The strength of protection can be tuned by adjusting key length, allowing a balance between privacy and utility.

Security testing formed a central part of the evaluation. The researchers subjected the model to name reconstruction, true-prefix completion and data extraction attacks, techniques commonly used to probe whether a system has memorised training examples. The encrypted-pretraining model showed markedly lower success rates in these attacks compared with models trained on scrubbed or unprotected text, indicating stronger resistance to leakage of personal and copyrighted material.

For developers and policymakers, the work highlights a shift in thinking about responsible AI development. Rather than relying solely on post-hoc safeguards or legal disclaimers, Crypto-LLM embeds protection into the training pipeline itself. This aligns with the principle of privacy by design, increasingly favoured by regulators and corporate governance bodies.

The approach is not without trade-offs. Encryption adds an extra preprocessing step and requires careful key management. The study also focused on English text and a mid-sized model, leaving open questions about performance across languages and at the multi-billion-parameter scale now common in commercial systems. Still, the results suggest that the core idea scales in principle and warrants further exploration.



Notice an issue?

Arabian Post strives to deliver the most accurate and reliable information to its readers. If you believe you have identified an error or inconsistency in this article, please don't hesitate to contact our editorial team at editor[at]thearabianpost[dot]com. We are committed to promptly addressing any concerns and ensuring the highest level of journalistic integrity.


ADVERTISEMENT
Social Media Auto Publish Powered By : XYZScripts.com
Just in:
IMF warns Gulf flows need more time // Security Is the New Market Access: Kigen Is Leading the IoT Security Mandate // Avalanche forms payments alliance with VanEck // From Millennium Xuan Paper to Contemporary Visual Storytelling: China’s Intangible Cultural Heritage Sets Off Again // Putting Scientific Research Agents Within Reach — SCNet.AI Accelerates AI4S Innovation Powered by AI & HPC // ADNOC group secures Bab gas cap concession // Varenne Capital opens Dubai base for regional push // Pulsar International (“Pulsar”) announces agreement as an authorized reseller of Amazon Leo to bring high-speed satellite internet to commercial maritime customers // Impossible Marketing Unveils ImpossiblePlus™ AI SEO Solution for Singapore Businesses // Baghdad raises stakes in OPEC quota clash // Emirates SkyCargo widens Asian freight reach // Europe and China Must Pivot from Tech Rivalry to “Constructive Engagement” in AI Era, Warn Leaders at CEIBS Forums // DIFC growth lifts Dubai finance rank // Rubio seeks Gulf backing for Iran accord // Dubai summit sets global sports agenda // EVB Successfully Concludes Power2Drive Europe 2026 With Advanced EV Charging Solutions // HKRITA Signs MoU with Jeanologia and Looptworks to Establish the Green Machine Circular Textile Ecosystem, Marking a Breakthrough in Scalable Textile Recycling // J.P. Morgan pares Brent outlook on softer demand // Foreign bank branch fined over compliance failures // GEMS enrolment softens as war delays relocations //