back_office_ops · saas · workflow

How DeepL built next-generation LLMs with FP8 for training and inference

BF16 training constrained DeepL's ability to scale LLMs to larger parameter counts within practical memory and latency budgets; moving to 8-bit computation was needed to increase throughput and enable more sophisticated models without sacrificing production inference latency.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · LLM pre-training initiation
The FP8 training and inference journey starts with the pre-training of DeepL's LLMs.
Tools used
NVIDIA DGX SuperPODNVIDIA H100 Tensor Core GPUsNVIDIA Transformer EngineNVIDIA TensorRT-LLM
Outcome

FP8 accelerated model training by 50% in MFU, ultimately reaching 80% MFU after further optimization, doubled inference throughput at the same latency budget, and enabled translation quality that outperforms previous models by 1.4x for European languages and 1.7x for complex language pairs.

Results
Time saved25%
Volume67% MFU
Source

https://www.deepl.com/en/blog/tech/next-generation-llm-fp8-training

How we source this →

Grounding & classification
Source type: technical build writeup
22 fields verified against source quotes.
translationmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementcycle time reductionthroughput increasetechnical build writeupback office ops