back_office_ops · saas · workflow

How DeepL built next-generation LLMs with FP8 for training and inference

BF16 training constrained DeepL's ability to scale LLMs to larger parameter counts within practical memory and latency budgets; moving to 8-bit computation was needed to increase throughput and enable more sophisticated models without sacrificing production inference latency.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · LLM pre-training initiation

The FP8 training and inference journey starts with the pre-training of DeepL's LLMs.

Tools used

NVIDIA DGX SuperPODNVIDIA H100 Tensor Core GPUsNVIDIA Transformer EngineNVIDIA TensorRT-LLM

Outcome

FP8 accelerated model training by 50% in MFU, ultimately reaching 80% MFU after further optimization, doubled inference throughput at the same latency budget, and enabled translation quality that outperforms previous models by 1.4x for European languages and 1.7x for complex language pairs.

Results

Time saved25%

Volume67% MFU

Source

https://www.deepl.com/en/blog/tech/next-generation-llm-fp8-training

How we source this →

Grounding & classification

Source type: technical build writeup

22 fields verified against source quotes.

translationmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementcycle time reductionthroughput increasetechnical build writeupback office ops