back_office_ops · saas · workflow

Liger-Kernel: LinkedIn's open-source Triton kernel library improves LLM training throughput by 20% and cuts memory usage by 60%

Training LLMs on GPUs is slowed by two key bottlenecks: extensive GPU memory I/O overhead between slow HBM and fast SRAM for every kernel launched, and per-operation overhead from eager-execution frameworks where operations run synchronously line-by-line and output activations must be stored in memory for the backward pass.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Training job submitted via platform

Users submit training tasks that are scheduled by Flyte onto Kubernetes, which efficiently allocates GPUs.

Tools used

Liger-KernelTritonAxolotlLLaMa-FactorySFTTrainerHugging Face TrainerSWIFTPyTorch FSDPMicrosoft DeepSpeedFlash AttentionFlyteKubernetes

Outcome

Liger-Kernel improves training throughput by 20% and reduces memory usage by 60% with a single line of code, and LinkedIn observed a 3X reduction in end-to-end training time for an in-house model at ~70B scale, with 10%–20% throughput gains at ~100B and ~10B scale.

Results

Time saved3X reduction

Volume20%

Running sinceAugust 2024

Source

https://www.linkedin.com/blog/engineering/open-source/liger-kernel-open-source-ecosystem-for-efficient-llm-training

How we source this →

Grounding & classification

Source type: technical build writeup

35 fields verified against source quotes.

builder submittedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecost reductioncycle time reductionthroughput increasetechnical build writeupback office ops