back_office_ops · saas · workflow

Salesforce reduces AI inference infrastructure costs up to 8x with Amazon SageMaker AI inference components

Salesforce's AI Platform team faced two GPU underutilization problems: large models (20–30 GB) with low traffic patterns ran on expensive multi-GPU instances mostly idle, while medium models (~15 GB) handling high-traffic workloads were over-provisioned on similarly expensive multi-GPU setups, both driving avoidable infrastructure cost.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Identify GPU underutilization

Salesforce identified two distinct optimization challenges: larger models underutilizing multi-GPU instances and medium high-traffic models over-provisioned on similar setups.

Tools used

Amazon SageMaker AIAmazon EC2 P4dCodeGenXGenApexGuru

Outcome

By deploying multiple models as inference components on shared SageMaker AI endpoints with dynamic scaling, Salesforce achieved up to an eight-fold reduction in deployment and infrastructure costs while maintaining high performance.

Results

Volumesubstantial reduction in operational cost

Cost replacedup to an eight-fold reduction

Source

https://aws.amazon.com/blogs/machine-learning/optimizing-salesforces-model-endpoints-with-amazon-sagemaker-ai-inference-components?tag=soumet-20

How we source this →

Grounding & classification

Source type: technical build writeup

17 fields verified against source quotes.

code generationmetric backednamed customerproduction runtime claimedtools describedvendor confirmedworkflow describedsoftwarecost reductiontechnical build writeupback office ops