ecommerce_ops · ecommerce · workflow

One Prompt To Rule Them All: LLMs For Opinion Summary Evaluation at Flipkart

Traditional automatic metrics like ROUGE fail to provide comprehensive assessment of opinion summaries and show poor alignment with human judgment, leaving e-commerce teams without a reliable way to evaluate AI-generated review summaries.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Customer browsing triggers review overload

A customer browsing a product is confronted with hundreds of user reviews requiring synthesis.

Tools used

OP-I-PROMPTSUMMEVAL-OPROUGEBERTSCOREChatGPT-3.5GPT-4Solar-10.7B

Outcome

OP-I-PROMPT achieved a Spearman correlation of 0.70 with human judgments, outperforming G-EVAL on open-source models, while the Flipkart use case demonstrated that high-quality summaries can drive increased conversion rates and reduced product returns.

What failed first

Reference-based metrics ROUGE and BERTSCORE showed very poor and sometimes negative correlation with human ratings of summary quality, confirming they are inadequate for assessing modern generative model outputs.

Results

Volume0.70

Running sinceACL 2024

Source

https://blog.flipkart.tech/one-prompt-to-rule-them-all-llms-for-opinion-summary-evaluation-d5dd4eb6f225

How we source this →

Grounding & classification

Source type: technical build writeup

29 fields verified against source quotes, 3 dropped as unverifiable.

quality inspectionsummarizationknowledge basehuman review describedmetric backednamed customersource backedtools describedworkflow describedecommerceaccuracy improvementconversion increasecustomer satisfactiontechnical build writeupecommerce opsquality assurancecase to summary