ecommerce_ops · ecommerce · workflow

One Prompt To Rule Them All: LLMs For Opinion Summary Evaluation at Flipkart

Traditional automatic metrics like ROUGE fail to provide comprehensive assessment of opinion summaries and show poor alignment with human judgment, leaving e-commerce teams without a reliable way to evaluate AI-generated review summaries.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Customer browsing triggers review overload
A customer browsing a product is confronted with hundreds of user reviews requiring synthesis.
Tools used
OP-I-PROMPTSUMMEVAL-OPROUGEBERTSCOREChatGPT-3.5GPT-4Solar-10.7B
Outcome

OP-I-PROMPT achieved a Spearman correlation of 0.70 with human judgments, outperforming G-EVAL on open-source models, while the Flipkart use case demonstrated that high-quality summaries can drive increased conversion rates and reduced product returns.

What failed first

Reference-based metrics ROUGE and BERTSCORE showed very poor and sometimes negative correlation with human ratings of summary quality, confirming they are inadequate for assessing modern generative model outputs.

Results
Volume0.70
Running sinceACL 2024
Source

https://blog.flipkart.tech/one-prompt-to-rule-them-all-llms-for-opinion-summary-evaluation-d5dd4eb6f225

How we source this →

Grounding & classification
Source type: technical build writeup
29 fields verified against source quotes, 3 dropped as unverifiable.
quality inspectionsummarizationknowledge basehuman review describedmetric backednamed customersource backedtools describedworkflow describedecommerceaccuracy improvementconversion increasecustomer satisfactiontechnical build writeupecommerce opsquality assurancecase to summary