compliance_monitoring · saas · workflow

Grammarly develops DeTexD benchmark dataset and RoBERTa-based classifier for delicate text detection

Existing toxic text detection methods leave a gap for the broader category of delicate text — emotionally charged or potentially triggering writing that may not be explicitly offensive but still carries risk for users and LLMs exposed to it.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Delicate text data sourcing
Data was sourced by targeting news websites, forums discussing sensitive topics, and generally controversial forums.
Tools used
RoBERTaHateBERTGoogle's Perspective APIOpenAPI content filterOpenAPI moderation API
Outcome

Grammarly created the DeTexD dataset (40,000 training samples and 1,023 benchmark paragraphs) and a RoBERTa-based baseline model achieving 79.3% F1, outperforming all evaluated methods. Annotation guidelines, dataset, and baseline model were made publicly available.

What failed first

All evaluated toxic and hate-speech detection methods underperform on delicate text, missing coverage on medical and mental health topics or showing lower precision on texts containing offensive keywords that are not actually delicate.

Results
Volume40,000 samples
Source

https://www.grammarly.com/blog/engineering/detecting-delicate-text/

How we source this →

Grounding & classification
Source type: technical build writeup
24 fields verified against source quotes.
document classificationsentiment analysissocial media postfailure mode describedhuman review describedmetric backedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupcompliance monitoringquality assuranceextract classify route