compliance_monitoring · saas · workflow

Grammarly develops DeTexD benchmark dataset and RoBERTa-based classifier for delicate text detection

Existing toxic text detection methods leave a gap for the broader category of delicate text — emotionally charged or potentially triggering writing that may not be explicitly offensive but still carries risk for users and LLMs exposed to it.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Delicate text data sourcing

Data was sourced by targeting news websites, forums discussing sensitive topics, and generally controversial forums.

Tools used

RoBERTaHateBERTGoogle's Perspective APIOpenAPI content filterOpenAPI moderation API

Outcome

Grammarly created the DeTexD dataset (40,000 training samples and 1,023 benchmark paragraphs) and a RoBERTa-based baseline model achieving 79.3% F1, outperforming all evaluated methods. Annotation guidelines, dataset, and baseline model were made publicly available.

What failed first

All evaluated toxic and hate-speech detection methods underperform on delicate text, missing coverage on medical and mental health topics or showing lower precision on texts containing offensive keywords that are not actually delicate.

Results

Volume40,000 samples

Source

https://www.grammarly.com/blog/engineering/detecting-delicate-text/

How we source this →

Grounding & classification

Source type: technical build writeup

24 fields verified against source quotes.

document classificationsentiment analysissocial media postfailure mode describedhuman review describedmetric backedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupcompliance monitoringquality assuranceextract classify route