Technical ReportTechnical Report · UIC Fall 2025·2025

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

Mokshit SuranaArchit RathodAkshaj Kurra Satishkumar

abstract

Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to "toxic degeneration" where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments, necessitating effective mitigation strategies that maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of DExperts (Decoding-time Experts), an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using RealToxicityPrompts on standard GPT-2 models; (2) implementing and evaluating DExperts to mitigate explicit toxicity; and (3) stress-testing the method against implicit hate speech using the adversarial ToxiGen dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5%. Furthermore, we quantify a critical trade-off: the method introduces a ~10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios.

keywords

LLM SafetyToxicity MitigationDExpertsGPT-2ToxiGenRealToxicityPromptsAI SafetyResponsible AI

← previous

Galaxy Morphology XAI

Technical Report

YouTube Misinformation Detection

Course Project