Publication: Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

14 March 2026

Online Social Networks (OSNs) widely adopt content moderation to mitigate the spread of abusive and toxic discourse. Nonetheless, the real effectiveness of moderation interventions remains unclear due to the high cost of data collection and limited experimental control. The latest developments in Natural Language Processing pave the way for a new evaluation approach. Large Language Models (LLMs) can be successfully leveraged to enhance Agent-Based Modeling and simulate human-like social behavior with unprecedented degree of believability. Yet, existing tools do not support simulation-based evaluation of moderation strategies.

In this study, EMERGE partners from the University of Pisa fill this gap by designing a LLM-powered simulator of OSN conversations enabling a parallel, counterfactual simulation where toxic behavior is influenced by moderation interventions, keeping all else equal. The authors conduct extensive experiments, unveiling the psychological realism of OSN agents, the emergence of social contagion phenomena and the superior effectiveness of personalized moderation strategies.

Read the paper in the link below.

More Information

Next Article
Publication: Mental Models in Human-AI Interaction: Systematic Review of Empirical Methodologies and Guidelines
Previous Article
Publication: Covert neural and autonomic signatures of shared perception
View All Articles

Publication: Evaluating Online Moderation Via LLM-Powered Counterfactual Simulations

About

Consortium

Resources

Outreach

Follow Us

About

Consortium

Resources

News & Events

Follow Us