Alessandro Bondielli, Martina Miliani, Luca Paglione, Serena Auriemma, Lucia Passaro and Alessandro Lenci, "LLMs Struggle on Explicit Causality in Italian", CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics, September 24 — 26, 2025, Cagliari, Italy

Abstract: The ability to recognize and interpret causal relations is fundamental for building robust intelligent systems. Recent research has focused on developing benchmarks and tasks to evaluate the inferential and causal reasoning capabilities of LLMs, such as the Pairwise Causal Discovery (PCD) task. However, most of these resources are limited to English. In this paper, we present ExpliCITA, a translation of the English ExpliCa dataset [ 1], which is the first publicly available dataset for joint temporal-causal reasoning in Italian, enabling evaluation of LLMs on Italian PCD. We conduct an extensive empirical study across 20 Italian and multilingual models of varying sizes and training strategies, combining a perplexity-based evaluation of causal reasoning competence with multiple-choice prompting tasks in both zero-shot and few-shot settings. Our results show that all tested models, including the GPT family, struggle with the ExpliCITA PCD task, more so than with the original English ExpliCa, in both evaluation scenarios. Moreover, native Italian models do not outperform fine-tuned multilingual alternatives. Consistent with prior findings, we observe that the linguistic competence of models, measured using perplexity-based metrics, is higher than their respective performances, measured via accuracy on prompting results; however, this gap tends to narrow with increasing model size. Finally, a per-class performance analysis reveals that models handle causal relations relatively better than temporal ones.