Calendar31 July 2025

Publication: ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models Publication: ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

The question of whether Large Language Models (LLMs) truly comprehend causal relationships in natural language texts or merely perform as ‘stochastic parrots’ by replicating statistical associations in their pretraining data, remains a topic of debate. This question is crucial for the application of LLMs in domains that demand interpretive and inferential accuracy.

In this work, EMERGE partners from the University of Pisa introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. The authors tested LLMs on ExpliCa through prompting and perplexity-based metrics. They assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

Read the paper in the link below.