26 September 2025
Large Language Models (LLMs) have demonstrated remarkable capabilities in many tasks, often achieving near-human performances. Nevertheless, many questions remain about their internal workings: Whether and how they perform some form of reasoning, and to what extent their grasp of concepts through natural language approximates human conceptual understanding.
Sparse Autoencoders (SAEs) have become a popular technique to identify interpretable concepts in Language Models. They have been successfully applied to several models of varying sizes, including both open and commercial ones, and have become one of the main avenues for interpretability research. Despite these advances, little attention has been given to applying SAEs to Italian language models. In this work, EMERGE partners from the University of Pisa present an initial step toward addressing this gap. They train a SAE on the residual stream of the Minerva-1B-base-v1.0 model, for which they release the weights; they leverage an automated interpretability pipeline based on LLMs to evaluate both the quality of the latents, and provide explanations for some of them. They show that, albeit the limitations of the approach, some concepts are found in the weights of the model.
Read the paper in the link below.

