Mattia Proietti, Lucia C. Passaro, and Alessandro Lenci. 2025. Leveraging LLMs to Build a Semi-synthetic Dataset for Legal Information Retrieval: A Case Study on the Italian Civil Code and GPT4-O. In Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025), p. 933–941.
Abstract: Although raw textual data in the legal domain is abundant, making it easy to collect large amounts of material from several sources, structured and annotated data needed to fine-tune machine learning models is limited and difficult to obtain. Creating human-annotated datasets is both time- and money-consuming, which often makes impractical to get quality data to train machines on various legal language tasks. AI models such as Large Language Models (LLMs) are becoming appealing to generate synthetic data, judge model responses, and annotate textual information, so to cope with such shortcomings. In this work, we wish to evaluate the applicability of LLMs for the automatic generation of a dataset of legal query-passage pairs to train retrieval systems. Indeed, Legal Information Retrieval (LIR) has been crucial for the creation of robust search systems for legal documents and is now gaining new importance in the context of the Retrieval Augmented Generation (RAG) framework, which is becoming a widespread tool to cope with LLMs hallucinating behaviours. Our goal is to test the feasibility of building a query-passage dataset in which the queries are generated by an LLM about real textual passages and assess the reliability of such a process in terms of the generation of hallucination-free data points in a delicate domain, as the legal one. We do so in a two-step pipeline spelt out as follows: i) we use the Italian Civil Code as a source of self-contained, semantically coherent legal textual passages and ask the model to generate hypothetical questions on them; ii) we use the LLM itself to judge the coherence of the questions to spot those inconsistent with the passage. We then select a random subset of the question-passage pairs and ask humans to evaluate them. Finally, we compare human and model evaluations on the randomly selected subset. We show that the model generates many questions easily, and while it lags behind humans when evaluating the appropriateness of the generated questions with respect to the reference passages in zero-shot settings, it substantially reduces the gap with human judgements when only two examples are provided.

