Calendar26 September 2025

Publication: Leveraging LLMs to Build a Semi-synthetic Dataset for Legal Information Retrieval: A Case Study on the Italian Civil Code and GPT4-O Publication: Leveraging LLMs to Build a Semi-synthetic Dataset for Legal Information Retrieval: A Case Study on the Italian Civil Code and GPT4-O

Although raw textual data in the legal domain is abundant, making it easy to collect large amounts of material from several sources, structured and annotated data needed to fine-tune machine learning models is limited and difficult to obtain. Creating human-annotated datasets is both time- and money-consuming, which often makes impractical to get quality data to train machines on various legal language tasks. AI models such as Large Language Models (LLMs) are becoming appealing to generate synthetic data, judge model responses, and annotate textual information, so to cope with such shortcomings.

In this study, EMERGE partners from the University of Pisa evaluate the applicability of LLMs for the automatic generation of a dataset of legal query-passage pairs to train retrieval systems. Indeed, Legal Information Retrieval (LIR) has been crucial for the creation of robust search systems for legal documents and is now gaining new importance in the context of the Retrieval Augmented Generation (RAG) framework, which is becoming a widespread tool to cope with LLMs hallucinating behaviours. They show that the model generates many questions easily, and while it lags behind humans when evaluating the appropriateness of the generated questions with respect to the reference passages in zero-shot settings, it substantially reduces the gap with human judgements when only two examples are provided.

Read the paper in the link below.