Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods

Investor logo

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Law. Official publication website can be found on muni.cz.
Authors

NOVOTNÁ Tereza HARAŠTA Jakub

Year of publication 2025
Type Article in Periodical
Magazine / Source arXiv
MU Faculty or unit

Faculty of Law

Citation
web Text (arXiv.org)
Doi https://doi.org/10.48550/arXiv.2512.05681
Keywords legal information retrieval; case law; embeddings; evaluation under noisy labels; Czech Constitutional Court
Description Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.