Network information extraction from medieval trial records combining LLM-based coreference resolution with string matching in pre-existing lists of persons
| Authors | |
|---|---|
| Year of publication | 2025 |
| Type | Appeared in Conference without Proceedings |
| MU Faculty or unit | |
| Citation | |
| Description | This study presents a method for extracting person-to-person network data from medieval trial records by combining Large Language Model (LLM)-based coreference resolution with string matching against pre-existing person lists. Focusing on a corpus of depositions from 14th-century Bologna, we evaluated the performance of a multi-stage pipeline under four conditions differing in the availability and specificity of external person data. Using GPT-4o for clause classification and entity extraction, followed by string normalization and ID matching, we assessed the pipeline’s precision, recall, and ability to replicate a ground-truth incrimination network. While basic LLM extraction without tailored data yielded low performance, enriching the pipeline with document-specific name lists and trial role metadata (Conditions C3 and C4) significantly improved network reconstruction, achieving F1 scores up to 0.77 and high correlation with ground-truth centrality rankings. These results demonstrate that combining LLMs with structured pre-existing data can produce network datasets suitable for historical analysis, while also highlighting the limitations of LLM-based extraction in the absence of contextual person identifiers. |
| Related projects: |