Modeling text editions as linked data (edition2LD)
Term: January 1, 2023–December 31, 2023
Editions as Linked Data
Scholarly text editions pose a significant challenge in terms of secondary, cross-disciplinary analysis and their reusability due to a variety of factors. These factors encompass both the compilation of content and the analysis and preservation of the collected data. The challenges include:
- different temporal and geographical focuses of the research topics
- various languages and language levels
- different writing systems
- Presentation in various combinations of the original text, translation, facsimiles, etc.
- Digital data is available in various system architectures and data models
- The data collected may change during the publication period(“hot data”). (Only long-term archiving of the data after the project is completed (“cold data”)guarantees that the data remains unchanged.)
Making research data available over the long term is therefore an essential part of scientific research.
The project
The "edition2LD" project is working on a data curation solution that makes heterogeneous research data accessible over the long term and across the boundaries listed above. This solution must be flexible enough to accommodate the heterogeneity of the data while also being robust enough to ensure long-term viability. To achieve this, the edition2LD project follows the Linked Data (LD) paradigm and integrates the data into the Semantic Web.
Case Study as a "Best Practice"
To develop the workflow for Linked Data (LD) modeling, the project has chosen as its use case the editions produced by the HAdW Research Centerfor “Religious and Legal Historical Sources of Pre-modern Nepal” (Nepal-FS). The research center produces digital editions of Nepali texts (in Devanagari script), including facsimiles, English translations (in Roman script), commentaries, and index entries (people, places, technical terms).
Articles are published on nepalica.hadw.de and, with a DOI assigned, on the Heidelberg University Library website (see, for example, hadw).
Approach
The workflow for LD modeling should be generic enough to allow for future transferability to other projects. At the same time, it should be capable of—when triggered repeatedly—converting data in batches to RDF, thereby addressing the major challenge posed by dynamic “hot data.” When developing the automated mapping processes, it is therefore immensely important to minimize the need for manual post-processing, ideally to the point where it only needs to be performed once.
The project focuses on modeling the information units “text,” “English translation,” named entities (names of people and places), and technical terms, as well as metadata. For modeling named entities, the project can draw on entries from the Nepal-FS register, some of which already include links to standard data repositories and encyclopedic resources.
The vocabularies, ontologies, and repositories used for modeling are those already established as standards: RDFS, SKOS, Gemeinsame Normdatei (GND), VIAF, DBpedia, GeoNames, FOAF, etc. In addition, the LD modeling incorporates links to instances of two ontologies for historical personal and place names in Nepal (NepalPeople and NepalPlaces, see Tittel 2022*), which are developed based on research data from the Nepal-FS.
The data sources are:
- the Nepal FS database
- Files with additional information
- Information that is integrated into the pipeline via web crawling from registry and glossary entries
- The NepalPeople and NepalPlaces ontologies: Python scripts synchronize the modeled data with the entries in NepalPeople and NepalPlaces and, where applicable, integrate links to their instances.
As of now [September 2023], the modeling of named entities and terms is 90% complete; the modeling of the "English translation," "Nepali edition," and metadata is currently in progress.
The project “Language-Data-Based Modeling of Knowledge Networks in Medieval Romania – ALMA” (internal link), which was launched on August 1, 2022, as an inter-academic project of HAdW, BAdW, and AdW Mainz under the Academies Program, follows an approach that is consistently based on Linked Data. Since ALMA produces text editions (in this case of medieval legal and medical texts), this dataset is also well-suited for an edition2LD approach.
Publications
Svoboda-Baas, Dieta/Tittel, Sabine: Text+Plus, #04: Modeling Text Editions as Linked Data (edition2LD), in: Text+ Blog, Dec. 18, 2023,https://textplus.hypotheses.org/8723.
*Tittel, Sabine. "Towards an Ontology for Toponyms in Nepalese Historical Documents," in: Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered, and Lesser-Resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, Marseille, June 2022, 2022, Marseille (European Language Resource Association - ELRA), pp. 7–16.

