Modeling Text Editions as Linked Data (edition2LD)

Modeling text editions as linked data (edition2LD)

Term: January 1, 2023–December 31, 2023

Editions as Linked Data

Scholarly text editions pose a significant challenge in terms of secondary, cross-disciplinary analysis and their reusability due to a variety of factors. These factors encompass both the compilation of content and the analysis and preservation of the collected data. The challenges include:

different temporal and geographical focuses of the research topics
various languages and language levels
different writing systems
Presentation in various combinations of the original text, translation, facsimiles, etc.
Digital data is available in various system architectures and data models
The data collected may change during the publication period(“hot data”). (Only long-term archiving of the data after the project is completed (“cold data”)guarantees that the data remains unchanged.)

Making research data available over the long term is therefore an essential part of scientific research.

The project

The "edition2LD" project is working on a data curation solution that makes heterogeneous research data accessible over the long term and across the boundaries listed above. This solution must be flexible enough to accommodate the heterogeneity of the data while also being robust enough to ensure long-term viability. To achieve this, the edition2LD project follows the Linked Data (LD) paradigm and integrates the data into the Semantic Web.

Case Study as a "Best Practice"

To develop the workflow for Linked Data (LD) modeling, the project has chosen as its use case the editions produced by the HAdW Research Centerfor “Religious and Legal Historical Sources of Pre-modern Nepal” (Nepal-FS). The research center produces digital editions of Nepali texts (in Devanagari script), including facsimiles, English translations (in Roman script), commentaries, and index entries (people, places, technical terms).

Articles are published on nepalica.hadw.de and, with a DOI assigned, on the Heidelberg University Library website (see, for example, hadw).

Digital edition of “Sources on the History of Religion and Law in Pre-Modern Nepal,” nepalica.hadw.de.

Approach

The workflow for LD modeling should be generic enough to allow for future transferability to other projects. At the same time, it should be capable of—when triggered repeatedly—converting data in batches to RDF, thereby addressing the major challenge posed by dynamic “hot data.” When developing the automated mapping processes, it is therefore immensely important to minimize the need for manual post-processing, ideally to the point where it only needs to be performed once.

The project focuses on modeling the information units “text,” “English translation,” named entities (names of people and places), and technical terms, as well as metadata. For modeling named entities, the project can draw on entries from the Nepal-FS register, some of which already include links to standard data repositories and encyclopedic resources.

Entry in the register for Pṛthvīnārāyaṇa Śāha, a king of Gorkha and later Nepal, on nepalica.hadw.de

The vocabularies, ontologies, and repositories used for modeling are those already established as standards: RDFS, SKOS, Gemeinsame Normdatei (GND), VIAF, DBpedia, GeoNames, FOAF, etc. In addition, the LD modeling incorporates links to instances of two ontologies for historical personal and place names in Nepal (NepalPeople and NepalPlaces, see Tittel 2022*), which are developed based on research data from the Nepal-FS.

The data sources are:

the Nepal FS database
Files with additional information
Information that is integrated into the pipeline via web crawling from registry and glossary entries
The NepalPeople and NepalPlaces ontologies: Python scripts synchronize the modeled data with the entries in NepalPeople and NepalPlaces and, where applicable, integrate links to their instances.

Data processing pipeline using Python scripts (current status: terminology.py; NamedEntities.py) that generate RDF data from input from various sources

As of now [September 2023], the modeling of named entities and terms is 90% complete; the modeling of the "English translation," "Nepali edition," and metadata is currently in progress.

The project “Language-Data-Based Modeling of Knowledge Networks in Medieval Romania – ALMA” (internal link), an inter-academic collaboration between the HAdW, BAdW, and AdW Mainz that Academies' Programme on August 1, 2022, as part of Academies' Programme , follows an approach based entirely on Linked Data. Since ALMA produces textual editions (in this case, of medieval legal and medical texts), this dataset is also well-suited for an edition2LD approach.

Publications

Svoboda-Baas, Dieta/Tittel, Sabine: Text+Plus, #04: Modeling Text Editions as Linked Data (edition2LD), in: Text+ Blog, Dec. 18, 2023,https://textplus.hypotheses.org/8723.

*Tittel, Sabine. "Towards an Ontology for Toponyms in Nepalese Historical Documents," in: Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered, and Lesser-Resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, Marseille, June 2022, 2022, Marseille (European Language Resource Association - ELRA), pp. 7–16.

Project Management

Dr. Dieta Svoboda-Baas
Assistant Professor Dr. Sabine Tittel

Employees

Bishwo Bijaya Shah, B.A.
Congcong Xu, B.A.

Knowledge Networks in Medieval Romania (ALMA) ( internal link)
Sources on the History of Religion and Law in Pre-Modern Nepal (internal link)

Research projects

Academy Projects State Projects Third-Party Funding Young Academy Completed Projects