Corpora of the Tundra Nenets language
During the project, we gathered both published and unpublished sources in the Tundra Nenets and Forest Nenets languages. These materials were then digitised for further analysis. For a detailed overview of the digitisation process, please refer to Mus & Metzger (2021a) and Mus & Metzger (2021b).
The project also encompasses several corpora, which are currently unpublished. These corpora include the following:
-
Tundra Nenets Monolingual Corpus
Compiled by Nikolett Mus & Réka Metzger
Digitally processed; includes metadata and searchable text (approx. 500,000 tokens).
-
Forest Nenets Monolingual Corpus
Compiled by Nikolett Mus, Péter Csényi & Szilárd Tóth
OCR-processed and normalised; metadata catalogued (approx. 28,800 tokens).
-
Tundra Nenets – Russian – English Parallel Corpus
Sentence-level aligned (currently 2,155 and 1,107 aligned entries).
-
Forest Nenets – Russian – English Parallel Corpus
Sentence-level alignment in progress.
-
Nenets UD treebank
Based on spoken language data collected in 2017, we began developing a UD treebank for Tundra Nenets. The first version was released in 2025 as part of the UD v2.16 release.
To describe our resources, we collected metadata associated with these sources. The metadata categories were defined according to established standards, including the IMDI (Interactive Metadata for Data Integration), the CLARIN Metadata Standard (Common Language Resources and Technology Infrastructure), the FIMS (Fieldwork Information Management System), and the MARC (Machine-Readable Cataloging) standards. The metadata were systematically organised into a catalog, with each category separated into individual columns to ensure clarity and facilitate efficient organisation and analysis. Consistent data formats were maintained across all fields, such as using a uniform date format (e.g., YYYY-MM-DD) and standardising categorical values, to ensure consistency and streamline processing. The catalog includes the following information:
- Language Information
- Corpus Information
- Data Context
- Speaker(s) Information
- Data Information