Corpora of the Tundra Nenets language
During the project, we gathered both published and unpublished sources in the Tundra Nenets and Forest Nenets languages. These materials were then digitised for further analysis. For a detailed overview of the digitisation process, please refer to Mus & Metzger (2021a) and Mus & Metzger (2021b).
The project also encompasses several corpora, which are currently unpublished. These corpora include the following:
- A Tundra Nenets monolingual corpus (OCR-ed and unified),
- A Forest Nenets monolingual corpus (OCR-ed and unified),
- A Tundra Nenets – Russian – English parallel corpus (sentence-level aligned),
- A Forest Nenets – Russian – English parallel corpus (sentence-level aligned).
To describe our resources, we collected metadata associated with these sources. The metadata categories were defined according to established standards, including the IMDI (Interactive Metadata for Data Integration), the CLARIN Metadata Standard (Common Language Resources and Technology Infrastructure), the FIMS (Fieldwork Information Management System), and the MARC (Machine-Readable Cataloging) standards. The metadata were systematically organised into a catalog, with each category separated into individual columns to ensure clarity and facilitate efficient organisation and analysis. Consistent data formats were maintained across all fields, such as using a uniform date format (e.g., YYYY-MM-DD) and standardising categorical values, to ensure consistency and streamline processing. The catalog includes the following information:
- Language Information
- Corpus Information
- Data Context
- Speaker(s) Information
- Data Information