Spoken Data Annotation Manual: Tundra Nenets spoken data

Author: Nikolett Mus

Date: 28.05.2025


1. Introduction

This document outlines the annotation conventions for spoken language-specific features in Tundra Nenets, including prosodic tags, discourse phenomena, and structural relations. It reflects the current annotation practice and serves as a working reference.

2. Spoken Phenomena: Tags and Relational Encoding

  1. Noises

    ID: lexical unit on its own
    FORM: tag
    LEMMA: tag
    UPOS: X
    DEPREL: noise, attached to the element it follows
    Gloss: NOISE

    Illustration of Noise
  2. Speech Interruptions

    Interruptions in the speaker's normal flow, including errors, hesitations, restarts.

    1. 2.1 Disfluency: Audible disfluency
      • <a_d> = Audible disfluency (e.g., "uh", "erm")

      ID: lexical unit on its own
      FORM: tag
      LEMMA: tag
      UPOS: INTJ
      DEPREL: discourse, attached to the following element
      Gloss: DISFL

      Illustration of Noise
    2. 2.2 Disfluency: Articulatory lengthening
      • <a> = Articulatory lengthening (e.g., prolonged sound)

      ID: attached to lexeme
      FORM: The tag <a> should be attached to the relevant lexeme without a space.
      LEMMA: not relevant
      UPOS: not relevant
      DEPREL: not relevant
      Gloss: not relevant

      Illustration of Noise
    3. 2.3 Cut-offs: Unfinished lexeme
      • <u_l> = Unfinished lexeme

      ID: attached to lexeme
      FORM: The tag <u_l> should be attached to the relevant lexeme without a space.
      LEMMA: Full form of expected word
      UPOS: not relevant
      DEPREL: not relevant
      Gloss: not relevant

      Illustration of Cut-off
    4. 2.4 False Start
      • <f_s> = False start

      ID: attached to lexeme
      FORM: The tag <f_s> should be attached to the lexeme without a space.
      LEMMA: underlying lemma (base word) without the tag
      UPOS: POS of the base word
      DEPREL: reparandum attached to the node that replaces or repairs it
      Gloss: Gloss of base/intended word

      Illustration of False Start
    5. 2.5 Exact Repetition
      • <e_r> = Exact repetition

      ID: attached to lexeme
      FORM: The tag <e_r> attached to lexeme without a space
      LEMMA: underlying lemma (base word) without the tag
      UPOS: POS of the base word
      DEPREL: reparandum attached to the next instance (the "actual", i.e. the first use).
      Gloss: Gloss of base/intended word

      Illustration of Repetition
    6. 2.6 Partial Repetition
      • <p_r> = Partial repetition

      ID: attached to lexeme
      FORM: The tag <p_r> attached to lexeme without a space
      LEMMA: underlying lemma (base word) without the tag
      UPOS: POS of the base word
      DEPREL: reparandum attached to the next instance (the "actual", i.e. the first use).
      Gloss: Gloss of base/intended word

      Illustration of Repetition
  3. Connected Speech: Merged lexemes

    ID: attached to lexeme
    FORM: The lexemes must be separated and the tag <m_l> should precede the subsequent lexeme without a space.
    LEMMA: underlying lemma (base word) without the tag
    UPOS: POS of the base word
    DEPREL: As per syntax
    Gloss: Gloss of lemma

    Illustration of Connected Speech
  4. Lexical Error

    <u_w> = Unusual or incorrect word (word choice error)

    ID: attached to lexeme
    FORM: The tag <u_w> attached to word without space
    LEMMA: Lemma of incorrect word
    UPOS: POS of the word as it appears
    DEPREL: As if correct
    Gloss: Literal gloss of the incorrect word

    Illustration of Lexical Error
  5. Pauses coinciding with a clear syntactic boundary

    In Tundra Nenets it appears:

    ID: lexical unit on its own
    FORM: Tag
    LEMMA: Tag
    UPOS: PUNCT
    DEPREL: punct attached to the head of the clause/sentence it follows. Note, that the last punct is always dependent to the root
    Gloss: SIL

    Illustration of Pauses
  6. Pauses: disfluency-related

    In Tundra Nenets it appears:

    ID: lexical unit on its own
    FORM: Tag
    LEMMA: Tag
    UPOS: X
    DEPREL: discourse attached to the constituent it follows.
    Gloss: PAUSE

    Illustration of Pauses

3. Attribution and Reuse

This document is part of the [Project Name] and is freely available under a CC BY 4.0 license.