NATIONAL CORPUS OF THE KAZAKH LANGUAGE

GENERAL INFORMATION

The Kazakh National Corpus (NCKL) is a comprehensive, electronically formatted collection of texts containing millions of word usages that fully encompass the lexical and grammatical system of the Kazakh language (with deep annotation). It serves as a specialized, "intelligent" knowledge base that compiles all available information about the Kazakh language.

Currently, the Kazakh National Corpus consists of 16 sub-corpora, each developed for specific purposes.
Total word usages: 65,000,000.

Main Corpus is an electronic compilation of texts drawn from five styles of the Kazakh language (fiction, scientific, journalistic, business, and conversational) and functions as a research and educational IT resource.

The objective of the Main Corpus is to serve as a source of texts that encompass all stylistic layers of the Kazakh language, reflecting the unified representation of a single language.

Total word usages in the Main Corpus database: 31,105,900.

The Main Corpus includes a search system that allows queries by word and word form (inflected forms).

In the Main Corpus, as well as in all sub-corpora, morphological, semantic, lexical, and phonetic-phonological markups are employed. These markups provide information about the searched word at all levels of language:

Morphological markup: the analyzer automatically divides the word or word form into its root and affixes (lemmatization), assigns a part of speech to the root (lemma), and provides grammatical characteristics of the affixes.

Lexical markup displays all meanings of the word as listed in the explanatory dictionary.

Phonetic markup provides the word's pronunciation according to orthoepy rules, automatically divides it into syllables, and describes the syllable types.

Phonological markup offers a phonemic description of the sounds within the word.

Each text included in the sub-corpora is accompanied by meta-markup (meta-markup), which contains details about its source. The meta-markup window (text author, title, author's gender, text style, target audience, distribution type, publication date, topic, complete source information, etc.) opens on a secondary page when the cursor is hovered over the author's name.

Users of the corpus can search for specific words based on the types of meta-markup.