KZ | RU | EN
Welcome to the Main Corpus of the Kazakh Language! The Main Corpus is an electronic collection of texts taken from 5 functional styles of the Kazakh language (fiction, scientific, journalistic, official/business, and colloquial), serving as an IT resource for research and education. The purpose of the Main Corpus is to be a source of texts that covers all stylistic layers of the Kazakh language, representing a complete picture of the language. Total volume — 31,105,900 word usages. The Main Corpus includes a search system by word and word form (inflected forms). The Main Corpus operates with morphological, semantic, lexical, and phonetic-phonological annotation types. These annotations provide information about the searched word at all linguistic levels: In morphological annotation, the analyzer automatically splits the word/word form into root and affixes (lemmatization) and assigns a part of speech to the root (lemma). It also provides grammatical descriptions of affixes. Lexical annotation shows all meanings of words as presented in explanatory dictionaries. Phonetic annotation provides the orthoepy of the word, automatically divides it into syllables, and describes types of syllables. Phonological annotation provides phonemic characteristics of the sounds within the word. Each text included in the Main Corpus contains source information (metadata). The metadata window opens on a separate page when hovering over the author. Users can search for words using metadata types such as author, text title, author gender, text style, audience, distribution type, time period, topic, and full source information.


GENERAL INFORMATION



The Kazakh National Corpus (NCKL) is a comprehensive, electronically formatted collection of texts containing millions of word usages that fully encompass the lexical and grammatical system of the Kazakh language (with deep annotation). It serves as a specialized, "intelligent" knowledge base that compiles all available information about the Kazakh language.

Currently, the Kazakh National Corpus consists of 16 sub-corpora, each developed for specific purposes.
Total word usages: 65,000,000.

Main Corpus is an electronic compilation of texts drawn from five styles of the Kazakh language (fiction, scientific, journalistic, business, and conversational) and functions as a research and educational IT resource.

The objective of the Main Corpus is to serve as a source of texts that encompass all stylistic layers of the Kazakh language, reflecting the unified representation of a single language.

Total word usages in the Main Corpus database: 31,105,900.

The Main Corpus includes a search system that allows queries by word and word form (inflected forms).

In the Main Corpus, as well as in all sub-corpora, morphological, semantic, lexical, and phonetic-phonological markups are employed. These markups provide information about the searched word at all levels of language:

Morphological markup: the analyzer automatically divides the word or word form into its root and affixes (lemmatization), assigns a part of speech to the root (lemma), and provides grammatical characteristics of the affixes.

Lexical markup displays all meanings of the word as listed in the explanatory dictionary.

Phonetic markup provides the word's pronunciation according to orthoepy rules, automatically divides it into syllables, and describes the syllable types.

Phonological markup offers a phonemic description of the sounds within the word.

Each text included in the sub-corpora is accompanied by meta-markup (meta-markup), which contains details about its source. The meta-markup window (text author, title, author's gender, text style, target audience, distribution type, publication date, topic, complete source information, etc.) opens on a secondary page when the cursor is hovered over the author's name.

Users of the corpus can search for specific words based on the types of meta-markup.