NATIONAL CORPUS OF THE KAZAKH LANGUAGE

GENERAL INFORMATION

The Kazakh National Corpus (KNCK / NCKL) is a comprehensive, electronically formatted collection of texts containing millions of word usages that fully encompass the lexical and grammatical system of the Kazakh language (with deep annotation). It serves as a specialized, “intelligent” knowledge base that compiles all available information about the Kazakh language.

Total word usages: 250,000,000.

Currently, the Kazakh National Corpus consists of 20 subcorpora, each developed for specific purposes.

Main Corpus. An electronic compilation of texts drawn from five styles of the Kazakh language (fiction, scientific, journalistic, business, and conversational) that functions as a research and educational IT resource. Its objective is to encompass all stylistic layers of the Kazakh language within a unified representation of a single language. Total word usages in the Main Corpus: 31,105,900. Includes a search system supporting queries by word and word form (inflected forms). In the Main Corpus—as in all subcorpora—morphological, semantic, lexical, and phonetic-phonological markup is employed. These provide information about the queried word at all levels of language:

Morphological markup: the analyzer automatically divides a word or word form into root and affixes (lemmatization), assigns a part of speech to the root (lemma), and provides grammatical characteristics of the affixes.

Lexical markup: displays all meanings of the word as listed in the explanatory dictionary.

Phonetic markup: provides the word’s pronunciation according to orthoepic rules, divides it into syllables, and describes syllable types.

Phonological markup: offers a phonemic description of the sounds within the word.

Each text included in the subcorpora is accompanied by meta-markup containing details about its source. The meta-markup window (text author, title, author’s gender, text style, target audience, distribution type, publication date, topic, complete source information, etc.) opens on a secondary page when the cursor is hovered over the author’s name. Users can search by words and by types of meta-markup.

Ahmet Baitursynuly Subcorpus. An electronic database including Ahmet Baitursynuly’s poems, stories, articles, and educational materials. The subcorpus page provides search fields and meta-markup categories (author, title, publication date/place, genre, topic). Featuring internal and external markup, it supports studies of Baitursynuly’s scholarly and educational works, the history of the Kazakh literary language, vocabulary development, the differentiation and evolution of functional styles, and analysis of the Alash heritage. Total volume: 132,000 word usages.

Spoken Subcorpus. An electronic database of transcribed, orthoepically marked, and audio-aligned oral utterances in Kazakh (from audio and video recordings). Purpose: to create a marked database of spoken Kazakh as a resource for studying speech characteristics and for developing Kazakh-language speech technologies. It enables analysis of national oral language features; speech and style of notable figures; oral skills per orthoepic norms; phonetic/orthoepic analysis (production, articulation, perception); processes in oral speech; regional features; sociolinguistic traits; and supports listening/speaking/comprehension in Kazakh. The search system allows queries by words or meta-markup; results provide access to meta-markup, markup, orthography, speaker pronunciation, and normalized orthoepic forms. Embedded links let users view video and listen to transcripts. Content: 136 interviews (76 with pronunciation + normalized forms; 60 orthographic transcriptions). Total: 1,000,000 word usages.

Historical Subcorpus. An electronic database containing meta-markup of written heritage texts in various graphic systems from earlier periods. Goal: to develop a database of texts from the 12th–20th centuries, including originals in Arabic/Latin scripts, their transcriptions and translations, aligned page-by-page for historical-comparative, diachronic, and synchronic studies. Meta-markup covers summary, theme, author, year, place of preservation, style, genre, versions, publisher, page count, etc. Linguistic annotations include script, part of speech, and contextual meaning (translation). The database of early and medieval texts serves Orientalists, historians, Turkologists, and the public. Coverage: 66 texts, 655,997 word usages.

Parallel Subcorpus. A collection of source texts and their translations. Goal: a linguistic platform for teaching Kazakh via a balanced database of translations in other languages. Components: balanced text database, annotations, meta-markup, and search. Initial phase covers literary and official-business styles. Both Kazakh and Russian texts are morphologically analyzed. Literary texts have 28 meta-parameters. Sizes: official-business 600,000; literary 1,500,000; total: 2,000,100 word usages.

Cultural-Representative Subcorpus. Provides information on the cultural semantics of ethnocultural units. Total volume: 8,000,000 word usages. The texts are grouped by four themes: folklore; authorial oral literature; ethnographic works; scientific papers/articles. Search by thematic groups (personal names, kinship terms, national foods/clothing, jewelry, weapons, sacred numbers, utensils). Also includes religious terms, archaisms, loanwords, ethnographic terms, variants, and cultural onomastics.

Advertising Subcorpus. An electronic database of Kazakh-language advertising texts with a search system. Goal: to build a database that reflects industrial/business communication and to establish meta-markup. Sources include street, transport, retail, media, and internet ads. Markup (15 parameters): advertisement image, material, text, type, themes (macro/micro), language, format, year, advertiser, region, feedback, source, annotator, time of inclusion—ensuring research authenticity. Now: 4,631 texts; total: 140,000 word usages.

Dialectological Subcorpus. An electronic collection of oral and written texts with regional linguistic features, accessible via search. Goal: to create a database of local speech patterns as a research/teaching IT resource. Phase 1 used the “Regional Dictionary of the Kazakh Language” (2005; G. Kaliyev, O. Nakysbekov, Sh. Sarybaev, A. Uderbaev). Total volume: 180,000 word usages.

Proverbs and Sayings Subcorpus. A searchable database providing linguistic and cultural/ethnolinguistic explanations. Goal: to preserve proverbs and make them easily accessible. Sources include Academician Ä. Qaidar’s “People’s Wisdom” and the 100-volume “Words of the Ancestors.” Now: 3,000+ proverbs with explanatory notes; total: 138,647 word usages.

Phraseological Subcorpus. A searchable database of Kazakh phraseological units with meanings/definitions, aimed at preserving fixed expressions digitally and transmitting them across generations, enabling quick retrieval. Includes units from the 15-volume “Explanatory Dictionary of the Kazakh Literary Language.” Contains 58,070 entries with clarified lexical meanings. Also includes ethnophraseologisms with unique cultural information (drawn from the 5-volume “Kazakh Ethnographic Encyclopedia”; authored by scholars such as Prof. Zh. Mankeeva and K. Gabitkhan). Total: 667,300 word usages.

Onomastic Subcorpus. A tool for collecting, organizing, and digitizing onomastic data—tracking changes, usage, and frequency of place names and proper names, and forming databases of them. Goal: to collect toponyms, anthroponyms, and other onym types by region; provide geographical, cultural-semantic, and word-formation labels while gathering texts. Current coverage: ~2,000 onym types; text base: 12,000,000 word usages.

Writers’ Texts Subcorpus. An annotated collection of prose and drama, serving as a repository of stylistic devices representing writers’ artistic language. Provides access to Kazakh novels, stories, novellas, short stories, essays, and plays (readable online). Includes works by 100 Kazakh writers; each work has full meta-markup. Total: 10,000,000 word usages. Frequently used metaphors, similes, figurative expressions, idioms, proverbs, and sayings are grouped into separate search categories. Search modes: Word; Meta-markup (author, title, genre, chronotope, audience); Literary Devices.

Contemporary Poetic Subcorpus. An electronic database of written and oral texts by modern Kazakh poets with search access, highlighting structural features of poems and enabling comparison across poets. Includes works by Abai Qunanbaiuly, Shakarim Qudaiberdiuly, Iliias Zhansugirov, Saken Seifullin, Kasym Amanzholov, Zhuban Moldagaliyev, Abdilda Taizhubaev, Dikhan Abilov, Mukaghali Makataev, Mukhtar Shakhanov, Sultanmakhmut Toraigyrov, Abish Kekilbaev, Taiyr Zharokov, Isa Baizakov, and others. Current size: ~460,000 word usages (text + audio) with meta-markup and prosodic features. Expanding.

Historical Poetic Subcorpus. Written works of Kazakh literature from the 6th–19th centuries. Enables exploration and analysis of old texts by script, style, lexicon, and poetic structure. Aim: preservation and easy retrieval for future generations. Currently includes poetic works from the 15th–19th centuries (epics, religious poems, heroic songs, love poetry, folk tales, debates, aphorisms), many preserved as rare Arabic-script manuscripts; originals and transliterations are provided. Total: 148,564 word usages.

Terminological Subcorpus. An electronic database of Kazakh-language texts across scientific fields, providing comprehensive linguistic information on terminology. Supports cross-disciplinary research; standardization of general/technical terms; high-quality translation; development of MT and online dictionaries; educational standardization; and public understanding of terminology. Size: ~5,000,000 word usages; meta-markup includes source, type, subject, style, keywords, and field. Covers ~10,000 terms with Kazakh/Russian variants, etymologies, definitions, and legal-document usage.

Educational Subcorpus. An information-corpus resource for learners, equipped with tools for instructors and students and containing pedagogical texts. Organized by level-based instruction: Practical Kazakh, Professional Kazakh, Theoretical Kazakh, Business Kazakh, Oratory Kazakh. Includes literacy tools (orthography, punctuation, lexico-grammatical guides) and explanatory dictionaries; supports teachers, textbook developers, L2 learners, and professional/academic users. Includes texts from children’s literature anthologies and publications such as “Ulan” and “Alzhelken.” Total: ~3,000,000 word usages.

Learner’s Corpus – an interactive instructional subcorpus designed to teach Kazakh to English-speaking learners. The platform includes texts for levels A1–C1, grammar and vocabulary reference guides, an illustrated dictionary, video materials, and sample exercises.

Six-language parallel subcorpus – a balanced database that presents versions of the same text in Kazakh, English, Turkish, Uzbek, Uyghur, and Azerbaijani simultaneously.

The corpus of errors – an electronic database created to collect errors from written work and to conduct linguistic analysis. It is aimed at identifying the writing literacy of Kazakh speakers, determining the proficiency level of learners of Kazakh, improving writing skills through error analysis, and enhancing language-teaching methodology.