NATIONAL CORPUS OF THE KAZAKH LANGUAGE

INFORMATION ABOUT THE SPOKEN SUBCORPUS

The Spoken Subcorpus of the Kazakh Language is an electronic database that contains audio and video recordings of Kazakh speech along with their accurate written transcriptions. The subcorpus includes the speaker’s pronunciation (orthoepy) and intonation, as well as the standard literary pronunciation, allowing comparison between actual speech and normative forms. All these components are aligned and systematically organized within a single digital resource. This type of spoken subcorpus is unique and has no direct analogues in Kazakhstan. It does not replicate foreign models. The subcorpus was developed by the A. Baitursynuly Institute of Linguistics. The idea author is the Institute’s director, Anar Fazylzhan.

The main goal of creating this spoken subcorpus is to demonstrate the correct pronunciation of Kazakh words and the features of spoken language. It also aims to preserve patterns of vowel harmony and phonetic consistency that are gradually disappearing. The subcorpus includes speech samples from prominent public figures, writers, professionals, traditional speakers, and local residents, preserving authentic spoken language practices.

The spoken subcorpus allows users to:

– learn correct spoken Kazakh;

– analyze spoken language processes;

– identify sociolinguistic features;

– conduct phonetic and orthoepic analysis;

– study the speech and language style of well-known individuals;

– identify regional variations of the Kazakh language;

– develop standard pronunciation skills;

– explore and research features of the national spoken language;

– analyze prosodic features of speech.

In addition, the subcorpus serves as a high-quality resource for artificial intelligence systems. It enables the development of technologies that preserve the phonological harmony of the Kazakh language. The subcorpus can be used in interdisciplinary fields such as speech recognition, text-to-speech, speech generation, speech-to-text conversion, and natural language processing.

The search system of the spoken subcorpus allows users to retrieve data by entering a word or by applying filters. Users can select region, topic, speech style, sociolinguistic level, or a specific speaker to obtain relevant information. Regardless of the search method, information about the word, its spelling, the speaker’s pronunciation, and the standard pronunciation is available. In addition, users can watch video recordings and listen to audio materials. By clicking on a word in the “Spelling of the word” section, phonetic, phonological, and prosodic analyses become available, supported by scientifically validated data.

The Spoken Subcorpus database consists of recorded interviews annotated with the speaker’s pronunciation, standard pronunciation norms, and orthography. It includes speech samples from well-known figures such as Gabit Musrepov, Azilkhan Nurshaiykov, Sherkhan Murtaza, Mukhtar Auezov, Maulen Balaqayev, Myrzatay Zholdasbekov, Satybaldy Narymbetov, Asanali Ashimov, Ibragim Agytaiuly, Zhaksylyk Ushkempirov, Kanipash Madibai, Zeynep Akhmetova, Askar Zhumadilayev, Alimkhan Zhunisbek, Bakytzhan Khasanov, Rabiga Syzdyk, Fauziya Orazbayeva, Okas Nakhysbekov, Bauyrzhan Momyshuly, Nygmet Mynzhani, Akseleu Seidimbek, Kassym Kaysenov, Bekbolat Tileukhan, Ermurat Bapi, Abzal Kuspan, Abduali Kaidar, Nurtore Zhussip, Ularbek Nurgalymuly, Nurgeldi Uali, Kalikhan Iskak, Saken Zhunisov, as well as many other public figures, professionals, and ordinary speakers.