NATIONAL CORPUS OF THE KAZAKH LANGUAGE

GENERAL INFORMATION

Spoken Subcorpus - An electronic database of spoken Kazakh language (audio and video recordings), where transcription, orthoepy, and audio are aligned and presented in a synchronized format. The purpose of developing the oral subcorpus of the National Corpus of the Kazakh Language is to create an electronic database of spoken language samples with internal and external annotation, enabling the study of spoken language features and supporting the development of Kazakh-language speech technologies. The oral subcorpus allows users to: study, acquire, and analyze the characteristics of Kazakh spoken language; investigate the language and speech style of specific individuals; develop pronunciation skills in accordance with orthoepic norms; conduct phonetic and orthoepic analysis in terms of articulation, pronunciation, and perception; identify processes of spoken language; determine regional linguistic variations; explore sociolinguistic features; develop listening and speaking skills in language learning; analyze prosodic features of speech. The search system supports queries by word and metadata. Search results provide metadata, linguistic annotations, orthographic forms, the speaker’s pronunciation, and the standardized orthoepic form. In addition, users can watch video recordings and listen to audio materials via provided links. The textual database of the oral subcorpus consists of a total of 136 interviews: 76 interviews include both the speaker’s pronunciation and the normative orthoepy; 60 interviews are presented in orthographic transcription. Total number of word usages — 1,000,000.