Interview with Baiba Saulīte about "Voice Gathering" (Balsu talka)
What is "Balsu talka"? Why does this campaign need public involvement? To learn more about this initiative, we interviewed Baiba Saulīte.
Could you briefly introduce yourself - what is your occupation? What is your relationship with language technology?
I am a linguist, the leading researcher of the Artificial Intelligence Laboratory (AiLab) of the LU MII, and my main object of research is the modern Latvian language in various aspects. AiLab's mission is to develop resources and technologies for the existence of the Latvian language in a multilingual environment, and here I can gain knowledge about the language both by creating Latvian language resources and also by analyzing various linguistic phenomena in the already created resources. By resources I mean language data used in language technologies, lexical databases (dictionaries) and language corpora. The main resources useful to the general public on a daily basis are collected in two platforms - the popular dictionary Thesaurus and the National Corpus Collection. Perhaps I will explain that language corpora are a set of written texts, transcribed speech or video recordings intended for modern linguistic analysis and development of language technologies. Corpuses include authentic, very extensive, millions of words of language material that reflects the use of language, but corpus browsers offer the opportunity to analyze these voluminous texts, to identify typical or unusual, rare phenomena of the language in them.
What is "Balsu talka"? "Balsu talka" is a campaign during which we ask the public to get involved in the creation of a speech corpus of the Latvian language - to speak a few sentences in order to collect as many samples of Latvian speech as possible and to create a diverse, open and accessible Latvian speech data set. To collect voice samples, we use the internationally known Mozilla Common Voice platform, where the collected speech data is available to everyone. To date, more than 170 hours of recordings from 4,364 speakers have been collected, more than half of which have been verified.
How did "Balsu talka" come about? How did the idea to implement this project come about? Creating speech corpora is very expensive and time-consuming because the speech data must be accurately transcribed. Most of the Latvian speech data currently used by research institutions and language technology companies is not open and freely available. For a year and a half, AiLab has been creating a freely available corpus of spontaneous speech, but, as I said, transcribing is quite slow. That's why we thought about how to collect many transcribed recordings as conveniently as possible and create an open, large dataset of modern Latvian speech. Together with our cooperation partner LU LFMI, we considered various options, because we were interested in collecting spontaneous speech as well, but we could not afford to transcribe the data in such a large amount ourselves. Then we met with Raivis Dejus, who already was inviting residents to speak sentences in Mozilla Common Voice. And so, in a short time, the "Voice Gathering" campaign was born, which we launched together with the LU LFMI and the Latvian Open Technologies Association (LATA) already on May 4, and on which date we collected the first 100 hours of various voice recordings.
How will the data collected in "Balsu talka" be used? The data is regularly published on the Common Voice platform under the Creative Commons CC0 public domain license, which means that no one owns the copyright to the data. Everyone is free to use them for any purpose, thereby facilitating the unlimited development of research in both linguistics and language technology. The user can view and listen to the data collected during "Balsu talka" until mid-summer in the "Balsutalka.lv speech corpus". There, for example, you can listen to how different people recited fragments from Anna Brigadere's "Spridītis" or other texts. This corpus can be used by language researchers, especially phoneticians. For example, the data makes it possible to analyze the intonation of syllables in the speech of different people, which is typical for long syllables in the Latvian language, to see what positional sound changes and how regularly they occur in words. Since the set of spoken sentences includes sentences of different communicative types - narrative, question, and exclamatory, it is possible to analyze the intonation of the sentence.
How does "Balsu talka" contribute to the development of the Latvian language and what is its impact on the use of the language? This initiative does not really promote the development of the Latvian language itself, but, as I have already mentioned, the obtained data allow for the analysis of various aspects of the Latvian language. In cooperation with the Rēzekne Academy of Technology, the Latgalian version of the initiative "Bolsu tolka" has been created, where people who know how to read sentences out loud in the written Latvian language take an active part. As it turns out, many have to really concentrate to read the sentences. Perhaps, for someone, it is practice for reading in the written Latgalian language. It should be emphasized that our aim is to obtain recordings of a variety of voices, including inflections and accents. The age, gender and nationality of the participants is irrelevant - the more diverse the voice samples in Latvian and Latgalian, the more valuable. This is also a culturally and historically important initiative, because the samples of voices spoken in Latvia and in the diaspora will be immortalized and preserved for future generations.
What are the main challenges and tasks when working in "Balsu talka"? We have formed a very nice team in this project, where everyone has their own task - AiLab thinks about the content and the analysis of the obtained data, LU LFMI and LATA inspire us for various public engagement events, while Raivis Dejus takes care of the website balsutalka.lv and inclusion in the Mozilla Common Voice platform . At this point, it is important to evaluate the data already obtained, for example, the initial analysis of the results shows that longer sentences are more useful for speech recognition than sentences of one to five words. On the other hand, such sentences and even words are very necessary in language analysis. It is also clear that collecting the largest possible corpus of speech is not the main task. It is essential to create a diverse corpus in which different texts have been spoken by people with different accents or inflectional features. When selecting the texts spoken by the cleanup participants, we try as much as possible to display the most frequently used words in Latvian, to include sentences with a different syntactic and communicative structure, etc. etc. Also, we remind the participants of the cleanup that it is equally important to check the sentences that have already been spoken. By the way, it is very pleasant to listen to how (with what expression, intonations) the sentences are spoken by the members of the initiative.
What could be the next step to improve language technology? As I have already mentioned, it is important to diversify the set of sentences to be read, to make sure that longer sentences are added (observing the Common Voice limit of up to 14 words, of course). It would also be valuable to expand the number of participants in "Balsu talkas" - so that the corpus contains as many different speech samples as possible. And of course, checking the already recorded data is just as important as recording the voice.
Do you see other countries or projects that could serve as a model or inspiration for "Balsu talka"? Of course, when thinking about the collection of speech data, we ourselves analyzed the experience of other countries on how to involve the wider public in the collection of speech data. The most impressive seemed to be the Finnish project "Donate Speech" (Estonians were later inspired by this project), but they do not use the Mozilla Common Voice platform and collect spontaneous speech instead of asking them to read sentences. This means that they initially get speech recordings without transcriptions. But it is the transcriptions that are the most difficult stage in the creation of speech corpora. Like us, Icelanders also collect voice recordings.
We would like to thank Baiba Saulīte for her time and contribution to the speech data collection project!
The project "Language technology initiative" (No. 2.3.1.1.i.0/1/22/I/CFLA/002) is co-financed by the European Union Recovery and Resilience Mechanism investment and the State Budget.