February 26, 2021

More than 333 million spelling forms in the new Spanish corpus

Madrid, Feb 15 (EFE) .- More than 333 million spelling forms, from written texts and oral transcriptions, make up the new update of the Corpus of XXI Century Spanish (CORPES XXI), which has been presented by the RAE in collaboration with the Association of Academies of the Spanish Language.

This is the latest version of this linguistic tool, 0.93, which has more than 316,000 documents and more than 333 million spelling forms, which represents an increase of more than 21 million forms compared to the previous version, published in May 2020, as reported this Monday by the RAE.

The corpus is a set of texts that is as extensive and ordered as possible, usually used to find out the context and properties of words, expressions and constructions based on the actual registered uses. Given their size, the corpus must be in electronic format.

A general corpus (called a reference) has the basic purpose of serving to obtain the global characteristics that a language presents at a given moment in its history. In the case of current Spanish, the corpus must contain texts of all types and also from all the countries that make up the Hispanic world.

More than four and a half million of the forms incorporated in this update are transcripts of oral texts (radio and television shows, media interviews, or YouTube).

Regarding the block of fiction (novels, film scripts, stories, plays), the forms of Corpes exceed 93 million, while those contained in non-fiction textbooks and in periodical publications (social sciences, health, politics, arts, technology) are close to 238 million.

Texts from books represent almost 166 million forms while periodicals are represented with about 158 ​​million. Six and a half million more come from blogs, digital interviews and social networks.

Regarding the temporal distribution, the number of texts produced between 2016 and 2020 increased, with just over 42 million forms in this version. For lustrums, the greatest weight in this version, still provisional, falls on the 2006-2010 segment, with more than 107 million shapes; more than 100 million correspond to forms produced between 2001 and 2005; and, from 2011 to 2015, it reaches almost 82 million forms.

The forms corresponding to texts generated in Spain account for slightly more than 30 percent while the rest comes from America, with more than 217 million forms, in addition to containing texts from the Philippines and Equatorial Guinea.


Source link