Collection: LCC

The Wortschatz Leipzig project offers downloadable corpora for many languages, in normed sizes and using the same format and comparable sources. The data are intended for scientific use by corpus linguists, for applications such as knowledge extraction programs and as training data.

The corpora contain randomly selected sentences in the specified language and are available in sizes from 10,000 to up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences and their order randomized, ensuring that the original document can not be recreated. Non-sentences and foreign language material are removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the same sentence are given. Additional details concerning the creation of the corpora can be found in this publication. More information about the format and content of the corpora files can be found here.

CC BY 4.0

Download page of the Leipzig Corpora Collection on the Wortschatz Leipzig web portal.
https://wortschatz-leipzig.de/download

LCC

Leipzig Corpora Collection (LCC)

Description

License

Links

Resources

Page structure