The Wortschatz Leipzig project collects digital text since 1994 and makes them available online since 1998. Since more than 20 years, it maintains one of the biggest collections of digital news text for German, with currently around 100 billion tokens of cleaned text per year. For this, freely available internet documents are continuously collected and processed. The results are, among other things, corpus-based dictionaries with numerous linguistic and language-statistical annotations and large text corpora, which for many languages represents the biggest publicly available dataset of this kind.
Due to the sizable data sets of up to multiple hundred million sentences for each language (after de-duplication), statistical information for all words and linguistic phenomena can be found among the project's resources. For the German language, this represents one of the largest information systems and the available data are continuously extended with additional languages. By 2024, data for over 250 languages is available, either via the web portal, Web services or in the Leipzig Corpora Collection (LCC) as downloadable norm size corpora.