Wortschatz Leipzig

Project

General Information

Project-URL:

https://wortschatz-leipzig.de

Institutions:

Saxon Academy of Sciences and Humanities in Leipzig
Leipzig University
Institute for Applied Informatics (InfAI) e.V.

Start: 1994

Description

The Wortschatz Leipzig project collects digital text since 1994 and makes them available online since 1998. Since more than 20 years, it maintains one of the biggest collections of digital news text for German, with currently around 100 billion tokens of cleaned text per year. For this, freely available internet documents are continuously collected and processed. The results are, among other things, corpus-based dictionaries with numerous linguistic and language-statistical annotations and large text corpora, which for many languages represents the biggest publicly available dataset of this kind.

Due to the sizable data sets of up to multiple hundred million sentences for each language (after de-duplication), statistical information for all words and linguistic phenomena can be found among the project's resources. For the German language, this represents one of the largest information systems and the available data are continuously extended with additional languages. By 2024, data for over 250 languages is available, either via the web portal, Web services or in the Leipzig Corpora Collection (LCC) as downloadable norm size corpora.

Resources

lcc/corpora

lcc/sentiws

Page structure

Top of page

General Information
Description
Resources

Offered in