Southern Sotho Web subcorpus (South Africa) from 2018 (sot-za_web_2018_10K)

Corpus
Identifier: 11022/0000-0007-CA6B-ELink icon

Description

Southern Sotho Web subcorpus (South Africa) based on material from 2018 (10,000 sentences) created in the project "Deutscher Wortschatz" or "Leipzig Corpora Collection.
The project regularly collects and processes available documents from the Internet (typically in an annual cycle) and other sources. The results are corpora and corpora-based dictionaries for more than 250 languages, which provide statistical information about almost each word, example sentences and links to related words. Because of the huge amount of used text material containing several million sentences, information about almost every word can be provided. The service ranks among the most comprehensive information systems about the German language and provides the largest freely available amounts of data for many other languages.

Metadata

Details

Type/Media Type: Written corpus, text/plain, application/gzip
License: CC BY-NC
Language: Southern Sotho (ISO 639-3: sot)
Temporal coverage: 2018-09-13 - 2018-10-01
Keywords: Southern Sotho, web, Corpus

Size

Number of sentences: 10000
Number of types: 19391
Number of tokens: 213418

Contact

Icon Envelope Administrative contact
Icon Envelope Technical contact

Citation

Icon Quote Leipzig Corpora Collection: Southern Sotho Web subcorpus (South Africa) from 2018 (sot-za_web_2018_10K). Leipzig Corpora Collection. Dataset. Identifier: 11022/0000-0007-CA6B-E.