Gujarati news subcorpus from 2014 (guj_newscrawl_2014_1M)

Corpus
Identifier: 11022/0000-0000-7F62-4 Link icon

Description

Gujarati news subcorpus based on material crawled in 2014 (1,000,000 sentences) created in the project "Deutscher Wortschatz" or "Leipzig Corpora Collection.
The project regularly collects and processes available documents from the Internet (typically in an annual cycle) and other sources. The results are corpora and corpora-based dictionaries for more than 250 languages, which provide statistical information about almost each word, example sentences and links to related words. Because of the huge amount of used text material containing several million sentences, information about almost every word can be provided. The service ranks among the most comprehensive information systems about the German language and provides the largest freely available amounts of data for many other languages.

Applications

Search Portal
Search Portal
Search Portal
CLARIN Federated Content Search (FCS)
Text+ Federated Content Search (FCS)
Show in Virtual Language Observatory

Downloads

Metadata

Details

Type/Media Type: Written corpus, text/tab-separated-values, application/zstd
License: CC BY-NC
Language: Gujarati (ISO 639-3: guj)
Temporal coverage: 2014-02-27 - 2014-03-14
Keywords: Gujarati, newscrawl, Corpus

Size

Number of sentences: 1000000
Number of types: 583642
Number of tokens: 13988086

Contact

Icon Envelope Administrative contact
Technical contact

Citation

Icon Quote Leipzig Corpora Collection: Gujarati news subcorpus from 2014 (guj_newscrawl_2014_1M). Leipzig Corpora Collection. Dataset. Identifier: 11022/0000-0000-7F62-4.

Page structure

Top of page

Description
Applications
Downloads
Details
Size
Contact
Citation

Offered in