Croatian Web Corpus

102 Last view: 2026-02-01

hrWaC

http://www.nljubesic.net/resources/corpora/hrwac/

ID:

306 Croatian Web Corpus (hrWaC) is the largest collected corpus for Croatian so far. It was collected in 2011-06 by crawling the whole .hr internet domain yielding ca 1.2 billion tokens. The corpus has been cleaned of HTML code, lemmatised and MSD-tagged automatically using CroTag system (Agić et al., 2008). The compilation of the corpus is described in the TSD2011 paper Ljubešić, N., Erjavec, T. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. The morphosyntactically annotated and lemmatized corpus is distributed under the CC-BY-SA licence. It has been installed also in NoSketchEngine for free on-line querying: http://faust.ffzg.hr/bonito2/run.cgi/first_form?corpname=hrwac.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

CC - BY - SA

Restrictions: Attribution, Share Alike

Execution location: hidden

Distribution Access/Medium: Downloadable

Distribution rights holders:

University of Zagreb, Faculty of Humanities and Social Sciences

IPR Holder

University of Zagreb, Faculty of Humanities and Social Sciences

Contact Person

Nikola Ljubešić

text

Monolingual text corpusLanguages

Croatian

Language Script: Latn

Linguality

Linguality type: Monolingual

Size

1 200 000 000 Tokens

Character encoding

UTF - 8

AnnotationLemmatization

Segmentation level: Word

Morphosyntactic Annotation - B Pos Tagging

Segmentation level: Word

Segmentation

Segmentation level: Word

Segmentation

Segmentation level: Paragraph

Resource Creation

Resource Creator

Univ. of Zagreb, Faculty of Humanities and Social Sciences, Depts. of Linguistics & Information Sci.

Creation started: 01/06/2011

Funding Project

Central and South-East European Resources (CESAR)

URL: http://www.cesar-pro...

Funding Types: Eu Funds, National Funds

Funders: European Commission (50%), University of Zagreb, Faculty of Humanities and Social Sciences (50%)

Project duration: 01/02/2011 - 31/01/2013

Metadata

Created: 30/07/2012

Last Updated: 04/02/2013

Metadata Creator

Marko Tadić

Version

Version: 1.0

Last Updated: 30/07/2012

Documentation

Nikola Ljubešić and Tomaž Erjavec. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. Text, Speech and Dialogue 2011. Lecture Notes in Computer Science, Springer.

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators