Web Content Extractor

90 Last view: 2026-06-01

Web Content Extractor

WebContentExtractor

http://www.nljubesic.net/resources/tools/webcontentextractor/

ID:

311 Web Content Extractor is a tool for content extraction from web pages for building web corpora. The content extraction algorithm developed for building hrWaC and slWaC is described in TSD2011 paper Ljubešić, N., Erjavec, T. hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. An implementation (a java file) is published under the Apache 2.0 licence. A Croatian evaluation sample used in the paper can also be downloaded and it is distributed under the CC-BY-SA license.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

Apache Licence 2.0

Restrictions: Inform Licensor

Execution location: hidden

Distribution Access/Medium: Downloadable

Distribution rights holders:

University of Zagreb, Faculty of Humanities and Social Sciences

IPR Holder

University of Zagreb, Faculty of Humanities and Social Sciences

Marko Tadić

Contact Person

Nikola Ljubešić

toolService

Tool

Language Independent

Input

Media type: Text

Resource type: Language Description

Modality: Written Language

Output

Media type: Text

Resource type: Language Description

Modality: Written Language

Operation

Operating system: Linux

Required Software

Python (version 2.6 or higher)

Evaluation

Evaluated: True

Evaluation level: Diagnostic

Evaluation type: Black Box

Evaluation criteria: Intrinsic

Evaluation measure: Human

Evaluator Nikola Ljubešić

Creation

Programming language: Python

Resource Creation

Resource Creator

Univ. of Zagreb, Faculty of Humanities and Social Sciences, Depts. of Linguistics & Information Sci.

Creation started: 01/04/2011

Funding Project

Central and South-East European Resources (CESAR)

URL: http://www.cesar-pro...

Funding Types: Eu Funds, National Funds

Funders: European Commission (50%), University of Zagreb, Faculty of Humanities and Social Sciences (50%)

Project duration: 01/02/2011 - 31/01/2013

Metadata

Created: 30/07/2012

Last Updated: 04/02/2013

Metadata Creator

Marko Tadić

Version

Version: 1.0

Last Updated: 30/07/2012

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators