CINTIL-TreeBank

80 Last view: 2026-04-03

The CINTIL-TreeBank (Branco et al., 2011) is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
For the creation of this TreeBank we adopted a semi-automatic analysis with a double-blind annotation followed by adjudication. The resulting dataset contains one information level: phrase constituency.
The main motivation behind the creation of this resource was to build a high quality data set with syntactic information that could support the development of a large set of automatic resources and tools for Portuguese for NLP studies.

You don’t have the permission to edit this resource.

DistributionAvailability

Under Negotiation

Licence

Other

Licensors:

António Branco

Distribution rights holders:

António Branco

IPR Holder

University of Lisbon, Faculty of Sciences

Contact Person

António Branco

text

Monolingual text corpusLanguages

Portuguese (10,140 Sentences)

Linguality

Linguality type: Monolingual

Text Format

text/xml (10,140 Sentences)

Size

110,166 Tokens

10,039 Sentences

Character encoding

UTF - 8 (10,140 Sentences)

Domains

Novels (403 Sentences)

News (8,952 Sentences)

Test (785 Sentences)

Modalities

Written Language

Geographic coverage

Portugal (10,140 Sentences)

Estados Unidos da América (106 Sentences)

Creation

Creation mode: Mixed

Resource Creation

Resource Creator

António Branco

Funding Project

SemanticShare - Resources and Tools for Semantic Processing (SemanticShare - FCT/PTDC/PLP/81157/2006)

URL: http://nlx.di.fc.ul....

Funding Type: National Funds

Funder: FCT - Fundação para a Ciência e Tecnologia

Funding Country: Portugal

Project duration: 01/06/2006 - 31/12/2010

Metadata

Created: 01/06/2012

Last Updated: 11/12/2015

Source: META-SHARE

METANET4U

Metadata Language: english

Metadata Creator

Catarina Carvalheiro

Version

Version: 1

Last Updated: 01/06/2012

Documentation

Tool Documentation: Online

Samples Location: http://194.117.45.19...

Document Type: Other

Catarina Carvalheiro, CINTIL Treebank Narrative Description., http://194.117.45.19... , 2012

Document Type: Tech Report

António, Branco; João, Silva; Francisco, Costa; Sérgio, Castro, CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency, http://docs.di.fc.ul... , 2011

Publisher: Department of Informatics, University of Lisbon

Document Language: english

People who looked at this resource also viewed the following:

Resources from the same project

Resources from the same creators