Creating metadata for parallel treebanks

Oct 08, 2012 at 10:58

META NORD has developed two parallel treebanks as part of the "horizontal actions" that run throughout the entire project period. The Sofie Parallel Treebank was ready in November 2011 (although not yet made publicly available due to IPR issues), and the JRC Acquis Parallel Treebank will be ready for the final resource upload at the end of this year. 

Both treebanks consists of annotations of translations of one particular text in several languages (most of the META-NORD official languges): The Sofie treebank contains analyses of the 255 first sentences of the book Sophie's World by Jostein Gaarder, and the Acquis treebank is based on a EU directive that was selected on the basis of its availability in all the relevant languages as well as its "parseability". In effect, these parallel treebanks are collections of monolingual treebanks that are based on the same source text.

A metadata record for Parallel Sofie already exists, but we are not sure that this record describes our resource satisfactorily. We would thus need some advice before registering metadata also for Acquis. 

The Sofie treebank was originally developed under the Nordic Treebank Network, while Acquis is being created by the META-NORD partners themselves. Ideally, since a parallel treebank is conceptually one resource and the annotations are based on the same material, we would like each treebank to be one complex resource with subcomponents. However, it is not straightforward how to account for the different properties of the component treebanks using the existing schema. 

A few examples: In Sofie, there are different IPR holders for the different translations, and each annotation is created using different tools/grammars, sometimes with different annotation modes (manual/semi-automatic/automatic) etc. Likewise, since some of the JRC-annoations have undergone manual supervision (some are even created manually) we would like to be able to specify for each annotation the annotator(s).

As far as we can see there are currently two possible ways of doing this: 

1: by adding "Corpus Text Info" for each monolingual treebank

2: by creating one metadata record for each monolingual treebank and later relate them (in our opinion this is a suboptimal solution)

How do you recommend that we proceed? Are there other, better solutions to describing parallel treebanks that we have overlooked? 

 

Kind regards,

Gyri Smørdal Losnegaard

 

Tags: parallel treebanking corpora multilingual resource

Discussion 4 answers

  • avatar
    Answer by pennyl67 on Oct 09, 2012 at 11:40

    Hi Gyri!

    I would consider this a "complex resource", a case which is still pending in the current version.

    The first option you suggest, adding a separate "corpusTextInfo" for each language module makes it easier to describe the peculiarities of each of them; however, given that the "lingualityInfo" and the "languageInfo" are in the textInfo, you lose the multilinguality dimension. The second option also creates the same problem (i.e. losing the multilinguality dimension).

    What we have done in a similar case, the INTERA corpus, is that we have created one metadata record, using the sizePerLanguage to give the minimal information for each language.

    In next versions,  the situation will be improved by handling complex resources.

     

    Cheers,

    Penny

    for the metadata team

  • avatar
    Answer by Gyri Smørdal Losnegaard on Oct 09, 2012 at 12:11

    Dear Penny,

    Thank you for the prompt reply, I'm glad to hear the issue is pending. We'll await the improved version, then.

    Best,

    Gyri

  • avatar
    Answer by Gyri Smørdal Losnegaard on Oct 16, 2012 at 14:41

    A quick follow-up: When can we expect a META-SHARE version handling complex resource descriptions to be released? 

    The parallel treebanks mentioned above are hosted by the INESS project (INESS is a language-independent system for building, accessing and exploiting treebanks), and we are considering a metadata solution based on harvesting metadata from META-SHARE. However, due to project collaboration we must provide this solution before META-NORD ends early 2013. This means we must consider other alternatives if the improved META-SHARE version is too far away.   

    For the same reasons we are also anxious to know what is happening with respect to LRT profile templates (also a pending issue, cf. post #29).

    Many thanks!

  • avatar
    Answer by Gyri Smørdal Losnegaard on Oct 16, 2012 at 14:41

    A quick follow-up: When can we expect a META-SHARE version handling complex resource descriptions to be released? 

    The parallel treebanks mentioned above are hosted by the INESS project (INESS is a language-independent system for building, accessing and exploiting treebanks), and we are considering a metadata solution based on harvesting metadata from META-SHARE. However, due to project collaboration we must provide this solution before META-NORD ends early 2013. This means we must consider other alternatives if the improved META-SHARE version is too far away.   

    For the same reasons we are also anxious to know what is happening with respect to LRT profile templates (also a pending issue, cf. post #29).

    Many thanks!

  • avatar
    Log in or Register to reply to this post.