Description

Collective Elaboration of a Coreference Annotated Corpus for Portuguese Texts

Task description

The general objective of the task is a collective elaboration of a Portuguese annotated corpora for nominal coreference in texts. The idea is to call for annotation teams who should submit their corpus for annotation. Each annotation team should have at least 3 annotators. We will organize the process of annotation with the aid of annotation tools and then check annotation concordance intra and inter groups. The task is going to use use two resources: CORP and CorrefVisual. The first is a coreference resolution toolkit for Portuguese; the second is a resource for the manual correction of CORP’s output. As a result, we hope to obtain an annotated corpus, which may be used in NLP tasks. The main contribution of this study is the availability of a coreference corpus for computational and linguistic studies.

The coordinating team is going to be responsible for:

  • Providing the annotation tools;
  • Selecting the participant teams;
  • Training the participant teams for the annotation task;
  • Analysing annotation concordance intra and inter team (Kappa);
  • Generation of the reference annotated corpus.
The participant teams will submit submit a corpus of their interest to be annotated. The participant teams are going to receive tools and instructions for the annotation task and annotate a set of texts on the basis of the provided annotation tools.

Activities

This coreference task annotation is composed by the following steps:
  • Corpus Submission instructions:
    • The corpus must be in Portuguese;
    • The corpus should be composed of 30 texts with each text containing around 1200 tokens;
    • A statement of the reason(s) for the corpus choice, including current related studies;
    • Availability confirmation of an annotation team composed by at least 3 participants;
    • The participants are going to annotate texts from their own corpus as well as texts submitted by other teams.
    • The entire submission process is going to happen via this form.
  • Training Phase: In the Training Phase, the annotation tools will be provided by the coordinators. The participants will annotate a sample and submit their annotation. Basically, in this phase, the coordinators will train the participants and answer eventual questions about the task.
  • Corpus Annotation: In this stage, we will distribute the corpus for the teams and they work on the annotation. It is important to clarify some aspects:
    • Each team will receive a set of texts no longer than 30 texts to be annotated;
    • Each team has to submit the revised texts to coordinators.
    • The annotation will be performed using the aforementioned toolkit CorrefVisual.
  • Concordance analysis: After the annotation task, the organizers will analyse the annotation concordance, calculating Kappa intra and inter team level. The results will be published at the conference.
  • Annotated Corpus Generation: The resulting gold annotated corpus will be based on a majority voting scheme, considering the participant’s annotations.

Schedule

Below, the activity schedule of this proposal is presented.
ActivityTimetableResponsible
Corpus Submission01/03 to 27/03Teams
Analysis of submitted corpus and selection of teams28/03 to 03/04Committee
Training phase 04/04 to 17/04 Teams
Corpus annotation 18/04 to 31/05 Teams
Kappa calculation 01/06 to 16/06 Committee
Annotated corpus generation17/06 to 30/06 Committee
Camera ready submissions due01/07Committee
IberEval Workshop 19/09