U-Compare E-txt2DB: Giving structure to unstructured data
Etxt2DB is a framework for specifying and executing Entity Recognition (ER) programs. These programs accept as input a text containing potentially interesting entities to be extracted and produce the input text annotated with the recognized entities.
The Etxt2DB functioning mode involves two distinct phases. First, the training phase consists in creating a model based on a given ER technique and one or more resources that guide the creation of the classification model. Examples of these resources are dictionaries for rule-based ER techniques or training data for statistical learning techniques (e.g., Conditional Random Fields). Second, in the execution phase, a classification model previously created receives as input plain text and produces annotations corresponding to the recognized entities.
The Etxt2DB framework consists of a software layer, built on top of Minorthird and Lingpipe, offering a command-like specification language. Existing Machine Learning Java APIs (such as Minorthird and Lingpipe) provide implementations of Entity Recognition techniques. Some developers of ER applications do not want to get involved in the implementation details of the techniques used. Instead, they are willing to focus on: the choice of the technique to be used; the resources used in the process (e.g., dictionaries); a good set of features that help the ER program to take adequate decisions. The objective of the Etxt2DB specification language is to turn the development and tuning of ER programs easier for developers that are mainly concerned with these topics.
In the context of the METANET project, the goal was to build a component-generator tool that encapsulates Etxt2DB. In the training phase, this tool accepts a training data set as input and produces a classification model and a U-Compare component that is able to interpret that model. In the execution phase, the component produced is loaded into the U-Compare platform and then is ready to be used for recognizing entities from text.