QT21 Domain Specific Human Post-Edited data set

68 Last view: 2026-03-19

5 Last update: 2018-03-02

QT21 Domain Specific Human Post-Edited data set

http://www.qt21.eu/,

https://lindat.mff.cuni.cz/

ID:

http://hdl.handle.net/11372/LRT-2390

Training of Automatic Post-editing and Quality Estimation components / Quality Estimation / Error Analysis.
Set of 195,000 domain-specific Human Post-Edited (HPE) triplets for four language pairs and six translation engines. Each quadruplet consists in (source, reference, target, HPE). The domain for En-De and En-Cz is IT, the domain for En-Lv and De-En is Pharma. A total of six translation engines have been used to produce the targets that have been post-edited: PBMT from KIT and NMT (using Nematus) for En-De, PBMT from KIT for De-En, PBMT from CUNI for En-Cz and both PBMT and NMT systems from Tilde for En-Lv. For each language pair, one unique set of source segments has been used as input to the different translation engines. The De-En and the En-Cz have provided 45,000 target segments each, both En-De engines have provided 30,000 target segments each, and both En-Lv engines have provided 22,500 target segments each. En-De and De-En HPEs have been collected by professional translators from Text&Form. En-Lv HPEs have been collected by professional translators from Tilde. En-Cz HPEs have been collected by professional translators from Traductera.

IMPORTANT LEGAL NOTICE (This dataset is provided under the following terms of use)
TAUS Terms of Use (https://lindat.mff.cuni.cz/repository/xmlui/page/licence-TAUS_QT21).
TAUS grants to QT21 User access to the WMT Data Set with the following rights:
i) the right to use the target side of the translation units into a commercial product, provided that QT21 User may not resell the WMT Data Set as if it is its own new translation;
ii) the right to make Derivative Works; and
iii) the right to use or resell such Derivative Works commercially and for the following goals:
i) research and benchmarking;
ii) piloting new solutions; and
iii) testing of new commercial services.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Licence

Other

Distribution Access/Medium: Downloadable

Contact Person

Christian Dugast

text

Bilingual text corpusLanguages

English Latvian

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

45,000 segments

Domains

Pharma

Bilingual text corpusLanguages

German English

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

45,000 segments

Domains

Pharma

Bilingual text corpusLanguages

English Czech

Linguality

Linguality type: Bilingual

Multi-linguality type: Parallel

Size

45,000 segments

Domains

Information Technology

Bilingual text corpusLanguages

English German

Linguality

Linguality type: Bilingual

Size

30,000 segments

Domains

Information Technology

Metadata

Created: 13/12/2017

Last Updated: 02/03/2018

Metadata Creator

Kanella Pouli

Usage

Foreseen UseNlp Applications

Use NLP Specific: Machine Translation

Actual Use - Nlp Applications

Use NLP Specific: Machine Translation

People who looked at this resource also viewed the following: