Hungarian Parliamentary Speech and Aligned Text Selection Database

89 Last view: 2026-07-03

Hungarian Parliamentary Speech and Aligned Text Selection Database

ID:

216 Database of recordings and official transcripts of Hungarian parliamentary speeches. The recordings are segmented between speech pauses, which not necessarily correspond to sentence boundaries. The official transcripts are not completely accurate, since the parliamentary transcribers correct most of grammatical mistakes and speech disfluencies. Hence, an automatic speech recognizer was utilized to choose only those segments, where there is a high match between the automatic and manual transcriptions. Thus the database comprises only those segments that are considered to have a reliable transcription. The database can be applied in speech technology research, phonetic, phonological research and for developing and testing speech and speaker recognition systems.

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

CC - BY

Distribution Access/Medium: CD - ROM

Licensors:

Henk Tamás

IPR Holder

Budapest University of Technology and Economics

Contact Person

Péter Mihajlik

text
audio

Monolingual text corpusLanguages

Hungarian

Linguality

Linguality type: Monolingual

Size

134 Mb

Character encoding

UTF - 8 (100,000,000 Phonemes)

Creation

Creation mode: Automatic

Original Sources

http://www.parlament...

Monolingual audio corpusLanguages

Hungarian

Linguality

Linguality type: Monolingual

Size

204 Gb

Audio duration

1,898 Hours

AnnotationSpeech Annotation - Orthographic Transcription

Segmentation level: Other

Format: plain txt

Annotation Mode: Mixed (The official transcripts were downloaded from the web page of the Hungarian parliament. These transcripts are not completely accurate, since the parliamentary transcribers correct most of grammatical mistakes and speech disfluencies. Hence, recordings were transcribed and segmented by an automatic speech recognizer, and were compared with the downloaded transcripts. The given text corpus comprises only those segments where the match (letter-based accuracy) between automatic and manual transcriptions is over 98%. The audio files are not part of this shared corpus. They can freely be downloaded from the Hungarian Parliament website. The size data above applies to the selected part of the speeches.)

Annotation Tools:

voXerver ASR engine, other self developed processing tools

Content

Speech items: Free Speech

Noise Level: Low

Setting

Naturality: Planned

Conversational type: Monologue

Scenario: Other

Audience: Some

Interactivity: Non Interactive

Audio Formatsvideo/mpeg

Compression loss: True

Compression name: Mpeg

Compression: False

Recording quality: High

Quantization: 16

Number of tracks: 1

Signal encoding: Other

CreationOriginal Sources

http://www.parlament...

Metadata

Created: 10/07/2012

Last Updated: 23/01/2013

Source: METANET4U

Metadata Creator

Gellért Sárosi

Version

Version: 2.0

People who looked at this resource also viewed the following: