An efficient unified extraction algorithm for bilingual data

Christoph Tillmann; Sanjika Hewavitharana

INTERSPEECH 2011

Conference paper

01 Dec 2011

An efficient unified extraction algorithm for bilingual data

Abstract

The paper presents a unified algorithm for aligning sentences with their translations in bilingual data. The sentence alignment problem is handled as a large-scale pattern recognition problem similar to the task of finding the word sequence that corresponds to an acoustic input signal in isolated word automatic speech recognition (ASR). The algorithm gains efficiency from related work on dynamic programming (DP) search for speech recognition ([1]): a stack-based search is parametrized in a novel way, such that the unified algorithm can be used on various types of data that have been previously handled by separate implementations: the extracted text chunk pairs can be either sub-sentential pairs, one-to-one, or many-to-many sentence-level pairs. The one-stage search algorithm is carried out in a single run over the data. With the help of a unified beam-search candidate pruning, the algorithm is very efficient: it avoids any document-level prefiltering and uses less restrictive sentence-level filtering. Results are presented on a Russian-English and a Spanish-English extraction task. Based on a simple word-based scoring model, text chunk pairs are extracted out of several trillion candidates. Copyright © 2011 ISCA.

Conference paper