Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers

Yousef El-Kurdi; Hiroshi Kanayama; Efsun Kayi; Vittorio Castelli; Todd Ward; Hans Florian

COLING 2020

Short paper

07 Dec 2020

Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers

Abstract

We present scalable dependency treebank synthesis techniques that exploit advances in language representation modeling which leverage vast amounts of unlabeled general-purpose multilingual text. We introduce a data augmentation technique that uses the synthetic treebanks to improve production-grade parsers. The synthetic treebanks are generated using a state-of-the-art biaffine parser enhanced with Transformer pretrained models, such as Multilingual BERT (M-BERT). The new parser improves LAS by up to two points on seven languages. Trend line results of LAS as the augmented treebank size scales surpasses performance of production models trained on originally annotated Universal Dependency (UD) treebanks.

Conference paper