Short paper

Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers


We present scalable dependency treebank synthesis techniques that exploit advances in language representation modeling which leverage vast amounts of unlabeled general-purpose multilingual text. We introduce a data augmentation technique that uses the synthetic treebanks to improve production-grade parsers. The synthetic treebanks are generated using a state-of-the-art biaffine parser enhanced with Transformer pretrained models, such as Multilingual BERT (M-BERT). The new parser improves LAS by up to two points on seven languages. Trend line results of LAS as the augmented treebank size scales surpasses performance of production models trained on originally annotated Universal Dependency (UD) treebanks.