Sandhikosh: A benchmark corpus for evaluating Sanskrit sandhi tools

Shubham Bhardwaj; Neelamadhav Gantayat; Nikhil Chaturvedi; Rahul Garg; Sumeet Agarwal

LREC 2018

Conference paper

07 May 2018

Sandhikosh: A benchmark corpus for evaluating Sanskrit sandhi tools

Abstract

Sanskrit is an ancient Indian language. Several important texts which are of interest to people all over the world today were written in Sanskrit. The Sanskrit grammar has a precise and complete specification given in the text Astādhyāy¯ı by Pān . ini. This has led to the development of a number of Sanskrit Computational Linguistics tools for processing and analyzing Sanskrit texts. Unfortunately, there .. has been no effort to standardize and critically validate these tools. In this paper, we develop a Sanskrit benchmark called SandhiKosh to evaluate the completeness and accuracy of Sanskrit Sandhi tools. We present the results of this benchmark on three most prominent Sanskrit tools and demonstrate that these tools have substantial scope for improvement. This benchmark will be freely available to researchers worldwide and we hope it will help everyone working in this area evaluate and validate their tools.

Conference paper