DCS and SanskritTagger

DCS is an extract from a larger database that stores the linguistic data created using the program SanskritTagger. SanskritTagger, whose development started in 1999, is a part-of-speech (POS) and lexical tagger for post-Vedic Sanskrit. The program is able to analyze unpreprocessed digital Sanskrit texts lexically and morphologically. SanskritTagger uses a Sandhi analyzer, a morphological, and a lexical database to generate proposals for the analysis of strings of Sanskrit text. Based on statistical key values extracted from the program database, it selects the most probable of these proposals, which can be stored in the program database by the user. In this way, it is possible to train the Sanskrit analyzer: A larger text corpus generates better statistical key values, and the improvement of these values speeds up the analysis.1 The training process can be sketched as follows:

In addition, SanskritTagger has some basic functions for the semantic annotation of Sanskrit texts. At the moment, approximately 30.000 lexical units contained in the SanskritTagger database are enriched with semantic information. Further areas of the program SanskritTagger, which are equally not reproduced in the DCS, are the thematic and syntactic annotation of the texts.

The DCS is created automatically from the most recent version of the SanskritTagger database. This conversion involves the translation of a Microsoft Access format (SanskritTagger) into a MySQL database (DCS), which can be performed using built-in commands of the program SanskritTagger. As a consequence, additions or corrections are only entered into the database of SanskritTagger, which can be transformed into a new version of DCS.

Most texts contained in SanskritTagger were entered and checked by the author of this site (= Oliver Hellwig), who is responsible for any errors occurring in these texts. Therefore, the selection of texts reflects my personal areas of interest, which means that the corpus is neither exhaustive nor representative for the whole body of Sanskrit literature. In addition, some texts are not complete because I was only interested in parts of them. I hope to improve both points in future releases of DCS. More details concerning the people who digitalized and managed a text can be found on the page "Bibliographical details".

SanskritTagger is now available as freeware from the ind.senz website.

1 For details regarding the program SanskritTagger and the automatical linguistic analysis of Sanskrit, refer to the following papers:
Oliver Hellwig: SanskritTagger, a stochastic lexical and POS tagger for Sanskrit. In: Proceedings of the First International Sanskrit Computational Linguistics Symposium, pp. 37-46.
Oliver Hellwig: Performance of a lexical and POS tagger for Sanskrit. In: Proceedings of the Fourth International Sanskrit Computational Linguistics Symposium, pp. 162-172.