Quickstart

This page covers the most common use cases for the sanskrit package.

Building the database

The sanskrit package is most useful when it can work with linguistic data. Not all of the code in the sanskrit package requires access to this data, but the most interesting code does. This data is managed internally as an SQL database, but you don’t need to know to know any SQL to work with it.

The database is initialized from a set of special data files. To get that data, you can clone this repository from GitHub:

git clone https://github.com/sanskrit/data.git

and run a script to prepare the data for use:

cd data && python bin/make_data.py && cd -

After you run these two commands, the data/all-data folder will contain CSV files that the sanskrit package can process directly. sanskrit.setup loads these files into a database, which makes the data they contain easier to query.

Currently, the easiest way to create this database is to run a script like the following. You only need to run this once:

from sanskrit import context, setup

# Path to the 'all-data' folder you produced above.
data_path = '/path/to/sanskrit/data/all-data'
# Absolute path to where you want your sqlite database to be.
sqlite_path = '/path/to/your/data.sqlite'

ctx = context.Context({
    'DATABASE_URI': 'sqlite:///' + sqlite_path,
    'DATA_PATH': data_path
})
setup.run(ctx)

Here, context is a Context, which contains information like the database URI and the locations of various data files.

Sounds and meter

TODO. For now, see sanskrit.sounds.

Sandhi rules

TODO. For now, see sanskrit.sandhi.

Analyzing words

The SimpleAnalyzer class provides a way to analyze Sanskrit words. By analyze we mean that the analyzer identifies the stem or root of the word (if applicable) and provides useful information about how it was inflected. By word we mean a finished Sanskrit pada (as opposed to a stem or root). For example, this:

from sanskrit import analyze
a = analyze.SimpleAnalyzer(ctx)
for item in a.analyze('anugacCati'):
    print item

would produce the following output:

Verb(134712, u'anugacCati')
Nominal(None, 'anugacCati')
Nominal(None, 'anugacCati')

where the first result is an inflected verb and the next two results are inflected participles with either masculine or neuter gender. You can inspect these results for more information. For example, this:

p = a.analyze('anugacCati')[1]
print p.stem.name
print p.stem.root.name
print p.stem.mode.name
print p.stem.voice.name

would produce the following output:

anugacCat
anugam
present
parasmaipada

Tagging text

The Tagger class provides a way to analyze all of the words in some Sanskrit passage. It does so by undoing the sandhi rules that separate different words then analyzing each of those words for a good parse. Since natural language is ambiguous, the tagger won’t always produce “correct” output. Instead, it will make its best guess.

For example, this:

from sanskrit import tagger
t = tagger.Tagger(ctx)
for item in t.tag_segment('kAntAvirahaguruRA'):
    print item.form

would produce the following output:

Nominal(None, 'kAntA')
Nominal(None, 'viraha')
Nominal(None, 'guruRA')

Currently, the tagger runs greedily. Future versions of the tagger will be more sophisticated.

Interactively tagging words

Even the best tagger is liable to make mistakes. The shell module makes it easy to fix these mistakes interactively. For example, this:

from sanskrit import shell

data = "Darmakzetre kurukzetre"
shell.run(ctx, data)

would produce the following output:

*--------------------------------------------------
*
* Interactive Sanskrit tagger
*
*--------------------------------------------------
1  : get alternatives for form 1
s  : re-split the chunk
pc : previous chunk
nc : next chunk
q  : quit
?  : help
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Darmakzetre kurukzetre

1 : ('Darmakzetre', 'nominal', u'Darmakzetra', u'm-7-s')

:

where the last line is a prompt for user input.

The shell is still in its early stages and isn’t very useful currently, but it does give a feel for how the tagger and the analyzer work.