API¶

This page documents the entire linguistic “stack,” from individual sounds (sanskrit.sounds) to paragraph-level morphosyntax (sanskrit.tagger). Lower-level modules are at the top of the page.

Context¶

sanskrit.context¶

Manages the package context. For details, see Context.

license:	MIT

class sanskrit.context.Context(config=None, connect=True)[source]¶

The package context. In addition to storing basic config information, such as the database URI or paths to various data files, a Context also constructs a Session class for connecting to the database.

You can populate a context in several ways. For example, you can pass a dict:

context = Context({'DATABASE_URI': 'sqlite:///data.sqlite'})

or a path to a Python module:

context = Context('project/config.py')

If you initialize a context from a module, note that only uppercase variables will be stored in the context. This lets you use lowercase variables as temporary values.

Config values are stored internally as a dict, so you can always just use ordinary dict methods:

context.config['FOO'] = 'baz'

Parameters:	config – an object to read from. If this is a string, treat config as a module path and load values from that module. Otherwise, treat config as a dictionary.

build()[source]¶: Build all data.

config = None¶: A dict of various settings. By convention, all keys are uppercase. These are used to create engine and session.

connect()[source]¶: Connect to the database.

create_all()[source]¶: Create tables for every model in sanskrit.schema.

drop_all()[source]¶: Drop all tables defined in sanskrit.schema.

engine = None¶: The Engine that underlies the session.

enum_abbr¶: Maps an ID or name to an abbreviation.

enum_id¶: Maps a name or abbreviation to an ID.

session = None¶: A Session class.

Sounds and sandhi¶

sanskrit.sounds¶

Code for checking and transforming Sanskrit sounds. This module also contains basic metrical functions (see sanskrit.sounds.meter() and sanskrit.sounds.num_syllables()).

All functions assume SLP1.

license:	MIT

sanskrit.sounds.ALL_SOUNDS = frozenset(["'", 'A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'Y', 'X', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '~'])¶: All legal sounds, including anusvara, ardhachandra, and Vedic ‘L’.

sanskrit.sounds.ALL_TOKENS = frozenset(['\n', ' ', "'", 'A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'Y', 'X', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '|', '~'])¶: All legal tokens, including sounds, punctuation (‘|’), and whitespace.

sanskrit.sounds.CONSONANTS = frozenset(['C', 'B', 'D', 'G', 'K', 'J', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'Y', 'c', 'b', 'd', 'g', 'h', 'k', 'j', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', 'z'])¶: Consonants.

sanskrit.sounds.NASALS = frozenset(['Y', 'n', 'R', 'm', 'N'])¶: Nasals.

sanskrit.sounds.SAVARGA = frozenset(['h', 'S', 'z', 's'])¶: Savarga

sanskrit.sounds.SEMIVOWELS = frozenset(['y', 'r', 'v', 'l', 'L'])¶: Semivowels.

sanskrit.sounds.SHORT_VOWELS = frozenset(['a', 'i', 'u', 'x', 'f'])¶: Short vowels.

sanskrit.sounds.STOPS = frozenset(['q', 'c', 'J', 'g', 'G', 'P', 'k', 'j', 'C', 'D', 'd', 'W', 'p', 'b', 'Q', 't', 'w', 'B', 'K', 'T'])¶: Stop consonants.

sanskrit.sounds.VALID_FINALS = frozenset(['a', 'A', 'e', 'f', 'I', 'k', 'm', 'o', 'N', 'i', 's', 'r', 'u', 'O', 'w', 'p', 'U', 'n', 'E', 't'])¶: Valid word-final sounds.

sanskrit.sounds.VOWELS = frozenset(['a', 'A', 'e', 'f', 'I', 'F', 'o', 'i', 'u', 'O', 'X', 'x', 'U', 'E'])¶: All vowels.

sanskrit.sounds.aspirate(L)¶

Aspirate L. If this is not possible, return L unchanged.

Parameters:	L – the letter to aspirate

sanskrit.sounds.clean(phrase, valid)[source]¶

Remove all characters from phrase that are not in valid.

Parameters:	phrase – the phrase to clean valid – the set of valid characters. A sensible default is sounds.ALL_TOKENS.

sanskrit.sounds.deaspirate(L)¶

Deaspirate L. If this is not possible, return L unchanged.

Parameters:	L – the letter to deaspirate

sanskrit.sounds.dentalize(L)¶

Dentalize L. If this is not possible, return L unchanged.

Parameters:	L – the letter to dentalize

sanskrit.sounds.devoice(L)¶

Devoice L. If this is not possible, return L unchanged.

Parameters:	L – the letter to devoice

sanskrit.sounds.guna(L)¶: Apply guna to the given letter, if possible.

sanskrit.sounds.key_fn(s)[source]¶: Sorting function for Sanskrit words in SLP1.

sanskrit.sounds.lengthen(L)¶

Lengthen L. If this is not possible, return L unchanged.

Parameters:	L – the letter to lengthen

sanskrit.sounds.meter(phrase, heavy='_', light='.')[source]¶

Find the meter of the given phrase. Results are returned as a list whose elements are either heavy and light.

By the traditional definition, a syllable is heavy if one of the following is true:

the vowel is long
the vowel is short and followed by multiple consonants
the vowel is followed by an anusvara or visarga

All other syllables are light.

Parameters:	phrase – the phrase to scan heavy – used to indicate heavy syllables. By default it’s a string, but you can pass in anything. light – used to indicate light syllables. By default it’s a string, but you can pass in anything.

sanskrit.sounds.nasalize(L)¶

Nasalize L. If this is not possible, return L unchanged.

Parameters:	L – the letter to nasalize

sanskrit.sounds.num_syllables(phrase)[source]¶

Find the number of syllables in phrase.

Parameters:	phrase – the phrase to test

sanskrit.sounds.retroflex(L)¶

Retroflex L. If this is not possible, return L unchanged.

Parameters:	L – the letter to retroflex

sanskrit.sounds.samprasarana(L)¶: Apply samprasarana to the given letter, if possible.

sanskrit.sounds.semivowel(L)¶

Semivowel L. If this is not possible, return L unchanged.

Parameters:	L – the letter to semivowel

sanskrit.sounds.shorten(L)¶

Shorten L. If this is not possible, return L unchanged.

Parameters:	L – the letter to shorten

sanskrit.sounds.simplify(L)¶

Simplify the given letter, if possible.

Here, to “simplify” a letter is to reduce it to a sound that is permitted to end a Sanskrit word. For instance, the c in vAc should be reduced to k:

assert simplify('c') == 'k'

Parameters:	letter – the letter to simplify

sanskrit.sounds.voice(L)¶

Voice L. If this is not possible, return L unchanged.

Parameters:	L – the letter to voice

sanskrit.sounds.vrddhi(L)¶: Apply vrddhi to the given letter, if possible.

sanskrit.sandhi¶

Classes that apply and undo sandhi rules.

license:	MIT

class sanskrit.sandhi.Exempt[source]¶

A helper class for marking strings as exempt from sandhi changes. To mark a string as exempt, just do the following:

original = 'amI'
exempt = Exempt('amI')

Exempt is a subclass of unicode, so you can use normal string methods on Exempt objects.

class sanskrit.sandhi.Joiner(rules=None)[source]¶

Joins multiple Sanskrit terms by applying sandhi rules.

add_rules(rules)[source]¶

Add rules for joining words.

Example usage:

joiner.add_rules[('a', 'i', 'e'), ('a', 'a', 'A'])

Parameters:	rules – a list of 3-tuples, each of which contains:

the first part of the combination
the second part of the combination
the result

static internal_retroflex(term)[source]¶

Apply the “n -> ṇ” and “s -> ṣ” rules of internal sandhi.

Parameters:	term – the string to process

join(chunks, internal=False)[source]¶

Join the given chunks according to the object’s rules:

assert 'tasyecCA' == s.join('tasya', 'icCA')

join() does not take pragṛhya rules into account. As a reminder, the main exception are:

1.1.11 “ī”, “ū”, and “e” when they end words in the dual.

1.1.12 the same vowels after the “m” of adas;

1.1.13 particles with just one vowel, apart from “ā”

1.1.14 particles that end in “o”.

One simple way to account for these rules is to wrap exempt strings with Exempt:

assert joiner.join('te', 'iti') == 'ta iti'
assert joiner.join(Exempt('te'), 'iti') == 'te iti'

Parameters:	chunks – a list of the strings that should be joined internal – if true, join words using the empty string instead of ‘ ‘.

class sanskrit.sandhi.Splitter(rules=None)[source]¶

Splits Sanskrit terms by undoing sandhi rules.

iter_splits(chunk)[source]¶

Return a generator for all splits in chunk. Results are yielded as 2-tuples containing the term before the split and the term after:

for item in s.splits('nareti'):
    before, after = item

splits() will generate many false positives, usually when the first part of the split ends in an invalid consonant:

assert ('narAv', 'iti') in s.splits('narAviti')

These should be filtered out in the calling function.

Splits are generated from left to right, but the function makes no guarantees on when certain rules are applied. That is, output is loosely ordered but nondeterministic.

Database schema¶

sanskrit.schema¶

Schema for Sanskrit data.

class sanskrit.schema.AbstractNominal(**kwargs)[source]¶: A complete nominal form. This corresponds to Panini’s subanta.

class sanskrit.schema.Case(**kwargs)[source]¶

Grammatical case.

Nominative case, corresponding to Panini’s prathamā
Accusative case, corresponding to Panini’s dvitīyā
Instrumental case, corresponding to Panini’s tṛtīyā
Dative case, corresponding to Panini’s caturthī
Ablative case, corresponding to Panini’s pañcamī
Genitive case, corresponding to Panini’s ṣaṣṭhī
Locative case, corresponding to Panini’s saptamī
Vocative case, corresponding to Panini’s saṃbodhana

class sanskrit.schema.EnumBase(**kwargs)[source]¶

Base class for enumerations.

Each enumeration has a name and an abbreviation.

class sanskrit.schema.Form(**kwargs)[source]¶

A complete form. This corresponds to Panini’s pada:

1.4.14 Terms ending in “sup” and “tiṅ” are called pada.

In other words, a Form is a self-contained linguistic unit that could be used in a sentence as-is.

class sanskrit.schema.Gender(**kwargs)[source]¶

Grammatical gender:

masculine, corresponding to Panini’s puṃliṅga
feminine, corresponding to Panini’s strīliṅga
neuter, corresponding to Panini’s napuṃsakaliṅga
unknown/undefined

class sanskrit.schema.GenderGroup(**kwargs)[source]¶

Grammatical gender of a nominal stem. Since stems support nearly every combination of genders, this class stores both individual genders:

masculine
feminine
neuter
unknown/undefined

and collections of genders:

masculine and feminine
masculine and neuter
feminine and neuter
masculine, feminine, and neuter

class sanskrit.schema.Gerund(**kwargs)[source]¶: A complete form. This corresponds to Panini’s ktvānta and lyabanta.

class sanskrit.schema.Indeclinable(**kwargs)[source]¶: A complete form. This corresponds to Panini’s avyaya.

class sanskrit.schema.Infinitive(**kwargs)[source]¶: A complete form. This corresponds to Panini’s tumanta.

class sanskrit.schema.Mode(**kwargs)[source]¶

Tenses and moods:

present, corresponding to Panini’s laṭ
aorist, corresponding to Panini’s luṅ
imperfect, corresponding to Panini’s laṅ
perfect, corresponding to Panini’s liṭ
simple future, corresponding to Panini’s lṛṭ
distant future, corresponding to Panini’s luṭ
conditional, corresponding to Panini’s lṛṅ
optative, corresponding to Panini’s vidhi-liṅ
imperative, corresponding to Panini’s loṭ
benedictive, corresponding to Panini’s āśīr-liṅ
injunctive
future optative
future imperative

class sanskrit.schema.Modification(**kwargs)[source]¶

Verb modification:

Causative, corresponding to Panini’s ṇic
Desiderative, corresponding to Panini’s san
Intensive, corresponding to Panini’s yaṅ

class sanskrit.schema.ModifiedRoot(**kwargs)[source]¶: A root with one or more modifications.

class sanskrit.schema.NominalEnding(**kwargs)[source]¶: A suffix for regular nouns and adjectives. This corresponds to Panini’s sup.

class sanskrit.schema.NominalStem(**kwargs)[source]¶: Stem of a Nominal.

class sanskrit.schema.NounPrefix(**kwargs)[source]¶: A noun prefix. This includes nañ, among others.

class sanskrit.schema.Number(**kwargs)[source]¶

Grammatical number:

singular, corresponding to Panini’s ekavacana
dual, corresponding to Panini’s dvivacana
plural, corresponding to Panini’s bahuvacana

class sanskrit.schema.Paradigm(**kwargs)[source]¶: Represents an inflectional paradigm. This associates a root with a particular class and voice.

class sanskrit.schema.Participle(**kwargs)[source]¶: A complete form. This corresponds to Panini’s niṣṭhā and sat. Moreover, it also corresponds to kvasu.

class sanskrit.schema.ParticipleStem(**kwargs)[source]¶: Stem of a Participle.

class sanskrit.schema.PerfectIndeclinable(**kwargs)[source]¶: A complete form. This corresponds to forms ending in the suffix ām, as in “īkṣāṃ cakre”.

class sanskrit.schema.Person(**kwargs)[source]¶

Grammatical person:

first person, corresponding to Panini’s uttamapuruṣa
second person, corresponding to Panini’s madhyamapuruṣa
third person, corresponding to Panini’s prathamapuruṣa

class sanskrit.schema.Prefix(**kwargs)[source]¶: A generic prefix.

class sanskrit.schema.PrefixedModifiedRoot(**kwargs)[source]¶: A prefixed root with one or more modifications.

class sanskrit.schema.PrefixedRoot(**kwargs)[source]¶: A root with one or more prefixes.

class sanskrit.schema.PronounStem(**kwargs)[source]¶: Stem of a Pronoun.

class sanskrit.schema.Root(**kwargs)[source]¶

A verb root. This corresponds to Panini’s dhātu:

1.3.1 “bhū” etc. are called dhātu.

3.1.22 Terms ending in san etc. are called dhātu.

Moreover, Root contains prefixed roots. Although this modeling choice is non-Paninian, it does express the notion that verb prefixes can cause profound changes in a root’s meaning and identity.

basis_id¶: The ultimate ancestor of this root. For instance, the basis of “sam-upa-gam” is “gam”.

class sanskrit.schema.RootModAssociation(modification)[source]¶: Associates a modified root with a list of modifications.

class sanskrit.schema.RootPrefixAssociation(prefix)[source]¶: Associates a prefixed root with a list of prefixes.

class sanskrit.schema.SandhiType(**kwargs)[source]¶

Rule type. Sandhi rules are usually of three types:

external rules, which act between words
internal rules, which act between morphemes
general rules, which act in any context

class sanskrit.schema.SimpleBase(**kwargs)[source]¶

A simple default base class.

This automatically creates:

__tablename__
id (primary key)
name (string)

class sanskrit.schema.Stem(**kwargs)[source]¶

A nominal stem. This corresponds to Panini’s aṅga:

1.4.13 Anything to which a suffix can be added is called an aṅga.

But although “aṅga” also includes “dhātu,” Stem does not. Verb roots are stored in Root.

dependent¶: True iff a stem can produce its own words. For stems like “nara” or “agni” this value is True. For stems like “ja” (dependent on upapada, as in “agra-ja”) or “tva” (a suffix, as in “sama-tva”), this value is False.

class sanskrit.schema.StemIrregularity(**kwargs)[source]¶

Record of an irregular stem.

Some Sanskrit stems are inflected irregularly. Since only an exceedingly small number of stems is irregular (< 100), it’s easiest to record those irregularities in their own scheme.

fully_described¶: If True, assume that only the forms stored in the database are valid. If False, assume that all unspecified forms can be generated by applying normal rules and endings to the stem.

class sanskrit.schema.Tag(**kwargs)[source]¶

Part of speech tag. It contains the following:

noun
pronoun
adjective
indeclinable
verb
gerund
infinitive
participle
perfect indeclinable (also known as the periphrastic perfect)
noun prefix
verb prefix

class sanskrit.schema.VClass(**kwargs)[source]¶

Verb class:

class 1, corresponding to Panini’s bhvādi
class 2, corresponding to Panini’s adādi
class 3, corresponding to Panini’s juhotyādi
class 4, corresponding to Panini’s divādi
class 5, corresponding to Panini’s svādi
class 6, corresponding to Panini’s tudādi
class 7, corresponding to Panini’s rudhādi
class 8, corresponding to Panini’s tanādi
class 9, corresponding to Panini’s kryādi
class 10, corresponding to Panini’s curādi
class unknown, for verbs like “ah”
nominal, corresponding to various Paninian terms

class sanskrit.schema.Verb(**kwargs)[source]¶: A complete form. This corresponds to Panini’s tiṅanta.

class sanskrit.schema.VerbEnding(**kwargs)[source]¶

A suffix for conjugated verbs of any kind. This corresponds to Panini’s tiṅ.

category¶

Name of the stem category that uses this ending. This is useful for partitioning the list of endings according to the stem under consideration.

The names of these groups depend on the initial data. The default data uses these names: - ‘simple’ for classes 1, 4, 6, and 10 - ‘complex’ for classes 2, 3, 45, 7, 8, and 9 - ‘both’ for all classes

class sanskrit.schema.VerbPrefix(**kwargs)[source]¶: A verb prefix. This corresponds to Panini’s gati, which includes cvi (svāgatī-karoti) and upasarga (anu-karoti).

class sanskrit.schema.VerbalIndeclinable(**kwargs)[source]¶: A complete form. VerbalIndeclinable is a superclass for three more specific classes: Gerund, Infinitive, and PerfectIndeclinable.

class sanskrit.schema.Voice(**kwargs)[source]¶

Grammatical voice:

parasmaipada
ātmanepada
ubhayapada

Analyzing forms¶

sanskrit.analyze¶

Code for analyzing Sanskrit forms, i.e. finding the basic lexical forms that produced them and specifying the word’s inflectional information.

license:	MIT

class sanskrit.analyze.SimpleAnalyzer(ctx)[source]¶

A simple analyzer for Sanskrit words. The analyzer is simple for a few reasons:

It doesn’t do any caching.
It uses an ORM instead of raw SQL queries.
Its output is always “well-formed.” For example, neuter nouns can take only neuter endings.

This analyzer is best used when memory is at a premium and speed is a secondary concern (e.g. when on a web server).

analyze(word)[source]¶

Return all possible solutions for the given word. Any ORM objects used in these solutions will be in a detached state.

Parameters:	word – the word to analyze. This should be a complete word, or what Panini would call a pada.

Analyzing text¶

sanskrit.tagger¶

Code for converting Sanskrit paragraphs into a list of linguistic forms. This is done through a part-of-speech tagger (Tagger) that also removes sandhi and identifies the lexical roots that underlie the forms in some passage.

When tagging, a block of text (i.e. a verse or paragraph) is called a segment and its space-separated substrings are called chunks. For example, the segment 'aTa SabdAnuSAsanam |' has chunks 'aTa', 'SabdAnuSAsanam', and '|'.

license:	MIT

class sanskrit.tagger.NonForm(name)[source]¶: Wraps a chunk that couldn’t be parsed by the Tagger.

class sanskrit.tagger.TaggedItem(segment_id, chunk_index, form)[source]¶: Associates a linguistic form with a specific chunk and segment.

class sanskrit.tagger.Tagger(ctx)[source]¶

The part-of-speech tagger.

iter_chunks(segment)[source]¶

Iterate over the chunks in segment.

Parameters:	segment – an arbitrary string

tag(segment, segment_id=None)[source]¶

Return the linguistic forms that compose segment. If a form can’t be parsed, it’s wrapped in NonForm.

Parameters:	segment – an arbitrary string
Returns:	a list of `TaggedItem` objects.