API

This page documents the entire linguistic “stack,” from individual sounds (sanskrit.sounds) to paragraph-level morphosyntax (sanskrit.tagger). Lower-level modules are at the top of the page.

Context

sanskrit.context

Manages the package context. For details, see Context.

license:MIT
class sanskrit.context.Context(config=None, connect=True)[source]

The package context. In addition to storing basic config information, such as the database URI or paths to various data files, a Context also constructs a Session class for connecting to the database.

You can populate a context in several ways. For example, you can pass a dict:

context = Context({'DATABASE_URI': 'sqlite:///data.sqlite'})

or a path to a Python module:

context = Context('project/config.py')

If you initialize a context from a module, note that only uppercase variables will be stored in the context. This lets you use lowercase variables as temporary values.

Config values are stored internally as a dict, so you can always just use ordinary dict methods:

context.config['FOO'] = 'baz'
Parameters:config – an object to read from. If this is a string, treat config as a module path and load values from that module. Otherwise, treat config as a dictionary.
build()[source]

Build all data.

config = None

A dict of various settings. By convention, all keys are uppercase. These are used to create engine and session.

connect()[source]

Connect to the database.

create_all()[source]

Create tables for every model in sanskrit.schema.

drop_all()[source]

Drop all tables defined in sanskrit.schema.

engine = None

The Engine that underlies the session.

enum_abbr

Maps an ID or name to an abbreviation.

enum_id

Maps a name or abbreviation to an ID.

session = None

A Session class.

Sounds and sandhi

sanskrit.sounds

Code for checking and transforming Sanskrit sounds. This module also contains basic metrical functions (see sanskrit.sounds.meter() and sanskrit.sounds.num_syllables()).

All functions assume SLP1.

license:MIT
sanskrit.sounds.ALL_SOUNDS = frozenset(["'", 'A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'Y', 'X', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '~'])

All legal sounds, including anusvara, ardhachandra, and Vedic ‘L’.

sanskrit.sounds.ALL_TOKENS = frozenset(['\n', ' ', "'", 'A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'Y', 'X', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '|', '~'])

All legal tokens, including sounds, punctuation (‘|’), and whitespace.

sanskrit.sounds.CONSONANTS = frozenset(['C', 'B', 'D', 'G', 'K', 'J', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'Y', 'c', 'b', 'd', 'g', 'h', 'k', 'j', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', 'z'])

Consonants.

sanskrit.sounds.NASALS = frozenset(['Y', 'n', 'R', 'm', 'N'])

Nasals.

sanskrit.sounds.SAVARGA = frozenset(['h', 'S', 'z', 's'])

Savarga

sanskrit.sounds.SEMIVOWELS = frozenset(['y', 'r', 'v', 'l', 'L'])

Semivowels.

sanskrit.sounds.SHORT_VOWELS = frozenset(['a', 'i', 'u', 'x', 'f'])

Short vowels.

sanskrit.sounds.STOPS = frozenset(['q', 'c', 'J', 'g', 'G', 'P', 'k', 'j', 'C', 'D', 'd', 'W', 'p', 'b', 'Q', 't', 'w', 'B', 'K', 'T'])

Stop consonants.

sanskrit.sounds.VALID_FINALS = frozenset(['a', 'A', 'e', 'f', 'I', 'k', 'm', 'o', 'N', 'i', 's', 'r', 'u', 'O', 'w', 'p', 'U', 'n', 'E', 't'])

Valid word-final sounds.

sanskrit.sounds.VOWELS = frozenset(['a', 'A', 'e', 'f', 'I', 'F', 'o', 'i', 'u', 'O', 'X', 'x', 'U', 'E'])

All vowels.

sanskrit.sounds.aspirate(L)

Aspirate L. If this is not possible, return L unchanged.

Parameters:L – the letter to aspirate
sanskrit.sounds.clean(phrase, valid)[source]

Remove all characters from phrase that are not in valid.

Parameters:
  • phrase – the phrase to clean
  • valid – the set of valid characters. A sensible default is sounds.ALL_TOKENS.
sanskrit.sounds.deaspirate(L)

Deaspirate L. If this is not possible, return L unchanged.

Parameters:L – the letter to deaspirate
sanskrit.sounds.dentalize(L)

Dentalize L. If this is not possible, return L unchanged.

Parameters:L – the letter to dentalize
sanskrit.sounds.devoice(L)

Devoice L. If this is not possible, return L unchanged.

Parameters:L – the letter to devoice
sanskrit.sounds.guna(L)

Apply guna to the given letter, if possible.

sanskrit.sounds.key_fn(s)[source]

Sorting function for Sanskrit words in SLP1.

sanskrit.sounds.lengthen(L)

Lengthen L. If this is not possible, return L unchanged.

Parameters:L – the letter to lengthen
sanskrit.sounds.meter(phrase, heavy='_', light='.')[source]

Find the meter of the given phrase. Results are returned as a list whose elements are either heavy and light.

By the traditional definition, a syllable is heavy if one of the following is true:

  • the vowel is long
  • the vowel is short and followed by multiple consonants
  • the vowel is followed by an anusvara or visarga

All other syllables are light.

Parameters:
  • phrase – the phrase to scan
  • heavy – used to indicate heavy syllables. By default it’s a string, but you can pass in anything.
  • light – used to indicate light syllables. By default it’s a string, but you can pass in anything.
sanskrit.sounds.nasalize(L)

Nasalize L. If this is not possible, return L unchanged.

Parameters:L – the letter to nasalize
sanskrit.sounds.num_syllables(phrase)[source]

Find the number of syllables in phrase.

Parameters:phrase – the phrase to test
sanskrit.sounds.retroflex(L)

Retroflex L. If this is not possible, return L unchanged.

Parameters:L – the letter to retroflex
sanskrit.sounds.samprasarana(L)

Apply samprasarana to the given letter, if possible.

sanskrit.sounds.semivowel(L)

Semivowel L. If this is not possible, return L unchanged.

Parameters:L – the letter to semivowel
sanskrit.sounds.shorten(L)

Shorten L. If this is not possible, return L unchanged.

Parameters:L – the letter to shorten
sanskrit.sounds.simplify(L)

Simplify the given letter, if possible.

Here, to “simplify” a letter is to reduce it to a sound that is permitted to end a Sanskrit word. For instance, the c in vAc should be reduced to k:

assert simplify('c') == 'k'
Parameters:letter – the letter to simplify
sanskrit.sounds.voice(L)

Voice L. If this is not possible, return L unchanged.

Parameters:L – the letter to voice
sanskrit.sounds.vrddhi(L)

Apply vrddhi to the given letter, if possible.

sanskrit.sandhi

Classes that apply and undo sandhi rules.

license:MIT
class sanskrit.sandhi.Exempt[source]

A helper class for marking strings as exempt from sandhi changes. To mark a string as exempt, just do the following:

original = 'amI'
exempt = Exempt('amI')

Exempt is a subclass of unicode, so you can use normal string methods on Exempt objects.

class sanskrit.sandhi.Joiner(rules=None)[source]

Joins multiple Sanskrit terms by applying sandhi rules.

add_rules(rules)[source]

Add rules for joining words.

Example usage:

joiner.add_rules[('a', 'i', 'e'), ('a', 'a', 'A'])
Parameters:rules – a list of 3-tuples, each of which contains:
  • the first part of the combination
  • the second part of the combination
  • the result
static internal_retroflex(term)[source]

Apply the “n -> ṇ” and “s -> ṣ” rules of internal sandhi.

Parameters:term – the string to process
join(chunks, internal=False)[source]

Join the given chunks according to the object’s rules:

assert 'tasyecCA' == s.join('tasya', 'icCA')

join() does not take pragṛhya rules into account. As a reminder, the main exception are:

1.1.11 “ī”, “ū”, and “e” when they end words in the dual.
1.1.12 the same vowels after the “m” of adas;
1.1.13 particles with just one vowel, apart from “ā”
1.1.14 particles that end in “o”.

One simple way to account for these rules is to wrap exempt strings with Exempt:

assert joiner.join('te', 'iti') == 'ta iti'
assert joiner.join(Exempt('te'), 'iti') == 'te iti'
Parameters:
  • chunks – a list of the strings that should be joined
  • internal – if true, join words using the empty string instead of ‘ ‘.
class sanskrit.sandhi.Splitter(rules=None)[source]

Splits Sanskrit terms by undoing sandhi rules.

iter_splits(chunk)[source]

Return a generator for all splits in chunk. Results are yielded as 2-tuples containing the term before the split and the term after:

for item in s.splits('nareti'):
    before, after = item

splits() will generate many false positives, usually when the first part of the split ends in an invalid consonant:

assert ('narAv', 'iti') in s.splits('narAviti')

These should be filtered out in the calling function.

Splits are generated from left to right, but the function makes no guarantees on when certain rules are applied. That is, output is loosely ordered but nondeterministic.

Database schema

sanskrit.schema

Schema for Sanskrit data.

class sanskrit.schema.AbstractNominal(**kwargs)[source]

A complete nominal form. This corresponds to Panini’s subanta.

class sanskrit.schema.Case(**kwargs)[source]

Grammatical case.

  • Nominative case, corresponding to Panini’s prathamā
  • Accusative case, corresponding to Panini’s dvitīyā
  • Instrumental case, corresponding to Panini’s tṛtīyā
  • Dative case, corresponding to Panini’s caturthī
  • Ablative case, corresponding to Panini’s pañcamī
  • Genitive case, corresponding to Panini’s ṣaṣṭhī
  • Locative case, corresponding to Panini’s saptamī
  • Vocative case, corresponding to Panini’s saṃbodhana
class sanskrit.schema.EnumBase(**kwargs)[source]

Base class for enumerations.

Each enumeration has a name and an abbreviation.

class sanskrit.schema.Form(**kwargs)[source]

A complete form. This corresponds to Panini’s pada:

1.4.14 Terms ending in “sup” and “tiṅ” are called pada.

In other words, a Form is a self-contained linguistic unit that could be used in a sentence as-is.

class sanskrit.schema.Gender(**kwargs)[source]

Grammatical gender:

  • masculine, corresponding to Panini’s puṃliṅga
  • feminine, corresponding to Panini’s strīliṅga
  • neuter, corresponding to Panini’s napuṃsakaliṅga
  • unknown/undefined
class sanskrit.schema.GenderGroup(**kwargs)[source]

Grammatical gender of a nominal stem. Since stems support nearly every combination of genders, this class stores both individual genders:

  • masculine
  • feminine
  • neuter
  • unknown/undefined

and collections of genders:

  • masculine and feminine
  • masculine and neuter
  • feminine and neuter
  • masculine, feminine, and neuter
class sanskrit.schema.Gerund(**kwargs)[source]

A complete form. This corresponds to Panini’s ktvānta and lyabanta.

class sanskrit.schema.Indeclinable(**kwargs)[source]

A complete form. This corresponds to Panini’s avyaya.

class sanskrit.schema.Infinitive(**kwargs)[source]

A complete form. This corresponds to Panini’s tumanta.

class sanskrit.schema.Mode(**kwargs)[source]

Tenses and moods:

  • present, corresponding to Panini’s laṭ
  • aorist, corresponding to Panini’s luṅ
  • imperfect, corresponding to Panini’s laṅ
  • perfect, corresponding to Panini’s liṭ
  • simple future, corresponding to Panini’s lṛṭ
  • distant future, corresponding to Panini’s luṭ
  • conditional, corresponding to Panini’s lṛṅ
  • optative, corresponding to Panini’s vidhi-liṅ
  • imperative, corresponding to Panini’s loṭ
  • benedictive, corresponding to Panini’s āśīr-liṅ
  • injunctive
  • future optative
  • future imperative
class sanskrit.schema.Modification(**kwargs)[source]

Verb modification:

  • Causative, corresponding to Panini’s ṇic
  • Desiderative, corresponding to Panini’s san
  • Intensive, corresponding to Panini’s yaṅ
class sanskrit.schema.ModifiedRoot(**kwargs)[source]

A root with one or more modifications.

class sanskrit.schema.NominalEnding(**kwargs)[source]

A suffix for regular nouns and adjectives. This corresponds to Panini’s sup.

class sanskrit.schema.NominalStem(**kwargs)[source]

Stem of a Nominal.

class sanskrit.schema.NounPrefix(**kwargs)[source]

A noun prefix. This includes nañ, among others.

class sanskrit.schema.Number(**kwargs)[source]

Grammatical number:

  • singular, corresponding to Panini’s ekavacana
  • dual, corresponding to Panini’s dvivacana
  • plural, corresponding to Panini’s bahuvacana
class sanskrit.schema.Paradigm(**kwargs)[source]

Represents an inflectional paradigm. This associates a root with a particular class and voice.

class sanskrit.schema.Participle(**kwargs)[source]

A complete form. This corresponds to Panini’s niṣṭhā and sat. Moreover, it also corresponds to kvasu.

class sanskrit.schema.ParticipleStem(**kwargs)[source]

Stem of a Participle.

class sanskrit.schema.PerfectIndeclinable(**kwargs)[source]

A complete form. This corresponds to forms ending in the suffix ām, as in “īkṣāṃ cakre”.

class sanskrit.schema.Person(**kwargs)[source]

Grammatical person:

  • first person, corresponding to Panini’s uttamapuruṣa
  • second person, corresponding to Panini’s madhyamapuruṣa
  • third person, corresponding to Panini’s prathamapuruṣa
class sanskrit.schema.Prefix(**kwargs)[source]

A generic prefix.

class sanskrit.schema.PrefixedModifiedRoot(**kwargs)[source]

A prefixed root with one or more modifications.

class sanskrit.schema.PrefixedRoot(**kwargs)[source]

A root with one or more prefixes.

class sanskrit.schema.PronounStem(**kwargs)[source]

Stem of a Pronoun.

class sanskrit.schema.Root(**kwargs)[source]

A verb root. This corresponds to Panini’s dhātu:

1.3.1 “bhū” etc. are called dhātu.
3.1.22 Terms ending in san etc. are called dhātu.

Moreover, Root contains prefixed roots. Although this modeling choice is non-Paninian, it does express the notion that verb prefixes can cause profound changes in a root’s meaning and identity.

basis_id

The ultimate ancestor of this root. For instance, the basis of “sam-upa-gam” is “gam”.

class sanskrit.schema.RootModAssociation(modification)[source]

Associates a modified root with a list of modifications.

class sanskrit.schema.RootPrefixAssociation(prefix)[source]

Associates a prefixed root with a list of prefixes.

class sanskrit.schema.SandhiType(**kwargs)[source]

Rule type. Sandhi rules are usually of three types:

  • external rules, which act between words
  • internal rules, which act between morphemes
  • general rules, which act in any context
class sanskrit.schema.SimpleBase(**kwargs)[source]

A simple default base class.

This automatically creates:

  • __tablename__
  • id (primary key)
  • name (string)
class sanskrit.schema.Stem(**kwargs)[source]

A nominal stem. This corresponds to Panini’s aṅga:

1.4.13 Anything to which a suffix can be added is called an aṅga.

But although “aṅga” also includes “dhātu,” Stem does not. Verb roots are stored in Root.

dependent

True iff a stem can produce its own words. For stems like “nara” or “agni” this value is True. For stems like “ja” (dependent on upapada, as in “agra-ja”) or “tva” (a suffix, as in “sama-tva”), this value is False.

class sanskrit.schema.StemIrregularity(**kwargs)[source]

Record of an irregular stem.

Some Sanskrit stems are inflected irregularly. Since only an exceedingly small number of stems is irregular (< 100), it’s easiest to record those irregularities in their own scheme.

fully_described

If True, assume that only the forms stored in the database are valid. If False, assume that all unspecified forms can be generated by applying normal rules and endings to the stem.

class sanskrit.schema.Tag(**kwargs)[source]

Part of speech tag. It contains the following:

  • noun
  • pronoun
  • adjective
  • indeclinable
  • verb
  • gerund
  • infinitive
  • participle
  • perfect indeclinable (also known as the periphrastic perfect)
  • noun prefix
  • verb prefix
class sanskrit.schema.VClass(**kwargs)[source]

Verb class:

  • class 1, corresponding to Panini’s bhvādi
  • class 2, corresponding to Panini’s adādi
  • class 3, corresponding to Panini’s juhotyādi
  • class 4, corresponding to Panini’s divādi
  • class 5, corresponding to Panini’s svādi
  • class 6, corresponding to Panini’s tudādi
  • class 7, corresponding to Panini’s rudhādi
  • class 8, corresponding to Panini’s tanādi
  • class 9, corresponding to Panini’s kryādi
  • class 10, corresponding to Panini’s curādi
  • class unknown, for verbs like “ah”
  • nominal, corresponding to various Paninian terms
class sanskrit.schema.Verb(**kwargs)[source]

A complete form. This corresponds to Panini’s tiṅanta.

class sanskrit.schema.VerbEnding(**kwargs)[source]

A suffix for conjugated verbs of any kind. This corresponds to Panini’s tiṅ.

category

Name of the stem category that uses this ending. This is useful for partitioning the list of endings according to the stem under consideration.

The names of these groups depend on the initial data. The default data uses these names: - ‘simple’ for classes 1, 4, 6, and 10 - ‘complex’ for classes 2, 3, 45, 7, 8, and 9 - ‘both’ for all classes

class sanskrit.schema.VerbPrefix(**kwargs)[source]

A verb prefix. This corresponds to Panini’s gati, which includes cvi (svāgatī-karoti) and upasarga (anu-karoti).

class sanskrit.schema.VerbalIndeclinable(**kwargs)[source]

A complete form. VerbalIndeclinable is a superclass for three more specific classes: Gerund, Infinitive, and PerfectIndeclinable.

class sanskrit.schema.Voice(**kwargs)[source]

Grammatical voice:

  • parasmaipada
  • ātmanepada
  • ubhayapada

Analyzing forms

sanskrit.analyze

Code for analyzing Sanskrit forms, i.e. finding the basic lexical forms that produced them and specifying the word’s inflectional information.

license:MIT
class sanskrit.analyze.SimpleAnalyzer(ctx)[source]

A simple analyzer for Sanskrit words. The analyzer is simple for a few reasons:

  • It doesn’t do any caching.
  • It uses an ORM instead of raw SQL queries.
  • Its output is always “well-formed.” For example, neuter nouns can take only neuter endings.

This analyzer is best used when memory is at a premium and speed is a secondary concern (e.g. when on a web server).

analyze(word)[source]

Return all possible solutions for the given word. Any ORM objects used in these solutions will be in a detached state.

Parameters:word – the word to analyze. This should be a complete word, or what Panini would call a pada.

Analyzing text

sanskrit.tagger

Code for converting Sanskrit paragraphs into a list of linguistic forms. This is done through a part-of-speech tagger (Tagger) that also removes sandhi and identifies the lexical roots that underlie the forms in some passage.

When tagging, a block of text (i.e. a verse or paragraph) is called a segment and its space-separated substrings are called chunks. For example, the segment 'aTa SabdAnuSAsanam |' has chunks 'aTa', 'SabdAnuSAsanam', and '|'.

license:MIT
class sanskrit.tagger.NonForm(name)[source]

Wraps a chunk that couldn’t be parsed by the Tagger.

class sanskrit.tagger.TaggedItem(segment_id, chunk_index, form)[source]

Associates a linguistic form with a specific chunk and segment.

class sanskrit.tagger.Tagger(ctx)[source]

The part-of-speech tagger.

iter_chunks(segment)[source]

Iterate over the chunks in segment.

Parameters:segment – an arbitrary string
tag(segment, segment_id=None)[source]

Return the linguistic forms that compose segment. If a form can’t be parsed, it’s wrapped in NonForm.

Parameters:segment – an arbitrary string
Returns:a list of TaggedItem objects.