API¶
This page documents the entire linguistic “stack,” from individual sounds
(sanskrit.sounds) to paragraph-level morphosyntax
(sanskrit.tagger). Lower-level modules are at the top of the page.
Context¶
sanskrit.context¶
Manages the package context. For details, see
Context.
| license: | MIT |
|---|
-
class
sanskrit.context.Context(config=None, connect=True)[source]¶ The package context. In addition to storing basic config information, such as the database URI or paths to various data files, a
Contextalso constructs aSessionclass for connecting to the database.You can populate a context in several ways. For example, you can pass a
dict:context = Context({'DATABASE_URI': 'sqlite:///data.sqlite'})
or a path to a Python module:
context = Context('project/config.py')
If you initialize a context from a module, note that only uppercase variables will be stored in the context. This lets you use lowercase variables as temporary values.
Config values are stored internally as a
dict, so you can always just use ordinarydictmethods:context.config['FOO'] = 'baz'
Parameters: config – an object to read from. If this is a string, treat config as a module path and load values from that module. Otherwise, treat config as a dictionary. -
config= None¶ A
dictof various settings. By convention, all keys are uppercase. These are used to createengineandsession.
-
enum_abbr¶ Maps an ID or name to an abbreviation.
-
enum_id¶ Maps a name or abbreviation to an ID.
-
Sounds and sandhi¶
sanskrit.sounds¶
Code for checking and transforming Sanskrit sounds. This module also
contains basic metrical functions (see sanskrit.sounds.meter()
and sanskrit.sounds.num_syllables()).
All functions assume SLP1.
| license: | MIT |
|---|
-
sanskrit.sounds.ALL_SOUNDS= frozenset(["'", 'A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'Y', 'X', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '~'])¶ All legal sounds, including anusvara, ardhachandra, and Vedic ‘L’.
-
sanskrit.sounds.ALL_TOKENS= frozenset(['\n', ' ', "'", 'A', 'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N', 'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'Y', 'X', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z', '|', '~'])¶ All legal tokens, including sounds, punctuation (‘|’), and whitespace.
-
sanskrit.sounds.CONSONANTS= frozenset(['C', 'B', 'D', 'G', 'K', 'J', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'Y', 'c', 'b', 'd', 'g', 'h', 'k', 'j', 'm', 'l', 'n', 'q', 'p', 's', 'r', 't', 'w', 'v', 'y', 'z'])¶ Consonants.
-
sanskrit.sounds.NASALS= frozenset(['Y', 'n', 'R', 'm', 'N'])¶ Nasals.
-
sanskrit.sounds.SAVARGA= frozenset(['h', 'S', 'z', 's'])¶ Savarga
-
sanskrit.sounds.SEMIVOWELS= frozenset(['y', 'r', 'v', 'l', 'L'])¶ Semivowels.
-
sanskrit.sounds.SHORT_VOWELS= frozenset(['a', 'i', 'u', 'x', 'f'])¶ Short vowels.
-
sanskrit.sounds.STOPS= frozenset(['q', 'c', 'J', 'g', 'G', 'P', 'k', 'j', 'C', 'D', 'd', 'W', 'p', 'b', 'Q', 't', 'w', 'B', 'K', 'T'])¶ Stop consonants.
-
sanskrit.sounds.VALID_FINALS= frozenset(['a', 'A', 'e', 'f', 'I', 'k', 'm', 'o', 'N', 'i', 's', 'r', 'u', 'O', 'w', 'p', 'U', 'n', 'E', 't'])¶ Valid word-final sounds.
-
sanskrit.sounds.VOWELS= frozenset(['a', 'A', 'e', 'f', 'I', 'F', 'o', 'i', 'u', 'O', 'X', 'x', 'U', 'E'])¶ All vowels.
-
sanskrit.sounds.aspirate(L)¶ Aspirate L. If this is not possible, return L unchanged.
Parameters: L – the letter to aspirate
-
sanskrit.sounds.clean(phrase, valid)[source]¶ Remove all characters from phrase that are not in valid.
Parameters: - phrase – the phrase to clean
- valid – the set of valid characters. A sensible default is sounds.ALL_TOKENS.
-
sanskrit.sounds.deaspirate(L)¶ Deaspirate L. If this is not possible, return L unchanged.
Parameters: L – the letter to deaspirate
-
sanskrit.sounds.dentalize(L)¶ Dentalize L. If this is not possible, return L unchanged.
Parameters: L – the letter to dentalize
-
sanskrit.sounds.devoice(L)¶ Devoice L. If this is not possible, return L unchanged.
Parameters: L – the letter to devoice
-
sanskrit.sounds.guna(L)¶ Apply guna to the given letter, if possible.
-
sanskrit.sounds.lengthen(L)¶ Lengthen L. If this is not possible, return L unchanged.
Parameters: L – the letter to lengthen
-
sanskrit.sounds.meter(phrase, heavy='_', light='.')[source]¶ Find the meter of the given phrase. Results are returned as a list whose elements are either heavy and light.
By the traditional definition, a syllable is heavy if one of the following is true:
- the vowel is long
- the vowel is short and followed by multiple consonants
- the vowel is followed by an anusvara or visarga
All other syllables are light.
Parameters: - phrase – the phrase to scan
- heavy – used to indicate heavy syllables. By default it’s a string, but you can pass in anything.
- light – used to indicate light syllables. By default it’s a string, but you can pass in anything.
-
sanskrit.sounds.nasalize(L)¶ Nasalize L. If this is not possible, return L unchanged.
Parameters: L – the letter to nasalize
-
sanskrit.sounds.num_syllables(phrase)[source]¶ Find the number of syllables in phrase.
Parameters: phrase – the phrase to test
-
sanskrit.sounds.retroflex(L)¶ Retroflex L. If this is not possible, return L unchanged.
Parameters: L – the letter to retroflex
-
sanskrit.sounds.samprasarana(L)¶ Apply samprasarana to the given letter, if possible.
-
sanskrit.sounds.semivowel(L)¶ Semivowel L. If this is not possible, return L unchanged.
Parameters: L – the letter to semivowel
-
sanskrit.sounds.shorten(L)¶ Shorten L. If this is not possible, return L unchanged.
Parameters: L – the letter to shorten
-
sanskrit.sounds.simplify(L)¶ Simplify the given letter, if possible.
Here, to “simplify” a letter is to reduce it to a sound that is permitted to end a Sanskrit word. For instance, the c in vAc should be reduced to k:
assert simplify('c') == 'k'
Parameters: letter – the letter to simplify
-
sanskrit.sounds.voice(L)¶ Voice L. If this is not possible, return L unchanged.
Parameters: L – the letter to voice
-
sanskrit.sounds.vrddhi(L)¶ Apply vrddhi to the given letter, if possible.
sanskrit.sandhi¶
Classes that apply and undo sandhi rules.
| license: | MIT |
|---|
-
class
sanskrit.sandhi.Exempt[source]¶ A helper class for marking strings as exempt from sandhi changes. To mark a string as exempt, just do the following:
original = 'amI' exempt = Exempt('amI')
Exemptis a subclass ofunicode, so you can use normal string methods onExemptobjects.
-
class
sanskrit.sandhi.Joiner(rules=None)[source]¶ Joins multiple Sanskrit terms by applying sandhi rules.
-
add_rules(rules)[source]¶ Add rules for joining words.
Example usage:
joiner.add_rules[('a', 'i', 'e'), ('a', 'a', 'A'])Parameters: rules – a list of 3-tuples, each of which contains: - the first part of the combination
- the second part of the combination
- the result
-
static
internal_retroflex(term)[source]¶ Apply the “n -> ṇ” and “s -> ṣ” rules of internal sandhi.
Parameters: term – the string to process
-
join(chunks, internal=False)[source]¶ Join the given chunks according to the object’s rules:
assert 'tasyecCA' == s.join('tasya', 'icCA')
join()does not take pragṛhya rules into account. As a reminder, the main exception are:1.1.11 “ī”, “ū”, and “e” when they end words in the dual.1.1.12 the same vowels after the “m” of adas;1.1.13 particles with just one vowel, apart from “ā”1.1.14 particles that end in “o”.One simple way to account for these rules is to wrap exempt strings with
Exempt:assert joiner.join('te', 'iti') == 'ta iti' assert joiner.join(Exempt('te'), 'iti') == 'te iti'
Parameters: - chunks – a list of the strings that should be joined
- internal – if true, join words using the empty string instead of ‘ ‘.
-
-
class
sanskrit.sandhi.Splitter(rules=None)[source]¶ Splits Sanskrit terms by undoing sandhi rules.
-
iter_splits(chunk)[source]¶ Return a generator for all splits in chunk. Results are yielded as 2-tuples containing the term before the split and the term after:
for item in s.splits('nareti'): before, after = item
splits()will generate many false positives, usually when the first part of the split ends in an invalid consonant:assert ('narAv', 'iti') in s.splits('narAviti')
These should be filtered out in the calling function.
Splits are generated from left to right, but the function makes no guarantees on when certain rules are applied. That is, output is loosely ordered but nondeterministic.
-
Database schema¶
sanskrit.schema¶
Schema for Sanskrit data.
-
class
sanskrit.schema.AbstractNominal(**kwargs)[source]¶ A complete nominal form. This corresponds to Panini’s subanta.
-
class
sanskrit.schema.Case(**kwargs)[source]¶ Grammatical case.
- Nominative case, corresponding to Panini’s prathamā
- Accusative case, corresponding to Panini’s dvitīyā
- Instrumental case, corresponding to Panini’s tṛtīyā
- Dative case, corresponding to Panini’s caturthī
- Ablative case, corresponding to Panini’s pañcamī
- Genitive case, corresponding to Panini’s ṣaṣṭhī
- Locative case, corresponding to Panini’s saptamī
- Vocative case, corresponding to Panini’s saṃbodhana
-
class
sanskrit.schema.EnumBase(**kwargs)[source]¶ Base class for enumerations.
Each enumeration has a name and an abbreviation.
-
class
sanskrit.schema.Form(**kwargs)[source]¶ A complete form. This corresponds to Panini’s pada:
1.4.14 Terms ending in “sup” and “tiṅ” are called pada.In other words, a
Formis a self-contained linguistic unit that could be used in a sentence as-is.
-
class
sanskrit.schema.Gender(**kwargs)[source]¶ Grammatical gender:
- masculine, corresponding to Panini’s puṃliṅga
- feminine, corresponding to Panini’s strīliṅga
- neuter, corresponding to Panini’s napuṃsakaliṅga
- unknown/undefined
-
class
sanskrit.schema.GenderGroup(**kwargs)[source]¶ Grammatical gender of a nominal stem. Since stems support nearly every combination of genders, this class stores both individual genders:
- masculine
- feminine
- neuter
- unknown/undefined
and collections of genders:
- masculine and feminine
- masculine and neuter
- feminine and neuter
- masculine, feminine, and neuter
-
class
sanskrit.schema.Gerund(**kwargs)[source]¶ A complete form. This corresponds to Panini’s ktvānta and lyabanta.
-
class
sanskrit.schema.Indeclinable(**kwargs)[source]¶ A complete form. This corresponds to Panini’s avyaya.
-
class
sanskrit.schema.Infinitive(**kwargs)[source]¶ A complete form. This corresponds to Panini’s tumanta.
-
class
sanskrit.schema.Mode(**kwargs)[source]¶ Tenses and moods:
- present, corresponding to Panini’s laṭ
- aorist, corresponding to Panini’s luṅ
- imperfect, corresponding to Panini’s laṅ
- perfect, corresponding to Panini’s liṭ
- simple future, corresponding to Panini’s lṛṭ
- distant future, corresponding to Panini’s luṭ
- conditional, corresponding to Panini’s lṛṅ
- optative, corresponding to Panini’s vidhi-liṅ
- imperative, corresponding to Panini’s loṭ
- benedictive, corresponding to Panini’s āśīr-liṅ
- injunctive
- future optative
- future imperative
-
class
sanskrit.schema.Modification(**kwargs)[source]¶ Verb modification:
- Causative, corresponding to Panini’s ṇic
- Desiderative, corresponding to Panini’s san
- Intensive, corresponding to Panini’s yaṅ
-
class
sanskrit.schema.NominalEnding(**kwargs)[source]¶ A suffix for regular nouns and adjectives. This corresponds to Panini’s sup.
-
class
sanskrit.schema.Number(**kwargs)[source]¶ Grammatical number:
- singular, corresponding to Panini’s ekavacana
- dual, corresponding to Panini’s dvivacana
- plural, corresponding to Panini’s bahuvacana
-
class
sanskrit.schema.Paradigm(**kwargs)[source]¶ Represents an inflectional paradigm. This associates a root with a particular class and voice.
-
class
sanskrit.schema.Participle(**kwargs)[source]¶ A complete form. This corresponds to Panini’s niṣṭhā and sat. Moreover, it also corresponds to kvasu.
-
class
sanskrit.schema.ParticipleStem(**kwargs)[source]¶ Stem of a
Participle.
-
class
sanskrit.schema.PerfectIndeclinable(**kwargs)[source]¶ A complete form. This corresponds to forms ending in the suffix ām, as in “īkṣāṃ cakre”.
-
class
sanskrit.schema.Person(**kwargs)[source]¶ Grammatical person:
- first person, corresponding to Panini’s uttamapuruṣa
- second person, corresponding to Panini’s madhyamapuruṣa
- third person, corresponding to Panini’s prathamapuruṣa
-
class
sanskrit.schema.PrefixedModifiedRoot(**kwargs)[source]¶ A prefixed root with one or more modifications.
-
class
sanskrit.schema.Root(**kwargs)[source]¶ A verb root. This corresponds to Panini’s dhātu:
1.3.1 “bhū” etc. are called dhātu.3.1.22 Terms ending in san etc. are called dhātu.Moreover,
Rootcontains prefixed roots. Although this modeling choice is non-Paninian, it does express the notion that verb prefixes can cause profound changes in a root’s meaning and identity.-
basis_id¶ The ultimate ancestor of this root. For instance, the basis of “sam-upa-gam” is “gam”.
-
-
class
sanskrit.schema.RootModAssociation(modification)[source]¶ Associates a modified root with a list of modifications.
-
class
sanskrit.schema.RootPrefixAssociation(prefix)[source]¶ Associates a prefixed root with a list of prefixes.
-
class
sanskrit.schema.SandhiType(**kwargs)[source]¶ Rule type. Sandhi rules are usually of three types:
- external rules, which act between words
- internal rules, which act between morphemes
- general rules, which act in any context
-
class
sanskrit.schema.SimpleBase(**kwargs)[source]¶ A simple default base class.
This automatically creates:
- __tablename__
- id (primary key)
- name (string)
-
class
sanskrit.schema.Stem(**kwargs)[source]¶ A nominal stem. This corresponds to Panini’s aṅga:
1.4.13 Anything to which a suffix can be added is called an aṅga.But although “aṅga” also includes “dhātu,”
Stemdoes not. Verb roots are stored inRoot.-
dependent¶ Trueiff a stem can produce its own words. For stems like “nara” or “agni” this value isTrue. For stems like “ja” (dependent on upapada, as in “agra-ja”) or “tva” (a suffix, as in “sama-tva”), this value isFalse.
-
-
class
sanskrit.schema.StemIrregularity(**kwargs)[source]¶ Record of an irregular stem.
Some Sanskrit stems are inflected irregularly. Since only an exceedingly small number of stems is irregular (< 100), it’s easiest to record those irregularities in their own scheme.
-
fully_described¶ If
True, assume that only the forms stored in the database are valid. IfFalse, assume that all unspecified forms can be generated by applying normal rules and endings to the stem.
-
-
class
sanskrit.schema.Tag(**kwargs)[source]¶ Part of speech tag. It contains the following:
- noun
- pronoun
- adjective
- indeclinable
- verb
- gerund
- infinitive
- participle
- perfect indeclinable (also known as the periphrastic perfect)
- noun prefix
- verb prefix
-
class
sanskrit.schema.VClass(**kwargs)[source]¶ Verb class:
- class 1, corresponding to Panini’s bhvādi
- class 2, corresponding to Panini’s adādi
- class 3, corresponding to Panini’s juhotyādi
- class 4, corresponding to Panini’s divādi
- class 5, corresponding to Panini’s svādi
- class 6, corresponding to Panini’s tudādi
- class 7, corresponding to Panini’s rudhādi
- class 8, corresponding to Panini’s tanādi
- class 9, corresponding to Panini’s kryādi
- class 10, corresponding to Panini’s curādi
- class unknown, for verbs like “ah”
- nominal, corresponding to various Paninian terms
-
class
sanskrit.schema.Verb(**kwargs)[source]¶ A complete form. This corresponds to Panini’s tiṅanta.
-
class
sanskrit.schema.VerbEnding(**kwargs)[source]¶ A suffix for conjugated verbs of any kind. This corresponds to Panini’s tiṅ.
-
category¶ Name of the stem category that uses this ending. This is useful for partitioning the list of endings according to the stem under consideration.
The names of these groups depend on the initial data. The default data uses these names: - ‘simple’ for classes 1, 4, 6, and 10 - ‘complex’ for classes 2, 3, 45, 7, 8, and 9 - ‘both’ for all classes
-
-
class
sanskrit.schema.VerbPrefix(**kwargs)[source]¶ A verb prefix. This corresponds to Panini’s gati, which includes cvi (svāgatī-karoti) and upasarga (anu-karoti).
-
class
sanskrit.schema.VerbalIndeclinable(**kwargs)[source]¶ A complete form.
VerbalIndeclinableis a superclass for three more specific classes:Gerund,Infinitive, andPerfectIndeclinable.
Analyzing forms¶
sanskrit.analyze¶
Code for analyzing Sanskrit forms, i.e. finding the basic lexical forms that produced them and specifying the word’s inflectional information.
| license: | MIT |
|---|
-
class
sanskrit.analyze.SimpleAnalyzer(ctx)[source]¶ A simple analyzer for Sanskrit words. The analyzer is simple for a few reasons:
- It doesn’t do any caching.
- It uses an ORM instead of raw SQL queries.
- Its output is always “well-formed.” For example, neuter nouns can take only neuter endings.
This analyzer is best used when memory is at a premium and speed is a secondary concern (e.g. when on a web server).
Analyzing text¶
sanskrit.tagger¶
Code for converting Sanskrit paragraphs into a list of linguistic
forms. This is done through a part-of-speech tagger
(Tagger) that also removes sandhi and
identifies the lexical roots that underlie the forms in some passage.
When tagging, a block of text (i.e. a verse or paragraph) is called
a segment and its space-separated substrings are called chunks.
For example, the segment 'aTa SabdAnuSAsanam |' has chunks
'aTa', 'SabdAnuSAsanam', and '|'.
| license: | MIT |
|---|
-
class
sanskrit.tagger.TaggedItem(segment_id, chunk_index, form)[source]¶ Associates a linguistic form with a specific chunk and segment.
-
class
sanskrit.tagger.Tagger(ctx)[source]¶ The part-of-speech tagger.
-
iter_chunks(segment)[source]¶ Iterate over the chunks in segment.
Parameters: segment – an arbitrary string
-
tag(segment, segment_id=None)[source]¶ Return the linguistic forms that compose segment. If a form can’t be parsed, it’s wrapped in
NonForm.Parameters: segment – an arbitrary string Returns: a list of TaggedItemobjects.
-