Indus Script & Computational Linguistics
Writing is an epitome of the intellectual creation of a civilisation. It involves comprehension as well as abstraction of symbols that signify specific achievement of human creativity and communication. Renfrew points out that "The practice of writing, and the development of a coherent system of signs, a script, is something which is seen only in complex societies... Writing, in other words, is a feature of civilisations". When a civilisation leaves behind some written records, they are invaluable not only to understand their civic society but also to understand the basic thinking processes that moulded the civilisation.
Decipherment of any script is a challenging task. At times it is aided by the discovery of a multilingual text where the same text is written in an undeciphered script as well as known script(s). Both Egyptian hieroglyphs and Mesopotamian cuneiform texts were deciphered with the help of multilingual texts. In some cases, continuing linguistic traditions provide significant clues and at times interlocking phonetic values are used as a proof of decipherment. In the absence of these, statistical studies can provide important insights into the structure of the script and can be used to define a syntactic framework for the script.
Indus script is a product of one of the largest Bronze Age civilisations often referred to as the Harappan civilisation. At its peak from 2500 BC to 1900 BC, the civilisation was spread over an area of more than a million square kilometres across most of the present day Pakistan, Afghanistan and north-western India. It was distinguished for its highly utilitarian and standardised life style, excellent water management system and architecture. The civilisation had flourishing trade links with West Asia and artefacts of the Harappan civilisation have been found several thousand kilometres away in West Asia.
The Indus script is predominantly found on objects such as seals, sealings (made of terracotta or steatite), copper tablets, ivory sticks, bronze implements, pottery etc. from almost all sites of this civilisation and in some West Asian sites too. The objects on which the script was written are typically a few square centimetres in size (with the exception of a sign board in Dholavira) and often have multiple components with highly decorated unicorn and other animal motifs with or without a feeding trough. Many of these objects also have geometric designs with multiple folds of symmetry and depiction of scenes involving humans etc. One of the excavators of Mohenjo Daro Sir Mortimer Wheeler says: "At their best, it would be no exaggeration to describe them as little masterpieces of controlled realism, with a monumental strength - in one sense out of all proportion to their size and in another entirely related to it."
The Indus script has defied decipherment in spite of several serious attempts. This is primarily because no multilingual texts have been found, the underlying language(s) is unknown and the script occurs in very short texts. The average length of an Indus text is five signs and the longest text in a single line has only 14 signs.
Through a series of systematic studies (see table below) the TIFR group, in collaboration with colleagues
from India and abroad, has been working on understanding the structure of Indus writing. Adopting a novel
methodology based on statistical and computational techniques, the group has approached the problem in a
manner that makes no assumptions about its underlying content, language or connection to later writing.
The study focuses on exploring the structure of the Indus script in unprecedented detail using developments
in the fields of machine learning, data mining and information theory. They approach the problem using various
techniques of computational linguistics and pattern recognition such as Markov models, n-grams etc.
to understand the structure of Indus writing. Using these methods, they first established that the Indus
writing has definite rules or a grammatical structure. Having established that the writing is neither
random nor disordered, the group is now working on revealing the subtleties of its structure. They have
identified specific signs that begin and end the texts. There exist frequently occurring sign combinations
(pairs and triplets) which tend to appear at specific locations in the texts. The bigram model of the Indus
script can accurately restore the illegible or incomplete texts found on broken or damaged objects with
about 75% accuracy. Equally interestingly, the flexibility of sign usage in Indus texts, as measured by
conditional entropy, falls within the range of linguistic systems and is distinct from non-linguistic
systems such as Protein or DNA sequences or Fortran.
These studies will eventually help in defining a syntactic framework of the Indus script against which different hypotheses about its content can be tested.
|Sl. No.||Test/ Measure||Results||Conclusions|
|1.||Zipf- Mandelbrot Law||Best fit for a= 15.4, b =2.6, c = 44.5 (95% confidence interval)||Small number of signs account for bulk of the data while a large number of signs contribute to a long tail.|
|2.||Cumulative frequency distribution||69 signs: 80 % of EBUDS,23 signs: 80 % of Text Enders, 82 signs: 80 % of Text Beginners||Indicates asymmetry in usage of 417 distinct signs. Suggests logic and structure in writing.|
|3.||Bigram probability||Conditional probability matrix is strikingly different from the matrix assuming no correlations.||Indicates presence of significant correlations between signs.|
|4.||Conditional probabilities of text beginners and text enders||Restricted number of signs follow frequent text beginners whereas large number of signs precede frequent text enders.||Indicates presence of signs having similar syntactic functions.|
|5.||Log-likelihood significance test||Significant sign pairs and triplets extracted.||The most significant sign pairs and triplets are not always the most frequent ones.|
|6.||Entropy||Random: 8.70; EBUDS: 6.68||Indicates presence of correlations.|
|7.||Mutual information||Random: 0; EBUDS: 2.24||Indicates flexibility in sign usage.|
|8.||Perplexity||Monotonic reduction as n-increases from 1 to 5.||Indicates presence of long range correlations.|
|9.||Sign restoration||Restoraton of missing and illegible signs.||Bigram model can restore illegible signs according to probability.|
|10.||Cross validation||Sensitivity of the bigram model = 74 %||Bigram model can predict signs with 74% accuracy.|
|11.||Conditional entropy||Closer to linguistic systems than non-linguistic systems.||The flexibility of sign usage in Indus texts is similar to closer to that of linguistic systems.|
|12.||Comparison of compound signs with constituent sign sequences||Environments in which compound signs appear is very different from that of its constituent sign sequences which rarely appear together.||Compound signs are not created for shorthanding but seem to have different function.|
A statistical approach for pattern search in Indus writingNisha Yadav, M N Vahia, Iravatham Mahadevan and H. Joglekar
International Journal of Dravidian Linguistics, 37, 39 - 52, January 2008
Segmentation of Indus textNisha Yadav, M N Vahia, Iravatham Mahadevan and H. Joglekar
International Journal of Dravidian Linguistics, 37, 53 - 72, January 2008
Statistical analysis of the Indus script using n-gramsNisha Yadav, Hrishikesh Joglekar, Rajesh P.N. Rao, M. N. Vahia, Iravatham Mahadevan, R. Adhikari
PLoS ONE 5(3): e9506., doi:10.1371/journal.pone.0009506, March 2010
A probabilistic model for analyzing undeciphered scripts and its application to the 4500-year-old Indus scriptRajesh P. N. Rao, Nisha Yadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, Iravatham Mahadevan
Proceedings of the National Academy of Sciences (PNAS), Dec. 2009106:13685-13690; published online before print August 5, 2009,doi:10.1073/pnas.0906237106
Evidence for linguistic structure in the Indus scriptRajesh P. N. Rao, NishaYadav, Mayank N. Vahia, Hrishikesh Joglekar, R. Adhikari, Iravatham Mahadevan
Science, 324, 1165, 2009
Network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilisation inscriptionsSitabhra Sinha, Raj Kumar Pan, Nisha Yadav, Mayank Vahia and Iravatham Mahadevan
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, ACL-IJCNLP 2009, pages 5–13, Suntec, Singapore
Entropy, the Indus script and language: A reply to R. SproatRajesh Rao, Nisha Yadav, M N Vahia, H Jogalekar, R Adhikari and I Mahadevan
Computational Linguistics 36(4), 2010
Harappan geometry and symmetry: A study of geometrical patterns on Indus objectsM N Vahia and Nisha Yadav
Indian Journal of History of Science, 45, 343, 2010
Classification of patterns on Indus objectsNisha Yadav and M. N. Vahia
International Journal of Dravidian Linguistics, Vol. 40: No. 2, June 2011
Indus script: A study of its sign designNisha Yadav and M N Vahia
Scripta, Vol. 3, pp. 133-172, September 2011
FAQClick on the question to see the answers and comments
Pankaj : mam, what were they trying to say. if that was there way of communication with each other or what it was a way to represnt their territory or any thing like that. their way of communicating is or was superir to today ways.