Thursday, November 23, 2017

The Rice Transliteration Standard for Roman Transliteration of Telugu RTS telugu


 Some Telugu language lovers have been working tirelessly to popularize Telugu use on the Internet and  standardization with Unicode.
As time goes by people who have no Knowledge of these efforts may try to Invent the wheel again.

As I see some  Apps which are worse than some which were available 20 years ago.

the reason for this is there is no single place where each and every development and details of these executables/instrument or Apps.

I am still  in love with the simple  RTS standard for Telugu

so I am mirroring  a post by one of the inventors of RTS

The Rice Transliteration Standard for Roman Transliteration of Telugu
                           
Roman transliteration of Telugu simply means writing Telugu using
English (Roman) alphabet. Modern Telugu text has Telugu words, English
words written in Roman or Telugu script, and modern punctuation marks.
Transliteration is merely a way to represent modern Telugu text using
English alphabet. Transliteration is not a software; it is a form of
information representation. Transliteration is (typically) done by
humans, resulting in a file written with English alphabets and
punctuation marks.
Inverse transliteration is an operation that extracts Telugu and
English from this file. Inverse transliteration function can be
realized as a software. It is this software we refer to in this
document. The output of this software is a file, which when printed
contains Telugu written in Telugu script and English written either in
Roman script or Telugu script, and is an approximation to the text
that was transliterated in the first place. It is this output we refer
to here.
There can be (and are) several such transliteration schemes. We are
proposing the following scheme as a standard. In this scheme, many
letters can be transliterated in more than one way.  Some of them are
designed to cater to varying intuition, some to increase speed, and
some to be fault-tolerant. For the sake of a later reference, the
preferred form of the transliteration of the Telugu alphabet is
presented first. We emphasize that one doesn't need to stick to Table
1, and it is only a part of the standard.
Table 1:
-------
vowels:       a  aa  i  ee  u  oo  R  Ru  e  ea  ai  o  oe  ou  

plosives
and nasals:            
              k  kh  g  gh  ~m
 
              c  C   j  jh  ~n
              T  Th  D  Dh  N
   
              t  th  d  dh  n
              p  f   b  bh  m
fluids:
              y  r  l  v  S  sh  s  h  L  x  ~r

where S is "melika sa" and ~r is "banDi ra".

Examples:
     English meaning       Transliteration
     uncle                 maama
     ant                   cheema
     monkey                koeti
     play                  aaTa
     old                   paata
     important             mukhyam
     saw (n)               rampam
     eggplant              vankaaya
     order                 aaj~na
Software takes care of "guDintaalu" (consonant-vowel combinations) and
"vattulu" (consonant-consonant combinations) automatically. This is
not the only way to transliterate these words, though. There are
several other ways. Many letters have alternatives (equivalents), as
in the following table.
Table RTS:
_____________________________________________________________________

    a    aa=aaa=a'    i    ee=ii=ia=i'    u    oo=uu=U=ua=u'
    R    Ru    e     ea=ae=E=e'    ai   o    oe=O=oa=o'    au=ou
    k         kh=K=Kh        g        gh=G=Gh      ~m
    c=ch      C=Ch           j        jh=J=Jh      ~n
    T=t'      Th=th'         D=d'     Dh=dh'       N=nh
    t         th             d        dh           n
    p         f=P=ph=Ph      b        bh=B=Bh      m
    y   r   l   v=w   S   sh   s   h   L=lh=Lh   x=ksh   ~r

Throughout, h  = H.
alu (archaic)  = ~l
aloo (archaic) = ~L
arasunna = @M
visarga  = @h
avagraha (used in Sanskrit) = @2
na pollu (arachaic) = @n
null operation = _  (underscore) (see below)

Syllable break = ^  (see below)
Force combination = & (see below)
For "sunnaa", see below.
tcha (allophone of c, now extinct) = ~c
tja  (allophone of j, now extnict) = ~j
________________________________________________________________________

Example. Telugu word meaning monkey can be transliterated as any of
the following: koati, koeti, kOti, ko'ti. The same information is
represented by all of them. Any of these can be chosen, based on
personal preference or convenience.

Notes.
1. The following symbols are treated as both Telugu and English symbols:
, < . > / ? : * ; + ] } [ { ` " ! $ % ( ) - = 1 2 3 4 5 6 7 8 9 0.
These symbols are transliteration-invariant. That is, these symbols
retain their meaning:
     mana de'Saaniki "svaatantryam" 1947 lo' vaccindi. kaanii idi
     nijangaa svaatantryamaa?

2. The following are special characters:  ~ @ & ' _ ^ #
They have special meanings, as can be noted from Table RTS. (However,
there is a way to print them in the output, as explained later.)
3. Both ' and "a" serve as a vowel-elongation suffix. That is,
"short vowel followed by ' or "a" becomes a long vowel."
    ceema = ciima = ci'ma = ciama, pOru = poeru = po'ru = poaru
4. There is a retroflex suffix, namely '. That is, "dental plosive
followed by ' becomes a retroflex."
                  aaTa = aat'a, enDa = end'a

5. "sunnaa" Generation (Nasal Contraction):
------------------------------------------
All nasals are contracted before plosives as in Rule 1 below. Rule 2,
like Rule 1, improves typo-tolerance.
Rule 1. Whenever the letter n or m is followed by one of {k, K, g,
G, c, C, j, J, T, Th, D, Dh, t, th, d, dh, p, P, b, B} (or their
alternatives), it will be converted to sunnaa.
Rule 2. Also, whenever the letter m is followed by one of {l, v, s, S},
it will be converted to "sunnaa" automatically.
Example: vankaaya, vamkaaya, lankhaNam, lamkhaNam, anga, amga, kance,
kamce, manTa, mamTa, SunTha, SumTha, enDa, emDa, santa, samta,
panthaa, pamthaa, undi, umdi, kampa, kanpa, cembu, cenbu, kaalamloe,
samvatsaram, hamsa, amSa - all generate a "sunnaa" automatically.
Force combination:
-----------------
The "sunnaa" generation rules produce unwanted results in rare cases.
The Sanskrit word for acid "aamla" doesn't have a "sunnaa" in it -
we need to force "la-vattu" under ma. Similarly, "kaanpu, "paanpu"
don't have a "sunnaa" in them: we need to force "pa-vattu" under na.
This is done by using "&", as in "aam&la", "kaan&pu", "paan&pu". We
emphasize that & is used only rarely, in special cases such as above.
Syllable break:
--------------
Suppose we want to write "wrong number" in Telugu script as one word.
If we write "raangnembar", there will be a "na-vattu" under "ge". But
writing "raang^nembar" breaks the syllable after "raang" and writes
"nembar" next to it, without producing the (unwanted)
consonant-consonant combination. That is, k^ is the "praaNa" (pure)
form of ka (without any vowel added to it). [In particular, typing ^
after m generates a "sunnaa".] However, a word ending in a consonant
always assumes ^ at the end by default. That is, we write "shaap" (for
shop), "lak" (for luck) and not "shaap^", "lak^".
Null-operation:
--------------
"poruguvaad'iki toeDupad'avoeyi" is perhaps too tough on the eye.  For
human readability, it maybe typed as "porugu_vaad'iki
toeDu_paDa_voeyi". Both represent the same information, including
white spaces. The symbol _ is invisible to the software, that is why
we call it a null-op.  (However, _ serves another purpose, as will be
explained later.) We recommend using null-op only when the
transliterated text is supposed to be processed by humans. Otherwise,
typing effort is wasted by breaking the words by null-op, since it
is transparent to the software.
More equivalents:
----------------
j~n = jn
d'd' = dd'
t't' = tt'
How to represent English words:
------------------------------
Consider
           naa flight delay ayindi
in which it is obvious that the second and third words are English.
So, normally there is no need to take any special action when using
English words (which are to be printed in Roman script). Software
should normally be able to handle such a representation. You can skip
the next section which may be read when you run into an unusual
problem.
Automatic determination of English words:
----------------------------------------
Since Rice Transliteration Standard as defined in Table RTS is almost
orthogonal to English [1], we provide automatic determination of
English words. However, there are some rare cases in which it is not
clear whether a word is Telugu or English:
   me'm ekkad'ikee poem. Sree Sree poem caduvutuu ikkad'e' unt'aam
where poem in the first instance is Telugu, in the second English.
There are a few more Telugu words, which when transliterated become
valid English words: are, gala, mana, nee, poem, eg.  Based on their
potential frequency, we treat some of them as Telugu and some as
English, by default.  For example, we treat "mana" as a Telugu word,
and "are" as an English word, by default. What if we want to use
"mana" as an English word? We simply enclose it by #s thus: # mana #.
Text enclosed between #s is inverse-transliteration-invariant. That
is, it will be printed as it is.
Similarly, we write _are to use "are" as a Telugu word. That is, we
have a way to force Telugu using _.  In other words, just as we force English
words by enclosing them with #, we force certain Telugu words (rare
cases) by prepending them with _ .  Finally, the defaults associated to
the conflicting words can be changed by the users.  That is, if a user
wants to change "are" default to Telugu, (s)he can do so by editing a
defaults file.

How to represent Special Characters:
-----------------------------------
We noted that @, ~, ^, &, ', _, # are special characters. Suppose the text
to be transliterated has these characters. How do we represent them in
transliteration? We enclose them by #s.  That is, # is an ESCAPE
character that toggles transliteration off and on. In other words,
text enclosed between #s is inverse-transliteration-invariant. It will
be printed as it is.
Example: #'# prints ', ### prints #, #Hello!# prints Hello!.  However,
the single quote ' retains its meaning when it doesn't follow a, i, u,
e, o, t, th, d, dh. Hopefully, future software, in most cases,
determines automatically whether ' is a quote (punctuation mark) or
whether it is a suffix.

Line-breaks and Verse Environment:
---------------------------------
When typing we may or may not hit return. The `return' key strokes in
the input file have nothing to do with where the line breaks in the
output (except in the verse environment. See below).  We start new
paragraphs after a blank line. There is a verse environment, delimited
by |'s, where 'return' keystroke means line break in the output
(equivalent to \obeylines in TeX).
Examples:
--------
     English meaning       Transliteration, with alternatives
     uncle                 maama,  ma'ma
     ant                   cheema, ceema, chiima, ciama
     monkey                kOti, koati, koeti, ko'ti
     play                  aaTa, aat'a
     old                   paata
     important             mukhyam, muKyam
     saw (n)               rampam, ranpam
     eggplant              vankaaya,  vamkaaya
     order                 aaj~na, aajna
   
Examples containing English words:
 Nobody is doing that nowadays and'ee, e'mant'aaru?
 Modern culture loe TV, videos part and parcel ayipoeyaayand'ee!
Examples containing English words written in Telugu:
krist'afar kaad'vel aa maat'a eppud'oe ceppad'u.
san^set' bulevaard' meeda oka kaameraa shaap undi.
The following is an example file.
-------------------------------------------------------------------
Free verse movement was spearheaded by Kundurti Anjaneyulu . The
movement can be traced back to the 1930s, but it really took off only
recently. The eighties have seen a number of good Telugu poets writing
excellent free verse. But free verse is not necessarily easily
understood. The reason for this is that modern poetry, like modern
life, is complex. While using increasingly complex imagery, modern
poetry also tends to shift its frame of reference to outside, rather
than keeping it inside. Poetry of # Nannaya # and # Peddana # can be
understood (with the help of a dictionary) without references to the
outside society or history. Contrast this with the poetry of  T. S.
Eliot. However, this shift is  a hallmark of modern poetry, and not
something peculiar to free verse.
Some examples of modern free verse follow. In the first example, the
poet expresses his closeness to soil, with which his umblical cord is
still attached.
|
   nagna bhoommeeda nagna de'hantoe Sayaninci_nappaTi anubhavam.
   naa naraalu ekkad'oe bhoomi loepali poralloe modalai
   naaloeki vyaapinci_natt'u -
   bhoomi hRdayamloe janmistunna agni
   naa gund'egaa vikasistu_nnatt'u
   naakoo bhoomikee oka avinaa_bhaava sambandham
....
   bhoomi vittu   andu_loenci ne' putt'u_kostaa.
   bhoomi oka naxatra pushpam
   andu_loenci ne' parima_Listaa
   bhoomi oka nayanam   andu_loenci ne' dRshTi saaristaa
....
|
 (by  K. Siva Reddy,  `nagna bhoomeeda ', in
a collection of his poems "mOhanaa! O mOhanaa!", 1988)
--------------------------------------------------------------------
Note `Kundurti Anjaneyulu' is not enclosed by #s, whereas `Nannaya'
and `Peddana' are. The reason is that software should be able to
recognize modern names and handle them appropriately.  However, if we
write `kundurti aanjanEyulu', it will be printed in Telugu script.
Software:
--------
We will present the inverse transliteration software, called Rice
Inverse Transliterator (RIT), in a separate posting.

we now state Rice Internal Representation below. This is used as a
common platform for all subsequent text processing tasks such as
type-setting, spell-checking. It is not necessary to know this
representation for transliteration purposes. Only software enthusiasts
may find this useful. Others may skip this section.

                  The Rice Internal Representation
                  --------------------------------
Text processing becomes simpler if each Telugu character is
represented by a single ASCII character. Furthermore, the internal
representation serves as a canonical one-to-one mapping between Telugu
alphabet and ASCII. For example, "Th" and "th'" are both represented
internally by "Q". Since the internal representation is not meant to be
read, only to be processed by the software, intuition does not play a
role here. The Rice Internal Representation follows.
         a A i I u U R H e E y o O w
         k K g G V
         c C j J W
         T Q D Z N
         t q d z n
         p f b B m
   
         Y r l v S P s h L x F

The Special Characters:
sunnaa         = M
visarga         = X
alu      =  ASCII(1)
aloo     =  ASCII(2)
arasunna =  ASCII(5)
avagraha =  ASCII(6)
na pollu =  ASCII(11)
Syllable break =  ASCII(30)
tcha         = ^P ASCII (16)
tja      = ^Y ASCII (25)

Since the internal representation is not intended to be read by
humans, we need to be able to produce a human readable representation
from this. In case we need to do so, we represent the information
using Table 1, given in the beginning of this document.

Reference: Ananda Kishore, "On Roman Transliteration of Telugu,"
soc.culture.indian.telugu, revised after posting.

- Ananda Kishore
  Rama Rao Kanneganti

No comments: