Saturday, August 25, 2018

Telugu Spelling correction _Why Indic languages are resource-scarce ?

Spelling correction is a well-known task in Natural Language Processing (NLP). Automatic spelling correction is important for many NLP applications like web search engines, text summarization, sentiment analysis etc. Most approaches use parallel data of noisy and correct word mappings from different sources as training data for automatic spelling correction. Indic languages are resource-scarce and do not have such parallel data due to a low volume of queries and nonexistence of such prior implementations.

The task of spelling correction is challenging for resource-scarce languages.
Then why do we have multiple disparate efforts by multiple groups with no cooperation or coordination to create such resources for Indic languages?
 Google, Microsoft, Mozilla, HP, IIIT, TDLP various  IITs  they all want to go their separate ways  and  as soon as   some progress is made  Victory is declared  and  all the  troops are withdrawn  or  just a few  soldiers and  advisors are left on the  ground  who  try to keep the  status quo as much as possible. the end result of this is even after 3 decades of efforts we do not have a decent  Telugu spelling and Grammar correction software with 90+% accuracy.
why do we keep reinventing the wheel?
Why are we satisfied with Mediocre software?

"The accuracy of the system was 98%"
The above sentence really has no proper scientific significance.

98 % accuracy in what ? in detecting all the wrong words spelt and correct context sensitive corrections suggested? This is among how many documents on what sized corpus?
does it include scientific and technical works?
does it include classical and poetic works?
how about slang words?
how about borrowed but "Telugized" words?
What about the major divisions of Telangana, Andhra and Rayalaseema language corpora.
What about other dialects of Telugu?

using a training corpus of  25000 words to test a language software for Telugu which is spoken by 35.19 million people in Telangana  88,64 million people in Andhra Pradesh is laughable.
(International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 7 – March 2015)

Do we even have a standard POS tagging  TAg set?

how many parts of speech are there? what is the level of granularity we are achieving?

"part-of-speech annotation in various research applications is incomparable which is variously due to the variations in tag set definitions. We understand that the morphosyntactic features of the language and the degree of desire to represent the granularity of these morpho-syntactic features, domain etc., decide the tags in the tag set. "
what about agglutinative languges like Telugu are we  going to set a standard on where to draw a line  on the  level of  "Sandhi"

 IL POS tag-set[14] proposed by Bharti et. Al for Hindi

టెన్ యెఅర్స్ అగొ ఇత్ వ్స్ సుగ్గెస్తెద్  ఇన్ అ పపెర్

:It is strongly felt that all Indian languages should have the same tag set so that the annotated corpus in corresponding languages may be useful in cross lingual NLP applications, reducing much load on language to language transfer engines. This point can be well explained by taking analogy of existing script representation for Indian Languages. The ISCII and Unicode representations for all Indian languages can be viewed appropriately in the languages we like, just by setting their language code. There is no one-to-one alphabet mapping in the scripts of Indian Languages. For example, the short e,o ( ఏ,ఒ) are present in Telugu, while they are not available in Hindi, Sanskrit etc. Similarly alphabet variations between Telugu and Tamil exist. Even then, all these issues are taken care of, in the process of language to language script conversion. Similarly POS variations across Indian Languages also should be taken care of."
What happened  after that?

Wహత్ హప్ప్[ఎనెద్  అఫ్తెర్ థత్?










No comments: