Hari's Random Thoughts by Hariharan Ramamurthy: Telugu Spelling correction _Why Indic languages are resource-scarce ?

Saturday, August 25, 2018

Telugu Spelling correction _Why Indic languages are resource-scarce ?

Spelling correction is a well-known task in Natural Language Processing (NLP). Automatic spelling correction is important for many NLP applications like web search engines, text summarization, sentiment analysis etc. Most approaches use parallel data of noisy and correct word mappings from different sources as training data for automatic spelling correction. Indic languages are resource-scarce and do not have such parallel data due to a low volume of queries and nonexistence of such prior implementations.

The task of spelling correction is challenging for resource-scarce languages.
Then why do we have multiple disparate efforts by multiple groups with no cooperation or coordination to create such resources for Indic languages?
Google, Microsoft, Mozilla, HP, IIIT, TDLP various IITs they all want to go their separate ways and as soon as some progress is made Victory is declared and all the troops are withdrawn or just a few soldiers and advisors are left on the ground who try to keep the status quo as much as possible. the end result of this is even after 3 decades of efforts we do not have a decent Telugu spelling and Grammar correction software with 90+% accuracy.
why do we keep reinventing the wheel?
Why are we satisfied with Mediocre software?

"The accuracy of the system was 98%"
The above sentence really has no proper scientific significance.

98 % accuracy in what ? in detecting all the wrong words spelt and correct context sensitive corrections suggested? This is among how many documents on what sized corpus?
does it include scientific and technical works?
does it include classical and poetic works?
how about slang words?
how about borrowed but "Telugized" words?
What about the major divisions of Telangana, Andhra and Rayalaseema language corpora.
What about other dialects of Telugu?

using a training corpus of 25000 words to test a language software for Telugu which is spoken by 35.19 million people in Telangana 88,64 million people in Andhra Pradesh is laughable.
(International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 7 – March 2015)

Do we even have a standard POS tagging TAg set?

how many parts of speech are there? what is the level of granularity we are achieving?

"part-of-speech annotation in various research applications is incomparable which is variously due to the variations in tag set definitions. We understand that the morphosyntactic features of the language and the degree of desire to represent the granularity of these morpho-syntactic features, domain etc., decide the tags in the tag set. "
what about agglutinative languges like Telugu are we going to set a standard on where to draw a line on the level of "Sandhi"

IL POS tag-set[14] proposed by Bharti et. Al for Hindi

టెన్ యెఅర్స్ అగొ ఇత్ వ్స్ సుగ్గెస్తెద్ ఇన్ అ పపెర్

:It is strongly felt that all Indian languages should have the same tag set so that the annotated corpus in corresponding languages may be useful in cross lingual NLP applications, reducing much load on language to language transfer engines. This point can be well explained by taking analogy of existing script representation for Indian Languages. The ISCII and Unicode representations for all Indian languages can be viewed appropriately in the languages we like, just by setting their language code. There is no one-to-one alphabet mapping in the scripts of Indian Languages. For example, the short e,o ( ఏ,ఒ) are present in Telugu, while they are not available in Hindi, Sanskrit etc. Similarly alphabet variations between Telugu and Tamil exist. Even then, all these issues are taken care of, in the process of language to language script conversion. Similarly POS variations across Indian Languages also should be taken care of."
What happened after that?

Wహత్ హప్ప్[ఎనెద్ అఫ్తెర్ థత్?

Hari's Random Thoughts by Hariharan Ramamurthy

Saturday, August 25, 2018

Telugu Spelling correction _Why Indic languages are resource-scarce ?

No comments:

Pages

Search This Blog