Thursday, December 26, 2013

Challenges in the Recognition and Searching of Printed Books in Indian Languages and Scripts

 An excerpt :

I will  also post   why I  concur with Randy Schekman 

"Randy Schekman, one of the 2013 crop of Nobel prize-winners (for physiology or medicine, in his case), decided to criticise the way scientific journals are run, he did not hold back.
many  good papers  published by Indian authors  are either not published in  well known journals  are  if they are  they are not easily available to other indians working  in India . due to the  exorbitant prices.

in todays  internet world theses journals are like  the  dinosaurs of the yester year and their demise  is to be applauded and actively encouraged .

 I am told  the collaboration between Wong suk and Scatten of the cloning infamy was just to get access to one of the prestigious journals. All that the  American did was to suggest  using  a well known professional photographer to take a picture of the cloned  afghan hound !

 I  have decided to  post useful information on indic computing  and wherever available full papers   and  till   some one complains of copyright violation.  I intend to do this in both of my blogs 

"
Challenges in the Recognition and Searching of
Printed Books in Indian Languages and Scripts
R. Manmatha' and C. V. Jawahar
'Department of Computer Science,
University of Massachusetts, Amherst, MA, USA
mlanmarha@cs.umass.edu
Center for Visualization technology
IIIT. Hyderabad


Optical character recognizers have been combined with text search engines to
successfully build digital book libraries in a number of European languages.
The recognition of printed books in Indian languages and scripts is, however,
still a challenging problem. This paper describes some of the challenges in
building recognizers for Indian languages. Challenges include the complexity of
some cursive scripts, the large number of classes, the paucity of annotated data
sets and the document quality. However, content-level browsing and accessing
document images i immediately required for acce ing image collections in
digital libraries. We describe some possible approaches to this problem including
techniques to directly search the text using word spotting.
Keywords: Indian Languages, Indian Scripts, Content level Browsing, Image
Retrieval, Digital Libraries, OCR, Recognizer , Search Sy tem, Word Spotting.

6.1 Introduction
Books, newspapers and magazines form an important part of a society's
cultural heritage. To enhance cultural communication within a society, it is
important to provide easy access to such material. This is being currently
achieved for many European languages such as English, French and Russian
by projects such as the Million Book project, Google Books or the Internet
Archive. Books, newspapers and magazines are scanned, converted to text
using commercial optical character recognition (OCR) software and then
indexed for search. Making handwritten material searchable in a similar
manner is still difficult (even for English) and we will not discuss that further.
There is significant interest in making printed material in Indian languages
searchable in the same manner - for example the Digital Library of India
(DLl) is such an attempt. However there are no good commercial

(DU)1. 2 such an attempt. However, there are no good commercial
recognizers available for Indian languages. Major challenges in developing
such recognizers include the variety of languages and scripts in India.
Additional challenges include the large number of character classes in many
of these scripts and the lack of good annotated data sets.
Why is it difficult to build OCR systems for Indian languages? There are
probably good financial reasons why companies have not tried to build OCR
systems for Indian languages unlike say Chinese. Much of the formal internal
and external communication in companies , large organizations and even
much of the government in India still take place in English. The language
of high-end business in India is primarily English. This has proven financially
advantageous to India for example in the software industry. However, it
also mean that there i less of a financial incentive to build Indian language
OCRs. Here we will focus on the technical difficulties in recognizing Indic documents. While that the language of business may be English, Indian
languages are widely used in communications between people and f01l1l an
important intermediary in preserving the culture. A significant number of
newspapers, magazines and books get printed in these languages and it
would be useful to search such material. While considerable historical material
in Indian languages exists on palm leaf manuscripts and as inscriptions on
structures, the e are much beyond what current technology can recognize
and we will, therefore, not discuss such material further.

OCR systems first have to find the layout of a document image to find
possible text regions. The text blocks are segmented into words and
characters. This is followed by a recognition phase using a classifier. The
accuracy of the pattern classifier is improved by post-processing with the
help of a dictionary or for research systems an appropriate language model.
In a number of scripts (for example Latin scripts) it is straightforward to
divide words into characters using white space as long as good quality printed
documents are used. This makes it possible to use character recognizers
built for each character thus simplifying the problem Note that this assumes
that the layout analysis as well as word/character segmentation work well.
In a number of Indian languages a line Shirorekha connects characters
making character segmentation trickier. In addition, with natural degradations
in the document, characters (or symbols) in the image may get cut into
multiple parts. Multiple characters can also get merged into one unit. Such
degradations make recognition challenging. Finally, Indian languages have a
potentially large numbers of character classes. We touch upon these points
later.

There are some techniques which can potentially simplify this problem;
one can recognize words rather than characters thus avoiding the character
segmentation problem'. This technique is not favored for good quality printed
Latin scripts since the characters can usually be cleanly segmented. However,
it is used in handwriting and is also likely to be good for noisy text. Directly
recognizing each word leads to a large expansion of character classes but
has been successfully used for example in historical handwritten recognition
A modification of this approach adapted from speech recognition involves
segmentation and recognition, at the same time, using an HMM - again a
technique used for print in some cases' and widely used for handwriting'.
Manmatha' gives a brief overview of document image recognition while
Nagy' reviews the recent literature in the area. Pal and Chaudhuri'
survey the state of the art in Indian scripts.
For many digital libraries, recognition is considered as a prerequisite to
developing a search engine. Often, however, the output of the recognition is
not directly viewed by the user in many situations. In such cases search can avoid OCR completely. Search has the advantage that it is usually not reliant
on accurately finding every word. Rather, context is implicitly used in search
(as with multiple word queries). Search systems produce a ranking from
which the user selects the most appropriate or relevant result. This means that while the result must be in the top n, it does not have to be the top
result.
At least two different approaches exist for searching scanned images of
text directly. The first one based on word spotting'" '' involves clustering
similar word images together in the dataset using image matching (notice
that we are talking about clustering the test dataset and not the training set).
The clu s~ers that are formed can either be annotated by a user (one
annotation per cluster rather than one per image) or alternatively searched
directly using an image query. Since people prefer text queries, a reverse
annotation step can be performed where the text query produced by a person
is converted to an image query by creating a glyph. For printed words creating
a reverse annotation is possible". A number of image matching steps have
been tried although dynamic time warping has been the most successful for
both handwriting and print"·" including printed Indian scripts. Dynamic
time warping is, however, slow and problems arise when trying to scale this
approach to large corpora. Alternative techniques have been suggested in

this case. One proposed for handwriting involves converting the handwritten
features to discrete segments using clusteringl• Another, tried for printed
books in Hindi and Telugu involves using locality sensitive hashing to index
printed booksl
This latter approach is sensitive to font variations but given
that much of a book is in a single font, it can be searched well using this
approach.
A second approach involves image annotation techniques.. Here, a set
of word images from a training set is annotated with the correct word. The
words are automatically segmented and continuous features computed which
are in turn transformed to discrete features. A statistical mapping is then
learned between the discrete features of a word and its English rendering
using the training set. The word segmentation and feature extraction are
repeated for the lest set and the word images are automatically annotated
with their textual versions using the statistical model. A single word image
may be annotated with large numbers of words and associated probabilities.
Given a query, a language modeling based retrieval approach is then used to
rank the documents. Note that for multi-word queries it is possible that the
top annotation for any single word may not be the best match for the query
if a number of the other words in the query bias it towards a different
answer. This relevance model based approach has been successfully used
for historical handwriting but has not been tested for printed Indian languages.


in this chapter we focus on methods which enable us to search printed
digital libraries without doing explicit character recognition. We first
present some of the challenges in designing robust recognizers for Indic
scripts. We then look at word spotting and the proposed uses for
searching Indian scripts. Given that traditional word spoiling approaches
can be slow, we consider the use of indexes using locality based hashing
to search word images rapidly. Finally. we conclude the chapter.
6.2 Challenges for lndic OCRs
The languages of India belong to either the Indo-European or Dravidian
language families with a small number (i n terms of speakers) belonging
to the Austro-Asiatic and Tibeto-Burman language families. Roughly
speaking most speakers of Dravidian languages are concentrated in the
south of India while the rest of India has speakers of languages derived
from Sanskrit (an Indo-European language). The Austro-Asiatic and
Tibeto-Burman languages are found in small pockets. There are possibly
more than 200 languages in India. However the twenty two official
languages in India are Assamese, Bengali, Dogri, Gujarati, Hindi,
Kashmiri, Konkani, Maithili, Manipuri, Marathi, Nepali, Oriya, Punjabi,
Sanskrit. Sindhi, Urdu (all Indo- European), Kannada, Malayalam.
Tamil, Telugu (all Dravidian). Bodo (Tibeto-Burman) and Santhali

(Austro-Asiatic). The situation is further complicated since all the
Dravidian languages have loan words from Sanskrit - in particular
Telugu has a substantial fraction of its vocabulary derived from Sanskrit.
For most speakers, Hindi and Urdu are essentially the same language.
The two main distinctions between them are a} that Hindi's vocabulary is
more Sanskritized while Urdu's vocabulary borrows more heavily from
Persian and Arabic and b} Hindi is written in Devanagari while Urdu is
written in an Arabic derived script. However. the latter difference means
that Hindi and Urdu require separate OCR systems.
Many of these languages have their own distinct scripts. For example.
each of the four main Dravidian languages has its own script which are
very different from each other and from those of other Indian languages.
This is unlike Europe where a large number of languages share a Latin
alphabet with small variations. A number of these languages are spoken,
written and read by significant populations. Hindi (including Urdu) has about


600 million native speakers, Telugu about 75 million and MalayaJam about
37 million speakers. All these languages have significant printed books and
resources and OCRs are immediately required to make them accessible.
The largest newspaper in India by circulation Dainik Jagran is a Hindi
newspaper.
A good OCR algorithm has two important components - a layout
recognition step and a step which recognizes the segmented characters.
While a lot of work has been done on recognizing isolated characters/words
in Indian scripts and languages, it is not often recognized that an essential
element in building a good OCR is a good. layout recognition algorithm which
describes how the page is formatted. For example. a good layout analysis
algorithm extracts out the image and text pans and segments the text into
groups such as columns. paragraphs, sentences, words and characters.
Layout analysis as well as segmentation is difficult for many reasons. For
Indian scripts, the complex shapes and their scattered distribution on a 2D
plane makes many text blocks similar to line drawings and pictures. The
situation gets further complicated when the printed text is generated using
word processors which are designed for scripts and fonts in European
languages. This has a direct bearing on the arrangements of glyphs and
their spacing. A quick look at the distance-angle plot of components in English
and Telugu (in Fig. 6.1), reveals that the symbol distribution in Telugu is
highly general (distributed) while that in English is highly structured (clear
peaks seen in Fig. 6.1 (a». An empirical comparison of segmentation
algorithms argues that many popular algorithms are not directly applicable
to many of the Indian scri pts. Popular algorithms available in the literature
use local geometric information in the form of distance between connected
components to develop segmentation algorithms. In many situations (as in
Fig. 6.2), the inter-line spacing may be smaller than the intra-word spacing.
In summary, not enough work has been done on layout analysis for Indian
languages and current layout analysis algorithms produce segmented
characters which are noisy and the results applied to isolated characters
and words do not directly translate to real documents. Hence, more work is
needed on Indian languages to recognize real output from document images.







In virtually all scripts (such as Latin, Cyrillic, Chinese, Japanese
and Korean) for which commercial OCR has been successful,
characters or the corresponding units are separated by a space. Thus for
good quality documents it is possible without a great deal of effort to
segment individual characters in these scripts. In these languages,
problems arise when the documents are noisy or complex since the layout
recognition and character segmentation is then more likely to fail. For
example, drawing a line through a word or underlining a word is likely to
cause a commercial OCR 10 fail in English. The Latin alphabet also has
glyphs which are sufficiently different (distinct) - although one could
argue that the accent marks used in some languages complicate
recognition. A number of Indian scripts (Devanagari or the Bengali script
for example) have the characters joined using a Shirorekha or line
drawn on lOp of the characters. This makes character segmentation
more difficult. In addition most Indian languages use vowel modifiers
which may in some cases be just a dot. Thus minor changes in the image
can lead to a major change in interpretation of the character or word. Many
Indian scripts create new character classes by taking half a character and
joining it with the next character greatly expanding both the number of
character classes and the possibility of confusion since these are close to
the original characters. Rice et al is a good discussion of the problems
faced by OCR systems (even in Latin languages) while Pal and Chaudhuri
survey character recognition in Indian scripts. We now expand on these
difficulties.
A large number of characters are present in Indian scripts compared to
that of European languages. This makes the recognition difficult for
conventional pattern classifiers. The basic unit of the language akshara
may be a consonant(C). a consonant-vowel combination (CV). CCV or
CCCV. This makes the number of basic units enormously high, though many
of these valid units are rarely used in the language (but rare use complicates
training). These aksharas may consist of a single glyph (connected
component) or multiple components. Complex character graphemes with
curved shaped images and the added inflations also make the recognition
difficult. Additional challenges derive from the large number of similar!
confusing characters. Fig. 6.3 shows some of the pairs of similar characters
in Malayalam. The variation between these characters is extremely small
Even humans find it difficult to recognize them in isolation. However, we
usually read them correctly [rom the context. Such issues in the recognition
process al so increases the need [or computational resources. Increased
computational complexity and memory requirements due to the large number
of classes, has been a bottleneck in developing robust OCR systems.
The lack of standard databases. statistical information and benchmarks
for testing. are another set of challenges in developing robust OCR systems
for Indian languages. This has prevented the scaling of available results to



large document collections. The absence of large standard datasets also
makes it difficult to develop OCR systems robust to natural variations. The
lack of well developed language models makes conventional post-processors
practically impossible. Many languages like Malayalam and Telugu have
very complex language structures, which does not make them attractive
candidates for using linguistic post-processing (e.g. using dictionaries.
bigrams).
Unicode/display/font related issues in building, testing and deploying
working systems. have slowed down research in the development of
character recognition systems for Indian languages. Many standard
representations such as Unicode fail to encode all the valid characters in
many of the indian scripts. Such representational issues seriously affect the
development of software and systems. to some scripts the same character
can be written in multiple ways and all lhe mUltiple methods may coexist in
the same document. This has possibly happened because of script revisions
which have happened officially or un-officially at various times. Fig. 6.4



shows how three Malayalam characters were written during periods of
time. Present day readers can comfortably read all these variants. Fig. 6.5
shows a similar example in Telugu where the same character (shown in
Hindi) get written in Telugu. Variations in glyph/shape of a character could
happen due to font/style.As the font or style changes the glyph of a character
also changes considerably, which makes the recognition difficult.
Fig. 6.S: The same Telugu characters are written in different ways. The first column shows
a Hindi character and the other columns show the corresponding - with the same sound Tclugu
characters written in different ways.

A significant population of Indian educated people can read, write or
comprehend multiple languages. Many inldian language documents contain
foreign language words (printed in the same or a foreign script). In practice,
script separation at word or character level is difficult. Fie. 6.6


A significant population of Indian educated people can read, write or
comprehend multiple languages. Many indian language documents contain
foreign language words (printed in the same or a foreign script). In practice,
script separation at word or character level is difficult. Fig. 6.6
demonstrates that highly similar shapes exist across characters. The
appearance of foreign or unknown symbols in the document makes the
recognition difficult, and sometimes unpredictable. For example. English
words might occur in the middle of an indian language sentence even in
printed books.
There has been some progress in research on indian language OCRs in
Bangla. Gurmukhi, Kannada and Telugu. Most of these attempts
have demonstrated recognition on a limited number of characters and pages.
Since the recognition system is not robustly tested on a large enough corpus
to validate the results. extending the work to commercial prototypes is
challenging. It is hoped that with the emergence of a large annotated corpus
for Indian languages27, the situation will significantly improve.

Word spotting has been tried for many different kinds of documents both
hand-written and print. Rath and Manmatha used dynamic time warping
 to compute image similarities for handwriting. The word similarities
are then used for clustering using K-means or agglomerative clustering
techniques. This approach was adopted in Jawahar et al. for printed Indian
language document images. To simplify the process of querying. a word
image is generated for each query and the cluster corresponding to this
word is identified. In such methods, efficiency is achieved by significant
offl ine computation. Gatos el aJ.)O used word spotting for old Greek
typewritten manuscripts for which OCRs did not work. One advantage of
word spotting over traditional OCR methods is that they take advantage of
the fact that within corpora such as books the word images are likely to be
much more similar, which traditional OCRs do not do.
Many of these techniques (for example DTW) are computationally
expensive and do not scale very well . In spite of this, Sankar et al
successfully indexed 500 books in Indian languages using this approach by
doing virtually all the computation offline. This made the retrieval
instantaneous. Avoiding OTW. Ram et al demonstrated the use of direct
clustering of word image features on historical handwritten manuscripts.
However, clustering is itself an expensive operation. An alternative to doing
this efficiently using locality sensitive hashing (LSH) wiU be discussed in the
next section.

6.4 Efficient Indexing Using LSH
However, direct matching of images is inefficient due to the complexity of
matching and thus impractical for large databases. This may be solved by
directly hashing word image representations, using an efficient mechanism
for indexing and retrieval in large document image collections" . First, words
are automatically segmented. Then features are computed at word level
and indexed ln this case profile features are used . Word retrieval is done
very efficiently by using an approximate nearest neighbor retrieval technique
called locality sensitive hashing (LSH). The word images are hashed into
hash tables using features computed at word level. Content-sensitive hash
functions are used to hash words such that the probability of grouping similar
words in the same index of the hash table is high. The sub-linear time content sensitive
hashing scheme makes the search very fast without degrading
accuracy. Experiments on a collection of Kalidasa's - the classical Indian
poet of antiquity - books in Telugu demonstrate that 20,000 word images
may be searched in a few milliseconds. The approach thus makes searching
large document image collections practical.



The query image and example search results are shown in Fig. 6.8. "The
first two rows show correct results. The last column in the last two rows
shows examples where erroneous words may be retrieved although they
appear somewhat visually similar.
the results of queries containing words of different sizes and style types
are shown in Fig. 6.9. Such results are obtained by querying the same word



in multiple books of the collection. Using the same query on two different
books of the collection retrieve. words which are content-wise similar.
Indian language words have small form variations. For example, the same
word may have different case end i n g. Such words are also searched
correctly using the proposed solution. Example results of such queries are
shown in Fig. 6.10 (row 2). The retrieved words have the same stem. which
is due lo the similarity in image content. 'There are limits to !he font variations
that can be handled by the proposed retrieval technique. Experiments show



that we cannot use combinations of different font words but such
combinations are very unlikely to occur in books.
The proposed hash based search is sub-linear and much faster than
exhaustive nearest neighbor search. The experiments were conducted on
data sets of increasing size (by 5,000 words) in each iteration. The maximum
number of words used were around 45,000. With the use of the maximum
size data set, the maximum time to search relevant words was of the order
of milliseconds. The experiments were conducted on anAMD Athlon 64 bit
processor using 512 MB memory.
6.S Discussions, Conclusion and Future Directions
Clearly. word spotting using locality sensitive hashing can successfully
retrieve documents in response to query if the font variations are small. The
method is also extremely fast. The approach used at word level avoids
character segmentation and complexities which may arise therein. Finally.
like all word spotting techniques it leverages the actual data set (rather than
just a training set). The method has also been tried on some other Indian
languages besides Telugu.
However. if font variations are large the method docs not work as well.
This may be due to feature limitations and also technique limitations. Gradient
features may possibly improve results. One can also envision other
approaches which involve a combination of recognition and locality sensitive

like all word spoiling techniques it leverages the actual data set (rather than
just a training set). The method has also been tried on some other Indian
languages besides Telugu.
However, if font variations are large the method docs not work as well.
This may be due to feature limitations and also technique limitations. Gradient
features may possibly improve results. One can also envision other
approaches which involve a cornbination of recognition and locality sensitive
hashing approaches. Other approaches which may be successful include
image annotation based approaches.
The difficulties inherent in Indian scripts implies that we need to think out
of the box and build robust systems which have better layout analysis systems
and recognizers. Even OCR systems for English are not that robust to noise.
Underlining a word can make the OCR system fail. The widely reported
recognition rate of 99% for English OCR systems is misleading since its
only true for good quality printed documents in standard fonts. It hides
robustness problems when layouts are complex or there is noise. We believe
building recognition and search systems for Indian languages/scripts may
give us the opportunity to build better systems for all languages/ scripts.


No comments: