[PLUG] Localisation Camp followup - Payyans Workout session on Sunday 22 August 2010

Wed Aug 18 15:07:09 IST 2010

2010/8/18 Navin Kabra <navin.kabra at gmail.com>:
> What exactly is payyans?

Well, you found out the basic description yourself.

> I tried looking at the wiki, but it is in Malayalam; and google gives
> irrelevant information.

It was initially written to convert Malayalam ASCII encoded data to Unicode.

> I found this:
>
>> Payyans is a Language independent encoding converter – ASCII to Unicode
> and reverse. Read more from here : http://wiki.smc.org.in/Payyans
>
> but I cant really "read more" from there. Anybody willing to give more
> details, or translate that page?

Well, come down and help write the documentation - that is one task
for the workout ie, write documentation in English.

To get you started with basics, we have to start from number system.
As you might already know, computers understand only binary data ie
zero or one. So how do we represent data in a way computers can
understand? Using sequence of ones and zeros we can represent any
number. Now what about letters? Character encoding is introduced as a
way of representing characters as numbers. In ASCII encoding systems 7
bits (there is 8 bit variant as well) are used to represent a
character. Using 7 bits, we can represent up to 2^7 (128) characters.
It was sufficient to represent all characters in English/Latin and
special characters (including control characters). But there are so
many scripts around the world and using 128 numbers we cannot
represent all of them.

There were different attempts to solve this issue. For European
languages 8 bit ASCII was sufficient. We started using the same
numbers (from 0 to 127) to represent characters in Indian languages,
but internally the computer still handled it as English characters.
But we substituted Indian language characters in font and fooled the
computer into thinking we are using Indian languages. This was good
enough for displaying Indian Languages on screen and printing, though
other important tasks like sorting and searching was impossible,
because internally they were still understood as English characters.
This technique became widely popular and even now many popular new
papers use this system. This kind of technique was so closely tied to
a font that it requires the same font used for entering the data to be
available on every system one wants to read it.

Now Unicode encoding comes into picture with a promise of uniquely
identifying every character in the world. Now the limit of 128 (or 256
with 8 bit ASCII) characters is taken away and it became possible to
have separate code points/numbers for each of Indian languages. There
are different ways of representing this numbers and these are called
encoding methods. Most popular is UTF-8 which uses variable number of
bytes to represent a character. There is UTF-16 which uses 16 bits for
representing a character. Unicode encoded data can be read using any
Unicode font taking away the dependency on a particular font. OpenType
specification for fonts has option for substituting sequence of
characters with another glyph (glyph is the pictorial representation
of a character). This takes care of conjuncts ie ka halant ka (क ् क)
is substituted with kka (क्क).

Even though Unicode is used widely on the internet some applications
used popularly for DTP still does not support them and many people did
not move to Unicode. So there is lot if data encoded in ASCII format
which needs to be converted to Unicode if we want to make them,
readable without needing a specific font, search-able, sortable ...

Payyans is such a software written in python for converting ASCII font
specific data into Unicode. Padma is firefox plugin which does the
same for many Indian languages. Now it seems simple to map the ASCII
data to its corresponding Unicode, but each font followed its own
encoding and for every ASCII font, you need a separate mapping table.
Moreover there are script specific reordering, like moving ikar from
left to right (in ASCII ikar is added before the conjunct but in
Unicode ikar is added after the conjunct), required for proper
conversion.

For Devanagari conversion, the requirement is more complex than for
Malayalam and so we need to adapt Payyans for supporting Devanagari.
Work is already started and it needs handling of some specific cases.

Hope it is clear now.
-- 
പ്രവീണ്‍ അരിമ്പ്രത്തൊടിയില്‍
You have to keep reminding your government that you don't get your
rights from them; you give them permission to rule, only so long as
they follow the rules: laws and constitution.