The Fate of Thai localisation

by Trin Tantsetthi
First published on the Bangkok Post Annual Information Technology Directory 1992 (May 10, 1992)

LOCALISATION is a process of adapting a product to suit the requirements of the local market. In the software industry, the localisation process is split into two main parts: code modification and message translation.

Software authors make certain assumptions when writing code. When the product is distributed worldwide, certain assumptions become false ones. The most notable example is an assumption that each character takes one byte of storage and occupies a rectangular real estate on the output device (called cell). An impact of this assumption, as observed in Thailand, is that most software applications do not allow the user to enter data correctly.

The native Thai character set has characters that do not take a space on output devices (called combining characters). Applications that process text, such as word processor or text editor must be modified, re-engineered, or even re-written just to eliminate this assumption.

The other part of software that needs localisation are text messages. User-friendly software speaks the native language of its users. In Thailand, application menus, error messages and manuals should appear in Thai.

Market picture

Software that processes Thai information started to flood the low-end market in mid 1980's when personal computers had become more affordable.

In the early days, a few companies started a niche market by providing a hardware equipment called Thai cards. Basically, this is a video adaptor which has been modified to deal with higher number of lines on the screen. The higher number of scan lines is required since glyphs (shapes) of Thai letters require a higher resolution, for good legibility, compared to that of Latin-based scripts which the PCs are designed for.

A Thai card also comes with a piece of software called a Thai driver which provides the capability to handle Thai and serve as an enhancement to the BIOS (basic system software that handles primitive input and output functions), which is unfortunately illiterate when it comes to Thai.

Later, when personal computers came equipped with more powerful processo rs and typically with a higher resolution display, a new class of Thai driver without special hardware started to emerge.

The choice between a hardware-assisted solution and pure software one, or even which brand, is not totally up to users' wisdom. Application availability is the key. Thai solution suppliers have to provide a great deal of software packages which have been modified or adapted to work on Thai cards/drivers that they supply.

Local representatives of software packages also face the same situation. Plain English software, written elsewhere, will be used by small number of people in offices where no Thai is used. Given the fact that Thai is the national language, correspondence with authorities in the public sector is done in Thai.

Since most of the population do not use English as an alternative to Thai, English-oriented software limits itself when it comes to the real market potential. Besides, the Thai legal system requires that all transactions be recorded in hard copy and most of these must be in Thai. There are some exceptions but such special arrangements are insignificant in the general picture.

The business community, which is the largest market for software sales, therefore undoubtedly needs Thai-language software.

As stated before, Thai solution suppliers stay in a niche market. Not many software packages have been localised mostly because of limited resources to do localisation. This also leaves a lot of room for local application developers to develop native Thai applications; another niche.

The biggest headache for application developers is that these Thai cards/drivers available in the market are incompatible with each other. In order to take full advantage of Thai cards/drivers the application must use the software interface provided by each driver. These interfaces are also functionally different from each other; using different common subroutine names, taking different arguments, among others.

The implication of this is that a software application is closely tied to one or few Thai cards/drivers. Users who wish to use that application in Thai must choose that Thai card/driver as well. They must also choose associated hardware, such as a printer that the Thai card/driver supports. This is because a certain printer does not work with certain drivers nor with certain applications.

In case of Thai cards, users are forced to stay with the hardware, losing the liberty to use applications which are not supported by that card.

For the software-only Thai driver solution, their seems to be more flexibility in the choice of application.

Software-only drivers in the market are commonly available as unloadable TSR's (Terminate and Stay Resident -- a programming technique that keeps the driver in a PC's main memory after the driver is invoked from command line). The unloadable feature enables the user to remove the driver from memory and install another which is suitable for a new application that he/she would like to run. Still, this can be a hassle to general/non-technical users.

Wor-Taw-Taw Comes to Rescue

Back in 1990, the National Electronics and Computer Technology Center (NECTEC) funded research led by Dr. Thaweesak Koanantakool of the Information Processing Institute for Education and Development at Thammasat University to devise a strategy on how to enable native Thai applications to be able to be run across multiple hardware platforms. The project was named Wor-Taw-Taw, WTT for short, which is a Thai

acronym for Wing-Took-Tee (run everywhere).

WTT, is not a Thai driver. It is actually common specifications for writing native Thai application software. To prove the concept, the research team produced a version of a driver which, while conforming to the common specifications, runs on both ordinary video graphic adaptors as well as three different commercial Thai cards.

The project in its final stages went public for review in the middle of 1991. Many excellent comments were raised from the reviewers, and NECTEC was made aware that in order to integrate these comments into WTT, major subsequent efforts had to follow.

In order to do this, NECTEC organised an entity called the Thai API Consortium (TAPIC) comprising leading systems engineers from many Thai solution suppliers. TAPIC's mission was to revise the WTT Application Programming Interface (API -- commonly used subroutines from which application can call for services from operating system) and I/O behaviour to make sure software applications written on top of WTT subsystem would be unleashed to enjoy full market coverage.

It hit the nail on the head by enabling native Thai applications to run across multiple platforms regardless of CPU or operating systems.

For Thai solution suppliers, this means they also get exposure to a larger market. Because Thai cards/drivers of the past were different from each other, each Thai solution provider had to divide some scarce human resources to keep on adding sophisticated features to the product to attract application developers. The more these features were added, the more difficult application developers found it was to keep pace with them, and the deeper users were locked into specific hardware or a driver.

Consequently the market that they could get exposure to became smaller and smaller. With a driver implementing an industry-wide common approach, Thai solution providers can focus on the quality of their product, e.g. speed and memory usage, while being able to spend more time on bug fixes.

WTT has gained a considerable degree of public interest as well. It is a viable approach to avoid being locked to a certain hardware platform. For example, WTT's printing subsystem also provides run-time code conversion from TIS 620-2533 (the national standard character set TIS 620-2533 that WTT applications use) to print on various kinds of printers which might not have the TIS 620 or TIS 988 character set.

This conversion is done transparently to application logic and programmers and requires a minimal set-up effort from the user. This is done by setting an environment variable to select a proper conversion table. On the DOS platform, this is designed to be done in the AUTOEXEC.BAT file. So once the user has set it up, he or she does not have to worry about printing anymore.

Since the approach used by TAPIC and NECTEC to tackle WTT is technically sound, the IT standards committee of the Thai Industrial Standards Institute (TISI -- commonly known as Saw-Maw-Or) has formed a subcommittee on software standards to review and revise WTT and, by its charter, will finally endorse WTT as another Thai Industrial Standard.

The WTT subsystem decouples Thai dependencies into two parts: Input/Output methods and Thai semantic services which will be available in form of an API and a run-time library. The separation of the I/O and the API is necessary so that the operating system dependent part (the I/O) can be separated from application program to allow the WTT subsystem to be implemented on various operating systems which will have different ways of I/O handling. The API part of the WTT subsystem can be as simple as a recompile when ported to other operating systems.

The I/O method is a complete description of how Thai information is

entered from the input device and displayed on output device. It explains in an unambiguous way how to do it right. The big side-effect of I/O behaviour description is that it eliminates all interpretation on how a Thai character string should be treated at the I/O level as well as under the processing model that applications software and APIs understand. This will ensure consistent behaviour for Thai processing across all WTT-compliant implementation.

Take a look at the string Paw-Plaa Sara-Ee Sara-Ee

Users of today's Thai implementations can argue to death how this string would appear on the screen and how the software application understands it. On some implementations, Paw-Plaa and Sara-Ee are seen as two symbols in the same cell on the screen but there are 3 letters in computer memory; but some implementations have only 2 letters in memory. In some implementations,Paw-Plaa Sara-Ee are written in one cell and Sara-Ee alone in another cell and there are 3 letters in memory.

Under WTT, the second Sara-Ee is rejected at input time since it is not possible, according to Thai grammar, to put two Sara-Ee's in one cell, so the WTT I/O subsystem helps reducing the typographical error rate by rejecting it at input time.

Obviously in this case, Paw-Plaa and Sara-Ee are seen on the screen in one cell and there are only 2 letters in memory. But if somehow the second Sara-Ee had been entered into the system, e.g. by non-WTT system, and is processed remotely across the network by a WTT application, the second Sara-Ee would be seen on the screen in a separate cell.

The catch is that WTT guarantees consistency between what you see on the output device and what is stored in and understood by the computer. Otherwise, when Paw-Plaa Sara-Ee appear on the screen, one could NEVER know how many Sara-Ee are there on that spot. Many database searches have failed without reasonable reason to the users just because their I/O system implementations had not been thought out.

WTT, as of now, belongs to neither NECTEC nor TAPIC. Rights to use WTT Thai behaviour, I/O handling techniques and APIs have all been put in the public domain. The I/O subsystem specification has been completed by TAPIC and reviewed by TISI's software subcommittee.

It is anticipated that this part of WTT will be endorsed into an industrial standard by the end of 1992. Given that a complete description of Thai behaviour and I/O subsystem has been published and made available for free to the public by NECTEC, a few commercial Thai system software packages have already implemented it.

Software publishers' view

While local developers are stumbling with localisation, major software publishers are generally taking a different view, which is that in-house localisation efforts must be drastically reduced.

As mentioned earlier, localisation can be divided into two main parts. Message and manual translation cannot be eliminated. Manual translation efforts will remain until there is a good automatic language translation.

A tremendous effort which is of most concern to software publishers who operate a worldwide business is code modification to suit local markets. When the business grew, it didn't take much time to make the package become unmanageable in terms of engineering and quality.

A well-known database supplier is reported to have some 200 versions of code for each release of its popular package for different hardware, operating systems, and end-user languages. The work to add a new feature or fix a bug must be folded into other 199 versions as well. Then all 200 versions have to be tested. The effort to maintain anything at this scale is non-trivial. Evidently from publishers' perspective, the problem of duplicate engineering efforts must be fixed.

Some software publishers have entered into business agreements with local business entities.

From software companies' point of view, Thailand has not provided software protection to the level that will comfort them. Therefore, the chance that local business firms can get the source code to Thai-ise the product is quite minimal.

By and large, since most modifications will be done without source code access, international software publishers have a common tendency not to classify Thai-ised products in the same category as the original version. Risk areas are compatibility with the original version, execution speed and software reliability. Thai-ised products are hardly mentioned in any world-wide marketing program. Local companies have no other choice but to stay in the vicious circle of learn - modify - test - release for each version of the product.

However, some software publishers do treat local developers well. These publishers normally enter agreements with local firms to jointly develop and market the product together.

International Standards

However, to build a common source code base for all different end-user languages is not as easy as one might think. A computer program differentiates one character from others by looking at its value. In Thai processing environment,Sara-Ee has character value of 213. The computer must know that character 213 is a combining mark before it can display this character on the screen or print it on the printer correctly. Software must know that character 213 does not take a space on output device so that it can compute the field length correctly.

However, character 213 means different things in different countries. It is referred to as the Phi letter in Greece but Vav in Israel. It means something different in European countries, depending on which country one is looking at. It is undefined in Arab countries and the United States. The value 213 is only half a character in Japan, Korea and China.

But since computer software must know how to process each character, it is basically impossible to write a generic software application so that the same code can process different local languages correctly. A language switching technique has been developed to cope with this:locale switching.

A locale is a set of culture-dependent contexts that software can ask for services from operating system. Such contexts are, for example, order of alphabet, monetary unit, numeric format, date/time, etc. When the technique was developed, it was hoped that operating systems would be able to provide a small database and run-time services to solve cultural differences in end-user languages. Application software would select one locale at a time.

Locale switching mechanisms have been integrated into ANSI/ISO C programming language, IEEE/ISO POSIX (portable operating system interface specifications), XPG3 (X/Open Portability Guide, Issue 3), and X11R3 (X Window System), to name a few. Application programs can enjoy the use of national standard character set.

However, there is one big problem with locale switching mechanism: it was designed around Latin-based scripts which wouldn't work well for non-Latin scripts, Sanskrit- and Semitic-based in particular.

Another subtle problem with the locale switch mechanism is that when a character, say 213, is sent to another computer or even is processed by the same computer but under different locale, data can be interpreted incorrectly. The word Paw-Plaa Sara-Ee is represented by characters 187 and 213. When this string is processed under the Chinese locale, the word year in Thai is turned into badge (or symbol that signifies things) in Chinese. The semantic of the information is changed.

In the modern world that client/server programming is emerging, data exchange across the network is becoming common. When information coded in a national standard character set when sent to a computer that makes a different assumption about cultural context, things can easily go wrong.

ISO/IEC attacked this problem by defining a global character set, dubbed ISO/IEC 10646 Multiple Octet Coded Character Set, covering most, if not all, scripts used in computer applications. The attempt was to assign a unique value to each character.

ISO 10646 did not catch attention of the international standards community until late 1980s. It was structured as 4-octet (each octet has 8 bits) cannonical form and supported 5 `compression' mechanisms, one of which would enable existing software to use the national standard character set so they would continue to work. When a version called DIS2 (2nd Draft) was released, there were hot debates that ISO 10646, as it was then called, could never satisfy processing requirements for some types of application. Finally DIS2 failed in the ballot. A new draft had to be created.

A group of computer linguists started a parallel effort to DIS2 to define another global character set. More and more software industry giants joined the group which finally turned out into the powerful Unicode Consortium. Unicode supporters are major operating system suppliers who effectively control the market for low-end systems. The Unicode character set employs a fixed 2-byte structure.

After DIS2 had failed, both ISO and Unicode supporters started high-level talks after they realised that having two completely incompatible global character sets would be a disaster for the software industry. This resulted in a merger: DIS ISO/IEC 10646-1.2 which has been voted on and passed. The final text for ISO/IEC 10646 Universal Coded Character Set is expected to be released later this year (1992).

The new ISO/IEC IS 10646 will have two notable characteristics:

  • it has a 4-byte cannonical form and 2-byte short form. The short form will be identical to Unicode V1.1, and;
  • the basic entity for characters will change from one byte to two. The two extra bytes in 10646 4-byte cannonical form which make it different from the 2-byte Unicode is currently assigned with 0. So, one can say that an assigned character value in 10646 is identical to that of Unicode.

While most, if not all, operating systems in the next generation will either support 10646 or Unicode, it is not clear at this moment how application programs would have to change. A change seems to be inevitable since now a character is 2-bytes wide. Existing popular operating systems also have a tendency to evolve towards 10646/Unicode.

If one expects the operating system interface to change from byte-based character to word-based, virtually all existing software will no longer work. But if the operating system interface will continue to work in byte-oriented I/O (but needs to make sure one issues two I/O's for each character), the internal data structure would break anyway and existing software will, unfortunately, fail.

There are other subtle problems. For example, a capital `A' in the existing byte-oriented character set is defined as 65 in decimal or 41 in hexadecimal. In Unicode, capital A also has the value 41 (hexadecimal) but since Unicode is a 2-byte character set, its value is 0041. Software written in C language will break since it will mistakenly understand the 00 part of capital A as string terminator. While 10646 provides an optional encoding mechanism to eliminate the use of 00 in character value, the encoding introduces an non-uniform length character problem.

In short, there is no magic way to deal with these 2-byte characters easily.

Preparing for what is to come

The big question that lies here is what should the local software industry do in order to prepare for unforeseen changes?

A common reaction is just simply not to upgrade from existing operating systems and applications. Well, not quite. As software evolves, providing more and more features and sophistication, end-users can hardly resist the temptation to use them. While there might be some, most users are not using the original DOS 1.x and its applications.

Even though existing popular operating systems might be good enough execution platforms for Thai applications, it would be unwise anyway to ignore making preparations for the upcoming waves of technology.

Given the fact that future operating systems will support 10646/Unicode, it is anticipated that basic operating systems will provide a rendering engine that can handle Thai without localisation. The market for Thai cards/drivers is likely to shrink, while competition will be fiercer.

Software written on top of WTT has one distinguishing advantage: WTT applications do not deal directly with I/O. Therefore, it is possible to insert bidirectional character conversion between TIS 620 and 10646/Unicode under the WTT I/O layer. The effect is that a WTT application will continue to run even if the operating system interface were to change to a two-byte character I/O.

Software written abroad will have a better chance to penetrate the Thai market since the authors will understand about combining marks and can write packages that deal with combining marks since 10646/Unicode will have solved the problem of having multiple semantics for each character value.

Knowledge about Thai and side-effects of combining marks is still very rare in the international software community. This opens an opportunity for WTT API to play an important role as Thai-specific run-time library for 10646/Unicode-based operating systems.

An obvious thing that the local software industry should consider at this moment is WTT implementation. It is a win-win solution for all parties concerned. It proposes a new programming model with a clear separation of I/O from processing. WTT is not positioned to be the Thai system, but it has so many attractive features that it could provide a head start for those who wish to work out a contingency plan for a possible catastrophe in the future. Its design is in the public domain anyway.

The decision of whether to opt for WTT or not and its success now relies on how well Thai solution suppliers will implement it as well as how much application programmers and end-users demand it.

Trin Tantsetthi is an active member of TAPIC and TISI software committee. Trin is a software architect at Digital Equipment (Thailand) Ltd. His group works on software internationalisation, standards and also offers consultancy services on software internationalisation. As an individual contributor, he has followed 10646 since 1989 and usually defended Thailand's interests in ISO 10646 design process which has been completed now.

Being a network evangelist, Trin has been playing an active part in NECTEC's ThaiSarn network of which he assumes the roles of network architect and backbone designer as well as running a major ThaiSarn hub. His name appears on various BBS around Bangkok.