“将 Linux 本地化为印度语可以引发一场深及国家草根阶层的革命。”[95]
——Venkatesh Hariharan 教授
本地化带来的挑战对于不同的国家和地区都是不同的。有些地区的本地化工作不会花费太多精力,有些地区则可能发现本地化需要大量地修改和定制程序。这是由当地的需求和 GNU/Linux 已有的地区设定之间的相似程度决定的。
本地化 GNU/Linux 有许多不同的方法,它们使用不同的编码、输入和现实系统。目前,最有效的方法是通过 Linux-Unicode-OpenType 模式进行本地化。以下是不同技术的简要介绍。
Unicode (www.unicode.org)
最新版本为4.0的 Unicode 编码系统是编码字符和符号的工业标准。它与国际标准组织(ISO)的通用字符集标准10646紧密相关。对两个标准的任何修改都在 ISO 和 Adobe,IBM,微软,Sybase,康柏,惠普,甲骨文,升阳,网景和爱立信组成的 Unicode 联合会之间获得协调。
Unicode 和 ISO 10646 的目标是纳入世界上的所有语言,每个字符编码对应一个“glyph”。字符编码的组合构成合成的 glyph ,用于表达复杂的字符(特别用于东亚语言)。最初的 Unicode 标准指定了16位字符集的编码方法,一共支持65,535种字符/符号。标准的较新版本扩展到了32位编码,支持超过一百万种不同字符和符号的编码。
Unicode 标准在加速的全球化进程中显得越来越重要。它是最适合因特网的编码系统。随着因特网在发达国家和发展中国家的越来越广泛的应用,将 Unicode 整合到软件和内容的开发中带来的益处是不容忽视的。
Open Type (www.adobe.com/type/opentype/main.html)
字体是本地化的“前端”,因而最受非技术用户的注意。因此,字体开发常被看作是本地化的唯一和最终工作。但是,虽然字体开发是本地化工作中最明显的部分,但它并不是唯一的重要部分。
就像我们提倡使用 Unicode 编码系统,我们也提倡使用 OpenType 字体文件格式作为本地化工作中字体开发工作的标准。
OpenType 是一种由微软和 Adobe 联合开发的跨平台字体文件格式。它基于 Unicode 编码标准并能在单一字体文件中提供多种语言的字符集。OpenType 字体可以包括超过 65,000 个 glyph,让多种语言能够用同一种字体显示,而传统的西方 Postscript 字体只支持256个 glyph。
使用 Linux-Unicode-OpenType 模型,大部分本地化工作包括以下几个步骤:
=Unicode 标准更正/加强=
设计一种能够处理世界上多不胜数的语言的编码是非常复杂的工作。巨大的工作量导致了某些语言错误和不当的编码,特别是那些信息和通讯技术不发达的国家的语言。而且,虽然 Unicode 包括了世界上所有主要语言的编码,对其他语言和方言(仅印度就有超过1,000种语言和方言)的编码都是不完整的甚至不存在。对于现有的 Unicode 标准没有覆盖的国家,有必要对现有 Unicode 标准进行审议并将改进建议提交 Unicode 联合会。
=字体开发=
开发出可用的 Unicode 标准后,下一个挑战是确保开发出自由的、跨平台的字体。没有字体,就不能在电子设备上显示、使用和处理任何语言。现代字体,特别是 OpenType 字体,不仅仅是语言的可视表现。OpenType 字体暗含着词语显示背后的逻辑,字符可以与周围的字符互动并改变它们。不基于西方字母的语言(阿拉伯语,老挝语,不丹语等等)常常没有开放的、非私有的字体。
字体开发并不是简单的工作。高质量的、专业的字体的开发可能需要数年时间。
=输入法=
下一步是用于输入这种语言的系统的标准化和部署。最常用的输入方法是通过键盘,许多国家都设计了标准键盘与本地语言字符之间的映射表。例如,有几种键盘方案经常被用来输入孟加拉语。缺乏统一的标准是字符集/编码、键盘映射、字体等等不兼容的结果,并进一步导致更多的不兼容。对输入方法的统一将为开发者提供一个共同的起点。
输入法标准化后,需要编写软件以在 GNU/Linux 下实现这种标准。如果字符的数目少于可能的按键组合,重新映射键盘上的案件是很简单的。但如果字符数远远超过键盘上的按键数目(例如中文有30,000个字符),就需要更复杂的技术。
=修改应用程序以处理本地语言字符=
虽然大部分自由/开源软件都是国际化的,为了适应本地语言的特点仍有必要对它们进行一些修改。例如,大部分字处理软件通过空格分词,但对于不使用空格的语言,必须建立特定的规则处理分词。类似的问题也存在于词语排序,文本流和其他的工作中。大部分语言都需要小的修改,而有些语言可能需要对程序的大范围改动。
此外,与地区习惯有关的信息,如日期格式,货币符号等等都必须被指定。这通常是只涉及文本编辑的简单工作。
=翻译程序信息=
本地化 GNU/Linux 的下一步涉及到翻译应用程序传递给用户的信息。像“File Not Found”或者“Operation Complete”这样的信息需要翻译成本地语言。这项任务不需要太多技术能力,因为这些信息通常以文本形式存储以方便查看和编辑。但是,翻译数以千计的信息和帮助文件是一项需要几年时间的工作,而且常常是本地化进程中最慢的部分。即使这项工作只涉及最常用的程序(浏览器和办公套件),也需要花费大量的经历。
=确保改变被全球自由/开源软件社区接受=
自由/开源软件开发方法的一大优点是软件的众多用户可以分担维护的开销。但是,这样做的前提是对软件的改变被社区在全球范围所接受。本地化可能牵涉到许多不同软件组件的改变,每一个都由不同的开发队伍维护。因此,需要专门的工作来保证所有改变都被每一个队伍接受,这通常是通过确保改变的方式与开发队伍的未来方向协调一致而实现的。这就要求本地化工作者必须从一开始就作为全球协作中的一员,否则就有成为一个孤立的 GNU/Linux 版本唯一的维护者的风险。
"The localisation of Linux to Indian languages can spark off a revolution that reaches down to the grassroots levels of the country,"i
Prof. Venkatesh Hariharan
For each different locale or country, the challenges involved in localizing GNU/Linux vary. Some locales may find that localization requires minimal effort. Other locales may find that localization requires extensive modification and customized programming. This depends largely on the similarity between the locale’s requirements and the requirements already localized in GNU/Linux.
There are many different methods used to localize GNU/Linux, using different encoding, input and display systems. At present, the most technically effective method is localization via the Linux-Unicode-OpenType model. A brief explanation of the different technologies follows.
Unicode (www.unicode.org)
The Unicode encoding system, the latest version being Unicode 4.0, is an industry standard for encoding characters and symbols. It is closely related to the ISO Universal Character Set standard 10646. Additions to either standard are coordinated between the ISO and the Unicode Consortium. The Unicode Consortium, co-founded by Apple and Xerox in 1991, now has more than 100 members, including Adobe, IBM, Microsoft, Sybase, Compaq, Hewlett Packard, Oracle, Sun Microsystems, Netscape and Ericsson.
The aim of Unicode and ISO 10646 is to encompass all of the languages of the world, with each character code corresponding to a ‘glyph’. Combinations of character codes produce combined glyphs for complex characters (particularly in the Asian languages). The initial Unicode standard specified an encoding for 16-bit characters, which allows for a total of 65,535 possible characters/symbols. Later versions of the standard have expanded the encoding to a 32-bit range, allowing over one million different characters and symbols to be encoded.
The Unicode standard is more and more relevant in light of accelerated globalization. It is the most relevant encoding system for the Internet. As Internet penetration continues to increase in both developing and developed countries, the benefits of integrating Unicode in software and content development cannot be ignored.
OpenType (www.adobe.com/type/opentype/main.html)
Fonts are at the ‘front end’ of localization and often receive the most attention from non-technical observers. Thus, font development is very often seen as the be-all and end-all of localization. However, font development is only one crucial component of the entire localization process, although it is the most visible.
Just as we are advocating the Unicode encoding system, we are advocating OpenType font file formats as the appropriate standard for font development in localization efforts.
OpenType is a cross-platform font file format jointly developed by Microsoft and Adobe. It is based on the Unicode encoding standard and offers multiple language character sets in one font file. Whereas traditional Western Postscript fonts are limited to 256 glyphs, an OpenType font may contain more than 65,000 glyphs, allowing multiple languages to be displayed using a single font.
Using the Linux-Unicode-OpenType model, most localization efforts involve the following steps:
1)Unicode standard corrections/enhancements
2)Font development
3)Input methods
4)Modifying applications to handle local language characteristics
5)Translating application messages
6)Ensuring that changes are accepted by the global FOSS community
Unicode standard corrections/enhancements
Creating encoding that adequately handles the needs of the countless languages throughout the world is highly complex. The immensity of this task has resulted in errors and inadequacies in the specification of certain languages, particularly languages from countries that have low levels of ICT development. Additionally, while Unicode may have included encoding for all of the major languages in the world, encoding for the other languages and dialects (India alone has over 1,000 languages and dialects) is either incomplete or non-existent. In countries where the existing Unicode standard is lacking, a review of the existing Unicode standard and recommendation of changes to the Unicode Consortium will be necessary.
Font development
Once a satisfactory Unicode standard has been developed, the next challenge is ensuring that there is a freely available, cross-platform font. Without fonts, it is impossible to display, use and manipulate any language electronically. Modern fonts, particularly OpenType fonts, are more than just the visual representation of a language. OpenType fonts contain the logic behind the display of the words, how glyphs interact with and change surrounding glyphs. Languages that differ greatly from the western alphabet (Arabic, Laotian, Dzongkha, etc) often do not have a commonly available, non-proprietary font.
Font development is no small task. A high-quality, professional font can take several years to develop.
Input methods
The next step involves standardizing and implementing a system for input in that language. The most common input method in computing is via the keyboard and many countries have created mappings between the standard keys to characters in their local language. These are often ad hoc adoptions and several are used within a country. For example, there are several keyboard layouts in use regularly in Bangladesh. The lack of a single standard is a result of and contributes further to incompatible implementations of character sets/encoding, keyboard mappings, fonts, and the like. Addressing and standardizing input methods from the outset provides developers with a common starting point.
Once an input method has been standardized, software has to be written to implement the standard under GNU/Linux. If the number of characters is less than the possible key combinations, this becomes a simple task of remapping the keys on a keyboard. It is when the number of characters far outnumber the keys on a keyboard (e.g., Chinese with its 30,000 characters) that more advanced techniques become necessary.
Modify applications to handle local language characteristics
While most major FOSS applications have been internationalized, some modification may still be necessary to adapt to local language characteristics. For example, most word processors break words on a space but in languages that do not use spaces, special rules must be created to specify breaking order. Similar problems exist with word sorting, text flow and other issues. Most languages will require minimal modification but certain languages may require extensive modification to applications.
Additionally, locale-specific information such as date format, currency symbols and other issues has to be specified. This is normally a simple task involving editing text files.
Translating application messages
The next step in localizing GNU/Linux involves the translation of messages that the application passes to the user. Messages such as “File Not Found” or “Operation Complete” have to be translated to the local language. This task involves very little technical skill as the messages are normally stored in text files for easy viewing and editing. However, translating the thousands of messages and help files is an undertaking that can take several years to complete and is often the slowest part of the localization process. Even if the task is limited to the most commonly used applications (web browser, office productivity suite), significant effort has to be expended.
Ensuring that changes are accepted by the global FOSS community
One of the major advantages of the FOSS development method is that maintenance costs are often shared among the various users of the software. However, this is possible only if the changes made are accepted by the global community. Localization may involve changes in many different software components, each maintained by different project teams. Therefore, there should be a focused effort to ensure that all changes made are accepted by each of the teams, often by ensuring that the changes are made in a manner compatible with the future direction of the project team. In essence, one must be a player in the global team effort from the very start or risk being the only one left maintaining an isolated version of GNU/Linux.