附录 B. 本地化技术一览

In this annex, more technical details will be discussed. The aim is to give implementers necessary information to start localization. However, this is not intended to be a hands-on cookbook.

在本附录中,将讨论更多的技术细节。目标是给工作者开始本地化工作所需要的信息。但是,这并不是一本手把手具体教授技术的手册。

Unicode
=Unicode=

As a universal character set that includes all characters of the world, Unicode assigns code points to its characters by 16-bit integers, which means that up to 65,536 characters can be encoded. However, due to the huge set of CJK characters, this has become insufficient, and Unicode 3.0 has extended the index to 21 bits, which will support up to 1,114,112 characters.

Unicode 是一个包括了世界上所有字符的字符集,用16位整数来编码字符指针,也就是可以编码最多65,536个字符。但是,由于 CJK 字符集的庞大规模,连这个容量也不够使用,因此 Unicode 3.0 把索引字长扩展到21位,支持多达1,114,112个字符。

Planes
=平面=

Unicode code point is a numeric value between 0 and 10FFFF, divided into planes of 64K characters. In Unicode 4.0, allocated planes are Plane 0, 1, 2 and 14.

Unicode 编码指针是一个在0和10FFFF之间的数值,分成64K个字符组成的平面。在 Unicode 4.0 里,分配的平面是平面0,1,2和14。

Plane 0, ranging from 0000 to FFFF, is called Basic Multilingual Plane (BMP), which is the set of characters assigned by the previous 16-bit scheme.

平面0,从0000到FFFF,叫做基本多语言平面(Basic Multilingual Plane, BMP),由过去的16位编码系统下的字符集组成。

Plane 1, ranging from 10000 to 1FFFF and called Supplementary Multilingual Plane (SMP), is dedicated to lesser used historic scripts, special-purpose invented scripts and special notations. These include Gothic, Shavian and musical symbols. Many more historic scripts may be encoded in this plane in the future.

平面1,从10000到1FFFF,叫做辅助多语言平面(Supplementary Multilingual Plane, SMP),用于较少使用的古文字,特殊用途的文字和特殊符号。这些文字包括哥特文字,Shavian 文字和乐谱符号。今后可能会有更多的古文字被编码到这个平面中。

Plane 2, ranging from 20000 to 2FFFF and called Supplementary Ideographic Plane (SIP), is the spillover allocation area for those CJK characters that cannot fit into the blocks for common CJK characters in the BMP. Plane 14, ranging from E0000 to EFFFF and called Supplementary Special-purpose Plane (SSP), is for some control characters that do not fit into the small areas allocated in the BMP.

平面2,从20000到2FFFF,称为辅助表意文字平面(Supplementary Ideographic Plane, SIP),用于容纳 BMP 中一般 CJK 字符容纳不下的字符的区域。平面14,从E0000到EFFFF,称为辅助特殊用途平面(Supplementary Special-purpose Plane, SSP),是为 BMP 中有限的小区域无法容纳的控制字符准备的。

There are two more reserved planes Plane 15 and Plane 16, for private use, where no code point is assigned.

还有两个保留平面,平面15和平面16,用于个别用途,没有分配编码指针。

Basic Multilingual Plane
==基本多语言平面==

Basic Multilingual Plane (BMP), or Plane 0, is most commonly in general documents. Code points are allocated for common characters in contemporary scripts with exactly the same set as ISO/IEC 10646-1, as summarized in Figure 2 in section ý0 Note that the code points between E000 and F900 are reserved for the vendors' private use. No character is assigned in this area.

基本多语言平面(Basic Multilingual Plane, BMP),或平面0,是一般文本中使用最多的平面。现代文字中常用字符的编码指针被按照与 ISO/IEC 10646-1 完全相同的方式分配,如图2所示。注意E000和F900之间的编码指针为软件提供商的特别用途被保留,该区域中没有分配字符。

图2 Unicode 基本多语言平面图2 Unicode 基本多语言平面

Character Encoding
==字符编码==

There are several ways of encoding Unicode strings for information interchange. One may simply represent each character using a fixed size integer (called wide char), which is defined by ISO/IEC 10646 as UCS-2 and UCS-4, where 2-byte and 4-byte integers are used, respectively (6) and where UCS-2 is for BMP only. But the common practice is to encode the characters using variable-length sequences of integers called UTF-8, UTF-16 and UTF-32 for 8-bit, 16-bit and 32-bit integers, respectively (7). There is also UTF-7 for e-mail transmissions that are 7-bit strict, but UTF-8 is safe in most cases.

用于信息交换的 Unicode 字符串有几种编码方式。每个字符可以简单地用固定长度的整数表示(称为宽字符),这种方式在 ISO/IEC 10646 中定义为 UCS-2 和 UCS-4,分别使用2字节和4字节长度的整数(6),而且 UCS-2 只用于基本多语言平面。但一般的做法是用可变长度的整数序列表示,根据使用的是8位,16位还是32位的整数,分别称为 UTF-8,UTF-16,和 UTF-32(7)。还有7位的专用于电子邮件传输的 UTF-7 编码,但多数情况下 UTF-8 都被支持。

UTF-32
===UTF-32===

UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit unsigned integer. It is therefore, a fixed-width character encoding form. This makes UTF-32 an ideal form for APIs that pass single character values. However, it is inefficient in terms of storage for Unicode strings.

UTF-32 是最简单的 Unicode 编码形式。每个 Unicode 编码指针都由一个单个32位无符号整数直接表示,因此它是一种固定宽度的编码形式。这使得 UTF-32 适合用于传递单个字符值的应用程序借口。但是,它不能有效满足 Unicode 字符串的存储需要。

UTF-16
===UTF-16===

UTF-16 encodes code points in the range 0000 to FFFF (i.e. BMP) as a single 16-bit unsigned integer. Code points in supplementary planes are instead represented as pairs of 16-bit unsigned integers. These pairs of code units are called surrogate pairs. The values used for the surrogate pairs are in the range D800 ­ DFFF, which are not assigned to any character. So, UTF-16 readers can easily distinguish between single code unit and surrogate pairs. The Unicode Standard(8) provides more details of surrogates.

UTF-16 在0000到FFFF范围(即基本多语言平面)内以单个16位无符号整数编码指针。辅助平面内的编码指针由两个16位无符号整数代表。这些编码单位被称为代用对。代用对的值在D800到DFFF间,没有分配给任何字符。这样,UTF-16 程序容易分辨单个编码单位和代用对。Unicode 标准(8)给出了代用对的详情。

UTF-16 is a good choice for keeping general Unicode strings, as it is optimized for characters in BMP, which is used in 99 percent of Unicode texts. It consumes about half of the storage required by UTF-32.

UTF-16 是保存一般 Unicode 字符串的好方法,因为它对在99%的 Unicode 文本中使用的基本多语言平面内的字符进行了优化。它只需要相当于 UTF-32 所需一半的存储空间。

UTF-8
===UTF-8===

To meet the requirements of legacy byte-oriented ASCII-based systems, UTF-8 is defined as variable- width encoding form that preserves ASCII compatibility. It uses one to four 8-bit code units to represent a Unicode character, depending on the code point value. The code points between 0000 and 007F are encoded in a single byte, making any ASCII string a valid UTF-8. Beyond the ASCII range of Unicode, some non-ideographic characters between 0080 and 07FF are encoded with two bytes. Then, Indic scripts and CJK ideographs between 0800 and FFFF are encoded with three bytes. Supplementary characters beyond BMP require four bytes. The Unicode Standard(9) provides more detail of UTF-8.

为满足旧式的基于 ASCII 的,面向字节处理的系统的要求,UTF-8 被定义为一种保留了 ASCII 兼容性的可变宽度编码形式。根据编码指针数值的不同,它使用一个到四个8位的编码单位来表示一个 Unicode 字符。在0000到007F范围内的编码指针用一个字节编码,这样任何 ASCII 字符串在 UTF-8 下都同样有效。在 Unicode 的 ASCII 范围外,一些在0080到07FF之间的非表意字符用两个字节编码。在其后的位于0800和FFFF范围内的印地语和 CJK 表意文字用三个字节编码。基本多语言平面之外的辅助字符需要四个字节。Unicode 标准(9)提供了 UTF-8 的详细介绍。

UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot in migration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C or other programming languages APIs. For example, the traditional string collation using byte-wise comparison works with UTF-8.

UTF-8 是因特网上典型的理想编码形式。ASCII 兼容性对从旧系统迁移帮助很大。UTF-8 还有字节串行化和对 C 或其他语言编程接口友好的优点。例如,传统的逐字节比较方式的字符排序表在 UTF-8 下也能工作。

In short, UTF-8 is the most widely adopted encoding form of Unicode.

一句话,UTF-8 是 Unicode 最普及的编码形式。

Character Properties
==字符属性==

In addition to code points, Unicode also provides a database of character properties called the Unicode Character Database (UCD), which consists of a set of files describing the following properties:

除了编码指针外,Unicode 还提供了一个称为 Unicode 字符数据库(Unicode Character Database, UCD)(10)的字符属性数据库,包括一系列文件用来描述以下的属性:

Name.
General category (classification as letters, numbers, symbols, punctuation, etc.).
Other important general characteristics (white space, dash, ideographic, alphabetic, non char-acter, deprecated, etc.).
Character shaping (bidi category, shaping, mirroring, width, etc.).
Case (upper, lower, title, folding; both simple and full).
Numeric values and types (for digits).
Script and block.
Normalization properties (decompositions, decomposition type, canonical combining class, composition exclusions, etc.).
Age (version of the standard in which the code point was first designated).
Boundaries (grapheme cluster, word, line and sentence).
Standardized variants.

  • 名字
  • 一般类别(分类为字母、数字、符号、标点,等等)。
  • 其他重要一般性质(空白,连字符,表意,字母顺序,非字符,已过时,等等)
  • 字符外形(bidi 分类,外形,镜像,宽度,等等)。
  • 形式(大写,小写,标题,折叠;简写和全写)。
  • 数值和类型(用于数字)。
  • 字符和字符块。
  • 标准化属性(分解,分解类型,最简组成类,不合法组合,等等)。
  • 历史(编码指针最初被指定的标准版本)。
  • 边界(字,词,分行和断句)。
  • 标准变形。

The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site. The Unicode Standard(11) provides more details of the database.

这个数据库可用于一般的 Unicode 实现。在 Unicode.org 网站上可以找到它。Unicode 标准(11)提供了这个数据库的详情。

Technical Reports
==技术报告==

In addition to the code points, encoding forms and character properties, Unicode also provides some technical reports that can serve as implementation guidelines. Some of these reports have been included as annexes to the Unicode standard, and some are published individually as Technical Standards.

除了编码指针,编码形式和字符属性外,Unicode 还提供了一些技术报告,可以作为实现的指导。其中一些报告作为 Unicode 标准的附录提供,另一些则单独作为技术标准发布。

In Unicode 4.0, the standard annexes are:

在 Unicode 4.0 中,标准附录包括:

UAX 9: The Bidirectional Algorithm
Specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.

UAX 11: East-Asian Width
Specifications of an informative property of Unicode characters that is useful when interoperating with East-Asian Legacy character sets.

UAX 14: Line Breaking Properties
Specification of line breaking properties for Unicode characters as well as a model algorithm for determining line break opportunities.

UAX 15: Unicode Normalization Forms
Specifications for four normalized forms of Unicode text. With these forms, equivalent text (canonical or compatibility) will have identical binary representations. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.

UAX 24: Script Names
Assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

UAX 29: Text Boundaries
Guidelines for determining default boundaries between certain significant text elements: grapheme clusters ("user characters"), words and sentences.

  • UAX 9:双向算法:对于从右向左书写的文字,如阿拉伯文和希伯来文字符位置的规定。
  • UAX 11:东亚字符宽度:当操作旧式东亚字符集时对 Unicode 字符属性的规定。
  • UAX 14:断行属性:对 Unicode 字符断行属性的规定,以及决定断行时机的模型算法。
  • UAX 15:Unicode 标准化形式:规定了 Unicode 字符的四种标准形式。通过这些形式,等价(相同或兼容)的文本将具有同样的二进制值。当实现的程序用标准形式保存字符串时,可以确保等价的字符串有唯一的二进制值。
  • UAX 24:语言名称:为所有 Unicode 编码指针分配了语言的名称。这种信息在正则表达式这样的机制中能产生比仅仅匹配字符块名称更好的效果,因而非常有用。
  • UAX 29:字符边界:定义某些重要文本元素,如字符组合(“用户字符”),词和句子缺省边界的指导。

The individual technical standards are:

单独的技术标准包括:

UTS 6: A Standard Compression Scheme for Unicode
Specifications of a compression scheme for Unicode and sample implementation.

UTS 10: Unicode Collation Algorithm
Specifications for how to compare two Unicode strings while conforming to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

UTS 18: Unicode Regular Expression Guidelines
Guidelines on how to adapt regular expression engines to use Unicode.

  • UTS 6:Unicode 标准压缩方式:对 Unicode 和样本实现的压缩方式的规定。
  • UTS 10:Unicode 排序算法(UCA):在与 Unicode 标准兼容的前提下比较两个 Unicode 字符串的规定。UCA 也提供了缺省 Unicode 排序元素表(Default Unicode Collation Element Table, DUCET)用来指定所有 Unicode 字符的缺省排序顺序。
  • UTS 18:Unicode 正则表达式导则:关于如何让正则表达式引擎使用 Unicode 的导则。

All Unicode Technical Reports are accessible from the Unicode.org web site (12).

所有 Unicode 技术报告都可以从 Unicode.org 网站(12)上得到。

Fonts
=字体=

Font Development Tools
==字体开发工具==

Some FOSS tools for developing fonts are available. Although not as many as their proprietary counterparts, they are adequate to get the job done, and are continuously being improved. Some interesting examples are:

有一些用于开发字体的自由/开源软件工具。虽然这类工具不像私有的开发工具那样丰富,但它们足以胜任工作,而且在不断地改进。一些有趣的例子包括:

1. XmBDFEd(13). Developed by Mark Leisher, XmBDFEd is a Motif-based tool for developing BDF fonts. It allows one to edit bit-map glyphs of a font, do some simple transformations on the glyphs, transfer information between different fonts, and so on.

2. FontForge(14) (formerly PfaEdit(15) ). Developed by George Williams, FontForge is a tool for developing outline fonts, including Postscript Type1, TrueType, and OpenType. Scanned images of letters can be imported and their outline vectors automatically traced. The splines can be edited, and transformations like skewing, scaling, rotating, thickening may be applied and much more. It provides sufficient functionalities for editing Type1 and TrueType fonts properties. OpenType tables can also be edited in its recent versions. One weak point, however, is hinting. It guarantees Type1 hints quality, but not for TrueType.

3. TTX/FontTools(16). Just van Rossum's TTX/FontTools is a tool to convert OpenType and TrueType fonts to and from XML. FontTools is a library for manipulating fonts, written in Python. It supports TrueType, OpenType, AFM and, to a certain extent, Type 1 and some Mac-specific formats. It allows one to dump OpenType tables, examine and edit them with XML or plain text editor, and merge them back to the font.

  1. XmBDFEd(13)。它由 Mark Leisher 开发,是一个基于 Motif 的开发 BDF 字体的工具。它允许编辑字体的点阵形式,对符号进行简单的变换,在不同字体间传递信息,等等。
  2. FontForge(14)(过去的 PfaEdit(15))。George Williams 开发的 FontForge 是用于开发 Postscript Type1,TrueType 和 OpenType 等轮廓字体的工具。它可以导入字母的扫描图像,并自动追踪其轮廓向量。它还能对样条曲线进行编辑,并进行倾斜、缩放、旋转、加粗以及其他许多种变换。其功能足够用于编辑 Type1 和 TrueType 字体。新版本还能编辑 OpenType 字体表。但 hinting 是它的一个弱项,它只能保证 Type1 hinting 的质量,但 TrueType 的则不理想。
  3. TTX/FontTools(16)。Just van Rossum 的 TTX/FontTools 是一种用于 OpenType 和 TrueType 字体与 XML 文件相互转换的工具。FontTools 是用 Python 写成的处理字体的函数库。它支持 TrueType,OpenType,AFM 并提供了 Type1 和一些 Mac 专用字体的有限支持。它允许导出 OpenType 字体表,并用 XML 和纯文本编辑器检验和编辑,再合并回字体文件。

Font Configuration
==字体配置==

There have been several font configuration systems available in GNU/Linux desktops. The most fundamental one is the X Window font system itself. But, due to some recent developments, another font configuration called fontconfig has been developed to serve some specific requirements of modern desktops. These two font configurations will be discussed briefly.

在 GNU/Linux 桌面上有几种字体配置系统。最基本的是 X Window 字体系统本身。但是,在近期的开发中,另一种称为 fontconfig 的字体配置被开发出来以满足现代桌面的一些特定需要。以下简单讨论这两种字体系统。

First, however, let us briefly discuss the X Window architecture, to understand font systems. X Window(17) is a client-server system. X servers are the agents that provide service to control hardware devices, such as video cards, monitors, keyboards, mice or tablets, as well as passes user input events from the devices to the clients. X clients are GUI application programs that request X server to draw graphical objects on the screen, and accept user inputs via the events fed by X server. Note that with this architecture, X client and server can be on different machines in the network. In which case, X server is the machine that the user operates with, while X client can be a process running on the same machine or on a remote machine in the network.

不过首先,我们简要讨论一下 X Window 架构,以便理解字体系统。X Window(17) 是一种客户端-服务器系统。X 服务器是提供显卡、显示器、键盘、鼠标或触摸板等硬件设备控制服务的主体,也负责把用户输入事件从设备传送到客户。X 客户端是请求 X 服务器在屏幕上描绘图形对象,并通过 X 服务器的事件传送接受用户输入的图形界面程序。注意在这种架构中,X 客户端和服务器可以处在网络中不同的机器上。这种情况下,X 服务器是用户操作的机器,而 X 客户端可以是同一台机器上运行的进程,或网络中的远程机器。

In this client-server architecture, fonts are provided on the server side. Thus, installing fonts means configuring X server by installing fonts and registering them to its font path.

在这个客户端-服务器架构中,字体是服务器端提供的。因此,安装字体意味着在 X 服务器上加入字体并注册其字体路径。

However, since X server is sometimes used to provide thin-client access in some deployments, where X server may run on cheap PCs booted by floppy or across network, or even from ROM, font installation on each X server is not always appropriate. Thus, font service has been delegated to a separate service called X Font Server (XFS). Another machine in the network can be dedicated for font service so that all X servers can request font information. Therefore, with this structure, an X server may be configured to manage fonts by itself or to use fonts from the font server, or both.

但是,由于 X 服务在一些配置中有时被用来提供瘦客户机访问,而这些 X 服务器可能是运行在用软盘或网络方式启动的廉价机器上,甚至是从固化的 ROM 启动,在每台 X 服务器上安装字体不一定合适。因此,字体服务被分离成一个单独的服务,称为 X 字体服务器(X Font Server, XFS)。网络中另一台机器可以专门提供字体服务,这样所有的 X 服务器都可以请求字体信息。这样,在这个构架下,X 服务器可以配置成自我管理字体,或者使用来自字体服务器的字体,或者两者并存。

Nevertheless, recent changes in XFree86 have addressed some requirements to manage fonts at the client side. The Xft extension provides anti-aliased glyph images by font information provided by the X client. With this, the Xft extension also provides font management functionality to X clients in its first version. This was later split from Xft2 into a separate library called fontconfig. fontconfig is a font management system independent of X, which means it can also apply to non-GUI applications such as printing services. Modern desktops, including KDE 3 and GNOME 2 have adopted fontconfig as their font management systems, and have benefited from closer integration in providing easy font installation process. Moreover, client-side fonts also allow applications to do all glyph manipulations, such as making special effects, while enjoying consistent appearance on the screen and in printed outputs.

不过,在 XFree86 中最近的改变注意到了一些在客户端管理字体的需求。Xft 扩展通过 X 客户端提供的字体信息实现了抗锯齿的符号图像。这个功能也使 Xft 在其第一版中提供了 X 客户端的字体管理能力。后来这个功能从 Xft2 中分离成一个单独的库,称为 fontconfig。fontconfig 是独立于 X 的一个字体管理系统,因此它也支持像打印服务这样的非图形界面应用。包括 KDE 3 和 GNOME 2 在内的现代桌面都采用了 fontconfig 作为字体管理系统,并且得益于紧密的整合,提供了简单的字体安装过程。而且,客户端的字体也允许应用程序进行特效等各种符号操作,同时在屏幕上和打印输出中都可以得到一致的效果。

The splitting of the X client-server architecture is not standard practice on stand-alone desktops. However, it is important to always keep the split in mind, to enable particular features.

X 客户端-服务器的分离式架构并不是独立桌面的标准形式。但是,要使用某些特别的功能,必须记住这个特点。

Output Methods
=输出方法=

Since the usefulness of XOM is still being questioned, we shall discuss only the output methods already implemented in the two major toolkits: Pango of GTK+ 2 and Qt 3.

由于 XOM 的有用程度还有疑问,我们将只讨论在两个主要的工具包中已经实现的输出方法:GTK+ 2的 Pango 和 Qt 3。

Pango Text Layout Engines
==Pango 文本外观引擎==

Pango [`Pan' means `all' in English and `go' means `language' in Japanese](18) is a multilingual text layout engine designed for quality text typesetting. Although it is the text drawing engine of GTK+, it can also be used outside GTK+ for other purposes, such as printing(19). This section will provide localizers with a bird`s eye view of Pango. The Pango reference manual(20) should be consulted for more detail.

Pango(“Pan”在英语里意思是“全部”,而“go”是日语中“语言”的意思)(18) 是一个用于高质量文本排版的多语言文本外观引擎。虽然它是 GTK+ 的文本描绘引擎,它也可以用于 GTK+ 之外的其他用途,例如打印(19)。这一节将为本地化工作者提供 Pango 的概览。如需要更多详情,应阅读 Pango 参考手册(20)。

PangoLayout
===PangoLayout===

At a high level, Pango provides the PangoLayout class that takes care of typesetting text in a column of given width, as well as other information necessary for editing, such as cursor positions. Its features may be summarized as follows:

在较高的层级,Pango 提供了 PangoLayout 类,处理给定宽度内的一列文本的排版,以及光标位置等其他编辑时必要的信息。其功能可以概括如下:

Paragraph Properties

indent justification
spacing word/character wrapping modes
alignment tabs

段落属性

  • 缩进
  • 间距
  • 段落对齐
  • 两端对齐
  • 字/词换行模式
  • 制表位

Text Elements

get lines and their extents character logical attributes (is line break, is cursor position, etc.)
get runs and their extents cursor movements
character search at (x, y) position

文本元素

  • 行及其范围
  • 语流及其范围
  • 在 (x,y) 位置的字符搜索
  • 字符逻辑属性(是换行符,是光标位置控制符,等等)
  • 光标移动

Text Contents

plain text markup text

文本内容

  • 纯文本
  • 标记文本

Middle-level Processing
==中级处理==

Pango also provides access to some middle-level text processing functions, although most clients in general do not use them directly. To gain a brief understanding of Pango internals, some highlights are discussed here.

Pango 还提供了一些中级的文本处理功能,虽然大部分客户端都不直接使用这些功能。为了简单了解 Pango 的能力,这里讨论一些重要特性。

There are three major steps for text processing in Pango(21):

Pango 中的文本处理有三个主要步骤(21):

Itemize. Breaks input text into chunks (items) of consistent direction and shaping engine. This usually means chunks of text of the same language with the same font. Corresponding shaping and language engines are also associated with the items.

分项:将文本打散成具有相同方向和形状引擎的文本块(项目)。这通常是指同一种语言和同一种字体的文本块。相应的形状和语言引擎也和项目相关联。

Break. Determines possible line, word and character breaks within the given text item. It calls the language engine of the item (or the default engine based on Unicode data if no language engine exists) to analyze the logical attributes of the characters (is-line-break, is-char-break, etc.).

分解:确定给定的文本项中可能的行、词和字符分割。它调用项目的语言引擎(如语言引擎不存在则调用基于 Unicode 数据的缺省引擎)来分析字符的逻辑属性(是断行,是断字,等等)。

Shape. Converts the text item into glyphs, with proper positioning. It calls the shaping engine of the item (or the default shaping engine that is currently suitable for European languages) to obtain a glyph string that provides the information required to render the glyphs (code point, width, offsets, etc.).

造型:把文本项转化成具有正确位置的符号。它调用项目的造型引擎(或者适用于欧洲语言的缺省造型引擎)生成提供渲染符号所需信息(编码指针,宽度,偏移量等)的符号串。

Pango Engines
==Pango 引擎==

Pango engines are implemented in loadable modules that provide entry functions for querying and creating the desired engine. During initialization, Pango queries the list of all engines installed in the memory. Then, when it itemizes input text, it also searches the list for the language and shaping engines available for the script of each item and creates them for association to the relevant text item.

Pango 引擎以可加载的模块形式实现,提供查询和建立所需引擎的函数。在初始化时,Pango 查询内存中所有引擎的列表。然后,在对输入文字分项后,它为每个项目中的文字搜索可用的语言和造型引擎并建立与相关的文本项目关联的引擎。

Pango Language Engines
==Pango 语言引擎==

As discussed above, the Pango language engine is called to determine possible break positions in a text item of a certain language. It provides a method to analyze the logical attributes of every character in the text as listed in Table 3.

如上所述,调用 Pango 语言引擎是为了确定某种语言中文本项的可能的分解位置。它提供了分析文本中每个字符逻辑属性的方法,如表3所示:

Table 3 Pango Logical Attributes
Flag Description
is_line_break can break line in front of the character
is_mandatory_break must break line in front of the character
is_char_break can break here when doing character wrap
is_white is white space character
is_cursor_position cursor can appear in front of character
is_word_start is first character in a word
is_word_end is first non-word character after a word
is_sentence_boundary is inter-sentence space
is_sentence_start is first character in a sentence
is_sentence_end is first non-sentence character after a sentence
backspace_deletes_character backspace deletes one character, not entire cluster (new in Pango 1.3.x)
表3 Pango 逻辑属性
标志(Flag) 描述
is_line_break 可以在字符前断行
is_mandatory_break 必须在字符前断行
is_char_break 字符分行时可以在这里断行
is_white 是空格字符
is_cursor_position 光标可以在字符前出现
is_word_start 是单词的第一个字符
is_word_end 是单词后的第一个非单词字符
is_sentence_boundary 是句子间的空格
is_sentence_start 是句子的第一个字符
is_sentence_end 是句子后的第一个非句子的字符
backspace_deletes_character 退格删除一个字符而不是整个字符簇

Pango Shaping Engines
==Pango 造型引擎==

As discussed above, the Pango shaping engine converts characters in a text item in a certain language into glyphs, and positions them according to the script constraints. It provides a method to convert a given text string into a sequence of glyphs information (glyph code, width and positioning) and a logical map that maps the glyphs back to character positions in the original text. With all the information provided, the text can be properly rendered on output devices, as well as accessed by the cursor despite the difference between logical and rendering order in some scripts like Indic, Hebrew and Arabic.

如上所述,Pango 造型引擎把一个特定语言的文本项中的字符转换成符号,并且按照文字的规则放置这些符号。它提供了一种将给定的文本串转化为符号信息序列(符号编码、宽度和位置)的方法以及按原文本中字符位置将符号映射回字符的规则。利用这些信息,文本可以在输出设备上正确地显示,也可以正确地处理光标位置,而不用管像印地语、希伯来语和阿拉伯语这样的语言中不同的逻辑和显示顺序。

Qt Text Layout
==Qt 文本外观==

Qt 3 text rendering is different from that of GTK+/Pango. Instead of modularizing, it handles all complex text rendering in a single class, called QComplexText, which is mostly based on the Unicode character database. This is equivalent to the default routines provided by Pango. Due to the incompleteness of the Unicode database, this class sometimes needs extra workarounds to override some values. Developers should examine this class if a script is not rendered properly.

Qt 3 的文本渲染与 GTK+/Pango 的不同。它不是模块化的,而是在一个称为 QComplexText 的基于 Unicode 字符数据库的类中处理所有复杂文本渲染。它与 Pango 提供的缺省处理方法是一样的。由于 Unicode 数据库的不完整,这个类需要更多的修改来处理某些数值。如果一种语言渲染不正确,开发者需要检查这个类。

Although relying on the Unicode database appears to be a straightforward method for rendering Unicode texts, this makes the class rigid and error prone. Checking the Qt Web site regularly to find out whether there are bugs in latest versions is advisable. However, a big change has been planned for Qt 4, which is the Scribe text layout engine, similar to Pango for GTK+.

虽然依赖于 Unicode 数据库看起来是一种直接的渲染 Unicode 文本的办法,但这样的类不灵活而且容易出错。建议经常查看 Qt 的网站了解最新版本中是否存在问题。不过,Qt 4 当中计划引入一个大的变化,即 Scribe 文本布局引擎,与 GTK+ 的 Pango 类似。

Keyboard Layouts
==键盘布局==

The first step to providing text input for a particular language is to prepare the keyboard map. X Window handles the keyboard map using the X Keyboard (XKB) extension. When you start an X server on GNU/ Linux, a virtual terminal is attached to it in raw mode, so that keyboard events are sent from the kernel without any translation.

为一种特定语言提供文本输入功能的第一步是定义键盘布局。X Window 用 X 键盘扩展处理(X Keyboard extension, XKB)键盘布局。当你在 GNU/Linux 上启动 X 服务器时,它附带一个简单的虚拟控制台,这样内核可以发送键盘事件而不需要任何转换。

The raw scan code of the key is then translated into keycode according to the keyboard model. For XFree86 on PC, the keycode map is usually "xfree86" as kept under /etc/X11/xkb/keycodes directory. The keycodes just represent the key positions in symbolic form, for further referencing.

击键的原始扫描码按照键盘型号被转换成键位代码。对于 PC 上的 XFree86 ,键位映射通常是保存在 /etc/X11/xkb/keycodes 目录下的“xfree86”。键位代码只以符号形式表示键位,以供查询。

The keycode is then translated into a keyboard symbol (keysym) according to the specified layout, such as qwerty, dvorak, or a layout for a specific language, chosen from the data under /etc/X11/xkb/symbols directory. A keysym does not represent a character yet. It requires an input method to translate sequences of key events into characters, which will be described later. For XFree86, all of the above setup is done via the setxkbmap command. (Setting up values in /etc/X11/XF86Config means setting parameters for setxkbmap at initial X server startup.) There are many ways of describing the configuration, as explained in Ivan Pascal's XKB explanation(22). The default method for XFree86 4.x is the "xfree86" rule (XKB rules are kept under /etc/X11/xkb/rules), with additional parameters:

之后,键位代码根据指定的键盘布局,如 qwerty, dvorak 或者从 /etc/X11/xkb/symbols 目录下的文件中指定的语言的布局,被翻译成键盘符号(keysym)。一个键盘符号还不是一个字符。它需要输入方法来把键盘事件序列转换成字符,后面会提到这个转换过程。对于 XFree86,所有上述的设置都是通过 setxkbmap 命令来完成(在 /etc/X11/XF86Config 中的设置可以在 X 服务器启动时为 setxkbmap 设定参数)。有许多描述配置的方法,在 Ivan Pascal 的 XKB 文档(22)中有说明。XFree86 4.x 的缺省方法是“xfree86”规则(XKB 规则保存在 /etc/X11/xkb/xrules),有以下一些参数:

model ­ pc104, pc105, microsoft, microsoftplus, ...
layout ­ us, dk, ja, lo, th, ...
(For XFree86 4.0+, up to 64 groups can be provided as part of layout definition)
variant ­ (mostly for Latins) nodeadkeys
option ­ group switching key, swap caps, LED indicator, etc.
(See /etc/X11/xkb/rules/xfree86 for all available options.)

  • 型号-pc104,pc105,microsoft,microsoftplus,……
  • 布局-us,dk,ja,lo,th,……(对 XFree86 4.0 以上版本,布局定义可以提供最多64个分组)
  • 变形-(主要用于拉丁语系的)语音辅助键
  • 可选项-切换键,大小写交换,LED 指示灯,等等(其他可选项见 /etc/X11/xkb/rules/xfree86)

For example:

例如:

$ setxkbmap us,th -option grp:alt_shift_toggle,grp_led:scroll

Sets layout using US symbols as the first group, and Thai symbols as the second group. The Alt-Shift combination is used to toggle between the two groups. Scroll Lock LED will be the group indicator, which will be on when the current group is not the first group, that is, on for Thai, off for US.

把美国英语符号作为第一组,泰国语符号作为第二组。Alt-Shift 组合键用来在两组之间切换。Scroll Lock LED 指示灯将作为分组状况显示,在当前组不是第一组时将点亮,也就是亮表示泰国语,灭表示美国英语。

You can even mix more than two languages:

你甚至可以混合更多的语言:

$ setxkbmap us,th,lo -option grp:alt_shift_toggle,grp_led:scroll

This loads trilingual layout. Alt-Shift is used to rotate among the three groups; that is, Alt-RightShift chooses the next group and Alt-LeftShift chooses the previous group. Scroll Lock LED will be on when the Thai or Lao group is active.

这个命令装入三种语言的布局。Alt-Shift 用来在三个组之间轮换;Alt-右Shift 选择下一组,而 Alt-左Shift 选择上一组。Scroll Lock LED 指示灯在启用泰国语和老挝语组时点亮。

The arguments for setxkbmap can be specified in /etc/X11/XF86Config for initialization on X server startup by describing the "InputDevice" section for keyboard, for example:

setxkbmap 的参数可以在 /etc/X11/XF86Config 的“InputDevice”小节中为键盘指定,在 X 服务器启动时初始化,例如:

Section "InputDevice"
Identifier "Generic Keyboard"
Driver "keyboard"
Option "CoreKeyboard"
Option "XkbRules" "xfree86"
Option "XkbModel" "microsoftplus"
Option "XkbLayout" "us,th_tis"
Option "XkbOptions grp:alt_shift_toggle,lv3:switch,grp_led:scroll"
EndSection

Notice the last four option lines. They tell setxkbmap to use "xfree86" rule, with "microsoftplus" model (with Internet keys), mixed layout of US and Thai TIS-820.2538, and some more options for group toggle key and LED indicator. The "lv3:switch" option is only for keyboard layouts that require a 3rd level of shift (that is, one more than the normal shift keys). In this case for "th_tis" in XFree86 4.4.0, this option sets RightCtrl as 3rd level of shift.

注意最后四行选项。它们告诉 setxkbmap 使用“xfree86”规则,“microsoftplus”型号(带有因特网功能键),混合美国英语和泰国语 TIS-820.2538 布局,以及组切换键和 LED 指示灯的选项。“lv3:switch”这个选项只用于需要第三级上档键(比一般的上档键多一级)的键盘布局。这里在 XFree86 4.4.0 中的“th_tis”布局设定右 Ctrl 键为第三级上档键。

Providing a Keyboard Map
==提供键盘布局==

If the keyboard map for a language is not available, one needs to prepare a new one. In XKB terms, one needs to prepare a symbols map, associating keysyms to the available keycodes.

如果一种语言没有可用的键盘布局,就需要制作一个新的。对于 XKB,需要提供一个符号映射,并将键盘符号和可用的击键代码联系起来。

The quickest way to start is to read the available symbols files under the /etc/X11/xkb/symbols directory. In particular, the files used by default rules of XFree86 4.3.0 are under the pc/ subdirectory. Here, only one group is defined per file, unlike the old files in its parent directory, in which groups are pre-combined. This is because XFree86 4.3.0 provides a flexible method for mixing keyboard layouts.

开始的最快方法是读取 /etc/X11/xkb/symbols 目录下已有的符号文件。要注意的是 XFree86 4.3.0 使用的缺省规则文件在 pc/ 子目录下。这里每个文件只定义了一个组,而不像上一级目录中那些老式的文件,把不同的组合并在一起。这是因为 XFree86 4.3.0 为混合的键盘布局提供了一种灵活的方法。

Therefore, unless you need to support the old versions of XFree86, all you need to do is to prepare a single-group symbols file under the pc/ subdirectory.

因此,除非你需要支持旧版本的 XFree86,否则只要在 pc/ 目录下提供一个单独的组符号文件就可以了。

Here is an excerpt from the th_tis symbols file:

以下是 th_tis 符号文件的片段:


partial default alphanumeric_keys
xkb_symbols "basic" {
name[Group1]= "Thai (TIS-820.2538)";
// The Thai layout defines a second keyboard group and changes
// the behavior of a few modifier keys.
key { [ 0x1000e4f, 0x1000e5b ] };
key { [ Thai_baht, Thai_lakkhangyao]
};
key { [ slash, Thai_leknung ] };
key { [ minus, Thai_leksong ] };
key { [ Thai_phosamphao, Thai_leksam ] };
...
};

Each element in the xkb_symbols data, except the first one, is the association of keysyms to the keycode for unshift and shift versions, respectively. Here, some keysyms are predefined in Xlib. You can find the complete list in . If the keysyms for a language are not defined there, the Unicode keysyms, can be used, as shown in the key entry. (In fact, this may be a more effective way for adding new keysyms.) The Unicode value must be prefixed with "0x100" to describe the keysym for a single character.

xkb_symbols 列表中,除了第一个元素外,每个元素都表示了未切换和切换状态下不同键位代码代表的键盘字符的关联。在这里,有些键盘字符是在 Xlib 中预先定义的。完整的列表可以在 X11/keysymdef.h 文件中找到。如果一种语言的键盘字符没有定义,则可以像 这一元素所示,使用 Unicode 键盘字符。(实际上,这可能是更有效地加入新的键盘字符的方法。描述单个字符的键盘符号时 Unicode 值前面要加”0x100”。

For more details of the file format, see Ivan Pascal's XKB explanation(23). When finished, the symbols.dir file should be regenerated so that the symbols file is listed:

关于文件格式的更多细节,参看 Ivan Pascal 的 XKB 解释(23)。完成文件编写后,需要重新生成 symbols.dir 文件以便列出新添加的文件。

# cd /etc/X11/xkb/symbols
# xkbcomp -lhlpR `*' -o ../symbols.dir

Then, the new layout may be tested as described in the previous section.

然后,可以按照上一节描述的方法测试新的布局。

Additionally, entries may be added to /etc/X11/xkbcomp/rules/xfree86.lst so that some GUI keyboard configuration tools can see the layout.

此外,还可以在 /etc/X11/xkbcomp/rules/xfree86.lst 中加入条目以便一些图形界面的键盘配置工具能够发现这些布局。

Once the new keyboard map is completed, it may also be included in XFree86 source where the data for XKB are kept under the xc/programs/xkbcomp subdirectory.

新的键盘布局完成后,也可以加入到 XFree86 源代码中,关于 XKB 的数据放在 xc/programs/xkbcomp 目录下。

XIM - X Input Method
==XIM — X 输入方法==

For some languages, text input is as straightforward as one-to-one mapping from keysyms to characters, such as English. For European languages, this is a little more complicated because of accents. But for Chinese, Japanese and Korean (CJK), the one-to-one mapping is impossible. They require a series of keystroke interpretations to obtain each character.

对于一些像英语这样的语言,文本输入只是简单的击键代码与字符的一一对应。对于欧洲语言,注音符号只是使得输入稍显复杂。但对于中、日、韩文字(CJK),一一对应的映射是不可能的。这些文字需要一系列的击键来表示每个字符。

X Input Method (XIM) is a locale-based framework designed to address the requirements of text input for any language. It is a separate service for handling input events as requested by X clients. Any text entry in X clients is represented by X Input Context (XIC). All the keyboard events will be propagated to the XIM, which determines the appropriate action for the events based on the current state of the XIC, and passes back the resulting characters.

X 输入方法是一种基于区域设置的框架,用来满足任何语言的文本输入需要。它是一个处理 X 客户程序输入事件请求的单独服务。X 客户程序中的任何文字输入都用 X 输入语境 (X Input Context,XIC)来表示。所有键盘时间都被传递到 XIM,它根据 XIC 的当前状态决定事件对应的正确动作,然后传送回相应的字符。

Internally, a common process of every XIM is to translate keyboard scan code into keycode and then to keysym, by calling XKB, whose process detail has been described in previous sections. The following processes to convert keysyms into characters are different for different locales.

在内部,各种 XIM 的共同机理是把调用 XKB 通过上面叙述的过程把键盘扫描码转换成击键代码和键盘符号。之后的把键盘符号转换成字符的过程则随不同的区域设置而变化。

In general cases, XIM is usually implemented using the client-server model. More detailed discussion of XIM implementation is beyond the scope of this document. Please see Section 13.5 of the Xlib document [18] and the XIM protocol [19] for more information.

一般情况下,XIM 是使用客户端-服务器模型实现的。关于 XIM 实现的更多详细讨论超出了本册的范围。请参考 Xlib 文档[18]的 13.5 节和 XIM 协议[19]获取更多信息。(脚注编号需要确认)

In general, users can choose their favourite XIM server by setting the system environment XMODIFIERS, like this:

通常,用户可以通过设置系统环境变量 XMODIFIERS 选择他们想要的 XIM 服务器,例如:


$ export LANG=th_TH.TIS-620
$ export XMODIFIERS="@im=Strict"

This specifies Strict input method for Thai locale.

以上命令设置了泰国区域设置中的 Strict 输入法。

Client Side
===客户端===

A normal GTK+-based text entry widget will provide an “Input Methods” context menu that can be opened by right clicking within the text area. This menu provides the list of all installed GTK+ IM modules, which the user can choose from. The menu is initialized by querying all installed modules for the engines they provide.

标准的基于 GTK+ 的文本输入框组件会提供一个“输入方法”的上下文菜单,可通过在文本区点击鼠标右键打开。这个菜单提供了所有已安装的 GTK+ 输入法列表供用户选择。菜单是通过在所有已安装的模块中查询它们提供的引擎而生成的。

From the client’s point of view, each text entry is represented by an IM context, which communicates with the IM module after every key press event by calling a key filter function provided by the module. This allows the IM to intercept the key presses and translate them into characters. Non-character keys, such as function keys or control keys, are not usually intercepted. This allows the client to handle special keys, such as shortcuts.

在客户程序看来,每一个输入的文本都是由输入法上下文来表示的,后者在每次击键事件之后调用模块提供的键过滤函数。这使得输入法能够截获击键事件并将它们转化为字符。非字符的键,如功能键和控制键不被截获。这使得客户程序能够处理快捷键等特殊键。

There are also interfaces for the other direction. The IM can also call the client for some actions by emitting GLib signals, for which the handlers may be provided by the client by connecting callbacks to the signals:

还有些接口是为其他文本输入流设计的。输入法可以通过发送 GLib 信号调用客户程序执行某些动作。客户程序可以通过连接以下信号反馈提供信号的句柄:

“preedit_changed”
Uncommitted (pre-edit) string is changed. The client may update the display, but not the input buffer, to let the user see the keystrokes.
“commit”
Some characters are committed from the IM. The committed string is also passed so that the client can take it into its input buffer.
“retrieve_surrounding”
The IM wants to retrieve some text around the cursor.
“delete_surrounding”
The IM wants to delete the text around the cursor. The client should delete the text portion around the cursor as requested.

  • "preedit_changed"没有提交(预编辑状态)的字符串发生了改变。客户程序可以刷新显示让用户看到击键序列,但不刷新输入缓冲区。
  • "commit"输入法提交了一些字符。提交的字符也一并传递过来,客户程序可以将其放入输入缓冲区。
  • "retrieve_surrounding"输入法要获得光标附近的一些文本。
  • "delete_surrounding"输入法要删除光标附近的文本。客户程序需要按照需求删除光标附近的部分文字。

IM Modules
===输入法模块===

GTK+ input methods are implemented using loadable modules that provide entry functions for querying and creating the desired IM context. These are used as interface with the “Input Methods” context menu in text entry areas.

GTK+ 输入法是用可装载的模块实现的,模块提供输入函数以便查询和建立需要的输入法上下文。这些函数也被用做文本输入区中“输入方法”上下文菜单的接口。

The IM module defines a new IM context class or classes and provides filter functions to be called by the client upon key press events. It can determine proper action to the key and return TRUE if it means to intercept the event or FALSE to pass the event back to the client.

输入法模块定义一个或多个新的输入法上下文类,并提供客户程序遇到按键事件时调用的过滤器函数。模块也负责确定对击键的正确响应,如果要截获事件就返回“真”值,要把事件传回客户程序就返回“假”值。

Some IM (e.g., CJK and European) may do a stateful conversion which is incrementally matching the input string with predefined patterns until each unique pattern is matched before committing the converted string. During the partial matching, the IM emits the “preedit_changed” signal to the client for every change, so that it can update the pre-edit string to the display. Finally, to commit characters, the IM emits the “commit” signal, along with the converted string as the argument, to the IM context. Some IM (e.g., Thai) is context-sensitive. It needs to retrieve text around the cursor to determine the appropriate action. This can be done through the “retrieve_surrounding” signal.

有些输入法(如 CJK 和欧洲文字)可能需要进行有状态的转换,即把输入字符串与预定义的样式进行增量匹配,直到所有字符串匹配完成才提交转换后的字符串。在部分匹配状态,每一次改变时输入法都向客户程序发出“preedit_changed”信号,以便后者在显示中更新预编辑字符串。最后,为了提交字符,输入法发出“commit”信号,将转换后的字符串作为参数提交给输入法上下文。还有些输入法(如泰文)是上下文敏感的。他需要获取光标两侧的文字决定适当的动作。这可以通过“retrieve_surrounding”信号来完成。

In addition, the IM may request to delete some text from the client’s input buffer as required by Thai advanced IM. This is also used to correct the illegal sequences. This can be done via the “delete_surrounding” signal.

此外,输入法可能需要按照泰文输入的要求从客户端的输入缓冲区中删除一些文字。修正非法的序列时也要用到这个功能。它通过“delete_surrounding”信号实现。

Locales
==区域设置==

As mentioned in earlier, the GNU C library is internationalized according to POSIX and ISO/IEC 14652. Both locales are discussed in this section.

正如前面提到的,GNU C 库是按照 POSIX 和 ISO/IEC 14652 标准国际化的。本节对两种区域设置都进行讨论。

Locale Naming
===区域命名===

A locale is described by its language, country and character set. The naming convention as given in OpenI18N guideline (26) is:

区域设置是用语言,国家和字符集描述的。在 OpenI18N 导则(26)中给出的命名规定是:

lang_territory.codeset[@modifiers]

(语言)_(地区).代码集[@修饰]

where

其中

lang is a two-letter language code defined in ISO 639:1988. Three-letter codes in ISO 639-2 are also allowed in the absence of the two-letter version. The ISO 639-2 Registration Authority at Library of Congress (27) has a complete list of language codes.

语言是在 ISO 639:1988 中定义的两个字母的语言代码。如果没有两字母的代码,也可以使用 ISO 639-2 中的三字母代码。国会图书馆的 ISO 639-2 注册机构(27)有语言代码的完整列表。

territory is a two-letter country code defined in ISO 3166-1:1997. The list of two-letter country codes is available online from ISO 3166 Maintenance agency.(28)

地区是在 ISO 3166-1:1997 中定义的两个字母的国家代码。ISO 3166 维护处的网站(28)有双字母国家代码的清单。

codeset describes the character set used in the locale.

编码集描述该区域设置使用的字符集。

modifiers add more information for the locale by setting options (turn on flags or use equal sign to set values). Options are separated by commas. This part is optional and implementation-dependent. Different I18N frameworks provide different options.

修饰通过设置选项(打开标志或者用等号设置值)添加更多的信息。选项用逗号分开。这部分是可选的,而且取决于具体的实现。不同的 I18N 框架提供不同的选项。

For example:

例如:

fr_CA.ISO-8859-1= French language in Canada using ISO-8859-1 character set
th_TH.TIS-620 = Thai language in Thailand using TIS-620 encoding

fr_CA.ISO-8859-1 = 在加拿大使用的法语,使用 ISO-8859-1 字符集
th_TH.TIS-620 = 在泰国使用的泰语,使用 TIS-620 编码

If territory or codeset is omitted, default values are usually resolved by means of locale aliasing.

如果省略了区域和字符集代码,缺省值通常通过区域设置别名的方式得到。

Note that for the GNU/Linux desktop, the modifiers part is not supported yet. Locale modifiers for X Window are to be set through the XMODIFIERS environment instead.

注意在 GNU/Linux 桌面系统中,修饰部分尚未被支持。X Window 的区域设置修饰通过 XMODIFIERS 环境变量来设定。

Character Sets
===字符集===

Character set is part of locale definition. It defines all characters in a character set as well as how they are encoded for information interchange. In the GNU C library (glibc), locales are described in terms of Unicode.

字符集是区域设置定义的一部分。它定义文字中所有的字符以及它们在信息交换中的编码。在 GNU C 库(glibc)中,区域用 Unicode 的方式描述。

A new character set is described as a Unicode subset, with each element associated by a byte string to be encoded in the target character set. For example, the UTF-8 encoding is described like this:

新的字符集被作为 Unicode 的子集描述,其每个元素在定义的字符集中被分配一串字节作为编码。例如,UTF-8 编码是这样描述的:


...
/x41 LATIN CAPITAL LETTER A
/x42 LATIN CAPITAL LETTER B
/x43 LATIN CAPITAL LETTER C
...
/xe0/xb8/x81 THAI CHARACTER KO KAI
/xe0/xb8/x82 THAI CHARACTER KHO KHAI
/xe0/xb8/x83 THAI CHARACTER KHO KHUAT
...

The first column is the Unicode value. The second is the encoded byte string. And the rest are comments.

第一列是 Unicode 值,第二列是被编码的字符串,其余是注释。

As another example, TIS-620 encoding for Thai is simple 8-bit single-byte. The first half of the code table is the same as ASCII, and the second half begins encoding the first character at 0xA1. Therefore, the character map looks like:

另一个例子,泰语的 TIS-620 编码是简单的8位单字节。码表的前半部分与 ASCII 相同,后半部分在 0xA1 编码第一个字符。这样,字符表就是这个样子:


...
/x41 LATIN CAPITAL LETTER A
/x42 LATIN CAPITAL LETTER B
/x43 LATIN CAPITAL LETTER C
...
/xa1 THAI CHARACTER KO KAI
/xa2 THAI CHARACTER KHO KHAI
/xa3 THAI CHARACTER KHO KHUAT
...

POSIX Locales
POSIX 区域设置

According to POSIX, standard C library functions are internationalized according to the following categories:

根据 POSIX 标准,标准的 C 函数库根据如下分类进行国际化:

Category Description
LC_CTYPE character classification
LC_COLLATE string collation
LC_TIME date and time format
LC_NUMERIC number format
LC_MONETARY currency format
LC_MESSAGES messages in locale language
分类 描述
LC_CTYPE 字符分类
LC_COLLATE 字符排列
LC_TIME 日期和时间格式
LC_NUMERIC 数字格式
LC_MONETARY 货币格式
LC_MESSAGES 信息使用的语言

Setting Locale
设定区域设置

A C application can set current locale with the setlocale() function (declared in ). The first argument indicates the category to be set; alternatively, LC_ALL is used to set all categories. The second argument is the locale name to be chosen, or alternatively empty string ("") is used to rely on system environment setting.

C 语言编写的应用程序可以用 setlocale()(在中定义)函数设定当前的区域设置。第一个参数表示设定的类别;也可以使用 LC_ALL 设置所有的类别。第二个参数是选择的区域设置名称,或者用空字符串("")表示依赖系统的环境设定。

Therefore, the program initialization of a typical internationalized C program may appear as follows:

这样,典型的国际化 C 程序的初始化部分就如下所示:

#include
...
const char *prev_locale;
prev_locale = setlocale (LC_ALL, "");


#include
...
const char *prev_locale;
prev_locale = setlocale (LC_ALL, "");

and the system environments are looked up to determine the appropriate locale as follows:

然后查询系统环境,按如下的顺序确定适当的区域设置:

1. If LC_ALL is defined, it shall be used as the locale name.
2. Otherwise, if corresponding values of LC_CTYPE, LC_COLLATE, LC_MESSAGES are defined, they shall be used as locale names for corresponding categories.
3. For categories that are still undefined by the above checks, and LANG is defined, this is used as the locale name.
4. For categories that are still undefined by the above checks, "C" (or "POSIX") locale shall be used.

  1. 如果定义了 LC_ALL,就用它作为区域设置名称。
  2. 否则,如果定义了 LC_CTYPE,LC_COLLATE,LC_MESSAGE 的值,就用它们作为相应类别的区域设置名称。
  3. 没有被上面的步骤定义的类别,如果定义了 LANG,则用它作为区域设置名称。
  4. 仍然没有被定义的类别,则使用“C”或者“POSIX”区域设置。

The "C" or "POSIX" locale is a dummy locale in which all behaviours are C defaults (e.g. ASCII sort for LC_COLLATE).

“C”或“POSIX”区域设置是一个表示 C 语言默认值的虚设置(例如 LC_COLLATE 使用 ASCII 排序)。

LC_CTYPE
LC_CTYPE

LC_CTYPE defines character classification for functions declared in :
LC_CTYPE 定义 中声明的函数的字符类别。


iscntl() isgraph() isprint()
isspace() ispunct() isalnum()
isalpha() isdigit() isxdigit()
islower() isupper() tolower()
toupper()

iscntl() isgraph() isprint()
isspace() ispunct() isalnum()
isalpha() isdigit() isxdigit()
islower() isupper() tolower()
toupper()

Since glibc is Unicode-based, and all character sets are defined as Unicode subsets, it makes no sense to redefine character properties in each locale. Typically, the LC_CTYPE category in most locale definitions refers to the default definition (called "i18n").

glibc 是基于 Unicode 的,而且所有的字符集都定义为 Unicode 子集,所以没有必要在每种区域设置中重新定义字符的属性。一般,大部分区域设置定义中的 LC_CTYPE 分类都是指缺省的定义(称为“i18n”)。

LC_COLLATE
LC_COLLATE

C functions that are affected by LC_COLLATE are strcoll() and strxfrm().

LC_COLLATE 影响的 C 语言函数包括 strcoll() 和 strxfrm()。

strcoll() compares two strings in a similar manner as strcmp() but in a locale-dependent way. Note that the behaviour strcmp()never changes under different locales.

strcoll() 用与 strcmp() 类似的方式比较两个字符串,但依赖于区域设置。注意 strcmp() 的行为在不同的区域设置下不发生改变。

strxfrm() translates string into a form that can be compared using the plain strcmp() to get the same result as when directly compared with strcoll().

strxfrm() 将字符串转换为能够用一般的 strcmp() 函数比较的形式,以获得与使用 strcoll() 函数相同的结果。

The LC_COLLATE specification is the most complicated of all locale categories. There is a separate standard for collating Unicode strings, called ISO/IEC 14651 International String Ordering (29). The glibc default locale definition is based on this standard. Locale developers may consider investigating the Common Tailorable Template (CTT) defined there before beginning their own locale definition.

LC_COLLATE 规则是所有区域设置类别中最复杂的。对于 Unicode 字符串的排序有专门的标准,称为 ISO/IEC 14651 国际字符串排序(29)。glibc 默认的区域设置定义是基于这个标准的。区域设置的开发者在开始定义自己的区域设置前可以研究一下该标准定义的通用可裁减模板(Common Tailorable Template, CTT)。

In the CTT, collation is done through multiple passes. Character weights are defined in multiple levels (four levels for ISO/IEC 14651). Some characters can be ignored (by using "IGNORE" as weight) at first passes and be brought into consideration in later passes for finer adjustment. Please see ISO/IEC 14651 document for more details.

在 CTT 中,排序通过多次过程来完成。字符的权重有多级定义(ISO/IEC 14651 有四级)。在第一次排序中有些字符可以被忽略(通过设置权重为“忽略”)并被放入下一次排序过程中以实现更细致的调整。详情请参看 ISO/IEC 14651 文档。

LC_TIME
LC_TIME

LC_TIME allows localization of date/time strings formatted by the strftime() function. Days of week and months can be translated into the locale language, appropriate date

LC_TIME 对用 strftime() 函数格式化的日期/时间字符串进行本地化。星期和月份名称可以可以翻译成本地语言中正确的名称。

LC_NUMERIC & LC_MONETARY
LC_NUMERIC 和 LC_MONETARY

Each culture uses different conventions for writing numbers, namely, the decimal point, the thousand separator and grouping. This is covered by LC_NUMERIC.

书写数字的方式,例如小数点,千位分割符和分组随文化的不同而不同,LC_NUMERIC 处理这些差别。

LC_MONETARY defines currency symbols used in the locale as per ISO 4217, as well as the format in which monetary amounts are written. A single function localeconv() in is defined for retrieving information from both locale categories. Glibc provides an extra function strfmon() in for formatting monetary amounts as per LC_MONETARY, but this is not standard C function.

LC_MONETARY 根据 ISO 4217 定义区域设置中使用的货币符号,以及货币数量的书写格式。在 中定义了专门的函数 localeconv() 用来从这两个区域设置类别中提取信息。Glibc 在 中提供了另一个函数 strfmon() 用来根据 LC_MONETARY 的设置格式化货币数额,但它不是标准的 C 函数。

LC_MESSAGES
LC_MESSAGES

LC_MESSAGES is mostly used for message translation purposes. The only use in POSIX locale is the description of a yes/no answer for the locale.

LC_MESSAGES 常常用于信息的翻译文本。在 POSIX 区域设置中的唯一用途是对是否使用区域设置的回答的描述。

ISO/IEC 14652 30
ISO/IEC 14652

The ISO/IEC 14652 Specification method for cultural conventions (30) is basically an extended POSIX locale specification. In addition to the details in each of the six categories, it introduces six more:

ISO/IEC 14652 对文化习惯的规定(30)基本上是 POSIX 区域设置规定的扩展。除了上面的六个类别外,它还引入了另外六个:

Category Description
LC_PAPER paper size
LC_NAME personal name format
LC_ADDRESS address format
LC_TELEPHONE telephone number
LC_MEASUREMENT measurement units
LC_VERSION locale version

类别 描述
LC_PAPER 纸张规格
LC_NAME 人名格式
LC_ADDRESS 地址格式
LC_TELEPHONE 电话号码
LC_MEASUREMENT 测量单位
LC_VERSION 区域设置版本

All of the above categories have already been supported by glibc. C applications can retrieve all locale information using the nl_langinfo() function.

上述类别都已被 glibc 支持。C 语言程序可以用 nl_langinfo() 函数获取所有的区域设置信息。

Building Locales
===构建区域设置===

To build a locale, a locale definition file describing data for ISO/IEC 14652 locale categories must be prepared. (See the standard document for the file format.) In addition, when defining a new character set, a charmap file must be created for it; this gives every character a symbolic name and describes encoded byte strings.

构建区域设置之前,需要编写一个描述 ISO/IEC 14652 区域设置类别资料的文件。(文件格式见标准文档。)此外,在定义新的字符集时,也要同时编写一个字符映射文件,描述编码的字符串并给每个字符一个符号名称。

In general, glibc uses UCS symbolic names () in locale definition, for convenience in generating locale data for any charmap. The actual locale data to be used by C programs is in binary form. The locale definition must be compiled with the localedef command, which accepts arguments like this:

通常,glibc 在区域设置定义中使用 UCS 符号名称(),以便为任意的字符映射生成区域设置信息。C 程序实际使用的区域设置数据是二进制格式的。因此区域设置信息必须用 localedef 命令编译,它依赖如下的参数:

localedef [-f ] [-i ]

localedef [-f <字符映射>] [-i <输入>] <名称>

For example, to build th_TH locale from locale definition file th_TH using TIS-620 charmap:

例如,用 TIS-620 字符映射和 th_TH 定义文件生成 th_TH 区域设置:

# localedef -f TIS-620 -i th_TH th_TH

# localedef -f TIS-620 -i th_TH th_TH

The charmap file may be installed at /usr/share/i18n/charmaps directory, and the locale definition file at /usr/share/i18n/locales directory, for further reference.

字符定义文件可以放在 /usr/share/i18n/charmaps 目录,区域设置定义文件放在 /usr/share/i18n/locales 目录,用于今后使用。

The locale command can be used with "-a" option to check for all installed locales and "-m" option to list supported charmaps. Issuing the command without argument shows the locale categories selected by environment setting.

locale 命令可以用“-a”选项检查所有安装了的区域设置,用“-m”选项列出支持的字符映射。不带参数的命令显示环境变量设置的区域设置类别的值。

Translation
==翻译==

The translation framework most commonly used in FOSS is GNU gettext, although some cross-platform FOSS, such as AbiWord, Mozilla and OpenOffice.org use their own frameworks as a result of the cross-platform abstractions. In this section, the GNU gettext, which covers more than 90 percent of GNU/Linux desktops, is discussed briefly. The concepts discussed here, however, apply to other frameworks. Messages in program source code are put in a short macro that calls a gettext function to retrieve the translated version. At program initialization, the hashed message database corresponding to LC_MESSAGES locale category is loaded. Then, all messages covered by the macros are translated by quick lookup during program execution. Therefore, the task of translation is to build the message translation database for a particular language and get it installed in an appropriate place for the locale. With that preparation, the gettext programs are automatically translated as per locale setting without having to touch the source code.

在自由/开源软件翻译工作中使用得最多的框架是 GNU gettext,不过也有些跨平台的自由/开源软件如 AbiWord, Mozilla 和 OpenOffice.org 为了跨平台抽象而使用自己的翻译框架。本节将简要介绍在超过90%的 GNU/Linux 桌面中使用的 GNU gettext,但这里讨论的概念也适用于其它的翻译框架。程序源代码中的消息被放进调用 gettext 函数的短小程序中,以产生翻译后的版本。在程序初始化过程中,根据 LC_MESSAGES 区域设置调入哈希表编码的消息数据库。然后,小程序中包含的消息串在程序执行过程中通过迅速查找一张对照表被翻译。这样,翻译的任务就是为特定的语言建立消息翻译数据库,并将它安装在区域设置规定的适当位置。这样,gettext 程序就可以自动按照区域设置进行翻译,而不用改动源代码。

GNU gettext also provides tools for creating the message database. Two kinds of files are involved in the process:

GNU gettext 还为建立消息数据库提供了工具。在建立数据库的过程中涉及到两种文件:

PO (Portability Object) file. This is a file in human-readable form for the translators to work with. It is named so because of its plain-text nature, which makes it portable to other platforms.

PO (Portability Object,可移植性对象)文件:这是翻译者工作时使用的一种人类可读的文件。它是纯文本文件,可以移植到各种平台,因此而得名。

MO (Machine Object) file. This is a hashed database for machines to read. It is in the final format to be loaded by the gettext program. There are many translation frameworks in commercial Unices, and these MO files are not compatible. One may also find some GMO files as immediate output from GNU gettext tools. They are MO files containing some GNU gettext enhanced features.

MO (Machine Object,机器对象)文件:这是计算机读取的哈希表排序的数据库。它是 gettext 程序载入的最终形式。商用的 Unix 系统中有许多中翻译框架,这些 MO 文件并不兼容。GNU gettext 工具也可能产生 GMO 文件,它们是包含有 GNU gettext 增强功能的 MO 文件。

Important GNU gettext tools will be discussed by describing the summarized steps of translation from scratch (See Figure 3):

我们通过描述从头开始翻译的概要步骤讨论重要的 GNU gettext 工具(见图3):

图3 GNU gettext 工作流程图3 GNU gettext 工作流程

1. Extract messages with the xgettext utility. What you get is the "package.pot" file as a template for the PO file.
2. Create the PO file for your language from the template, either by copying it to "xx.po" (where xx is your locale language) and filling its header information with your information, or by using the msginit utility.
3. Translate the messages by editing the PO file with your favourite text editor. Some specialized editors for PO files, such as kbabel and gtranslator, are also available.
4. Convert the PO file into MO file using the msgfmt utility.
5. Install the MO file under the LC_MESSAGES directory of your locale.
6. When the program develops, new strings are introduced. You need not begin from scratch again. Rather, you extract the new PO template with the xgettext utility as usual, and then merge the template with your current PO with the msgmerge utility. Then, you can continue by translating the new messages.

  1. 用 xgettext 工具提取消息。这一步产生的“package.dot”文件是 PO 文件的模板。
  2. 从模板创建你的语言的 PO 文件,可以把模板拷贝到“xx.po”(这里 xx 是你的区域设置语言)并在文件头部填入你的信息,也可以使用 msginit 工具。
  3. 通过用你最喜欢的文本编辑器编辑 PO 文件来翻译消息。也可以使用 PO 文件的专用编辑器,例如 kbabel 和 gtranslator。
  4. 用 msgfmt 工具把 PO 工具转换成 MO 文件。
  5. 把 MO 文件安装到你所在区域设置的 LC_MESSAGES 目录。
  6. 随着程序的开发,会引入新的字符串。你不需要再从头翻译,只要用 xgettext 工具提取新的 PO 模板,然后用 msgmerge 工具把它合并到你自己的 PO 文件中即可。然后,你就可以继续翻译新的消息了。

GNOME intltool
===GNOME intltool===

GNU/Linux desktops have more things to translate than messages in C/C++ source code. The system menu entries, lists of sounds on events, for example, also contain messages, mostly in XML formats that are not supported by GNU gettext. One may dig into these individual files to translate the messages, but this is very inconvenient to maintain and is also error prone.

GNU/Linux 桌面需要翻译的东西不仅仅是 C/C++ 源代码中的消息。比如系统菜单项和事件提示音效列表,也含有需要翻译的消息,多数是 GNU gettext 不支持的 XML 格式。打开这些单个的文件进行翻译当然可以,但这样做不便维护,而且容易出错。

KDE has a strong policy for translation. PO files for all KDE core applications are extracted into a single directory for each language, so that translators can work in a single place to translate the desktop without a copy of the source code. But in practice, one needs to look into the sources occasionally to verify the exact meaning of some messages, especially error messages. This already includes all the messages outside the C++ sources mentioned above.

KDE 的翻译规定比较严格。所有 KDE 核心应用程序的 PO 文件都被提取出来,按语言放在不同的目录下,这样翻译者可以在一个地方翻译桌面程序,而不需要源代码。但实际上,翻译者需要不时地查看源代码以确定一些消息的准确含义,尤其是错误信息。这些 PO 文件也已经包括了 C++ 源代码之外的所有消息。

GNOME comes up with a different approach. The PO files are still placed in the source under the "PO" subdirectory as usual. But instead of directly using xgettext to extract messages from the source, the GNOME project has developed an automatic tool called intltool. This tool extracts messages from the XML files into the PO template along with the usual things xgettext does, and merges the translations back as well. As a result, despite the heterogeneous translation system, what translators need to do is still edit a single PO file for a particular language.

GNOME 采用了不同的方式。PO 文件还是放在名为“PO”的目录下,和源代码放在一起。但 GNOME 项目不是用 xgettext 来从源代码提取消息,他们开发了一个称为 intltool 的自动化工具。这个工具不仅完成 xgettext 的功能,从 XML 文件中提取消息放入 PO 模板,还可以把翻译后的消息合并回去。因此,虽然翻译系统不同,翻译者也只需要为一种语言编辑单个的 PO 文件。

The use of intltool is easy. To generate a PO template, change the directory to the "po" subdirectory and run:

intltool 的使用很容易。要生成 PO 模板,只需进入“po”子目录并运行:

$ intltool-update --pot

$ intltool-update --pot

To generate a new PO file and merge with existing translation:

生成新的 PO 文件并和已有的翻译合并:

$ intltool-update xx

$ intltool-update xx

where xx is the language code. That is all that is required. Editing the PO file as usual can then begin.

其中 xx 是语言代码。这样就够了,然后就可以像平常一样编辑 PO 文件。

When PO editing is complete, the usual installation process of typical GNOME sources will automatically call the appropriate intltool command to merge the translations back into those XML files before installing. Note that, with this automated system, one should not directly call the xgettext and msgmerge commands any more.

完成 PO 文件编辑后,典型的 GNOME 程序安装过程会调用适当的 intltool 命令将翻译结果合并回 XML 文件。注意在这个自动化过程中,不能直接使用 xgettext 和 msgmerge 命令。

The following sites and documents provide more information on KDE and GNOME translation:

以下站点和文档提供了有关 KDE 和 GNOME 翻译的更多信息:

KDE Internationalization Home (i18n.kde.org/)
> The KDE Translation HOWTO
(i18n.kde.org/translation-howto/)
The GNOME Translation Project (developer.gnome.org/projects/gtp/)
> Localizing GNOME Applications
(developer.gnome.org/projects/gtp/l10n-guide/)
> How to Use GNOME CVS as a Translator
(developer.gnome.org/doc/tutorials/gnome-i18n/translator.html)

  • KDE 国际化主页(i18n.kde.org/)
    • KDE 翻译 HOWTO (i18n.kde.org/translation-howto/)
  • GNOME 翻译项目(developer.gnome.org/projects/gtp/)
    • 本地化 GNOME 程序(developer.gnome.org/projects/gtp/l10n-guide/)
    • 翻译者怎样使用 GNOME CVS(developer.gnome.org/doc/tutorials/gnome-i18n/translator.html)

PO Editors
===PO 文件编辑器===

A PO file is a plain text file. This can be edited, using a favourite text editor. But, as stated earlier, translation is a labour-intensive task. It is worth considering some convenient tools to speed up the job.

PO 文件是纯文本文件,可以用任意一种文本编辑器编辑。但是,像前面所说的,翻译是一项任务繁重的工作,可以考虑用一些方便的工具来加速工作进程。

Normally, the editor is needed to be able to edit UTF-8, as both KDE and GNOME now have used it as standard text encoding. However, the following tools have many other features.

所选用的编辑器一般需要能够编辑 UTF-8 文件,因为现在 KDE 和 GNOME 都使用这种编码作为标准文本编码。而以下的工具还有其他很多功能。

KBabel
KBabel

Part of the KDE Software Development Kit, KBabel is an advanced and easy-to-use PO-files editor with full navigation and editing capabilities, syntax checking and statistics. The editor separates translated, un-translated and fuzzy messages so that it is easy to find and edit the unfinished parts.

KBabel 是 KDE 开发工具包的一部分,是一个先进而容易使用的 PO 文件编辑器,拥有全面的导航和编辑能力,语法检查和统计功能。编辑器把已翻译、未翻译和模糊的信息分开,从而易于寻找和编辑没有完成的部分。

KBabel also provides CatalogManager, which allows keeping track of many PO-files at once, and KBabelDict for keeping the glossary, which is important for translation consistency, especially among team members from different backgrounds.

KBabel 还提供了允许同时跟踪多个 PO 文件的目录管理器(CatalogManager),以及用于维护词汇表的 KBabelDict ,这对于保持翻译的一致性非常重要,特别是由来自不同背景的成员组成的队伍中。

Gtranslator
Gtranslator

Gtranslator is the PO-file editor for the GNOME desktop. It is very similar to Kbabel in core functionality.

Gtranslator 是 GNOME 桌面使用的 PO 文件编辑器。它与 KBabel 在核心功能上非常相似。

Gtranslator also supports auto-translation, where translations are learnt and transferred into its memory, and can be applied in later translations using a hot key.

Gtranslator 还支持自动翻译,它能学习翻译并记录下来,在后面的翻译中通过一个热键可以调用记录的翻译。



6 UCS is the acronym for Universal multi-octet coded Character Set
6 UCS 是通用多字节编码字符集(Universal multi-octet coded Character Set)的缩写形式。
7 UTF is the acronym for Unicode (UCS) Transformation Format
7 UTF 是 Unicode 变换格式(Unicode (UCS) Transformation Format)的缩写。
8 The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 76­77.
9 The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77­78.
10 Ibid., pp. 95­104.
11 Unicode.org, `Unicode Technical Reports'; available from www.unicode.org/reports/index.html.
12 Unicode.org, `Unicode Technical Reports'; available from www.unicode.org/reports/index.html.
13 Leisher, M., `The XmBDFEd Font Editor`; available from crl.nmsu.edu/~mleisher/xmbdfed.html.
14 Williams, G., `PfaEdit'; available from pfaedit.sourceforge.net.
15 van Rossum, J., S `TTX/FontTools'; available from fonttools.sourceforge.net/.
16 Note the difference with Microsoft's "Windows" trademark. X Window is without `s'.
16 注意与微软的“Windows“的差别。X Window 不带“s“。
(这里原书的脚注编号不正确——译注)
18 Taylor, O., `Pango ­ Design'; available from www.pango.org/design.shtml.
19 GNOME Development Site, `Pango Reference Manual'; available from developer.gnome.org/doc/API/2.0/pango/.
20 This is a very rough classification. Obviously, there are further steps, such as line breaking, alignment and justification. They need not be discussed here, as they go beyond localization.
20 这是一种粗糙的分类。显然,还有更多的步骤,例如断行、对齐。由于它们超出了本地化的范围,所以不在这里讨论。
21 Pascal, I., X Keyboard Extension; available from pascal.tsu.ru/en/xkb/.
22 Pascal, I., X Keyboard Extension; available from pascal.tsu.ru/en/xkb/.
23 Gettys, J., Scheifler, R.W., `Xlib ­ C Language X Interface, X Consortium Standard, X Version 11 Release 6.4.'
24 Narita, M., Hiura, H., The Input Method Protocol Version 1.0. X Consortium Standard, X Version 11 Release 6.4.
25 OpenI18N.org. OpenI18N Locale Name Guideline, Version 1.1 – 2003-03-11]; available from www.openi18n.org/docs/text/LocNameGuide-V11.txt.
26 Library of Congress, ISO 639-2 Registration Authority; available from lcweb.loc.gov/standards/iso639-2.
27 ISO, ISO 3166 Maintenance agency (ISO 3166/MA) – ISO’s focal point for country codes; available from www.iso.org/iso/en/prods-services/iso3166ma/index.html.
28 ISO/IEC, ISO/IEC JTC1/SC22/WG20 – Internationalization; available from anubis.dkuug.dk/jtc1/sc22/wg20.
29 ISO/IEC, ISO/IEC JTC1/SC22/WG20 -- Internationalization; available from anubis.dkuug.dk/jtc1/sc22/wg20.
30 同上。


附件大小
figure3.dia_.txt2.1 千字节