亚洲自由/开源软件本地化的实现
本地化计划的技能和工具需求
自由/开源软件本地化的成本
本地化工作
附录A. 本地化关键概念
附录B. 本地化技术一览
亚洲是世界自由/开源软件本地化的主战场。垄断私有软件公司还没有在亚洲占据统治地位。出于国家利益的考虑,亚洲许多政府推行了鼓励多元的软件开发,特别是自由/开源软件的政策。
某些国家目前的本地化工作只限于散布在世界各地的少数热心人的成果。他们的工作很少得到报偿,而他们的翻译工作无组织的特点会产生意想不到的歧义。
大多数情况下,志愿者都是程序员而不是语言学家。他们需要得到翻译人员,科技作家和测试人员的帮助。因为本地化主要与语言而不是与程序有关,非技术人员人数应该达到程序员的五倍。在制定正式的自由/开源软件政策之前,技术词典和本地化标准的支持工作就可以开始。本地语言专家只需要普通职业人士的工资和办公室待遇。技术作家和测试人员可以在几个月内训练完毕。
支持非常相似的语言(如泰国语,老挝语和库美尔语)分享代码和技术资源的国际合作本地化是非常有效的。它对社会整体创造的价值是无法否定的——新词典和技术标准使得技术和翻译人员能够低成本的、一致地本地化任何自由开源软件。其他的亚洲国家应当仿效 CJK 计划的范例。
本地化项目应当有非常明确的目标,配备实现这些项目的所需资源。这些计划需要资金,专业管理和技术支持。此外,全面的语言知识是成功的关键。
这些计划的结果会形成地方性的中心,在那里把知识传递到真正实施的人员手里。这样的中心可以是政府行为或商业合作的产物,也可作为一所大学的一部分来运作。不管这些中心是如何成立的,它们都应当得到政府政策的全面支持。
亚洲软件本地化过程应该进入专业化轨道了,尤其是在发展中国家。如果无序的[志愿者]活动产生不可靠的成果,重要的机遇将会丢失。在个人行为不能促成公共利益的情况下,政府责无旁贷。一个专业团体可以要求志愿者的帮助,但中心调控和基本的工作必须由领取工资的专门队伍来完成。
一个发展中国家为一个部门的商业软件支付的费用就很可能足以支持这个国家参与自由/开源软件运动。
需要建立本地化中心,作为自由/开源软件开发者分享信息,提高技能,以及在现有工作上发展的中心场所。不同的国家如果有语言上的共同点,一个区域性的本地化中心有助于分担开发的成本。熟悉源代码的专家,语言学家和分析家可以支持许多不同的计划,并建立一个知识库来加速未来的发展。
通过支持技术词典和标准的编纂,所有自由/开源软件计划都可以保持一致性。有了标准的术语,计算机用户就较不容易遇到困难,而有了标准的技术流程和过程,所有信息技术专业人员都能读懂自由/开源软件代码。政府实施的所有软件开发政策都应当强调遵循这样的标准。
迅速行动。把事情做对是重要的,但迅速地做事也很重要。为一种技术成分很少的语言编写一本好的计算机词典可以花上一年多时间,但足以用于软件最初版本翻译的词典可以很快完成(例如三个月)。软件的未来版本将使用最终的词典,但第一版可以在几个月内完成。首要的任务是建立官方的“门户”详细列出已定义的术语和标准。
鼓励自由/开源软件操作系统、应用程序和平台的传播。政府几乎不花什么费用就可以把本地化的自由/开源软件发布到学校、企业和其他组织。这将迅速地促进计算机和软件的应用,同时避免不必要的私有软件非法拷贝。因为自由/开源软件能够在旧机器上工作,这样为大众提供计算服务的价格比其他方式都低。
不仅为计算机专业人员,也为中小学校提供自由/开源软件的培训。在缺乏教育经费的发展中国家,在廉价的计算机上使用本地化的自由/开源软件非常适合在农村社区中增加教育机会。年轻人自然的好奇心将迅速造就一代能够用母语使用计算机的人。这可以鼓励那些在使用计算机方面显示出天分的学生通过奖学金、竞赛和其他与其年龄相适应的活动学习编程。
政府除了建立倾向于本地化自由/开源软件的政府采购政策,在去除障碍,提供资助和协调标准方面也要担当重要角色。没有政府的支持,“洋腔洋调”和翻译的不一致将会严重地影响持续的自由/开源软件本地化,并限制本土软件产业的成长潜力。
Asia is the world's battleground for FOSS localization. The monopolistic proprietary companies have not yet established dominance in Asia. For reasons of national interest, many Asian governments have adopted policies that encourage software alternatives, primarily FOSS.
In some countries the localization to date has been the work of a few dedicated enthusiasts scattered around the globe. Few have been paid for their efforts, and the disorganized nature of their translations has unintentionally produced ambiguities.
For the most part, the volunteers are programmers, not linguists. They need help from translators, technical writers and testers. Since localization deals primarily with language rather than programming issues, non-technical staff should outnumber programmers five to one. Even before adopting a formal FOSS policy, direct support for technical dictionaries and standards for localization can begin. Local language specialists require only professional salaries and offices. Technical writers and testers can be trained within a few months.
Supporting international localization initiatives, where languages that share significant similarities (i.e., Thai, Lao and Khmer) share programming and technical resources, is very cost effective. The value created for society as a whole, with new dictionaries and a technical standard enabling programmers and translators to consistently localize any FOSS for a low price, is undeniable. Other Asian countries should follow the example of the CJK initiative.
Localization initiatives should have very clear objectives, and the resources required to meet those objectives. They require funding, professional management and technical expertise. In addition, thorough linguistic knowledge is critical to success.
These initiatives will result in the creation of local centres where the knowledge is dispersed to those who will perform the actual work. Such centres can be the product of governmental action or business partnerships, or operate as part of a university. Regardless of how the centres are founded, they should enjoy the full support of government policy.
It is time to professionalize the process of Asian software localization, especially for developing countries. A great opportunity can be lost if haphazard efforts lead to undependable results. Where public good cannot be achieved by individual effort, the government is expected to help. A professional group can request the help of volunteers, but central coordination and basic work ought to be done by a dedicated team of paid staff.
It could be that the yearly fees paid by a developing country for a single department's commercial software is enough to underwrite that country's participation in the FOSS movement.
Establish localization centres to be the focal point for FOSS developers to share information, develop skills, and build on existing accomplishments. Where different countries have linguistic commonalities, a regional localization centre could share the cost of development. Specialists who are familiar with source code, linguists and analysts could be available to assist a wide variety of projects and build a knowledge base to accelerate future development.
Sponsor the creation of technical dictionaries and standards so that consistency is retained in all FOSS projects. With standard terminology, computer users are less likely to encounter frustration, and with standard technological procedures and processes, FOSS code can remain comprehensible to all IT professionals. Adherence to such standards should be mandated in all software procurement policies implemented by the government.
Move fast. It is important to have things done correctly, but it is also important to do them quickly. Writing a good computer glossary for a low-technology language can take over a year, but a first glossary that will suffice for translating the first versions of the programs can be done very quickly (e.g., in three months). Future versions of the programs will use the final glossary, but first versions can be available within months. An official "portal" detailing the prescribed terminology and standards should be the first priority.
Encourage the distribution of FOSS operating systems, applications and platforms. With little cost, governments can distribute localized FOSS to schools, businesses and other organizations. This would jumpstart the rate of adoption of computers and software in general, and prevent the unnecessary illegal copying of proprietary software. Because FOSS often works with older machines, the total price of providing computing access to the masses would be lower than that for any other approach.
Provide FOSS training not only for computer professionals, but also in primary and secondary schools. In developing countries where educational budgets are stretched thin, the use of localized FOSS operating on low-cost computers is well suited for increasing educational opportunities in rural communities. The natural curiosity of the youth should quickly result in a new generation that knows how to use computers in their native language. Those students who show a special talent for using computers can be encouraged to learn programming through scholarships, contests and other age-appropriate activities.
Beyond establishing governmental purchasing policies that favour localized FOSS, governments have an important role in removing obstacles, providing funding and coordinating standards. Without governmental support, "anglicisms" and inconsistencies will severely hamper the continued localization of FOSS, and limit the possibilities for growth of an indigenous software industry.
Localization often occurs when the country is already using computers in a foreign language. Computer scientists and trainers are used to an English or French computer vocabulary. Localization therefore requires creating training materials based on the language used in the glossary, so that trainers and new users will start using the local language.
当一个国家已经在用外语使用计算机时常常需要本地化。计算机科学家和培训人员习惯于英语或法语的计算机词汇,因此本地化需要编写基于本地语言词汇表的培训材料,这样培训人员和新用户将会开始使用本地语言。
As it is difficult to engage linguists, preparatory work can be done first, such as looking for different translation options for each term.
找到语言学家参与工作比较困难,因此可以先做一些预备工作,例如查找每个术语不同的翻译方式。
After this, the work is mainly that of translators, who follow glossary guidelines and rules. There should be professional translators and computer scientists in the same team to assure linguistic and technical correctness of the terms used.
然后,就主要是跟从词汇表的指导和规则的翻译人员的工作。在同一支队伍里还应该有专业翻译家和计算机科学家确保使用属于在语言和技术方面的正确性。
Localization can increasingly be performed without too many technical resources, once the first layer of the work is done (fonts, language support, etc.). In the future, it will become easier, since almost all FOSS projects are adopting new tools and techniques to make it easier for non-experts to perform the work.
一旦最上一层的工作完成(字体,语言支持,等等),本地化的进行就越来越不需要太多的技术资源。将来,随着所有的自由/开源项目都采用了新工具和技术简化非专业人员的劳动,本地化工作将会更容易进行。
The skilled workers who can perform software localization are often already available, or can be trained locally or abroad. Regional software localization training and coordination centres could act as clearinghouses and colleges for individuals to improve their skills, and thereby produce new workers for the years ahead. Fortunately, only the programmers need to have specialized knowledge of FOSS. The other professionals can have previous experience with any type of software.
熟练的软件本地化工作者常常可以直接找到,或者能在本地或国外培训。地区级的软件本地化培训和协调中心可以作为知识的交换场所和个人提高技能的学校,因此可以在今后培养新的工作者。幸运的是,只有程序员需要自由/开源软件的专门知识,其他的工作者可以借鉴任何类型软件的经验。
Office space that is sufficient and appropriate for the work at hand is a must for any project where work is not distributed ad hoc around the world. For a professional localization effort, and especially for multilingual regional localization centres, a commercial space is best. This includes stable low-cost broadband connections to the Internet, LAN and development servers, sufficient client computers for each employee and three or four terminals for each tester.
任何一个不把工作无组织地分散在全世界的项目都需要足够和适合当前工作的办公空间。对于专业的本地化工作,特别是多语言的地区本地化中心,一个商务空间是最适合的。这包括稳定低价的宽带互联网接入,局域网和开发服务器,足够每一名工作人员使用的客户计算机和每个测试者的三到四台终端。
Active participation and cooperation from universities, especially linguists and translators of English, should be solicited. Publishing rights for scholars who make significant contributions to technical dictionaries and standards should be granted, as well as public recognition for student volunteers.
应当吸引来自大学,特别是英语语言学家和翻译的积极参与及合作。应当给对技术词典和标准作出重要贡献的学者发表成果的机会,也应当给学生志愿者以公众承认。
Typically, the following people need to be trained, organized and provided with the tools to succeed:
一般来说,成功的本地化需要培训、组织下列人员并给他们提供适当的工具:
Project managers technical and translation.
Analysts and linguists.
FOSS programmers.
Translators and technical writers.
Testers.
Trainers.
Project management for localization should be split into two jobs: (i) Technical Managers direct the actual editing of code to ensure proper language support; and (ii) Translation Managers coordinate the creative efforts of linguists, technical writers and trainers.
本地化的项目管理包括两部份工作:(一)技术经理指导提供正确语言支持的代码编写工作;(二)翻译经理协调语言学家,技术作家和培训人员的工作。
Analysts and linguists work together with project managers, sociologists and programme sponsors to identify the technical challenges to be overcome and the cultural-linguistic requirements to be met. Their work results in requirement specifications and a project description that the project manager uses to guide the project to completion. This roadmap guides the programmers in their work, and provides the benchmark against which the software will be tested. The analysts are also responsible for gathering, organizing and disseminating the technical standards and specifications required by the programmers to perform their work.
分析家和语言学家与项目经理、社会学家和软件提供者协作发现需要克服的技术困难以及需要满足的文化--语言要求。他们的工作提供了项目经理用来指导项目完成的需求说明和项目描述文件。这些文件指导程序员的工作,并提供了软件测试的对照标准。分析家也要负责搜集、整理和传播技术标准和程序员工作所需的技术要求。
Since both the operating system user interface and various application user interfaces should be localized, often several different types of programmers will be required. Enthusiasts can perform this work remotely with others worldwide, but only if the problem has been thoroughly documented by the analysts. Wherever possible, local programming staff should be used for this stage of the work. The lessons they learn, and write down for later reference, can be spread to others who are performing localization. They may also create the technical standards for that language if none exists. Compliance with such FOSS standards as "G11N, I18N, L10N" (please see Glossary) and others, will ensure that work proceeds quickly with the confidence that successive developers can continue to update and improve the software.
因为操作系统和不同程序的用户界面都需要本地化,一般需要几种不同类型的程序员。志愿者可以与世界各地的其他人远程协同工作,但前提是分析家完整地记录了文档。这阶段的工作应当在可行条件下尽量采用本地编程人员。他们学习和记录下来供今后参考的经验可以传给其他进行本地化工作的人。如果没有关于本地语言的现存标准,他们也可以自己创建。与“G11N,I18N,L10N”这样的自由/开源软件技术标准和其他兼容,将确保工作进展迅速,并保证后来的开发者能够更新和改进这些软件。
Translators and technical writers perform the lion's share of the work. All the error messages, buttons, menus, commands, help files and user guides must be translated. In consultation with linguists for consistency and accuracy, translators and technical writers compile technical dictionaries, often coining new technical words and phrases that enable future developers to communicate more effectively with their colleagues. Just as the technical standards for localization are vital to programmers, the technical dictionary used by the writers and translators is vital to the project's success.
翻译和技术作家负责最主要的工作。所有的出错信息,按钮,菜单,命令,帮助文件和用户指南都必须翻译。通过向语言学家咨询翻译的一致性和准确性,翻译和技术作家编写好技术词典,常常需要创造新的技术词汇和词组以使未来的开发者能够与同行更有效地交流。正如本地化的技术标准对程序员至关重要,技术作者和翻译使用的技术词典对项目的成功也具有同等的重要性。
Testers use the requirement specifications to check the complete work of both the programmers and technical writers. Their painstaking work identifies errors and inconsistencies to be corrected, and rechecked, before release to the users of the software. Additional apprentice testers, especially those who speak no English and who are computer novices, can provide excellent feedback for programmers and translators.
测试员按照需求说明检验程序员和技术作者的工作成果。他们繁重的工作能够在软件向用户发布之前发现需要修正和复查的错误和不一致情况。其他的见习测试者,特别是那些不说英语或者是电脑新手的人,能够为程序员和翻译提供出色的反馈信息。
Trainers introduce the localized software to the users. Often, local teachers who have been taught how to use the system give seminars, answer questions and mentor computer enthusiasts. Local businesses and governments may also hire trainers to educate their workforce. It is important to ensure that these software trainers are locally recruited and speak the native language, rather than being English speakers imported at great expense.
培训员把本地化的软件介绍给用户。通常,学会了如何使用系统的本地教师负责讲课,回答问题,并指导计算机爱好者。本地企业和政府也可能雇佣培训人员来教育他们的雇员。有必要确保这些培训人员都是在本地招募并且会讲当地语言,而不是花费高昂代价从外国输入的讲英语的人。
For both proprietary and free software, training on how the software works is essential. To teach local users how to operate the software, one needs:
无论私有软件还是自由软件,对其使用原理的培训都是必要的。要教本地用户学会使用软件,我们需要:
Training equipment and materials.
Classrooms.
Instructors.
Most often, developers of the software ‘train the trainer’, who then instructs novices to make them advanced users. Training can be further divided between user training, system administrator training and developer training. Except for user training, which should be widespread, most specialized FOSS training takes place in educational institutions. Countries that have advanced quickly in FOSS localization have all devoted considerable resources to training and education. Without actual adoption of the software by a large segment of the population, the work of localization is an exercise in futility.
通常,软件的开发者“培训训练员”,这些训练员再去指导新手们,让他们成为高级用户。培训可以进一步细分为用户培训,系统管理员培训和开发者培训。除了用户培训应当广泛开展,大部分专门的自由/开源软件培训都在教育机构中进行。在自由/开源软件本地化中有明显进展的国家都在培训和教育中投入了大量资源。没有相当数量的人实际使用软件,本地化的工作就是做无用功。
Tools and equipment for FOSS development and localization are less expensive than those required for proprietary localization. Version control, project management, documentation change management, and development tool kits for programmers are all available either free of charge or at a low cost. To work on FOSS, experience shows that it is best to use free/open source tools.
自由/开源软件开发和本地化使用的软件和工具不像私有软件本地化所需要的那样昂贵。版本控制,计划管理,文档变化管理,以及程序员使用的开发工具包都是免费或者以低价提供的。经验表明对自由/开源软件的工作最好是使用自由/开源工具来进行。
All other equipment, including most development computers, should be up to date, in a secure environment. A separate budget should be set aside for libraries, references and language tools specific to the language to be localized. If these materials do not already exist, they must be created.
其他的设备,包括大部分开发计算机,应当是最新型的,并存放在安全的环境中。应当为图书馆、参考资料和本地化特定的语言工具设立单独的预算。如果没有这些材料,则必须进行编写。
Wherever possible, information on FOSS localization should be shared with the international FOSS development community so that the necessary tools do not have to be recreated by every team.
只要可能,应当与国际自由/开源软件社区共享自由/开源软件本地化的信息,这样各个开发团队可以避免重复开发那些必需的工具。
Technically speaking, localizing FOSS costs about as much as localizing commercial software. Only the techniques of programming are significantly different, since the linguistic and operational challenges exist no matter what type of software is to be localized. To localize any software, the following are needed:
技术上说,自由/开源软件本地化的成本跟商业软件本地化的成本相同。明显不同的只有编程的技术手段,语言和运作方面的挑战不论对哪种软件的本地化都一样存在。要本地化任何软件,都需要如下的资源:
Office space.
Office equipment and tools.
Technical staff.
Access to technical information.
Access to linguists and translators.
The largest cost will be staff salaries. The total cost of a project depends heavily on the wage expectations of local technical, translation, writing and testing staff, and their individual levels of experience with software localization, language and cultural issues.
最大的开支将会是工作人员的薪水。项目的总成本在很大程度上决定于本地技术、翻译、写作和测试人员的工资期望,以及他们个人的在软件本地化、语言和文化等问题上的经验。
The programmers and project managers probably require a higher-than-average education and salary, but most of the other staff utilizes skills that are not particular to software and can be found more readily in the general population.
程序员和项目经理可能需要高于常人的学历和工资,但其他工作人员使用的技能并不局限于软件,而且可以从一般的人群中获得。
Trainers are hired when the software is near finalization, and presumably remain employed in teaching new users, system administrators and developers how to use the software.
在软件接近完成时需要雇佣培训人员,此后他们可能要一直进行指导新用户、系统管理员以及开发者使用软件的工作。
For countries seeking independence from proprietary English language software, a permanent local office whose purpose is to train and disseminate technical information about localization could yield exponential savings. This establishment could be associated with a public library or university, where interested parties can access information at little or no cost.
对于希望不再依赖于私有英语软件的国家,为培训和传播本地化技术信息而设立的永久性当地办公室可以节省大量的开支。这个机构可以与公共图书馆或大学合作,让有关人员可以免费或接近免费地获取信息。
FOSS can often operate well on older computers. This offers advantages to both developed countries with an overstock of used computers they must dispose of, and developing countries that can configure these computers to operate FOSS in the local language.
自由/开源软件常常可以在旧计算机上很好地工作。这为有大量旧计算机需要处理的发达国家,以及能够配置这些计算机让它们使用本地语言自由/开源软件的发展中国家都有好处。
The total cost of localizing any particular piece of software is highly variable. Each project requires individual analysis for complexity, experience and availability of technical staff, and the characteristics of the local language.
本地化任何一个特定软件的成本都可能有高低之别。每个项目都需要专门分析其复杂性,技术人员的经验和数量,以及本地语言的特征。
Software cost and schedule estimating is not a simple calculation. In addition to a rough estimate based on the number of message strings to be translated, other factors must be considered.
软件成本和时间估计不是简单的计算。除了对需要翻译的信息字符串数量的粗略估计,还需要考虑其他因素。
Experience: Do the programmers, translators and testers have previous experience with this kind of work? If not, it will require extra time and effort to train them in the processes and standards of localization. But translators learn very quickly, and productivity increases dramatically after the first month or two. With a stable team, the members become very productive.
Environment: Does the staff have the tools and equipment needed to perform the work in a professional manner? Without modern office space, tools and techniques, it is unrealistic to expect the staff to perform at top efficiency.
Linguistic factors: How different is the local language from English? Translating from English to Swedish, for example, is fairly simple. The grammar, length of words and vocabulary is very similar. There is near universal fluency in English, and translators are easy to find. On the other hand, translating from English to Lao is very difficult. The grammar, spelling conventions, word length, collation, and other factors are not similar at all. So the size and position of user interface elements must be changed. In addition, a lack of experienced translators or even of a basic technical glossary means that projects would begin from practically nothing and take much more time and effort.
Scope: How much is enough? Is it acceptable to merely change the primary user interface menus and commands? Should the help files also be translated? What about documentation and user training materials? Are `'anglicisms''acceptable? How many new words will be introduced into the language? To avoid failure, a very clear definition of the project's scope is necessary.
Metrics: Professional software cost and estimating relies on the experience of previous projects for determining future schedules. If there is little heuristic evidence to rely on for estimating, the first few project estimates can only be educated guesses. After several projects have been completed, the actual time for completion can be compared to the initial estimates in order to refine future estimates. So it is important that accurate records of person hours, resources, experience and other factors are collected for future reference.
With the points mentioned above in mind, consider the following formula as a very rough "rule of thumb" for estimating localization project schedules.
记住以上的要点,下面的公式可看成一个粗略的“经验公式”,用于估计本地化项目的进程。
(需要翻译的信息字符串数 x 翻译一条“一般”字符串估计需要的分钟数 x 翻译者的经验水平*) / (60** x “实际”的每人每周工作时间*** x 可用的人员数) + 10--20% 管理和培训时间 = 估计完成所需的周数
* 较少的经验需要较多的时间(没有经验: x 1.5,经验丰富: x 0.75)
** 换算为人时
*** 总是少于每周40小时,一般每周20小时左右
Example Case 1:
1. 10,000 message strings to be translated
2. 10 minutes per message string
3. Less than average experience with software localization translating tools and processes.
4. 10 staff members
示例1:
Example Case 1 Estimate:
1. 10,000 x 10 minutes (divided by 60) = 1,666 person-hours
2. 20 "actual" man hours per week = 83.33 person-weeks
3. Add 16.66 man weeks for testing and editing = 99.99 person-weeks
4. Add 16.66 man weeks for management and training = 116.65 person-weeks
5. Multiply by 1.5 to reflect lack of experience = 174.97 person-weeks
6. Divide by 10 staff members = 17.4975 weeks
示例1估计:
In other words, such a project would require a staff of 10 people working almost five months. If the average salary for these professionals is a thousand dollars a month, costs for staff alone is USD50,000.00.
换句话说,这样一个项目需要10名工作人员工作将近5个月。如果每个专业人员的
平均工资是每月1,000美元,仅雇员的成本就是50,000美元。
Add 10 computers, office space, Internet connectivity, copy machines and other routine expenses, and a rough estimate of the overall cost of localization can be made.
加上10台计算机,办公空间,因特网连接,复印机和其他日常开销,就能大概估计本地化的总体成本。
Consider the same example with the following change: average experience with software localization translating tools and processes.
考虑在上例中进行如下的改变:对软件本地化翻译工具和过程有一般水平的经验。
1. Multiply by 1.0 to reflect average level of experience = 116.65 person-weeks
2. Divide by 10 staff members = 11.665 weeks.
A project with experienced staff would require less than four months. If the average salary for these professionals is a thousand dollars a month, costs for staff alone is USD40,000.00.
有经验的人员参与的项目只需要少于4个月。如果这些专业人员的平均工资是每月1,000美元,则人员开销就是40,000美元。
Consider the same example with the following change: staff with more experience with software localization, translation tools and processes.
考虑上例有如下的改变:人员有更多的关于软件本地化、翻译工具和过程的经验。
1. Multiply by .75 to reflect above average experience = 81.24 person-weeks
2. Divide by 10 staff members = 8.124 weeks.
An above average staff of 10 people, requiring no additional training, will need only about two months. If the average salary for these professionals is a thousand dollars a month, cost for staff is only USD20,000.00. Compared to proprietary products, FOSS is ideal for localizing. When a few projects have been completed, the costs drop quickly because the underlying concepts and techniques remain the same. The first few projects must develop new dictionaries, tools and specialties relating to language and technical processes. The experience of the staff is the key to increased productivity.
由具有较多经验的10人组成的队伍,不需要附加的培训,将只需要两个月时间。如果这些专业人员的工资是每月一千美元,则工作人员的开销只需20,000美元。与私有软件相比,自由软件更适合于本地化。完成几个项目后,成本就能迅速下降,因为工作背后的概念和技术都是一样的。最开始的几个项目必须发展新的词典、工具和与语言及技术过程有关的专业技能。工作人员的经验是提高效率的关键。
Once these are in place and well understood, the time and money required to complete additional projects are reduced. Because FOSS developers tend to adhere to open standards, the developers of a local version do not have to reverse-engineer the code to guess what work must be done. The localization process should be very similar from project to project. With commercial proprietary closed software, the opposite is true.
一旦这些资源都到位并且被理解,完成其它项目所需的时间和金钱就可以减少。因为自由/开源软件开发者倾向于采用开放标准,本地版本的开发者不需要进行反向工程来猜测需要做什么工作。不同项目的本地化过程非常相似。而对于商业的私有闭源软件,情况正好相反。
As we have seen, a major pitfall of proprietary software is that only the owner of the copyright can maintain or modify it. With FOSS, any person with the appropriate skills can do the work. So, instead of users being locked into an expensive maintenance contract with a single foreign vendor, support and maintenance of software can be freely contracted to a wide variety of local companies.
正如我们已经看到的,私有软件的一个主要问题是只有版权的拥有者可以维护和修改它。对于自由/开源软件,任何有足够技能的人都可以做这些工作。所以,软件支持和维护工作可以自由地外包给各种本地企业。
Programmers working alone cannot localize software. When estimating the cost, time and effort required for any localization, set aside only about 10 percent of the budget for technical issues. All the rest goes to the time-consuming task of translating, writing, testing and training.
独自工作的程序员是不能本地化软件的。估计任何本地化项目的花费、时间和工作量时,只为技术问题预留10%的预算。其它的都应该用于花费时间的翻译、写作、测试和培训工作。
When a proprietary company localizes software, it first determines whether the effort would be commercially viable. Then it hires localization experts, including linguists and cultural experts, to develop a technical dictionary. Meanwhile, well-paid analysts and programmers modify the software to accept and display the script for the language.
当一家私有软件公司本地化软件时,它首先要确定这项工作在商业上是否有利。然后它将雇用语言和文化专家,开发技术词典。同时,高薪聘请的分析师和程序员修改软件以便接受和显示那种语言的文字。
The lion's share of localization work involves translating and then replacing the labels, menu items, error messages and help files of the software. Sometimes the appearance of the software must also be modified to fit awkwardly long or short words. The technical dictionary, in some cases, is the first of its kind for that language, and new terms are invented.
本地化工作的最主要部分包括翻译和替换软件的标签、菜单项,错误信息和帮助文件。有时为了适应特别长或短的词,软件的外观也要相应地调整。有时技术词典是第一次用这种语言编写,就需要加入新的词汇。
Closely following the technical dictionary and programming standards, teams of technical writers enter the new phrases alongside the English original. When it is all done, testers ensure not only that every message has been translated, but also that the terminology is consistent and logical.
紧随技术词典和编程标准的是技术作家在英文原稿旁边写上新的短句。这个工作完成后,测试人员确定所有信息都得到了翻译,而且术语前后一致并符合逻辑。
Work of this sort follows an exponential curve, where the initial work is painfully slow, and then accelerates rapidly when the technical dictionary and standards are well established. After a few programs have been localized, an experienced team can localize additional software at greatly reduced costs. Marketing and training determine how successful they are in getting users to adopt the software. Often, they publish their technical dictionaries and standards, selling them for a profit or making them available free of charge to governments, universities and the online community.
这样的工作表现为一条指数曲线,开始的工作极为缓慢,然后在技术词典和标准建立后迅速地加速。在完成几个本地化项目后,有经验的队伍可以用大大减少的成本本地化更多的软件。营销和培训决定了他们让用户使用这些软件的成功程度。通常,他们会出版技术词典和标准,用于出售或者免费提供给政府、大学和网络社区。
Language Considerations
=语言因素=
FOSS has typically been localized by a few volunteers working remotely, without the benefit of linguists or a technical dictionary for translation. The work can take a long time and it can be riddled with inconsistencies or errors.
自由/开源软件一般都是由几个远程工作的志愿者进行本地化,而缺乏语言学家或者技术词典的帮助。这项工作可能花费很长时间,而且可能受到不一致或错误翻译的影响而让人迷惑。
The pace of FOSS localization is uneven. In countries where the language is similar to English and there are many bilingual volunteers, FOSS localization is well established. Where governments and other agencies have stepped in to provide financial support for localization, the results have also been impressive. (The "CJK" partnership of China, Japan, and Korea stands out as an example.)
自由/开源软件本地化的进程并不一致。在语言与英语类似并且有许多双语志愿者的国家,自由/开源软件的本地化非常发达。有些国家的政府和其他机构介入开发并为本地化提供资金支持,这些地方的结果也非常引人注目。(中国、日本、韩国的“CJK”合作项目是一个突出的例子)。
In countries without much technical infrastructure, localization of both commercial software and FOSS is slow. It is far slower when the language is not of the Indo-European group. Commercial companies see little profit in the work, and few local professionals have the time or skills to localize FOSS. Even though the source code is freely available for localization work to begin, few specialized technical standards or technical dictionaries exist.
在技术水平不高的国家,商业软件和自由/开源软件的本地化都很缓慢。如果当地语言不属于印欧语系进展就会更缓慢得多。商业公司认为这样的工作无利可图,也很少有本地专业人员有时间或者技能来本地化自由/开源软件。即使本地化工作可以利用开放的源代码来启动,也几乎不存在专用的标准或技术词典。
Some languages, particularly those with Latin-based scripts, are relatively easy to localize. Others can be very difficult. As an example, both Lao and Thai share a 42-consonant script, with vowel and intonation marks. These scripts follow complex rules of layout involving consonants, vowels, special symbols, conjuncts and ligatures. All of these writing systems share certain characteristics: spaces are not necessarily used to separate words, and vowels appear before and after, under, over, and after consonants.
有些语言,特别是基于拉丁文字的,相对容易本地化。其它的可能非常困难。例如,老挝和泰国都使用同一种42辅音的文字,带有元音和声调符号。这些文字的摆放规则涉及辅音、元音、特别符号、复合词和连写,非常复杂。所有这些书写系统都有一些共同的性质:词之间不一定用空白分开,元音字母可以出现在辅音字母的前方,下方,上方和后方。
Thai and Lao volunteers responsible for localizing FOSS have saved a great deal of time and avoided frustration by cooperating on technical issues, and sharing information on resources and tools.
负责本地化自由/开源软件的泰国和老挝志愿者通过技术合作和分享资源与工具的信息节省了大量时间。
Across Asia, opportunities exist for shared localization efforts at the inter-governmental level. Many other Asian languages share similarities, and often the programming tasks are nearly identical across similar language groups. Properly funded and organized, pan-Asian software localization is a realistic goal.
整个亚洲都存在着在政府间分享本地化工作成果的机会。许多其它的亚洲语言都有相似性,而且编程任务在相似的语言群中几乎都相同。在适当的资助和组织下,泛亚洲的软件本地化是一个可以实现的目标。
This annex provides a quick tour of the key concepts of localization, so that those interested in localizing FOSS for their own language, get a broad picture of the kind of knowledge that is needed. The next annex provides the technical details required to get started.
本附录提供本地化关键概念的快速介绍,以便有兴趣将自由/开源软件本地化为自己语言的人们大致地了解他们所需要的知识。下一节附录提供了开始工作所需的技术细节。
Standardization
=标准化=
When two or more entities interact, common conventions are important. Car drivers must abide by traffic rules to prevent accidents. People need common conventions on languages and gestures to communicate. Likewise, software needs standards and protocols to interoperate seamlessly. In terms of software engineering, contracts between parts of programs need to be established before implementation. The contracts are most important for systems developed by a large group of individual developers from different backgrounds, and are extremely essential for cross-platform interoperability.
两个或者更多的团体进行交流时,重要的是使用同样的方式。汽车司机要遵守交通规则才能避免事故。人们需要共同的语言和手势习惯才能沟通。类似地,软件需要标准和协议来实现无缝的协作。在软件工程中,程序的不同部分需要在实现前预先确定协调方法。由具有不同背景的许多单个开发者组成的大团体开发的系统中协议尤其重要,而且对跨平台的协作能力是极为必要的。
Standards provide such contracts for all computing systems in the world. Software developers need to conform to such conventions to prevent miscommunication. Therefore, standardization should be the very first step for any kind of software development, including localization.
标准为世界上所有计算系统提供了这样的协议。软件开发者需要遵循这些标准以避免沟通的障碍。因此,对于包括本地化在内的任何类型的软件开发,标准化都应该是第一步。
To start localization, it is a good idea to study related standards and use them throughout the project. Nowadays, many international standards and specifications have been developed to cover the languages of the world. If these do not fit the project's needs, one may consider participating in standardization activities. Important sources are:
要开始本地化,一个好办法是研究相关的标准,并在项目中自始至终使用这些标准。现在,人们开发了许多国际标准和技术指标以覆盖世界各地的语言。如果这些标准不符合项目的需要,可以考虑参与标准化活动。重要的标准来源包括:
ISO/IEC JTC1 (International Organization for Standardization and International Electrotechnical Commission Joint Technical Committee 1): A joint technical committee for international standards for information technology. There are many subcommittees (SC) for different categories, under which working groups (WG) are formed to work on subcategories of standards. For example, ISO/IEC JTC1/SC2/WG2 is the working group for Universal Coded Character Set (UCS). The standardization process, however, proceeds in a closed manner. If the national standard body is an ISO/IEC member, it can propose the requirements for the project. Otherwise, one may need to approach individual committees. They may ask for participation as a specialist. Information for JTC1/SC2 (coded character sets) is published at anubis.dkuug.dk/JTC1/SC2. Information for JTC1/SC22 (programming languages, their environments and system software interfaces) is at anubis.dkuug.dk/JTC1/SC22.
ISO/IEC JTC1 (国际标准化组织和国际电子技术协会第一联合技术委员会,International Organization for Standardization and International Electrotechnical Commission Joint Technical Committee 1):一个负责制定信息技术国际标准的联合技术委员会。其中包括许多负责不同方面的子委员会(subcommittees, SC),再下一层则是负责标准子类工作的工作组(working groups, WG)。例如,ISO/IEC JTC1/SC2/WG2 是负责统一编码字符集(Universal Coded Character Set, UCS)的工作组。标准化的过程是以封闭的方式进行的。如果国家的标准组织是 ISO/IEC 的成员,它可以提出项目需求的建议。否则,提议制定标准需要接触单独的委员会。委员会可能需要专家参与。关于 JTC1/SC2 (编码字符集)的信息在 anubis.dkuug.dk/JTC1/SC2 发布。关于 JTC1/SC22 (编程语言,环境和系统软件界面)的信息在 anubis.dkuug.dk/JTC1/SC22 发布。
Unicode Consortium: A non-profit organization working on a universal character set. It is closely related to ISO/IEC JTC1 subcommittees. Its Web site is at www.unicode.org, where channels of contribution are provided.
Unicode 联合会:一个开发通用字符集的非营利组织。它与 ISO/IEC JTC1 子委员会的关系密切。其网站是 www.unicode.org,上面提供了参与工作的渠道。
Free Standards Group: A non-profit organization dedicated to accelerating the use of FOSS by developing and promoting standards. Its Web site is at www.freestandards.org. It is open to participation. There are a number of work groups under its umbrella, including OpenI18N for internationalization (www.openi18n.org).
自由标准小组:一个专门通过开发和提倡标准来促进自由/开源软件使用的非营利组织。它的网站是 www.freestandards.org。任何人都可以参与。在其下有许多工作组,包括做国际化工作的 OpenI18N (www.openi18n.org)。
Note, however, that some issues such as national keyboard maps and input/output methods are not covered by the standards mentioned above. The national standards body should define these standards, or unify existing solutions used by different vendors, so that users can benefit from the consistency.
但是注意,像国家键盘布局和输入/输出方法这样的问题并不包含在上述的标准中。国家的标准组织应当定义这些标准,或者把不同提供商使用的解决方案统一起来,这样用户才能从一致的标准中受益。
Unicode
=Unicode=
Characters are the most fundamental units for representing text data of any particular language. In mathematical terms, the character set defines the set of all characters used in a language. In ICT terms, the character set must be encoded as bytes in the storage, according to some conventions, called encoding. These conventions must be agreed upon both by the sender and receiver of data for the information to remain intact and exact.
字符是表示任何语言文本资料的最基本单位。字符集以数字形式定义了一种语言中使用的所有字符。在国际通信技术术语中,字符集必须按照一定的规定被编码成字符以便存储,这个过程称为编码。这些规定必须由数据的发送方和接收方达成一致,以保证信息完整准确。
In the 1970s, the character set used by most programs consisted of letters of the English alphabet, decimal digits and some punctuation marks. The most widely used encoding was the 7-bit ASCII (American Standard Code for Information Interchange), in which up to 128 characters can be represented, which is just sufficient for English. However, when the need to use non-English languages in computers arose, other encodings were defined. The concept of codepages was devised as enhancements to ASCII by adding characters as the second 7-bit half, making an 8-bit code table in total. Several codepages were defined by vendors for special characters for decoration purpose and for Latin accents. Some non-European languages were added by this strategy, such as Hebrew and Thai. National standards were defined for character encoding.
在20世纪70年代,大多数程序使用的字符包括了英文字母表,10进制数字和一些标点符号。最广泛使用的编码是7位的 ASCII(美国信息交换标准代码,American Standard Code for Information Interchange),最多可以表示128个字符,仅仅足够英语使用。不过,随着使用非英语语言的需要产生,其他的编码也被定义出来。编码页(codepage)的概念作为通过在第二个半区增加字符增强 ASCII 的方法而被发明出来,使得码表达到8比特。供应商定义了几种编码页用于表示修饰和拉丁文音调的特殊字符。一些非欧洲语言通过这种策略被加入码表中,例如希伯来语和泰语。一些字符编码的国家标准也被开发出来。
The traditional encoding systems were not suitable for Asian languages that have large character sets and particular complexities. For example, the encoding of Han characters used by the Chinese, Japanese and Korean (CJK), the total number of which are still not determined, is much more complicated. A large number of codepages must be defined to cover all of them. Moreover, compatibility with other single-byte encodings is another significant challenge. This ends up in some multi-byte encodings for CJK.
传统的编码系统不适合有大量字符和特殊形式的亚洲语言。例如,汉语,日语和韩语(CJK)使用的尚未确定总数的汉字编码就复杂得多。要提供所有的汉字编码,必须定义大量的编码页。此外,与其他单字节编码系统的兼容性也是一个巨大的挑战。因此 CJK 使用了多字节的编码。
However, having a lot of encoding standards to support is a problem for software developers. A group of vendors thus agreed to work together to define a single character set that covers the characters of all languages of the world, so that developers have a single point of reference, and users have a single encoding. The Unicode Consortium was thus founded. Major languages in the world were added to the code table. Later on, ISO and IEC formed JTC1/SC2/WG2 to standardize the code table, which is published as ISO/IEC 10646. Unicode is also a member of the working group, along with standard bodies of ISO member countries. Both Unicode and ISO/IEC 10646 are synchronized, so the code tables are the same. But Unicode also provides additional implementation guidelines, such as character properties, rendering, editing, string collation, etc.
但是,需要支持多种编码标准对软件开发者来说是个问题。一组厂商为此商定写作定义一个统一的字符集,涵盖世界上所有语言的字符,这样开发者就有一个单一的参照标准,用户也只需要使用一种编码。为此 Unicode 联合会就成立了。世界上的主要语言都被加入到码表中。不久后,ISO 和 IEC 组建了 JTC1/SC2/WG2 来标准化码表,并作为 ISO/IEC 10646 发布。Unicode 联合会也和其他 ISO 成员国的标准组织一样是工作组的成员之一。Unicode 和 ISO/IEC 10646 是同步的,因此码表一致。但 Unicode 也提供了附加的实现导则,例如字符属性、渲染、编辑、字符排序,等等。
Nowadays, many applications have moved to Unicode and have benefited from the clear definitions for supporting new languages. Users of Unicode are able to exchange information in their own languages, especially through the Internet, without compatibility issues.
现在,许多应用程序已经使用了 Unicode 并且从新语言定义的清晰支持中受益。Unicode 的用户可以用自己的语言交换信息,尤其是通过因特网,而没有兼容性的问题。
Fonts
=字体=
Once the character set and encoding of a script are defined, the first step to enabling it on a system is to display it. Rendering text on the screen requires some resource to describe the shapes of the characters, i.e., the fonts, and some process to render the character images as per script conventions. The process is called the output method. This section will try to cover important aspects of these requirements.
一种文字的字符集和编码得到定义后,在一种系统上使用它的第一步就是显示。在屏幕上渲染文本需要一些资源来描述字符的形状,即字体,还需要一些按照每种文字的规定渲染字符图像的程序。这种程序被称为输出方法。本节将讨论有关这些需求的一些重要事项。
Characters and Glyphs
==字符和符号==
A font is a set of glyphs for a character set. A glyph is an appearance form of a character or a sequence of characters. It is quite important to distinguish the concepts of characters and glyphs. For some scripts, a character can have more than one variation, depending on the context. In that case, the font may contain more than one glyph for each of those characters, so that the text renderer can dynamically pick the appropriate one. On the other hand, the concept of ligatures, such as "ff" in English text, also allows some sequence of characters to be drawn together. This introduces another kind of mapping of multiple characters to a single glyph.
字体是对应一个字符集的一系列符号。符号是一个字符或一串字符的表现形式。区分字符和符号的概念非常重要。对于一些文字,一个字符可能根据上下文有多种形式。这种情况下,字体对于每个这样的字符需要包含多个符号,这样文字渲染程序可以动态地选取适合的符号。另一方面,像英文中的“ff“这样的连写的概念也允许一些特定顺序的字符在一起描画。这引入了一种将多个字符映射到单个符号的做法。
Bitmap and Vector Fonts
==点阵和矢量字体==
In principle, there are two methods of describing glyphs in fonts: bitmaps and vectors. Bitmap fonts describe glyph shapes by plotting the pixels directly onto a two-dimensional grid of determined size, while vector fonts describe the outlines of the glyphs with line and curve drawing instructions. In other words, bitmap fonts are designed for a particular size, while vector fonts are designed for all sizes. The quality of the glyphs rendered from bit-map fonts always drops when they are scaled up, while that from vector fonts does not. However, vector fonts often render poorly in small sizes in low-resolution devices, such as computer screens, due to the limited pixels available to fit the curves. In this case, bitmap fonts may be more precise.
原则上,在字体中有两种描述符号的方法:点阵和向量。点阵字体通过直接在确定大小的二维网格上描画像素来描述符号形状,而向量字体用勾画直线和曲线的指令描述符号的轮廓。换句话说,点阵字体是为特定的大小而设计,而向量字体是为所有的规格设计。点阵字体符号在放大时质量一定会下降,而向量符号则不会。但是,在计算机屏幕这样的低分辨率设备上,由于用于拟合曲线的像素数量有限,小尺寸的向量符号渲染很差。这种情况下,点阵字体可能更为准确。
Nevertheless, the quality problem at low resolution has been addressed by font technology. For example:
但是,字体技术试图解决低分辨率下的质量问题,例如:
Hinting, additional guideline information stored in the fonts for rasterizers to fit the curves in a way that preserves the proper glyph shape.
Anti-aliasing, capability of the rasterizer to simulate unfitted pixels with some illusion to human perception, such as using grayscales and coloured-subpixels, resulting in the feeling of "smooth curves."
These can improve the quality of vector fonts at small sizes. Moreover, the need for bitmap fonts in modern desktops is gradually diminishing.
这些技术可以提高小尺寸矢量字体的质量。而且,现代桌面系统对点阵字体的需求正在逐渐减少。
Font Formats
==字体格式==
Currently, the X Window system for GNU/Linux desktop supports many font formats.
目前,GNU/Linux 桌面使用的 X Window 系统支持多种字体格式。
BDF Fonts
BDF 字体
BDF (Bit-map Distribution Format) is a bitmap font format of the X Consortium for exchanging fonts in a form that is both human-readable and machine-readable. Its content is actually in plain text.
BDF(点阵发布格式,Bit-map Distribution Format)是 X 联合会的一种点阵字体格式,用于以人和机器都能读取的形式交换字体。其内容实际上是纯文本。
PCF Fonts
PCF 字体
PCF (Portable Compiled Format) is just the compiled form of the BDF format. It is binary and thus, only machine-readable. The utility that compiles BDF into PCF is bdftopcf. Although BDF fonts can be directly installed into the X Window system, they are usually compiled for better performance.
PCF(可移植编译格式,Portable Compiled Format)是 BDF 格式编译后的形式。它是二进制的,因此只有机器可读。将 BDF 格式编译成 PCF 格式的程序是 bdftopcf。虽然 BDF 字体可以直接安装在 X Window 系统中,为了提高性能一般还是要编译它们。
Type 1 Fonts
Type 1 字体
Type 1 is a vector font standard devised by Adobe and supported by its Postscript standard. So it is well supported under most UNIX and GNU/Linux, through the X Window system and Ghostscript. Therefore, it is the recommended format for traditional UNIX printing.
Type 1 是一种由 Adobe 发明的向量字体标准,并在其 Postscript 标准中提供支持。因此它在绝大多数 Unix 和 GNU/Linux 系统下都被通过 X Window 系统和 Ghostscript 很好地支持。因此,它是传统 Unix 打印输出的推荐格式。
TrueType Fonts
TrueType 字体
TrueType is a vector font standard developed by Apple, and is also used in Microsoft Windows. Its popularity has grown along with the growth of Windows. XFree86 also supports TrueType fonts with the help of the FreeType library. Ghostscript has also supported TrueType. Thus, it becomes another potential choice for fonts on GNU/Linux desktops.
TrueType 是由 Apple 开发的一种向量字体标准,也在微软 Windows 中使用。它随着 Windows 的成长而得到广泛使用。Ghostscript 也支持 TrueType。因此,它成为 GNU/Linux 桌面的又一种可能选择。
OpenType Fonts
OpenType 字体
Recently, Adobe and Microsoft have agreed to create a new font standard that covers both Type 1 and TrueType technologies with some enhancements to cover the requirements of different scripts in the world. The result is OpenType.
最近,Adobe 和微软同意开发一种包括 Type 1 和 TrueType 技术并按照世界上不同字体的需要增强的新字体标准。其产物就是 OpenType。
An OpenType font can describe glyph outlines with either Type 1 or TrueType splines. In addition, information for relative glyph positioning (namely, GPOS table) has been added for combining marks to base characters or to other marks, as well as some glyph substitution rules (namely, GSUB table), so that it is flexible enough to draw characters of various languages.
OpenType 字体可以用 Type 1 或者 TrueType 样条曲线描述符号轮廓。此外,还增加了符号相对位置的信息(例如 GPOS 表)以便在基础字符上增加标记或者组合不同的标记,以及一些符号替换规则(例如 GSUB 表),因此它具有足够的灵活性,可以描绘多种语言的字符。
Output Methods
=输出方法=
Output method is a procedure for drawing texts on output devices. It converts text strings into sequences of properly positioned glyphs of the given fonts. For the simple cases like English, the character-to- glyph mapping may be straightforward. But for other scripts the output methods are more complicated. Some could be with combining marks, some written in directions other than left-to-right, some with glyph variations of a single character, some requiring character reordering, and so on.
输出方法是在输出设备上描绘文本的过程。它将文本串转换成一系列正确放置的给定字体中的符号。对于像英文这样的简单情况,字符到符号的映射可能非常直观。但对于其他文字来说输出方法要更复杂。有些文字可能有复合的标记,有些可能不是按从左到右的顺序书写,有些对于一个字符有不同符号的变化形式,有些需要字符的重新排序,等等。
With traditional font technologies, the information for handling complex scripts is not stored in the fonts. So the output methods bear the burden. But with OpenType fonts, where all of the rules are stored, the output methods just need the capability to read and apply the rules.
在传统的字体技术中,处理复杂文字的信息并没有贮存在字体中。这个工作就由输出方法来承担。但对于贮存了所有规则信息的 OpenType 字体,输出方法只需要读取和应用规则的能力。
Output methods are defined at different implementations. For X Window, it is called X Output Method (XOM). For GTK+, it uses a separate module called Pango. For Qt, it implements the output method by some classes. Modern rendering engines are now capable of using OpenType fonts. So, there are two ways of drawing texts in output method implementations. If you are using TrueType or Type 1 fonts and your script has some complications over Latin-based languages, you need to provide an output method that knows how to process and typeset characters of your script. Otherwise, you may use OpenType fonts with OpenType tables that describe rules for glyph substitution and positioning.
输出方法在不同的实现中被定义。对于 X Window,它被称为 X 输出方法(X Output Method, XOM)。对于 GTK+,它使用一个称为 Pango 的单独模块。对于 Qt,则通过某些类来实现输出方法。现代渲染引擎能够使用 OpenType 字体。因此,在输出方法实现中有两种描绘文本的方法。如果使用 TrueType 或者 Type 1 字体,而且你的文字相对基于拉丁字母的语言有一些变化,你需要提供知道怎样处理和排版你的文字中字符的输出方法。否则,你需要使用带有描述符号替换和定位规则的表格的 OpenType 字体。
Input Methods
=输入方法=
There are many factors in the design and implementation of input methods. The more different the character set size and the input device capability are, the more complicated the input method becomes. For example, inputting English characters with a 104-key keyboard is straightforward (mostly one-to- one that is, one key stroke produces one character), while inputting English with mobile phone keypad requires some more steps. For languages with huge character sets, such as CJK, character input is very complicated, even with PC keyboards.
输入方法的设计与实现中有许多考虑因素。字符集的大小和输入设备的能力之间差别越大,输入法就需要变得越复杂。例如,用104键键盘输入英文字符是非常简单的(基本上是一一对应——即,一次击键产生一个字符),而用手机键盘输入英文就需要更多的步骤。对于有大量字符的语言,例如中日韩文字,字符输入即使使用 PC 键盘也是非常复杂的。
Therefore, analysis and design are important stages of input method creation. The first step is to list all the characters (not glyphs) needed for input, including digits and punctuation marks. The next step is to decide whether it can be matched one-to-one with the available keys, or whether it needs some composing (like European accents) or conversion (like CJK Romanji input) mechanisms in which multiple key strokes are required to input some characters.
因此,在输入法的开发中分析和设计是重要的步骤。第一步是列出所有需要输入的字符(不是符号),包括数字和标点符号。下一步是决定现有的键位能否与这些字符一一对应,还是需要一些合成(例如欧洲变音)或转换(例如中日韩注音输入)机制,用多次击键输入一些字符。
When the input scheme is decided for the script, the keyboard layout may be designed. Good keyboard layout should help users by putting most frequently used characters in the home row, and the rest in the upper and lower rows. If the script has no concept of upper/lower cases (which is almost the case for non-Latin scripts), rare characters may be put in the shift positions.
决定文字使用的输入方式后,就可以设计键盘布局。好的键盘布局应当通过把最常使用的字符放在基准键位上,其他放在上方或下方键位上来方便用户。如果文字没有大小写的概念(对于非拉丁文字基本都如此),少见的字符可以放在上档键位上。
Then, there are two major steps to implement the input method. First, a map of the keyboard layout is created. This is usually an easy step, as there are existing keyboard maps to refer to. Then, if necessary, the second step is to write the input method based on the keyboard map. In general, this means writing an input method module to plug into the system framework.
之后,输入方法的实现有两个主要步骤。第一,建立键盘布局图。这一步通常比较简单,因为可以借鉴已有的键盘图。然后,如果有必要,第二步是基于键盘图编写输入方法。通常,这意味着编写一个输入方法模块插入到系统框架中。
Locales
=区域设置=
Locale is a term introduced by the concept of internationalization (I18N), in which generic frameworks are made so that the software can adjust its behaviour to the requirements of different native languages, cultural conventions and coded character sets, without modification or re-compilation.
区域设置是国际化(internationalization, I18N)概念引入的一个术语,国际化提供了一个框架,让软件能够根据当地语言、文化习惯和编码字符集的不同需要调整自身的行为,而不需要修改代码或重新编译。
Within such frameworks, locales are defined for describing particular cultures. Users can configure their systems to pick up their locales. The programs will load the corresponding predefined locale definition to accomplish internationalized functions. Therefore, to make internationalized software support a new language or culture, one must create a locale definition and fill up the required information, and things will work without having to touch the software code.
在这样一个框架中,定义了区域设置用来描述特定的文化习惯。用户可以通过配置系统来选择区域设置。程序可以载入相应的预先定义的区域设置来实现国际化功能。因此,要让国际化的软件支持一种新的语言或文化,必须建立一个区域定义并填入所需要的信息,这样软件就能工作而不需要改动代码。
According to POSIX (1), a number of C library functions, such as date and time formats, string collation, numerical and monetary formats, are locale-dependent. ISO/IEC 14652 has added more features to POSIX locale specifications and defined new categories for paper size, measurement unit, address and telephone formats, and personal names. GNU C library has implemented all of these categories. Thus, cultural conventions may be described through it.
根据 POSIX 标准(1),像日期和时间格式、字符串排序、数字和货币格式这样的一系列 C 程序库函数,都是依赖区域设置的。ISO/IEC 14652 为 POSIX 区域设置增加了更多的功能,并为纸张规格,计量单位,地址和电话格式,以及人名定义了新的类别。GNU C 库实现了所有这些分类。这样,就可以通过它来描述文化的不同。
Locale definitions are discussed in detail on pages 41-42.
区域设置的定义在附录 B 进行了详细的讨论。
Translation
=翻译=
Translating messages in programs, including menus, dialog boxes, button labels, error messages, and so on, ensures that local users, not familiar with English, can use the software. This task can be accomplished only after the input methods, output methods and fonts are done or the translated messages will become useless.
翻译程序中的信息,包括菜单、对话框、按钮标签、错误信息等等,确保不熟悉英语的本地用户可以使用软件。这项任务只有在输入方法、输出方法和字体都完成后才有可能成功——否则翻译出来的信息是无用的。
There are many message translation frameworks available, but the general concepts are the same. Messages are extracted into a working file to be translated and compiled into a hash table. When the program executes, it loads the appropriate translation data as per locale. Then, messages are quickly looked up for the translation to be used in the user interface.
有许多的信息翻译框架可供使用,但一般概念都是相同的。信息被提取到工作文件中进行翻译,并被编译成一张哈希表。执行程序时,程序按照区域设置载入适当的翻译资料。然后翻译的信息被快速地查找出来并应用于用户界面。
Translation is a labour-intensive task. It takes time to translate a huge number of messages, which is why it is always done by a group of people. When forming a team, make sure that all members use consistent terminology in all parts of the programs. Therefore it is vital to work together in a forum through close discussion and to build the glossary database from the decisions made collectively. Sometimes the translator needs to run the program to see the context surrounding the message, in order to find a proper translation. At other times the translator needs to investigate the source code to locate conditional messages, such as error messages. Translating each message individually in a literal manner, without running the program, can often result in incomprehensible outputs.
翻译是一项费时费力的工作。翻译大量的信息需要时间,因此这项工作总是由一组人来进行。组建队伍时,要确保所有成员在软件的各个部分都使用一致的术语。因此在一个场所中通过大量的讨论来协同工作和通过集体决定建立词汇数据库至关重要。有时翻译者需要运行程序来查看信息的上下文,以便找到正确的翻译。有些时候翻译者需要查看源代码来定位需要条件的信息,例如出错信息。按照字面意思一条一条翻译信息,而不运行程序,常常会导致不合情理的结果。
Like other FOSS development activities, translation is a long-term commitment. New messages are usually introduced in every new version. Even though all messages have been completed in the current version, it is necessary to check for new messages before the next release. There is usually a string freeze period before a version is released, when no new strings are allowed in the code base, and an appropriate time period is allocated for the translators. Technical aspects of the message translation process are discussed on page 45.
像其他自由/开源软件开发活动一样,翻译是一项长期的任务。通常每个新版本都会加入新的信息。即使现有版本中所有信息都得到了翻译,在下一个版本发布时也有必要检查新信息。在新版本发布之前通常都有一个字符串冻结时期,这期间新字符串不允许加入代码库,并分配给翻译人员足够的时间。信息翻译过程的技术在附录 B 最后进行了讨论。
GNU/Linux Desktop Structure
=GNU/Linux 桌面构架=
Before planning to enable a language in GNU/Linux desktop, a clear understanding of the overview of its structure is required. GNU/Linux desktop is composed of layers of subsystems working on top of one another. Every layer has its own locale-dependent operations. Therefore, to enable a language completely, it is necessary to work in all layers. The layers, from the bottom up, are as follow (See Figure 1):
在准备为 GNU/Linux 桌面提供一种语言支持之前,需要对它的大致构架有一个清晰的了解。GNU/Linux 桌面是由在不同层次上协调工作的许多子系统构成的。每一层都有依赖于区域设置的操作。因此,要完全支持一种语言,需要在所有层次进行工作,这些层次从上到下包括(见图1):
1. The C Library. C is the programming language of the lowest level for developing GNU/Linux applications. Other languages rely on the C library to make calls to the operating system kernel.
2. The X Window. In most UNIX systems, the graphical environment is provided by the X Window system. It is a client-server system, where X clients make requests to X server and receive events from it across the network connection (or through a local inter-process communication channel) based on X protocol. A library called X Library (Xlib) encapsulates this protocol by a set of application programming interfaces (API), so that X clients can do everything in terms of function calls. Due to its liberal license terms, which allow even commercial redistributions, there have been several versions of X Window in the UNIX market. For GNU/Linux, XFree86 is the major code base, although new releases of some major distributions are now migrating to the newly forked X.org released in April 2004. All forks differ mainly in X server implementation and some extensions. But the X protocol and Xlib function calls are still standardized.
3. Toolkits. Writing programs using the low-level Xlib can be tedious as well as a source of inconsistent GUI when all applications draw menus and buttons by their own preferences. Some libraries are developed as a middle layer to help reduce both problems. In X terminology, these libraries are called toolkits. And the GUI components they provide, such as buttons, text entries, etc., are called widgets. Many historic toolkits have been developed in the past, either by the X Consortium itself like the X Toolkit and Athena widget set (Xaw), or by vendors like XView from Sun, Motif from Open Group, etc. In the FOSS realm, the toolkits most widely adopted are GTK+ (The GIMP Toolkit)(2) and Qt(3).
4. Desktop Environments. Toolkits help developers create a consistent look-and-feel among a set of programs. But to make a complete desktop, applications need to interoperate more closely to form a convenient workplace. The concept of desktop environment has been invented to provide common conventions, resource sharing and communication among applications. The first desktop environment ever created on UNIX platforms was CDE (Common Desktop Environment) by Open Group, based on its Motif toolkit. But it is proprietary. The first FOSS desktop environment for GNU/Linux is KDE (K Desktop Environment) (4), based on TrollTech's Qt toolkit. However, due to some licensing conditions of Qt at that time, some developers didn't like it. A second one was thus created, called GNOME (GNU Network Object Modelling Environment)(5), based on GTK+. Nowadays, although the licensing issue of Qt has been resolved, GNOME continues to grow and get more support from vendors and the community. KDE and GNOME have thus become the desktops most widely used on GNU/Linux and other FOSS operating systems such as FreeBSD.
Each component is internationalized, allowing local implementation for different locales:
每一部分都是国际化的,使得不同区域设置下的本地化实现成为可能:
1. GNU C Library: Internationalized according to POSIX and ISO/IEC 14652.
2. XFree86 (and X Window in general): Internationalization in this layer includes X locale (XLC) describing font set and character code conversion; X Input Method (XIM) for text input process, in which X Keyboard Extension (XKB) is used in describing keyboard map; and X Output Method (XOM) for text rendering. For XOM, it was implemented too late, when both GTK+ and Qt had already handled the rendering by their own solutions. Therefore, it is questionable whether XOM is still needed.
3. GTK+: For GTK+ 2, internationalization frameworks have been defined in a modular way. It has its own input method framework called GTK+ IM, where input method modules can be dynamically plugged in as per user command. Text rendering in GTK+ 2 is handled by a separate general-purpose text layout engine called Pango. Pango can be used for any application that needs to render multilingual texts and not just for GTK.
4. Qt: Internationalization in Qt 3 is done in a minimal way. It relies solely on XIM for all text inputs, and handles text rendering with QComplexText C++ class, which relies completely on Unicode data for character properties from Unicode.org.
For the desktop environment layer, namely, GNOME and KDE, there is no additional internationalization apart from what is provided by GTK+ and Qt.
在桌面环境这一层,如 GNOME 和 KDE,除了 GTK+ 和 Qt 提供的国际化框架外没有其他的国际化技术。
图1:GNU/Linux 桌面构成和国际化
In this annex, more technical details will be discussed. The aim is to give implementers necessary information to start localization. However, this is not intended to be a hands-on cookbook.
在本附录中,将讨论更多的技术细节。目标是给工作者开始本地化工作所需要的信息。但是,这并不是一本手把手具体教授技术的手册。
Unicode
=Unicode=
As a universal character set that includes all characters of the world, Unicode assigns code points to its characters by 16-bit integers, which means that up to 65,536 characters can be encoded. However, due to the huge set of CJK characters, this has become insufficient, and Unicode 3.0 has extended the index to 21 bits, which will support up to 1,114,112 characters.
Unicode 是一个包括了世界上所有字符的字符集,用16位整数来编码字符指针,也就是可以编码最多65,536个字符。但是,由于 CJK 字符集的庞大规模,连这个容量也不够使用,因此 Unicode 3.0 把索引字长扩展到21位,支持多达1,114,112个字符。
Planes
=平面=
Unicode code point is a numeric value between 0 and 10FFFF, divided into planes of 64K characters. In Unicode 4.0, allocated planes are Plane 0, 1, 2 and 14.
Unicode 编码指针是一个在0和10FFFF之间的数值,分成64K个字符组成的平面。在 Unicode 4.0 里,分配的平面是平面0,1,2和14。
Plane 0, ranging from 0000 to FFFF, is called Basic Multilingual Plane (BMP), which is the set of characters assigned by the previous 16-bit scheme.
平面0,从0000到FFFF,叫做基本多语言平面(Basic Multilingual Plane, BMP),由过去的16位编码系统下的字符集组成。
Plane 1, ranging from 10000 to 1FFFF and called Supplementary Multilingual Plane (SMP), is dedicated to lesser used historic scripts, special-purpose invented scripts and special notations. These include Gothic, Shavian and musical symbols. Many more historic scripts may be encoded in this plane in the future.
平面1,从10000到1FFFF,叫做辅助多语言平面(Supplementary Multilingual Plane, SMP),用于较少使用的古文字,特殊用途的文字和特殊符号。这些文字包括哥特文字,Shavian 文字和乐谱符号。今后可能会有更多的古文字被编码到这个平面中。
Plane 2, ranging from 20000 to 2FFFF and called Supplementary Ideographic Plane (SIP), is the spillover allocation area for those CJK characters that cannot fit into the blocks for common CJK characters in the BMP. Plane 14, ranging from E0000 to EFFFF and called Supplementary Special-purpose Plane (SSP), is for some control characters that do not fit into the small areas allocated in the BMP.
平面2,从20000到2FFFF,称为辅助表意文字平面(Supplementary Ideographic Plane, SIP),用于容纳 BMP 中一般 CJK 字符容纳不下的字符的区域。平面14,从E0000到EFFFF,称为辅助特殊用途平面(Supplementary Special-purpose Plane, SSP),是为 BMP 中有限的小区域无法容纳的控制字符准备的。
There are two more reserved planes Plane 15 and Plane 16, for private use, where no code point is assigned.
还有两个保留平面,平面15和平面16,用于个别用途,没有分配编码指针。
Basic Multilingual Plane
==基本多语言平面==
Basic Multilingual Plane (BMP), or Plane 0, is most commonly in general documents. Code points are allocated for common characters in contemporary scripts with exactly the same set as ISO/IEC 10646-1, as summarized in Figure 2 in section ý0 Note that the code points between E000 and F900 are reserved for the vendors' private use. No character is assigned in this area.
基本多语言平面(Basic Multilingual Plane, BMP),或平面0,是一般文本中使用最多的平面。现代文字中常用字符的编码指针被按照与 ISO/IEC 10646-1 完全相同的方式分配,如图2所示。注意E000和F900之间的编码指针为软件提供商的特别用途被保留,该区域中没有分配字符。
图2 Unicode 基本多语言平面
Character Encoding
==字符编码==
There are several ways of encoding Unicode strings for information interchange. One may simply represent each character using a fixed size integer (called wide char), which is defined by ISO/IEC 10646 as UCS-2 and UCS-4, where 2-byte and 4-byte integers are used, respectively (6) and where UCS-2 is for BMP only. But the common practice is to encode the characters using variable-length sequences of integers called UTF-8, UTF-16 and UTF-32 for 8-bit, 16-bit and 32-bit integers, respectively (7). There is also UTF-7 for e-mail transmissions that are 7-bit strict, but UTF-8 is safe in most cases.
用于信息交换的 Unicode 字符串有几种编码方式。每个字符可以简单地用固定长度的整数表示(称为宽字符),这种方式在 ISO/IEC 10646 中定义为 UCS-2 和 UCS-4,分别使用2字节和4字节长度的整数(6),而且 UCS-2 只用于基本多语言平面。但一般的做法是用可变长度的整数序列表示,根据使用的是8位,16位还是32位的整数,分别称为 UTF-8,UTF-16,和 UTF-32(7)。还有7位的专用于电子邮件传输的 UTF-7 编码,但多数情况下 UTF-8 都被支持。
UTF-32
===UTF-32===
UTF-32 is the simplest Unicode encoding form. Each Unicode code point is represented directly by a single 32-bit unsigned integer. It is therefore, a fixed-width character encoding form. This makes UTF-32 an ideal form for APIs that pass single character values. However, it is inefficient in terms of storage for Unicode strings.
UTF-32 是最简单的 Unicode 编码形式。每个 Unicode 编码指针都由一个单个32位无符号整数直接表示,因此它是一种固定宽度的编码形式。这使得 UTF-32 适合用于传递单个字符值的应用程序借口。但是,它不能有效满足 Unicode 字符串的存储需要。
UTF-16
===UTF-16===
UTF-16 encodes code points in the range 0000 to FFFF (i.e. BMP) as a single 16-bit unsigned integer. Code points in supplementary planes are instead represented as pairs of 16-bit unsigned integers. These pairs of code units are called surrogate pairs. The values used for the surrogate pairs are in the range D800 DFFF, which are not assigned to any character. So, UTF-16 readers can easily distinguish between single code unit and surrogate pairs. The Unicode Standard(8) provides more details of surrogates.
UTF-16 在0000到FFFF范围(即基本多语言平面)内以单个16位无符号整数编码指针。辅助平面内的编码指针由两个16位无符号整数代表。这些编码单位被称为代用对。代用对的值在D800到DFFF间,没有分配给任何字符。这样,UTF-16 程序容易分辨单个编码单位和代用对。Unicode 标准(8)给出了代用对的详情。
UTF-16 is a good choice for keeping general Unicode strings, as it is optimized for characters in BMP, which is used in 99 percent of Unicode texts. It consumes about half of the storage required by UTF-32.
UTF-16 是保存一般 Unicode 字符串的好方法,因为它对在99%的 Unicode 文本中使用的基本多语言平面内的字符进行了优化。它只需要相当于 UTF-32 所需一半的存储空间。
UTF-8
===UTF-8===
To meet the requirements of legacy byte-oriented ASCII-based systems, UTF-8 is defined as variable- width encoding form that preserves ASCII compatibility. It uses one to four 8-bit code units to represent a Unicode character, depending on the code point value. The code points between 0000 and 007F are encoded in a single byte, making any ASCII string a valid UTF-8. Beyond the ASCII range of Unicode, some non-ideographic characters between 0080 and 07FF are encoded with two bytes. Then, Indic scripts and CJK ideographs between 0800 and FFFF are encoded with three bytes. Supplementary characters beyond BMP require four bytes. The Unicode Standard(9) provides more detail of UTF-8.
为满足旧式的基于 ASCII 的,面向字节处理的系统的要求,UTF-8 被定义为一种保留了 ASCII 兼容性的可变宽度编码形式。根据编码指针数值的不同,它使用一个到四个8位的编码单位来表示一个 Unicode 字符。在0000到007F范围内的编码指针用一个字节编码,这样任何 ASCII 字符串在 UTF-8 下都同样有效。在 Unicode 的 ASCII 范围外,一些在0080到07FF之间的非表意字符用两个字节编码。在其后的位于0800和FFFF范围内的印地语和 CJK 表意文字用三个字节编码。基本多语言平面之外的辅助字符需要四个字节。Unicode 标准(9)提供了 UTF-8 的详细介绍。
UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot in migration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C or other programming languages APIs. For example, the traditional string collation using byte-wise comparison works with UTF-8.
UTF-8 是因特网上典型的理想编码形式。ASCII 兼容性对从旧系统迁移帮助很大。UTF-8 还有字节串行化和对 C 或其他语言编程接口友好的优点。例如,传统的逐字节比较方式的字符排序表在 UTF-8 下也能工作。
In short, UTF-8 is the most widely adopted encoding form of Unicode.
一句话,UTF-8 是 Unicode 最普及的编码形式。
Character Properties
==字符属性==
In addition to code points, Unicode also provides a database of character properties called the Unicode Character Database (UCD), which consists of a set of files describing the following properties:
除了编码指针外,Unicode 还提供了一个称为 Unicode 字符数据库(Unicode Character Database, UCD)(10)的字符属性数据库,包括一系列文件用来描述以下的属性:
Name.
General category (classification as letters, numbers, symbols, punctuation, etc.).
Other important general characteristics (white space, dash, ideographic, alphabetic, non char-acter, deprecated, etc.).
Character shaping (bidi category, shaping, mirroring, width, etc.).
Case (upper, lower, title, folding; both simple and full).
Numeric values and types (for digits).
Script and block.
Normalization properties (decompositions, decomposition type, canonical combining class, composition exclusions, etc.).
Age (version of the standard in which the code point was first designated).
Boundaries (grapheme cluster, word, line and sentence).
Standardized variants.
The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site. The Unicode Standard(11) provides more details of the database.
这个数据库可用于一般的 Unicode 实现。在 Unicode.org 网站上可以找到它。Unicode 标准(11)提供了这个数据库的详情。
Technical Reports
==技术报告==
In addition to the code points, encoding forms and character properties, Unicode also provides some technical reports that can serve as implementation guidelines. Some of these reports have been included as annexes to the Unicode standard, and some are published individually as Technical Standards.
除了编码指针,编码形式和字符属性外,Unicode 还提供了一些技术报告,可以作为实现的指导。其中一些报告作为 Unicode 标准的附录提供,另一些则单独作为技术标准发布。
In Unicode 4.0, the standard annexes are:
在 Unicode 4.0 中,标准附录包括:
UAX 9: The Bidirectional Algorithm
Specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.
UAX 11: East-Asian Width
Specifications of an informative property of Unicode characters that is useful when interoperating with East-Asian Legacy character sets.
UAX 14: Line Breaking Properties
Specification of line breaking properties for Unicode characters as well as a model algorithm for determining line break opportunities.
UAX 15: Unicode Normalization Forms
Specifications for four normalized forms of Unicode text. With these forms, equivalent text (canonical or compatibility) will have identical binary representations. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.
UAX 24: Script Names
Assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.
UAX 29: Text Boundaries
Guidelines for determining default boundaries between certain significant text elements: grapheme clusters ("user characters"), words and sentences.
The individual technical standards are:
单独的技术标准包括:
UTS 6: A Standard Compression Scheme for Unicode
Specifications of a compression scheme for Unicode and sample implementation.
UTS 10: Unicode Collation Algorithm
Specifications for how to compare two Unicode strings while conforming to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.
UTS 18: Unicode Regular Expression Guidelines
Guidelines on how to adapt regular expression engines to use Unicode.
All Unicode Technical Reports are accessible from the Unicode.org web site (12).
所有 Unicode 技术报告都可以从 Unicode.org 网站(12)上得到。
Fonts
=字体=
Font Development Tools
==字体开发工具==
Some FOSS tools for developing fonts are available. Although not as many as their proprietary counterparts, they are adequate to get the job done, and are continuously being improved. Some interesting examples are:
有一些用于开发字体的自由/开源软件工具。虽然这类工具不像私有的开发工具那样丰富,但它们足以胜任工作,而且在不断地改进。一些有趣的例子包括:
1. XmBDFEd(13). Developed by Mark Leisher, XmBDFEd is a Motif-based tool for developing BDF fonts. It allows one to edit bit-map glyphs of a font, do some simple transformations on the glyphs, transfer information between different fonts, and so on.
2. FontForge(14) (formerly PfaEdit(15) ). Developed by George Williams, FontForge is a tool for developing outline fonts, including Postscript Type1, TrueType, and OpenType. Scanned images of letters can be imported and their outline vectors automatically traced. The splines can be edited, and transformations like skewing, scaling, rotating, thickening may be applied and much more. It provides sufficient functionalities for editing Type1 and TrueType fonts properties. OpenType tables can also be edited in its recent versions. One weak point, however, is hinting. It guarantees Type1 hints quality, but not for TrueType.
3. TTX/FontTools(16). Just van Rossum's TTX/FontTools is a tool to convert OpenType and TrueType fonts to and from XML. FontTools is a library for manipulating fonts, written in Python. It supports TrueType, OpenType, AFM and, to a certain extent, Type 1 and some Mac-specific formats. It allows one to dump OpenType tables, examine and edit them with XML or plain text editor, and merge them back to the font.
Font Configuration
==字体配置==
There have been several font configuration systems available in GNU/Linux desktops. The most fundamental one is the X Window font system itself. But, due to some recent developments, another font configuration called fontconfig has been developed to serve some specific requirements of modern desktops. These two font configurations will be discussed briefly.
在 GNU/Linux 桌面上有几种字体配置系统。最基本的是 X Window 字体系统本身。但是,在近期的开发中,另一种称为 fontconfig 的字体配置被开发出来以满足现代桌面的一些特定需要。以下简单讨论这两种字体系统。
First, however, let us briefly discuss the X Window architecture, to understand font systems. X Window(17) is a client-server system. X servers are the agents that provide service to control hardware devices, such as video cards, monitors, keyboards, mice or tablets, as well as passes user input events from the devices to the clients. X clients are GUI application programs that request X server to draw graphical objects on the screen, and accept user inputs via the events fed by X server. Note that with this architecture, X client and server can be on different machines in the network. In which case, X server is the machine that the user operates with, while X client can be a process running on the same machine or on a remote machine in the network.
不过首先,我们简要讨论一下 X Window 架构,以便理解字体系统。X Window(17) 是一种客户端-服务器系统。X 服务器是提供显卡、显示器、键盘、鼠标或触摸板等硬件设备控制服务的主体,也负责把用户输入事件从设备传送到客户。X 客户端是请求 X 服务器在屏幕上描绘图形对象,并通过 X 服务器的事件传送接受用户输入的图形界面程序。注意在这种架构中,X 客户端和服务器可以处在网络中不同的机器上。这种情况下,X 服务器是用户操作的机器,而 X 客户端可以是同一台机器上运行的进程,或网络中的远程机器。
In this client-server architecture, fonts are provided on the server side. Thus, installing fonts means configuring X server by installing fonts and registering them to its font path.
在这个客户端-服务器架构中,字体是服务器端提供的。因此,安装字体意味着在 X 服务器上加入字体并注册其字体路径。
However, since X server is sometimes used to provide thin-client access in some deployments, where X server may run on cheap PCs booted by floppy or across network, or even from ROM, font installation on each X server is not always appropriate. Thus, font service has been delegated to a separate service called X Font Server (XFS). Another machine in the network can be dedicated for font service so that all X servers can request font information. Therefore, with this structure, an X server may be configured to manage fonts by itself or to use fonts from the font server, or both.
但是,由于 X 服务在一些配置中有时被用来提供瘦客户机访问,而这些 X 服务器可能是运行在用软盘或网络方式启动的廉价机器上,甚至是从固化的 ROM 启动,在每台 X 服务器上安装字体不一定合适。因此,字体服务被分离成一个单独的服务,称为 X 字体服务器(X Font Server, XFS)。网络中另一台机器可以专门提供字体服务,这样所有的 X 服务器都可以请求字体信息。这样,在这个构架下,X 服务器可以配置成自我管理字体,或者使用来自字体服务器的字体,或者两者并存。
Nevertheless, recent changes in XFree86 have addressed some requirements to manage fonts at the client side. The Xft extension provides anti-aliased glyph images by font information provided by the X client. With this, the Xft extension also provides font management functionality to X clients in its first version. This was later split from Xft2 into a separate library called fontconfig. fontconfig is a font management system independent of X, which means it can also apply to non-GUI applications such as printing services. Modern desktops, including KDE 3 and GNOME 2 have adopted fontconfig as their font management systems, and have benefited from closer integration in providing easy font installation process. Moreover, client-side fonts also allow applications to do all glyph manipulations, such as making special effects, while enjoying consistent appearance on the screen and in printed outputs.
不过,在 XFree86 中最近的改变注意到了一些在客户端管理字体的需求。Xft 扩展通过 X 客户端提供的字体信息实现了抗锯齿的符号图像。这个功能也使 Xft 在其第一版中提供了 X 客户端的字体管理能力。后来这个功能从 Xft2 中分离成一个单独的库,称为 fontconfig。fontconfig 是独立于 X 的一个字体管理系统,因此它也支持像打印服务这样的非图形界面应用。包括 KDE 3 和 GNOME 2 在内的现代桌面都采用了 fontconfig 作为字体管理系统,并且得益于紧密的整合,提供了简单的字体安装过程。而且,客户端的字体也允许应用程序进行特效等各种符号操作,同时在屏幕上和打印输出中都可以得到一致的效果。
The splitting of the X client-server architecture is not standard practice on stand-alone desktops. However, it is important to always keep the split in mind, to enable particular features.
X 客户端-服务器的分离式架构并不是独立桌面的标准形式。但是,要使用某些特别的功能,必须记住这个特点。
Output Methods
=输出方法=
Since the usefulness of XOM is still being questioned, we shall discuss only the output methods already implemented in the two major toolkits: Pango of GTK+ 2 and Qt 3.
由于 XOM 的有用程度还有疑问,我们将只讨论在两个主要的工具包中已经实现的输出方法:GTK+ 2的 Pango 和 Qt 3。
Pango Text Layout Engines
==Pango 文本外观引擎==
Pango [`Pan' means `all' in English and `go' means `language' in Japanese](18) is a multilingual text layout engine designed for quality text typesetting. Although it is the text drawing engine of GTK+, it can also be used outside GTK+ for other purposes, such as printing(19). This section will provide localizers with a bird`s eye view of Pango. The Pango reference manual(20) should be consulted for more detail.
Pango(“Pan”在英语里意思是“全部”,而“go”是日语中“语言”的意思)(18) 是一个用于高质量文本排版的多语言文本外观引擎。虽然它是 GTK+ 的文本描绘引擎,它也可以用于 GTK+ 之外的其他用途,例如打印(19)。这一节将为本地化工作者提供 Pango 的概览。如需要更多详情,应阅读 Pango 参考手册(20)。
PangoLayout
===PangoLayout===
At a high level, Pango provides the PangoLayout class that takes care of typesetting text in a column of given width, as well as other information necessary for editing, such as cursor positions. Its features may be summarized as follows:
在较高的层级,Pango 提供了 PangoLayout 类,处理给定宽度内的一列文本的排版,以及光标位置等其他编辑时必要的信息。其功能可以概括如下:
Paragraph Properties
indent justification
spacing word/character wrapping modes
alignment tabs
段落属性
Text Elements
get lines and their extents character logical attributes (is line break, is cursor position, etc.)
get runs and their extents cursor movements
character search at (x, y) position
文本元素
Text Contents
plain text markup text
文本内容
Middle-level Processing
==中级处理==
Pango also provides access to some middle-level text processing functions, although most clients in general do not use them directly. To gain a brief understanding of Pango internals, some highlights are discussed here.
Pango 还提供了一些中级的文本处理功能,虽然大部分客户端都不直接使用这些功能。为了简单了解 Pango 的能力,这里讨论一些重要特性。
There are three major steps for text processing in Pango(21):
Pango 中的文本处理有三个主要步骤(21):
Itemize. Breaks input text into chunks (items) of consistent direction and shaping engine. This usually means chunks of text of the same language with the same font. Corresponding shaping and language engines are also associated with the items.
分项:将文本打散成具有相同方向和形状引擎的文本块(项目)。这通常是指同一种语言和同一种字体的文本块。相应的形状和语言引擎也和项目相关联。
Break. Determines possible line, word and character breaks within the given text item. It calls the language engine of the item (or the default engine based on Unicode data if no language engine exists) to analyze the logical attributes of the characters (is-line-break, is-char-break, etc.).
分解:确定给定的文本项中可能的行、词和字符分割。它调用项目的语言引擎(如语言引擎不存在则调用基于 Unicode 数据的缺省引擎)来分析字符的逻辑属性(是断行,是断字,等等)。
Shape. Converts the text item into glyphs, with proper positioning. It calls the shaping engine of the item (or the default shaping engine that is currently suitable for European languages) to obtain a glyph string that provides the information required to render the glyphs (code point, width, offsets, etc.).
造型:把文本项转化成具有正确位置的符号。它调用项目的造型引擎(或者适用于欧洲语言的缺省造型引擎)生成提供渲染符号所需信息(编码指针,宽度,偏移量等)的符号串。
Pango Engines
==Pango 引擎==
Pango engines are implemented in loadable modules that provide entry functions for querying and creating the desired engine. During initialization, Pango queries the list of all engines installed in the memory. Then, when it itemizes input text, it also searches the list for the language and shaping engines available for the script of each item and creates them for association to the relevant text item.
Pango 引擎以可加载的模块形式实现,提供查询和建立所需引擎的函数。在初始化时,Pango 查询内存中所有引擎的列表。然后,在对输入文字分项后,它为每个项目中的文字搜索可用的语言和造型引擎并建立与相关的文本项目关联的引擎。
Pango Language Engines
==Pango 语言引擎==
As discussed above, the Pango language engine is called to determine possible break positions in a text item of a certain language. It provides a method to analyze the logical attributes of every character in the text as listed in Table 3.
如上所述,调用 Pango 语言引擎是为了确定某种语言中文本项的可能的分解位置。它提供了分析文本中每个字符逻辑属性的方法,如表3所示:
| Table 3 Pango Logical Attributes | |
|---|---|
| Flag | Description |
| is_line_break | can break line in front of the character |
| is_mandatory_break | must break line in front of the character |
| is_char_break | can break here when doing character wrap |
| is_white | is white space character |
| is_cursor_position | cursor can appear in front of character |
| is_word_start | is first character in a word |
| is_word_end | is first non-word character after a word |
| is_sentence_boundary | is inter-sentence space |
| is_sentence_start | is first character in a sentence |
| is_sentence_end | is first non-sentence character after a sentence |
| backspace_deletes_character | backspace deletes one character, not entire cluster (new in Pango 1.3.x) |
| 表3 Pango 逻辑属性 | |
|---|---|
| 标志(Flag) | 描述 |
| is_line_break | 可以在字符前断行 |
| is_mandatory_break | 必须在字符前断行 |
| is_char_break | 字符分行时可以在这里断行 |
| is_white | 是空格字符 |
| is_cursor_position | 光标可以在字符前出现 |
| is_word_start | 是单词的第一个字符 |
| is_word_end | 是单词后的第一个非单词字符 |
| is_sentence_boundary | 是句子间的空格 |
| is_sentence_start | 是句子的第一个字符 |
| is_sentence_end | 是句子后的第一个非句子的字符 |
| backspace_deletes_character | 退格删除一个字符而不是整个字符簇 |
Pango Shaping Engines
==Pango 造型引擎==
As discussed above, the Pango shaping engine converts characters in a text item in a certain language into glyphs, and positions them according to the script constraints. It provides a method to convert a given text string into a sequence of glyphs information (glyph code, width and positioning) and a logical map that maps the glyphs back to character positions in the original text. With all the information provided, the text can be properly rendered on output devices, as well as accessed by the cursor despite the difference between logical and rendering order in some scripts like Indic, Hebrew and Arabic.
如上所述,Pango 造型引擎把一个特定语言的文本项中的字符转换成符号,并且按照文字的规则放置这些符号。它提供了一种将给定的文本串转化为符号信息序列(符号编码、宽度和位置)的方法以及按原文本中字符位置将符号映射回字符的规则。利用这些信息,文本可以在输出设备上正确地显示,也可以正确地处理光标位置,而不用管像印地语、希伯来语和阿拉伯语这样的语言中不同的逻辑和显示顺序。
Qt Text Layout
==Qt 文本外观==
Qt 3 text rendering is different from that of GTK+/Pango. Instead of modularizing, it handles all complex text rendering in a single class, called QComplexText, which is mostly based on the Unicode character database. This is equivalent to the default routines provided by Pango. Due to the incompleteness of the Unicode database, this class sometimes needs extra workarounds to override some values. Developers should examine this class if a script is not rendered properly.
Qt 3 的文本渲染与 GTK+/Pango 的不同。它不是模块化的,而是在一个称为 QComplexText 的基于 Unicode 字符数据库的类中处理所有复杂文本渲染。它与 Pango 提供的缺省处理方法是一样的。由于 Unicode 数据库的不完整,这个类需要更多的修改来处理某些数值。如果一种语言渲染不正确,开发者需要检查这个类。
Although relying on the Unicode database appears to be a straightforward method for rendering Unicode texts, this makes the class rigid and error prone. Checking the Qt Web site regularly to find out whether there are bugs in latest versions is advisable. However, a big change has been planned for Qt 4, which is the Scribe text layout engine, similar to Pango for GTK+.
虽然依赖于 Unicode 数据库看起来是一种直接的渲染 Unicode 文本的办法,但这样的类不灵活而且容易出错。建议经常查看 Qt 的网站了解最新版本中是否存在问题。不过,Qt 4 当中计划引入一个大的变化,即 Scribe 文本布局引擎,与 GTK+ 的 Pango 类似。
Keyboard Layouts
==键盘布局==
The first step to providing text input for a particular language is to prepare the keyboard map. X Window handles the keyboard map using the X Keyboard (XKB) extension. When you start an X server on GNU/ Linux, a virtual terminal is attached to it in raw mode, so that keyboard events are sent from the kernel without any translation.
为一种特定语言提供文本输入功能的第一步是定