::-- limodou [2005-09-25 01:12:38]

[翻译]Unicode HOWTO by liqust at gmail dot com http://liqust.com/ -- 09/24/2005

原文在www.amk.ca/python/howto/unicode

1. Unicode HOWTO

Version 1.02

This HOWTO discusses Python's support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. 这个HOWTO论述了Python对Unicode的支持,并对处理Unicode时经常遇到的各种问题作出了说明。

1.1. Introduction to Unicode (Unicode介绍)

1.1.1. History of Character Codes (字符编码历史)

In 1968, the American Standard Code for Information Interchange, better known by its acronym ASCII, was standardized. ASCII defined numeric codes for various characters, with the numeric values running from 0 to 127. For example, the lowercase letter 'a' is assigned 97 as its code value.

1968年,American Standard Code for Information Interchange(ASCII)标准被确立。ASCII为各种字符定义了数字编码,范围是0-127。例如,小写字母'a'的编码值为97。

ASCII was an American-developed standard, so it only defined unaccented characters. There was an 'e', but no 'é' or 'Í'. This meant that languages which required accented characters couldn't be faithfully represented in ASCII. (Actually the missing accents matter for English, too, which contains words such as 'naïve' and 'café', and some publications have house styles which require spellings such as 'coöperate'.)

ASCII是一个美国开发的标准,因此并没有包括重音字符。其定义了'e',而没有定义'é'或 'Í'。这意味着包含重音字符的语言不能准确地用ASCII表示(实际上是那些被英语舍弃的重音部分,包括'naïve'和'café'等词语,)

For a while people just wrote programs that didn't display accents. I remember looking at Apple ][ BASIC programs, published in French-language publications in the mid-1980s, that had lines like these:

曾有一段时间,人们编写的程序还不能显示重音符号。我记得80年代中期时,针对法语发行的Apple ][ BASIC程序,有类似以下的语句:

   1 PRINT "FICHER EST COMPLETE."
   2 PRINT "CARACTERE NON ACCEPTE."

Those messages should contain accents, and they just look wrong to someone who can read French.

这些消息本应该包含重音符,因而对于读懂法语的人来说,产生了错误。

In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some machines assigned values between 128 and 255 to accented characters. Different machines had different codes, however, which led to problems exchanging files. Eventually various commonly used sets of values for the 128-255 range emerged. Some were true standards, defined by the International Standards Organization, and some were de facto conventions that were invented by one company or another and managed to catch on.

在80年代,几乎所有的个人电脑都是8位的,这意味着可表示0-255的值。ASCII码仅用到127,所以某些机型将128-255的值赋给重音字符。但是,不同的机型使用不同的编码,这样在交换文件时就会产生问题。最终出现了各种普遍使用的128-255字符集。其中某些是国际标准组织(ISO)定义的正规标准,某些是某个公司开发并流行成为事实上的标准。

255 characters aren't very many. For example, you can't fit both the accented characters used in Western Europe and the Cyrillic alphabet used for Russian into the 128-255 range because there are more than 127 such characters.

255个字符并不多。例如,你不能将西欧使用的重音字符和俄罗斯使用的Cyrillic字符同时影射到128-255的范围中,因为它们已经超过了127个字符。

You could write files using different codes (all your Russian files in a coding system called KOI8, all your French files in a different coding system called Latin1), but what if you wanted to write a French document that quotes some Russian text? In the 1980s people began to want to solve this problem, and the Unicode standardization effort began.

你不能在同同文件中使用不同的编码(所有的俄语文件使用KOI8编码系统,所有的法语文件使用另一种Latin编码系统),但是如果想在一个法语文档中引用某些俄语文本呢?在80年代,人们便准备解决这个问题了,并开始开发UNICODE标准。

Unicode started out using 16-bit characters instead of 8-bit characters. 16 bits means you have 2^16 = 65,536 distinct values available, making it possible to represent many different characters from many different alphabets; an initial goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in base-16).

Unicode用16比特代替8比特。16比特意味着有2^16 = 65,536个不同的有效值,就有可能同时表示很多字符集的不同字符;一个初始目标是让Unicode包括人类所有语言的字符集。但最终表明即使是16位也是不够的,因此现在的Unicode规范使用更宽的编码范围0-1,114,111 (0x10ffff,16进制 )。

There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode.

还有个相关的ISO标准,ISO 10646。Unicode和ISO 10646是最初的两个不同的努力方向,但在Unicode的1.1版本时这两个规范就合并在一起了。

(This discussion of Unicode's history is highly simplified. I don't think the average Python programmer needs to worry about the historical details; consult the Unicode consortium site listed in the References for more information.)

(这个关于Unicode历史的论述是被高度简化了的。我认为一般的Python程序员并不需要担心这些历史细节;在参考里列出的Unicode consortium网站上有更多的参考信息)

1.1.2. Definitions (定义)

A character is the smallest possible component of a text. 'A', 'B', 'C', etc., are all different characters. So are 'È ' and Í '. Characters are abstractions, and vary depending on the language or context you're talking about. For example, the symbol for ohms (Ω ) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet (they may even be the same in some fonts), but these are two different characters that have different meanings.

一个字符可能是文本的最小单位了。如'A','B','C'等,都是不同的字符,而 È ' 和 Í '也是。字符是抽象的,并依赖于你使用的语言和上下文环境。例如,ohms符号(Ω )通常都很像希腊字母表中的大写字母omega(Ω)(在某些字体中它们甚至完全一样),但是它们是不同的字符,表示的是不同的意思。

The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:

Unicode标准描述了字符是如何通过code point表示的。一个code point就是一个整数值,通常用16进制表示。在标准中,一个code point使用U+12ca来表示值0x12ca(十进制为4810)。Unicode标准包含了很多表格,其中列出了字符和它们对应的code point。

   1 0061    'a'; LATIN SMALL LETTER A
   2 0062    'b'; LATIN SMALL LETTER B
   3 0063    'c'; LATIN SMALL LETTER C
   4 ...
   5 007B    '{'; LEFT CURLY BRACKET

Strictly, these definitions imply that it's meaningless to say 'this is character U+12ca'. U+12ca is a code point, which represents some particular character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In informal contexts, this distinction between code points and characters will sometimes be forgotten.

严格说来,这些定义表明,说“这是字符U+12ca”是无意义的。U+12ca是个code point,表示的是某个特定的字符,这里,它表示的是 'ETHIOPIC SYLLABLE WI'。在非正规的上下文中,有时code point和字符的区别会被忽略。

A character is represented on a screen or on paper by a set of graphical elements that's called a glyph. The glyph for an uppercase A, for example, is two diagonal strokes and a horizontal stroke, though the exact details will depend on the font being used. Most Python code doesn't need to worry about glyphs; figuring out the correct glyph to display is generally the job of a GUI toolkit or a terminal's font renderer.

一个字符通过一系列被称为glyph的图形元素表示在屏幕或纸张上。对于大写字母A,glyph是两个斜线和一个水平线,但精确的细节还依赖于使用的字体。多数Python代码并不需要担心glyph;绘出正确的glyph并显示通常是某个GUI工具集或某个终端的字体描绘的工作。

1.1.3. Encodings (编码)

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.

总结前面的部分:一个Unicode字符串是一个code point序列,值从0到0x10ffff。这个序列需要被表示成内存中的一系列字节(即,值从0到255)。将一个Unicode字符串翻译成一个字节序列被称为编码。

The first encoding you might think of is an array of 32-bit integers. In this representation, the string "Python" would look like this:

你首先想到的编码可能是一个32比特的整数数组。在这个表示法中,字符串"Python"看起来像这样:

   1    P           y           t           h           o           n
   2 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 
   3    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

This representation is straightforward but using it presents a number of problems.

这个表示法虽然直接却产生了一些问题:

  1. It's not portable; different processors order the bytes differently.
    • 不可移植;不同类型的处理器的字节顺序不同
  2. It's very wasteful of space. In most texts, the majority of the code points are less than 127, or less than 255, so a lot of space is occupied by zero bytes. The above string takes 24 bytes compared to the 6 bytes needed for an ASCII representation. Increased RAM usage doesn't matter too much (desktop computers have megabytes of RAM, and strings aren't usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable.
    • 空间浪费严重。在多数文本中,主要的code point都小于127,或小于255,因此很多的空间被零占据。上面的那个字符串用了24字节,而ASCII只需要6字节。更多的空间使用并不是很要紧(桌面电脑有数以兆计的内存,且字符串通常都不大),但磁盘空间和网络带宽的占用被扩展了四倍则是无法忍受的。
  3. It's not compatible with existing C functions such as strlen(), so a new family of wide string functions would need to be used.
    • 与现有的C函数,如strlen(),不兼容,因此需要使用一系列新的宽字符处理函数。
  4. Many Internet standards are defined in terms of textual data, and can't handle content with embedded zero bytes.
    • 很多因特网标准使用的数据为文本形式,并不能处理内嵌零字节的内容。

Generally people don't use this encoding, choosing other encodings that are more efficient and convenient.

通常,人们并不使用这种编码。而选择其它更有效率、更便利的编码。

Encodings don't have to handle every possible Unicode character, and most encodings don't. For example, Python's default encoding is the 'ascii' encoding. The rules for converting a Unicode string into the ASCII encoding are are simple; for each code point:

编码时并不需要处理每个可能的Unicode字符,且多数时候都是这样。例如,Python的默认编码是'ascii'。将Unicode字符串转换成ASCII编码的规则是很简单的;对于每个code point:

  1. If the code point is <128, each byte is the same as the value of the code point.

    • 如果code point<128,每个字节的值与code point相同。

  2. If the code point is 128 or greater, the Unicode string can't be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

    • 如果code point是128或更大,Unicode字符串不能被表示(在这种情况下Python抛出UnicodeEncodeError异常)

Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can't be encoded into Latin-1.

Latin-1,也被称为ISO-8859-1,是一个类似的编码。Unicode中0-255的code point与Latin-1一样,所以只需要将code point简单地转换成字节值就可以了;如果code point大于255,字符串则不能转换成Latin-1。

Encodings don't have to be simple one-to-one mappings like Latin-1. Consider IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145 through 153. If you wanted to use EBCDIC as an encoding, you'd probably use some sort of lookup table to perform the conversion, but this is largely an internal detail.

编码并不一定要像Latin一样是一对一的简单影射。参考IBM大型机上使用的IBM的EBCDIC。字母的值并不都在一个块中:'a'到'i'的值从129到137,但'j'到'r'的值是从145到153。如果你想使用EBCDIC来编码,很有可能会使用某种类型的查询表来完成转换,但这通常是一个内部问题。

UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode Transformation Format", and the '8' means that 8-bit numbers are used in the encoding. (There's also a UTF-16 encoding, but it's less frequently used than UTF-8.) UTF-8 uses the following rules:

UTF-8是一个广泛使用的编码方案。UTF 代表"Unicode Transformation Format",'8'意味着编码时使用8比特。(还有一种UTF-16编码,但其使用没有UTF-8频繁)

  1. If the code point is <128, it's represented by the corresponding byte value.

    • 如果code point<128,就用对应的字节值表示。

  2. If the code point is between 128 and 0x7ff, it's turned into two byte values between 128 and 255.
    • 如果code point在128和0x7ff之间,就转换成两个字节,每个字节的值从128到255。
  3. Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

    • 若code point大于0x7ff,就转换成三到四字节,每个字节的值从128到255。

UTF-8 has several convenient properties:

UTF-8有几个优点:

  1. It can handle any Unicode code point.
    • 可以处理任何Unicode code point。
  2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can't handle zero bytes.
    • Unicode字符串被转换后没有内嵌的零字节。这就避免了字节顺序的问题,并意味着UTF-8字符串可以用strcpy()等C函数处理,并能适用于不能处理零字节的协议。
  3. A string of ASCII text is also valid UTF-8 text.
    • 一个ASCII字符串也是有效的UTF-8字符串。
  4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
    • UTF-8相当紧凑,绝大部分code point被转换成两个字节,而值小于128的只占一个字节。
  5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
    • 如果有字节被破坏或丢失,也可判断出下一个UTF-8 code point的位置并重新同步。同样,一个随机的8比特数据也不大可能被认为是有效的UTF-8数据。

1.1.4. References (参考)

The Unicode Consortium site at <http://www.unicode.org> has character charts, a glossary, and PDF versions of the Unicode specification. Be prepared for some difficult reading. <http://www.unicode.org/history/> is a chronology of the origin and development of Unicode.

To help understand the standard, Jukka Korpela has written an introductory guide to reading the Unicode character tables, available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.

Roman Czyborra wrote another explanation of Unicode's basic principles; it's at <http://czyborra.com/unicode/characters.html>. Czyborra has written a number of other Unicode-related documentation, available from <http://www.cyzborra.com>.

Two other good introductory articles were written by Joel Spolsky <http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff <http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make things clear to you, you should try reading one of these alternate articles before continuing.

Wikipedia entries are often helpful; see the entries for "character encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8 <http://en.wikipedia.org/wiki/UTF-8>, for example.

1.2. Python's Unicode Support (Python的Unicode支持)

Now that you've learned the rudiments of Unicode, we can look at Python's Unicode features.

现在你已经了解了Unicode的基础知识,我们可以关注Python的Unicode特性了。

1.2.1. The Unicode Type (Unicode类型)

Unicode strings are expressed as instances of the unicode type, one of Python's repertoire of built-in types. It derives from an abstract type called basestring, which is also an ancestor of the str type; you can therefore check if a value is a string type with isinstance(value, basestring). Under the hood, Python represents Unicode strings as either 16- or 32-bit integers, depending on how the Python interpreter was compiled, but this

Unicode字符串用Python的unicode类型的实例表示,unicode类型是Python的一个内建类型。它来源于一个被称之为basestring的抽象类型,这个类型也是str类型的祖先;你可以用isinstance(value, basestring)检查某个值是否是字符串类型。在内部,Python用16或32字节来表示Unicode字符串,这取决于Python解释器是如何被编译的,但

The unicode() constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings. The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors:

构造函数unicode()的用法为unicode(string[, encoding, errors])。所有参数都应是8比特的字符串。使用指定的encoding参数将第一个参数转换成Unicode字符串;如果不指定encoding参数,便使用ASCII编码,因而值大于127的字符会被认为是错误:

   1 >>> unicode('abcdef')
   2 u'abcdef'
   3 
   4 >>> s = unicode('abcdef')
   5 >>> type(s)
   6 <type 'unicode'>
   7 >>> unicode('abcdef' + chr(255))
   8 Traceback (most recent call last):
   9   File "<stdin>", line 1, in ?
  10 UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: 
  11                     ordinal not in range(128)

The errors argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the Unicode result). The following examples show the differences:

当输入字符串依照encoding的规则却不能被转换时,参数errors指定了返回信息。这个参数的合法值有'strict' (抛出一个 UnicodeDecodeError 异常), 'replace' ( U+FFFD, '被替换的字符'), 或 'ignore' (仅使错误字符不出现在结果中):

   1 >>> unicode('\x80abc', errors='strict')
   2 Traceback (most recent call last):
   3   File "<stdin>", line 1, in ?
   4 UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: 
   5                     ordinal not in range(128)
   6 >>> unicode('\x80abc', errors='replace')
   7 u'\ufffdabc'
   8 
   9 >>> unicode('\x80abc', errors='ignore')
  10 u'abc'

Encodings are specified as strings containing the encoding's name. Python 2.4 comes with roughly 100 different encodings; see the Python Library Reference at <http://docs.python.org/lib/standard-encodings.html> for a list. Some encodings have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all synonyms for the same encoding.

用一个包含编码名称的字符串指定encoding。Pyrhon 2.4大约有100种不同的encoding;在http://docs.python.org/lib/standard-encodings.html> 查看Python库参考中的列表。有些encoding有好几个名称;例如, 'latin-1', 'iso_8859_1' 和 '8859' 都是同一个编码。

One-character Unicode strings can also be created with the unichr() built-in function, which takes integers and returns a Unicode string of length 1 that contains the corresponding code point. The reverse operation is the built-in ord() function that takes a one-character Unicode string and returns the code point value:

只有一个字符的Unicode字符串还可以使用unichr()内建函数,它接收整数并返回一个长度为1的包含相应code point的Unicode字符串。使用ord()内建函数可完成逆反操作,它接受一个长度为1的Unicode字符串并返回code point值:

   1 >>> unichr(40960)
   2 u'\ua000'
   3 >>> ord(u'\ua000')
   4 40960

Instances of the unicode type have many of the same methods as the 8-bit string type for operations such as searching and formatting:

8比特字符串类型能的很多方法,unicode类型也能提供,比如搜索和格式化操作:

   1 >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
   2 >>> s.count('e')
   3 5
   4 
   5 >>> s.find('feather')
   6 9
   7 >>> s.find('bird')
   8 -1
   9 >>> s.replace('feather', 'sand')
  10 u'Was ever sand so lightly blown to and fro as this multitude?'
  11 >>> s.upper()
  12 u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'

Note that the arguments to these methods can be Unicode strings or 8-bit strings. 8-bit strings will be converted to Unicode before carrying out the operation; Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception:

注意这些方法的参数可以是Unicode字符串或8比特字符串。在执行这些操作时8比特字符串被转换成Unicode字符串;并会使用Python的默认ASCII编码,因而值大于127的字符会产生异常:

   1 >>> s.find('Was\x9f')
   2 Traceback (most recent call last):
   3   File "<stdin>", line 1, in ?
   4 UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
   5 
   6 >>> s.find(u'Was\x9f')
   7 -1

Much Python code that operates on strings will therefore work with Unicode strings without requiring any changes to the code. (Input and output code needs more updating for Unicode; more on this later.)

因此很多操作字符串的Python代码并不需要做任何改动就可以操作Unicode字符串。(为操作Unicode,输入和输出代码需要更多的改进,后面会有更多的介绍)

Another important method is .encode([encoding], [errors='strict']), which returns an 8-bit string version of the Unicode string, encoded in the requested encoding. The errors parameter is the same as the parameter of the unicode() constructor, with one additional possibility; as well as 'strict', 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's character references. The following example shows the different results:

另一个重要的方法是 .encode([encoding], [errors='strict']),它返回一个Unicode字符串的8比特字符串版本,使用的是encoding参数。errors参数用法与Unicode()构造函数一样,但多了一个选择;不仅有 'strict', 'ignore', 和 'replace',还可以传递 'xmlcharrefreplace' ,用来使用XML的字符引用。以下例子展示了不同的结果:

   1 >>> u = unichr(40960) + u'abcd' + unichr(1972)
   2 
   3 >>> u.encode('utf-8')
   4 '\xea\x80\x80abcd\xde\xb4'
   5 >>> u.encode('ascii')
   6 Traceback (most recent call last):
   7   File "<stdin>", line 1, in ?
   8 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
   9 
  10 >>> u.encode('ascii', 'ignore')
  11 'abcd'
  12 >>> u.encode('ascii', 'replace')
  13 '?abcd?'
  14 >>> u.encode('ascii', 'xmlcharrefreplace')
  15 
  16 '&#40960;abcd&#1972;'

Python's 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding:

Python的8比特字符串有一个.decode([encoding], [errors]) 方法,使用给定的encoding解释字符串:

   1 >>> u = unichr(40960) + u'abcd' + unichr(1972)   # Assemble a string
   2 >>> utf8_version = u.encode('utf-8')             # Encode as UTF-8
   3 >>> type(utf8_version), utf8_version
   4 (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
   5 
   6 >>> u2 = utf8_version.decode('utf-8')            # Decode using UTF-8
   7 >>> u == u2                                      # The two strings match
   8 True

The low-level routines for registering and accessing the available encodings are found in the codecs module. However, the encoding and decoding functions returned by this module are usually more low-level than is comfortable, so I'm not going to describe the codecs module here. If you need to implement a completely new encoding, you'll need to learn about the codecs module interfaces, but implementing encodings is a specialized task that also won't be covered here. Consult the Python documentation to learn more about this module.

注册和访问有效的encoding的底层路径可以在codecs模块中找到。但是,这个模块返回的encoding和decoding函数通常都过于底层,所以我并不准备在这里介绍codecs模块。如果你想实现一个全新的encoding,则需学习codecs模块接口,但实现encoding是一个专门的任务,在这里并不准备介绍。参考Python文档 可以对这个模块了解更多。

The most commonly used part of the codecs module is the codecs.open() function which will be discussed in the section on input and output.

codesc模块中最常使用的部分是codecs.open()函数,它将会在输入和输出这节中论述。

1.2.2. Unicode Literals in Python Source Code (Python代码中的Unicode文本)

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk'. Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.

在Python源代码中,Unicode文本用带有'u'或'U'前缀的字符串表示:u'abcdefghijk'。特殊的code point可以使用\u 转义序列 表示,后面的是表示code point的四个十六进制值。 \U 转义序列 则类似,但使用的是8个数字,而不是4个。

Unicode literals can also use the same escape sequences as 8-bit strings, including \x, but \x only takes two hex digits so it can't express an arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.

Unicode文本也可以使用与8比特字符串一样的转义序列,包括\x,但\x只能用两个数字,因此不能表示任意的code point。八进制的转义可以表示到U+01ff,即八进制的777。

   1 >>> s = u"a\xac\u1234\u20ac\U00008000"
   2            ^^^^ two-digit hex escape
   3                ^^^^^^ four-digit Unicode escape 
   4                            ^^^^^^^^^^ eight-digit Unicode escape
   5 >>> for c in s:  print ord(c),
   6 ... 
   7 97 172 4660 8364 32768

Using escape sequences for code points greater than 127 is fine in small doses, but becomes an annoyance if you're using many accented characters, as you would in a program with messages in French or some other accent-using language. You can also assemble strings using the unichr() built-in function, but this is even more tedious.

对大于127的code point使用转义序列 在数量较少时是很好的,但如果有大量的重音字符则会令人厌烦,比如程序中的消息为法语或其它使用重音的语言。你可以使用unichr()内建函数,但这甚至更令人厌烦。

Ideally, you'd want to be able to write literals in your language's natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.

理想状况下,你可以使用自己语言的自然编码来书写文本。然后选择支持重音字符显示的编辑器编写Python代码,并在运行时得到正确的字符。

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

Python支持任何编码的Unicode文本,但是你需要声明所使用的编码。在源文件的第一或第二行包含一个特殊的注释就可做到这点:

   1 #!/usr/bin/env python
   2 # -*- coding: latin-1 -*-
   3 
   4 u = u'abcdé'
   5 
   6 print ord(u[-1])

The syntax is inspired by Emacs's notation for specifying variables local to a file. Emacs supports many different variables, but Python only supports 'coding'. The -*- symbols indicate that the comment is special; within them, you must supply the name coding and the name of your chosen encoding, separated by ':'.

这种语法格式受到Emacs指定本地文件变量想法的启发。Emacs支持很多不同的变量,但是Python只支持'coding'。-*- symbol 表明这是个特殊注释;在其中你必须提供名称coding和你选择的编码,并用':'分开。

If you don't include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:

如果不包含这条注释,则会使用默认编码ASCII。在2.4版本之前的Python是以欧洲为中心的,并假定字符串文本的默认编码为Latin-1,在Python 2.4 版本中,大于127的字符也能工作但会返回一个警告。例如,以下程序没有编码声明:

   1 #!/usr/bin/env python
   2 u = u'abcdé'
   3 
   4 print ord(u[-1])

When you run it with Python 2.4, it will output the following warning:

当你用Python 2.4 版本运行它时,会输出以下警告:

   1 amk:~$ python p263.py
   2 sys:1: DeprecationWarning: Non-ASCII character '\xe9' 
   3      in file p263.py on line 2, but no encoding declared; 
   4      see http://www.python.org/peps/pep-0263.html for details

1.2.3. Unicode Properties(Unicode属性)

The Unicode specification includes a database of information about code points. For each code point that's defined, the information includes the character's name, its category, the numeric value if applicable (Unicode has characters representing the Roman numerals and fractions such as one-third and four-fifths). There are also properties related to the code point's use in bidirectional text and other display-related properties.

Unicode规范包含了一个存储code point信息的数据库。这些信息包括了每个定义的code point的字符名称,所属类别,如果可能的话还有数字信息(在Unicode中的某些字符,代表罗马数字和例如三分之一和五分之四的分数)。同时还包含了使用双向文本时的信息和其它的关于显示的信息。

The following program displays some information about several characters, and prints the numeric value of one particular character:

以下程序显示了若干字符的部分信息,并打印出特定字符的数值:

   1 import unicodedata
   2 
   3 u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
   4 
   5 for i, c in enumerate(u):
   6     print i, '%04x' % ord(c), unicodedata.category(c),
   7     print unicodedata.name(c)
   8 
   9 # Get numeric value of second character
  10 print unicodedata.numeric(u[1])

When run, this prints:

运行时,打印出:

   1 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
   2 1 0bf2 No TAMIL NUMBER ONE THOUSAND
   3 2 0f84 Mn TIBETAN MARK HALANTA
   4 3 1770 Lo TAGBANWA LETTER SA
   5 4 33af So SQUARE RAD OVER S SQUARED
   6 1000.0

The category codes are abbreviations describing the nature of the character. These are grouped into categories such as "Letter", "Number", "Punctuation", or "Symbol", which in turn are broken up into subcategories. To take the codes from the above output, 'Ll' means 'Letter, lowercase', 'No' means "Number, other", 'Mn' is "Mark, nonspacing", and 'So' is "Symbol, other". See <http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a list of category codes.

类别码表示字符所属的类别。字符被划分为若干类别,如"Letter", "Number", "Punctuation", 或 "Symbol",而它们又有子类别。在以上的输出中,'Ll' 代表 'Letter, lowercase', 'No' 代表 "Number, other", 'Mn' 为 "Mark, nonspacing", 'So' 为 "Symbol, other".。参考<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>上的类别列表。

1.2.4. References (参考)

The Unicode and 8-bit string types are described in the Python library reference at <http://docs.python.org/lib/typesseq.html>.

The documentation for the unicodedata module is at <http://docs.python.org/lib/module-unicodedata.html>.

The documentation for the codecs module is at <http://docs.python.org/lib/module-codecs.html>.

Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and Unicode". A PDF version of his slides is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, and is an excellent overview of the design of Python's Unicode features.

1.3. Reading and Writing Unicode Data(读写Unicode数据)

Once you've written some code that works with Unicode data, the next problem is input/output. How do you get Unicode strings into your program, and how do you convert Unicode into a form suitable for storage or transmission?

当你编写了可处理Unicode数据的代码后,下一个问题就是输入/输出。如何取得Unicode字符串,又如何将Unicode转换成适合存储或传输的格式?

It's possible that you may not need to do anything depending on your input sources and output destinations; you should check whether the libraries used in your application support Unicode natively. XML parsers often return Unicode data, for example. Many relational databases also support Unicode-valued columns and can return Unicode values from an SQL query.

取决于你的输入来源和输出目标,你也有可能并不需要做任何事;你应当检查程序使用的库是否直接支持Unicode。例如,XML解析器经常返回Unicode数据。很多关系数据库支持Unicode值的字段,并能对SQL查询返回Unicode值。

Unicode data is usually converted to a particular encoding before it gets written to disk or sent over a socket. It's possible to do all the work yourself: open a file, read an 8-bit string from it, and convert the string with unicode(str, encoding). However, the manual approach is not recommended.

Unicode数据通常在写入磁盘或网络传输前被转换成特定的编码。你也可能自己完成所有的任务,打开文件,从中读取8比特字符串,并用unicode(str,encoding)转换它。但是,并不建议使用这种手工方式。

One problem is the multi-byte nature of encodings; one Unicode character can be represented by several bytes. If you want to read the file in arbitrary-sized chunks (say, 1K or 4K), you need to write error-handling code to catch the case where only part of the bytes encoding a single Unicode character are read at the end of a chunk. One solution would be to read the entire file into memory and then perform the decoding, but that prevents you from working with files that are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM. (More, really, since for at least a moment you'd need to have both the encoded string and its Unicode version in memory.)

一个问题就是编码可能是多字节的;一个Unicode字符可能用若干个字节表示。如果你想读取任意大小chunk(如,1K或4K)的文件,则需要编写错误处理代码,用来处理在chunk结尾时哪几个字节表示一个Unicode字符的情况。一个解决方法是一次性地读取整个文件到内存并解码,但并不适合处理大文件的情况;如果你想读取一个2GB的文件,则需要2GB的内存。(其实更多,至少在某些时候内存中要同时存储解码后的字符串还有它的Unicode版本)

The solution would be to use the low-level decoding interface to catch the case of partial coding sequences. The work of implementing this has already been done for you: the codecs module includes a version of the open() function that returns a file-like object that assumes the file's contents are in a specified encoding and accepts Unicode parameters for methods such as .read() and .write().

解决方法应该是使用底层的解码接口,处理部分编码序列的情况。且已经有实现方法了:codecs模块包含了open()函数的一个版本,它可以返回一个类似文件的对象,并保证文件的内容是由特定的encoding编码,而它的method如.read()和.write()还能接收Unicode参数。

The function's parameters are open(filename, mode='rb', encoding=None, errors='strict', buffering=1). mode can be 'r', 'w', or 'a', just like the corresponding parameter to the regular built-in open() function; add a '+' to update the file. buffering is similarly parallel to the standard function's parameter. encoding is a string giving the encoding to use; if it's left as None, a regular Python file object that accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and data written to or read from the wrapper object will be converted as needed. errors specifies the action for encoding errors and can be one of the usual values of 'strict', 'ignore', and 'replace'.

函数的参数使用方式为open(filename, mode='rb', encoding=None, errors='strict', buffering=1)。mode可以为'r','w',或'a',就像内建函数open()对应的参数一样;添加'+'则更新文件。类似的,buffering也与标准函数相同。encoding为一个字符串,给出了要使用的encoding;如果没有这个参数,则返回一个接收8比特字符串的常规Python文件。否则,返回一个封装对象,并且对封装对象的读和写操作的数据都会相应的被转换。errors指定在编码错误时的反应,可设置成'strict','ignore'或'replace'。

Reading Unicode from a file is therefore simple:

这样从一个文件读取Unicode就简单了:

   1 import codecs
   2 f = codecs.open('unicode.rst', encoding='utf-8')
   3 for line in f:
   4     print repr(line)

It's also possible to open files in update mode, allowing both reading and writing:

也有可能以更新模式打开文件,允许读和写:

   1 f = codecs.open('test', encoding='utf-8', mode='w+')
   2 f.write(u'\u4500 blah blah blah\n')
   3 f.seek(0)
   4 print repr(f.readline()[:1])
   5 f.close()

Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file; when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for little-endian and big-endian encodings, that specify one particular byte ordering and don't skip the BOM.

Unicode字符U+FEFF被用作字节顺序标记(BOM),并通常作为文件的第一个字符,以便于自动检测文件的字节顺序。某些编码,如UTF-16,则希望文件的开头有一个BOM;当使用这种编码时,这个BOM将作为第一个字符被写入,并在读取文件内容时被去掉。还有若干种编码,如分别针对little endian和big-endian的'utf-16-le'和'utf-16-be',会指定一个特定的字节顺序,却并不跳过这个BOM。

1.3.1. Unicode filenames(Unicode文件名)

Most of the operating systems in common use today support filenames that contain arbitrary Unicode characters. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. For example, MacOS X uses UTF-8 while Windows uses a configurable encoding; on Windows, Python uses the name "mbcs" to refer to whatever the currently configured encoding is. On Unix systems, there will only be a filesystem encoding if you've set the LANG or LC_CTYPE environment variables; if you haven't, the default encoding is ASCII.

大多数流行的操作系统都支持文件名中包含任意的Unicode字符。这通常是通过将Unicode字符串转化成对应的系统编码来实现的。例如,MacOS X用户使用UTF-8,而Windows用户使用一个可配置的编码;在Windows系统上,Python用户使用"mbcs"来获取系统目前的编码。在Unix系统上,如果你设定了LANG或LC CTYPE环境变量,那么只有一种编码;如果没有设定,默认编码为ASCII。

The sys.getfilesystemencoding() function returns the encoding to use on your current system, in case you want to do the encoding manually, but there's not much reason to bother. When opening a file for reading or writing, you can usually just provide the Unicode string as the filename, and it will be automatically converted to the right encoding for you:

若你想手动设置编码,可使用函数sys.getfilesystemencoding()返回当前使用系统的编码,但通常并不需要去改变它。当打开文件时,通常只需要提供Unicode字符串作为文件名,它会自动转化成相应的编码:

   1 filename = u'filename\u4500abc'
   2 f = open(filename, 'w')
   3 f.write('blah\n')
   4 f.close()

Functions in the os module such as os.stat() will also accept Unicode filenames.

os模块中的函数如os.stat()也能接收Unicode文件名。

os.listdir(), which returns filenames, raises an issue: should it return the Unicode version of filenames, or should it return 8-bit strings containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as an 8-bit string or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing an 8-bit path will return the 8-bit versions of the filenames. For example, assuming the default filesystem encoding is UTF-8, running the following program:

os.listdir()返回文件名,产生一个问题:是返回文件名的Unicode版本呢,还是8比特字符串版本?这两种结果os.listdir()都可以提供,它取决于你提供的目录参数是Unicode字符串还是8比特的字符串。如果你传递一个Unicode字符串作为路径,则会使用本地编码将文件名转换成Unicode字符串并返回,而传递一个8比特字符串路径则会返回8比特字符串文件名。例如,假设默认的本地编码是UTF-8,运行以下程序:

   1 fn = u'filename\u4500abc'
   2 f = open(fn, 'w')
   3 f.close()
   4 
   5 import os
   6 print os.listdir('.')
   7 print os.listdir(u'.')

will produce the following output:

   1 amk:~$ python t.py
   2 ['.svn', 'filename\xe4\x94\x80abc', ...]
   3 [u'.svn', u'filename\u4500abc', ...]

The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions.

第一个列表包含了UTF-8文件名,而第二个列表包含的则是Unicode版本。

1.3.2. Tips for Writing Unicode-aware Programs (关于编写Unicode-aware程序的技巧)

This section provides some suggestions on writing software that deals with Unicode.

本节提供了一些关于编写能处理Unicode的程序的建议

The most important tip is:

最重要的技巧就是:

If you attempt to write processing functions that accept both Unicode and 8-bit strings, you will find your program vulnerable to bugs wherever you combine the two different kinds of strings. Python's default encoding is ASCII, so whenever a character with an ASCII value >127 is in the input data, you'll get a UnicodeDecodeError because that character can't be handled by the ASCII encoding.

如果希望编写的程序既能处理Unicode,也能处理8比特字符串,你会发现将两种不同字符串合并起来很容易导致bug。Python的默认编码是ASCII,因此若在输入的数据中遇到ASCII值大于127的字符,就会产生一个UnicodeDecodeError错误,因为ASCII编码不能处理这个字符。

It's easy to miss such problems if you only test your software with data that doesn't contain any accents; everything will seem to work, but there's actually a bug in your program waiting for the first user who attempts to use characters >127. A second tip, therefore, is:

如果你测试软件时提供的数据不包含任何重音字符,则很容易就忽略了这些问题;看起来一切正常,但在程序等待一个意图输入重音字符的用户时,就会遇到一个bug。因此,第二个技巧就是:

When using data coming from a web browser or some other untrusted source, a common technique is to check for illegal characters in a string before using the string in a generated command line or storing it in a database. If you're doing this, be careful to check the string once it's in the form that will be used or stored; it's possible for encodings to be used to disguise characters. This is especially true if the input data also specifies the encoding; many encodings leave the commonly checked-for characters alone, but Python includes some encodings such as 'base64' that modify every single character.

当使用从浏览器或其它不可靠地方来的数据时,一个通用的方法就是,在命令行中使用或将存储至数据库之前检查字符是否合法。当字符串为使用或存储的格式时要小心检查它 ;存在使用编码对字符进行伪装的可能性。这在输入数据指定了encoding时尤为突出,很多编码并不改变那些被检查的字符,但是Python中有些编码,如'base64',可修改每个单独的字符。

For example, let's say you have a content management system that takes a Unicode filename, and you want to disallow paths with a '/' character. You might write this code:

例如,你有一个接受Unicode文件名的内容管理系统,并拒绝含有字符'/'的路径。你可能会编写如下代码:

   1 def read_file (filename, encoding):
   2     if '/' in filename:
   3         raise ValueError("'/' not allowed in filenames")
   4     unicode_name = filename.decode(encoding)
   5     f = open(unicode_name, 'r')
   6     # ... return contents of file ...

However, if an attacker could specify the 'base64' encoding, they could pass 'L2V0Yy9wYXNzd2Q=', which is the base-64 encoded form of the string '/etc/passwd', to read a system file. The above code looks for '/' characters in the encoded form and misses the dangerous character in the resulting decoded form.

但是,如果某位攻击者可以指定'base64'编码,则会传递'L2V0Yy9wYXNzd2Q=',它是字符串'/etc/passwd'的base-64编码形式。以上代码在这个字符串中查找字符'/',却找不到这个危险的字符。

1.3.3. References (参考)

The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" are available at <http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> and discuss questions of character encodings as well as how to internationalize and localize an application.

1.4. Revision History and Acknowledgements

Thanks to the following people who have noted errors or offered suggestions on this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André Lemburg, Martin von L??wis.

Version 1.0: posted August 5 2005.

Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds several links.

Version 1.02: posted August 16 2005. Corrects factual errors.


Email:liqust at gmail dot com ©2005 2pole

UnicodeHowto (last edited 2011-03-15 05:56:09 by liqust)