文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- dreamingk [2004-08-12 04:02:40]

3.1 Introduction

Credit: Fred L. Drake, Jr., PythonLabs

Text-processing applications form a substantial part of the application space for any scripting language, if only because everyone can agree that text processing is useful. Everyone has bits of text that need to be reformatted or transformed in various ways. The catch, of course, is that every application is just a little bit different from every other application, so it can be difficult to find just the right reusable code to work with different file formats, no matter how similar they are.

如果说大多数人都同意文本处理是十分有用的话,那么对于任何一种脚本语言,文本处理都是一个健壮丰富的 应用部分,人们总是需要对文本数据通过各种途径进行重组,变换。重要的是,所有的应用都有数据和其它的 应用有着或多或少的区别,因此,针对不同的文件格式,无论它们是多么的简单,都很难从其中找到重用的代 码。

3.1.1 What Is Text?

Sounds like an easy question, doesn't it? After all, we know it when we see it, don't we? Text is a sequence of characters, and it is distinguished from binary data by that very fact. Binary data, after all, is a sequence of bytes.

什么是文本?似乎是个简单的问题对吗?毕竟我们一看到它的时候就知道是怎么回事了。文本是字符的序列,并 且由于其非常直观而与二进制数据相区别。二进制数据,与之相比,是字节的序列。

Unfortunately, all data enters our applications as a sequence of bytes. There's no library function we can call that will tell us whether a particular sequence of bytes represents text, although we can create some useful heuristics that tell us whether data can safely (not necessarily correctly) be handled as text.

不幸的是,我们平时输入的都是作为字节的序列。我们不会找到告诉我们一个特定字节序列是否是文本的库函数,尽管 我们能创造heuristics(函数)告诉我们这些数据作为文本来处理是否正确或安全。

Python strings are immutable sequences of bytes or characters. Most of the ways we create and process strings treat them as sequences of characters, but many are just as applicable to sequences of bytes. Unicode strings are immutable sequences of Unicode characters: transformations of Unicode strings into and from plain strings use codecs (coder-decoder) objects that embody knowledge about the many standard ways in which sequences of characters can be represented by sequences of bytes (also known as encodings and character sets). Note that Unicode strings do not serve double duty as sequences of bytes.

Python中的字符串是不可变的字符或字节的序列。绝大多数的创建或处理字符串的方法都是把它们看成是字符的 序列的,但是有些应用背景还是要看作字节序列的。Unicode字符串是Unicode字符的不可变序列:用codescs (译码器)把Unicode串进行变换操作包含了许多用字节序列表示字符序列的标准方法(所谓的编码和字符集)。 注意,Unicode并不是提供了双重标准的字节序列。

Okay, let's assume that our application knows from the context that it's looking at text. That's usually the best approach, because that's where external input comes into play. We're either looking at a file because it has a well-known name and defined format (common in the Unix world) or because it has a well-known filename extension that indicates the format of the contents (common on Windows). But now we have a problem: we had to use the word "format" to make the previous paragraph meaningful. Wasn't text supposed to be simple?

好,假设我们的应用程序以着眼文本为背景。一般来说这是最好的途径,因为这是外部数据来表演的好机会。 同时我们针对的是对文件的操作,因为在Unix它有固定的格式(format)或者在Windows中它有扩展名表示其内容的格式 (format)。但是,问题来了:我们不得不用“格式”(format)来使前边几段论述讲的通。是不是我们把文本想 的太简单了呢?

Let's face it: there's no such thing as "pure" text, and if there were, we probably wouldn't care about it (with the possible exception of applications in the field of computational linguistics, where pure text may indeed be studied for its own sake). What we want to deal with in our applications is information content contained in text. The text we care about may contain configuration data, commands to control or define processes, documents for human consumption, or even tabular data. Text that contains configuration data or a series of commands usually can be expected to conform to a fairly strict syntax that can be checked before relying on the information in the text. Informing the user of an error in the input text is typically sufficient to deal with things that aren't what we were expecting.

来看看吧:原本没有什么“纯”文本,如果有的话可能我们业不会关心它(除了在计算机语义学的学术研究中却有 纯文本其实)。我们想要处理的对象其实使在文本中的信息内容。我们所关心的包含了结构数据 控制命令和定义 处理,甚至是制表符。包含结构数据或命令了的文本往往是符合一定的严格的语法规则的,而这些规则在文本的 信息中可分析出来的。通知用户输入文本出错是充分典型的处理那些我们没料想到的事情的例子。

Documents intended for humans tend to be simple, but they vary widely in detail. Since they are usually written in a natural language, their syntax and grammar can be difficult to check, at best. Different texts may use different character sets or encodings, and it can be difficult or even impossible to tell what character set or encoding was used to create a text if that information is not available in addition to the text itself. It is, however, necessary to support proper representation of natural-language documents. Natural-language text has structure as well, but the structures are often less explicit in the text and require at least some understanding of the language in which the text was written. Characters make up words, which make up sentences, which make up paragraphs, and still larger structures may be present as well. Paragraphs alone can be particularly difficult to locate unless you know what typographical conventions were used for a document: is each line a paragraph, or can multiple lines make up a paragraph? If the latter, how do we tell which lines are grouped together to make a paragraph? Paragraphs may be separated by blank lines, indentation, or some other special mark. See Recipe 4.9 and Recipe 12.8 for examples of processing and inputting paragraphs separated by blank lines.

文档打算提供人们希望简化的意愿的实现方法,可是它的细节变数太多了。因为它们往往都是用自然语言些的, 至少其语法,文法很难被检查。不同的文本可能会用不同的字符集或编码,如果编码相信不在文本本身的附加 信息提供的话,我们将很难甚至是不可能在创建文本的时候确定其编码或字符集。然而,对自然语言文档的正 确支持是非常必要的。自然语言文本同样有结构,但其结构若没有文本创建信息的基本的了解的话将会含糊不清。 字符组成词,词组成句子,句子组成段………并且还会有更大的组织结构的出现。若不知道排版的协定的话(每一 行一段或是多行成段?)段本身是十分难定位的。若是后者的话,如何确定哪几行是一段呢?段会被空行符,缩进 或者是其它的特殊标记所分开。看看窍门4.9 窍门12.8作为处理输入段落被空行符分开的例子吧

Tabular data has many issues that are similar to the problems associated with natural-language text, but it adds a second dimension to the input format: the text is no longer linear梚t is no longer a sequence of characters, but rather a matrix of characters from which individual blocks of text must be identified and organized.

制表符会引发很多问题,就像自然语言文本所涉及到的一样。但是它引入了二维输入格式:文本不再是线性的, 业不再是字符的序列了,然而这样的独立文本块是式的字符矩阵也需要被定义和组织。

3.1.2 Basic Textual Operations

As with any other data format, we need to do different things with text at different times. However, there are still three basic operations:

Parsing can be performed in a variety of ways, and many formats can be suitably handled by ad hoc parsers that deal effectively with a very constrained format. Examples of this approach include parsers for RFC 2822-style email headers (see the rfc822 module in Python's standard library) and the configuration files handled by the ConfigParser module. The netrc module offers another example of a parser for an application-specific file format, this one based on the shlex module. shlex offers a fairly typical tokenizer for basic languages, useful in creating readable configuration files or allowing users to enter commands to an interactive prompt. These sorts of ad hoc parsers are abundant in Python's standard library, and recipes using them can be found in Chapter 4 and Chapter 10. More formal parsing tools are also available for Python; they depend on larger add-on packages and are surveyed in the introduction to Chapter 15.

Transforming text from one format to another is more interesting when viewed as text processing, which is what we usually think of first when we talk about text. In this chapter, we'll take a look at some ways to approach transformations that can be applied for different purposes, including three different recipes that deal with replacing embedded Python expressions with their evaluations. Sometimes we'll work with text stored in external files, and other times we'll simply work with it as strings in memory.

The generation of textual data from application-specific data structures is most easily performed using Python's print statement or the write method of a file or file-like object. This is often done using a method of the application object or a function, which takes the output file as a parameter. The function can then use statements such as these:

print >>file, sometext
file.write(sometext)

which generate output to the appropriate file. However, this isn't generally thought of as text processing, as here there is no input text to be processed. Examples of using both print and write can be found throughout this book.

3.1.3 Sources of Text

Working with text stored as a string in memory can be easy when the text is not too large. Operations that search the text can operate over multiple lines very easily and quickly, and there's no need to worry about searching for something that might cross a buffer boundary. Being able to keep the text in memory as a simple string makes it very easy to take advantage of the built-in string operations available as methods of the string object.

File-based transformations deserve special treatment, because there can be substantial overhead related to I/O performance and the amount of data that must actually be stored in memory. When working with data stored on disk, we often want to avoid loading entire files into memory, due to the size of the data: loading an 80-MB file into memory should not be done too casually! When our application needs only part of the data at a time, working on smaller segments of the data can yield substantial performance improvements, simply because we've allowed enough space for our program to run. If we are careful about buffer management, we can still maintain the performance advantage of using a small number of relatively large disk read and write operations by working on large chunks of data at a time. File-related recipes are found in Chapter 4.

Another interesting source for textual data comes to light when we consider the network. Text is often retrieved from the network using a socket. While we can always view a socket as a file (using the makefile method of the socket object), the data that is retrieved over a socket may come in chunks, or we may have to wait for more data to arrive. The textual data may also not consist of all data until the end of the data stream, so a file object created with makefile may not be entirely appropriate to pass to text-processing code. When working with text from a network connection, we often need to read the data from the connection before passing it along for further processing. If the data is large, it can be handled by saving it to a file as it arrives and then using that file when performing text-processing operations. More elaborate solutions can be built when the text processing needs to be started before all the data is available. Examples of the parsers that are useful in such situations may be found in the htmilib and HTMLParser modules in the standard library.

3.1.4 String Basics

The main tool Python gives us to process text is strings梚mmutable sequences of characters. There are actually two kinds of strings: plain strings, which contain eight-bit (ASCII) characters; and Unicode strings, which contain Unicode characters. We won't deal much with Unicode strings here: their functionality is similar to that of plain strings, except that each character takes up 2 (or 4) bytes, so that the number of different characters is in the tens of thousands (or even billions), as opposed to the 256 different characters that comprise plain strings. Unicode strings are important if you must deal with text in many different alphabets, particularly Asian ideographs. Plain strings are sufficient to deal with English or any of a limited set of non-Asian languages. For example, all Western European alphabets can be encoded in plain strings, typically using the international standard encoding known as ISO-8859-1 (or ISO-8859-15, if you need the Euro currency symbol as well).

In Python, you express a literal string as:

'this is a literal string'
"this is another string"

String values can be enclosed in either single or double quotes. The two different kinds of quotes work the same way, but having both allows you to include one kind of quotes inside of a string specified with the other kind of quotes, without needing to escape them with the backslash character:

'isn\'t that grand'
"isn't that grand"

To have a string span multiple lines, you can use a backslash as the last character on the line, which indicates that the next line is a continuation:

big = "This is a long string\
that spans two lines."

You must embed newlines in the string if you want the string to output on two lines:

big = "This is a long string\n\
that prints on two lines."

Another approach is to enclose the string in a pair of matching triple quotes (either single or double):

   1 bigger = """
   2 This is an even
   3 bigger string that
   4 spans three lines.
   5 """

In this case, you don't need to use the continuation character, and line breaks in the string literal are preserved as newline characters in the resulting Python string object. You can also make a string a "raw" string by preceding it with an r or R:

big = r"This is a long string\
with a backslash and a newline in it"

With a raw string, backslash escape sequences are left alone, rather than being interpreted. Finally, you can precede a string with a u or U to make it a Unicode string:

hello = u'Hello\u0020World'

Strings are immutable, which means that no matter what operation you do on a string, you will always produce a new string object, rather than mutating the existing string. A string is a sequence of characters, which means that you can access a single character:

   1 mystr = "my string"
   2 mystr[0]        # 'm'
   3 mystr[-2]       # 'n'

You can also access a portion of the string with a slice:

   1 mystr[1:4]      # 'y s'
   2 mystr[3:]       # 'string'
   3 mystr[-3:]      # 'ing'

You can loop on a string's characters:

for c in mystr: This will bind c to each of the characters in mystr. You can form another sequence:

   1 list(mystr)     # returns ['m','y',' ','s','t','r','i','n','g']

You can concatenate strings by addition:

   1 mystr+'oid'     # 'my stringoid'

You can also repeat strings by multiplication:

   1 'xo'*3          # 'xoxoxo'

In general, you can do anything to a string that you can do to a sequence, as long as it doesn't require changing the sequence, since strings are immutable.

String objects have many useful methods. For example, you can test a string's contents with s.isdigit, which returns true if s is not empty and all of the characters in s are digits (otherwise, it returns false). You can produce a new modified string with a method such as s.toupper, which returns a new string that is like s, but with every letter changed into its uppercase equivalent. You can search for a string inside another with haystack.count("needle"), which returns the number of times the substring "needle" appears in the string haystack. When you have a large string that spans multiple lines, you can split it into a list of single-line strings with splitlines:

   1 list_of_lines = one_large_string.splitlines(  )

And you can produce the single large string again with join:

   1 one_large_string = '\n'.join(list_of_lines)

The recipes in this chapter show off many methods of the string object. You can find complete documentation in Python's Library Reference.

Strings in Python can also be manipulated with regular expressions, via the re module. Regular expressions are a powerful (but complicated) set of tools that you may already be familiar with from another language (such as Perl), or from the use of tools such as the vi editor and text-mode commands such as grep. You'll find a number of uses of regular expressions in recipes in the second half of this chapter. For complete documentation, see the Library Reference. Mastering Regular Expressions, by J. E. F. Friedl (O'Reilly), is also recommended if you do need to master this subject桺ython's regular expressions are basically the same as Perl's, which Friedl covers thoroughly.

Python's standard module string offers much of the same functionality that is available from string methods, packaged up as functions instead of methods. The string module also offers additional functions, such as the useful string.maketrans function that is demonstrated in a few recipes in this chapter, and helpful string constants (string.digits, for example, is '0123456789'). The string-formatting operator, %, provides a handy way to put strings together and to obtain precisely formatted strings from such objects as floating-point numbers. Again, you'll find recipes in this chapter that show how to use % for your purposes. Python also has lots of standard and extension modules that perform special processing on strings of many kinds, although this chapter doesn't cover such specialized resources.

PyCkBk-3-1 (last edited 2009-12-25 07:18:30 by localhost)