文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- 61.182.251.99 [2004-09-21 22:29:37]

描述

处理文件中每个单词

Credit: Luther Blissett

问题 Problem

You need to do something to every word in a file, similar to the foreach function of csh.

需要处理文件中每个单词, 类似于cshforeach功能。

解决 Solution

This is best handled by two nested loops, one on lines and one on the words in each line:

最佳方法是使用2层嵌套循环,对文件的各行循环和对每行内的单词循环:

   1 for line in open(thefilepath).xreadlines(  ):                  #方法1
   2     for word in line.split(  ):
   3         dosomethingwith(word)

This implicitly defines words as sequences of nonspaces separated by sequences of spaces (just as the Unix program wc does).

代码中隐含单词定义是:被空白符号分开的非空白符号的系列(同Unix程序wc一样)。

For other definitions of words, you can use regular expressions. For example:

对于单词的其它定义,可以使用正则表达式,比如:

   1 import re
   2 re_word = re.compile(r'[\w-]+')
   3 
   4 for line in open(thefilepath).xreadlines(  ):
   5     for word in re_word.findall(line):
   6         dosomethingwith(word)

In this case, a word is defined as a maximal sequence of alphanumerics and hyphens.

此处,单词的定义是:由字母数字-的组成的最长序列(#译注:贪婪查找?)

讨论 Discussion

For other definitions of words you will obviously need different regular expressions. The outer loop, on all lines in the file, can of course be done in many ways. The xreadlines method is good, but you can also use the list obtained by the readlines method, the standard library module fileinput, or, in Python 2.2, even just:

对于单词的其它定义,显然需要不同的正则表达式。对文件每行进行的外层循环,可以以多种方式进行。上面使用xreadlines不错,也可以使用由readlines获得的list对象,或则标准模块fileinput, 进一步在Python 2.2种,可以用:

   1 for line in open(thefilepath):

which is simplest and fastest.

这样最简单最快。

In Python 2.2, it's often a good idea to wrap iterations as iterator objects, most commonly by simple generators:

Python 2.2及高版本中,用iterator对象封装迭代是个好主意。一般由简单generator产生,代码如下:

   1 from _ _future_ _ import generators
   2 
   3 def words_of_file(thefilepath):
   4     for line in open(thefilepath):
   5         for word in line.split(  ):
   6             yield word
   7 
   8 for word in words_of_file(thefilepath):
   9     dosomethingwith(word)

This approach lets you separate, cleanly and effectively, two different concerns: how to iterate over all items (in this case, words in a file) and what to do with each item in the iteration.

generatoriterator的使用,可以干净有效的分离2个不同的Concern(#译注:AOP中seperation of concerns):1,如何迭代所有元素; 2,对每个元素的处理。

Once you have cleanly encapsulated iteration concerns in an iterator object (often, as here, a generator), most of your uses of iteration become simple for statements.

只要将迭代Concern封装于iterator对象(这里是generator)内一次,那么几乎所有后期迭代代码就可以使用简单的for循环了。

You can often reuse the iterator in many spots in your program, and if maintenance is ever needed, you can then perform it in just one place梩he definition of the iterator梤ather than having to hunt for all uses.

可以在程序中多处使用这个iteraotor, 如果需要维护,那么在一处的维护就够了。维护仅仅需要修改iterator定义处的代码,而不是搜索修改所有应用iterator的程序片断。

The advantages are thus very similar to those you obtain, in any programming language, by appropriately defining and using functions rather than copying and pasting pieces of code all over the place.

如此,正确定义并使用函数,而不是复制粘贴代码片断到程序的各个角落,在Python中与在所有编程语言中一样,获得的好处是很明显的。

With Python 2.2's iterators, you can get these advantages for looping control structures, too.

使用Python 2.2的iterator处理循环结构,也可以获得这种好处。(#译注:理解的不知道对不对?)

参考 See Also

Python 文档fileinput模块部分;

PEP 255 关于 simple generator http://www.python.org/peps/pep-0255.html

Perl Cookbook Recipe 8.3

PyCkBk-4-8 (last edited 2009-12-25 07:16:21 by localhost)