文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- 0.706 [2004-09-22 19:30:33]

1. Replacing Multiple Patterns in a Single Pass 一次替换多个模式

Credit: Xavier Defrang

1.1. 问题 Problem

You need to perform several string substitutions on a string.

你需要在同一个字符串中完成几个子串替换.

1.2. 解决 Solution

Sometimes regular expressions afford the fastest solution even in cases where their applicability is anything but obvious. In particular, the sub method of re objects makes regular expressions a good way to perform string substitutions.Here is how you can produce a result string from an input string where each occurrence of any key in a given dictionary is replaced by the corresponding value in the dictionary:

有时规则表达式提供最快地解决方案,甚至在其适用性是anything除了明显。特别地,re对象的sub方法使得规则表达式成为字符串替换的好方法。这里展示了,输入一个字符串中及字典,如何将字符串中出现的每个(字典中的)键替换为字典中的相应值,从而生成一个结果字符串:

   1 # requires Python 2.1 or later
   2 from _ _future_ _ import nested_scopes
   3 
   4 import re
   5 
   6 # the simplest, lambda-based implementation
   7 def multiple_replace(adict, text):
   8   # Create a regular expression from all of the dictionary keys
   9   regex = re.compile("|".join(map(re.escape, adict.keys(  ))))
  10 
  11   # For each match, look up the corresponding value in the dictionary
  12   return regex.sub(lambda match: adict[match.group(0)], text)

A more powerful and flexible approach is to wrap the dictionary into a callable object that directly supports the lookup and replacement idea, which you can use directly as the callback in the sub method. This object-oriented approach is more flexible because the callable object can keep its own state and therefore is easily extensible to other tasks. In Python 2.2 and later, you can create a class for this object by extending the dict built-in type, while in older Python versions you must fall back on UserDict.UserDict (built-in types were not subclassable in older versions). A try/except lets us easily write code that works optimally on both old and new versions of Python:

一个更强大且灵活的方法,是把字典封装成一个可以直接支持查找和替换的可调用对象,可以直接作为sub方法中的回调参数。这个面向对象的方法更灵活,因为可调用对象能保持自己的状态从而可以容易地增加新的处理工作。在python2.2及以后,你能通过扩展内建dict类型为这个对象创建一个类,而在旧的版本中你必须使用UserDict.UserDict(旧的版本中,内建类型不能派生子类)。 try/except让我们可以轻易的写出在新版本和旧版本中都能顺利工作的代码:

   1 try: dict
   2 except: from UserDict import UserDict as dict
   3 
   4 class Xlator(dict):
   5     """ All-in-one multiple-string-substitution class """
   6     def _make_regex(self):
   7         """ Build re object based on the keys of the current dictionary """
   8         return re.compile("|".join(map(re.escape, self.keys(  ))))
   9 
  10     def _ _call_ _(self, match):
  11         """ Handler invoked for each regex match """
  12         return self[match.group(0)]
  13 
  14     def xlat(self, text):
  15         """ Translate text, returns the modified text. """
  16         return self._make_regex(  ).sub(self, text)

1.3. 讨论 Discussion

This recipe shows how to use the Python standard re module to perform single-pass multiple-string substitution using a dictionary. Let's say you have a dictionary-based, one-to-one mapping between strings. The keys are the set of strings (or regular-expression patterns) you want to replace, and the corresponding values are the strings with which to replace them. You can perform the substitution by calling re.sub for each key/value pair in the dictionary, thus processing and creating a new copy of the whole text several times, but it is clearly better to do all the changes in a single pass, processing and creating a copy of the text only once. Fortunately, re.sub's callback facility makes this better approach quite easy.

本配方展示了,如何使用python的标准模块re,利用一个字典在单次扫描中完多个子串替换。让我们说,你有一个用字典表示的,一对一的字符串映射,字典的那些键是你想要替换掉的字符串(或规则表达式的模式),其对应的值是在你用来替换它的字符串。你可以为字典中的每个键/值对调用re.sub来完成替换,这将会处理并生成整个文本的多次新拷贝,很明显,更好的方法是一次做完所有的替换,仅处理并生成文本的一次拷贝。re.sub可以使用回调,使这个更好的方法变得相当容易实现。

First, we have to build a regular expression from the set of keys we want to match. Such a regular expression is a pattern of the form "a1|a2|...|an" and can easily be generated using a one-liner, as shown in the recipe. Then, instead of giving re.sub a replacement string, we call it with a callback argument. re.sub calls this object for each match, with a re.MatchObject as its only argument, and expects the replacement string as the call's result. In our case, the callback just has to look up the matched text in the dictionary and return the corresponding value.

首先,我们得从那些我们要匹配的键出发创建一个规则表达式。这样的规则表达式是一个具有形式"a1|a2|...|an"的模式,我们可以轻易地用一行语句生成,就象配方中展示的。然后,在调用re.sub时,我们用一个回调对象取代替换字符串,re.sub将会为每个匹配调用这个回调对象,并将re.MatchObjec作为它唯一的参数传递给它,并期待返回结果是替换字符串。在我们的实例中,回调对象就不得不在字典中查找匹配文本并返回相应的值串。

The recipe has two implementations: one is lambda-based, and the other uses a callable, dictionary-like object. The second option is better if you want to perform additional processing on each match (e.g., build a histogram of the number of times each possible substitution is actually performed) or if you just dislike lambda. Another potential advantage of the class-based approach is performance. If you know that the translation dictionary is static, and you must apply the same translation to several input strings, you can move the _make_regex call from the xlat method, where it's currently done, to an _ _init_ _ method, to avoid repeatedly preparing and compiling the regular expression.

本配方中有两个实现,一个基于lambda,另一个使用类字典的可调用对象。第二个在你想要对每个匹配完成额外的处理(例如,生成一个关于每个可能的替换实际完成次数的柱状图)时要更好一些,或者你讨厌lambda时。基于类的方法还有个潜在的优势在于性能。如果你已经知道转换字典是静态的,你必须在几个输入串上执行转换时,你可以从xlat方法现在的代码中把对_make_regex的调用,移动动_ _init_ _中,以避免重复准备和编译规则表达式。

Here's a usage example for each half of this recipe. We would normally have it as a part of the same .py source file as the function and class shown in the recipe, so it is guarded by the traditional Python idiom that runs it if and only if the module is called as a main script:

这里有本配方中每一半的使用例子.我们通常把它作为包含配方中函数和类的.py源文件的一部分,按Python传统习惯,当且仅当模块作为主脚本调用时它们将会运行:

   1 if _ _name_ _ == "_ _main_ _":
   2     text = "Larry Wall is the creator of Perl"
   3     adict = {
   4       "Larry Wall" : "Guido van Rossum",
   5       "creator" : "Benevolent Dictator for Life",
   6       "Perl" : "Python",
   7     }
   8 
   9     print multiple_replace(adict, text)
  10 
  11     xlat = Xlator(adict)
  12     print xlat.xlat(text)

Substitutions such as those performed by this recipe are often intended to operate on entire words, rather than on arbitrary substrings. Regular expressions are good at picking up the beginnings and endings of words, thanks to the special sequence r'\b'. Thus, we can easily make a version of the Xlator class that is constrained to substitute only entire words:

象本配方中那样的替换经常倾向于操作整个单词,而不是任意子串。感谢特殊序列r'\b',规则表达式能很好地捡取单词的起始串及结尾串。这样我们很容易写出一个Xlator 类的版本,用来强制只对整个单词才替换。

   1 class WordXlator(Xlator):
   2     """ An Xlator version to substitute only entire words """
   3     def _make_regex(self):
   4         return re.compile(
   5           r'\b'+r'\b|\b'.join(map(re.escape, self.keys(  )))+r'\b')

Note how much easier it is to customize Xlator than it would be to customize the multiple_replace function. Ease of customization by subclassing and overriding helps you avoid copy-and-paste coding, and this is another excellent reason to prefer object-oriented structures over simpler procedural ones. Of course, just because some functionality is packaged up as a class doesn't magically make it customizable in just the way you want. Customizability also takes some foresight in dividing the functionality into separately overridable methods that correspond to the right pieces of overall functionality. Fortunately, you don't have to get it right the first time; when code does not have the optimal internal structure for the task at hand (e.g., reuse by subclassing and selective overriding), you can and should refactor the code so that its internal structure serves your needs. Just make sure you have a suitable battery of tests ready to run to ensure that your refactoring hasn't broken anything, and then you can refactor to your heart's content. See http://www.refactoring.com for more information on the important art and practice of refactoring.

现在看来,定制Xlator比定制multiple_replace函数要容易的多。易于通过子类化和重载进行定制,可帮你避免拷贝代码,这是另一个极好的理由,使我们更愿意采用面向对象结构取代简单的过程。当然,仅仅因为一些功能被打包为一个类,并不能魔术般地象你想的那样变成可定制的。可定制性还需要一些远见,能把机制划为分几个可重载的、对应于整个机制恰当片断的方法。幸运地是,你不必在第一次就做对,当代码中没有手边任务需要的可选内部结构(何如,子类重用且选择性地重载)时,你可以也应该重写代码,以使其内部结构满足需要。只是要确信,你有一个适当的测试机制已作好运行准备,以确保你重写的代码不会破坏任何事情,然后你可以重写你的核心内容。查看http://www.refactoring.com ,可以获得关于重写的重要技巧及实践。

1.4. 参考 See Also

Documentation for the re module in the Library Reference; the Refactoring home page (http://www.refactoring.com).