文章来自《Python cookbook》.


-- Zoom.Quiet [2004-08-11 17:15:31]


2.10 Performing Frequent Membership Tests on a Sequence


Credit: Alex Martelli

问题 Problem

You need to perform frequent tests for membership in a sequence. The O(N) behavior of repeated in operators hurts performance, but you can't switch to using just a dictionary, as you also need the sequence's order.

在一个序列里,你需要为成员关系执行频繁的测试。在操作上重复O(N)行为损害了性能。但是你不 能切换去使用一个字典,因为你需要这个序列的顺序。

解决 Solution

Say you need to append items to a list only if they're not already in the list. The simple, naive solution is excellent but may be slow: 假定你需要添加一个元素到一个list,仅仅是如果他们不在list里的时候。简单,直白的方案是很 好的但也是慢的

   1 def addUnique1(baseList, otherList):
   2     for item in otherList:
   3         if item not in baseList:
   4             baseList.append(item)

If otherList is large, it may be faster to build an auxiliary dictionary:


   1 def addUnique2(baseList, otherList):
   2     auxDict = {}
   3     for item in baseList:
   4         auxDict[item] = None
   5     for item in otherList:
   6         if not auxDict.has_key(item):
   7             baseList.append(item)
   8             auxDict[item] = None

For a list on which you must often perform membership tests, it may be best to wrap the list, together with its auxiliary dictionary, into a class. You can then define a special _ _contains_ _ method to speed the in operator. The dictionary must be carefully maintained to stay in sync with the sequence. Here's a version that does the syncing just in time, when a membership test is required and the dictionary is out of sync, and works with Python 2.1 or later:

对于一个在你必须经常执行成员关系测试的list上,或许最好把list连同它的辅助字典一起包装成 一个类。然后在操作上你能定义一个特别的_ _contains_ _方法去提速。字典必须小心的维护同序列的同步。这里是一个版本,要求python2.1或者以后。

   1 from _ _future_ _ import nested_scopes
   2 import UserList
   3 try: list._ _getitem_ _
   4 except: Base = UserList.UserList
   5 else: Base = list
   7 class FunkyList(Base):
   8     def _ _init_ _(self, initlist=None):
   9         Base._ _init_ _(self, initlist)
  10         self._dict_ok = 0
  11     def _ _contains_ _(self, item):
  12         if not self._dict_ok:
  13             self._dict = {}
  14             for item in self:
  15                 self._dict[item] = 1
  16             self._dict_ok = 1
  17         return self._dict.has_key(item)
  19 def _wrapMethod(methname):
  20     _method = getattr(Base, methname)
  21     def wrapper(self, *args):
  22         # Reset 'dictionary OK' flag, then delegate
  23         self._dict_ok = 0
  24         return _method(self, *args)
  25     setattr(FunkyList, methname, wrapper)
  26 for meth in 'setitem delitem setslice delslice iadd'.split(  ):
  27     _wrapMethod('_ _%s_ _'%meth)
  28 for meth in 'append insert pop remove extend'.split(  ):
  29     _wrapMethod(meth)
  30 del _wrapMethod

讨论 Discussion

Python's in operator is extremely handy, but it's O(N) when applied to an N-item sequence. If a sequence is subject to frequent in tests, and the items are hashable, an auxiliary dictionary at the sequence's side can provide a signficant performance boost. A membership check (using the in operator) on a sequence of N items is O(N); if M such tests are performed, the overall time is O(M x N). Preparing an auxiliary dictionary whose keys are the sequence's items is also roughly O(N), but the M tests are roughly O(M), so overall we have roughly O(N+M). This is rather less than O(N x M) and can thus offer a very substantial performance boost when M and N are large.

在操作上python是极其方便的。但是当应用到N个元素的序列中的时候,它是O(N)。如果一个序列?br>瞧捣辈馐裕⑶以厥莌ash表。在序列这边一个辅助字典能提供一个重要的性能推进。一个成员?br>叵导觳椋ㄔ诓僮髦惺褂茫┰谝桓鲇蠳个元素的序列上是O(N)。 如果M那样的测试被执行,总共时间是O(M x N)。准备一个它的键是序列的元素的辅助字典也是大约O(N)。但是M测试大约是O(M)。所以我们总?br>泊笤加蠴(N+M)。比起O(N x M),这是相当小的。并且当M和N是很大的时候,提供一个非常坚固的性能推进。

Even better overall performance can often be obtained by permanently placing the auxiliary dictionary alongside the sequence, encapsulating both into one object. However, in this case, the dictionary must be maintained as the sequence is modified, so that it stays in sync with the actual membership of the sequence.

即使较好的全面的性能能经常通过固定的放置辅助字典在序列旁边,压缩两者到一个对象这些方法 被获得。然而,在这种情况下,字典必须在序列被更改的时候来维护以便于它能同序列真正的成员 关系同步。

The FunkyList class in this recipe, for example, extends list (UserList in Python 2.1) and delegates every method to it. However, each method that can modify list membership is wrapped in a closure that resets a flag asserting that the auxiliary dictionary is in sync. The in operator calls the _ _contains_ _ method when it is applied to an instance that has such a method. The _ _contains_ _ method rebuilds the auxiliary dictionary, unless the flag is set, proving that the rebuilding is unnecessary.

例如,在这个配方里的FunkyList类扩展list(UserList在python2.1里)并且给它代理每一个方法 。然而,能改动list成员关系的每一个方法被包装在一个闭合包里它重新安排一个辅助字典是在同 步状态的标志断言。当它被应用到一个有那样方法的实例的时候,在操作里调用_ _contains_ _方法。_ _contains_ _方法重新构建辅助字典,除非标志被设置证明重新构建是不不要的。

If our program needs to run only on Python 2.2 and later versions, we can rewrite the _ _contains_ _ method in a much better way: 如果我们的程序里需要去运行在python2.2和以后版本里,我们能用一种较好的方式来重写_ _contains_ _方法。

   1 def __contains__(self, item):
   2      if not self.dict_ok:
   3          self._dict = dict(zip(self,self))
   4          self.dict_ok = 1
   5      return item in self._dict

The built-in type dict, new in Python 2.2, lets us build the auxiliary dictionary faster and more concisely. Furthermore, the ability to test for membership in a dictionary directly with the in operator, also new in Python 2.2, has similar advantages in speed, clarity, and conciseness.

内建类型dict,在python2.2里是新的。它让我们构建辅助字典更快速简明。更进一步,在一个带?br>胁僮鞣淖值淅镏苯硬馐猿稍惫叵档哪芰υ趐ython2.2里也是新的。在速度,清晰度和简洁上,它 也有相似的优势。

Instead of building and installing the wrapping closures for all the mutating methods of the list into the FunkyList class with the auxiliary function _wrapMethod, we could simply write all the needed defs for the wrapper methods in the body of FunkyList, with the advantage of extending backward portability to Python versions even older than 2.1. Indeed, this is how I tackled the problem in the first version of this recipe that I posted to the online Python cookbook. However, the current version of the recipe has the important advantage of minimizing boilerplate (repetitious plumbing code that is boring and voluminous and thus a likely home for bugs). Python's advanced abilities for introspection and dynamic modification give you a choice: you can build method wrappers, as this recipe does, in a smart and concise way, or you can choose to use the boilerplate approach anyway, if you don't mind repetitious code and prefer to avoid what some would call the "black magic" of advanced introspection and dynamic modification of class objects.

取代为所有list的变异方法构建和安装包装的闭合包到具有辅助函数_wrapMethod FunkyList类,我们能在FunkyList的函数体里为包装方法简单写出所有需要的模块,有向后兼容 甚至比2.1还早的版本。确实,这就是我怎样在这个配方的我曾经贴到老的食谱上去的第一个版本?br>锝饩鑫侍? 然而,当配方的当前版本有重要的将样板文件最小化的优势(重复的铺设令代码厌烦,臃肿因而很 可能是bug的住所)


你能构建方法包装?br>鳎拖笳飧雠浞嚼锼龅模谝桓隽榍珊图蛄返姆椒ɡ铩;蛘呷绻悴还匦闹馗吹拇氩⑶蚁肴ケ 苊庖恍┑饔美喽韵蟮母呒蹲允『投谋涞奈资酰?那么无论如何你能选择去使用样板文件方法

Performance characteristics depend on the actual pattern of membership tests versus membership modifications, and some careful profiling may be required to find the right approach for a given use. This recipe, however, caters well to a rather common pattern of use, where sequence-modifying operations tend to happen in bunches, followed by a period in which no sequence modification is performed, but several membership tests may be performed.

性能特性依赖于实际的成员关系测试的模式对成员关系的更改。并且对于一个被给的使用,一些小 心的剖析或许被要求去发现正确的方法。然而,这个配方很好的满足了一个相当普通的使用模式, 在这里序列的更改操作趋向于发生在一个时期,跟着是没有序列更改被执行的一个时期。但是好几 种成员关系测试可以被执行。

Rebuilding the dictionary when needed is far simpler than incrementally maintaining it at each sequence-modifying step. Incremental maintenance requires careful analysis of what is being removed and of what is inserted, particularly upon such operations as slice assignment. If that strategy is desired, the values in the dictionary should probably be a count of the number of occurrences of each key's value in the sequence. A list of the indexes in which the value is present is another possibility, but that takes even more work to maintain. Depending on usage patterns, the strategy of incremental maintenance can be substantially faster or slower.

在每一序列更改的步骤上,当需求比增长地维护它更简单的时候,重新构建字典。增长地维护要求 小心的分析被移除的和被插入的。尤其是在作为切片赋值那样的操作之上。如果那个策略被需要, 在字典里的值应该可能是在序列里的每个键的值的出现的次数的总和。在那样的值里,被呈现出来 的索引的列表是其他可能性。但是那要花费更多的工作去维护。依赖于使用的模式,增长地维护的 策略是充分快的或者慢的。

Of course, all of this is necessary only if the sequence itself is needed (i.e., if the order of items in the sequence is significant). Otherwise, keeping just the dictionary is obviously simpler and more effective. Again, the dictionary can map values to counts, if you the need the data structure to be, in mathematical terms, a bag rather than a set.

当然,如果序列自己被需要,所有的这些都是是必要的,(也就是说,如果在序列中条目的顺序是 重要的)。否则, 仅仅保持字典很明显是简单的更有效的。再者 ,如果你需要数据结构是,在数学术语上,一个包胜于一个集合,字典能映射值去计算,。

An important requisite for any of these membership-test optimizations is that the values in the sequence must be hashable (otherwise, of course, they cannot be keys in a dictionary). For example, a list of tuples might be subjected to this recipe's treatment, but for a list of lists the recipe as it stands is not applicable. You can sometimes use cPickle.dumps to create dictionary keys梠r, for somewhat different application needs, the object's id梑ut neither workaround is always fully applicable. In the case of cPickle.dumps, even when it is applicable, the overhead may negate some or most of the optimization.

对于这些成员关系测试最优化的任何一个, 一个重要的需求是在序列中的值必须是hashable(否则,当然,他们不能是在字典里的键)。例如 ,一个tuple的list可能被这个配方处理。但是对于一个list的list,配方因为它的抵抗而是不可?br>视玫摹D阌惺焙蚰苁褂胏Pickle.dumps去建立字典的键,为了稍微不通的应用的需要, 对象的id【 乱码】是一直完全适用的。在cPickle.dumps的这种情况里,即使当它是可适用的,花费可能抵消?br>恍┗蛘叽蟛糠值挠呕?

参考 See Also

The Library Reference sections on sequences types and mapping types.

PyCkBk-2-10 (last edited 2009-12-25 07:09:38 by localhost)