文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- Zoom.Quiet [2004-08-11 17:11:32]

不用循环从一个list中选择随机的元素

2.9 Selecting Random Elements from a List Without Repetition

不用循环从一个list中选择随机的元素

Credit: Iuri Wickert, Duncan Grisby, Steve Holden, Alex Martelli

问题 Problem

You need to consume, in random order, the items of a rather long list, and the most direct approach is painfully slow.

你需要按随机顺序处理一个相当长的list的条目并且多数直接的方法是慢的令人痛苦。

解决 Solution

While it's a common mistake to be overly concerned with speed, you should not ignore the different performances of various algorithms. Suppose we must process all of the items in a long list in random order, without repetition. For example:

虽然极度关心速度是一个普遍错误,但你不应该忽略不同算法之间不同的性能。假定我们必须不用循环在一个长的list里以随机顺序处理所有的条目。例如:

   1 import random
   2 
   3 # an example of a processing function
   4 def process(datum): print datum
   5 # an example of a set of data to process
   6 data = range(10000)
   7 #The simplest version is very slow:
   8 
   9 def simple(  ):
  10     while data:
  11         # returns element, not index (doesn't help)
  12         elem = random.choice(data) 
  13         # needs to search for the element throughout the list!
  14         data.remove(elem) 
  15         process(elem)

Here is a faster version:

这里是一个更快的版本:

   1 def faster(  ):
   2     while data:
   3         index = random.randrange(len(data))
   4         elem = data[index]
   5         # direct deletion, no search needed
   6         del data[index] 
   7         process(elem)
   8 
   9 # the same, but preserving the data list
  10 def faster_preserve(  ):
  11     aux = range(len(data))
  12     while aux:
  13         posit = random.randrange(len(aux))
  14         index = aux[posit]
  15         elem = data[index]
  16         # alters the auxiliary list only
  17         del aux[posit] 
  18         process(elem)

However, the key improvement is to switch to an O(N) algorithm:

然而,关键的改进是切换到一个O(N)算法。

   1 def improved(  ):
   2     size = len(data)
   3     while size:
   4         index = random.randrange(size)
   5         elem = data[index]
   6         data[index] = data[size-1]
   7         size = size - 1
   8         process(elem)

Of course, you can also implement a version of this that preserves the data list.

当然,你也能实现它的保持数据list的版本

But the winner is the version that appears to be the simplest:

但是,胜利者是看上去最简单的版本:

   1 def best(  ):
   2     random.shuffle(data)
   3     for elem in data: process(elem)
   4 
   5 # or, if you need to preserve the data list's original ordering:
   6 def best_preserve(  ):
   7     aux = list(data)
   8     random.shuffle(aux)
   9     for elem in aux: process(elem)

讨论 Discussion

The simplest, most direct way of consuming a list in a random fashion is painfully slow for lists with a few hundred elements. While it is tempting to use the simple, clear choice/remove combination, as in the simple function, this is a bad choice, because remove must linearly search through the list to find the element to delete. In other words, the overall performance is O(N2), with a large multiplicative constant.

在一个随机样式中,使用一个list最简单,最直接的方法对具有好几百个元素的list来说是慢的令人痛苦。 当它正企图去使用简单,清晰的choice/remove联合,象在simple函数里一样, 这是一个很坏的选择。因为remove必须通过list来线性查找去发现需要删除的元素。换句话说,全部的性能是O(N2),一个大的倍增的常量。

Fortunately, there are equally simple (or simpler), but much faster, ways to do this. The faster function, using randrange/del to generate the random indexes for the list, can skip the costly search for the deletion. If it's important to preserve the input list, you can use a disposable auxiliary list to generate the data list indexes, as in the faster_preserve function.

幸运的是,有一些同样simple(或者simpler),但是更快速的方法可以做这个。Faster函数使用randrange/del去为list 产生随机的索引,能够为删除跳过昂贵的查找。如果保持输入list是重要的,你能使用任意一个faster_preserve函数。

However, del anylist[x] for a random x is still O(N), so overall performance is still O(N2), albeit with a much smaller multiplicative constant. Incidentally, the pseudorandom order in which items are processed is not the same with the various approaches, even if random is seeded in the same way. All of the orderings are equally pseudorandom, though.

然而, 为一个随机的x来del anylist[x]仍然是O(N)。所以总体性能仍然是O(N2),虽然是一个更小一点的配增常量。顺便地,即使如果在相同的方法里随机数被作为种子, 在它的元素中被处理的伪随机顺序没有同样多种的方法。

Pursuing O(N) performance, one possibility is not to delete an item from a list at all, but rather to overwrite it with the last item and decrease at each step the number of items from which we're choosing. The improved function takes this tack and benefits greatly from it, in terms of performance.

追踪O(N)性能,一个可能性是根本不从一个list去删除一个元素,而是从我们选择的list 里,用最后一个元素去覆盖它并且在每一步减少元素的数量。 根据性能,improved函数从这里能够获取好处和利益

The fastest approach, however, is to shuffle the data (or an auxiliary copy of it) once, at the start, then just process the items sequentially in the resulting pseudorandom order. The nice thing about this approach, shown in the best and best_preserve functions, is that it's actually the simplest of all.

然而,最快的方法是一旦在开始重新整理数据(或者一个它的辅助拷贝),那么仅仅在结果的伪随机顺序里继续处理元素。关于这个方法最后的事情,显示在best和best_preserve函数,它是所有方法里真正最简单的。

On lists of 10,000 items, as shown in this recipe, the overhead (meaning pure overhead, using a do-nothing processing function) of simple is about 13 or 14 times more than that of faster and faster_preserve. Those functions, in turn, have over twice as much overhead as improved, best, and best_preserve. On lists of 100,000 items, faster and faster_preserve become about 15 times slower than improved, best, and best_preserve. The latter two have, for every list size, about 20%-30% less overhead than improved梐 very minor issue, although their utter simplicity clearly does make them deserve their names.

10,000个元素的list,就象在这个配方里显示的,simple的开销(是纯粹的开销,使用do-nothing 处理函数)是大约超过faster 和 faster_preserve的13或者14倍。improved ,Best和best_preserve,这些函数依次有两倍开销的提高。在一个100,000元素的list 上,faster 和 faster_preserve比best和best_preserve慢了大约15倍。虽然他们绝对的简单清晰使他们的名字受之无愧,但对于每一个list尺寸,后两个函数比改良的函数少大约20%-30%的开销【梐 very minor issue】

While an improvement of 25%, or even a factor of 2, may be neglected without substantial cost for the performance of your program as a whole, the same does not apply to an algorithm that is 10 or more times as slow as it could be. Such terrible performance is likely to make that program fragment a bottleneck, all by itself, particularly when we're talking about O(N2) versus O(N) behavior.

当有25%或者甚至于是2的因数的提高时候,对于你的程序作为一个整体的性能不用真正的花费,可以被忽视。相同的是不应用于一个比它慢10倍或者更慢的算法。明显地,当我们讨论O(N2) 对O(N)行为的时候, 那样糟糕的性能很可能是程序片断的瓶颈,

参考 See Also

The documentation for the random module in the Library Reference.