-- Zoom.Quiet [2004-08-11 17:11:32]

# 不用循环从一个list中选择随机的元素

2.9 Selecting Random Elements from a List Without Repetition

Credit: Iuri Wickert, Duncan Grisby, Steve Holden, Alex Martelli

## 问题 Problem

You need to consume, in random order, the items of a rather long list, and the most direct approach is painfully slow.

## 解决 Solution

While it's a common mistake to be overly concerned with speed, you should not ignore the different performances of various algorithms. Suppose we must process all of the items in a long list in random order, without repetition. For example:

```   1 import random
2
3 # an example of a processing function
4 def process(datum): print datum
5 # an example of a set of data to process
6 data = range(10000)
7 #The simplest version is very slow:
8
9 def simple(  ):
10     while data:
11         # returns element, not index (doesn't help)
12         elem = random.choice(data)
13         # needs to search for the element throughout the list!
14         data.remove(elem)
15         process(elem)
```

Here is a faster version:

```   1 def faster(  ):
2     while data:
3         index = random.randrange(len(data))
4         elem = data[index]
5         # direct deletion, no search needed
6         del data[index]
7         process(elem)
8
9 # the same, but preserving the data list
10 def faster_preserve(  ):
11     aux = range(len(data))
12     while aux:
13         posit = random.randrange(len(aux))
14         index = aux[posit]
15         elem = data[index]
16         # alters the auxiliary list only
17         del aux[posit]
18         process(elem)
```

However, the key improvement is to switch to an O(N) algorithm:

```   1 def improved(  ):
2     size = len(data)
3     while size:
4         index = random.randrange(size)
5         elem = data[index]
6         data[index] = data[size-1]
7         size = size - 1
8         process(elem)
```

Of course, you can also implement a version of this that preserves the data list.

But the winner is the version that appears to be the simplest:

```   1 def best(  ):
2     random.shuffle(data)
3     for elem in data: process(elem)
4
5 # or, if you need to preserve the data list's original ordering:
6 def best_preserve(  ):
7     aux = list(data)
8     random.shuffle(aux)
9     for elem in aux: process(elem)
```

## 讨论 Discussion

The simplest, most direct way of consuming a list in a random fashion is painfully slow for lists with a few hundred elements. While it is tempting to use the simple, clear choice/remove combination, as in the simple function, this is a bad choice, because remove must linearly search through the list to find the element to delete. In other words, the overall performance is O(N2), with a large multiplicative constant.

Fortunately, there are equally simple (or simpler), but much faster, ways to do this. The faster function, using randrange/del to generate the random indexes for the list, can skip the costly search for the deletion. If it's important to preserve the input list, you can use a disposable auxiliary list to generate the data list indexes, as in the faster_preserve function.

However, del anylist[x] for a random x is still O(N), so overall performance is still O(N2), albeit with a much smaller multiplicative constant. Incidentally, the pseudorandom order in which items are processed is not the same with the various approaches, even if random is seeded in the same way. All of the orderings are equally pseudorandom, though.

Pursuing O(N) performance, one possibility is not to delete an item from a list at all, but rather to overwrite it with the last item and decrease at each step the number of items from which we're choosing. The improved function takes this tack and benefits greatly from it, in terms of performance.

The fastest approach, however, is to shuffle the data (or an auxiliary copy of it) once, at the start, then just process the items sequentially in the resulting pseudorandom order. The nice thing about this approach, shown in the best and best_preserve functions, is that it's actually the simplest of all.

On lists of 10,000 items, as shown in this recipe, the overhead (meaning pure overhead, using a do-nothing processing function) of simple is about 13 or 14 times more than that of faster and faster_preserve. Those functions, in turn, have over twice as much overhead as improved, best, and best_preserve. On lists of 100,000 items, faster and faster_preserve become about 15 times slower than improved, best, and best_preserve. The latter two have, for every list size, about 20%-30% less overhead than improved梐 very minor issue, although their utter simplicity clearly does make them deserve their names.

10,000个元素的list，就象在这个配方里显示的，simple的开销（是纯粹的开销，使用do-nothing 处理函数）是大约超过faster 和 faster_preserve的13或者14倍。improved ，Best和best_preserve，这些函数依次有两倍开销的提高。在一个100,000元素的list 上，faster 和 faster_preserve比best和best_preserve慢了大约15倍。虽然他们绝对的简单清晰使他们的名字受之无愧，但对于每一个list尺寸，后两个函数比改良的函数少大约20%-30%的开销【梐 very minor issue】

While an improvement of 25%, or even a factor of 2, may be neglected without substantial cost for the performance of your program as a whole, the same does not apply to an algorithm that is 10 or more times as slow as it could be. Such terrible performance is likely to make that program fragment a bottleneck, all by itself, particularly when we're talking about O(N2) versus O(N) behavior.

The documentation for the random module in the Library Reference.

• Page.execute = 6.061s
• _moinSearch = 0.021s
• getACL = 0.401s
• getPageCount = 0.357s
• getPageList = 0.018s
• init = 0.030s