文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- Zoom.Quiet [2004-08-11 16:44:45]

1. 有效的处理选择的结构数据对

2.3 Processing Selected Pairs of Structured Data Efficiently 有效的处理选择的结构数据对

Credit: Alex Martelli, David Ascher

1.1. 问题 Problem

You need to efficiently process pairs of data from two large and related data sets.

你需要有效的处理来自于两个大的相关数据集合的数据对。

1.2. 解决 Solution

Use an auxiliary dictionary to do preprocessing of the data, thereby reducing the need for iteration over mostly irrelevant data. For instance, if xs and ys are the two data sets, with matching keys as the first item in each entry, so that x[0] == y[0] defines an "interesting" pair:

使用一个辅助的字典去做数据的预处理。因此会减少在大多数不相关数据上的迭代。例如,如果xs 和ys是两个数据集合,匹配的键在每个入口作为第一个元素。所以, x[0]=y[0]定义了一个有趣的对:

   1 auxdict = {}
   2 for y in ys: auxdict.setdefault(y[0], []).append(y)
   3 result = [ process(x, y) for x in xs for y in auxdict[x[0]] ]

1.3. 讨论 Discussion

To make the problem more concrete, let's look at an example. Say you need to analyze data about visitors to a web site who have purchased something online. This means you need to perform some computation based on data from two log files梠ne from the web server and one from the credit-card processing framework. Each log file is huge, but only a small number of the web server log entries correspond to credit-card log entries. Let's assume that cclog is a sequence of records, one for each credit-card transaction, and that weblog is a sequence of records describing each web site hit. Let's further assume that each record uses the attribute ipaddress to refer to the IP address involved in each event. In this case, a reasonable first approach would be to do something like:

为了使问题更具体,让我们看一个例子。假定你需要分析关于在线购物站点的数据。这意味着你需要执行一些基于来自于两个log文件的数据运算。 一个来自于web server, 另一个来自于信用卡处理框架。每个log文件很大,而且仅仅少数的wev server log入口同信用卡log入口相关。让我们假定cclog是一个记录的序列, 对应每个信用卡事务。并且weblog是描述每个web站点hit的记录序列。让我们更进一步的假定每个记录使用属性ipaddress去引用在每个事件里ip地址的调用。 在这种情况下,第一个合理的方法是象这样做:

   1 results = [ process(webhit, ccinfo) for webhit in weblog for ccinfo in cclog \
            if ccinfo.ipaddress==webhit.ipaddress ]

The problem with this approach is that the nested list comprehension will iterate over each entry in the web server log, and, for each entry in the web server log, it will also iterate over each entry in the credit-card log. This means that the algorithm has O(M x N) performance characteristics梚n other words, the time it takes to compute will be proportional to the product of the size of both logs. As the web site becomes more popular and as data accumulates, the performance of the algorithm will rapidly deteriorate.

这个方法的问题是嵌套的list内涵将在web server log上迭代每个入口。并且,对于在web server log里的每个入口,它也将地带在信用卡log上的每个入口。这意味着,算法有O(M x N)性能特征。换句话说,应用于计算的时间将两个log的尺寸成比例。因为web site更流行,数据堆积如山。算法的性能将迅速的恶化。

The key to optimizing this algorithm is to recognize that the computation (process) needs to happen for only a small subset of all of the possible combinations of the two variables (in this case, webhit and ccinfo). If we could search for only the right pairs, we might be able to speed up the process. As Tim Peters says in the introduction to this chapter, if you need to search, use an auxiliary dictionary. The solution described earlier, rewritten to use our variables, yields:

优化这个算法的关键是认识到计算(处理)仅发生在两个变量(在这种情况下是,webhit和ccinfo)所有可能的积集的一个很小的子集上.如果我们 能只搜索正确的组合,就可以提高处理的速度。如 Tim Peters 在本章的介绍中所述,如果你需要搜索,就使用附助的字典。前述 的解决方案,用我们的变量来写,就是:

   1 ipdict = {}
   2 for webhit in weblog: ipdict.setdefault(webhit.ipaddress, []).append(webhit)
   3 results = [ process(webhit, ccinfo) for ccinfo in cclog \
                                    for webhit in ipdict[ccinfo.ipaddress] ]

The highlighted line creates a dictionary mapping IP addresses to lists containing the data for each web hit. Because we're indexing the dictionary by IP address, we are optimizing the data structures for a particular query: "give me all of the web records for a particular IP address." The list comprehension now iterates over only the data we require梖or each credit-card transaction, we iterate over all of the web hits corresponding to the IP address used in that transaction. Not only did the algorithm go from O(M x N) to O(M+N) from a theoretical point of view, but, because we chose to hold in the auxiliary dictionary data that is sparser from the point of view of the task at hand, we've made the solution faster than the alternative (which would also be O(M+N)).

高亮的行(第2行)建立一个字典,映射ip地址到一个 webhit 的列表,该列表中每个 webhit 都包含该ip。因为我们正在通过ip地址来索引字典。我们正在为一个特别的请求优化数据结构:“将一个特别的ip地址对应的所有web 记录给我”。 List内涵现在迭代在我们要求的每个信用卡事务的数据上。在那个事务中,我们迭代对应于ip地址的所有webhit。 不仅从理论上算法O(M x N) 变成 O(M+N) , 而且,从任务便利来看,因为我们选择了在辅助字典里保持稀疏的数据, 我们做了一个比原来更快的解决方案。(它也是O(M+N))

Note that the test used to determine whether two records correspond to a pair of interest can be arbitrary. The generic description of the solution uses indexing, while the web example uses attribute-getting. The test you use will depend on your application and your data structures.

注意使用测试去确定是否两个记录符合一对的兴趣是武断的。当web例子使用attribute-getting, 解决方案一般使用索引。你使用的测试将依赖于你的应用和你的数据结构。

1.4. 参考 See Also

PyCkBk-1-16--Recipe 1.6 for more details on the setdefault method of dictionaries; the introduction to Chapter 2.