文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- 大熊 [2004-10-08 16:48:13]

1. 描述

12.3 Counting Tags in a Document Credit: Paul Prescod

12.3 统计文档中标记出现次数 感谢:Paul Prescod

1.1. 问题 Problem

12.3.1 Problem You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.

12.3.1 问题 你想要知道XML文档中元素的出现情况,快速的提取相关的记数信息。

1.2. 解决 Solution

12.3.2 Solution You can subclass SAX's ContentHandler to make your own specialized classes for any kind of task, including the collection of such statistics:

12.3.2 解决 你可以子类SAX中的ContentHandler,创建你自己的专用的类,来完成各种各样的任务,包括如下的统计:

from xml.sax.handler import ContentHandler import xml.sax

class countHandler(ContentHandler):

  • def _ _init_ _(self):
    • self.tags={}
    def startElement(self, name, attr):
    • if not self.tags.has_key(name):
      • self.tags[name] = 0
      self.tags[name] += 1

parser = xml.sax.make_parser( ) handler = countHandler( ) parser.setContentHandler(handler) parser.parse("test.xml")

tags = handler.tags.keys( ) tags.sort( ) for tag in tags: print tag, handler.tags[tag]

1.3. 讨论 Discussion

12.3.3 Discussion When I start with a new XML content set, I like to get a sense of which elements are in it and how often they occur. I use variants of this recipe. I can also collect attributes just as easily, as you can see. If you add a stack, you can keep track of which elements occur within other elements (for this, of course, you also have to override the endElement method so you can pop the stack).

12.3.3 讨论 当我开始一个新的XML内耳集时,希望知道文档中出现了那些元素,以及它们的出现频率。我使用上述处方的变体(根据需要)。我也能很容易的收集属性,就象你所看到的。如果你添加使用一个堆栈,对一个元素在其它元素中的出现情况进行跟踪(要想这样,你还得重载endElemnt方法,这样你就可以正确的出栈了)。

This recipe also works well as a simple example of a SAX application, usable as the basis for any SAX application. Alternatives to SAX include pulldom and minidom. These would be overkill for this simple job, though. For any simple processing, this is generally the case, particularly if the document you are processing is very large. DOM approaches are generally justified only when you need to perform complicated editing and alteration on an XML document, when the document itself is complicated by references that go back and forth inside it, or when you need to correlate (e.g., compare) multiple documents with each other.

这个处方也可以作为一个简单的SAX应用程序,而且工作的很好,也可以作为任何SAX应用程序的基本架构。除了SAX,pulldom和minidom也是可选的,不过对于这个简单的工作显得过于强大了。对于任何简单的处理,这一般就可采用这样的方式,特别是如果你要处理的文档是相当大。当文档在内部向前先后引用着文档自身,或者你需要在各个文档间互相关联(例如,作比较)时,DOM式的处理方法通常在你需要处理复杂的编辑和修改一个XML文档时是比较合理的。

ContentHandler subclasses offer many other options, and the online Python documentation does a good job of explaining them. This recipe's countHandler class overrides ContentHandler's startElement method, which the parser calls at the start of each element, passing as arguments the element's tag name as a Unicode string and the collection of attributes. Our override of this method counts the number of times each tag name occurs. In the end, we extract the dictionary used for counting and emit it (in alphabetical order, which we easily obtain by sorting the keys).

ContentHandler的子类提供许多选项,而且Python的在线帮助中对此做了很好说明。这个处方中的countHandler类重载了ContentHandler的startElement方法,该方法在开始解析每个元素时被调用,并将传递元素的标记名称(Unicode字符串)和属性集作为参数。在我们对该方法的重载中,对每个标记名称的出现次数进行记数。在最后,我们提取显示在记数中使用的字典信息(按照字母的顺序,这个我们可以很方便的通过对键进行排序来获得)。

In the implementation of this recipe, an alternative to testing the tags dictionary with has_key might offer a slightly more concise way to code the startElement method:

在处方的实现中,除了使用has_key来检测标记字典外,可以使用下面一个较为简明的方法来实现startElement方法:

def startElement(self, name, attr):

  • self.tags[name] = 1 + self.tags.get(name,0)

This counting idiom for dictionaries is so frequent that it's probably worth encapsulating in its own function despite its utter simplicity:

这种对字典记数的方法使用相当频繁,因此值得封装成一个函数,尽管它本身已经是十分的简单:

def count(adict, key, delta=1, default=0):

  • adict[key] = delta + adict.get(key, default)

Using this, you could code the startElement method in the recipe as:

使用这个方法,你可以使用如下的代码来实现处方的startElement方法:

  • def startElement(self, name, attr): count(self.tags, name)

1.4. 参考 See Also

12.3.4 See Also Recipe 12.2, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API.

12.3.5 参考 处方12.2,处方12.4,以及处方12.6演示了SAX API的其它用途。