文章来自《Python cookbook》.


-- 大熊 [2004-10-08 16:48:13]

1. 描述

12.3 Counting Tags in a Document Credit: Paul Prescod

12.3 统计文档中标记出现次数 感谢:Paul Prescod

1.1. 问题 Problem

12.3.1 Problem You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.

12.3.1 问题 你想要知道XML文档中元素的出现情况,快速的提取相关的记数信息。

1.2. 解决 Solution

12.3.2 Solution You can subclass SAX's ContentHandler to make your own specialized classes for any kind of task, including the collection of such statistics:

12.3.2 解决 你可以子类SAX中的ContentHandler,创建你自己的专用的类,来完成各种各样的任务,包括如下的统计:

from xml.sax.handler import ContentHandler import xml.sax

class countHandler(ContentHandler):

  • def _ _init_ _(self):
    • self.tags={}
    def startElement(self, name, attr):
    • if not self.tags.has_key(name):
      • self.tags[name] = 0
      self.tags[name] += 1

parser = xml.sax.make_parser( ) handler = countHandler( ) parser.setContentHandler(handler) parser.parse("test.xml")

tags = handler.tags.keys( ) tags.sort( ) for tag in tags: print tag, handler.tags[tag]

1.3. 讨论 Discussion

12.3.3 Discussion When I start with a new XML content set, I like to get a sense of which elements are in it and how often they occur. I use variants of this recipe. I can also collect attributes just as easily, as you can see. If you add a stack, you can keep track of which elements occur within other elements (for this, of course, you also have to override the endElement method so you can pop the stack).

12.3.3 讨论 当我开始一个新的XML内耳集时,希望知道文档中出现了那些元素,以及它们的出现频率。我使用上述处方的变体(根据需要)。我也能很容易的收集属性,就象你所看到的。如果你添加使用一个堆栈,对一个元素在其它元素中的出现情况进行跟踪(要想这样,你还得重载endElemnt方法,这样你就可以正确的出栈了)。

This recipe also works well as a simple example of a SAX application, usable as the basis for any SAX application. Alternatives to SAX include pulldom and minidom. These would be overkill for this simple job, though. For any simple processing, this is generally the case, particularly if the document you are processing is very large. DOM approaches are generally justified only when you need to perform complicated editing and alteration on an XML document, when the document itself is complicated by references that go back and forth inside it, or when you need to correlate (e.g., compare) multiple documents with each other.


ContentHandler subclasses offer many other options, and the online Python documentation does a good job of explaining them. This recipe's countHandler class overrides ContentHandler's startElement method, which the parser calls at the start of each element, passing as arguments the element's tag name as a Unicode string and the collection of attributes. Our override of this method counts the number of times each tag name occurs. In the end, we extract the dictionary used for counting and emit it (in alphabetical order, which we easily obtain by sorting the keys).


In the implementation of this recipe, an alternative to testing the tags dictionary with has_key might offer a slightly more concise way to code the startElement method:


def startElement(self, name, attr):

  • self.tags[name] = 1 + self.tags.get(name,0)

This counting idiom for dictionaries is so frequent that it's probably worth encapsulating in its own function despite its utter simplicity:


def count(adict, key, delta=1, default=0):

  • adict[key] = delta + adict.get(key, default)

Using this, you could code the startElement method in the recipe as:


  • def startElement(self, name, attr): count(self.tags, name)

1.4. 参考 See Also

12.3.4 See Also Recipe 12.2, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API.

12.3.5 参考 处方12.2,处方12.4,以及处方12.6演示了SAX API的其它用途。