文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- 大熊 [2004-10-08 16:42:30]

1. 描述

12.2 Checking XML Well-Formedness Credit: Paul Prescod

12.2 检查XML是否为良好格式

感谢:Paul Prescod

1.1. 问题 Problem

12.2.1 Problem You need to check if an XML document is well-formed (not if it conforms to a DTD or schema), and you need to do this quickly.

12.2.1 问题 你需要检查一个XML文档是否是格式良好的(是否符合DTD或schema),同时需要快速的完成检查。

1.2. 解决 Solution

12.2.2 Solution SAX (presumably using a fast parser such as Expat underneath) is the fastest and simplest way to perform this task:

12.2.2 解决 SAX(在底层可能会使用一个较快的解析器,就像Expat)是最快的和最为简单的方式来做这个任务:

   1 
   2 from xml.sax.handler import ContentHandler
   3 from xml.sax import make_parser
   4 from glob import glob
   5 import sys
   6 
   7 def parsefile(file):
   8     parser = make_parser(  )
   9     parser.setContentHandler(ContentHandler(  ))
  10     parser.parse(file)
  11 
  12 for arg in sys.argv[1:]:
  13     for filename in glob(arg):
  14         try:
  15             parsefile(filename)
  16             print "%s is well-formed" % filename
  17         except Exception, e:
  18             print "%s is NOT well-formed! %s" % (filename, e)

1.3. 讨论 Discussion

12.2.3 Discussion A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.

12.2.3 讨论 一个文本如果它遵守所有基本的XML文档的语法规则,那它就是格式良好的。换句话说,它有一个正确的XML声明和有一个单一的根元素,所有的标记签套正确,标记的属性是用引号括起来的,等等。

This recipe uses the SAX API with a dummy ContentHandler that does nothing. Generally, when we parse an XML document with SAX, we use a ContentHandler instance to process the document's contents. But in this case, we only want to know if the document meets the most fundamental syntax constraints of XML; therefore, there is no processing that we need to do, and the do-nothing handler suffices.

这个处方使用了SAX API,使用了一个虚拟的ContentHandler,实际什么也没有做。通常,当我们要使用SAX解析一个XML文档,需要使用一个ContentHandler实例来处理文档的内容。但在这个例子中,我们仅仅想知道是否文档满足XML基本的语法约定,因此无需作什么处理,这样的一个空的ContentHandler足够了。

The parsefile function parses the whole document and throws an exception if there is an error. The recipe's main code catches any such exception and prints it out like this:

函数parsefile解析整个文档,如果有什么错误将抛出一个异常。处方的主程序将捕获这样的异常,然后打印如下的信息:

$ python wellformed.py test.xml test.xml is NOT well-formed! test.xml:1002:2: mismatched tag This means that character 2 on line 1,002 has a mismatched tag.

This recipe does not check adherence to a DTD or schema. That is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task.

这个处方并不检查XML是否遵守DTD或schema,这是一个单独处理,称为有效性检查。这个脚本的性能相当的好,正因为它仅仅关注于执行一个最小的不能再缩减的核心任务。

1.4. 参考 See Also

12.2.4 See Also Recipe 12.3, Recipe 12.4, and Recipe 12.6 for other uses of the SAX API; the PyXML package (http://pyxml.sourceforge.net/) includes the pure-Python validating parser xmlproc, which checks the conformance of XML documents to specific DTDs; the PyRXP package from ReportLab is a wrapper around the faster validating parser RXP (http://www.reportlab.com/xml/pyrxp.html), which is available under the GPL license.

处方12.3,处方12.4,以及处方12.6演示了SAX API的其他一些应用:PyXML包( http://pyxml.sourceforge.net ) 包括纯Python的带检验的解析器xmlproc,可以检查XML文档是否和指定的DTD一致;来自于ReportLab的PyRXP包是一个较快速的带校验的解析器RXP的包装( http://www.reportlab.com/xml/pyrxp.html ),在GPL许可下是可用的。