文章来自《Python cookbook》.

翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!

-- 大熊 [2004-10-08 16:48:48]

1. 描述

12.4 Extracting Text from an XML Document Credit: Paul Prescod

12.4 从XML文档提取文本 感谢:Paul Prescod

1.1. 问题 Problem

12.4.1 Problem You need to extract only the text from an XML document, not the tags.

12.4.1 问题 你仅需要从XML文档中提取文本,而不是标记。

1.2. 解决 Solution

12.4.2 Solution Once again, subclassing SAX's ContentHandler makes this extremely easy:

12.4.2 解决 再一次,使用SAX的ContentHandler类的子类将很容易的解决这个问题:

from xml.sax.handler import ContentHandler import xml.sax import sys

class textHandler(ContentHandler):

  • def characters(self, ch):
    • sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser( ) handler = textHandler( ) parser.setContentHandler(handler) parser.parse("test.xml")

1.3. 讨论 Discussion

12.4.3 Discussion Sometimes you want to get rid of XML tags?for example, to rekey a document or to spellcheck it. This recipe performs this task and will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the XML lexer (shallow parser) shown in Recipe 12.12.

12.4.3 讨论 有时你要去除XML中的标记,例如,重新索引一个文档或对文档进行拼写检查。这个处方可以完成这个任务,用来处理任何格式良好的文档,它相当的有效。如果文档不是格式良好的,你可能要使用基于XML Lexer(shallow parser)的解决方法,参见处方12.12。

In this recipe's textHandler class, we subclass ContentHander's characters method, which the parser calls for each string of text in the XML document (excluding tags, XML comments, and processing instructions), passing as the only argument the piece of text as a Unicode string. We have to encode this Unicode before we can emit it to standard output. In this recipe, we're using the Latin-1 (also known as ISO-8859-1) encoding, which covers all Western-European alphabets and is supported by many popular output devices (e.g., printers and terminal-emulation windows). However, you should use whatever encoding is most appropriate for the documents you're handling and is supported by the devices you use. The configuration of your devices may depend on your operating system's concepts of locale and code page. Unfortunately, these vary too much between operating systems for me to go into further detail.

在这个处方中的textHandler类,我们子类了ContentHander的characters方法,该方法在解析器遇到XML文档中的每个文本串的时候被调用(标记,注释和预处理指令除外),被处理的文本(Unicode字符串)作为该方法唯一的参数被传入。我们不得不对这个Unicode串编码,以便能够在标准输出上显示。在这个处方中,我使用Latin-1(也就是ISO-8859-1)来编码,包括所有的西欧的字符,能够被许多流行的输出设备支持(例如,打印机和终端)。然而,你应该使用最为适合你所处理的文档的编码,同时该编码能够被你使用的设备所支持。你的设备的配置依赖于你的操作系统称之为本地或代码页。不幸的是,这些在操作系统间的变化是相当多样的,以致于我无法对此深入。

1.4. 参考 See Also

12.4.4 See Also Recipe 12.2, Recipe 12.3, and Recipe 12.6 for other uses of the SAX API; see Recipe 12.12 for a very different approach to XML lexing that works on XML fragments.

12.4.4 参考 处方12.2,处方12.3,以及处方12.6演示了SAX API的其它用途;处方12.12演示了一种十分不同的处理XML片断的方法。