文章来自《Python cookbook》. 翻译仅仅是为了个人学习,其它商业版权纠纷与此无关!
-- 61.182.251.99 [2004-09-21 18:36:08]
1. 描述
统计文件行数
Credit: Luther Blissett
1.1. 问题 Problem
需要统计文件的行数
1.2. 解决 Solution
The simplest approach, for reasonably sized files, is to read the file as a list of lines so that the count of lines is the length of the list.
对于具有适当大小的文件,最简单方法是拷贝文件各行到一个list中,list的长度就是文件的行数。
If the file's path is in a string bound to the thefilepath variable, that's just: 文件路径由下面的string变量thefilepath给出,代码如下:
1 count = len(open(thefilepath).readlines( )) #方法1
For a truly huge file, this may be very slow or even fail to work. If you have to worry about humongous files, a loop using the xreadlines method always works:
对于很大的文件,readlines可能很慢, 甚至操作失败。 如果需要处理特别大的文件,使用xreadlines函数的循环可以保证不会出问题:
1 count = 0 #方法2
2 for line in open(thefilepath).xreadlines( ): count += 1
Here's a slightly tricky alternative, if the line terminator is '\n' (or has '\n' as a substring, as happens on Windows):
如果行分隔符号是'\n' (Windows中'\n'作为一个子串出现), 那么可以选用如下的小技巧:
1 count = 0 #方法3
2 thefile = open(thefilepath, 'rb')
3 while 1:
4 buffer = thefile.read(8192*1024)
5 if not buffer: break
6 count += buffer.count('\n') #string函数count
7 thefile.close( )
Without the 'rb' argument to open, this will work anywhere, but performance may suffer greatly on Windows or Macintosh platforms.
如果不使用上面的rb参数,脚本也会正常工作,但是在Windows或Macintosh平台上可能有很大的性能损失。
1.3. 讨论 Discussion
If you have an external program that counts a file's lines, such as wc -l on Unix-like platforms, you can of course choose to use that (e.g., via os.popen( )). However, it's generally simpler, faster, and more portable to do the line-counting in your program. You can rely on almost all text files having a reasonable size, so that reading the whole file into memory at once is feasible. For all such normal files, the len of the result of readlines gives you the count of lines in the simplest way.
如果可以使用外部程序,比如类Unix平台上命令 wc -l来统计文件行数,那么应该使用这个程序(在脚本中利用 os.popen()). 但更简单快捷, 具有移植性的做法是在程序中编码计算文件行数。可以认为几乎所有的文本文件具有合适的大小,一次读入内存处理是适当的。对这种文件,readlines读出的序列长度最简单地给出了文件行数。
If the file is larger than available memory (say, a few hundred of megabytes on a typical PC today), the simplest solution can become slow, as the operating system struggles to fit the file's contents into virtual memory. It may even fail, when swap space is exhausted and virtual memory can't help any more.
如果文件相对于可用内存太大(呵呵,普通PC机上的几百M大小的文件),由于操作系统需要将文件内容读入虚拟内存,上面的简单方法可能非常慢。如果交换分区耗尽,即使是虚拟内存也不会有用,操作甚至会失败。(#译注:不太明白的说)
On a typical PC, with 256 MB of RAM and virtually unlimited disk space, you should still expect serious problems when you try to read into memory files of, say, 1 or 2 GB, depending on your operating system (some operating systems are much more fragile than others in handling virtual-memory issues under such overstressed load conditions).
即使在有256 M RAM,硬盘可以认为是无限大的PC机上,将一个有1到2G大小的文件内容一次读入内存,可以预见也会出现问题(依赖于操作系统,有些系统在类似压力情况下的虚拟内存处理较之其它系统更脆弱)。
In this case, the xreadlines method of file objects, introduced in Python 2.1, is generally a good way to process text files line by line.
此时,较好的方法是使用Python 2.1中引入的file对象的 xreadlines 方法,每次仅读取一行进行处理。
In Python 2.2, you can do even better, in terms of both clarity and speed, by looping directly on the file object:
Python 2.2中,更清晰更快的方法是直接在file对象上使用循环
1 for line in open(thefilepath): count += 1
However, xreadlines does not return a sequence, and neither does a loop directly on the file object, so you can't just use len in these cases to get the number of lines. Rather, you have to loop and count line by line, as shown in the solution.
然而,使用xreadlines 和 显式循环不会返回文件行的序列,无法利用序列的len函数获得文件行数,这样必须使用上面的循环语句并且自己统计行数。
Counting line-terminator characters while reading the file by bytes, in reasonably sized chunks, is the key idea in the third approach. It's probably the least immediately intuitive, and it's not perfectly cross-platform, but you might hope that it's fastest (for example, by analogy with Recipe 8.2 in the Perl Cookbook).
方法3中使用合理大小的块,以2进制读取文件,其要点是统计行分隔符的数目。直觉上最想不到此方法,同时它不具有最佳的平台移植性,但是可以认为是最快的(参考Perl CookBook recipe 8.2)。
However, remember that, in most cases, performance doesn't really matter all that much. When it does matter, the time sink might not be what your intuition tells you it is, so you should never trust your intuition in this matter梚nstead, always benchmark and measure.
然而,谨记在多数情况下,性能并没有通常考虑得那么重要!性能确实重要时,性能瓶颈也可能与你的直觉判断不符。这时,不要相信直觉,而是应该进行测试测量来确定瓶颈所在。
For example, I took a typical Unix syslog file of middling size, a bit over 18 MB of text in 230,000 lines:
举个例子,我使用一个中等大小有23万行18M的典型Unix系统日志文件进行测试:
[situ@tioni nuc]$ wc nuc 231581 2312730 18508908 nuc #译注:linux菜鸟都不是!懂的朋友补充一下,谢谢。 #嘿嘿,刚看了,wordcount ,哈哈哈 and I set up the following benchmark framework script, bench.py: 我编写了如下测试框架代码, bench.py :
1 import time
2
3 def timeo(fun, n=10):
4 start = time.clock( )
5 for i in range(n): fun( )
6 stend = time.clock( )
7 thetime = stend-start
8 return fun._ _name_ _, thetime #返回函数名称,函数运行10次总时间的 元组
9
10 import os
11
12 def linecount_wc( ):
13 return int(os.popen('wc -l nuc').read().split( )[0]) #使用外部系统程序 wc -l
14
15 def linecount_1( ): #使用方法1
16 return len(open('nuc').readlines( ))
17
18 def linecount_2( ): #使用方法2
19 count = 0
20 for line in open('nuc').xreadlines( ): count += 1
21 return count
22
23 def linecount_3( ): #使用方法3
24 count = 0
25 thefile = open('nuc')
26 while 1:
27 buffer = thefile.read(65536)
28 if not buffer: break
29 count += buffer.count('\n')
30 return count
31
32 for f in linecount_wc, linecount_1, linecount_2, linecount_3:
33 print f._ _name_ _, f( )
34
35 for f in linecount_1, linecount_2, linecount_3:
36 print "%s: %.2f"%timeo(f)
First, I print the line counts obtained by all methods, thus ensuring that there is no anomaly or error (counting tasks are notoriously prone to off-by-one errors).
测试脚本先输出了由各个方法获得的文件行数,这可以保证脚本的正确(统计计数操作有出现差1错误的丑陋倾向)
Then, I run each alternative 10 times, under the control of the timing function timeo, and look at the results. Here they are:
然后,在函数timeo的控制下,执行使用对应方法的函数各10次,返回结果如下:
[situ@tioni nuc]$ python -O bench.py linecount_wc 231581 linecount_1 231581 linecount_2 231581 linecount_3 231581 linecount_1: 4.84 linecount_2: 4.54 linecount_3: 5.02 “
As you can see, the performance differences hardly matter: a difference of 10% or so in one auxiliary task is something that your users will never even notice. However, the fastest approach (for my particular circumstances, a cheap but very recent PC running a popular Linux distribution, as well as this specific benchmark) is the humble loop-on-every-line technique, while the slowest one is the ambitious technique that counts line terminators by chunks. In practice, unless I had to worry about files of many hundreds of megabytes, I'd always use the simplest approach (i.e., the first one presented in this recipe).
可以看出,几种方法的性能差别没有太大关系:一个辅助函数性能上约10%的差别, 客户甚至不会觉察到。 (作者的环境是:一个便宜但是很新的PC机,流行的linux系统)
最快的是方法2:笨笨的循环,每次处理一行并计数的方法,
最慢的是方法3:每次读取一块出来,统计行分隔符数目的方法。(#译注:是不是性能跟分块大小有关系呢?可以测试一下)
实际编程中,除了对于几百M大小的文件需要特别考虑外,我都用最简单的方法1。
1.4. 参考 See Also
Python 库参考file对象部分和time模块部分;
Perl CookBook Recipe 8.2 。